|
![]() |
||||||||||||||||||||||||||||||||||||||||||||||||||||||
Stackish—An XML AlternativeI was checking out some of the XML Alternatives that PaulT maintains and was laughing at some of the proposals. I decided to throw my own ridiculous and charming hat into the ring of fools with the Stackish XML Alternative. This XML alternative is radically different since it writes the exact same stuff you get from an s- expr or XML document, but does it the way a stack language like FORTH, Onyx, and Joy would. This has a few advantages, but the main thing is that it beats the Lisp and Scheme whiners at their own game by being even more concise and having two additional features that s-expr don’t have. To put it simply, Stackish is a “data language” that is written more like FORTH than like other languages. This means that everything is actually ordered backwards, and the majority of processing happens with an implicit stack. As a data language, it is better for encoding data structures and their contents like strings and numbers. As a stack language, you construct the Stackish file inverted so that the structure is built directly rather than relying on parsers and such. For those people who are familiar with s-expressions, you would write:
What you should notice is that there’s no closing brackets, only opening brackets with node names and content. This is allowed based on a simple trick that I’ll explain later. Since the serializer and deserializer (or scanner and emitter) is processing a language that is working an implied stack, and is already in stack order, there’s no need for a lot of parsing machinery. This makes Stackish easier to parse since all you need is a scanner to recognize each token, and a set of simple stack operations to build the structures. With nothing more than a scanner and a set of simple rules and operations you can encode and decode Stackish source almost directly. You can even easily convert it to any other format. CodeIf you want to download the current reference implementation then you can download it from the downloads section A Bit Of BackgroundOriginally, I did Stackish as a satire similar to INTERCAL and other great faux languages not intended for actual use. After I wrote it and started to play with it I found that I actually do like it for a data language. I wouldn’t be so bold as to cram it down a student’s throat as a full programming language like some other people have done, but amazingly enough it does have some advantages. In the end, it’s a perfectly valid XML alternative That’s an alternative not a replacement. I get really annoyed when people put something forward as an alternative when they actually mean a replacement. An alternative in my mind means another choice while a replacement is intended to Stackish is not a replacement for XML but in some situations it will work as an alternative. I think Stackish has some advantages over XML that are similar to s-expressions, and some advantages over s-expressions based on what Schemers like. But I would never claim it is better than XML for all situations, and maybe not even most situations. The FORTH ConnectionFORTH is an interesting language created by Charles Moore in 1968. The fundamental difference with FORTH was that it worked as a stack with words being operations. This is different from other languages which require the source to be processed before operations are performed. With FORTH, the source is processed “on line” and since it is organized as a series of stack operations the processing is very minimal. The second aspect of FORTH that’s important is that operations are not called functions but are called “words”. A FORTH programmer will typically approach a programming problem as if he/she is building a language to describe a solution. They’ll then create words that are processed in order to create the full program. Think of it as a very big RPN calculator. As an example, let’s say we have a function in C of Meanwhile … Back At The LabStackish takes the FORTH way of organizing operations in “stack order” and adds a Lisp like structuring of the source to allow for arbitrary structures. It also attempts to remove any unnecessary punctuation in order to simplify the syntax even further. The end result is a very tight syntax that is incredibly easy to parse and generate consistently. Stackish is best understood by exploring the tokens allowed, then the processing rules, and finally examples of how to write Stackish. Scanning TokensThe following table describes the small set of tokens that a Stackish scanner will handle: Stackish Tokens
Each token is processed by a separate processing rule which builds the structure. It’s not entirely necessary to use a stack to process the input, and the reference implementation actually doesn’t use a stack at all. The reference implementation simply builds the structure directly using parent and child links inside the nodes. A stack is useful when translating or processing the input for other purposes. You can expect the format for FLOAT and NUMBER to improve to encompass common things like negative numbers and other notation. The following Stackish is converted into tokens like so:
You can probably figure out from this how it would be structured. The WORD things is associated with the MARK right after “child” which makes the subgroup [“hello” “I” ‘4:like’ 200]. Then the WORD root is associated with the very first MARK and contains [ “child” things ] structure. As a tree it looks like this:
How all of this is done depends on a set of simple processing rules for each token that are actually similar to SAX events in the XML world, but much simpler. Processing RulesStackish Processing
These simple rules require only a few basic operations which even assembler on nearly any processor should have. The reference implementation also does all operations without using a stack to prove that it’s not entirely necessary. The reference implementation uses a single Node element that has a link to the parent, first child, and next sibling. This creates a binary tree which is traversed during the operations to construct the final structure. A primary thing to note is that we make an assumption that a WORD always means
we have named a group node and there’s no need for a closing ‘]’. This implicit
assumption removes one frequently repeated character from the input and output
which isn’t necessary. The ‘]’ now can be distinct and mean only an anonymous
group, which removes some ambiguity in the processing. This also lets us assign
an anonymous group to an attribute by placing the attribute after the group
like so: Converting From A Tree to StackishMaking a Stackish stream from a typical tree structure (a binary tree works best) is very easy. You basically traverse it in post order with siblings first, children second. This means you traverse the tree to the bottom far right and then work your way up to the final root. This is probably the biggest disadvantage of Stackish since it means that you have to usually do recursive processing. The processing rules for serializing are fairly simple and are based on traversing a tree where each node has three links: parent, child, sibling. The parent link goes to the parent, the child link goes to the first child, and sibling goes to the first sibling. The following table summarizes the encoding rules: Stackish Serialization Rules
These rules are processed in order and produce a consistent canonical form as long as a space is printed after each element. Check out the Node_dump() function in the reference implementation for exact code on how this is done. It’s not that difficult. Interesting ObservationsHere’s some things I’ve noticed since playing with Stackish. Some of these are interesting, but for the most part I like the way Stackish is now. Normal Rather Than Stack OrderThe first observation about Stackish is that it’s not entirely necessary to process the tree in stack order. Let’s say you have the short bit of Stackish:
and you want to process it from rear to front or so it is more like this:
Well, almost all of the same rules would apply, just in the reverse. You’d still make a root node, still add child to it, and still add “data” to child and so on. The difference of course being that the root and child nodes get named when they’re created rather than after. In the end it’s still pretty much the same and you still don’t need a parser. Where the normal order (in contrast to stack order) falls down is when you add attributes which can contain any other structure in them. Let’s say in Stackish we want to add an attribute to the previous example that contains a group of two numbers:
This basically creates an attribute of the child node named @numbers that contains a group with 1 and 2 in it. Now, if we place this in normal order we’d have:
You might be able to see where it gets sticky. The problem lies in the fact that after we read the @numbers attribute, we now have to recursively process the following group as if it stands alone. Since we don’t know when the [ 1 2 ] group actually ends, we’re forced to do odd special case handling or limit what can come immediately after a number. This also means we need at least 1 level of look-ahead to handle the @numbers case. In the end it turns into at least a minimal parser just to handle this, and there are other situations where it becomes apparent. In contrast, Stackish and stack ordering does not have the same problem because the [ 1 2 ] group is already fully constructed and known when we encounter the @numbers attribute. We already know what’s supposed to go into the @numbers attribute so it’s just a simple name assign and we’re on our way. Distinct Token MeaningOne thing that makes Stackish so easy to process is that each token has one
distinct meaning and one distinct structure. For example, the ‘(’ character in
s-expressions is used distinctly for creating lists, but in actuality it
semantically means either “list with name” like in In contrast Stackish has ‘[’ as starting a list, and WORDs (like root) starting named lists. This distinction makes processing Stackish much easier since the processor can immediately handle the distinct cases without extra logic and checking. Knowing When To EndAnother interesting thing about the Stackish way of ordering input is that it is able to detect when a structure is not complete and return to allow the input to be continued. The reason is simply because, at the end of processing there should be only one item on the stack which is the root element. In the reference implementation I don’t use a stack but instead keep track of the current node and the root node. If the current and root node don’t match at the end of the input then we’re not done yet. As an example, let’s say I have the following incomplete structure:
The structure is incomplete because the [ (MARK) and WORD tokens do not balance. I know that I need at least one more WORD or GROUP token to finish the structure. A Stackish processor only needs to scan the stack or walk up the unfinished tree to the root in order to figure out how many structural elements are missing. Also important is the fact that the processor can easily return and indicate that it needs more data to finish the structure. Why is this important? It makes it much easier to write an incremental data parser in situations such as network I/O where the amount of data received is unreliable. It also allows you to read a stream of structures without having to read the entire input. You can read only enough to create a rooted structure, and then return, leaving the remaining input for another call. This is very important in network I/O where being able to send a batch of requests or documents is much better than one at a time. For example, XML has a very hard time handling streamed data since it’s difficult to really know when XML is finished. Stackish Is Diff FriendlySince Stackish has a fairly consistent serializing and deserializing process, and you can put \t \n or ’ ’ as a token separator, you can get much better diffs and even tokenize the diff results to get an almost semantic diff. XML is much more difficult to process in this way, usually requiring specialized XML diff algorithms and tools. To improve the diff results, I would probably want to add an additional set of rules such as attributes are always placed at the top of a node and are always alphabetically organized, but it’s not entirely necessary. I’m still interested in comparing the diff output against canonical XML and other methods. SpeedMy rough experimental results shows that Stackish is actually about twice as fast as the fastest s-expression implementation I could find. The s-expression is the one to beat for speed as compared with XML. The primary thing to keep in mind when comparing Stackish to other formats is that Stackish both fires events as in SAX and builds an entire fully functional binary tree structure with back parent links. In comparison, the above sexpr project only builds the tree. Even then, Stackish in it’s early stages is still twice as fast. The test involved loading large Stackish stream document and serializing them back to Stackish. This was compared with sexpr’s test of doing the same loading and reprinting the s-expressions of the same size. Additionally, the s-expressions used in the sexpr test were generated by Stackish so that the comparison is as fair as possible. When comparing with XML we’d have to compare Stackish not with expat (which only does SAX like events), but something more like Xalan which loads the document entirely into memory and builds the internal DOM tree. I haven’t measured this yet, but my past experience with several XML parsers in numerous languages has showed me that they can rarely compete with s-expressions. I’m planning to do a full analysis of the speed between various formats in the future. Tuning OpportunitiesThe best profiling software out there (commercial or otherwise) is Kcachegrind. I’m sure some people will whine about that statement, but I find Cachegrind and KCachegrind to be the best. I also frequently use Valgrind to verify the correctness of the memory usage in my software. I use it religiously when I code in C. What Kcachegrind has told me is that almost all of the time in the program is spent making memory for the nodes. I’d say the first place to look at improving speed is to use pooled memory so that large numbers of nodes are created at once. The interesting thing is that there’s an almost one-to-one relationship between tokens and nodes in Stackish. Almost each token event results in the creation of a Node, with a few exceptions being WORDs and ATTRIBUTEs. It might be better in this situation then to have the scanner simply load all of the tokens directly as an array of Nodes. Using large chunks of pooled memory would speed up the memory allocation. Then processing the nodes involves simply running through them in order to adjust their links as needed. The use of an array of Nodes ripped out directly with the scanner also brings up that it might not be necessary to have child and sibling links be just straight pointers, but instead pointers combined with lengths. Organizing the array of Nodes from the scanner would then be a process of finding the beginning of a MARK, then the WORD, and that’s the children. This removes a lot of pointer arithmetic. Another area of improvement is with the scanner itself. I currently use re2c to generate the scanner since it was the easiest to use yet still fast. The syntax for Stackish is so simple though that it might just be easier to write a scanner by hand similar to how s-expression parsers are written. This would also allow for Stackish to be processed with character level continuation, which is always nice. Finally, I think that Stackish is tight enough that one could implement a wicked fast processor or a very small one in directly assembly language. While this may just be an exercise to prove your prowess, it would mean that very small devices could have a structure that is just as expressive as s- expressions and XML, but which is much easier to process and store. SAXish TooIn the reference implementation of Stackish I use a SAX style processing method where the scanner calls a set of function pointers to process each token it receives. These methods are similar in style to SAX events, and they let you create translators that can convert directly from Stackish to other formats and structures without using the intermediary Node structure. I successfully used the initial version of this API to convert directly from Stackish to s- expressions. The only thing about using the SAXish API is that you have to process everything in stack order, which means you probably need some form of stack structure. This isn’t too hard to implement with the included sglib ADT library (which is awesome by the way). Also look at the bstring library (bstrlib.c) for a nice set of APIs to help you deal with strings safely in C. I used a combination of an sglib LIST with the bstring formatting to create the test Stackish to s-expression setup. Writing Other FormatsSince Stackish is very similar to s-expressions, it’s possible to write almost any Stackish structure to an s-expression. The only two exceptions are that s- expressions do not support the BLOB type, and Stackish does not have atoms starting with ’ (like ‘blah). I’m considering adding the ability to put an “escaped WORD” which would be the same as the atom type. Included is a Node_dump_sexpr() function which dumps a Node structure as an s-expression. Creating XML is also possible, but there’s a problem since XML does not support the ability to have an entire XML document assigned to an attribute. Also, XML attributes are considered “out of band data” and are not part of the normal node structure, so they are placed more or less first in the parsing stream before children. Stackish and s-expressions (and many other formats) can place attributes just about anywhere and assign just about anything to them. Because of this we need to treat Stackish attributes as actual XML nodes, not really as XML attributes. A smarter (future) XML output algorithm would detect the difference and write either regular XML attributes or a child tag. Another area where XML is lacking is that it doesn’t have anonymous nodes. Stackish and s-expressions can have anonymous groups, but XML doesn’t allow it. Because of this, Stackish must give all anonymous groups an XML tag. Take a look at the output of Node_dump_xml() as an example of how it is done. Future DirectionsI’m probably going to continue to explore Stackish, beef up it’s scanner and see just how fast I can get it. I’d like to also see about using it in some of my other projects as a configuration and possibly RPC language. I’d also like to make native implementations in other languages. Otherwise it’s just an interesting novelty. |