Latest News >> 2008-03-26

First off, thanks to everyone who has emailed me with jobs for my people at Bear Stearns. I’ve received about 20 job offers for them so far, so I think that’s enough. That’s actually quite a large number of jobs from just my blog so I’m thinking of doing a blog post where I put these emails into a nice list for anyone else interested. Depends on if the original submitters are into that or not.

2008-03-17

Update: 2008-03-17 @ 5:30PM Other people who work with me are looking as well. If you have RoR, C#, or Java positions, preferably stable positions let me know. Might even need to help some non-technical folks with manager experience too. Again, email me if you have anything.

2008-03-16

How I Spent My PyCon Vacation

2008-03-14

Well I just popped my Python cherry by releasing the Zapps project to the world. Zapps is a fork of Amit Patel’s Yapps2 that I’ll be using to do the Python Stackish parser and Utu protocol design. Yapps2 is a great little parser generator that is easily hackable, so adding things like binary parsing and improving the performance is very easy.

Idiopidae: code is code; prose is prose.

Idiopidae is my attempt at finally releasing something that makes it easier for technical documentation authors to write.

The purpose of Idiopidae is to keep the code in the code, and the prose in the prose, and then merge the two together based on very light comments in the source.

You can see the original HTML of this file as well as the final output to compare the two. You can also look at a version that’s done before wiki rendering which shows how Idiopidae automagically figures out that the sample.page file is a text file and properly formats for that output.

Concepts

Idiopidae works on the idea that in your “prose” file you’ll put include statements as comments, and in your “code” file you’ll put export statements to mark off regions of code that need to be named.

When you run the idio Python script on your prose file, it follows the include statements and loads the file and section you specify into an output result. It will also format it with the Pygments library to produce nice typsetting (currently defaults to HTML).

This file you’re reading right now is simply a Textile prose file that includes and describes Idiopidae’s source. The process for creating it was:

  > cd doc
  > webgen
  > idio output/index.html  > output/test.html

The source is available from a Bazaar repository at:

http://www.zedshaw.com/repository/zapps

As currently just a demo of Zapps but it will be moved into its own project folder soon as it is good enough to use and distribute.

The Runtime

It’s best if we start with the runtime.py file, which is responsible for using the IdiopidaeParser to process the files. It starts off with your typical boilerplate code but I like the with statement so I include some future stuff:

    1: from __future__ import with_statement
    2: 
    3: # Copyright (C) Zed A. Shaw, licensed under the GPLv3
    4: 
    5: import idiopidae
    6: from pygments import highlight
    7: from pygments.formatters import get_formatter_for_filename, get_formatter_by_name
    8: from pygments.lexers import guess_lexer_for_filename, get_lexer_by_name
    9: 
   10: 

Next we need to keep track of stuff:

   11: class Builder(object):
   12:     """Used by IdiopidaeParser to construct the data structure of
   13:         a parsed document.  Composer then uses this to unify a file
   14:         against a directory of other files to produce an output."""
   15: 
   16: 
   17:     def __init__(self):
   18:         self.index = 0
   19:         self.line = 1
   20:         self.current = {
   21:                 "command": "export", "section": self.next_anonymous(), 
   22:                 "language": None, 
   23:                 "lines": []}
   24:         self.statements = [self.current]
   25:         self.exports = {}
   26:         self.sections = []
   27: 
   28: 

Now, there’s three methods that the parser uses heavily during the parsing phase to chunk up a document into the proper structure for later analysis:

   29:     def include(self, file, section, format):
   30:         """ Creates a new include statement which lines are next appended to."""
   31:         self.next_statement({
   32:             "command": "include", 
   33:             "file": file, 
   34:             "section": section, 
   35:             "format": format,
   36:             "language": None,
   37:             "lines": []
   38:             })
   39: 
   40: 
   41:     def export(self, section, language):
   42:         """ Creates a new export statement which lines are next appended to."""
   43:         if not section: section = self.next_anonymous()
   44: 
   45:         self.next_statement({
   46:             "command": "export", 
   47:             "section": section, 
   48:             "language": language,
   49:             "lines": []})
   50: 
   51: 
   52:     def end(self):
   53:         """Just a method that ends a section to start the
   54:         next anonymous one."""
   55:         self.export(None, None)
   56: 
   57: 
   58:     def append(self, text):
   59:         """ Appends a line to the current statement with line numbers."""
   60:         self.current["lines"].append((self.line, text))
   61:         self.line += 1
   62: 
   63: 

These aren’t used by callers so much as by the IdiopidaeParser and the Composer. These methods then use:

   79:     def next_statement(self, statement):
   80:         """Just slaps this new statement onto the list of existing
   81:            statements and then sets the current one for appending
   82:            the lines."""
   83:         self.append_current_export()
   84:         self.current = statement
   85:         self.statements.append(self.current)
   86: 
   87:     def next_anonymous(self):
   88:         """Increments the anonymous section counter for tracking
   89:         sections without names."""
   90:         self.index += 1
   91:         return str(self.index)
   92: 

To swap into the next statement and:

   93:     def append_current_export(self):
   94:         """When a new export statement is hit, this updates the 
   95:            internals that track sequential export statements 
   96:            for later analysis."""
   97:         if self.current["command"] == "export":
   98:             section = self.current["section"]
   99:             self.exports[section] = self.current
  100:             self.sections.append(section)

To append each export to a list of exports found.

The process we’re describing involves the IdiopidaeParser using the Builder under the direction of the Composer:

  103: class Composer(object):
  104:     """Uses idiopidae.parse to parse the given file into a 
  105:     builder, and then spits out the results using the self.process()
  106:     method."""
  107: 
  108:     def __init__(self):
  109:         self.includes = {}
  110:         self.loads = {}
  111: 
  112: 

It is built with a simple loop in the idio file that acts as a binary for users to run:

    1: #!/usr/bin/env python
    2: 
    3: import runtime
    4: import sys
    5: 
    6: c = runtime.Composer()
    7: 
    8: for file in sys.argv[1:]:
    9:     print c.process(file)
   10: 

First we have how a file is loaded and parsed by the composer:

  113:     def load(self, name):
  114:         """Does the actual parsing of a file into a Builder and caches the results
  115:         into self.loads for faster calls later."""
  116:         if not self.loads.has_key(name):
  117:             with open(name) as file:
  118:                 text = file.read() + "\n\0"
  119:                 self.loads[name] = idiopidae.parse('Document', text)
  120:         return self.loads[name]
  121: 
  122: 

which is actually used by the process method:

  123:     def process(self, name):
  124:         """Performs a full processing of the file returning a string
  125:         with all the @include sections replaced."""
  126:         self.builder = self.load(name)
  127:         results = []
  128:         for st in self.builder.statements:
  129:             if st["command"] == "export":
  130:                 self.append_export(results, st)
  131:             elif st["command"] == "include":
  132:                 self.append_include(results, name, st)
  133:         return "\n".join(results)
  134: 
  135:     def append_include(self, results, name, st):
  136:         key = "%s/%s/%s" % (name, st["file"], st["section"])
  137: 
  138:         if self.includes.has_key(key):
  139:             # look it up in the cache instead of processing it again
  140:             text = self.includes[key]
  141:         else:
  142:             lines, firsts = self.include(st["file"], st["section"])
  143:             lexer = self.resolve_lexer(st, firsts)
  144:             format = self.resolve_format(name, st)
  145:             text = self.format(lines, lexer, format, numbered=True)
  146:             self.includes[key] = text
  147: 
  148:         results.append(text)
  149: 
  150:     def append_export(self, results, st):
  151:         results.append(self.format(st["lines"]))
  152: 
  153: 
  154:     def resolve_lexer(self, st, firsts):
  155:         """Responsible for resolving the lexer that should be used on the
  156:         section of code.  It will use the one specified in the export, and
  157:         then try to guess based on the file name/extension and the first line
  158:         of the text file."""
  159:         file, lang = st["file"], st["language"]
  160: 
  161:         if lang:
  162:             return get_lexer_by_name(lang)
  163:         try:
  164:             return guess_lexer_for_filename(file, firsts)
  165:         except:
  166:             return get_lexer_by_name("text")
  167:         
  168: 
  169:     def resolve_format(self, file, st):
  170:         """Resolves formats that are specified based on either the
  171:         file name/extension or an explicitly given format."""
  172:         # TODO: let them specify options too, probably from some yaml
  173:         if st["format"]:
  174:             return get_formatter_by_name(st["format"])
  175:         else:
  176:             try:
  177:                 return get_formatter_for_filename(file)
  178:             except:
  179:                 return get_formatter_by_name("text")
  180: 
  181: 

This is the most complex method since it is where all the real work is being done. It loads the file we want to compose, and goes through all the sections. Any section that’s an export is just printed out, but any section that’s an import is processed as another call to include and format to get the text:

  182:     def format(self, lines, lexer = None, format = None, numbered=False):
  183:         """Given a set of (#,"") line tuples it will return a 
  184:         string with line numbers or not."""
  185:         # TODO: need to figure out if the format has line numbers and do that instead
  186:         if numbered:
  187:             text = "\n".join(["%5d: %s" % l for l in lines])
  188:         else:
  189:             text = "\n".join([l[1] for l in lines])
  190: 
  191:         if format and lexer:
  192:             return highlight(text, lexer, format)
  193:         else:
  194:             return text
  195: 
  196: 

The include method is actually very simple:

  197:     def include(self, file, section):
  198:         """Loads the requested section and returns those lines and the first
  199:         few lines of the whole file for guessing the format.  Also does some 
  200:         caching of the requested sections, firsts, and loaded files."""
  201: 
  202:         try:
  203:             target = self.load(file)
  204:             if not target: 
  205:                 print "!!!! ERROR: Failed to parse file %s (see above for error)" % file
  206:                 raise RuntimeError("ERROR: Failed to parse %s (see output)" % file)
  207:             else:
  208:                 lines = target.lines_for(section)
  209:                 firsts = self.format(target.lines_for(target.sections[0]), numbered=False)
  210: 
  211:             return lines, firsts
  212:         except KeyError:
  213:             raise KeyError("ERROR: Key '%s' not exported or included in file '%s'" % (section, file))
  214: 
  215: 

And that’s all of idiopidae except the parser, which we’ll go over next.

The Parser

The parser is the key to how Idiopidae works and it uses the Zapps that I adopted recently. It shows you can easily crank out little parsers for little languages that are fast enough for real work.

Since most people don’t get parsers, you could do good to use bzr to grab the code and study how this file is translated into the idiopidae.py file.

Every parser generator has three main components: code stuff, tokens, and grammar rules. For Idiopidae there’s not much code stuff than the import of the runtime:

    1: 
    2: # Copyright (C) Zed A. Shaw, licensed under the GPLv3
    3: 
    4: import runtime
    5: 
    6: %%
    7: 

Then we just start off the parser declaration, which will be turned into a class named idiopidae.IdiopidaeParser that you can run:

    8: parser IdiopidaeParser:
    9: 

Now, we need to have a bunch of tokens which we want to either discard as just visual aids for the user, or keep as input data:

   10:     token WS: "[ \t]+"
   11:     token NUMBER: "[0-9]+[0-9\.]*"
   12:     token STRING: '\'([^\\n\'\\\\]|\\\\.)*\'|"([^\\n"\\\\]|\\\\.)*"'
   13:     token EOD: "\\0"
   14:     token EOL: "(\\n|\\r\\n)"
   15:     token END: "end"
   16:     token ID: "[a-zA-Z][a-zA-Z\-_0-9]+"
   17:     token INCLUDE: "include"
   18:     token EXPORT: "export"
   19:     token STARTER: "[ \t]*(###|//|\\*)+ @"
   20:     token NOT_STARTER: "([^#]|[^//]|[^\\*])"
   21:     token JUNK: "[^\\n]*" 
   22: 

You can’t tell from the above list what it is dropped and what is kept, for that you have to look in the grammar. The trick is we define all the base “words” or tokens and then we use the grammar to sift through them to pull out what is considered Junk or a Statement:

   23:     rule Section:  
   24:             ID {{ return ID }} 
   25:             | NUMBER {{ return NUMBER }} 
   26:             | STRING {{ return STRING[1:-1] }}
   27:     rule Language: WS ID {{ return ID }} 
   28:     rule Format:   WS ID {{ return ID }}
   29:     rule File: STRING {{ return STRING[1:-1] }}
   30:     rule Include: 
   31:         INCLUDE WS File WS Section Format? {{ self.doc.include(File, Section, Format) }}
   32:     rule Export: EXPORT WS Section Language?  {{ self.doc.export(Section, Language) }}
   33:     rule Command: Include 
   34:         | Export 
   35:         | END {{ self.doc.end() }}
   36:     rule Statement: STARTER Command (WS)* EOL
   37:     rule Junk: (
   38:             NOT_STARTER JUNK EOL {{ self.doc.append(NOT_STARTER + JUNK) }}
   39:             | EOL {{ self.doc.append('') }}
   40:             )
   41:     rule Line: Statement | Junk 
   42:     rule Document: 
   43:         {{ self.doc = runtime.Builder() }} 
   44:         (Line)* 
   45:         EOD {{ self.doc.append_current_export();  return self.doc }}
   46: 
   47: 

More on reading this later.