A simple lexer in Python

I’m taking a course on building compilers at the Israeli Open University and just learned how to use flex. It occurred to me that building a simple lexical analyzer should be quite easy with Python’s re module. A typical lexical analyzer read a stream of text input and splits it into a list of tokens. The simplest example of such a thing is the split function which takes a sentence and returns the list of words in it.

s = "A simple lexer in Python"

s.split()

[‘A’, ’simple’, ‘lexer’, ‘in’, ‘Python’]

The problem becomes more complex when you need to separate the tokens you find into different kinds, words and numbers, for instance. We’ll use a well known lyric as our sample text:

s = """99 bottles of beer on the wall, 99 bottles of beer.

Take one down and pass it around, 98 bottles of beer on the wall."""

The first thing we need to do is build a regular expression that recognizes words and another one that recognizes numbers. Although there are shorter ways to build those regular expressions, I like the less obscure form:

wordsRegex = "[A-Za-z]+"

numbersRegex = "[0-9]+"

We could now use findall on the string and get all the numbers and words out of it.

re.findall(wordsRegex, s)
[‘bottles’, ‘of’, ‘beer’, ‘on’, ‘the’, ‘wall’, ‘bottles’, ‘of’, ‘beer’, ‘Take’, ‘one’, ‘down’, ‘and’, ‘pass’, ‘it’, ‘around’, ‘bottles’, ‘of’, ‘beer’, ‘on’, ‘the’, ‘wall’]re.findall(numbersRegex, s)

[‘99′, ‘99′, ‘98′]

But wait, you say, that isn’t what we wanted at all! We need to get the tokens in the order of their appearance in text and still get the type of each token. Something along the lines of

for tokenType, tokenText in lexer(s):

    print tokenType, tokenText

would be really nice.

In order to do that, we’ll need to combine both regular expressions into one and iterate on the result of findall examining each token to decide on its type.

regex = "(%s)|(%s)" % (wordsRegex, numbersRegex)

‘([A-Za-z]+)|([0-9]+)’

re.findall(regex, s)

[(”, ‘99′), (‘bottles’, ”), (‘of’, ”), (‘beer’, ”), 

(‘on’, ”), (‘the’, ”), (‘wall’, ”), (”, ‘99′), 

(‘bottles’, ”), (‘of’, ”), (‘beer’, ”), (‘Take’, ”),

 (‘one’, ”), (‘down’, ”), (‘and’, ”), (‘pass’, ”), 

(‘it’, ”), (‘around’, ”), (”, ‘98′), (‘bottles’, ”), 

(‘of’, ”), (‘beer’, ”), (‘on’, ”), (‘the’, ”), (‘wall’, ”)]

As you can see, the result of the call to findall is a list of tuples, each containing a single match. If you look closely at the way I’ve combined the two regular expressions, you’ll see that each part is surrounded with parenthesis and that there’s a pipe (|) between the expressions. The compound regular expression matches either a number rf a word and each tuple in the return value of findall contains the matches for each parenthesized part of the regexp. However, since we combined the parts using a pipe (|), only one of the parts matches each time.

Using that knowledge we can now construct a simple loop that shows the token type for each of the words in the lyric:

for t in re.findall(regex, s):

    if t[0]:

        print "word", t[0]

    elif t[1]:

        print "number", t[1]

We now have most of the knowledge we need to build ourselves a lexer that will take a list of regular expressions and some text and return (or even better, generate) an list of tokens and their types. We’ll need to combine the regular expressions for each token into one big regex using pipes, scan the string, and gather the tokens and their types.

Our usage code looks like this:

definitions = [
("word", "[A-Za-z]+"),
("number", "[0-9]+"),
]lex = Lexer(definitions)

for tokenType, tokenValue in lex.parse(s):

    print tokenType, tokenValue

And here is the code for the lexer itself:

class Lexer(object):
def__init__(self, definitions):
self.definitions = definitions

        parts = []
for name, part in definitions:

            parts.append("(?P&lt;%s&gt;%s)" % (name, part))
self.regexpString = "|".join(parts)
self.regexp = re.compile(self.regexpString, re.MULTILINE)
    def parse(self, text):

        # yield lexemes

        for match in self.regexp.finditer(text):

            found = False

            for name, rexp in self.definitions:

                m = match.group(name)

                if m is not None:

                    yield (name, m)

                    break

Some notes on the implementation are in order. I’ve used the little known (?P<name>…) syntax for naming the parenthesized groups of regular expressions. Using that syntax the expression (?P<word>[A-Za-z]) matches a word and that match is accessible with match.group(’word’) where match is a re.Match object.

In order to speed things up a bit, I’ve compiled the regular expression when the Lexer object is created, used the finditer function instead of findall, and made parse a generator instead of a list returning function.

Using this simple lexer implementation it was quite simple to create a Python-to-HTML converter with syntax highlighting that works well enough to highlight the code of the highlighter itself!

The code for the lexer and syntax highlighter example are available here and on my snippets page. You can also see the result of running the syntax highlighter on itself here.

Enjoy lexing and let me know if you found this useful.