HiveBrain v1.2.0
Get Started
← Back to all entries
patternMinor

Python tokenizer in Haskell + Parsec

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
parsechaskellpythontokenizer

Problem

I wrote a tokenizer/lexer (difference?) for Python in Haskell: this is my code on GitHub.

I already tested it out with some of CPython's standard library scripts, and it doesn't fail. However, it probably outputs non-conforming output. But I don't care about this right now.

What I'm looking for is a general (functional) style review. This is my first time using Parsec and Haskell for something relatively big, so I expect my functional style to be suboptimal.

Significant code snippets:

parseTokens :: LexemeParser [PositionedToken]
parseTokens = do
  P.skipMany ignorable
  tokens  P.many parseLogicalLine
  P.eof
  endIndents 0)  getIndentLevels
  dedents  P.many (P.char ' ')
      prev  return []
        GT -> return [Indent]  do
          levels  curr) levels
          putIndentLevels levels'
          return $ replicate (length pop) Dedent

parseLogicalLine :: LexemeParser [PositionedToken]
parseLogicalLine = do
  P.skipMany ignorable
  indentation  begin
  where
    begin = do
      tokens  begin

      implicitJoin  0)  getOpenBraces
      if implicitJoin then do
        whitespace
        parseComment  (P.optional backslash *> eol)
        P.skipMany ignorable
        whitespace
        continue
      else do
        whitespace
        explicitJoin  eol)
        case explicitJoin of
          Nothing   -> ((\t -> tokens ++ [t])  position NewLine)  whitespace *> continue


(I was largely inspired by PureScript's lexer, though I probably ended up with something a lot worse.)

Solution

for comparison

Here are some other ways I've found to parse indentation sensitive language:

-
Use Text.Parsec.Indent Here is a blog post containing an example: (link)

-
Have a look at the language-python package. It's lexer is built using alex: (link)

-
Purescript handles indentation via the mark, checkIndentation, indented and same functions: (link) Grep the rest of the source code for these functions to see how they are used. Its lexer defines a Indent token (link) but from what I can tell it is never returned, i.e. see (link). The indentation level is kept track of by the parser (link) via the mark function.

returning a list of tokens?

I'm not sure there is a lot of value in parsers which return [PositionedToken] except, perhaps, for testing purposes.

Most of the time you want to run a token parser incrementally, but parseTokens has to consume the entire file before it can return anything.

There should be a function which just returns the next token. This will come in handy in conjunction with the next comment I have...

handling dedents at eof

I would find another way to handle implicit dedents at the end of the file. I'm sure this can be handled at the grammar level. The code to add them is extremely messy - don't you agree?

Have a look at how the language-python package does it:

https://github.com/bjpop/language-python/blob/master/src/Language/Python/Version2/Parser/Lexer.x#L229-L242

The function lexToken returns just a single token.
The lexer has state, so when it reaches EOF it checks the current indentation level. If the indentation level is > 1, it decrements it and returns a dedentToken. Otherwise it returns an EOF token.

Context

StackExchange Code Review Q#101691, answer score: 3

Revisions (0)

No revisions yet.