patternMinor
Python tokenizer in Haskell + Parsec
Viewed 0 times
parsechaskellpythontokenizer
Problem
I wrote a tokenizer/lexer (difference?) for Python in Haskell: this is my code on GitHub.
I already tested it out with some of CPython's standard library scripts, and it doesn't fail. However, it probably outputs non-conforming output. But I don't care about this right now.
What I'm looking for is a general (functional) style review. This is my first time using Parsec and Haskell for something relatively big, so I expect my functional style to be suboptimal.
Significant code snippets:
(I was largely inspired by PureScript's lexer, though I probably ended up with something a lot worse.)
I already tested it out with some of CPython's standard library scripts, and it doesn't fail. However, it probably outputs non-conforming output. But I don't care about this right now.
What I'm looking for is a general (functional) style review. This is my first time using Parsec and Haskell for something relatively big, so I expect my functional style to be suboptimal.
Significant code snippets:
parseTokens :: LexemeParser [PositionedToken]
parseTokens = do
P.skipMany ignorable
tokens P.many parseLogicalLine
P.eof
endIndents 0) getIndentLevels
dedents P.many (P.char ' ')
prev return []
GT -> return [Indent] do
levels curr) levels
putIndentLevels levels'
return $ replicate (length pop) Dedent
parseLogicalLine :: LexemeParser [PositionedToken]
parseLogicalLine = do
P.skipMany ignorable
indentation begin
where
begin = do
tokens begin
implicitJoin 0) getOpenBraces
if implicitJoin then do
whitespace
parseComment (P.optional backslash *> eol)
P.skipMany ignorable
whitespace
continue
else do
whitespace
explicitJoin eol)
case explicitJoin of
Nothing -> ((\t -> tokens ++ [t]) position NewLine) whitespace *> continue(I was largely inspired by PureScript's lexer, though I probably ended up with something a lot worse.)
Solution
for comparison
Here are some other ways I've found to parse indentation sensitive language:
-
Use
-
Have a look at the language-python package. It's lexer is built using
-
Purescript handles indentation via the
returning a list of tokens?
I'm not sure there is a lot of value in parsers which return
Most of the time you want to run a token parser incrementally, but
There should be a function which just returns the next token. This will come in handy in conjunction with the next comment I have...
handling dedents at eof
I would find another way to handle implicit dedents at the end of the file. I'm sure this can be handled at the grammar level. The code to add them is extremely messy - don't you agree?
Have a look at how the
https://github.com/bjpop/language-python/blob/master/src/Language/Python/Version2/Parser/Lexer.x#L229-L242
The function
The lexer has state, so when it reaches EOF it checks the current indentation level. If the indentation level is > 1, it decrements it and returns a
Here are some other ways I've found to parse indentation sensitive language:
-
Use
Text.Parsec.Indent Here is a blog post containing an example: (link)-
Have a look at the language-python package. It's lexer is built using
alex: (link)-
Purescript handles indentation via the
mark, checkIndentation, indented and same functions: (link) Grep the rest of the source code for these functions to see how they are used. Its lexer defines a Indent token (link) but from what I can tell it is never returned, i.e. see (link). The indentation level is kept track of by the parser (link) via the mark function.returning a list of tokens?
I'm not sure there is a lot of value in parsers which return
[PositionedToken] except, perhaps, for testing purposes.Most of the time you want to run a token parser incrementally, but
parseTokens has to consume the entire file before it can return anything.There should be a function which just returns the next token. This will come in handy in conjunction with the next comment I have...
handling dedents at eof
I would find another way to handle implicit dedents at the end of the file. I'm sure this can be handled at the grammar level. The code to add them is extremely messy - don't you agree?
Have a look at how the
language-python package does it:https://github.com/bjpop/language-python/blob/master/src/Language/Python/Version2/Parser/Lexer.x#L229-L242
The function
lexToken returns just a single token.The lexer has state, so when it reaches EOF it checks the current indentation level. If the indentation level is > 1, it decrements it and returns a
dedentToken. Otherwise it returns an EOF token.Context
StackExchange Code Review Q#101691, answer score: 3
Revisions (0)
No revisions yet.