patternpythonMinor
Schemey - lexer
Viewed 0 times
schemeylexerstackoverflow
Problem
1Note: This is the first post in - hopefully - a series of posts I plan to make.
Project background
Schemey is a project I've been working on for the past 3-4 weeks. It is an implementation of a subset of the Scheme programming language. The project is still very new and most likely has bugs in certain parts(which is why I'm posting this here ).
Actual, I just finished posting the repository of my project on github, and documentation of the project can be found here. Please note, while I've tried my best to make sure my project is "release ready", this is the first version of my project's repository and its documentation, so typos can be expected. Also note, all code in the repository is in the public domain, so feel free to use it as you please.
Structure of the lexer
The way I designed my lexer was to base it off of the maximal-munch rule, instead of using regex like I was originally planning to.
The lexer itself is fairly small(around 150-200 lines), and I've add documentation throughout the code, so I'm confident I don't have to explain it in very much detail. Without further ado, here is the code:
lexer.py
``
"""
return char.isalnum() or char in builtin_map.keys() or char in ('?', '!', '.')
class Token:
""" A simple Token structure.
Contains the token type,
Project background
Schemey is a project I've been working on for the past 3-4 weeks. It is an implementation of a subset of the Scheme programming language. The project is still very new and most likely has bugs in certain parts(which is why I'm posting this here ).
Actual, I just finished posting the repository of my project on github, and documentation of the project can be found here. Please note, while I've tried my best to make sure my project is "release ready", this is the first version of my project's repository and its documentation, so typos can be expected. Also note, all code in the repository is in the public domain, so feel free to use it as you please.
Structure of the lexer
The way I designed my lexer was to base it off of the maximal-munch rule, instead of using regex like I was originally planning to.
The lexer itself is fairly small(around 150-200 lines), and I've add documentation throughout the code, so I'm confident I don't have to explain it in very much detail. Without further ado, here is the code:
lexer.py
``
"""
lexer.py
----------------------------------------
A simple lexer based upon the "maximal munch" rule.
Because of this, the lexer is not generic and must
be created anew for each specific language.
----------------------------------------
Algerbrex
All code in this module is
public domain.
Last modified: February 5 2017
"""
from collections import namedtuple
from _builtins import builtin_map
# Sometimes we need to return the current position
# the lexer is on, to raise an appropriate error.
# Whenever there is an error, an instance is returned
# to the parser.
Error = namedtuple('Error', 'pos')
def is_identifier(char):
""" Test if char` is a valid Scheme identifier."""
return char.isalnum() or char in builtin_map.keys() or char in ('?', '!', '.')
class Token:
""" A simple Token structure.
Contains the token type,
Solution
Overall, the code is quite clean and understandable. Though, I have never done anything like you are doing, but here are some notes about the code:
-
the PEP8 import guidelines suggest to put a new line between the different types of imports, replace:
with:
-
one-line doc strings can stay on a single line (PEP257)
As a side note, I think you should not be putting the "modified date" into the module itself - let the github and git handle it.
- the
TokenandTokenTypesclasses may use "__slots__" for faster attribute access and memory savings
- the
char in builtin_map.keys()can and should be replaced with justchar in builtin_mapto avoid creating the list of keys and looking up in the list instead of in the hashtable -O(n)vsO(1)
- the
char in ('?', '!', '.')should be replaced with something likechar in ADDITIONAL_BUILTIN_CHARSwhereADDITIONAL_BUILTIN_CHARS = {'?', '!', '.'}(note - Python 2.7+ syntax) - it is a set that you should define on the module level - or, may be even move to_builtinsmodule to keep all the related constants together
- all the other places where you use
into check if character is one of multiple characters can benefit from a similar improvement - defining as a set, giving a meaningful name and moving to the module level
- the Lexer should probably support the "iterator" protocol (advancing to the next token continuously), like for example this or this one does
-
the PEP8 import guidelines suggest to put a new line between the different types of imports, replace:
from collections import namedtuple
from _builtins import builtin_mapwith:
from collections import namedtuple
from _builtins import builtin_map-
one-line doc strings can stay on a single line (PEP257)
As a side note, I think you should not be putting the "modified date" into the module itself - let the github and git handle it.
Code Snippets
from collections import namedtuple
from _builtins import builtin_mapfrom collections import namedtuple
from _builtins import builtin_mapContext
StackExchange Code Review Q#154735, answer score: 5
Revisions (0)
No revisions yet.