patternpythonMinor

Schemey - lexer

Submitted by: @import:stackexchange-codereview·Mar 10, 2026·

Viewed 0 times

schemeylexerstackoverflow

Problem

1Note: This is the first post in - hopefully - a series of posts I plan to make.

Project background

Schemey is a project I've been working on for the past 3-4 weeks. It is an implementation of a subset of the Scheme programming language. The project is still very new and most likely has bugs in certain parts(which is why I'm posting this here ).

Actual, I just finished posting the repository of my project on github, and documentation of the project can be found here. Please note, while I've tried my best to make sure my project is "release ready", this is the first version of my project's repository and its documentation, so typos can be expected. Also note, all code in the repository is in the public domain, so feel free to use it as you please.

Structure of the lexer

The way I designed my lexer was to base it off of the maximal-munch rule, instead of using regex like I was originally planning to.

The lexer itself is fairly small(around 150-200 lines), and I've add documentation throughout the code, so I'm confident I don't have to explain it in very much detail. Without further ado, here is the code:

lexer.py

``


"""
lexer.py
----------------------------------------
A simple lexer based upon the "maximal munch" rule.
Because of this, the lexer is not generic and must
be created anew for each specific language.
----------------------------------------
Algerbrex
All code in this module is
public domain.
Last modified: February 5 2017
"""

from collections import namedtuple
from _builtins import builtin_map

# Sometimes we need to return the current position
# the lexer is on, to raise an appropriate error. 
# Whenever there is an error, an instance is returned 
# to the parser.
Error = namedtuple('Error', 'pos')

def is_identifier(char):
    """ Test if

char` is a valid Scheme identifier.
"""
return char.isalnum() or char in builtin_map.keys() or char in ('?', '!', '.')

class Token:
""" A simple Token structure.
Contains the token type,

Solution

Overall, the code is quite clean and understandable. Though, I have never done anything like you are doing, but here are some notes about the code:

the Token and TokenTypes classes may use "__slots__" for faster attribute access and memory savings

the char in builtin_map.keys() can and should be replaced with just char in builtin_map to avoid creating the list of keys and looking up in the list instead of in the hashtable - O(n) vs O(1)

the char in ('?', '!', '.') should be replaced with something like char in ADDITIONAL_BUILTIN_CHARS where ADDITIONAL_BUILTIN_CHARS = {'?', '!', '.'} (note - Python 2.7+ syntax) - it is a set that you should define on the module level - or, may be even move to _builtins module to keep all the related constants together

all the other places where you use in to check if character is one of multiple characters can benefit from a similar improvement - defining as a set, giving a meaningful name and moving to the module level

the Lexer should probably support the "iterator" protocol (advancing to the next token continuously), like for example this or this one does

-
the PEP8 import guidelines suggest to put a new line between the different types of imports, replace:

from collections import namedtuple
from _builtins import builtin_map

with:

from collections import namedtuple

from _builtins import builtin_map

-
one-line doc strings can stay on a single line (PEP257)

As a side note, I think you should not be putting the "modified date" into the module itself - let the github and git handle it.

Code Snippets

from collections import namedtuple
from _builtins import builtin_map

from collections import namedtuple

from _builtins import builtin_map

Context

StackExchange Code Review Q#154735, answer score: 5

Revisions (0)

No revisions yet.