patternpythondjangoMinor

Python parser for attributes in a HAML template

Submitted by: @import:stackexchange-codereview·Mar 10, 2026·

Viewed 0 times

templateparserattributeshamlforpython

Problem

I'm working on a feature for the HamlPy (Haml for Django) project:

About Haml

For those who don't know, Haml is an indentation-based markup language which compiles to HTML:

%ul#atheletes
    - for athelete in athelete_list
        %li.athelete{'id': 'athelete_{{ athelete.pk }}'}= athelete.name

compiles to


    {% for athelete in athelete_list %}
        {{ athelete.name }}
    {% endfor %}

The code

{'id': 'athelete_{{ athelete.pk }}'} is referred to as the 'attribute dictionary'. It is an (almost) valid Python dictionary and is currently parsed with some very ugly regular expressions and an eval(). However, I would like to add some features to it that would no longer make it a valid Python dictionary, e.g. using Haml within the attributes:

%a.link{
    'class':
        - if forloop.first
            link-first
        - else
            - if forloop.last
                link-last
    'href':
        - url some_view
    }

among other things.

I began by writing a class which I could swap out for the eval and would pass all of the current tests:

```
import re

# Valid characters for dictionary key
re_key = re.compile(r'[a-zA-Z0-9-_]+')
re_nums = re.compile(r'[0-9\.]+')

class AttributeParser:
"""Parses comma-separated HamlPy attribute values"""

def __init__(self, data, terminator):
self.terminator=terminator
self.s = data.lstrip()
# Index of current character being read
self.ptr=1

def consume_whitespace(self, include_newlines=False):
"""Moves the pointer to the next non-whitespace character"""
whitespace = (' ', '\t', '\r', '\n') if include_newlines else (' ', '\t')

while self.ptr<len(self.s) and self.s[self.ptr] in whitespace:
self.ptr+=1
return self.ptr

def consume_end_of_value(self):
# End of value comma or end of string
self.ptr=self.consume_whitespace()
if self.s[self.ptr] != self.terminator:
i

Solution

Answers to your questions

-
If the goal of the project is to be able to include Haml in attribute values, then you've got no choice but to switch to your own parser. I haven't looked at the set of test cases, but it does seem plausible that you are going to introduce incompatibilities because of the complexity of Python's own parser. You are going to find that you have users who used the oddities of Python's string syntax (r-strings, \u-escapes and all).

The way to manage the transition from the old parser to the new is to start out by shipping both, with the old parser selected by default, but the new parser selectable with an option. This gives your users time to discover the incompatibilities and fix them (or submit bug reports). Then in a later release make the new parser the default, but with the old parser available but deprecated. Finally, remove the old parser.

-
Correctness and simplicity first, speed later. You can always port the parser to C if nothing else will do.

-
My answer to question 1 applies here too.

-
See below.

Designing a parser

Now, let's look at the code. I thought about making a series of comments on the various misfeatures, but that seems less than helpful, given that the whole design of the parser isn't quite right:

-
There's no separation between the lexer and the parser.

-
You have different classes for different productions in your syntax, so that each time you need to parse a tuple/list, you construct a new AttributeTupleAndListParser object, construct a string for it to parse (by copying the tail of the original string), and then throw away the parser object when done.

-
Some of your parsing methods don't seem well-matched to the syntax of the language, making it difficult to understand what they do. consume_end_of_value is a good example: it doesn't seem to correspond to anything natural in the syntax.

Computer science is by no means a discipline with all the answers, but one thing that we know how to do is write a parser! You don't have to have read the dragon book from cover to cover to know that it's conventional to develop a formal grammar for your language (often written down in a variation on Backus–Naur form), and then split your code into a lexical analyzer (which transforms source code into tokens using a finite state machine or something similar) and a parser which takes a stream of tokens and constructs a syntax tree or some other form of output based on the syntax of the input.

Sticking to this convention has a bunch of advantages: the existence of a formal grammar makes it easier to build compatible implementations; you can modify and test the lexical analyzer independently from the parser and vice versa; and other programmers will find it easier to understand and modify your code.

Rewriting your parser conventionally

Here's how I might start rewriting your parser to use the conventional approach. This implements a deliberately incomplete subset of the HamlPy attribute language, in order to keep the code short and to get it finished in a reasonable amount of time.

First, a class whose instances represent source tokens. The original string and position of each token is recorded in that token so that we can easily produce an error message related to that token. I've used the built-in exception SyntaxError here so that the error messages match those from other Python libraries. (You might later want to extend this so that the class can represent tokens from files as well as tokens from strings.)

``


class Token(object):
    """
    An object representing a token in a HamlPy document. Construct it
    using

Token(type, value, source, start, end)

 where:

type is the token type (Token.DELIMITER, Token.STRING

, etc);

value

 is the token value;

source

 is the string from which the token was taken;

start is the character position in source

 where the token starts;

ends is the character position in source

 where the token finishes.
    """

    # Enumeration of token types.
    DELIMITER = 1
    STRING = 2
    END = 3
    ERROR = 4

    def __init__(self, type, value, source, start, end):
        self.type = type
        self.value = value
        self.source = source
        self.start = start
        self.end = end

    def __repr__(self):
        type_name = 'UNKNOWN'
        for attr in dir(self):
            if getattr(self, attr) == self.type:
                type_name = attr
                break
        return ('Token(Token.{0}, {1}, {2}, {3}, {4})'
                .format(type_name, repr(self.value), repr(self.source),
                        self.start, self.end))

    def matches(self, type, value):
        """
        Return True iff this token matches the given

type and value

.
        """
        return self.type == type and self.value == value

    def error(self, msg):
        """
        Return a

SyntaxError

 object describing a problem with this
        token. The argument

Code Snippets

class Token(object):
    """
    An object representing a token in a HamlPy document. Construct it
    using `Token(type, value, source, start, end)` where:

    `type` is the token type (`Token.DELIMITER`, `Token.STRING`, etc);
    `value` is the token value;
    `source` is the string from which the token was taken;
    `start` is the character position in `source` where the token starts;
    `ends` is the character position in `source` where the token finishes.
    """

    # Enumeration of token types.
    DELIMITER = 1
    STRING = 2
    END = 3
    ERROR = 4

    def __init__(self, type, value, source, start, end):
        self.type = type
        self.value = value
        self.source = source
        self.start = start
        self.end = end

    def __repr__(self):
        type_name = 'UNKNOWN'
        for attr in dir(self):
            if getattr(self, attr) == self.type:
                type_name = attr
                break
        return ('Token(Token.{0}, {1}, {2}, {3}, {4})'
                .format(type_name, repr(self.value), repr(self.source),
                        self.start, self.end))

    def matches(self, type, value):
        """
        Return True iff this token matches the given `type` and `value`.
        """
        return self.type == type and self.value == value

    def error(self, msg):
        """
        Return a `SyntaxError` object describing a problem with this
        token. The argument `msg` is the error message; the token's
        line number and position are also reported.
        """
        line_start = 1 + self.source.rfind('\n', 0, self.start)
        line_end = self.source.find('\n', self.end)
        if line_end == -1: line_end = len(self.source)
        e = SyntaxError(msg)
        e.lineno = 1 + self.source.count('\n', 0, self.start)
        e.text = self.source[line_start: line_end]
        e.offset = self.start - line_start + 1
        return e

class Tokenizer(object):
    """
    Tokenizer for a subset of HamlPy. Instances of this class support
    the iterator protocol, and yield tokens from the string `s` as
    Token object. When the string `s` runs out, yield an END token.

    >>> from pprint import pprint
    >>> pprint(list(Tokenizer('{"a":"b"}')))
    [Token(Token.DELIMITER, '{', '{"a":"b"}', 0, 1),
     Token(Token.STRING, 'a', '{"a":"b"}', 2, 3),
     Token(Token.DELIMITER, ':', '{"a":"b"}', 4, 5),
     Token(Token.STRING, 'b', '{"a":"b"}', 6, 7),
     Token(Token.DELIMITER, '}', '{"a":"b"}', 8, 9),
     Token(Token.END, '', '{"a":"b"}', 9, 9)]
    """
    def __init__(self, s):
        self.iter = self.tokenize(s)

    def __iter__(self):
        return self

    def next(self):
        return next(self.iter)

    # Regular expression matching a source token.
    token_re = re.compile(r'''
        \s*                                 # Ignore initial whitespace
        (?:([][{},:])                       # 1. Delimiter
          |'([^\\']*(?:\\.[^\\']*)*)'       # 2. Single-quoted string
          |"([^\\"]*(?:\\.[^\\"]*)*)"       # 3. Double-quoted string
          |(\S)                             # 4. Something else
        )''', re.X)

    # Regular expression matching a backslash and following character.
    backslash_re = re.compile(r'\\(.)')

    def tokenize(self, s):
        for m in self.token_re.finditer(s):
            if m.group(1):
                yield Token(Token.DELIMITER, m.group(1),
                            s, m.start(1), m.end(1))
            elif m.group(2):
                yield Token(Token.STRING,
                            self.backslash_re.sub(r'\1', m.group(2)),
                            s, m.start(2), m.end(2))
            elif m.group(3):
                yield Token(Token.STRING,
                            self.backslash_re.sub(r'\1', m.group(3)),
                            s, m.start(3), m.end(3))
            else:
                t = Token(Token.ERROR, m.group(4), s, m.start(4), m.end(4))
                raise t.error('Unexpected character')
        yield Token(Token.END, '', s, len(s), len(s))

class Parser(object):
    """
    Parser for the subset of HamlPy with the following grammar:

    attribute-dict ::= '{' [attribute-list] '}'
    attribute-list ::= attribute (',' attribute)*
    attribute      ::= string ':' value
    value          ::= string | '[' [value-list] ']'
    value-list     ::= value (',' value)*
    """

    def __init__(self, s):
        self.tokenizer = Tokenizer(s)
        self.lookahead = None       # The lookahead token.
        self.next_token()           # Lookahead one token.

    def next_token(self):
        """
        Return the next token from the lexer and update the lookahead
        token.
        """
        t = self.lookahead
        self.lookahead = next(self.tokenizer)
        return t

    # Regular expression matching an allowable key.
    key_re = re.compile(r'[a-zA-Z_0-9-]+$')

    def parse_value(self):
        t = self.next_token()
        if t.type == Token.STRING:
            return t.value
        elif t.matches(Token.DELIMITER, '['):
            return list(self.parse_value_list())
        else:
            raise t.error('Expected a value')

    def parse_value_list(self):
        if self.lookahead.matches(Token.DELIMITER, ']'):
            self.next_token()
            return
        while True:
            yield self.parse_value()
            t = self.next_token()
            if t.matches(Token.DELIMITER, ']'):
                return
            elif not t.matches(Token.DELIMITER, ','):
                raise t.error('Expected "," or "]"')

    def parse_attribute(self):
        t = self.next_token()
        if t.type != Token.STRING:
            raise t.error('Expected a string')
        key = t.value
        if not self.key_re.match(key):
            raise t.error('Invalid key')
        t = self.next_token()
        if not t.matches(Token.DELIMITER, ':'):
            raise t.error('Expected ":"')
        value = self.parse_value()
        return key, value

    def parse_attribute_list(self):
        if self.lookahead.matches(Token.DELIMITER, '}'):
            self.next_token()
            return
        while True:
            yield self.parse_attribute()
            t = self.next_token()
            if t.matches(Token.DELIMITER, '}'):
                return
            elif not t.matches(Token.DELIMITER, ','):
                raise t.error('Expected "," or "}"')

    def parse_attribute_dict(self):
        t = self.next_token()
        if not t.matches(Token.DELIMITER, '{'):
            raise t.error('Expected "{"')
        return dict(self.parse_attribute_list())

Context

StackExchange Code Review Q#15395, answer score: 7

Revisions (0)

No revisions yet.