patternpythonMinor

SVG path parsing

Submitted by: @import:stackexchange-codereview·Mar 10, 2026·

Viewed 0 times

pathsvgparsing

Problem

I have a module in Python for dealing with SVG paths. One of the problems with this is that the SVG spec is obsessed with saving characters to a pointless extent. As such, this path is valid:

"M3.4E-34-2.2e56L23-34z"

And should be parsed as follows:

"M", "3.4E-34", "-2.2e56", "L", "23", "-34", "z"

As you see, anything non-ambiguous is allowed, including separating two numbers only by a minus sign, as long as that minus-sign is not preceded by an "E" or an "e", in which case it should be interpreted as an exponent of the first number. Letters are commands (except "E" and "e" of course), and both comma and any sort of whitespace is allowed as separators.

My module currently uses a rather ugly way of tokenizing the SVG path by multiple string replacements and then a split:

COMMANDS = set('MmZzLlHhVvCcSsQqTtAa')

def _tokenize_path_replace(pathdef):
    # First handle negative exponents:
    pathdef = pathdef.replace('e-', 'NEGEXP').replace('E-', 'NEGEXP')
    # Commas and minus-signs are separators, just like spaces.
    pathdef = pathdef.replace(',', ' ').replace('-', ' -')
    pathdef = pathdef.replace('NEGEXP', 'e-')
    # Commands are allowed without spaces around. Let's insert spaces so it's
    # easier to split later.
    for c in COMMANDS:
        pathdef = pathdef.replace(c, ' %s ' % c)

    # Split the path into elements
    return pathdef.split()

This in fact is doing a total of 23 string replacements on the path, and this is easy, but seems like it should be slow. I tried doing this other ways, but to my surprise they were all slower. I did a character by character tokenizer, which took around 30-40% more time. A user of the module also suggested a regex:

import re
TOKEN_RE = re.compile("[MmZzLlHhVvCcSsQqTtAa]|[-+]?[0-9]*\.?[0-9]+(?:[eE]
[-+]?[0-9]+)?")

def _tokenize_path_replace(pathdef):
    return TOKEN_RE.findall(pathdef)

To my surprise this was also slower than doing 23 string replacements, although just 20-30

Solution

Actually, the problem at hand might not be ideally addressed by replacements or regular expressions. The way how SVG path data is designed, seems to make going through the path string character by character more efficient.

There are essentially five different cases one has to examine. This is a direct consequence of the BNF of SVG Paths (http://www.w3.org/TR/SVG/paths.html#PathDataBNF). When looping over the string, the next character could be

a digit or one of the letters 'e' and 'E',

a comma or a whitespace,

a command letter, i.e. one of the letters 'MmZzLlHhVvCcSsQqTtAa',

the dot '.',

a sign, i.e. '+' or '-'.

In these five cases the following things are done:

We are inside a number and can simply append the character to the current entity.

We have encountered a separator and if not already done in the last step, a new empty entity is started.

We found a command. This command is added as a separate entity and a new empty one is started.

In this case it could be that we were already inside a float (if the flag 'float' is True). Then the dot starts a new entity. Otherwise the dot indicates that we are now in a float. The flag 'float' is then set to True and we stay in the current entity.

In this case it could be that the sign is the one of an exponent. Then it is just appended to the current entity. Otherwise a new one is started.

A code example might look like this:

def parse_path(path_data):
    digit_exp = '0123456789eE'
    comma_wsp = ', \t\n\r\f\v'
    drawto_command = 'MmZzLlHhVvCcSsQqTtAa'
    sign = '+-'
    exponent = 'eE'
    float = False
    entity = ''
    for char in path_data:
        if char in digit_exp:
            entity += char
        elif char in comma_wsp and entity:
            yield entity
            float = False
            entity = ''
        elif char in drawto_command:
            if entity:
                yield entity
                float = False
                entity = ''
            yield char
        elif char == '.':
            if float:
                yield entity
                entity = '.'
            else:
                entity += '.'
                float = True
        elif char in sign:
            if entity and entity[-1] not in exponent:
                yield entity
                float = False
                entity = char
            else:
                entity += char
    if entity:
        yield entity

I've run some tests with the above code and it was mostly twice as fast as the regexp version.

In addition, due to the clear cases it is relativly easy to understand what's going on here. This is also helpful for finding and correcting errors.

The problem in the regexp version Peter Taylor pointed to is not severe in most cases. But it might actually lead to wrong parsing. If one considers for example the coordinate 2.e2, the regexp version leads to the two coordinates '2','2'.

Code Snippets

def parse_path(path_data):
    digit_exp = '0123456789eE'
    comma_wsp = ', \t\n\r\f\v'
    drawto_command = 'MmZzLlHhVvCcSsQqTtAa'
    sign = '+-'
    exponent = 'eE'
    float = False
    entity = ''
    for char in path_data:
        if char in digit_exp:
            entity += char
        elif char in comma_wsp and entity:
            yield entity
            float = False
            entity = ''
        elif char in drawto_command:
            if entity:
                yield entity
                float = False
                entity = ''
            yield char
        elif char == '.':
            if float:
                yield entity
                entity = '.'
            else:
                entity += '.'
                float = True
        elif char in sign:
            if entity and entity[-1] not in exponent:
                yield entity
                float = False
                entity = char
            else:
                entity += char
    if entity:
        yield entity

Context

StackExchange Code Review Q#28502, answer score: 8

Revisions (0)

No revisions yet.