patternpythonMinor
SVG path parsing
Viewed 0 times
pathsvgparsing
Problem
I have a module in Python for dealing with SVG paths. One of the problems with this is that the SVG spec is obsessed with saving characters to a pointless extent. As such, this path is valid:
And should be parsed as follows:
As you see, anything non-ambiguous is allowed, including separating two numbers only by a minus sign, as long as that minus-sign is not preceded by an "E" or an "e", in which case it should be interpreted as an exponent of the first number. Letters are commands (except "E" and "e" of course), and both comma and any sort of whitespace is allowed as separators.
My module currently uses a rather ugly way of tokenizing the SVG path by multiple string replacements and then a split:
This in fact is doing a total of 23 string replacements on the path, and this is easy, but seems like it should be slow. I tried doing this other ways, but to my surprise they were all slower. I did a character by character tokenizer, which took around 30-40% more time. A user of the module also suggested a regex:
To my surprise this was also slower than doing 23 string replacements, although just 20-30
"M3.4E-34-2.2e56L23-34z"And should be parsed as follows:
"M", "3.4E-34", "-2.2e56", "L", "23", "-34", "z"As you see, anything non-ambiguous is allowed, including separating two numbers only by a minus sign, as long as that minus-sign is not preceded by an "E" or an "e", in which case it should be interpreted as an exponent of the first number. Letters are commands (except "E" and "e" of course), and both comma and any sort of whitespace is allowed as separators.
My module currently uses a rather ugly way of tokenizing the SVG path by multiple string replacements and then a split:
COMMANDS = set('MmZzLlHhVvCcSsQqTtAa')
def _tokenize_path_replace(pathdef):
# First handle negative exponents:
pathdef = pathdef.replace('e-', 'NEGEXP').replace('E-', 'NEGEXP')
# Commas and minus-signs are separators, just like spaces.
pathdef = pathdef.replace(',', ' ').replace('-', ' -')
pathdef = pathdef.replace('NEGEXP', 'e-')
# Commands are allowed without spaces around. Let's insert spaces so it's
# easier to split later.
for c in COMMANDS:
pathdef = pathdef.replace(c, ' %s ' % c)
# Split the path into elements
return pathdef.split()This in fact is doing a total of 23 string replacements on the path, and this is easy, but seems like it should be slow. I tried doing this other ways, but to my surprise they were all slower. I did a character by character tokenizer, which took around 30-40% more time. A user of the module also suggested a regex:
import re
TOKEN_RE = re.compile("[MmZzLlHhVvCcSsQqTtAa]|[-+]?[0-9]*\.?[0-9]+(?:[eE]
[-+]?[0-9]+)?")
def _tokenize_path_replace(pathdef):
return TOKEN_RE.findall(pathdef)To my surprise this was also slower than doing 23 string replacements, although just 20-30
Solution
Actually, the problem at hand might not be ideally addressed by replacements or regular expressions. The way how SVG path data is designed, seems to make going through the path string character by character more efficient.
There are essentially five different cases one has to examine. This is a direct consequence of the BNF of SVG Paths (http://www.w3.org/TR/SVG/paths.html#PathDataBNF). When looping over the string, the next character could be
In these five cases the following things are done:
A code example might look like this:
I've run some tests with the above code and it was mostly twice as fast as the regexp version.
In addition, due to the clear cases it is relativly easy to understand what's going on here. This is also helpful for finding and correcting errors.
The problem in the regexp version Peter Taylor pointed to is not severe in most cases. But it might actually lead to wrong parsing. If one considers for example the coordinate
There are essentially five different cases one has to examine. This is a direct consequence of the BNF of SVG Paths (http://www.w3.org/TR/SVG/paths.html#PathDataBNF). When looping over the string, the next character could be
- a digit or one of the letters 'e' and 'E',
- a comma or a whitespace,
- a command letter, i.e. one of the letters 'MmZzLlHhVvCcSsQqTtAa',
- the dot '.',
- a sign, i.e. '+' or '-'.
In these five cases the following things are done:
- We are inside a number and can simply append the character to the current entity.
- We have encountered a separator and if not already done in the last step, a new empty entity is started.
- We found a command. This command is added as a separate entity and a new empty one is started.
- In this case it could be that we were already inside a float (if the flag 'float' is True). Then the dot starts a new entity. Otherwise the dot indicates that we are now in a float. The flag 'float' is then set to True and we stay in the current entity.
- In this case it could be that the sign is the one of an exponent. Then it is just appended to the current entity. Otherwise a new one is started.
A code example might look like this:
def parse_path(path_data):
digit_exp = '0123456789eE'
comma_wsp = ', \t\n\r\f\v'
drawto_command = 'MmZzLlHhVvCcSsQqTtAa'
sign = '+-'
exponent = 'eE'
float = False
entity = ''
for char in path_data:
if char in digit_exp:
entity += char
elif char in comma_wsp and entity:
yield entity
float = False
entity = ''
elif char in drawto_command:
if entity:
yield entity
float = False
entity = ''
yield char
elif char == '.':
if float:
yield entity
entity = '.'
else:
entity += '.'
float = True
elif char in sign:
if entity and entity[-1] not in exponent:
yield entity
float = False
entity = char
else:
entity += char
if entity:
yield entityI've run some tests with the above code and it was mostly twice as fast as the regexp version.
In addition, due to the clear cases it is relativly easy to understand what's going on here. This is also helpful for finding and correcting errors.
The problem in the regexp version Peter Taylor pointed to is not severe in most cases. But it might actually lead to wrong parsing. If one considers for example the coordinate
2.e2, the regexp version leads to the two coordinates '2','2'.Code Snippets
def parse_path(path_data):
digit_exp = '0123456789eE'
comma_wsp = ', \t\n\r\f\v'
drawto_command = 'MmZzLlHhVvCcSsQqTtAa'
sign = '+-'
exponent = 'eE'
float = False
entity = ''
for char in path_data:
if char in digit_exp:
entity += char
elif char in comma_wsp and entity:
yield entity
float = False
entity = ''
elif char in drawto_command:
if entity:
yield entity
float = False
entity = ''
yield char
elif char == '.':
if float:
yield entity
entity = '.'
else:
entity += '.'
float = True
elif char in sign:
if entity and entity[-1] not in exponent:
yield entity
float = False
entity = char
else:
entity += char
if entity:
yield entityContext
StackExchange Code Review Q#28502, answer score: 8
Revisions (0)
No revisions yet.