patternpythonMinor
Using Python to tokenize a string in a stack-based programming language
Viewed 0 times
stacktokenizeprogramminglanguagepythonusingbasedstring
Problem
As an avid user of esoteric programming languages, I'm currently designing one, interpreted in Python.
The language will be stack-based and scripts will be a mix of symbols, letters, and digits, like so:
This will be read from the file and passed to a new
Currently, I am doing this using regex, with the following code (
The resulting tokens can be seen with
I have a few questions about my method:
The language will be stack-based and scripts will be a mix of symbols, letters, and digits, like so:
ab123c-43=f54g6!6This will be read from the file and passed to a new
Script object, to tokenize. It needs to be tokenized into a list of strings, where adjacent digits are grouped together and converted into ints, but letters are individual, resulting in this array:['a', 'b', 123, 'c', '-', 43, '=', 'f', 54, 'g', 6, '!', 6]Currently, I am doing this using regex, with the following code (
get_tokens is the main method):import re
class Script:
def __init__(self, text):
self._tokens = Script.get_tokens(text)
def get_tokens(text):
""" Static tokenizer function.
Uses regex to turn the string into a list of tokens.
Groups adjacent digits together and converts to integers. """
tokens = re.findall('\d+|[^\d]', text)
for i in range(len(tokens)):
if tokens[i].isdigit():
tokens[i] = int(tokens[i])
return tokens
def run(self):
raise NotImplementedError()
The resulting tokens can be seen with
Script("ab123c-43=f54g6!6")._tokens.I have a few questions about my method:
- Is there any more Pythonic or elegant way to do this? (e.g, an alternative instead of regexing)
- Should I really be using a static function like that, or should I move it outside the class?
- If the loop is to be kept, should I be using
enumerateinstead?
for i,t in enumerate(tokens):
if t.isdigit():
tokens[i] = int(t)
Solution
It would be good practice to use
Testing
If, as you say, this
I don't see any need to define a public method for tokenizing. The generator expression could be written directly within the constructor:
r'raw strings' for regexes, especially since this one contains a backslash. Instead of [^\d], you could use \D, or better yet, just ., since any digits would have been slurped up by the \d+ already.Testing
.isdigit() feels redundant to me, since the regex was already detecting digits. If you use capturing parentheses, then you could immediately tell which portion of the regex matched, and avoid having to make a second pass to convert the integers.If, as you say, this
Script is meant to be interpreted, consider making _tokens an iterator rather than a list, which saves you from having to parse the entire string at once. To that end, I recommend re.finditer() rather than re.findall().I don't see any need to define a public method for tokenizing. The generator expression could be written directly within the constructor:
import re
class Script:
def __init__(self, text):
self._tokens = (
match.group(1) or int(match.group(0))
for match in re.finditer(r'\d+|(.)', text)
)Code Snippets
import re
class Script:
def __init__(self, text):
self._tokens = (
match.group(1) or int(match.group(0))
for match in re.finditer(r'\d+|(.)', text)
)Context
StackExchange Code Review Q#150157, answer score: 9
Revisions (0)
No revisions yet.