HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

Using Python to tokenize a string in a stack-based programming language

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
stacktokenizeprogramminglanguagepythonusingbasedstring

Problem

As an avid user of esoteric programming languages, I'm currently designing one, interpreted in Python.

The language will be stack-based and scripts will be a mix of symbols, letters, and digits, like so:

ab123c-43=f54g6!6


This will be read from the file and passed to a new Script object, to tokenize. It needs to be tokenized into a list of strings, where adjacent digits are grouped together and converted into ints, but letters are individual, resulting in this array:

['a', 'b', 123, 'c', '-', 43, '=', 'f', 54, 'g', 6, '!', 6]


Currently, I am doing this using regex, with the following code (get_tokens is the main method):

import re

class Script:
def __init__(self, text):
self._tokens = Script.get_tokens(text)

def get_tokens(text):
""" Static tokenizer function.
Uses regex to turn the string into a list of tokens.
Groups adjacent digits together and converts to integers. """

tokens = re.findall('\d+|[^\d]', text)

for i in range(len(tokens)):
if tokens[i].isdigit():
tokens[i] = int(tokens[i])

return tokens

def run(self):
raise NotImplementedError()


The resulting tokens can be seen with Script("ab123c-43=f54g6!6")._tokens.

I have a few questions about my method:

  • Is there any more Pythonic or elegant way to do this? (e.g, an alternative instead of regexing)



  • Should I really be using a static function like that, or should I move it outside the class?



  • If the loop is to be kept, should I be using enumerate instead?



for i,t in enumerate(tokens):
if t.isdigit():
tokens[i] = int(t)

Solution

It would be good practice to use r'raw strings' for regexes, especially since this one contains a backslash. Instead of [^\d], you could use \D, or better yet, just ., since any digits would have been slurped up by the \d+ already.

Testing .isdigit() feels redundant to me, since the regex was already detecting digits. If you use capturing parentheses, then you could immediately tell which portion of the regex matched, and avoid having to make a second pass to convert the integers.

If, as you say, this Script is meant to be interpreted, consider making _tokens an iterator rather than a list, which saves you from having to parse the entire string at once. To that end, I recommend re.finditer() rather than re.findall().

I don't see any need to define a public method for tokenizing. The generator expression could be written directly within the constructor:

import re

class Script:
    def __init__(self, text):
        self._tokens = (
            match.group(1) or int(match.group(0))
            for match in re.finditer(r'\d+|(.)', text)
        )

Code Snippets

import re

class Script:
    def __init__(self, text):
        self._tokens = (
            match.group(1) or int(match.group(0))
            for match in re.finditer(r'\d+|(.)', text)
        )

Context

StackExchange Code Review Q#150157, answer score: 9

Revisions (0)

No revisions yet.