patternpythonMinor

Converting domain-specific regular-expressions to a list of all matching instances

Submitted by: @import:stackexchange-codereview·Mar 10, 2026·

Viewed 0 times

instancesallregularexpressionsspecificconvertinglistdomainmatching

Problem

There seem to be several questions floating around Stackexchange regarding how to take a python regular expression list the matching instances. This problem is a bit different because 1) I'm need to start off with a domain-specific regex system, not just the standard python regex, and 2) the possible matches for any given location in string are limited to a specific 'alphabet' of characters.

The current implementation works, but I've got a nagging feeling there's a much more succinct way to do it.

```
import numpy as np

def shorthand_to_instances(shorthand,alphabet):
"""
Converts the shorthand-notation for bioinformatics 'motifs'
into the all the instances of sequences they would match.
Brackets means "it could be any of these letters" in this location.
Curly braces means "it's not one of these letters" in this location,
it can be any letter in the given "alphabet" except these.
Lowercase "x" means it could be any of the letters in the given
"alphabet".

>>> binf_motifs_to_sequences('T[AG]','AGCT')
['TA', 'TG']

>>> binf_motifs_to_sequences('T{A}T','AGCT')
['TGT', 'TCT', 'TTT']

>>> binf_motifs_to_sequences('G{AGC}AGTC','AGCT')
['GTAGTC']

>>> binf_motifs_to_sequences('xGTC','AGCT')
['AGTC', 'GGTC', 'CGTC', 'TGTC']

"""
instances=[]
shc = shorthand
while len(shc)>0:
char = shc[0]
assert char in alphabet+'['+']'+'{'+'}'+'x', "Failed assert: '{0}'".format(char)
if char in alphabet:
# it's in the alphabet
if len(instances)==0:
instances.append(char)
else:
for i in list(np.arange(len(instances))):
instances[i] += char
shc = shc[1:]
elif char=='[':
# it's a 'one of these'
chargroup,c,e = shc[1:].partition(']')
if len(instances)==0:
for char in chargroup:
instances.append

Solution

Review

-
This is clear, well-documented code.

-
The doctests all call binf_motifs_to_sequences but your function is named shorthand_to_instances so all the doctests fail.

-
There's no need to have this kind of infrastructure for running doctests:

if __name__ == "__main__":
    import doctest
    doctest.testmod()

since you can easily run doctests from the command line using Python's -m option:

python -m doctest mymodule.py

-
Assertions should be used for programming errors (conditions that you believe can't happen), but this assertion:

assert char in alphabet+'['+']'+'{'+'}'+'x'

is really a runtime error (it might happen if bad data was passed to the function), so it would be better to raise an exception.

-
A given shorthand can have exponentially many matching sequences. For example, xxxx already has 256 matching sequences over a four-character alphabet. For this reason, I would avoid returning a list of matching sequences, as this requires building a list and storing all the sequences in memory. It would be better to generate the sequences one at a time.

Simplifying

The way to make this kind of code simpler is to use functional decomposition to divide it into pieces that are simpler to understand, implement, and test.

A sensible division into two pieces would be to first, parse the shorthand notation, producing some kind of data structure; and second, take the data structure and generate the matching sequences. Since the alphabet is likely to be small, the natural data structure to use is just a sequence of strings giving the characters that match at each position. So given the shorthand xT[AG]{AT} over the alphabet ACGT, the intermediate data would be the sequence ['ACGT', 'T', 'AG', 'CG'].

(If alphabets were expected to be much larger, then this might be somewhat wasteful in space, and so there might be a need to have a more sophisticated data structure here.)

I've called the first step "parsing" but really it's so simple that it's just lexical analysis and in Python this can be done simply by writing a regular expression that matches a token, and calling re.finditer to loop over the tokens:

import re

class ParseError(Exception): pass

def shorthand_parse(shorthand, alphabet):
    """Parse bioinformatics shorthand notation and generate strings giving
    the characters in alphabet that are allowed to match at each
    position in the sequence.

    Square brackets means "it could be any of these letters" in this
    location. Curly braces means "it's *not* one of these letters" in
    this location (it can be any letter in the given alphabet except
    these). Lowercase "x" means it could be any of the letters in the
    given alphabet.

    >>> list(shorthand_parse('xT[AG]{AT}', 'ACGT'))
    ['ACGT', 'T', 'AG', 'CG']

    """
    # Use set() to remove duplicates, and sorted() to ensure that the
    # output is in a canonical order (for predictability and testing).
    alphabet = set(alphabet)
    alphabet_str = ''.join(sorted(alphabet))    
    shorthand_re = r'''(x)             # Any character in alphabet
                      |([{0}])         # Single character
                      |\[([{0}]*)\]    # One of these characters
                      |{{([{0}]*)}}    # Not one of these characters
                      |(.)             # Anything else is an error
                    '''.format(alphabet_str)
    for m in re.finditer(shorthand_re, shorthand, re.VERBOSE):
        x, single, oneof, noneof, error = m.groups()
        if x:
            yield alphabet_str
        elif single:
            yield single
        elif oneof:
            yield ''.join(sorted(set(oneof)))
        elif noneof:
            yield ''.join(sorted(alphabet - set(noneof)))
        else:
            raise ParseError("unexpected character '{}'".format(error))

Now that we have the strings of characters that are valid at each position, we want the Cartesian product of those strings. This can be computed by itertools.product, and in fact the output of shorthand_parse is already in the right format for input to that function:

import itertools

def shorthand_sequences(shorthand, alphabet):
    """Generate all sequences matching the given bioinformatics shorthand
    notation over the given alphabet.

    >>> list(shorthand_sequences('T[AG]', 'AGCT'))
    ['TA', 'TG']
    >>> list(shorthand_sequences('T{A}T', 'AGCT'))
    ['TCT', 'TGT', 'TTT']
    >>> list(shorthand_sequences('G{AGC}AGTC', 'AGCT'))
    ['GTAGTC']
    >>> list(shorthand_sequences('xGTC', 'AGCT'))
    ['AGTC', 'CGTC', 'GGTC', 'TGTC']
    >>> list(shorthand_sequences('[AG]{AT}[CG]', 'AGCT'))
    ['ACC', 'ACG', 'AGC', 'AGG', 'GCC', 'GCG', 'GGC', 'GGG']

    """    
    for seq in itertools.product(*shorthand_parse(shorthand, alphabet)):
        yield ''.join(seq)

Code Snippets

if __name__ == "__main__":
    import doctest
    doctest.testmod()

python -m doctest mymodule.py

assert char in alphabet+'['+']'+'{'+'}'+'x'

import re

class ParseError(Exception): pass

def shorthand_parse(shorthand, alphabet):
    """Parse bioinformatics shorthand notation and generate strings giving
    the characters in alphabet that are allowed to match at each
    position in the sequence.

    Square brackets means "it could be any of these letters" in this
    location. Curly braces means "it's *not* one of these letters" in
    this location (it can be any letter in the given alphabet except
    these). Lowercase "x" means it could be any of the letters in the
    given alphabet.

    >>> list(shorthand_parse('xT[AG]{AT}', 'ACGT'))
    ['ACGT', 'T', 'AG', 'CG']

    """
    # Use set() to remove duplicates, and sorted() to ensure that the
    # output is in a canonical order (for predictability and testing).
    alphabet = set(alphabet)
    alphabet_str = ''.join(sorted(alphabet))    
    shorthand_re = r'''(x)             # Any character in alphabet
                      |([{0}])         # Single character
                      |\[([{0}]*)\]    # One of these characters
                      |{{([{0}]*)}}    # Not one of these characters
                      |(.)             # Anything else is an error
                    '''.format(alphabet_str)
    for m in re.finditer(shorthand_re, shorthand, re.VERBOSE):
        x, single, oneof, noneof, error = m.groups()
        if x:
            yield alphabet_str
        elif single:
            yield single
        elif oneof:
            yield ''.join(sorted(set(oneof)))
        elif noneof:
            yield ''.join(sorted(alphabet - set(noneof)))
        else:
            raise ParseError("unexpected character '{}'".format(error))

import itertools

def shorthand_sequences(shorthand, alphabet):
    """Generate all sequences matching the given bioinformatics shorthand
    notation over the given alphabet.

    >>> list(shorthand_sequences('T[AG]', 'AGCT'))
    ['TA', 'TG']
    >>> list(shorthand_sequences('T{A}T', 'AGCT'))
    ['TCT', 'TGT', 'TTT']
    >>> list(shorthand_sequences('G{AGC}AGTC', 'AGCT'))
    ['GTAGTC']
    >>> list(shorthand_sequences('xGTC', 'AGCT'))
    ['AGTC', 'CGTC', 'GGTC', 'TGTC']
    >>> list(shorthand_sequences('[AG]{AT}[CG]', 'AGCT'))
    ['ACC', 'ACG', 'AGC', 'AGG', 'GCC', 'GCG', 'GGC', 'GGG']

    """    
    for seq in itertools.product(*shorthand_parse(shorthand, alphabet)):
        yield ''.join(seq)

Context

StackExchange Code Review Q#88097, answer score: 4

Revisions (0)

No revisions yet.