patternpythonMinor
Converting domain-specific regular-expressions to a list of all matching instances
Viewed 0 times
instancesallregularexpressionsspecificconvertinglistdomainmatching
Problem
There seem to be several questions floating around Stackexchange regarding how to take a python regular expression list the matching instances. This problem is a bit different because 1) I'm need to start off with a domain-specific regex system, not just the standard python regex, and 2) the possible matches for any given location in string are limited to a specific 'alphabet' of characters.
The current implementation works, but I've got a nagging feeling there's a much more succinct way to do it.
```
import numpy as np
def shorthand_to_instances(shorthand,alphabet):
"""
Converts the shorthand-notation for bioinformatics 'motifs'
into the all the instances of sequences they would match.
Brackets means "it could be any of these letters" in this location.
Curly braces means "it's not one of these letters" in this location,
it can be any letter in the given "alphabet" except these.
Lowercase "x" means it could be any of the letters in the given
"alphabet".
>>> binf_motifs_to_sequences('T[AG]','AGCT')
['TA', 'TG']
>>> binf_motifs_to_sequences('T{A}T','AGCT')
['TGT', 'TCT', 'TTT']
>>> binf_motifs_to_sequences('G{AGC}AGTC','AGCT')
['GTAGTC']
>>> binf_motifs_to_sequences('xGTC','AGCT')
['AGTC', 'GGTC', 'CGTC', 'TGTC']
"""
instances=[]
shc = shorthand
while len(shc)>0:
char = shc[0]
assert char in alphabet+'['+']'+'{'+'}'+'x', "Failed assert: '{0}'".format(char)
if char in alphabet:
# it's in the alphabet
if len(instances)==0:
instances.append(char)
else:
for i in list(np.arange(len(instances))):
instances[i] += char
shc = shc[1:]
elif char=='[':
# it's a 'one of these'
chargroup,c,e = shc[1:].partition(']')
if len(instances)==0:
for char in chargroup:
instances.append
The current implementation works, but I've got a nagging feeling there's a much more succinct way to do it.
```
import numpy as np
def shorthand_to_instances(shorthand,alphabet):
"""
Converts the shorthand-notation for bioinformatics 'motifs'
into the all the instances of sequences they would match.
Brackets means "it could be any of these letters" in this location.
Curly braces means "it's not one of these letters" in this location,
it can be any letter in the given "alphabet" except these.
Lowercase "x" means it could be any of the letters in the given
"alphabet".
>>> binf_motifs_to_sequences('T[AG]','AGCT')
['TA', 'TG']
>>> binf_motifs_to_sequences('T{A}T','AGCT')
['TGT', 'TCT', 'TTT']
>>> binf_motifs_to_sequences('G{AGC}AGTC','AGCT')
['GTAGTC']
>>> binf_motifs_to_sequences('xGTC','AGCT')
['AGTC', 'GGTC', 'CGTC', 'TGTC']
"""
instances=[]
shc = shorthand
while len(shc)>0:
char = shc[0]
assert char in alphabet+'['+']'+'{'+'}'+'x', "Failed assert: '{0}'".format(char)
if char in alphabet:
# it's in the alphabet
if len(instances)==0:
instances.append(char)
else:
for i in list(np.arange(len(instances))):
instances[i] += char
shc = shc[1:]
elif char=='[':
# it's a 'one of these'
chargroup,c,e = shc[1:].partition(']')
if len(instances)==0:
for char in chargroup:
instances.append
Solution
- Review
-
This is clear, well-documented code.
-
The doctests all call
binf_motifs_to_sequences but your function is named shorthand_to_instances so all the doctests fail.-
There's no need to have this kind of infrastructure for running doctests:
if __name__ == "__main__":
import doctest
doctest.testmod()since you can easily run doctests from the command line using Python's
-m option:python -m doctest mymodule.py-
Assertions should be used for programming errors (conditions that you believe can't happen), but this assertion:
assert char in alphabet+'['+']'+'{'+'}'+'x'is really a runtime error (it might happen if bad data was passed to the function), so it would be better to raise an exception.
-
A given shorthand can have exponentially many matching sequences. For example,
xxxx already has 256 matching sequences over a four-character alphabet. For this reason, I would avoid returning a list of matching sequences, as this requires building a list and storing all the sequences in memory. It would be better to generate the sequences one at a time.- Simplifying
The way to make this kind of code simpler is to use functional decomposition to divide it into pieces that are simpler to understand, implement, and test.
A sensible division into two pieces would be to first, parse the shorthand notation, producing some kind of data structure; and second, take the data structure and generate the matching sequences. Since the alphabet is likely to be small, the natural data structure to use is just a sequence of strings giving the characters that match at each position. So given the shorthand
xT[AG]{AT} over the alphabet ACGT, the intermediate data would be the sequence ['ACGT', 'T', 'AG', 'CG'].(If alphabets were expected to be much larger, then this might be somewhat wasteful in space, and so there might be a need to have a more sophisticated data structure here.)
I've called the first step "parsing" but really it's so simple that it's just lexical analysis and in Python this can be done simply by writing a regular expression that matches a token, and calling
re.finditer to loop over the tokens:import re
class ParseError(Exception): pass
def shorthand_parse(shorthand, alphabet):
"""Parse bioinformatics shorthand notation and generate strings giving
the characters in alphabet that are allowed to match at each
position in the sequence.
Square brackets means "it could be any of these letters" in this
location. Curly braces means "it's *not* one of these letters" in
this location (it can be any letter in the given alphabet except
these). Lowercase "x" means it could be any of the letters in the
given alphabet.
>>> list(shorthand_parse('xT[AG]{AT}', 'ACGT'))
['ACGT', 'T', 'AG', 'CG']
"""
# Use set() to remove duplicates, and sorted() to ensure that the
# output is in a canonical order (for predictability and testing).
alphabet = set(alphabet)
alphabet_str = ''.join(sorted(alphabet))
shorthand_re = r'''(x) # Any character in alphabet
|([{0}]) # Single character
|\[([{0}]*)\] # One of these characters
|{{([{0}]*)}} # Not one of these characters
|(.) # Anything else is an error
'''.format(alphabet_str)
for m in re.finditer(shorthand_re, shorthand, re.VERBOSE):
x, single, oneof, noneof, error = m.groups()
if x:
yield alphabet_str
elif single:
yield single
elif oneof:
yield ''.join(sorted(set(oneof)))
elif noneof:
yield ''.join(sorted(alphabet - set(noneof)))
else:
raise ParseError("unexpected character '{}'".format(error))Now that we have the strings of characters that are valid at each position, we want the Cartesian product of those strings. This can be computed by
itertools.product, and in fact the output of shorthand_parse is already in the right format for input to that function:import itertools
def shorthand_sequences(shorthand, alphabet):
"""Generate all sequences matching the given bioinformatics shorthand
notation over the given alphabet.
>>> list(shorthand_sequences('T[AG]', 'AGCT'))
['TA', 'TG']
>>> list(shorthand_sequences('T{A}T', 'AGCT'))
['TCT', 'TGT', 'TTT']
>>> list(shorthand_sequences('G{AGC}AGTC', 'AGCT'))
['GTAGTC']
>>> list(shorthand_sequences('xGTC', 'AGCT'))
['AGTC', 'CGTC', 'GGTC', 'TGTC']
>>> list(shorthand_sequences('[AG]{AT}[CG]', 'AGCT'))
['ACC', 'ACG', 'AGC', 'AGG', 'GCC', 'GCG', 'GGC', 'GGG']
"""
for seq in itertools.product(*shorthand_parse(shorthand, alphabet)):
yield ''.join(seq)Code Snippets
if __name__ == "__main__":
import doctest
doctest.testmod()python -m doctest mymodule.pyassert char in alphabet+'['+']'+'{'+'}'+'x'import re
class ParseError(Exception): pass
def shorthand_parse(shorthand, alphabet):
"""Parse bioinformatics shorthand notation and generate strings giving
the characters in alphabet that are allowed to match at each
position in the sequence.
Square brackets means "it could be any of these letters" in this
location. Curly braces means "it's *not* one of these letters" in
this location (it can be any letter in the given alphabet except
these). Lowercase "x" means it could be any of the letters in the
given alphabet.
>>> list(shorthand_parse('xT[AG]{AT}', 'ACGT'))
['ACGT', 'T', 'AG', 'CG']
"""
# Use set() to remove duplicates, and sorted() to ensure that the
# output is in a canonical order (for predictability and testing).
alphabet = set(alphabet)
alphabet_str = ''.join(sorted(alphabet))
shorthand_re = r'''(x) # Any character in alphabet
|([{0}]) # Single character
|\[([{0}]*)\] # One of these characters
|{{([{0}]*)}} # Not one of these characters
|(.) # Anything else is an error
'''.format(alphabet_str)
for m in re.finditer(shorthand_re, shorthand, re.VERBOSE):
x, single, oneof, noneof, error = m.groups()
if x:
yield alphabet_str
elif single:
yield single
elif oneof:
yield ''.join(sorted(set(oneof)))
elif noneof:
yield ''.join(sorted(alphabet - set(noneof)))
else:
raise ParseError("unexpected character '{}'".format(error))import itertools
def shorthand_sequences(shorthand, alphabet):
"""Generate all sequences matching the given bioinformatics shorthand
notation over the given alphabet.
>>> list(shorthand_sequences('T[AG]', 'AGCT'))
['TA', 'TG']
>>> list(shorthand_sequences('T{A}T', 'AGCT'))
['TCT', 'TGT', 'TTT']
>>> list(shorthand_sequences('G{AGC}AGTC', 'AGCT'))
['GTAGTC']
>>> list(shorthand_sequences('xGTC', 'AGCT'))
['AGTC', 'CGTC', 'GGTC', 'TGTC']
>>> list(shorthand_sequences('[AG]{AT}[CG]', 'AGCT'))
['ACC', 'ACG', 'AGC', 'AGG', 'GCC', 'GCG', 'GGC', 'GGG']
"""
for seq in itertools.product(*shorthand_parse(shorthand, alphabet)):
yield ''.join(seq)Context
StackExchange Code Review Q#88097, answer score: 4
Revisions (0)
No revisions yet.