patternpythonMinor
All possible values that will match a regular expression
Viewed 0 times
expressionallregularmatchpossiblewillthatvalues
Problem
This is a similar question to this, but I am looking for the set of all possible values that will match a regular expression pattern.
To avoid an infinite set of possible values, I am willing to restrict the regular expression pattern to a subset of the regular expression language.
Here's the approach I took (Python code):
Supported subset of regular expressions language
Basically what this does is locate all parts of a regular expression that could match a single character (
This algorithm is slow for regular expressions with many fields or if
Is there a faster approach to solving this problem, or an approach that supports a larger percentage of the standard re
To avoid an infinite set of possible values, I am willing to restrict the regular expression pattern to a subset of the regular expression language.
Here's the approach I took (Python code):
def generate_possible_strings(pattern):
'''
input: 'K0[2468]'
output: ['K02', 'K04', 'K06', 'K08']
generates a list of possible strings that would match pattern
ie, any value X such that re.search(pattern, X) is a match
'''
query = re.compile(pattern, re.IGNORECASE)
fill_in = string.uppercase + string.digits + '_'
# Build a re for a language subset that is supported by reverse_search
bracket = r'\[[^\]]*\]' #finds [A-Z], [0-5], [02468]
symbol = r'\\.' #finds \w, \d
expression = '|'.join((bracket,symbol)) #search query
tokens = re.split(expression, pattern)
for c in product(fill_in, repeat=len(tokens)-1):
candidate = ''.join(roundrobin(tokens, c)) #roundrobin recipe from itertools documentation
if query.match(candidate):
yield candidateSupported subset of regular expressions language
- Supports
[]set of characters ([A-Z],[0-5], etc)
- Supports escaped special characters (
\w,\d,\D, etc)
Basically what this does is locate all parts of a regular expression that could match a single character (
[A-Z] or [0-5] or [02468] or \w or \d), then for all of the valid replacement characters A-Z0-9_ test to see if the replacement matches the regular expression.This algorithm is slow for regular expressions with many fields or if
fill_in is not restricted to just A-Z0-9_, but at least it guarantees finding every possible string that will match a regular expression in finite time (if the solution set is finite).Is there a faster approach to solving this problem, or an approach that supports a larger percentage of the standard re
Solution
A major inefficiency in your solution is that you try every
For a more complete existing solution, you may want to look into these:
fill_in character as a replacement for any character class in the pattern. Instead, you could use the character class to select matching characters from fill_in and only loop over those. >>> pattern = 'K0[2468]'
>>> re.findall(expression, pattern)
['[2468]']
>>> re.findall('[2468]', fill_in)
['2', '4', '6', '8']For a more complete existing solution, you may want to look into these:
- Regular expression parsing in Python
- invRegex.py in Pyparsing examples
Code Snippets
>>> pattern = 'K0[2468]'
>>> re.findall(expression, pattern)
['[2468]']
>>> re.findall('[2468]', fill_in)
['2', '4', '6', '8']Context
StackExchange Code Review Q#43981, answer score: 4
Revisions (0)
No revisions yet.