HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

All possible values that will match a regular expression

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
expressionallregularmatchpossiblewillthatvalues

Problem

This is a similar question to this, but I am looking for the set of all possible values that will match a regular expression pattern.

To avoid an infinite set of possible values, I am willing to restrict the regular expression pattern to a subset of the regular expression language.

Here's the approach I took (Python code):

def generate_possible_strings(pattern):
    '''
    input: 'K0[2468]'
    output: ['K02', 'K04', 'K06', 'K08']

    generates a list of possible strings that would match pattern
    ie, any value X such that re.search(pattern, X) is a match
    '''
    query = re.compile(pattern, re.IGNORECASE)
    fill_in = string.uppercase + string.digits + '_'

    # Build a re for a language subset that is supported by reverse_search
    bracket = r'\[[^\]]*\]' #finds [A-Z], [0-5], [02468]
    symbol = r'\\.' #finds \w, \d
    expression = '|'.join((bracket,symbol)) #search query

    tokens = re.split(expression, pattern)
    for c in product(fill_in, repeat=len(tokens)-1):
        candidate = ''.join(roundrobin(tokens, c)) #roundrobin recipe from itertools documentation
        if query.match(candidate):
            yield candidate


Supported subset of regular expressions language

  • Supports [] set of characters ([A-Z], [0-5], etc)



  • Supports escaped special characters (\w, \d, \D, etc)



Basically what this does is locate all parts of a regular expression that could match a single character ([A-Z] or [0-5] or [02468] or \w or \d), then for all of the valid replacement characters A-Z0-9_ test to see if the replacement matches the regular expression.

This algorithm is slow for regular expressions with many fields or if fill_in is not restricted to just A-Z0-9_, but at least it guarantees finding every possible string that will match a regular expression in finite time (if the solution set is finite).

Is there a faster approach to solving this problem, or an approach that supports a larger percentage of the standard re

Solution

A major inefficiency in your solution is that you try every fill_in character as a replacement for any character class in the pattern. Instead, you could use the character class to select matching characters from fill_in and only loop over those.

>>> pattern = 'K0[2468]'
>>> re.findall(expression, pattern)
['[2468]']
>>> re.findall('[2468]', fill_in)
['2', '4', '6', '8']


For a more complete existing solution, you may want to look into these:

  • Regular expression parsing in Python



  • invRegex.py in Pyparsing examples

Code Snippets

>>> pattern = 'K0[2468]'
>>> re.findall(expression, pattern)
['[2468]']
>>> re.findall('[2468]', fill_in)
['2', '4', '6', '8']

Context

StackExchange Code Review Q#43981, answer score: 4

Revisions (0)

No revisions yet.