HiveBrain v1.2.0
Get Started
← Back to all entries
patternjavaMinor

Check whether a list of list of string fits a regex efficiently

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
efficientlyregexfitslistcheckwhetherstring

Problem

I have the following structure for my object:

Word (Object)

  • Word (String)



  • List



Symbol (Object)

  • Symbol (String)



  • List



SymbolChoice (Object)

  • Symbol (String)



The background of this is that this is processed OCR output and I want to force a word to match a certain regex. If it cannot find a match, it should return an empty List, otherwise it should return all List that match the specified regex.

Example:

  • Found word: "1o0"



  • Symbols: "1", "o" and "0"



  • Choices of "1": "1"



  • Choices of "o": "o", "0"



  • Choices of "0": "0"



When forcing this word to abide regex "\\d{3}", it has to return [["1", "0", "0"]], here I am denoting the SymbolChoice objects as strings.

I have a solution, however it is not fast enough, more details after the method:

```
public List> forceFitRegex(final String regex) {
Objects.requireNonNull(regex, "regex");

if (getSymbols().size() >();
}

int combinations = 1;
for (Symbol symbol : getSymbols()) {
combinations *= symbol.getSymbolChoices().size();
}
System.out.println("Combinations: " + combinations);

List> listOfSymbolChoices = new ArrayList>();
for (SymbolChoice symbolChoice : getSymbols().get(0).getSymbolChoices()) {
List symbolChoices = new ArrayList();
symbolChoices.add(symbolChoice);
listOfSymbolChoices.add(symbolChoices);
}

int i = 1;
while (i > listOfSymbolChoicesCopy = new ArrayList>(listOfSymbolChoices);
for (SymbolChoice symbolChoice : getSymbols().get(i).getSymbolChoices()) {
for (List symbolChoicesCopy : listOfSymbolChoicesCopy) {
List symbolChoices = new ArrayList(symbolChoicesCopy);
symbolChoices.add(symbolChoice);
listOfSymbolChoices.add(symbolChoices);
}
}
i++;
}

Pattern pattern = Pattern.compile(regex);
ListIterator> symbolChoicesListIterator = listOfSymbolChoices.listIterator

Solution

Bonus points if someone can calculate the asymptotic time my algorithm takes.

Too much. It's exponential, as you're generating the whole candidate set. Even the improvement by rolfl doesn't change it.

You could do much better by analyzing the regex. For example "\\d{3}" will never match anything starting with a letter and you could cut off many possibilities. However, doing this in general is pretty complicated.

There's a method which might do this for you: hitEnd. I'm not sure if it does the right thing, but I guess so:

Returns true if the end of input was hit by the search engine in the last match operation performed by this matcher.

When this method returns true, then it is possible that more input would have changed the result of the last search.

So you could test you incomplete candidates against the regex and possibly save yourself the work with expanding them.

Context

StackExchange Code Review Q#53981, answer score: 3

Revisions (0)

No revisions yet.