HiveBrain v1.2.0
Get Started
← Back to all entries
snippetMinor

How to find specificity of a regex match?

Submitted by: @import:stackexchange-cs··
0
Viewed 0 times
howmatchspecificityfindregex

Problem

I'm thinking about a routing system. Imagine I have the two following regexes

  • pathpart1/pathpart2 => specific match that routes to controller1



  • .* => catch-all that routes to controller2



And I let them match on a URL, e.g. 'pathpart1/pathpart2'.

They both match, but I would want to give prevalence to the most specific regex, i.e. the regex where the cardinality of all possible matches of that regex is the lowest.


Is there a good way to calculate that the first regex has a low cardinality on its match-set (so I want to to go with that match) and the second has a very high cardinality on its match-set (i.e. it is completely not specific, so a match is basically a catch all last resort)...?

I do not know upfront which routes are registered with the router, so I can't loop over them in order of cardinality by hand (i.e. low cardinality first, and the catch-all last if all others don't match).

Solution

There are two angles I see to define "specifity" in this context.

  • Specifity of a regular expression $r$ that matches input $w$ is


the (inverse of the) number of words of length $|w|$ that also match $r$.

  • Specifity of a regular expression $r$ that matches input $w$ is (the ratio of)


the (maximum) number of symbols of $w$ that are matched by non-wildcard symbols
in $r$ (and $|w|$).

The first option follows your idea: choose among all matches the regular expressions that
matches the fewest others. Since there are in general infinitely many matching
strings, we have to restrict ourselves to a certain length, or a finite set of lengths.

The second option follows another approach. If the input string is well described
by a given expression, long parts of it should be (more or less) explicitly described
in the regular expression. For example, $(aa|ab)+.^*bba$ is a more specific match
for $aaababaababababba$ than $.^*$.

Option 1

In this case, we have to count all words in $\mathcal{L}(r)$ that have length $n$.
Luckily, this is easy once we have an unambiguous, regular grammar¹ $G$ for $\mathcal{L}(r)$.
See here how this can be done.

Option 2

Let's assume we have an unambiguous regular expression (for instance by specifying
lazy wildcards). In that case, we can translate the regular expression into a
DFA. Run the word through it and count how many non-wildcard transitions are taken.

If the expression is ambiguous, we can do the same with the corresponding NFA
if we explore all possible paths.

  • Convert $r$ to an $NFA$, determinise, convert to right-regular grammar. All


conversions are standard and algorithmically feasible (if not necessarily
efficient).

Context

StackExchange Computer Science Q#10786, answer score: 3

Revisions (0)

No revisions yet.