snippetMinor
How to find specificity of a regex match?
Viewed 0 times
howmatchspecificityfindregex
Problem
I'm thinking about a routing system. Imagine I have the two following regexes
And I let them match on a URL, e.g. 'pathpart1/pathpart2'.
They both match, but I would want to give prevalence to the most specific regex, i.e. the regex where the cardinality of all possible matches of that regex is the lowest.
Is there a good way to calculate that the first regex has a low cardinality on its match-set (so I want to to go with that match) and the second has a very high cardinality on its match-set (i.e. it is completely not specific, so a match is basically a catch all last resort)...?
I do not know upfront which routes are registered with the router, so I can't loop over them in order of cardinality by hand (i.e. low cardinality first, and the catch-all last if all others don't match).
- pathpart1/pathpart2 => specific match that routes to controller1
- .* => catch-all that routes to controller2
And I let them match on a URL, e.g. 'pathpart1/pathpart2'.
They both match, but I would want to give prevalence to the most specific regex, i.e. the regex where the cardinality of all possible matches of that regex is the lowest.
Is there a good way to calculate that the first regex has a low cardinality on its match-set (so I want to to go with that match) and the second has a very high cardinality on its match-set (i.e. it is completely not specific, so a match is basically a catch all last resort)...?
I do not know upfront which routes are registered with the router, so I can't loop over them in order of cardinality by hand (i.e. low cardinality first, and the catch-all last if all others don't match).
Solution
There are two angles I see to define "specifity" in this context.
the (inverse of the) number of words of length $|w|$ that also match $r$.
the (maximum) number of symbols of $w$ that are matched by non-wildcard symbols
in $r$ (and $|w|$).
The first option follows your idea: choose among all matches the regular expressions that
matches the fewest others. Since there are in general infinitely many matching
strings, we have to restrict ourselves to a certain length, or a finite set of lengths.
The second option follows another approach. If the input string is well described
by a given expression, long parts of it should be (more or less) explicitly described
in the regular expression. For example, $(aa|ab)+.^*bba$ is a more specific match
for $aaababaababababba$ than $.^*$.
Option 1
In this case, we have to count all words in $\mathcal{L}(r)$ that have length $n$.
Luckily, this is easy once we have an unambiguous, regular grammar¹ $G$ for $\mathcal{L}(r)$.
See here how this can be done.
Option 2
Let's assume we have an unambiguous regular expression (for instance by specifying
lazy wildcards). In that case, we can translate the regular expression into a
DFA. Run the word through it and count how many non-wildcard transitions are taken.
If the expression is ambiguous, we can do the same with the corresponding NFA
if we explore all possible paths.
conversions are standard and algorithmically feasible (if not necessarily
efficient).
- Specifity of a regular expression $r$ that matches input $w$ is
the (inverse of the) number of words of length $|w|$ that also match $r$.
- Specifity of a regular expression $r$ that matches input $w$ is (the ratio of)
the (maximum) number of symbols of $w$ that are matched by non-wildcard symbols
in $r$ (and $|w|$).
The first option follows your idea: choose among all matches the regular expressions that
matches the fewest others. Since there are in general infinitely many matching
strings, we have to restrict ourselves to a certain length, or a finite set of lengths.
The second option follows another approach. If the input string is well described
by a given expression, long parts of it should be (more or less) explicitly described
in the regular expression. For example, $(aa|ab)+.^*bba$ is a more specific match
for $aaababaababababba$ than $.^*$.
Option 1
In this case, we have to count all words in $\mathcal{L}(r)$ that have length $n$.
Luckily, this is easy once we have an unambiguous, regular grammar¹ $G$ for $\mathcal{L}(r)$.
See here how this can be done.
Option 2
Let's assume we have an unambiguous regular expression (for instance by specifying
lazy wildcards). In that case, we can translate the regular expression into a
DFA. Run the word through it and count how many non-wildcard transitions are taken.
If the expression is ambiguous, we can do the same with the corresponding NFA
if we explore all possible paths.
- Convert $r$ to an $NFA$, determinise, convert to right-regular grammar. All
conversions are standard and algorithmically feasible (if not necessarily
efficient).
Context
StackExchange Computer Science Q#10786, answer score: 3
Revisions (0)
No revisions yet.