patternMinor
Optimize regex for maximum speed
Viewed 0 times
maximumforoptimizeregexspeed
Problem
According to https://stackoverflow.com/questions/19608546/optimize-regex-for-maximum-speed
and comments to ask my question here .
Please help me to optimize following regex to best performance . I have read some articles but this problem should solve quickly to decrease cpu usage and delay time so i don't have enough time for try and false .
First one should match for example
note:
-
All url should start with
-
first line and last line have specific rules but lines between them may combined.
-
These regexp are used in squid configuration.
Any help appreciated .
After help of Stackoverflow guys i have changed it to this . Any more optimization ?
```
refresh_pattern -i (.+\.)?(microsoft|windowsupdate)\.com/.*?\.(cab|exe|dll|ms[i|u|f]|asf|wm[v|a]|dat|zip|iso|psf)$
refresh_pattern -i (.+\.)?eset\.com/.*?\.(cab|exe|dll|ms[i|u|f]|asf|wm[v|a]
and comments to ask my question here .
Please help me to optimize following regex to best performance . I have read some articles but this problem should solve quickly to decrease cpu usage and delay time so i don't have enough time for try and false .
First one should match for example
http://microsoft.com/test/temp.isohttp://download.microsoft.com/TEMP.isohttp://www.download.microsoft.com/test.aspx?hdwjdhcjdgcjhdc=TEMP.isonote:
-
All url should start with
http:// so i don't know it is better to put ^http:// at first or not ?-
first line and last line have specific rules but lines between them may combined.
-
These regexp are used in squid configuration.
Any help appreciated .
refresh_pattern -i (.+\.||)(microsoft|windowsupdate).com/.*\.(cab|exe|dll|ms[i|u|f]|asf|wm[v|a]|dat|zip|iso|psf)
refresh_pattern -i (.+\.||)eset.com/.*\.(cab|exe|dll|ms[i|u|f]|asf|wm[v|a]|dat|zip|ver|nup)
refresh_pattern -i (.+\.||)avg.com/.*\.(cab|exe|dll|ms[i|u|f]|asf|wm[v|a]|dat|zip|ctf|bin|gz)
refresh_pattern -i (.+\.||)grisoft.(com|cz)/.*\.(cab|exe|dll|ms[i|u|f]|asf|wm[v|a]|dat|zip|ctf|bin|gz)
refresh_pattern -i (.+\.||)avast.com/.*\.(cab|exe|dll|ms[i|u|f]|asf|wm[v|a]|dat|zip|vpx|vpu|vpa|vpaa|def|stamp)
refresh_pattern -i (.+\.||)(kaspersky-labs|kaspersky).com/.*\.(cab|zip|exe|msi|msp|bz2|avc|kdc|klz|dif|dat|kdz|kdl|kfb)
refresh_pattern -i (.+\.||)nai.com/.*\.(gem|zip|mcs|tar|exe|)
refresh_pattern -i (.+\.||)adobe.com/.*\.(cab|aup|exe|msi|upd|msp)
refresh_pattern -i (.+\.||)symantecliveupdate.com/.*\.(zip|exe|msi)
refresh_pattern -i (.+\.||)(192\.168\.10\.34|mywebsite.com)/.*After help of Stackoverflow guys i have changed it to this . Any more optimization ?
```
refresh_pattern -i (.+\.)?(microsoft|windowsupdate)\.com/.*?\.(cab|exe|dll|ms[i|u|f]|asf|wm[v|a]|dat|zip|iso|psf)$
refresh_pattern -i (.+\.)?eset\.com/.*?\.(cab|exe|dll|ms[i|u|f]|asf|wm[v|a]
Solution
Since all your patterns have similar structure, we only need to focus on one, e.g.
This pattern starts with
This is apparently sensible, but, because the entire URL is not anchored to the start of the line (there is no
For example, all of these URL's will successfully match:
This is all beside the point, though, because whether we use
It is possible that the squid regex engine is smart enough to identify that the
So, you can remove all the
I believe that you should replace them with a
It would likely be better to also anchor the regex to the beginning of the URL. If you can assume that it is an http or https URL then I would absolutely recommend that you include it as part of the regex. It will anchor the query well.
I will assume that the protocol could be https as well as http, in which case the match would be:
This change can be applied to all the refresh_patterns. (replace
refresh_pattern -i (.+\.)?avg\.com/.*?\.(cab|exe|dll|ms[i|u|f]|asf|wm[v|a]|dat|zip|ctf|bin|gz)$
This pattern starts with
(.+\.)?. This pattern says 'match any characters that end in a '.', but, the ? means that if nothing matches, then that's OK..... so, what the (.+\.)? really means is "Match something with a dot at the end, or match nothing".This is apparently sensible, but, because the entire URL is not anchored to the start of the line (there is no
^) it really means "match anything".For example, all of these URL's will successfully match:
http://www.avg.com/file.gz// we would expect this
http://avg.com/file.gz// we would maybe expect this
http://cravg.com/file.gz// we would not expect this
https://downloads.avg.com/file.gz// we would expect this
mailto:me@www.avg.com/file.gz// we would expect this
ftp://ftp.avg.com/file.gz// we would expect this
This is all beside the point, though, because whether we use
(.+\.)? or not it makes no difference whatsoever (functionality wise).It is possible that the squid regex engine is smart enough to identify that the
(.+\.)? is useless, and it may optimize it out.... but, I doubt it.So, you can remove all the
(.+\.)? structures.I believe that you should replace them with a
\b which is a 'word break'. This will match things like http://avg.com/... and http://www.avg.com/... but it will not match http://cravg.com/.... I believe this is what you want.It would likely be better to also anchor the regex to the beginning of the URL. If you can assume that it is an http or https URL then I would absolutely recommend that you include it as part of the regex. It will anchor the query well.
I will assume that the protocol could be https as well as http, in which case the match would be:
refresh_pattern -i ^https?//.\bavg\.com/.?\.(cab|exe|dll|ms[i|u|f]|asf|wm[v|a]|dat|zip|ctf|bin|gz)$
This change can be applied to all the refresh_patterns. (replace
(.+\.)? with ^https?//.*\b)Context
StackExchange Code Review Q#33319, answer score: 4
Revisions (0)
No revisions yet.