patternMinor

Optimize regex for maximum speed

Submitted by: @import:stackexchange-codereview·Mar 10, 2026·

Viewed 0 times

maximumforoptimizeregexspeed

Problem

According to https://stackoverflow.com/questions/19608546/optimize-regex-for-maximum-speed
and comments to ask my question here .
Please help me to optimize following regex to best performance . I have read some articles but this problem should solve quickly to decrease cpu usage and delay time so i don't have enough time for try and false .

First one should match for example

http://microsoft.com/test/temp.iso

http://download.microsoft.com/TEMP.iso

http://www.download.microsoft.com/test.aspx?hdwjdhcjdgcjhdc=TEMP.iso

note:

-
All url should start with http:// so i don't know it is better to put ^http:// at first or not ?

-
first line and last line have specific rules but lines between them may combined.

-
These regexp are used in squid configuration.

Any help appreciated .

refresh_pattern -i (.+\.||)(microsoft|windowsupdate).com/.*\.(cab|exe|dll|ms[i|u|f]|asf|wm[v|a]|dat|zip|iso|psf) 
refresh_pattern -i (.+\.||)eset.com/.*\.(cab|exe|dll|ms[i|u|f]|asf|wm[v|a]|dat|zip|ver|nup) 
refresh_pattern -i (.+\.||)avg.com/.*\.(cab|exe|dll|ms[i|u|f]|asf|wm[v|a]|dat|zip|ctf|bin|gz) 
refresh_pattern -i (.+\.||)grisoft.(com|cz)/.*\.(cab|exe|dll|ms[i|u|f]|asf|wm[v|a]|dat|zip|ctf|bin|gz) 
refresh_pattern -i (.+\.||)avast.com/.*\.(cab|exe|dll|ms[i|u|f]|asf|wm[v|a]|dat|zip|vpx|vpu|vpa|vpaa|def|stamp) 
refresh_pattern -i (.+\.||)(kaspersky-labs|kaspersky).com/.*\.(cab|zip|exe|msi|msp|bz2|avc|kdc|klz|dif|dat|kdz|kdl|kfb) 
refresh_pattern -i (.+\.||)nai.com/.*\.(gem|zip|mcs|tar|exe|) 
refresh_pattern -i (.+\.||)adobe.com/.*\.(cab|aup|exe|msi|upd|msp) 
refresh_pattern -i (.+\.||)symantecliveupdate.com/.*\.(zip|exe|msi) 
refresh_pattern -i (.+\.||)(192\.168\.10\.34|mywebsite.com)/.*

After help of Stackoverflow guys i have changed it to this . Any more optimization ?

```
refresh_pattern -i (.+\.)?(microsoft|windowsupdate)\.com/.*?\.(cab|exe|dll|ms[i|u|f]|asf|wm[v|a]|dat|zip|iso|psf)$
refresh_pattern -i (.+\.)?eset\.com/.*?\.(cab|exe|dll|ms[i|u|f]|asf|wm[v|a]

Solution

Since all your patterns have similar structure, we only need to focus on one, e.g.

refresh_pattern -i (.+\.)?avg\.com/.*?\.(cab|exe|dll|ms[i|u|f]|asf|wm[v|a]|dat|zip|ctf|bin|gz)$

This pattern starts with (.+\.)?. This pattern says 'match any characters that end in a '.', but, the ? means that if nothing matches, then that's OK..... so, what the (.+\.)? really means is "Match something with a dot at the end, or match nothing".

This is apparently sensible, but, because the entire URL is not anchored to the start of the line (there is no ^) it really means "match anything".

For example, all of these URL's will successfully match:

http://www.avg.com/file.gz // we would expect this

http://avg.com/file.gz // we would maybe expect this

http://cravg.com/file.gz // we would not expect this

https://downloads.avg.com/file.gz // we would expect this

mailto:me@www.avg.com/file.gz // we would expect this

ftp://ftp.avg.com/file.gz // we would expect this

This is all beside the point, though, because whether we use (.+\.)? or not it makes no difference whatsoever (functionality wise).

It is possible that the squid regex engine is smart enough to identify that the (.+\.)? is useless, and it may optimize it out.... but, I doubt it.

So, you can remove all the (.+\.)? structures.

I believe that you should replace them with a \b which is a 'word break'. This will match things like http://avg.com/... and http://www.avg.com/... but it will not match http://cravg.com/.... I believe this is what you want.

It would likely be better to also anchor the regex to the beginning of the URL. If you can assume that it is an http or https URL then I would absolutely recommend that you include it as part of the regex. It will anchor the query well.

I will assume that the protocol could be https as well as http, in which case the match would be:

refresh_pattern -i ^https?//.\bavg\.com/.?\.(cab|exe|dll|ms[i|u|f]|asf|wm[v|a]|dat|zip|ctf|bin|gz)$

This change can be applied to all the refresh_patterns. (replace (.+\.)? with ^https?//.*\b)

Context

StackExchange Code Review Q#33319, answer score: 4

Revisions (0)

No revisions yet.