HiveBrain v1.2.0
Get Started
← Back to all entries
patternjavascriptModerate

What do you think of my regex for URL validation?

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
youwhatvalidationforregexurlthink

Problem

I would like you to review my regex. It's suppose to recognize common URLs like:

http://www.google.com
http://www.sub1.sub2.google.com
https://www.google.com
http://www.google.com/path1/path2
http://www.google.com/path1/path2/
http://www.google.com/path1/path2/a=b&c=d
http://www.google.com/path1/path2/a=b&c=d/

var url = new RegExp("^(?:https?:\/\/)?(?:[-a-z0-9]+\\.)+[-a-z0-9]+(?:(?:(?:\/[-=&?.#a-z0-9]+)+)\/?)?$");

Solution

Your regex has many false positives (matches without a valid URL being there), and many false negativs (URLs you don't recognize). You should read the appropriate RFCs which contain relevant parts of the grammar.

Here are some failure scenarios:

-
⊖ You don't match schemes other than HTTP or HTTPS. Consider FTP, SFTP, data URLs, and other widespread schemes like file:, mailto:, irc: or skype:. Consider also that the scheme is sometimes omitted, so that the same protocol as the current document is assumed. Test cases:

ftp://example.com/
data:text/html,Hello World!
mailto:foo@example.com
//example.com/


Every scheme has a slightly different syntax, so you'll have to decide which ones you support. Note that matching a HTTP(S) URL is different from matching all URLs.

-
⊕ You match some domain names that are syntactically correct, but semantically only consist of a top level domain. You may or may not wish to rule these out:

http://co.uk/


See the Mozilla Public Suffix List for the only reliable way to do this.

-
⊖ Some host names don't have a TLD:

http://localhost/


-
⊕ You match loads of stuff that isn't ok, e.g.:

999.999.999.999/this-is-no-ip4-address
-.-


-
⊖ You don't have support for ports:

http://example.com:80/


-
⊖ You don't have support for IPv6 IPs.

http://::1/
http://[::1]:80/


-
⊖ You can't match any authority in the host part:

ftp://user:password@example.com/


-
⊖ You don't support URL-encoding:

http://example.com/some%20path


-
⊖ You don't support Unicode chars:

http://example.com/smørrebrød/greek-Λεττερς


Consider also that the host name may contain Unicode characters, but will have an equivalent punycode representation.

-
⊖ Coming to think of it, you actually forbid quite a lot of (special) characters, although they would be perfectly valid (the host part is more restricted).

http://example.com/stuff_(in parens)/


-
⊖ In a URL, the path is followed by a single query string, introduced by a ?. Note that the semicolon ; is the more modern, but equivalent alternative to the & separator. The + means a literal space here. The query string may only be followed by a fragment, which is separated by #. Other instances have to be encoded.

For an overview of URL syntax, start with this Wikipedia article. Also look at URI Schemes. These articles also link to relevant specs.

Code Snippets

ftp://example.com/
data:text/html,<p>Hello World!
mailto:foo@example.com
//example.com/
http://co.uk/
http://localhost/
999.999.999.999/this-is-no-ip4-address
-.-
http://example.com:80/

Context

StackExchange Code Review Q#33739, answer score: 15

Revisions (0)

No revisions yet.