HiveBrain v1.2.0
Get Started
← Back to all entries
snippetjavaMinor

Regex to parse URLs for their correctness according to RFC 3986

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
correctnessparse3986forrfcregexaccordingtheirurls

Problem

I recently came to write a regex to parse URLs. Now I wonder, did I miss something? Did I make a mistake or could I have written it cleaner? That's why I'm here.

In order to write the regex, I took this paper as a reference (together with some Wikipedia articles about URIs/URLs).

//According to http://www.ietf.org/rfc/rfc3986.txt
private static final String URL_UNRESERVED_CHARS = "0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ";
private static final String URL_UNRESERVED_SPECIAL_CHARS = "-._~";
private static final String URL_UNRESERVED = URL_UNRESERVED_CHARS + URL_UNRESERVED_SPECIAL_CHARS;
private static final String URL_RESERVED_GEN_DELIMS = ":/?#[]@";
private static final String URL_RESERVED_SUB_DELIMS = "!%%CODEBLOCK_0%%amp;'()*+,;=";
private static final String URL_CHAR_ENCODING_SIGN = "%";

public static final String URL_ALLOWED_CHARS = URL_UNRESERVED + URL_RESERVED_GEN_DELIMS + URL_RESERVED_SUB_DELIMS + URL_CHAR_ENCODING_SIGN;

private static final String REGEX_SCHEME = "[A-Za-z][A-Za-z0-9+.-]*:"; //Also called 'protocol'
private static final String REGEX_AUTHORATIVE_DECLARATION = "/{2}";
private static final String REGEX_USERINFO = "(?:[A-Za-z0-9-._~]|%[A-Fa-f0-9]{2})+(?::(?:[A-Za-z0-9-._~]|%[A-Fa-f0-9]{2})+)?@";
private static final String REGEX_HOST = "(?:[A-Za-z0-9](?:[A-Za-z0-9-]*[A-Za-z0-9])?\\.){1,126}[A-Za-z0-9](?:[A-Za-z0-9-]*[A-Za-z0-9])?";
private static final String REGEX_PORT = ":[0-9]+";
private static final String REGEX_PATH = "/(?:[A-Za-z0-9-._~]|%[A-Fa-f0-9]{2})*";
private static final String REGEX_QUERY = "\\?(?:[A-Za-z0-9-._~]+(?:=(?:[A-Za-z0-9-._~+]|%[A-Fa-f0-9]{2})+)?)(?:[&|;][A-Za-z0-9-._~]+(?:=(?:[A-Za-z0-9-._~+]|%[A-Fa-f0-9]{2})+)?)*";
//FRAGMENTs don't need to be parsed as they won't be sent to the server anyways

public static final String REGEX_URL = "(?:" + REGEX_SCHEME + REGEX_AUTHORATIVE_DECLARATION + ")?(?:" + REGEX_USERINFO + ")?" + REGEX_HOST + "(?:" + REGEX_PORT + ")?(?:" + REGEX_PATH + ")*(?:" + REGEX_QUERY + ")?";

Solution

Holy regex Batman!

Without testing, you should be able to replace all occurrences of:

  • [0-9] with \d



  • [A-Za-z0-9] with [\w^_]



  • [A-Fa-f0-9] with \p{XDigit}



  • [A-Za-z0-9-._~] with [\w.~-]



and so forth. See documentation for Pattern.

As for your capturing groups I'm quite sure you can do some clever back referencing there but I will save my sanity and not try to parse the regexp. :)

Context

StackExchange Code Review Q#78768, answer score: 7

Revisions (0)

No revisions yet.