snippetjavaMinor
Regex to parse URLs for their correctness according to RFC 3986
Viewed 0 times
correctnessparse3986forrfcregexaccordingtheirurls
Problem
I recently came to write a regex to parse URLs. Now I wonder, did I miss something? Did I make a mistake or could I have written it cleaner? That's why I'm here.
In order to write the regex, I took this paper as a reference (together with some Wikipedia articles about URIs/URLs).
In order to write the regex, I took this paper as a reference (together with some Wikipedia articles about URIs/URLs).
//According to http://www.ietf.org/rfc/rfc3986.txt
private static final String URL_UNRESERVED_CHARS = "0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ";
private static final String URL_UNRESERVED_SPECIAL_CHARS = "-._~";
private static final String URL_UNRESERVED = URL_UNRESERVED_CHARS + URL_UNRESERVED_SPECIAL_CHARS;
private static final String URL_RESERVED_GEN_DELIMS = ":/?#[]@";
private static final String URL_RESERVED_SUB_DELIMS = "!%%CODEBLOCK_0%%amp;'()*+,;=";
private static final String URL_CHAR_ENCODING_SIGN = "%";
public static final String URL_ALLOWED_CHARS = URL_UNRESERVED + URL_RESERVED_GEN_DELIMS + URL_RESERVED_SUB_DELIMS + URL_CHAR_ENCODING_SIGN;
private static final String REGEX_SCHEME = "[A-Za-z][A-Za-z0-9+.-]*:"; //Also called 'protocol'
private static final String REGEX_AUTHORATIVE_DECLARATION = "/{2}";
private static final String REGEX_USERINFO = "(?:[A-Za-z0-9-._~]|%[A-Fa-f0-9]{2})+(?::(?:[A-Za-z0-9-._~]|%[A-Fa-f0-9]{2})+)?@";
private static final String REGEX_HOST = "(?:[A-Za-z0-9](?:[A-Za-z0-9-]*[A-Za-z0-9])?\\.){1,126}[A-Za-z0-9](?:[A-Za-z0-9-]*[A-Za-z0-9])?";
private static final String REGEX_PORT = ":[0-9]+";
private static final String REGEX_PATH = "/(?:[A-Za-z0-9-._~]|%[A-Fa-f0-9]{2})*";
private static final String REGEX_QUERY = "\\?(?:[A-Za-z0-9-._~]+(?:=(?:[A-Za-z0-9-._~+]|%[A-Fa-f0-9]{2})+)?)(?:[&|;][A-Za-z0-9-._~]+(?:=(?:[A-Za-z0-9-._~+]|%[A-Fa-f0-9]{2})+)?)*";
//FRAGMENTs don't need to be parsed as they won't be sent to the server anyways
public static final String REGEX_URL = "(?:" + REGEX_SCHEME + REGEX_AUTHORATIVE_DECLARATION + ")?(?:" + REGEX_USERINFO + ")?" + REGEX_HOST + "(?:" + REGEX_PORT + ")?(?:" + REGEX_PATH + ")*(?:" + REGEX_QUERY + ")?";Solution
Holy regex Batman!
Without testing, you should be able to replace all occurrences of:
and so forth. See documentation for Pattern.
As for your capturing groups I'm quite sure you can do some clever back referencing there but I will save my sanity and not try to parse the regexp. :)
Without testing, you should be able to replace all occurrences of:
[0-9]with\d
[A-Za-z0-9]with[\w^_]
[A-Fa-f0-9]with\p{XDigit}
[A-Za-z0-9-._~]with[\w.~-]
and so forth. See documentation for Pattern.
As for your capturing groups I'm quite sure you can do some clever back referencing there but I will save my sanity and not try to parse the regexp. :)
Context
StackExchange Code Review Q#78768, answer score: 7
Revisions (0)
No revisions yet.