patternpythonMinor
RFC 2812 message validation regex
Viewed 0 times
messagevalidation2812regexrfc
Problem
In my ongoing quest for an IRC client, I'm working on more stringent validation per RFC 2812. As with everything else I'm doing on this project, I'm trying to implement it all from scratch for the sake of joy/frustration and education. Thus I'm not going to be using any other libraries or tools designed to work with IRC clients/servers/messages/etc.
I've written this regex to validate a message. It's a little hairy, so I'm wondering if there's a way to reduce repetition, or at least to make it cleaner and easier to read. I didn't always add comments when I felt a section was easy to read, or if it was identical to another one I had previously commented.
I've also thought about maybe splitting it up into the different sections of prefix, command, and parameters, and then validating each separately.
Alternately, do you think that ditching regexes entirely might be a better solution? Right now I do consider this readable and maintainable, albeit with some effort, but I'd like to know if you feel the same.
```
import re
message_regex = re.compile(
r"""
# Validation of RFC 2812 messages
# https://www.rfc-editor.org/rfc/rfc2812#page-6
^ # Match from the start
# Optional prefix
( # leading colon required
:(?P
# prefix group can be the server's name, which consists of
(?P
[a-zA-Z0-9] # A leading alphanumeric character
[a-zA-Z0-9\-]* # Followed by the same with hyphens
[a-zA-Z0-9]* # Can't have the final character be a hyphen
# Then the above with a leading period
(\.[a-zA-Z0-9][a-zA-Z0-9\-][a-zA-Z0-9])*
) | # otherwise it can be a nickname
(?P
[a-zA-Z\x5B-\x60\x7B-\x7D] # letter or special character
[a-zA-Z0-9\x5B-\x60\x7B-\x7D\-]{,8} # up to 8 of the same, plus digits/hyphens
# followed by an optional host, which in turn has a
I've written this regex to validate a message. It's a little hairy, so I'm wondering if there's a way to reduce repetition, or at least to make it cleaner and easier to read. I didn't always add comments when I felt a section was easy to read, or if it was identical to another one I had previously commented.
I've also thought about maybe splitting it up into the different sections of prefix, command, and parameters, and then validating each separately.
Alternately, do you think that ditching regexes entirely might be a better solution? Right now I do consider this readable and maintainable, albeit with some effort, but I'd like to know if you feel the same.
```
import re
message_regex = re.compile(
r"""
# Validation of RFC 2812 messages
# https://www.rfc-editor.org/rfc/rfc2812#page-6
^ # Match from the start
# Optional prefix
( # leading colon required
:(?P
# prefix group can be the server's name, which consists of
(?P
[a-zA-Z0-9] # A leading alphanumeric character
[a-zA-Z0-9\-]* # Followed by the same with hyphens
[a-zA-Z0-9]* # Can't have the final character be a hyphen
# Then the above with a leading period
(\.[a-zA-Z0-9][a-zA-Z0-9\-][a-zA-Z0-9])*
) | # otherwise it can be a nickname
(?P
[a-zA-Z\x5B-\x60\x7B-\x7D] # letter or special character
[a-zA-Z0-9\x5B-\x60\x7B-\x7D\-]{,8} # up to 8 of the same, plus digits/hyphens
# followed by an optional host, which in turn has a
Solution
-
You're interested in matching byte strings here (not Unicode strings), and so to make this clear (and for compatibility with Python 3) I recommend adding the
-
I don't think there's anywhere where you care about the difference between
-
There's no need to escape a hyphen in a character class if it's the first or last character in the class. Also,
-
-
This part of the regular expression doesn't prevent the final character from being a hyphen:
That's because
I can see that you are just following the syntax in RFC 2812, but that's a mistake! If you look at RFC 952 then you'll see the syntax for hostnames was originally given as follows:
and then in RFC 1123 "the restriction on the first character is relaxed to allow either a letter or a digit". So this corresponds to:
-
This part of the regular expression:
can match the same string in multiple ways, which is unwise for performance reasons. The problem is that when a string fails to match, the regular expression engine will have to backtrack many times before it can prove that there is no match.
For example, it takes half a millisecond for Python to determine that the regular expression
but it takes nearly 14 seconds (more than 26,000 times as long) to determine that
So what you need here is:
-
-
There are a lot of unused capturing parentheses. For example, in
there are four capturing parentheses. These can either be omitted if unnecessary, or changed to be non-capturing:
-
If you're worried about the readability of this giant regular expression, consider building it out of named parts using string formatting. This allows you to avoid some of the repetition by reusing parts. For example, you might start like this:
You're interested in matching byte strings here (not Unicode strings), and so to make this clear (and for compatibility with Python 3) I recommend adding the
b prefix to the regular expression.-
I don't think there's anywhere where you care about the difference between
a and A (you're always matching case-insensitively). So you could pass the re.IGNORECASE flag and use [A-Z] instead of [a-zA-Z], which would shorten things.-
There's no need to escape a hyphen in a character class if it's the first or last character in the class. Also,
\d matches a digit in a character class. So [a-zA-Z0-9\-] can become [a-zA-Z\d-], or [A-Z\d-] if you use re.IGNORECASE.-
{1,} can be written as +.-
This part of the regular expression doesn't prevent the final character from being a hyphen:
[a-zA-Z0-9] # A leading alphanumeric character
[a-zA-Z0-9\-]* # Followed by the same with hyphens
[a-zA-Z0-9]* # Can't have the final character be a hyphenThat's because
means "zero or more times" and so the [a-zA-Z0-9] might match zero times.I can see that you are just following the syntax in RFC 2812, but that's a mistake! If you look at RFC 952 then you'll see the syntax for hostnames was originally given as follows:
::= *["."]
::= [*[]]and then in RFC 1123 "the restriction on the first character is relaxed to allow either a letter or a digit". So this corresponds to:
[A-Z\d] # A leading alphanumeric character
(?:
[A-Z\d-]* # Optionally followed by the same with hyphens
[A-Z\d] # But the final character must not be a hyphen
)?-
This part of the regular expression:
[a-zA-Z0-9\-]*
[a-zA-Z0-9]*can match the same string in multiple ways, which is unwise for performance reasons. The problem is that when a string fails to match, the regular expression engine will have to backtrack many times before it can prove that there is no match.
For example, it takes half a millisecond for Python to determine that the regular expression
A*Z fails to match a string of 100,000 letter ‘A’s:>>> timeit(lambda:re.match('A*Z', 'A'*100000), number=1)
0.0005234569543972611but it takes nearly 14 seconds (more than 26,000 times as long) to determine that
AAZ fails to match this string:>>> timeit(lambda:re.match('A*A*Z', 'A'*100000), number=1)
13.715154931996949So what you need here is:
[A-Z\d]+ # Initial alphanumeric word.
(?:-+[A-Z\d]+)* # Hyphen(s) followed by word, zero or more times.-
(:|[^\x00\x0A\x0D\x20\x3A]) is a long-winded way of writing [^\x00\n\r ].-
There are a lot of unused capturing parentheses. For example, in
([a-fA-F0-9]{1,}(:[a-fA-F0-9]{1,}){7})|
0:0:0:0:0:(0|FFFF):(\d{1,3}\.){3}\d{1,3}there are four capturing parentheses. These can either be omitted if unnecessary, or changed to be non-capturing:
[a-fA-F\d]+(?::a-fA-F\d]+){7}|
0:0:0:0:0:(?:0|FFFF):(?:\d{1,3}\.){3}\d{1,3}-
If you're worried about the readability of this giant regular expression, consider building it out of named parts using string formatting. This allows you to avoid some of the repetition by reusing parts. For example, you might start like this:
def build_message_regexp():
word = rb'[A-Z\d]+'
shortname = rb'{word}(?:-+{word})*'.format(**locals())
hostname = rb'{shortname}(?:\.{shortname})*'.format(**locals())
ip4addr = rb'\d{1,3}(?:\.\d{1,3}){3}'
hexdigit = rb'[A-F\d]'
ip6addr1 = rb'{hexdigit}+(?::{hexdigit}+){7}|'.format(**locals())
ip6addr2 = rb'0:0:0:0:0:(?:0|FFFF):{ip4addr}'.format(**locals())Code Snippets
[a-zA-Z0-9] # A leading alphanumeric character
[a-zA-Z0-9\-]* # Followed by the same with hyphens
[a-zA-Z0-9]* # Can't have the final character be a hyphen<hname> ::= <name>*["."<name>]
<name> ::= <let>[*[<let-or-digit-or-hyphen>]<let-or-digit>][A-Z\d] # A leading alphanumeric character
(?:
[A-Z\d-]* # Optionally followed by the same with hyphens
[A-Z\d] # But the final character must not be a hyphen
)?[a-zA-Z0-9\-]*
[a-zA-Z0-9]*>>> timeit(lambda:re.match('A*Z', 'A'*100000), number=1)
0.0005234569543972611Context
StackExchange Code Review Q#116909, answer score: 6
Revisions (0)
No revisions yet.