patternjavascriptModerate
Matching script tags with regexes
Viewed 0 times
scriptregexeswithtagsmatching
Problem
Like anything that shouldn't be done, I decided to see if it is possible to match `
Did I miss anything? Ideally, I want it to match exclusively anything a browser would consider a script element (might not be possible), but at the very least, I would like it to match only well-formed script tags with well-formed javascript.
Here is the string for the regex that I am passing to preg_match:
Note: I am not using this in production.
tags robustly using regexes in PHP. Since there is no arbitrary nesting, I figured it should at least be possible.
This is what I came up with. It is designed to handle every edge case I could think of, including:
- arbitrary attributes in the opening script tag
- single and multiline comments and single and double-quoted strings (which might include arbitrary escape sequences) in the javascript which may contain the characters
`- Captures the smallest script tag it finds.
Did I miss anything? Ideally, I want it to match exclusively anything a browser would consider a script element (might not be possible), but at the very least, I would like it to match only well-formed script tags with well-formed javascript.
Here is the string for the regex that I am passing to preg_match:
'#"]*(?:"[^"]*")?)*>((?:"(?:[^\\\\\\n"]*(?:\\\\.)*)*"|\'(?:[^\\\\\\n\']*(?:\\\\.)*)*\'|#';Note: I am not using this in production.
Solution
Instead of writing a long uncommented regex, compose it out of multiple parts. This makes it easier to understand what you're trying to do.
Let's go through the spec for a start tag:
Expressed as a regex fragment, with insignificant whitespace:
Attributes themselves are rather complex. They can either be empty, unquoted, single-quoted or double-quoted.
Now once we fill in the appropriate values for character references and the space character recognized by HTML5 and possible attribute names, we have finally correctly matched the `
Let's go through the spec for a start tag:
- The first character of a start tag must be a "
- The next few characters of a start tag must be the element's tag name.
- If there are to be any attributes in the next step, there must first be one or more space characters.
- Then, the start tag may have a number of attributes, the syntax for which is described below. Attributes must be separated from each other by one or more space characters.
- After the attributes, or after the tag name if there are no attributes, there may be one or more space characters. (Some attributes are required to be followed by a space. See the attributes section below.)
- Then, if the element is one of the void elements, or if the element is a foreign element, then there may be a single "/" (U+002F) character. This character has no effect on void elements, but on foreign elements it marks the start tag as self-closing.
- Finally, start tags must be closed by a ">" (U+003E) character.
Expressed as a regex fragment, with insignificant whitespace:
$start_tag = "(?: [] )"Attributes themselves are rather complex. They can either be empty, unquoted, single-quoted or double-quoted.
$attribute = "(?: $attr_name
| $attr_name $ws* [=] $ws* (?:[^${space_characters}\"'<>`&]+|$character_reference)+
| $attr_name $ws* [=] $ws* [\"] (?:[^\"&]|$character_reference)* [\"]
| $attr_name $ws* [=] $ws* ['] (?:[^'&]|$character_reference)* [']
)"Now once we fill in the appropriate values for character references and the space character recognized by HTML5 and possible attribute names, we have finally correctly matched the `
start tag.
How does your solution hold up? I am led to believe that this part is supposed to match the start tag (I added whitespace for clarity):
"]* (?:"[^"]*")? )* >
Err, no. This fails to match single-quoted strings '>. This will also match some strings that do not contain valid HTML at all, like .
What is the point of this exercise? It's absolutely possible to correctly match HTML with the PCRE (it's not possible with a regular language, which has a specific computer science meaning. Too many people confuse the theoretical concept with a similarly named practical tool which happens to be more powerful). However, if you do want to do this, you have to follow the spec. Don't fudge it, read it. (Actually, I fudged it as well, but to a lesser degree. Do as I say, not as I do).
It is absolutely possible to write readable regexes. Compose them from multiple reusable parts. Use the /x` option to include insignificant whitespace. But do not ram them onto a single line. That's just obfuscation, and you wouldn't do that in other languages.Code Snippets
$start_tag = "(?: [<] script (?: $ws+ (?:$attribute $ws+)* $attribute)? $ws* [>] )"$attribute = "(?: $attr_name
| $attr_name $ws* [=] $ws* (?:[^${space_characters}\"'<>`&]+|$character_reference)+
| $attr_name $ws* [=] $ws* [\"] (?:[^\"&]|$character_reference)* [\"]
| $attr_name $ws* [=] $ws* ['] (?:[^'&]|$character_reference)* [']
)"<script (?: [^>"]* (?:"[^"]*")? )* >Context
StackExchange Code Review Q#40843, answer score: 11
Revisions (0)
No revisions yet.