HiveBrain v1.2.0
Get Started
← Back to all entries
patternjavascriptModerate

Matching script tags with regexes

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
scriptregexeswithtagsmatching

Problem

Like anything that shouldn't be done, I decided to see if it is possible to match ` tags robustly using regexes in PHP. Since there is no arbitrary nesting, I figured it should at least be possible.

This is what I came up with. It is designed to handle every edge case I could think of, including:

  • arbitrary attributes in the opening script tag



  • single and multiline comments and single and double-quoted strings (which might include arbitrary escape sequences) in the javascript which may contain the characters `



  • Captures the smallest script tag it finds.



Did I miss anything? Ideally, I want it to match exclusively anything a browser would consider a script element (might not be possible), but at the very least, I would like it to match only well-formed script tags with well-formed javascript.

Here is the string for the regex that I am passing to preg_match:

'#"]*(?:"[^"]*")?)*>((?:"(?:[^\\\\\\n"]*(?:\\\\.)*)*"|\'(?:[^\\\\\\n\']*(?:\\\\.)*)*\'|#';


Note: I am not using this in production.

Solution

Instead of writing a long uncommented regex, compose it out of multiple parts. This makes it easier to understand what you're trying to do.

Let's go through the spec for a start tag:



  • The first character of a start tag must be a "



  • The next few characters of a start tag must be the element's tag name.



  • If there are to be any attributes in the next step, there must first be one or more space characters.



  • Then, the start tag may have a number of attributes, the syntax for which is described below. Attributes must be separated from each other by one or more space characters.



  • After the attributes, or after the tag name if there are no attributes, there may be one or more space characters. (Some attributes are required to be followed by a space. See the attributes section below.)



  • Then, if the element is one of the void elements, or if the element is a foreign element, then there may be a single "/" (U+002F) character. This character has no effect on void elements, but on foreign elements it marks the start tag as self-closing.



  • Finally, start tags must be closed by a ">" (U+003E) character.




Expressed as a regex fragment, with insignificant whitespace:

$start_tag = "(?: [] )"


Attributes themselves are rather complex. They can either be empty, unquoted, single-quoted or double-quoted.

$attribute = "(?: $attr_name
              |   $attr_name $ws* [=] $ws* (?:[^${space_characters}\"'<>`&]+|$character_reference)+
              |   $attr_name $ws* [=] $ws* [\"] (?:[^\"&]|$character_reference)* [\"]
              |   $attr_name $ws* [=] $ws* ['] (?:[^'&]|$character_reference)* [']
              )"


Now once we fill in the appropriate values for character references and the space character recognized by HTML5 and possible attribute names, we have finally correctly matched the ` start tag.

How does your solution hold up? I am led to believe that this part is supposed to match the start tag (I added whitespace for clarity):

"]* (?:"[^"]*")? )* >


Err, no. This fails to match single-quoted strings
'>. This will also match some strings that do not contain valid HTML at all, like .

What is the point of this exercise? It's absolutely possible to correctly match HTML with the PCRE (it's not possible with a regular language, which has a specific computer science meaning. Too many people confuse the theoretical concept with a similarly named practical tool which happens to be more powerful). However, if you do want to do this, you have to follow the spec. Don't fudge it, read it. (Actually, I fudged it as well, but to a lesser degree. Do as I say, not as I do).

It is absolutely possible to write readable regexes. Compose them from multiple reusable parts. Use the
/x` option to include insignificant whitespace. But do not ram them onto a single line. That's just obfuscation, and you wouldn't do that in other languages.

Code Snippets

$start_tag = "(?: [<] script (?: $ws+ (?:$attribute $ws+)* $attribute)? $ws* [>] )"
$attribute = "(?: $attr_name
              |   $attr_name $ws* [=] $ws* (?:[^${space_characters}\"'<>`&]+|$character_reference)+
              |   $attr_name $ws* [=] $ws* [\"] (?:[^\"&]|$character_reference)* [\"]
              |   $attr_name $ws* [=] $ws* ['] (?:[^'&]|$character_reference)* [']
              )"
<script (?: [^>"]* (?:"[^"]*")? )* >

Context

StackExchange Code Review Q#40843, answer score: 11

Revisions (0)

No revisions yet.