HiveBrain v1.2.0
Get Started
← Back to all entries
patternphpMinor

Regex to remove inline javascript from string

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
javascriptremoveinlineregexfromstring

Problem

I need to remove inline javascript for a given string. Examples:

If user typed: `

I should need to convert into


I created this PHP code and it works(apparently without issues):

http://writecodeonline.com/php/

function test_input($input){
  //I have a list with all events but for this example I used two
  $html_events = 'onload|onclick';
  $pattern = "/(]*)($html_events)([\s]*=[\s]*)('[^>]*'|\"[^>]*\")([^>]*>)/i";
  $replacement = '$1$5';
  while( preg_match($pattern, $input) ){
    $input = preg_replace($pattern, $replacement, $input);
  }
  return htmlentities($input);
}

echo test_input(''). '';
echo test_input(''). '';
echo test_input('hello buddies'). '';


I'm just looking for improvements or use cases that I did not supporting or that break my regex. I would appreciate if you tell me:

This:
test_input('something bad');` breaks your regex.

Or if found an improvement that in a benchmark demonstrate better performance I should be happy to apply it as long as it does not break use cases already supported.

Thank You!

Update
I finally used htmlpurifier

Solution

Parsing markup with regex is like building your house using lego... it's not the right tool for the job. HTML is not a regular language, therefore regular expressions don't cut the mustard. More than that: You're actively working to bring the world as we know it to an end, which drives people insane

What you need is a DOM parser, and as luck would have it, PHP has the DOMDocument object, which is just that:

$dom = new DOMDocument;
$dom->loadHTML('');
$nodes = $dom->getElementsByTagName('*');//just get all nodes, 
//$dom->getElementsByTagName('img'); would work, too
foreach($nodes as $node)
{
    if ($node->hasAttribute('onload'))
    {
        $node->removeAttribute('onload');
    }
    if ($node->hasAttribute('onclick'))
    {
        $node->removeAttribute('onclick');
    }
}
echo $dom->saveHTML();//will include html, head, body tags and doctype


Tadaa... both onload and onclick have been removed from the markup, without the pain of writing a reliable and stable regex, that can deal with in-line JS... As an added bonus, this code will be far more maintainable (and expandable) in the future. I'd much prefer maintaining this code, than having to rework a regular expression somebody wrote a couple of months ago...

If you want, you can echo only the tags you've changed, like so:

$changed = array();
$attributesOfDeath = array('onload', 'onclick');
foreach($nodes as $node)
{
    $current = null;
    foreach($attributesOfDeath as $attr)
    {
        if ($node->hasAttribute($attr))
        {
            $node->removeAttribute($attr);
            $current = $node;
        }
    }
    if ($current)
    {
        $changed[] = $current;//add to changed array
    }
}
$changed = array_map(array($dom, 'saveXML'), $changed);
echo implode(PHP_EOL, $changed);


As Jan said, for maintainability it's best to use an array of "forbidden attributes". That's what the $attributesOfDeath array is for. If you want to, later on, check for a third or fourth attribute, you can simply add that to the array, and nothing else in your code need change. It'll just keep on working as before.

Code Snippets

$dom = new DOMDocument;
$dom->loadHTML('<img onload="alert(\'hello world\');" onclick="alert(\'hello world\');" />');
$nodes = $dom->getElementsByTagName('*');//just get all nodes, 
//$dom->getElementsByTagName('img'); would work, too
foreach($nodes as $node)
{
    if ($node->hasAttribute('onload'))
    {
        $node->removeAttribute('onload');
    }
    if ($node->hasAttribute('onclick'))
    {
        $node->removeAttribute('onclick');
    }
}
echo $dom->saveHTML();//will include html, head, body tags and doctype
$changed = array();
$attributesOfDeath = array('onload', 'onclick');
foreach($nodes as $node)
{
    $current = null;
    foreach($attributesOfDeath as $attr)
    {
        if ($node->hasAttribute($attr))
        {
            $node->removeAttribute($attr);
            $current = $node;
        }
    }
    if ($current)
    {
        $changed[] = $current;//add to changed array
    }
}
$changed = array_map(array($dom, 'saveXML'), $changed);
echo implode(PHP_EOL, $changed);

Context

StackExchange Code Review Q#30045, answer score: 5

Revisions (0)

No revisions yet.