patternjavascriptMinor

Is my PHP script/embed remover robust?

Submitted by: @import:stackexchange-codereview·Mar 10, 2026·

Viewed 0 times

scriptremoverrobustphpembed

Problem

The goal of this question:

Your goal here is to find a security hole in my code which allows a user to create input that contains a script doing anything they want, without that script being stopped by my code.

Please bear in mind that that is the ONLY goal of this post. I am not so much here to talk about how good the algorithm is, only if it works. However, that being said, I am still open to ideas on how to improve the algorithm, make it faster, etc, but please as comments. I will upvote any good suggestions :)

With that out of the way,

My code:

As you may have read above, you are trying to trick my code into letting a script through. What does that mean exactly? Well, I have created a php script which attempts to parse HTML and remove scripts and embeds which are not from trusted websites. Yes.... Using some regex... I know, but I do have a good reason -- You see, the only reasons I have found and can think of NOT to use regex are:

HTML is recursive, regex is not. (hence all this talk about context-free vs regular. I am sure there are more reasons for HTML being "above regex" but the recursion is the best one I can persoanly come up with.)

However, the interesting thing about my problem is this: Scripts are not recursive! This means that WHENEVER I come across a script tag, everything between that and the next end tag will NEVER be html! Thinking of HTML as a bunch of random letters, with every once in a while an open-script tag and a close-script tag actually brings HTML down to a level that regex can handle. And that's almost exactly what I did...

The other reason is memory constriction. I've used a lot of non-capturing groups with my regex, so, that should not be a problem. I also don't use 100% regex, I only use it as a generic way to detect tags. Actual handling of the tags is done with my own code (only with a bit more regex to select src attributes)

The reason I decided to push so hard for regex is this:

DomDocument only handles CLEAN html. I

Solution

The filter doesn't block inline javascript.

Example 1:

...

Example 2:

X

Also, it doesn't encode the html, thus this will break your filter. `



Example:

Inserts HTML:

I'm a bug



Inserts Script:

alert(\'I'm a bug\')



If you're trying to prevent XSS attacks then your goal should be not to allow ANY html to render. One way to do this would be to replace the

symbols with the respective special html entries, being < and &gl;`.

Context

StackExchange Code Review Q#15361, answer score: 3

Revisions (0)

No revisions yet.