HiveBrain v1.2.0
Get Started
← Back to all entries
patternhtmlMinor

Golang function to clean a string of scripts

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
scriptsfunctiongolangstringclean

Problem

I am trying to make an efficient algorithm for removing script tags from an HTML string. Can someone point out any flaws in this? This seems to be the best I could think of.

func removeScripts(s string) string {
    startingScriptTag := ""

    var script string

    for {
        startingScriptTagIndex := strings.Index(s, startingScriptTag)
        endingScriptTagIndex := strings.Index(s, endingScriptTag)

        if startingScriptTagIndex > -1 && endingScriptTagIndex > -1 {
            script = s[startingScriptTagIndex:endingScriptTagIndex + len(endingScriptTag)]
            s = strings.Replace(s, script, "", 1)
            continue
        }

        break
    }

    return s
}

Solution

As per usual, I'd say the best way to reliably get rid of any script tags in an HTML string, is to use a parser. HTML is a bit too complex to consume using your standard string functions and regular expressions. It's a hierarchical language, best processed as such. Thankfully, golang has a package for this, and it's fantastically easy to remove script tags:

import (
    "bytes"
    "fmt"
    "log"
    "string"

    "golang.org/x/net/html" // go get -u golang.org etc...
)

func main() {
    doc, err := html.Parse(strings.NewReader(htmlString))
    if err != nil {
        log.Fatal(err)
    }
    removeScript(doc)
    buf := bytes.NewBuffer([]bytes{})
    if err := html.Render(buf, doc); err != nil {
        log.Fatal(err)
    }
    fmt.Println(buf.String())
}

func removeScript(n *html.Node) {
    // if note is script tag
    if n.Type == html.ElementNode && n.Data == "script" {
        n.Parent.RemoveChild(n)
        return // script tag is gone...
    }
    // traverse DOM
    for c := n.FirstChild; c != nil; c = c.NextSibling {
        removeScript(c)
    }
}


The use of n.Data is not a typo BTW. The field name is a bit unfortunate, but as the doc pages state:


A Node consists of a NodeType and some Data (tag name for element nodes, content for text)

This code was not tested. It's loosely based on the parse example in the official godoc pages.

Although not relevant in this case, it is worth looking into the tokenizer API, too. It is a lower-level api, that can help you process an HTML stream (eg parsing/validating a large file in a stream). You can use it to check how many script tags there are, for example:

tokenizer := html.NewTokenizer(strings.NewReader(htmlString))
tagCount := 0
for {
    tt := z.Next()
    switch tt {
    case ErrorToken:
        return z.Err()
    //case TextToken: ignore, we don't need it here
    case StartTagToken //, EndTagToken: ignore end-tags to make life easier
        tn, _ := z.TagName()
        if string(tn) == "script"  {
            tagCount++
        }
    }
}


Do with it as you like. Again, in this case, there's no reason why you would use the tokenizer I think, unless you want to manually write all tags that aren't script tags to a separate buffer and process them some more. Just thought it worth mentioning here...

Code Snippets

import (
    "bytes"
    "fmt"
    "log"
    "string"

    "golang.org/x/net/html" // go get -u golang.org etc...
)

func main() {
    doc, err := html.Parse(strings.NewReader(htmlString))
    if err != nil {
        log.Fatal(err)
    }
    removeScript(doc)
    buf := bytes.NewBuffer([]bytes{})
    if err := html.Render(buf, doc); err != nil {
        log.Fatal(err)
    }
    fmt.Println(buf.String())
}

func removeScript(n *html.Node) {
    // if note is script tag
    if n.Type == html.ElementNode && n.Data == "script" {
        n.Parent.RemoveChild(n)
        return // script tag is gone...
    }
    // traverse DOM
    for c := n.FirstChild; c != nil; c = c.NextSibling {
        removeScript(c)
    }
}
tokenizer := html.NewTokenizer(strings.NewReader(htmlString))
tagCount := 0
for {
    tt := z.Next()
    switch tt {
    case ErrorToken:
        return z.Err()
    //case TextToken: ignore, we don't need it here
    case StartTagToken //, EndTagToken: ignore end-tags to make life easier
        tn, _ := z.TagName()
        if string(tn) == "script"  {
            tagCount++
        }
    }
}

Context

StackExchange Code Review Q#161461, answer score: 5

Revisions (0)

No revisions yet.