HiveBrain v1.2.0
Get Started
← Back to all entries
patterncsharpMinor

Cleaning up HTML created by users

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
createduserscleaninghtml

Problem

I have a system that accepts free text from users. This can either be plain text or a heavily limited subset of HTML. I've written the below with Html Agility Pack and Microsoft's AntiXss library. I have had to remove a few things from the below code which do things like whitespace normalization, punctuation normalization etc. as I know they're already fine.

There's some additional functionality which aims to delete some common html patterns (from pasting from MS Word) which include pointlessly nested spans (A word) and

.

Expected use case:

HtmlUtility.SanitizeResponse("An html response with bad stuff  EVIL SCRIPT  ")
// An html response with bad stuff EVIL SCRIPT  


I want to retain as much of the original text as I can.

```
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Text.RegularExpressions;
using HtmlAgilityPack;
using Encoder = Microsoft.Security.Application.Encoder;

public static class HtmlUtility
{
public static HashSet WhiteSpaceSignificantAttributes = new HashSet
{ "class", "alt", "title", "style" };

private static readonly Dictionary AllowedResponseHtml =
new Dictionary
{
["p"] = new[] { "class" },
["a"] = new[] { "href", "title", "class" },
["strong"] = new string[0],
["em"] = new string[0],
["span"] = new string[0],
["br"] = new string[0]
};

private static readonly HashSet BlockElements = new HashSet(StringComparer.OrdinalIgnoreCase)
{
"address", "article", "aside", "blockquote", "canvas", "dd", "div", "dl", "fieldset", "figcaption",
"figure", "footer", "form", "h1", "h2", "h3", "h4", "h5", "h6", "header", "hgroup", "hr", "main",
"nav", "noscript", "ol", "output", "p", "pre", "section", "table", "tfoot", "ul", "video"
};

private static readonly HashSet FlowContentElements = new HashSet(S

Solution

Magic String/Lists

Personally I'd reduce away those lists of strings that're really just configuration settings. e.g. FlowContentElements could become something like this:

private static List FlowContentElements
{
    get
    {
        if (FlowContentElementsCache == null)
        {
            // Load list from xml or some source here.
        }

        return FlowContentElementsCache;
    }
}
private static List FlowContentElementsCache;


This way on first use it'll cache the list and then from there on out you will have the list already loaded. This way if you were to ever change what is white-listed or want to use this in another project it's already setup to be re-configured.

Pointless variables

In many locations you have something like this: var validTag = tag; where you don't actually need the variable. The places validTag and tag are used are as follows:

var attributes = rootNode
    .DescendantsAndSelf()
    .Where(n => n.Name == validTag.Key)
    .Where(n => n.HasAttributes)
    .SelectMany(n => n.Attributes.ToArray());
    
if (!validTag.Value.Contains(attribute.Name))
{
    attribute.Remove();
    continue;
}


Neither of which need the variable re-declared. So you very much could just use tag.

var attributes = rootNode
    .DescendantsAndSelf()
    .Where(n => n.Name == tag.Key)
    .Where(n => n.HasAttributes)
    .SelectMany(n => n.Attributes.ToArray());
    
if (!tag.Value.Contains(attribute.Name))
{
    attribute.Remove();
    continue;
}


Again,

private static HtmlNode GetRootNode(string response)
{
    var html = GetHtml(response);

    var rootNode = html?.DocumentNode;
    return rootNode;
}


could be

private static HtmlNode GetRootNode(string response)
{
    var html = GetHtml(response);
    return html?.DocumentNode;
}


If you are using the variables for debugging purposes you can type "result" into the watch window and it'll display a result (reference here, VS2013+).

Simplify

if (rootNode.Name == "span" && (rootNode.ChildNodes.All(n => n.NodeType == HtmlNodeType.Text) || (rootNode.ChildNodes.Count == 1 && rootNode.ChildNodes.All(n => n.Name == "span"))))
{
    // code from `RemoveSuperfluousSpans`
}


Any way you format this, it's ugly. Break these large or even just hard to read boolean expressions down to much more managable and readable expressions like this: (naming might be wrong but you get the point)

private static bool IsSpan(HtmlNode node)
{
    return node.Name == "span";
}

private static bool IsSuperfuousSpan(HtmlNode rootNode)
{
    return HasOnlyTextChildren(rootNode) || HasOnlyASpanChild(rootNode);
}

private static bool HasOnlyTextChildren(HtmlNode node)
{
    return node.ChildNodes.All(n => n.NodeType == HtmlNodeType.Text);
}

private static bool HasOnlyASpanChild(HtmlNode node)
{
    return node.ChildNodes.Count == 1 && node.ChildNodes.All(n => n.Name == "span");
}


Then your if-statement becomes:

if (IsSpan(rootNode) && IsSuperfuousSpan(rootNode))
{
    // code from `RemoveSuperfluousSpans`
}

Code Snippets

private static List<string> FlowContentElements
{
    get
    {
        if (FlowContentElementsCache == null)
        {
            // Load list from xml or some source here.
        }

        return FlowContentElementsCache;
    }
}
private static List<string> FlowContentElementsCache;
var attributes = rootNode
    .DescendantsAndSelf()
    .Where(n => n.Name == validTag.Key)
    .Where(n => n.HasAttributes)
    .SelectMany(n => n.Attributes.ToArray());
    
if (!validTag.Value.Contains(attribute.Name))
{
    attribute.Remove();
    continue;
}
var attributes = rootNode
    .DescendantsAndSelf()
    .Where(n => n.Name == tag.Key)
    .Where(n => n.HasAttributes)
    .SelectMany(n => n.Attributes.ToArray());
    
if (!tag.Value.Contains(attribute.Name))
{
    attribute.Remove();
    continue;
}
private static HtmlNode GetRootNode(string response)
{
    var html = GetHtml(response);

    var rootNode = html?.DocumentNode;
    return rootNode;
}
private static HtmlNode GetRootNode(string response)
{
    var html = GetHtml(response);
    return html?.DocumentNode;
}

Context

StackExchange Code Review Q#102054, answer score: 2

Revisions (0)

No revisions yet.