HiveBrain v1.2.0
Get Started
← Back to all entries
patterncsharpMinor

Reference implementation of HTML-to-JSON converter for PCG

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
referencepcgforjsonimplementationhtmlconverter

Problem

I have opened a puzzle at Programming Puzzles and Code Golf. The question is not perceived well (yet) and one of the reasons is that no reference implementation is provided. To mitigate this shortcoming, I have implemented a (non-golfed and documented) reference solution.

I'd like your feedback about

  • Correctness according the PCG puzzle specification



  • Comprehensibility of the code, e.g. feedback about the documentation



If you can't review everything, please focus on the following points which IMHO could need some attention:

  • ParseHtmlToObjects(), which is a bit too long for my taste



  • AddProperty(), which has a ref parameter and causes a side effect. If possible, please suggest a better solution, ideally with a code snippet.



I'd not like feedback about:

  • Performance, since it is ok with the largest real world test file I have



  • Converting loops to LINQ, since I simply prefer the loops



  • Other nifty C# 10.5 lambda inline tricks etc. that nobody can read 6 months from now



The environment:

  • Implementation language is C# 4.5



  • Input is HTML, output is JSON



  • Libraries: HtmlAgilityPack, ExCSS and JSON.NET



  • Implemented with R# support. The code is almost green with warnings about "Possible multiple enumeration" inside Convert(). Maybe someone can elaborate how critical that is.



The task:

  • Read some awkward and verbose HTML document, extract the relevant information based on various indicators (HTML elements, attributes and CSS style information)



  • Output valid JSON which contains the relevant information



For more details please see the question at PCG.

The major part of the code:

  • Entry point is the constructor HtmlToJsonConverter()



  • Convert() method



I don't provide CustomJsonConverter here, since that is straightforward.

```
using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.Globalization;
using System.IO;
using System.Linq;
using ExCSS;
using HtmlAgilityPack;
using Newtonsoft.Json;

na

Solution

With the IsKey() method I go with @mjolka and also you prefer loops I will suggest to use Any() like

private static bool IsKey(HtmlNode span)
{
    var styleSheet = ExtractStyle(span);
    bool isKey = styleSheet.StyleRules[0].Declarations
                      .Any(cssAttribute.Name == "font-weight" && cssAttribute.Term.ToString() == "bold");

    return isKey;
}


because it really is the same and it is more obvious what is meant.

In the GetIndentationFromNode() method, there should be only one cssAttribute with the name left so you should break out of the loop after you have found the value.

You should let your variables some space to breathe. Instead of e.g decimal thisIndent=0; you better should write decimal thisIndent = 0;, because this will make your code more readable.

In the ParseHtmlToObjects() method there is no need to call ToList() on the IEnumerable<> here page.Descendants().Where(x => (x.Name == "span")).ToList();.

Always code against interfaces if the implementation isn't needed.

If step isn't a key you can by adding a continue; omit the else and therefor save horizontal spacing.

By removing the if (level == currentObject.level) condition you can shorten the code.

By ectracting the analyzing of the Descendants of the current page this method can be shortened and be more readable.

By extracting the while loop to equalize the level to a meaningful method, you can omit the comment also this is only a cosmetic change.

The check if (key != null) could be omitted, because if key == null the AddProperty() method just returns the passed in DataObject.

Applying these points lead to

private static DataObject ParseHtmlToObjects(IEnumerable pages)
{
    var rootObject = new DataObject();
    var currentObject = rootObject;

    foreach (var page in pages)
    {
        var steps = page.Descendants().Where(x => (x.Name == "span")).ToList();
        currentObject = AnalyzeSpanTags(steps, currentObject);
    }

    return rootObject;
}

private static DataObject AnalyzeSpanTags(IEnumerable steps, DataObject currentObject)
{
    string key = null;
    foreach (var step in steps)
    {
        if (!IsKey(step))
        {
            // If this is not a key, the key was detected before. Use it to populate the object
            currentObject = AddProperty(currentObject, key, GetTextFromSpan(step));
            key = null;
            continue;
        }

        // Special case: Maybe we detected a new key, although the old key has not been used as property yet
        // This can happen for keys without value, so add it empty.
        currentObject = AddProperty(currentObject, key, "");

        key = GetKeyFromNode(step);
        var level = GetIndentationFromNode(step);

        if (level > currentObject.level)
        {
            // Decend to lower level: create a new child
            var child = new DataObject { level = level, Parent = currentObject };
            currentObject.Children.Add(child);
            currentObject = child;
        }
        else
        {
            currentObject = EqualizeLevel(currentObject, level);
        }
    }
    return currentObject;
}

private static DataObject EqualizeLevel(DataObject obj, decimal level)
{
    while (level < obj.level)
    {
        obj = obj.Parent;
    }
    return obj;
}


Speaking about comments. Comments should describe why something is done. You should let the code speak for itself about what is done.

So comments like

// Go through all pages
foreach (var page in pages)


are superfluous, because they don't add any value.

By introducing a GetAddedSiblingIfKeyExists() method (the name is not that optimal, but I couldn't come up with a better one) like

private static DataObject GetAddedSiblingIfKeyExists(DataObject obj, string key)
{
    if (key == null || !obj.Properties.ContainsKey(key)) { return obj; }

    var sibling = new DataObject { level = obj.level, Parent = obj.Parent };
    obj.Parent.Children.Add(sibling);
    return sibling;
}


the AddProperty() method would result in

private static DataObject AddProperty(DataObject obj, string key, string value)
{
    // Special case:  which contains the page information. Skip it.
    if (key == null) return obj;

    obj.Properties.Add(key, value);
    return obj;
}


which would change the AnalyzeSpanTags() method like

private static DataObject AnalyzeSpanTags(IEnumerable steps, DataObject currentObject)
{
    string key = null;
    foreach (var step in steps)
    {
        currentObject = GetAddedSiblingIfKeyExists(currentObject, key);

        if (!IsKey(step))
        {
        .....


maybe GetAddedSiblingForExistingKey() would be a little bit more meaningful.

/// 
    /// Converts the HTML input file into JSON and writes the output file
    /// 
    public void Convert()


the xml documentation clearly states that this method is d

Code Snippets

private static bool IsKey(HtmlNode span)
{
    var styleSheet = ExtractStyle(span);
    bool isKey = styleSheet.StyleRules[0].Declarations
                      .Any(cssAttribute.Name == "font-weight" && cssAttribute.Term.ToString() == "bold");

    return isKey;
}
private static DataObject ParseHtmlToObjects(IEnumerable<HtmlNode> pages)
{
    var rootObject = new DataObject();
    var currentObject = rootObject;

    foreach (var page in pages)
    {
        var steps = page.Descendants().Where(x => (x.Name == "span")).ToList();
        currentObject = AnalyzeSpanTags(steps, currentObject);
    }

    return rootObject;
}

private static DataObject AnalyzeSpanTags(IEnumerable<HtmlNode> steps, DataObject currentObject)
{
    string key = null;
    foreach (var step in steps)
    {
        if (!IsKey(step))
        {
            // If this is not a key, the key was detected before. Use it to populate the object
            currentObject = AddProperty(currentObject, key, GetTextFromSpan(step));
            key = null;
            continue;
        }

        // Special case: Maybe we detected a new key, although the old key has not been used as property yet
        // This can happen for keys without value, so add it empty.
        currentObject = AddProperty(currentObject, key, "");

        key = GetKeyFromNode(step);
        var level = GetIndentationFromNode(step);

        if (level > currentObject.level)
        {
            // Decend to lower level: create a new child
            var child = new DataObject { level = level, Parent = currentObject };
            currentObject.Children.Add(child);
            currentObject = child;
        }
        else
        {
            currentObject = EqualizeLevel(currentObject, level);
        }
    }
    return currentObject;
}

private static DataObject EqualizeLevel(DataObject obj, decimal level)
{
    while (level < obj.level)
    {
        obj = obj.Parent;
    }
    return obj;
}
// Go through all pages
foreach (var page in pages)
private static DataObject GetAddedSiblingIfKeyExists(DataObject obj, string key)
{
    if (key == null || !obj.Properties.ContainsKey(key)) { return obj; }

    var sibling = new DataObject { level = obj.level, Parent = obj.Parent };
    obj.Parent.Children.Add(sibling);
    return sibling;
}
private static DataObject AddProperty(DataObject obj, string key, string value)
{
    // Special case: <Span> which contains the page information. Skip it.
    if (key == null) return obj;

    obj.Properties.Add(key, value);
    return obj;
}

Context

StackExchange Code Review Q#83233, answer score: 5

Revisions (0)

No revisions yet.