patterncsharpMinor
Reference implementation of HTML-to-JSON converter for PCG
Viewed 0 times
referencepcgforjsonimplementationhtmlconverter
Problem
I have opened a puzzle at Programming Puzzles and Code Golf. The question is not perceived well (yet) and one of the reasons is that no reference implementation is provided. To mitigate this shortcoming, I have implemented a (non-golfed and documented) reference solution.
I'd like your feedback about
If you can't review everything, please focus on the following points which IMHO could need some attention:
I'd not like feedback about:
The environment:
The task:
For more details please see the question at PCG.
The major part of the code:
I don't provide
```
using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.Globalization;
using System.IO;
using System.Linq;
using ExCSS;
using HtmlAgilityPack;
using Newtonsoft.Json;
na
I'd like your feedback about
- Correctness according the PCG puzzle specification
- Comprehensibility of the code, e.g. feedback about the documentation
If you can't review everything, please focus on the following points which IMHO could need some attention:
ParseHtmlToObjects(), which is a bit too long for my taste
AddProperty(), which has arefparameter and causes a side effect. If possible, please suggest a better solution, ideally with a code snippet.
I'd not like feedback about:
- Performance, since it is ok with the largest real world test file I have
- Converting loops to LINQ, since I simply prefer the loops
- Other nifty C# 10.5 lambda inline tricks etc. that nobody can read 6 months from now
The environment:
- Implementation language is C# 4.5
- Input is HTML, output is JSON
- Libraries: HtmlAgilityPack, ExCSS and JSON.NET
- Implemented with R# support. The code is almost green with warnings about "Possible multiple enumeration" inside
Convert(). Maybe someone can elaborate how critical that is.
The task:
- Read some awkward and verbose HTML document, extract the relevant information based on various indicators (HTML elements, attributes and CSS style information)
- Output valid JSON which contains the relevant information
For more details please see the question at PCG.
The major part of the code:
- Entry point is the constructor
HtmlToJsonConverter()
Convert()method
I don't provide
CustomJsonConverter here, since that is straightforward.```
using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.Globalization;
using System.IO;
using System.Linq;
using ExCSS;
using HtmlAgilityPack;
using Newtonsoft.Json;
na
Solution
With the
because it really is the same and it is more obvious what is meant.
In the
You should let your variables some space to breathe. Instead of e.g
In the
Always code against interfaces if the implementation isn't needed.
If
By removing the
By ectracting the analyzing of the
By extracting the
The check
Applying these points lead to
Speaking about comments. Comments should describe why something is done. You should let the code speak for itself about what is done.
So comments like
are superfluous, because they don't add any value.
By introducing a
the
which would change the
maybe
the xml documentation clearly states that this method is d
IsKey() method I go with @mjolka and also you prefer loops I will suggest to use Any() like private static bool IsKey(HtmlNode span)
{
var styleSheet = ExtractStyle(span);
bool isKey = styleSheet.StyleRules[0].Declarations
.Any(cssAttribute.Name == "font-weight" && cssAttribute.Term.ToString() == "bold");
return isKey;
}because it really is the same and it is more obvious what is meant.
In the
GetIndentationFromNode() method, there should be only one cssAttribute with the name left so you should break out of the loop after you have found the value. You should let your variables some space to breathe. Instead of e.g
decimal thisIndent=0; you better should write decimal thisIndent = 0;, because this will make your code more readable. In the
ParseHtmlToObjects() method there is no need to call ToList() on the IEnumerable<> here page.Descendants().Where(x => (x.Name == "span")).ToList();. Always code against interfaces if the implementation isn't needed.
If
step isn't a key you can by adding a continue; omit the else and therefor save horizontal spacing. By removing the
if (level == currentObject.level) condition you can shorten the code. By ectracting the analyzing of the
Descendants of the current page this method can be shortened and be more readable. By extracting the
while loop to equalize the level to a meaningful method, you can omit the comment also this is only a cosmetic change. The check
if (key != null) could be omitted, because if key == null the AddProperty() method just returns the passed in DataObject. Applying these points lead to
private static DataObject ParseHtmlToObjects(IEnumerable pages)
{
var rootObject = new DataObject();
var currentObject = rootObject;
foreach (var page in pages)
{
var steps = page.Descendants().Where(x => (x.Name == "span")).ToList();
currentObject = AnalyzeSpanTags(steps, currentObject);
}
return rootObject;
}
private static DataObject AnalyzeSpanTags(IEnumerable steps, DataObject currentObject)
{
string key = null;
foreach (var step in steps)
{
if (!IsKey(step))
{
// If this is not a key, the key was detected before. Use it to populate the object
currentObject = AddProperty(currentObject, key, GetTextFromSpan(step));
key = null;
continue;
}
// Special case: Maybe we detected a new key, although the old key has not been used as property yet
// This can happen for keys without value, so add it empty.
currentObject = AddProperty(currentObject, key, "");
key = GetKeyFromNode(step);
var level = GetIndentationFromNode(step);
if (level > currentObject.level)
{
// Decend to lower level: create a new child
var child = new DataObject { level = level, Parent = currentObject };
currentObject.Children.Add(child);
currentObject = child;
}
else
{
currentObject = EqualizeLevel(currentObject, level);
}
}
return currentObject;
}
private static DataObject EqualizeLevel(DataObject obj, decimal level)
{
while (level < obj.level)
{
obj = obj.Parent;
}
return obj;
}Speaking about comments. Comments should describe why something is done. You should let the code speak for itself about what is done.
So comments like
// Go through all pages
foreach (var page in pages)are superfluous, because they don't add any value.
By introducing a
GetAddedSiblingIfKeyExists() method (the name is not that optimal, but I couldn't come up with a better one) like private static DataObject GetAddedSiblingIfKeyExists(DataObject obj, string key)
{
if (key == null || !obj.Properties.ContainsKey(key)) { return obj; }
var sibling = new DataObject { level = obj.level, Parent = obj.Parent };
obj.Parent.Children.Add(sibling);
return sibling;
}the
AddProperty() method would result in private static DataObject AddProperty(DataObject obj, string key, string value)
{
// Special case: which contains the page information. Skip it.
if (key == null) return obj;
obj.Properties.Add(key, value);
return obj;
}which would change the
AnalyzeSpanTags() method like private static DataObject AnalyzeSpanTags(IEnumerable steps, DataObject currentObject)
{
string key = null;
foreach (var step in steps)
{
currentObject = GetAddedSiblingIfKeyExists(currentObject, key);
if (!IsKey(step))
{
.....maybe
GetAddedSiblingForExistingKey() would be a little bit more meaningful. ///
/// Converts the HTML input file into JSON and writes the output file
///
public void Convert()the xml documentation clearly states that this method is d
Code Snippets
private static bool IsKey(HtmlNode span)
{
var styleSheet = ExtractStyle(span);
bool isKey = styleSheet.StyleRules[0].Declarations
.Any(cssAttribute.Name == "font-weight" && cssAttribute.Term.ToString() == "bold");
return isKey;
}private static DataObject ParseHtmlToObjects(IEnumerable<HtmlNode> pages)
{
var rootObject = new DataObject();
var currentObject = rootObject;
foreach (var page in pages)
{
var steps = page.Descendants().Where(x => (x.Name == "span")).ToList();
currentObject = AnalyzeSpanTags(steps, currentObject);
}
return rootObject;
}
private static DataObject AnalyzeSpanTags(IEnumerable<HtmlNode> steps, DataObject currentObject)
{
string key = null;
foreach (var step in steps)
{
if (!IsKey(step))
{
// If this is not a key, the key was detected before. Use it to populate the object
currentObject = AddProperty(currentObject, key, GetTextFromSpan(step));
key = null;
continue;
}
// Special case: Maybe we detected a new key, although the old key has not been used as property yet
// This can happen for keys without value, so add it empty.
currentObject = AddProperty(currentObject, key, "");
key = GetKeyFromNode(step);
var level = GetIndentationFromNode(step);
if (level > currentObject.level)
{
// Decend to lower level: create a new child
var child = new DataObject { level = level, Parent = currentObject };
currentObject.Children.Add(child);
currentObject = child;
}
else
{
currentObject = EqualizeLevel(currentObject, level);
}
}
return currentObject;
}
private static DataObject EqualizeLevel(DataObject obj, decimal level)
{
while (level < obj.level)
{
obj = obj.Parent;
}
return obj;
}// Go through all pages
foreach (var page in pages)private static DataObject GetAddedSiblingIfKeyExists(DataObject obj, string key)
{
if (key == null || !obj.Properties.ContainsKey(key)) { return obj; }
var sibling = new DataObject { level = obj.level, Parent = obj.Parent };
obj.Parent.Children.Add(sibling);
return sibling;
}private static DataObject AddProperty(DataObject obj, string key, string value)
{
// Special case: <Span> which contains the page information. Skip it.
if (key == null) return obj;
obj.Properties.Add(key, value);
return obj;
}Context
StackExchange Code Review Q#83233, answer score: 5
Revisions (0)
No revisions yet.