HiveBrain v1.2.0
Get Started
← Back to all entries
patterncsharpMinor

Extracting text fields from <span> tags in an HTML message

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
fieldstextmessagespantagsextractingfromhtml

Problem

What i'm doing

I have a string with html information like this:

 Some text this is a test


My goal in the method is to create a dictionary with this value:

**key**     **value**
field-4    Some text


This is the code that i'm using to accomplish my task:

public static Dictionary getFields(String mensaje) 
    {
        Dictionary fields = new Dictionary();
        Match m = Regex.Match(mensaje, @"^(.*?(.*?).*?)+$", RegexOptions.Singleline);
        for (int i = 0; i .*?)+$", RegexOptions.Singleline);
            String fieldId = m2.Groups[2].Captures[0].Value;
            fieldId = fieldId.Replace("field-", String.Empty);
            fields.Add(int.Parse(fieldId),m.Groups[2].Captures[i].Value);
        }

        return fields;
    }


How can i improve my code?

Solution

I know this is Code Review not Rewrite My Code, however I would suggest using a third-party Html parser (like the Html Agility Pack for example) over regular expressions if that's an option.

I realize you're doing very trivial parsing here, but from my personal experiences regular expressions grow to unmaintainable status quicker than anything in software development.

If you were to use a Html parser, you could do something like this:

string htmlToParse = "Some text this is a testSome more text this is another test";
const string ElementToParse = "span";
const string IdField = "FieldId";

HtmlDocument htmlDocument = new HtmlDocument();
htmlDocument.LoadHtml(htmlToParse);

int fieldId = default( int );

Dictionary fieldValuesTable = 
(
    from
        htmlNode in htmlDocument.DocumentNode.DescendantNodes()
    where
        htmlNode.Name.Equals( ElementToParse, StringComparison.InvariantCultureIgnoreCase )
        &&
        htmlNode.Attributes.Contains( IdField )
    let
        id = htmlNode.Attributes[ IdField ].Value
    where
        Int32.TryParse( id.Substring( id.IndexOf( "-" ) + 1 ), out fieldId ) // this is stil not ideal,
    select
        new { Id = fieldId, Text = htmlNode.InnerText }
).ToDictionary( f => f.Id, f => f.Text );


You get the output:

4 : Some text
5 : Some more text


IMHO, it's much cleaner and maintainable.

Code Snippets

string htmlToParse = "<p><span class=\"fieldText\" fieldId=\"field-4\">Some text</span> this is a test</p><p><span class=\"fieldText\" fieldId=\"field-5\">Some more text</span> this is another test</p>";
const string ElementToParse = "span";
const string IdField = "FieldId";

HtmlDocument htmlDocument = new HtmlDocument();
htmlDocument.LoadHtml(htmlToParse);

int fieldId = default( int );

Dictionary<int,string> fieldValuesTable = 
(
    from
        htmlNode in htmlDocument.DocumentNode.DescendantNodes()
    where
        htmlNode.Name.Equals( ElementToParse, StringComparison.InvariantCultureIgnoreCase )
        &&
        htmlNode.Attributes.Contains( IdField )
    let
        id = htmlNode.Attributes[ IdField ].Value
    where
        Int32.TryParse( id.Substring( id.IndexOf( "-" ) + 1 ), out fieldId ) // this is stil not ideal,
    select
        new { Id = fieldId, Text = htmlNode.InnerText }
).ToDictionary( f => f.Id, f => f.Text );
4 : Some text
5 : Some more text

Context

StackExchange Code Review Q#3547, answer score: 9

Revisions (0)

No revisions yet.