HiveBrain v1.2.0
Get Started
← Back to all entries
patternMinor

Downloading HTML documents from the web

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
thedownloadingwebfromdocumentshtml

Problem

In this project I'm working on, I need to extract articles text from their original HTML document. This class, HtmlConnection, receives the URL of the article, and eventually it needs to produce a collection of the paragraphs inside the article. I'm using HTML agility pack and XPath to extract only relevent text from the article, "removing" irrelevant text that comes alongside in the HTML, such as JavaScript, etc. Notice that this class does not produce the final text of the article (another class deals with it), but rather a HtmlNodeCollection that consist of all the paragraphs in the article.

There is one main issue in the code: It's too slow.

I did some test, and came up with these numbers:

  • Number of articles downloaded: 25



  • Average download time: 4958 Miliseconds



This is too much. As you can see, just 25 articles take about 2 minutes. And I plan to download hundreds of articles per run. It could be a problem with my internet connection, but when I'm surfing normally, it's pretty fast and clean.

```
Imports HtmlAgilityPack
Imports System.Text.RegularExpressions
Imports System.Net

'''
''' Represents a single Html document.
'''
Public Class HtmlConnection

' XPath for all the paragraphs inside the body.
Private Const BodyPath As String = "//body//p"

' RegEx for a single word.
Private Const WordPath As String = "[a-zA-Z]+"

'''
''' Constructor to initialize Url property
''' and to call the DownloadHtml sub.
'''
'''
''' The Url of the current article.
'''
Public Sub New(ByVal url As String)

Me.Url = url
DownloadHtml()

End Sub ' Constructor

'''
''' Represents the Url of the current article.
'''
Private Property Url As String
'''
''' Represents all the Html code
''' received from the article.
'''
Private Property FullHtml As HtmlDocument

Private _BodyHtml As HtmlNodeCollection
'''
''' Represents the Html of all the paragraph

Solution

Let's break down your code and look at what you're doing wrong/right and what can be improved.

I - Class

Public Class HtmlConnection


The name of your class is quite misleading as it's not a connection object as per se. It's an object which contains html. The underlying HttpClient used in HAP is closer to being (if not is) a html connector. So we'll rename the class so that it reflects what it is/represents, a html article.

Public Class HtmlArticle


II - Constants

Private Const BodyPath As String = "//body//p"
Private Const WordPath As String = "[a-zA-Z]+"


This is good! You're using constants instead of magic strings/numbers. Nothing to change here except that I'll introduce a new constant.

Private Const MinLength As Integer = 10


III - Fields

Private _BodyHtml As HtmlNodeCollection


Fields should be placed at the top and written in lowerCamelCase. It's also a bad habit to start a member name with an underscore as this makes your code non CLS compliant if the member is anything other than private.

We'll rename the member and introduce a new one. I'll explain why later.

Private m_url As String
Private m_paragraphs As HtmlNodeCollection


IV - Constructors

Public Sub New(ByVal url As String)
    Me.Url = url
    DownloadHtml()
End Sub


You're doing a very big mistake in the constructor. It's doing too much work. Constructors should be as light as possible. The html should only be downloaded when needed, when you decide to invoke DownloadHtml. Same logic applies for the SqlConnection class. It doesn't invoke Openin the constructor. You have to do this as a separate call.

Your casing is correct but you can remove the ByVal keyword as this is default by design. I'll introduce a new parameter and make the constructor private. More about that later.

Private Sub New(url As String, paragraphs As HtmlNodeCollection)
    Me.m_url = url
    Me.m_paragraphs = paragraphs
End Sub


V - Properties

Private Property Url As String

Private Property FullHtml As HtmlDocument

Public Property BodyHtml As HtmlNodeCollection
    Get
        Return _BodyHtml
    End Get
    Set(value As HtmlNodeCollection)

        Dim WordsMatches As MatchCollection
        _BodyHtml = value

        For Paragraph As Integer = value.Count - 1 To 0 Step -1
            WordsMatches = Regex.Matches(value.Item(Paragraph).InnerText, WordPath)
            If WordsMatches.Count < 10 Then
                _BodyHtml.RemoveAt(Paragraph)
            End If
        Next

    End Set
End Property


Private auto-implemented get-set properties should always be turned into fields.

You should avoid heavy code in properties. Properties should mainly be used to get and set the value of a backing field, not processing data. The code should be moved to the DownloadHtml method.

The expression WordsMatches.Count < 10 contains a magic number (10) which should be turned into a constant (ref. the beginning of the review).

I don't see any real use of the FullHtml property other than storing a reference so we'll change the scope and remove it.

The name of the BodyHtml property is misleading. It's not a html body. It contains our paragraphs so we'll name it accordingly. Same reason why we changed the name of the backing field earlier.

The name of the Url property is good so we'll keep that.

Since the backing fields of the properties are provided in the constructor we'll remove the setters and mark the properties read only.

Public ReadOnly Property Paragraphs As HtmlNodeCollection
    Get
        Return Me.m_paragraphs
    End Get
End Property

Public ReadOnly Property Url As String
    Get
        Return Me.m_url
    End Get
End Property


VI - Methods

Private Sub DownloadHtml()

    Dim HtmlWeb As HtmlWeb = New HtmlWeb

    FullHtml = New HtmlDocument

    _BodyHtml = New HtmlNodeCollection(FullHtml.DocumentNode)

    FullHtml = (HtmlWeb.Load(Url))
    FullHtml.OptionFixNestedTags = True

    BodyHtml = FullHtml.DocumentNode.SelectNodes(BodyPath)

End Sub


This method should be public and it should do all the heavy work. I also suggest you make it static (shared) and return an instance of our class based on downloaded data. Doing this makes it obvious why we made the constructor private and the properties read only.

```
Public Shared Function Download(url As String) As HtmlArticle

If (String.IsNullOrWhiteSpace(url)) Then
Throw New ArgumentNullException(NameOf(url))
End If

Dim web As New HtmlWeb()
Dim document As HtmlDocument = web.Load(url)

document.OptionFixNestedTags = True

Dim paragraphs As HtmlNodeCollection = document.DocumentNode.SelectNodes(HtmlArticle.BodyPath)

For index As Integer = (paragraphs.Count - 1) To 0 Step -1
If (Regex.Matches(paragraphs.Item(index).InnerText, HtmlArticle.WordPath).Count < HtmlArticle.MinLength) Then
paragraphs.RemoveAt(index)
End If
Next

Ret

Code Snippets

Public Class HtmlConnection
Public Class HtmlArticle
Private Const BodyPath As String = "//body//p"
Private Const WordPath As String = "[a-zA-Z]+"
Private Const MinLength As Integer = 10
Private _BodyHtml As HtmlNodeCollection

Context

StackExchange Code Review Q#108144, answer score: 3

Revisions (0)

No revisions yet.