patternMinor
Downloading HTML documents from the web
Viewed 0 times
thedownloadingwebfromdocumentshtml
Problem
In this project I'm working on, I need to extract articles text from their original HTML document. This class,
There is one main issue in the code: It's too slow.
I did some test, and came up with these numbers:
This is too much. As you can see, just 25 articles take about 2 minutes. And I plan to download hundreds of articles per run. It could be a problem with my internet connection, but when I'm surfing normally, it's pretty fast and clean.
```
Imports HtmlAgilityPack
Imports System.Text.RegularExpressions
Imports System.Net
'''
''' Represents a single Html document.
'''
Public Class HtmlConnection
' XPath for all the paragraphs inside the body.
Private Const BodyPath As String = "//body//p"
' RegEx for a single word.
Private Const WordPath As String = "[a-zA-Z]+"
'''
''' Constructor to initialize Url property
''' and to call the DownloadHtml sub.
'''
'''
''' The Url of the current article.
'''
Public Sub New(ByVal url As String)
Me.Url = url
DownloadHtml()
End Sub ' Constructor
'''
''' Represents the Url of the current article.
'''
Private Property Url As String
'''
''' Represents all the Html code
''' received from the article.
'''
Private Property FullHtml As HtmlDocument
Private _BodyHtml As HtmlNodeCollection
'''
''' Represents the Html of all the paragraph
HtmlConnection, receives the URL of the article, and eventually it needs to produce a collection of the paragraphs inside the article. I'm using HTML agility pack and XPath to extract only relevent text from the article, "removing" irrelevant text that comes alongside in the HTML, such as JavaScript, etc. Notice that this class does not produce the final text of the article (another class deals with it), but rather a HtmlNodeCollection that consist of all the paragraphs in the article.There is one main issue in the code: It's too slow.
I did some test, and came up with these numbers:
- Number of articles downloaded: 25
- Average download time: 4958 Miliseconds
This is too much. As you can see, just 25 articles take about 2 minutes. And I plan to download hundreds of articles per run. It could be a problem with my internet connection, but when I'm surfing normally, it's pretty fast and clean.
```
Imports HtmlAgilityPack
Imports System.Text.RegularExpressions
Imports System.Net
'''
''' Represents a single Html document.
'''
Public Class HtmlConnection
' XPath for all the paragraphs inside the body.
Private Const BodyPath As String = "//body//p"
' RegEx for a single word.
Private Const WordPath As String = "[a-zA-Z]+"
'''
''' Constructor to initialize Url property
''' and to call the DownloadHtml sub.
'''
'''
''' The Url of the current article.
'''
Public Sub New(ByVal url As String)
Me.Url = url
DownloadHtml()
End Sub ' Constructor
'''
''' Represents the Url of the current article.
'''
Private Property Url As String
'''
''' Represents all the Html code
''' received from the article.
'''
Private Property FullHtml As HtmlDocument
Private _BodyHtml As HtmlNodeCollection
'''
''' Represents the Html of all the paragraph
Solution
Let's break down your code and look at what you're doing wrong/right and what can be improved.
I - Class
The name of your class is quite misleading as it's not a connection object as per se. It's an object which contains html. The underlying
II - Constants
This is good! You're using constants instead of magic strings/numbers. Nothing to change here except that I'll introduce a new constant.
III - Fields
Fields should be placed at the top and written in lowerCamelCase. It's also a bad habit to start a member name with an underscore as this makes your code non CLS compliant if the member is anything other than private.
We'll rename the member and introduce a new one. I'll explain why later.
IV - Constructors
You're doing a very big mistake in the constructor. It's doing too much work. Constructors should be as light as possible. The html should only be downloaded when needed, when you decide to invoke
Your casing is correct but you can remove the
V - Properties
Private auto-implemented get-set properties should always be turned into fields.
You should avoid heavy code in properties. Properties should mainly be used to get and set the value of a backing field, not processing data. The code should be moved to the
The expression
I don't see any real use of the
The name of the
The name of the
Since the backing fields of the properties are provided in the constructor we'll remove the setters and mark the properties read only.
VI - Methods
This method should be public and it should do all the heavy work. I also suggest you make it static (shared) and return an instance of our class based on downloaded data. Doing this makes it obvious why we made the constructor private and the properties read only.
```
Public Shared Function Download(url As String) As HtmlArticle
If (String.IsNullOrWhiteSpace(url)) Then
Throw New ArgumentNullException(NameOf(url))
End If
Dim web As New HtmlWeb()
Dim document As HtmlDocument = web.Load(url)
document.OptionFixNestedTags = True
Dim paragraphs As HtmlNodeCollection = document.DocumentNode.SelectNodes(HtmlArticle.BodyPath)
For index As Integer = (paragraphs.Count - 1) To 0 Step -1
If (Regex.Matches(paragraphs.Item(index).InnerText, HtmlArticle.WordPath).Count < HtmlArticle.MinLength) Then
paragraphs.RemoveAt(index)
End If
Next
Ret
I - Class
Public Class HtmlConnectionThe name of your class is quite misleading as it's not a connection object as per se. It's an object which contains html. The underlying
HttpClient used in HAP is closer to being (if not is) a html connector. So we'll rename the class so that it reflects what it is/represents, a html article.Public Class HtmlArticleII - Constants
Private Const BodyPath As String = "//body//p"
Private Const WordPath As String = "[a-zA-Z]+"This is good! You're using constants instead of magic strings/numbers. Nothing to change here except that I'll introduce a new constant.
Private Const MinLength As Integer = 10III - Fields
Private _BodyHtml As HtmlNodeCollectionFields should be placed at the top and written in lowerCamelCase. It's also a bad habit to start a member name with an underscore as this makes your code non CLS compliant if the member is anything other than private.
We'll rename the member and introduce a new one. I'll explain why later.
Private m_url As String
Private m_paragraphs As HtmlNodeCollectionIV - Constructors
Public Sub New(ByVal url As String)
Me.Url = url
DownloadHtml()
End SubYou're doing a very big mistake in the constructor. It's doing too much work. Constructors should be as light as possible. The html should only be downloaded when needed, when you decide to invoke
DownloadHtml. Same logic applies for the SqlConnection class. It doesn't invoke Openin the constructor. You have to do this as a separate call.Your casing is correct but you can remove the
ByVal keyword as this is default by design. I'll introduce a new parameter and make the constructor private. More about that later.Private Sub New(url As String, paragraphs As HtmlNodeCollection)
Me.m_url = url
Me.m_paragraphs = paragraphs
End SubV - Properties
Private Property Url As String
Private Property FullHtml As HtmlDocument
Public Property BodyHtml As HtmlNodeCollection
Get
Return _BodyHtml
End Get
Set(value As HtmlNodeCollection)
Dim WordsMatches As MatchCollection
_BodyHtml = value
For Paragraph As Integer = value.Count - 1 To 0 Step -1
WordsMatches = Regex.Matches(value.Item(Paragraph).InnerText, WordPath)
If WordsMatches.Count < 10 Then
_BodyHtml.RemoveAt(Paragraph)
End If
Next
End Set
End PropertyPrivate auto-implemented get-set properties should always be turned into fields.
You should avoid heavy code in properties. Properties should mainly be used to get and set the value of a backing field, not processing data. The code should be moved to the
DownloadHtml method.The expression
WordsMatches.Count < 10 contains a magic number (10) which should be turned into a constant (ref. the beginning of the review).I don't see any real use of the
FullHtml property other than storing a reference so we'll change the scope and remove it.The name of the
BodyHtml property is misleading. It's not a html body. It contains our paragraphs so we'll name it accordingly. Same reason why we changed the name of the backing field earlier.The name of the
Url property is good so we'll keep that.Since the backing fields of the properties are provided in the constructor we'll remove the setters and mark the properties read only.
Public ReadOnly Property Paragraphs As HtmlNodeCollection
Get
Return Me.m_paragraphs
End Get
End Property
Public ReadOnly Property Url As String
Get
Return Me.m_url
End Get
End PropertyVI - Methods
Private Sub DownloadHtml()
Dim HtmlWeb As HtmlWeb = New HtmlWeb
FullHtml = New HtmlDocument
_BodyHtml = New HtmlNodeCollection(FullHtml.DocumentNode)
FullHtml = (HtmlWeb.Load(Url))
FullHtml.OptionFixNestedTags = True
BodyHtml = FullHtml.DocumentNode.SelectNodes(BodyPath)
End SubThis method should be public and it should do all the heavy work. I also suggest you make it static (shared) and return an instance of our class based on downloaded data. Doing this makes it obvious why we made the constructor private and the properties read only.
```
Public Shared Function Download(url As String) As HtmlArticle
If (String.IsNullOrWhiteSpace(url)) Then
Throw New ArgumentNullException(NameOf(url))
End If
Dim web As New HtmlWeb()
Dim document As HtmlDocument = web.Load(url)
document.OptionFixNestedTags = True
Dim paragraphs As HtmlNodeCollection = document.DocumentNode.SelectNodes(HtmlArticle.BodyPath)
For index As Integer = (paragraphs.Count - 1) To 0 Step -1
If (Regex.Matches(paragraphs.Item(index).InnerText, HtmlArticle.WordPath).Count < HtmlArticle.MinLength) Then
paragraphs.RemoveAt(index)
End If
Next
Ret
Code Snippets
Public Class HtmlConnectionPublic Class HtmlArticlePrivate Const BodyPath As String = "//body//p"
Private Const WordPath As String = "[a-zA-Z]+"Private Const MinLength As Integer = 10Private _BodyHtml As HtmlNodeCollectionContext
StackExchange Code Review Q#108144, answer score: 3
Revisions (0)
No revisions yet.