patternMinor
VBA - XMLHTTP web scraping
Viewed 0 times
xmlhttpscrapingwebvba
Problem
I navigate with IE, do various things, then select all results option from a list and fire on click event. Once all results have been listed, I loop through their URLs, using the following code to gather data inside these URLs.
What can I do to improve speed of data scraping (apart from VB Script
multithreading)?
With
Is it better to declare and create objects within this Sub (meaning inside the loop for 42,000 links) or outside and make them public variables?
What can I do to improve speed of data scraping (apart from VB Script
multithreading)?
With
- IE navigation,
visible = falseand no images I had 1000 links extracted in 3min 45sec
- msxml2.XMLHTTP in 3min 30sec
- msxml2.serverXMLHTTP 1min 40sec -
.setRequestHeader "Content-Type", "text/xml"and.setRequestHeader "Accept-Encoding", "compress, gzip"didn't affect performance
Is it better to declare and create objects within this Sub (meaning inside the loop for 42,000 links) or outside and make them public variables?
Public Sub PrintCompanyData()
Dim oDom As Object: Set oDom = CreateObject("htmlFile")
Dim htmlelePopUp As IHTMLElement
Dim unformattedData As String
With CreateObject("msxml2.serverXMLHTTP") 'CreateObject("msxml2.xmlhttp")
.Open "GET", Link, False
'.setRequestHeader "Content-Type", "text/xml"
'.setRequestHeader "Accept-Encoding", "compress, gzip"
.send
If .ReadyState = 4 And .Status = 200 Then
oDom.body.innerHTML = .responseText
With WS
For Each htmlelePopUp In oDom.getElementsByTagName("tbody")
unformattedData = htmlelePopUp.Children(htmlelePopUp.Children.Length - 1).innerText
.Range("a" & ItemCount + 1).Value2 = Replace(unformattedData, Chr(10), vbNullString)
Next htmlelePopUp
End With
End If
End With
Set oDom = Nothing
End SubSolution
It's likely not a huge improvement, because the vast majority of this program's time will be spent going over the network, but there is a micro-optimization you can make here.
Instead of latebinding like this
You can early bind your objects, eliminating the time COM takes to do a dynamic lookup. However, the time this saves will be minuscule compared to the total runtime.
I'm not bashing on VBA, I <3 VBA, but it's just not the right tool for the job if you need it to run faster than this. What you really need is a language that's capable of multi-threading and asynchronous programming to make the most of the CPU time while you're waiting for those Http requests to return. Sounds like it's time to learn some C# or F# to me. Even Powershell may be a good option.
Instead of latebinding like this
With CreateObject("msxml2.serverXMLHTTP")You can early bind your objects, eliminating the time COM takes to do a dynamic lookup. However, the time this saves will be minuscule compared to the total runtime.
I'm not bashing on VBA, I <3 VBA, but it's just not the right tool for the job if you need it to run faster than this. What you really need is a language that's capable of multi-threading and asynchronous programming to make the most of the CPU time while you're waiting for those Http requests to return. Sounds like it's time to learn some C# or F# to me. Even Powershell may be a good option.
Code Snippets
With CreateObject("msxml2.serverXMLHTTP")Context
StackExchange Code Review Q#145592, answer score: 2
Revisions (0)
No revisions yet.