HiveBrain v1.2.0
Get Started
← Back to all entries
patternMinor

Web scraping VBA - Internet Explorer

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
explorerscrapinginternetwebvba

Problem

The code below extracts data from one web page - I emulate search, select all results from the list and when the list appears (42000 items) I loop through these items.

I get an id value from their href to create a proper link (example href is href="javascript:NeuFenster('rb_id=570964&land_abk=bw', so I don't want any JS interaction while I can just create working links). Then I navigate to this link and extract whole text from last table row.

The main reason that I post this code is a memory issue. From all my testing here is what happens:

  • iexplore.exe process starts for the main result page, with a helper 32 bit iexplore.exe *32;



  • memory used by iexplore.exe *32 increases up to 308 MB and when result page is fully loaded, it remains +/- 2MB constant through whole extraction process;



  • when a popup window (subpage for each result) is open, another iexplore.exe *32 opens, but unlike in the previous process, its memory usage never stops increasing, despite PopUpWindow.Quit and set PopupWindow = nothing;



  • a third iexplore.exe process starts once first pop-up window is open, it has a constant memory usage around 24 MB. I have a problem understanding why such process appears at all;



  • each time I have scraped around 6000 elements, IE automation error appears. I have concluded that maybe it's because at this point all iexplore.exe *32 processes together consume around 500MB of memory, maybe it's some limit.



  • when I run this macro on other old PC, which has hardly any free RAM, automation errors appear usually around 10 times faster and therefore using this macro is impossible on that PC.



Is there any way to stop this increasing memory usage? Maybe I should set IHTMLElement objects to nothing as well?

Are there any other things which could be upgrade efficiency / transparency of the code (apart from moving code to a module, applying multithreading and using XML)?

```
Option Explicit

Private Sub CommandButton1_Click()

Dim htmlele As I

Solution

That is a lot of work for a little button Click handler: your UserForm is running the show, it's doing much more than just collecting user input - it is the program. This is a design pattern known as "Smart UI", where the whole application logic is implemented in the UI. That's good for prototyping, but quickly becomes an unmaintainable tangle of globals and click handlers with hundreds of lines of code and countless responsibilities; you'll want to take a look at UserForm best practices on SO Documentation to see how you can separate presentation concerns from the actual application logic.

So I'll pretend that code isn't in some click handler and actually a specialized method of some dedicated class module. But how specialized is it?

By dividing the procedure into multiple smaller ones that do one thing, you will gain in maintainability, and more importantly you will be reducing your objects' scopes.

In VBA the smallest possible scope is at procedure level. This means an object reference doesn't go out of scope until a procedure exits; if in theory setting a reference to Nothing destroys it, in practice if it's still used in the same scope, then it's still in scope.

By splitting up your logic in smaller scopes you give much clearer clues to the VBA runtime: objects don't need to be set to Nothing when they only ever live long enough to do their thing; in fact an object going out of scope is the only reliable way to make sure VBA cleans it up properly. PopupWindow doesn't belong in the same scope as ie.

Then there's the separation of concerns: collecting inputs (reading the scraped data) doesn't belong in the same procedure as outputting results (writing to some target worksheet).

There should be a function responsible for collecting the data, stuffing it into an array and returning that array to the caller; then another procedure takes that array and a given Worksheet object, and writes all results in one single operarion, no looping is necessary. By separating inputs from outputs you'll be specializing each procedure, making the whole thing more performant. You don't need ScreenUpdating turned off while you're collecting the data - you could use the Application.StatusBar to update some progress indicator as you go (say once every 5%), and that wouldn't affect performance much - what's costing you here is the fact that you're writing to individual cells in a nested loop.

Extract procedures from each loop body, reduce variables/objects' scopes as much as possible, and separate concerns - you'll gain performance, lose memory footprint, and end up with overall more maintainable code.

Context

StackExchange Code Review Q#143015, answer score: 2

Revisions (0)

No revisions yet.