HiveBrain v1.2.0
Get Started
← Back to all entries
patternMinor

Matching between files and a list of filenames at scale

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
filenamesbetweenfilesandlistscalematching

Problem

Given a folder typically containing ~250,000 small (20-100kb) files in HTML format (they open as single-sheet workbooks in Excel) and a list of ~ 1 million filenames, I need to open all the files that match the list and Do Stuff.

Overview of Main code Loop: "Test 1 Start" to "Test 1 End"

Once we get to the main file loop, I have a 1-D array arrCompanyNumbers with approx. 1 Million 8-digit company numbers (extracted from a 2-d array arrFilteredAddresses where the company number is only one of about 12 columns).

There is a folder with a couple hundred thousand files in it named like this.


"Prod224_0005_00040751_20131231"

"Prod224_0005_00040789_20130930"

.......

That number in the middle is the company number. The number on the end is the file date.

The macro loops through the folder using strFilename = dir. For each file, it extracts the company number and tries to match it against the array. If it does, it calls Check_File(). If it doesn't it goes to the next file and repeats.

Overview of Check_File()

Each file opens as a one-sheet workbook with a corporate accounts filing in it. I want to find the Cash, Assets and Profits in the last year. I have 3 collections of phrases that correspond to those values e.g. "Cash at Bank:".

The macro searches the first 200 rows of the first 2 columns for those phrases. Then searches 10 cells across the row and returns the first number it finds.

Once it has Cash, Assets and Profits (or failed to find them), it filters them against set criteria. If they pass, it copies the results to a second worksheet in the main workbook (which is eventually just one long list of companies, file dates and cash/assets/profits) and closes the file.

Optimisation Parameters:

I've already optimised it as far as I think I can e.g. Running speed tests on using vlookup instead of iterative searching. Even after all that, it will typically run for 6-24 hours to filter an entire month of data, and comes dangerously clo

Solution

Here are a couple of things to look into:

-
In the Check_Companies_House_Files, near the start you have this block of code:

ReDim arrFilteredAddresses(1 To lngFinalRow, 1 To lngFinalColumn)

'/ Done iteratively because excel throws an "Out of memory" if I try to pass the whole range to the array in one go. Approx. 2 minutes for 1Million length list
For L = 1 To lngFinalRow
    For M = 1 To lngFinalColumn
        arrFilteredAddresses(L, M) = wsFilteredAddresses.Cells(L, M).Text
    Next M
Next L

ReDim arrCompanyNumbers(1 To lngFinalRow)

For L = 1 To lngFinalRow
    arrCompanyNumbers(L) = Right("00000000" & arrFilteredAddresses(L, 2), 8) '/ company numbers in the filenames are always 8 digits long, with 0's filling up any extra digits
Next L


Why have two separate loops? You should be able to merge them.

ReDim arrFilteredAddresses(1 To lngFinalRow, 1 To lngFinalColumn)
ReDim arrCompanyNumbers(1 To lngFinalRow)

'/ Done iteratively because excel throws an "Out of memory" if I try to pass the whole range to the array in one go. Approx. 2 minutes for 1Million length list
For L = 1 To lngFinalRow
    For M = 1 To lngFinalColumn
        arrFilteredAddresses(L, M) = wsFilteredAddresses.Cells(L, M).Text
    Next M
    arrCompanyNumbers(L) = Right("00000000" & arrFilteredAddresses(L, 2), 8) '/ company numbers in the filenames are always 8 digits long, with 0's filling up any extra digits
Next L


When you build the arrCompanyNumbers array, you do that text formatting 1 million times. Why not add a column to your Excel sheet with the value already formatted using the TEXT function? You then just read that column into the array.

-
Your code is looping through your array arrCompanyNumbers looking for a matching entry. You might to look into using a Scripting.Dictionary instead because the Exists method is possibly much quicker than a 1 million long array. Presumably your company number values are all unique?

-
If your files are HTML, Excel probably spends a lot of time parsing the HTML code when opening the file. You could try using the Scripting.FileSystemObject and the Scripting.TextStream.ReadAll method. This will load the file into a string variable and you could then use the InStr function to search for your text entries. If the HTML is complicated/fussy it might be too tricky to find the value that goes with your heading.

-
You do this kind of thing a couple of times:

varHolder1 = Application.Match(strPhraseHolder, arrFirstColumn, 0)
    varHolder2 = UCase(Application.Index(arrFirstColumn, varHolder1))
    If IsError(varHolder1) _
        Then
        varHolder1 = Application.Match(strPhraseHolder, arrSecondColumn, 0)
        varHolder2 = UCase(Application.Index(arrSecondColumn, varHolder1))
    End If


I've got two issues with this. First, why bother getting the value for varHolder2 before you've checked if varHolder1 is an error? You should check varHolder1 first and then decide what to do. (By the way, why do you put the Then onto a new line and indent it? I would leave it on the same line as the If unless the line is very long.)

varHolder1 = Application.Match(strPhraseHolder, arrFirstColumn, 0)
    If Not IsError(varHolder1) Then
        varHolder2 = UCase(Application.Index(arrFirstColumn, varHolder1))
    Else
        varHolder1 = Application.Match(strPhraseHolder, arrSecondColumn, 0)
        varHolder2 = UCase(Application.Index(arrSecondColumn, varHolder1))
    End If


The second thing, is why use the Index function at all? The Match function has told you the row number/index number of the element. Why not use:

varHolder2 = UCase(arrSecondColumn(varHolder1))


-
A minor niggle is you hardcoding the max number of rows in your code, e.g. lngFinalRow = Cells(1048576, 1).End(xlUp).Row. This is fragile and might break if you upgrade Excel. Better to use lngFinalRow = Cells(wsAccountsData.Rows.Count, 1).End(xlUp).Row

Code Snippets

ReDim arrFilteredAddresses(1 To lngFinalRow, 1 To lngFinalColumn)

'/ Done iteratively because excel throws an "Out of memory" if I try to pass the whole range to the array in one go. Approx. 2 minutes for 1Million length list
For L = 1 To lngFinalRow
    For M = 1 To lngFinalColumn
        arrFilteredAddresses(L, M) = wsFilteredAddresses.Cells(L, M).Text
    Next M
Next L

ReDim arrCompanyNumbers(1 To lngFinalRow)

For L = 1 To lngFinalRow
    arrCompanyNumbers(L) = Right("00000000" & arrFilteredAddresses(L, 2), 8) '/ company numbers in the filenames are always 8 digits long, with 0's filling up any extra digits
Next L
ReDim arrFilteredAddresses(1 To lngFinalRow, 1 To lngFinalColumn)
ReDim arrCompanyNumbers(1 To lngFinalRow)

'/ Done iteratively because excel throws an "Out of memory" if I try to pass the whole range to the array in one go. Approx. 2 minutes for 1Million length list
For L = 1 To lngFinalRow
    For M = 1 To lngFinalColumn
        arrFilteredAddresses(L, M) = wsFilteredAddresses.Cells(L, M).Text
    Next M
    arrCompanyNumbers(L) = Right("00000000" & arrFilteredAddresses(L, 2), 8) '/ company numbers in the filenames are always 8 digits long, with 0's filling up any extra digits
Next L
varHolder1 = Application.Match(strPhraseHolder, arrFirstColumn, 0)
    varHolder2 = UCase(Application.Index(arrFirstColumn, varHolder1))
    If IsError(varHolder1) _
        Then
        varHolder1 = Application.Match(strPhraseHolder, arrSecondColumn, 0)
        varHolder2 = UCase(Application.Index(arrSecondColumn, varHolder1))
    End If
varHolder1 = Application.Match(strPhraseHolder, arrFirstColumn, 0)
    If Not IsError(varHolder1) Then
        varHolder2 = UCase(Application.Index(arrFirstColumn, varHolder1))
    Else
        varHolder1 = Application.Match(strPhraseHolder, arrSecondColumn, 0)
        varHolder2 = UCase(Application.Index(arrSecondColumn, varHolder1))
    End If
varHolder2 = UCase(arrSecondColumn(varHolder1))

Context

StackExchange Code Review Q#101461, answer score: 8

Revisions (0)

No revisions yet.