patternMinor
Matching between files and a list of filenames at scale
Viewed 0 times
filenamesbetweenfilesandlistscalematching
Problem
Given a folder typically containing ~250,000 small (20-100kb) files in HTML format (they open as single-sheet workbooks in Excel) and a list of ~ 1 million filenames, I need to open all the files that match the list and Do Stuff.
Overview of Main code Loop: "Test 1 Start" to "Test 1 End"
Once we get to the main file loop, I have a 1-D array
There is a folder with a couple hundred thousand files in it named like this.
"Prod224_0005_00040751_20131231"
"Prod224_0005_00040789_20130930"
.......
That number in the middle is the company number. The number on the end is the file date.
The macro loops through the folder using
Overview of
Each file opens as a one-sheet workbook with a corporate accounts filing in it. I want to find the Cash, Assets and Profits in the last year. I have 3 collections of phrases that correspond to those values e.g. "Cash at Bank:".
The macro searches the first 200 rows of the first 2 columns for those phrases. Then searches 10 cells across the row and returns the first number it finds.
Once it has Cash, Assets and Profits (or failed to find them), it filters them against set criteria. If they pass, it copies the results to a second worksheet in the main workbook (which is eventually just one long list of companies, file dates and cash/assets/profits) and closes the file.
Optimisation Parameters:
I've already optimised it as far as I think I can e.g. Running speed tests on using
Overview of Main code Loop: "Test 1 Start" to "Test 1 End"
Once we get to the main file loop, I have a 1-D array
arrCompanyNumbers with approx. 1 Million 8-digit company numbers (extracted from a 2-d array arrFilteredAddresses where the company number is only one of about 12 columns). There is a folder with a couple hundred thousand files in it named like this.
"Prod224_0005_00040751_20131231"
"Prod224_0005_00040789_20130930"
.......
That number in the middle is the company number. The number on the end is the file date.
The macro loops through the folder using
strFilename = dir. For each file, it extracts the company number and tries to match it against the array. If it does, it calls Check_File(). If it doesn't it goes to the next file and repeats.Overview of
Check_File()Each file opens as a one-sheet workbook with a corporate accounts filing in it. I want to find the Cash, Assets and Profits in the last year. I have 3 collections of phrases that correspond to those values e.g. "Cash at Bank:".
The macro searches the first 200 rows of the first 2 columns for those phrases. Then searches 10 cells across the row and returns the first number it finds.
Once it has Cash, Assets and Profits (or failed to find them), it filters them against set criteria. If they pass, it copies the results to a second worksheet in the main workbook (which is eventually just one long list of companies, file dates and cash/assets/profits) and closes the file.
Optimisation Parameters:
I've already optimised it as far as I think I can e.g. Running speed tests on using
vlookup instead of iterative searching. Even after all that, it will typically run for 6-24 hours to filter an entire month of data, and comes dangerously cloSolution
Here are a couple of things to look into:
-
In the
Why have two separate loops? You should be able to merge them.
When you build the
-
Your code is looping through your array
-
If your files are HTML, Excel probably spends a lot of time parsing the HTML code when opening the file. You could try using the
-
You do this kind of thing a couple of times:
I've got two issues with this. First, why bother getting the value for
The second thing, is why use the
-
A minor niggle is you hardcoding the max number of rows in your code, e.g.
-
In the
Check_Companies_House_Files, near the start you have this block of code:ReDim arrFilteredAddresses(1 To lngFinalRow, 1 To lngFinalColumn)
'/ Done iteratively because excel throws an "Out of memory" if I try to pass the whole range to the array in one go. Approx. 2 minutes for 1Million length list
For L = 1 To lngFinalRow
For M = 1 To lngFinalColumn
arrFilteredAddresses(L, M) = wsFilteredAddresses.Cells(L, M).Text
Next M
Next L
ReDim arrCompanyNumbers(1 To lngFinalRow)
For L = 1 To lngFinalRow
arrCompanyNumbers(L) = Right("00000000" & arrFilteredAddresses(L, 2), 8) '/ company numbers in the filenames are always 8 digits long, with 0's filling up any extra digits
Next LWhy have two separate loops? You should be able to merge them.
ReDim arrFilteredAddresses(1 To lngFinalRow, 1 To lngFinalColumn)
ReDim arrCompanyNumbers(1 To lngFinalRow)
'/ Done iteratively because excel throws an "Out of memory" if I try to pass the whole range to the array in one go. Approx. 2 minutes for 1Million length list
For L = 1 To lngFinalRow
For M = 1 To lngFinalColumn
arrFilteredAddresses(L, M) = wsFilteredAddresses.Cells(L, M).Text
Next M
arrCompanyNumbers(L) = Right("00000000" & arrFilteredAddresses(L, 2), 8) '/ company numbers in the filenames are always 8 digits long, with 0's filling up any extra digits
Next LWhen you build the
arrCompanyNumbers array, you do that text formatting 1 million times. Why not add a column to your Excel sheet with the value already formatted using the TEXT function? You then just read that column into the array.-
Your code is looping through your array
arrCompanyNumbers looking for a matching entry. You might to look into using a Scripting.Dictionary instead because the Exists method is possibly much quicker than a 1 million long array. Presumably your company number values are all unique?-
If your files are HTML, Excel probably spends a lot of time parsing the HTML code when opening the file. You could try using the
Scripting.FileSystemObject and the Scripting.TextStream.ReadAll method. This will load the file into a string variable and you could then use the InStr function to search for your text entries. If the HTML is complicated/fussy it might be too tricky to find the value that goes with your heading.-
You do this kind of thing a couple of times:
varHolder1 = Application.Match(strPhraseHolder, arrFirstColumn, 0)
varHolder2 = UCase(Application.Index(arrFirstColumn, varHolder1))
If IsError(varHolder1) _
Then
varHolder1 = Application.Match(strPhraseHolder, arrSecondColumn, 0)
varHolder2 = UCase(Application.Index(arrSecondColumn, varHolder1))
End IfI've got two issues with this. First, why bother getting the value for
varHolder2 before you've checked if varHolder1 is an error? You should check varHolder1 first and then decide what to do. (By the way, why do you put the Then onto a new line and indent it? I would leave it on the same line as the If unless the line is very long.)varHolder1 = Application.Match(strPhraseHolder, arrFirstColumn, 0)
If Not IsError(varHolder1) Then
varHolder2 = UCase(Application.Index(arrFirstColumn, varHolder1))
Else
varHolder1 = Application.Match(strPhraseHolder, arrSecondColumn, 0)
varHolder2 = UCase(Application.Index(arrSecondColumn, varHolder1))
End IfThe second thing, is why use the
Index function at all? The Match function has told you the row number/index number of the element. Why not use:varHolder2 = UCase(arrSecondColumn(varHolder1))-
A minor niggle is you hardcoding the max number of rows in your code, e.g.
lngFinalRow = Cells(1048576, 1).End(xlUp).Row. This is fragile and might break if you upgrade Excel. Better to use lngFinalRow = Cells(wsAccountsData.Rows.Count, 1).End(xlUp).RowCode Snippets
ReDim arrFilteredAddresses(1 To lngFinalRow, 1 To lngFinalColumn)
'/ Done iteratively because excel throws an "Out of memory" if I try to pass the whole range to the array in one go. Approx. 2 minutes for 1Million length list
For L = 1 To lngFinalRow
For M = 1 To lngFinalColumn
arrFilteredAddresses(L, M) = wsFilteredAddresses.Cells(L, M).Text
Next M
Next L
ReDim arrCompanyNumbers(1 To lngFinalRow)
For L = 1 To lngFinalRow
arrCompanyNumbers(L) = Right("00000000" & arrFilteredAddresses(L, 2), 8) '/ company numbers in the filenames are always 8 digits long, with 0's filling up any extra digits
Next LReDim arrFilteredAddresses(1 To lngFinalRow, 1 To lngFinalColumn)
ReDim arrCompanyNumbers(1 To lngFinalRow)
'/ Done iteratively because excel throws an "Out of memory" if I try to pass the whole range to the array in one go. Approx. 2 minutes for 1Million length list
For L = 1 To lngFinalRow
For M = 1 To lngFinalColumn
arrFilteredAddresses(L, M) = wsFilteredAddresses.Cells(L, M).Text
Next M
arrCompanyNumbers(L) = Right("00000000" & arrFilteredAddresses(L, 2), 8) '/ company numbers in the filenames are always 8 digits long, with 0's filling up any extra digits
Next LvarHolder1 = Application.Match(strPhraseHolder, arrFirstColumn, 0)
varHolder2 = UCase(Application.Index(arrFirstColumn, varHolder1))
If IsError(varHolder1) _
Then
varHolder1 = Application.Match(strPhraseHolder, arrSecondColumn, 0)
varHolder2 = UCase(Application.Index(arrSecondColumn, varHolder1))
End IfvarHolder1 = Application.Match(strPhraseHolder, arrFirstColumn, 0)
If Not IsError(varHolder1) Then
varHolder2 = UCase(Application.Index(arrFirstColumn, varHolder1))
Else
varHolder1 = Application.Match(strPhraseHolder, arrSecondColumn, 0)
varHolder2 = UCase(Application.Index(arrSecondColumn, varHolder1))
End IfvarHolder2 = UCase(arrSecondColumn(varHolder1))Context
StackExchange Code Review Q#101461, answer score: 8
Revisions (0)
No revisions yet.