HiveBrain v1.2.0
Get Started
← Back to all entries
patternMinor

Parsing 4 Million Filenames

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
millionfilenamesparsing

Problem

This is the next step in my project to query Companies House records. The first step, retrieving and validating the company numbers to be targeted, was covered here and I will include that code at the end for context.

In this stage, GetTargetFilenames, I need to parse the entire series of Corporate Filings (typically 100,000 per month, going back 30 months at this point) and, for each:

  • Extract the company number from the filename



  • Check the company number against the ones I am targeting



  • If it is being targeted, add it to a Dictionary of target filenames



With this project, I am trying to take particular care with regards to future maintainability (naming, commenting etc.), so any critiques of that aspect would be especially welcome.

```
Option Explicit

Public Const COMPANY_NUMBER_COLUMN As Long = 1
Public Const parentFolderPath As String = "S:\Investments\Data\Companies House\Monthly Companies House Downloads\"

Public Sub ParseAllCompanyRecords()
'/ Data Structure: "Company Numbers", once input, will be stored as strings
'/ Company Number: 8-character string, generally 8-digits but sometimes with text prefixes E.G. "OC374102"

'/ Folder Path for monthly CH downloads: "S:\Investments\Data\Companies House\Monthly Companies House Downloads\"
'/ Filename Strucutre of a Monthly Folder: [parentFolderPath]"Accounts_Monthly_Data-"[Full Month Name][yyyy]"\" - Square Brackets not in filename
'/ Filename Structure of an individual filing: [Monthly Folder Path]"Prod224_"[4-character code]"_"[8-character Company Registration Number]"_"[yyyymmdd][.html OR .xml] - Square Brackets not in filename

Dim targetCompanyNumbers As Dictionary
Set targetCompanyNumbers = GetTargetCompanyNumbers

Dim targetFilenames As Dictionary
Set targetFilenames = GetTargetFilenames(targetCompanyNumbers)

End Sub

Public Function GetTargetFilenames(ByRef targetCompanyNumbers As Dictionary) As Dictionary
'/ Folder Path for monthly CH downloads: "S:\I

Solution

To improve the performance, you should list the files in an Array instead of a Collection. I would also drop the unnecessary * for the file expression:

Sub GetFileList(folder As String, outList() As String)
Dim fname$, count&
ReDim outList(0 To 10000) ' set initial array size

fname = FileSystem.Dir(folder, vbNormal)
Do While Len(fname)

' double the size of the array if necessary
If count > UBound(outList) Then ReDim Preserve outList(0 To UBound(outList) * 2)

' insert the file path
outList(count) = fname
count = count + 1

' next file
fname = FileSystem.Dir()
Loop

' set the final size for the array
If count Then
ReDim Preserve outList(0 To count - 1)
Else
outList = Split(Empty) ' set an empty array
End If
End Sub

Public Function GetTargetFilenames(ByRef targetCompanyNumbers As Dictionary) As Dictionary
...

Dim fileList() As String
GetFileList monthFolderPath, fileList

For i = LBound(fileList) To UBound(fileList)
...
Next

End Function

Context

StackExchange Code Review Q#126095, answer score: 3

Revisions (0)

No revisions yet.