HiveBrain v1.2.0
Get Started
← Back to all entries
patternMinor

Load txt to Scripting.Dictionary

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
dictionarytxtscriptingload

Problem

This code load a 40MB txt into a dictionary. It runs in about 40 seconds (sometimes 20, no idea why). Is there a way to make it run under 4 seconds, or hopefully in 1 second?

Sub ScriptDic()

Dim FileNum As Integer
Dim DataLine As String
Dim tmp As Variant
Dim Dict As Object
Dim duplicatecount As Object
Dim key As String
Dim count As Long

Set Dict = CreateObject("Scripting.Dictionary")
Set duplicatecount = CreateObject("Scripting.Dictionary")

Sheets("Control").Cells(5, 3).Value2 = Now
Filename = "C:\Users\MyFolder\Documents\PerformanceTests\MyData.txt"
FileNum = FreeFile()

Open Filename For Input As #FileNum
While Not EOF(FileNum)
    Line Input #FileNum, DataLine ' read in data 1 line at a time
    tmp = Split(DataLine, Chr(9))
    key = tmp(7) & "-" & tmp(8)
    If Not Dict.exists(key) Then
        Dict.Add key, tmp
        duplicatecount.Add key, 1
    Else
        count = duplicatecount(key)
        duplicatecount.Remove (key)

        Dict.Add key & ">" & count, tmp
        duplicatecount.Add key, count + 1
    End If
Wend

Sheets("Control").Cells(6, 3).Value2 = Now
End Sub

Solution

I assume you're working with a 40MB file instead of a 40GB file. The performance will vary greatly depending upon:

  • the length of each line in the file



  • the number of lines in the file



  • the number of uniquely keyed-lines in the file



I've worked with a contrived data-file that is 5.127MB in size and has 250,000 data rows like:

a   b   c   d   e   f   g   h   i   j
b   c   d   e   f   g   h   i   j   k
a   b   c   d   e   f   g   h   i   j


The file has nearly all duplicated records, so I'm pushing the limits on the unique key approach. A file with all unique values will perform differently.

Using your code against the 5MB file, it runs in 6.18s. If I only read the file, it reads in 0.11s. If I only read the file, split each line and build a key, it runs in 0.76s. So approximately 5.42s, or 88% of the duration is related to dictionary manipulation.

So, what can we do in VBA, to improve your code

Option Explicit
You haven't included it in your code, so I assume it isn't declared.

Declare filename As String
The filename variable isn't declared, although you may have declared it at global scope. Option Explicit shows e this right away.

Dim filename As String


Declare tmp as a String Array
tmp is assigned with the Split function, which returns a String array, so declare tmp accordingly and you'll use a great deal less memory.

Dim tmp() As String


Early Binding
You're using CreateObject("Scripting.Dictionary"), so your code will fail anyway, if Microsoft Scripting Runtime isn't available. But more importantly, by using late-binding, COM has to work harder to find the methods you're using. It's much better to early-bind by adding a reference to Microsoft Scripting Runtime.:

Dictionary variable names
You've used ambiguous names: Dict and duplicatecount. You'd be better off with more meaningful names (I don't know what your data is, so I'm using less un-meaningful names) like:

Dim allRecords As Scripting.Dictionary
Dim uniqueKeyCounts As Scripting.Dictionary

Set allRecords = New Scripting.Dictionary
Set uniqueKeyCounts = New Scripting.Dictionary


Avoid using magic function/literal values
You're referring to the Tab character Chr(9) which incurs time on every line. VBA has a built-in constant for Tab: vbTab, which makes it more efficient and easier to read.

tmp = Split(DataLine, vbTab)


Unique keys and Dictionary usage
The code to build a unique key and populate the dictionaries is the most expensive, so let's be careful about checking the smallest dictionary more frequently than the large dictionary.

Also, there's no need to Remove and then Add back, when we can just increment the count of the existing entry.

If uniqueKeyCounts.Exists(key) Then
  'Retrieve the current count once
  Dim currentCount As Long
  currentCount = uniqueKeyCounts.Item(key)
  allRecords.Add key & ">" & currentCount, DataLine
  'Don't remove and re-add, just increment the counter
  uniqueKeyCounts(key) = currentCount + 1
Else
  uniqueKeyCounts.Add key, 1
  allRecords.Add key, tmp
End If


Close the file
You've opened the file and read it all, but you've forgotten to close the file:

Close #FileNum


Timing events
Using Now() offers very limited granularity. Timer gives you improved granularity but also a few problems when timed-code runs through midnight, but it avoids the need for win32 functions. If you want better timing see GetTickCount, or better still, QueryPerformanceCounter.
I've used Timer for the sake of simplicity.

Dim start As Double
start = Timer
'...Do stuff...
Debug.Print Timer - start


So, when I put it altogether, the file is read in 4.85s, or about 22% faster.

```
Sub ScriptDic()

'Declare fileName
Dim filename As String

Dim FileNum As Integer
Dim DataLine As String
'Declare tmp as a string array
Dim tmp() As String
'Declare as Scripting.Dictionary to get early-binding benefits
Dim allRecords As Scripting.Dictionary
Dim uniqueKeyCounts As Scripting.Dictionary
Dim key As String
Dim count As Long

'USe early binding
Set allRecords = New Scripting.Dictionary
Set uniqueKeyCounts = New Scripting.Dictionary

Sheets("Control").Cells(5, 3).Value2 = Now
filename = "C:\Temp\test100k.txt"
FileNum = FreeFile()

Dim start As Double
start = Timer
Open filename For Input As #FileNum
Do While Not EOF(FileNum)
Line Input #FileNum, DataLine ' read in data 1 line at a time
'USe the vbTab constant
tmp = Split(DataLine, vbTab)
key = tmp(7) & "-" & tmp(8)
'Check the unique dictionary - it's almost always smaller
If uniqueKeyCounts.Exists(key) Then
'Retrieve the current count once
Dim currentCount As Long
currentCount = uniqueKeyCounts.Item(key)
allRecords.Add key & ">" & currentCount, DataLine
'Don't remove and re-add, just increment the counter
uniqueKeyCounts(key) = currentCount + 1
Else
unique

Code Snippets

a   b   c   d   e   f   g   h   i   j
b   c   d   e   f   g   h   i   j   k
a   b   c   d   e   f   g   h   i   j
Dim filename As String
Dim tmp() As String
Dim allRecords As Scripting.Dictionary
Dim uniqueKeyCounts As Scripting.Dictionary

Set allRecords = New Scripting.Dictionary
Set uniqueKeyCounts = New Scripting.Dictionary
tmp = Split(DataLine, vbTab)

Context

StackExchange Code Review Q#141276, answer score: 4

Revisions (0)

No revisions yet.