patternMinor
Load txt to Scripting.Dictionary
Viewed 0 times
dictionarytxtscriptingload
Problem
This code load a 40MB txt into a dictionary. It runs in about 40 seconds (sometimes 20, no idea why). Is there a way to make it run under 4 seconds, or hopefully in 1 second?
Sub ScriptDic()
Dim FileNum As Integer
Dim DataLine As String
Dim tmp As Variant
Dim Dict As Object
Dim duplicatecount As Object
Dim key As String
Dim count As Long
Set Dict = CreateObject("Scripting.Dictionary")
Set duplicatecount = CreateObject("Scripting.Dictionary")
Sheets("Control").Cells(5, 3).Value2 = Now
Filename = "C:\Users\MyFolder\Documents\PerformanceTests\MyData.txt"
FileNum = FreeFile()
Open Filename For Input As #FileNum
While Not EOF(FileNum)
Line Input #FileNum, DataLine ' read in data 1 line at a time
tmp = Split(DataLine, Chr(9))
key = tmp(7) & "-" & tmp(8)
If Not Dict.exists(key) Then
Dict.Add key, tmp
duplicatecount.Add key, 1
Else
count = duplicatecount(key)
duplicatecount.Remove (key)
Dict.Add key & ">" & count, tmp
duplicatecount.Add key, count + 1
End If
Wend
Sheets("Control").Cells(6, 3).Value2 = Now
End SubSolution
I assume you're working with a 40MB file instead of a 40GB file. The performance will vary greatly depending upon:
I've worked with a contrived data-file that is 5.127MB in size and has 250,000 data rows like:
The file has nearly all duplicated records, so I'm pushing the limits on the unique key approach. A file with all unique values will perform differently.
Using your code against the 5MB file, it runs in 6.18s. If I only read the file, it reads in 0.11s. If I only read the file, split each line and build a key, it runs in 0.76s. So approximately 5.42s, or 88% of the duration is related to dictionary manipulation.
So, what can we do in VBA, to improve your code
Option Explicit
You haven't included it in your code, so I assume it isn't declared.
Declare
The
Declare
tmp is assigned with the
Early Binding
You're using
Dictionary variable names
You've used ambiguous names:
Avoid using magic function/literal values
You're referring to the Tab character
Unique keys and Dictionary usage
The code to build a unique key and populate the dictionaries is the most expensive, so let's be careful about checking the smallest dictionary more frequently than the large dictionary.
Also, there's no need to
Close the file
You've opened the file and read it all, but you've forgotten to close the file:
Timing events
Using
I've used
So, when I put it altogether, the file is read in 4.85s, or about 22% faster.
```
Sub ScriptDic()
'Declare fileName
Dim filename As String
Dim FileNum As Integer
Dim DataLine As String
'Declare tmp as a string array
Dim tmp() As String
'Declare as Scripting.Dictionary to get early-binding benefits
Dim allRecords As Scripting.Dictionary
Dim uniqueKeyCounts As Scripting.Dictionary
Dim key As String
Dim count As Long
'USe early binding
Set allRecords = New Scripting.Dictionary
Set uniqueKeyCounts = New Scripting.Dictionary
Sheets("Control").Cells(5, 3).Value2 = Now
filename = "C:\Temp\test100k.txt"
FileNum = FreeFile()
Dim start As Double
start = Timer
Open filename For Input As #FileNum
Do While Not EOF(FileNum)
Line Input #FileNum, DataLine ' read in data 1 line at a time
'USe the vbTab constant
tmp = Split(DataLine, vbTab)
key = tmp(7) & "-" & tmp(8)
'Check the unique dictionary - it's almost always smaller
If uniqueKeyCounts.Exists(key) Then
'Retrieve the current count once
Dim currentCount As Long
currentCount = uniqueKeyCounts.Item(key)
allRecords.Add key & ">" & currentCount, DataLine
'Don't remove and re-add, just increment the counter
uniqueKeyCounts(key) = currentCount + 1
Else
unique
- the length of each line in the file
- the number of lines in the file
- the number of uniquely keyed-lines in the file
I've worked with a contrived data-file that is 5.127MB in size and has 250,000 data rows like:
a b c d e f g h i j
b c d e f g h i j k
a b c d e f g h i jThe file has nearly all duplicated records, so I'm pushing the limits on the unique key approach. A file with all unique values will perform differently.
Using your code against the 5MB file, it runs in 6.18s. If I only read the file, it reads in 0.11s. If I only read the file, split each line and build a key, it runs in 0.76s. So approximately 5.42s, or 88% of the duration is related to dictionary manipulation.
So, what can we do in VBA, to improve your code
Option Explicit
You haven't included it in your code, so I assume it isn't declared.
Declare
filename As StringThe
filename variable isn't declared, although you may have declared it at global scope. Option Explicit shows e this right away.Dim filename As StringDeclare
tmp as a String Arraytmp is assigned with the
Split function, which returns a String array, so declare tmp accordingly and you'll use a great deal less memory.Dim tmp() As StringEarly Binding
You're using
CreateObject("Scripting.Dictionary"), so your code will fail anyway, if Microsoft Scripting Runtime isn't available. But more importantly, by using late-binding, COM has to work harder to find the methods you're using. It's much better to early-bind by adding a reference to Microsoft Scripting Runtime.:Dictionary variable names
You've used ambiguous names:
Dict and duplicatecount. You'd be better off with more meaningful names (I don't know what your data is, so I'm using less un-meaningful names) like:Dim allRecords As Scripting.Dictionary
Dim uniqueKeyCounts As Scripting.Dictionary
Set allRecords = New Scripting.Dictionary
Set uniqueKeyCounts = New Scripting.DictionaryAvoid using magic function/literal values
You're referring to the Tab character
Chr(9) which incurs time on every line. VBA has a built-in constant for Tab: vbTab, which makes it more efficient and easier to read.tmp = Split(DataLine, vbTab)Unique keys and Dictionary usage
The code to build a unique key and populate the dictionaries is the most expensive, so let's be careful about checking the smallest dictionary more frequently than the large dictionary.
Also, there's no need to
Remove and then Add back, when we can just increment the count of the existing entry.If uniqueKeyCounts.Exists(key) Then
'Retrieve the current count once
Dim currentCount As Long
currentCount = uniqueKeyCounts.Item(key)
allRecords.Add key & ">" & currentCount, DataLine
'Don't remove and re-add, just increment the counter
uniqueKeyCounts(key) = currentCount + 1
Else
uniqueKeyCounts.Add key, 1
allRecords.Add key, tmp
End IfClose the file
You've opened the file and read it all, but you've forgotten to close the file:
Close #FileNumTiming events
Using
Now() offers very limited granularity. Timer gives you improved granularity but also a few problems when timed-code runs through midnight, but it avoids the need for win32 functions. If you want better timing see GetTickCount, or better still, QueryPerformanceCounter.I've used
Timer for the sake of simplicity.Dim start As Double
start = Timer
'...Do stuff...
Debug.Print Timer - startSo, when I put it altogether, the file is read in 4.85s, or about 22% faster.
```
Sub ScriptDic()
'Declare fileName
Dim filename As String
Dim FileNum As Integer
Dim DataLine As String
'Declare tmp as a string array
Dim tmp() As String
'Declare as Scripting.Dictionary to get early-binding benefits
Dim allRecords As Scripting.Dictionary
Dim uniqueKeyCounts As Scripting.Dictionary
Dim key As String
Dim count As Long
'USe early binding
Set allRecords = New Scripting.Dictionary
Set uniqueKeyCounts = New Scripting.Dictionary
Sheets("Control").Cells(5, 3).Value2 = Now
filename = "C:\Temp\test100k.txt"
FileNum = FreeFile()
Dim start As Double
start = Timer
Open filename For Input As #FileNum
Do While Not EOF(FileNum)
Line Input #FileNum, DataLine ' read in data 1 line at a time
'USe the vbTab constant
tmp = Split(DataLine, vbTab)
key = tmp(7) & "-" & tmp(8)
'Check the unique dictionary - it's almost always smaller
If uniqueKeyCounts.Exists(key) Then
'Retrieve the current count once
Dim currentCount As Long
currentCount = uniqueKeyCounts.Item(key)
allRecords.Add key & ">" & currentCount, DataLine
'Don't remove and re-add, just increment the counter
uniqueKeyCounts(key) = currentCount + 1
Else
unique
Code Snippets
a b c d e f g h i j
b c d e f g h i j k
a b c d e f g h i jDim filename As StringDim tmp() As StringDim allRecords As Scripting.Dictionary
Dim uniqueKeyCounts As Scripting.Dictionary
Set allRecords = New Scripting.Dictionary
Set uniqueKeyCounts = New Scripting.Dictionarytmp = Split(DataLine, vbTab)Context
StackExchange Code Review Q#141276, answer score: 4
Revisions (0)
No revisions yet.