patternMinor
Searching across text files
Viewed 0 times
textfilessearchingacross
Problem
I have 7 source files that I am searching and cross-referencing. I have tried a few different methods inside Excel (cell-by-cell iteration, dictionary), and found select statements as recordsets to be the fastest thus far.
Unfortunately, it still takes around 5 seconds for my code to search 4 of the files for all instances of a string from another file. Since this file currently has 9,000 (and will soon grow to over half a million), I'm wondering how I can improve the speed.
The 7 files are structured as follows:
The purpose of my code:
I have a key value that will be identical across the source and vendor files. Unfortunately, this value might be found multiple times in my source file, and multiple times or not at all in my vendor files. This value represents a model number, unique to the manufacturer, but each vendor includes multiple manufacturers in their files.
A problem is that the manufacturer name may (often does) have a different spelling, not only from file to file, but also within vendor files. Therefore I can't just concatenate the two fields into one and search for matches. What I have been doing is finding every instance of the model number, and comparing the manufacturer name fields to see if they refer to the same company (sometimes the manufacturer is spelled out, sometimes acronym, sometimes location appended, sometimes sub-brand used instead of or in addition to manufacturer, and often just spelled wro
Unfortunately, it still takes around 5 seconds for my code to search 4 of the files for all instances of a string from another file. Since this file currently has 9,000 (and will soon grow to over half a million), I'm wondering how I can improve the speed.
The 7 files are structured as follows:
String-Source(10+ fields, only 2 are needed, current product listing downloaded in .xlsx format from a website)
- 4 vendor files (.csv or tab delimited or .xls or the like, with/without/maybe-sometimes headers, ranging in size from small [only 5,000 records] to larger [about a million] records, 20+ fields but only 3 are needed)
sheets("yes")andsheets("no")(3 fields each, very small less than 1000, only used when I find a match, of which I only expect less than 100,000 hits each, so I am focusing less on optimizing search here)
The purpose of my code:
I have a key value that will be identical across the source and vendor files. Unfortunately, this value might be found multiple times in my source file, and multiple times or not at all in my vendor files. This value represents a model number, unique to the manufacturer, but each vendor includes multiple manufacturers in their files.
A problem is that the manufacturer name may (often does) have a different spelling, not only from file to file, but also within vendor files. Therefore I can't just concatenate the two fields into one and search for matches. What I have been doing is finding every instance of the model number, and comparing the manufacturer name fields to see if they refer to the same company (sometimes the manufacturer is spelled out, sometimes acronym, sometimes location appended, sometimes sub-brand used instead of or in addition to manufacturer, and often just spelled wro
Solution
The moment you start referring to file contents using X Million (even if X is a fraction) then you're well beyond the point where Excel is an appropriate application framework.
For robustness and future scalability, and given your client's inclination to the Microsoft Environment, I would recommend your client gets a license for Microsoft SQL Server (The express version is free and will handle up to 10GB of data) and builds a framework using that.
For your code as it stands, you might want to try
For robustness and future scalability, and given your client's inclination to the Microsoft Environment, I would recommend your client gets a license for Microsoft SQL Server (The express version is free and will handle up to 10GB of data) and builds a framework using that.
For your code as it stands, you might want to try
Application.EnableEvents = False and Application.Calculation = xlManual, though since you're mainly working with ADODB and recordsets, it might not make much of a difference.Context
StackExchange Code Review Q#128766, answer score: 3
Revisions (0)
No revisions yet.