patterncsharpMinor
Finding Matches and writing results to file C#
Viewed 0 times
matchesfilewritingfindingandresults
Problem
I have a folder that contains around 1000 text files, their sizes varies from few KBs up to 500 MB (Total size is around 60 GB).
I have a text file that contains a 166k lines, each line contains a set of numbers (length up to 11), what I'm doing here is to go through each file in that folder and try to find a match between the 166k and those files, then store the line from the file and print that out to a file.
My approach is working but there are several issues:
Please take a look at my code below, and let me know if I can enhance it and improve it in any possible way.
I have a text file that contains a 166k lines, each line contains a set of numbers (length up to 11), what I'm doing here is to go through each file in that folder and try to find a match between the 166k and those files, then store the line from the file and print that out to a file.
My approach is working but there are several issues:
- It is not fast (taking a lot of time to go through the files)
- Memory consumption, I believe that is because I'm storing everything into a list of strings and then print all of it at once
Please take a look at my code below, and let me know if I can enhance it and improve it in any possible way.
static void Main(string[] args)
{
char splitter = '\u0001';
//166k file
string path = "Z:\\subid.txt";
//Destination file
string RxPath = "Z:\\matched.txt";
List subid = File.ReadAllLines(path).ToList();
List RxClaims = new List();
string[] lineObject;
int count = 0;
//folder location (contains 1000 text files)
string folderPath = "Z:\\rawfiles";
foreach (string file in Directory.EnumerateFiles(folderPath))
{
Console.WriteLine("Processing " + file);
foreach (string line in File.ReadLines(file))
{
lineObject = line.Split(splitter);
//Check if that value is equal to any of the numbers in the 166k, if so store in the list to print out later
if(subid.Contains(lineObject[14]))
{
count++;
RxClaims.Add(line);
}
}
}
File.WriteAllLines(RxPath, RxClaims);
Console.WriteLine("Done, Number of Claims" + count);
Console.ReadLine();
}Solution
Some things to speed this process up
With only minimal adjustments, this should be significantly faster.
Improved code
subidis only used if it contains a certain string. Use a hashset or dictionary where each lookup is O(1) instead of O(n) ; which is in this case O(166K) in worst case
- EDIT:
StreamReaderandFile.ReadLinesactually behave the same
- Start using
usingfor handling input/output; it makes sure that object are garbage collected one it's out of scope. Using declares a scope for these.
- Personally I put all configuration on top
- You could also use
StreamWriterto write the file, which is just a handle (same principle as theStreamReader, but then for writing)
With only minimal adjustments, this should be significantly faster.
Improved code
public static void Main(string[] args)
{
// file paths
string path = "Z:\\subid.txt"; //166k file
string RxPath = "Z:\\matched.txt"; //Destination file
string folderPath = "Z:\\rawfiles"; //folder location (contains 1000 text files)
char splitter = '\u0001';
// subid is ONLY used to check if it contains something. Make it a hashset
HashSet subid = new HashSet(File.ReadAllLines(path));
List RxClaims = new List();
string[] lineObject;
int count = 0;
foreach (string file in Directory.EnumerateFiles(folderPath))
{
Console.WriteLine("Processing " + file);
// use a streamreader to go through files!
using (StreamReader reader = new StreamReader(file))
{
string line = reader.ReadLine();
lineObject = line.Split(splitter);
//Check if that value is equal to any of the numbers in the 166k, if so store in the list to print out later
if (subid.Contains(lineObject[14]))
{
count++;
RxClaims.Add(line);
}
}
}
File.WriteAllLines(RxPath, RxClaims);
Console.WriteLine("Done, Number of Claims" + count);
Console.ReadLine();
}Code Snippets
public static void Main(string[] args)
{
// file paths
string path = "Z:\\subid.txt"; //166k file
string RxPath = "Z:\\matched.txt"; //Destination file
string folderPath = "Z:\\rawfiles"; //folder location (contains 1000 text files)
char splitter = '\u0001';
// subid is ONLY used to check if it contains something. Make it a hashset
HashSet<string> subid = new HashSet<string>(File.ReadAllLines(path));
List<string> RxClaims = new List<string>();
string[] lineObject;
int count = 0;
foreach (string file in Directory.EnumerateFiles(folderPath))
{
Console.WriteLine("Processing " + file);
// use a streamreader to go through files!
using (StreamReader reader = new StreamReader(file))
{
string line = reader.ReadLine();
lineObject = line.Split(splitter);
//Check if that value is equal to any of the numbers in the 166k, if so store in the list to print out later
if (subid.Contains(lineObject[14]))
{
count++;
RxClaims.Add(line);
}
}
}
File.WriteAllLines(RxPath, RxClaims);
Console.WriteLine("Done, Number of Claims" + count);
Console.ReadLine();
}Context
StackExchange Code Review Q#141727, answer score: 3
Revisions (0)
No revisions yet.