HiveBrain v1.2.0
Get Started
← Back to all entries
patterncsharpMinor

Finding Matches and writing results to file C#

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
matchesfilewritingfindingandresults

Problem

I have a folder that contains around 1000 text files, their sizes varies from few KBs up to 500 MB (Total size is around 60 GB).

I have a text file that contains a 166k lines, each line contains a set of numbers (length up to 11), what I'm doing here is to go through each file in that folder and try to find a match between the 166k and those files, then store the line from the file and print that out to a file.

My approach is working but there are several issues:

  • It is not fast (taking a lot of time to go through the files)



  • Memory consumption, I believe that is because I'm storing everything into a list of strings and then print all of it at once



Please take a look at my code below, and let me know if I can enhance it and improve it in any possible way.

static void Main(string[] args)
    {
        char splitter = '\u0001';

        //166k file
        string path = "Z:\\subid.txt";

        //Destination file
        string RxPath = "Z:\\matched.txt";

        List subid = File.ReadAllLines(path).ToList();
        List RxClaims = new List();

        string[] lineObject;
        int count = 0;

        //folder location (contains 1000 text files)
        string folderPath = "Z:\\rawfiles";

        foreach (string file in Directory.EnumerateFiles(folderPath))
        {
            Console.WriteLine("Processing " + file);
            foreach (string line in File.ReadLines(file))
            {
                lineObject = line.Split(splitter);

                //Check if that value is equal to any of the numbers in the 166k, if so store in the list to print out later
                if(subid.Contains(lineObject[14]))
                {
                    count++;
                    RxClaims.Add(line);
                }

            }
        }

        File.WriteAllLines(RxPath, RxClaims);
        Console.WriteLine("Done, Number of Claims" + count);

        Console.ReadLine();
    }

Solution

Some things to speed this process up

  • subid is only used if it contains a certain string. Use a hashset or dictionary where each lookup is O(1) instead of O(n) ; which is in this case O(166K) in worst case



  • EDIT: StreamReader and File.ReadLines actually behave the same



  • Start using using for handling input/output; it makes sure that object are garbage collected one it's out of scope. Using declares a scope for these.



  • Personally I put all configuration on top



  • You could also use StreamWriter to write the file, which is just a handle (same principle as the StreamReader, but then for writing)



With only minimal adjustments, this should be significantly faster.

Improved code

public static void Main(string[] args)
{
    // file paths
    string path = "Z:\\subid.txt";      //166k file
    string RxPath = "Z:\\matched.txt";  //Destination file
    string folderPath = "Z:\\rawfiles"; //folder location (contains 1000 text files)

    char splitter = '\u0001';

    // subid is ONLY used to check if it contains something. Make it a hashset     
    HashSet subid = new HashSet(File.ReadAllLines(path));
    List RxClaims = new List();

    string[] lineObject;
    int count = 0;

    foreach (string file in Directory.EnumerateFiles(folderPath))
    {
        Console.WriteLine("Processing " + file);

        // use a streamreader to go through files!
        using (StreamReader reader = new StreamReader(file))
        {
            string line = reader.ReadLine();
            lineObject = line.Split(splitter);

            //Check if that value is equal to any of the numbers in the 166k, if so store in the list to print out later
            if (subid.Contains(lineObject[14]))
            {
                count++;
                RxClaims.Add(line);
            }
        }
    }

    File.WriteAllLines(RxPath, RxClaims);
    Console.WriteLine("Done, Number of Claims" + count);

    Console.ReadLine();
}

Code Snippets

public static void Main(string[] args)
{
    // file paths
    string path = "Z:\\subid.txt";      //166k file
    string RxPath = "Z:\\matched.txt";  //Destination file
    string folderPath = "Z:\\rawfiles"; //folder location (contains 1000 text files)

    char splitter = '\u0001';

    // subid is ONLY used to check if it contains something. Make it a hashset     
    HashSet<string> subid = new HashSet<string>(File.ReadAllLines(path));
    List<string> RxClaims = new List<string>();

    string[] lineObject;
    int count = 0;


    foreach (string file in Directory.EnumerateFiles(folderPath))
    {
        Console.WriteLine("Processing " + file);

        // use a streamreader to go through files!
        using (StreamReader reader = new StreamReader(file))
        {
            string line = reader.ReadLine();
            lineObject = line.Split(splitter);

            //Check if that value is equal to any of the numbers in the 166k, if so store in the list to print out later
            if (subid.Contains(lineObject[14]))
            {
                count++;
                RxClaims.Add(line);
            }
        }
    }

    File.WriteAllLines(RxPath, RxClaims);
    Console.WriteLine("Done, Number of Claims" + count);

    Console.ReadLine();
}

Context

StackExchange Code Review Q#141727, answer score: 3

Revisions (0)

No revisions yet.