HiveBrain v1.2.0
Get Started
← Back to all entries
patterncsharpCritical

Counting rows in a CSV file that correspond to a database row, each with a million records

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
rowsmillionfilecountingeachwithcsvrecordsdatabasethat

Problem

I have two DataTables:

  • dt: is populated from a CSV file with over 1.7 million rows



  • dataStructure.Tables["AccountData"]: is populated from a database query also roughly a million rows



I use the following code to iterate through and compare the data from each set of rows. The code takes over 48 hours to complete. I have changed the properties of the application to x64 to allow it to use more Process Memory. It now uses roughly 2.5GB.

My question is, is there a more efficient way of doing this that would decrease run time?

//set is_legal column value for each row
foreach (DataRow row in dt.Rows)
{
    var acctNum = row[0].ToString().Replace("\"", "");
    foreach (DataRow queryRow in dataStructure.Tables["AccountData"].Rows)
    {
        var queryAcctNum = queryRow[0].ToString();
        if (acctNum.Equals(queryAcctNum))
        {
            row[12] = "Y";
            Console.WriteLine("Yes count: " + cnt);
        }
        else
        {
            row[12] = "N";
        }
    }
    cnt++;
};


How dataStructure.Tables["AccountData"] is being populated:

//Read each row from the table and output the results into the data set
while (readFile.Read())
{
    //Create a row to hold data
    DataRow datarow = dataStructure.Tables["AccountData"].NewRow();

    datarow["AccountNumber"] = readFile.GetString(0).Trim();
    datarow["LegalStatus"] = readFile.GetString(1);

    //add the row to the data table AccountData
    dataStructure.Tables["AccountData"].Rows.Add(datarow);
}

Solution

Your inner loop appears to be unnecessary. Why not create a lookup:

var knownAccountNumbers = new HashSet(
    dataStructure.Tables["AccountData"].Rows
        .Cast()
        .Select(row => row[0].ToString()));


Now your loop is simply:

foreach (DataRow row in dt.Rows)
{
    var accountNumber = row[0].ToString().Replace("\"", "");
    row[12] = knownAccountNumbers.Contains(accountNumber) ? "Y" : "N";
}


I think I remember reading once that the memory usage of a HashSet is 12 bytes per entry + size of the entry. So you're looking at 12MB + 1,000,000 (2 accountNumber.Length). So basically nothing in the grand scheme of things. However, you are gaining constant time lookups which should be a huge benefit to this kind of work.

You should take more care when naming things. Don't abbreviate e.g. acctNum -> accountNumber.

Code Snippets

var knownAccountNumbers = new HashSet<string>(
    dataStructure.Tables["AccountData"].Rows
        .Cast<DataRow>()
        .Select(row => row[0].ToString()));
foreach (DataRow row in dt.Rows)
{
    var accountNumber = row[0].ToString().Replace("\"", "");
    row[12] = knownAccountNumbers.Contains(accountNumber) ? "Y" : "N";
}

Context

StackExchange Code Review Q#120175, answer score: 53

Revisions (0)

No revisions yet.