patterncsharpCritical
Counting rows in a CSV file that correspond to a database row, each with a million records
Viewed 0 times
rowsmillionfilecountingeachwithcsvrecordsdatabasethat
Problem
I have two DataTables:
I use the following code to iterate through and compare the data from each set of rows. The code takes over 48 hours to complete. I have changed the properties of the application to x64 to allow it to use more Process Memory. It now uses roughly 2.5GB.
My question is, is there a more efficient way of doing this that would decrease run time?
How
dt: is populated from a CSV file with over 1.7 million rows
dataStructure.Tables["AccountData"]: is populated from a database query also roughly a million rows
I use the following code to iterate through and compare the data from each set of rows. The code takes over 48 hours to complete. I have changed the properties of the application to x64 to allow it to use more Process Memory. It now uses roughly 2.5GB.
My question is, is there a more efficient way of doing this that would decrease run time?
//set is_legal column value for each row
foreach (DataRow row in dt.Rows)
{
var acctNum = row[0].ToString().Replace("\"", "");
foreach (DataRow queryRow in dataStructure.Tables["AccountData"].Rows)
{
var queryAcctNum = queryRow[0].ToString();
if (acctNum.Equals(queryAcctNum))
{
row[12] = "Y";
Console.WriteLine("Yes count: " + cnt);
}
else
{
row[12] = "N";
}
}
cnt++;
};How
dataStructure.Tables["AccountData"] is being populated://Read each row from the table and output the results into the data set
while (readFile.Read())
{
//Create a row to hold data
DataRow datarow = dataStructure.Tables["AccountData"].NewRow();
datarow["AccountNumber"] = readFile.GetString(0).Trim();
datarow["LegalStatus"] = readFile.GetString(1);
//add the row to the data table AccountData
dataStructure.Tables["AccountData"].Rows.Add(datarow);
}Solution
Your inner loop appears to be unnecessary. Why not create a lookup:
Now your loop is simply:
I think I remember reading once that the memory usage of a
You should take more care when naming things. Don't abbreviate e.g.
var knownAccountNumbers = new HashSet(
dataStructure.Tables["AccountData"].Rows
.Cast()
.Select(row => row[0].ToString()));Now your loop is simply:
foreach (DataRow row in dt.Rows)
{
var accountNumber = row[0].ToString().Replace("\"", "");
row[12] = knownAccountNumbers.Contains(accountNumber) ? "Y" : "N";
}I think I remember reading once that the memory usage of a
HashSet is 12 bytes per entry + size of the entry. So you're looking at 12MB + 1,000,000 (2 accountNumber.Length). So basically nothing in the grand scheme of things. However, you are gaining constant time lookups which should be a huge benefit to this kind of work.You should take more care when naming things. Don't abbreviate e.g.
acctNum -> accountNumber.Code Snippets
var knownAccountNumbers = new HashSet<string>(
dataStructure.Tables["AccountData"].Rows
.Cast<DataRow>()
.Select(row => row[0].ToString()));foreach (DataRow row in dt.Rows)
{
var accountNumber = row[0].ToString().Replace("\"", "");
row[12] = knownAccountNumbers.Contains(accountNumber) ? "Y" : "N";
}Context
StackExchange Code Review Q#120175, answer score: 53
Revisions (0)
No revisions yet.