patterncsharpMinor
CSV concatenator
Viewed 0 times
csvconcatenatorstackoverflow
Problem
I have the following code (running in LINQPad) which takes tens of thousands of CSV files and concatenates them to create a single file.
Each CSV file has two lines: a header line and a data line. I need the header line only once, then the rest of the lines are data.
However, it appears to be getting slower as it runs over the course of several minutes. Why is it doing this, and what can I do to make it faster overall?
Each file is only 3KB apiece. For example, the output below was from a run of about 55K files and the resulting CSV file was just under 8MB.
Output via
`Processed: 1000 files. Elapsed Time
Each CSV file has two lines: a header line and a data line. I need the header line only once, then the rest of the lines are data.
However, it appears to be getting slower as it runs over the course of several minutes. Why is it doing this, and what can I do to make it faster overall?
Each file is only 3KB apiece. For example, the output below was from a run of about 55K files and the resulting CSV file was just under 8MB.
// Directory path - This is the directory where all of the csv's you want to combine reside
var path = @"E:\a1";
// Destination path - Needs to be different from the directory Path we defined in the first step. This is where the combined csv file will be placed
var destination = @"E:\MasterFileA1.csv";
var dirInfo = new DirectoryInfo(path);
var enumFiles = dirInfo.EnumerateFiles("*.csv", SearchOption.TopDirectoryOnly);
int counter = 0;
string header;
string data;
DateTime start = DateTime.Now;
TimeSpan elapsed;
FileStream stream = new FileStream(destination, FileMode.Append, FileAccess.Write);
StreamWriter writer = new StreamWriter(stream);
foreach(var file in enumFiles)
{
StreamReader reader = new StreamReader(file.FullName);
header = reader.ReadLine();
data = reader.ReadLine();
reader.Close();
if (counter == 0)
{
writer.WriteLine(header);
}
writer.WriteLine(data);
counter++;
if (counter % 1000 == 0)
{
//writer.Flush();
elapsed = DateTime.Now.Subtract(start);
Console.WriteLine("Processed: " + counter + " files. Elapsed Time: " + String.Format("{0}:{1}", elapsed.Minutes, elapsed.Seconds) +
" Time per file: " + Math.Round(elapsed.TotalMilliseconds/counter, 4) + " ms");
}
}
writer.Flush();
writer.Close();Output via
Console:`Processed: 1000 files. Elapsed Time
Solution
I would suggest a couple of things (I can't reproduce since I don't have the files).
Here is a re-written version I did in LinqPad using
This allows the runtime to optimize memory usage and make sure that instances get cleaned up properly.
This allows the buffer to flush regularly and creates a nice atomic operation.
Let me know how this version performs.
When performing operations like this, the question frequently comes up as to what is the source of the performance degradation. Any number of things could cause it but to get concrete reasons to help your refactor pen, Performance Counters are your friend.
When the .NET Framework is installed, it comes with MANY handy little counters you can use to identify key metrics of your applications performance.
Here is some information on the performance counters available.
https://msdn.microsoft.com/en-us/library/w8f5kw2e(v=vs.110).aspx
Using the perf counters allows you to more easily identify things like if you have a memory leak, are bound by disk I/O, are killing the CPU, or are spending all of your time in GC. Once you know that, the solution to your performance issue should be much easier to see and implement.
Here is a re-written version I did in LinqPad using
Stopwatch and using statements.void Main() {
// Directory path - This is the directory where all of the csv's you want to combine reside
var path = @"E:\a1";
// Destination path - Needs to be different from the directory Path we defined in the first step. This is where the combined csv file will be placed
var destination = @"E:\MasterFileA1.csv";
var enumFiles = GetFilesToProcess(path);
int counter = 0;
string header;
string data;
Stopwatch globalTimer = Stopwatch.StartNew();
foreach (var file in enumFiles) {
Stopwatch fileReadTimer = Stopwatch.StartNew();
using (StreamReader reader = new StreamReader(file.FullName)) {
header = reader.ReadLine();
data = reader.ReadLine();
}
fileReadTimer.Stop();
LogTime(string.Concat("File Read: ", file.FullName), fileReadTimer);
Stopwatch writeTimer = Stopwatch.StartNew();
using (FileStream stream = new FileStream(destination, FileMode.Append, FileAccess.Write)) {
using (StreamWriter writer = new StreamWriter(stream)) {
if (counter == 0) {
writer.WriteLine(header);
}
writer.WriteLine(data);
}
}
writeTimer.Stop();
LogTime("Finished write operation.", writeTimer);
counter++;
if (counter % 1000 == 0) {
//writer.Flush();
Console.WriteLine("Processed: " + counter + " files. Elapsed Time: " + globalTimer.Elapsed.ToString() +
" Time per file: " + Math.Round((double)(globalTimer.ElapsedMilliseconds / counter), 4) + " ms");
}
}
globalTimer.Stop();
}
public FileInfo[] GetFilesToProcess(string path) {
Stopwatch timer = Stopwatch.StartNew();
return new DirectoryInfo(path).EnumerateFiles("*.csv", SearchOption.TopDirectoryOnly).ToArray();
timer.Stop();
LogTime("Enumerated input CSVs", timer);
}
public void LogTime(string message, Stopwatch timer) {
var now = DateTime.Now;
Console.WriteLine("[{0}] - {1}: Elapsed Time: {2}", string.Concat(now.ToShortDateString(), " ", now.ToShortTimeString()), message, timer.Elapsed.ToString());
}- Dispose of unmanaged resources with using statements:
using (FileStream stream = new FileStream(destination, FileMode.Append, FileAccess.Write)) {}This allows the runtime to optimize memory usage and make sure that instances get cleaned up properly.
- Only keep the stream open for the shortest amount of time possible.
This allows the buffer to flush regularly and creates a nice atomic operation.
Let me know how this version performs.
- Performance Profiling
When performing operations like this, the question frequently comes up as to what is the source of the performance degradation. Any number of things could cause it but to get concrete reasons to help your refactor pen, Performance Counters are your friend.
When the .NET Framework is installed, it comes with MANY handy little counters you can use to identify key metrics of your applications performance.
Here is some information on the performance counters available.
https://msdn.microsoft.com/en-us/library/w8f5kw2e(v=vs.110).aspx
Using the perf counters allows you to more easily identify things like if you have a memory leak, are bound by disk I/O, are killing the CPU, or are spending all of your time in GC. Once you know that, the solution to your performance issue should be much easier to see and implement.
Code Snippets
void Main() {
// Directory path - This is the directory where all of the csv's you want to combine reside
var path = @"E:\a1";
// Destination path - Needs to be different from the directory Path we defined in the first step. This is where the combined csv file will be placed
var destination = @"E:\MasterFileA1.csv";
var enumFiles = GetFilesToProcess(path);
int counter = 0;
string header;
string data;
Stopwatch globalTimer = Stopwatch.StartNew();
foreach (var file in enumFiles) {
Stopwatch fileReadTimer = Stopwatch.StartNew();
using (StreamReader reader = new StreamReader(file.FullName)) {
header = reader.ReadLine();
data = reader.ReadLine();
}
fileReadTimer.Stop();
LogTime(string.Concat("File Read: ", file.FullName), fileReadTimer);
Stopwatch writeTimer = Stopwatch.StartNew();
using (FileStream stream = new FileStream(destination, FileMode.Append, FileAccess.Write)) {
using (StreamWriter writer = new StreamWriter(stream)) {
if (counter == 0) {
writer.WriteLine(header);
}
writer.WriteLine(data);
}
}
writeTimer.Stop();
LogTime("Finished write operation.", writeTimer);
counter++;
if (counter % 1000 == 0) {
//writer.Flush();
Console.WriteLine("Processed: " + counter + " files. Elapsed Time: " + globalTimer.Elapsed.ToString() +
" Time per file: " + Math.Round((double)(globalTimer.ElapsedMilliseconds / counter), 4) + " ms");
}
}
globalTimer.Stop();
}
public FileInfo[] GetFilesToProcess(string path) {
Stopwatch timer = Stopwatch.StartNew();
return new DirectoryInfo(path).EnumerateFiles("*.csv", SearchOption.TopDirectoryOnly).ToArray();
timer.Stop();
LogTime("Enumerated input CSVs", timer);
}
public void LogTime(string message, Stopwatch timer) {
var now = DateTime.Now;
Console.WriteLine("[{0}] - {1}: Elapsed Time: {2}", string.Concat(now.ToShortDateString(), " ", now.ToShortTimeString()), message, timer.Elapsed.ToString());
}Context
StackExchange Code Review Q#137961, answer score: 3
Revisions (0)
No revisions yet.