HiveBrain v1.2.0
Get Started
← Back to all entries
patterncsharpMinor

Extracting data from .txt files and writing to Excel

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
excelwritingfilestxtextractingandfromdata

Problem

I am extracting data from .txt files and writing needed data to Excel. My problem is that my code is very very slow and at one point even slower (+35k txt files). How should I rethink this problem?

```
Microsoft.Office.Interop.Excel.Application excelapp = new Microsoft.Office.Interop.Excel.Application();
excelapp.Visible = true;
_Workbook workbook = (_Workbook)(excelapp.Workbooks.Open(textBox2.Text));
_Worksheet worksheet = (_Worksheet)workbook.ActiveSheet;
DateTime dt = DateTime.Now;

foreach (string fileName in Directory.GetFiles(textBox1.Text, "*.txt"))
{
int row = 1, EmptyRow = 0;
while (Convert.ToString(worksheet.Range["A" + row].Value) != null)
{
row++;
EmptyRow = row;
}
string lines1 = GetLine(fileName, 10);
File.WriteAllText(@"D:/text.txt", lines1);
string[] ParsedLines = File.ReadAllLines(@"D:/text.txt");

string comp = ParsedLines[0];
string compare = comp.Substring(30, comp.Length - 30);
if (compare == "Failed")
{
string[] lines = File.ReadAllLines(fileName);
string serial = lines[3];

string SN = serial.Substring(30, serial.Length - 30);
worksheet.Range["A" + EmptyRow].Value2 = SN;

string data = lines[4];
string DT = data.Substring(30, data.Length - 30);
worksheet.Range["B" + EmptyRow].Value2 = DT;

string time = lines[5];
string TM = time.Substring(30, time.Length - 30);
worksheet.Range["C" + EmptyRow].Value2 = TM;

string operat = lines[6];
string OP = operat.Substring(30, operat.Length - 30);
worksheet.Range["D" + Empty

Solution

Input Validation

Currently the code just takes user-input and assumes it's well-formed and valid. That is not a reasonable assumption for all accounts and purposes.

The first place this happens is when the workbook is retrieved. The msdn-Documentation states, that the "file name" is required to open the Workbook. What if the user gives you something obviously incorrect like "my$\important\$Wb.exe"?

It would be a much cleaner UX, if you used a FilePicker here to get the workbook's filename.

The next place is where the code assumes textBox1.Text contains a valid Path. Here the "correct" solution is to show a FolderPicker

Stuff ...

The casing on the variable names is inconsistent. You should be consistent at almost any cost. It's important to make sure future maintainers (including you) can read the code easily and know what to expect.

In that vein variable names like SN and DT don't even remotely help, because nobody beside you right now can tell what they mean. You'll have a very hard time reading the code in 2 or 3 months, because you will have forgotten most of what the names mean. serialNumber instead of SN and timestamp instead of TM as well as better naemes for all these variables will help you when you need to touch this code again.

On that note there's quite a bunch of unnecessary assignments in the code. You're iterating foreach (string line in lines) multiple times. For one you could merge all of these loops, because they're completely independent.

Additionally whenever there's a match for what you need, the first thing you do is copying the line into an intermediary variable (which is busywork for the CPU, if it's not optimized out). Let me show what I mean in an example:

foreach (string line in lines) 
{
        if (line.Contains("FixtureCoverResistance:")) 
        {
                (worksheet.Cells[emptyRow, 6] as Excel.Range).Value2 = line.Substring(31, line.length - 31);
        }
        if (line.Contains("FwProgrammingCheck:"))
        {
                (worksheet.Cells[emptyRow, 7] as Excel.Range).Value2 = line.Substring(31, line.length - 31);
        }
        // ...
}


If you want to stop iterating lines when all of these properties have been used (which seems like a micro-optimization to me), you'll need to change your foreach loop to a for-loop over an index and stop iterating depending on flags.

You can see here that I use the Cells property to access the relevant cells in the worksheet instead of Range. This should be minimally faster, because accessing a Range over a string is always bound to make Excel parse the specifier string into something more suitable for addressing the range.

Additionally it integrates nicely with the use of EmptyRow

On a last note: Since the else-block is empty, you may want to consider inverting the if-condition to save one level of indentation and switch away early:

if (compare != "Failed") 
{
        continue;
}
string[] lines = File.ReadAllLines(fileName);
// ...

Code Snippets

foreach (string line in lines) 
{
        if (line.Contains("FixtureCoverResistance:")) 
        {
                (worksheet.Cells[emptyRow, 6] as Excel.Range).Value2 = line.Substring(31, line.length - 31);
        }
        if (line.Contains("FwProgrammingCheck:"))
        {
                (worksheet.Cells[emptyRow, 7] as Excel.Range).Value2 = line.Substring(31, line.length - 31);
        }
        // ...
}
if (compare != "Failed") 
{
        continue;
}
string[] lines = File.ReadAllLines(fileName);
// ...

Context

StackExchange Code Review Q#148304, answer score: 2

Revisions (0)

No revisions yet.