patterncsharpModerate
Extract Excel data using Interop.Excel from C#
Viewed 0 times
interopexcelusingextractfromdata
Problem
Code Objective
I need to create a
The Issue
My code is horrendously slow. The workbook I'm using to test my code has one worksheet which uses the Excel columns A through CO and uses rows 1 through 11361, so I don't expect it to be too blazing fast, but I finally stopped it after 20 minutes of letting it run. That's really slow, even for a big workbook.
My Goal Here
I would love your help in determining the cause of the slowness, or perhaps if there's an altogether more efficient way of going about this. My instinct was to post this to Stack Overflow, but I figured this would be a better place because the code actually works and does what I want it to do, it just does it very slowly.
The Code
I used Dylan Morley's code here for creating an Excel "wrapper" which aids with orphaned Excel processes and the release of COM objects.
```
namespace Excel.Helpers
{
internal class ExcelWrapper : IDisposable
{
private class Window
{
[DllImport("user32.dll", SetLastError = true)]
static extern IntPtr FindWindow(string lpClassName, string lpWindowName);
[DllImport("user32.dll")]
private static extern IntPtr GetWindowThreadProcessId(IntPtr hWnd,
out IntPtr ProcessID);
public static IntPtr GetWindowThreadProcessId(IntPtr hWnd)
{
IntPtr processID;
IntPtr returnResult = GetWindowThreadProcessId(hWnd, out processID);
return processID;
}
I need to create a
System.Data.DataSet object from an Excel workbook. Each DataTable within the DataSet must correspond to a nonempty worksheet in the workbook. The top row of each sheet will be used as column names for each DataTable and the columns will be populated as the string representation of the worksheet data. The string part here is very important--everything in the final DataSet must be visually identical to what is in the Excel file with no assumptions about data types.The Issue
My code is horrendously slow. The workbook I'm using to test my code has one worksheet which uses the Excel columns A through CO and uses rows 1 through 11361, so I don't expect it to be too blazing fast, but I finally stopped it after 20 minutes of letting it run. That's really slow, even for a big workbook.
My Goal Here
I would love your help in determining the cause of the slowness, or perhaps if there's an altogether more efficient way of going about this. My instinct was to post this to Stack Overflow, but I figured this would be a better place because the code actually works and does what I want it to do, it just does it very slowly.
The Code
I used Dylan Morley's code here for creating an Excel "wrapper" which aids with orphaned Excel processes and the release of COM objects.
```
namespace Excel.Helpers
{
internal class ExcelWrapper : IDisposable
{
private class Window
{
[DllImport("user32.dll", SetLastError = true)]
static extern IntPtr FindWindow(string lpClassName, string lpWindowName);
[DllImport("user32.dll")]
private static extern IntPtr GetWindowThreadProcessId(IntPtr hWnd,
out IntPtr ProcessID);
public static IntPtr GetWindowThreadProcessId(IntPtr hWnd)
{
IntPtr processID;
IntPtr returnResult = GetWindowThreadProcessId(hWnd, out processID);
return processID;
}
Solution
My other answer proposed a radically different approach. This is more of a code review.
Naming
The naming isn't following the typical C# naming conventions:
Would be more readable like this:
Or, with type inference / implicit typing:
Avoid arbitrary prefixes:
Same here, the
Use meaningful names:
Readability
This:
Would more explicitly read as an increment if written like this:
Your main loop (inside the
I'd extract 3 methods out of the main loop, to make it easier to parse:
I dropped the redundant
The
This isn't perfect,
Would parallelization help?
As a possible performance helper, the main loop could then be parallelized:
So you'd possibly have multiple worksheets being processed at the same time, depending on number of available CPU cores and some other factors. The overall workload remains the same though.
Naming
The naming isn't following the typical C# naming conventions:
Application oXL = wrapper.Excel;
Workbook oWB = oXL.Workbooks.Open(this.File, 0, true);Would be more readable like this:
Application app = wrapper.Excel;
Workbook workbook = app.Workbooks.Open(this.File, 0, true);Or, with type inference / implicit typing:
var app = wrapper.Excel;
var workbook = app.Workbooks.Open(this.File, 0, true);Avoid arbitrary prefixes:
int nColumns = oSheet.UsedRange.Columns.Count;
int nRows = oSheet.UsedRange.Rows.Count;n means what? number? it's redundant. Just call a cat, a cat:var columns = worksheet.UsedRange.Columns.Count;
var rows = worksheet.UsedRange.Rows.Count;Same here, the
o prefix isn't buying you anything, and only makes the pronounciation of that variable's name sound funny:int oRange = (Range)oSheet.Cells[i, j];Use meaningful names:
public string File { get; set; }File, in .net, usually refers to the System.IO.File class. Given that it's a string, I'd suggest the name FileName for that property instead.Readability
This:
i += 1;Would more explicitly read as an increment if written like this:
i++;Your main loop (inside the
try block) is doing 3 things:- Create a new
DataTablefor each worksheet.
- Create a new
DataColumnfor each column on each worksheet.
- Create a new
DataRowfor each row on each worksheet, and populate each column.
I'd extract 3 methods out of the main loop, to make it easier to parse:
var result = new DataSet();
try
{
foreach (var worksheet in workbook.Worksheets)
{
DataTable worksheetData;
If (ReadWorksheetData(worksheet, out worksheetData))
{
result.Tables.Add(worksheetData);
}
}
}
finally
{
workbook.Close();
app.Quit();
}I dropped the redundant
catch clause, and removed the call to wrapper.Dispose(), since the wrapper instance is created in a using scope, which already ensures its proper disposal - you're basically calling Dispose twice on it. Also nulling references that you're not going to reuse, doesn't buy you anything really, so I removed them too.The
ReadWorksheetData method would look something like this:private bool ReadWorksheetData(Worksheet worksheet, out DataTable worksheetData)
{
var result = false;
var columns = worksheet.UsedRange.Columns.Count;
if (columns > 0)
{
var data = new DataTable(worksheet.Name);
worksheetData = data;
AddWorksheetColumns(worksheet, worksheetData, columns);
AddWorksheetRows(worksheet, worksheetData); // temporal coupling here...
result = true;
}
else
{
worksheetData = null;
}
return result;
}This isn't perfect,
AddWorksheetColumns must run before AddWorksheetRows and that creates temporal coupling between the two methods, but you get the idea - you break the main loop into multiple, smaller methods that do as little as possible.Would parallelization help?
As a possible performance helper, the main loop could then be parallelized:
Parallel.ForEach(workbook.Worksheets, worksheet =>
{
DataTable worksheetData;
If (ReadWorksheetData(worksheet, out worksheetData))
{
result.Tables.Add(worksheetData);
}
});So you'd possibly have multiple worksheets being processed at the same time, depending on number of available CPU cores and some other factors. The overall workload remains the same though.
Code Snippets
Application oXL = wrapper.Excel;
Workbook oWB = oXL.Workbooks.Open(this.File, 0, true);Application app = wrapper.Excel;
Workbook workbook = app.Workbooks.Open(this.File, 0, true);var app = wrapper.Excel;
var workbook = app.Workbooks.Open(this.File, 0, true);int nColumns = oSheet.UsedRange.Columns.Count;
int nRows = oSheet.UsedRange.Rows.Count;var columns = worksheet.UsedRange.Columns.Count;
var rows = worksheet.UsedRange.Rows.Count;Context
StackExchange Code Review Q#49162, answer score: 11
Revisions (0)
No revisions yet.