HiveBrain v1.2.0
Get Started
← Back to all entries
patterncsharpModerate

Extract Excel data using Interop.Excel from C#

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
interopexcelusingextractfromdata

Problem

Code Objective

I need to create a System.Data.DataSet object from an Excel workbook. Each DataTable within the DataSet must correspond to a nonempty worksheet in the workbook. The top row of each sheet will be used as column names for each DataTable and the columns will be populated as the string representation of the worksheet data. The string part here is very important--everything in the final DataSet must be visually identical to what is in the Excel file with no assumptions about data types.

The Issue

My code is horrendously slow. The workbook I'm using to test my code has one worksheet which uses the Excel columns A through CO and uses rows 1 through 11361, so I don't expect it to be too blazing fast, but I finally stopped it after 20 minutes of letting it run. That's really slow, even for a big workbook.

My Goal Here

I would love your help in determining the cause of the slowness, or perhaps if there's an altogether more efficient way of going about this. My instinct was to post this to Stack Overflow, but I figured this would be a better place because the code actually works and does what I want it to do, it just does it very slowly.

The Code

I used Dylan Morley's code here for creating an Excel "wrapper" which aids with orphaned Excel processes and the release of COM objects.

```
namespace Excel.Helpers
{
internal class ExcelWrapper : IDisposable
{
private class Window
{
[DllImport("user32.dll", SetLastError = true)]
static extern IntPtr FindWindow(string lpClassName, string lpWindowName);

[DllImport("user32.dll")]
private static extern IntPtr GetWindowThreadProcessId(IntPtr hWnd,
out IntPtr ProcessID);

public static IntPtr GetWindowThreadProcessId(IntPtr hWnd)
{
IntPtr processID;
IntPtr returnResult = GetWindowThreadProcessId(hWnd, out processID);
return processID;
}

Solution

My other answer proposed a radically different approach. This is more of a code review.

Naming

The naming isn't following the typical C# naming conventions:

Application oXL = wrapper.Excel;

    Workbook oWB = oXL.Workbooks.Open(this.File, 0, true);


Would be more readable like this:

Application app = wrapper.Excel; 
     Workbook workbook = app.Workbooks.Open(this.File, 0, true);


Or, with type inference / implicit typing:

var app = wrapper.Excel; 
     var workbook = app.Workbooks.Open(this.File, 0, true);


Avoid arbitrary prefixes:

int nColumns = oSheet.UsedRange.Columns.Count;
    int nRows = oSheet.UsedRange.Rows.Count;


n means what? number? it's redundant. Just call a cat, a cat:

var columns = worksheet.UsedRange.Columns.Count;
    var rows = worksheet.UsedRange.Rows.Count;


Same here, the o prefix isn't buying you anything, and only makes the pronounciation of that variable's name sound funny:

int oRange = (Range)oSheet.Cells[i, j];


Use meaningful names:

public string File { get; set; }


File, in .net, usually refers to the System.IO.File class. Given that it's a string, I'd suggest the name FileName for that property instead.

Readability

This:

i += 1;


Would more explicitly read as an increment if written like this:

i++;


Your main loop (inside the try block) is doing 3 things:

  • Create a new DataTable for each worksheet.



  • Create a new DataColumn for each column on each worksheet.



  • Create a new DataRow for each row on each worksheet, and populate each column.



I'd extract 3 methods out of the main loop, to make it easier to parse:

var result = new DataSet();
try
{
    foreach (var worksheet in workbook.Worksheets)
    {
        DataTable worksheetData;
        If (ReadWorksheetData(worksheet, out worksheetData))
        {
            result.Tables.Add(worksheetData);
        }
    }
}
finally
{
    workbook.Close();
    app.Quit();
}


I dropped the redundant catch clause, and removed the call to wrapper.Dispose(), since the wrapper instance is created in a using scope, which already ensures its proper disposal - you're basically calling Dispose twice on it. Also nulling references that you're not going to reuse, doesn't buy you anything really, so I removed them too.

The ReadWorksheetData method would look something like this:

private bool ReadWorksheetData(Worksheet worksheet, out DataTable worksheetData)
{
    var result = false;

    var columns = worksheet.UsedRange.Columns.Count;        
    if (columns > 0)
    {
        var data = new DataTable(worksheet.Name);
        worksheetData = data;
        AddWorksheetColumns(worksheet, worksheetData, columns);
        AddWorksheetRows(worksheet, worksheetData); // temporal coupling here...

        result = true;
    }
    else
    {
        worksheetData = null;
    }

    return result;
}


This isn't perfect, AddWorksheetColumns must run before AddWorksheetRows and that creates temporal coupling between the two methods, but you get the idea - you break the main loop into multiple, smaller methods that do as little as possible.

Would parallelization help?

As a possible performance helper, the main loop could then be parallelized:

Parallel.ForEach(workbook.Worksheets, worksheet =>
        {
            DataTable worksheetData;
            If (ReadWorksheetData(worksheet, out worksheetData))
            {
                result.Tables.Add(worksheetData);
            }
        });


So you'd possibly have multiple worksheets being processed at the same time, depending on number of available CPU cores and some other factors. The overall workload remains the same though.

Code Snippets

Application oXL = wrapper.Excel;

    Workbook oWB = oXL.Workbooks.Open(this.File, 0, true);
Application app = wrapper.Excel; 
     Workbook workbook = app.Workbooks.Open(this.File, 0, true);
var app = wrapper.Excel; 
     var workbook = app.Workbooks.Open(this.File, 0, true);
int nColumns = oSheet.UsedRange.Columns.Count;
    int nRows = oSheet.UsedRange.Rows.Count;
var columns = worksheet.UsedRange.Columns.Count;
    var rows = worksheet.UsedRange.Rows.Count;

Context

StackExchange Code Review Q#49162, answer score: 11

Revisions (0)

No revisions yet.