HiveBrain v1.2.0
Get Started
← Back to all entries
snippethtmlMinor

PowerShell script to create Excel spreadsheet from HTML file

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
scriptfileexcelcreatepowershellspreadsheetfromhtml

Problem

This script is designed to create an Excel spreadsheet from each HTML table from auto-generated index.html file.

It needs to be very robust but unfortunately I don't have many example files to test it against. Any feedback on style, possible points of failure, or vulnerabilities would be very appreciated.

Powershell

```
function Main
{

# find each file called "index.html" in the folders below the directory of the script
$Directory = $PSScriptRoot
$htmlFiles = Get-ChildItem -path $Directory -Recurse -Include index.html

# for each html file, create an excel file in the directory of the script
# called Exceptions Log .xlsx
ForEach($htmlFile in $htmlFiles)
{
$fileCreationDate = (Get-ChildItem $htmlFile).CreationTime
$fileString = $fileCreationDate.ToString("yyyy-mm-dd hh.mm.ss")
$outputFile = "$Directory\Exceptions Log $fileString.xlsx"

Write-Host "Parsing HTML file: $htmlFile"
$document = ParseHtml $htmlFile

Write-Host "Creating Excel file: $outputFile"
CreateExcelFromHtml $document $outputFile
}

}

# load local HTML file and return COM object that we can query
function ParseHtml([string]$fileIn)
{
$html = New-Object -ComObject "HTMLFile"
$source = Get-Content $fileIn -Raw
$html.IHTMLDocument2_write($source)
return $html
}

# create excel spreadsheet from html document
function CreateExcelFromHtml($document, [string]$excelFile)
{
$excel = New-Object -ComObject Excel.Application
$excel.Visible = $false # $true
$excel.DisplayAlerts = $false
$workbook = $excel.Workbooks.Add()
$sheet = $workbook.ActiveSheet

Write-Host "Filling in Excel file"
$row = 1;
# iterate over tables
$tables = $document.getElementsByTagName("table")
$captions = $document.getElementsByTagName("caption")

for($t=0; $t -lt $tables.length; $t++){

# create 2 row space between tables
$row += 2

#iterate over rows

Solution

Your code looks fine. Just some notes:

Excel

The main weakness of your design as I see it is its reliance on the Excel application. Since you say robustness is important, Excel would be the main weak point. Microsoft says don't use Office as a server. From my own bitter experience, I think they are right to say that.

It's not so bad in your case because you are doing the same thing over and over, so there should be no surprises, but you never know. Excel can be flakey. It can pop up dialogs that will halt processing. It's especially flakey if you are processing multiple documents simultaneously because Excel was never designed to handle that case.

As an alternative, you could use a library to produce your Excel, such as EPPlus. I've never used that, so I can't endorse it, but it looks good. There may be others out there too. Remember that it's easy to use .Net libraries from PowerShell.

HTML

Your HTML is very nearly XHTML. It looks like just the
tags would need to be fixed. If you have control over the thing that produces the HTML, then you could change it to produce XHTML instead, and then you could consume that with Import-XML. I'm guessing that's not the case, but I thought it worth mentioning.

Error Handling

You don't have any error handling. If you want your code to be robust you will have to think about what it should do when things go wrong.

Write-Host

Write-Host "Formating Excel file"


If you want to capture these messages in a log, then it would be better to use Write-Output.

Line Continuation

$tableRange = $sheet.Range($sheet.Cells($startRow, 1), ` 
    $sheet.Cells($startRow + $numberOfRows - 1, $endColumn))


I don't believe you need to use the backtick character there.

-Include

$htmlFiles = Get-ChildItem -path $Directory -Recurse -Include index.html


One oddity of PowerShell is that -Filter is much faster than -Include. If the number of folders and files you are searching is small, then it doesn't matter, but if you are searching through lots of them, it can make a huge difference.

(What is the difference in functionality between the two? None, except that -Include is more general and will work in non-filesystems, but that isn't relevant here.)

Code Snippets

Write-Host "Formating Excel file"
$tableRange = $sheet.Range($sheet.Cells($startRow, 1), ` 
    $sheet.Cells($startRow + $numberOfRows - 1, $endColumn))
$htmlFiles = Get-ChildItem -path $Directory -Recurse -Include index.html

Context

StackExchange Code Review Q#139670, answer score: 6

Revisions (0)

No revisions yet.