snippethtmlMinor
PowerShell script to create Excel spreadsheet from HTML file
Viewed 0 times
scriptfileexcelcreatepowershellspreadsheetfromhtml
Problem
This script is designed to create an Excel spreadsheet from each HTML table from auto-generated
It needs to be very robust but unfortunately I don't have many example files to test it against. Any feedback on style, possible points of failure, or vulnerabilities would be very appreciated.
Powershell
```
function Main
{
# find each file called "index.html" in the folders below the directory of the script
$Directory = $PSScriptRoot
$htmlFiles = Get-ChildItem -path $Directory -Recurse -Include index.html
# for each html file, create an excel file in the directory of the script
# called Exceptions Log .xlsx
ForEach($htmlFile in $htmlFiles)
{
$fileCreationDate = (Get-ChildItem $htmlFile).CreationTime
$fileString = $fileCreationDate.ToString("yyyy-mm-dd hh.mm.ss")
$outputFile = "$Directory\Exceptions Log $fileString.xlsx"
Write-Host "Parsing HTML file: $htmlFile"
$document = ParseHtml $htmlFile
Write-Host "Creating Excel file: $outputFile"
CreateExcelFromHtml $document $outputFile
}
}
# load local HTML file and return COM object that we can query
function ParseHtml([string]$fileIn)
{
$html = New-Object -ComObject "HTMLFile"
$source = Get-Content $fileIn -Raw
$html.IHTMLDocument2_write($source)
return $html
}
# create excel spreadsheet from html document
function CreateExcelFromHtml($document, [string]$excelFile)
{
$excel = New-Object -ComObject Excel.Application
$excel.Visible = $false # $true
$excel.DisplayAlerts = $false
$workbook = $excel.Workbooks.Add()
$sheet = $workbook.ActiveSheet
Write-Host "Filling in Excel file"
$row = 1;
# iterate over tables
$tables = $document.getElementsByTagName("table")
$captions = $document.getElementsByTagName("caption")
for($t=0; $t -lt $tables.length; $t++){
# create 2 row space between tables
$row += 2
#iterate over rows
index.html file. It needs to be very robust but unfortunately I don't have many example files to test it against. Any feedback on style, possible points of failure, or vulnerabilities would be very appreciated.
Powershell
```
function Main
{
# find each file called "index.html" in the folders below the directory of the script
$Directory = $PSScriptRoot
$htmlFiles = Get-ChildItem -path $Directory -Recurse -Include index.html
# for each html file, create an excel file in the directory of the script
# called Exceptions Log .xlsx
ForEach($htmlFile in $htmlFiles)
{
$fileCreationDate = (Get-ChildItem $htmlFile).CreationTime
$fileString = $fileCreationDate.ToString("yyyy-mm-dd hh.mm.ss")
$outputFile = "$Directory\Exceptions Log $fileString.xlsx"
Write-Host "Parsing HTML file: $htmlFile"
$document = ParseHtml $htmlFile
Write-Host "Creating Excel file: $outputFile"
CreateExcelFromHtml $document $outputFile
}
}
# load local HTML file and return COM object that we can query
function ParseHtml([string]$fileIn)
{
$html = New-Object -ComObject "HTMLFile"
$source = Get-Content $fileIn -Raw
$html.IHTMLDocument2_write($source)
return $html
}
# create excel spreadsheet from html document
function CreateExcelFromHtml($document, [string]$excelFile)
{
$excel = New-Object -ComObject Excel.Application
$excel.Visible = $false # $true
$excel.DisplayAlerts = $false
$workbook = $excel.Workbooks.Add()
$sheet = $workbook.ActiveSheet
Write-Host "Filling in Excel file"
$row = 1;
# iterate over tables
$tables = $document.getElementsByTagName("table")
$captions = $document.getElementsByTagName("caption")
for($t=0; $t -lt $tables.length; $t++){
# create 2 row space between tables
$row += 2
#iterate over rows
Solution
Your code looks fine. Just some notes:
Excel
The main weakness of your design as I see it is its reliance on the Excel application. Since you say robustness is important, Excel would be the main weak point. Microsoft says don't use Office as a server. From my own bitter experience, I think they are right to say that.
It's not so bad in your case because you are doing the same thing over and over, so there should be no surprises, but you never know. Excel can be flakey. It can pop up dialogs that will halt processing. It's especially flakey if you are processing multiple documents simultaneously because Excel was never designed to handle that case.
As an alternative, you could use a library to produce your Excel, such as EPPlus. I've never used that, so I can't endorse it, but it looks good. There may be others out there too. Remember that it's easy to use .Net libraries from PowerShell.
HTML
Your HTML is very nearly XHTML. It looks like just the
Error Handling
You don't have any error handling. If you want your code to be robust you will have to think about what it should do when things go wrong.
Write-Host
If you want to capture these messages in a log, then it would be better to use
Line Continuation
I don't believe you need to use the backtick character there.
-Include
One oddity of PowerShell is that
(What is the difference in functionality between the two? None, except that
Excel
The main weakness of your design as I see it is its reliance on the Excel application. Since you say robustness is important, Excel would be the main weak point. Microsoft says don't use Office as a server. From my own bitter experience, I think they are right to say that.
It's not so bad in your case because you are doing the same thing over and over, so there should be no surprises, but you never know. Excel can be flakey. It can pop up dialogs that will halt processing. It's especially flakey if you are processing multiple documents simultaneously because Excel was never designed to handle that case.
As an alternative, you could use a library to produce your Excel, such as EPPlus. I've never used that, so I can't endorse it, but it looks good. There may be others out there too. Remember that it's easy to use .Net libraries from PowerShell.
HTML
Your HTML is very nearly XHTML. It looks like just the
tags would need to be fixed. If you have control over the thing that produces the HTML, then you could change it to produce XHTML instead, and then you could consume that with Import-XML. I'm guessing that's not the case, but I thought it worth mentioning.Error Handling
You don't have any error handling. If you want your code to be robust you will have to think about what it should do when things go wrong.
Write-Host
Write-Host "Formating Excel file"If you want to capture these messages in a log, then it would be better to use
Write-Output.Line Continuation
$tableRange = $sheet.Range($sheet.Cells($startRow, 1), `
$sheet.Cells($startRow + $numberOfRows - 1, $endColumn))I don't believe you need to use the backtick character there.
-Include
$htmlFiles = Get-ChildItem -path $Directory -Recurse -Include index.htmlOne oddity of PowerShell is that
-Filter is much faster than -Include. If the number of folders and files you are searching is small, then it doesn't matter, but if you are searching through lots of them, it can make a huge difference. (What is the difference in functionality between the two? None, except that
-Include is more general and will work in non-filesystems, but that isn't relevant here.)Code Snippets
Write-Host "Formating Excel file"$tableRange = $sheet.Range($sheet.Cells($startRow, 1), `
$sheet.Cells($startRow + $numberOfRows - 1, $endColumn))$htmlFiles = Get-ChildItem -path $Directory -Recurse -Include index.htmlContext
StackExchange Code Review Q#139670, answer score: 6
Revisions (0)
No revisions yet.