snippetshellMinor
Batch convert txt files to xls
Viewed 0 times
convertxlsbatchfilestxt
Problem
In this SO question I provided two answers for the batch conversion of more than 100K txt to xlsx files
I suspect any automation of Excel is too slow, but would appreciate if there is a superior (i.e. faster) approach using powershell.
I suspect any automation of Excel is too slow, but would appreciate if there is a superior (i.e. faster) approach using powershell.
$files = Get-ChildItem C:\Temp\*.txt
Write "Loading Files..."
$Excel = New-Object -ComObject Excel.Application
$Excel.visible = $false
$Excel.DisplayAlerts = $false
ForEach ($file in $files)
{
$WorkBook = $Excel.Workbooks.Open($file.Fullname)
$NewFilepath = $file.Fullname -replace ".{4}$"
$NewFilepath = $NewFilepath + ".xls"
$Workbook.SaveAs($NewFilepath,56)
}
Stop-Process -processname EXCEL
$Excel.Quit()Solution
As far as performance goes there are two general approaches to address this
Parallel processing
Running the com object is dragging in itself. Processing 10's of 1000's of files with one Excel instance will be draining. PowerShell does support multiple avenues of mitigating this
-
https://stackoverflow.com/questions/4016451/can-powershell-run-commands-in-parallel
-
https://stackoverflow.com/questions/8781666/run-n-parallel-jobs-in-powershell
-
about_foreach_parallel
I am going to show an approach using Jobs. What this will do is group up identified files. Each group will be past to its own Excel job for processing. Some comments in the code below explain it more.
Excel automation is full of pitfalls but the general approach seems to be working. For one com objects can prevent from being closed. Best thing to do is just kill any remaining processes. There is logic that can help to be sure any Excel opened before the script was run are not affected. Jobs are done in the way as well to not affect processing of other scripts.
This all would be better as a script using proper parameters but showing the logic of jobs is the point here.
Automate without Excel
There are a number of libraries out there that can create Excel documents without the need for COM objects or Excel itself. These would perform faster as well. EPPlus comes to mind and even [
Magic Numbers
Your 56 would be an example of that. You have to look it up to know what it represents. So 56 is for Excel Version 8. At a minimum that should be its own variable. My code above uses
If you wanted to go a little crazier you could actually import the assembly to access the enum
BaseName
This is what you are doing to replace the extension of the file. File objects have a basename property that is the file name without the extension. I know that you are dealing with txt files, so you already kno
Parallel processing
Running the com object is dragging in itself. Processing 10's of 1000's of files with one Excel instance will be draining. PowerShell does support multiple avenues of mitigating this
-
https://stackoverflow.com/questions/4016451/can-powershell-run-commands-in-parallel
-
https://stackoverflow.com/questions/8781666/run-n-parallel-jobs-in-powershell
-
about_foreach_parallel
I am going to show an approach using Jobs. What this will do is group up identified files. Each group will be past to its own Excel job for processing. Some comments in the code below explain it more.
# Root directory containing your files.
$path = "E:\temp\csv"
# Get Current EXCEL Process ID's so they are not affected but the scripts cleanup
# SilentlyContinue in case there are no active Excels
$currentExcelProcessIDs = (Get-Process excel -ErrorAction SilentlyContinue).Id
# Collect the files
$files = Get-ChildItem -Path $path -Filter "*.txt" -File
# Split the files up into processing groups. For Each group and excel process will be started.
$numberOfGroups = 5
$maxGroupMemberSize = [math]::Ceiling($files.Count / $numberOfGroups)
# Create as many file groups
$fileGroups = 0..($numberOfGroups - 1) | Foreach-object{
$groupIndexStart = $maxGroupMemberSize * $_
# Use the unary comma operator to be sure an array is returned and not unrolled
,$files[$groupIndexStart..($groupIndexStart + $maxGroupMemberSize - 1)]
}
# Create a job for each file group.
for($jobCount = 0; $jobCount -lt $fileGroups.Count; $jobCount++){
# Start a unique Excel instance for this group of files.
Start-Job -Name "Excel$jobCount" -ScriptBlock {
param($files)
$excelFileFormat = 56 #xlExcel8 format
# Create a new Excel Instance
$excel = New-Object -ComObject Excel.Application
$excel.Visible = $false
$excel.DisplayAlerts = $false
ForEach ($file in $files){
$workbook = $Excel.Workbooks.Open($file.Fullname)
$newFilepath = $file.Fullname -replace "\..*$",".xls"
$workbook.SaveAs($newFilepath, $excelFileFormat)
}
# Quit this instance and return its memory
$excel.Quit()
while([System.Runtime.Interopservices.Marshal]::ReleaseComObject($workbook)){}
while([System.Runtime.Interopservices.Marshal]::ReleaseComObject($excel)){}
Remove-Variable "workbook","excel"
} -ArgumentList (,($fileGroups[$jobCount])) | Out-Null
}
# Wait for the jobs to be completed and remove them from inventory since they won't have output we need
Get-Job -Name "Excel*" | Wait-Job | Receive-Job
# Remove any stale Excel processes created by this scripts execution
Get-Process excel -ErrorAction SilentlyContinue | Where-Object{$currentExcelProcessIDs -notcontains $_.id} | Stop-ProcessExcel automation is full of pitfalls but the general approach seems to be working. For one com objects can prevent from being closed. Best thing to do is just kill any remaining processes. There is logic that can help to be sure any Excel opened before the script was run are not affected. Jobs are done in the way as well to not affect processing of other scripts.
This all would be better as a script using proper parameters but showing the logic of jobs is the point here.
Automate without Excel
There are a number of libraries out there that can create Excel documents without the need for COM objects or Excel itself. These would perform faster as well. EPPlus comes to mind and even [
ImportExcel module] (which uses EPPlus) would be solid choices as well. I don't know what your text files look like so you will need to experiment.Get-Content E:\temp\csv\data.txt | Export-Excel -Path "e:\temp\csv\file.xlsx"Export-Excel works better with objects so, depending on your data structure, you would do this instead.Import-Csv E:\temp\csv\data.txt | Export-Excel -Path "e:\temp\csv\file.xlsx"Import-CSV is generally a slow process so you would want to try using streamreader and Convert-FromCSV if performance suffers.Magic Numbers
Your 56 would be an example of that. You have to look it up to know what it represents. So 56 is for Excel Version 8. At a minimum that should be its own variable. My code above uses
$excelFileFormat = 56 #xlExcel8 formatIf you wanted to go a little crazier you could actually import the assembly to access the enum
[reflection.assembly]::LoadWithPartialName("Microsoft.Office.InterOp.Excel") | Out-Null
[Microsoft.Office.Interop.Excel.XlFileFormat]::xlExcel8BaseName
$NewFilepath = $file.Fullname -replace ".{4}$"
$NewFilepath = $NewFilepath + ".xls"This is what you are doing to replace the extension of the file. File objects have a basename property that is the file name without the extension. I know that you are dealing with txt files, so you already kno
Code Snippets
# Root directory containing your files.
$path = "E:\temp\csv"
# Get Current EXCEL Process ID's so they are not affected but the scripts cleanup
# SilentlyContinue in case there are no active Excels
$currentExcelProcessIDs = (Get-Process excel -ErrorAction SilentlyContinue).Id
# Collect the files
$files = Get-ChildItem -Path $path -Filter "*.txt" -File
# Split the files up into processing groups. For Each group and excel process will be started.
$numberOfGroups = 5
$maxGroupMemberSize = [math]::Ceiling($files.Count / $numberOfGroups)
# Create as many file groups
$fileGroups = 0..($numberOfGroups - 1) | Foreach-object{
$groupIndexStart = $maxGroupMemberSize * $_
# Use the unary comma operator to be sure an array is returned and not unrolled
,$files[$groupIndexStart..($groupIndexStart + $maxGroupMemberSize - 1)]
}
# Create a job for each file group.
for($jobCount = 0; $jobCount -lt $fileGroups.Count; $jobCount++){
# Start a unique Excel instance for this group of files.
Start-Job -Name "Excel$jobCount" -ScriptBlock {
param($files)
$excelFileFormat = 56 #xlExcel8 format
# Create a new Excel Instance
$excel = New-Object -ComObject Excel.Application
$excel.Visible = $false
$excel.DisplayAlerts = $false
ForEach ($file in $files){
$workbook = $Excel.Workbooks.Open($file.Fullname)
$newFilepath = $file.Fullname -replace "\..*$",".xls"
$workbook.SaveAs($newFilepath, $excelFileFormat)
}
# Quit this instance and return its memory
$excel.Quit()
while([System.Runtime.Interopservices.Marshal]::ReleaseComObject($workbook)){}
while([System.Runtime.Interopservices.Marshal]::ReleaseComObject($excel)){}
Remove-Variable "workbook","excel"
} -ArgumentList (,($fileGroups[$jobCount])) | Out-Null
}
# Wait for the jobs to be completed and remove them from inventory since they won't have output we need
Get-Job -Name "Excel*" | Wait-Job | Receive-Job
# Remove any stale Excel processes created by this scripts execution
Get-Process excel -ErrorAction SilentlyContinue | Where-Object{$currentExcelProcessIDs -notcontains $_.id} | Stop-ProcessGet-Content E:\temp\csv\data.txt | Export-Excel -Path "e:\temp\csv\file.xlsx"Import-Csv E:\temp\csv\data.txt | Export-Excel -Path "e:\temp\csv\file.xlsx"$excelFileFormat = 56 #xlExcel8 format[reflection.assembly]::LoadWithPartialName("Microsoft.Office.InterOp.Excel") | Out-Null
[Microsoft.Office.Interop.Excel.XlFileFormat]::xlExcel8Context
StackExchange Code Review Q#152541, answer score: 4
Revisions (0)
No revisions yet.