HiveBrain v1.2.0
Get Started
← Back to all entries
patternshellMinor

PowerShell script to read line by line large CSV files

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
scriptlinereadpowershellcsvlargefiles

Problem

I am managing large CSV files (files ranging from 750 Mb to 10+ Gb), parsing their data into PSObjects, then processing each of those objects based on what is required.

I wrote the following script to churn through these files line by line, filter based on one of the data fields, then close the file. The script works but I feel that it could be faster. For instance, it took 4.5 hours to parse a 389k line csv file. Taking that filesize and timeline, it would take over two and a half days to work through just the sorting and filtering of the data!

Before anyone suggests "use a database!", I'm 100% with you and have started the RFC for adding a database server to our network. Unfortunately our CAB only meets quarterly and this wasn't deemed an emergency. So, I'm left without a database solution for at least 2-3 months.

Anyhow, here is the code:

```
[void][reflection.assembly]::LoadWithPartialName("Microsoft.VisualBasic")

$source = Get-FileName "C:\users\$env.username\Downloads"

$reader = New-Object Microsoft.VisualBasic.FileIO.TextFieldParser $source

$reader.SetDelimiters(",")

While(!$reader.EndOfData)
{
$line = $reader.ReadFields()

$details = [ordered]@{
"Plugin ID" = $line[0]
CVSS = $line[2]
Risk = $line[3]
Host = $line[4]
Protocol = $line[5]
Port = $line[6]
Name = $line[7]
Description = $line[9]
Solution = $line[10]
"Plugin Output" = $line[12]
}

$pluginID = $line[0]
$risk = $line[3]

if ($risk -eq "Critical" -or $risk -eq "High" -or $risk -eq "Medium" -or $risk -eq "Low")
{
$allVulns += New-Object PSObject -Property $details
}
else
{
# Filters data into objects based on their plugin ID

Solution

It might be a little hard to make a suggestion about this as I cannot see how the data is being used beyond this. However I can point out the large performance flaw in your logic.
Building arrays with +=

You have multiple lines of code that go like this:

$20811 += New-Object PSObject -Property $details


What is actually happening there is PowerShell is destroying the array $20811 and creating a new one that is one element larger to house the data on the right hand side. For small operations this performance hit is negligible. However you will certainly feel it when you get into the thousands of elements. You are rebuilding new arrays with every operation.
Consider array lists

I would instead just create all of the custom objects in one pass into one large variable instead. Then you could use Where-Object to process the groups of rows as you see fit. Or you can still keep them as separate objects. Either way I would use an arraylist instead.

$11936 = New-Object System.Collections.ArrayList

# truncated ....
11936 # OS Identification
{
    [void]($11936.Add($details))
    break
}


The [void] cast suppresses the output created from the Add method

This should be significantly faster.
Read the CSV as a database

There are methods to read the CSV as a database as well. The one I tested was using the Microsoft.ACE.OLEDB.12.0 provider. This method allows you to use SQL statements against the CSV file. I don't really have the time to properly benchmark this but this should be faster than your current method as well. Consider a simple example using mock "employee" data:

@"
id,first_name,last_name,email,ip_address
1,Edward,Richards,erichards0@businessweek.com,201.133.112.30
2,Jimmy,Scott,jscott1@clickbank.net,103.231.149.144
3,Marilyn,Williams,mwilliams2@chicagotribune.com,52.180.157.43
4,Frank,Morales,fmorales3@google.ru,218.175.165.205
5,Chris,Watson,cwatson4@ed.gov,75.251.1.149
6,Albert,Ross,aross5@abc.net.au,89.56.133.54
7,Diane,Daniels,ddaniels6@washingtonpost.com,197.156.129.45
8,Nancy,Carter,ncarter7@surveymonkey.com,75.162.65.142
9,John,Kennedy,jkennedy8@tumblr.com,85.35.177.235
10,Bonnie,Bradley,bbradley9@dagondesign.com,255.67.106.193
"@ | Set-Content .\test.csv 

$conn = New-Object System.Data.OleDb.OleDbConnection("Provider=Microsoft.ACE.OLEDB.12.0;Data Source='C:\Users\Cameron';Extended Properties='Text;HDR=Yes;FMT=Delimited';")
$cmd=$conn.CreateCommand()
$cmd.CommandText="Select * from test.csv where first_name like '%n%'"
$conn.open()
$data = $cmd.ExecuteReader()

$data | ForEach-Object{
    [pscustomobject]@{
        id=$_.GetValue(0) 
        first_name=$_.GetValue(1) 
        last_name=$_.GetValue(2)
        ip_address=$_.GetValue(4)
    }
}

$cmd.Dispose()
$conn.Dispose()


I have a SQL statement in there that will return all the records where the first name has a "n" in it. Using the field indecies we build a custom psobject that gets sent down the pipe.

Note that the source in the connection string is the folder that contains the csv file. And the table in the SQL statement is the CSV file name.

Code Snippets

$20811 += New-Object PSObject -Property $details
$11936 = New-Object System.Collections.ArrayList

# truncated ....
11936 # OS Identification
{
    [void]($11936.Add($details))
    break
}
@"
id,first_name,last_name,email,ip_address
1,Edward,Richards,erichards0@businessweek.com,201.133.112.30
2,Jimmy,Scott,jscott1@clickbank.net,103.231.149.144
3,Marilyn,Williams,mwilliams2@chicagotribune.com,52.180.157.43
4,Frank,Morales,fmorales3@google.ru,218.175.165.205
5,Chris,Watson,cwatson4@ed.gov,75.251.1.149
6,Albert,Ross,aross5@abc.net.au,89.56.133.54
7,Diane,Daniels,ddaniels6@washingtonpost.com,197.156.129.45
8,Nancy,Carter,ncarter7@surveymonkey.com,75.162.65.142
9,John,Kennedy,jkennedy8@tumblr.com,85.35.177.235
10,Bonnie,Bradley,bbradley9@dagondesign.com,255.67.106.193
"@ | Set-Content .\test.csv 

$conn = New-Object System.Data.OleDb.OleDbConnection("Provider=Microsoft.ACE.OLEDB.12.0;Data Source='C:\Users\Cameron';Extended Properties='Text;HDR=Yes;FMT=Delimited';")
$cmd=$conn.CreateCommand()
$cmd.CommandText="Select * from test.csv where first_name like '%n%'"
$conn.open()
$data = $cmd.ExecuteReader()

$data | ForEach-Object{
    [pscustomobject]@{
        id=$_.GetValue(0) 
        first_name=$_.GetValue(1) 
        last_name=$_.GetValue(2)
        ip_address=$_.GetValue(4)
    }
}

$cmd.Dispose()
$conn.Dispose()

Context

StackExchange Code Review Q#140892, answer score: 3

Revisions (0)

No revisions yet.