HiveBrain v1.2.0
Get Started
← Back to all entries
patternMinor

Parsing large CSV in Perl

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
csvlargeparsingperl

Problem

I am getting Out of Memory errors when I try to parse through a large CSV file (2.5Gb). My computer has 32Gb of Memory but Perl uses all of it up. The CSV has 2 columns. The first is the time in epoch and the second is a 10000+ line XML file as a single-line. There are around 13k rows. I then use XML::XPath to retrieve the customer name and save the XML to [customername]-[time].xml. If there is an error, it is because the XML is invalid and I skip it.

Is there any way to make my code run more efficiently and look better?

#!/usr/bin/perl

use strict;
use warnings;
use XML::XPath;
use XML::XPath::XMLParser;
use File::Slurp;

my $file = '../FILENAME.csv';
open my $info, $file or die "Could not open $file: $!";
my $count = 0;
$| = 1;

while( my $line = )  {
    next if ++$count == 1; #Ignore headers
    my ($time, $report) = ($line =~ /(\d+),(.*)$/); # time, XML file
    eval {
        my $xp = XML::XPath->new(xml => $report);
        our $ext = $xp->getNodeText('/report/customer') . "-" . $time . ".xml";
        write_file($ext, $report);
    };
    if ( $@ ) {printf "ERROR ";}
    else {printf "$count ";}
}

close $info;

Solution

I am suspicious of this line here:

my ($time, $report) = ($line =~ /(\d+),(.*)$/); # time, XML file


That line needs to scan and group the entire XML doc, but my instinct is that a 2-limited split (see note on LIMIT) would be more efficient:

my ($time, $report) = split /,/, $line, 2;


Apart from that, really, everything looks sane, and far better than a typical perl hack. I am impressed.

About perl using all your memory... are you sure? perl may just be struggling to allocate more than it is allowed. Do you have a ulimit condition that is constraining memory allocation? Are there problems with heavily fragmented memory? (You're not running 32-bit, are you... just checking?)

As it happens, for a task like this, you may find that everything is better with a language other than perl. Not to dismiss perl as incapable, but with the XPath expressions and other items your code looks surprisingly complicated. 2.5gig files are way beyond the normal I would consider for perl performance.

Code Snippets

my ($time, $report) = ($line =~ /(\d+),(.*)$/); # time, XML file
my ($time, $report) = split /,/, $line, 2;

Context

StackExchange Code Review Q#90441, answer score: 2

Revisions (0)

No revisions yet.