patternMinor
Parsing large CSV in Perl
Viewed 0 times
csvlargeparsingperl
Problem
I am getting
Is there any way to make my code run more efficiently and look better?
Out of Memory errors when I try to parse through a large CSV file (2.5Gb). My computer has 32Gb of Memory but Perl uses all of it up. The CSV has 2 columns. The first is the time in epoch and the second is a 10000+ line XML file as a single-line. There are around 13k rows. I then use XML::XPath to retrieve the customer name and save the XML to [customername]-[time].xml. If there is an error, it is because the XML is invalid and I skip it.Is there any way to make my code run more efficiently and look better?
#!/usr/bin/perl
use strict;
use warnings;
use XML::XPath;
use XML::XPath::XMLParser;
use File::Slurp;
my $file = '../FILENAME.csv';
open my $info, $file or die "Could not open $file: $!";
my $count = 0;
$| = 1;
while( my $line = ) {
next if ++$count == 1; #Ignore headers
my ($time, $report) = ($line =~ /(\d+),(.*)$/); # time, XML file
eval {
my $xp = XML::XPath->new(xml => $report);
our $ext = $xp->getNodeText('/report/customer') . "-" . $time . ".xml";
write_file($ext, $report);
};
if ( $@ ) {printf "ERROR ";}
else {printf "$count ";}
}
close $info;Solution
I am suspicious of this line here:
That line needs to scan and group the entire XML doc, but my instinct is that a 2-limited split (see note on LIMIT) would be more efficient:
Apart from that, really, everything looks sane, and far better than a typical perl hack. I am impressed.
About perl using all your memory... are you sure? perl may just be struggling to allocate more than it is allowed. Do you have a ulimit condition that is constraining memory allocation? Are there problems with heavily fragmented memory? (You're not running 32-bit, are you... just checking?)
As it happens, for a task like this, you may find that everything is better with a language other than perl. Not to dismiss perl as incapable, but with the XPath expressions and other items your code looks surprisingly complicated. 2.5gig files are way beyond the normal I would consider for perl performance.
my ($time, $report) = ($line =~ /(\d+),(.*)$/); # time, XML fileThat line needs to scan and group the entire XML doc, but my instinct is that a 2-limited split (see note on LIMIT) would be more efficient:
my ($time, $report) = split /,/, $line, 2;Apart from that, really, everything looks sane, and far better than a typical perl hack. I am impressed.
About perl using all your memory... are you sure? perl may just be struggling to allocate more than it is allowed. Do you have a ulimit condition that is constraining memory allocation? Are there problems with heavily fragmented memory? (You're not running 32-bit, are you... just checking?)
As it happens, for a task like this, you may find that everything is better with a language other than perl. Not to dismiss perl as incapable, but with the XPath expressions and other items your code looks surprisingly complicated. 2.5gig files are way beyond the normal I would consider for perl performance.
Code Snippets
my ($time, $report) = ($line =~ /(\d+),(.*)$/); # time, XML filemy ($time, $report) = split /,/, $line, 2;Context
StackExchange Code Review Q#90441, answer score: 2
Revisions (0)
No revisions yet.