patternMinor
Parsing psql output into multiple files
Viewed 0 times
intooutputpsqlfilesparsingmultiple
Problem
I wrote the following simple Perl script to read lines from
This does the trick, but there are 2 things I'm not thrilled about
-
I'd like to be able to do the
-
I don't like that I have to save
Anyhow, any suggestions on making this a bit more slick?
stdin that are the output from a psql call that returns lines of the form key1 | key2 | long text field and create a separate output file for each line, whose name is key1_key2_NNN.txt, where NNN is just a counter to ensure unique file names.my $count = 0;
while (<>) {
if (/(.*)\|(.*)\|(.*)/) {
$count++;
my $outname = "$1_$2_$count.txt";
my $text = $3;
$outname =~ s/\s+//g;
my $outfile = new IO::File("> $outname");
$outfile->print("$text");
}
}This does the trick, but there are 2 things I'm not thrilled about
-
I'd like to be able to do the
$count++ inline while putting the incremented number into the string, but I'm not sure how to not just get 0++ plugged in (i.e. how to make it know that the ++ is a command and not part of the literal string).-
I don't like that I have to save
$3 to another variable, but if I don't, it gets cleared out before I need it.Anyhow, any suggestions on making this a bit more slick?
Solution
I don't know if this is the complete script, but the absence of
Your usage of the regex
-
Don't use a regex to extract the data. Instead:
-
Some guides recommend putting special characters into a character class
-
Immediately after matching, assign the values of the captures to permanent variables. Otherwise, they could be modified by a subroutine you call – the next match can clear those values. Note also that ways exist to booby trap such an innocent-looking operation like incrementing an integer to clear the last match.
So we either write code such as:
or we don't use the capture variables at all, and use the regex in list context instead:
Technically, this isn't absolutely equivalent (e.g. in case you aren't using any captures, or if some captures are conditional, or …), but it's far better style.
If course, you'd assign to a list of names rather than an array:
The substitution
If you don't want to remove all space in those strings (remember that most filesystems can deal with spaces in filenames all right) but only at the start or beginning of the string, you might want to
The expression
As a fix, use curly braces to delimit the variable names:
The line
-
Don't use the “indirect object notation”
If you need help to get rid of that habit, you may enjoy the
-
What you're doing is basically
or in the object-oriented wrapper:
The next problem here is that you aren't doing any error handling! Check the return value of the constructor to assert that you could actually open the file. Well, calling a method on an undefined value will
The
Then, you are using
Another aspect that just came to my mind is that your input file could contain slashes, allowing a filename like
The script now looks like:
```
use strict;
use warnings;
use autodie; # easier than explicit, manual error handling
use feature qw/say/; # say is like print, but appends newline
my $count = 0;
while (my $line = <>) {
chomp $line;
my ($name_a, $name_b, $text) = split /[|]/, $line, 3
or next;
for ($name_a, $name_b) {
s/\s+//g;
die qq(The first two columns in "$line" may not contain slashes) if m[/];
}
my $filename = sprintf '%s_%s_%d.txt', $name_a, $name_b,
use strict; use warnings is pretty noticeable. Always use these as a basic safety net, no matter how small or simple your program.Your usage of the regex
/(.)\|(.)\|(.*)/ can be improved in some ways:-
Don't use a regex to extract the data. Instead:
my @cols = split /[|]/, $_, 3;. This really is the correct way to go, the following points however deal with normal regex usage.-
Some guides recommend putting special characters into a character class
[|] instead of escaping them \|. It's my experience that this makes complicated regexes easier to read.-
Immediately after matching, assign the values of the captures to permanent variables. Otherwise, they could be modified by a subroutine you call – the next match can clear those values. Note also that ways exist to booby trap such an innocent-looking operation like incrementing an integer to clear the last match.
So we either write code such as:
if (/(.*)[|](.*)[|](.*)/) {
my @fields = ($1, $2, $3);
...
}or we don't use the capture variables at all, and use the regex in list context instead:
if (my @fields = /(.*)[|](.*)[|](.*)/) { ... }Technically, this isn't absolutely equivalent (e.g. in case you aren't using any captures, or if some captures are conditional, or …), but it's far better style.
If course, you'd assign to a list of names rather than an array:
my ($name_a, $name_b, $text) = ...The substitution
s/\s+//g might be considered slightly silly, as it's equivalent to s/\s//g. The former variant has the advantage of less substitution operations, so it might actually be preferable. But why are you sanitizing the $outname rather than the parts of which it's made up? Note that you can apply a substitution to multiple variables likes/\s+//g for $x, $y;If you don't want to remove all space in those strings (remember that most filesystems can deal with spaces in filenames all right) but only at the start or beginning of the string, you might want to
split with a different separator instead:split /\s*[|]\s*/, ...The expression
"$1_$2_$count.txt" does not contain a bug, but this is accidental. For example, "$foo_$bar" is equivalent to $foo_ . $bar – underscores can be part of variable names, but variables can't start with numbers, so the problem isn't visible here (variable names can consist solely of numbers, and all such variables are reserved for capture groups).As a fix, use curly braces to delimit the variable names:
"${1}_${2}_$count.txt", or use sprintf:sprintf '%s_%s_%d.txt', $1, $2, $countThe line
my $outfile = new IO::File("> $outname"); is a big no-no for two reasons:-
Don't use the “indirect object notation”
method $object @arglist. Instead use $object->method(@arglist). The former variant may look nicer, but is often ambiguous, has rather confusing precedence, and widely considered to be a syntactic mistake.If you need help to get rid of that habit, you may enjoy the
no indirect pragma.-
What you're doing is basically
open my $outfile, "> $outname". Use the three-argument form of open:open my $outfile, ">", $outnameor in the object-oriented wrapper:
IO::File->new($outname, '>').The next problem here is that you aren't doing any error handling! Check the return value of the constructor to assert that you could actually open the file. Well, calling a method on an undefined value will
die, but checking manually allows you to output a sensible error message.The
$outfile->print("$text") would usually be written print { $outfile } $text. For one thing, the object-oriented interface to file handles is used very rarely. You can use the normal interface directly. Also, you are unnecessarily stringifying the $text. If it isn't already a string, that will be managed by print. Note that explicit stringification creates a copy of that value, something which you would usually (but not always) want to avoid.Then, you are using
print, not say. This means (together with the fact that in a regex, . does not match newlines), that you won't end your output to the file with a newline. This is probably an oversight.Another aspect that just came to my mind is that your input file could contain slashes, allowing a filename like
/you/were/pwned/_key2_1234.txt to be created (assuming the path already exists). Let's put in a bit of validation.The script now looks like:
```
use strict;
use warnings;
use autodie; # easier than explicit, manual error handling
use feature qw/say/; # say is like print, but appends newline
my $count = 0;
while (my $line = <>) {
chomp $line;
my ($name_a, $name_b, $text) = split /[|]/, $line, 3
or next;
for ($name_a, $name_b) {
s/\s+//g;
die qq(The first two columns in "$line" may not contain slashes) if m[/];
}
my $filename = sprintf '%s_%s_%d.txt', $name_a, $name_b,
Code Snippets
if (/(.*)[|](.*)[|](.*)/) {
my @fields = ($1, $2, $3);
...
}if (my @fields = /(.*)[|](.*)[|](.*)/) { ... }my ($name_a, $name_b, $text) = ...s/\s+//g for $x, $y;split /\s*[|]\s*/, ...Context
StackExchange Code Review Q#42030, answer score: 7
Revisions (0)
No revisions yet.