patternMinor

Parsing psql output into multiple files

Submitted by: @import:stackexchange-codereview·Mar 10, 2026·

Viewed 0 times

intooutputpsqlfilesparsingmultiple

Problem

I wrote the following simple Perl script to read lines from stdin that are the output from a psql call that returns lines of the form key1 | key2 | long text field and create a separate output file for each line, whose name is key1_key2_NNN.txt, where NNN is just a counter to ensure unique file names.

my $count = 0;

while (<>) {
    if (/(.*)\|(.*)\|(.*)/) {
        $count++;
        my $outname = "$1_$2_$count.txt";
        my $text = $3;
        $outname =~ s/\s+//g;
        my $outfile = new IO::File("> $outname");
        $outfile->print("$text");
    }
}

This does the trick, but there are 2 things I'm not thrilled about

-
I'd like to be able to do the $count++ inline while putting the incremented number into the string, but I'm not sure how to not just get 0++ plugged in (i.e. how to make it know that the ++ is a command and not part of the literal string).

-
I don't like that I have to save $3 to another variable, but if I don't, it gets cleared out before I need it.

Anyhow, any suggestions on making this a bit more slick?

Solution

I don't know if this is the complete script, but the absence of use strict; use warnings is pretty noticeable. Always use these as a basic safety net, no matter how small or simple your program.

Your usage of the regex /(.)\|(.)\|(.*)/ can be improved in some ways:

-
Don't use a regex to extract the data. Instead: my @cols = split /[|]/, $_, 3;. This really is the correct way to go, the following points however deal with normal regex usage.

-
Some guides recommend putting special characters into a character class [|] instead of escaping them \|. It's my experience that this makes complicated regexes easier to read.

-
Immediately after matching, assign the values of the captures to permanent variables. Otherwise, they could be modified by a subroutine you call – the next match can clear those values. Note also that ways exist to booby trap such an innocent-looking operation like incrementing an integer to clear the last match.

So we either write code such as:

if (/(.*)[|](.*)[|](.*)/) {
  my @fields = ($1, $2, $3);
  ...
}

or we don't use the capture variables at all, and use the regex in list context instead:

if (my @fields = /(.*)[|](.*)[|](.*)/) { ... }

Technically, this isn't absolutely equivalent (e.g. in case you aren't using any captures, or if some captures are conditional, or …), but it's far better style.

If course, you'd assign to a list of names rather than an array:

my ($name_a, $name_b, $text) = ...

The substitution s/\s+//g might be considered slightly silly, as it's equivalent to s/\s//g. The former variant has the advantage of less substitution operations, so it might actually be preferable. But why are you sanitizing the $outname rather than the parts of which it's made up? Note that you can apply a substitution to multiple variables like

s/\s+//g for $x, $y;

If you don't want to remove all space in those strings (remember that most filesystems can deal with spaces in filenames all right) but only at the start or beginning of the string, you might want to split with a different separator instead:

split /\s*[|]\s*/, ...

The expression "$1_$2_$count.txt" does not contain a bug, but this is accidental. For example, "$foo_$bar" is equivalent to $foo_ . $bar – underscores can be part of variable names, but variables can't start with numbers, so the problem isn't visible here (variable names can consist solely of numbers, and all such variables are reserved for capture groups).

As a fix, use curly braces to delimit the variable names: "${1}_${2}_$count.txt", or use sprintf:

sprintf '%s_%s_%d.txt', $1, $2, $count

The line my $outfile = new IO::File("> $outname"); is a big no-no for two reasons:

-
Don't use the “indirect object notation” method $object @arglist. Instead use $object->method(@arglist). The former variant may look nicer, but is often ambiguous, has rather confusing precedence, and widely considered to be a syntactic mistake.

If you need help to get rid of that habit, you may enjoy the no indirect pragma.

-
What you're doing is basically open my $outfile, "> $outname". Use the three-argument form of open:

open my $outfile, ">", $outname

or in the object-oriented wrapper: IO::File->new($outname, '>').

The next problem here is that you aren't doing any error handling! Check the return value of the constructor to assert that you could actually open the file. Well, calling a method on an undefined value will die, but checking manually allows you to output a sensible error message.

The $outfile->print("$text") would usually be written print { $outfile } $text. For one thing, the object-oriented interface to file handles is used very rarely. You can use the normal interface directly. Also, you are unnecessarily stringifying the $text. If it isn't already a string, that will be managed by print. Note that explicit stringification creates a copy of that value, something which you would usually (but not always) want to avoid.

Then, you are using print, not say. This means (together with the fact that in a regex, . does not match newlines), that you won't end your output to the file with a newline. This is probably an oversight.

Another aspect that just came to my mind is that your input file could contain slashes, allowing a filename like /you/were/pwned/_key2_1234.txt to be created (assuming the path already exists). Let's put in a bit of validation.

The script now looks like:

```
use strict;
use warnings;
use autodie; # easier than explicit, manual error handling
use feature qw/say/; # say is like print, but appends newline

my $count = 0;
while (my $line = <>) {
chomp $line;
my ($name_a, $name_b, $text) = split /[|]/, $line, 3
or next;
for ($name_a, $name_b) {
s/\s+//g;
die qq(The first two columns in "$line" may not contain slashes) if m[/];
}
my $filename = sprintf '%s_%s_%d.txt', $name_a, $name_b,

Code Snippets

if (/(.*)[|](.*)[|](.*)/) {
  my @fields = ($1, $2, $3);
  ...
}

if (my @fields = /(.*)[|](.*)[|](.*)/) { ... }

my ($name_a, $name_b, $text) = ...

s/\s+//g for $x, $y;

split /\s*[|]\s*/, ...

Context

StackExchange Code Review Q#42030, answer score: 7

Revisions (0)

No revisions yet.