HiveBrain v1.2.0
Get Started
← Back to all entries
patternbashMinor

Locating matching files with input folder and file prefix

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
fileprefixwithlocatinginputfilesfolderandmatching

Problem

This is my first more-than-1-line script. It takes an input folder and a file prefix and gets all the matching files. For the first of the files, the script grabs the first line and appends an extra label string and puts that into $data. For all files (including the first), it takes the second line, adds the timestamp found from the filename, and appends a that as new line to $data. When it's done, it writes the output.

The problem is it's pretty slow. 1 directory might have 10,000 files. In that case, the script takes about 3-4 minutes to complete. When doing this over 10 such directories, it begins to get quite annoying.

So, I'm hoping someone here could help speed things up, preferably with explanations so I also get better.

for cs in {47..52}; do
  csdirnm="CASE$cs";
  tsdirnm="eta_at_x2";
  label='flow-time';
  dir="$csdirnm/$tsdirnm/";
  files=$(ls -tr $dir); # -tr sort on time created, reversed (newest last).
  i=0;
  for f in $files ; do
    fn=$dir$f;
    if [[ $i = 0 ]]; then
      data="$(sed -n 1p $fn) $label";
      ((i++));
    fi
    d=$(sed -n 2p $fn);
    t=$(echo $fn | perl -pe 's|.*?(\d+\.\d+)|\1|');
  data+="\n$d $t";
  done;
  echo -e "$data">"$csdirnm/${csdirnm}_${tsdirnm}_20T.time-series";
done;

Solution

The problem is almost certainly this line here:

t=$(echo $fn | perl -pe 's|.*?(\d+\.\d+)|\1|');


If you are invoking the Perl interpreter for each line, you will struggle.

A close second is that, for each file, you invoke 2 subshells, and two other program (sed and echo).

My recommendation is for you to actually rewrite the whole thing in Perl...

....

but, you may find it faster to use sed

t=$(echo $fn | sed -e '$s@.*[^0-9]\([0-9]\+\.[0-9]\+\).*@\1@g')


The above expression takes advantage of the fact that there has to be some non-digit character before the first date digit.

But, as I say, it is my suggestion that you do the whole thing in Perl, and avoid having to do all the execs.

EDIT: In Perl

#!/usr/bin/perl -w
#
#for cs in {47..52}; do
#  csdirnm="CASE$cs";
#  tsdirnm="eta_at_x2";
#  label='flow-time';
#  dir="$csdirnm/$tsdirnm/";
#  files=$(ls -tr $dir); # -tr sort on time created, reversed (newest last).
#  i=0;
#  for f in $files ; do
#    fn=$dir$f;
#    if [[ $i = 0 ]]; then
#      data="$(sed -n 1p $fn) $label";
#      ((i++));
#    fi
#    d=$(sed -n 2p $fn);
#    t=$(echo $fn | perl -pe 's|.*?(\d+\.\d+)|\1|');
#    data+="\n$d $t";
#  done;
#  echo -e "$data">"$csdirnm/${csdirnm}_${tsdirnm}_20T.time-series";
#done;

use strict;

foreach my $dirnum (47 .. 52) {
  my $csdirnm = "CASE$dirnum";
  print "Processing Dir $csdirnm\n";
  my $tsdirnm = "eta_at_x2";
  # Use ls to get the file ordering right.
  my $subdir = "$csdirnm/$tsdirnm";
  next unless -d $subdir; #Skiip non-existant dirs
  open DIRLIST, "-|", "ls -tr $subdir" or die "Unable to list files in $subdir";
  my $outfile = "${csdirnm}/${csdirnm}_${tsdirnm}_20T.time-series";
  open REPORTFILE, ">", $outfile or die "Unable to write to report file $outfile";

  my $file;
  my $cnt = 0;
  while ($file = ) {
    chomp $file;
    print "Processing $file\n";
    my $datafile = "$subdir/$file";
    open DATA, ";
    my $line2 = ;
    close DATA;
    print REPORTFILE $line1 unless $cnt;
    print REPORTFILE $line2;
    $cnt++;
  }

  close DIRLIST;
  close REPORTFILE;
}

Code Snippets

t=$(echo $fn | perl -pe 's|.*?(\d+\.\d+)|\1|');
t=$(echo $fn | sed -e '$s@.*[^0-9]\([0-9]\+\.[0-9]\+\).*@\1@g')
#!/usr/bin/perl -w
#
#for cs in {47..52}; do
#  csdirnm="CASE$cs";
#  tsdirnm="eta_at_x2";
#  label='flow-time';
#  dir="$csdirnm/$tsdirnm/";
#  files=$(ls -tr $dir); # -tr sort on time created, reversed (newest last).
#  i=0;
#  for f in $files ; do
#    fn=$dir$f;
#    if [[ $i = 0 ]]; then
#      data="$(sed -n 1p $fn) $label";
#      ((i++));
#    fi
#    d=$(sed -n 2p $fn);
#    t=$(echo $fn | perl -pe 's|.*?(\d+\.\d+)|\1|');
#    data+="\n$d $t";
#  done;
#  echo -e "$data">"$csdirnm/${csdirnm}_${tsdirnm}_20T.time-series";
#done;

use strict;

foreach my $dirnum (47 .. 52) {
  my $csdirnm = "CASE$dirnum";
  print "Processing Dir $csdirnm\n";
  my $tsdirnm = "eta_at_x2";
  # Use ls to get the file ordering right.
  my $subdir = "$csdirnm/$tsdirnm";
  next unless -d $subdir; #Skiip non-existant dirs
  open DIRLIST, "-|", "ls -tr $subdir" or die "Unable to list files in $subdir";
  my $outfile = "${csdirnm}/${csdirnm}_${tsdirnm}_20T.time-series";
  open REPORTFILE, ">", $outfile or die "Unable to write to report file $outfile";

  my $file;
  my $cnt = 0;
  while ($file = <DIRLIST>) {
    chomp $file;
    print "Processing $file\n";
    my $datafile = "$subdir/$file";
    open DATA, "<", $datafile or die "Unable to read file $datafile";
    my $line1 = <DATA>;
    my $line2 = <DATA>;
    close DATA;
    print REPORTFILE $line1 unless $cnt;
    print REPORTFILE $line2;
    $cnt++;
  }

  close DIRLIST;
  close REPORTFILE;
}

Context

StackExchange Code Review Q#41508, answer score: 7

Revisions (0)

No revisions yet.