patternbashMinor
Locating matching files with input folder and file prefix
Viewed 0 times
fileprefixwithlocatinginputfilesfolderandmatching
Problem
This is my first more-than-1-line script. It takes an input folder and a file prefix and gets all the matching files. For the first of the files, the script grabs the first line and appends an extra label string and puts that into
The problem is it's pretty slow. 1 directory might have 10,000 files. In that case, the script takes about 3-4 minutes to complete. When doing this over 10 such directories, it begins to get quite annoying.
So, I'm hoping someone here could help speed things up, preferably with explanations so I also get better.
$data. For all files (including the first), it takes the second line, adds the timestamp found from the filename, and appends a that as new line to $data. When it's done, it writes the output.The problem is it's pretty slow. 1 directory might have 10,000 files. In that case, the script takes about 3-4 minutes to complete. When doing this over 10 such directories, it begins to get quite annoying.
So, I'm hoping someone here could help speed things up, preferably with explanations so I also get better.
for cs in {47..52}; do
csdirnm="CASE$cs";
tsdirnm="eta_at_x2";
label='flow-time';
dir="$csdirnm/$tsdirnm/";
files=$(ls -tr $dir); # -tr sort on time created, reversed (newest last).
i=0;
for f in $files ; do
fn=$dir$f;
if [[ $i = 0 ]]; then
data="$(sed -n 1p $fn) $label";
((i++));
fi
d=$(sed -n 2p $fn);
t=$(echo $fn | perl -pe 's|.*?(\d+\.\d+)|\1|');
data+="\n$d $t";
done;
echo -e "$data">"$csdirnm/${csdirnm}_${tsdirnm}_20T.time-series";
done;Solution
The problem is almost certainly this line here:
If you are invoking the Perl interpreter for each line, you will struggle.
A close second is that, for each file, you invoke 2 subshells, and two other program (sed and echo).
My recommendation is for you to actually rewrite the whole thing in Perl...
....
but, you may find it faster to use
The above expression takes advantage of the fact that there has to be some non-digit character before the first date digit.
But, as I say, it is my suggestion that you do the whole thing in Perl, and avoid having to do all the execs.
EDIT: In Perl
t=$(echo $fn | perl -pe 's|.*?(\d+\.\d+)|\1|');If you are invoking the Perl interpreter for each line, you will struggle.
A close second is that, for each file, you invoke 2 subshells, and two other program (sed and echo).
My recommendation is for you to actually rewrite the whole thing in Perl...
....
but, you may find it faster to use
sedt=$(echo $fn | sed -e '$s@.*[^0-9]\([0-9]\+\.[0-9]\+\).*@\1@g')The above expression takes advantage of the fact that there has to be some non-digit character before the first date digit.
But, as I say, it is my suggestion that you do the whole thing in Perl, and avoid having to do all the execs.
EDIT: In Perl
#!/usr/bin/perl -w
#
#for cs in {47..52}; do
# csdirnm="CASE$cs";
# tsdirnm="eta_at_x2";
# label='flow-time';
# dir="$csdirnm/$tsdirnm/";
# files=$(ls -tr $dir); # -tr sort on time created, reversed (newest last).
# i=0;
# for f in $files ; do
# fn=$dir$f;
# if [[ $i = 0 ]]; then
# data="$(sed -n 1p $fn) $label";
# ((i++));
# fi
# d=$(sed -n 2p $fn);
# t=$(echo $fn | perl -pe 's|.*?(\d+\.\d+)|\1|');
# data+="\n$d $t";
# done;
# echo -e "$data">"$csdirnm/${csdirnm}_${tsdirnm}_20T.time-series";
#done;
use strict;
foreach my $dirnum (47 .. 52) {
my $csdirnm = "CASE$dirnum";
print "Processing Dir $csdirnm\n";
my $tsdirnm = "eta_at_x2";
# Use ls to get the file ordering right.
my $subdir = "$csdirnm/$tsdirnm";
next unless -d $subdir; #Skiip non-existant dirs
open DIRLIST, "-|", "ls -tr $subdir" or die "Unable to list files in $subdir";
my $outfile = "${csdirnm}/${csdirnm}_${tsdirnm}_20T.time-series";
open REPORTFILE, ">", $outfile or die "Unable to write to report file $outfile";
my $file;
my $cnt = 0;
while ($file = ) {
chomp $file;
print "Processing $file\n";
my $datafile = "$subdir/$file";
open DATA, ";
my $line2 = ;
close DATA;
print REPORTFILE $line1 unless $cnt;
print REPORTFILE $line2;
$cnt++;
}
close DIRLIST;
close REPORTFILE;
}Code Snippets
t=$(echo $fn | perl -pe 's|.*?(\d+\.\d+)|\1|');t=$(echo $fn | sed -e '$s@.*[^0-9]\([0-9]\+\.[0-9]\+\).*@\1@g')#!/usr/bin/perl -w
#
#for cs in {47..52}; do
# csdirnm="CASE$cs";
# tsdirnm="eta_at_x2";
# label='flow-time';
# dir="$csdirnm/$tsdirnm/";
# files=$(ls -tr $dir); # -tr sort on time created, reversed (newest last).
# i=0;
# for f in $files ; do
# fn=$dir$f;
# if [[ $i = 0 ]]; then
# data="$(sed -n 1p $fn) $label";
# ((i++));
# fi
# d=$(sed -n 2p $fn);
# t=$(echo $fn | perl -pe 's|.*?(\d+\.\d+)|\1|');
# data+="\n$d $t";
# done;
# echo -e "$data">"$csdirnm/${csdirnm}_${tsdirnm}_20T.time-series";
#done;
use strict;
foreach my $dirnum (47 .. 52) {
my $csdirnm = "CASE$dirnum";
print "Processing Dir $csdirnm\n";
my $tsdirnm = "eta_at_x2";
# Use ls to get the file ordering right.
my $subdir = "$csdirnm/$tsdirnm";
next unless -d $subdir; #Skiip non-existant dirs
open DIRLIST, "-|", "ls -tr $subdir" or die "Unable to list files in $subdir";
my $outfile = "${csdirnm}/${csdirnm}_${tsdirnm}_20T.time-series";
open REPORTFILE, ">", $outfile or die "Unable to write to report file $outfile";
my $file;
my $cnt = 0;
while ($file = <DIRLIST>) {
chomp $file;
print "Processing $file\n";
my $datafile = "$subdir/$file";
open DATA, "<", $datafile or die "Unable to read file $datafile";
my $line1 = <DATA>;
my $line2 = <DATA>;
close DATA;
print REPORTFILE $line1 unless $cnt;
print REPORTFILE $line2;
$cnt++;
}
close DIRLIST;
close REPORTFILE;
}Context
StackExchange Code Review Q#41508, answer score: 7
Revisions (0)
No revisions yet.