patternbashMinor
Sorting a file with three-line blocks by the second word of the first line in each block
Viewed 0 times
threesortingfileblocksthelineeachwithblockword
Problem
I have a text file with the following three line pattern with blank lines in between. My script sorts alphabetically by each person's last name and preserves formatting. I would love to see other options to improve this in Bash. For example, the group command that redirects into final.txt repeats a lot. Also, it would be nice to have the content of output.txt in a variable instead of creating a file.
The result looks like the following, sorted alphabetically by last name:
Here is my script:
`#!/bin/bash
# Get the number of lines in the document.
lines=$(cat my-file.txt | wc -l)
# This is the starting range and end range. Each section is three lines.
x=1
y=3
until [ "$x" -gt "$lines" ]; do
# Store the three lines to one line.
block=$(awk 'NR=="'"$x"'",NR=="'"$y"'"' my-file.txt)
# Echo each instance into my file.
# The $block variable is not double quotes so new lines are not honored.
echo $block >> output.txt
# Increment so it goes on to the next block.
x=$((x+4))
y=$((y+4))
done
# Sort the output file in place by the second column.
sort -k2 output.txt -o output.txt
# Put it back into original formatting.
while read i; do
(echo "$i" | awk '{ print $1 " " $2 }'; echo "$i" | awk '{ print $3 }'; echo "$i" | awk '{ print $4 }'; echo "") >> final.txt
done
Sally Smith
UniqueStringSmith_1
UniqueStringSmith_2
Wally Wilson
UniqueStringWilson_1
UniqueStringWilson_2
Tod Taylor
UniqueStringTaylor_1
UniqueStringTaylor_2
Judy Johnson
UniqueStringJohnson_1
UniqueStringJohnson_2
The result looks like the following, sorted alphabetically by last name:
Judy Johnson
UniqueStringJohnson_1
UniqueStringJohnson_2
Sally Smith
UniqueStringSmith_1
UniqueStringSmith_2
Tod Taylor
UniqueStringTaylor_1
UniqueStringTaylor_2
Wally Wilson
UniqueStringWilson_1
UniqueStringWilson_2
Here is my script:
`#!/bin/bash
# Get the number of lines in the document.
lines=$(cat my-file.txt | wc -l)
# This is the starting range and end range. Each section is three lines.
x=1
y=3
until [ "$x" -gt "$lines" ]; do
# Store the three lines to one line.
block=$(awk 'NR=="'"$x"'",NR=="'"$y"'"' my-file.txt)
# Echo each instance into my file.
# The $block variable is not double quotes so new lines are not honored.
echo $block >> output.txt
# Increment so it goes on to the next block.
x=$((x+4))
y=$((y+4))
done
# Sort the output file in place by the second column.
sort -k2 output.txt -o output.txt
# Put it back into original formatting.
while read i; do
(echo "$i" | awk '{ print $1 " " $2 }'; echo "$i" | awk '{ print $3 }'; echo "$i" | awk '{ print $4 }'; echo "") >> final.txt
done
Solution
Usability
Hardcoded input and output filenames are not easy to use.
This script only works with one specific input file name,
and it may inadvertently overwrite a file.
It would be better to take the input file as a command line argument,
and write the output to
letting the user to redirect to any file.
Error handling
If the input file doesn't exist, the script prints a bunch of error messages:
It would be better to check first that the file exists and fail early.
Keep in mind that after an error in one of the commands,
the script continues to run and execute the rest of the commands anyway.
I've seen cases when this cause real damage,
for example with
So it's important to look out for possible errors, check the exit code of commands and halt execution early.
You could do something like this:
Bash arithmetic
The
You can write like this:
Simpler quoting
You can simplify the quoting here:
Like this:
Initializing
In the
What if the file already existed before running the script?
You will get funny results.
To make sure the file is empty, you can do this:
But this is still not great. A file with that name may exist, and now its content will be destroyed.
Instead of using a temporary file in the current folder,
it would be better to use one in
And to avoid clashing with other scripts that might do the same,
you can add the process ID to the filename, for example
But the best solution is to use the
Deleting temporary files at the end
One problem with deleting temporary files at the end of the script like you did with
Another problem is the end of the script might not be reached,
if the command gets interrupted due to an error or signals or the user pressing Control-C.
You can protect against these by using the
I copied again the line creating the temporary file,
because it's best to put the
so it won't be forgotten.
The first parameter of
typically more than one commands,
and it's important that the last one is
The other parameters are signals that will be trapped.
1, 2, 3, 15 are typical signals to trap, for example 2 is
it is sent when the user presses Control-C while the script is running.
More Bash arithmetic
Instead of this:
You can simplify to:
Fewer variables
Fewer redirections
Instead of redirecting output in every iteration of the
you could redirect the entire loop, just once:
Fewer processes
Instead of running an
you could move the same logic inside
and achieve the same using a single process:
In the
Multiple commands are on one line separated by
and enclosed within
It's equivalent to this:
Note that
But the bigger issue is that a single line with
Even better, a single
Putting it together
At this point, we have:
We can chain them a
Hardcoded input and output filenames are not easy to use.
This script only works with one specific input file name,
and it may inadvertently overwrite a file.
It would be better to take the input file as a command line argument,
and write the output to
stdout,letting the user to redirect to any file.
Error handling
If the input file doesn't exist, the script prints a bunch of error messages:
cat: my-file.txt: No such file or directory
sort: open failed: output.txt: No such file or directory
script.sh: line 29: output.txt: No such file or directory
rm: output.txt: No such file or directoryIt would be better to check first that the file exists and fail early.
Keep in mind that after an error in one of the commands,
the script continues to run and execute the rest of the commands anyway.
I've seen cases when this cause real damage,
for example with
rm -fr commands that assumed to be in a different directory, which was not the case due to earlier errors.So it's important to look out for possible errors, check the exit code of commands and halt execution early.
You could do something like this:
input=$1
if ! test -f "$input"; then
echo fatal: input file argument missing or not a file: $input
echo usage: $0 input
exit 1
fiBash arithmetic
The
-gt operator in [ ... ] is obsolete, a better way is to use the modern ((...)). Instead of:until [ "$x" -gt "$lines" ]; doYou can write like this:
until (( x > lines )); doSimpler quoting
You can simplify the quoting here:
block=$(awk 'NR=="'"$x"'",NR=="'"$y"'"' "$input")Like this:
block=$(awk "NR==$x,NR==$y" "$input")Initializing
output.txtIn the
until loop, you append to output.txt.What if the file already existed before running the script?
You will get funny results.
To make sure the file is empty, you can do this:
> output.txtBut this is still not great. A file with that name may exist, and now its content will be destroyed.
Instead of using a temporary file in the current folder,
it would be better to use one in
$TMP/output.txt.And to avoid clashing with other scripts that might do the same,
you can add the process ID to the filename, for example
$TMP/output-$$.txt.But the best solution is to use the
mktemp command:tmpfile=$(mktemp)Deleting temporary files at the end
One problem with deleting temporary files at the end of the script like you did with
rm output.txt is that you might forget to do it.Another problem is the end of the script might not be reached,
if the command gets interrupted due to an error or signals or the user pressing Control-C.
You can protect against these by using the
trap builtin:tmpfile=$(mktemp)
trap "rm -f '$tmpfile'; exit 1" 1 2 3 15I copied again the line creating the temporary file,
because it's best to put the
trap command right after that line,so it won't be forgotten.
The first parameter of
trap is a command to run,typically more than one commands,
and it's important that the last one is
exit.The other parameters are signals that will be trapped.
1, 2, 3, 15 are typical signals to trap, for example 2 is
SIGINT,it is sent when the user presses Control-C while the script is running.
More Bash arithmetic
Instead of this:
x=$((x+4))
y=$((y+4))You can simplify to:
((x+=4))
((y+=4))Fewer variables
y is not really necessary. Instead of incrementing it by 4 in parallel with x, you can just increment x, and use x + 2 in awk:Fewer redirections
Instead of redirecting output in every iteration of the
until loop,you could redirect the entire loop, just once:
until (( x > lines )); do
block=$(awk "NR==$x,NR==$x+2" "$input")
echo $block
((x+=4))
done > "$tmpfile"Fewer processes
Instead of running an
awk process for every block in the file in an until loop,you could move the same logic inside
awk itself,and achieve the same using a single process:
awk '{printf "%s ", $0} NR % 4 == 0 {print ""}' "$input" > "$tmpfile"In the
while loop too, there is some waste.Multiple commands are on one line separated by
;,and enclosed within
(...).It's equivalent to this:
while read i; do
echo "$i" | awk '{ print $1 " " $2 }'
echo "$i" | awk '{ print $3 }'
echo "$i" | awk '{ print $4 }'
echo
done < "$tmpfile"Note that
i is a poor name for a variable that contains a line.But the bigger issue is that a single line with
awk could replace the 4 lines of echo:echo $line | awk '{ print $1 " " $2; print $3; print $4; print ""; }'Even better, a single
awk process could replace the entire loop:awk '{ print $1 " " $2; print $3; print $4; print ""; }' "$tmpfile"Putting it together
At this point, we have:
- An
untilloop that creates$tmpfile
- A
sortthat sorts$tmpfile
- An
awkcommand that processes$tmpfile
We can chain them a
Code Snippets
cat: my-file.txt: No such file or directory
sort: open failed: output.txt: No such file or directory
script.sh: line 29: output.txt: No such file or directory
rm: output.txt: No such file or directoryinput=$1
if ! test -f "$input"; then
echo fatal: input file argument missing or not a file: $input
echo usage: $0 input
exit 1
fiuntil [ "$x" -gt "$lines" ]; dountil (( x > lines )); doblock=$(awk 'NR=="'"$x"'",NR=="'"$y"'"' "$input")Context
StackExchange Code Review Q#127261, answer score: 3
Revisions (0)
No revisions yet.