patternbashMinor
Extracting data from text file in bash using awk, grep, head and tail
Viewed 0 times
fileawkheadtextgreptailusingextractingandfrom
Problem
I've been writing bash script on and off, with pretty good results in terms of getting the job done. However, I'm worried that my script might be very ugly, as I am a beginner. I'm looking for advice concerning this particular one.
I want to extract some part of a big text file (here a lammps .lmps file), and manipulate it to create another file (here .xyz).
The interesting parts of the big file are:
The beginning of the file:
The part about atom tag (here either "1" or "2") and mass:
The part concerning atom coordinates:
Now here's the script I wrote, with some comments to explain what I wanted to do:
`#!/bin/bash/
echo "LAMMPS file name? (without .lmps)"
read -r filename
# Make sure file exists
if [ -r "$filename".lmps ]; then
# Appends number of atoms to xyz file
head "$filename".lmps | grep atoms | awk -F' ' '{print $1}' > "$filename".xyz
echo >> "$filename".xyz
# Extracts "atom coordinates" and "masses" sections from original lmps file
awk '/Atoms/,/Bonds/' "$filename".l
I want to extract some part of a big text file (here a lammps .lmps file), and manipulate it to create another file (here .xyz).
The interesting parts of the big file are:
The beginning of the file:
300 atoms
300 bonds
450 angles
600 dihedrals
150 impropers
The part about atom tag (here either "1" or "2") and mass:
Masses
1 12.011150
2 1.007970
Pair Coeffs
The part concerning atom coordinates:
Atoms
1 1 1 -0.126800 20.511864 28.359121 11.290877
2 1 1 -0.126800 21.779636 28.644716 10.779171
3 1 1 -0.126800 20.381316 27.822484 12.573717
4 1 1 -0.126800 21.518471 27.571445 13.344853
5 1 1 -0.126800 22.786244 27.857074 12.833161
6 1 1 -0.126800 22.916794 28.393694 11.550321
7 1 2 0.126800 19.390282 27.599170 12.973874
8 1 2 0.126800 19.622826 28.555315 10.688110
9 1 2 0.126800 23.907808 28.617021 11.150121
10 1 2 0.126800 21.881943 29.064261 9.776262
11 1 2 0.126800 23.675251 27.660865 13.435963
12 1 2 0.126800 21.416213 27.151893 14.347761
Bonds
Now here's the script I wrote, with some comments to explain what I wanted to do:
`#!/bin/bash/
echo "LAMMPS file name? (without .lmps)"
read -r filename
# Make sure file exists
if [ -r "$filename".lmps ]; then
# Appends number of atoms to xyz file
head "$filename".lmps | grep atoms | awk -F' ' '{print $1}' > "$filename".xyz
echo >> "$filename".xyz
# Extracts "atom coordinates" and "masses" sections from original lmps file
awk '/Atoms/,/Bonds/' "$filename".l
Solution
General remarks
The first line should be
Don't use `
Finally, you did some cleanup at the end of the script:
However, this is pointless if the input file did not exist. One more reason to exit early, so there's nothing to clean up.
The first line should be
#!/bin/bash.Don't use `
cmd style command substitution, it's deprecated. Use $(cmd) instead, it's better.
Tests like [ "$var" = "$other" ] are also deprecated. Use [[ $var = $other ]] instead. As you can see in this example, you can omit the double-quotes in this modern version.
The http://www.shellcheck.net/ site is great for checking your code for common mistakes.
If there's awk in the pipeline, use it well
When you have a pipeline like this:
head "$filename".lmps | grep atoms | awk -F' ' '{print $1}' > "$filename".xyz
That is, there is an awk in the pipeline, along with other operations that awk could do all by itself. This code is equivalent:
awk -F' ' '/atoms/ {print $1} NR == 10 { exit }' "$filename".lmps > "$filename".xyz
This is better, because instead of 3 processes (head + grep + awk), you've managed to do everything in just one process.
Note: some of your other pipelines with awk are not well-suited for this, for example:
awk '/Atoms/,/Bonds/' "$filename".lmps | head -n -2 | tail -n +3 > coordinates.tmp
This is different from the first case, because there's no easy way to do with awk the equivalent of head -n -2. Also, moving the tail -n +3 logic inside the awk would be possible, but in this example too complicated, so it's ok to leave this statement as it is. It's only executed once per run, so using 3 processes instead of 2 is not a big problem.
Reading multiple variables from a line
You can simplify this:
while read line_masses
do
mass=`echo $line_masses | awk -F' ' '{print $2}'`
tag=`echo $line_masses | awk -F' ' '{print $1}'`
# ...
by writing like this:
while read tag mass
do
# ...
This is much better, as you just got rid of 2 extra processes per iteration.
You can do similarly for the outer loop as well:
while read f1 f2 atag f4 f5 f6 f7
This will simplify your if statments in the case $mass in, like this:
12.011150)
if [[ $tag = $atag ]]; then
echo -e "C\t$f5\t$f6\t$f7" >> "$filename".xyz
fi
;;
1.007970)
if [[ $tag = $atag ]]; then
echo -e "H\t$f5\t$f6\t$f7" >> "$filename".xyz
fi
;;
Calculate once, save in variable to reuse
Be careful with code like this:
while read line_atoms
do
while read line_masses
do
if [ "$tag" == `echo $line_atoms | awk -F' ' '{print $3}'` ]; then
echo -e "C\t`echo $line_atoms | awk -F' ' '{print $5,"\t",$6,"\t",$7}'`"
A big problem here is repeated evaluation of those echo $line_atoms | awk commands for each mass line in the input, when it would have been more efficient to calculate these before starting the inner loop.
Reduce nesting
The main part of the script is wrapped inside this large if block:
if [ -r "$filename".lmps ]; then
# do the main work
fi
It would be better to reverse this logic, like this:
if [ ! -r "$filename".lmps ]; then
echo "Error:"$filename".lmps doesn't exits"
exit 1
fi
# do the main work
Related to this, it's a good practice to exit 1` to indicate an error to the caller.Finally, you did some cleanup at the end of the script:
# Gets rid of temporary files
rm *.tmpHowever, this is pointless if the input file did not exist. One more reason to exit early, so there's nothing to clean up.
Code Snippets
head "$filename".lmps | grep atoms | awk -F' ' '{print $1}' > "$filename".xyzawk -F' ' '/atoms/ {print $1} NR == 10 { exit }' "$filename".lmps > "$filename".xyzawk '/Atoms/,/Bonds/' "$filename".lmps | head -n -2 | tail -n +3 > coordinates.tmpwhile read line_masses
do
mass=`echo $line_masses | awk -F' ' '{print $2}'`
tag=`echo $line_masses | awk -F' ' '{print $1}'`
# ...while read tag mass
do
# ...Context
StackExchange Code Review Q#59417, answer score: 7
Revisions (0)
No revisions yet.