HiveBrain v1.2.0
Get Started
← Back to all entries
patternbashMinor

Extracting data from text file in bash using awk, grep, head and tail

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
fileawkheadtextgreptailusingextractingandfrom

Problem

I've been writing bash script on and off, with pretty good results in terms of getting the job done. However, I'm worried that my script might be very ugly, as I am a beginner. I'm looking for advice concerning this particular one.

I want to extract some part of a big text file (here a lammps .lmps file), and manipulate it to create another file (here .xyz).

The interesting parts of the big file are:

The beginning of the file:

300 atoms
300 bonds
450 angles
600 dihedrals
150 impropers


The part about atom tag (here either "1" or "2") and mass:

Masses

1 12.011150
2 1.007970

Pair Coeffs


The part concerning atom coordinates:

Atoms

1 1 1 -0.126800 20.511864 28.359121 11.290877
2 1 1 -0.126800 21.779636 28.644716 10.779171
3 1 1 -0.126800 20.381316 27.822484 12.573717
4 1 1 -0.126800 21.518471 27.571445 13.344853
5 1 1 -0.126800 22.786244 27.857074 12.833161
6 1 1 -0.126800 22.916794 28.393694 11.550321
7 1 2 0.126800 19.390282 27.599170 12.973874
8 1 2 0.126800 19.622826 28.555315 10.688110
9 1 2 0.126800 23.907808 28.617021 11.150121
10 1 2 0.126800 21.881943 29.064261 9.776262
11 1 2 0.126800 23.675251 27.660865 13.435963
12 1 2 0.126800 21.416213 27.151893 14.347761

Bonds


Now here's the script I wrote, with some comments to explain what I wanted to do:

`#!/bin/bash/

echo "LAMMPS file name? (without .lmps)"
read -r filename

# Make sure file exists
if [ -r "$filename".lmps ]; then

# Appends number of atoms to xyz file
head "$filename".lmps | grep atoms | awk -F' ' '{print $1}' > "$filename".xyz
echo >> "$filename".xyz

# Extracts "atom coordinates" and "masses" sections from original lmps file
awk '/Atoms/,/Bonds/' "$filename".l

Solution

General remarks

The first line should be #!/bin/bash.

Don't use `cmd style command substitution, it's deprecated. Use $(cmd) instead, it's better.

Tests like
[ "$var" = "$other" ] are also deprecated. Use [[ $var = $other ]] instead. As you can see in this example, you can omit the double-quotes in this modern version.

The http://www.shellcheck.net/ site is great for checking your code for common mistakes.

If there's
awk in the pipeline, use it well

When you have a pipeline like this:

head "$filename".lmps | grep atoms | awk -F' ' '{print $1}' > "$filename".xyz


That is, there is an
awk in the pipeline, along with other operations that awk could do all by itself. This code is equivalent:

awk -F' ' '/atoms/ {print $1} NR == 10 { exit }' "$filename".lmps > "$filename".xyz


This is better, because instead of 3 processes (head + grep + awk), you've managed to do everything in just one process.

Note: some of your other pipelines with
awk are not well-suited for this, for example:

awk '/Atoms/,/Bonds/' "$filename".lmps | head -n -2 | tail -n +3 > coordinates.tmp


This is different from the first case, because there's no easy way to do with
awk the equivalent of head -n -2. Also, moving the tail -n +3 logic inside the awk would be possible, but in this example too complicated, so it's ok to leave this statement as it is. It's only executed once per run, so using 3 processes instead of 2 is not a big problem.

Reading multiple variables from a line

You can simplify this:

while read line_masses
do
        mass=`echo $line_masses | awk -F' ' '{print $2}'`
        tag=`echo $line_masses | awk -F' ' '{print $1}'`
        # ...


by writing like this:

while read tag mass
do
        # ...


This is much better, as you just got rid of 2 extra processes per iteration.

You can do similarly for the outer loop as well:

while read f1 f2 atag f4 f5 f6 f7


This will simplify your
if statments in the case $mass in, like this:

12.011150)
if [[ $tag = $atag ]]; then
    echo -e "C\t$f5\t$f6\t$f7" >> "$filename".xyz
fi
;;

1.007970)
if [[ $tag = $atag ]]; then
    echo -e "H\t$f5\t$f6\t$f7" >> "$filename".xyz
fi
;;


Calculate once, save in variable to reuse

Be careful with code like this:

while read line_atoms
do
    while read line_masses
    do
        if [ "$tag" == `echo $line_atoms | awk -F' ' '{print $3}'` ]; then
            echo -e "C\t`echo $line_atoms | awk -F' ' '{print $5,"\t",$6,"\t",$7}'`"


A big problem here is repeated evaluation of those
echo $line_atoms | awk commands for each mass line in the input, when it would have been more efficient to calculate these before starting the inner loop.

Reduce nesting

The main part of the script is wrapped inside this large
if block:

if [ -r "$filename".lmps ]; then
    # do the main work
fi


It would be better to reverse this logic, like this:

if [ ! -r "$filename".lmps ]; then
    echo "Error:"$filename".lmps doesn't exits"
    exit 1
fi

# do the main work


Related to this, it's a good practice to
exit 1` to indicate an error to the caller.

Finally, you did some cleanup at the end of the script:

# Gets rid of temporary files
rm *.tmp


However, this is pointless if the input file did not exist. One more reason to exit early, so there's nothing to clean up.

Code Snippets

head "$filename".lmps | grep atoms | awk -F' ' '{print $1}' > "$filename".xyz
awk -F' ' '/atoms/ {print $1} NR == 10 { exit }' "$filename".lmps > "$filename".xyz
awk '/Atoms/,/Bonds/' "$filename".lmps | head -n -2 | tail -n +3 > coordinates.tmp
while read line_masses
do
        mass=`echo $line_masses | awk -F' ' '{print $2}'`
        tag=`echo $line_masses | awk -F' ' '{print $1}'`
        # ...
while read tag mass
do
        # ...

Context

StackExchange Code Review Q#59417, answer score: 7

Revisions (0)

No revisions yet.