patternbashMinor

Extracting data from text file in bash using awk, grep, head and tail

Submitted by: @import:stackexchange-codereview·Mar 10, 2026·

Viewed 0 times

fileawkheadtextgreptailusingextractingandfrom

Problem

I've been writing bash script on and off, with pretty good results in terms of getting the job done. However, I'm worried that my script might be very ugly, as I am a beginner. I'm looking for advice concerning this particular one.

I want to extract some part of a big text file (here a lammps .lmps file), and manipulate it to create another file (here .xyz).

The interesting parts of the big file are:

The beginning of the file:

300 atoms
300 bonds
450 angles
600 dihedrals
150 impropers

The part about atom tag (here either "1" or "2") and mass:

Masses

1         12.011150
2          1.007970

Pair Coeffs

The part concerning atom coordinates:

Atoms

1       1    1     -0.126800     20.511864     28.359121     11.290877
2       1    1     -0.126800     21.779636     28.644716     10.779171
3       1    1     -0.126800     20.381316     27.822484     12.573717
4       1    1     -0.126800     21.518471     27.571445     13.344853
5       1    1     -0.126800     22.786244     27.857074     12.833161
6       1    1     -0.126800     22.916794     28.393694     11.550321
7       1    2      0.126800     19.390282     27.599170     12.973874
8       1    2      0.126800     19.622826     28.555315     10.688110
9       1    2      0.126800     23.907808     28.617021     11.150121
10      1    2      0.126800     21.881943     29.064261      9.776262
11      1    2      0.126800     23.675251     27.660865     13.435963
12      1    2      0.126800     21.416213     27.151893     14.347761

Bonds

Now here's the script I wrote, with some comments to explain what I wanted to do:

`#!/bin/bash/

echo "LAMMPS file name? (without .lmps)"
read -r filename

# Make sure file exists
if [ -r "$filename".lmps ]; then

# Appends number of atoms to xyz file
head "$filename".lmps | grep atoms | awk -F' ' '{print $1}' > "$filename".xyz
echo >> "$filename".xyz

# Extracts "atom coordinates" and "masses" sections from original lmps file
awk '/Atoms/,/Bonds/' "$filename".l

Solution

General remarks

The first line should be #!/bin/bash.

Don't use `cmd style command substitution, it's deprecated. Use $(cmd)

 instead, it's better.

Tests like

[ "$var" = "$other" ] are also deprecated. Use [[ $var = $other ]]

 instead. As you can see in this example, you can omit the double-quotes in this modern version.

The http://www.shellcheck.net/ site is great for checking your code for common mistakes.

If there's

awk

 in the pipeline, use it well

When you have a pipeline like this:

head "$filename".lmps | grep atoms | awk -F' ' '{print $1}' > "$filename".xyz


That is, there is an

awk in the pipeline, along with other operations that awk

 could do all by itself. This code is equivalent:

awk -F' ' '/atoms/ {print $1} NR == 10 { exit }' "$filename".lmps > "$filename".xyz


This is better, because instead of 3 processes (head + grep + awk), you've managed to do everything in just one process.

Note: some of your other pipelines with

awk

 are not well-suited for this, for example:

awk '/Atoms/,/Bonds/' "$filename".lmps | head -n -2 | tail -n +3 > coordinates.tmp


This is different from the first case, because there's no easy way to do with

awk the equivalent of head -n -2. Also, moving the tail -n +3 logic inside the awk

 would be possible, but in this example too complicated, so it's ok to leave this statement as it is. It's only executed once per run, so using 3 processes instead of 2 is not a big problem.

Reading multiple variables from a line

You can simplify this:

while read line_masses
do
        mass=`echo $line_masses | awk -F' ' '{print $2}'`
        tag=`echo $line_masses | awk -F' ' '{print $1}'`
        # ...


by writing like this:

while read tag mass
do
        # ...


This is much better, as you just got rid of 2 extra processes per iteration.

You can do similarly for the outer loop as well:

while read f1 f2 atag f4 f5 f6 f7


This will simplify your

if statments in the case $mass in

, like this:

12.011150)
if [[ $tag = $atag ]]; then
    echo -e "C\t$f5\t$f6\t$f7" >> "$filename".xyz
fi
;;

1.007970)
if [[ $tag = $atag ]]; then
    echo -e "H\t$f5\t$f6\t$f7" >> "$filename".xyz
fi
;;


Calculate once, save in variable to reuse

Be careful with code like this:

while read line_atoms
do
    while read line_masses
    do
        if [ "$tag" == `echo $line_atoms | awk -F' ' '{print $3}'` ]; then
            echo -e "C\t`echo $line_atoms | awk -F' ' '{print $5,"\t",$6,"\t",$7}'`"


A big problem here is repeated evaluation of those

echo $line_atoms | awk

 commands for each mass line in the input, when it would have been more efficient to calculate these before starting the inner loop.

Reduce nesting

The main part of the script is wrapped inside this large

 block:

if [ -r "$filename".lmps ]; then
    # do the main work
fi


It would be better to reverse this logic, like this:

if [ ! -r "$filename".lmps ]; then
    echo "Error:"$filename".lmps doesn't exits"
    exit 1
fi

# do the main work


Related to this, it's a good practice to

exit 1` to indicate an error to the caller.

Finally, you did some cleanup at the end of the script:

# Gets rid of temporary files
rm *.tmp

However, this is pointless if the input file did not exist. One more reason to exit early, so there's nothing to clean up.

Code Snippets

head "$filename".lmps | grep atoms | awk -F' ' '{print $1}' > "$filename".xyz

awk -F' ' '/atoms/ {print $1} NR == 10 { exit }' "$filename".lmps > "$filename".xyz

awk '/Atoms/,/Bonds/' "$filename".lmps | head -n -2 | tail -n +3 > coordinates.tmp

while read line_masses
do
        mass=`echo $line_masses | awk -F' ' '{print $2}'`
        tag=`echo $line_masses | awk -F' ' '{print $1}'`
        # ...

while read tag mass
do
        # ...

Context

StackExchange Code Review Q#59417, answer score: 7

Revisions (0)

No revisions yet.