HiveBrain v1.2.0
Get Started
← Back to all entries
patternbashMinor

Parsing Gaussian 09 output for energy statement on one or more files and reformat it to a table

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
statementtableenergymoreoutputonefilesparsingforand

Problem

I am a computational chemist working with the program Gaussian 09. After I manually check the output(s) I want to create a summary for easier processing of the obtained values. Also avoid opening all the files again and again. The script searches for the last line of the energy statement. A portion of the output will be at the end of the post.

The outputs can become quite long and I am not completely satisfied with the performance of the script. I does its job, but if there are many big files, I can get a coffee in between. I do know that it is still faster than doing it by hand, but if it could be improved I would be quite happy.

I am pretty certain the problem comes from actually finding the string, i.e. the grep command. I am using tac here, since there could be multiple occurrences, but I am only interested in the last one. I have tried some of these solutions, too, but the tac|grep was the fastest. Depending on the steps of the optimisation it is therefore easier to read from the back. Since I already checked the file I also know the last value is the one I want.

Unfortunately this is slow if it also computes properties, where there is a huge block of properties at the end of the file. I don't know how I could skip it.

```
#!/bin/bash

# Find energy statement from a Gaussian 09 calculation
# Find energy statement from all G09 log files in working directory

findEnergy ()
{
# Initiate variables necessary for parsing output
local readWholeLine pattern functional energy cycles
# Find match from the end of the file
# Ref: https://unix.stackexchange.com/q/112159/160000
readWholeLine=$(tac $1 | grep -m1 'SCF Done')
# Gaussian output has following format, trap important information
pattern="(E\(.+\)) = (.+) A\.U\.[^0-9]+([0-9]+) cycles"
if [[ $readWholeLine =~ $pattern ]]
then
functional="${BASH_REMATCH[1]}"
energy="${BASH_REMATCH[2]}"
cycles="${BASH_REMATCH[3]}"
fi

# Print the line, format

Solution

I don't see obvious signs why this script should be slow.
There are no unnecessary sub-processes,
no compute-heavy operations or nested loops,
and so I don't know how to help you speed this up.
The tac | grep -m1 combo are well utilized for their intended purpose,
I don't think a specialized custom implementation for your purpose would make a significant difference.

I have a few tips only in terms of technique.
In getAll, part of the code is identical to what you have in getOnly.
As such, you can reuse that, and call getOnly from getAll:

getAll() {
    # run over all commandfiles
    # ToDo: specify file suffixes
    local commandfile logfile
    printf "%-25s %s\n" "Summary for " ${PWD#\/*\/*\/}
    printf "%-25s %s\n\n" "Created " "$(date +"%Y/%m/%d %k:%M:%S")"
    # Print a header
    printf "%-25s %-15s   %20s ( %6s )\n" "Command file" "Functional" "Energy / Hartree" "cycles"
    for commandfile in *com; do
        getOnly "$commandfile"
    done
}


This code is a bit sloppy, because if $1 is empty,
then the while doesn't need to be executed:

if [[ -z $1 ]]; then getAll; fi
while [[ ! -z $1 ]]; do
    getOnly $1
    shift
done


It would be more appropriate to move the loop inside the else branch:

if [[ -z $1 ]]; then
    getAll
else
    while [[ ! -z $1 ]]; do
        getOnly $1
        shift
    done
fi


I'm guessing that the intention here is to call getAll if there are no arguments.
But strictly speaking [[ -z $1 ]] doesn't mean there are no arguments,
it just means that the first argument is empty.
The correct way to check that there are no arguments:

if [[ $# == 0 ]]; then


And instead of a while loop, it would be more natural to use a for loop here:

for commandfile in "$@"; do
        getOnly "$commandfile"
    done


Lastly, at many places you did not quote variables that are paths.
I'm guessing you did that because you are certain they will never contain spaces. Even so, it's a good habit to double-quote such variables.

Code Snippets

getAll() {
    # run over all commandfiles
    # ToDo: specify file suffixes
    local commandfile logfile
    printf "%-25s %s\n" "Summary for " ${PWD#\/*\/*\/}
    printf "%-25s %s\n\n" "Created " "$(date +"%Y/%m/%d %k:%M:%S")"
    # Print a header
    printf "%-25s %-15s   %20s ( %6s )\n" "Command file" "Functional" "Energy / Hartree" "cycles"
    for commandfile in *com; do
        getOnly "$commandfile"
    done
}
if [[ -z $1 ]]; then getAll; fi
while [[ ! -z $1 ]]; do
    getOnly $1
    shift
done
if [[ -z $1 ]]; then
    getAll
else
    while [[ ! -z $1 ]]; do
        getOnly $1
        shift
    done
fi
if [[ $# == 0 ]]; then
for commandfile in "$@"; do
        getOnly "$commandfile"
    done

Context

StackExchange Code Review Q#129854, answer score: 5

Revisions (0)

No revisions yet.