HiveBrain v1.2.0
Get Started
← Back to all entries
patternbashMinor

bash script for constructing RNA pipeline

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
scriptrnapipelineconstructingforbash

Problem

I have written a bash script that consists of multiple commands and Python scripts. The goal is to make a pipeline for detecting long non coding RNA from a certain input. Ultimately I would like to turn this into a app and host it on some bioinformatics website. One problem I am facing is using getopt tools in bash. I couldn't find a good tutorial that I understand clearly. In addition any other comments related to improving on the code is much appreciated.

#!/bin/bash
if [ "$1" == "-h" ]
then
    echo "Usage: sh $0 cuffcompare_output reference_genome blast_file"
    exit
else
wget https://github.com/TransDecoder/TransDecoder/archive/2.0.1.tar.gz && tar xvf 2.0.1 && rm -r 2.0.1
makeblastdb -in $3 -dbtype nucl -out $3.blast.out
grep '"u"' $1 | \
gffread -w transcripts_u.fa -g $2 - && \
python2.7 get_gene_length_filter.py transcripts_u.fa transcripts_u_filter.fa && \
TransDecoder-2.0.1/TransDecoder.LongOrfs -t transcripts_u_filter.fa
sed 's/ .*//' transcripts_u_filter.fa | grep ">" | sed 's/>//' > transcripts_u_filter.fa.genes
cd transcripts_u_filter.fa.transdecoder_dir
sed 's/|.*//' longest_orfs.cds | grep ">" | sed 's/>//' | uniq > longest_orfs.cds.genes
grep -v -f longest_orfs.cds.genes ../transcripts_u_filter.fa.genes > longest_orfs.cds.genes.not.genes
sed 's/^/>/' longest_orfs.cds.genes.not.genes > temp && mv temp longest_orfs.cds.genes.not.genes
python ../extract_sequences.py longest_orfs.cds.genes.not.genes ../transcripts_u_filter.fa longest_orfs.cds.genes.not.genes.fa
blastn -query longest_orfs.cds.genes.not.genes.fa -db ../$3.blast.out -out longest_orfs.cds.genes.not.genes.fa.blast.out -outfmt 6
python ../filter_sequences.py longest_orfs.cds.genes.not.genes.fa.blast.out longest_orfs.cds.genes.not.genes.fa.blast.out.filtered
grep -v -f longest_orfs.cds.genes.not.genes.fa.blast.out.filtered longest_orfs.cds.genes.not.genes.fa > lincRNA_final.fa
fi


Here is how I run it:

```
sh test.sh cuffcompare_out_annot_no_annot.combined.gtf /mydata/db/Brapa_sequ

Solution

This script looks like a wall of text and barely readable.
Let's break it down from top to bottom.

Simplify the handling of -h

Since the if at the beginning to check if the first parameter is -h will exit if true,
it would be simpler to drop the else part entirely:

if [ "$1" == -h ]
then
    echo "Usage: sh $0 cuffcompare_output reference_genome blast_file"
    exit
fi


I also dropped the quotes around the -h, they are not necessary.

Validate the command line arguments

The script expects 3 parameters, but doesn't validate if it really received them.
I suggest to rewrite the top part like this:

help() {
    exitcode=$1
    echo "Usage: sh $0 cuffcompare_output reference_genome blast_file"
    exit $exitcode
}

[ "$1" == -h ] && help 0
[ $# -lt 3 ] && help 1


I added a help function to avoid duplication, because I wasn't too reuse the same help message when exiting the program in both cases:

  • When the first param is -h, print help and exit with 0, meaning success



  • When there are less than 3 params, print help and exit with 1, meaning failure



Instead of if statements I used && expressions to make it shorter and simpler. The result is equivalent.

Give the parameters meaningful names

The parameters $1, $2 and $3 are scattered here and there in the wall of text,
and they are not exactly easy to remember.
Give them proper names right after validation:

cuffcompare_output=$1
reference_genome=$2
blast_file=$3


And use these names throughout the script.

Taming the wall of text...

Add some line breaks.
Group together lines that are most tightly related.
For example,
for the lines that are the continuation of a pipeline, put
a line break before and after would make this easier to read.
And when a pipeline gets too long,
add some more line breaks (escaped with \) to make the important ending part more visible,
for example when redirecting output to a file.

For example, I think this is more readable:

wget https://github.com/TransDecoder/TransDecoder/archive/2.0.1.tar.gz && tar xvf 2.0.1 && rm -r 2.0.1

makeblastdb -in $3 -dbtype nucl -out $3.blast.out

grep '"u"' $1 | \
gffread -w transcripts_u.fa -g $2 - && \
python2.7 get_gene_length_filter.py transcripts_u.fa transcripts_u_filter.fa && \
TransDecoder-2.0.1/TransDecoder.LongOrfs -t transcripts_u_filter.fa

sed 's/ .*//' transcripts_u_filter.fa | grep ">" | sed 's/>//' \
> transcripts_u_filter.fa.genes

cd transcripts_u_filter.fa.transdecoder_dir

sed 's/|.*//' longest_orfs.cds | grep ">" | sed 's/>//' | uniq \
> longest_orfs.cds.genes

grep -v -f longest_orfs.cds.genes ../transcripts_u_filter.fa.genes \
> longest_orfs.cds.genes.not.genes

sed 's/^/>/' longest_orfs.cds.genes.not.genes \
> temp && mv temp longest_orfs.cds.genes.not.genes

python ../extract_sequences.py longest_orfs.cds.genes.not.genes ../transcripts_u_filter.fa longest_orfs.cds.genes.not.genes.fa

blastn -query longest_orfs.cds.genes.not.genes.fa -db ../$3.blast.out -out longest_orfs.cds.genes.not.genes.fa.blast.out -outfmt 6

python ../filter_sequences.py longest_orfs.cds.genes.not.genes.fa.blast.out longest_orfs.cds.genes.not.genes.fa.blast.out.filtered

grep -v -f longest_orfs.cds.genes.not.genes.fa.blast.out.filtered longest_orfs.cds.genes.not.genes.fa > lincRNA_final.fa


As a small tip, a shorter equivalent of grep ">" | sed 's/>//' is sed -ne 's/>//p'.

Code Snippets

if [ "$1" == -h ]
then
    echo "Usage: sh $0 cuffcompare_output reference_genome blast_file"
    exit
fi
help() {
    exitcode=$1
    echo "Usage: sh $0 cuffcompare_output reference_genome blast_file"
    exit $exitcode
}

[ "$1" == -h ] && help 0
[ $# -lt 3 ] && help 1
cuffcompare_output=$1
reference_genome=$2
blast_file=$3
wget https://github.com/TransDecoder/TransDecoder/archive/2.0.1.tar.gz && tar xvf 2.0.1 && rm -r 2.0.1

makeblastdb -in $3 -dbtype nucl -out $3.blast.out

grep '"u"' $1 | \
gffread -w transcripts_u.fa -g $2 - && \
python2.7 get_gene_length_filter.py transcripts_u.fa transcripts_u_filter.fa && \
TransDecoder-2.0.1/TransDecoder.LongOrfs -t transcripts_u_filter.fa

sed 's/ .*//' transcripts_u_filter.fa | grep ">" | sed 's/>//' \
> transcripts_u_filter.fa.genes

cd transcripts_u_filter.fa.transdecoder_dir

sed 's/|.*//' longest_orfs.cds | grep ">" | sed 's/>//' | uniq \
> longest_orfs.cds.genes

grep -v -f longest_orfs.cds.genes ../transcripts_u_filter.fa.genes \
> longest_orfs.cds.genes.not.genes

sed 's/^/>/' longest_orfs.cds.genes.not.genes \
> temp && mv temp longest_orfs.cds.genes.not.genes

python ../extract_sequences.py longest_orfs.cds.genes.not.genes ../transcripts_u_filter.fa longest_orfs.cds.genes.not.genes.fa

blastn -query longest_orfs.cds.genes.not.genes.fa -db ../$3.blast.out -out longest_orfs.cds.genes.not.genes.fa.blast.out -outfmt 6

python ../filter_sequences.py longest_orfs.cds.genes.not.genes.fa.blast.out longest_orfs.cds.genes.not.genes.fa.blast.out.filtered

grep -v -f longest_orfs.cds.genes.not.genes.fa.blast.out.filtered longest_orfs.cds.genes.not.genes.fa > lincRNA_final.fa

Context

StackExchange Code Review Q#82772, answer score: 4

Revisions (0)

No revisions yet.