patternbashMinor
bash script for constructing RNA pipeline
Viewed 0 times
scriptrnapipelineconstructingforbash
Problem
I have written a bash script that consists of multiple commands and Python scripts. The goal is to make a pipeline for detecting long non coding RNA from a certain input. Ultimately I would like to turn this into a app and host it on some bioinformatics website. One problem I am facing is using
Here is how I run it:
```
sh test.sh cuffcompare_out_annot_no_annot.combined.gtf /mydata/db/Brapa_sequ
getopt tools in bash. I couldn't find a good tutorial that I understand clearly. In addition any other comments related to improving on the code is much appreciated.#!/bin/bash
if [ "$1" == "-h" ]
then
echo "Usage: sh $0 cuffcompare_output reference_genome blast_file"
exit
else
wget https://github.com/TransDecoder/TransDecoder/archive/2.0.1.tar.gz && tar xvf 2.0.1 && rm -r 2.0.1
makeblastdb -in $3 -dbtype nucl -out $3.blast.out
grep '"u"' $1 | \
gffread -w transcripts_u.fa -g $2 - && \
python2.7 get_gene_length_filter.py transcripts_u.fa transcripts_u_filter.fa && \
TransDecoder-2.0.1/TransDecoder.LongOrfs -t transcripts_u_filter.fa
sed 's/ .*//' transcripts_u_filter.fa | grep ">" | sed 's/>//' > transcripts_u_filter.fa.genes
cd transcripts_u_filter.fa.transdecoder_dir
sed 's/|.*//' longest_orfs.cds | grep ">" | sed 's/>//' | uniq > longest_orfs.cds.genes
grep -v -f longest_orfs.cds.genes ../transcripts_u_filter.fa.genes > longest_orfs.cds.genes.not.genes
sed 's/^/>/' longest_orfs.cds.genes.not.genes > temp && mv temp longest_orfs.cds.genes.not.genes
python ../extract_sequences.py longest_orfs.cds.genes.not.genes ../transcripts_u_filter.fa longest_orfs.cds.genes.not.genes.fa
blastn -query longest_orfs.cds.genes.not.genes.fa -db ../$3.blast.out -out longest_orfs.cds.genes.not.genes.fa.blast.out -outfmt 6
python ../filter_sequences.py longest_orfs.cds.genes.not.genes.fa.blast.out longest_orfs.cds.genes.not.genes.fa.blast.out.filtered
grep -v -f longest_orfs.cds.genes.not.genes.fa.blast.out.filtered longest_orfs.cds.genes.not.genes.fa > lincRNA_final.fa
fiHere is how I run it:
```
sh test.sh cuffcompare_out_annot_no_annot.combined.gtf /mydata/db/Brapa_sequ
Solution
This script looks like a wall of text and barely readable.
Let's break it down from top to bottom.
Simplify the handling of
Since the
it would be simpler to drop the
I also dropped the quotes around the
Validate the command line arguments
The script expects 3 parameters, but doesn't validate if it really received them.
I suggest to rewrite the top part like this:
I added a help function to avoid duplication, because I wasn't too reuse the same help message when exiting the program in both cases:
Instead of if statements I used && expressions to make it shorter and simpler. The result is equivalent.
Give the parameters meaningful names
The parameters
and they are not exactly easy to remember.
Give them proper names right after validation:
And use these names throughout the script.
Taming the wall of text...
Add some line breaks.
Group together lines that are most tightly related.
For example,
for the lines that are the continuation of a pipeline, put
a line break before and after would make this easier to read.
And when a pipeline gets too long,
add some more line breaks (escaped with
for example when redirecting output to a file.
For example, I think this is more readable:
As a small tip, a shorter equivalent of
Let's break it down from top to bottom.
Simplify the handling of
-hSince the
if at the beginning to check if the first parameter is -h will exit if true,it would be simpler to drop the
else part entirely:if [ "$1" == -h ]
then
echo "Usage: sh $0 cuffcompare_output reference_genome blast_file"
exit
fiI also dropped the quotes around the
-h, they are not necessary.Validate the command line arguments
The script expects 3 parameters, but doesn't validate if it really received them.
I suggest to rewrite the top part like this:
help() {
exitcode=$1
echo "Usage: sh $0 cuffcompare_output reference_genome blast_file"
exit $exitcode
}
[ "$1" == -h ] && help 0
[ $# -lt 3 ] && help 1I added a help function to avoid duplication, because I wasn't too reuse the same help message when exiting the program in both cases:
- When the first param is -h, print help and exit with 0, meaning success
- When there are less than 3 params, print help and exit with 1, meaning failure
Instead of if statements I used && expressions to make it shorter and simpler. The result is equivalent.
Give the parameters meaningful names
The parameters
$1, $2 and $3 are scattered here and there in the wall of text,and they are not exactly easy to remember.
Give them proper names right after validation:
cuffcompare_output=$1
reference_genome=$2
blast_file=$3And use these names throughout the script.
Taming the wall of text...
Add some line breaks.
Group together lines that are most tightly related.
For example,
for the lines that are the continuation of a pipeline, put
a line break before and after would make this easier to read.
And when a pipeline gets too long,
add some more line breaks (escaped with
\) to make the important ending part more visible,for example when redirecting output to a file.
For example, I think this is more readable:
wget https://github.com/TransDecoder/TransDecoder/archive/2.0.1.tar.gz && tar xvf 2.0.1 && rm -r 2.0.1
makeblastdb -in $3 -dbtype nucl -out $3.blast.out
grep '"u"' $1 | \
gffread -w transcripts_u.fa -g $2 - && \
python2.7 get_gene_length_filter.py transcripts_u.fa transcripts_u_filter.fa && \
TransDecoder-2.0.1/TransDecoder.LongOrfs -t transcripts_u_filter.fa
sed 's/ .*//' transcripts_u_filter.fa | grep ">" | sed 's/>//' \
> transcripts_u_filter.fa.genes
cd transcripts_u_filter.fa.transdecoder_dir
sed 's/|.*//' longest_orfs.cds | grep ">" | sed 's/>//' | uniq \
> longest_orfs.cds.genes
grep -v -f longest_orfs.cds.genes ../transcripts_u_filter.fa.genes \
> longest_orfs.cds.genes.not.genes
sed 's/^/>/' longest_orfs.cds.genes.not.genes \
> temp && mv temp longest_orfs.cds.genes.not.genes
python ../extract_sequences.py longest_orfs.cds.genes.not.genes ../transcripts_u_filter.fa longest_orfs.cds.genes.not.genes.fa
blastn -query longest_orfs.cds.genes.not.genes.fa -db ../$3.blast.out -out longest_orfs.cds.genes.not.genes.fa.blast.out -outfmt 6
python ../filter_sequences.py longest_orfs.cds.genes.not.genes.fa.blast.out longest_orfs.cds.genes.not.genes.fa.blast.out.filtered
grep -v -f longest_orfs.cds.genes.not.genes.fa.blast.out.filtered longest_orfs.cds.genes.not.genes.fa > lincRNA_final.faAs a small tip, a shorter equivalent of
grep ">" | sed 's/>//' is sed -ne 's/>//p'.Code Snippets
if [ "$1" == -h ]
then
echo "Usage: sh $0 cuffcompare_output reference_genome blast_file"
exit
fihelp() {
exitcode=$1
echo "Usage: sh $0 cuffcompare_output reference_genome blast_file"
exit $exitcode
}
[ "$1" == -h ] && help 0
[ $# -lt 3 ] && help 1cuffcompare_output=$1
reference_genome=$2
blast_file=$3wget https://github.com/TransDecoder/TransDecoder/archive/2.0.1.tar.gz && tar xvf 2.0.1 && rm -r 2.0.1
makeblastdb -in $3 -dbtype nucl -out $3.blast.out
grep '"u"' $1 | \
gffread -w transcripts_u.fa -g $2 - && \
python2.7 get_gene_length_filter.py transcripts_u.fa transcripts_u_filter.fa && \
TransDecoder-2.0.1/TransDecoder.LongOrfs -t transcripts_u_filter.fa
sed 's/ .*//' transcripts_u_filter.fa | grep ">" | sed 's/>//' \
> transcripts_u_filter.fa.genes
cd transcripts_u_filter.fa.transdecoder_dir
sed 's/|.*//' longest_orfs.cds | grep ">" | sed 's/>//' | uniq \
> longest_orfs.cds.genes
grep -v -f longest_orfs.cds.genes ../transcripts_u_filter.fa.genes \
> longest_orfs.cds.genes.not.genes
sed 's/^/>/' longest_orfs.cds.genes.not.genes \
> temp && mv temp longest_orfs.cds.genes.not.genes
python ../extract_sequences.py longest_orfs.cds.genes.not.genes ../transcripts_u_filter.fa longest_orfs.cds.genes.not.genes.fa
blastn -query longest_orfs.cds.genes.not.genes.fa -db ../$3.blast.out -out longest_orfs.cds.genes.not.genes.fa.blast.out -outfmt 6
python ../filter_sequences.py longest_orfs.cds.genes.not.genes.fa.blast.out longest_orfs.cds.genes.not.genes.fa.blast.out.filtered
grep -v -f longest_orfs.cds.genes.not.genes.fa.blast.out.filtered longest_orfs.cds.genes.not.genes.fa > lincRNA_final.faContext
StackExchange Code Review Q#82772, answer score: 4
Revisions (0)
No revisions yet.