HiveBrain v1.2.0
Get Started
← Back to all entries
patternMinor

Parsing GTF file using command-line

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
filelinegtfparsingusingcommand

Problem

I am extracting exons details from a GTF file using command line in Unix like cut, awk, grep or sed.

input file.gtf:

chrI    ce11_ws245Genes CDS 8378308 8378427 0.000000    -   0   gene_id "T19A6.1a.2"; transcript_id "T19A6.1a.2"; 
chrI    ce11_ws245Genes exon    8377602 8378427 0.000000    -   .   gene_id "T19A6.1a.2"; transcript_id "T19A6.1a.2"; 
chrI    ce11_ws245Genes CDS 8379137 8379239 0.000000    -   1   gene_id "T19A6.1a.2"; transcript_id "T19A6.1a.2"; 
chrI    ce11_ws245Genes exon    8379137 8379239 0.000000    -   .   gene_id "T19A6.1a.2"; transcript_id "T19A6.1a.2"; 
chrI    ce11_ws245Genes CDS 8379706 8379815 0.000000    -   0   gene_id "T19A6.1a.2"; transcript_id "T19A6.1a.2"; 
chrI    ce11_ws245Genes exon    8379706 8379815 0.000000    -   .   gene_id "T19A6.1a.2"; transcript_id "T19A6.1a.2"; 
chrI    ce11_ws245Genes CDS 8380330 8380445 0.000000    -   2   gene_id "T19A6.1a.2"; transcript_id "T19A6.1a.2"; 
chrI    ce11_ws245Genes exon    8380330 8380445 0.000000    -   .   gene_id "T19A6.1a.2"; transcript_id "T19A6.1a.2"; 
chrI    ce11_ws245Genes CDS 8388028 8388092 0.000000    -   1   gene_id "T19A6.1a.2"; transcript_id "T19A6.1a.2";


Desired output:

chrI 8377602 8378427 -  T19A6.1a.2
chrI 8379137 8379239 -  T19A6.1a.2
chrI 8379706 8379815 -  T19A6.1a.2
chrI 8380330 8380445 -  T19A6.1a.2


My successful attempts to solve the problem:

awk '/exon/ {print $1 " " $4 " " $5 " " $7 " " $10;}' file.gtf | awk '{sub(/gene_id/,"",$5)};1' | awk -F'"' '{print $1, $2}'

grep 'exon' file.gtf | cut -f1,4,5,7,9 | cut -d ';' -f1 | awk '{sub(/gene_id/,"",$5)};1' | awk -F'"' '{print $1, $2}'


steps:

  • search for lines which contain the word 'exon'



  • cut the fields of interest 1,4,5,7,9



  • in field 9: cut using the delimiter ';'



  • remove 'gene_id'



  • remove the double quotations around the genes' names

Solution

Your code can be simplified with only awk script:

awk '/exon/ {gsub("[\";]","", $10);print $1,$4,$5,$7,$10}' file.gtk


gusb removes any occurrence " or ; in the 10th element.

Code Snippets

awk '/exon/ {gsub("[\";]","", $10);print $1,$4,$5,$7,$10}' file.gtk

Context

StackExchange Code Review Q#143298, answer score: 3

Revisions (0)

No revisions yet.