patternMinor

Parsing GTF file using command-line

Submitted by: @import:stackexchange-codereview·Mar 10, 2026·

Viewed 0 times

filelinegtfparsingusingcommand

Problem

I am extracting exons details from a GTF file using command line in Unix like cut, awk, grep or sed.

input file.gtf:

chrI    ce11_ws245Genes CDS 8378308 8378427 0.000000    -   0   gene_id "T19A6.1a.2"; transcript_id "T19A6.1a.2"; 
chrI    ce11_ws245Genes exon    8377602 8378427 0.000000    -   .   gene_id "T19A6.1a.2"; transcript_id "T19A6.1a.2"; 
chrI    ce11_ws245Genes CDS 8379137 8379239 0.000000    -   1   gene_id "T19A6.1a.2"; transcript_id "T19A6.1a.2"; 
chrI    ce11_ws245Genes exon    8379137 8379239 0.000000    -   .   gene_id "T19A6.1a.2"; transcript_id "T19A6.1a.2"; 
chrI    ce11_ws245Genes CDS 8379706 8379815 0.000000    -   0   gene_id "T19A6.1a.2"; transcript_id "T19A6.1a.2"; 
chrI    ce11_ws245Genes exon    8379706 8379815 0.000000    -   .   gene_id "T19A6.1a.2"; transcript_id "T19A6.1a.2"; 
chrI    ce11_ws245Genes CDS 8380330 8380445 0.000000    -   2   gene_id "T19A6.1a.2"; transcript_id "T19A6.1a.2"; 
chrI    ce11_ws245Genes exon    8380330 8380445 0.000000    -   .   gene_id "T19A6.1a.2"; transcript_id "T19A6.1a.2"; 
chrI    ce11_ws245Genes CDS 8388028 8388092 0.000000    -   1   gene_id "T19A6.1a.2"; transcript_id "T19A6.1a.2";

Desired output:

chrI 8377602 8378427 -  T19A6.1a.2
chrI 8379137 8379239 -  T19A6.1a.2
chrI 8379706 8379815 -  T19A6.1a.2
chrI 8380330 8380445 -  T19A6.1a.2

My successful attempts to solve the problem:

awk '/exon/ {print $1 " " $4 " " $5 " " $7 " " $10;}' file.gtf | awk '{sub(/gene_id/,"",$5)};1' | awk -F'"' '{print $1, $2}'

grep 'exon' file.gtf | cut -f1,4,5,7,9 | cut -d ';' -f1 | awk '{sub(/gene_id/,"",$5)};1' | awk -F'"' '{print $1, $2}'

steps:

search for lines which contain the word 'exon'

cut the fields of interest 1,4,5,7,9

in field 9: cut using the delimiter ';'

remove 'gene_id'

remove the double quotations around the genes' names

Solution

Your code can be simplified with only awk script:

awk '/exon/ {gsub("[\";]","", $10);print $1,$4,$5,$7,$10}' file.gtk

gusb removes any occurrence " or ; in the 10th element.

Code Snippets

awk '/exon/ {gsub("[\";]","", $10);print $1,$4,$5,$7,$10}' file.gtk

Context

StackExchange Code Review Q#143298, answer score: 3

Revisions (0)

No revisions yet.