HiveBrain v1.2.0
Get Started
← Back to all entries
patternMinor

Counting GUAG introns in chromosomes

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
guagchromosomesintronscounting

Problem

I have this code that is working fine but it's taking pretty much 100% of my cpu to run and it takes around 25min. I'd really like to optimize it but don't know what parts I could improve.

The main problem I guess is that I'm using coordinates of a file to be compared with a 3,5GB file.
So I tried putting the information of this 3,5GB file in a hash to make it faster but it's not what I thought it would be.
I want to compare exon coordinates to the whole human genome (hg38), see which two letters are before the exon position and which two are in the end, so that would count as an intron. Then, for each intron I want to see if those two letters are compatible with GUAG or not.

```
#!/usr/bin/perl
use warnings;
use strict;

print "Enter file path (refGene)\n";
my $ref_gene = <>;
chomp ($ref_gene);
my $local1 = "./$ref_gene";

print "Enter file path (hg38)\n";
my $filename = <>;
chomp ($filename);
my $local2 = "./$filename";

open (HG, $local2) or die "error opening $filename\n";

my %chromosome;
my $key;
while (my $line = ) {
chomp ($line);
if ($line =~ /^\>/) {
$key = $line;
$key =~ s/\>//; # / stop editor red coloring
$chromosome{$key} = "";
}
else {
$chromosome{$key} .= uc $line;
}}

close (HG);

open (REFGENE, $local1) or die "error opening $ref_gene\n";

my $total_introns = 0;
my $GUAG_introns = 0;
while (my $line = ) {
my @column = split ("\t", $line);
my $chr = $column[2];
my $sense = $column[3];
my $number_exons = $column[8];
my $beginning_of_exon = $column[9];
my $end_of_exon = $column[10];

my @beginning_exons = split (",", $beginning_of_exon);
my @end_exons = split (",", $end_of_exon);
foreach my $element (@end_exons) { #subtract 1 from end of exon because its 1-based index
$element -= 1;
}

my $introns = ($number_exons - 1); #amount of introns = exons - 1
$total_introns += $introns;

my $current_seq = $chromosome{$chr};

for (m

Solution

I have a program that seems to work and produce the same results as your own code. However I have a problem in that the intron doesn't seem to be defined properly

Suppose my sequence is the alphabet and the exons we have are defined by

@beginning_exons = (5, 17);
@end_exons = (12, 21);


1 2
1 2 3 4|5 6 7 8 9 0 1 2|3 4 5 6|7 8 9 0 1|2 3 4 5 6
A B C D|E F G H I J K L|M N O P|Q R S T U|V W X Y Z
[=============] [=======]


so they are EFGHIJKL and QRSTU, forming an intron MNOP. The calculation in your program, and in my code that produces the same result, gives $end = 'MN', which is fine, but $beginning = 'PQ', which I don't understand. It is the last character of the intron and the first character of the exon

Can this be right? I have put exactly the values above into your code and mine and get the same result

Update

As usual, the solution was staring me in the face. I am confident that the values you extract to @beginning_exons and @end_exons are not character positions but offsets, that may be passed directly to substr. It was your comment "subtract 1 from end of exon because its 1-based index" that threw me

So it is the gaps between the letters that are numbers from zero at the start of the string. Like this

1 2
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6
A B C D|E F G H I J K L|M N O P|Q R S T U|V W X Y Z
[=============] [=======]


Now the "exon" EFGHIJKL starts at 4 and ends at 12, and QRSTU starts at 16 and ends at 21, making the data look like this

@beginning_exons = (4, 16);
@end_exons = (12, 21);


Assuming this makes the arithmetic much easier, because there is no off-by-one error to account for, and the start and end of the intron are 12 and 16—the end of the first exon and the beginning of the next. Furthermore, numbers can be fed directly into substr without any offset, except that the last two characters of the intron are clearly at (6-2)

I've modified my code below to take this into account, and it makes it much clearer to read

Here's my solution

The main reason yours is so slow is that it's reading the HG38 file line by line, which means it has to upper case each line and append it to the sequence up to 4 million times per sequence

That's very slow because the repeated appending will mean that the string will frequently become too big for the space allocated to it, and it has to be copied to a larger space before it can be expanded. Copying such a huge string thousands of times takes a lot of processor work

I have written it so that a whole chromosome is read at a cctime. Then all that has to be done is to remove the newlines and set it to upper case, just once

I've also printed the each chromosome's name when it is encountered, to give some confirmation that the process is progressing. (You can stop this by removing the print " $key\n" statement.) The time taken to test the introns at the end is minimal

As I said, this code produces the results 492784 /499504 that you expect, but I'm still concerned about the end pair of bases. Please let me know if you have an explanation

use strict;
use warnings FATAL => 'all';

STDOUT->autoflush;

my ( $ref_gene_file, $hg38_file ) = qw/ refGene.txt hg38.fa /;

my %chromosome;

{
print "Processing HG38:\n";

open my $hg38_fh, '';

my $key;

while ( ) {

next unless /[^>\s]/;

s/(.+)// and $key = $1;
printf " %s\n", $key;

$chromosome{$key} = uc tr/A-Za-z//cdr;
}

print "\n";
}

my $total_introns = 0;
my $GUAG_introns = 0;

{
open my $ref_gene_fh, ' ) {

my @column = split;

my ( $chr, $sense, $n, $exon_starts, $exon_ends ) = @column[ 2, 3, 8, 9, 10 ];
my @exon_starts = $exon_starts =~ /\d+/g;
my @exon_ends = $exon_ends =~ /\d+/g;

die "Data discrepancy" unless @exon_starts == $n and @exon_ends == $n;

my $introns = $n - 1; #amount of introns = exons - 1
$total_introns += $introns;

my $sequence = $chromosome{$chr} or die "No such chromosome";

for my $i ( 0 .. $introns-1 ) {

my ($intron_start, $intron_end) = ( $exon_ends[$i], $exon_starts[$i+1] );

my $intron_prefix = substr( $sequence, $intron_start, 2 );
my $intron_suffix = substr( $sequence, $intron_end-2, 2 );

if ( $sense eq '+' ) {
++$GUAG_introns if $intron_prefix eq 'GT' and $intron_suffix eq 'AG';
}
elsif ( $sense eq '-' ) {
++$GUAG_introns if $intron_prefix eq 'CT' and $intron_suffix eq 'AC';
}
}
}
}

printf "Number of GUAG introns = %d\n", $GUAG_introns;
printf "Total introns = %d\n", $total_introns;
printf "%.1f%% of GT-AG introns.\n\n", $GUAG_introns / $total_introns * 100;


output

`Processing HG38:
chr1
chr10
chr11
chr11_K

Context

StackExchange Code Review Q#132055, answer score: 6

Revisions (0)

No revisions yet.