patternMinor
Data screening using Perl
Viewed 0 times
screeningdatausingperl
Problem
Background information
I've been asked to write a little Perl script that allows genomic data to be screened against reference files in order to determine locations of specific mutations.
The input file format look like this (information are tab-separated). In the script it's referred to as Funestus_NON_SYN.txt.
```
KB669306 9962 . C T 520.77 PASS AC=13;AF=0.929;AN=14;BaseQRankSum=0.540;DP=103;Dels=0.00;EFF=SYNONYMOUS_CODING(LOW|SILENT|cgC/cgT|R15||AFUN012013|||AFUN012013-RA|1|1);FS=0.000;HaplotypeScore=0.0000;MQ0=0;MQRankSum=0.231;ReadPosRankSum=0.694;set=variant-variant2-variant3-variant5-variant6-variant7-variant10 GT:AD:DP:GQ:PL 1/1:0,18:18:42:549,42,0 1/1:0,12:12:33:399,33,0 1/1:0,13:13:39:505,39,0 ./. 0/1:9,4:13:99:104,0,297 1/1:0,17:17:45:570,45,0 1/1:0,18:18:51:624,51,0 ./. ./. 1/1:0,12:12:33:420,33,0
KB669306 10439 . C T 668.77 PASS AC=15;AF=0.938;AN=16;BaseQRankSum=-0.643;DP=139;Dels=0.00;EFF=SYNONYMOUS_CODING(LOW|SILENT|gcC/gcT|A174||AFUN012013|||AFUN012013-RA|1|1);FS=0.000;MQ0=0;MQRankSum=-0.746;ReadPosRankSum=-1.106;set=variant-variant3-variant5-variant6-variant7-variant8-variant9-variant10 GT:AD:DP:GQ:PL 1/1:0,22:22:51:697,51,0 ./. 1/1:0,20:20:48:667,48,0 ./. 0/1:14,12:25:99:317,0,403 1/1:0,15:15:36:469,36,0 1/1:0,16:16:36:499,36,0 1/1:0,13:13:30:415,30,0 1/1:0,14:14:30:412,30,0 1/1:0,13:13:36:492,36,0
KB668289 903 . T C 577.77 PASS AC=14;AF=0.875;AN=16;DP=173;Dels=0.00;EFF=DOWNSTREAM(MODIFIER||3412|||AFUN005870|||AFUN005870-RA||1),SYNONYMOUS_CODING(LOW|SILENT|ctA/ctG|L578||AFUN005868|||AFUN005868-RA|4|1),UPSTREAM(MODIFIER||2667|||AFUN005869|||AFUN005869-RA||1);MQ0=0;set=variant-variant2-variant3-variant5-variant6-variant7-variant9-variant10 GT:AD:DP:GQ:PL 1/1:0,20:20:45:606,45,0 1/1:0,19:19:42:588,42,0 1/1:0,22:22:54:720,54,0 ./. 1/1:0,29:29:63:882,63,0 0/1:16,13:28:99:401,0,391 1/1:0,13:13:30:425,30,0./. 1/1:0,26:26:57:785,57,0 0/1:8,7:15:99:223,0,243
KB668289 1224 . A C 572.77 PAS
I've been asked to write a little Perl script that allows genomic data to be screened against reference files in order to determine locations of specific mutations.
The input file format look like this (information are tab-separated). In the script it's referred to as Funestus_NON_SYN.txt.
```
KB669306 9962 . C T 520.77 PASS AC=13;AF=0.929;AN=14;BaseQRankSum=0.540;DP=103;Dels=0.00;EFF=SYNONYMOUS_CODING(LOW|SILENT|cgC/cgT|R15||AFUN012013|||AFUN012013-RA|1|1);FS=0.000;HaplotypeScore=0.0000;MQ0=0;MQRankSum=0.231;ReadPosRankSum=0.694;set=variant-variant2-variant3-variant5-variant6-variant7-variant10 GT:AD:DP:GQ:PL 1/1:0,18:18:42:549,42,0 1/1:0,12:12:33:399,33,0 1/1:0,13:13:39:505,39,0 ./. 0/1:9,4:13:99:104,0,297 1/1:0,17:17:45:570,45,0 1/1:0,18:18:51:624,51,0 ./. ./. 1/1:0,12:12:33:420,33,0
KB669306 10439 . C T 668.77 PASS AC=15;AF=0.938;AN=16;BaseQRankSum=-0.643;DP=139;Dels=0.00;EFF=SYNONYMOUS_CODING(LOW|SILENT|gcC/gcT|A174||AFUN012013|||AFUN012013-RA|1|1);FS=0.000;MQ0=0;MQRankSum=-0.746;ReadPosRankSum=-1.106;set=variant-variant3-variant5-variant6-variant7-variant8-variant9-variant10 GT:AD:DP:GQ:PL 1/1:0,22:22:51:697,51,0 ./. 1/1:0,20:20:48:667,48,0 ./. 0/1:14,12:25:99:317,0,403 1/1:0,15:15:36:469,36,0 1/1:0,16:16:36:499,36,0 1/1:0,13:13:30:415,30,0 1/1:0,14:14:30:412,30,0 1/1:0,13:13:36:492,36,0
KB668289 903 . T C 577.77 PASS AC=14;AF=0.875;AN=16;DP=173;Dels=0.00;EFF=DOWNSTREAM(MODIFIER||3412|||AFUN005870|||AFUN005870-RA||1),SYNONYMOUS_CODING(LOW|SILENT|ctA/ctG|L578||AFUN005868|||AFUN005868-RA|4|1),UPSTREAM(MODIFIER||2667|||AFUN005869|||AFUN005869-RA||1);MQ0=0;set=variant-variant2-variant3-variant5-variant6-variant7-variant9-variant10 GT:AD:DP:GQ:PL 1/1:0,20:20:45:606,45,0 1/1:0,19:19:42:588,42,0 1/1:0,22:22:54:720,54,0 ./. 1/1:0,29:29:63:882,63,0 0/1:16,13:28:99:401,0,391 1/1:0,13:13:30:425,30,0./. 1/1:0,26:26:57:785,57,0 0/1:8,7:15:99:223,0,243
KB668289 1224 . A C 572.77 PAS
Solution
Style Guides
The Perl community does not have a single universal style – there is more than one way to do it. But sometimes, consistency isn't a bad thing either. Here are the three important cornerstones of Perl style:
Based on those references, I make the recommendations below.
Objective Style Issues
Now lets look at what you've written.
-
Don't use the
-
Don't call functions with an ampersand, like
-
When looping over a range, do not use the C-style
The Perl community does not have a single universal style – there is more than one way to do it. But sometimes, consistency isn't a bad thing either. Here are the three important cornerstones of Perl style:
- A
perlstylemanpage exists which explains a sensible core of a style guide.
- Damian Conway published the Perl Best Practices book with many valuable tips, although I'd view his preferences as over-zealous.
- The
perlcritictool exists to automatically check your code for a set of style violations. It's inspired by Perl Best Practices. You can install it via the Perl::Critic module. You can layout your code automatically withperltidy, installable via Perl::Tidy.
Based on those references, I make the recommendations below.
Objective Style Issues
Now lets look at what you've written.
-
Don't use the
-w command line flag, only use warnings, and use it always. They are fundamentally the same (they activate warnings), but -w has an unfortunate global effect. In your case it doesn't matter, but -w is a bad habit to get into.-
Don't call functions with an ampersand, like
&foo(). This has a specific meaning (it disables so-called prototypes), but it hasn't been necessary to use it for over 20 years. Simply invoke your functions like foo().-
When looping over a range, do not use the C-style
for (my $j = 1; $j
-
Since perl 5.10.0 (released 2007), you can use feature 'say' to enable the say function. It behaves exactly like print but always appends a newline, making it more convenient to use in most cases.
-
The keywords for and foreach are absolutely equivalent, so we'd use the shorter form. When used as a statement modifier, the parens around the iterated values are optional:
... for @file_genes_print;
-
In a foreach loop, always name the iteration variable rather than using $_. This is not possible in the statement modifier use case.
for my $iteration_variable (@list) {
...
}
-
Hash variables are initialized by declaration. Assigning them the empty list is unnecessary – my %hash; rather than my %hash = ();.
-
For readability, any comma should be followed by a space.
open my $fh_out, ">", "$output_file.tsv" or die ...
-
Prefer interpolation over concatenation. $scaffold."_".$location could be "${scaffold}_${location}" and $output_file.".tsv" could be "$output_file.tsv".
-
Sometimes, join is more preferable than both interpolation or concatenation. For example, "$scaffold\t$location\t$mutation\t$rest" becomes more readable as
join "\t", $scaffold, $location, $mutation, $rest
-
The print (and say) syntax is horrible because of the difference between print $foo $bar and print $foo, $bar – the former prints $bar to the filehandle $foo, whereas the latter prints "$foo$bar" to the currently selected filehandle (usually STDOUT).
-
We could use the object-oriented interface.
use IO::File;
$foo->print($bar);
-
We can use curly braces for the indirect object notation
print { $foo } $bar
I'd recommend to always use the curly braces. You only really need them when the file handle is something else than a simple variable, but it's a good visual distinction.
-
The ScanPrint uses the %genes hash which is therefore essentially a global variable. Instead, pass it in as an argument:
scan_print(\%genes, @file_genes);
...
sub scan_print {
my ($genes, @array) = @_;
...
for my $key (keys %$genes) {
-
Don't load a whole file into memory unless you have to. If you can, process files line-by-line. Especially your use of InputFileHandler violates this. On the other hand this may be absoluetly OK for smaller files, or when you are going to use that amount of memory anyway.
Subjective Style Issues
-
Variable and function names should generally be snake_case: lowercase, and words separated by underscores. You apply this for variables, but not for subroutine names.
-
Consider using the K&R style/egyptian brackets. This means that the opening curly brace is on the same line as it's introducing keyword and not on a line of its own:
for (...) {
...
}
Why? For no other reason than that most Perl programmers do this.
-
It is customary to call builtins without parentheses. This is partly because builtins are actually operators, not functions, and partly because it looks nicer. So let's do close $fh rather than close ($fh).
-
In regexes, put symbols into character classes to emphasize them: /; / might become /[;][ ]/.
Performance Issues, Bugs, and Other Notes
-
You are using a regex /^KB./. The trailing . is entirely unnecessary as this always matches any string. Remove it: /^KB/.
-
Do not prompt for file names. Instead, take these from the command line arguments. The arguments are in the @ARGV array. So you could do:
my ($input_file, @reference_files) = @ARGV;
and invoke your script like pCode Snippets
for my $j (1 .. $i) {... for @file_genes_print;for my $iteration_variable (@list) {
...
}open my $fh_out, ">", "$output_file.tsv" or die ...join "\t", $scaffold, $location, $mutation, $restContext
StackExchange Code Review Q#48184, answer score: 6
Revisions (0)
No revisions yet.