patternrubyMinor
The Genetic Code
Viewed 0 times
codegeneticthe
Problem
This question is part of a series solving the Rosalind challenges. For the previous question in this series, see Wascally wabbits. The repository with all my up-to-date solutions so far can be found here.
Problem: PROT
The 20 commonly occurring amino acids are abbreviated by using 20 letters from the English alphabet (all letters except for B, J, O, U, X, and Z). Protein strings are constructed from these 20 symbols. Henceforth, the term genetic string will incorporate protein strings along with DNA strings and RNA strings.
The RNA codon table dictates the details regarding the encoding of specific codons into the amino acid alphabet.
Given:
An RNA string \$s\$ corresponding to a strand of mRNA (of length at most 10 kbp).
Return:
The protein string encoded by \$s\$.
Sample Dataset:
Sample Output:
My solution solves the sample dataset and the actual dataset given.
Dataset:
```
AUGCGCCCUUGGUCGCUCCUUGGAUCAGAGCAUAUUCUAUCACGGCGCGUCGAAGGAAUAACCCACGACAUCUCUCUAUAUUGGAUUCCCUUUUUUUCGGUUCAGGUAGAUCAUUUGGCUACUGGACUUUCUAAGAUUUACUCCGCCAUGUUCCUAUAUGUUACAUUCUCAGCCGAAGUCGCUGUAUAUCACGUUAAGGUAGACGGUUCCUUGACUACCAGCGACGCCUGUAGGGAGAAUUCCAUCCAUCAUGCAUGUAUGGGCAGUGCGCACUUACAGCGCCAUAGGCAACGGACGGACAGACCUCCUUUCCUGUCGGACGGUAAGCCGCGAUCCAAUACAGAGCAAAGUCCCACGCCCUCCUAUAGACUCACGCCAAGAUUGUAUUCCCCGUUAACCGCUCUCUCAGGGAAGUUGUAUCUACUCGGAUCGGGAUGUCCUUGGAAAUGUAGGAAAAUGGCUCAAACUACGAUUGUAUACCGUGCGAGACGUUGGAUCCCGCUUAUCACUGAUACCAUAAUCUGUGUGGCCCCCUUACCACAACCUAACCAUGGAGUAGUAGCCCUGGCCGUCCCUUCAAGGGCAAGACCUCAUUGUCUUGUACGCUUAUCACAAGGGCCAUCUAACAAUGUGUACCGGUAUAAUUUUACGUGGUAUUGUCCAGACGGCGGUACGGCCGAUCCGUUGCCAUUUCGUCAUGGCAUAACCUCGGUCUAUCUUCCUUCCUACUCGGGAAUAGUUCGCAGUACACCAUACCUCAUCGGCACUUACGCUGUUCCAACACAAAAUUCUGAUCCCUUCGCUACCACCCGCUGGGUAUCUGUCAGGUUACUGGCCUCCACUACCGGAGAGGGCGAUACGGGGGACCGCGGAACACUUUCUACAUUUUUGGACUGCUGUAUGUCUACUUCGGCUCUUCCUCCCGCGAGAUUUAUAUCGGCAUACAGAGUAAACGCUACCCAUGGCGACGACACUCGUCUCACCGUUAAGAAGCUGUG
Problem: PROT
The 20 commonly occurring amino acids are abbreviated by using 20 letters from the English alphabet (all letters except for B, J, O, U, X, and Z). Protein strings are constructed from these 20 symbols. Henceforth, the term genetic string will incorporate protein strings along with DNA strings and RNA strings.
The RNA codon table dictates the details regarding the encoding of specific codons into the amino acid alphabet.
Given:
An RNA string \$s\$ corresponding to a strand of mRNA (of length at most 10 kbp).
Return:
The protein string encoded by \$s\$.
Sample Dataset:
AUGGCCAUGGCGCCCAGAACUGAGAUCAAUAGUACCCGUAUUAACGGGUGASample Output:
MAMAPRTEINSTRINGMy solution solves the sample dataset and the actual dataset given.
Dataset:
```
AUGCGCCCUUGGUCGCUCCUUGGAUCAGAGCAUAUUCUAUCACGGCGCGUCGAAGGAAUAACCCACGACAUCUCUCUAUAUUGGAUUCCCUUUUUUUCGGUUCAGGUAGAUCAUUUGGCUACUGGACUUUCUAAGAUUUACUCCGCCAUGUUCCUAUAUGUUACAUUCUCAGCCGAAGUCGCUGUAUAUCACGUUAAGGUAGACGGUUCCUUGACUACCAGCGACGCCUGUAGGGAGAAUUCCAUCCAUCAUGCAUGUAUGGGCAGUGCGCACUUACAGCGCCAUAGGCAACGGACGGACAGACCUCCUUUCCUGUCGGACGGUAAGCCGCGAUCCAAUACAGAGCAAAGUCCCACGCCCUCCUAUAGACUCACGCCAAGAUUGUAUUCCCCGUUAACCGCUCUCUCAGGGAAGUUGUAUCUACUCGGAUCGGGAUGUCCUUGGAAAUGUAGGAAAAUGGCUCAAACUACGAUUGUAUACCGUGCGAGACGUUGGAUCCCGCUUAUCACUGAUACCAUAAUCUGUGUGGCCCCCUUACCACAACCUAACCAUGGAGUAGUAGCCCUGGCCGUCCCUUCAAGGGCAAGACCUCAUUGUCUUGUACGCUUAUCACAAGGGCCAUCUAACAAUGUGUACCGGUAUAAUUUUACGUGGUAUUGUCCAGACGGCGGUACGGCCGAUCCGUUGCCAUUUCGUCAUGGCAUAACCUCGGUCUAUCUUCCUUCCUACUCGGGAAUAGUUCGCAGUACACCAUACCUCAUCGGCACUUACGCUGUUCCAACACAAAAUUCUGAUCCCUUCGCUACCACCCGCUGGGUAUCUGUCAGGUUACUGGCCUCCACUACCGGAGAGGGCGAUACGGGGGACCGCGGAACACUUUCUACAUUUUUGGACUGCUGUAUGUCUACUUCGGCUCUUCCUCCCGCGAGAUUUAUAUCGGCAUACAGAGUAAACGCUACCCAUGGCGACGACACUCGUCUCACCGUUAAGAAGCUGUG
Solution
Some notes:
-
As others have already pointed out, you should use a hash instead of a gigantic
-
You can use
-
Encapsulate the code in a module/class.
-
You need a
-
Note that this works:
-
This is a common pattern: write the data structure in the most declarative/simple way and then programmatically build (on initialization) whatever (efficient) data structures you need in the algorithm.
I'd write it in functional style:
-
As others have already pointed out, you should use a hash instead of a gigantic
case. But make sure your get operations on that hash are O(1), otherwise the method will be very inefficient.-
You can use
Enumerable#take_while to manage the stop amino acids. -
Encapsulate the code in a module/class.
-
You need a
return because it's not the last expression of the method, it's within the scan, which you want to break.-
Note that this works:
"123456".gsub(/.../) { |triplet| triplet[0] } #=> "14"-
This is a common pattern: write the data structure in the most declarative/simple way and then programmatically build (on initialization) whatever (efficient) data structures you need in the algorithm.
I'd write it in functional style:
module Rosalind
CODONS_BY_AMINOACID = {
"F" => ["UUU", "UUC"],
"L" => ["UUA", "UUG","CUU", "CUC", "CUA", "CUG"],
"S" => ["UCU", "UCC", "UCA", "UCG", "AGU", "AGC"],
"Y" => ["UAU", "UAC"],
"C" => ["UGU", "UGC"],
"W" => ["UGG"],
"P" => ["CCU", "CCC", "CCA", "CCG"],
"H" => ["CAU", "CAC"],
"Q" => ["CAA", "CAG"],
"R" => ["CGU", "CGC", "CGA", "CGG", "AGA", "AGG"],
"I" => ["AUU", "AUC", "AUA"],
"M" => ["AUG"],
"T" => ["ACU", "ACC", "ACA", "ACG"],
"N" => ["AAU", "AAC"],
"K" => ["AAA", "AAG"],
"V" => ["GUU", "GUC", "GUA", "GUG"],
"A" => ["GCU", "GCC", "GCA", "GCG"],
"D" => ["GAU", "GAC"],
"E" => ["GAA", "GAG"],
"G" => ["GGU", "GGC", "GGA", "GGG"],
"STOP" => ["UGA", "UAA", "UAG"],
}
AMINOACID_BY_CODON = CODONS_BY_AMINOACID.
flat_map { |c, as| as.map { |a| [a, c] } }.to_h
def self.problem_prot(aminoacids_string)
aminoacids_string.
scan(/[UGTCA]{3}/).
map { |codon| AMINOACID_BY_CODON[codon] }.
take_while { |aminoacid| aminoacid != "STOP" }.
join
end
endCode Snippets
module Rosalind
CODONS_BY_AMINOACID = {
"F" => ["UUU", "UUC"],
"L" => ["UUA", "UUG","CUU", "CUC", "CUA", "CUG"],
"S" => ["UCU", "UCC", "UCA", "UCG", "AGU", "AGC"],
"Y" => ["UAU", "UAC"],
"C" => ["UGU", "UGC"],
"W" => ["UGG"],
"P" => ["CCU", "CCC", "CCA", "CCG"],
"H" => ["CAU", "CAC"],
"Q" => ["CAA", "CAG"],
"R" => ["CGU", "CGC", "CGA", "CGG", "AGA", "AGG"],
"I" => ["AUU", "AUC", "AUA"],
"M" => ["AUG"],
"T" => ["ACU", "ACC", "ACA", "ACG"],
"N" => ["AAU", "AAC"],
"K" => ["AAA", "AAG"],
"V" => ["GUU", "GUC", "GUA", "GUG"],
"A" => ["GCU", "GCC", "GCA", "GCG"],
"D" => ["GAU", "GAC"],
"E" => ["GAA", "GAG"],
"G" => ["GGU", "GGC", "GGA", "GGG"],
"STOP" => ["UGA", "UAA", "UAG"],
}
AMINOACID_BY_CODON = CODONS_BY_AMINOACID.
flat_map { |c, as| as.map { |a| [a, c] } }.to_h
def self.problem_prot(aminoacids_string)
aminoacids_string.
scan(/[UGTCA]{3}/).
map { |codon| AMINOACID_BY_CODON[codon] }.
take_while { |aminoacid| aminoacid != "STOP" }.
join
end
endContext
StackExchange Code Review Q#128203, answer score: 7
Revisions (0)
No revisions yet.