HiveBrain v1.2.0
Get Started
← Back to all entries
patternrubyMinor

The Genetic Code

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
codegeneticthe

Problem

This question is part of a series solving the Rosalind challenges. For the previous question in this series, see Wascally wabbits. The repository with all my up-to-date solutions so far can be found here.

Problem: PROT


The 20 commonly occurring amino acids are abbreviated by using 20 letters from the English alphabet (all letters except for B, J, O, U, X, and Z). Protein strings are constructed from these 20 symbols. Henceforth, the term genetic string will incorporate protein strings along with DNA strings and RNA strings.


The RNA codon table dictates the details regarding the encoding of specific codons into the amino acid alphabet.

Given:


An RNA string \$s\$ corresponding to a strand of mRNA (of length at most 10 kbp).

Return:


The protein string encoded by \$s\$.

Sample Dataset:

AUGGCCAUGGCGCCCAGAACUGAGAUCAAUAGUACCCGUAUUAACGGGUGA


Sample Output:

MAMAPRTEINSTRING


My solution solves the sample dataset and the actual dataset given.

Dataset:

```
AUGCGCCCUUGGUCGCUCCUUGGAUCAGAGCAUAUUCUAUCACGGCGCGUCGAAGGAAUAACCCACGACAUCUCUCUAUAUUGGAUUCCCUUUUUUUCGGUUCAGGUAGAUCAUUUGGCUACUGGACUUUCUAAGAUUUACUCCGCCAUGUUCCUAUAUGUUACAUUCUCAGCCGAAGUCGCUGUAUAUCACGUUAAGGUAGACGGUUCCUUGACUACCAGCGACGCCUGUAGGGAGAAUUCCAUCCAUCAUGCAUGUAUGGGCAGUGCGCACUUACAGCGCCAUAGGCAACGGACGGACAGACCUCCUUUCCUGUCGGACGGUAAGCCGCGAUCCAAUACAGAGCAAAGUCCCACGCCCUCCUAUAGACUCACGCCAAGAUUGUAUUCCCCGUUAACCGCUCUCUCAGGGAAGUUGUAUCUACUCGGAUCGGGAUGUCCUUGGAAAUGUAGGAAAAUGGCUCAAACUACGAUUGUAUACCGUGCGAGACGUUGGAUCCCGCUUAUCACUGAUACCAUAAUCUGUGUGGCCCCCUUACCACAACCUAACCAUGGAGUAGUAGCCCUGGCCGUCCCUUCAAGGGCAAGACCUCAUUGUCUUGUACGCUUAUCACAAGGGCCAUCUAACAAUGUGUACCGGUAUAAUUUUACGUGGUAUUGUCCAGACGGCGGUACGGCCGAUCCGUUGCCAUUUCGUCAUGGCAUAACCUCGGUCUAUCUUCCUUCCUACUCGGGAAUAGUUCGCAGUACACCAUACCUCAUCGGCACUUACGCUGUUCCAACACAAAAUUCUGAUCCCUUCGCUACCACCCGCUGGGUAUCUGUCAGGUUACUGGCCUCCACUACCGGAGAGGGCGAUACGGGGGACCGCGGAACACUUUCUACAUUUUUGGACUGCUGUAUGUCUACUUCGGCUCUUCCUCCCGCGAGAUUUAUAUCGGCAUACAGAGUAAACGCUACCCAUGGCGACGACACUCGUCUCACCGUUAAGAAGCUGUG

Solution

Some notes:

-
As others have already pointed out, you should use a hash instead of a gigantic case. But make sure your get operations on that hash are O(1), otherwise the method will be very inefficient.

-
You can use Enumerable#take_while to manage the stop amino acids.

-
Encapsulate the code in a module/class.

-
You need a return because it's not the last expression of the method, it's within the scan, which you want to break.

-
Note that this works: "123456".gsub(/.../) { |triplet| triplet[0] } #=> "14"

-
This is a common pattern: write the data structure in the most declarative/simple way and then programmatically build (on initialization) whatever (efficient) data structures you need in the algorithm.

I'd write it in functional style:

module Rosalind
  CODONS_BY_AMINOACID = {
    "F" => ["UUU", "UUC"],
    "L" => ["UUA", "UUG","CUU", "CUC", "CUA", "CUG"],
    "S" => ["UCU", "UCC", "UCA", "UCG", "AGU", "AGC"],
    "Y" => ["UAU", "UAC"],
    "C" => ["UGU", "UGC"],
    "W" => ["UGG"],
    "P" => ["CCU", "CCC", "CCA", "CCG"],
    "H" => ["CAU", "CAC"],
    "Q" => ["CAA", "CAG"],
    "R" => ["CGU", "CGC", "CGA", "CGG", "AGA", "AGG"],
    "I" => ["AUU", "AUC", "AUA"],
    "M" => ["AUG"],
    "T" => ["ACU", "ACC", "ACA", "ACG"],
    "N" => ["AAU", "AAC"],
    "K" => ["AAA", "AAG"],
    "V" => ["GUU", "GUC", "GUA", "GUG"],
    "A" => ["GCU", "GCC", "GCA", "GCG"],
    "D" => ["GAU", "GAC"],
    "E" => ["GAA", "GAG"],
    "G" => ["GGU", "GGC", "GGA", "GGG"],
    "STOP" => ["UGA", "UAA", "UAG"],
  }
  AMINOACID_BY_CODON = CODONS_BY_AMINOACID.
    flat_map { |c, as| as.map { |a| [a, c] } }.to_h

  def self.problem_prot(aminoacids_string)
    aminoacids_string.
      scan(/[UGTCA]{3}/).
      map { |codon| AMINOACID_BY_CODON[codon] }.
      take_while { |aminoacid| aminoacid != "STOP" }.
      join
  end
end

Code Snippets

module Rosalind
  CODONS_BY_AMINOACID = {
    "F" => ["UUU", "UUC"],
    "L" => ["UUA", "UUG","CUU", "CUC", "CUA", "CUG"],
    "S" => ["UCU", "UCC", "UCA", "UCG", "AGU", "AGC"],
    "Y" => ["UAU", "UAC"],
    "C" => ["UGU", "UGC"],
    "W" => ["UGG"],
    "P" => ["CCU", "CCC", "CCA", "CCG"],
    "H" => ["CAU", "CAC"],
    "Q" => ["CAA", "CAG"],
    "R" => ["CGU", "CGC", "CGA", "CGG", "AGA", "AGG"],
    "I" => ["AUU", "AUC", "AUA"],
    "M" => ["AUG"],
    "T" => ["ACU", "ACC", "ACA", "ACG"],
    "N" => ["AAU", "AAC"],
    "K" => ["AAA", "AAG"],
    "V" => ["GUU", "GUC", "GUA", "GUG"],
    "A" => ["GCU", "GCC", "GCA", "GCG"],
    "D" => ["GAU", "GAC"],
    "E" => ["GAA", "GAG"],
    "G" => ["GGU", "GGC", "GGA", "GGG"],
    "STOP" => ["UGA", "UAA", "UAG"],
  }
  AMINOACID_BY_CODON = CODONS_BY_AMINOACID.
    flat_map { |c, as| as.map { |a| [a, c] } }.to_h

  def self.problem_prot(aminoacids_string)
    aminoacids_string.
      scan(/[UGTCA]{3}/).
      map { |codon| AMINOACID_BY_CODON[codon] }.
      take_while { |aminoacid| aminoacid != "STOP" }.
      join
  end
end

Context

StackExchange Code Review Q#128203, answer score: 7

Revisions (0)

No revisions yet.