HiveBrain v1.2.0
Get Started
← Back to all entries
patternrubyModerate

Count the occurence of nucleobases in DNA string

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
occurencethednacountstringnucleobases

Problem

Inspired by this meta question I decided to take a look at Rosalind. Their first challenge seemed easy enough:

An example of a length 21 DNA string (whose alphabet contains the symbols 'A', 'C', 'G', and 'T') is "ATGCTTCAGAAAGGTCTTACG."

Given: A DNA string s of length at most 1000 nt.

Return: Four integers (separated by spaces) counting the respective number of times that the symbols 'A', 'C', 'G', and 'T' occur in s.

Sample Dataset

AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGC


Sample Output

20 12 17 21


Since I'm still on my quest to learn both regex and Ruby, I decided to go that route:

def countACGT(str)
  list = [0,0,0,0]
  str.scan(/A|C|G|T/) do |sub|
    if sub == "A"
      list[0] += 1
    end
    if sub == "C"
      list[1] += 1
    end
    if sub == "G"
      list[2] += 1
    end
    if sub == "T"
      list[3] += 1
    end
  end
  return list
end


I'm not a big fan of long if chains. Luckily, Ruby has a case statement as well:

def countACGT(str)
  list = [0,0,0,0]
  str.scan(/A|C|G|T/) do |sub|
    puts case sub
    when "A"
      list[0] += 1
    when "C"
      list[1] += 1
    when "G"
      list[2] += 1
    when "T"
      list[3] += 1
    end
  end
  return list
end


Both can be invoked like:

```
p countACGT("TCCCACTTCAGGGTCAGGGAGCTCCAAACTCTCTTTCTAGAGATGACAATCGAGAGTGAGATAAGGTGGATAGCAATCGTTATGGGATGTAAGCGCCAAGCGTTCGGGTAGCCCACGTTGCGGGCTAATCGCTAGGCTAGAACCTCTAAGCTGTACTTCTGTCAAAACGGAAAGAATCATACCGCACACCAACACTCGATGTAATGTAAGGATATCCTGTGCAGATGAGGTGCTTGGTACGCTAGATACTAGTATTACTAACACACAACATTACCGCCCAAGCGTGTCAGCCACGGACCAGATGACTCTTGCCGATTGAATACCTATCATCCTTACGGTCCGGAATCAGTATATCGCGTGCACAGTTACAGTGGTTAACTTGAGCTAGAGCAAGATAATGTGCGATCTGCGCACTCGGTGGGCTTGGATCACCCTACTTCCAATTGCCCGCGTATGATAGTTCCACCACTCACAAGTCTGTCATAGTGATTATCAAGAGTAGGCGTAGTGGGCACCCAAGAAATTAATGAATCTCACAGTCGAGTGTATCTTCGGCCATATCCCTACGGCAAATGGTCGCTCAGCTTGTCTCCGAGAGTTCGTTGGTTCAGAACCTCCGAAGGGTTGGGTGATTGTTGCGGCGCGCATGCGAGCTATGGTGGCTGTGTGTGGAGGTATTATCA

Solution

Your code is fine and readable from a C/Java perspective. I don't think it's particularly Ruby-ic to use return statements. Just put list at the end on its own.

Why your case is slow

You have this extra puts here:

str.scan(/A|C|G|T/) do |sub|
  puts case sub
  ^^^^
  when "A"
     ...


You may want to get rid of that :) That's why you interpreter prints the list every time. You told it to...

Functionally better

But functionally, we can do way better. Ruby's enumerables have a group_by method:

def countACGT(str)
    str.chars.group_by(&:chr)
end


That'll give you a hash from each key (nucleotide base) to the values in the collection (a list of each occurence of each nucleotide base). All you have to do then is map it to just give you the size:

def countACGT(str)
    str.chars.group_by(&:chr).map { |k, v| [k, v.size] }
end


That'll give you, for your example, the list:

[["T", 228], ["C", 209], ["A", 214], ["G", 220]]


If you want to get it in ACGT order like your original, you can just sort it and map off the key:

def countACGT(str)
    str.chars.group_by(&:chr).sort.map{|k, v| v.size}
end

Code Snippets

str.scan(/A|C|G|T/) do |sub|
  puts case sub
  ^^^^
  when "A"
     ...
def countACGT(str)
    str.chars.group_by(&:chr)
end
def countACGT(str)
    str.chars.group_by(&:chr).map { |k, v| [k, v.size] }
end
[["T", 228], ["C", 209], ["A", 214], ["G", 220]]
def countACGT(str)
    str.chars.group_by(&:chr).sort.map{|k, v| v.size}
end

Context

StackExchange Code Review Q#115504, answer score: 15

Revisions (0)

No revisions yet.