patternrubyModerate
Count the occurence of nucleobases in DNA string
Viewed 0 times
occurencethednacountstringnucleobases
Problem
Inspired by this meta question I decided to take a look at Rosalind. Their first challenge seemed easy enough:
An example of a length 21 DNA string (whose alphabet contains the symbols 'A', 'C', 'G', and 'T') is "ATGCTTCAGAAAGGTCTTACG."
Given: A DNA string
Return: Four integers (separated by spaces) counting the respective number of times that the symbols 'A', 'C', 'G', and 'T' occur in
Sample Dataset
Sample Output
Since I'm still on my quest to learn both regex and Ruby, I decided to go that route:
I'm not a big fan of long
Both can be invoked like:
```
p countACGT("TCCCACTTCAGGGTCAGGGAGCTCCAAACTCTCTTTCTAGAGATGACAATCGAGAGTGAGATAAGGTGGATAGCAATCGTTATGGGATGTAAGCGCCAAGCGTTCGGGTAGCCCACGTTGCGGGCTAATCGCTAGGCTAGAACCTCTAAGCTGTACTTCTGTCAAAACGGAAAGAATCATACCGCACACCAACACTCGATGTAATGTAAGGATATCCTGTGCAGATGAGGTGCTTGGTACGCTAGATACTAGTATTACTAACACACAACATTACCGCCCAAGCGTGTCAGCCACGGACCAGATGACTCTTGCCGATTGAATACCTATCATCCTTACGGTCCGGAATCAGTATATCGCGTGCACAGTTACAGTGGTTAACTTGAGCTAGAGCAAGATAATGTGCGATCTGCGCACTCGGTGGGCTTGGATCACCCTACTTCCAATTGCCCGCGTATGATAGTTCCACCACTCACAAGTCTGTCATAGTGATTATCAAGAGTAGGCGTAGTGGGCACCCAAGAAATTAATGAATCTCACAGTCGAGTGTATCTTCGGCCATATCCCTACGGCAAATGGTCGCTCAGCTTGTCTCCGAGAGTTCGTTGGTTCAGAACCTCCGAAGGGTTGGGTGATTGTTGCGGCGCGCATGCGAGCTATGGTGGCTGTGTGTGGAGGTATTATCA
An example of a length 21 DNA string (whose alphabet contains the symbols 'A', 'C', 'G', and 'T') is "ATGCTTCAGAAAGGTCTTACG."
Given: A DNA string
s of length at most 1000 nt.Return: Four integers (separated by spaces) counting the respective number of times that the symbols 'A', 'C', 'G', and 'T' occur in
s.Sample Dataset
AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGCSample Output
20 12 17 21Since I'm still on my quest to learn both regex and Ruby, I decided to go that route:
def countACGT(str)
list = [0,0,0,0]
str.scan(/A|C|G|T/) do |sub|
if sub == "A"
list[0] += 1
end
if sub == "C"
list[1] += 1
end
if sub == "G"
list[2] += 1
end
if sub == "T"
list[3] += 1
end
end
return list
endI'm not a big fan of long
if chains. Luckily, Ruby has a case statement as well:def countACGT(str)
list = [0,0,0,0]
str.scan(/A|C|G|T/) do |sub|
puts case sub
when "A"
list[0] += 1
when "C"
list[1] += 1
when "G"
list[2] += 1
when "T"
list[3] += 1
end
end
return list
endBoth can be invoked like:
```
p countACGT("TCCCACTTCAGGGTCAGGGAGCTCCAAACTCTCTTTCTAGAGATGACAATCGAGAGTGAGATAAGGTGGATAGCAATCGTTATGGGATGTAAGCGCCAAGCGTTCGGGTAGCCCACGTTGCGGGCTAATCGCTAGGCTAGAACCTCTAAGCTGTACTTCTGTCAAAACGGAAAGAATCATACCGCACACCAACACTCGATGTAATGTAAGGATATCCTGTGCAGATGAGGTGCTTGGTACGCTAGATACTAGTATTACTAACACACAACATTACCGCCCAAGCGTGTCAGCCACGGACCAGATGACTCTTGCCGATTGAATACCTATCATCCTTACGGTCCGGAATCAGTATATCGCGTGCACAGTTACAGTGGTTAACTTGAGCTAGAGCAAGATAATGTGCGATCTGCGCACTCGGTGGGCTTGGATCACCCTACTTCCAATTGCCCGCGTATGATAGTTCCACCACTCACAAGTCTGTCATAGTGATTATCAAGAGTAGGCGTAGTGGGCACCCAAGAAATTAATGAATCTCACAGTCGAGTGTATCTTCGGCCATATCCCTACGGCAAATGGTCGCTCAGCTTGTCTCCGAGAGTTCGTTGGTTCAGAACCTCCGAAGGGTTGGGTGATTGTTGCGGCGCGCATGCGAGCTATGGTGGCTGTGTGTGGAGGTATTATCA
Solution
Your code is fine and readable from a C/Java perspective. I don't think it's particularly Ruby-ic to use
Why your case is slow
You have this extra
You may want to get rid of that :) That's why you interpreter prints the list every time. You told it to...
Functionally better
But functionally, we can do way better. Ruby's enumerables have a
That'll give you a hash from each key (nucleotide base) to the values in the collection (a list of each occurence of each nucleotide base). All you have to do then is map it to just give you the size:
That'll give you, for your example, the list:
If you want to get it in
return statements. Just put list at the end on its own. Why your case is slow
You have this extra
puts here:str.scan(/A|C|G|T/) do |sub|
puts case sub
^^^^
when "A"
...You may want to get rid of that :) That's why you interpreter prints the list every time. You told it to...
Functionally better
But functionally, we can do way better. Ruby's enumerables have a
group_by method:def countACGT(str)
str.chars.group_by(&:chr)
endThat'll give you a hash from each key (nucleotide base) to the values in the collection (a list of each occurence of each nucleotide base). All you have to do then is map it to just give you the size:
def countACGT(str)
str.chars.group_by(&:chr).map { |k, v| [k, v.size] }
endThat'll give you, for your example, the list:
[["T", 228], ["C", 209], ["A", 214], ["G", 220]]If you want to get it in
ACGT order like your original, you can just sort it and map off the key:def countACGT(str)
str.chars.group_by(&:chr).sort.map{|k, v| v.size}
endCode Snippets
str.scan(/A|C|G|T/) do |sub|
puts case sub
^^^^
when "A"
...def countACGT(str)
str.chars.group_by(&:chr)
enddef countACGT(str)
str.chars.group_by(&:chr).map { |k, v| [k, v.size] }
end[["T", 228], ["C", 209], ["A", 214], ["G", 220]]def countACGT(str)
str.chars.group_by(&:chr).sort.map{|k, v| v.size}
endContext
StackExchange Code Review Q#115504, answer score: 15
Revisions (0)
No revisions yet.