patterncMinor
Counting nucleobases in a nucleotide
Viewed 0 times
nucleotidecountingnucleobases
Problem
This question is part of a series solving the Rosalind challenges. For the previous question in this series, see Calculating protein mass ruby. The repository with all my up-to-date solutions so far can be found here.
I started the Rosalind challenges roughly a year ago in Ruby. Now I got curious whether I could do the same challenges in C.
Problem: DNA
A string is simply an ordered collection of symbols selected from some alphabet and formed into a word; the length of a string is the number of symbols that it contains.
An example of a length 21 DNA string (whose alphabet contains the symbols 'A', 'C', 'G', and 'T') is "ATGCTTCAGAAAGGTCTTACG."
Given:
A DNA string
Return:
Four integers (separated by spaces) counting the respective number of times that the symbols 'A', 'C', 'G', and 'T' occur in
Sample Dataset:
Sample Output:
Actual Dataset:
Actual output:
DNA.c:
`#include "DNA.h"
int main()
{
size_t MAX_LENGTH = 1000;
char *user_input;
user_input = (char *)
I started the Rosalind challenges roughly a year ago in Ruby. Now I got curious whether I could do the same challenges in C.
Problem: DNA
A string is simply an ordered collection of symbols selected from some alphabet and formed into a word; the length of a string is the number of symbols that it contains.
An example of a length 21 DNA string (whose alphabet contains the symbols 'A', 'C', 'G', and 'T') is "ATGCTTCAGAAAGGTCTTACG."
Given:
A DNA string
s of length at most 1000 nt.Return:
Four integers (separated by spaces) counting the respective number of times that the symbols 'A', 'C', 'G', and 'T' occur in
s.Sample Dataset:
AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGC
Sample Output:
20 12 17 21
Actual Dataset:
CTCCTCAGATCTCAAACGGCTCTATATTACTAGATAGGAGACACGCCCATACCAGCGACGCGGGGTCACTCATTTTCCCAAGAATCCATGAGTGCGAAGCGCACGTCCATGTGACACAAAATTACTAGAGAGTTTTCAAGTCTGATTACCCGTAGTAAACGACCTTGTGCCGGGTCACTAGTGCAATGAAGAATATGTCAACTATTACTCCCGTGGGATCTATAAAACCAGAAGATCCATTGCACTTGTAGTCGCTGTATAGTCTCTCGTCGTCACCTAGCCGATATGACCGTGCGCGAGTTATCCGGAACCTATAAGTGTTTGCTCTCAACAGTGTCTCAACACATGGAGTCGGTAACCTACTACGAAGCCTGCACCAAGATCGATCAGGGAGAATACCCCCTGACGGTCAACGCCGAAGATCAAAGAGAATGATTCGGCCTAGGGCGATTGGCTATTATCCCGGTCTAACCGCCAGGATACTTCAGTAGATCCCGCTCGACATCTGCCCCCCACAAAGTTATTCAGTTTCGGTGATAATTTCGCTTGAACTCCTATCTATTTAAAAGTTTTCCTATACGATGACTAGTCCCTTGCGAACGATCTTTGCCAGGATGCACGACGGCGAGACAATATTACAATACCGAGTGGAGTGATTGGTATCTACACATACGAAATCTCAATGAGAATGGAAGGTCACACTCGTAACAAACTCCTAAGCGGCGGAGAGCGGAAAGGTATAGTCGAGTCGAAGCCTTTATATCGTGTGGCCAGCAGCTAACACAGAGAAATATGGCGGGAATCATC
Actual output:
231 201 181 194
DNA.c:
`#include "DNA.h"
int main()
{
size_t MAX_LENGTH = 1000;
char *user_input;
user_input = (char *)
Solution
Advice 1
In real world setting,
Advice 2
You forgot the parentheses. Would be more funky as
Advice 3
You ask 4 times more memory than you need. Consider using
Advice 4
Above,
Advice 5
This is asking for a
Advice 6
Finally, more idiomatic C would be having a
Summa summarum
All in all, I had this in mind:
Hope that helps.
typedef struct {
uint8_t A;
uint8_t C;
uint8_t G;
uint8_t T;
} uint8_t_container;In real world setting,
uint8_t values are not sufficient for representing (absolute) frequencies, consider using at least uint32_t.Advice 2
if isalnum(nucleobase)You forgot the parentheses. Would be more funky as
if (isalnum(nucleobase))Advice 3
uint8_t_container *count = calloc(4, sizeof(uint8_t_container));You ask 4 times more memory than you need. Consider using
uint8_t_container *count = calloc(1, sizeof(uint8_t_container));
^Advice 4
return *count;Above,
count is a pointer to a structure, I am not 100% sure, yet it appears to me that since the return value of countACGT is not a pointer to that structure, you convert *count to a value copy, and leave the actual structure being referenced without deallocating: a memory leak.Advice 5
if (nucleobase == 'A')
{
count->A += 1;
}
else if (nucleobase == 'C')
{
count->C += 1;
}
else if (nucleobase == 'G')
{
count->G += 1;
}
else if (nucleobase == 'T')
{
count->T += 1;
}This is asking for a
switch.Advice 6
Finally, more idiomatic C would be having a
char pointer sliding through the nucleotides and halting at zero terminator. That way, no need for strlen.Summa summarum
All in all, I had this in mind:
typedef struct {
uint32_t a_count;
uint32_t c_count;
uint32_t g_count;
uint32_t t_count;
} uint32_t_container;
uint32_t_container count_nucleotides(char *nucleotides)
{
uint32_t_container result;
result.a_count = 0;
result.c_count = 0;
result.g_count = 0;
result.t_count = 0;
for (char *c = nucleotides;; c++)
{
switch (*c)
{
case 'A':
result.a_count++;
break;
case 'C':
result.c_count++;
break;
case 'G':
result.g_count++;
break;
case 'T':
result.t_count++;
break;
case 0:
return result;
}
}
}Hope that helps.
Code Snippets
typedef struct {
uint8_t A;
uint8_t C;
uint8_t G;
uint8_t T;
} uint8_t_container;if isalnum(nucleobase)if (isalnum(nucleobase))uint8_t_container *count = calloc(4, sizeof(uint8_t_container));uint8_t_container *count = calloc(1, sizeof(uint8_t_container));
^Context
StackExchange Code Review Q#149490, answer score: 5
Revisions (0)
No revisions yet.