HiveBrain v1.2.0
Get Started
← Back to all entries
patterncMinor

Counting nucleobases in a nucleotide

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
nucleotidecountingnucleobases

Problem

This question is part of a series solving the Rosalind challenges. For the previous question in this series, see Calculating protein mass ruby. The repository with all my up-to-date solutions so far can be found here.

I started the Rosalind challenges roughly a year ago in Ruby. Now I got curious whether I could do the same challenges in C.

Problem: DNA


A string is simply an ordered collection of symbols selected from some alphabet and formed into a word; the length of a string is the number of symbols that it contains.
An example of a length 21 DNA string (whose alphabet contains the symbols 'A', 'C', 'G', and 'T') is "ATGCTTCAGAAAGGTCTTACG."

Given:


A DNA string s of length at most 1000 nt.

Return:


Four integers (separated by spaces) counting the respective number of times that the symbols 'A', 'C', 'G', and 'T' occur in s.

Sample Dataset:

AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGC


Sample Output:

20 12 17 21


Actual Dataset:

CTCCTCAGATCTCAAACGGCTCTATATTACTAGATAGGAGACACGCCCATACCAGCGACGCGGGGTCACTCATTTTCCCAAGAATCCATGAGTGCGAAGCGCACGTCCATGTGACACAAAATTACTAGAGAGTTTTCAAGTCTGATTACCCGTAGTAAACGACCTTGTGCCGGGTCACTAGTGCAATGAAGAATATGTCAACTATTACTCCCGTGGGATCTATAAAACCAGAAGATCCATTGCACTTGTAGTCGCTGTATAGTCTCTCGTCGTCACCTAGCCGATATGACCGTGCGCGAGTTATCCGGAACCTATAAGTGTTTGCTCTCAACAGTGTCTCAACACATGGAGTCGGTAACCTACTACGAAGCCTGCACCAAGATCGATCAGGGAGAATACCCCCTGACGGTCAACGCCGAAGATCAAAGAGAATGATTCGGCCTAGGGCGATTGGCTATTATCCCGGTCTAACCGCCAGGATACTTCAGTAGATCCCGCTCGACATCTGCCCCCCACAAAGTTATTCAGTTTCGGTGATAATTTCGCTTGAACTCCTATCTATTTAAAAGTTTTCCTATACGATGACTAGTCCCTTGCGAACGATCTTTGCCAGGATGCACGACGGCGAGACAATATTACAATACCGAGTGGAGTGATTGGTATCTACACATACGAAATCTCAATGAGAATGGAAGGTCACACTCGTAACAAACTCCTAAGCGGCGGAGAGCGGAAAGGTATAGTCGAGTCGAAGCCTTTATATCGTGTGGCCAGCAGCTAACACAGAGAAATATGGCGGGAATCATC


Actual output:

231 201 181 194


DNA.c:

`#include "DNA.h"

int main()
{
size_t MAX_LENGTH = 1000;
char *user_input;

user_input = (char *)

Solution

Advice 1

typedef struct {
    uint8_t A;
    uint8_t C;
    uint8_t G;
    uint8_t T;
} uint8_t_container;


In real world setting, uint8_t values are not sufficient for representing (absolute) frequencies, consider using at least uint32_t.

Advice 2

if isalnum(nucleobase)


You forgot the parentheses. Would be more funky as

if (isalnum(nucleobase))


Advice 3

uint8_t_container *count = calloc(4, sizeof(uint8_t_container));


You ask 4 times more memory than you need. Consider using

uint8_t_container *count = calloc(1, sizeof(uint8_t_container)); 
                                  ^


Advice 4

return *count;


Above, count is a pointer to a structure, I am not 100% sure, yet it appears to me that since the return value of countACGT is not a pointer to that structure, you convert *count to a value copy, and leave the actual structure being referenced without deallocating: a memory leak.

Advice 5

if (nucleobase == 'A')
{
    count->A += 1;
}
else if (nucleobase == 'C')
{
    count->C += 1;
}
else if (nucleobase == 'G')
{
    count->G += 1;
}
else if (nucleobase == 'T')
{
    count->T += 1;
}


This is asking for a switch.

Advice 6

Finally, more idiomatic C would be having a char pointer sliding through the nucleotides and halting at zero terminator. That way, no need for strlen.

Summa summarum

All in all, I had this in mind:

typedef struct {
    uint32_t a_count;
    uint32_t c_count;
    uint32_t g_count;
    uint32_t t_count;
} uint32_t_container;

uint32_t_container count_nucleotides(char *nucleotides)
{
    uint32_t_container result;

    result.a_count = 0;
    result.c_count = 0;
    result.g_count = 0;
    result.t_count = 0;

    for (char *c = nucleotides;; c++)
    {
        switch (*c)
        {
            case 'A':
                result.a_count++;
                break;

            case 'C':
                result.c_count++;
                break;

            case 'G':
                result.g_count++;
                break;

            case 'T':
                result.t_count++;
                break;

            case 0:
                return result;
        }
    }
}


Hope that helps.

Code Snippets

typedef struct {
    uint8_t A;
    uint8_t C;
    uint8_t G;
    uint8_t T;
} uint8_t_container;
if isalnum(nucleobase)
if (isalnum(nucleobase))
uint8_t_container *count = calloc(4, sizeof(uint8_t_container));
uint8_t_container *count = calloc(1, sizeof(uint8_t_container)); 
                                  ^

Context

StackExchange Code Review Q#149490, answer score: 5

Revisions (0)

No revisions yet.