HiveBrain v1.2.0
Get Started
← Back to all entries
patterncModerate

Counting DNA nucleotides in C

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
nucleotidescountingdna

Problem

I have written code to solve the following Rosalind problem. This is my first time writing in C and I would like a review of my code, particularly in regard to correctness and performance.

#include 
#include 
int main() {

    /* Input DNA of length at most 1000 nucleotides */
    char dna[1000];
    FILE *input;
    input = fopen("/Users/James/Desktop/rosalind_dna.txt", "r");
    fgets(dna, 1000, input);
    fclose(input);

    /* Count frequency of symbols 'A', 'C', 'G' and 'T' in DNA */
    int i, a, c, g, t;
    for(i=0; i<strlen(dna); i++) {
        if (dna[i] == 'A') {
            a += 1;
        }
        else if (dna[i] == 'C') {
            c += 1;
        }
        else if (dna[i] == 'G') {
            g += 1;
        }
        else if (dna[i] == 'T') {
            t += 1;
        }
    }

    /* Output frequency of DNA nucleotides */
    printf("%i %i %i %i", a, c, g, t);
    return 0;
}


Updated Code

```
#include
#include
#define READ_BUFFER_SIZE 1000

int main() {

/ Input DNA of length at most 1000 nucleotides /
char dna[READ_BUFFER_SIZE];
FILE *input;
input = fopen("/Users/James/Desktop/rosalind_dna.txt", "r");
if (input == NULL) {
perror("Error");
}
fgets(dna, READ_BUFFER_SIZE, input);
fclose(input);

/ Count frequency of symbols 'A', 'C', 'G' and 'T' in DNA /
int aCount, cCount, gCount, tCount;
char *current;
for(current = dna; *current; ++current) {
switch(*current){
case 'A':
++aCount;
break;
case 'C':
++cCount;
break;
case 'G':
++gCount;
break;
case 'T':
++tCount;
break;
default:
printf("Error: Invalid nucleotide\n");
break;
}
}

/ Output frequency of DNA nucleotides /
printf("%i %i %i %i", aCount, cCount, gCount, tCount);

Solution

I am no C programmer but there are a few things that come to my eye:

Static size

You always read at most 1000 characters, what if the file size is much bigger?
Instead you would loop trough the file, reading/processing it chunkwise.

Magic Number

Even if you use the static size you should not enter it at multiple locations. Instead use a constant for it to have to apply changes only in one place:

#define READ_BUFFER_SIZE 1000
//...
char dna[READ_BUFFER_SIZE];
//...
fgets(dna, READ_BUFFER_SIZE, input);


Loop termination

It might be optimized by the compiler but unoptimized your loop end check is n times (for n = strlen(dna)) which itself is takes \$O(n)\$, making your loop \$O(n^2)\$.

Looping

You index through the loop which is more costly than walking a pointer through it like:

char *current;
// loop until the '\0' terminator is encountered
for(current = dna; *current; ++current)


However, note as stated in https://stackoverflow.com/a/2305783 that modern compiler will optimize the indexed version to the same efficient code as the pointer version.

Switching

To find the counter for the current base you are using cascading ifs, which is suboptimal. Instead use a switch statement:

switch(*current) {
default: assert("Invalid base name!");
case 'A':
//...


Incrementing

This is most likely optimized by the compiler but still: You are incrementing so use the (prefix) increment operator ++a instead of a += 1;.

Naming

In this short example the names are nearly okay but in a longer program you would want to expand the variable names e.g.:

  • dna -> dnaChunkBuffer (assuming you work in chunks)



  • input -> inputFile



  • i -> currentBaseIndex



  • a, c, g, t -> base*Count



Code-Reuse

As I said I am no C programmer but I could imagine that there is a standard algorithm which helps you count the different characters in a string. If so, I would go for that.

Error Handling

It is just a small example but what happens if there is no file at the location you entered? Your program does not account for that. You should at least test if the fopen function succeeded and given an error otherwise.

Code Snippets

#define READ_BUFFER_SIZE 1000
//...
char dna[READ_BUFFER_SIZE];
//...
fgets(dna, READ_BUFFER_SIZE, input);
char *current;
// loop until the '\0' terminator is encountered
for(current = dna; *current; ++current)
switch(*current) {
default: assert("Invalid base name!");
case 'A':
//...

Context

StackExchange Code Review Q#54142, answer score: 11

Revisions (0)

No revisions yet.