HiveBrain v1.2.0
Get Started
← Back to all entries
patterncMinor

Duplicate counter in C

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
duplicatecounterstackoverflow

Problem

My goal was to create a function that could check for duplicates in an array no matter what the data type was, given that the size of the data type was provided. This was done by using the \$\mathcal{O}(n^2)\$ algorithm and casting to char because of its 1 byte property.

/**
 * Returns count of duplicates in an array
 * @param  a     array to check in
 * @param  items how many entries in array
 * @param  size  size of each entry
 * @return       returns amount of duplicates/conflicts
 */
int dubCount(void *a, size_t items, size_t size){
    int i,j;
    int conflicts = 0;
    char *x = (char *)a;

    for (i = 0; i < items - 1; i++){
        for (j = i + 1; j < items; j++){
            /* Check if chunks are equal, if so count */
            if(memcmp(&(x[i*size]), &(x[j*size]), size) == 0){
                conflicts++;
            }
        }
    }

    return conflicts;
}


The code has been confirmed to work with many data types.
As I'm quite new to the C world, I'd like for someone to point out anything wrong or dangerous in this function.

Solution

This is a very straightforward implementation which is great for maintainability and readability. Here are some things you could improve:

Not All Comparisons Work In Obvious Ways

As mentioned in the comments by Jerry Coffin and Loki Astari, you may run into issues with floating point values or values which may not be padded how you think, or for which the padding contains random values. You could solve this by also passing in a pointer to a comparison function and calling that function to do the comparison.

Naming

Your naming could be improved a little. The function is named dubCount(), but from the comments and the name of the return value it should probably be conflictCount() or at least dupeCount(). It's not clear what dub refers to in dubCount() (at least to me). Also the parameter name a tells me nothing, and since it's a pointer to void, I have no contextual clue, either. I'd rename it to something like items and rename the items parameter to numberOfItems or numItems, and rename size to itemSize or sizePerItem or bytesPerItem so that it's more clear just from the function prototype what the parameters mean.

Use const where appropriate

In this function, none of the parameters to the function are modified. When your function doesn't modify a parameter, it should be marked as const. This lets both the compiler and a reader know the function won't change those values. The compiler may be able to make some optimizations with that info, and a reader can do more reasoning about the function without reading its actual code, which is helpful.

Performance

You noted that the algorithm you're using is \$\mathcal{O}(n^2)\$. If you sorted the items in the array first, you could use an O(n) search on the result as you'd be able to simply run through the array until you hit a different value. (Note that this may depend on the statistical nature of your input data and the sort algorithm you use. But in general, I'd expect to see a speedup in such a case.) Most fast sort algorithms are \$\mathcal{O}(n log(n))\$ so the result would be \$\mathcal{O}(n + n log(n))\$ (I think? Check my math). Note that you would use the same comparison function for the sort as you would for checking for duplicates.

Context

StackExchange Code Review Q#149602, answer score: 2

Revisions (0)

No revisions yet.