patterncMinor

memcpy() implementation

Submitted by: @import:stackexchange-codereview·Mar 10, 2026·

Viewed 0 times

implementationmemcpystackoverflow

Problem

I have tried to write a function like memcpy. It copies sizeof(long) bytes at a time.

What surprised me is how inefficient it is. It's just 17% more efficient than the naivest implementation with -O3. With optimizations off it's a lot faster than the naivest, so perhaps the compiler is doing this automatically?

It will copy 1 byte at a time until one of the addresses is aligned, then it will copy sizeof(long) and then 1 byte at a time to avoid writing beyond the bounds.

How can I make this faster?

```
#include

#define THRESHOLD sizeof(long)

static size_t min(size_t a, size_t b)
{
return (a > b) ? a : b;
}

static void big_copy(void dest, const void src, size_t iterations)
{
long *d = dest;
const long *s = src;

size_t eight = iterations / 8;
size_t single = iterations % 8;

while(eight > 0){
d++ = s++;
d++ = s++;
d++ = s++;
d++ = s++;
d++ = s++;
d++ = s++;
d++ = s++;
d++ = s++;
--eight;
}

while(single > 0){
d++ = s++;
--single;
}
}

static void small_copy(void dest, const void src, size_t iterations)
{
char *d = dest;
const char *s = src;

while(iterations > 0){
d++ = s++;
--iterations;
}
}

void copy_memory(void dest, const void *src, size_t size)
{
//Small size is handled here
if(size 0){
small_copy(position, src, bytes_to_align);
position = (char *)position + bytes_to_align;
src = (char *)src + bytes_to_align;
size -= bytes_to_align;
}

//How many iterations can be done
size_t safe_big_iterations = size / sizeof(long);
size_t remaining_bytes = size % sizeof(long);

//Copy most bytes here
big_copy(position, src, safe_big_iterations);
position = (char )position + safe_big_iterations sizeof(long);
src = (char )src + safe_big_iterations sizeof(long);

//Process the remaining bytes
small_

Solution

The last time I saw source for a C run-time-library implementation of memcpy (Microsoft's compiler in the 1990s), it used the algorithm you describe: but it was written in assembly. It might (my memory is uncertain) have used rep movsd in the inner loop.

Your code says, //Start copying 8 bytes as soon as one of the pointers is aligned. When you're performance-testing you should know (because that's when you might expect the best performance) whether both buffers are aligned.

On the subject of alignment there as an interesting (but unrelated to your question) question here on StackOverflow: Why speed of memcpy() drops dramatically every 4KB?

I vaguely understand what kind of an effect you're looking for in your code. I don't know what assembler your compiler is actually producing.

The accepted answer to this StackOverflow question demonstrates the kind of assembly that is used nowadays: Very fast memcpy for image processing?

Context

StackExchange Code Review Q#41094, answer score: 5

Revisions (0)

No revisions yet.