patterncMinor

Re-implementing memcpy

Submitted by: @import:stackexchange-codereview·Mar 10, 2026·

Viewed 0 times

codereview reinventing-the-wheel c stackoverflow memory-management performance kernel

implementingmemcpystackoverflow

Problem

I have written an implementation of memcpy function and I want to know how good is it and if it gets the most out of the processor capabilities or not.

The reason for this is because I am writing my operating system (which will be of course reviewed on parts here on Code Review) it is supposed to work efficiently on both x86_64 and x86 machines

#include 
#include 
void *memcpy(void *dest, const void *src,size_t n)
{
        if (n==0)
                return dest;
        #if defined(__x86_64__) || defined(_M_X64)
        size_t i=0;
        if(n>=8)
                while(i=4){
                *(uint32_t*)(dest + i) = *(uint32_t*)(src+i);
                i+=4;
        }
        if(n-i>=2){
                *(uint16_t*)(dest+i) = *(uint16_t*)(src+i);
                i+=2;
        }
        if(n-i>=1)
                *(uint8_t*)(dest+i) = *(uint8_t*)(src+i);

        #elif defined(__i386) || defined(_M_IX86)
        size_t i=0;
        if(n>=4)
                while(i=2){
                *(uint16_t*)(dest+i) = *(uint16_t*)(src+i);
                i+=2;
        }
        if(n-i>=1)
                *(uint8_t*)(dest+i) = *(uint8_t*)(src+i);

        #endif
        return dest;
}

I made some aggressive testing with memcmp() to check for correct data transmission and valgrind to check for memory leaks and It passed all the tests. I didn't post the testing code because I think it could be useless since I don't want it to be reviewed.

Solution

Not very fast

Your memcpy() implementation is not really better than a standard byte by byte copy. Even though you attempt to copy more bytes at a time, the limiting factor isn't actually the number of bytes you copy per instruction.

If you research the various memcpy() implementations there are for x86 targets, you will find a wealth of information about how to get faster speeds. I think the simplest thing for you to do is to just use the simple "rep movsb" implementation.
My own benchmarks

I ran your version against the following two versions. One is a straightforward byte by byte copy, and the other is just using "rep movsb", which on modern processors is highly optimized:

void *memcpy2(void *dst, const void *src,size_t n)
{
    size_t i;

    for (i=0;i<n;i++)
        *(char *) dst++ = *(char *) src++;
    return dst;
}

void *memcpy3(void *dst, const void *src, size_t n)
{
    void *ret = dst;
    asm volatile("rep movsb" : "+D" (dst) : "c"(n), "S"(src) : "cc", "memory");
    return ret;
}

My results for copying 2GB (32-bit host):

OP's function  : 3.74 sec
memcpy2 (naive): 3.74 sec
memcpy3 (movsb): 2.96 sec

Code Snippets

void *memcpy2(void *dst, const void *src,size_t n)
{
    size_t i;

    for (i=0;i<n;i++)
        *(char *) dst++ = *(char *) src++;
    return dst;
}

void *memcpy3(void *dst, const void *src, size_t n)
{
    void *ret = dst;
    asm volatile("rep movsb" : "+D" (dst) : "c"(n), "S"(src) : "cc", "memory");
    return ret;
}

OP's function  : 3.74 sec
memcpy2 (naive): 3.74 sec
memcpy3 (movsb): 2.96 sec

Context

StackExchange Code Review Q#105114, answer score: 7

Revisions (0)

No revisions yet.