patterncMajor

Zero-initializing large dynamically allocated arrays of double

Submitted by: @import:stackexchange-codereview·Mar 10, 2026·

Viewed 0 times

codereview array c stackoverflow memory-management performance

arraysdynamicallyinitializinglargedoubleallocatedzero

Problem

Arrays of double can be zero-initialized with the { 0 } initializer, which works even on systems that have a binary representation of 0.0 that is not all bits zero. While the IEEE 754 representation of floating point zero is all bits zero, the C Standard makes no guarantees about the representation of floating point numbers.

Large dynamically allocated arrays of integer types can be zero-initialized by using calloc(), but this method is not portable for floating point types, since calloc() initializes all bits to zero. Assigning a value of 0.0 to each individual double in the segment using a loop seems grossly inefficient.

I wrote the code for the function calloc_d() to solve this problem. The function takes the size_t argument nmemb, and returns a zero-initialized segment of memory large enough to hold nmemb doubles. If there is an allocation error, the function returns NULL. The caller is of course responsible for deallocation.

After using malloc() to allocate the required amount of memory, the first bytes are assigned the value of 0.0. Then memmove() is used to copy the bytes of the first zero, then the bytes of the first two zeros, and so on until at least half of the memory has been initialized. Finally, the remaining memory is initialized to 0.0 by copying bytes from the first half.

The code works, and has no obvious issues, as far as I can tell. I am interested in any comments about shortcomings of this approach, or shortcomings of my code, and suggestions for improvement. I would also be interested in suggestions for alternative, possibly more efficient methods.

I have included the function in a working program below. MAXCOUNT doubles are allocated for, initialized, and displayed. The value of MAXCOUNT is currently set to 100, but I have tested it for values up to 1000000.

```
#include
#include
#include

#define MAXCOUNT 100 // number of doubles to allocate for

double * calloc_d(size_t nmemb);

int

Solution

So my first hunch when I read your question is:

Okay this guy is making a complicated solution to a simple problem based on a hunch that the trivial solution is "grossly inefficient" without any measurements to back up said claim. I suspect OPs solution is slower than the trivial solution.

So I set out test this, I pit your fancy implementation against my trivial implementation:

double * calloc_d2(size_t nmemb){
    double* ret = malloc(sizeof(double)*nmemb);
    double* end = ret + nmemb;
    
    if(ret){
        double* i = ret;
        while(i != end){
            *i = 0.0;
            ++i;
        }
    }
    return ret;
}

I tested this here on ideone.com (I think it is using clang) which shows that this simple bit of code is consistently around 3-5% faster than your fancy code.
So I have two pieces of advice

Express your intent as clearly as possible, and let the compiler worry about generating high performance code. Compilers are pretty smart...

Always, always, always measure.

Addendum/Edit

Seeing as this answer has gotten quite some attention I've gone ahead and dug deeper.

Using godbolt compiler explorer I compiled (-O3) and disassembled my trivial loop implementation.

Here is the disassembly on GCC 6.2:

calloc_d2(unsigned long):
        push    rbp
        push    rbx
        lea     rbx, [0+rdi*8]
        sub     rsp, 8
        mov     rdi, rbx
        call    malloc
        test    rax, rax
        mov     rbp, rax
        je      .L1
        lea     rax, [rax+rbx]
        cmp     rbp, rax
        je      .L1
        mov     rdx, rbx
        xor     esi, esi
        mov     rdi, rbp
        call    memset
.L1:
        add     rsp, 8
        mov     rax, rbp
        pop     rbx
        pop     rbp
        ret

Notice how the compiler has just gone "Oh, you want to zero that? Cool, I'll just replace your code with a memset". This goes to show that an optimising compiler is quite smart and is able to transform your code in astounding ways.

I also tried different versions of GCC and clang on ARM and MIPS and PowerPC, they all generate similar instructions which essentially boil down to call malloc then memset.

With the sole exception of ICC 17 (-O3 -march=native) which goes above and beyond:

calloc_d2(unsigned long):
        push      rbp                                           #5.1
        mov       rbp, rsp                                      #5.1
        and       rsp, -32                                      #5.1
        push      r15                                           #5.1
        push      rbx                                           #5.1
        sub       rsp, 16                                       #5.1
        mov       rbx, rdi                                      #5.1
        lea       rdi, QWORD PTR [rbx*8]                        #6.50
        call      malloc         
        -- snip --
        call      __intel_avx_rep_memset                        #12.8
        -- snip --

Notice that it calls into __intel_avx_rep_memset instead, this is a 512 bit wide AVX optimized memset that will only function on modern CPUs (it did this because I passed -march=native asking it to throw backward compatibility out the window). There is some branching code at the end which deals with cases where nmemb isn't evenly divisible by 512/64=8.

But here is the key take away, we clearly expressed our intent with the trivial loop. The compiler could then replaced the trivial loop (after deducing it's functionality) with an (512 bit) AVX optimised memset written by Intel's engineers specifically for modern Intel CPUs.

There is this thing called the "as-if rule" in C/C++. Which pretty much says:

As long as the program does the same reads and writes to volatile memory in the same order as-if the program was executed according to the language specification, then the program may do ANY transformation of the code it pleases. Including reordering instructions, reordering stores and loads to non-volatile memory, or removing instructions all together.

Just to show you how awesome some compilers can be look at what clang does here:

auto stupid(){
   auto x = new int[30];
  for(int i = 0; i < 30; i++)
    x[i] = i;
  
  auto sum = 0;
  for(int i = 0; i < 30; i++)
    sum += x[i]*x[i];
  delete [] x;
  return sum;
}

gets compiled into:

stupid():                             # @stupid()
        mov     eax, 8555
        ret

It deduces that the function always returns the same value and it completely removes all code and just returns a constant! It also removes the calls to new and delete as it realises they are pointless and don't affect the output of the function. If that is not impressive, I don't know what is. :)

So I would like to take this opportunity to reiterate my first piece of advice again:

Clearly express what it is you want your code to do. Let the compiler worry about generating good machine code, the compile

Code Snippets

double * calloc_d2(size_t nmemb){
    double* ret = malloc(sizeof(double)*nmemb);
    double* end = ret + nmemb;
    
    if(ret){
        double* i = ret;
        while(i != end){
            *i = 0.0;
            ++i;
        }
    }
    return ret;
}

calloc_d2(unsigned long):
        push    rbp
        push    rbx
        lea     rbx, [0+rdi*8]
        sub     rsp, 8
        mov     rdi, rbx
        call    malloc
        test    rax, rax
        mov     rbp, rax
        je      .L1
        lea     rax, [rax+rbx]
        cmp     rbp, rax
        je      .L1
        mov     rdx, rbx
        xor     esi, esi
        mov     rdi, rbp
        call    memset
.L1:
        add     rsp, 8
        mov     rax, rbp
        pop     rbx
        pop     rbp
        ret

calloc_d2(unsigned long):
        push      rbp                                           #5.1
        mov       rbp, rsp                                      #5.1
        and       rsp, -32                                      #5.1
        push      r15                                           #5.1
        push      rbx                                           #5.1
        sub       rsp, 16                                       #5.1
        mov       rbx, rdi                                      #5.1
        lea       rdi, QWORD PTR [rbx*8]                        #6.50
        call      malloc         
        -- snip --
        call      __intel_avx_rep_memset                        #12.8
        -- snip --

auto stupid(){
   auto x = new int[30];
  for(int i = 0; i < 30; i++)
    x[i] = i;
  
  auto sum = 0;
  for(int i = 0; i < 30; i++)
    sum += x[i]*x[i];
  delete [] x;
  return sum;
}

stupid():                             # @stupid()
        mov     eax, 8555
        ret

Context

StackExchange Code Review Q#150245, answer score: 27

Revisions (0)

No revisions yet.