patterncMajor
Zero-initializing large dynamically allocated arrays of double
Viewed 0 times
arraysdynamicallyinitializinglargedoubleallocatedzero
Problem
Arrays of
Large dynamically allocated arrays of integer types can be zero-initialized by using
I wrote the code for the function
After using
The code works, and has no obvious issues, as far as I can tell. I am interested in any comments about shortcomings of this approach, or shortcomings of my code, and suggestions for improvement. I would also be interested in suggestions for alternative, possibly more efficient methods.
I have included the function in a working program below.
```
#include
#include
#include
#define MAXCOUNT 100 // number of doubles to allocate for
double * calloc_d(size_t nmemb);
int
double can be zero-initialized with the { 0 } initializer, which works even on systems that have a binary representation of 0.0 that is not all bits zero. While the IEEE 754 representation of floating point zero is all bits zero, the C Standard makes no guarantees about the representation of floating point numbers. Large dynamically allocated arrays of integer types can be zero-initialized by using
calloc(), but this method is not portable for floating point types, since calloc() initializes all bits to zero. Assigning a value of 0.0 to each individual double in the segment using a loop seems grossly inefficient.I wrote the code for the function
calloc_d() to solve this problem. The function takes the size_t argument nmemb, and returns a zero-initialized segment of memory large enough to hold nmemb doubles. If there is an allocation error, the function returns NULL. The caller is of course responsible for deallocation.After using
malloc() to allocate the required amount of memory, the first bytes are assigned the value of 0.0. Then memmove() is used to copy the bytes of the first zero, then the bytes of the first two zeros, and so on until at least half of the memory has been initialized. Finally, the remaining memory is initialized to 0.0 by copying bytes from the first half.The code works, and has no obvious issues, as far as I can tell. I am interested in any comments about shortcomings of this approach, or shortcomings of my code, and suggestions for improvement. I would also be interested in suggestions for alternative, possibly more efficient methods.
I have included the function in a working program below.
MAXCOUNT doubles are allocated for, initialized, and displayed. The value of MAXCOUNT is currently set to 100, but I have tested it for values up to 1000000.```
#include
#include
#include
#define MAXCOUNT 100 // number of doubles to allocate for
double * calloc_d(size_t nmemb);
int
Solution
So my first hunch when I read your question is:
Okay this guy is making a complicated solution to a simple problem based on a hunch that the trivial solution is "grossly inefficient" without any measurements to back up said claim. I suspect OPs solution is slower than the trivial solution.
So I set out test this, I pit your fancy implementation against my trivial implementation:
I tested this here on ideone.com (I think it is using clang) which shows that this simple bit of code is consistently around 3-5% faster than your fancy code.
So I have two pieces of advice
Addendum/Edit
Seeing as this answer has gotten quite some attention I've gone ahead and dug deeper.
Using godbolt compiler explorer I compiled (
Here is the disassembly on GCC 6.2:
Notice how the compiler has just gone "Oh, you want to zero that? Cool, I'll just replace your code with a
I also tried different versions of GCC and clang on ARM and MIPS and PowerPC, they all generate similar instructions which essentially boil down to call
With the sole exception of ICC 17 (
Notice that it calls into
But here is the key take away, we clearly expressed our intent with the trivial loop. The compiler could then replaced the trivial loop (after deducing it's functionality) with an (512 bit) AVX optimised memset written by Intel's engineers specifically for modern Intel CPUs.
There is this thing called the "as-if rule" in C/C++. Which pretty much says:
As long as the program does the same reads and writes to
Just to show you how awesome some compilers can be look at what clang does here:
gets compiled into:
It deduces that the function always returns the same value and it completely removes all code and just returns a constant! It also removes the calls to
So I would like to take this opportunity to reiterate my first piece of advice again:
Clearly express what it is you want your code to do. Let the compiler worry about generating good machine code, the compile
Okay this guy is making a complicated solution to a simple problem based on a hunch that the trivial solution is "grossly inefficient" without any measurements to back up said claim. I suspect OPs solution is slower than the trivial solution.
So I set out test this, I pit your fancy implementation against my trivial implementation:
double * calloc_d2(size_t nmemb){
double* ret = malloc(sizeof(double)*nmemb);
double* end = ret + nmemb;
if(ret){
double* i = ret;
while(i != end){
*i = 0.0;
++i;
}
}
return ret;
}I tested this here on ideone.com (I think it is using clang) which shows that this simple bit of code is consistently around 3-5% faster than your fancy code.
So I have two pieces of advice
- Express your intent as clearly as possible, and let the compiler worry about generating high performance code. Compilers are pretty smart...
- Always, always, always measure.
Addendum/Edit
Seeing as this answer has gotten quite some attention I've gone ahead and dug deeper.
Using godbolt compiler explorer I compiled (
-O3) and disassembled my trivial loop implementation.Here is the disassembly on GCC 6.2:
calloc_d2(unsigned long):
push rbp
push rbx
lea rbx, [0+rdi*8]
sub rsp, 8
mov rdi, rbx
call malloc
test rax, rax
mov rbp, rax
je .L1
lea rax, [rax+rbx]
cmp rbp, rax
je .L1
mov rdx, rbx
xor esi, esi
mov rdi, rbp
call memset
.L1:
add rsp, 8
mov rax, rbp
pop rbx
pop rbp
retNotice how the compiler has just gone "Oh, you want to zero that? Cool, I'll just replace your code with a
memset". This goes to show that an optimising compiler is quite smart and is able to transform your code in astounding ways.I also tried different versions of GCC and clang on ARM and MIPS and PowerPC, they all generate similar instructions which essentially boil down to call
malloc then memset.With the sole exception of ICC 17 (
-O3 -march=native) which goes above and beyond:calloc_d2(unsigned long):
push rbp #5.1
mov rbp, rsp #5.1
and rsp, -32 #5.1
push r15 #5.1
push rbx #5.1
sub rsp, 16 #5.1
mov rbx, rdi #5.1
lea rdi, QWORD PTR [rbx*8] #6.50
call malloc
-- snip --
call __intel_avx_rep_memset #12.8
-- snip --Notice that it calls into
__intel_avx_rep_memset instead, this is a 512 bit wide AVX optimized memset that will only function on modern CPUs (it did this because I passed -march=native asking it to throw backward compatibility out the window). There is some branching code at the end which deals with cases where nmemb isn't evenly divisible by 512/64=8.But here is the key take away, we clearly expressed our intent with the trivial loop. The compiler could then replaced the trivial loop (after deducing it's functionality) with an (512 bit) AVX optimised memset written by Intel's engineers specifically for modern Intel CPUs.
There is this thing called the "as-if rule" in C/C++. Which pretty much says:
As long as the program does the same reads and writes to
volatile memory in the same order as-if the program was executed according to the language specification, then the program may do ANY transformation of the code it pleases. Including reordering instructions, reordering stores and loads to non-volatile memory, or removing instructions all together.Just to show you how awesome some compilers can be look at what clang does here:
auto stupid(){
auto x = new int[30];
for(int i = 0; i < 30; i++)
x[i] = i;
auto sum = 0;
for(int i = 0; i < 30; i++)
sum += x[i]*x[i];
delete [] x;
return sum;
}gets compiled into:
stupid(): # @stupid()
mov eax, 8555
retIt deduces that the function always returns the same value and it completely removes all code and just returns a constant! It also removes the calls to
new and delete as it realises they are pointless and don't affect the output of the function. If that is not impressive, I don't know what is. :)So I would like to take this opportunity to reiterate my first piece of advice again:
Clearly express what it is you want your code to do. Let the compiler worry about generating good machine code, the compile
Code Snippets
double * calloc_d2(size_t nmemb){
double* ret = malloc(sizeof(double)*nmemb);
double* end = ret + nmemb;
if(ret){
double* i = ret;
while(i != end){
*i = 0.0;
++i;
}
}
return ret;
}calloc_d2(unsigned long):
push rbp
push rbx
lea rbx, [0+rdi*8]
sub rsp, 8
mov rdi, rbx
call malloc
test rax, rax
mov rbp, rax
je .L1
lea rax, [rax+rbx]
cmp rbp, rax
je .L1
mov rdx, rbx
xor esi, esi
mov rdi, rbp
call memset
.L1:
add rsp, 8
mov rax, rbp
pop rbx
pop rbp
retcalloc_d2(unsigned long):
push rbp #5.1
mov rbp, rsp #5.1
and rsp, -32 #5.1
push r15 #5.1
push rbx #5.1
sub rsp, 16 #5.1
mov rbx, rdi #5.1
lea rdi, QWORD PTR [rbx*8] #6.50
call malloc
-- snip --
call __intel_avx_rep_memset #12.8
-- snip --auto stupid(){
auto x = new int[30];
for(int i = 0; i < 30; i++)
x[i] = i;
auto sum = 0;
for(int i = 0; i < 30; i++)
sum += x[i]*x[i];
delete [] x;
return sum;
}stupid(): # @stupid()
mov eax, 8555
retContext
StackExchange Code Review Q#150245, answer score: 27
Revisions (0)
No revisions yet.