patterncppMinor
measuring the time taken to copy a std stl container
Viewed 0 times
stdthestlcontainertimetakenmeasuringcopy
Problem
I'm trying to measure the time taken to copy a
std::vector. Would like to know if this is a correct approach. I tried to do some trivial computations on the copy to prevent the compiler from optimizing out the copy. Should I explicitly write a copy-constructor?#include
#include
#include
#include
#include
#include
using namespace std;
int NUM_OF_ELEMENTS;
int NUM_OF_TRIALS;
char CONTAINER_TYPE;
template
class NonPrimitiveType
{
public:
int x[OBJECT_SIZE/4];
NonPrimitiveType()
{
for(int i=0;i
void copyContainer()
{
T container;
container.reserve(NUM_OF_ELEMENTS);
for (int i=0;i));
}
long long hash = 0;
auto begin = chrono::high_resolution_clock::now();
for (int i=0;i(end - begin).count() / NUM_OF_TRIALS;
printf("%c;%d;%d;%lu;%llu\n",CONTAINER_TYPE, NUM_OF_ELEMENTS, (int)OBJECT_SIZE, totalTime, hash);
}
int main(int argc, char *argv[])
{
NUM_OF_ELEMENTS = atoi(argv[1]);
CONTAINER_TYPE = *argv[2];
NUM_OF_TRIALS = atoi(argv[3]);
copyContainer>, 16>();
copyContainer>, 64>();
copyContainer>, 256>();
copyContainer>, 1024>();
copyContainer>, 4096>();
return 0;
}Solution
First regarding the correct approach.
-
-
Using
-
Preparing for measurement. Also as this part goes:
You're just wasting time here by searching memory to reallocate. Since stl container call default ctor your above code could be replaced with:
And this will save you lots of time for redundant calls to
-
-
I'd also avoid calling every test together:
Make a switch statement and call only one of them per run, so as to avoid compiler further optimizing the switches between those statements.
-
If you turn off optimization flags, then you'll get quite an interesting picture of what's happening but your insights wont be applicable once the optimizations kick in. I think its true that measuring timings without optimization flags set to true is quite pointless.
Your code with
transforms to:
So - you're correct that compiler optimizes out the actual copy and leaves the sequential increment of
Now, when you write
If you declare a set outside the loop:
and inside the loop add
In a real life application you'd probably copy the container and let it be consumed by some other function, meaning that a call to
So with this in mind, if you provide a copy ctor that wont really help measuring imo. I'd try to narrow down your request. What do you actually want to measure? Copy of arrays with dtor? Without dtor? Memory reallocation on the heap?
If you declare
then your code with `-O
-
printf usage: My output of hash for every case after 200000 trials was always 85899345920000, which is incorrect - due to wrong printf specifiers - "%c;%d;%d;%lu;%llu\n". Correct ones are "%c;%d;%d;%lld;%lld\n" if you declare totalTime and hash as long long. Why not use cout i wonder? -
Using
float for division results. Then again after fixing this in some cases of a small container with few elements i've kept getting totalTime = 0. Thats not good, so you'd want to use a floating variable to store the duration divided by NUM_OF_TRIALS. Otherwise the precision is reduced due to implicit conversion of the division result to integer.-
Preparing for measurement. Also as this part goes:
T container;
container.reserve(NUM_OF_ELEMENTS);
for (int i=0;i));
}You're just wasting time here by searching memory to reallocate. Since stl container call default ctor your above code could be replaced with:
T container(NUM_OF_ELEMENTS);And this will save you lots of time for redundant calls to
new here.-
sizeof(int) == 4 is true only generally. Use int32_t which gives that guarantee, and then OBJECT_SIZE/4 makes sense. Or better yet divide OBJECT_SIZE by sizeof(int32_t). That way nothing is going to surprise you in the future. -
I'd also avoid calling every test together:
copyContainer>, 16>();
copyContainer>, 64>();
copyContainer>, 256>();
copyContainer>, 1024>();
copyContainer>, 4096>();Make a switch statement and call only one of them per run, so as to avoid compiler further optimizing the switches between those statements.
-
If you turn off optimization flags, then you'll get quite an interesting picture of what's happening but your insights wont be applicable once the optimizations kick in. I think its true that measuring timings without optimization flags set to true is quite pointless.
Your code with
-O3 optimization:auto begin = chrono::high_resolution_clock::now();
for (int i=0;i<NUM_OF_TRIALS;i++)
{
T copy = container;
hash = hash + copy.at(0).x[1];
}
auto end = chrono::high_resolution_clock::now();transforms to:
92 [2] for (int i=0;i 85 c0 test eax,eax
0x407d08 0f 8e 02 02 00 00 jle 0x407f10 , std::allocator > >, 16u>()+592>
0x407d0e 31 f6 xor esi,esi
0x407d10 c7 45 c8 00 00 00 00 mov DWORD PTR [ebp-0x38],0x0
0x407d17 c7 45 cc 00 00 00 00 mov DWORD PTR [ebp-0x34],0x0
0x407d1e 66 90 xchg ax,ax
95 [1] hash = hash + copy.at(0).x[1];
0x407daa 8b 43 04 mov eax,DWORD PTR [ebx+0x4]
0x407db0 99 cdq
0x407db1 01 45 c8 add DWORD PTR [ebp-0x38],eax
0x407db4 11 55 cc adc DWORD PTR [ebp-0x34],edx
92 [3] for (int i=0;i 83 c6 01 add esi,0x1
0x407dbf 39 35 3c f0 40 00 cmp DWORD PTR ds:0x40f03c,esi
0x407dc5 0f 8f 55 ff ff ff jg 0x407d20 , std::allocator > >, 16u>()+96>
97 [1] auto end = chrono::high_resolution_clock::now();So - you're correct that compiler optimizes out the actual copy and leaves the sequential increment of
hash by 1 up to NUM_OF_TRIALS times.Now, when you write
T copy = container; in a loop you're actually calling Ctor of T, copy assignment operator of T, copy ctor, and then dtor. Then loop counter increments and again 4 calls. Also since nothing is happening to 'T copy', you're getting the same address on the stack reused for the copy and have the stack top pointer moving up and down, while the memory for the elements of the container on the heap also get reused. If you declare a set outside the loop:
unordered_set vAddr;//(NUM_OF_TRIALS + 1);
vAddr.insert(&(container[0].x[1]));and inside the loop add
vAddr.insert(&(copy[0].x[1])); then by calling vAddr.size() after the loop you should get the actual number of distinct memory location used by first element. Answer is 2. One for container and one for copy.In a real life application you'd probably copy the container and let it be consumed by some other function, meaning that a call to
T copy = container would probably have to allocate a new space for copy's elements on the heap. So while moving the stack pointer up down is a single instruction, what can become a bottleneck is allocating new memory on the heap. So with this in mind, if you provide a copy ctor that wont really help measuring imo. I'd try to narrow down your request. What do you actually want to measure? Copy of arrays with dtor? Without dtor? Memory reallocation on the heap?
If you declare
T copy outside the loop T copy;
auto begin = chrono::high_resolution_clock::now();
for (int i=0;i<NUM_OF_TRIALS;i++)
{
/*T*/ copy = container;
}then your code with `-O
Code Snippets
T container;
container.reserve(NUM_OF_ELEMENTS);
for (int i=0;i<NUM_OF_ELEMENTS;i++)
{
container.push_back(*(new NonPrimitiveType<OBJECT_SIZE>));
}copyContainer<vector<NonPrimitiveType<16>>, 16>();
copyContainer<vector<NonPrimitiveType<64>>, 64>();
copyContainer<vector<NonPrimitiveType<256>>, 256>();
copyContainer<vector<NonPrimitiveType<1024>>, 1024>();
copyContainer<vector<NonPrimitiveType<4096>>, 4096>();auto begin = chrono::high_resolution_clock::now();
for (int i=0;i<NUM_OF_TRIALS;i++)
{
T copy = container;
hash = hash + copy.at(0).x[1];
}
auto end = chrono::high_resolution_clock::now();92 [2] for (int i=0;i<NUM_OF_TRIALS;i++)
0x407d06 <+0x0046> 85 c0 test eax,eax
0x407d08 <+0x0048> 0f 8e 02 02 00 00 jle 0x407f10 <copyContainer<std::vector<NonPrimitiveType<16u>, std::allocator<NonPrimitiveType<16u> > >, 16u>()+592>
0x407d0e <+0x004e> 31 f6 xor esi,esi
0x407d10 <+0x0050> c7 45 c8 00 00 00 00 mov DWORD PTR [ebp-0x38],0x0
0x407d17 <+0x0057> c7 45 cc 00 00 00 00 mov DWORD PTR [ebp-0x34],0x0
0x407d1e <+0x005e> 66 90 xchg ax,ax
95 [1] hash = hash + copy.at(0).x[1];
0x407daa <+0x00ea> 8b 43 04 mov eax,DWORD PTR [ebx+0x4]
0x407db0 <+0x00f0> 99 cdq
0x407db1 <+0x00f1> 01 45 c8 add DWORD PTR [ebp-0x38],eax
0x407db4 <+0x00f4> 11 55 cc adc DWORD PTR [ebp-0x34],edx
92 [3] for (int i=0;i<NUM_OF_TRIALS;i++)
0x407db7 <+0x00f7> 83 c6 01 add esi,0x1
0x407dbf <+0x00ff> 39 35 3c f0 40 00 cmp DWORD PTR ds:0x40f03c,esi
0x407dc5 <+0x0105> 0f 8f 55 ff ff ff jg 0x407d20 <copyContainer<std::vector<NonPrimitiveType<16u>, std::allocator<NonPrimitiveType<16u> > >, 16u>()+96>
97 [1] auto end = chrono::high_resolution_clock::now();unordered_set<int *> vAddr;//(NUM_OF_TRIALS + 1);
vAddr.insert(&(container[0].x[1]));Context
StackExchange Code Review Q#140185, answer score: 2
Revisions (0)
No revisions yet.