HiveBrain v1.2.0
Get Started
← Back to all entries
patterncppMinor

measuring the time taken to copy a std stl container

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
stdthestlcontainertimetakenmeasuringcopy

Problem

I'm trying to measure the time taken to copy a std::vector. Would like to know if this is a correct approach. I tried to do some trivial computations on the copy to prevent the compiler from optimizing out the copy. Should I explicitly write a copy-constructor?

#include 
#include 
#include 
#include 
#include 
#include 
using namespace std;

int NUM_OF_ELEMENTS;
int NUM_OF_TRIALS;
char CONTAINER_TYPE;

template
class NonPrimitiveType
{
public:
    int x[OBJECT_SIZE/4];
    NonPrimitiveType()
    {
        for(int i=0;i
void copyContainer()
{
    T container;
    container.reserve(NUM_OF_ELEMENTS);
    for (int i=0;i));
    }

    long long hash = 0;
    auto begin = chrono::high_resolution_clock::now();
    for (int i=0;i(end - begin).count() / NUM_OF_TRIALS;
    printf("%c;%d;%d;%lu;%llu\n",CONTAINER_TYPE, NUM_OF_ELEMENTS, (int)OBJECT_SIZE, totalTime, hash);
}

int main(int argc, char *argv[])
{
    NUM_OF_ELEMENTS = atoi(argv[1]);
    CONTAINER_TYPE = *argv[2];
    NUM_OF_TRIALS = atoi(argv[3]);

    copyContainer>, 16>();
    copyContainer>, 64>();
    copyContainer>, 256>();
    copyContainer>, 1024>();
    copyContainer>, 4096>();

    return 0;
}

Solution

First regarding the correct approach.

-
printf usage: My output of hash for every case after 200000 trials was always 85899345920000, which is incorrect - due to wrong printf specifiers - "%c;%d;%d;%lu;%llu\n". Correct ones are "%c;%d;%d;%lld;%lld\n" if you declare totalTime and hash as long long. Why not use cout i wonder?

-
Using float for division results. Then again after fixing this in some cases of a small container with few elements i've kept getting totalTime = 0. Thats not good, so you'd want to use a floating variable to store the duration divided by NUM_OF_TRIALS. Otherwise the precision is reduced due to implicit conversion of the division result to integer.

-
Preparing for measurement. Also as this part goes:

T container;
container.reserve(NUM_OF_ELEMENTS);
for (int i=0;i));
}


You're just wasting time here by searching memory to reallocate. Since stl container call default ctor your above code could be replaced with:

T container(NUM_OF_ELEMENTS);

And this will save you lots of time for redundant calls to new here.

-
sizeof(int) == 4 is true only generally. Use int32_t which gives that guarantee, and then OBJECT_SIZE/4 makes sense. Or better yet divide OBJECT_SIZE by sizeof(int32_t). That way nothing is going to surprise you in the future.

-
I'd also avoid calling every test together:

copyContainer>, 16>();
copyContainer>, 64>();
copyContainer>, 256>();
copyContainer>, 1024>();
copyContainer>, 4096>();


Make a switch statement and call only one of them per run, so as to avoid compiler further optimizing the switches between those statements.

-
If you turn off optimization flags, then you'll get quite an interesting picture of what's happening but your insights wont be applicable once the optimizations kick in. I think its true that measuring timings without optimization flags set to true is quite pointless.

Your code with -O3 optimization:

auto begin = chrono::high_resolution_clock::now();
for (int i=0;i<NUM_OF_TRIALS;i++)
{
    T copy = container;
    hash = hash + copy.at(0).x[1];
}
auto end = chrono::high_resolution_clock::now();


transforms to:

92 [2]      for (int i=0;i        85 c0                    test   eax,eax
0x407d08          0f 8e 02 02 00 00        jle    0x407f10 , std::allocator > >, 16u>()+592>
0x407d0e          31 f6                    xor    esi,esi
0x407d10          c7 45 c8 00 00 00 00     mov    DWORD PTR [ebp-0x38],0x0
0x407d17          c7 45 cc 00 00 00 00     mov    DWORD PTR [ebp-0x34],0x0
0x407d1e          66 90                    xchg   ax,ax
95 [1]          hash = hash + copy.at(0).x[1];
0x407daa          8b 43 04                 mov    eax,DWORD PTR [ebx+0x4]
0x407db0          99                       cdq
0x407db1          01 45 c8                 add    DWORD PTR [ebp-0x38],eax
0x407db4          11 55 cc                 adc    DWORD PTR [ebp-0x34],edx
92 [3]      for (int i=0;i        83 c6 01                 add    esi,0x1
0x407dbf          39 35 3c f0 40 00        cmp    DWORD PTR ds:0x40f03c,esi
0x407dc5          0f 8f 55 ff ff ff        jg     0x407d20 , std::allocator > >, 16u>()+96>
97 [1]      auto end = chrono::high_resolution_clock::now();


So - you're correct that compiler optimizes out the actual copy and leaves the sequential increment of hash by 1 up to NUM_OF_TRIALS times.

Now, when you write T copy = container; in a loop you're actually calling Ctor of T, copy assignment operator of T, copy ctor, and then dtor. Then loop counter increments and again 4 calls. Also since nothing is happening to 'T copy', you're getting the same address on the stack reused for the copy and have the stack top pointer moving up and down, while the memory for the elements of the container on the heap also get reused.

If you declare a set outside the loop:

unordered_set vAddr;//(NUM_OF_TRIALS + 1);
vAddr.insert(&(container[0].x[1]));


and inside the loop add vAddr.insert(&(copy[0].x[1])); then by calling vAddr.size() after the loop you should get the actual number of distinct memory location used by first element. Answer is 2. One for container and one for copy.

In a real life application you'd probably copy the container and let it be consumed by some other function, meaning that a call to T copy = container would probably have to allocate a new space for copy's elements on the heap. So while moving the stack pointer up down is a single instruction, what can become a bottleneck is allocating new memory on the heap.

So with this in mind, if you provide a copy ctor that wont really help measuring imo. I'd try to narrow down your request. What do you actually want to measure? Copy of arrays with dtor? Without dtor? Memory reallocation on the heap?

If you declare T copy outside the loop

T copy;
auto begin = chrono::high_resolution_clock::now();
for (int i=0;i<NUM_OF_TRIALS;i++)
{
    /*T*/ copy = container;
}


then your code with `-O

Code Snippets

T container;
container.reserve(NUM_OF_ELEMENTS);
for (int i=0;i<NUM_OF_ELEMENTS;i++)
{
    container.push_back(*(new NonPrimitiveType<OBJECT_SIZE>));
}
copyContainer<vector<NonPrimitiveType<16>>, 16>();
copyContainer<vector<NonPrimitiveType<64>>, 64>();
copyContainer<vector<NonPrimitiveType<256>>, 256>();
copyContainer<vector<NonPrimitiveType<1024>>, 1024>();
copyContainer<vector<NonPrimitiveType<4096>>, 4096>();
auto begin = chrono::high_resolution_clock::now();
for (int i=0;i<NUM_OF_TRIALS;i++)
{
    T copy = container;
    hash = hash + copy.at(0).x[1];
}
auto end = chrono::high_resolution_clock::now();
92 [2]      for (int i=0;i<NUM_OF_TRIALS;i++)
0x407d06  <+0x0046>        85 c0                    test   eax,eax
0x407d08  <+0x0048>        0f 8e 02 02 00 00        jle    0x407f10 <copyContainer<std::vector<NonPrimitiveType<16u>, std::allocator<NonPrimitiveType<16u> > >, 16u>()+592>
0x407d0e  <+0x004e>        31 f6                    xor    esi,esi
0x407d10  <+0x0050>        c7 45 c8 00 00 00 00     mov    DWORD PTR [ebp-0x38],0x0
0x407d17  <+0x0057>        c7 45 cc 00 00 00 00     mov    DWORD PTR [ebp-0x34],0x0
0x407d1e  <+0x005e>        66 90                    xchg   ax,ax
95 [1]          hash = hash + copy.at(0).x[1];
0x407daa  <+0x00ea>        8b 43 04                 mov    eax,DWORD PTR [ebx+0x4]
0x407db0  <+0x00f0>        99                       cdq
0x407db1  <+0x00f1>        01 45 c8                 add    DWORD PTR [ebp-0x38],eax
0x407db4  <+0x00f4>        11 55 cc                 adc    DWORD PTR [ebp-0x34],edx
92 [3]      for (int i=0;i<NUM_OF_TRIALS;i++)
0x407db7  <+0x00f7>        83 c6 01                 add    esi,0x1
0x407dbf  <+0x00ff>        39 35 3c f0 40 00        cmp    DWORD PTR ds:0x40f03c,esi
0x407dc5  <+0x0105>        0f 8f 55 ff ff ff        jg     0x407d20 <copyContainer<std::vector<NonPrimitiveType<16u>, std::allocator<NonPrimitiveType<16u> > >, 16u>()+96>
97 [1]      auto end = chrono::high_resolution_clock::now();
unordered_set<int *> vAddr;//(NUM_OF_TRIALS + 1);
vAddr.insert(&(container[0].x[1]));

Context

StackExchange Code Review Q#140185, answer score: 2

Revisions (0)

No revisions yet.