patternMinor
Parallelising random reads seems to work well — why?
Viewed 0 times
randomwhyparallelisingseemsreadsworkwell
Problem
Consider the following very simple computer program:
Here $x$ and $y$ are $n$-element arrays of bytes, and $p$ is an $n$-element array of words. Here $n$ is large, e.g., $n = 2^{31}$ (so that only a negligible fraction of the data fits in any kind of cache memory).
Assume that $p$ consists of random numbers, uniformly distributed between $1$ and $n$.
From the perspective of modern hardware, this should mean the following:
And this is indeed what I am observing. The program is very slow in comparison with a program that does only sequential reads and writes. Great.
Now comes the question: how well does this program parallelise on modern multi-core platforms?
My hypothesis was that this program does not parallelise well. After all, the bottleneck is the main memory. A single core is already wasting most of its time just waiting for some data from the main memory.
However, this was not what I observed when I started experimenting with some algorithms where the bottleneck was this kind of operation!
I simply replaced the naive for-loop with an OpenMP parallel for-loop (in essence, it will just split the range $[1,n]$ to smaller parts and run these parts on different CPU cores in parallel).
On low-end computers, speedups were indeed minor. But on higher-end platforms I was surprised that I was getting excellent near-linear speedups. Some concrete examples (the exact timings may be a bit off, there is a lot of random variation; these were just quick experiments):
-
2 x 4-core Xeon (in total 8 cores): factor 5-8 speedups in comparison to single-threaded version.
-
2 x 6-core Xeon (in total 12 cores): factor 8-14 speedups in comparison to single-threaded version.
Now this was
for i = 1 to n:
y[i] = x[p[i]]Here $x$ and $y$ are $n$-element arrays of bytes, and $p$ is an $n$-element array of words. Here $n$ is large, e.g., $n = 2^{31}$ (so that only a negligible fraction of the data fits in any kind of cache memory).
Assume that $p$ consists of random numbers, uniformly distributed between $1$ and $n$.
From the perspective of modern hardware, this should mean the following:
- reading $p[i]$ is cheap (sequential read)
- reading $x[p[i]]$ is very expensive (random reads; almost all reads are cache misses; we will have to fetch each individual byte from the main memory)
- writing $y[i]$ is cheap (sequential write).
And this is indeed what I am observing. The program is very slow in comparison with a program that does only sequential reads and writes. Great.
Now comes the question: how well does this program parallelise on modern multi-core platforms?
My hypothesis was that this program does not parallelise well. After all, the bottleneck is the main memory. A single core is already wasting most of its time just waiting for some data from the main memory.
However, this was not what I observed when I started experimenting with some algorithms where the bottleneck was this kind of operation!
I simply replaced the naive for-loop with an OpenMP parallel for-loop (in essence, it will just split the range $[1,n]$ to smaller parts and run these parts on different CPU cores in parallel).
On low-end computers, speedups were indeed minor. But on higher-end platforms I was surprised that I was getting excellent near-linear speedups. Some concrete examples (the exact timings may be a bit off, there is a lot of random variation; these were just quick experiments):
-
2 x 4-core Xeon (in total 8 cores): factor 5-8 speedups in comparison to single-threaded version.
-
2 x 6-core Xeon (in total 12 cores): factor 8-14 speedups in comparison to single-threaded version.
Now this was
Solution
Forget for a moment all of the issues related to the access to main memory and level 3 cache. From a parallel perspective, ignoring these issues, the program parallelize perfectly when using $p$ processors (or cores), owing to the fact that, once you partition the work to be done through domain decomposition, each core must process either $\left\lfloor {\frac{n}{p}} \right\rfloor$ or $\left\lceil {\frac{n}{p}} \right\rceil$ elements and there is no communication and/or synchronization overhead since there is no functional dependence among the processors. Therefore, ignoring memory issues you expect a speedup equal to $p$.
Now, let's take into account the memory issues. The super-linear speedup you actually observed on your high-end Xeon based node is justified as follows.
A parallel system might exhibit such behaviour if its memory is hierarchical and if access time increases (in discrete steps) with the memory used by the program. In this case the effective computation speed could be slower on a serial processor than on a parallel computer using similar processors. This is because a sequential algorithm using $n$ bytes of memory will use only $n/p$ bytes on each processor of a $p$ processor parallel system, while
cache and virtual memory effects could instead reduce the serial processor’s effective computation rate.
For $n = 2^{31}$ bytes, we need 2048 Mbytes of memory. But, when using 12 cores as in your last example, every core needs to deal with only 2048/12 Mbytes of data, which is about 170 Mbytes. High-end Xeon processors are equipped with a cache level 3 whose size ranges from 15 to 30 Mbytes of size. Clearly, with this huge cache size the cache hit ratio is high, and this explains the good or even super-linear speedup-observed.
Regarding your second question, current architectures already prefetch data by evicting cache lines and replacing them as required to exploit temporal and spacial locality of data. But this will not be enough for a single core processing 2048 Mbytes of data. If you restrict $n$ to about 170 Mbytes, then you should see on a single core more or less the same performances, since you are now running under (more or less, not exactly) the same conditions.
Finally, besides QSM (Queueing Shared Memory), I am not aware of any other theoretical parallel model taking into account at the same level the contention for access to shared memory (in your case, when using OpenMP the main memory is shared among the cores, and the cache is always shared as well among the cores). Anyway, even though the model is interesting, it did not obtain great success.
Now, let's take into account the memory issues. The super-linear speedup you actually observed on your high-end Xeon based node is justified as follows.
A parallel system might exhibit such behaviour if its memory is hierarchical and if access time increases (in discrete steps) with the memory used by the program. In this case the effective computation speed could be slower on a serial processor than on a parallel computer using similar processors. This is because a sequential algorithm using $n$ bytes of memory will use only $n/p$ bytes on each processor of a $p$ processor parallel system, while
cache and virtual memory effects could instead reduce the serial processor’s effective computation rate.
For $n = 2^{31}$ bytes, we need 2048 Mbytes of memory. But, when using 12 cores as in your last example, every core needs to deal with only 2048/12 Mbytes of data, which is about 170 Mbytes. High-end Xeon processors are equipped with a cache level 3 whose size ranges from 15 to 30 Mbytes of size. Clearly, with this huge cache size the cache hit ratio is high, and this explains the good or even super-linear speedup-observed.
Regarding your second question, current architectures already prefetch data by evicting cache lines and replacing them as required to exploit temporal and spacial locality of data. But this will not be enough for a single core processing 2048 Mbytes of data. If you restrict $n$ to about 170 Mbytes, then you should see on a single core more or less the same performances, since you are now running under (more or less, not exactly) the same conditions.
Finally, besides QSM (Queueing Shared Memory), I am not aware of any other theoretical parallel model taking into account at the same level the contention for access to shared memory (in your case, when using OpenMP the main memory is shared among the cores, and the cache is always shared as well among the cores). Anyway, even though the model is interesting, it did not obtain great success.
Context
StackExchange Computer Science Q#18834, answer score: 9
Revisions (0)
No revisions yet.