patterncppCritical
Why is my program slow when looping over exactly 8192 elements?
Viewed 0 times
programloopingwhyoverwhen8192slowelementsexactly
Problem
Here is the extract from the program in question. The matrix
Then, you make a matrix
That's all there's to the program. For completeness' sake, here is what comes before. No code comes after. As you can see, it's just initialization.
Basically, this program is slow when SIZE is a multiple of 2048, e.g. the execution times:
The compiler is GCC.
From what I know, this is because of memory management, but I don't really know too much about that subject, which is why I'm asking here.
Also how to fix this would be nice, but if someone could explain these execution times I'd already be happy enough.
I already know of malloc/free, but the problem is not amount of memory used, it's merely execution time, so I don't know how that would help.
img[][] has the size SIZE×SIZE, and is initialized at:img[j][i] = 2 * j + iThen, you make a matrix
res[][], and each field in here is made to be the average of the 9 fields around it in the img matrix. The border is left at 0 for simplicity.for(i=1;i<SIZE-1;i++)
for(j=1;j<SIZE-1;j++) {
res[j][i]=0;
for(k=-1;k<2;k++)
for(l=-1;l<2;l++)
res[j][i] += img[j+l][i+k];
res[j][i] /= 9;
}That's all there's to the program. For completeness' sake, here is what comes before. No code comes after. As you can see, it's just initialization.
#define SIZE 8192
float img[SIZE][SIZE]; // input image
float res[SIZE][SIZE]; //result of mean filter
int i,j,k,l;
for(i=0;i<SIZE;i++)
for(j=0;j<SIZE;j++)
img[j][i] = (2*j+i)%8196;Basically, this program is slow when SIZE is a multiple of 2048, e.g. the execution times:
SIZE = 8191: 3.44 secs
SIZE = 8192: 7.20 secs
SIZE = 8193: 3.18 secsThe compiler is GCC.
From what I know, this is because of memory management, but I don't really know too much about that subject, which is why I'm asking here.
Also how to fix this would be nice, but if someone could explain these execution times I'd already be happy enough.
I already know of malloc/free, but the problem is not amount of memory used, it's merely execution time, so I don't know how that would help.
Solution
The difference is caused by the same super-alignment issue from the following related questions:
But that's only because there's one other problem with the code.
Starting from the original loop:
First notice that the two inner loops are trivial. They can be unrolled as follows:
So that leaves the two outer-loops that we're interested in.
Now we can see the problem is the same in this question: Why does the order of the loops affect performance when iterating over a 2D array?
You are iterating the matrix column-wise instead of row-wise.
To solve this problem, you should interchange the two loops.
This eliminates all the non-sequential access completely so you no longer get random slow-downs on large powers-of-two.
Core i7 920 @ 3.5 GHz
Original code:
Interchanged Outer-Loops:
- Why is transposing a matrix of 512x512 much slower than transposing a matrix of 513x513?
- Matrix multiplication: Small difference in matrix size, large difference in timings
But that's only because there's one other problem with the code.
Starting from the original loop:
for(i=1;i<SIZE-1;i++)
for(j=1;j<SIZE-1;j++) {
res[j][i]=0;
for(k=-1;k<2;k++)
for(l=-1;l<2;l++)
res[j][i] += img[j+l][i+k];
res[j][i] /= 9;
}First notice that the two inner loops are trivial. They can be unrolled as follows:
for(i=1;i<SIZE-1;i++) {
for(j=1;j<SIZE-1;j++) {
res[j][i]=0;
res[j][i] += img[j-1][i-1];
res[j][i] += img[j ][i-1];
res[j][i] += img[j+1][i-1];
res[j][i] += img[j-1][i ];
res[j][i] += img[j ][i ];
res[j][i] += img[j+1][i ];
res[j][i] += img[j-1][i+1];
res[j][i] += img[j ][i+1];
res[j][i] += img[j+1][i+1];
res[j][i] /= 9;
}
}So that leaves the two outer-loops that we're interested in.
Now we can see the problem is the same in this question: Why does the order of the loops affect performance when iterating over a 2D array?
You are iterating the matrix column-wise instead of row-wise.
To solve this problem, you should interchange the two loops.
for(j=1;j<SIZE-1;j++) {
for(i=1;i<SIZE-1;i++) {
res[j][i]=0;
res[j][i] += img[j-1][i-1];
res[j][i] += img[j ][i-1];
res[j][i] += img[j+1][i-1];
res[j][i] += img[j-1][i ];
res[j][i] += img[j ][i ];
res[j][i] += img[j+1][i ];
res[j][i] += img[j-1][i+1];
res[j][i] += img[j ][i+1];
res[j][i] += img[j+1][i+1];
res[j][i] /= 9;
}
}This eliminates all the non-sequential access completely so you no longer get random slow-downs on large powers-of-two.
Core i7 920 @ 3.5 GHz
Original code:
8191: 1.499 seconds
8192: 2.122 seconds
8193: 1.582 secondsInterchanged Outer-Loops:
8191: 0.376 seconds
8192: 0.357 seconds
8193: 0.351 secondsCode Snippets
for(i=1;i<SIZE-1;i++)
for(j=1;j<SIZE-1;j++) {
res[j][i]=0;
for(k=-1;k<2;k++)
for(l=-1;l<2;l++)
res[j][i] += img[j+l][i+k];
res[j][i] /= 9;
}for(i=1;i<SIZE-1;i++) {
for(j=1;j<SIZE-1;j++) {
res[j][i]=0;
res[j][i] += img[j-1][i-1];
res[j][i] += img[j ][i-1];
res[j][i] += img[j+1][i-1];
res[j][i] += img[j-1][i ];
res[j][i] += img[j ][i ];
res[j][i] += img[j+1][i ];
res[j][i] += img[j-1][i+1];
res[j][i] += img[j ][i+1];
res[j][i] += img[j+1][i+1];
res[j][i] /= 9;
}
}for(j=1;j<SIZE-1;j++) {
for(i=1;i<SIZE-1;i++) {
res[j][i]=0;
res[j][i] += img[j-1][i-1];
res[j][i] += img[j ][i-1];
res[j][i] += img[j+1][i-1];
res[j][i] += img[j-1][i ];
res[j][i] += img[j ][i ];
res[j][i] += img[j+1][i ];
res[j][i] += img[j-1][i+1];
res[j][i] += img[j ][i+1];
res[j][i] += img[j+1][i+1];
res[j][i] /= 9;
}
}8191: 1.499 seconds
8192: 2.122 seconds
8193: 1.582 seconds8191: 0.376 seconds
8192: 0.357 seconds
8193: 0.351 secondsContext
Stack Overflow Q#12264970, score: 1008
Revisions (0)
No revisions yet.