patterncppModerate
Speed optimization for transparent gradient blend algorithm
Viewed 0 times
blendgradienttransparentoptimizationalgorithmforspeed
Problem
The following algorithm works correctly for the purposes that I need, which is to set the top side of an image with a transparent gradient blend. Any tips to optimize it for speed?
void ImageUtil::sectionAlphaGradient(uint32_t* pixelBuffer, const int width, const int gradientHeight)
{
unsigned int pixelOffsetY, pixelIndex;
uint8_t A, R, G, B;
const unsigned short OPAQUE = 255;
const uint32_t BACKCOLOR = 0x00000000;
const uint8_t backR = (BACKCOLOR & 0x00FF0000) >> 16;
const uint8_t backG = (BACKCOLOR & 0x0000FF00) >> 8;
const uint8_t backB = (BACKCOLOR & 0x000000FF);
for (unsigned int y = 0; y > 16;
G = (pixelValue & 0x0000FF00) >> 8;
B = (pixelValue & 0x000000FF);
R = (R * A + backR * (OPAQUE - A)) / OPAQUE;
G = (G * A + backG * (OPAQUE - A)) / OPAQUE;
B = (B * A + backB * (OPAQUE - A)) / OPAQUE;
pixelBuffer[pixelIndex] = (uint32_t)((A << 24) | ((R & 0xFF) << 16) | ((G & 0xFF) << 8) | (B & 0xFF));
}
}
}Solution
There are a number of things you could do to make this faster and better.
Use a pointer rather than array references
The code currently computes a
Avoid shifting
Right now, the code masks off the pixel value for each color and then shifts it, does calculations and then shifts it back into place. Those shifts are not needed. Instead of this:
The code could instead do this:
Use same-sized constants
Using a mix of
Move loop invariants out of the loop
Compilers can often perform this optimization for you, but it's good to give them a hint. This code computes this each time through the inner loop:
But
Use named constants
This isn't a speed issue, but instead of using the constant
Be careful with signed vs. unsigned
The loop counters
Is
More of the math could be eliminated if
Declare variables as late as possible
Rather than using the old C-style of declaring all variables at the top of a function, use the modern C++-style and declare variables as late as possible. Doing so can sometimes help the compiler figure out register allocation, resulting in faster, smaller code.
Putting it all together
When all of these are applied, the resulting code is shorter, cleaner and faster:
Results
On my machine, a 64-bit Linux box running g++ 4.8.3 and using test code with an 8000 by 6000 pixel buffer, I get the following results:
Use a pointer rather than array references
The code currently computes a
pixelIndex and uses it as in pixelValue = pixelBuffer[pixelIndex], but what's actually happening is that each pixel is visited in order. The code can be sped up quite a bit with just this one change.pixelValue = *pixelBuffer;
// .. do processing
++pixelBuffer; // advance to the next pixelAvoid shifting
Right now, the code masks off the pixel value for each color and then shifts it, does calculations and then shifts it back into place. Those shifts are not needed. Instead of this:
const uint8_t backR = (BACKCOLOR & 0x00FF0000) >> 16;
R = (pixelValue & 0x00FF0000) >> 16;
pixelBuffer[pixelIndex] = (uint32_t)((A << 24) | ((R & 0xFF) << 16)
| ((G & 0xFF) << 8) | (B & 0xFF));The code could instead do this:
const uint32_t maskR = 0x00FF0000;
const uint32_t backR = BACKCOLOR & maskR;
R = *pixelBuffer & maskR;
*pixelBuffer++ = (uint32_t)((A << 24) | (R & maskR)
| (G & maskG) | (B & maskB));Use same-sized constants
Using a mix of
uint8_t, uint32_t and unsigned short constants makes the compiler work much harder with all of the implicit transformations. Better would be to use all the same size, such as uint32_t, the same as your pixel data. The code generated by the compiler may be simpler, smaller and faster as a result.Move loop invariants out of the loop
Compilers can often perform this optimization for you, but it's good to give them a hint. This code computes this each time through the inner loop:
R = (R * A + backR * (OPAQUE - A)) / OPAQUE;But
A, backR and OPAQUE don't change within that loop, so better would be to calculate backR * (OPAQUE - A) as part of the outer loop.Use named constants
This isn't a speed issue, but instead of using the constant
0x00FF0000 multiple places, it's better from a maintenance standpoint to turn that into a named constant maskR to avoid errors. Alternatively, one could create a function or macro that does the masking.Be careful with signed vs. unsigned
The loop counters
x and y are declared as int values, and they're compared to width and gradientHeight, but are you ever really going to want negative numbers for any of those? It seems to me that they'd be better as unsigned.Is
BACKCOLOR really 0?More of the math could be eliminated if
BACKCOLOR is really always supposed to be zero. I assumed instead, that it might be set to some other color, and so I left it in place.Declare variables as late as possible
Rather than using the old C-style of declaring all variables at the top of a function, use the modern C++-style and declare variables as late as possible. Doing so can sometimes help the compiler figure out register allocation, resulting in faster, smaller code.
Putting it all together
When all of these are applied, the resulting code is shorter, cleaner and faster:
#define maskR(x) (x & 0x00FF0000)
#define maskG(x) (x & 0x0000FF00)
#define maskB(x) (x & 0x000000FF)
void sectionAlphaGradient2(uint32_t* pixelBuffer,
const unsigned width, const unsigned gradientHeight)
{
const uint32_t OPAQUE = 255;
const uint32_t BACKCOLOR = 0x00000000;
const uint32_t backR = maskR(BACKCOLOR);
const uint32_t backG = maskG(BACKCOLOR);
const uint32_t backB = maskB(BACKCOLOR);
for (unsigned y = 0; y < gradientHeight; y++)
{
uint32_t A = (uint8_t)((OPAQUE * y) / gradientHeight);
uint32_t shiftedA = A << 24;
uint32_t altR = backR *(OPAQUE - A);
uint32_t altG = backG *(OPAQUE - A);
uint32_t altB = backB *(OPAQUE - A);
for (unsigned int x = 0; x < width; x++)
{
uint32_t R = (maskR(*pixelBuffer) * A + altR) / OPAQUE;
uint32_t G = (maskG(*pixelBuffer) * A + altG) / OPAQUE;
uint32_t B = (maskB(*pixelBuffer) * A + altB) / OPAQUE;
*pixelBuffer++ = shiftedA | maskR(R) | maskG(G) | maskB(B);
}
}
}Results
On my machine, a 64-bit Linux box running g++ 4.8.3 and using test code with an 8000 by 6000 pixel buffer, I get the following results:
original: 240977 microseconds
improved: 125714 microsecondsCode Snippets
pixelValue = *pixelBuffer;
// .. do processing
++pixelBuffer; // advance to the next pixelconst uint8_t backR = (BACKCOLOR & 0x00FF0000) >> 16;
R = (pixelValue & 0x00FF0000) >> 16;
pixelBuffer[pixelIndex] = (uint32_t)((A << 24) | ((R & 0xFF) << 16)
| ((G & 0xFF) << 8) | (B & 0xFF));const uint32_t maskR = 0x00FF0000;
const uint32_t backR = BACKCOLOR & maskR;
R = *pixelBuffer & maskR;
*pixelBuffer++ = (uint32_t)((A << 24) | (R & maskR)
| (G & maskG) | (B & maskB));R = (R * A + backR * (OPAQUE - A)) / OPAQUE;#define maskR(x) (x & 0x00FF0000)
#define maskG(x) (x & 0x0000FF00)
#define maskB(x) (x & 0x000000FF)
void sectionAlphaGradient2(uint32_t* pixelBuffer,
const unsigned width, const unsigned gradientHeight)
{
const uint32_t OPAQUE = 255;
const uint32_t BACKCOLOR = 0x00000000;
const uint32_t backR = maskR(BACKCOLOR);
const uint32_t backG = maskG(BACKCOLOR);
const uint32_t backB = maskB(BACKCOLOR);
for (unsigned y = 0; y < gradientHeight; y++)
{
uint32_t A = (uint8_t)((OPAQUE * y) / gradientHeight);
uint32_t shiftedA = A << 24;
uint32_t altR = backR *(OPAQUE - A);
uint32_t altG = backG *(OPAQUE - A);
uint32_t altB = backB *(OPAQUE - A);
for (unsigned int x = 0; x < width; x++)
{
uint32_t R = (maskR(*pixelBuffer) * A + altR) / OPAQUE;
uint32_t G = (maskG(*pixelBuffer) * A + altG) / OPAQUE;
uint32_t B = (maskB(*pixelBuffer) * A + altB) / OPAQUE;
*pixelBuffer++ = shiftedA | maskR(R) | maskG(G) | maskB(B);
}
}
}Context
StackExchange Code Review Q#69826, answer score: 12
Revisions (0)
No revisions yet.