patterncppMinor
Saving image data
Viewed 0 times
savingimagedata
Problem
Given a heap of pixel data, and a width and height associated with that pixel data, I've written code which will save that data as an image, and it more-or-less works as expected.
But, aside from some trivial optimizations in surrounding parts of the code (which don't concern me because they make up a tiny proportion of CPU time), the code in the for loop is most troubling to me:
What needs to happen in this code is
The code does what I'm expecting, but it requires a rather significant amount of CPU overhead to do it, and I'd like to know if there's a better solution.
//I'm using the FreeImage library.
#include
/*Some other, irrelevant code*/
namespace filesystem = std::experimental::filesystem;
//'heap' is an object wrapping around a std::vector which allows me to query its
//width and height, as opposed to its "size"
void save_image(filesystem::path path, font_heap const& heap) {
uint32_t width = heap.get_width(), height = heap.get_height();
FIBITMAP * image = FreeImage_Allocate(width, height, 8);
BYTE * bits = (BYTE*)FreeImage_GetBits(image);
auto it = stdext::make_checked_array_iterator(bits, width * height);
//BEGIN CONCERNING CODE
for (size_t y = 0; y (raw_memory), size);
FreeImage_CloseMemory(memory);
FreeImage_Unload(image);
}But, aside from some trivial optimizations in surrounding parts of the code (which don't concern me because they make up a tiny proportion of CPU time), the code in the for loop is most troubling to me:
auto it = stdext::make_checked_array_iterator(bits, width * height);
for (size_t y = 0; y < height; y++) {
it = std::transform(
heap.begin() + (height - y - 1) * width,
heap.begin() + (height - y) * width,
it,
[](uint8_t val) {return 255 - val; }
);
}What needs to happen in this code is
- All the rows need to be swapped, because images are saved x→, y↓, but the data is saved in memory as x→, y↑
- The colors need to be inverted, because the image is a raw grayscale image, and the image should be black with a white background (but is saved in data as white with a black background).
The code does what I'm expecting, but it requires a rather significant amount of CPU overhead to do it, and I'd like to know if there's a better solution.
- Are there STL algorithms that I could/should be using instead of
std::transform? Maybe st
Solution
Here is some OpenCl test on Intel HD Graphics 400 with 12 compute units and using 1-channel 1600 MHz ddr3 ram:
timings include buffer copies. No mapping was used. With memory mapping, it would drop to much lower time.
Edit: fixed the mapped buffer accessing, it is even faster now(laptop battery nearly empty)
-
1024 x 1024: 4.5 ms (1.52 ms with mapping)
-
2048 x 2048: 9.7 ms (4.19 ms with mapping)
-
4096 x 4096: 21 ms (13.93 ms with mapping)
-
8192 x 8192: 65 ms (55.16 ms with mapping)
kernel code(number of threads are half of total pixels, each thread swap uppermost line's pixel with bottommost line's pixel):
throughput increases for larger images and minimum latency depends on hardware and opencl wrapper thickness. This example was run on a not-thin wrapper.
One of the pros, cpu can be used for other things when gpu is computing this.
One of the cons, depending on the image size, completion time will have variance.
Kernel has low compute to data ratio so it will not be faster for computers with same pci-e bandiwdth.(just tried with a r7-240, 8k_8k took 67 ms)
timings include buffer copies. No mapping was used. With memory mapping, it would drop to much lower time.
Edit: fixed the mapped buffer accessing, it is even faster now(laptop battery nearly empty)
- 1 byte per pixel
-
1024 x 1024: 4.5 ms (1.52 ms with mapping)
-
2048 x 2048: 9.7 ms (4.19 ms with mapping)
-
4096 x 4096: 21 ms (13.93 ms with mapping)
-
8192 x 8192: 65 ms (55.16 ms with mapping)
kernel code(number of threads are half of total pixels, each thread swap uppermost line's pixel with bottommost line's pixel):
__kernel void test0(__global char *imagebuf)
{
int i=get_global_id(0);
int height=8192;
int width=8192;
int y=i/width;
int x=i%width;
char tmp=255-imagebuf[((height-y)-1)*width+x];
char tmp2=255-imagebuf[x+y*width];
imagebuf[x+y*width]=tmp;
imagebuf[((height-y)-1)*width+x]=tmp2;
}throughput increases for larger images and minimum latency depends on hardware and opencl wrapper thickness. This example was run on a not-thin wrapper.
One of the pros, cpu can be used for other things when gpu is computing this.
One of the cons, depending on the image size, completion time will have variance.
Kernel has low compute to data ratio so it will not be faster for computers with same pci-e bandiwdth.(just tried with a r7-240, 8k_8k took 67 ms)
Code Snippets
__kernel void test0(__global char *imagebuf)
{
int i=get_global_id(0);
int height=8192;
int width=8192;
int y=i/width;
int x=i%width;
char tmp=255-imagebuf[((height-y)-1)*width+x];
char tmp2=255-imagebuf[x+y*width];
imagebuf[x+y*width]=tmp;
imagebuf[((height-y)-1)*width+x]=tmp2;
}Context
StackExchange Code Review Q#152180, answer score: 2
Revisions (0)
No revisions yet.