patterncppMinor
Parallel Reduction method with C++AMP
Viewed 0 times
methodwithampparallelreduction
Problem
I am writing a C++AMP library, and as one of my utility methods I am implementing a parallel reduction algorithm based on the cascade method documented on this blog post with slight improvements by utilizing asynchronous CPU execution.
Does anyone have any performance/correctness/language improvements to offer?
```
template
auto ParallelAccumulate( const concurrency::array_view& avData ) -> decltype(std::declval() + std::declval())
{
static_assert(IsPowerOfTwo( TileSize ), "Tile Size must be a power of two");
static_assert(is_amp_compatible::value, "The internal type of the array_view must be amp comptible");
static_assert(NumTiles > 0, "There must be a non-zero number of tiles");
std::size_t sArrayLength = avData.extent.size();
const std::size_t sStrideLength = TileSize NumTiles 2U;
// Accumulate tail (if necessary):
const std::size_t sTailLength = sArrayLength % sStrideLength;
std::future() + std::declval())> futTailSum;
if( sTailLength != 0 )
{
std::vector vecTail( sTailLength );
concurrency::copy( avData.section( concurrency::index( sArrayLength - sTailLength ) ), vecTail.begin() );
futTailSum = std::async( std::launch::async, [&vecTail]{ return concurrency::parallel_reduce( vecTail.begin(), vecTail.end(), static_cast(0) ); } );
if( (sArrayLength -= sTailLength) == 0 )
{
return futTailSum.get();
}
}
concurrency::array() + std::declval()), 1> arrPartialResult( NumTiles );
concurrency::parallel_for_each( concurrency::extent( TileSize * NumTiles ).tile(), =, &arrPartialResult restrict( amp ) {
tile_static decltype(std::declval() + std::declval()) tTileData[TileSize];
std::size_t sLocalIndex = tIndex.local[0];
std::size_t sInputIndex = (tIndex.tile[0] 2U TileSize) + sLocalIndex;
tTileData[sLocalIndex] = static_cast() + std::declval())>(0);
do {
tTileDa
Does anyone have any performance/correctness/language improvements to offer?
```
template
auto ParallelAccumulate( const concurrency::array_view& avData ) -> decltype(std::declval() + std::declval())
{
static_assert(IsPowerOfTwo( TileSize ), "Tile Size must be a power of two");
static_assert(is_amp_compatible::value, "The internal type of the array_view must be amp comptible");
static_assert(NumTiles > 0, "There must be a non-zero number of tiles");
std::size_t sArrayLength = avData.extent.size();
const std::size_t sStrideLength = TileSize NumTiles 2U;
// Accumulate tail (if necessary):
const std::size_t sTailLength = sArrayLength % sStrideLength;
std::future() + std::declval())> futTailSum;
if( sTailLength != 0 )
{
std::vector vecTail( sTailLength );
concurrency::copy( avData.section( concurrency::index( sArrayLength - sTailLength ) ), vecTail.begin() );
futTailSum = std::async( std::launch::async, [&vecTail]{ return concurrency::parallel_reduce( vecTail.begin(), vecTail.end(), static_cast(0) ); } );
if( (sArrayLength -= sTailLength) == 0 )
{
return futTailSum.get();
}
}
concurrency::array() + std::declval()), 1> arrPartialResult( NumTiles );
concurrency::parallel_for_each( concurrency::extent( TileSize * NumTiles ).tile(), =, &arrPartialResult restrict( amp ) {
tile_static decltype(std::declval() + std::declval()) tTileData[TileSize];
std::size_t sLocalIndex = tIndex.local[0];
std::size_t sInputIndex = (tIndex.tile[0] 2U TileSize) + sLocalIndex;
tTileData[sLocalIndex] = static_cast() + std::declval())>(0);
do {
tTileDa
Solution
I only see some possible stylistic/reusability improvements:
-
You use
-
By the way, are there many types for which
-
Whenever possible, try to use
-
By default, if you don't specify the rank of an
-
Your lines are really long. As I said in the comments, you might want to split some of them to improve readability:
-
You use
decltype(std::declval() + std::declval()) five times in the function (plus the one in the return type). You could probably typedef that to improve readability. I don't know how you would call it, but I am pretty sure that you can find a meaningful name.-
By the way, are there many types for which
decltype(std::declval() + std::declval) is different from T while still being amp compatible? If it is the case, that may break some other pieces of your code anyway (I didn't check though; that's a wild guess).-
Whenever possible, try to use
std::begin and std::end instead of the member functions. It helps to write reusable code; you might someday want to extract parts of your function to write smaller and more generic functions that could work on old C arrays.-
By default, if you don't specify the rank of an
array_view, it is one. Therefore, you actually don't have to specify it since you always use 1.-
Your lines are really long. As I said in the comments, you might want to split some of them to improve readability:
futTailSum = std::async( std::launch::async, [&vecTail]
{
return concurrency::parallel_reduce( std::begin(vecTail),
std::end(vecTail),
static_cast(0) );
} );Code Snippets
futTailSum = std::async( std::launch::async, [&vecTail]
{
return concurrency::parallel_reduce( std::begin(vecTail),
std::end(vecTail),
static_cast<T>(0) );
} );Context
StackExchange Code Review Q#58858, answer score: 3
Revisions (0)
No revisions yet.