patterncppMinor

Parallel Reduction method with C++AMP

Submitted by: @import:stackexchange-codereview·Mar 10, 2026·

Viewed 0 times

methodwithampparallelreduction

Problem

I am writing a C++AMP library, and as one of my utility methods I am implementing a parallel reduction algorithm based on the cascade method documented on this blog post with slight improvements by utilizing asynchronous CPU execution.

Does anyone have any performance/correctness/language improvements to offer?

```
template
auto ParallelAccumulate( const concurrency::array_view& avData ) -> decltype(std::declval() + std::declval())
{
static_assert(IsPowerOfTwo( TileSize ), "Tile Size must be a power of two");
static_assert(is_amp_compatible::value, "The internal type of the array_view must be amp comptible");
static_assert(NumTiles > 0, "There must be a non-zero number of tiles");

std::size_t sArrayLength = avData.extent.size();
const std::size_t sStrideLength = TileSize NumTiles 2U;

// Accumulate tail (if necessary):
const std::size_t sTailLength = sArrayLength % sStrideLength;
std::future() + std::declval())> futTailSum;
if( sTailLength != 0 )
{
std::vector vecTail( sTailLength );
concurrency::copy( avData.section( concurrency::index( sArrayLength - sTailLength ) ), vecTail.begin() );
futTailSum = std::async( std::launch::async, [&vecTail]{ return concurrency::parallel_reduce( vecTail.begin(), vecTail.end(), static_cast(0) ); } );

if( (sArrayLength -= sTailLength) == 0 )
{
return futTailSum.get();
}
}

concurrency::array() + std::declval()), 1> arrPartialResult( NumTiles );
concurrency::parallel_for_each( concurrency::extent( TileSize * NumTiles ).tile(), =, &arrPartialResult restrict( amp ) {
tile_static decltype(std::declval() + std::declval()) tTileData[TileSize];
std::size_t sLocalIndex = tIndex.local[0];

std::size_t sInputIndex = (tIndex.tile[0] 2U TileSize) + sLocalIndex;
tTileData[sLocalIndex] = static_cast() + std::declval())>(0);

do {
tTileDa

Solution

I only see some possible stylistic/reusability improvements:

-
You use decltype(std::declval() + std::declval()) five times in the function (plus the one in the return type). You could probably typedef that to improve readability. I don't know how you would call it, but I am pretty sure that you can find a meaningful name.

-
By the way, are there many types for which decltype(std::declval() + std::declval) is different from T while still being amp compatible? If it is the case, that may break some other pieces of your code anyway (I didn't check though; that's a wild guess).

-
Whenever possible, try to use std::begin and std::end instead of the member functions. It helps to write reusable code; you might someday want to extract parts of your function to write smaller and more generic functions that could work on old C arrays.

-
By default, if you don't specify the rank of an array_view, it is one. Therefore, you actually don't have to specify it since you always use 1.

-
Your lines are really long. As I said in the comments, you might want to split some of them to improve readability:

futTailSum = std::async( std::launch::async, [&vecTail]
    {
        return concurrency::parallel_reduce( std::begin(vecTail),
                                             std::end(vecTail),
                                             static_cast(0) );
    } );

Code Snippets

futTailSum = std::async( std::launch::async, [&vecTail]
    {
        return concurrency::parallel_reduce( std::begin(vecTail),
                                             std::end(vecTail),
                                             static_cast<T>(0) );
    } );

Context

StackExchange Code Review Q#58858, answer score: 3

Revisions (0)

No revisions yet.