patterncppMinor

3D vector CUDA kernel

Submitted by: @import:stackexchange-codereview·Mar 10, 2026·

Viewed 0 times

kernelcudavector

Problem

I designed this CUDA kernel to compute a function on a 3D domain:

p and Ap are 3D vectors that are actually implemented as a single long array:

__global__ void update(P_REAL* data, P_REAL* tmp, P_REAL* f, P_REAL* reduced, 
       const unsigned int sizeX,const unsigned int sizeY,const unsigned int sizeZ,
       const P_REAL aX, const P_REAL aY, const P_REAL aZ, const P_REAL aF,
       const P_REAL cX, const P_REAL cY, const P_REAL cZ){

extern __shared__ P_REAL s_pAp[];

// Get grid size
unsigned int gY((sizeY-2)/blockDim.y),gZ((sizeZ-2)/blockDim.z);

unsigned int idX(blockDim.x*blockIdx.x+threadIdx.x+1);
unsigned int idY(blockDim.y*blockIdx.y+threadIdx.y+1);
unsigned int idZ(blockDim.z*blockIdx.z+threadIdx.z+1);

unsigned int sX(sizeY*sizeZ);
unsigned int i((idX*sizeY+idY)*sizeZ+idZ);
unsigned int bi((threadIdx.x*blockDim.y+threadIdx.y)*blockDim.z+threadIdx.z);

Ap[i]=cX*(p[i-sX]+p[i+sX]-2.0f*p[i])
     +cY*(p[i-sizeZ]+p[i+sizeZ]-2.0f*p[i])
     +cZ*(p[i-1]+p[i+1]-2.0f*p[i]);
s_pAp[bi] = Ap[i]*p[i];

// Wait for all threads
__syncthreads();

Reduction of s_pAp (to be summed):

// Take advantage of the geometry of the block
// At each iteration, there are 7 points to consider
// Use byte-shift to ease division by 2

for(int offset = blockDim.x/2;offset>0;offset>>=1){
    if(threadIdx.x < offset && threadIdx.y < offset && threadIdx.z < offset){
        s_pAp[bi]+=s_pAp[bi+offset]+s_pAp[bi+offset*blockDim.z]+s_pAp[bi+offset*blockDim.z*blockDim.y]
                +s_pAp[bi+offset*(1+blockDim.z)]
                +s_pAp[bi+offset*(1+blockDim.z*blockDim.y)]
                +s_pAp[bi+offset*blockDim.z*(1+blockDim.y)]
                +s_pAp[bi+offset*(1+blockDim.z*(1+blockDim.y))];
    }

    // Wait for all threads
    __syncthreads();
}                                   

// Write final result with (0,0,0) thread
if(threadIdx.x==0 && threadIdx.y==0 && threadIdx.z == 0){
    reduced[(blockIdx.x*gY+blockIdx.y)*gZ+blockIdx.z] = s_pAp[0];

}
}

This wo

Solution

-
The indentation within update() is misleading. Most of all, everything inside of the function should be indented. Otherwise, it'll be hard to tell what exactly is inside of the function.

I do, however, like that you've indented the (lengthy) parameter list by eight spaces instead of the standard four. It should be easier to tell them apart from the code inside of the function.

-
More whitespace within statements can be added for more readability.

This line, for instance, is hard to read:

Ap[i]=cX*(p[i-sX]+p[i+sX]-2.0f*p[i])

You can add more whitespace as such:

Ap[i] = cX * (p[i-sX] + p[i+sX] - 2.0f * p[i])

Now you can really tell what everything is at a glance.

You can also add a space before the opening curly braces for functions. It may not be a big deal with functions with smaller signatures, but yours are pretty lengthy, and it's harder to see the opening curly brace with them.

You also have inconsistent whitespace, such as in here:

for(int offset = blockDim.x/2;offset>0;offset>>=1){

There is spacing between the =, but nowhere else. This should look clearer:

for (int offset = blockDim.x / 2; offset > 0; offset >>= 1) {

-
Considering that CUDA's blockIdx and blockDim are primarily signed, you should make your thread ID variables of the same type. If they won't be too large, make them int. If they can be larger than an int, make them either long or long long and cast blockIdx to the same.

Code Snippets

Ap[i]=cX*(p[i-sX]+p[i+sX]-2.0f*p[i])

Ap[i] = cX * (p[i-sX] + p[i+sX] - 2.0f * p[i])

for(int offset = blockDim.x/2;offset>0;offset>>=1){

for (int offset = blockDim.x / 2; offset > 0; offset >>= 1) {

Context

StackExchange Code Review Q#52623, answer score: 3

Revisions (0)

No revisions yet.