patterncppMinor

A "policy-based" design for a generic CUDA kernel

Submitted by: @import:stackexchange-codereview·Mar 10, 2026·

Viewed 0 times

policygenericdesigncudaforbasedkernel

Problem

I am faced with a design issue that has been discussed on SO several times, the most similar question being this one. Essentially, I want polymorphism for a CUDA kernel in the form of a "generic" kernel that can apply a "user-supplied" function to every element in an array.

In C++, one could achieve this design by using abstract base classes. Here's a minimal example:

#include
#include

class AbstractFunction{
    public:
        virtual double EvalFunction(double) const = 0;
        virtual std::vector operator()(std::vector) const;
};

std::vector AbstractFunction::operator()(std::vector x) const{
    std::vector y;
    for(auto it=x.begin();it!=x.end();++it) 
        y.push_back(EvalFunction(*it));
    return y;
}

class ConcreteFunction : public AbstractFunction{
    public:
        double EvalFunction(double x) const {return x*x;}
};  

int main()
{
    ConcreteFunction f;
    std::vector x(2,1);
    x[0] = 1.0; x[1] = 2.0;
    std::vector y = f(x);
    for(auto it=y.begin();it!=y.end();++it)
        std::cout<<*it<<std::endl;

    return 0;
}

As it turns out, this design is nearly impossible to port to CUDA, because CUDA doesn't play nicely with abstract base classes. Fine. After some digging (this question and this question were very useful), I have come up with the following design, which seems to work quite well, but I'd like to hear some critiques:

```
#include

// The Managed class allows us to use CUDA unified memory.
// See http://devblogs.nvidia.com/parallelforall/unified-memory-in-cuda-6/
class Managed {
public:
void *operator new(size_t len) {
void *ptr;
cudaMallocManaged(&ptr, len);
cudaDeviceSynchronize();
return ptr;
}
void operator delete(void *ptr) {
cudaDeviceSynchronize();
cudaFree(ptr);
}
};

// Evaluate function on GPU
template
__global__ void EvalKernelGPU(double x,double y,int N,FunType* f)
{
unsigned int tId = blockIdx.x*blockDim.x + threadIdx.x;
if(tId
class Evaluator : public

Solution

Is there a good reason against using the thrust library? It seems to come with the nvcc compiler and provides that kind of functionality:

#include 
#include 

__device__ __managed__ float x[2], y[2];

template  struct square {
    __host__ __device__ T operator()(const T& x) const {
        return x * x;
    }
};

int main(int argc, char const *argv[]) {
    x[0] = 1; x[1] = 2;
    thrust::transform(x, x + 2, y, square());
    std::cout << x[0] << "**2 = " << y[0] << "\n";
    std::cout << x[1] << "**2 = " << y[1];
    return 0;
}

Code Snippets

#include <iostream>
#include <thrust/transform.h>


__device__ __managed__ float x[2], y[2];


template <typename T> struct square {
    __host__ __device__ T operator()(const T& x) const {
        return x * x;
    }
};


int main(int argc, char const *argv[]) {
    x[0] = 1; x[1] = 2;
    thrust::transform(x, x + 2, y, square<float>());
    std::cout << x[0] << "**2 = " << y[0] << "\n";
    std::cout << x[1] << "**2 = " << y[1];
    return 0;
}

Context

StackExchange Code Review Q#124497, answer score: 2

Revisions (0)

No revisions yet.