patterncppMinor

Calculating the distance between several spatial points

Submitted by: @import:stackexchange-codereview·Mar 10, 2026·

Viewed 0 times

thepointsdistanceseveralcalculatingbetweenspatial

Problem

I am developing a CUDA program and I want to enhance my performance. I have a kernel function which is consuming more than 70% of execution time. The kernel calculates the distance between several spatial points and based on whether they are neighbors or not, it fills a boolean vector.

Any ideas on how to get more speedup?

Here is the code:

```
#include
#include
#include
#include
#include

#define _SQR(a) ((a)*(a))
#define _BLOCKSIZE 32

__host__ void RandGen(double* A, int n){

double a = 1.0;
for (int i = 0; i h_xPos(num), h_yPos(num), h_zPos(num), h_h(num,0.001);

std::srand(11);
RandGen(&h_xPos[0],num);
std::srand(15);
RandGen(&h_yPos[0],num);
std::srand(19);
RandGen(&h_zPos[0],num);

thrust::device_vector d_xPos(h_xPos), d_yPos(h_yPos), d_zPos(h_zPos), d_h(h_h);

float dummymili;
float distanceCheck = 0.f;

int nBranches = 1024;

for (int i = 0; i particles1(500);
thrust::device_vector particles2(500);
thrust::device_vector distance(particles1.size()*particles2.size(), true);

dim3 blockSize(32,32); // also tested for blockSize(16,16)
dim3 gridSize;
gridSize.x = (particles1.size() + blockSize.x - 1) / blockSize.x;
gridSize.y = (particles2.size() + blockSize.y - 1) / blockSize.y;

cudaEventRecord(start);
DistanceChecker>>(
thrust::raw_pointer_cast(&d_xPos[0]),
thrust::raw_pointer_cast(&d_yPos[0]),
thrust::raw_pointer_cast(&d_zPos[0]),
thrust::raw_pointer_cast(&d_h[0]),
thrust::raw_pointer_cast(&particles1[0]),
thrust::raw_pointer_cast(&particles2[0]),
particles1.size(), particles2.size(),
thrust::raw_pointer_cast(&distance[0]));
c

Solution

[It would be helpful to know the time the code needs to run on your GPU in total and the kernel time. See this as a comment as I cannot comment yet...]

Two suggestions why your runtime is so long:
Hardware

As you want to do the calculations with double precision you should look out for hardware that provides many more double precision units. Your GPU (Quadro K2000) has only 384/24*2 = 32 of them (cf. anandtech.com). This results in a peak performance of about 15 GFLOP/s only (~0.95GHz).
Workload

Another problem is the small problem size. You are launching kernels with a grid size of only 16x16 blocks or ~250k threads. Additionally every thread has only 10 double precision operations (14 if a is calculated twice for a*a) which results in a total of 2.5 MFLOP (or 3.5 MFLOP). Even for your GPU, the kernel runtime for peak performance would be only about 0.17ms (or 0.23ms). GPUs reach maximal performance as the problem size grows.

On my GPU without doing any calculations I still get about 20% of the runtime with calculations (without optimization flags). (Inaccurate time measuring may be a problem, too.)

You may want to test your code on another GPU or use single precision and check if it runs noticeably faster.

There are ways to do calculations in DP without DP units. But I don't know whether that is reasonable for GPU computing.

Context

StackExchange Code Review Q#116038, answer score: 3

Revisions (0)

No revisions yet.