patterncMinor

C and CUDA: circular buffer implementation

Submitted by: @import:stackexchange-codereview·Mar 10, 2026·

Viewed 0 times

cudacircularandimplementationbuffer

Problem

I have a programme which uses many circular buffers in an identical fashion on a CPU and GPU (C and C/C++ CUDA). I essentially require many queues, however, due to this being run on a GPU, I have limited its length so memory can be set up once at the beginning of the program. I therefore have the code below with a circular buffer/queue. Device and host code is similar with only minor changes for memory efficiency on GPU which is described when necessary. I have implemented this myself so it can be utilized within a hand written kernel on the GPU and so I can compare results between the CPU and GPU for verification purposes. I emit code which allocates/frees memory on the host and device for conciseness.

Worth noting is that, for my purpose, I do not require pop to return the value. Instead I first peek and compare with some condition, and if it is true, pop and discard the data. Furthermore, I can include a peek_tail; however, currently, I do not require this. During a typical program run, I am continuously pushing new data (for all i), and at the same time checking whether the oldest (via peek) is ready to be discarded (again for all i). While this is happening, I am periodically iterating over all items (j) within each (i) buffer. Lastly, when compiling I assume that no pointers alias and compile with strict pointer aliasing flags for both GCC and nvcc when appropriate.

Data structure (note power of 2 for capacity):

typedef struct
{
  // Parameters (here only host is shown, identical versions are placed in
  // constant memory of the device to access)
  int N; // multiple of warp size
  int capacity; // power of 2
  // Host data
  int* head_h;
  int* size_h;
  float* peek_head_h;
  float* data_h;
  // Device data
  int* head_d;
  int* size_d;
  float* peek_head_d;
  float* data_d;
} ring;

Host code (I will explain later why indexing to data is offset) with i={0,...,N-1} and j={0,...,capacity-1} where j is used in a for loo

Solution

Unfortunately I can't answer your initial question, as the analysis would take more time than I can afford; but the second question I can give you some bullet-points on:

From my experience the CUDA compiler is not as smart as the

mainstream C/C++ compilers, and there's a lot of things that would be
optimized out in more advanced compilers that aren't in CUDA, for
example, ternary use vs if/else blocks. Likewise, combining
statements may have an effect, or not.

You've got a lot of branching there, the less you can do, the more synchronized your threads will be and the faster block processing will complete. Using the ternary operator may produce faster code, depending on the CUDA version.

I forget which subsets of the C++ language CUDA supports (it's not quite C or C++, more like C+) but use prefix increment if you can.

In other compilers such as GCC, combining ++ and if statements can sometimes have small performance benefits, where you'd think they'd be automatically optimized. I would expect CUDA to be no different, but experimentation might be required.

Remember the const rules: read from right to left, so "const int * const i" means "i is a constant pointer to a constant int". Again, I forget what level of const use CUDA supports. The more const, the more assumptions the compiler can make, generally speaking - which may or may not lead to better opportunity for optimization.

As Reinderien noted, forcing inline is best left to the compiler, with the exception of very small functions. Even then, using 'inline' might be better practice. -

Bearing all that in mind, see below - none of the code logic should've changed. There are some additional suggestions I've included in the code. Apologies if I've expected any particular feature which is not actually available in CUDA (it's been a while since I used it):

```
__device__ static void
Push_ring_GPU(int const head, int const size, float* const peek_head,
float* const data, const int i, const int capacity,
const int N, const float val)
{
// use temp variables
int headTemp = head[i];
int sizeTemp = size[i];

// wrap tail if needed and set value
data[(((headTemp + sizeTemp) ? (capacity - 1)) * N) + i] = val;

// increase size
if (++sizeTemp >= capacity)
{
#ifdef WITH_WARN
printf("Ring full.\n");
#endif // WITH_WARN
sizeTemp = capacity;
// wrap tempHead if needed

head[i] = (++headTemp == capacity) ? 0 : headTemp;
}

// if first set peek tempHead
if (sizeTemp == 1)
peek_head[i] = val;

// update from temp variables
size[i] = sizeTemp;
}

// NOTE: What is the logic of including headTemp and sizeTemp in the parameters?
// Why not use temporary values within the function as with the above function then copy those values back? This will be faster.
// If you need their values afterwards you can always obtain it from head, size and i.
__device__ static void
Push_ring_Loading_GPU(int const headTemp, int const sizeTemp, int* const head,
int const size, float const peek_head,
float* const data, const int i,
const int capacity, const int N,
const float val)
{
// use loading variables
*headTemp = head[i];
*sizeTemp = size[i];

// wrap tail if needed
data[(((headTemp + sizeTemp) & (capacity - 1)) * N) + i] = val;

// increase size

// NOTE: If you can find a function to reduce sizetemp to a MAX value of capacity in CUDA, use that* instead of creating branches!
if (++(*sizeTemp) >= capacity)
{
#ifdef WITH_WARN
printf("Ring full.\n");
#endif // WITH_WARN
*sizeTemp = capacity;

// wrap tempHead if needed
if (++(*headTemp) == capacity)
*headTemp = 0;
}
// if first set peek tempHead
if (*sizeTemp == 1)
peek_head[i] = val;
}

__device__ static void
Pop_ring_GPU(int const head, int const size, float* const peek_head,
float* const data, const int i, const int capacity,
const int N)
{
// use temporary variables
int sizeTemp = size[i];
int headTemp = head[i];

// if empty
if (sizeTemp == 0)
return;

// wrap head if needed
if (++headTemp == capacity)
headTemp = 0;

// NOTE: an alternative you may want to test and see whether it yields better performance:
// ++headTemp;
// headTemp = (headTemp == capacity) ? 0 : headTemp;

// update size and peek head

if (--sizeTemp 0) ? data[(headTemp * N) + i] : -1.f;

// update from temporary variables
head[i] = headTemp;
size[i] = sizeTemp;
}

__device__ static void
Pop_ring_Loaded_GPU(int const headTemp, int const sizeTemp,
float const peek_head, float const data,
const int i, const int capacity, const int N)
{
// if empty
if (*sizeTemp == 0)
return;

// wrap head if needed
if (++(*headTemp) == capacity)
*headTemp = 0;

// update size and peek head
i

Code Snippets

__device__ static void
Push_ring_GPU(int* const head, int* const size, float* const peek_head,
              float* const data, const int i, const int capacity, 
              const int N, const float val)
{
  // use temp variables
  int headTemp = head[i];
  int sizeTemp = size[i];

  // wrap tail if needed and set value
  data[(((headTemp + sizeTemp) ? (capacity - 1)) * N) + i] = val;

  // increase size
  if (++sizeTemp >= capacity)
    {
#ifdef WITH_WARN
      printf("Ring full.\n");
#endif // WITH_WARN
      sizeTemp = capacity;
      // wrap tempHead if needed

      head[i] = (++headTemp == capacity) ? 0 : headTemp;
    }

  // if first set peek tempHead
  if (sizeTemp == 1)
    peek_head[i] = val;

  // update from temp variables
  size[i] = sizeTemp;
}


// NOTE: What is the logic of including headTemp and sizeTemp in the parameters?
// Why not use temporary values within the function as with the above function then copy those values back? This will be faster.
// If you need their values afterwards you can always obtain it from head, size and i.
__device__ static void
Push_ring_Loading_GPU(int* const headTemp, int* const sizeTemp, int* const head,
                      int* const size, float* const peek_head, 
                      float* const data, const int i,
                      const int capacity, const int N,
                      const float val)
{
  // use loading variables
  *headTemp = head[i];
  *sizeTemp = size[i];

  // wrap tail if needed
  data[(((*headTemp + *sizeTemp) & (capacity - 1)) * N) + i] = val;

  // increase size

  // NOTE: If you can find a function to reduce *sizetemp to a MAX value of capacity in CUDA, *use that* instead of creating branches!
  if (++(*sizeTemp) >= capacity)
    {
#ifdef WITH_WARN
      printf("Ring full.\n");
#endif // WITH_WARN
      *sizeTemp = capacity;

      // wrap tempHead if needed
      if (++(*headTemp) == capacity)
        *headTemp = 0;
    }
  // if first set peek tempHead
  if (*sizeTemp == 1)
    peek_head[i] = val;
}



__device__ static void
Pop_ring_GPU(int* const head, int* const size, float* const peek_head,
             float* const data, const int i, const int capacity,
             const int N)
{
  // use temporary variables
  int sizeTemp = size[i];
  int headTemp = head[i];

  // if empty
  if (sizeTemp == 0)
    return;

  // wrap head if needed
  if (++headTemp == capacity)
    headTemp = 0;

  // NOTE: an alternative you may want to test and see whether it yields better performance:
  // ++headTemp;
  // headTemp = (headTemp == capacity) ? 0 : headTemp;

  // update size and peek head

  if (--sizeTemp < 0)
    sizeTemp = 0;

  peek_head[i] = (sizeTemp > 0) ? data[(headTemp * N) + i] :  -1.f;

  // update from temporary variables
  head[i] = headTemp;
  size[i] = sizeTemp;
}



__device__ static void
Pop_ring_Loaded_GPU(int* const headTemp, int* const sizeTemp,
                    float* const peek_head, float* const data,
                    const int i, const

Context

StackExchange Code Review Q#114266, answer score: 8

Revisions (0)

No revisions yet.