patternModerate

Why are comparisons so expensive on a GPU?

Submitted by: @import:stackexchange-cs·Mar 10, 2026·

Viewed 0 times

computer-architecture cs parallel-computing stackoverflow

whygpuareexpensivecomparisons

Problem

While trying to improve the performance of my collision detection class, I found that ~80% of the time spent at the gpu, it spent on if/else conditions just trying to figure out the bounds for the buckets it should loop through.

More precisely:

-
each thread gets an ID, by that ID it fetches its triangle from the memory (3 integers each) and by those 3 it fetches its vertices(3 floats each).

-
Then it transforms the vertices into integer grid points (currently 8x8x8) and transforms them into the triangle bounds on that grid

-
To transform the 3 points into bounds, it finds the min/max of each dimension among each of the points

Since the programming language I am using is missing a minmax intrinsic, I made one myself, looks like this:

procedure MinMax(a, b, c):
   local min, max

   if a > b:
      max = a
      min = b
   else:
      max = b
      min = a
   if c > max:
      max = c
   else:
      if c < min:
         min = c

   return (min, max)

So on the average it should be 2.5 3 3 = 22.5 comparisons which ends up eating up way more time than the actual triangle - edge intersection tests (around 100 * 11-50 instructions).

In fact, I found that pre-calculating the required buckets on the cpu (single threaded, no vectorization), stacking them in a gpu view along with bucket definition and making the gpu do ~4 extra reads per thread was 6 times faster than trying to figure out the bounds on the spot. (note that they get recalculated before every execution since I'm dealing with dynamic meshes)

So why is the comparison so horrendously slow on a gpu?

Solution

GPUs are SIMD architectures. In SIMD architectures every instruction needs to be executed for every element that you process. (There's an exception to this rule, but it rarely helps).

So in your MinMax routine not only does every call need to fetch all three branch instructions, (even if on average only 2.5 are evaluated), but every assignment statement takes up a cycle as well (even if it doesn't actually get "executed").

This problem is sometimes called thread divergence. If your machine has something like 32 SIMD execution lanes, it will still have only a single fetch unit. (Here the term "thread" basically means "SIMD execution lane".) So internally each SIMD execution lane has a "I'm enabled/disabled" bit, and the branches actually just manipulate that bit. (The exception is that at the point where every SIMD lane becomes disabled, the fetch unit will generally jump directly to the "else" clause.)

So in your code, every SIMD execution lane is doing:

compare (a > b)
assign (max = a if a>b)
assign (min = b if a>b)
assign (max = b if not(a>b))
assign (min = a if not(a>b))
compare (c > max)
assign (max = c if c>max)
compare (c max))
assign (min = c if not(c>max) and c<min)

It may be the case that on some GPUs this conversion of conditionals to predication is slower if the GPU is doing it itself. As pointed out by @PaulA.Clayton, if your programming language and architecture has a predicated conditional move operation (especially one of the form if (c) x = y else x = z) you might be able to do better. (But probably not much better).

Also, placing the c max is unnecessary. It certainly isn't saving you anything, and (given that the GPU has to automatically convert it to predication) may actually be hurting to have it nested in two different conditionals.

Code Snippets

compare (a > b)
assign (max = a if a>b)
assign (min = b if a>b)
assign (max = b if not(a>b))
assign (min = a if not(a>b))
compare (c > max)
assign (max = c if c>max)
compare (c < min if not(c>max))
assign (min = c if not(c>max) and c<min)

Context

StackExchange Computer Science Q#39871, answer score: 10

Revisions (0)

No revisions yet.