patterncppMinor
High performance triangle - axis aligned bounding box clipping
Viewed 0 times
clippingaxisboundinghightrianglealignedperformancebox
Problem
Implementation of a Robust (i.e. with a finite plane thickness) Sutherland–Hodgman algorithm for clipping polygons against an axis-aligned bounding box. I use the following code only for clipping triangles. So the clipped polygon can consist only of at most 9 vertices. The third-party code of
The C++ code works fine (Python version available at) and fast for clipping some triangles. Unfortunately, I typically need to perform 1 billion of these operations. Profiling learns most of the time is spent on these clipping operations (which is not the core of my apllication). So my question: is it possible to make the code more performance efficient (e.g. eliminating most of the branching)?
```
#pragma once
#define PLANE_THICKNESS_EPSILON 0.00001f
// Classify a given vertex against an axis-aligned plane
//
// @param sign min/max clipping plane
// @param axis axis of the clipping plane
// @param c_v one vertex of the clipping plane
// @param p_v vertex to classify
//
// @return classification of the vertex
template
inline int8_t Classify(int8_t sign, uint8_t axis, const Point3 &c_v, const Point3d &p_v) {
const double d = sign * (p_v[axis] - c_v[axis]);
if (d > PLANE_THICKNESS_EPSILON) return 1;
else if (d
inline void Clip3D_plane(Point3d p_vs, uint8_t nb_p_vs, int8_t sign, uint8_t axis, const Point3 &c_v) {
uint8_t nb = (*nb_p_vs);
if (nb == 0) return;
else if (nb == 1) {
*nb_p_vs = 0;
return;
}
Point3d new_p_vs[POINT_BUFFER_SIZE];
uint8_t k = 0;
bool b = true; // polygon is fully located on clipping plane
Point3d p_v1 = p_vs[nb-1];
int8_t d1 = Classify(sign, axis, c_v, p_v1);
for (uint8_t j = 0; j (sign, axis, c_v, p_v2);
if (d2 0) {
const double alpha = (p_v2[axis] - c_v[axis]) / (p_v2[axis] - p_v1[axis]);
new_p_vs[k++] = Lerp(alpha, p_v2
Point, BBox and Lerp is not given, since this is rather trivial.The C++ code works fine (Python version available at) and fast for clipping some triangles. Unfortunately, I typically need to perform 1 billion of these operations. Profiling learns most of the time is spent on these clipping operations (which is not the core of my apllication). So my question: is it possible to make the code more performance efficient (e.g. eliminating most of the branching)?
```
#pragma once
#define PLANE_THICKNESS_EPSILON 0.00001f
// Classify a given vertex against an axis-aligned plane
//
// @param sign min/max clipping plane
// @param axis axis of the clipping plane
// @param c_v one vertex of the clipping plane
// @param p_v vertex to classify
//
// @return classification of the vertex
template
inline int8_t Classify(int8_t sign, uint8_t axis, const Point3 &c_v, const Point3d &p_v) {
const double d = sign * (p_v[axis] - c_v[axis]);
if (d > PLANE_THICKNESS_EPSILON) return 1;
else if (d
inline void Clip3D_plane(Point3d p_vs, uint8_t nb_p_vs, int8_t sign, uint8_t axis, const Point3 &c_v) {
uint8_t nb = (*nb_p_vs);
if (nb == 0) return;
else if (nb == 1) {
*nb_p_vs = 0;
return;
}
Point3d new_p_vs[POINT_BUFFER_SIZE];
uint8_t k = 0;
bool b = true; // polygon is fully located on clipping plane
Point3d p_v1 = p_vs[nb-1];
int8_t d1 = Classify(sign, axis, c_v, p_v1);
for (uint8_t j = 0; j (sign, axis, c_v, p_v2);
if (d2 0) {
const double alpha = (p_v2[axis] - c_v[axis]) / (p_v2[axis] - p_v1[axis]);
new_p_vs[k++] = Lerp(alpha, p_v2
Solution
As you say that this is the bottle neck and you mention big numbers I'm going to put my performance goggles on.
Avoid conversions
You are mixing
Also your
Your
Use template value parameters
This section is speculative, you compiler might be good enough to do the constant propagation and optimization any way, or it might not. You have to profile to find out, in either case I will show you the technique and you can test.
You can pass values to template (non class) arguments and specialize on these values so that the compiler can more easily generate optimized code if the values are known during compile-time (they are here).
For example:
could be something like this:
Note that I also showed you how to get rid of the branches in this function. You'll have to benchmark and see if there is any use from making
Consolidate branches
Note how the branches for
You can re-organize this like so:
Again YMMV the compiler may have been able to do the above transform for you, or it may not be any faster at all. It is only my gut feeling telling me that the above might be slightly better. I have not profiled and you should test to see if it gives any improvement at all.
Better formulation for Lerp
The current code is:
If you instead formulate it like this you have one multiplication less:
Better implementation of
Use a anonymous union and struct to better implement
That's all I have time for right now, I have to go to bed. Hope that helps!
Avoid conversions
You are mixing
double and float precision numbers with Point3d (which is double) and Point with T=float. Conversion between these two consumes some time. Pick either double or float and use it throughout, the performance difference between them should be minimal.Also your
sign parameter is int8_t and you multiply it with double values this causes an integer to floating point conversion which can be somewhat costly if done in a hot loop. At the very least sign should be double. But I will show you how you can avoid the multiplication all together further down.Your
axis parameter should be of type size_t. Using uint8_t is not in any way faster because it is smaller, the argument is likely passed in registers anyway where the size doesn't matter. At worst you're causing conversions/type promotions.Use template value parameters
This section is speculative, you compiler might be good enough to do the constant propagation and optimization any way, or it might not. You have to profile to find out, in either case I will show you the technique and you can test.
You can pass values to template (non class) arguments and specialize on these values so that the compiler can more easily generate optimized code if the values are known during compile-time (they are here).
For example:
template
inline int8_t Classify(int8_t sign, uint8_t axis, const Point3 &c_v, const Point3d &p_v) {
const double d = sign * (p_v[axis] - c_v[axis]);
if (d > PLANE_THICKNESS_EPSILON) return 1;
else if (d < -PLANE_THICKNESS_EPSILON) return -1;
else return 0;
}could be something like this:
template
inline int Classify(const Point3& c_v, const Point3& p_v) {
// because "positive_sign" is a compile time constant, the compiler can
// remove the branch here.
const double d = positive_sign ? (p_v[axis] - c_v[axis]) :
(c_v[axis] - p_v[axis]);
auto c = ((d > PLANE_THICKNESS_EPSILON)<<1) |
(d < -PLANE_THICKNESS_EPSILON);
static int v[3] = {0, -1, 1};
return v[c];
}Note that I also showed you how to get rid of the branches in this function. You'll have to benchmark and see if there is any use from making
axis a template value. I'm suspecting it wont make a difference.Consolidate branches
Note how the branches for
d2 0 have almost the same code?if (d2 0) { // NOTE: Differs here
const double alpha = (p_v2[axis] - c_v[axis]) / (p_v2[axis] - p_v1[axis]);
new_p_vs[k++] = Lerp(alpha, p_v2, p_v1);
}
else if (d1 == 0 && (k == 0 || new_p_vs[k-1] != p_v1))
new_p_vs[k++] = p_v1;
}
else if (d2 > 0) {
b = false;
if (d1 < 0) { // NOTE: And here
const double alpha = (p_v2[axis] - c_v[axis]) / (p_v2[axis] - p_v1[axis]);
new_p_vs[k++] = Lerp(alpha, p_v2, p_v1);
}
else if (d1 == 0 && (k == 0 || new_p_vs[k-1] != p_v1))
new_p_vs[k++] = p_v1;
new_p_vs[k++] = p_v2; // Note: And here
}
else {
if (d1 != 0)
new_p_vs[k++] = p_v2;
}You can re-organize this like so:
if(d2 == 0){
if (d1 != 0)
new_p_vs[k++] = p_v2;
}else{ // d2 0
b = false;
// test for d1 == 0 first, because if that is true,
// none of the expressions in the else statement can be true.
if (d1 == 0){
// You might want to consider special casing k == 0 outside of the
// loop so you can avoid the check for k==0 in here which is only
// true once.
if(k == 0 || new_p_vs[k-1] != p_v1)
new_p_vs[k++] = p_v1;
}
else if ((d2 0) || (d2 > 0 && d1 0)
new_p_vs[k++] = p_v2;
}Again YMMV the compiler may have been able to do the above transform for you, or it may not be any faster at all. It is only my gut feeling telling me that the above might be slightly better. I have not profiled and you should test to see if it gives any improvement at all.
Better formulation for Lerp
The current code is:
template
Point3 Lerp(double t, const Point3 &p0, const Point3 &p1) {
return (1 - t) * p0 + t * p1;
}If you instead formulate it like this you have one multiplication less:
return p0 + t*(p1-p0);Better implementation of
Point3::operator[]Use a anonymous union and struct to better implement
operator []. Like so:T operator[](int i) const {
return raw[i];
}
T &operator[](int i) {
return raw[i];
}
union{
struct{
T x, y, z;
};
T raw[3];
};
};That's all I have time for right now, I have to go to bed. Hope that helps!
Code Snippets
template <typename T>
inline int8_t Classify(int8_t sign, uint8_t axis, const Point3<T> &c_v, const Point3d &p_v) {
const double d = sign * (p_v[axis] - c_v[axis]);
if (d > PLANE_THICKNESS_EPSILON) return 1;
else if (d < -PLANE_THICKNESS_EPSILON) return -1;
else return 0;
}template<typename T, bool positive_sign, int axis>
inline int Classify(const Point3<T>& c_v, const Point3<T>& p_v) {
// because "positive_sign" is a compile time constant, the compiler can
// remove the branch here.
const double d = positive_sign ? (p_v[axis] - c_v[axis]) :
(c_v[axis] - p_v[axis]);
auto c = ((d > PLANE_THICKNESS_EPSILON)<<1) |
(d < -PLANE_THICKNESS_EPSILON);
static int v[3] = {0, -1, 1};
return v[c];
}if (d2 < 0) {
b = false;
if (d1 > 0) { // NOTE: Differs here
const double alpha = (p_v2[axis] - c_v[axis]) / (p_v2[axis] - p_v1[axis]);
new_p_vs[k++] = Lerp(alpha, p_v2, p_v1);
}
else if (d1 == 0 && (k == 0 || new_p_vs[k-1] != p_v1))
new_p_vs[k++] = p_v1;
}
else if (d2 > 0) {
b = false;
if (d1 < 0) { // NOTE: And here
const double alpha = (p_v2[axis] - c_v[axis]) / (p_v2[axis] - p_v1[axis]);
new_p_vs[k++] = Lerp(alpha, p_v2, p_v1);
}
else if (d1 == 0 && (k == 0 || new_p_vs[k-1] != p_v1))
new_p_vs[k++] = p_v1;
new_p_vs[k++] = p_v2; // Note: And here
}
else {
if (d1 != 0)
new_p_vs[k++] = p_v2;
}if(d2 == 0){
if (d1 != 0)
new_p_vs[k++] = p_v2;
}else{ // d2 < 0 OR d2 > 0
b = false;
// test for d1 == 0 first, because if that is true,
// none of the expressions in the else statement can be true.
if (d1 == 0){
// You might want to consider special casing k == 0 outside of the
// loop so you can avoid the check for k==0 in here which is only
// true once.
if(k == 0 || new_p_vs[k-1] != p_v1)
new_p_vs[k++] = p_v1;
}
else if ((d2 < 0 && d1 > 0) || (d2 > 0 && d1 < 0)) {
auto alpha = (p_v2[axis] - c_v[axis]) / (p_v2[axis] - p_v1[axis]);
new_p_vs[k++] = Lerp(alpha, p_v2, p_v1);
}
if(d2 > 0)
new_p_vs[k++] = p_v2;
}template <typename T>
Point3<T> Lerp(double t, const Point3<T> &p0, const Point3<T> &p1) {
return (1 - t) * p0 + t * p1;
}Context
StackExchange Code Review Q#131852, answer score: 4
Revisions (0)
No revisions yet.