patterncppMinor
Octree code performance
Viewed 0 times
codeoctreeperformance
Problem
I'm using a simple octree in my program and wanted to know if my implementation could be faster. It is part of this program.
Octree.H:
Octree.cxx:
```
#include "Octree.H"
Octree::Octree()
{
root = new node_t;
root->value = 0;
for(int i = 0; i child[i] = 0;
}
Octree::~Octree()
{
clear(root);
}
void Octree::clear(struct node_t *node)
{
for(int i = 0; i child[i])
clear(node->child[i]);
delete node;
}
void Octree::write(const int &r, const int &g, const int &b,
const float &value)
{
struct node_t *node = root;
for(int i = 7; i >= 0; i--)
{
const int index = ((r >> i) & 1) > i) & 1) > i) & 1) child[index])
{
node->child[index] = new node_t;
node = node->child[index];
node->value = 0;
for(int j = 0; j child[j] = 0;
}
else
{
node = node->child[index];
}
}
node->value = value;
}
// Sets entire path to value.
// This allows the octree to be used in a different way.
// (Needed for palette lookup.)
void Octree::writePath(const int &r, const int &g, const int &b,
const float &value)
{
struct node_t *node = root;
for(int i = 7; i >= 0; i--)
{
const int index = ((r >> i) & 1) > i) & 1) > i) & 1) child[index])
{
node->child[index] = new node_t;
node = node->child[index];
node->value = value;
for(int j = 0; j child[j] = 0;
}
else
{
node = node->child[index];
}
}
}
float Octree::read(const int &r, const int &g, const int &b)
{
struct nod
Octree.H:
#ifndef OCTREE_H
#define OCTREE_H
#include "Rendera.H"
class Octree
{
public:
struct node_t
{
float value;
struct node_t *child[8];
};
Octree();
virtual ~Octree();
void clear(struct node_t *);
void write(const int &, const int &, const int &, const float &);
void writePath(const int &, const int &, const int &, const float &);
float read(const int &, const int &, const int &);
struct node_t *root;
};
#endifOctree.cxx:
```
#include "Octree.H"
Octree::Octree()
{
root = new node_t;
root->value = 0;
for(int i = 0; i child[i] = 0;
}
Octree::~Octree()
{
clear(root);
}
void Octree::clear(struct node_t *node)
{
for(int i = 0; i child[i])
clear(node->child[i]);
delete node;
}
void Octree::write(const int &r, const int &g, const int &b,
const float &value)
{
struct node_t *node = root;
for(int i = 7; i >= 0; i--)
{
const int index = ((r >> i) & 1) > i) & 1) > i) & 1) child[index])
{
node->child[index] = new node_t;
node = node->child[index];
node->value = 0;
for(int j = 0; j child[j] = 0;
}
else
{
node = node->child[index];
}
}
node->value = value;
}
// Sets entire path to value.
// This allows the octree to be used in a different way.
// (Needed for palette lookup.)
void Octree::writePath(const int &r, const int &g, const int &b,
const float &value)
{
struct node_t *node = root;
for(int i = 7; i >= 0; i--)
{
const int index = ((r >> i) & 1) > i) & 1) > i) & 1) child[index])
{
node->child[index] = new node_t;
node = node->child[index];
node->value = value;
for(int j = 0; j child[j] = 0;
}
else
{
node = node->child[index];
}
}
}
float Octree::read(const int &r, const int &g, const int &b)
{
struct nod
Solution
Performance
(Without any profiling data, I'm guessing blindly here)
You have a linked data structure (a tree), traversing the tree has poor cache locality and this is where your performance will likely suffer. Because every node could be a cache miss at worst. But please do measure this first, if you see that most of your time is spent getting the next node or close to a line of code that does, you're having cache issues.
One thing you can do to try to improve your cache performance is to encode your tree into a linear sequence (
As for the encoding, we'll refer to a level of the tree as
We can order all the nodes into a linear random access sequence
Note that the width of the grid at level
Benefits
Drawbacks
(Without any profiling data, I'm guessing blindly here)
You have a linked data structure (a tree), traversing the tree has poor cache locality and this is where your performance will likely suffer. Because every node could be a cache miss at worst. But please do measure this first, if you see that most of your time is spent getting the next node or close to a line of code that does, you're having cache issues.
One thing you can do to try to improve your cache performance is to encode your tree into a linear sequence (
std::vector preferrably). And then measure to see if it actually is faster.As for the encoding, we'll refer to a level of the tree as
k and the root has k=0. This means that the level k has at most 8^k nodes. If we refer to each node in a level by it's grid coordinate (x,y, z) then the node n(k,x,y,z) (at level k and position (x,y,z)) will have it's child nodes on:n(k+1, 2x, 2y, 2*z)
- ... (all combinations of +1 to the x,y,z) ...
n(k+1, 2x + 1, 2y + 1, 2*z + 1)
We can order all the nodes into a linear random access sequence
v (std::vector) like this:level_base = ipow(8,k) - 1;
level_width = ipow(2,k);
grid_index = x + level_width * y + level_width*level_width*z;
n(k, x, y) = v[level_base + grid_index];Note that the width of the grid at level
k is 2^k (example: 1, 2, 4, 8). Here ipow is an optimized integer power function. If ipow proves to be a bottle neck, you can use a LUT for the powers of 8 and 2.Benefits
- The benefits to doing this is that Breadth First Traversal/Search becomes a linear access pattern which your CPU's pre-fetcher will love you for. This is very fast.
- If you know before hand which node you want
(k,x,y,z), you can get it directly.
- You can get all the neighbours easily.
- Given your
(r,g,b)triplet, you may be able to just directly compute the index of the destination node, write the value and you're done. No iteration to the right node.
- Less memory fragmentation.
- For dense trees this actually takes less memory as you don't have the heap bookkeeping overhead etc.
- You do not need the
child[]array making your data structure smaller and more of them can fit in each cache line.
Drawbacks
- The
std::vectoris dense, if your tree is very sparse, you may end up allocating a lot more nodes than you need. On the other hand, if you have the memory it might be acceptable.
- It's a more complex indexing which can result in bugs or more complex code if proper caution isn't taken.
- If you need to distinguish between empty and non-empty nodes (as opposed to having
value == 0), you need to introduce ahas_dataflag. But the nodes are still smaller than having the child array.
- Not as resilient against off-by-one type of bugs. In a linked-tree type structure if you manage to somehow write outside of the node's structure you'll write into the canary values on the heap or in the heap book-keeping. Under debug mode your RT should let you know you have an issue. With this approach you will silently corrupt nearby nodes. Not necessarily likely but something to be aware of.
Code Snippets
level_base = ipow(8,k) - 1;
level_width = ipow(2,k);
grid_index = x + level_width * y + level_width*level_width*z;
n(k, x, y) = v[level_base + grid_index];Context
StackExchange Code Review Q#70563, answer score: 8
Revisions (0)
No revisions yet.