HiveBrain v1.2.0
Get Started
← Back to all entries
patterncppMinor

Optimising a custom tree data structure

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
optimisingcustomstructuredatatree

Problem

I have the following 'custom' tree data structure. This concept was to be thought of as, if given ("abb", 20) and ("abc", 30), then

Tree['a']['b']['b'] == 20


Tree['a']['b']['c'] == 30


Or

a
 \
  b
 / \
b   c


With b == 30, and c == 20.

Therefore taking up very limited space, and a very fast search.
However, as depth increases, the number of bytes for a fully explored tree was exponential. In the above case, 4 * 256 bytes were required for valid indexing to occur. This was not good.

It now works in the following way; for each branch, compare its value against the n character; i.e. introducing a worst case O(256) character comparison for each depth. This is not ideal, and what I want to attempt to change.

The other question was (and still is), what is it? I couldn't quite classify it as anything but a Tree; nor relate it to any std:: containers for comparison. It is very good at what it does though, out-performing a std::map with small strings and very large strings (sizes ~20, ~500 respectively); past 600, the data structure begins to get too heavy.

Obviously, this code is a bit sketchy in how it handles the templates at the moment. A number of things could go wrong, but my interest (other than classification) is on improving this data structure.

```
template
class Tree {
unsigned int depth;
Tree* parent;

std::vector branches;
public:
K key;
V val;

int n_trees;
Tree(K key = 0, int depth = 0, Tree* parent = 0) : depth(depth), parent(parent), key(key) {
val = 0;
n_trees = 0;
inc();
}

void inc() {
if (parent) parent->inc();
n_trees++;
}

void add(const K* search, const unsigned int length, const V value) {
const int index = search[depth];

if (length <= depth) {
val = value;
return;
}

for (unsigned int i = 0; i < branches.size(); ++i) {
if (branches[i].key == index) {

Solution

Why do you have m_parent? I don't think there are any reasonable operations on a trie that require a parent pointer in each node, and removing it can reduce the space used a lot.

I don't love using a depth argument in add. I'd use the following instead:

void add(const std::string& key, const T& value) {
  return add(key.data(), key.size(), value);
}
void add(const char *key, const T& value) {
  return add(key, std::strlen(key), value);
}
void add(const char* key, std::size_t len, const T& value);  // impl below


I suggest that you keep the children sorted by m_key. Write a comparison functor like so:

struct KeyLessThan {
  bool operator()(const PrefixTree& l, const PrefixTree& r) const {
    return l.m_key < r.m_key;
  }
  bool operator()(const PrefixTree& tree, char c) const {
    return l.m_key < c;
  }
};


Now you can rewrite add like this (using the signature above):

void PrefixTree::add(const char* key, std::size_t len, const T& value) {
  assert(len >= 0);
  if (len == 0) {
    m_value = value;
    return;
  }
  auto it = std::lower_bound(begin(m_branches), end(m_branches),
                             *key, KeyLessThan());
  if (it == end || it->m_key != *key) {
    it = m_branches.emplace(it, *key);
  }
  it->add(key + 1, len - 1, value);
}


The size of m_branches is bounded to be quite small (256 elements in theory; in practice if you're doing English words only it'll be 26 or 52 at the root and much smaller in interior nodes), so this should be quite fast, and it'll make find faster.

Speaking of m_branches, m_children would be a more usual name for it.

It's not really clear that you want find to return a T*; it's odd to give out handles to internal memory in a data structure like this. I would make the signature one of

const T* find(const std::string& key) const
std::pair find(const std::string& key) const


or

std::pair find(const std::string& key) const


(if you have access to a C++14 implementation, even better is

std::optional find(const std::string& key) const


Whichever you choose, use the same overloads as with add for const char *, etc.

It is very weird to me to see an #include that's not at the top of the file, and to see printfs in the middle of a C++ file.

Code Snippets

void add(const std::string& key, const T& value) {
  return add(key.data(), key.size(), value);
}
void add(const char *key, const T& value) {
  return add(key, std::strlen(key), value);
}
void add(const char* key, std::size_t len, const T& value);  // impl below
struct KeyLessThan {
  bool operator()(const PrefixTree& l, const PrefixTree& r) const {
    return l.m_key < r.m_key;
  }
  bool operator()(const PrefixTree& tree, char c) const {
    return l.m_key < c;
  }
};
void PrefixTree::add(const char* key, std::size_t len, const T& value) {
  assert(len >= 0);
  if (len == 0) {
    m_value = value;
    return;
  }
  auto it = std::lower_bound(begin(m_branches), end(m_branches),
                             *key, KeyLessThan());
  if (it == end || it->m_key != *key) {
    it = m_branches.emplace(it, *key);
  }
  it->add(key + 1, len - 1, value);
}
const T* find(const std::string& key) const
std::pair<bool, const T&> find(const std::string& key) const
std::pair<bool, T> find(const std::string& key) const

Context

StackExchange Code Review Q#1322, answer score: 7

Revisions (0)

No revisions yet.