HiveBrain v1.2.0
Get Started
← Back to all entries
patterncMinor

Improving Trie Implementation

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
implementationimprovingtrie

Problem

This is an implementation of a trie, that I will be using on a code editor to parse keywords from a given programming language. I decided to use a trie when I asked this question, and the implementation seems to perform pretty well and without errors from valgrind/memcheck.

I want to make sure I'm squeezing every bit of performance out of this thing, though. So I want to know if there's anything more I can do to improve performance (I guess that makes this a performance-related review). Are there any performance tips?

While I'm not specifically looking for a review of the coding style, if I'm doing anything weird please point it out.

Implementation:

Header:

// trie.h

#include 

#define TRIE_SIZE 256 // would be different for unicode

struct trie_node {
    int class; /* acts as a class or id of the word
    * and used for checking for a match
    */

    struct trie_node * nodes[TRIE_SIZE];
};

struct trie {
    struct trie_node head, * current;
};

int trie_init (struct trie * t);
int trie_adds (struct trie_node * head, char * str, int class);
int trie_freenode (struct trie_node * node);
int trie_free (struct trie * t);
int trie_pass (struct trie * t, unsigned char c);
int trie_reset (struct trie * t);


Implementation:

```
#include "trie.h"
#include

int trie_init (struct trie * t){

t->current = &t->head;

for (int i = 0; i head.nodes[i] = 0;
}

t->head.class = 0;

return 0;
}

int trie_adds (struct trie_node head, char str, int class){

if (str[0]){

int num = str[0];

if (!head->nodes[num]){ // create new node

head->nodes[num] = calloc (1, sizeof (struct trie_node));

if (!head->nodes[num]) return -1;
}

if (str[1]) return trie_adds (head->nodes[num], &str[1], class);
else head->nodes[num]->class = class;
}

return 0;
}

// used in trie_free
static int trie_freenode (struct trie_node * node){

if (!node || !node->nodes){
return -1;

Solution

It's encouraging that your test indicates that the trie execution time is (almost) independent of the number of keywords.

I think you'll find that isqualifier is a significant contributor to the execution time, and gets worse as the complexity of the list of qualifying characters grows. For anything more complicated than [A-Za-z], a look-up table will be just as good. Only 256 bytes are needed; if that bothers you, you can actually make it work (for certain definitions of qualifying sets) with a couple of 64-bit constants and some bit-whacking:

// Matches [0-9A-Za-z_]
static const uint64_t low_64 =  0x03FF000000000000UL;
static const uint64_t high_64 = 0x07FFFFFE87FFFFFEUL;
static inline bool qualifies(uint8_t c) {
  uint8_t bit7 = (c >> 6);
  return ((c >> 7) ^ 1) & (
             ((low_64  >> (c & 0x7F)) & (bit7 ^ 1) |
             ((high_64 >> (c & 0x7F)) & bit7));
}


There are lots of variations on the above theme; you'd need to profile. But I wouldn't actually do that with tries, because you can take advantage of the trie itself to discriminate.

The trie is really a simple state machine. With a small modification, we can actually build the precise state machine. Here it is in a simple version, recognizing a, on and one. Note that all the keywords end with a transition to "Accept"; this is really a transition to "Start" which also indicates that the keyword was accepted. (Code below.)

Start:  a                --> State1
        o                --> State2
        other qualifying --> Skip
        non_qualifying   --> Start

# Come here after matching `a`
State1: qualifying       --> Skip
        non_qualifying   --> Accept

# Come here after matching `o`
State2: n                --> State3
        other qualifying --> Skip
        non_qualifying   --> Start

# Come here after matching `on`
State3: e                --> State4
        other qualifying --> Skip
        non_qualifying   --> Accept

# Come here after matching `one`
State4: qualifying       --> Skip
        non_qualifying   --> Accept

# Come here after matching any other word.
Skip:   qualifying       --> Skip
        non_qualifying   --> Start


Building this state machine is almost exactly the same as building the trie. The following is based on your trie code:

struct trie_node {
  int class;
  struct trie_node* nodes[TRIE_SIZE];
};

struct trie {
  trie_node skip;
  trie_node start;
  trie_node* current;
}

int trie_init(struct trie* t) {
  for (int c = 0; c skip[c] = isqualifying(c) ? &t->skip
                                 : &t->start;
  }
  // Initially, no keywords
  t->start = t->skip;
  t->current = &t->start;
}

int trie_adds(struct trie* t, const char* s, int class) {
  struct trie_node* current = t->start;
  for (; *s; ++s) {
    uint8_t ch = *s;
    if (current[ch] == &t->skip) {
      current[ch] = malloc(sizeof *current[ch]);
      *current[ch] = t->skip;
    }
    current = current[ch];
  }
  current->class = class;
}

static inline int trie_step(struct trie* t, int ch) {
  int class = t->current->class;
  t->current = t->current.nodes[ch];
  return t->current == &t->start ? class : 0;
}


(The above code is completely untested. It probably doesn't even compile.)

With the above, you no longer need to call is_qualifying on every character. You just pass each character in turn -- including the terminating NUL -- through trie_step, and any time it returns a non-zero value, you've found a keyword.

Code Snippets

// Matches [0-9A-Za-z_]
static const uint64_t low_64 =  0x03FF000000000000UL;
static const uint64_t high_64 = 0x07FFFFFE87FFFFFEUL;
static inline bool qualifies(uint8_t c) {
  uint8_t bit7 = (c >> 6);
  return ((c >> 7) ^ 1) & (
             ((low_64  >> (c & 0x7F)) & (bit7 ^ 1) |
             ((high_64 >> (c & 0x7F)) & bit7));
}
Start:  a                --> State1
        o                --> State2
        other qualifying --> Skip
        non_qualifying   --> Start

# Come here after matching `a`
State1: qualifying       --> Skip
        non_qualifying   --> Accept

# Come here after matching `o`
State2: n                --> State3
        other qualifying --> Skip
        non_qualifying   --> Start

# Come here after matching `on`
State3: e                --> State4
        other qualifying --> Skip
        non_qualifying   --> Accept

# Come here after matching `one`
State4: qualifying       --> Skip
        non_qualifying   --> Accept

# Come here after matching any other word.
Skip:   qualifying       --> Skip
        non_qualifying   --> Start
struct trie_node {
  int class;
  struct trie_node* nodes[TRIE_SIZE];
};

struct trie {
  trie_node skip;
  trie_node start;
  trie_node* current;
}

int trie_init(struct trie* t) {
  for (int c = 0; c < TRIE_SIZE; ++c) {
    t->skip[c] = isqualifying(c) ? &t->skip
                                 : &t->start;
  }
  // Initially, no keywords
  t->start = t->skip;
  t->current = &t->start;
}

int trie_adds(struct trie* t, const char* s, int class) {
  struct trie_node* current = t->start;
  for (; *s; ++s) {
    uint8_t ch = *s;
    if (current[ch] == &t->skip) {
      current[ch] = malloc(sizeof *current[ch]);
      *current[ch] = t->skip;
    }
    current = current[ch];
  }
  current->class = class;
}

static inline int trie_step(struct trie* t, int ch) {
  int class = t->current->class;
  t->current = t->current.nodes[ch];
  return t->current == &t->start ? class : 0;
}

Context

StackExchange Code Review Q#30469, answer score: 2

Revisions (0)

No revisions yet.