HiveBrain v1.2.0
Get Started
← Back to all entries
patterncMinor

Spell checker using trie

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
spellcheckerusingtrie

Problem

This program implements spell checking by loading a given dictionary file into a trie structure and comparing against a given text file: Usage: speller [dictionary] text. When run, each misspelled word is printed along with benchmarks:

This was part of a problem set with a given mess of skeleton code (speller.c), which I've done my best to make presentable for code review here. The goal was to get the best timing possible (see reqs below), so we've been instructed to use getrusage to gather benchmarks. Please don't focus much critique on speller.c as it is a given piece of the problem set.

Other requirements were as follows:

  • You may assume that any dictionary passed to your program will be structured exactly like ours, lexicographically sorted from top to bottom with one word per line, each of which ends with \n. You may also assume that dictionary will contain at least one word, that no word will be longer than LENGTH (a constant defined in dictionary.h) characters, that no word will appear more than once, and that each word will contain only lowercase alphabetical characters and possibly apostrophes.



  • You may assume that check will only be passed strings with alphabetical characters and/or apostrophes.



  • Your implementation of check must be case-insensitive.



One area I'm specifically interested in review is performance. I ran gperf and callgrind to profile my code and made some changes based on analysis; however, it seems that load_dict might be underachieving when I compare to staff solution benchmarks (run on same environment). Does anything stick out that could be improved to get better performance?

speller.c

```
#include "dbg.h"
#include "dictionary.h"

#include
#include
#include
#include

#undef difference_in_seconds
#undef getrusage

#define DEFAULT_DICT "dictionaries/large"

typedef struct {
struct rusage before;
struct rusage after;
double time_load;
double time_check;
double time_size;
double time_unload;
u

Solution

Buffer overflow

This line is vulnerable to buffer overflow:

while (fgets(word, sizeof(word)+1, in)) {


I'm not sure why you added one here, but it makes it possible to overflow the buffer by one character. According to the problem statement, the dictionary will not have any words longer than the buffer, so your code is safe as long as you use the provided dictionary. But for an arbitrary input file, you should remove that +1 from that line.
Uninitialized variables

When I ran your program, I got this as part of the output:

WORDS MISSPELLED:   1629196271


This was due to the variable bm_data being uninitialized in main, and none of the other functions zeroing it either. Since that code was not written by you, you aren't really to blame for that.
Performance in loading

After playing with your code a bit, I found that the loading performance was slow mostly due to lack of buffering. That is to say, calling fgets() to read each word is slower than reading into a big buffer and then parsing through the buffer. When I used fread() to read into a buffer, the dictionary loaded in 0.06 seconds compared to 0.11 seconds when using fgets(). I tried different buffer sizes and I got pretty much the same results with buffer sizes ranging from 64 bytes to 64 KB. Here is code I used:

#define BUFSIZE    1024

bool Trie_insert_file(TrieNode **root, FILE *in)
{
    check(root, "Root is null.");
    char      buffer[BUFSIZE];
    TrieNode *cur = *root;

    while(1) {
        size_t bufLen = fread(buffer, 1, sizeof(buffer), in);
        if (bufLen end_of_word = true;
                cur = *root;
                continue;
            }
            size_t index = get_index(c);

            if (cur->children[index] == NULL)
                cur->children[index] = Trie_create();
            cur = cur->children[index];
        }
    }
error:
    return false;
}

Code Snippets

while (fgets(word, sizeof(word)+1, in)) {
WORDS MISSPELLED:   1629196271
#define BUFSIZE    1024

bool Trie_insert_file(TrieNode **root, FILE *in)
{
    check(root, "Root is null.");
    char      buffer[BUFSIZE];
    TrieNode *cur = *root;

    while(1) {
        size_t bufLen = fread(buffer, 1, sizeof(buffer), in);
        if (bufLen <= 0)
            return true;
        for (size_t i = 0; i < bufLen; i++) {
            char c = buffer[i];
            if (c == '\n') {
                trie_word_count++;
                cur->end_of_word = true;
                cur = *root;
                continue;
            }
            size_t index = get_index(c);

            if (cur->children[index] == NULL)
                cur->children[index] = Trie_create();
            cur = cur->children[index];
        }
    }
error:
    return false;
}

Context

StackExchange Code Review Q#138026, answer score: 8

Revisions (0)

No revisions yet.