patterncMinor
Spell checker using trie
Viewed 0 times
spellcheckerusingtrie
Problem
This program implements spell checking by loading a given dictionary file into a trie structure and comparing against a given text file:
This was part of a problem set with a given mess of skeleton code (speller.c), which I've done my best to make presentable for code review here. The goal was to get the best timing possible (see reqs below), so we've been instructed to use
Other requirements were as follows:
One area I'm specifically interested in review is performance. I ran gperf and callgrind to profile my code and made some changes based on analysis; however, it seems that
speller.c
```
#include "dbg.h"
#include "dictionary.h"
#include
#include
#include
#include
#undef difference_in_seconds
#undef getrusage
#define DEFAULT_DICT "dictionaries/large"
typedef struct {
struct rusage before;
struct rusage after;
double time_load;
double time_check;
double time_size;
double time_unload;
u
Usage: speller [dictionary] text. When run, each misspelled word is printed along with benchmarks:This was part of a problem set with a given mess of skeleton code (speller.c), which I've done my best to make presentable for code review here. The goal was to get the best timing possible (see reqs below), so we've been instructed to use
getrusage to gather benchmarks. Please don't focus much critique on speller.c as it is a given piece of the problem set.Other requirements were as follows:
- You may assume that any dictionary passed to your program will be structured exactly like ours, lexicographically sorted from top to bottom with one word per line, each of which ends with \n. You may also assume that dictionary will contain at least one word, that no word will be longer than LENGTH (a constant defined in dictionary.h) characters, that no word will appear more than once, and that each word will contain only lowercase alphabetical characters and possibly apostrophes.
- You may assume that check will only be passed strings with alphabetical characters and/or apostrophes.
- Your implementation of check must be case-insensitive.
One area I'm specifically interested in review is performance. I ran gperf and callgrind to profile my code and made some changes based on analysis; however, it seems that
load_dict might be underachieving when I compare to staff solution benchmarks (run on same environment). Does anything stick out that could be improved to get better performance?speller.c
```
#include "dbg.h"
#include "dictionary.h"
#include
#include
#include
#include
#undef difference_in_seconds
#undef getrusage
#define DEFAULT_DICT "dictionaries/large"
typedef struct {
struct rusage before;
struct rusage after;
double time_load;
double time_check;
double time_size;
double time_unload;
u
Solution
Buffer overflow
This line is vulnerable to buffer overflow:
I'm not sure why you added one here, but it makes it possible to overflow the buffer by one character. According to the problem statement, the dictionary will not have any words longer than the buffer, so your code is safe as long as you use the provided dictionary. But for an arbitrary input file, you should remove that
Uninitialized variables
When I ran your program, I got this as part of the output:
This was due to the variable
Performance in loading
After playing with your code a bit, I found that the loading performance was slow mostly due to lack of buffering. That is to say, calling
This line is vulnerable to buffer overflow:
while (fgets(word, sizeof(word)+1, in)) {I'm not sure why you added one here, but it makes it possible to overflow the buffer by one character. According to the problem statement, the dictionary will not have any words longer than the buffer, so your code is safe as long as you use the provided dictionary. But for an arbitrary input file, you should remove that
+1 from that line.Uninitialized variables
When I ran your program, I got this as part of the output:
WORDS MISSPELLED: 1629196271This was due to the variable
bm_data being uninitialized in main, and none of the other functions zeroing it either. Since that code was not written by you, you aren't really to blame for that.Performance in loading
After playing with your code a bit, I found that the loading performance was slow mostly due to lack of buffering. That is to say, calling
fgets() to read each word is slower than reading into a big buffer and then parsing through the buffer. When I used fread() to read into a buffer, the dictionary loaded in 0.06 seconds compared to 0.11 seconds when using fgets(). I tried different buffer sizes and I got pretty much the same results with buffer sizes ranging from 64 bytes to 64 KB. Here is code I used:#define BUFSIZE 1024
bool Trie_insert_file(TrieNode **root, FILE *in)
{
check(root, "Root is null.");
char buffer[BUFSIZE];
TrieNode *cur = *root;
while(1) {
size_t bufLen = fread(buffer, 1, sizeof(buffer), in);
if (bufLen end_of_word = true;
cur = *root;
continue;
}
size_t index = get_index(c);
if (cur->children[index] == NULL)
cur->children[index] = Trie_create();
cur = cur->children[index];
}
}
error:
return false;
}Code Snippets
while (fgets(word, sizeof(word)+1, in)) {WORDS MISSPELLED: 1629196271#define BUFSIZE 1024
bool Trie_insert_file(TrieNode **root, FILE *in)
{
check(root, "Root is null.");
char buffer[BUFSIZE];
TrieNode *cur = *root;
while(1) {
size_t bufLen = fread(buffer, 1, sizeof(buffer), in);
if (bufLen <= 0)
return true;
for (size_t i = 0; i < bufLen; i++) {
char c = buffer[i];
if (c == '\n') {
trie_word_count++;
cur->end_of_word = true;
cur = *root;
continue;
}
size_t index = get_index(c);
if (cur->children[index] == NULL)
cur->children[index] = Trie_create();
cur = cur->children[index];
}
}
error:
return false;
}Context
StackExchange Code Review Q#138026, answer score: 8
Revisions (0)
No revisions yet.