patterncModerate
Spell Corrector in C
Viewed 0 times
correctorspellstackoverflow
Problem
I recently stumbled across this article on how to write a spelling corrector, and figured I'd try to have a go at it in C (mainly because the link at the end of the page for the C code is broken).
Here is what I would like reviewed:
-
Accuracy: What can I do to make the program output a more accurate prediction of what it thinks the correct word should be?
-
Speed: Are there any improvements that could be made to improve runtime speeds?
-
Refinement: In what ways could I shorten my code down a bit?
Code:
Test run:
As you can see, the accuracy is somewhat low. The runtime speeds are faster than the original Python program in the post above, but I feel could still be improved upon.
Here is what I would like reviewed:
-
Accuracy: What can I do to make the program output a more accurate prediction of what it thinks the correct word should be?
-
Speed: Are there any improvements that could be made to improve runtime speeds?
-
Refinement: In what ways could I shorten my code down a bit?
Code:
#include
#include
#include
#include
#include
#include
#include
#include
#include
#include
#define SIZE_DICT 235886
char *dictionary = "/usr/share/dict/words";
const char alphabet[] = "abcdefghijklmnopqrstuvwxyz0123456789";
char *strtolower(char *word)
{
for (char *s = word; *s; ++s) *s = tolower(*s);
return word;
}
ENTRY *find(char *word)
{
return hsearch((ENTRY){.key = word}, FIND);
}
int update(char *word)
{
ENTRY *e = find(word);
if (!e) return 0;
e->data++;
return 1;
}
int readFile(const char* fileName, ENTRY dict)
{
int fd = open(fileName, O_RDONLY);
if (fd str_size) || ((offset + limit) > str_size) || (str_size data > max_size))
{
max_size = (int) e->data;
max_word = e->key;
}
}
return max_word;
}
void array_cleanup(char **array, int rows)
{
for (int i = 0; i ");
return 1;
}
ENTRY dict = {};
hcreate(SIZE_DICT);
if (!readFile(dictionary, dict)) return -1;
char *corrected_word = correct(argv[argc - 1]);
puts(corrected_word);
}Test run:
As you can see, the accuracy is somewhat low. The runtime speeds are faster than the original Python program in the post above, but I feel could still be improved upon.
time ./check miror
miro
real 0m0.118s
user 0m0.092s
sys 0m0.014s
Solution
Didn't work for me
First off, I ran your program but it didn't ever find any word. I suspect the problem was in the function
I didn't see anywhere in the program where
Messy concatenation
The way you build new strings is very messy. First of all, the function
Beyond that, between
Here is a rewrite of your concatenation functions, which doubled the speed of your program:
Reallocation scheme
In
Here is a rewrite of the reallocation scheme:
```
char known_edits2(char array, int rows, int *e2_rows)
{
size_t e1_rows = 0;
int res_size = 0;
int res_max = 0;
char **res = NULL;
char **e1 = NULL;
for (int i = 0; i = res_max) {
// F
First off, I ran your program but it didn't ever find any word. I suspect the problem was in the function
max:if (e && ((int) e->data > max_size))I didn't see anywhere in the program where
data was ever incremented, except if you had a duplicate word in the dictionary. My dictionary had no duplicate words. Perhaps yours had duplicate words such as Miro and miro. I changed the code to just return the first word in the array and it did start to work. Although it seems like a waste to find all possible words just to return the first one. It would be nice if the words in the dictionary were scored in some fashion and the highest scored word would be returned (for example, words with common letters could score higher than ones with uncommon letters). You could do this by initializing data to some function like this:dict.data = score_word(w);Messy concatenation
The way you build new strings is very messy. First of all, the function
concat() has a couple of problems:- Doesn't check return value of
mallocandrealloc.
- If
str2is NULL, there is a memory leak.
Beyond that, between
concat() and substr(), you end up allocating, reallocating, and freeing many temporary strings. It would be simpler and even faster to:- Allocate one string the length of the final string. You know this length because you are either deleting, modifying, or inserting into the original string.
- Copy the relevant parts of the string into that final string.
Here is a rewrite of your concatenation functions, which doubled the speed of your program:
static void *checked_malloc(int len)
{
void *ret = malloc(len);
if (ret == NULL) {
fprintf(stderr, "Out of memory\n");
exit(0);
}
return ret;
}
/**
* Takes a part of the source string and appends it to the destination string.
*
* @param dst Destination string to append to.
* @param dstLen Current length of the destination string. This will
* be updated with the new length after appending.
* @param src Source string.
* @param srcBegin Starting index in the source string to copy from.
* @param len Length of portion to copy.
*/
static void append(char *dst, int *dstLen, const char *src, int srcBegin,
int len)
{
if (len > 0) {
memcpy(&dst[*dstLen], &src[srcBegin], len);
*dstLen += len;
}
dst[*dstLen] = 0;
}
int deletion(char *word, char **array, int start_idx)
{
int i = 0;
size_t word_len = strlen(word);
for (; i < word_len; i++)
{
int pos = 0;
array[i+start_idx] = checked_malloc(word_len);
append(array[i+start_idx], &pos, word, 0, i);
append(array[i+start_idx], &pos, word, i+1, word_len-(i+1));
}
return i;
}
int transposition(char *word, char **array, int start_idx)
{
int i = 0;
size_t word_len = strlen(word);
for (; i < word_len-1; i++)
{
int pos = 0;
array[i+start_idx] = checked_malloc(word_len+1);
append(array[i+start_idx], &pos, word, 0, i);
append(array[i+start_idx], &pos, word, i+1, 1);
append(array[i+start_idx], &pos, word, i, 1);
append(array[i+start_idx], &pos, word, i+2, word_len-(i+2));
}
return i;
}
int alteration(char *word, char **array, int start_idx)
{
int k = 0;
size_t word_len = strlen(word);
char c[2] = {};
for (int i = 0; i < word_len; ++i)
{
for (int j = 0; j < ALPHABET_SIZE; ++j, ++k)
{
int pos = 0;
c[0] = alphabet[j];
array[k+start_idx] = checked_malloc(word_len+1);
append(array[k+start_idx], &pos, word, 0, i);
append(array[k+start_idx], &pos, c , 0, 1);
append(array[k+start_idx], &pos, word, i+1, word_len-(i+1));
}
}
return k;
}
int insertion(char *word, char **array, int start_idx)
{
int k = 0;
size_t word_len = strlen(word);
char c[2] = {};
for (int i = 0; i <= word_len; ++i)
{
for (int j = 0; j < ALPHABET_SIZE; ++j, ++k)
{
int pos = 0;
c[0] = alphabet[j];
array[k+start_idx] = checked_malloc(word_len+2);
append(array[k+start_idx], &pos, word, 0, i);
append(array[k+start_idx], &pos, c , 0, 1);
append(array[k+start_idx], &pos, word, i, word_len-i);
}
}
return k;
}Reallocation scheme
In
known_edits2(), you realloc your array every time you add a new string to it. This will take \$O(n^2)\$ time. I suggest doubling the size of the array each time to avoid slowing things down.Here is a rewrite of the reallocation scheme:
```
char known_edits2(char array, int rows, int *e2_rows)
{
size_t e1_rows = 0;
int res_size = 0;
int res_max = 0;
char **res = NULL;
char **e1 = NULL;
for (int i = 0; i = res_max) {
// F
Code Snippets
if (e && ((int) e->data > max_size))dict.data = score_word(w);static void *checked_malloc(int len)
{
void *ret = malloc(len);
if (ret == NULL) {
fprintf(stderr, "Out of memory\n");
exit(0);
}
return ret;
}
/**
* Takes a part of the source string and appends it to the destination string.
*
* @param dst Destination string to append to.
* @param dstLen Current length of the destination string. This will
* be updated with the new length after appending.
* @param src Source string.
* @param srcBegin Starting index in the source string to copy from.
* @param len Length of portion to copy.
*/
static void append(char *dst, int *dstLen, const char *src, int srcBegin,
int len)
{
if (len > 0) {
memcpy(&dst[*dstLen], &src[srcBegin], len);
*dstLen += len;
}
dst[*dstLen] = 0;
}
int deletion(char *word, char **array, int start_idx)
{
int i = 0;
size_t word_len = strlen(word);
for (; i < word_len; i++)
{
int pos = 0;
array[i+start_idx] = checked_malloc(word_len);
append(array[i+start_idx], &pos, word, 0, i);
append(array[i+start_idx], &pos, word, i+1, word_len-(i+1));
}
return i;
}
int transposition(char *word, char **array, int start_idx)
{
int i = 0;
size_t word_len = strlen(word);
for (; i < word_len-1; i++)
{
int pos = 0;
array[i+start_idx] = checked_malloc(word_len+1);
append(array[i+start_idx], &pos, word, 0, i);
append(array[i+start_idx], &pos, word, i+1, 1);
append(array[i+start_idx], &pos, word, i, 1);
append(array[i+start_idx], &pos, word, i+2, word_len-(i+2));
}
return i;
}
int alteration(char *word, char **array, int start_idx)
{
int k = 0;
size_t word_len = strlen(word);
char c[2] = {};
for (int i = 0; i < word_len; ++i)
{
for (int j = 0; j < ALPHABET_SIZE; ++j, ++k)
{
int pos = 0;
c[0] = alphabet[j];
array[k+start_idx] = checked_malloc(word_len+1);
append(array[k+start_idx], &pos, word, 0, i);
append(array[k+start_idx], &pos, c , 0, 1);
append(array[k+start_idx], &pos, word, i+1, word_len-(i+1));
}
}
return k;
}
int insertion(char *word, char **array, int start_idx)
{
int k = 0;
size_t word_len = strlen(word);
char c[2] = {};
for (int i = 0; i <= word_len; ++i)
{
for (int j = 0; j < ALPHABET_SIZE; ++j, ++k)
{
int pos = 0;
c[0] = alphabet[j];
array[k+start_idx] = checked_malloc(word_len+2);
append(array[k+start_idx], &pos, word, 0, i);
append(array[k+start_idx], &pos, c , 0, 1);
append(array[k+start_idx], &pos, word, i, word_len-i);
}
}
return k;
}char **known_edits2(char **array, int rows, int *e2_rows)
{
size_t e1_rows = 0;
int res_size = 0;
int res_max = 0;
char **res = NULL;
char **e1 = NULL;
for (int i = 0; i < rows; i++)
{
e1 = edits1(array[i]);
e1_rows = edits1_rows(array[i]);
for (int j = 0; j < e1_rows; j++)
{
if (find(e1[j]) && !array_exist(res, res_size, e1[j]))
{
if (res_size >= res_max) {
// First time, allocate 50 entries. After that, double
// the size of the array.
if (res_max == 0)
res_max = 50;
else
res_max *= 2;
}
res = checked_realloc(res, sizeof(char *) * res_max);
res[res_size++] = e1[j];
}
}
}
*e2_rows = res_size;
return res;
}array[i + start_idx] = concat(substr(word, 0, i), substr(word, i+1, word_len-(i+1)));Context
StackExchange Code Review Q#93377, answer score: 13
Revisions (0)
No revisions yet.