HiveBrain v1.2.0
Get Started
← Back to all entries
patterncModerate

Spell Corrector in C

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
correctorspellstackoverflow

Problem

I recently stumbled across this article on how to write a spelling corrector, and figured I'd try to have a go at it in C (mainly because the link at the end of the page for the C code is broken).

Here is what I would like reviewed:

-
Accuracy: What can I do to make the program output a more accurate prediction of what it thinks the correct word should be?

-
Speed: Are there any improvements that could be made to improve runtime speeds?

-
Refinement: In what ways could I shorten my code down a bit?

Code:

#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

#define SIZE_DICT 235886

char *dictionary = "/usr/share/dict/words";

const char alphabet[] = "abcdefghijklmnopqrstuvwxyz0123456789";

char *strtolower(char *word)
{
    for (char *s = word; *s; ++s) *s = tolower(*s);
    return word;
}

ENTRY *find(char *word)
{
    return hsearch((ENTRY){.key = word}, FIND);
}

int update(char *word)
{
    ENTRY *e = find(word);

    if (!e) return 0;

    e->data++;
    return 1;
}

int readFile(const char* fileName, ENTRY dict)
{
    int fd = open(fileName, O_RDONLY);
    if (fd  str_size) || ((offset + limit) > str_size) || (str_size data > max_size))
        {
            max_size = (int) e->data;
            max_word = e->key;
        }
    }

    return max_word;
}

void array_cleanup(char **array, int rows)
{
    for (int i = 0; i ");
        return 1;
    }

    ENTRY dict = {};
    hcreate(SIZE_DICT);

    if (!readFile(dictionary, dict)) return -1;

    char *corrected_word = correct(argv[argc - 1]);
    puts(corrected_word);
}


Test run:

As you can see, the accuracy is somewhat low. The runtime speeds are faster than the original Python program in the post above, but I feel could still be improved upon.

time ./check miror
miro

real 0m0.118s
user 0m0.092s
sys 0m0.014s

Solution

Didn't work for me

First off, I ran your program but it didn't ever find any word. I suspect the problem was in the function max:

if (e && ((int) e->data > max_size))


I didn't see anywhere in the program where data was ever incremented, except if you had a duplicate word in the dictionary. My dictionary had no duplicate words. Perhaps yours had duplicate words such as Miro and miro. I changed the code to just return the first word in the array and it did start to work. Although it seems like a waste to find all possible words just to return the first one. It would be nice if the words in the dictionary were scored in some fashion and the highest scored word would be returned (for example, words with common letters could score higher than ones with uncommon letters). You could do this by initializing data to some function like this:

dict.data = score_word(w);


Messy concatenation

The way you build new strings is very messy. First of all, the function concat() has a couple of problems:

  • Doesn't check return value of malloc and realloc.



  • If str2 is NULL, there is a memory leak.



Beyond that, between concat() and substr(), you end up allocating, reallocating, and freeing many temporary strings. It would be simpler and even faster to:

  • Allocate one string the length of the final string. You know this length because you are either deleting, modifying, or inserting into the original string.



  • Copy the relevant parts of the string into that final string.



Here is a rewrite of your concatenation functions, which doubled the speed of your program:

static void *checked_malloc(int len)
{
    void *ret = malloc(len);

    if (ret == NULL) {
        fprintf(stderr, "Out of memory\n");
        exit(0);
    }
    return ret;
}

/**
 * Takes a part of the source string and appends it to the destination string.
 *
 * @param     dst       Destination string to append to.
 * @param     dstLen    Current length of the destination string.  This will
 *                      be updated with the new length after appending.
 * @param     src       Source string.
 * @param     srcBegin  Starting index in the source string to copy from.
 * @param     len       Length of portion to copy.
 */
static void append(char *dst, int *dstLen, const char *src, int srcBegin,
        int len)
{
    if (len > 0) {
        memcpy(&dst[*dstLen], &src[srcBegin], len);
        *dstLen += len;
    }
    dst[*dstLen] = 0;
}

int deletion(char *word, char **array, int start_idx)
{
    int i = 0;
    size_t word_len = strlen(word);

    for (; i < word_len; i++)
    {
        int pos = 0;
        array[i+start_idx] = checked_malloc(word_len);
        append(array[i+start_idx], &pos, word, 0, i);
        append(array[i+start_idx], &pos, word, i+1, word_len-(i+1));
    }
    return i;
}

int transposition(char *word, char **array, int start_idx)
{
    int i = 0;
    size_t word_len = strlen(word);

    for (; i < word_len-1; i++)
    {
        int pos = 0;
        array[i+start_idx] = checked_malloc(word_len+1);
        append(array[i+start_idx], &pos, word, 0,   i);
        append(array[i+start_idx], &pos, word, i+1, 1);
        append(array[i+start_idx], &pos, word, i,   1);
        append(array[i+start_idx], &pos, word, i+2, word_len-(i+2));
    }
    return i;
}

int alteration(char *word, char **array, int start_idx)
{
    int k = 0;
    size_t word_len = strlen(word);
    char c[2] = {};

    for (int i = 0; i < word_len; ++i)
    {
        for (int j = 0; j < ALPHABET_SIZE; ++j, ++k)
        {
            int pos = 0;
            c[0] = alphabet[j];
            array[k+start_idx] = checked_malloc(word_len+1);
            append(array[k+start_idx], &pos, word, 0, i);
            append(array[k+start_idx], &pos, c   , 0, 1);
            append(array[k+start_idx], &pos, word, i+1, word_len-(i+1));
        }
    }
    return k;
}

int insertion(char *word, char **array, int start_idx)
{
    int k = 0;
    size_t word_len = strlen(word);
    char c[2] = {};

    for (int i = 0; i <= word_len; ++i)
    {
        for (int j = 0; j < ALPHABET_SIZE; ++j, ++k)
        {
            int pos = 0;
            c[0] = alphabet[j];
            array[k+start_idx] = checked_malloc(word_len+2);
            append(array[k+start_idx], &pos, word, 0, i);
            append(array[k+start_idx], &pos, c   , 0, 1);
            append(array[k+start_idx], &pos, word, i, word_len-i);
        }
    }
    return k;
}


Reallocation scheme

In known_edits2(), you realloc your array every time you add a new string to it. This will take \$O(n^2)\$ time. I suggest doubling the size of the array each time to avoid slowing things down.

Here is a rewrite of the reallocation scheme:

```
char known_edits2(char array, int rows, int *e2_rows)
{
size_t e1_rows = 0;
int res_size = 0;
int res_max = 0;
char **res = NULL;
char **e1 = NULL;

for (int i = 0; i = res_max) {
// F

Code Snippets

if (e && ((int) e->data > max_size))
dict.data = score_word(w);
static void *checked_malloc(int len)
{
    void *ret = malloc(len);

    if (ret == NULL) {
        fprintf(stderr, "Out of memory\n");
        exit(0);
    }
    return ret;
}

/**
 * Takes a part of the source string and appends it to the destination string.
 *
 * @param     dst       Destination string to append to.
 * @param     dstLen    Current length of the destination string.  This will
 *                      be updated with the new length after appending.
 * @param     src       Source string.
 * @param     srcBegin  Starting index in the source string to copy from.
 * @param     len       Length of portion to copy.
 */
static void append(char *dst, int *dstLen, const char *src, int srcBegin,
        int len)
{
    if (len > 0) {
        memcpy(&dst[*dstLen], &src[srcBegin], len);
        *dstLen += len;
    }
    dst[*dstLen] = 0;
}

int deletion(char *word, char **array, int start_idx)
{
    int i = 0;
    size_t word_len = strlen(word);

    for (; i < word_len; i++)
    {
        int pos = 0;
        array[i+start_idx] = checked_malloc(word_len);
        append(array[i+start_idx], &pos, word, 0, i);
        append(array[i+start_idx], &pos, word, i+1, word_len-(i+1));
    }
    return i;
}

int transposition(char *word, char **array, int start_idx)
{
    int i = 0;
    size_t word_len = strlen(word);

    for (; i < word_len-1; i++)
    {
        int pos = 0;
        array[i+start_idx] = checked_malloc(word_len+1);
        append(array[i+start_idx], &pos, word, 0,   i);
        append(array[i+start_idx], &pos, word, i+1, 1);
        append(array[i+start_idx], &pos, word, i,   1);
        append(array[i+start_idx], &pos, word, i+2, word_len-(i+2));
    }
    return i;
}

int alteration(char *word, char **array, int start_idx)
{
    int k = 0;
    size_t word_len = strlen(word);
    char c[2] = {};

    for (int i = 0; i < word_len; ++i)
    {
        for (int j = 0; j < ALPHABET_SIZE; ++j, ++k)
        {
            int pos = 0;
            c[0] = alphabet[j];
            array[k+start_idx] = checked_malloc(word_len+1);
            append(array[k+start_idx], &pos, word, 0, i);
            append(array[k+start_idx], &pos, c   , 0, 1);
            append(array[k+start_idx], &pos, word, i+1, word_len-(i+1));
        }
    }
    return k;
}

int insertion(char *word, char **array, int start_idx)
{
    int k = 0;
    size_t word_len = strlen(word);
    char c[2] = {};

    for (int i = 0; i <= word_len; ++i)
    {
        for (int j = 0; j < ALPHABET_SIZE; ++j, ++k)
        {
            int pos = 0;
            c[0] = alphabet[j];
            array[k+start_idx] = checked_malloc(word_len+2);
            append(array[k+start_idx], &pos, word, 0, i);
            append(array[k+start_idx], &pos, c   , 0, 1);
            append(array[k+start_idx], &pos, word, i, word_len-i);
        }
    }
    return k;
}
char **known_edits2(char **array, int rows, int *e2_rows)
{
    size_t e1_rows = 0;
    int res_size = 0;
    int res_max  = 0;
    char **res = NULL;
    char **e1  = NULL;

    for (int i = 0; i < rows; i++)
    {
        e1      = edits1(array[i]);
        e1_rows = edits1_rows(array[i]);

        for (int j = 0; j < e1_rows; j++)
        {
            if (find(e1[j]) && !array_exist(res, res_size, e1[j]))
            {
                if (res_size >= res_max) {
                    // First time, allocate 50 entries.  After that, double
                    // the size of the array.
                    if (res_max == 0)
                        res_max = 50;
                    else
                        res_max *= 2;
                }
                res = checked_realloc(res, sizeof(char *) * res_max);
                res[res_size++] = e1[j];
            }
        }
    }

    *e2_rows = res_size;

    return res;
}
array[i + start_idx] = concat(substr(word, 0, i), substr(word, i+1, word_len-(i+1)));

Context

StackExchange Code Review Q#93377, answer score: 13

Revisions (0)

No revisions yet.