patterncppMinor
Recognizing a sequence read through OCR software
Viewed 0 times
ocrreadsequencerecognizingsoftwarethrough
Problem
I am trying to recognize a sentence that I have read through optical character recognition software. This code will eventually run on a Raspberry Pi.
I know for certain that it's meant to be one of a few thousand sentences which I have written down in a file (sentencelist.txt), but the character recognition has messed up a few of the words.
For example:
The quick brown fox jumps over the lazy dog
has been read as
He quick broom fax jumps over the lazy dig
I want to compare the incorrect sentence to every sentence in my list, and then figure out which one it's meant to be.
I currently have a working program, but it is just too slow! I have about 10,000 entries in my sentence file, and the whole process takes over a minute.
I initially used the Levenshtein algorithm to compare, but am now using a different algorithm which compares words rather than characters to speed it up.
How can I speed this baby up?
```
// C source for sentence lookup from OCR string. In main, you can set the distance calculator to be used.
#include "stdafx.h"
#include
#include
#include
#include
#include
#include
#include
#include
using namespace std;
// Levenshtein Distance Function - callable from Main
size_t LevenshteinDistance(const std::string &s1, const std::string &s2)
{
const size_t m(s1.size());
const size_t n(s2.size());
if (m == 0) return n;
if (n == 0) return m;
size_t *costs = new size_t[n + 1];
for (size_t k = 0; k Sentence;
Sentence &split(const std::string &s, char delim, Sentence &elems) {
std::stringstream ss(s);
std::string item;
while (std::getline(ss, item, delim)) {
elems.push_back(item);
}
return elems;
}
Sentence split(const std::string &s, char delim) {
Sentence elems;
split(s, delim, elems);
return elems;
}
unsigned int edit_distance(const Sentence& s1, const Sentence& s2)
{
const std::size_t len1 = s1.size(), len2 = s2.size();
std::vector> d(len1 + 1, std::vector(le
I know for certain that it's meant to be one of a few thousand sentences which I have written down in a file (sentencelist.txt), but the character recognition has messed up a few of the words.
For example:
The quick brown fox jumps over the lazy dog
has been read as
He quick broom fax jumps over the lazy dig
I want to compare the incorrect sentence to every sentence in my list, and then figure out which one it's meant to be.
I currently have a working program, but it is just too slow! I have about 10,000 entries in my sentence file, and the whole process takes over a minute.
I initially used the Levenshtein algorithm to compare, but am now using a different algorithm which compares words rather than characters to speed it up.
How can I speed this baby up?
```
// C source for sentence lookup from OCR string. In main, you can set the distance calculator to be used.
#include "stdafx.h"
#include
#include
#include
#include
#include
#include
#include
#include
using namespace std;
// Levenshtein Distance Function - callable from Main
size_t LevenshteinDistance(const std::string &s1, const std::string &s2)
{
const size_t m(s1.size());
const size_t n(s2.size());
if (m == 0) return n;
if (n == 0) return m;
size_t *costs = new size_t[n + 1];
for (size_t k = 0; k Sentence;
Sentence &split(const std::string &s, char delim, Sentence &elems) {
std::stringstream ss(s);
std::string item;
while (std::getline(ss, item, delim)) {
elems.push_back(item);
}
return elems;
}
Sentence split(const std::string &s, char delim) {
Sentence elems;
split(s, delim, elems);
return elems;
}
unsigned int edit_distance(const Sentence& s1, const Sentence& s2)
{
const std::size_t len1 = s1.size(), len2 = s2.size();
std::vector> d(len1 + 1, std::vector(le
Solution
-
I'm not sure if the manually-allocated array in
If this is still necessary, then you could consider accounting for failure with
-
Try to avoid single-character variable names, unless they're loop counters. The names
-
Consider better names than
-
Keep your whitespace use consistent and don't add it where unnecessary. For instance, there's a lot of excess whitespace before and within
-
This is quite unclear and is practically a magic number:
The comment next to it describes this variable, so just rename it as such, and make it
-
You attempt to open a file for reading, but only display an error on failure and continue executing the program. You should instead terminate the program with a valid error code on failure after displaying the error. The error itself should also be printed to
-
Prefer a better alternative to a "pause" when running the code through certain IDEs, such as:
I'm not sure if the manually-allocated array in
LevenshteinDistance() is necessary when you can probably still use std::vector.If this is still necessary, then you could consider accounting for failure with
new if it could help make your program more robust. Still, having to do something like this suggests that you should avoid manual memory allocation as much as possible in C++.-
Try to avoid single-character variable names, unless they're loop counters. The names
m and n may work in a mathematical sense, but you shouldn't assume that the user (or even you in several years) will not be confused by this.-
Consider better names than
s1 and s2, based on their intended uses (you already have info on this in some comments in main()).-
Keep your whitespace use consistent and don't add it where unnecessary. For instance, there's a lot of excess whitespace before and within
main() for some reason. This does nothing to improve readability.-
This is quite unclear and is practically a magic number:
int num = 13309;The comment next to it describes this variable, so just rename it as such, and make it
const since it's a constant. If you need a comment to describe something like this, then you may need to rethink your naming. Comments should mostly be needed for more complex explanations.-
You attempt to open a file for reading, but only display an error on failure and continue executing the program. You should instead terminate the program with a valid error code on failure after displaying the error. The error itself should also be printed to
std::cerr instead of std::cout.-
Prefer a better alternative to a "pause" when running the code through certain IDEs, such as:
std::cin.get();Code Snippets
int num = 13309;std::cin.get();Context
StackExchange Code Review Q#98115, answer score: 4
Revisions (0)
No revisions yet.