principlejavaMinor
Using Levenshtein distance to compare strings
Viewed 0 times
distancelevenshteinusingcomparestrings
Problem
My overall challenge and why this code was written was to compare two strings. One string being a description of a item in inventory and one string being a description of a item, but possibly truncated/missing some characters. I am trying to match up the items I have posted on Craigslist and the items I think I have posted.
So I go to Craigslist and copy the table showing each active item, parse the text, and come up with the item descriptions from craigslist. When they are initially put on craigslist they could have been truncated or possibly I might have missed a few chars when pasting. So I cannot try to find an exact match for each item.
After search for some time I found the Levenshtein edit distance algorithm, which helps. It can reduce the search space by a good amount but it still gives me many false positives (low edit distance, but not the same item string) before I come to the true answer.
It also does not do well when trying to match strings that have been severely truncated (their description is 1.75x bigger than allowed text in craigslist description) because their edit distance is huge.
About the code
After the text is parsed I have a linked list of strings which are from craigslist, I then create a dummy item for each and have a list of possible items to set to active. So for each possible
For each comparison between a
In this function is where my issues lies, if I allow Levenshtein distance to be looked at that are too large I will have to answer the
So I go to Craigslist and copy the table showing each active item, parse the text, and come up with the item descriptions from craigslist. When they are initially put on craigslist they could have been truncated or possibly I might have missed a few chars when pasting. So I cannot try to find an exact match for each item.
After search for some time I found the Levenshtein edit distance algorithm, which helps. It can reduce the search space by a good amount but it still gives me many false positives (low edit distance, but not the same item string) before I come to the true answer.
It also does not do well when trying to match strings that have been severely truncated (their description is 1.75x bigger than allowed text in craigslist description) because their edit distance is huge.
About the code
After the text is parsed I have a linked list of strings which are from craigslist, I then create a dummy item for each and have a list of possible items to set to active. So for each possible
clItem I go through every item in the inventory until I find a match, then break. (I have multiple of the same inventory items, I only want to set one to true if one of them is posted).For each comparison between a
clItem and a Item in my inventory I call sameItemDesc() which uses the Levenshtein distance to determine if two strings are the same. In this function is where my issues lies, if I allow Levenshtein distance to be looked at that are too large I will have to answer the
JOptionPane numerous times for items that have numerous low Levenshtein distance. If it is too low I will not set enough and have to go through manually to set the restSolution
The overall question could be better for stackoverflow. This is more about reviewing code, where one would expect to see the code. There is no source for the Levenshtein distance.
But well, let me separate my answer in two parts.
First, the overall question
Are there any good ideas as to how to improve the code, or trim my search space so I might come to a faster solution?
According to the given preconditions:
One string being a description of a item in inventory and one string being a description of a item, but possibly truncated/missing some characters
One string is correct, the other one has flaws in a way, that characters are missing.
If we rephrase this, we have a string
For all characters
We use this properties to base our algorithm.
Trivial version:
For every flawed string, take every char, look if it is inside any correct string, if yes, remove char from correct string and continue with next char. If we found a correct string, which includes all chars from flawed string, we have a good candidate.
An implementation could be:
We can improve this, if you know that at least x of the first characters must be the same. Then do a startsWith for the substring.
The algorithm has a worst time complexity of (nmlen(longest string))
The not trivial solution:
Create a modified trie data structure. If we did not found a character for a node, look for all subnodes. The complexity is hard to guess, but in the average case should be better than the trivial version.
Second part, Code quality
Some parts are already pointed out, I try to avoid them.
Bad name for method (what is the meaning of the 2? What is the difference between 1 and 2?), expected result from a update method with return boolean is the success state, which is not the case(?). Rename it perhaps to printAllMatchingDescriptions or something like that. getAllMatchingDescriptions if you want to return it later.
Bad name for argument. Clearly, this is an argument, and arg could be an abbreviation for this. listDescriptions could be a better name. And use
Yes, the comment describes the next line. But it has no additional value as already written by the source code. You should avoid such comments.
Again, i would avoid abbreviations and would name it listItems.
It looks like the description is a good candidate for the constructor.
And again, try to avoid abbreviations.
It could be:
A logger could be nice. But ok, for quick & dirty debugging in small programs, println could be fine.
```
// Try to set one item in the inventory to active for each clItem
int setNum = 1;
for(Item cl : clItems)
for(Item cur : theApp
But well, let me separate my answer in two parts.
First, the overall question
Are there any good ideas as to how to improve the code, or trim my search space so I might come to a faster solution?
According to the given preconditions:
One string being a description of a item in inventory and one string being a description of a item, but possibly truncated/missing some characters
One string is correct, the other one has flaws in a way, that characters are missing.
If we rephrase this, we have a string
correct and a string flawed with this properties:For all characters
char in flawedcharis included incorrectandcount(char)inflawedis smaller or equal tocount(char)incorrect("contained").
index(char)inflawedis smaller or equal toindex(char)incorrect("ordered").
We use this properties to base our algorithm.
Trivial version:
For every flawed string, take every char, look if it is inside any correct string, if yes, remove char from correct string and continue with next char. If we found a correct string, which includes all chars from flawed string, we have a good candidate.
An implementation could be:
public static boolean isCharArrayIncludedInString(final char[] arrayChar, final String string) {
final char[] targetArray = string.toCharArray();
for (final char arrayCharItem : arrayChar) {
boolean isFound = false;
for (int i = 0; i < targetArray.length; ++i) { //if we have large targetArrays, we could improve this by a linkedlist and removing the entries
if (arrayCharItem == targetArray[i]) {
targetArray[i] = 0;
isFound = true;
break;
}
}
if (!isFound)
return false;
}
return true;
}
//if we use the ordered property, too:
public static boolean isCharArrayIncludedInString(final char[] arrayChar, final String string) {
final char[] targetArray = string.toCharArray();
int innerIndex = 0;
for (final char arrayCharItem : arrayChar) {
boolean isFound = false;
while (innerIndex < targetArray.length) {
if (arrayCharItem == targetArray[innerIndex++]) { // here we use the "ordered" property
isFound = true;
break;
}
}
if (!isFound)
return false;
}
return true;
}
// hint: i choosed isCharArrayIncludedInString instad of isStringIncludedInString, because the second typically assumes contiguous appearanceWe can improve this, if you know that at least x of the first characters must be the same. Then do a startsWith for the substring.
The algorithm has a worst time complexity of (nmlen(longest string))
The not trivial solution:
Create a modified trie data structure. If we did not found a character for a node, look for all subnodes. The complexity is hard to guess, but in the average case should be better than the trivial version.
Second part, Code quality
Some parts are already pointed out, I try to avoid them.
public boolean updateItems2(LinkedList arg)Bad name for method (what is the meaning of the 2? What is the difference between 1 and 2?), expected result from a update method with return boolean is the success state, which is not the case(?). Rename it perhaps to printAllMatchingDescriptions or something like that. getAllMatchingDescriptions if you want to return it later.
public boolean updateItems2(LinkedList arg)Bad name for argument. Clearly, this is an argument, and arg could be an abbreviation for this. listDescriptions could be a better name. And use
List if there is no specific need for LinkedList:public boolean printAllMatchingDescriptions(List listDescriptions)// Create a list of items with the CL descriptions
LinkedList clItems = new LinkedList();Yes, the comment describes the next line. But it has no additional value as already written by the source code. You should avoid such comments.
LinkedList clItems = new LinkedList();Again, i would avoid abbreviations and would name it listItems.
for(String a : arg)
{
Item newCL = new Item();
newCL.setDesc(a);
clItems.add(newCL);
}It looks like the description is a good candidate for the constructor.
And again, try to avoid abbreviations.
It could be:
for (final String listDescriptionsItem : listDescriptions)
listItems.add(new Item(listDescriptionsItem));System.out.println("Update Items got " + clItems.size() + " items");A logger could be nice. But ok, for quick & dirty debugging in small programs, println could be fine.
```
// Try to set one item in the inventory to active for each clItem
int setNum = 1;
for(Item cl : clItems)
for(Item cur : theApp
Code Snippets
public static boolean isCharArrayIncludedInString(final char[] arrayChar, final String string) {
final char[] targetArray = string.toCharArray();
for (final char arrayCharItem : arrayChar) {
boolean isFound = false;
for (int i = 0; i < targetArray.length; ++i) { //if we have large targetArrays, we could improve this by a linkedlist and removing the entries
if (arrayCharItem == targetArray[i]) {
targetArray[i] = 0;
isFound = true;
break;
}
}
if (!isFound)
return false;
}
return true;
}
//if we use the ordered property, too:
public static boolean isCharArrayIncludedInString(final char[] arrayChar, final String string) {
final char[] targetArray = string.toCharArray();
int innerIndex = 0;
for (final char arrayCharItem : arrayChar) {
boolean isFound = false;
while (innerIndex < targetArray.length) {
if (arrayCharItem == targetArray[innerIndex++]) { // here we use the "ordered" property
isFound = true;
break;
}
}
if (!isFound)
return false;
}
return true;
}
// hint: i choosed isCharArrayIncludedInString instad of isStringIncludedInString, because the second typically assumes contiguous appearancepublic boolean updateItems2(LinkedList<String> arg)public boolean updateItems2(LinkedList<String> arg)public boolean printAllMatchingDescriptions(List<String> listDescriptions)// Create a list of items with the CL descriptions
LinkedList<Item> clItems = new LinkedList<Item>();Context
StackExchange Code Review Q#21425, answer score: 5
Revisions (0)
No revisions yet.