patternjavaMinor
Analyze huge set of sentences for word presence
Viewed 0 times
analyzehugewordforsentencespresenceset
Problem
I have a huge file (1Gb) with English sentences and I need to filter only those containing the Alice word.
Actual tests could be more complex, e.g. matching a verb by its wordforms (go, goes, gone, went, going).
To solve the task I have designed a method, which feeds English word as it is ready to a consumer and if test returns back positive the method returns immediately.
I have improved performance twice in comparison with storing words in a list first and matching them afterwards. Still I have the following benchmarks:
Could this code be improved further:
```
import java.util.function.Function;
public class ExtractEnglishWordsAndTest {
public static boolean extractEnglishWordsAndTest(String text, Function consumer) {
if (text == null || text.isEmpty()) {
return false;
}
char[] buf = new char[text.length()];
int bufIndex = -1;
boolean isEnglishPiece = isEnglishLetterOrHyphen(text.charAt(0));
for (char ch : text.toCharArray()) {
boolean isEnglishLetter = isEnglishLetterOrHyphen(ch);
if (isEnglishPiece && !isEnglishLetter || !isEnglishPiece && isEnglishLetter) {
if (isEnglishPiece) {
if (consumer.apply(new String(buf, 0, bufIndex + 1))) {
return true;
}
}
isEnglishPiece = !isEnglishPiece;
bufIndex = -1;
}
bufIndex++;
buf[bufIndex] = ch;
}
if (isEnglishPiece) {
if (consumer.apply(new String(buf, 0, bufIndex + 1))) {
return true;
}
}
return false;
}
public static boolean isEnglishLetterOrHyphen(char ch) {
return ch >= 'a' && ch = 'A' && ch {
System.out.print(word + " ");
return false;
});
System.out.println();
System.ou
Actual tests could be more complex, e.g. matching a verb by its wordforms (go, goes, gone, went, going).
To solve the task I have designed a method, which feeds English word as it is ready to a consumer and if test returns back positive the method returns immediately.
I have improved performance twice in comparison with storing words in a list first and matching them afterwards. Still I have the following benchmarks:
- read: 1.6s / 100MB;
- process: 1.5s / 100MB.
Could this code be improved further:
```
import java.util.function.Function;
public class ExtractEnglishWordsAndTest {
public static boolean extractEnglishWordsAndTest(String text, Function consumer) {
if (text == null || text.isEmpty()) {
return false;
}
char[] buf = new char[text.length()];
int bufIndex = -1;
boolean isEnglishPiece = isEnglishLetterOrHyphen(text.charAt(0));
for (char ch : text.toCharArray()) {
boolean isEnglishLetter = isEnglishLetterOrHyphen(ch);
if (isEnglishPiece && !isEnglishLetter || !isEnglishPiece && isEnglishLetter) {
if (isEnglishPiece) {
if (consumer.apply(new String(buf, 0, bufIndex + 1))) {
return true;
}
}
isEnglishPiece = !isEnglishPiece;
bufIndex = -1;
}
bufIndex++;
buf[bufIndex] = ch;
}
if (isEnglishPiece) {
if (consumer.apply(new String(buf, 0, bufIndex + 1))) {
return true;
}
}
return false;
}
public static boolean isEnglishLetterOrHyphen(char ch) {
return ch >= 'a' && ch = 'A' && ch {
System.out.print(word + " ");
return false;
});
System.out.println();
System.ou
Solution
The logic of word extraction seems convoluted. Consider two loops instead, along the lines of
BTW,
is a long (and unclear) way to say
int text_length = text.length();
int i = 0;
while (true) {
while ((i < text_length) && !isEnglishLetterOrHyphen(text.charAt(i))) {
i++;
}
int wordStart = i;
while ((i < text_length) && isEnglishLetterOrHyphen(text.charAt(i))) {
i++;
}
if (consumer.apply(text.substring(wordStart, i))) {
return true;
}
}BTW,
isEnglishPiece && !isEnglishLetter || !isEnglishPiece && isEnglishLetteris a long (and unclear) way to say
isEnglishPiece != isEnglishLetterCode Snippets
int text_length = text.length();
int i = 0;
while (true) {
while ((i < text_length) && !isEnglishLetterOrHyphen(text.charAt(i))) {
i++;
}
int wordStart = i;
while ((i < text_length) && isEnglishLetterOrHyphen(text.charAt(i))) {
i++;
}
if (consumer.apply(text.substring(wordStart, i))) {
return true;
}
}isEnglishPiece && !isEnglishLetter || !isEnglishPiece && isEnglishLetterisEnglishPiece != isEnglishLetterContext
StackExchange Code Review Q#134996, answer score: 3
Revisions (0)
No revisions yet.