HiveBrain v1.2.0
Get Started
← Back to all entries
patternjavaMinor

Analyze huge set of sentences for word presence

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
analyzehugewordforsentencespresenceset

Problem

I have a huge file (1Gb) with English sentences and I need to filter only those containing the Alice word.

Actual tests could be more complex, e.g. matching a verb by its wordforms (go, goes, gone, went, going).

To solve the task I have designed a method, which feeds English word as it is ready to a consumer and if test returns back positive the method returns immediately.

I have improved performance twice in comparison with storing words in a list first and matching them afterwards. Still I have the following benchmarks:

  • read: 1.6s / 100MB;



  • process: 1.5s / 100MB.



Could this code be improved further:

```
import java.util.function.Function;

public class ExtractEnglishWordsAndTest {
public static boolean extractEnglishWordsAndTest(String text, Function consumer) {
if (text == null || text.isEmpty()) {
return false;
}

char[] buf = new char[text.length()];
int bufIndex = -1;

boolean isEnglishPiece = isEnglishLetterOrHyphen(text.charAt(0));

for (char ch : text.toCharArray()) {
boolean isEnglishLetter = isEnglishLetterOrHyphen(ch);

if (isEnglishPiece && !isEnglishLetter || !isEnglishPiece && isEnglishLetter) {
if (isEnglishPiece) {
if (consumer.apply(new String(buf, 0, bufIndex + 1))) {
return true;
}
}

isEnglishPiece = !isEnglishPiece;

bufIndex = -1;
}

bufIndex++;
buf[bufIndex] = ch;
}

if (isEnglishPiece) {
if (consumer.apply(new String(buf, 0, bufIndex + 1))) {
return true;
}
}

return false;
}

public static boolean isEnglishLetterOrHyphen(char ch) {
return ch >= 'a' && ch = 'A' && ch {
System.out.print(word + " ");
return false;
});

System.out.println();

System.ou

Solution

The logic of word extraction seems convoluted. Consider two loops instead, along the lines of

int text_length = text.length();
    int i = 0;
    while (true) {
        while ((i < text_length) && !isEnglishLetterOrHyphen(text.charAt(i))) {
            i++;
        }

        int wordStart = i;
        while ((i < text_length) && isEnglishLetterOrHyphen(text.charAt(i))) {
            i++;
        }

        if (consumer.apply(text.substring(wordStart, i))) {
            return true;
        }
    }


BTW,

isEnglishPiece && !isEnglishLetter || !isEnglishPiece && isEnglishLetter


is a long (and unclear) way to say

isEnglishPiece != isEnglishLetter

Code Snippets

int text_length = text.length();
    int i = 0;
    while (true) {
        while ((i < text_length) && !isEnglishLetterOrHyphen(text.charAt(i))) {
            i++;
        }

        int wordStart = i;
        while ((i < text_length) && isEnglishLetterOrHyphen(text.charAt(i))) {
            i++;
        }

        if (consumer.apply(text.substring(wordStart, i))) {
            return true;
        }
    }
isEnglishPiece && !isEnglishLetter || !isEnglishPiece && isEnglishLetter
isEnglishPiece != isEnglishLetter

Context

StackExchange Code Review Q#134996, answer score: 3

Revisions (0)

No revisions yet.