HiveBrain v1.2.0
Get Started
← Back to all entries
patternjavaMinor

Function to get top k words in a given String

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
topwordsfunctiongetstringgiven

Problem

This is a two-part job:

  • to identify a word



  • to find the top k words



I've tried to use regex to split but it's taking 10x more time when string is really long. The top k strings are found using a priority queue. I've created a min priority queue for first k element, if the frequency of the next word is greater than the min element. I then remove min and insert the new word.

```
public static List getTopStrings(String input, int k)
{
if(k list = new ArrayList();
if(input==null || input.isEmpty() || k==0)
{
return list;
}
Set separators = new HashSet();
separators.add(' ');
separators.add(',');
separators.add('.');
separators.add(';');
separators.add(':');
separators.add('"');
separators.add('(');
separators.add(')');
separators.add('-');
separators.add('/');
separators.add('\\');
separators.add('\'');
separators.add('?');
separators.add('\n');
separators.add('\r');
separators.add('!');
separators.add('|');
separators.add('~');
separators.add('\'');
separators.add('[');
separators.add(']');
separators.add('{');
separators.add('}');
separators.add('&');
separators.add('%');
separators.add('$');
separators.add('#');
separators.add('@');
separators.add('*');
separators.add('=');
separators.add('+');
separators.add('>');
separators.add(' wordMap = new HashMap();
int count, wordStart=0;
for(int i=0;i keySet = new ArrayList();
keySet.addAll(wordMap.keySet());
if(keySet.size() minHeap = new PriorityQueue(k, new Comparator() {

Solution

Initialize Set with values

Your method of initializing the hashset with values takes quite a bit of space. I would create a static field:

public static final String[] SEPARATORS = new String[] { " ", ",", ".", ";" };


And then use it like this:

public static final Set separatorsSet = new HashSet(Arrays.asList(SEPARATORS));


Extract code to methods

Your getTopStrings method currently does three things: extract words from string, count amount of duplicate words, sort the result of that by frequency. I would at least create separate methods for the first two and the third functionality.

Simplify Sorting

Your sorting mechanism using a queue seems overly complicated. If you are using Java 8, it could be reduced to three lines (or one really long one):

// sort by value, reversed
    Stream> sorted = wordMap.entrySet().stream().sorted(Map.Entry.comparingByValue((x,y) -> (y  collected = sorted.map(entry -> entry.getKey()).collect(Collectors.toCollection(ArrayList::new));
    // top k:
    List topK = collected.subList(0, k);


Misc

  • in Java, it is customary to place curly brackets on the same line as the opening statements.



  • use curly brackets even for one line statements for improved readability and to avoid future bugs.



  • use more spaces, eg around ==, `



  • declare variables in as small a scope as possible to increase readability.

Code Snippets

public static final String[] SEPARATORS = new String[] { " ", ",", ".", ";" };
public static final Set<String> separatorsSet = new HashSet<String>(Arrays.asList(SEPARATORS));
// sort by value, reversed
    Stream<Entry<String, Integer>> sorted = wordMap.entrySet().stream().sorted(Map.Entry.comparingByValue((x,y) -> (y < x) ? -1 : ((x == y) ? 0 : 1))); // alternatively: use reversed and cast
    // flatten map to list:
    List<String> collected = sorted.map(entry -> entry.getKey()).collect(Collectors.toCollection(ArrayList::new));
    // top k:
    List<String> topK = collected.subList(0, k);

Context

StackExchange Code Review Q#82264, answer score: 7

Revisions (0)

No revisions yet.