HiveBrain v1.2.0
Get Started
← Back to all entries
patternjavaMinor

Efficient top k words using MultiMap (Google Guava ext lib JAR)

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
topgooglelibjarwordsefficientextguavausingmultimap

Problem

Basic algorithm:

  • Open document/huge text file



  • Use Map and List to separate the words. \$O(K)\$ time complexity (TC) considering \$K\$ distinct words.



  • If the word already exists in the list, increment the count of frequency, else add the new word. Constant TC.



  • Use a Collection to sort the list on the basis of frequency of usage. \$n*log(n)\$ TC.



  • Print the list, Entry.getKey (max count word), and its Entry.getValue (max count).



Now, I tried to use Multimap before sorting so that I can group same frequency count words. This will drastically decrease the sorting complexity as the number of elements to compare will decrease.

Is it efficient enough or are there other ways to make it more efficient?

```
//Below code works fine (Please include lib files as per your IDE setting)
package Topk;
//---import necessary lib---
class Words
{
int count=1;
String word;
Words(int count, String word)
{
this.count = count;
this.word = word;
}
}

public class TopK
{
public static void main(String argv[]) throws IOException
{
int Tcount = 0;
Map map = new TreeMap();
BufferedReader br = new BufferedReader(new FileReader("text.txt"));
String line;
while((line = br.readLine())!=null)
{
String[] words = line.split("\\W");
for(String word:words)
{
word = word.toLowerCase();
Tcount++;
if(word.equals(""))
continue;
insert(word,map);
}
}
Multimap mm = ArrayListMultimap.create();
Iterator itr = map.keySet().iterator();
while(itr.hasNext())
{
String key = (String) itr.next();
int tempi = map.get(key);
Stri

Solution

Java/Java 8 tips

Words is an unused class, and even if it is in used, class naming convention recommends the singular form rather than the plural. This is because you will have a Word instance or many Word instances, but a Words instance doesn't sound as right. The only pluralized class names that I can recall off-hand are for utilities classes, i.e. those that only provide a bunch of static methods.

Since Java 7, the recommended approach for reading from a Reader source is try-with-resources, e.g.

try (BufferedReader reader = new BufferedReader(new FileReader("text.txt"))) {
    String line;
    while ((line = reader.readLine()) != null) {
        // ...
    }
}


In Java 8, there is an even more convenient way of specifying how you want to read and process lines from a text file, which I will explain below...

Your Guava-based approach



  • Read each line and split into words.



  • Store a running count as the variable Tcount.



  • For each non-empty word in words, insert the lowercase form into a map.





  • Insertion is based on checking if word exists in the map. If it does, increment the counter, else add word => 1 to the map.




  • Create a Multimap, mm.



  • Loop through the keys of map to add the count-per-word pairing to mm.



  • Create another HashMap, fmap.



  • Loop through mm to add the toString() representation of each value as the key for fmap, with the key being the value for fmap.



  • Sort fmap into a List> data structure.





  • This is done by wrapping the entrySet() into a List, and applying Collections.sort() on the new List.




  • Display the result.




-
Tcount can be better renamed as totalCount or wordCount, following the camelCase naming convention for variables, and the unnecessary need to shorten variable names (unless they are really long).

-
Your implementation for step 3, i.e. the insert() method, has a simpler implementation in Java 8 using Map.merge():

map.merge(word, 1, (a, b) -> a + b);


What this means is to either add word => 1 to the map if it does not exist, or use the BiFunction lambda declaration to add the existing and new (i.e. 1) values.

-
Steps 4 to 7 are... quite lengthy and you can rely on some of Guava's built-in methods to perform the same operations too:

Multimap inverseMap = Multimaps.invertFrom(Multimaps.forMap(map),
                                            ArrayListMultimap.create());


  • You can use Multimaps.forMap(Map) to give you a Multimap wrapper over the original Map so that it can be used as an argument for...



  • Multimaps.invertFrom, inverting the key => value mapping of a Multimap into a resulting Multimap instance, which you can declare as ArrayListMultimap.create().



-
Step 8 is quite an odd step, partly because I feel you have deconstructed a Map semantic as a sorted List of Entry instances. This definitely can be implemented in the form of a TreeMap, which allows you to sort on the keys while giving you additional flexibility of retrieving using Map-friendly methods. After sorting the keys, you can simply use the largest key to get() from your Multimap as your result, instead of having to deal with a List data structure.

-
Putting together, steps 4 to 8 can probably be done as such:

private static > Collection getMaximumByValue(
        Map map) {
    Multimap inverseMap = Multimaps.invertFrom(Multimaps.forMap(map),
                                            ArrayListMultimap.create());
    return inverseMap.get(Ordering.natural().immutableSortedCopy(inverseMap.keys())
                                    .reverse().get(0));
}


A non-Guava-based approach

With Java 8, one can also learn to apply the new Stream-based processing techniques that can simplify most of the explicit looping.

For example, to read a file in as a Stream of lines using Files.lines(Path):

try (Stream lines = Files.lines(Paths.get(path))) {
    // perform further operations on lines, e.g. lines.collect()...
}


Processing each line into your map, i.e. step 3 above, can be done by:

  • flapMap()-ing every line with a replacement of Stream of words,



  • The method reference to use here will be Pattern::splitAsStream.



  • filter()-ing for non-empty Strings, before performing a



-
collect(), groupingBy(), and then counting() on the word occurrences:

private static final Pattern LINE_SPLIITER = Pattern.compile("\\W");

private static Map process(Stream lines) {
    return lines.flatMap(LINE_SPLIITER::splitAsStream).filter(s -> !s.isEmpty())
            .collect(Collectors.groupingBy(String::toLowerCase, Collectors.counting()));
}


The equivalent of using a Guava Multimap so that the values can be sorted is to apply another form of groupingBy() on the resulting Map from above:

```
private static > Collection getMaximumByValue(
Map map) {

Code Snippets

try (BufferedReader reader = new BufferedReader(new FileReader("text.txt"))) {
    String line;
    while ((line = reader.readLine()) != null) {
        // ...
    }
}
map.merge(word, 1, (a, b) -> a + b);
Multimap<Integer, String> inverseMap = Multimaps.invertFrom(Multimaps.forMap(map),
                                            ArrayListMultimap.create());
private static <K, V extends Comparable<V>> Collection<K> getMaximumByValue(
        Map<K, V> map) {
    Multimap<V, K> inverseMap = Multimaps.invertFrom(Multimaps.forMap(map),
                                            ArrayListMultimap.create());
    return inverseMap.get(Ordering.natural().immutableSortedCopy(inverseMap.keys())
                                    .reverse().get(0));
}
try (Stream<String> lines = Files.lines(Paths.get(path))) {
    // perform further operations on lines, e.g. lines.collect()...
}

Context

StackExchange Code Review Q#101695, answer score: 2

Revisions (0)

No revisions yet.