HiveBrain v1.2.0
Get Started
← Back to all entries
patternjavaMinor

N grams generator

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
gramsgeneratorstackoverflow

Problem

I recently had a programming challenge, which was to create the N grams. The description is as follows:


Trigram analysis is very simple. Look at each set of three adjacent
words in a document. Use the first two words of the set as a key, and
remember the fact that the third word followed that key. Once you’ve
finished, you know the list of individual words that can follow each
two word sequence in the document. For example, given the input:

I wish I may I wish I might




You might generate:

"I wish" => ["I", "I"]
"wish I" => ["may", "might"]
"may I"  => ["wish"]
"I may"  => ["I"]




This says that the words "I wish" are twice followed by the word "I", the words "wish I" are followed once by "may" and once by "might"
and so on.
To generate new text from this analysis, choose an arbitrary word pair as a starting point. Use these to look up a random next word
(using the table above) and append this new word to the text so far.
This now gives you a new word pair at the end of the text, so look up
a potential next word based on these. Add this to the list, and so on.
In the previous example, we could start with "I may". The only
possible next word is "I", so now we have:

I may I




The last two words are "may I", so the next word is "wish". We then
look up "I wish", and find our choice is constrained to another "I".

I may I wish I




Now we look up "wish I", and find we have a choice. Let’s choose
"may".

I may I wish I may




Now we’re back where we started from, with "I may." Following the same
sequence, but choosing "might" this time, we get:

I may I wish I may I wish I might




At this point we stop, as no sequence starts "I might."
Given a book, so the resulting output can be surprising.

I've implemented this idea in Java with generic Ngrams, whose value should be > 1. I want to know whether I can improve the efficiency of my code like runtime

Solution

If text is null, you throw an IllegalArgumentException. It should throw a NullPointerException.

if (text == null) throw new NullPointerException("Check the input parameters: String should not be null");
if(n 1");


If the user is inputting the text directly I don't believe text can be null (I'm not completely sure on this), you can simplify your if condition test down a bit using the String method .equals().

if (n<1 || text.equals(""))


I included the upper example in the final code.

I am usually in support of using whitespace in programming. However, in this case I think you have a lot of superfluous whitespace in your program. Especially in between braces.

// There are only 4 real lines of code here, 5 if you include the comment.
// With your spacing scheme you have 10 (comment included).

                    }

//                  System.out.println("Key:  "+key+"       Value :   "+words[i+j]);
        }

        return nGramsMap;

    }


You comment what the variable n is, instead of giving it a more helpful name to begin with. It's good that you had the comment though, many people don't comment what obscure variables mean.

private int value; // Value for NGram


I don't see the imports you need. I assume you have all of these somewhere else in your program.

import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.Random;


Other than that, the program looks pretty good. You didn't include main(), so I couldn't do too much refining without making sure I wasn't breaking the program.

Final code:

import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.Random;
public class Test
{

    private String words[]; // List of words in Supplied String
    private HashMap> nGramsMap; // Mapping of the Ngrams
    private int value; // Value for NGram

    Test(String text, int n)
    {
        if (text == null) throw new NullPointerException("Check the input parameters: String should not be null");
        if(n 1");

        words = text.split(" ");
        nGramsMap = new HashMap>();
        this.value = n;
    }

    public Map> generateNgrams()
    {
        // Filling the HashMap based on N value and using WOrds Array

        /*
         * Look at each set of N adjacent words in a document. Use the first N-1 words of the set as a key, and remember the fact that the nth word
         * followed that key. Once you’ve finished, you know the list of individual words that can follow each N-1 word sequence in the document.
         */
        for (int i = 0; i  list = new ArrayList<>();
                list.add(words[i + j]);
                nGramsMap.put(key, list);

            }
            else
            {
                List list = nGramsMap.get(key);
                list.add(words[i + j]);
            }
            // System.out.println("Key:  "+key+"       Value :   "+words[i+j]);
        }
        return nGramsMap;
    }

    public String getNGramsText()
    {
        /*
         * To generate new text from this analysis, choose an arbitrary word pair as a starting point. Use these to look up a random next word (using
         * the table above) and append this new word to the text so far. This now gives you a new word pair at the end of the text, so look up a
         * potential next word based on these. Add this to the list, and so on.
         */
        if (nGramsMap.size() == 0) generateNgrams();
        StringBuilder result = new StringBuilder();
        Random gen = new Random();
        int st = gen.nextInt(words.length - value); // arbitrary word pair
        StringBuilder startSb = new StringBuilder();

        for (int i = 0; i  1)
            {
                st = gen.nextInt(size - 1);
                next = nGramsMap.get(start).remove(st);
            }
            else
            {
                st = 0;
                next = nGramsMap.get(start).get(st);
            }

            String start_split[] = start.split(" ");
            String nextKey;
            if (start_split.length > 1)
            {
                nextKey = start.substring(start_split[0].length() + 1);
                start = nextKey + " " + next;
            }
            else start = next;

            result.append(next); // append this new word to the text so far

            if (nGramsMap.containsKey(start)) result.append(" ");
            else break;
        }
        return result.toString();
    }
}

Code Snippets

if (text == null) throw new NullPointerException("Check the input parameters: String should not be null");
if(n<1 || text.length()==0)  throw new IllegalArgumentException("Check the input parameters : String should not  empty and Ngram value should be > 1");
if (n<1 || text.equals(""))
// There are only 4 real lines of code here, 5 if you include the comment.
// With your spacing scheme you have 10 (comment included).

                    }


//                  System.out.println("Key:  "+key+"       Value :   "+words[i+j]);
        }

        return nGramsMap;


    }
private int value; // Value for NGram
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.Random;

Context

StackExchange Code Review Q#33300, answer score: 5

Revisions (0)

No revisions yet.