patternjavaMinor
N grams generator
Viewed 0 times
gramsgeneratorstackoverflow
Problem
I recently had a programming challenge, which was to create the N grams. The description is as follows:
Trigram analysis is very simple. Look at each set of three adjacent
words in a document. Use the first two words of the set as a key, and
remember the fact that the third word followed that key. Once you’ve
finished, you know the list of individual words that can follow each
two word sequence in the document. For example, given the input:
You might generate:
This says that the words "I wish" are twice followed by the word "I", the words "wish I" are followed once by "may" and once by "might"
and so on.
To generate new text from this analysis, choose an arbitrary word pair as a starting point. Use these to look up a random next word
(using the table above) and append this new word to the text so far.
This now gives you a new word pair at the end of the text, so look up
a potential next word based on these. Add this to the list, and so on.
In the previous example, we could start with "I may". The only
possible next word is "I", so now we have:
The last two words are "may I", so the next word is "wish". We then
look up "I wish", and find our choice is constrained to another "I".
Now we look up "wish I", and find we have a choice. Let’s choose
"may".
Now we’re back where we started from, with "I may." Following the same
sequence, but choosing "might" this time, we get:
At this point we stop, as no sequence starts "I might."
Given a book, so the resulting output can be surprising.
I've implemented this idea in Java with generic Ngrams, whose value should be > 1. I want to know whether I can improve the efficiency of my code like runtime
Trigram analysis is very simple. Look at each set of three adjacent
words in a document. Use the first two words of the set as a key, and
remember the fact that the third word followed that key. Once you’ve
finished, you know the list of individual words that can follow each
two word sequence in the document. For example, given the input:
I wish I may I wish I mightYou might generate:
"I wish" => ["I", "I"]
"wish I" => ["may", "might"]
"may I" => ["wish"]
"I may" => ["I"]This says that the words "I wish" are twice followed by the word "I", the words "wish I" are followed once by "may" and once by "might"
and so on.
To generate new text from this analysis, choose an arbitrary word pair as a starting point. Use these to look up a random next word
(using the table above) and append this new word to the text so far.
This now gives you a new word pair at the end of the text, so look up
a potential next word based on these. Add this to the list, and so on.
In the previous example, we could start with "I may". The only
possible next word is "I", so now we have:
I may IThe last two words are "may I", so the next word is "wish". We then
look up "I wish", and find our choice is constrained to another "I".
I may I wish INow we look up "wish I", and find we have a choice. Let’s choose
"may".
I may I wish I mayNow we’re back where we started from, with "I may." Following the same
sequence, but choosing "might" this time, we get:
I may I wish I may I wish I mightAt this point we stop, as no sequence starts "I might."
Given a book, so the resulting output can be surprising.
I've implemented this idea in Java with generic Ngrams, whose value should be > 1. I want to know whether I can improve the efficiency of my code like runtime
Solution
If
If the user is inputting the text directly I don't believe text can be
I included the upper example in the final code.
I am usually in support of using whitespace in programming. However, in this case I think you have a lot of superfluous whitespace in your program. Especially in between braces.
You comment what the variable
I don't see the imports you need. I assume you have all of these somewhere else in your program.
Other than that, the program looks pretty good. You didn't include
Final code:
text is null, you throw an IllegalArgumentException. It should throw a NullPointerException.if (text == null) throw new NullPointerException("Check the input parameters: String should not be null");
if(n 1");If the user is inputting the text directly I don't believe text can be
null (I'm not completely sure on this), you can simplify your if condition test down a bit using the String method .equals().if (n<1 || text.equals(""))I included the upper example in the final code.
I am usually in support of using whitespace in programming. However, in this case I think you have a lot of superfluous whitespace in your program. Especially in between braces.
// There are only 4 real lines of code here, 5 if you include the comment.
// With your spacing scheme you have 10 (comment included).
}
// System.out.println("Key: "+key+" Value : "+words[i+j]);
}
return nGramsMap;
}You comment what the variable
n is, instead of giving it a more helpful name to begin with. It's good that you had the comment though, many people don't comment what obscure variables mean.private int value; // Value for NGramI don't see the imports you need. I assume you have all of these somewhere else in your program.
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.Random;Other than that, the program looks pretty good. You didn't include
main(), so I couldn't do too much refining without making sure I wasn't breaking the program.Final code:
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.Random;
public class Test
{
private String words[]; // List of words in Supplied String
private HashMap> nGramsMap; // Mapping of the Ngrams
private int value; // Value for NGram
Test(String text, int n)
{
if (text == null) throw new NullPointerException("Check the input parameters: String should not be null");
if(n 1");
words = text.split(" ");
nGramsMap = new HashMap>();
this.value = n;
}
public Map> generateNgrams()
{
// Filling the HashMap based on N value and using WOrds Array
/*
* Look at each set of N adjacent words in a document. Use the first N-1 words of the set as a key, and remember the fact that the nth word
* followed that key. Once you’ve finished, you know the list of individual words that can follow each N-1 word sequence in the document.
*/
for (int i = 0; i list = new ArrayList<>();
list.add(words[i + j]);
nGramsMap.put(key, list);
}
else
{
List list = nGramsMap.get(key);
list.add(words[i + j]);
}
// System.out.println("Key: "+key+" Value : "+words[i+j]);
}
return nGramsMap;
}
public String getNGramsText()
{
/*
* To generate new text from this analysis, choose an arbitrary word pair as a starting point. Use these to look up a random next word (using
* the table above) and append this new word to the text so far. This now gives you a new word pair at the end of the text, so look up a
* potential next word based on these. Add this to the list, and so on.
*/
if (nGramsMap.size() == 0) generateNgrams();
StringBuilder result = new StringBuilder();
Random gen = new Random();
int st = gen.nextInt(words.length - value); // arbitrary word pair
StringBuilder startSb = new StringBuilder();
for (int i = 0; i 1)
{
st = gen.nextInt(size - 1);
next = nGramsMap.get(start).remove(st);
}
else
{
st = 0;
next = nGramsMap.get(start).get(st);
}
String start_split[] = start.split(" ");
String nextKey;
if (start_split.length > 1)
{
nextKey = start.substring(start_split[0].length() + 1);
start = nextKey + " " + next;
}
else start = next;
result.append(next); // append this new word to the text so far
if (nGramsMap.containsKey(start)) result.append(" ");
else break;
}
return result.toString();
}
}Code Snippets
if (text == null) throw new NullPointerException("Check the input parameters: String should not be null");
if(n<1 || text.length()==0) throw new IllegalArgumentException("Check the input parameters : String should not empty and Ngram value should be > 1");if (n<1 || text.equals(""))// There are only 4 real lines of code here, 5 if you include the comment.
// With your spacing scheme you have 10 (comment included).
}
// System.out.println("Key: "+key+" Value : "+words[i+j]);
}
return nGramsMap;
}private int value; // Value for NGramimport java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.Random;Context
StackExchange Code Review Q#33300, answer score: 5
Revisions (0)
No revisions yet.