patternjavaMinor
AutoComplete program using the n-gram model
Viewed 0 times
theprogramautocompletegramusingmodel
Problem
For my Advanced Data Mining class (undergrad) we were to design a program that would predict the next word a user is likely to type via automatic text classification using the n-gram model.
The following is what I came up with. The reason I am posting this is because I am about to graduate and I want to be aware of any bad habits, inefficiencies, or examples of poor implementation that I might be prone to before I have to go out and interview for Data Science Graduate programs. Unfortunately, my professors are too busy to give this kind of analysis and it seems like they just grade us on whether a program works or not (this one does).
Note: This program uses the AOL search data from here in a text file called searchterms.txt.
```
import java.io.*;
import java.util.*;
public class AutoComplete {
static LinkedList sentences = new LinkedList();
public static void main(String[] args) throws FileNotFoundException {
/ Define Variables /
int n = 3;
Hashtable> nGram = new Hashtable>();
Scanner inFile = new Scanner(new File("searchterms.txt"));
Scanner input = new Scanner(System.in);
/ Output for progress tracking /
System.out.println("Reading in search data...");
/ Populate the LinkedList sentences with data from AOL search dataset /
while(inFile.hasNext()) {
String unparsed = inFile.nextLine().intern();
String[] parsed = unparsed.split("\t");
sentences.add(" " + parsed[1] + " ");
}
inFile.close();
/ Output for progress tracking /
System.out.println("Successfully archived searches.");
System.out.println("Creating 3 grams...");
/ Split sentences into words /
for(String s : sentences) {
String[] words = s.split("[\\s]");
for(int i = 0; i " + terms[0];
//Output for testing
//System.out.println(sTerm);
} else {
The following is what I came up with. The reason I am posting this is because I am about to graduate and I want to be aware of any bad habits, inefficiencies, or examples of poor implementation that I might be prone to before I have to go out and interview for Data Science Graduate programs. Unfortunately, my professors are too busy to give this kind of analysis and it seems like they just grade us on whether a program works or not (this one does).
Note: This program uses the AOL search data from here in a text file called searchterms.txt.
```
import java.io.*;
import java.util.*;
public class AutoComplete {
static LinkedList sentences = new LinkedList();
public static void main(String[] args) throws FileNotFoundException {
/ Define Variables /
int n = 3;
Hashtable> nGram = new Hashtable>();
Scanner inFile = new Scanner(new File("searchterms.txt"));
Scanner input = new Scanner(System.in);
/ Output for progress tracking /
System.out.println("Reading in search data...");
/ Populate the LinkedList sentences with data from AOL search dataset /
while(inFile.hasNext()) {
String unparsed = inFile.nextLine().intern();
String[] parsed = unparsed.split("\t");
sentences.add(" " + parsed[1] + " ");
}
inFile.close();
/ Output for progress tracking /
System.out.println("Successfully archived searches.");
System.out.println("Creating 3 grams...");
/ Split sentences into words /
for(String s : sentences) {
String[] words = s.split("[\\s]");
for(int i = 0; i " + terms[0];
//Output for testing
//System.out.println(sTerm);
} else {
Solution
Your code looks at first glance quite complete and professional, so onto the points. I hereby assume that you are using Java 7, since you have not made any restrictions and it is the most common version, though I may be wrong.
-
Consider changing your programs design. Currently you have an
I would advice to change the following points:
-
Use diamond inference where possible, this means that for example
-
Code against interfaces instead of against classes. Take your
-
I see that you only loop over the
-
Consider changing from using the
-
Prefer a class that receives print statements over directly printing to
-
A
-
Do not catch all exceptions with
As a whole, the best advice I can give you is to consider more abstraction, make your methods smaller and give them a single responsibility. A very important second advice is to use language features that are the standard in this day.
-
Consider changing your programs design. Currently you have an
AutoComplete class, with almost everything in the main class. Now what happens if you want to run two AutoComplete instances simultaneously? You cannot do that in one program.I would advice to change the following points:
- Make an
AutoCompleteclass that can operate on it's own, you tell it what to do, what the inputs are, and you can call methods on it that give you output.
- One candidate for refactoring is the input file, this should be an input argument.
- Another point is that you request user input inside your processing, the user input should be asked beforehand and also be an input parameter.
- The prediction which gets printed while processing, should be an output.
-
Use diamond inference where possible, this means that for example
LinkedList sentences = new LinkedList(); can be written as LinkedList sentences = new LinkedList<>();.-
Code against interfaces instead of against classes. Take your
LinkedList sentences again. Nowhere I see a requirement to use a LinkedList here, you just want to use a list, so only constrain yourself to writing: List sentences = new LinkedLIst<>(). This allows you to change the exact type of List at a later point.-
I see that you only loop over the
LinkedList sentences, you have no special requirement to use a linked list, consider using the more or less default ArrayList, which provides constant lookup times and generally performs better. In your case the performance seems to be equal as all you do is, underlying to the enhanced for-loop, use an Iterator.-
Consider changing from using the
File API to the Path API at some point, it offers more future-ready changes and will coöperate better with Java 8.-
Prefer a class that receives print statements over directly printing to
System.out during processing. In bigger projects this usually is a logger framework, to which you then attach System.out writers and also file writers for logfiles. In your case you may use a simplified version of this.-
A
Hashtable is old, very old, use the nowadays standard called Map, with as default implementation a HashMap. Some method names/semantics may have changed, but they both serve the same purpose.-
Do not catch all exceptions with
Exception NullPointerException, you may have confused yourself here, but this catches all exceptions of type Exception (so all), and gives the caught exception the name NullPointerException. But even then, catching NullPointerExceptions is not good and you should just let them fall through such that they terminate your program (or thread), so you can actually fix the issue, rather than a Cannot make prediction. message.As a whole, the best advice I can give you is to consider more abstraction, make your methods smaller and give them a single responsibility. A very important second advice is to use language features that are the standard in this day.
Context
StackExchange Code Review Q#47560, answer score: 4
Revisions (0)
No revisions yet.