patternjavaMinor
Document clustering / NLP
Viewed 0 times
clusteringnlpdocument
Problem
This is a solution from the CodeSprint 2012 coding competition. Any suggestions on ways to improve the object orientation or general style would be much appreciated. The code takes several text documents, encodes them into feature vectors using TF-IDF, and then applies k-means clustering to group documents with similar topics/themes.
```
/**
* Class containing an individual document
*/
private class Document implements Comparable {
// document title, contents and ID
private final String title;
private final String contents;
private final long id;
// whether document has been allocated to a cluster
private boolean allocated;
// document word histogram
private Vector histogram;
// encoded document vector (TF-IDF)
private Vector vector;
// precalculated document vector norms
private double norm;
public Document(long id, String title, String contents) {
this.id = id;
this.title = title;
this.contents = contents;
}
public void setVector(Vector vector) {
this.vector = vector;
}
public boolean isAllocated() {
return allocated;
}
/**
* Mark document as having been allocated to a cluster
*/
public void setIsAllocated() {
allocated = true;
}
/**
* Mark document as not being allocated to a cluster
*/
public void clearIsAllocated() {
allocated = false;
}
public long getId() {
return id;
}
public String getContents() {
return contents;
}
public Vector getVector() {
return vector;
}
public Vector getHistogram() {
return histogram;
}
public void setHistogram(Vector histogram) {
this.histogram = histogram;
}
public void setNorm(double norm) {
this.norm = norm;
}
public double getNorm() {
return norm;
}
/**
* Allow documents to be sorted by ID
*/
public int compareTo(Document document) {
if (id > document.getId()) {
return 1;
} else if (id {
/**
* Parse input string into document ID, document t
```
/**
* Class containing an individual document
*/
private class Document implements Comparable {
// document title, contents and ID
private final String title;
private final String contents;
private final long id;
// whether document has been allocated to a cluster
private boolean allocated;
// document word histogram
private Vector histogram;
// encoded document vector (TF-IDF)
private Vector vector;
// precalculated document vector norms
private double norm;
public Document(long id, String title, String contents) {
this.id = id;
this.title = title;
this.contents = contents;
}
public void setVector(Vector vector) {
this.vector = vector;
}
public boolean isAllocated() {
return allocated;
}
/**
* Mark document as having been allocated to a cluster
*/
public void setIsAllocated() {
allocated = true;
}
/**
* Mark document as not being allocated to a cluster
*/
public void clearIsAllocated() {
allocated = false;
}
public long getId() {
return id;
}
public String getContents() {
return contents;
}
public Vector getVector() {
return vector;
}
public Vector getHistogram() {
return histogram;
}
public void setHistogram(Vector histogram) {
this.histogram = histogram;
}
public void setNorm(double norm) {
this.norm = norm;
}
public double getNorm() {
return norm;
}
/**
* Allow documents to be sorted by ID
*/
public int compareTo(Document document) {
if (id > document.getId()) {
return 1;
} else if (id {
/**
* Parse input string into document ID, document t
Solution
Some notes:
-
This comment is unnecessary:
Since they are inside the
-
Here I'd rename the variable to
-
For the same reason as above,
etc.
-
Input check is missing in the constructor:
Does it make sense to have a
-
Instead of this code I'd use
-
Check Effective Java, Second Edition, Item 16: Favor composition over inheritance.
-
Sometimes comments should be method names:
Create an
-
This comment is unnecessary:
// document title, contents and ID
private final String title;
private final String contents;
private final long id;Since they are inside the
Document class, it's obvious that the title is a title of a document, and so on.-
// whether document has been allocated to a cluster
private boolean allocated;Here I'd rename the variable to
allocatedToCluster and remove the comment. It would make the code more readable since you don't have to check the declaration to figure out what it is allocated to.-
For the same reason as above,
histogramshould bewordHistogram,
vectorshould beencodedDocumentVector,
etc.
-
Input check is missing in the constructor:
public Document(long id, String title, String contents) {
this.id = id;
this.title = title;
this.contents = contents;
}Does it make sense to have a
Document with null or empty String title or contents? If not, check it and throw an IllegalArgumentException. (Effective Java, Second Edition, Item 38: Check parameters for validity)-
if (id > document.getId()) {
return 1;
} else if (id < document.getId()) {
return -1;
} else {
return 0;
}Instead of this code I'd use
ObjectUtils.compare from Apache Commons Lang.-
class DocumentList extends ArrayListCheck Effective Java, Second Edition, Item 16: Favor composition over inheritance.
-
Sometimes comments should be method names:
// add remaining documents to one of the k existing clusters
for (int iter = 0; iter < clusteringIterations; iter++) {Create an
addRemainingDocuments... method and move the for loop in it instead of the comment.Code Snippets
// document title, contents and ID
private final String title;
private final String contents;
private final long id;// whether document has been allocated to a cluster
private boolean allocated;public Document(long id, String title, String contents) {
this.id = id;
this.title = title;
this.contents = contents;
}if (id > document.getId()) {
return 1;
} else if (id < document.getId()) {
return -1;
} else {
return 0;
}class DocumentList extends ArrayList<Document>Context
StackExchange Code Review Q#7942, answer score: 2
Revisions (0)
No revisions yet.