HiveBrain v1.2.0
Get Started
← Back to all entries
patternjavaMinor

Document clustering / NLP

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
clusteringnlpdocument

Problem

This is a solution from the CodeSprint 2012 coding competition. Any suggestions on ways to improve the object orientation or general style would be much appreciated. The code takes several text documents, encodes them into feature vectors using TF-IDF, and then applies k-means clustering to group documents with similar topics/themes.

```
/**
* Class containing an individual document
*/
private class Document implements Comparable {

// document title, contents and ID
private final String title;
private final String contents;
private final long id;
// whether document has been allocated to a cluster
private boolean allocated;
// document word histogram
private Vector histogram;
// encoded document vector (TF-IDF)
private Vector vector;
// precalculated document vector norms
private double norm;

public Document(long id, String title, String contents) {
this.id = id;
this.title = title;
this.contents = contents;
}

public void setVector(Vector vector) {
this.vector = vector;
}

public boolean isAllocated() {
return allocated;
}

/**
* Mark document as having been allocated to a cluster
*/
public void setIsAllocated() {
allocated = true;
}

/**
* Mark document as not being allocated to a cluster
*/
public void clearIsAllocated() {
allocated = false;
}

public long getId() {
return id;
}

public String getContents() {
return contents;
}

public Vector getVector() {
return vector;
}

public Vector getHistogram() {
return histogram;
}

public void setHistogram(Vector histogram) {
this.histogram = histogram;
}

public void setNorm(double norm) {
this.norm = norm;
}

public double getNorm() {
return norm;
}

/**
* Allow documents to be sorted by ID
*/
public int compareTo(Document document) {
if (id > document.getId()) {
return 1;
} else if (id {

/**
* Parse input string into document ID, document t

Solution

Some notes:

-
This comment is unnecessary:

// document title, contents and ID
private final String title;
private final String contents;
private final long id;


Since they are inside the Document class, it's obvious that the title is a title of a document, and so on.

-
// whether document has been allocated to a cluster
private boolean allocated;


Here I'd rename the variable to allocatedToCluster and remove the comment. It would make the code more readable since you don't have to check the declaration to figure out what it is allocated to.

-
For the same reason as above,

  • histogram should be wordHistogram,



  • vector should be encodedDocumentVector,



etc.

-
Input check is missing in the constructor:

public Document(long id, String title, String contents) {
    this.id = id;
    this.title = title;
    this.contents = contents;
}


Does it make sense to have a Document with null or empty String title or contents? If not, check it and throw an IllegalArgumentException. (Effective Java, Second Edition, Item 38: Check parameters for validity)

-

if (id > document.getId()) {
    return 1;
} else if (id < document.getId()) {
    return -1;
} else {
    return 0;
}


Instead of this code I'd use ObjectUtils.compare from Apache Commons Lang.

-
class DocumentList extends ArrayList


Check Effective Java, Second Edition, Item 16: Favor composition over inheritance.

-
Sometimes comments should be method names:

// add remaining documents to one of the k existing clusters
for (int iter = 0; iter < clusteringIterations; iter++) {


Create an addRemainingDocuments... method and move the for loop in it instead of the comment.

Code Snippets

// document title, contents and ID
private final String title;
private final String contents;
private final long id;
// whether document has been allocated to a cluster
private boolean allocated;
public Document(long id, String title, String contents) {
    this.id = id;
    this.title = title;
    this.contents = contents;
}
if (id > document.getId()) {
    return 1;
} else if (id < document.getId()) {
    return -1;
} else {
    return 0;
}
class DocumentList extends ArrayList<Document>

Context

StackExchange Code Review Q#7942, answer score: 2

Revisions (0)

No revisions yet.