HiveBrain v1.2.0
Get Started
← Back to all entries
patternMinor

Document term matrix in Clojure

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
matrixtermdocumentclojure

Problem

This is my very first foray into Clojure (I'm normally a Python-pushing data-type). I'm trying to create a simple term-document matrix as a vector of vectors, out of a vector of strings.

For those who aren't into textmining, the term-document matrix is a dataset in matrix form where the column names represent every word in a set of documents, each row is a document, and each cell is the number of times a given word appears in a given document.

This is the very first step in what I hope to be a useful text data-cleaning library, as well as a clojure learning project. After the basics are nailed down, I want to add functionality like n-grams, stemming, removing sparse terms, etc. etc. My ultimate goals involve performance, so I want to optimize this beginning part within an inch of its life right from the start in order to build on it later.

I'm also trying to minimize dependencies (right now there are none), though I'm willing to use incanter or clojure.core.matrix if there are big performance gains to be gotten thereby.

So obviously I have a long way to go, but here are some questions on this initial step:

-
Is this "good clojure?" I tried to stick to sort of basic functional programming practice, composing lots of short functions with discrete behavior and such. But I'm not yet sure what the norms are otherwise.

-
How do I optimize this? The particular parts of the existing code that smell funny to me, performance-wise, are:

-
terdocmmap: there's gotta be a more efficient way of handling the sorting here than building a bunch of sparse maps then sorting them all. Ideally I'd like to build them in sorted form from the start somehow.

-
termdocmatrix: the maps -> sequences -> vectors conversion seems really wasteful; I'd like to come up with a more efficient way.

I'm not worrying about namespace and project structure at this stage.

```
(require '[clojure.string :as str])
(require '[clojure.walk :as walk])

(defn whitesplit
"split a vecto

Solution

I'm pretty new at Clojure myself, and haven't studied the collection algorithms very much yet, so this may not address your performance concerns, but I did find a few things that could be improved.

Potential problem with "real" document input

As I started going through your functions and how they work together, I noticed that your logic makes the assumption that all input will always just be words separated by spaces. Perhaps your data set is already preprocessed before it enters your termdocmatrix function? Unless that is the case, any text from actual documents written by humans will have many artifacts like punctuation marks and such that you should probably account for.

I ran these to illustrate what happens with more "natural" document text:

(def docs-punc ["this is, a cat" "this is a dog." "woof: and a meow" "woof; woof woof! meow? meow words"])
(whitesplit docs-punc)
; => ([this is, a cat] [this is a dog.] [woof: and a meow] [woof; woof woof! meow? meow words])
(termdocmatrix docs-punc)
; => [[:cat :dog. :is :this :woof: :is, :words :meow? :woof! :and :meow :woof; :woof :a] [1 0 0 1 0 1 0 0 0 0 0 0 0 1] [0 1 1 1 0 0 0 0 0 0 0 0 0 1] [0 0 0 0 1 0 0 0 0 1 1 0 0 1] [0 0 0 0 0 0 1 1 1 0 1 1 1 0]]


That totally messed up the results as you can see. I added a strip-punc function at the top (I made punc-to-remove its own form for readability, personal preference), and a helper function to apply it to a vector of strings:

(defn strip-punc
  "remove punctuation marks in string using `punc-to-remove` capture pattern and replacing them with empty string"
  [str]
  (def punc-to-remove #"[.,;:!?$%&\*()]")
  (str/replace str punc-to-remove ""))

(defn vec-strip-punc
  "applies strip-punc to a vector of strings"
  [vec]
  (map #(strip-punc %) vec))


Then change your bigmap function accordingly to call it before you split the strings:

(let [docs-no-punc (vec-strip-punc docs)
        stringvecs (whitesplit docs-no-punc)] ; etc.


Or alternatively inline style:

(let [stringvecs (whitesplit (vec-strip-punc docs))]


This will take care of pretty much all your general punctuation cases, and you can easily tweak the regex pattern as needed:

(def docs-punc ["this is, a cat%" "this $is a dog." "woof: and [a] meow*" "woof; (woof woof!) meow? meow words"])
(termdocmatrix docs-punc)
; => [[:cat :is :this :words :dog :and :meow :woof :a] [1 1 1 0 0 0 0 0 1] [0 1 1 0 1 0 0 0 1] [0 0 0 0 0 1 1 1 1] [0 0 0 1 0 0 2 3 0]]


Naming

Your names don't follow typical Lisp naming convention. According to Wikipedia on naming conventions (programming):

Common practice in most Lisp dialects is to use dashes to separate words in identifiers, as in with-open-file and make-hash-table. Global variable names conventionally start and end with asterisks: map-walls. Constants names are marked by plus signs: +map-size+.

Also since most/all your functions actually transform your data structure, I would suggest naming them in a way that suggests that. Perhaps also using an acronym consistently, let's say td (or even TD) for term-document, that would make it read better without being really verbose.

termdocmatrix -> TD-matrix-from-docs
terdocmmap -> TD-map-from-docs
tdseqs -> TD-seqs-from-TD-map


bigmap I don't think is a descriptive name. What is "big" in this context? In truth it reminds me of a Cartesian product, since each document entry in the docs vector will return its own map of all possible words, e.g., {this 1, is 1, a 1, cat 1, dog 0, woof 0, and 0, meow 0, words 0}. I would be tempted to call it something like cartesian-product-map or perhaps just cartesian-map.

I would also suggest to perhaps change whitesplit to space-split, since that is really what it is doing (it is not splitting other whitespace like \r \n \t. Or if you want to make it a true whitespace-split, then you should change #" " to the #"\s" special character which includes "all whitespace". Here is an article on RegexOne about it.

Code Snippets

(def docs-punc ["this is, a cat" "this is a dog." "woof: and a meow" "woof; woof woof! meow? meow words"])
(whitesplit docs-punc)
; => ([this is, a cat] [this is a dog.] [woof: and a meow] [woof; woof woof! meow? meow words])
(termdocmatrix docs-punc)
; => [[:cat :dog. :is :this :woof: :is, :words :meow? :woof! :and :meow :woof; :woof :a] [1 0 0 1 0 1 0 0 0 0 0 0 0 1] [0 1 1 1 0 0 0 0 0 0 0 0 0 1] [0 0 0 0 1 0 0 0 0 1 1 0 0 1] [0 0 0 0 0 0 1 1 1 0 1 1 1 0]]
(defn strip-punc
  "remove punctuation marks in string using `punc-to-remove` capture pattern and replacing them with empty string"
  [str]
  (def punc-to-remove #"[.,;:!?$%&\*()]")
  (str/replace str punc-to-remove ""))

(defn vec-strip-punc
  "applies strip-punc to a vector of strings"
  [vec]
  (map #(strip-punc %) vec))
(let [docs-no-punc (vec-strip-punc docs)
        stringvecs (whitesplit docs-no-punc)] ; etc.
(let [stringvecs (whitesplit (vec-strip-punc docs))]
(def docs-punc ["this is, a cat%" "this $is a dog." "woof: and [a] meow*" "woof; (woof woof!) meow? meow words"])
(termdocmatrix docs-punc)
; => [[:cat :is :this :words :dog :and :meow :woof :a] [1 1 1 0 0 0 0 0 1] [0 1 1 0 1 0 0 0 1] [0 0 0 0 0 1 1 1 1] [0 0 0 1 0 0 2 3 0]]

Context

StackExchange Code Review Q#121958, answer score: 8

Revisions (0)

No revisions yet.