patternMinor

Using Viterbi algorithm to analyze sentences

Submitted by: @import:stackexchange-codereview·Mar 10, 2026·

Viewed 0 times

viterbianalyzealgorithmusingsentences

Problem

I've probably done some pretty horrendous things here, but I'm throwing it out for people to give me some feedback that I can start using to immediately improve my Clojure coding style.

Additional suggestions would be performance enhancements as well as areas where I could use transients if it is advisable.

So far I've been told:

I should make use of vector-of

Use primitives such as int and double to avoid boxing and unboxing

Type hint string functions

I would be grateful for suggestions that I use to turn this code into more idiomatic Clojure code.

```
(ns tagger.core
"Running Viterbi on some text"
(:require [clojure.string :as str]
[clojure.contrib.generic.functor :as functor]
[clojure.contrib.math :as math]
[clojure.set :as set]
[clojure.data :as data]
[clojure.data.finger-tree :as ft]))

(def ^:dynamic epsilon 0.01)

(defn applyAll [fs x]
(map #(% x) fs))

(defn split-evenly [coll]
(partition (quot (count coll) 10) coll))

(defn nil?-zero [fn & args]
(let [val (apply fn args)]
(if (nil? val)
0
val)))

;; Counts needed are:
;; word counts W
;; tag counts T
;; word-tag counts W-T
;; previous tag to current tag counts T(i+1)-Ti

;; Problem set
;; A sample set to play with
(def str-to-tags (slurp "resources/sample.txt"))

(defn str->tags [string]
(filter #(not (empty? %))
(str/split string #"[\s]")))

(defn tag->W-T [tag]
"Converts a tag such as In/IN into a W-T such as [In IN]"
(str/split tag #"[//]"))

(defn sentence->tags [sentence]
(map #(second %) sentence))

(defn strip-tags [tag]
(first tag))

(defn this-and-that [xs]
(map-indexed (fn [i x]
[x (concat (take i xs)
(drop (inc i) xs))])
xs))

(def cleaned-tag-str (filter #(= (count %) 2) (map tag->W-T (str->tags str-to-tags))))

(defn split-sentences [tag-str]
"Splits the str into sentences"
(reduce #(if (= (second (first %2)) "

Solution

First of all, good job! This is obviously a complex algorithm and it looks like it's working.

I'm going to do this incrementally. So I'll save this and keep editing as I go. And since it's so long, I won't get to everything. Plus I don't understand the algorithm too well.

First of all, doc strings in Clojure go before the arguments. I used to make this mistake all the time. The reason is that you can have multi-variate functions.

1:

(defn insert [m k]
  "Inserts a key k into a map m if it does not exist or increments the count if it does"
  (let [val (m k)]
    (assoc m k (inc (if (nil? val) 0 val)))))

My version:

(defn insert 
  "Inserts a key k into a map m if it does not exist or increments the count if it does"
  [m k]
  (update-in m [k] (fnil inc 0)))

See update-in and fnil.

2:

#() construction is unnecessary here:

(defn sentence->tags [sentence]
  (map #(second %) sentence))

My version:

(defn sentence->tags [sentence]
  (map second sentence))

(defn word-count [tagged-str]
  "Example of how to get word counts"
  (reduce #(insert %1 (first %2)) {} tagged-str))

Could this be replaced by using frequencies?

4:

Now it's starting to get hairy. I may make some mistakes here, because the code is not factored.

(defn viterbi-init [v path obs states start-p emit-p]
  "Initializes viterbi for us"
  (reduce
    #(into %1 {%2 [(* (start-p %2)
                 (emit-p (first obs) %2))
              (conj path %2)]})
    {}
    states))

This definitely has too many levels of into/reduce. You could do a (reduce #(assoc %1 %s ...) {} states) pattern. Do we need v? And why are we passing in a list obs when we only need the first? And it's often a good idea to put the driving sequence in the first or last position, so you can do threading. Let's try it this way:

(defn viterbi-init 
  "Initializes viterbi for us"
  [states path ob start-p emit-p]
  (into {}
    (for [state states]
      [state
       [(* (start-p state) (emit-p ob state))
        (conj path state)]])))

5:

This one is very hairy. I don't quite understand it, but I will try.

(defn viterbi-step [prior obs states trans-p emit-p]
  "Goes through one step of viterbi for us, taking a prior state and performing one step"
  (apply merge (map
                (comp (fn [[path v]] {(last path) [v path]}) #(apply max-key val %) #(apply merge %))
                ((fn [obs]
                   (map #(applyAll (map (comp (fn [[v past-st]]
                                                (fn [current-st]
                                                  {(conj (second (prior past-st)) current-st)
                                                   (* v (trans-p current-st past-st)
                                                      (emit-p obs current-st))}))
                                              (partial extract-prob-state prior)) states) %) states))
                 obs))))

So, I tried and failed to refactor this myself. But I will give my general feedback. What this function, which is a map of a map of a map of a map, tells me is that there is a failure of abstraction. viterbi-step should be should be a high-level function which should read somewhat like the inner loop of a pseudo-code implementation of Viterbi. This function relies too much on the structure of the data structures involved. Deeply nested structures are common, but a single function that accesses them so deeply is not. A good rule of thumb is at most 1 nested map/reduce within a function.

There need to be functions which act as your primitive operations here. I can see that you began writing some near the top. You should continue that trend here. Then your functions would be operating at a certain level and calling functions from the level below.

An alternative approach would be to turn the algorithm into a sequential series of steps. This may or may not apply here, but it is hard for me to tell. As an example (not real code!):

(defn viterbi-step [prior obs states trans-p emit-p]
  (-> obs
    (calculate-priors)
    (extract-prob-states states)
    (extract-path)
    (merge)))

Again, it's just an example. But the idea is that each function takes the data it needs and creates a new data structure that is the result of that calculation. I don't know if this is possible with this algorithm. But it could be. One hint I can give is that you know you are on the right track when your functions are returning "appropriate" data structures. That is, when the data is a mapping, you return a map. When it's a set, you return a set. Also, the functions don't take much more data than they need to calculate the answer. I suggest you take the iterative algorithm description and work backwards from the final output.

Again, nice going. It was a pleasure to go through it.

Code Snippets

(defn insert [m k]
  "Inserts a key k into a map m if it does not exist or increments the count if it does"
  (let [val (m k)]
    (assoc m k (inc (if (nil? val) 0 val)))))

(defn insert 
  "Inserts a key k into a map m if it does not exist or increments the count if it does"
  [m k]
  (update-in m [k] (fnil inc 0)))

(defn sentence->tags [sentence]
  (map #(second %) sentence))

(defn sentence->tags [sentence]
  (map second sentence))

(defn word-count [tagged-str]
  "Example of how to get word counts"
  (reduce #(insert %1 (first %2)) {} tagged-str))

Context

StackExchange Code Review Q#8270, answer score: 6

Revisions (0)

No revisions yet.