patternMinor
Parsing US address with Clojure
Viewed 0 times
withclojureparsingaddress
Problem
The parser below is not designed for every single US address. I am parsing through a file where each line may or may not be an address. I am more focused on speed rather than robustness.
-
-
-
Next in the code is some checks for invalid values. I finally have to split the address into the appropriate parts:
Overall, it feels like too much is going on, but I'm not sure how best to make it the most Clojure-like.
```
(def
^{:const true
:source "https://www.usps.com/send/official-abbreviations.htm"}
state-abbreviation-lookup
(hash-map
"AL" "AL"
"ALABAMA" "AL"
"AK" "AK"
"ALASKA" "AK"
;... removed for brevity
))
(def
^{:const true
:source "https://www.usps.com/send/official-abbreviations.htm"}
street-suffix-lookup
(hash-map
;... removed for brevity
"STREET" "ST"
"ST" "ST"
"STR" "ST"
"STRT" "ST"
;... removed for brevity
))
(defrecord UsAddress [street city state zipcode])
(letfn
[(find-last-occurence-index-of [lookup-map coll]
(last (keep-indexed #(when (contains? lookup-map %2) %1) coll)))
(split-string [s]
((comp #(clojure.string/split % #"\s+")
clojure.string/upper-case
#(clojure.string/replace % #"," ""))
s))]
(defn extract-us-address [s]
(let [parts (split-string s)]
(when (re-matches #"^\d{5}$" (last parts))
(let [state-index (find-last-occurence-index-of state-abbreviation-lookup parts)
-
state-abbreviation-lookup and street-suffix-lookup are booth used for consistency (addresses will always be "ST" instead of "STREET" or "STR")-
find-last-occurence-index-of searches through a list of string and finds the last string that matches a key in a hash-map. This is used with state-abbreviation-lookup and street-suffix-lookup to find where in the address the state and or street is.-
split-string removes commas, capitalizes everything and then splits on whitespace. Next in the code is some checks for invalid values. I finally have to split the address into the appropriate parts:
street, city, state, zipcode.Overall, it feels like too much is going on, but I'm not sure how best to make it the most Clojure-like.
```
(def
^{:const true
:source "https://www.usps.com/send/official-abbreviations.htm"}
state-abbreviation-lookup
(hash-map
"AL" "AL"
"ALABAMA" "AL"
"AK" "AK"
"ALASKA" "AK"
;... removed for brevity
))
(def
^{:const true
:source "https://www.usps.com/send/official-abbreviations.htm"}
street-suffix-lookup
(hash-map
;... removed for brevity
"STREET" "ST"
"ST" "ST"
"STR" "ST"
"STRT" "ST"
;... removed for brevity
))
(defrecord UsAddress [street city state zipcode])
(letfn
[(find-last-occurence-index-of [lookup-map coll]
(last (keep-indexed #(when (contains? lookup-map %2) %1) coll)))
(split-string [s]
((comp #(clojure.string/split % #"\s+")
clojure.string/upper-case
#(clojure.string/replace % #"," ""))
s))]
(defn extract-us-address [s]
(let [parts (split-string s)]
(when (re-matches #"^\d{5}$" (last parts))
(let [state-index (find-last-occurence-index-of state-abbreviation-lookup parts)
Solution
split-string can be simplifiedYou're defining a function that takes an argument
s and applies the composite of 3 different clojure.string functions to s within the body of the function. You could simplify this in 2 ways:-
(require '[clojure.string] :as s) -- now you can call all three functions in a much more concise way, e.g. s/split.-
(comp fn1 fn2 fn3 s) is already a function, so you could actually define split-string as just that, without needing to repeat s as part of the function. See below:(require '[clojure.string :as s])
(def split-string
(comp #(s/split % #"\s+") s/upper-case #(s/replace % #"," "")))This is just a matter of preference, but I would find it easier to read if you used a threading macro instead:
(defn split-string [s]
(-> s (s/replace #"," "") s/upper-case (s/split #"\s+")))defn should only be at the top level
In your code, you have the equivalent of this:
(letfn [(foo [x] (bar x))]
(defn baz [y] (foo y)))(To be more specific, I'm talking about
(letfn [(find-last-occurrence-index-of ...) (split-string [s] ...)] (defn extract-us address [s] ...)))Having a
defn inside of a let form is un-idiomatic, and can be a little bit confusing at first glance. The general rule is to only have def forms (including defn, defmacro, etc.) at the top level of your code. You can fix this by simply flipping it so that the letfn part is inside the defn part, like this:(defn baz [y]
(letfn [(foo [x] (bar x))]
(foo y)))This way, it is more clear that the functions being "
let'd" are internal to the function.Combine
when/let bindings when possibleThe main thing I notice, looking at your
extract-us-address function, is that there are multiple let and when statements. Sometimes this is unavoidable, but often you can tidy up your code by combining multiple bindings into a single let form. You could do something like the code below. Notice also that I've included your "letfns" find-last-occurrence-index-of and split-string as anonymous functions within the let bindings. (defn extract-us-address [s]
(let [find-last-occurrence-index-of
(fn [lookup-map coll]
(last (keep-indexed #(when (contains? lookup-map %2) %1) coll)))
split-string
(fn [s]
(-> s (s/replace #"," "") s/upper-case (s/split #"\s+")))
parts (split-string s)]
...Use helper functions
The other thing I notice is that you have one fairly complex function, where you could have one "main" function (
extract-us-address) which relies on a handful of seperate "helper" functions. Doing it that way generally makes your code easier to read. Here's an idea of how you could refactor your code into several smaller functions:; (to make it clearer which functions are helper functions, mark
; them as private functions by using defn- instead of defn)
(defn- find-last-occurrence-index-of [lookup-map coll]
(last (keep-indexed #(when (contains? lookup-map %2) %1) coll)))
; You could rename split-string to parts, then refer to (parts s),
; thereby avoiding needing to bind (split-string s) to parts in a
; let binding. I think this more clearly describes what the function
; is doing, as well.
(defn- parts [s]
(-> s (s/replace #"," "") s/upper-case (s/split #"\s+")))
(defn- zip-code [s]
(re-matches #"^\d{5}$" (last (parts s))))
(defn- street-index [s]
(find-last-occurrence-index-of street-suffix-lookup (parts s)))
(defn- state-index [s]
(find-last-occurrence-index-of state-abbreviation-lookup (parts s)))
(defn extract-us-address [s]
; this let binding is here to help performance -- we only want to evaluate
; each of these once, and bind it to a symbol
(let [zip-code (zip-code s)
street-index (street-index s)
state-index (state-index s)]
(when (and zip-code street-index state-index)
(UsAddress.
(clojure.string/join " " (take (inc street-index) (parts s)))
(clojure.string/join " " (drop (inc street-index) (take state-index (parts s))))
(state-abbreviation-lookup (nth (parts s) state-index))
zip-code))))Is there a better way to do this?
I noticed that there are a couple of potential flaws with your algorithm:
-
It relies on the fact(/assumption?) that there are no street suffixes that overlap with state abbreviations. This might be OK (I can't think of any examples that would break this), but it seems like a potential problem. This is a really contrived idea, but what if the street suffix "Ca" for "calle" catches on? Your code wouldn't know how to tell apart CAlle from CAlifornia.
-
It doesn't return
nil if the state is missing from where it's supposed to be, but present somewhere else, like in the street address. For example, try running your program on an address like "1000 NC HWY 70, RALEIGH, 27602." Assuming you have "HWY" listed as a suffix, you would expect nil (becauCode Snippets
(require '[clojure.string :as s])
(def split-string
(comp #(s/split % #"\s+") s/upper-case #(s/replace % #"," "")))(defn split-string [s]
(-> s (s/replace #"," "") s/upper-case (s/split #"\s+")))(letfn [(foo [x] (bar x))]
(defn baz [y] (foo y)))(defn baz [y]
(letfn [(foo [x] (bar x))]
(foo y)))(defn extract-us-address [s]
(let [find-last-occurrence-index-of
(fn [lookup-map coll]
(last (keep-indexed #(when (contains? lookup-map %2) %1) coll)))
split-string
(fn [s]
(-> s (s/replace #"," "") s/upper-case (s/split #"\s+")))
parts (split-string s)]
...Context
StackExchange Code Review Q#56275, answer score: 2
Revisions (0)
No revisions yet.