HiveBrain v1.2.0
Get Started
← Back to all entries
patternhtmlMinor

Selecting HTML nodes

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
nodeshtmlselecting

Problem

I'm new to Clojure, and I'm writing some code that processes HTML. I'm parsing the HTML using clj-tagsoup and then trying to pull out the relevant bits with this function:

(defn find-nodes [tag-prototype base-node]
  (letfn [(node-matches [tag-prototype node]
            (let [[tag attrs] tag-prototype]
              (and (= (nth node 0) tag)
                   (every? (fn [[key val]]
                             (= (key (nth node 1)) val))
                           (seq attrs)))))]
    (if (node-matches tag-prototype base-node)
      [base-node]
      (loop [nodes base-node matches []]
        (if (seq nodes)
          (recur (next nodes)
                 (let [node (first nodes)]
                   (if (vector? node)
                     (if (node-matches tag-prototype node)
                       (conj matches node)
                       (concat matches (find-nodes tag-prototype node)))
                     matches)))
          matches)))))


Used something like this:

(->> (tagsoup/parse file)
     (find-nodes [:div {:class "magic"}])
     (find-nodes [:p]))


The code works (finally), but the style is probably horrible. What would you suggest to improve it?

Sample input:


  
    This is a test page
  
  
    
      
        ...
      
    
    
      
        Hello World.
      
    
  


Sample query:

(pprint (find-nodes [:div {:id "something-important"}]
                    (tagsoup/parse "test.html")))


Sample output:

([:div
  {:id "something-important"}
  [:p {:class "alert"} "\n        ...\n      "]])


Sample query:

(pprint (find-nodes [:p]
                    (tagsoup/parse "test.html")))


Sample output:

([:p {:class "alert"} "\n        ...\n      "]
 [:p {:class "message"} "\n        Hello World.\n      "])

Solution

First things first: any time you find yourself using letfn, you should strongly consider simply breaking those functions out into their own defns. In this case, node-matches is large enough and decoupled enough to make that clearly the right choice.

One way you can make your code more concise is to take advantage of the fact that you can do destructuring directly in function arguments:

(defn node-matches [[tag attrs] [node-tag node-attrs]]
(and (= node-tag tag)
(every? (fn [[key val]]
(= (key node-attrs) val))
(seq attrs))))


If you want to make clear what it is that you're destructuring, you could add :as prototype and :as node to your parameters.

When you think about it, though, that every? expression is really just trying to determine if attrs is a subset of node-attrs. In Clojure, you can use the subset? function to do that. So, assuming that you've already required [clojure.set :as set]:

(set/subset? (set attrs) (set node-attrs))


A lot of the complexity in the rest of find-nodes comes from the fact that you're complecting the logic that traverses the HTML tree and the logic that uses node-matches to do the matching and assemble the final result.

The best way to improve find-nodes, then, is to break it apart. Actually, traversing a Hiccup-style HTML tree is incredibly easy in Clojure using the tree-seq function:

(tree-seq vector? tagsoup/children (tagsoup/parse "test.html"))


This will give you a sequence of all the nodes in the HTML tree:

([:html {}
[:head {}
[:title {} "This is a test page"]]
[:body {}
[:div {:id "something-important"}
[:p {:class "alert"}
"\n ...\n "]]
[:div {:id "something-else"}
[:p {:class "message"}
"\n Hello World.\n "]]]]
[:head {}
[:title {} "This is a test page"]]
[:title {} "This is a test page"]
"This is a test page"
[:body {}
[:div {:id "something-important"}
[:p {:class "alert"}
"\n ...\n "]]
[:div {:id "something-else"}
[:p {:class "message"}
"\n Hello World.\n "]]]
[:div {:id "something-important"}
[:p {:class "alert"}
"\n ...\n "]]
[:p {:class "alert"}
"\n ...\n "]
"\n ...\n "
[:div {:id "something-else"}
[:p {:class "message"}
"\n Hello World.\n "]]
[:p {:class "message"}
"\n Hello World.\n "]
"\n Hello World.\n ")


And at this point, you're pretty much done. You just need to filter this sequence using the shortened version of node-matches that we wrote above:

(defn find-nodes [[tag attrs] root]
(filter (fn [[node-tag node-attrs]]
(and (= tag node-tag) (set/subset? (set attrs) (set node-attrs))))
(tree-seq vector? tagsoup/children root)))

Context

StackExchange Code Review Q#126175, answer score: 3

Revisions (0)

No revisions yet.