HiveBrain v1.2.0
Get Started
← Back to all entries
patternMinor

Is this the Clojure way to web-scrape a book cover image?

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
thistheclojureimagebookwayscrapewebcover

Problem

Is there a way to write this better or more Clojure way? Especially the last part with with-open and the let. Should I put the -> form into a separate function? Is this function to overloaded (to much stuff) or is this still ok?

(defn scrape-book-cover [url image-filename]
   (let [out-stream (io/output-stream image-filename)
         image (-> url
                   (java.net.URL.)
                   (html/html-resource)
                   (html/select [:div.bookcoverXXL :> :div :> :img])
                   (first) (:attrs) (:src)
                   (->> (str "http:"))
                   (http-client/get {:as :byte-array})
                   (:body))]
     (with-open [out out-stream]
       (.write out image))))

Solution

@abuzittingillifirca made a good point in that you can leave the parentheses off of the -> forms that are just a single element.

html/html-resource is actually quite flexible; it can take a variety of different kinds of arguments, including strings, URLs, input streams, etc. Assuming the url argument to your function is a string, you can take out the java.net.URL. part and just feed url directly into html/html-resource. In fact, I would begin the -> part like this:

(-> (html/html-resource url)
    ...


So (html/html-resource url) is like the "first step" in the threading macro. Less steps makes your code appear simpler and easier to read, just as long as the steps aren't overly complicated.

For the (first) (:attrs) (:src) part, you might consider using get-in. This function comes in handy for navigating nested associative structures like vectors and maps -- it's perfect for something like enlive. You could re-word this part of your code as:

(get-in [0 :attrs :src])


It isn't much shorter than the original, but I think it reads better, and it "feels" shorter because you're only calling one function.

It's probably not very Clojurian to have a ->> macro among the steps to a -> macro. If you have a situation like this, where one of the steps to the -> requires you to put the argument last instead of second, you can do something like (#(str "http:" %)) -- create an anonymous function literal with the % wherever you want the argument to go, and then call it by putting parentheses around it.

So, here is how I would refactor your code:

(defn scrape-book-cover [url image-filename]
  (let [out-stream (io/output-stream image-filename)
        image (-> (html/html-resource url)
                  (html/select [:div.bookcoverXXL :> :div :> :img])
                  (get-in [0 :attrs :src])
                  (#(str "http:" %))
                  (http-client/get {:as :byte-array})
                  :body)]
    (with-open [out out-stream]
      (.write out image))))


EDIT: To answer your question: No, I don't think this function is overloaded. When you simplify like I did above, it doesn't seem too long or complicated.

EDIT 2: Of course, there's nothing stopping you from pulling out the -> part and making it a separate function. It might make your code easier to read. You could call the function something like imgurl->bytes.

(defn imgurl->bytes [url]
  (-> (html/html-resource url)
      (html/select [:div.bookcoverXXL :> :div :> :img])
      (get-in [0 :attrs :src])
      (#(str "http:" %))
      (http-client/get {:as :byte-array})
      :body))

(defn scrape-book-cover [url image-filename]
  (let [out-stream (io/output-stream image-filename)
        image (imgurl->bytes url)]
    (with-open [out out-stream]
      (.write out image))))


Come to think of it, I like this better. If you look at a lot of Clojure code on github, you will see a lot of things broken up into simple helper functions. I suppose it is generally idiomatic in Clojure to break things up in helper functions.

EDIT 3 (5/16/14): Someone just upvoted my answer, which made me remember this code and have a quick re-visit. You could actually further refactor this version of the scrape-book-cover function by taking out the let statement:

(defn scrape-book-cover [url image-filename]
  (with-open [out (io/output-stream image-filename)]
    (.write out (imgurl->bytes url))))

Code Snippets

(-> (html/html-resource url)
    ...
(get-in [0 :attrs :src])
(defn scrape-book-cover [url image-filename]
  (let [out-stream (io/output-stream image-filename)
        image (-> (html/html-resource url)
                  (html/select [:div.bookcoverXXL :> :div :> :img])
                  (get-in [0 :attrs :src])
                  (#(str "http:" %))
                  (http-client/get {:as :byte-array})
                  :body)]
    (with-open [out out-stream]
      (.write out image))))
(defn imgurl->bytes [url]
  (-> (html/html-resource url)
      (html/select [:div.bookcoverXXL :> :div :> :img])
      (get-in [0 :attrs :src])
      (#(str "http:" %))
      (http-client/get {:as :byte-array})
      :body))

(defn scrape-book-cover [url image-filename]
  (let [out-stream (io/output-stream image-filename)
        image (imgurl->bytes url)]
    (with-open [out out-stream]
      (.write out image))))
(defn scrape-book-cover [url image-filename]
  (with-open [out (io/output-stream image-filename)]
    (.write out (imgurl->bytes url))))

Context

StackExchange Code Review Q#44862, answer score: 9

Revisions (0)

No revisions yet.