HiveBrain v1.2.0
Get Started
← Back to all entries
patternMinor

Simple web-scraper in Common Lisp (SBCL)

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
simplelispsbclwebcommonscraper

Problem

I have written a simple web-scraper in Common Lisp, & would greatly appreciate any feedback:

``
(defpackage :myfitnessdata
(:use :common-lisp)
(:export #:main))

(in-package :myfitnessdata)

(require :sb-posix)
(load (merge-pathnames "quicklisp/setup.lisp" (user-homedir-pathname)))
(ql:quickload '("drakma"
"closure-html"
"cxml-stp"
"net-telent-date"))

(defun show-usage ()
(format t "MyFitnessData - a CSV web scraper for the MyFitnessPal website.~%")
;; snip
(format t "'c:\\Users\\bob\\weights.csv', overwriting it if it exists.~%"))

(defun login (username password)
"Logs in to www.myfitnesspal.com. Returns a cookie-jar containing authentication details."
(let ((cookie-jar (make-instance 'drakma:cookie-jar)))
(drakma:http-request "http://www.myfitnesspal.com/account/login"
:method :post
:parameters
(("username" . ,username) ("password" . ,password))
:cookie-jar cookie-jar)
cookie-jar))

(defun logged-in? (cookie-jar)
"Returns true if a cookie-jar contains login information for www.myfitnesspal.com, and nil otherwise."
(let ((logged-in? nil))
(loop for cookie in (drakma:cookie-jar-cookies cookie-jar) do
(if (and (equal (drakma:cookie-name cookie) "known_user")
(equal (drakma:cookie-domain cookie) "www.myfitnesspal.com")
(drakma:cookie-value cookie))
(setq logged-in? t)))
logged-in?))

(defun get-page (page-num cookie-jar)
"Downloads a potentially invalid HTML page containing data to scrape. Returns a string containing the HTML."
(let ((url (concatenate 'string "http://www.myfitnesspal.com/measurements/edit?type=1&page=" (write-to-string page-num))))
(let ((body (drakma:http-request url :cookie-jar cookie-jar)))
(if (search "No measurements found." body)
nil
body))))

(defun scrape-body (body)
"Scrapes data from a potentially invalid HTML document, returning a list of lists of values."
(let (

Solution

You might want to take a look at defining asdf systems instead of using quicklisp to load dependencies internally.

The standard way of doing this is to set up an asd file. Here's a decent walk-through of that process. It's more verbose than ql:quickload, but it lets people who don't have quicklisp use your package regardless.

On second thought, screw those guys, keep it up.

(defun logged-in? (cookie-jar)       
  "Returns true if a cookie-jar contains login information for www.myfitnesspal.com, and nil otherwise."
  (let ((logged-in? nil))
    (loop for cookie in (drakma:cookie-jar-cookies cookie-jar) do
      (if (and (equal (drakma:cookie-name cookie) "known_user")
           (equal (drakma:cookie-domain cookie) "www.myfitnesspal.com")
           (drakma:cookie-value cookie))
          (setq logged-in? t)))
    logged-in?))


There's actually a loop shorthand for "make sure each member of list satisfies predicate". The above function can be written as

(defun logged-in? (cookie-jar)       
  "Returns true if a cookie-jar contains login information for www.myfitnesspal.com, and nil otherwise."
  (loop for cookie in (drakma:cookie-jar-cookies cookie-jar)
        always (and (equal (drakma:cookie-name cookie) "known_user")
                    (equal (drakma:cookie-domain cookie) "www.myfitnesspal.com"))))


foo? is the Scheme convention for predicates. The common CL conventions are foop or foo-p. Personally, I prefer foo? too, just be aware that it's not standard.

...
(sorted-list (sort list #'first-column-as-date-ascending)))
...


This can get you into trouble. The Common Lisp sort should really be named sort!, because it's destructive (so sorted-list will now contain a sorted list, but list won't still be the unsorted list, and isn't guaranteed to be the complete sequence anymore). If you might use list again later, instead do

...
(sorted-list (sort (copy-list list) #'first-column-as-date-ascending)))
...


(if (search "No measurements found." body)
    nil
    body)


Can be written as

(unless (search "No measurements found." body) body)


EDIT:

format can accept nested iterations in the directive, so you can eliminate separate-values by writing make-csv as

(defun make-csv (list)
  "Takes a list of lists of values, and returns a string containing a CSV file representing each top-level list as a row."
  (let ((sorted-list (sort list #'first-column-as-date-ascending)))
    (format nil "~{~{~A~^,~}~^~%~}" sorted-list)))


You could eliminate make-csv entirely by putting the above sort+directive directly into write-csv (this would also save you a trip through the CSV string, which may or may not make a significant difference).

recursive-scrape-page can be simplified down to

(defun scrape-page (page-num cookie-jar)
  (loop for i from page-num 
    if (get-page i cookie-jar) collect it into pg
      else return pg))


As a rule, Common Lisp doesn't guarantee tail-calls the way Scheme does, so it's generally a better idea to use a loop than raw recursion. SBCL does support some tail calls, but it isn't guaranteed (though this situation looks simple enough that it just might; do some profiling and compare).

You should be able to simplify scrape-xhtml in a similar way to eliminate (let ((results nil)).

Note that I haven't tested or profiled any of this since I don't have a "MyFitnessPal" account. Check that it works first.

EDIT the Second:

...
 (let ((valid-xhtml (chtml:parse body (cxml:make-string-sink))))
   (let ((xhtml-tree (chtml:parse valid-xhtml (cxml-stp:make-builder))))
 ...


You use this nested let idiom in a couple of places. I assume this is just because the value of xhtml-tree depends on the value of valid-html. In this case, you can instead write

...
 (let* ((valid-xhtml (chtml:parse body (cxml:make-string-sink)))
        (xhtml-tree (chtml:parse valid-xhtml (cxml-stp:make-builder))))
 ...

Code Snippets

(defun logged-in? (cookie-jar)       
  "Returns true if a cookie-jar contains login information for www.myfitnesspal.com, and nil otherwise."
  (let ((logged-in? nil))
    (loop for cookie in (drakma:cookie-jar-cookies cookie-jar) do
      (if (and (equal (drakma:cookie-name cookie) "known_user")
           (equal (drakma:cookie-domain cookie) "www.myfitnesspal.com")
           (drakma:cookie-value cookie))
          (setq logged-in? t)))
    logged-in?))
(defun logged-in? (cookie-jar)       
  "Returns true if a cookie-jar contains login information for www.myfitnesspal.com, and nil otherwise."
  (loop for cookie in (drakma:cookie-jar-cookies cookie-jar)
        always (and (equal (drakma:cookie-name cookie) "known_user")
                    (equal (drakma:cookie-domain cookie) "www.myfitnesspal.com"))))
...
(sorted-list (sort list #'first-column-as-date-ascending)))
...
...
(sorted-list (sort (copy-list list) #'first-column-as-date-ascending)))
...
(if (search "No measurements found." body)
    nil
    body)

Context

StackExchange Code Review Q#2277, answer score: 6

Revisions (0)

No revisions yet.