patternMinor
Simple web-scraper in Common Lisp (SBCL)
Viewed 0 times
simplelispsbclwebcommonscraper
Problem
I have written a simple web-scraper in Common Lisp, & would greatly appreciate any feedback:
``
:cookie-jar cookie-jar)
cookie-jar))
(defun logged-in? (cookie-jar)
"Returns true if a cookie-jar contains login information for www.myfitnesspal.com, and nil otherwise."
(let ((logged-in? nil))
(loop for cookie in (drakma:cookie-jar-cookies cookie-jar) do
(if (and (equal (drakma:cookie-name cookie) "known_user")
(equal (drakma:cookie-domain cookie) "www.myfitnesspal.com")
(drakma:cookie-value cookie))
(setq logged-in? t)))
logged-in?))
(defun get-page (page-num cookie-jar)
"Downloads a potentially invalid HTML page containing data to scrape. Returns a string containing the HTML."
(let ((url (concatenate 'string "http://www.myfitnesspal.com/measurements/edit?type=1&page=" (write-to-string page-num))))
(let ((body (drakma:http-request url :cookie-jar cookie-jar)))
(if (search "No measurements found." body)
nil
body))))
(defun scrape-body (body)
"Scrapes data from a potentially invalid HTML document, returning a list of lists of values."
(let (
``
(defpackage :myfitnessdata
(:use :common-lisp)
(:export #:main))
(in-package :myfitnessdata)
(require :sb-posix)
(load (merge-pathnames "quicklisp/setup.lisp" (user-homedir-pathname)))
(ql:quickload '("drakma"
"closure-html"
"cxml-stp"
"net-telent-date"))
(defun show-usage ()
(format t "MyFitnessData - a CSV web scraper for the MyFitnessPal website.~%")
;; snip
(format t "'c:\\Users\\bob\\weights.csv', overwriting it if it exists.~%"))
(defun login (username password)
"Logs in to www.myfitnesspal.com. Returns a cookie-jar containing authentication details."
(let ((cookie-jar (make-instance 'drakma:cookie-jar)))
(drakma:http-request "http://www.myfitnesspal.com/account/login"
:method :post
:parameters (("username" . ,username) ("password" . ,password)):cookie-jar cookie-jar)
cookie-jar))
(defun logged-in? (cookie-jar)
"Returns true if a cookie-jar contains login information for www.myfitnesspal.com, and nil otherwise."
(let ((logged-in? nil))
(loop for cookie in (drakma:cookie-jar-cookies cookie-jar) do
(if (and (equal (drakma:cookie-name cookie) "known_user")
(equal (drakma:cookie-domain cookie) "www.myfitnesspal.com")
(drakma:cookie-value cookie))
(setq logged-in? t)))
logged-in?))
(defun get-page (page-num cookie-jar)
"Downloads a potentially invalid HTML page containing data to scrape. Returns a string containing the HTML."
(let ((url (concatenate 'string "http://www.myfitnesspal.com/measurements/edit?type=1&page=" (write-to-string page-num))))
(let ((body (drakma:http-request url :cookie-jar cookie-jar)))
(if (search "No measurements found." body)
nil
body))))
(defun scrape-body (body)
"Scrapes data from a potentially invalid HTML document, returning a list of lists of values."
(let (
Solution
You might want to take a look at defining
The standard way of doing this is to set up an
On second thought, screw those guys, keep it up.
There's actually a
This can get you into trouble. The Common Lisp
Can be written as
EDIT:
You could eliminate
As a rule, Common Lisp doesn't guarantee tail-calls the way Scheme does, so it's generally a better idea to use a
You should be able to simplify
Note that I haven't tested or profiled any of this since I don't have a "MyFitnessPal" account. Check that it works first.
EDIT the Second:
You use this nested let idiom in a couple of places. I assume this is just because the value of
asdf systems instead of using quicklisp to load dependencies internally.The standard way of doing this is to set up an
asd file. Here's a decent walk-through of that process. It's more verbose than ql:quickload, but it lets people who don't have quicklisp use your package regardless.On second thought, screw those guys, keep it up.
(defun logged-in? (cookie-jar)
"Returns true if a cookie-jar contains login information for www.myfitnesspal.com, and nil otherwise."
(let ((logged-in? nil))
(loop for cookie in (drakma:cookie-jar-cookies cookie-jar) do
(if (and (equal (drakma:cookie-name cookie) "known_user")
(equal (drakma:cookie-domain cookie) "www.myfitnesspal.com")
(drakma:cookie-value cookie))
(setq logged-in? t)))
logged-in?))There's actually a
loop shorthand for "make sure each member of list satisfies predicate". The above function can be written as(defun logged-in? (cookie-jar)
"Returns true if a cookie-jar contains login information for www.myfitnesspal.com, and nil otherwise."
(loop for cookie in (drakma:cookie-jar-cookies cookie-jar)
always (and (equal (drakma:cookie-name cookie) "known_user")
(equal (drakma:cookie-domain cookie) "www.myfitnesspal.com"))))foo? is the Scheme convention for predicates. The common CL conventions are foop or foo-p. Personally, I prefer foo? too, just be aware that it's not standard....
(sorted-list (sort list #'first-column-as-date-ascending)))
...This can get you into trouble. The Common Lisp
sort should really be named sort!, because it's destructive (so sorted-list will now contain a sorted list, but list won't still be the unsorted list, and isn't guaranteed to be the complete sequence anymore). If you might use list again later, instead do...
(sorted-list (sort (copy-list list) #'first-column-as-date-ascending)))
...(if (search "No measurements found." body)
nil
body)Can be written as
(unless (search "No measurements found." body) body)EDIT:
format can accept nested iterations in the directive, so you can eliminate separate-values by writing make-csv as(defun make-csv (list)
"Takes a list of lists of values, and returns a string containing a CSV file representing each top-level list as a row."
(let ((sorted-list (sort list #'first-column-as-date-ascending)))
(format nil "~{~{~A~^,~}~^~%~}" sorted-list)))You could eliminate
make-csv entirely by putting the above sort+directive directly into write-csv (this would also save you a trip through the CSV string, which may or may not make a significant difference).recursive-scrape-page can be simplified down to(defun scrape-page (page-num cookie-jar)
(loop for i from page-num
if (get-page i cookie-jar) collect it into pg
else return pg))As a rule, Common Lisp doesn't guarantee tail-calls the way Scheme does, so it's generally a better idea to use a
loop than raw recursion. SBCL does support some tail calls, but it isn't guaranteed (though this situation looks simple enough that it just might; do some profiling and compare).You should be able to simplify
scrape-xhtml in a similar way to eliminate (let ((results nil)).Note that I haven't tested or profiled any of this since I don't have a "MyFitnessPal" account. Check that it works first.
EDIT the Second:
...
(let ((valid-xhtml (chtml:parse body (cxml:make-string-sink))))
(let ((xhtml-tree (chtml:parse valid-xhtml (cxml-stp:make-builder))))
...You use this nested let idiom in a couple of places. I assume this is just because the value of
xhtml-tree depends on the value of valid-html. In this case, you can instead write...
(let* ((valid-xhtml (chtml:parse body (cxml:make-string-sink)))
(xhtml-tree (chtml:parse valid-xhtml (cxml-stp:make-builder))))
...Code Snippets
(defun logged-in? (cookie-jar)
"Returns true if a cookie-jar contains login information for www.myfitnesspal.com, and nil otherwise."
(let ((logged-in? nil))
(loop for cookie in (drakma:cookie-jar-cookies cookie-jar) do
(if (and (equal (drakma:cookie-name cookie) "known_user")
(equal (drakma:cookie-domain cookie) "www.myfitnesspal.com")
(drakma:cookie-value cookie))
(setq logged-in? t)))
logged-in?))(defun logged-in? (cookie-jar)
"Returns true if a cookie-jar contains login information for www.myfitnesspal.com, and nil otherwise."
(loop for cookie in (drakma:cookie-jar-cookies cookie-jar)
always (and (equal (drakma:cookie-name cookie) "known_user")
(equal (drakma:cookie-domain cookie) "www.myfitnesspal.com"))))...
(sorted-list (sort list #'first-column-as-date-ascending)))
......
(sorted-list (sort (copy-list list) #'first-column-as-date-ascending)))
...(if (search "No measurements found." body)
nil
body)Context
StackExchange Code Review Q#2277, answer score: 6
Revisions (0)
No revisions yet.