patternpythonMinor
Python script for backing up reading information from Goodreads
Viewed 0 times
scriptgoodreadsreadingbackingforpythonfrominformation
Problem
I’ve written a script for automatically backing up my Goodreads data. This mimics the Export function on the Goodreads website, which returns a CSV file – but this doesn’t require going through a website, and provides JSON instead of CSV. The idea is that I could run this on a cron job, and have regular backups of my user data.
The script can be invoked without any arguments, or with one argument to specify the filename to write the backup to:
I’d be interested in any feedback, but particularly around:
-
My use of xml.etree. The Goodreads API mostly serves XML, but I’m not very familiar with this library, so I might not be making the best use of it. For ease of reviewing, I’ve posted an API response in a Gist so you can see what the API responses look like.
-
Is the code easy to follow? It makes a lot of sense to me, but I wrote it! I’m not sure whether it’s easily comprehensible to somebody else.
-
Uncaught errors. Are there any obvious error cases I’m failing to handle correctly? (Assume I trust that if I get a successful response, the XML will have the right structure.)
Here’s the code:
``
first command-line argument.
See the README for more details.
"""
from datetime import datetime, timezone
import json
import sys
import xml.etree.ElementTree as ET
import keyring
import requests
# Goodreads user ID
USER_ID = 'redacted' # keyring.get_password('goodreads', 'user_id')
# Goodreads API key. Obtain one from https://www.goodreads.com/api
API_KEY = 'redacted' # keyring.get_password('goodreads', 'api_key')
class TagDescriptor(object):
"""
Used internally by the Review class. The review class has an ETree
element representing an XML document of th
The script can be invoked without any arguments, or with one argument to specify the filename to write the backup to:
$ python3 backup_goodreads.py
$ python3 backup_goodreads.py my backup.jsonI’d be interested in any feedback, but particularly around:
-
My use of xml.etree. The Goodreads API mostly serves XML, but I’m not very familiar with this library, so I might not be making the best use of it. For ease of reviewing, I’ve posted an API response in a Gist so you can see what the API responses look like.
-
Is the code easy to follow? It makes a lot of sense to me, but I wrote it! I’m not sure whether it’s easily comprehensible to somebody else.
-
Uncaught errors. Are there any obvious error cases I’m failing to handle correctly? (Assume I trust that if I get a successful response, the XML will have the right structure.)
Here’s the code:
``
#!/usr/bin/env python3
# -- encoding: utf-8 --
"""
A script for backing up your Goodreads library. Writes a backup to
goodreads_backup.json`, or you can provide an alternative name as thefirst command-line argument.
See the README for more details.
"""
from datetime import datetime, timezone
import json
import sys
import xml.etree.ElementTree as ET
import keyring
import requests
# Goodreads user ID
USER_ID = 'redacted' # keyring.get_password('goodreads', 'user_id')
# Goodreads API key. Obtain one from https://www.goodreads.com/api
API_KEY = 'redacted' # keyring.get_password('goodreads', 'api_key')
class TagDescriptor(object):
"""
Used internally by the Review class. The review class has an ETree
element representing an XML document of th
Solution
- Review
-
This code uses
xml.etree.ElementTree. But the documentation says:Warning The
xml.etree.ElementTree module is not secure against maliciously constructed data. If you need to parse untrusted or unauthenticated data see XML vulnerabilities.Perhaps you're happy to trust the Goodreads API. But I think it's worth mentioning the concern.
-
It would make sense to pass in the user id and API key as parameters to the
get_reviews function.-
The function name
date_factory could be improved: what this function does is to convert a date from Goodreads format to your preferred format. So I would call it convert_date.-
If the results span multiple pages,
get_reviews calls itself recursively. But Python doesn't have tail-recursion elimination, so this will eventually result in a stack overflow if there are enough pages. It would be better to write a loop. See the revised code below.-
The code reads the API output as an XML document, and then builds
Review objects, each of which wraps a element in the XML document. The Review class knows how to search the XML document for each of the properties it wants (via the TagDescriptor descriptors, which call find on the XML element).I don't like this approach because (i) it seems overly complex (why keep the XML document around as your data store, when you could convert it once to some more convenient representation?); and (ii) it has the wrong complexity: finding the attributes for a review should be an \$O(n)\$ operation (it should only be necessary to pass over the children of each XML review element once), but the implementation searches for each attribute, making this \$Ω(n^2)\$.
I think the code would be simpler and clearer if it iterated once over the children of each `
element, applying a converter function to each attribute as it is encountered. This would allow you to drop the TagDescriptor and Review classes, and the asdict` function too, reducing the length and complexity of the code.- Revised code
(Untested, but if I made mistakes then I'm sure you can figure out how to fix them.)
# Identity function (for attributes that don't need conversion).
identity = lambda x: x
def convert_rating(element):
# The Goodreads API returns '0' to indicate an unrated book; make
# this a proper null type.
rating = element.text
if rating == '0':
return None
else:
return rating
def convert_authors(element):
return [a.find('name').text for a in element.findall('author')]
def convert_shelves(element):
return [s.find('name').text for s in element.findall('shelf')]
# Map from tag name to pair (output key, converter)
REVIEW_TAGS = {
'read_at': ('date_read', convert_date),
'date_added': ('date_added', convert_date),
'body': ('review', str.strip),
'rating': ('my_rating', convert_rating),
'shelves': ('bookshelves', convert_shelves),
}
# Map from tag name to pair (output key, converter)
BOOK_TAGS = {
'authors': ('authors', convert_authors),
'id': ('book_id', identity),
'title': ('title', identity),
'isbn': ('isbn', identity),
'isbn13': ('isbn13', identity),
'average_rating': ('average_rating', identity),
'publisher': ('publisher', identity),
'format': ('binding', identity),
'num_pages': ('page_count', int),
'publication_year': ('year_published', identity),
'published': ('orig_year_published', identity),
}
from itertools import count
def get_reviews(user_id, api_key):
"""Generate reviews associated with a Goodreads user as dictionaries.
:param user_id: The id of the Goodreads user.
:param api_key: The Goodreads API key.
"""
for page_no in count(1):
# ... code for calling the API goes here ...
reviews = ET.fromstring(req.text).find('reviews')
for review in reviews.findall('review')
data = {}
def convert(element, tag_mapping):
if element.tag in tag_mapping:
key, converter = tag_mapping[element.tag]
data[key] = converter(element.text)
for review_elt in review:
if review_elt.tag == 'book':
for book_elt in review_elt:
convert(book_elt, BOOK_TAGS)
else:
convert(review_elt, REVIEW_TAGS)
yield data
if int(reviews.attrib['end']) >= int(reviews.attrib['total']):
breakCode Snippets
# Identity function (for attributes that don't need conversion).
identity = lambda x: x
def convert_rating(element):
# The Goodreads API returns '0' to indicate an unrated book; make
# this a proper null type.
rating = element.text
if rating == '0':
return None
else:
return rating
def convert_authors(element):
return [a.find('name').text for a in element.findall('author')]
def convert_shelves(element):
return [s.find('name').text for s in element.findall('shelf')]
# Map from <review> tag name to pair (output key, converter)
REVIEW_TAGS = {
'read_at': ('date_read', convert_date),
'date_added': ('date_added', convert_date),
'body': ('review', str.strip),
'rating': ('my_rating', convert_rating),
'shelves': ('bookshelves', convert_shelves),
}
# Map from <book> tag name to pair (output key, converter)
BOOK_TAGS = {
'authors': ('authors', convert_authors),
'id': ('book_id', identity),
'title': ('title', identity),
'isbn': ('isbn', identity),
'isbn13': ('isbn13', identity),
'average_rating': ('average_rating', identity),
'publisher': ('publisher', identity),
'format': ('binding', identity),
'num_pages': ('page_count', int),
'publication_year': ('year_published', identity),
'published': ('orig_year_published', identity),
}
from itertools import count
def get_reviews(user_id, api_key):
"""Generate reviews associated with a Goodreads user as dictionaries.
:param user_id: The id of the Goodreads user.
:param api_key: The Goodreads API key.
"""
for page_no in count(1):
# ... code for calling the API goes here ...
reviews = ET.fromstring(req.text).find('reviews')
for review in reviews.findall('review')
data = {}
def convert(element, tag_mapping):
if element.tag in tag_mapping:
key, converter = tag_mapping[element.tag]
data[key] = converter(element.text)
for review_elt in review:
if review_elt.tag == 'book':
for book_elt in review_elt:
convert(book_elt, BOOK_TAGS)
else:
convert(review_elt, REVIEW_TAGS)
yield data
if int(reviews.attrib['end']) >= int(reviews.attrib['total']):
breakContext
StackExchange Code Review Q#146416, answer score: 3
Revisions (0)
No revisions yet.