patternpythonMinor
Scraping SEDE query results with caching
Viewed 0 times
withquerycachingscrapingresultssede
Problem
I use this script to scrape the results of a SEDE page and return as a BeautifulSoup object.
A small twist is that if I don't use a SEDE query manually in the browser for a few days, then non-interactive downloads get empty results. (I suspect there is a robot test there.)
Rather than fixing the robot test issue, my workaround is to cache successfully downloaded pages, and use the cache when downloading is not working well.
Can this be written better?
In case it helps, here are some sample output files:
A small twist is that if I don't use a SEDE query manually in the browser for a few days, then non-interactive downloads get empty results. (I suspect there is a robot test there.)
Rather than fixing the robot test issue, my workaround is to cache successfully downloaded pages, and use the cache when downloading is not working well.
import logging
import os
import requests
from bs4 import BeautifulSoup
BASE_DIR = os.path.dirname(__file__)
CACHE_DIR = os.path.join(BASE_DIR, '.cache')
def fetch_sede_soup(label, url):
def is_valid(soup):
for script in soup.findAll('script'):
if 'resultSets' in script.text:
return True
return False
if not os.path.isdir(CACHE_DIR):
os.mkdir(CACHE_DIR)
logging.info('fetching {} as {}'.format(label, url))
html = requests.get(url).text
soup = BeautifulSoup(html)
cache_path = os.path.join(CACHE_DIR, '{}.html'.format(label))
debug_cache_path = os.path.join(CACHE_DIR, '{}-debug.html'.format(label))
if is_valid(soup):
logging.info('updating cache')
with open(cache_path, 'w') as fh:
fh.write(html)
return soup
with open(debug_cache_path, 'w') as fh:
fh.write(html)
logging.warning('result not valid')
if os.path.exists(cache_path):
logging.info('using previous cache')
with open(cache_path) as fh:
return BeautifulSoup(fh)Can this be written better?
In case it helps, here are some sample output files:
- sede-output.html - a successful download.
- sede-output-debug.html - a "failed" download (empty results).
Solution
It shows that you've written code before, so style wise there is not much to comment upon. What I would like to comment upon is a few points on code structure:
-
Single responsibility concern – Your function is name
One point I would at least add a new function is related to the
-
Add comments and docstrings – As a side effect of the different aspects your function handles, I would at least add more comments within the code. This could justify keeping more within the larger function, whilst still maintaining readability
-
Add error handling around
But in general, nice, clean code, but a little thin on documentation and error handling of os-operations.
-
Single responsibility concern – Your function is name
fetch_sede_soup() and all actions are related to this. However it still creates the CACHE_DIR if needed, write output files, verifies validity in inner function. This could be separated into more functions, although it is not extremely large and hard to get the overview of.One point I would at least add a new function is related to the
return soup near the middle. Code after this seems to be error handling, but it is somewhat hidden by the return and as such it is not very clear. If you moved the remaining part into a function you would see a return soup followed by a else: clause calling the error handling.-
Add comments and docstrings – As a side effect of the different aspects your function handles, I would at least add more comments within the code. This could justify keeping more within the larger function, whilst still maintaining readability
-
Add error handling around
mkdir – Especially around external os calls, like mkdir, I would add try ... except code to handle potential error situations. This code related to creating the CACHE_DIR could possibly be done before this method is called as part of initialization.But in general, nice, clean code, but a little thin on documentation and error handling of os-operations.
Context
StackExchange Code Review Q#113982, answer score: 3
Revisions (0)
No revisions yet.