HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

Magpi Magazine Downloader

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
downloadermagpimagazine

Problem

I have created a simple program in Python which downloades all issues of the MagPi magazine by parsing this web page for links ending in .pdf, then downloading them using the urllib module. It also supports command line options, using the argparse module:

$ ./magpi_downloader.py -h
usage: magpi_downloader.py [-h] [-q] [-r] [--view] [-t FILETYPE] [-a] [-i]
                           [-e]
                           DIR [REMOTE_DIR]

Download issues of the MagPi magazine

positional arguments:
  DIR               The directory to install into.
  REMOTE_DIR        The directory to fetch from. Files must be links on that
                    page. Works best with Apache servers with the default
                    directory listing. Default: http://www.raspberrypi.org
                    /magpi-issues/

optional arguments:
  -h, --help        show this help message and exit
  -q, --quiet       Silence progress output
  -r, --reinstall   Reinstall all issues
  --view            List the issues available for install
  -t FILETYPE       The extension of the files to download. Default: pdf
  -a, --all         Install all files with the right extension.
  -i, --issues      Install the regular issues (default behavior)
  -e, --essentials  Install the 'Essentials' collection


I am looking for ways to improve the user interface, the speed, and making it generally cleaner and more pythonic.

Source code:

```
import urllib, os, re, time, sys, argparse
from bs4 import BeautifulSoup

ISSUES_REGEX = u'MagPi[0-9]+'
ESSENTIALS_REGEX = u'Essentials_.*'

def get_installed(directory):
if not os.path.exists(directory):
if not quiet:
print("Directory doesn't exist\nCreating new directory with the name {}".format(directory))
os.mkdir(directory)
for filename in os.listdir(directory):
if re.match(regex, filename):
yield filename

def download(filename, localfilename):
localfile = get_full_path(download_dir, localfilenam

Solution

The code overall is not readable. It's quite lengthy, there is no modular structure, no meaningful comments and docstrings. The first thing I would do is to split it into multiple modules grouped logically, define the docstrings for each of the functions (which may actually lead to some functions being joined together or even removed).

Also, some of the code blocks can be extracted into separate functions. For instance, apply the "Extract Method" refactoring method and extract the "parsing arguments with argparse" part into a separate function.

Now, let's go over the issues one by one:

  • import organization - avoid importing multiple built-in modules on a single line. Third-party imports need to have a newline before them. Put 2 newlines after all the imports (PEP8 import guidelines)



-
since you are issuing multiple requests to the same domain, I would use requests module maintaining a web-scraping session via requests.Session - this might result into a significant performance boost:


So if you're making several requests to the same host, the underlying TCP connection will be reused, which can result in a significant performance increase

-
when instantiating a "soup" object, it's highly recommended to explicitly specify the underlying parser to avoid letting BeautifulSoup do this automatically - it might choose html.parser on your machine, but on the other machine, it may pick lxml or html5lib which may result into different parsing results (see Differences between parsers):

soup = BeautifulSoup(page, "html.parser")
# soup = BeautifulSoup(page, "html5lib")
# soup = BeautifulSoup(page, "lxml")


-
you can improve the way you locate the elements on a page. Currently, you are finding all the links via find_all("a") and then applying a regular expression to the href attribute values. BeautifulSoup can do both in a single "find" command:

soup.find_all("a", href=re.compile(r"expression here"))


  • the print_to_install_info() can benefit from using a multi-line string (to avoid multiple newline characters in the format template string)



-
except for printing, the install() and install_quiet() functions really duplicate each other. Either have a single function with quiet argument, or, even better, use logging module with a configurable/controllable log level

-
see if you can replace listdir() and an inner regex check with a single glob.glob() (or glob.iglob()) call.

Code Snippets

soup = BeautifulSoup(page, "html.parser")
# soup = BeautifulSoup(page, "html5lib")
# soup = BeautifulSoup(page, "lxml")
soup.find_all("a", href=re.compile(r"expression here"))

Context

StackExchange Code Review Q#156311, answer score: 6

Revisions (0)

No revisions yet.