patternpythonMinor
Magpi Magazine Downloader
Viewed 0 times
downloadermagpimagazine
Problem
I have created a simple program in Python which downloades all issues of the MagPi magazine by parsing this web page for links ending in
I am looking for ways to improve the user interface, the speed, and making it generally cleaner and more pythonic.
Source code:
```
import urllib, os, re, time, sys, argparse
from bs4 import BeautifulSoup
ISSUES_REGEX = u'MagPi[0-9]+'
ESSENTIALS_REGEX = u'Essentials_.*'
def get_installed(directory):
if not os.path.exists(directory):
if not quiet:
print("Directory doesn't exist\nCreating new directory with the name {}".format(directory))
os.mkdir(directory)
for filename in os.listdir(directory):
if re.match(regex, filename):
yield filename
def download(filename, localfilename):
localfile = get_full_path(download_dir, localfilenam
.pdf, then downloading them using the urllib module. It also supports command line options, using the argparse module:$ ./magpi_downloader.py -h
usage: magpi_downloader.py [-h] [-q] [-r] [--view] [-t FILETYPE] [-a] [-i]
[-e]
DIR [REMOTE_DIR]
Download issues of the MagPi magazine
positional arguments:
DIR The directory to install into.
REMOTE_DIR The directory to fetch from. Files must be links on that
page. Works best with Apache servers with the default
directory listing. Default: http://www.raspberrypi.org
/magpi-issues/
optional arguments:
-h, --help show this help message and exit
-q, --quiet Silence progress output
-r, --reinstall Reinstall all issues
--view List the issues available for install
-t FILETYPE The extension of the files to download. Default: pdf
-a, --all Install all files with the right extension.
-i, --issues Install the regular issues (default behavior)
-e, --essentials Install the 'Essentials' collectionI am looking for ways to improve the user interface, the speed, and making it generally cleaner and more pythonic.
Source code:
```
import urllib, os, re, time, sys, argparse
from bs4 import BeautifulSoup
ISSUES_REGEX = u'MagPi[0-9]+'
ESSENTIALS_REGEX = u'Essentials_.*'
def get_installed(directory):
if not os.path.exists(directory):
if not quiet:
print("Directory doesn't exist\nCreating new directory with the name {}".format(directory))
os.mkdir(directory)
for filename in os.listdir(directory):
if re.match(regex, filename):
yield filename
def download(filename, localfilename):
localfile = get_full_path(download_dir, localfilenam
Solution
The code overall is not readable. It's quite lengthy, there is no modular structure, no meaningful comments and docstrings. The first thing I would do is to split it into multiple modules grouped logically, define the docstrings for each of the functions (which may actually lead to some functions being joined together or even removed).
Also, some of the code blocks can be extracted into separate functions. For instance, apply the "Extract Method" refactoring method and extract the "parsing arguments with argparse" part into a separate function.
Now, let's go over the issues one by one:
-
since you are issuing multiple requests to the same domain, I would use
So if you're making several requests to the same host, the underlying TCP connection will be reused, which can result in a significant performance increase
-
when instantiating a "soup" object, it's highly recommended to explicitly specify the underlying parser to avoid letting
-
you can improve the way you locate the elements on a page. Currently, you are finding all the links via
-
except for printing, the
-
see if you can replace
Also, some of the code blocks can be extracted into separate functions. For instance, apply the "Extract Method" refactoring method and extract the "parsing arguments with argparse" part into a separate function.
Now, let's go over the issues one by one:
- import organization - avoid importing multiple built-in modules on a single line. Third-party imports need to have a newline before them. Put 2 newlines after all the imports (PEP8 import guidelines)
-
since you are issuing multiple requests to the same domain, I would use
requests module maintaining a web-scraping session via requests.Session - this might result into a significant performance boost:So if you're making several requests to the same host, the underlying TCP connection will be reused, which can result in a significant performance increase
-
when instantiating a "soup" object, it's highly recommended to explicitly specify the underlying parser to avoid letting
BeautifulSoup do this automatically - it might choose html.parser on your machine, but on the other machine, it may pick lxml or html5lib which may result into different parsing results (see Differences between parsers):soup = BeautifulSoup(page, "html.parser")
# soup = BeautifulSoup(page, "html5lib")
# soup = BeautifulSoup(page, "lxml")-
you can improve the way you locate the elements on a page. Currently, you are finding all the links via
find_all("a") and then applying a regular expression to the href attribute values. BeautifulSoup can do both in a single "find" command:soup.find_all("a", href=re.compile(r"expression here"))- the
print_to_install_info()can benefit from using a multi-line string (to avoid multiple newline characters in the format template string)
-
except for printing, the
install() and install_quiet() functions really duplicate each other. Either have a single function with quiet argument, or, even better, use logging module with a configurable/controllable log level-
see if you can replace
listdir() and an inner regex check with a single glob.glob() (or glob.iglob()) call.Code Snippets
soup = BeautifulSoup(page, "html.parser")
# soup = BeautifulSoup(page, "html5lib")
# soup = BeautifulSoup(page, "lxml")soup.find_all("a", href=re.compile(r"expression here"))Context
StackExchange Code Review Q#156311, answer score: 6
Revisions (0)
No revisions yet.