patternpythonMinor

A simple PyPI crawler

Submitted by: @import:stackexchange-codereview·Mar 10, 2026·

Viewed 0 times

pypisimplecrawler

Problem

I had to make a program to crawl through a bunch of PyPI packages and see how many of them implement custom compare operators (i.e. grep for def __le__ etc). After downloading an HTML file with links to all Python 3.4 packages on PyPI (i.e. the directory page), I wrote this simple crawler to go through all the links, download and unzip each package, and grep them for custom compare definitions. It's rudimentary, but still, what are your comments? This is my first "shell-script" style Python program, i.e. a program where you're not computing stuff but just moving around files and doing networking.

Code:

```
import sys
assert(sys.version_info >= (3,5))

import re
from requests import get
import subprocess
def run(s):
return subprocess.run(s,
shell=True,
stdout=subprocess.DEVNULL,
stderr=subprocess.DEVNULL)

directory = open('list.html').read() #download list from Browse Packages ->
# Python 3.4 -> Show All

custom = no_custom = failiures = 0
for (package_url, package_name) in \
re.findall('(https://pypi\.python\.org/pypi/([^/]+)/)', directory):
print(custom+no_custom+failiures,
custom,
no_custom,
failiures)

try:
package_page = get(package_url).text
(download_url,file_type) = re.search('.+(\.tar\.gz|\.zip)',
package_page).groups()
print(package_name)
archive = open('archive', 'wb')
archive.write(get(download_url).content)
archive.close()

run('rm -r package_code')
run('mkdir package_code')

if file_type == '.tar.gz':
run('tar -xzf archive -C package_code')
if file_type == '.zip':
run('unzip archive -d package_code')
return_code = run('grep -Er "def __(le|lt|ge|gt)__" ./package_code').returncode

if return_code == 0:

Solution

Here are some comments/notes about the code and potential improvements:

use with context manager when opening files

-
parsing HTML with regular expressions has always been a very controversial thing to do. I would switch to an HTML parser like BeautifulSoup and lxml.html. For example, getting all the PyPI links with BeautifulSoup can be as straightforward as:

from bs4 import BeautifulSoup

with open('list.html') as directory:
    soup = BeautifulSoup(directory, "html.parser")
    for link in soup.select("a[href*=pypi]"):
        print(link.get_text())

where a[href*=pypi] is a CSS selector that would match all a elements that has pypi substring inside an href attribute.

-
instead of using requests.get() directly, initialize a "session" to reuse the underlying TCP connection:

..if you're making several requests to the same host, the underlying TCP connection will be reused, which can result in a significant performance increase..

If you want to scale this up, you would need to switch for a synchronous and blocking code/approach to asynchronous - look into using Scrapy web-scraping framework which is based on the twisted networking library.

Code Snippets

from bs4 import BeautifulSoup

with open('list.html') as directory:
    soup = BeautifulSoup(directory, "html.parser")
    for link in soup.select("a[href*=pypi]"):
        print(link.get_text())

Context

StackExchange Code Review Q#155804, answer score: 4

Revisions (0)

No revisions yet.