patternpythonMinor
A simple PyPI crawler
Viewed 0 times
pypisimplecrawler
Problem
I had to make a program to crawl through a bunch of PyPI packages and see how many of them implement custom compare operators (i.e. grep for
Code:
```
import sys
assert(sys.version_info >= (3,5))
import re
from requests import get
import subprocess
def run(s):
return subprocess.run(s,
shell=True,
stdout=subprocess.DEVNULL,
stderr=subprocess.DEVNULL)
directory = open('list.html').read() #download list from Browse Packages ->
# Python 3.4 -> Show All
custom = no_custom = failiures = 0
for (package_url, package_name) in \
re.findall('(https://pypi\.python\.org/pypi/([^/]+)/)', directory):
print(custom+no_custom+failiures,
custom,
no_custom,
failiures)
try:
package_page = get(package_url).text
(download_url,file_type) = re.search('.+(\.tar\.gz|\.zip)',
package_page).groups()
print(package_name)
archive = open('archive', 'wb')
archive.write(get(download_url).content)
archive.close()
run('rm -r package_code')
run('mkdir package_code')
if file_type == '.tar.gz':
run('tar -xzf archive -C package_code')
if file_type == '.zip':
run('unzip archive -d package_code')
return_code = run('grep -Er "def __(le|lt|ge|gt)__" ./package_code').returncode
if return_code == 0:
def __le__ etc). After downloading an HTML file with links to all Python 3.4 packages on PyPI (i.e. the directory page), I wrote this simple crawler to go through all the links, download and unzip each package, and grep them for custom compare definitions. It's rudimentary, but still, what are your comments? This is my first "shell-script" style Python program, i.e. a program where you're not computing stuff but just moving around files and doing networking.Code:
```
import sys
assert(sys.version_info >= (3,5))
import re
from requests import get
import subprocess
def run(s):
return subprocess.run(s,
shell=True,
stdout=subprocess.DEVNULL,
stderr=subprocess.DEVNULL)
directory = open('list.html').read() #download list from Browse Packages ->
# Python 3.4 -> Show All
custom = no_custom = failiures = 0
for (package_url, package_name) in \
re.findall('(https://pypi\.python\.org/pypi/([^/]+)/)', directory):
print(custom+no_custom+failiures,
custom,
no_custom,
failiures)
try:
package_page = get(package_url).text
(download_url,file_type) = re.search('.+(\.tar\.gz|\.zip)',
package_page).groups()
print(package_name)
archive = open('archive', 'wb')
archive.write(get(download_url).content)
archive.close()
run('rm -r package_code')
run('mkdir package_code')
if file_type == '.tar.gz':
run('tar -xzf archive -C package_code')
if file_type == '.zip':
run('unzip archive -d package_code')
return_code = run('grep -Er "def __(le|lt|ge|gt)__" ./package_code').returncode
if return_code == 0:
Solution
Here are some comments/notes about the code and potential improvements:
-
parsing HTML with regular expressions has always been a very controversial thing to do. I would switch to an HTML parser like
where
-
instead of using
..if you're making several requests to the same host, the underlying TCP connection will be reused, which can result in a significant performance increase..
If you want to scale this up, you would need to switch for a synchronous and blocking code/approach to asynchronous - look into using
- use
withcontext manager when opening files
-
parsing HTML with regular expressions has always been a very controversial thing to do. I would switch to an HTML parser like
BeautifulSoup and lxml.html. For example, getting all the PyPI links with BeautifulSoup can be as straightforward as:from bs4 import BeautifulSoup
with open('list.html') as directory:
soup = BeautifulSoup(directory, "html.parser")
for link in soup.select("a[href*=pypi]"):
print(link.get_text())where
a[href*=pypi] is a CSS selector that would match all a elements that has pypi substring inside an href attribute.-
instead of using
requests.get() directly, initialize a "session" to reuse the underlying TCP connection:..if you're making several requests to the same host, the underlying TCP connection will be reused, which can result in a significant performance increase..
If you want to scale this up, you would need to switch for a synchronous and blocking code/approach to asynchronous - look into using
Scrapy web-scraping framework which is based on the twisted networking library.Code Snippets
from bs4 import BeautifulSoup
with open('list.html') as directory:
soup = BeautifulSoup(directory, "html.parser")
for link in soup.select("a[href*=pypi]"):
print(link.get_text())Context
StackExchange Code Review Q#155804, answer score: 4
Revisions (0)
No revisions yet.