patternpythonMinor
Web Scraping with Python + asyncio
Viewed 0 times
withasyncioscrapingpythonweb
Problem
I've been working at speeding up my web scraping with the
asyncio library. I have a working solution, but am unsure as to how pythonic it is or if I am properly using the library. Any input would be appreciated.import aiohttp
import asyncio
import requests
from lxml import etree
@asyncio.coroutine
def get(*args, **kwargs):
"""
A wrapper method for aiohttp's get method. Taken from Georges Dubus' article at
http://compiletoi.net/fast-scraping-in-python-with-asyncio.html
"""
response = yield from aiohttp.request('GET', *args, **kwargs)
return (yield from response.read_and_close())
@asyncio.coroutine
def extract_text(url):
"""
Given the url for a chapter, extract the relevant text from it
:param url: the url for the chapter to scrape
:return: a string containing the chapter's text
"""
sem = asyncio.Semaphore(5)
with (yield from sem):
page = yield from get(url)
tree = etree.HTML(page)
paragraphs = tree.findall('.//*/div[@class="entry-content"]/p')[1: -1]
return b'\n'.join(etree.tostring(paragraph) for paragraph in paragraphs)
def generate_links():
"""
Generate the links to each of the chapters
:return: A list of strings containing every url to visit
"""
start_url = 'https://twigserial.wordpress.com/'
base_url = 'https://twigserial.wordpress.com/category/story/'
tree = etree.HTML(requests.get(start_url).text)
xpath = './/*/option[@class="level-2"]/text()'
return [base_url + suffix.strip() for suffix in tree.xpath(xpath)]
@asyncio.coroutine
def run():
links = generate_links()
chapters = []
for f in asyncio.as_completed([extract_text(link) for link in links]):
result = yield from f
chapters.append(result)
return chapters
def main():
loop = asyncio.get_event_loop()
chapters = loop.run_until_complete(run())
print(len(chapters))
if __name__ == '__main__':
main()Solution
Looks ... great? Not a lot to complain about really.
The semaphore doesn't do anything though used like this, it should be
passed in from the top to protect the
see that if you
Also, the result of
be sure to sort the resulting chapters somehow, e.g. by returning both
the URL and the collected text from
A couple of small things as well:
be shorter and equally performant just to use
Alternatively they could be passed in to
again, it's unlikely that another blog has the exact same layout?
list of generators and use a list comprehension instead.
more sense to call it from the
concurrently and you could think of a situation where you'd pass in
the result of a different function to be fetched and collected.
All in all, I'd maybe change things to the code below. Of course if you
were to add things to it, I'd recommend looking into command line
arguments and configuration files, ...
The semaphore doesn't do anything though used like this, it should be
passed in from the top to protect the
get/aiohttp.request. You cansee that if you
print something right before the HTTP request.Also, the result of
asyncio.as_completed will be in random order, sobe sure to sort the resulting chapters somehow, e.g. by returning both
the URL and the collected text from
extract_text.A couple of small things as well:
- List comprehensions are okay, but with just a single argument it can
be shorter and equally performant just to use
map.- The URL constants should ideally be defined on the top level; at least
base_url can also be defined by concatenating with start_url.Alternatively they could be passed in to
generate_links. Thenagain, it's unlikely that another blog has the exact same layout?
- The manual
appendinrunseems unnecessary, I'd rewrite it into a
list of generators and use a list comprehension instead.
- At the moment
generate_linksis called fromrun; I think it makes
more sense to call it from the
main function: it doesn't need to runconcurrently and you could think of a situation where you'd pass in
the result of a different function to be fetched and collected.
All in all, I'd maybe change things to the code below. Of course if you
were to add things to it, I'd recommend looking into command line
arguments and configuration files, ...
import aiohttp
import asyncio
import requests
from lxml import etree
@asyncio.coroutine
def get(*args, **kwargs):
"""
A wrapper method for aiohttp's get method. Taken from Georges Dubus' article at
http://compiletoi.net/fast-scraping-in-python-with-asyncio.html
"""
response = yield from aiohttp.request('GET', *args, **kwargs)
return (yield from response.read_and_close())
@asyncio.coroutine
def extract_text(url, sem):
"""
Given the url for a chapter, extract the relevant text from it
:param url: the url for the chapter to scrape
:return: a string containing the chapter's text
"""
with (yield from sem):
page = yield from get(url)
tree = etree.HTML(page)
paragraphs = tree.findall('.//*/div[@class="entry-content"]/p')[1:-1]
return url, b'\n'.join(map(etree.tostring, paragraphs))
def generate_links():
"""
Generate the links to each of the chapters
:return: A list of strings containing every url to visit
"""
start_url = 'https://twigserial.wordpress.com/'
base_url = start_url + 'category/story/'
tree = etree.HTML(requests.get(start_url).text)
xpath = './/*/option[@class="level-2"]/text()'
return [base_url + suffix.strip() for suffix in tree.xpath(xpath)]
@asyncio.coroutine
def run(links):
sem = asyncio.Semaphore(5)
fetchers = [extract_text(link, sem) for link in links]
return [(yield from f) for f in asyncio.as_completed(fetchers)]
def main():
loop = asyncio.get_event_loop()
chapters = loop.run_until_complete(run(generate_links()))
print(len(chapters))
if __name__ == '__main__':
main()
Context
StackExchange Code Review Q#91869, answer score: 3
Revisions (0)
No revisions yet.