HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

Web Scraping with Python + asyncio

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
withasyncioscrapingpythonweb

Problem

I've been working at speeding up my web scraping with the asyncio library. I have a working solution, but am unsure as to how pythonic it is or if I am properly using the library. Any input would be appreciated.

import aiohttp
import asyncio
import requests
from lxml import etree

@asyncio.coroutine
def get(*args, **kwargs):
    """
    A wrapper method for aiohttp's get method. Taken from Georges Dubus' article at
    http://compiletoi.net/fast-scraping-in-python-with-asyncio.html
    """
    response = yield from aiohttp.request('GET', *args, **kwargs)
    return (yield from response.read_and_close())

@asyncio.coroutine
def extract_text(url):
    """
    Given the url for a chapter, extract the relevant text from it
    :param url: the url for the chapter to scrape
    :return: a string containing the chapter's text
    """
    sem = asyncio.Semaphore(5)
    with (yield from sem):
        page = yield from get(url)

    tree = etree.HTML(page)
    paragraphs = tree.findall('.//*/div[@class="entry-content"]/p')[1: -1]
    return b'\n'.join(etree.tostring(paragraph) for paragraph in paragraphs)

def generate_links():
    """
    Generate the links to each of the chapters
    :return: A list of strings containing every url to visit
    """
    start_url = 'https://twigserial.wordpress.com/'
    base_url = 'https://twigserial.wordpress.com/category/story/'
    tree = etree.HTML(requests.get(start_url).text)
    xpath = './/*/option[@class="level-2"]/text()'
    return [base_url + suffix.strip() for suffix in tree.xpath(xpath)]

@asyncio.coroutine
def run():
    links = generate_links()
    chapters = []

    for f in asyncio.as_completed([extract_text(link) for link in links]):
        result = yield from f
        chapters.append(result)

    return chapters

def main():
    loop = asyncio.get_event_loop()
    chapters = loop.run_until_complete(run())
    print(len(chapters))

if __name__ == '__main__':
    main()

Solution

Looks ... great? Not a lot to complain about really.

The semaphore doesn't do anything though used like this, it should be
passed in from the top to protect the get/aiohttp.request. You can
see that if you print something right before the HTTP request.

Also, the result of asyncio.as_completed will be in random order, so
be sure to sort the resulting chapters somehow, e.g. by returning both
the URL and the collected text from extract_text.

A couple of small things as well:

  • List comprehensions are okay, but with just a single argument it can


be shorter and equally performant just to use map.

  • The URL constants should ideally be defined on the top level; at least


base_url can also be defined by concatenating with start_url.
Alternatively they could be passed in to generate_links. Then
again, it's unlikely that another blog has the exact same layout?

  • The manual append in run seems unnecessary, I'd rewrite it into a


list of generators and use a list comprehension instead.

  • At the moment generate_links is called from run; I think it makes


more sense to call it from the main function: it doesn't need to run
concurrently and you could think of a situation where you'd pass in
the result of a different function to be fetched and collected.

All in all, I'd maybe change things to the code below. Of course if you
were to add things to it, I'd recommend looking into command line
arguments and configuration files, ...

import aiohttp
import asyncio
import requests
from lxml import etree

@asyncio.coroutine
def get(*args, **kwargs):
"""
A wrapper method for aiohttp's get method. Taken from Georges Dubus' article at
http://compiletoi.net/fast-scraping-in-python-with-asyncio.html
"""
response = yield from aiohttp.request('GET', *args, **kwargs)
return (yield from response.read_and_close())

@asyncio.coroutine
def extract_text(url, sem):
"""
Given the url for a chapter, extract the relevant text from it
:param url: the url for the chapter to scrape
:return: a string containing the chapter's text
"""
with (yield from sem):
page = yield from get(url)

tree = etree.HTML(page)
paragraphs = tree.findall('.//*/div[@class="entry-content"]/p')[1:-1]
return url, b'\n'.join(map(etree.tostring, paragraphs))

def generate_links():
"""
Generate the links to each of the chapters
:return: A list of strings containing every url to visit
"""
start_url = 'https://twigserial.wordpress.com/'
base_url = start_url + 'category/story/'
tree = etree.HTML(requests.get(start_url).text)
xpath = './/*/option[@class="level-2"]/text()'
return [base_url + suffix.strip() for suffix in tree.xpath(xpath)]

@asyncio.coroutine
def run(links):
sem = asyncio.Semaphore(5)
fetchers = [extract_text(link, sem) for link in links]
return [(yield from f) for f in asyncio.as_completed(fetchers)]

def main():
loop = asyncio.get_event_loop()
chapters = loop.run_until_complete(run(generate_links()))
print(len(chapters))

if __name__ == '__main__':
main()

Context

StackExchange Code Review Q#91869, answer score: 3

Revisions (0)

No revisions yet.