HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

Web Scraper in Python

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
pythonscraperweb

Problem

So, this is my first web scraper (or part of it at least) and looking for things that I may have done wrong, or things that could be improved so I can learn from my mistakes.

I made a few short function that can take a user and search tpb to get the maximum number of pages of content that that user has by requesting the url, parsing the html for the links div, and then crawling through them to see if each page is valid or not (since some linked pages actually have no content).

I know that I haven't covered all cases such as invalid urls, non-existent users etc, just trying to find what I've done wrong so far in actually parsing existent content before I go further.

There's two main things I'm concerned about:

First, I have filter/map/lambda combos all over the place here. These are generally slow from what I remember about their efficiency, and they seemed to be a fairly easy and concise way to get the filtering I needed (although not the prettiest). So, is this acceptable and/or is there a better way with bs4 or another alternative?

Second, since beautifulsoup is recursive in finding nested tags anyways does calling my get_max_pages function recursively really matter here?

```
import requests, urllib, re
from bs4 import BeautifulSoup

BASE_USER = "https://thepiratebay.se/user/"

def request_html(url):
hdr = {'User-Agent': 'Mozilla/5.0'}
request = urllib.request.Request(url, headers=hdr)
html = urllib.request.urlopen(request)
return html

def get_url(*args):
for arg in args:
url = BASE_USER + "%s/" % arg
return url

def check_page_content(url):
bs = BeautifulSoup(request_html(url), "html.parser")
rows = bs.findAll("tr")
rows = list(filter(None, map(lambda row: row if row.findChildren('div', {'class': 'detName'}) else None, rows)))
return True if rows else False

def get_max_pages(user, url=None, placeholder=0, links = list(), valid=True):
if url is None:
url = get_url(user, str(placeholder), 3)

Solution

Your get_url function is confusing. It looks like you keep assigning new values to url and ignoring all the previous values. This is what happens when I run it:

>>> get_url("Hello", "World")
'https://thepiratebay.se/user/World/'


Surely this is either a bug or redundant behaviour, since the only argument that matters is the last one? It seems like instead you should be using str.join, which will concatenate all the arguments together with a string separator, so for example:

def get_url(*args):
    return BASE_USER + "/".join(args)

>>> get_url("Hello", "World")
'https://thepiratebay.se/user/Hello/World'
>>> get_url("ban", "an", "a")
'https://thepiratebay.se/user/ban/an/a'


Though as @Mathias Ettinger noted, you should use map(str, args) to ensure that all the arguments are converted to strings as join will raise errors if any of the arguments aren't strings.

check_page_content is hard to understand at first, a lot happens in one line so you should try spread it out to multiple lines. It would be easier to build rows with a list comprehension that filters in advance, rather than gathering up a lot of rows just to later remove them:

def check_page_content(url):
    bs = BeautifulSoup(request_html(url), "html.parser")
    rows = [row for row in bs.findAll("tr")
            if row.findChildren('div', {'class': 'detName'})]


But then I see that you're just returning the boolean value of the list! That means you can break as soon as you found that there's any row, no need to store the list at all. Luckily there's the any function that will do this for you. It supports short circuiting, which means that it will return as soon as it finds a condition that evaluates as True:

def check_page_content(url):
    bs = BeautifulSoup(request_html(url), "html.parser")
    return any(row.findChildren('div', {'class': 'detName'})
               for row in bs.findAll("tr"))


any will apply truthiness to all the values in bs.findAll so if any of them have truthy results then your function will immediately return. Even if you have to check every single row, this is faster than building a full list, mapping and filtering it.

In get_max_pages you have the default value links=list(). I'm not sure if you know about the mutable default and thought this would avoid it but it wont. links will be created once as an empty list. Every time you call the function the same list exists so the same list will be appended to, which is not what you need. Here's a simple example:

>>> def a(b=list()):
    b.append("another")
    return b

>>> a()
['another']
>>> a()
['another', 'another']
>>> a()
['another', 'another', 'another']


Instead you need to use links=None and then use if links is None: links = [] so that a new list is created within each function call.

You have most of your code nested inside if valid, but if you flipped it around you could save nesting by doing this:

def get_max_pages(user, url=None, placeholder=0, links = list(), valid=True):
    if url is None:
        url = get_url(user, str(placeholder), 3)
    if not valid:
        return links

    td = BeautifulSoup(request_html(url), "html.parser").find('td', {'colspan': 9, 'style':'text-align:center;'})


Less nesting generally makes it easier to read code like this.

You have another map filter, I'd recommend doing a similar list comprehension, though it's a little harder to follow:

regex = re.compile("/user/%s/\d{,3}/\d{,2}" % user)
    pages = [int(a.text) - 1 
             for a in td.findAll('a', {'href': regex}) if a.text]


What's happening here is that I'm iterating through td.findAll, and if a.text is True then I'm storing int(a.text) - 1 in pages. This does what your next line does without needing multiple calls.

However, you then may need to further filter pages based on a condition. You could still do this in a comprehension though, just run a list comprehension over pages itself:

if links:
    pages = [x for x in pages if x > placeholder]


Note I removed the int call because you already store them as integers to it's not necessary.

Lastly, your while loop here is strange. It would be easier to just do for element in pages and then break from the loop instead of setting valid = False. This gives you a much simpler loop with less lines:

for element in pages:
            valid_page = check_page_content(get_url(user, element, 3))
            if valid_page:
                links.append(element)
            else:
                break

Code Snippets

>>> get_url("Hello", "World")
'https://thepiratebay.se/user/World/'
def get_url(*args):
    return BASE_USER + "/".join(args)

>>> get_url("Hello", "World")
'https://thepiratebay.se/user/Hello/World'
>>> get_url("ban", "an", "a")
'https://thepiratebay.se/user/ban/an/a'
def check_page_content(url):
    bs = BeautifulSoup(request_html(url), "html.parser")
    rows = [row for row in bs.findAll("tr")
            if row.findChildren('div', {'class': 'detName'})]
def check_page_content(url):
    bs = BeautifulSoup(request_html(url), "html.parser")
    return any(row.findChildren('div', {'class': 'detName'})
               for row in bs.findAll("tr"))
>>> def a(b=list()):
    b.append("another")
    return b

>>> a()
['another']
>>> a()
['another', 'another']
>>> a()
['another', 'another', 'another']

Context

StackExchange Code Review Q#111980, answer score: 4

Revisions (0)

No revisions yet.