HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMajor

We'll be counting stars

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
starscountingstackoverflow

Problem

Lately, I've been, I've been losing sleep

Dreaming about the things that we could be

But baby, I've been, I've been praying hard,

Said, no more counting dollars

We'll be counting stars, yeah we'll be counting stars

(One Republic - Counting Stars)

The 2nd Monitor is known to be a quite star-happy chat room, but exactly how many stars is there? And who are the most starred users? I decided to write a script to find out.

I chose to write this in Python because:

  • I've never used Python before



  • @200_success used it and it didn't look that hard



  • I found Beautiful Soup which seemed powerful and easy to use



The script performs a number of HTTP requests to the list of starred messages, keeps track of the numbers in a dict and also saves the pure HTML data to files (to make it easier to perform other calculations on the data in the future, and I had the opportunity to learn file I/O in Python).

To be safe, I included a little delay between some of the requests (Don't want to risk getting blocked by the system)

Unfortunately there's no way to see which user starred the most messages, but we all know who that is anyway...

The code:

``
from time import sleep
from bs4 import BeautifulSoup
from urllib import request
from collections import OrderedDict
import operator

room = 8595 # The 2nd Monitor (Code Review)
url = 'http://chat.stackexchange.com/rooms/info/{0}/the-2nd-monitor/?tab=stars&page={1}'
pages = 125

def write_files(filename, content):
with open(filename, 'w', encoding = 'utf-8') as f:
f.write(content)

def fetch_soup(room, page):
resource = request.urlopen(url.format(room, page))
content = resource.read().decode('utf-8')
mysoup = BeautifulSoup(content)
return mysoup

allstars = {}
def add_stars(message):
message_soup = BeautifulSoup(str(message))
stars = message_soup.select('.times').pop().string
who = message_soup.select(".username a").pop().string

# If there is only one star, the
.times` span item doe

Solution

That's a beautiful excuse for writing code, and the final product is quite nice as well.

★ First of all, congrats to your Java-ridden mind for not forcing classes into Python where they aren't needed.

★ You import but do not use OrderedDict and operator. Remove unused imports.

★ It is customary to write your code in a way that allows it to be used as either a module, or a script. For this, the if __name__ == '__main__' trick is used:

if __name__ == '__main__':
    # executed only if used as a script


★ You declare a few variables like room, url, and pages up front. This hinders code reuse:

-
The room variable should not be declared up front as a global variable, but in your main section. From there, it can be passed down through all functions.

-
The url specifically but unnecessarily mentions the-2nd-monitor. This isn't harmful, but unnecessary as only the ID is relevant. Furthermore, url is an awfully short name for such a large scope. Something like star_url_pattern would be better – except that global “constants” should be named all-uppercase:

STAR_URL_PATTERN = 'http://chat.stackexchange.com/rooms/info/{0}/?tab=stars&page={1}'


-
Reserve plural names for collections. pages should rather be page_count. But wait – why are we hardcoding this rather than fetching it from the page itself? Just follow the rel="next" links until you reach the end.

★ That last idea could be implemented with a generator function. A Python generator function is similar to a simple iterator. It can yield elements, or return when exhausted. We could build such a generator function that yields a beautiful soup object for each page, and takes care of fetching the next one. As a sketch:

from urllib.parse import urljoin

def walk_pages(start_url):
    current_page = start_url
    while True:
        content = ... # fetch the current_page
        soup = BeautifulSoup(content)
        yield soup
        # find the next page
        next_link = soup.find('a', {'rel': 'next'})
        if next_link is None:
            return
        # urljoin takes care of resolving the relative URL
        current_page = urljoin(current_page, next_link.['href'])


★ Please don't use urllib.request. That library has a horrible interface and is more or less broken by design. You might notice that the .read() method returns raw bytes, rather than using the charset from the Content-Type header to automatically decode the content. This is useful when handling binary data, but a HTML page is text. Instead of hardcoding the encoding utf-8 (which isn't even the default encoding for HTML), we could use a better library like requests. Then:

import requests
response = requests.get(current_page)
response.raise_for_status()  # throw an error (only for 4xx or 5xx responses)
content = response.text  # transparently decodes the content


★ Your allstars variable should not only be named something like all_stars (notice the separation of words via an underscore), but also not be a global variable. Consider passing it in as a parameter to add_stars, or wrapping this dictionary in an object where add_stars would be a method.

★ I don't quite understand why you write each page to a file. I suspect this was intended as a debugging help, but it doesn't add any value to a user of that script. Instead of cluttering the current working directory, make this behavior optional.

★ Do not compare to None via the == operator – this is for general comparison. To test for identity, use the is operator: if stars is None. Sometimes, it is preferable to rely on the boolean overload of an object. For example, arrays are considered falsey if they are empty.

Talking is easy,

coding is hard.

Is this refactoring

equally bad?

```
import time
from bs4 import BeautifulSoup
import requests
from urllib.parse import urljoin

STAR_URL_TEMPLATE = 'http://chat.{0}/rooms/info/{1}/?tab=stars'

def star_pages(start_url):
current_page = start_url
while True:
print("GET {}".format(current_page))
response = requests.get(current_page)
response.raise_for_status()
soup = BeautifulSoup(response.text)
yield soup
# find the next page
next_link = soup.find('a', {'rel': 'next'})
if next_link is None:
return
# urljoin takes care of resolving the relative URL
current_page = urljoin(current_page, next_link['href'])

def star_count(room_id, site='stackexchange.com'):
stars = {}
for page in star_pages(STAR_URL_TEMPLATE.format(site, room_id)):
for message in page.find_all(attrs={'class': 'monologue'}):
author = message.find(attrs={'class': 'username'}).string

star_count = message.find(attrs={'class': 'times'}).string
if star_count is None:
star_count = 1

if author not in stars:
stars[author] = 0
stars[author] +=

Code Snippets

if __name__ == '__main__':
    # executed only if used as a script
STAR_URL_PATTERN = 'http://chat.stackexchange.com/rooms/info/{0}/?tab=stars&page={1}'
from urllib.parse import urljoin

def walk_pages(start_url):
    current_page = start_url
    while True:
        content = ... # fetch the current_page
        soup = BeautifulSoup(content)
        yield soup
        # find the next page
        next_link = soup.find('a', {'rel': 'next'})
        if next_link is None:
            return
        # urljoin takes care of resolving the relative URL
        current_page = urljoin(current_page, next_link.['href'])
import requests
response = requests.get(current_page)
response.raise_for_status()  # throw an error (only for 4xx or 5xx responses)
content = response.text  # transparently decodes the content
import time
from bs4 import BeautifulSoup
import requests
from urllib.parse import urljoin

STAR_URL_TEMPLATE = 'http://chat.{0}/rooms/info/{1}/?tab=stars'

def star_pages(start_url):
    current_page = start_url
    while True:
        print("GET {}".format(current_page))
        response = requests.get(current_page)
        response.raise_for_status()
        soup = BeautifulSoup(response.text)
        yield soup
        # find the next page
        next_link = soup.find('a', {'rel': 'next'})
        if next_link is None:
            return
        # urljoin takes care of resolving the relative URL
        current_page = urljoin(current_page, next_link['href'])

def star_count(room_id, site='stackexchange.com'):
    stars = {}
    for page in star_pages(STAR_URL_TEMPLATE.format(site, room_id)):
        for message in page.find_all(attrs={'class': 'monologue'}):
            author = message.find(attrs={'class': 'username'}).string

            star_count = message.find(attrs={'class': 'times'}).string
            if star_count is None:
                star_count = 1

            if author not in stars:
                stars[author] = 0
            stars[author] += int(star_count)

        # be nice to the server, and wait after each page
        time.sleep(1)
    return stars

if __name__ == '__main__':
    the_2nd_monitor_id = 8595
    stars = star_count(the_2nd_monitor_id)
    # print out the stars in descending order
    for author, count in sorted(stars.items(), key=lambda pair: pair[1], reverse=True):
        print("{}: {}".format(author, count))

Context

StackExchange Code Review Q#48258, answer score: 28

Revisions (0)

No revisions yet.