patternpythonMajor
We'll be counting stars
Viewed 0 times
starscountingstackoverflow
Problem
Lately, I've been, I've been losing sleep
Dreaming about the things that we could be
But baby, I've been, I've been praying hard,
Said, no more counting dollars
We'll be counting stars, yeah we'll be counting stars
(One Republic - Counting Stars)
The 2nd Monitor is known to be a quite star-happy chat room, but exactly how many stars is there? And who are the most starred users? I decided to write a script to find out.
I chose to write this in Python because:
The script performs a number of HTTP requests to the list of starred messages, keeps track of the numbers in a
To be safe, I included a little delay between some of the requests (Don't want to risk getting blocked by the system)
Unfortunately there's no way to see which user starred the most messages, but we all know who that is anyway...
The code:
``
Dreaming about the things that we could be
But baby, I've been, I've been praying hard,
Said, no more counting dollars
We'll be counting stars, yeah we'll be counting stars
(One Republic - Counting Stars)
The 2nd Monitor is known to be a quite star-happy chat room, but exactly how many stars is there? And who are the most starred users? I decided to write a script to find out.
I chose to write this in Python because:
- I've never used Python before
- @200_success used it and it didn't look that hard
- I found Beautiful Soup which seemed powerful and easy to use
The script performs a number of HTTP requests to the list of starred messages, keeps track of the numbers in a
dict and also saves the pure HTML data to files (to make it easier to perform other calculations on the data in the future, and I had the opportunity to learn file I/O in Python).To be safe, I included a little delay between some of the requests (Don't want to risk getting blocked by the system)
Unfortunately there's no way to see which user starred the most messages, but we all know who that is anyway...
The code:
``
from time import sleep
from bs4 import BeautifulSoup
from urllib import request
from collections import OrderedDict
import operator
room = 8595 # The 2nd Monitor (Code Review)
url = 'http://chat.stackexchange.com/rooms/info/{0}/the-2nd-monitor/?tab=stars&page={1}'
pages = 125
def write_files(filename, content):
with open(filename, 'w', encoding = 'utf-8') as f:
f.write(content)
def fetch_soup(room, page):
resource = request.urlopen(url.format(room, page))
content = resource.read().decode('utf-8')
mysoup = BeautifulSoup(content)
return mysoup
allstars = {}
def add_stars(message):
message_soup = BeautifulSoup(str(message))
stars = message_soup.select('.times').pop().string
who = message_soup.select(".username a").pop().string
# If there is only one star, the .times` span item doeSolution
That's a beautiful excuse for writing code, and the final product is quite nice as well.
★ First of all, congrats to your Java-ridden mind for not forcing classes into Python where they aren't needed.
★ You import but do not use
★ It is customary to write your code in a way that allows it to be used as either a module, or a script. For this, the
★ You declare a few variables like
-
The
-
The
-
Reserve plural names for collections.
★ That last idea could be implemented with a generator function. A Python generator function is similar to a simple iterator. It can
★ Please don't use
★ Your
★ I don't quite understand why you write each page to a file. I suspect this was intended as a debugging help, but it doesn't add any value to a user of that script. Instead of cluttering the current working directory, make this behavior optional.
★ Do not compare to
Talking is easy,
coding is hard.
Is this refactoring
equally bad?
```
import time
from bs4 import BeautifulSoup
import requests
from urllib.parse import urljoin
STAR_URL_TEMPLATE = 'http://chat.{0}/rooms/info/{1}/?tab=stars'
def star_pages(start_url):
current_page = start_url
while True:
print("GET {}".format(current_page))
response = requests.get(current_page)
response.raise_for_status()
soup = BeautifulSoup(response.text)
yield soup
# find the next page
next_link = soup.find('a', {'rel': 'next'})
if next_link is None:
return
# urljoin takes care of resolving the relative URL
current_page = urljoin(current_page, next_link['href'])
def star_count(room_id, site='stackexchange.com'):
stars = {}
for page in star_pages(STAR_URL_TEMPLATE.format(site, room_id)):
for message in page.find_all(attrs={'class': 'monologue'}):
author = message.find(attrs={'class': 'username'}).string
star_count = message.find(attrs={'class': 'times'}).string
if star_count is None:
star_count = 1
if author not in stars:
stars[author] = 0
stars[author] +=
★ First of all, congrats to your Java-ridden mind for not forcing classes into Python where they aren't needed.
★ You import but do not use
OrderedDict and operator. Remove unused imports.★ It is customary to write your code in a way that allows it to be used as either a module, or a script. For this, the
if __name__ == '__main__' trick is used:if __name__ == '__main__':
# executed only if used as a script★ You declare a few variables like
room, url, and pages up front. This hinders code reuse:-
The
room variable should not be declared up front as a global variable, but in your main section. From there, it can be passed down through all functions.-
The
url specifically but unnecessarily mentions the-2nd-monitor. This isn't harmful, but unnecessary as only the ID is relevant. Furthermore, url is an awfully short name for such a large scope. Something like star_url_pattern would be better – except that global “constants” should be named all-uppercase:STAR_URL_PATTERN = 'http://chat.stackexchange.com/rooms/info/{0}/?tab=stars&page={1}'-
Reserve plural names for collections.
pages should rather be page_count. But wait – why are we hardcoding this rather than fetching it from the page itself? Just follow the rel="next" links until you reach the end.★ That last idea could be implemented with a generator function. A Python generator function is similar to a simple iterator. It can
yield elements, or return when exhausted. We could build such a generator function that yields a beautiful soup object for each page, and takes care of fetching the next one. As a sketch:from urllib.parse import urljoin
def walk_pages(start_url):
current_page = start_url
while True:
content = ... # fetch the current_page
soup = BeautifulSoup(content)
yield soup
# find the next page
next_link = soup.find('a', {'rel': 'next'})
if next_link is None:
return
# urljoin takes care of resolving the relative URL
current_page = urljoin(current_page, next_link.['href'])★ Please don't use
urllib.request. That library has a horrible interface and is more or less broken by design. You might notice that the .read() method returns raw bytes, rather than using the charset from the Content-Type header to automatically decode the content. This is useful when handling binary data, but a HTML page is text. Instead of hardcoding the encoding utf-8 (which isn't even the default encoding for HTML), we could use a better library like requests. Then:import requests
response = requests.get(current_page)
response.raise_for_status() # throw an error (only for 4xx or 5xx responses)
content = response.text # transparently decodes the content★ Your
allstars variable should not only be named something like all_stars (notice the separation of words via an underscore), but also not be a global variable. Consider passing it in as a parameter to add_stars, or wrapping this dictionary in an object where add_stars would be a method.★ I don't quite understand why you write each page to a file. I suspect this was intended as a debugging help, but it doesn't add any value to a user of that script. Instead of cluttering the current working directory, make this behavior optional.
★ Do not compare to
None via the == operator – this is for general comparison. To test for identity, use the is operator: if stars is None. Sometimes, it is preferable to rely on the boolean overload of an object. For example, arrays are considered falsey if they are empty.Talking is easy,
coding is hard.
Is this refactoring
equally bad?
```
import time
from bs4 import BeautifulSoup
import requests
from urllib.parse import urljoin
STAR_URL_TEMPLATE = 'http://chat.{0}/rooms/info/{1}/?tab=stars'
def star_pages(start_url):
current_page = start_url
while True:
print("GET {}".format(current_page))
response = requests.get(current_page)
response.raise_for_status()
soup = BeautifulSoup(response.text)
yield soup
# find the next page
next_link = soup.find('a', {'rel': 'next'})
if next_link is None:
return
# urljoin takes care of resolving the relative URL
current_page = urljoin(current_page, next_link['href'])
def star_count(room_id, site='stackexchange.com'):
stars = {}
for page in star_pages(STAR_URL_TEMPLATE.format(site, room_id)):
for message in page.find_all(attrs={'class': 'monologue'}):
author = message.find(attrs={'class': 'username'}).string
star_count = message.find(attrs={'class': 'times'}).string
if star_count is None:
star_count = 1
if author not in stars:
stars[author] = 0
stars[author] +=
Code Snippets
if __name__ == '__main__':
# executed only if used as a scriptSTAR_URL_PATTERN = 'http://chat.stackexchange.com/rooms/info/{0}/?tab=stars&page={1}'from urllib.parse import urljoin
def walk_pages(start_url):
current_page = start_url
while True:
content = ... # fetch the current_page
soup = BeautifulSoup(content)
yield soup
# find the next page
next_link = soup.find('a', {'rel': 'next'})
if next_link is None:
return
# urljoin takes care of resolving the relative URL
current_page = urljoin(current_page, next_link.['href'])import requests
response = requests.get(current_page)
response.raise_for_status() # throw an error (only for 4xx or 5xx responses)
content = response.text # transparently decodes the contentimport time
from bs4 import BeautifulSoup
import requests
from urllib.parse import urljoin
STAR_URL_TEMPLATE = 'http://chat.{0}/rooms/info/{1}/?tab=stars'
def star_pages(start_url):
current_page = start_url
while True:
print("GET {}".format(current_page))
response = requests.get(current_page)
response.raise_for_status()
soup = BeautifulSoup(response.text)
yield soup
# find the next page
next_link = soup.find('a', {'rel': 'next'})
if next_link is None:
return
# urljoin takes care of resolving the relative URL
current_page = urljoin(current_page, next_link['href'])
def star_count(room_id, site='stackexchange.com'):
stars = {}
for page in star_pages(STAR_URL_TEMPLATE.format(site, room_id)):
for message in page.find_all(attrs={'class': 'monologue'}):
author = message.find(attrs={'class': 'username'}).string
star_count = message.find(attrs={'class': 'times'}).string
if star_count is None:
star_count = 1
if author not in stars:
stars[author] = 0
stars[author] += int(star_count)
# be nice to the server, and wait after each page
time.sleep(1)
return stars
if __name__ == '__main__':
the_2nd_monitor_id = 8595
stars = star_count(the_2nd_monitor_id)
# print out the stars in descending order
for author, count in sorted(stars.items(), key=lambda pair: pair[1], reverse=True):
print("{}: {}".format(author, count))Context
StackExchange Code Review Q#48258, answer score: 28
Revisions (0)
No revisions yet.