patternpythonMajor
Let's read a random Goodreads book in an optimal way
Viewed 0 times
randomgoodreadsoptimalreadbookwaylet
Problem
I have made the following program to gather data on random books from Goodreads, via their random books feature.
import requests
import re
URL = "https://www.goodreads.com/book/random"
while True:
html_text = requests.get(URL).text
# Rating Count
bg_rating_count = html_text.find("", bg_rating_count)
rating_count = int(html_text[bg_rating_count : end_rating_count].replace(',', ''))
if rating_count >= 30:
if "" in html_text:
# Title
bg_title = html_text.find("", bg_title)
title = html_text[bg_title : end_title].replace("&", '&')
title = re.sub(r'\((.*)\)','', title)
# Pages
bg_pages = html_text.find("") + 31
end_pages = html_text.find(" page", bg_pages)
pages = int(html_text[bg_pages : end_pages])
# Rating
bg_rating = html_text.find("") + 45
end_rating = html_text.find("
This is my first time working with anything of this sort, and I am still learning Python as I go. Here's a breakdown of what the code does (or, is intended to do):
-
This program gets the HTML code from a random book via the URL.
-
Finds the book's rating count by searching for a specific HTML tag. Removes any commas in the rating number before making sure it's an integer.
-
If the rating count is >= 30, it accepts it and continues to gather data. Otherwise it moves on and tries another random book.
-
If it passes the rating count test, it checks whether it has a listed number of pages. If it does, it continues to gather data. Otherwise it moves on and tries another random book.
-
If it passes the page count test, it then gathers the title. If the title has an ampersand HTML code & it replaces it with the actual ampersand character &`. As well, if the title contains text identifying a series, it removes the series text. Here's an example of what I'm talking about, where it saySolution
After I've gone through your code, I decided to rewrite it because using regex to parse html isn't a good idea at all.
When you're parsing html, is recommended to use
First, I'll start from the
Nothing too fancy so far, we're importing the modules we need in order to have our work done.
You had in your code a magic number (30), which we can define at the top of our program, just below the imports:
You had other numbers in your code which didn't make sense to me, so I removed them. Just let me know what their purpose was (if any) if there's any difference between my proposed solution and yours.
Moving on, we can now build a function which returns a
Now, let's build another four functions which will get us the rating count, title, pages and rating of a random book.
As I was testing this out, I've noticed that
Regarding the title, we can build another function to nicely format it: remove the newlines in it and replace
Last, but not least, let's build our main function:
If you're also going to terminate the program using CTRL + C, I suggest you put your
The final code:
If you let the above run for several minutes, you'll eventually get an
When you're parsing html, is recommended to use
BeautifulSoup, so I'll rewrite your code using it.First, I'll start from the
imports:from bs4 import BeautifulSoup as bs
import requestsNothing too fancy so far, we're importing the modules we need in order to have our work done.
You had in your code a magic number (30), which we can define at the top of our program, just below the imports:
MIN_RATING_COUNT = 30You had other numbers in your code which didn't make sense to me, so I removed them. Just let me know what their purpose was (if any) if there's any difference between my proposed solution and yours.
Moving on, we can now build a function which returns a
bs object on which we can later work.def get_html_source():
"""Docstring here."""
html_source = requests.get(URL).text
return bs(html_source, 'html.parser')Now, let's build another four functions which will get us the rating count, title, pages and rating of a random book.
def get_book_rating_count(soup):
"""Docstring here."""
return soup.find('span', attrs={'class', 'value-title'}).get_text()
def get_book_title(soup):
"""Docstring here."""
return soup.find('h1', attrs={'class': 'bookTitle'}).get_text()
def get_book_pages(soup):
"""Docstring here."""
return soup.find('span', attrs={'itemprop': 'numberOfPages'}).get_text()
def get_book_rating(soup):
"""Docstring here."""
return soup.find('span', attrs={'itemprop': 'ratingValue'}).get_text()As I was testing this out, I've noticed that
book_rating_count might have different value formats so let's build another function to treat each case:def to_float(rating_count):
"""Docstring here."""
rating = rating_count.split()[0]
return float(rating.replace(',', '.'))Regarding the title, we can build another function to nicely format it: remove the newlines in it and replace
& with &:def format_title(book_title):
"""Docstring here."""
return ' '.join(book_title.split()).replace('&', '&')Last, but not least, let's build our main function:
def main():
"""Docstring here."""
while True:
soup = get_html_source()
book_rating_count = get_book_rating_count(soup)
if to_float(book_rating_count) > MIN_RATING_COUNT:
book_pages = get_book_pages(soup)
book_title = format_title(get_book_title(soup))
book_rating = get_book_rating(soup)
print('Title: {}\n'
'Pages: {}\n'
'Rating: {}\n\n'.format(book_title, book_pages, book_rating))If you're also going to terminate the program using CTRL + C, I suggest you put your
main function into a try/except block.try:
main()
except KeyboardInterrupt:
print("You've decided to close the program")The final code:
from bs4 import BeautifulSoup as bs
import requests
URL = "https://www.goodreads.com/book/random"
MIN_RATING_COUNT = 30
def to_float(rating_count):
"""Docstring here."""
rating = rating_count.split()[0]
if ',' in rating:
return float(rating.replace(',', '.'))
return float(rating)
def format_title(book_title):
"""Docstring here."""
return ' '.join(book_title.split()).replace('&', '&')
def get_html_source():
"""Docstring here."""
html_source = requests.get(URL).text
return bs(html_source, 'html.parser')
def get_book_rating_count(soup):
"""Docstring here."""
return soup.find('span', attrs={'class', 'value-title'}).get_text()
def get_book_title(soup):
"""Docstring here."""
return soup.find('h1', attrs={'class': 'bookTitle'}).get_text()
def get_book_pages(soup):
"""Docstring here."""
return soup.find('span', attrs={'itemprop': 'numberOfPages'}).get_text()
def get_book_rating(soup):
"""Docstring here."""
return soup.find('span', attrs={'itemprop': 'ratingValue'}).get_text()
def main():
"""Docstring here."""
while True:
soup = get_html_source()
book_rating_count = get_book_rating_count(soup)
if to_float(book_rating_count) > MIN_RATING_COUNT:
try:
book_pages = get_book_pages(soup)
except AttributeError:
book_pages = 'No pages available'
book_title = format_title(get_book_title(soup))
book_rating = get_book_rating(soup)
print('Title: {}\n'
'Pages: {}\n'
'Rating: {}\n\n'.format(book_title, book_pages, book_rating))
if __name__ == '__main__':
try:
main()
except KeyboardInterrupt:
print("You've decided to close the program")If you let the above run for several minutes, you'll eventually get an
AttributeError for the simple fact that some books don't have the number of pages, so I also added that Code Snippets
from bs4 import BeautifulSoup as bs
import requestsMIN_RATING_COUNT = 30def get_html_source():
"""Docstring here."""
html_source = requests.get(URL).text
return bs(html_source, 'html.parser')def get_book_rating_count(soup):
"""Docstring here."""
return soup.find('span', attrs={'class', 'value-title'}).get_text()
def get_book_title(soup):
"""Docstring here."""
return soup.find('h1', attrs={'class': 'bookTitle'}).get_text()
def get_book_pages(soup):
"""Docstring here."""
return soup.find('span', attrs={'itemprop': 'numberOfPages'}).get_text()
def get_book_rating(soup):
"""Docstring here."""
return soup.find('span', attrs={'itemprop': 'ratingValue'}).get_text()def to_float(rating_count):
"""Docstring here."""
rating = rating_count.split()[0]
return float(rating.replace(',', '.'))Context
StackExchange Code Review Q#162083, answer score: 20
Revisions (0)
No revisions yet.