HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMajor

Let's read a random Goodreads book in an optimal way

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
randomgoodreadsoptimalreadbookwaylet

Problem

I have made the following program to gather data on random books from Goodreads, via their random books feature.

import requests
import re

URL = "https://www.goodreads.com/book/random"

while True:
html_text = requests.get(URL).text

# Rating Count
bg_rating_count = html_text.find("", bg_rating_count)
rating_count = int(html_text[bg_rating_count : end_rating_count].replace(',', ''))

if rating_count >= 30:
if "" in html_text:

# Title
bg_title = html_text.find("", bg_title)
title = html_text[bg_title : end_title].replace("&", '&')
title = re.sub(r'\((.*)\)','', title)

# Pages
bg_pages = html_text.find("") + 31
end_pages = html_text.find(" page", bg_pages)
pages = int(html_text[bg_pages : end_pages])

# Rating
bg_rating = html_text.find("") + 45
end_rating = html_text.find("

This is my first time working with anything of this sort, and I am still learning Python as I go. Here's a breakdown of what the code does (or, is intended to do):

-
This program gets the HTML code from a random book via the
URL.

-
Finds the book's rating count by searching for a specific HTML tag. Removes any commas in the rating number before making sure it's an integer.

-
If the rating count is >= 30, it accepts it and continues to gather data. Otherwise it moves on and tries another random book.

-
If it passes the rating count test, it checks whether it has a listed number of pages. If it does, it continues to gather data. Otherwise it moves on and tries another random book.

-
If it passes the page count test, it then gathers the title. If the title has an ampersand HTML code
& it replaces it with the actual ampersand character &`. As well, if the title contains text identifying a series, it removes the series text. Here's an example of what I'm talking about, where it say

Solution

After I've gone through your code, I decided to rewrite it because using regex to parse html isn't a good idea at all.

When you're parsing html, is recommended to use BeautifulSoup, so I'll rewrite your code using it.

First, I'll start from the imports:

from bs4 import BeautifulSoup as bs
import requests


Nothing too fancy so far, we're importing the modules we need in order to have our work done.

You had in your code a magic number (30), which we can define at the top of our program, just below the imports:

MIN_RATING_COUNT = 30


You had other numbers in your code which didn't make sense to me, so I removed them. Just let me know what their purpose was (if any) if there's any difference between my proposed solution and yours.

Moving on, we can now build a function which returns a bs object on which we can later work.

def get_html_source():
    """Docstring here."""
    html_source = requests.get(URL).text
    return bs(html_source, 'html.parser')


Now, let's build another four functions which will get us the rating count, title, pages and rating of a random book.

def get_book_rating_count(soup):
    """Docstring here."""
    return soup.find('span', attrs={'class', 'value-title'}).get_text()

def get_book_title(soup):
    """Docstring here."""
    return soup.find('h1', attrs={'class': 'bookTitle'}).get_text()

def get_book_pages(soup):
    """Docstring here."""
    return soup.find('span', attrs={'itemprop': 'numberOfPages'}).get_text()

def get_book_rating(soup):
    """Docstring here."""
    return soup.find('span', attrs={'itemprop': 'ratingValue'}).get_text()


As I was testing this out, I've noticed that book_rating_count might have different value formats so let's build another function to treat each case:

def to_float(rating_count):
    """Docstring here."""
    rating = rating_count.split()[0]
    return float(rating.replace(',', '.'))


Regarding the title, we can build another function to nicely format it: remove the newlines in it and replace & with &:

def format_title(book_title):
    """Docstring here."""
    return ' '.join(book_title.split()).replace('&', '&')


Last, but not least, let's build our main function:

def main():
    """Docstring here."""
    while True:
        soup = get_html_source()
        book_rating_count = get_book_rating_count(soup)
    
        if to_float(book_rating_count) > MIN_RATING_COUNT:
            book_pages = get_book_pages(soup)
            book_title = format_title(get_book_title(soup))
            book_rating = get_book_rating(soup)
    
            print('Title: {}\n'
                  'Pages: {}\n'
                  'Rating: {}\n\n'.format(book_title, book_pages, book_rating))


If you're also going to terminate the program using CTRL + C, I suggest you put your main function into a try/except block.

try:
    main()
except KeyboardInterrupt:
    print("You've decided to close the program")


The final code:

from bs4 import BeautifulSoup as bs
import requests

URL = "https://www.goodreads.com/book/random"
MIN_RATING_COUNT = 30

def to_float(rating_count):
    """Docstring here."""
    rating = rating_count.split()[0]
    if ',' in rating:
        return float(rating.replace(',', '.'))
    return float(rating)

def format_title(book_title):
    """Docstring here."""
    return ' '.join(book_title.split()).replace('&', '&')

def get_html_source():
    """Docstring here."""
    html_source = requests.get(URL).text
    return bs(html_source, 'html.parser')

def get_book_rating_count(soup):
    """Docstring here."""
    return soup.find('span', attrs={'class', 'value-title'}).get_text()

def get_book_title(soup):
    """Docstring here."""
    return soup.find('h1', attrs={'class': 'bookTitle'}).get_text()

def get_book_pages(soup):
    """Docstring here."""
    return soup.find('span', attrs={'itemprop': 'numberOfPages'}).get_text()

def get_book_rating(soup):
    """Docstring here."""
    return soup.find('span', attrs={'itemprop': 'ratingValue'}).get_text()

def main():
    """Docstring here."""
    while True:
        soup = get_html_source()
        book_rating_count = get_book_rating_count(soup)

        if to_float(book_rating_count) > MIN_RATING_COUNT:
            try:
                book_pages = get_book_pages(soup)
            except AttributeError:
                book_pages = 'No pages available'
            book_title = format_title(get_book_title(soup))
            book_rating = get_book_rating(soup)

            print('Title: {}\n'
                  'Pages: {}\n'
                  'Rating: {}\n\n'.format(book_title, book_pages, book_rating))

if __name__ == '__main__':
    try:
        main()
    except KeyboardInterrupt:
        print("You've decided to close the program")


If you let the above run for several minutes, you'll eventually get an AttributeError for the simple fact that some books don't have the number of pages, so I also added that

Code Snippets

from bs4 import BeautifulSoup as bs
import requests
MIN_RATING_COUNT = 30
def get_html_source():
    """Docstring here."""
    html_source = requests.get(URL).text
    return bs(html_source, 'html.parser')
def get_book_rating_count(soup):
    """Docstring here."""
    return soup.find('span', attrs={'class', 'value-title'}).get_text()


def get_book_title(soup):
    """Docstring here."""
    return soup.find('h1', attrs={'class': 'bookTitle'}).get_text()


def get_book_pages(soup):
    """Docstring here."""
    return soup.find('span', attrs={'itemprop': 'numberOfPages'}).get_text()


def get_book_rating(soup):
    """Docstring here."""
    return soup.find('span', attrs={'itemprop': 'ratingValue'}).get_text()
def to_float(rating_count):
    """Docstring here."""
    rating = rating_count.split()[0]
    return float(rating.replace(',', '.'))

Context

StackExchange Code Review Q#162083, answer score: 20

Revisions (0)

No revisions yet.