HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

Finding the occurrences of all words in movie scripts

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
thescriptsoccurrencesallmoviewordsfinding

Problem

I was wondering if someone could tell me things I could improve in this code. This is one of my first Python projects. This program gets the script of a movie (in this case Interstellar) and then finds the occurrences of all words and prints out the 20 most common words in the script.

from bs4 import BeautifulSoup
import requests
from collections import Counter 
import operator
import re

class MovieScript(object):
    def __init__(self, movie):
        self.movie = movie
        self.url = "http://www.springfieldspringfield.co.uk/movie_script.php?movie={0}".format(movie.replace(" ", "-"))

    def get_movie(self):
        return self.movie

    def get_text(self):
        r = requests.get(self.url)
        soup = BeautifulSoup(r.text)
        return soup

    def parse_text(self):
        text = self.get_text().findAll('div', {"class":"scrolling-script-container"})[0].text.lower()
        text = re.sub(r'\W+', ' ', text) #remove puncation from the text
        return text.split()

    def find_all_occurences(self):
        return Counter(self.parse_text())

    def get_all_occurences(self):
        sorted_occurences = sorted(self.find_all_occurences().items(), key=operator.itemgetter(1))
        return sorted_occurences

def Main():
    db = MovieScript("interstellar")
    print "The Movie: " + db.get_movie()
    all_occurences = db.get_all_occurences()
    for i in range(-1, -21, -1):
        print str(-i) + ". " + str(all_occurences[i][0]) + " -> " + str(all_occurences[i][1])

if __name__ == "__main__":
    Main()

Solution

A few things I can think of are:

  • parse_text can be marked as a private method. def _parse_text(self):. I understand this won't actually be hidden, but its a convention followed in python to treat methods starting with an underscore as private.



  • In MovieScript __init__ method maybe have the url as a kwarg (defaulting to http://www.springfieldspringfield.co.uk, but can be overwritten and point to another site in case you ever have to scrape another site)



  • In def get_text: You can cache the result from the request.get you're running and check the cache before making the request



  • Finally consider moving the logic of calculating the top keywords inside the MovieScript class itself (call it something like def get_common_words(self, count=20):) and allow the count to change in case you ever want more than the default.



I'm sure there are more and people will have there own preferences, but this is just what I could suggest in a quick glimpse :)

Context

StackExchange Code Review Q#73887, answer score: 8

Revisions (0)

No revisions yet.