patternpythonMinor
Finding the occurrences of all words in movie scripts
Viewed 0 times
thescriptsoccurrencesallmoviewordsfinding
Problem
I was wondering if someone could tell me things I could improve in this code. This is one of my first Python projects. This program gets the script of a movie (in this case Interstellar) and then finds the occurrences of all words and prints out the 20 most common words in the script.
from bs4 import BeautifulSoup
import requests
from collections import Counter
import operator
import re
class MovieScript(object):
def __init__(self, movie):
self.movie = movie
self.url = "http://www.springfieldspringfield.co.uk/movie_script.php?movie={0}".format(movie.replace(" ", "-"))
def get_movie(self):
return self.movie
def get_text(self):
r = requests.get(self.url)
soup = BeautifulSoup(r.text)
return soup
def parse_text(self):
text = self.get_text().findAll('div', {"class":"scrolling-script-container"})[0].text.lower()
text = re.sub(r'\W+', ' ', text) #remove puncation from the text
return text.split()
def find_all_occurences(self):
return Counter(self.parse_text())
def get_all_occurences(self):
sorted_occurences = sorted(self.find_all_occurences().items(), key=operator.itemgetter(1))
return sorted_occurences
def Main():
db = MovieScript("interstellar")
print "The Movie: " + db.get_movie()
all_occurences = db.get_all_occurences()
for i in range(-1, -21, -1):
print str(-i) + ". " + str(all_occurences[i][0]) + " -> " + str(all_occurences[i][1])
if __name__ == "__main__":
Main()Solution
A few things I can think of are:
I'm sure there are more and people will have there own preferences, but this is just what I could suggest in a quick glimpse :)
parse_textcan be marked as a private method.def _parse_text(self):. I understand this won't actually be hidden, but its a convention followed in python to treat methods starting with an underscore as private.
- In
MovieScript__init__method maybe have the url as a kwarg (defaulting to http://www.springfieldspringfield.co.uk, but can be overwritten and point to another site in case you ever have to scrape another site)
- In
def get_text: You can cache the result from the request.get you're running and check the cache before making the request
- Finally consider moving the logic of calculating the top keywords inside the
MovieScriptclass itself (call it something likedef get_common_words(self, count=20):) and allow the count to change in case you ever want more than the default.
I'm sure there are more and people will have there own preferences, but this is just what I could suggest in a quick glimpse :)
Context
StackExchange Code Review Q#73887, answer score: 8
Revisions (0)
No revisions yet.