patternpythonMinor
Non-repeatability (plagiarism and time travel)
Viewed 0 times
nonplagiarismtimeandrepeatabilitytravel
Problem
I heard/read that writers of texts can be recognized by counting their used words and compare these to previous works. As a code kata I experimented with this a bit and found some disturbing facts.
I think the last is most disturbing, see the numbers of corresponding words between brackets in my results below.
Used code is also below and used
Questions:
Results:
Code:
``
- 'Romeo and Juliet' was written by one of the Bronte sisters.
- Shakespeare plagiarized it from the future.
- My program is not repeatable.
I think the last is most disturbing, see the numbers of corresponding words between brackets in my results below.
Used code is also below and used
.txt files were downloaded from https://www.gutenberg.org/ and conveniently renamed, temporary available in a zip at http://kuiken.dyndns.org/spul/books.zip.Questions:
- Why is my code not repeatable
- Coding style comments
Results:
jk@ASUS-flaptop:~/PythonFun/Boeken$ python books.py
I read WutheringHeights.txt by Bronte
I read PrideAndPrejudice.txt by Austen
I read TheMerchantOfVenice.txt by Shakespeare
I think SensAndSensibility.txt is written by Austen (405)
I think RomeoAndJuliet.txt is written by Bronte (278)
I think JaneEyre.txt is written by Bronte (376)
I think Hamlet.txt is written by Shakespeare (324)
jk@ASUS-flaptop:~/PythonFun/Boeken$ python books.py
I read WutheringHeights.txt by Bronte
I read PrideAndPrejudice.txt by Austen
I read TheMerchantOfVenice.txt by Shakespeare
I think SensAndSensibility.txt is written by Austen (404)
I think RomeoAndJuliet.txt is written by Bronte (279)
I think JaneEyre.txt is written by Bronte (376)
I think Hamlet.txt is written by Shakespeare (319)Code:
``
from collections import Counter
from string import ascii_letters, whitespace
def read_book(filename):
with open(filename) as f:
return f.read()
def make_words(text):
'''removes weird characters and split into words'''
remain = ascii_letters + whitespace
filtered = ""
for ch in text:
if ch in remain:
filtered += ch
return filtered.lower().split()
def most_used(words, n=500):
'''returns a set of the n` moSolution
More classes
Making the author guesser a class was a good start, but I think authors also should be a class. Making them a class now will allow you to work more with the concept of an author in the future (see "More work to do").
Cleaning up
You don't really need the book reading method, and you can simplify how you teach your guesser about authors and then test them later. I've tried to leave the general spirit of your code intact:
More work to do
-
You can only learn about one book from an author. Surely, you'd like a system that allows you to create an author and then use several books to develop their favorite words. This is where a class is helpful - you can create the author class, get their favorite words, and then update them as you learn more books.
-
You should expand this so that you can use many books by the same author to learn favorite words. Perhaps return the frequency of a word along with the word itself (which
-
You should also use the frequency of the word to determine how confident you are that the authors are the same. Say an author's favorite words are
-
Everybody's favorite words are going to be "the", "it", "a", "and", etc. Consider making a list of boring words and not using them for your filtering.
Making the author guesser a class was a good start, but I think authors also should be a class. Making them a class now will allow you to work more with the concept of an author in the future (see "More work to do").
Cleaning up
You don't really need the book reading method, and you can simplify how you teach your guesser about authors and then test them later. I've tried to leave the general spirit of your code intact:
from collections import Counter
from string import ascii_letters, whitespace
def filter_words(text):
'''removes weird characters and split into words'''
remain = ascii_letters + whitespace
filtered = [ch.lower() for ch in text if ch in remain]
return "".join(filtered).split()
def most_used(words, n=500):
'''returns a set of the "n" most used words'''
most_used_words = Counter(words).most_common(n)
return set([word for word, frequency in most_used_words])
def read_and_analyse(filename):
'''reads a book and returns most used words'''
with open(filename) as book:
words = filter_words(book)
favorites = most_used(words)
return favorites
class Author(object):
def __init__(self, name, favorite_words):
self.name = name
self.favorite_words = favorite_words
class WriterGuesser(object):
def __init__(self):
self.analyzed_authors = []
def learn_about_authors(self, book, name):
favorite_words = read_and_analyse(book)
author = Author(name, favorite_words)
self.analyzed_authors.append(author)
print "I read {} by {}".format(book, name)
def recognize_author(self, book):
favorite_words = read_and_analyse(book)
max_identical_words = 0
guess = None
for author in self.analyzed_authors:
common = len(favorite_words.intersection(author.favorite_words))
if common > max_identical_words:
max_identical_words = common
guess = author.name
print"I think {} is written by {} ({} words in common)".format(book, guess, max_identical_words)
def main():
nerd = WriterGuesser()
learning_criteria = [
('WutheringHeights.txt', 'Bronte'),
('PrideAndPrejudice.txt', 'Austen'),
('TheMerchantOfVenice.txt', 'Shakespeare')
]
for book, author in learning_criteria:
nerd.learn_about_authors(book, author)
books_to_recognize = [
'SensAndSensibility.txt',
'RomeoAndJuliet.txt',
'JaneEyre.txt',
]
for book in books_to_recognize:
nerd.recognize_author(book)
if __name__ == "__main__":
main()More work to do
-
You can only learn about one book from an author. Surely, you'd like a system that allows you to create an author and then use several books to develop their favorite words. This is where a class is helpful - you can create the author class, get their favorite words, and then update them as you learn more books.
-
You should expand this so that you can use many books by the same author to learn favorite words. Perhaps return the frequency of a word along with the word itself (which
Counter already does), and remove less popular words as more popular ones come in.-
You should also use the frequency of the word to determine how confident you are that the authors are the same. Say an author's favorite words are
["cat", "purple", "clock"], but they use "clock" 600 times and use the rest only 10 times. Your code will pick a book with "cat" and "purple" over a book with 500 instances of "clock", because 2 matching words > 1 matching word. But we know "clock" is a more important word!-
Everybody's favorite words are going to be "the", "it", "a", "and", etc. Consider making a list of boring words and not using them for your filtering.
Code Snippets
from collections import Counter
from string import ascii_letters, whitespace
def filter_words(text):
'''removes weird characters and split into words'''
remain = ascii_letters + whitespace
filtered = [ch.lower() for ch in text if ch in remain]
return "".join(filtered).split()
def most_used(words, n=500):
'''returns a set of the "n" most used words'''
most_used_words = Counter(words).most_common(n)
return set([word for word, frequency in most_used_words])
def read_and_analyse(filename):
'''reads a book and returns most used words'''
with open(filename) as book:
words = filter_words(book)
favorites = most_used(words)
return favorites
class Author(object):
def __init__(self, name, favorite_words):
self.name = name
self.favorite_words = favorite_words
class WriterGuesser(object):
def __init__(self):
self.analyzed_authors = []
def learn_about_authors(self, book, name):
favorite_words = read_and_analyse(book)
author = Author(name, favorite_words)
self.analyzed_authors.append(author)
print "I read {} by {}".format(book, name)
def recognize_author(self, book):
favorite_words = read_and_analyse(book)
max_identical_words = 0
guess = None
for author in self.analyzed_authors:
common = len(favorite_words.intersection(author.favorite_words))
if common > max_identical_words:
max_identical_words = common
guess = author.name
print"I think {} is written by {} ({} words in common)".format(book, guess, max_identical_words)
def main():
nerd = WriterGuesser()
learning_criteria = [
('WutheringHeights.txt', 'Bronte'),
('PrideAndPrejudice.txt', 'Austen'),
('TheMerchantOfVenice.txt', 'Shakespeare')
]
for book, author in learning_criteria:
nerd.learn_about_authors(book, author)
books_to_recognize = [
'SensAndSensibility.txt',
'RomeoAndJuliet.txt',
'JaneEyre.txt',
]
for book in books_to_recognize:
nerd.recognize_author(book)
if __name__ == "__main__":
main()Context
StackExchange Code Review Q#133543, answer score: 4
Revisions (0)
No revisions yet.