HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

Non-repeatability (plagiarism and time travel)

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
nonplagiarismtimeandrepeatabilitytravel

Problem

I heard/read that writers of texts can be recognized by counting their used words and compare these to previous works. As a code kata I experimented with this a bit and found some disturbing facts.

  • 'Romeo and Juliet' was written by one of the Bronte sisters.



  • Shakespeare plagiarized it from the future.



  • My program is not repeatable.



I think the last is most disturbing, see the numbers of corresponding words between brackets in my results below.

Used code is also below and used .txt files were downloaded from https://www.gutenberg.org/ and conveniently renamed, temporary available in a zip at http://kuiken.dyndns.org/spul/books.zip.

Questions:

  • Why is my code not repeatable



  • Coding style comments



Results:

jk@ASUS-flaptop:~/PythonFun/Boeken$ python books.py 

I read WutheringHeights.txt by Bronte
I read PrideAndPrejudice.txt by Austen
I read TheMerchantOfVenice.txt by Shakespeare

I think SensAndSensibility.txt is written by Austen (405)
I think RomeoAndJuliet.txt is written by Bronte (278)
I think JaneEyre.txt is written by Bronte (376)
I think Hamlet.txt is written by Shakespeare (324)

jk@ASUS-flaptop:~/PythonFun/Boeken$ python books.py 

I read WutheringHeights.txt by Bronte
I read PrideAndPrejudice.txt by Austen
I read TheMerchantOfVenice.txt by Shakespeare

I think SensAndSensibility.txt is written by Austen (404)
I think RomeoAndJuliet.txt is written by Bronte (279)
I think JaneEyre.txt is written by Bronte (376)
I think Hamlet.txt is written by Shakespeare (319)


Code:

``
from collections import Counter
from string import ascii_letters, whitespace

def read_book(filename):
with open(filename) as f:
return f.read()

def make_words(text):
'''removes weird characters and split into words'''
remain = ascii_letters + whitespace
filtered = ""
for ch in text:
if ch in remain:
filtered += ch
return filtered.lower().split()

def most_used(words, n=500):
'''returns a
set of the n` mo

Solution

More classes

Making the author guesser a class was a good start, but I think authors also should be a class. Making them a class now will allow you to work more with the concept of an author in the future (see "More work to do").

Cleaning up

You don't really need the book reading method, and you can simplify how you teach your guesser about authors and then test them later. I've tried to leave the general spirit of your code intact:

from collections import Counter
from string import ascii_letters, whitespace

def filter_words(text):
    '''removes weird characters and split into words'''
    remain = ascii_letters + whitespace
    filtered = [ch.lower() for ch in text if ch in remain]
    return "".join(filtered).split()

def most_used(words, n=500):
    '''returns a set of the "n" most used words'''
    most_used_words = Counter(words).most_common(n)
    return set([word for word, frequency in most_used_words])

def read_and_analyse(filename):
    '''reads a book and returns most used words'''
    with open(filename) as book:
        words = filter_words(book)
        favorites = most_used(words)
        return favorites

class Author(object):
    def __init__(self, name, favorite_words):
        self.name = name
        self.favorite_words = favorite_words

class WriterGuesser(object):
    def __init__(self):
        self.analyzed_authors = []

    def learn_about_authors(self, book, name):
        favorite_words = read_and_analyse(book)
        author = Author(name, favorite_words)
        self.analyzed_authors.append(author)
        print "I read {} by {}".format(book, name)

    def recognize_author(self, book):
        favorite_words = read_and_analyse(book)
        max_identical_words = 0
        guess = None
        for author in self.analyzed_authors:
            common = len(favorite_words.intersection(author.favorite_words))
            if common > max_identical_words:
                max_identical_words = common
                guess = author.name
        print"I think {} is written by {} ({} words in common)".format(book, guess, max_identical_words)

def main():
    nerd = WriterGuesser()

    learning_criteria = [
        ('WutheringHeights.txt', 'Bronte'),
        ('PrideAndPrejudice.txt', 'Austen'),
        ('TheMerchantOfVenice.txt', 'Shakespeare')
    ]

    for book, author in learning_criteria:
        nerd.learn_about_authors(book, author)

    books_to_recognize = [
        'SensAndSensibility.txt',
        'RomeoAndJuliet.txt',
        'JaneEyre.txt',
    ]

    for book in books_to_recognize:
        nerd.recognize_author(book)

if __name__ == "__main__":
    main()


More work to do

-
You can only learn about one book from an author. Surely, you'd like a system that allows you to create an author and then use several books to develop their favorite words. This is where a class is helpful - you can create the author class, get their favorite words, and then update them as you learn more books.

-
You should expand this so that you can use many books by the same author to learn favorite words. Perhaps return the frequency of a word along with the word itself (which Counter already does), and remove less popular words as more popular ones come in.

-
You should also use the frequency of the word to determine how confident you are that the authors are the same. Say an author's favorite words are ["cat", "purple", "clock"], but they use "clock" 600 times and use the rest only 10 times. Your code will pick a book with "cat" and "purple" over a book with 500 instances of "clock", because 2 matching words > 1 matching word. But we know "clock" is a more important word!

-
Everybody's favorite words are going to be "the", "it", "a", "and", etc. Consider making a list of boring words and not using them for your filtering.

Code Snippets

from collections import Counter
from string import ascii_letters, whitespace

def filter_words(text):
    '''removes weird characters and split into words'''
    remain = ascii_letters + whitespace
    filtered = [ch.lower() for ch in text if ch in remain]
    return "".join(filtered).split()

def most_used(words, n=500):
    '''returns a set of the "n" most used words'''
    most_used_words = Counter(words).most_common(n)
    return set([word for word, frequency in most_used_words])

def read_and_analyse(filename):
    '''reads a book and returns most used words'''
    with open(filename) as book:
        words = filter_words(book)
        favorites = most_used(words)
        return favorites


class Author(object):
    def __init__(self, name, favorite_words):
        self.name = name
        self.favorite_words = favorite_words


class WriterGuesser(object):
    def __init__(self):
        self.analyzed_authors = []

    def learn_about_authors(self, book, name):
        favorite_words = read_and_analyse(book)
        author = Author(name, favorite_words)
        self.analyzed_authors.append(author)
        print "I read {} by {}".format(book, name)

    def recognize_author(self, book):
        favorite_words = read_and_analyse(book)
        max_identical_words = 0
        guess = None
        for author in self.analyzed_authors:
            common = len(favorite_words.intersection(author.favorite_words))
            if common > max_identical_words:
                max_identical_words = common
                guess = author.name
        print"I think {} is written by {} ({} words in common)".format(book, guess, max_identical_words)


def main():
    nerd = WriterGuesser()

    learning_criteria = [
        ('WutheringHeights.txt', 'Bronte'),
        ('PrideAndPrejudice.txt', 'Austen'),
        ('TheMerchantOfVenice.txt', 'Shakespeare')
    ]

    for book, author in learning_criteria:
        nerd.learn_about_authors(book, author)

    books_to_recognize = [
        'SensAndSensibility.txt',
        'RomeoAndJuliet.txt',
        'JaneEyre.txt',
    ]

    for book in books_to_recognize:
        nerd.recognize_author(book)

if __name__ == "__main__":
    main()

Context

StackExchange Code Review Q#133543, answer score: 4

Revisions (0)

No revisions yet.