patternpythonMinor

Finding the similarity between the two movies using Pearson correlation coefficient

Submitted by: @import:stackexchange-codereview·Mar 10, 2026·

Viewed 0 times

themoviescoefficienttwobetweenpearsonusingfindingsimilaritycorrelation

Problem

I am trying to find the similarity between the two movies using Pearson correlation coefficient. The programs is working well for small inputs but for large inputs (like 100000 lines) it takes forever. My professor said it would take few minutes, but my program is executing forever.

The input format: user|movie|rating.

If two or more users watch two movies, we will find similarity between them based on rating.

```
# -- coding: utf-8 --
"""
Created on June 06, 2016

@author: Praveen Allam
"""

from mrjob.job import MRJob
from mrjob.step import MRStep
from itertools import combinations
from itertools import izip
from math import sqrt

class PearsonCorrelation(MRJob):
def steps(self):
return [
MRStep(mapper=self.mapper1,
reducer=self.reducer1),
MRStep(mapper=self.mapper2,
reducer=self.reducer2)
]

def mapper1(self, _, line):

##yield each line to first mapper
user,movie,rating=line.split('|')
yield None,[user,movie,rating]

def reducer1(self, _, value):

##yield all combinations to second mapper
for item1,item2 in combinations(value,2):
yield item1,item2

def mapper2(self,value1,value2):

##yield movie1,movie2 and corresponding ratings of user.
if(value1[0]==value2[0]):
yield [value1[1],value2[1]],[float(value1[2]),float(value2[2])]

def reducer2(self,movies,ratings):
rating=[]
for item in ratings:
rating.append(item)

##extract using izip only if there are more than one instance.
if(len(rating)>1):
v1,v2=izip(*rating)

##calculate the pearson coefficient
corr=self.pearsonCoefficient(v1,v2)

##yield the result
yield "The Similarity between "+movies[0]+" and "+movies[1]+" is: " , corr
def pearsonCoefficient(self,a,b):
n=len(b)
value=range(n)

#sums of indiv

Solution

So many problems!

-
All the outputs of mapper1 have the same key, so the first mapping step is wasted.

-
reducer1 generates all pairs of records, but most of these pairs will be worthless because they have different users and so will be discarded by mapper2

-
The similarity between movie A and movie B can end up being computed twice on half the data.

These problems are really easy to see if you work through a simple example by hand to see what your code will do.

For example, suppose the data is:

Haruko|Koyaanisqatsi|8
Natsuko|Powaqqatsi|9
Haruko|Powaqqatsi|7
Natsuko|Koyaanisqatsi|10
Natsuko|Naqoyqatsi|8
Haruko|Naqoyqatsi|6

Then mapper1 generates the following key–value pairs:

Key   Value
----  --------------------------------
None  ['Haruko', 'Koyaanisqatsi', 8]
None  ['Natsuko', 'Powaqqatsi', 9]
None  ['Haruko', 'Powaqqatsi', 7]
None  ['Natsuko', 'Koyaanisqatsi', 10]
None  ['Natsuko', 'Naqoyqatsi', 8]
None  ['Haruko', 'Naqoyqatsi', 6]

Since these all have the same key (None) all the values get collected into a single list and passed to one instance of reducer1. So the mapping step was wasted — there's no parallelism here.

Next, reducer1 generates the following key–value pairs:

Key                                Value
--------------------------------   --------------------------------
['Haruko', 'Koyaanisqatsi', 8]     ['Natsuko', 'Powaqqatsi', 9]
['Haruko', 'Koyaanisqatsi', 8]     ['Haruko', 'Powaqqatsi', 7]
['Haruko', 'Koyaanisqatsi', 8]     ['Natsuko', 'Koyaanisqatsi', 10]
['Haruko', 'Koyaanisqatsi', 8]     ['Natsuko', 'Naqoyqatsi', 8]
['Haruko', 'Koyaanisqatsi', 8]     ['Haruko', 'Naqoyqatsi', 6]
['Natsuko', 'Powaqqatsi', 9]       ['Haruko', 'Powaqqatsi', 7]
['Natsuko', 'Powaqqatsi', 9]       ['Natsuko', 'Koyaanisqatsi', 10]
['Natsuko', 'Powaqqatsi', 9]       ['Natsuko', 'Naqoyqatsi', 8]
['Natsuko', 'Powaqqatsi', 9]       ['Haruko', 'Naqoyqatsi', 6]
['Haruko', 'Powaqqatsi', 7]        ['Natsuko', 'Koyaanisqatsi', 10]
['Haruko', 'Powaqqatsi', 7]        ['Natsuko', 'Naqoyqatsi', 8]
['Haruko', 'Powaqqatsi', 7]        ['Haruko', 'Naqoyqatsi', 6]
['Natsuko', 'Koyaanisqatsi', 10]   ['Natsuko', 'Naqoyqatsi', 8]
['Natsuko', 'Koyaanisqatsi', 10]   ['Haruko', 'Naqoyqatsi', 6]
['Natsuko', 'Naqoyqatsi', 8]       ['Haruko', 'Naqoyqatsi', 6]

You can see that most of the output of this step is useless because it consists of records for different users.

Then the mapper2 step filters key–value pairs that have the same user, getting:

Key                               Value
-------------------------------   --------
['Koyaanisqatsi', 'Powaqqatsi']   [8, 7]
['Koyaanisqatsi', 'Naqoyqatsi']   [8, 6]
['Powaqqatsi', 'Koyaanisqatsi']   [9, 10]
['Powaqqatsi', 'Naqoyqatsi']      [9, 8]
['Powaqqatsi', 'Naqoyqatsi']      [7, 6]
['Koyaanisqatsi', 'Naqoyqatsi']   [10, 8]

The values are collected under the different keys and passed to four instances of reducer2:

Key                               Values
-------------------------------   -----------------
['Koyaanisqatsi', 'Powaqqatsi']   [[8, 7]]
['Koyaanisqatsi', 'Naqoyqatsi']   [[8, 6], [10, 8]]
['Powaqqatsi', 'Koyaanisqatsi']   [[9, 10]]
['Powaqqatsi', 'Naqoyqatsi']      [[9, 8], [7, 6]]

Notice that there should be only three pairs of movies, but one of the pairs appears twice.

So, how should you go about fixing the code? Your best bet, I think, is to leave the code alone for the moment, and to work out (on paper, or in a text editor) exactly what the data needs to look like after each step, using a small example (like I did above). When you're happy with the design of each step, then it should be trivial to program it.

I recommend giving the mappers and reducers names that describe what they do. This will make the code easier to understand. See the examples in the MRJob documentation which have names like mapper_get_words and reducer_count_words.

Code Snippets

Haruko|Koyaanisqatsi|8
Natsuko|Powaqqatsi|9
Haruko|Powaqqatsi|7
Natsuko|Koyaanisqatsi|10
Natsuko|Naqoyqatsi|8
Haruko|Naqoyqatsi|6

Key   Value
----  --------------------------------
None  ['Haruko', 'Koyaanisqatsi', 8]
None  ['Natsuko', 'Powaqqatsi', 9]
None  ['Haruko', 'Powaqqatsi', 7]
None  ['Natsuko', 'Koyaanisqatsi', 10]
None  ['Natsuko', 'Naqoyqatsi', 8]
None  ['Haruko', 'Naqoyqatsi', 6]

Key                                Value
--------------------------------   --------------------------------
['Haruko', 'Koyaanisqatsi', 8]     ['Natsuko', 'Powaqqatsi', 9]
['Haruko', 'Koyaanisqatsi', 8]     ['Haruko', 'Powaqqatsi', 7]
['Haruko', 'Koyaanisqatsi', 8]     ['Natsuko', 'Koyaanisqatsi', 10]
['Haruko', 'Koyaanisqatsi', 8]     ['Natsuko', 'Naqoyqatsi', 8]
['Haruko', 'Koyaanisqatsi', 8]     ['Haruko', 'Naqoyqatsi', 6]
['Natsuko', 'Powaqqatsi', 9]       ['Haruko', 'Powaqqatsi', 7]
['Natsuko', 'Powaqqatsi', 9]       ['Natsuko', 'Koyaanisqatsi', 10]
['Natsuko', 'Powaqqatsi', 9]       ['Natsuko', 'Naqoyqatsi', 8]
['Natsuko', 'Powaqqatsi', 9]       ['Haruko', 'Naqoyqatsi', 6]
['Haruko', 'Powaqqatsi', 7]        ['Natsuko', 'Koyaanisqatsi', 10]
['Haruko', 'Powaqqatsi', 7]        ['Natsuko', 'Naqoyqatsi', 8]
['Haruko', 'Powaqqatsi', 7]        ['Haruko', 'Naqoyqatsi', 6]
['Natsuko', 'Koyaanisqatsi', 10]   ['Natsuko', 'Naqoyqatsi', 8]
['Natsuko', 'Koyaanisqatsi', 10]   ['Haruko', 'Naqoyqatsi', 6]
['Natsuko', 'Naqoyqatsi', 8]       ['Haruko', 'Naqoyqatsi', 6]

Key                               Value
-------------------------------   --------
['Koyaanisqatsi', 'Powaqqatsi']   [8, 7]
['Koyaanisqatsi', 'Naqoyqatsi']   [8, 6]
['Powaqqatsi', 'Koyaanisqatsi']   [9, 10]
['Powaqqatsi', 'Naqoyqatsi']      [9, 8]
['Powaqqatsi', 'Naqoyqatsi']      [7, 6]
['Koyaanisqatsi', 'Naqoyqatsi']   [10, 8]

Key                               Values
-------------------------------   -----------------
['Koyaanisqatsi', 'Powaqqatsi']   [[8, 7]]
['Koyaanisqatsi', 'Naqoyqatsi']   [[8, 6], [10, 8]]
['Powaqqatsi', 'Koyaanisqatsi']   [[9, 10]]
['Powaqqatsi', 'Naqoyqatsi']      [[9, 8], [7, 6]]

Context

StackExchange Code Review Q#131294, answer score: 6

Revisions (0)

No revisions yet.