patternpythonMinor
Finding the similarity between the two movies using Pearson correlation coefficient
Viewed 0 times
themoviescoefficienttwobetweenpearsonusingfindingsimilaritycorrelation
Problem
I am trying to find the similarity between the two movies using Pearson correlation coefficient. The programs is working well for small inputs but for large inputs (like 100000 lines) it takes forever. My professor said it would take few minutes, but my program is executing forever.
The input format:
If two or more users watch two movies, we will find similarity between them based on rating.
```
# -- coding: utf-8 --
"""
Created on June 06, 2016
@author: Praveen Allam
"""
from mrjob.job import MRJob
from mrjob.step import MRStep
from itertools import combinations
from itertools import izip
from math import sqrt
class PearsonCorrelation(MRJob):
def steps(self):
return [
MRStep(mapper=self.mapper1,
reducer=self.reducer1),
MRStep(mapper=self.mapper2,
reducer=self.reducer2)
]
def mapper1(self, _, line):
##yield each line to first mapper
user,movie,rating=line.split('|')
yield None,[user,movie,rating]
def reducer1(self, _, value):
##yield all combinations to second mapper
for item1,item2 in combinations(value,2):
yield item1,item2
def mapper2(self,value1,value2):
##yield movie1,movie2 and corresponding ratings of user.
if(value1[0]==value2[0]):
yield [value1[1],value2[1]],[float(value1[2]),float(value2[2])]
def reducer2(self,movies,ratings):
rating=[]
for item in ratings:
rating.append(item)
##extract using izip only if there are more than one instance.
if(len(rating)>1):
v1,v2=izip(*rating)
##calculate the pearson coefficient
corr=self.pearsonCoefficient(v1,v2)
##yield the result
yield "The Similarity between "+movies[0]+" and "+movies[1]+" is: " , corr
def pearsonCoefficient(self,a,b):
n=len(b)
value=range(n)
#sums of indiv
The input format:
user|movie|rating.If two or more users watch two movies, we will find similarity between them based on rating.
```
# -- coding: utf-8 --
"""
Created on June 06, 2016
@author: Praveen Allam
"""
from mrjob.job import MRJob
from mrjob.step import MRStep
from itertools import combinations
from itertools import izip
from math import sqrt
class PearsonCorrelation(MRJob):
def steps(self):
return [
MRStep(mapper=self.mapper1,
reducer=self.reducer1),
MRStep(mapper=self.mapper2,
reducer=self.reducer2)
]
def mapper1(self, _, line):
##yield each line to first mapper
user,movie,rating=line.split('|')
yield None,[user,movie,rating]
def reducer1(self, _, value):
##yield all combinations to second mapper
for item1,item2 in combinations(value,2):
yield item1,item2
def mapper2(self,value1,value2):
##yield movie1,movie2 and corresponding ratings of user.
if(value1[0]==value2[0]):
yield [value1[1],value2[1]],[float(value1[2]),float(value2[2])]
def reducer2(self,movies,ratings):
rating=[]
for item in ratings:
rating.append(item)
##extract using izip only if there are more than one instance.
if(len(rating)>1):
v1,v2=izip(*rating)
##calculate the pearson coefficient
corr=self.pearsonCoefficient(v1,v2)
##yield the result
yield "The Similarity between "+movies[0]+" and "+movies[1]+" is: " , corr
def pearsonCoefficient(self,a,b):
n=len(b)
value=range(n)
#sums of indiv
Solution
So many problems!
-
All the outputs of
-
-
The similarity between movie A and movie B can end up being computed twice on half the data.
These problems are really easy to see if you work through a simple example by hand to see what your code will do.
For example, suppose the data is:
Then
Since these all have the same key (
Next,
You can see that most of the output of this step is useless because it consists of records for different users.
Then the
The values are collected under the different keys and passed to four instances of
Notice that there should be only three pairs of movies, but one of the pairs appears twice.
So, how should you go about fixing the code? Your best bet, I think, is to leave the code alone for the moment, and to work out (on paper, or in a text editor) exactly what the data needs to look like after each step, using a small example (like I did above). When you're happy with the design of each step, then it should be trivial to program it.
I recommend giving the mappers and reducers names that describe what they do. This will make the code easier to understand. See the examples in the MRJob documentation which have names like
-
All the outputs of
mapper1 have the same key, so the first mapping step is wasted.-
reducer1 generates all pairs of records, but most of these pairs will be worthless because they have different users and so will be discarded by mapper2-
The similarity between movie A and movie B can end up being computed twice on half the data.
These problems are really easy to see if you work through a simple example by hand to see what your code will do.
For example, suppose the data is:
Haruko|Koyaanisqatsi|8
Natsuko|Powaqqatsi|9
Haruko|Powaqqatsi|7
Natsuko|Koyaanisqatsi|10
Natsuko|Naqoyqatsi|8
Haruko|Naqoyqatsi|6Then
mapper1 generates the following key–value pairs:Key Value
---- --------------------------------
None ['Haruko', 'Koyaanisqatsi', 8]
None ['Natsuko', 'Powaqqatsi', 9]
None ['Haruko', 'Powaqqatsi', 7]
None ['Natsuko', 'Koyaanisqatsi', 10]
None ['Natsuko', 'Naqoyqatsi', 8]
None ['Haruko', 'Naqoyqatsi', 6]Since these all have the same key (
None) all the values get collected into a single list and passed to one instance of reducer1. So the mapping step was wasted — there's no parallelism here.Next,
reducer1 generates the following key–value pairs:Key Value
-------------------------------- --------------------------------
['Haruko', 'Koyaanisqatsi', 8] ['Natsuko', 'Powaqqatsi', 9]
['Haruko', 'Koyaanisqatsi', 8] ['Haruko', 'Powaqqatsi', 7]
['Haruko', 'Koyaanisqatsi', 8] ['Natsuko', 'Koyaanisqatsi', 10]
['Haruko', 'Koyaanisqatsi', 8] ['Natsuko', 'Naqoyqatsi', 8]
['Haruko', 'Koyaanisqatsi', 8] ['Haruko', 'Naqoyqatsi', 6]
['Natsuko', 'Powaqqatsi', 9] ['Haruko', 'Powaqqatsi', 7]
['Natsuko', 'Powaqqatsi', 9] ['Natsuko', 'Koyaanisqatsi', 10]
['Natsuko', 'Powaqqatsi', 9] ['Natsuko', 'Naqoyqatsi', 8]
['Natsuko', 'Powaqqatsi', 9] ['Haruko', 'Naqoyqatsi', 6]
['Haruko', 'Powaqqatsi', 7] ['Natsuko', 'Koyaanisqatsi', 10]
['Haruko', 'Powaqqatsi', 7] ['Natsuko', 'Naqoyqatsi', 8]
['Haruko', 'Powaqqatsi', 7] ['Haruko', 'Naqoyqatsi', 6]
['Natsuko', 'Koyaanisqatsi', 10] ['Natsuko', 'Naqoyqatsi', 8]
['Natsuko', 'Koyaanisqatsi', 10] ['Haruko', 'Naqoyqatsi', 6]
['Natsuko', 'Naqoyqatsi', 8] ['Haruko', 'Naqoyqatsi', 6]You can see that most of the output of this step is useless because it consists of records for different users.
Then the
mapper2 step filters key–value pairs that have the same user, getting:Key Value
------------------------------- --------
['Koyaanisqatsi', 'Powaqqatsi'] [8, 7]
['Koyaanisqatsi', 'Naqoyqatsi'] [8, 6]
['Powaqqatsi', 'Koyaanisqatsi'] [9, 10]
['Powaqqatsi', 'Naqoyqatsi'] [9, 8]
['Powaqqatsi', 'Naqoyqatsi'] [7, 6]
['Koyaanisqatsi', 'Naqoyqatsi'] [10, 8]The values are collected under the different keys and passed to four instances of
reducer2:Key Values
------------------------------- -----------------
['Koyaanisqatsi', 'Powaqqatsi'] [[8, 7]]
['Koyaanisqatsi', 'Naqoyqatsi'] [[8, 6], [10, 8]]
['Powaqqatsi', 'Koyaanisqatsi'] [[9, 10]]
['Powaqqatsi', 'Naqoyqatsi'] [[9, 8], [7, 6]]Notice that there should be only three pairs of movies, but one of the pairs appears twice.
So, how should you go about fixing the code? Your best bet, I think, is to leave the code alone for the moment, and to work out (on paper, or in a text editor) exactly what the data needs to look like after each step, using a small example (like I did above). When you're happy with the design of each step, then it should be trivial to program it.
I recommend giving the mappers and reducers names that describe what they do. This will make the code easier to understand. See the examples in the MRJob documentation which have names like
mapper_get_words and reducer_count_words.Code Snippets
Haruko|Koyaanisqatsi|8
Natsuko|Powaqqatsi|9
Haruko|Powaqqatsi|7
Natsuko|Koyaanisqatsi|10
Natsuko|Naqoyqatsi|8
Haruko|Naqoyqatsi|6Key Value
---- --------------------------------
None ['Haruko', 'Koyaanisqatsi', 8]
None ['Natsuko', 'Powaqqatsi', 9]
None ['Haruko', 'Powaqqatsi', 7]
None ['Natsuko', 'Koyaanisqatsi', 10]
None ['Natsuko', 'Naqoyqatsi', 8]
None ['Haruko', 'Naqoyqatsi', 6]Key Value
-------------------------------- --------------------------------
['Haruko', 'Koyaanisqatsi', 8] ['Natsuko', 'Powaqqatsi', 9]
['Haruko', 'Koyaanisqatsi', 8] ['Haruko', 'Powaqqatsi', 7]
['Haruko', 'Koyaanisqatsi', 8] ['Natsuko', 'Koyaanisqatsi', 10]
['Haruko', 'Koyaanisqatsi', 8] ['Natsuko', 'Naqoyqatsi', 8]
['Haruko', 'Koyaanisqatsi', 8] ['Haruko', 'Naqoyqatsi', 6]
['Natsuko', 'Powaqqatsi', 9] ['Haruko', 'Powaqqatsi', 7]
['Natsuko', 'Powaqqatsi', 9] ['Natsuko', 'Koyaanisqatsi', 10]
['Natsuko', 'Powaqqatsi', 9] ['Natsuko', 'Naqoyqatsi', 8]
['Natsuko', 'Powaqqatsi', 9] ['Haruko', 'Naqoyqatsi', 6]
['Haruko', 'Powaqqatsi', 7] ['Natsuko', 'Koyaanisqatsi', 10]
['Haruko', 'Powaqqatsi', 7] ['Natsuko', 'Naqoyqatsi', 8]
['Haruko', 'Powaqqatsi', 7] ['Haruko', 'Naqoyqatsi', 6]
['Natsuko', 'Koyaanisqatsi', 10] ['Natsuko', 'Naqoyqatsi', 8]
['Natsuko', 'Koyaanisqatsi', 10] ['Haruko', 'Naqoyqatsi', 6]
['Natsuko', 'Naqoyqatsi', 8] ['Haruko', 'Naqoyqatsi', 6]Key Value
------------------------------- --------
['Koyaanisqatsi', 'Powaqqatsi'] [8, 7]
['Koyaanisqatsi', 'Naqoyqatsi'] [8, 6]
['Powaqqatsi', 'Koyaanisqatsi'] [9, 10]
['Powaqqatsi', 'Naqoyqatsi'] [9, 8]
['Powaqqatsi', 'Naqoyqatsi'] [7, 6]
['Koyaanisqatsi', 'Naqoyqatsi'] [10, 8]Key Values
------------------------------- -----------------
['Koyaanisqatsi', 'Powaqqatsi'] [[8, 7]]
['Koyaanisqatsi', 'Naqoyqatsi'] [[8, 6], [10, 8]]
['Powaqqatsi', 'Koyaanisqatsi'] [[9, 10]]
['Powaqqatsi', 'Naqoyqatsi'] [[9, 8], [7, 6]]Context
StackExchange Code Review Q#131294, answer score: 6
Revisions (0)
No revisions yet.