patternpythonMinor
Average movie rankings
Viewed 0 times
rankingsaveragemovie
Problem
Given a list of tuples of the form,
Classic movie recommendation example, where each tuple is
(a, b, c), is there a more direct or optimized for calculating the average of all the c's with PySpark? Below is what I have, but I feel like there is a more direct/optimized approach?Classic movie recommendation example, where each tuple is
(userID, movieID, rating). How do we get the average of all of the ratings in a direct/optimized fashion?ds_movie = sc.parallelize([(1,1,2.25), (1,2,3.0), (2,1,4.5)])
total = (ds_movie
.map(lambda (userid, movieid, rating): rating)
.reduce(lambda x, y: x + y))
num = ds_movie.count()
average = total / num
# in this example, average = 3.25Solution
I would recommend using
It is not only more concise but should have much better numerical properties (it is using a modified version of the online algorithm).
On a side note it is better to avoid tuple parameter unpacking which has been removed in Python 3. You can check PEP-3113 for details. Instead you can use
indexing (arguably much uglier than unpacking):
or standard function instead of lambda expression (kind of verbose for such a simple use case):
mean method:ds_movie.map(lambda (userid, movieid, rating): rating).mean()It is not only more concise but should have much better numerical properties (it is using a modified version of the online algorithm).
On a side note it is better to avoid tuple parameter unpacking which has been removed in Python 3. You can check PEP-3113 for details. Instead you can use
Rating class as follows:from pyspark.mllib.recommendation import Rating
ratings = ds_movie.map(lambda xs: Rating(*xs))
ratings.map(lambda r: r.rating).mean()indexing (arguably much uglier than unpacking):
ds_movie.map(lambda r: r[2]).mean()or standard function instead of lambda expression (kind of verbose for such a simple use case):
def get_rating(rating):
userid, movieid, rating = rating
return rating
ds_movie.map(get_rating).mean()Code Snippets
ds_movie.map(lambda (userid, movieid, rating): rating).mean()from pyspark.mllib.recommendation import Rating
ratings = ds_movie.map(lambda xs: Rating(*xs))
ratings.map(lambda r: r.rating).mean()ds_movie.map(lambda r: r[2]).mean()def get_rating(rating):
userid, movieid, rating = rating
return rating
ds_movie.map(get_rating).mean()Context
StackExchange Code Review Q#95162, answer score: 6
Revisions (0)
No revisions yet.