HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

Average movie rankings

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
rankingsaveragemovie

Problem

Given a list of tuples of the form, (a, b, c), is there a more direct or optimized for calculating the average of all the c's with PySpark? Below is what I have, but I feel like there is a more direct/optimized approach?

Classic movie recommendation example, where each tuple is (userID, movieID, rating). How do we get the average of all of the ratings in a direct/optimized fashion?

ds_movie = sc.parallelize([(1,1,2.25), (1,2,3.0), (2,1,4.5)])
total = (ds_movie
         .map(lambda (userid, movieid, rating): rating)
         .reduce(lambda x, y: x + y))
num = ds_movie.count()
average = total / num
# in this example, average = 3.25

Solution

I would recommend using mean method:

ds_movie.map(lambda (userid, movieid, rating): rating).mean()


It is not only more concise but should have much better numerical properties (it is using a modified version of the online algorithm).

On a side note it is better to avoid tuple parameter unpacking which has been removed in Python 3. You can check PEP-3113 for details. Instead you can use Rating class as follows:

from pyspark.mllib.recommendation import Rating

ratings = ds_movie.map(lambda xs: Rating(*xs))
ratings.map(lambda r: r.rating).mean()


indexing (arguably much uglier than unpacking):

ds_movie.map(lambda r: r[2]).mean()


or standard function instead of lambda expression (kind of verbose for such a simple use case):

def get_rating(rating):
    userid, movieid, rating = rating
    return rating

ds_movie.map(get_rating).mean()

Code Snippets

ds_movie.map(lambda (userid, movieid, rating): rating).mean()
from pyspark.mllib.recommendation import Rating

ratings = ds_movie.map(lambda xs: Rating(*xs))
ratings.map(lambda r: r.rating).mean()
ds_movie.map(lambda r: r[2]).mean()
def get_rating(rating):
    userid, movieid, rating = rating
    return rating

ds_movie.map(get_rating).mean()

Context

StackExchange Code Review Q#95162, answer score: 6

Revisions (0)

No revisions yet.