HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

Code for creating combinations taking a long time to finish

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
combinationscreatinglongtimefinishforcodetaking

Problem

I have the following code that I'm using to create combinations of elements in my dataset:

start_time = time.time()
question_pairs = []
import itertools as iter

for memberid in IncorrectQuestions.memberid.unique():
    combinations = IncorrectQuestions[IncorrectQuestions.memberid == memberid].groupby('memberid').questionid.apply(
        lambda x: list(iter.combinations(x, 2)))
    for elem in combinations:
        for el in elem:
            question_pairs.append(list(el))

print("--- %s seconds ---" % (time.time() - start_time))


the DataFrame IncorrectQuestions has about 40 Million records. Here's the sample dataset:

memberid    created firstencodedid  questionid
0   9   2016-01-18 05:10:44 MAT.CAL.110 5696d0248e0e0869c96357d3
1   9   2016-01-18 05:10:44 MAT.CAL.110 5696cbc45aa413444ffd9973
2   9   2016-01-18 05:10:44 MAT.CAL.110 5696cf86da2cfe6f21d09879
3   34  2016-11-10 04:24:14 MAT.ARI.300 51d8cd415aa41337ec50425a
4   34  2016-11-10 04:24:14 MAT.ARI.300 559a84505aa4136cb37be676


The piece of code above takes way too long. It has been an hour and it is still running. Is there a way to optimize this piece of code so it takes less time?

The desired output would create all possible combinations of questionid for each memberid. For example, the combinations for memberid = 9 would be:

['5696d0248e0e0869c96357d3','5696cbc45aa413444ffd9973']

['5696d0248e0e0869c96357d3','5696cf86da2cfe6f21d09879']

['5696cbc45aa413444ffd9973','5696cf86da2cfe6f21d09879']


The combinations for memberid = 34 would be:

['51d8cd415aa41337ec50425a','559a84505aa4136cb37be676']


And finally, the combined output would be both the lists combined, i.e.:

['5696d0248e0e0869c96357d3','5696cbc45aa413444ffd9973']

['5696d0248e0e0869c96357d3','5696cf86da2cfe6f21d09879']

['5696cbc45aa413444ffd9973','5696cf86da2cfe6f21d09879']

['51d8cd415aa41337ec50425a','559a84505aa4136cb37be676']


I hope this makes things clearer.

Any pointers would be app

Solution

You take the combinations in a way too convoluted way. For starter, I would simplify the retrieval of "same memberid" questions:

for memberid in IncorrectQuestions.memberid.unique():
    questions = IncorrectQuestions[IncorrectQuestions.memberid == memberid].questionid


No need to groupby here since you’re already filtering by the one memberid you’re interested in. Thus producing combinations can be as simple as:

for elem in itertools.combinations(questions, 2):
        question_pairs.append(list(elem))


Now, I don't see why you want to explicitly turn the tuples returned by itertools.combinations into lists as it adds up time and memory usage. So I suggest dropping it.

Now in the for loop, the use of unique() plus filtering on these unique values is the exact use-case for groupby. So you can rewrite it:

for member_id, questions in IncorrectQuestions.groupby('memberid'):
    for pair in itertools.combinations(questions.questionid, 2):
        question_pairs.append(pair)


I would then turn it into a function for better reusability/testing and use Python naming conventions:

def generate_combination_of_questions(dataframe):
    question_pairs = []
    for _, questions in dataframe.groupby('memberid'):
        for pair in itertools.combinations(questions.questionid, 2):
            question_pairs.append(pair)
    return question_pairs


But the combo var = [] + for loop + var.append should be turned into either a list-comprehension or a generator for better performances:

def generate_combination_of_questions(dataframe):
    return = [
        pair
        for _, questions in dataframe.groupby('memberid')
        for pair in itertools.combinations(questions.questionid, 2)
    ]


or

def generate_combination_of_questions(dataframe):
    for _, questions in dataframe.groupby('memberid'):
        yield from itertools.combinations(questions.questionid, 2)


The second version uses Python 3 syntax and is called as follow:

question_pairs = list(generate_combination_of_questions(IncorrectQuestions))


If you don't have Python 3 available, you can use the more verbose:

def generate_combination_of_questions(dataframe):
    for _, questions in dataframe.groupby('memberid'):
        for pair in itertools.combinations(questions.questionid, 2):
            yield pair


Your timing procedure can also be enhanced by the use of either time.perf_counter() or, better, the timeit module

Example code with timeit:

import itertools

def generate_combination_of_questions(dataframe):
    for _, questions in dataframe.groupby('memberid'):
        yield from itertools.combinations(questions.questionid, 2)

if __name__ == '__main__':
    import timeit
    setup = """\
import pandas as pd
from __main__ import generate_combination_of_questions
incorrect = pd.read_csv()"""

    print(timeit.timeit(
        'list(generate_combination_of_questions(incorrect))',
        setup=setup, number=1))


Reading at the various comments about the size of your dataset, you definitely want the generator version (the one with yield instead of return) as it will allow you to iterate over the results one by one and not store all of them in memory before. Otherwise, the list will fill up all your memory.

For instance, writing each pair into a file might look like:

import itertools
import csv
import pandas as pd

def generate_combination_of_questions(dataframe):
    for _, questions in dataframe.groupby('memberid'):
        yield from itertools.combinations(questions.questionid, 2)

def main(file_in, file_out):
    incorrect_questions = pd.read_csv(file_in)
    with open(file_out, 'w') as output:
        writer = csv.writer(output)
        for pair in generate_combination_of_questions(incorrect_questions):
            writer.writerow(pair)

if __name__ == '__main__':
    main(, )


But you're obviously free to do any form of processing while iterating over the generator; writing to a file is just the simplest example.

Code Snippets

for memberid in IncorrectQuestions.memberid.unique():
    questions = IncorrectQuestions[IncorrectQuestions.memberid == memberid].questionid
for elem in itertools.combinations(questions, 2):
        question_pairs.append(list(elem))
for member_id, questions in IncorrectQuestions.groupby('memberid'):
    for pair in itertools.combinations(questions.questionid, 2):
        question_pairs.append(pair)
def generate_combination_of_questions(dataframe):
    question_pairs = []
    for _, questions in dataframe.groupby('memberid'):
        for pair in itertools.combinations(questions.questionid, 2):
            question_pairs.append(pair)
    return question_pairs
def generate_combination_of_questions(dataframe):
    return = [
        pair
        for _, questions in dataframe.groupby('memberid')
        for pair in itertools.combinations(questions.questionid, 2)
    ]

Context

StackExchange Code Review Q#156571, answer score: 7

Revisions (0)

No revisions yet.