patternpythonMinor
Code for creating combinations taking a long time to finish
Viewed 0 times
combinationscreatinglongtimefinishforcodetaking
Problem
I have the following code that I'm using to create combinations of elements in my dataset:
the
The piece of code above takes way too long. It has been an hour and it is still running. Is there a way to optimize this piece of code so it takes less time?
The desired output would create all possible combinations of
The combinations for
And finally, the combined output would be both the lists combined, i.e.:
I hope this makes things clearer.
Any pointers would be app
start_time = time.time()
question_pairs = []
import itertools as iter
for memberid in IncorrectQuestions.memberid.unique():
combinations = IncorrectQuestions[IncorrectQuestions.memberid == memberid].groupby('memberid').questionid.apply(
lambda x: list(iter.combinations(x, 2)))
for elem in combinations:
for el in elem:
question_pairs.append(list(el))
print("--- %s seconds ---" % (time.time() - start_time))the
DataFrame IncorrectQuestions has about 40 Million records. Here's the sample dataset:memberid created firstencodedid questionid
0 9 2016-01-18 05:10:44 MAT.CAL.110 5696d0248e0e0869c96357d3
1 9 2016-01-18 05:10:44 MAT.CAL.110 5696cbc45aa413444ffd9973
2 9 2016-01-18 05:10:44 MAT.CAL.110 5696cf86da2cfe6f21d09879
3 34 2016-11-10 04:24:14 MAT.ARI.300 51d8cd415aa41337ec50425a
4 34 2016-11-10 04:24:14 MAT.ARI.300 559a84505aa4136cb37be676The piece of code above takes way too long. It has been an hour and it is still running. Is there a way to optimize this piece of code so it takes less time?
The desired output would create all possible combinations of
questionid for each memberid. For example, the combinations for memberid = 9 would be:['5696d0248e0e0869c96357d3','5696cbc45aa413444ffd9973']
['5696d0248e0e0869c96357d3','5696cf86da2cfe6f21d09879']
['5696cbc45aa413444ffd9973','5696cf86da2cfe6f21d09879']The combinations for
memberid = 34 would be:['51d8cd415aa41337ec50425a','559a84505aa4136cb37be676']And finally, the combined output would be both the lists combined, i.e.:
['5696d0248e0e0869c96357d3','5696cbc45aa413444ffd9973']
['5696d0248e0e0869c96357d3','5696cf86da2cfe6f21d09879']
['5696cbc45aa413444ffd9973','5696cf86da2cfe6f21d09879']
['51d8cd415aa41337ec50425a','559a84505aa4136cb37be676']I hope this makes things clearer.
Any pointers would be app
Solution
You take the combinations in a way too convoluted way. For starter, I would simplify the retrieval of "same
No need to
Now, I don't see why you want to explicitly turn the tuples returned by
Now in the for loop, the use of
I would then turn it into a function for better reusability/testing and use Python naming conventions:
But the combo
or
The second version uses Python 3 syntax and is called as follow:
If you don't have Python 3 available, you can use the more verbose:
Your timing procedure can also be enhanced by the use of either
Example code with
Reading at the various comments about the size of your dataset, you definitely want the generator version (the one with
For instance, writing each pair into a file might look like:
But you're obviously free to do any form of processing while iterating over the generator; writing to a file is just the simplest example.
memberid" questions:for memberid in IncorrectQuestions.memberid.unique():
questions = IncorrectQuestions[IncorrectQuestions.memberid == memberid].questionidNo need to
groupby here since you’re already filtering by the one memberid you’re interested in. Thus producing combinations can be as simple as:for elem in itertools.combinations(questions, 2):
question_pairs.append(list(elem))Now, I don't see why you want to explicitly turn the tuples returned by
itertools.combinations into lists as it adds up time and memory usage. So I suggest dropping it.Now in the for loop, the use of
unique() plus filtering on these unique values is the exact use-case for groupby. So you can rewrite it:for member_id, questions in IncorrectQuestions.groupby('memberid'):
for pair in itertools.combinations(questions.questionid, 2):
question_pairs.append(pair)I would then turn it into a function for better reusability/testing and use Python naming conventions:
def generate_combination_of_questions(dataframe):
question_pairs = []
for _, questions in dataframe.groupby('memberid'):
for pair in itertools.combinations(questions.questionid, 2):
question_pairs.append(pair)
return question_pairsBut the combo
var = [] + for loop + var.append should be turned into either a list-comprehension or a generator for better performances:def generate_combination_of_questions(dataframe):
return = [
pair
for _, questions in dataframe.groupby('memberid')
for pair in itertools.combinations(questions.questionid, 2)
]or
def generate_combination_of_questions(dataframe):
for _, questions in dataframe.groupby('memberid'):
yield from itertools.combinations(questions.questionid, 2)The second version uses Python 3 syntax and is called as follow:
question_pairs = list(generate_combination_of_questions(IncorrectQuestions))If you don't have Python 3 available, you can use the more verbose:
def generate_combination_of_questions(dataframe):
for _, questions in dataframe.groupby('memberid'):
for pair in itertools.combinations(questions.questionid, 2):
yield pairYour timing procedure can also be enhanced by the use of either
time.perf_counter() or, better, the timeit moduleExample code with
timeit:import itertools
def generate_combination_of_questions(dataframe):
for _, questions in dataframe.groupby('memberid'):
yield from itertools.combinations(questions.questionid, 2)
if __name__ == '__main__':
import timeit
setup = """\
import pandas as pd
from __main__ import generate_combination_of_questions
incorrect = pd.read_csv()"""
print(timeit.timeit(
'list(generate_combination_of_questions(incorrect))',
setup=setup, number=1))Reading at the various comments about the size of your dataset, you definitely want the generator version (the one with
yield instead of return) as it will allow you to iterate over the results one by one and not store all of them in memory before. Otherwise, the list will fill up all your memory.For instance, writing each pair into a file might look like:
import itertools
import csv
import pandas as pd
def generate_combination_of_questions(dataframe):
for _, questions in dataframe.groupby('memberid'):
yield from itertools.combinations(questions.questionid, 2)
def main(file_in, file_out):
incorrect_questions = pd.read_csv(file_in)
with open(file_out, 'w') as output:
writer = csv.writer(output)
for pair in generate_combination_of_questions(incorrect_questions):
writer.writerow(pair)
if __name__ == '__main__':
main(, )But you're obviously free to do any form of processing while iterating over the generator; writing to a file is just the simplest example.
Code Snippets
for memberid in IncorrectQuestions.memberid.unique():
questions = IncorrectQuestions[IncorrectQuestions.memberid == memberid].questionidfor elem in itertools.combinations(questions, 2):
question_pairs.append(list(elem))for member_id, questions in IncorrectQuestions.groupby('memberid'):
for pair in itertools.combinations(questions.questionid, 2):
question_pairs.append(pair)def generate_combination_of_questions(dataframe):
question_pairs = []
for _, questions in dataframe.groupby('memberid'):
for pair in itertools.combinations(questions.questionid, 2):
question_pairs.append(pair)
return question_pairsdef generate_combination_of_questions(dataframe):
return = [
pair
for _, questions in dataframe.groupby('memberid')
for pair in itertools.combinations(questions.questionid, 2)
]Context
StackExchange Code Review Q#156571, answer score: 7
Revisions (0)
No revisions yet.