patternpythonMinor
Count values in one dataframe that are less than all the values in another
Viewed 0 times
theallarethanonethatdataframelessanothervalues
Problem
I currently have 2 dataframes, A and B. These dataframes are generated in runtime, and increase in size according to parameters in the program execution.
I need to evaluate how many times a value in dataframe A is lesser than all the values in dataframe B.
For example:
I need to check how many values greater than 0.23 (for example) are in dataframe B. in this case 4 of the 6.
My first try with this was using this code. In this case,
a = bio_row[2]
total = random_dataframe.eval('final_res > @a')
`
It works excellent until 100000 rows, beyond that the story repeats.
I'm hitting a wall here and I've run out of ideas to test. Is there any way to improve the code? Am I missing something, or some snippet to make it faster?
I need to evaluate how many times a value in dataframe A is lesser than all the values in dataframe B.
For example:
Dataframe A
+-----+-------+
| id | value |
+-----+-------+
| 1 | 0.23 |
| 2 | 1.2 |
+-----+-------+
Dataframe B
+-----+-------+
| id | value |
+-----+-------+
| 1 | 0.22 |
| 2 | 1.25 |
| 3 | 0.3 |
| 4 | 0.5 |
| 5 | 0.9 |
| 6 | 0.0 |
+-----+-------+
I need to check how many values greater than 0.23 (for example) are in dataframe B. in this case 4 of the 6.
My first try with this was using this code. In this case,
bio_dataframe is dataframe A, an random_seq_df is dataframe B.for bio_row in bio_dataframe.itertuples():
total = 0
for ran_row in random_seq_df.itertuples():
if bio_row[2]
As you can see, i use itertuples for fast iteration of the rows of the dataframes.
This approach works "well" for dataframes below 25000 rows, but beyond that it starts to get painfully slow.
So my next approach was this.
final_res is a column in the dataframe.
for bio_row in bio_dataframe.itertuples():a = bio_row[2]
total = random_dataframe.eval('final_res > @a')
`
It works excellent until 100000 rows, beyond that the story repeats.
I'm hitting a wall here and I've run out of ideas to test. Is there any way to improve the code? Am I missing something, or some snippet to make it faster?
Solution
It may be a good idea to sort the columns you wish to compare first. That way you can iterate over both lists a single time. The idea is to use the index to keep track of how many items are larger than the values in your first list. A relatively untested function is:
The problem with your initial solution is that you iterate over your second list for every value in your first list giving you an n^2 run time. This can be seen if you increase the length of the first list a bit. Ive modified your function a little to demonstrate:
Testing with the following shows how the n^2 approach can get slow with longer lists, using the sorted list approach is ~750x quicker in this example:
import pandas as pd
import time
df1 = pd.DataFrame()
df2 = pd.DataFrame()
df1['a'] = [0.23, 1.2]
df2['a'] = [0.22, 1.25, 0.3, 0.5, 0.9, 0.0] * 1000 # Add some extra work
def sol(da, db):
# Sort your columns
x = sorted(da)
y = sorted(db)
t = [] # The results
yi = iter(y) # Use of an iterator to move over y
yindex = 0
y_item = next(yi)
for val in x:
# Search through y to find the index of an item bigger than val
while y_item >> [4000, 1000]
0.00200009346008
The problem with your initial solution is that you iterate over your second list for every value in your first list giving you an n^2 run time. This can be seen if you increase the length of the first list a bit. Ive modified your function a little to demonstrate:
def sol0(df1, df2):
b = df2['a']
for bio_row in df1.itertuples():
a = float(bio_row[1])
total = pd.eval('b > a')
df1.set_value(bio_row[0], 'sim_p_val', total.sum())
return df1
Testing with the following shows how the n^2 approach can get slow with longer lists, using the sorted list approach is ~750x quicker in this example:
df1['a'] = [0.23, 1.2] * 1000
df2['a'] = [0.22, 1.25, 0.3, 0.5, 0.9, 0.0] * 1000
t0 = time.time()
res = sol(df1['a'], df2['a'])
print time.time() - t0
df1['sim_p_val'] = [0]*len(df1)
t0 = time.time()
df1 = sol0(df1, df2)
print time.time() - t0
# Verify the result is the same
print res == sorted(list(df1['sim_p_val']), reverse=True)
>>>0.0024299621582
1.78539991379
True
Context
StackExchange Code Review Q#159506, answer score: 3
Revisions (0)
No revisions yet.