patternpythonMinor

Count values in one dataframe that are less than all the values in another

Submitted by: @import:stackexchange-codereview·Mar 10, 2026·

Viewed 0 times

theallarethanonethatdataframelessanothervalues

Problem

I currently have 2 dataframes, A and B. These dataframes are generated in runtime, and increase in size according to parameters in the program execution.

I need to evaluate how many times a value in dataframe A is lesser than all the values in dataframe B.

For example:

Dataframe A
+-----+-------+
| id  | value |
+-----+-------+
|   1 | 0.23  |
|   2 | 1.2   |
+-----+-------+
Dataframe B
+-----+-------+
| id  | value |
+-----+-------+
|   1 | 0.22  |
|   2 | 1.25  |
|   3 | 0.3   |
|   4 | 0.5   |
|   5 | 0.9   |
|   6 | 0.0   |
+-----+-------+

I need to check how many values greater than 0.23 (for example) are in dataframe B. in this case 4 of the 6.

My first try with this was using this code. In this case, bio_dataframe is dataframe A, an random_seq_df is dataframe B.

for bio_row in bio_dataframe.itertuples():
    total = 0
    for ran_row in random_seq_df.itertuples():
        if bio_row[2] 

As you can see, i use

itertuples

 for fast iteration of the rows of the dataframes.
This approach works "well" for dataframes below 25000 rows, but beyond that it starts to get painfully slow.

So my next approach was this.

final_res

 is a column in the dataframe.

for bio_row in bio_dataframe.itertuples():
a = bio_row[2]
total = random_dataframe.eval('final_res > @a')
`

It works excellent until 100000 rows, beyond that the story repeats.

I'm hitting a wall here and I've run out of ideas to test. Is there any way to improve the code? Am I missing something, or some snippet to make it faster?

Solution

It may be a good idea to sort the columns you wish to compare first. That way you can iterate over both lists a single time. The idea is to use the index to keep track of how many items are larger than the values in your first list. A relatively untested function is:

import pandas as pd
import time

df1 = pd.DataFrame()
df2 = pd.DataFrame()
df1['a'] = [0.23, 1.2]
df2['a'] = [0.22, 1.25, 0.3, 0.5, 0.9, 0.0] * 1000 # Add some extra work

def sol(da, db):
    # Sort your columns 
    x = sorted(da)
    y = sorted(db)

    t = []  # The results
    yi = iter(y)  # Use of an iterator to move over y
    yindex = 0
    y_item = next(yi)

    for val in x:
        # Search through y to find the index of an item bigger than val
        while y_item >> [4000, 1000]
0.00200009346008

The problem with your initial solution is that you iterate over your second list for every value in your first list giving you an n^2 run time. This can be seen if you increase the length of the first list a bit. Ive modified your function a little to demonstrate:

def sol0(df1, df2):
    b = df2['a']
    for bio_row in df1.itertuples():
        a = float(bio_row[1])
        total = pd.eval('b > a')
        df1.set_value(bio_row[0], 'sim_p_val', total.sum())
    return df1

Testing with the following shows how the n^2 approach can get slow with longer lists, using the sorted list approach is ~750x quicker in this example:

df1['a'] = [0.23, 1.2] * 1000
df2['a'] = [0.22, 1.25, 0.3, 0.5, 0.9, 0.0] * 1000

t0 = time.time()
res = sol(df1['a'], df2['a'])
print time.time() - t0

df1['sim_p_val'] = [0]*len(df1)
t0 = time.time()
df1 = sol0(df1, df2)
print time.time() - t0

# Verify the result is the same
print res == sorted(list(df1['sim_p_val']), reverse=True)

>>>0.0024299621582
1.78539991379
True

Context

StackExchange Code Review Q#159506, answer score: 3

Revisions (0)

No revisions yet.