HiveBrain v1.2.0
Get Started
← Back to all entries
patternMinor

Estimate result size of Natural Join

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
estimateresultsizejoinnatural

Problem

I have the following tables

R1(A, B, C) with 1.000 rows

R2(C, D, E) with 1.500 rows

R3(E, F) with 750 rows

where a bold letter denotes a primary key.

I need to estimate the number of rows of the natural join R1 |x| R2 |x| R3.

My textbook has proposed the following solution

The Natural join of R1, R2 and R3 will be the same no matter which way we join them (join is both associative and commutative). Size can be estimated using the strategy of first joining R1 and R2, and then joining the result with R3.
Joining R1 with R2 will yield a table of at most 1.000 rows, since C is a key for R2. Likewise, joining that result with R3 will yield a table of at most 1.000 rows because E is a key for R3. Therefore the final relation will have at most 1.000 rows!

I would have thought that the final relation will have at most 750 rows, since R3 only has 750 rows.

Is the textbook solution incorrect, or am I missing something?

Solution

You have at most 1000 rows from the natural join of R1 and R2 because there are two possible cases: either 1) all the values of R1.C are present in R2.C (i.e. R1.C is a “foreign key” for R2), or, 2) there are values in R1.C not present in R2.C.

In the first case you have exactly 1000 rows in the join, since the for each row of R1 there is exactly one row in R2 with the same value of C, since C is a key of R2, and for a certain value of C there is only one row in it, so you can join a row in R1 with only one row in R2, and the result has exactly 1000 rows.

In the second case you have instead n in rows in the result, with 0 ≤ n ≤ 1000, where n is the number of row in R1 whose value of C is present in R2: in fact, when a row of R1 has a value of C present in R2, you obtain again a single row in the result.

Combining these two possibilities, we can say that the number of rows in R1 ⨝ R2 is a value n, with 0 ≤ n ≤ 1000.

Now consider the second join. Since you have at most 1000 rows in R1 ⨝ R2, in each of those 1000 rows you can have a value of E which is present or not in table R3. But since E is a primary key in R3, when you join each of those n rows with R3 you can find at maximum one corresponding row in R3. So your result will have a number of rows which will be equal to m, such that m ≤ n.

Putting together these results, we know that for m, the number of rows of R1 ⨝ R2 ⨝ R2, holds the following: 0 ≤ m ≤ 1000.

Added

(Based on the comment of @ypercubeᵀᴹ).

Note that the result can be easily generalized. If you have a natural join or an inner equijoin between k tables Ri, i = 1, ..., k, and if you can reorder the join as: R1 ⨝ R2 ⨝ ... ⨝ Rk, such that Ri ⨝ Ri+1 is on the primary key (or unique attribute) of Ri+1, then the cardinality of the result will be always less than or equal to the cardinality of R1.

Context

StackExchange Database Administrators Q#139204, answer score: 5

Revisions (0)

No revisions yet.