patternMinor
Optimizing a join where each table has a selection
Viewed 0 times
selectioneachwherejoinhasoptimizingtable
Problem
Consider the following query:
Let's suppose all of the relevant attributes (Customer.Preferred, Order.Complete, Order.CustomerId and Customer.Id) are indexed. How can I evaluate this as quickly as possible?
Standard optimization advice would say that I should do the select on each table first, then the join using sort-merge or whatever the cardinality would imply. But this involves two passes through the data - I'm wondering if there's a better way.
EDIT: I think asking if there was a "better way" was too ill-defined. Suppose we are trying to find $\sigma_a(A)\bowtie_j\sigma_b(B)$. Observe that we can find this in $O(\alpha)$ (where $\alpha$ is the cardinality of $\sigma_a(A)$) with the following pseudocode:
As mentioned by some answerers, there are various minor permutations (do the for loop over B instead, etc.). But they all seem to be $O(\alpha)$ or $O(\beta)$. Is there a better way?
Note that if it the query were a self join, we could just do the merge part of a sort-merge join, (since our indexes would already be sorted) which would run in time proportional to the number of results. So I ask if a similar thing can be done here.
I am more than happy to accept a proof that there is no better method as an answer. I believe that there is no faster algorithm, but I'm unable to prove it.
SELECT Customer.Name FROM Customer
INNER JOIN Order on Order.CustomerId = Customer.Id
WHERE Customer.Preferred = True AND
Order.Complete = FalseLet's suppose all of the relevant attributes (Customer.Preferred, Order.Complete, Order.CustomerId and Customer.Id) are indexed. How can I evaluate this as quickly as possible?
Standard optimization advice would say that I should do the select on each table first, then the join using sort-merge or whatever the cardinality would imply. But this involves two passes through the data - I'm wondering if there's a better way.
EDIT: I think asking if there was a "better way" was too ill-defined. Suppose we are trying to find $\sigma_a(A)\bowtie_j\sigma_b(B)$. Observe that we can find this in $O(\alpha)$ (where $\alpha$ is the cardinality of $\sigma_a(A)$) with the following pseudocode:
for each a in A:
find foreign tuple in B // constant-time, if using hash table
check if foreign tuple meets foreign constraint // again, constant timeAs mentioned by some answerers, there are various minor permutations (do the for loop over B instead, etc.). But they all seem to be $O(\alpha)$ or $O(\beta)$. Is there a better way?
Note that if it the query were a self join, we could just do the merge part of a sort-merge join, (since our indexes would already be sorted) which would run in time proportional to the number of results. So I ask if a similar thing can be done here.
I am more than happy to accept a proof that there is no better method as an answer. I believe that there is no faster algorithm, but I'm unable to prove it.
Solution
When an SQL statement is turned into an execution plan, several optimization techniques are used. The use of indices allow to efficiently (without a full scan) select tuples that agree with a selection condition. Another technique in use is semantic optimization, id est, to turn a query into an equivalent one with better behaviour. To do so, identities of relational algebra are used.
For specifically, in the example, the identity between selection and product
$\qquad \displaystyle \sigma_{\alpha}(C \times O) = \sigma_{\beta \wedge \gamma \wedge \delta}(R \times P) = \sigma_{\delta}(\sigma_{\beta}(R) \times \sigma_{\gamma}(P))$
(with $\beta$ being the part of the condition $\alpha$ which refers only to attributes from $C$ and $\gamma$ the one with attributes from $O$) can be used. Informally worded this identity is "pushing selection through joins/product" (the identity holding for join too). The identity can be thought as a kind of rewrite rule, reading it from left to right. So the following identity holds in the example :
$\qquad \displaystyle \begin{align}
&\pi_{Name}(\sigma_{Preferred=True \wedge Complete=False}(C \Join_{CustomerId = Id} O)) \\
= &\pi_{Name}(\sigma_{Preferred=True}(C) \Join_{CustomerId = Id} \sigma_{Complete=False}(O))
\end{align}$
Doing so, you do not waste time computing the product of irrelevant tuples of $C$ and $O$. The more selective the $\beta$ and $\gamma$ predicates are, the larger the gain is. And IMO, I don't think you can ever lose some time using this rule.
For a more formal treatment of this aspect of optimization, one can look at the Static Analysis and Optimization chapter of the Foundation of Databases book.
For specifically, in the example, the identity between selection and product
$\qquad \displaystyle \sigma_{\alpha}(C \times O) = \sigma_{\beta \wedge \gamma \wedge \delta}(R \times P) = \sigma_{\delta}(\sigma_{\beta}(R) \times \sigma_{\gamma}(P))$
(with $\beta$ being the part of the condition $\alpha$ which refers only to attributes from $C$ and $\gamma$ the one with attributes from $O$) can be used. Informally worded this identity is "pushing selection through joins/product" (the identity holding for join too). The identity can be thought as a kind of rewrite rule, reading it from left to right. So the following identity holds in the example :
$\qquad \displaystyle \begin{align}
&\pi_{Name}(\sigma_{Preferred=True \wedge Complete=False}(C \Join_{CustomerId = Id} O)) \\
= &\pi_{Name}(\sigma_{Preferred=True}(C) \Join_{CustomerId = Id} \sigma_{Complete=False}(O))
\end{align}$
Doing so, you do not waste time computing the product of irrelevant tuples of $C$ and $O$. The more selective the $\beta$ and $\gamma$ predicates are, the larger the gain is. And IMO, I don't think you can ever lose some time using this rule.
For a more formal treatment of this aspect of optimization, one can look at the Static Analysis and Optimization chapter of the Foundation of Databases book.
Context
StackExchange Computer Science Q#1980, answer score: 4
Revisions (0)
No revisions yet.