patternMinor
Should i use subquery to limit table before a join?
Viewed 0 times
limitjoinsubqueryshouldbeforeusetable
Problem
In the case of a join followed by a where clause, would it be better to use a subquery to limit the results, and then do the join? Example:
In this case, it seams the DBMS will join the entire Customers table with the entire Orders table, and then filter the results based on the where clause. An equivalent query using subquery would be:
Here, there is potentially a smaller Orders table to JOIN with. Likewise, if there was a where clause that limited both Customers and Orders:
An equivalent subquery query:
SELECT *
FROM Customers
NATURAL JOIN Orders
WHERE shipped=1In this case, it seams the DBMS will join the entire Customers table with the entire Orders table, and then filter the results based on the where clause. An equivalent query using subquery would be:
SELECT *
FROM Customers
NATURAL JOIN (SELECT *
FROM Orders
WHERE shipped=1) AS OHere, there is potentially a smaller Orders table to JOIN with. Likewise, if there was a where clause that limited both Customers and Orders:
SELECT *
FROM Customers
NATURAL JOIN Orders
WHERE country='US' AND shipped=1
(assuming country attribute belongs to Customers table)An equivalent subquery query:
SELECT *
FROM (SELECT *
FROM Customers
WHERE country='US') AS C
NATURAL JOIN (SELECT *
FROM Orders
WHERE shipped=1) AS OSolution
The answer to your question depends on the specific database and version you are using. In any case most current databases will optimize the query and end-up having the same execution plan for all cases.
I'd go for the simplest syntax that clearly states your purpose. It is probably the one that the database will be able to best optimize. And, specially, it is the one that will be easier to interpret by yourself, or anyone modifying any application later on. If you find out that some query seems to underperform, then you should check execution plans, alternatives, and find out if it can actually be improved.
You can check it with most DBs with a statement similar to
This is how you would do it when using PostgreSQL:
Creation and population of (mockup) tables:
Make sure the database has proper statistics:
Check for the exeuction plan of the simplest query:
| QUERY PLAN |
| :-------------------------------------------------------------------------------------------------------------------- |
| Hash Join (cost=28.50..705.05 rows=21131 width=24) (actual time=0.946..29.305 rows=21131 loops=1) |
| Hash Cond: (orders.customer_id = customers.customer_id) |
| -> Seq Scan on orders (cost=0.00..386.00 rows=21131 width=16) (actual time=0.008..10.089 rows=21131 loops=1) |
| Filter: shipped |
| Rows Removed by Filter: 3869 |
| -> Hash (cost=16.00..16.00 rows=1000 width=12) (actual time=0.906..0.906 rows=1000 loops=1) |
| Buckets: 1024 Batches: 1 Memory Usage: 52kB |
| -> Seq Scan on customers (cost=0.00..16.00 rows=1000 width=12) (actual time=0.005..0.445 rows=1000 loops=1) |
| Planning time: 0.419 ms |
| Execution time: 34.613 ms |
Check for the execution plan of the second version of the query:
| QUERY PLAN |
| :-------------------------------------------------------------------------------------------------------------------- |
| Hash Join (cost=28.50..705.05 rows=21131 width=24) (actual time=0.693..24.537 rows=21131 loops=1) |
| Hash Cond: (orders.customer_id = customers.customer_id) |
| -> Seq Scan on orders (cost=0.00..386.00 rows=21131 width=16) (actual time=0.007..8.534 rows=21131 loops=1) |
| Filter: shipped |
| Rows Removed by Filter: 3869 |
| -> Hash (cost=16.00..16.00 rows=1000 width=12) (actual time=0.676..0.676 rows=1000 loops=1) |
| Buckets: 1024 Batches: 1 Memory Usage: 52kB |
| -> Seq Scan on customers (cost=0.00..16.00 rows=1000 width=12) (actual time=0.003..0.312 rows=1000 loops=1) |
| Planning time: 0.263 ms |
| Execution time: 29.083 ms |
You can see that, in this case, PostgreSQ
I'd go for the simplest syntax that clearly states your purpose. It is probably the one that the database will be able to best optimize. And, specially, it is the one that will be easier to interpret by yourself, or anyone modifying any application later on. If you find out that some query seems to underperform, then you should check execution plans, alternatives, and find out if it can actually be improved.
You can check it with most DBs with a statement similar to
EXPLAIN SELECT * FROM Customers NATURAL JOIN Orders WHERE shipped=1 and check what the execution plan is.This is how you would do it when using PostgreSQL:
Creation and population of (mockup) tables:
CREATE TABLE Customers
(
customer_id integer primary key,
customer_name text not null
) ;
CREATE TABLE Orders
(
order_id integer primary key,
customer_id integer NOR NULL REFERENCES Customers(customer_id),
whatever text,
shipped boolean
) ;
-- We invent 1_000 customers
INSERT INTO
Customers (customer_id, customer_name)
SELECT
i, 'Name ' || i
FROM
generate_series (1, 1000) as s(i) ;
-- and 25_000 orders
INSERT INTO
Orders (order_id, customer_id, whatever, shipped)
SELECT
i AS order_id,
1 + 999*random() AS customer_id,
'a text' AS whatever,
(random() < 0.85) AS shipped
FROM
generate_series(1, 25000) AS s(i) ;Make sure the database has proper statistics:
ANALYZE Orders;
ANALYZE Customers;Check for the exeuction plan of the simplest query:
EXPLAIN ANALYZE
SELECT * FROM Customers NATURAL JOIN Orders WHERE shipped| QUERY PLAN |
| :-------------------------------------------------------------------------------------------------------------------- |
| Hash Join (cost=28.50..705.05 rows=21131 width=24) (actual time=0.946..29.305 rows=21131 loops=1) |
| Hash Cond: (orders.customer_id = customers.customer_id) |
| -> Seq Scan on orders (cost=0.00..386.00 rows=21131 width=16) (actual time=0.008..10.089 rows=21131 loops=1) |
| Filter: shipped |
| Rows Removed by Filter: 3869 |
| -> Hash (cost=16.00..16.00 rows=1000 width=12) (actual time=0.906..0.906 rows=1000 loops=1) |
| Buckets: 1024 Batches: 1 Memory Usage: 52kB |
| -> Seq Scan on customers (cost=0.00..16.00 rows=1000 width=12) (actual time=0.005..0.445 rows=1000 loops=1) |
| Planning time: 0.419 ms |
| Execution time: 34.613 ms |
Check for the execution plan of the second version of the query:
EXPLAIN ANALYZE
SELECT * FROM Customers NATURAL JOIN (SELECT * FROM Orders WHERE shipped) AS O| QUERY PLAN |
| :-------------------------------------------------------------------------------------------------------------------- |
| Hash Join (cost=28.50..705.05 rows=21131 width=24) (actual time=0.693..24.537 rows=21131 loops=1) |
| Hash Cond: (orders.customer_id = customers.customer_id) |
| -> Seq Scan on orders (cost=0.00..386.00 rows=21131 width=16) (actual time=0.007..8.534 rows=21131 loops=1) |
| Filter: shipped |
| Rows Removed by Filter: 3869 |
| -> Hash (cost=16.00..16.00 rows=1000 width=12) (actual time=0.676..0.676 rows=1000 loops=1) |
| Buckets: 1024 Batches: 1 Memory Usage: 52kB |
| -> Seq Scan on customers (cost=0.00..16.00 rows=1000 width=12) (actual time=0.003..0.312 rows=1000 loops=1) |
| Planning time: 0.263 ms |
| Execution time: 29.083 ms |
You can see that, in this case, PostgreSQ
Code Snippets
CREATE TABLE Customers
(
customer_id integer primary key,
customer_name text not null
) ;
CREATE TABLE Orders
(
order_id integer primary key,
customer_id integer NOR NULL REFERENCES Customers(customer_id),
whatever text,
shipped boolean
) ;
-- We invent 1_000 customers
INSERT INTO
Customers (customer_id, customer_name)
SELECT
i, 'Name ' || i
FROM
generate_series (1, 1000) as s(i) ;
-- and 25_000 orders
INSERT INTO
Orders (order_id, customer_id, whatever, shipped)
SELECT
i AS order_id,
1 + 999*random() AS customer_id,
'a text' AS whatever,
(random() < 0.85) AS shipped
FROM
generate_series(1, 25000) AS s(i) ;ANALYZE Orders;
ANALYZE Customers;EXPLAIN ANALYZE
SELECT * FROM Customers NATURAL JOIN Orders WHERE shippedEXPLAIN ANALYZE
SELECT * FROM Customers NATURAL JOIN (SELECT * FROM Orders WHERE shipped) AS OCREATE INDEX idx_shipped_orders ON Orders(shipped, order_id) ;Context
StackExchange Database Administrators Q#168874, answer score: 8
Revisions (0)
No revisions yet.