Should i use subquery to limit table before a join?

Problem

In the case of a join followed by a where clause, would it be better to use a subquery to limit the results, and then do the join? Example:

SELECT * 
FROM Customers 
    NATURAL JOIN Orders 
WHERE shipped=1

In this case, it seams the DBMS will join the entire Customers table with the entire Orders table, and then filter the results based on the where clause. An equivalent query using subquery would be:

SELECT * 
FROM Customers 
    NATURAL JOIN (SELECT * 
                     FROM Orders 
                     WHERE shipped=1) AS O

Here, there is potentially a smaller Orders table to JOIN with. Likewise, if there was a where clause that limited both Customers and Orders:

SELECT * 
FROM Customers 
    NATURAL JOIN Orders 
WHERE country='US' AND shipped=1
(assuming country attribute belongs to Customers table)

An equivalent subquery query:

SELECT * 
FROM (SELECT *
            FROM Customers 
            WHERE country='US') AS C 
    NATURAL JOIN (SELECT * 
                                 FROM Orders 
                                 WHERE shipped=1) AS O

Solution

The answer to your question depends on the specific database and version you are using. In any case most current databases will optimize the query and end-up having the same execution plan for all cases.

I'd go for the simplest syntax that clearly states your purpose. It is probably the one that the database will be able to best optimize. And, specially, it is the one that will be easier to interpret by yourself, or anyone modifying any application later on. If you find out that some query seems to underperform, then you should check execution plans, alternatives, and find out if it can actually be improved.

You can check it with most DBs with a statement similar to EXPLAIN SELECT * FROM Customers NATURAL JOIN Orders WHERE shipped=1 and check what the execution plan is.

This is how you would do it when using PostgreSQL:

Creation and population of (mockup) tables:

CREATE TABLE Customers
 (
     customer_id integer primary key,
     customer_name text not null
 ) ;

 CREATE TABLE Orders
 (
     order_id integer primary key,
     customer_id integer NOR NULL REFERENCES Customers(customer_id),
     whatever text,
     shipped boolean
 ) ;

 -- We invent 1_000 customers
 INSERT INTO 
     Customers (customer_id, customer_name)
 SELECT
     i, 'Name ' || i
 FROM
     generate_series (1, 1000) as s(i) ;

 -- and 25_000 orders
 INSERT INTO
     Orders (order_id, customer_id, whatever, shipped)
 SELECT
     i AS order_id,  
     1 + 999*random() AS customer_id,
     'a text' AS whatever,
     (random() < 0.85) AS shipped
 FROM
     generate_series(1, 25000) AS s(i) ;

Make sure the database has proper statistics:

ANALYZE Orders;
ANALYZE Customers;

Check for the exeuction plan of the simplest query:

EXPLAIN ANALYZE
SELECT * FROM Customers NATURAL JOIN Orders WHERE shipped

| QUERY PLAN |
| :-------------------------------------------------------------------------------------------------------------------- |
| Hash Join (cost=28.50..705.05 rows=21131 width=24) (actual time=0.946..29.305 rows=21131 loops=1) |
| Hash Cond: (orders.customer_id = customers.customer_id) |
| -> Seq Scan on orders (cost=0.00..386.00 rows=21131 width=16) (actual time=0.008..10.089 rows=21131 loops=1) |
| Filter: shipped |
| Rows Removed by Filter: 3869 |
| -> Hash (cost=16.00..16.00 rows=1000 width=12) (actual time=0.906..0.906 rows=1000 loops=1) |
| Buckets: 1024 Batches: 1 Memory Usage: 52kB |
| -> Seq Scan on customers (cost=0.00..16.00 rows=1000 width=12) (actual time=0.005..0.445 rows=1000 loops=1) |
| Planning time: 0.419 ms |
| Execution time: 34.613 ms |

Check for the execution plan of the second version of the query:

EXPLAIN ANALYZE
 SELECT * FROM Customers NATURAL JOIN (SELECT * FROM Orders WHERE shipped) AS O

| QUERY PLAN |
| :-------------------------------------------------------------------------------------------------------------------- |
| Hash Join (cost=28.50..705.05 rows=21131 width=24) (actual time=0.693..24.537 rows=21131 loops=1) |
| Hash Cond: (orders.customer_id = customers.customer_id) |
| -> Seq Scan on orders (cost=0.00..386.00 rows=21131 width=16) (actual time=0.007..8.534 rows=21131 loops=1) |
| Filter: shipped |
| Rows Removed by Filter: 3869 |
| -> Hash (cost=16.00..16.00 rows=1000 width=12) (actual time=0.676..0.676 rows=1000 loops=1) |
| Buckets: 1024 Batches: 1 Memory Usage: 52kB |
| -> Seq Scan on customers (cost=0.00..16.00 rows=1000 width=12) (actual time=0.003..0.312 rows=1000 loops=1) |
| Planning time: 0.263 ms |
| Execution time: 29.083 ms |

You can see that, in this case, PostgreSQ

Code Snippets

CREATE TABLE Customers
 (
     customer_id integer primary key,
     customer_name text not null
 ) ;

 CREATE TABLE Orders
 (
     order_id integer primary key,
     customer_id integer NOR NULL REFERENCES Customers(customer_id),
     whatever text,
     shipped boolean
 ) ;

 -- We invent 1_000 customers
 INSERT INTO 
     Customers (customer_id, customer_name)
 SELECT
     i, 'Name ' || i
 FROM
     generate_series (1, 1000) as s(i) ;

 -- and 25_000 orders
 INSERT INTO
     Orders (order_id, customer_id, whatever, shipped)
 SELECT
     i AS order_id,  
     1 + 999*random() AS customer_id,
     'a text' AS whatever,
     (random() < 0.85) AS shipped
 FROM
     generate_series(1, 25000) AS s(i) ;

ANALYZE Orders;
ANALYZE Customers;

EXPLAIN ANALYZE
SELECT * FROM Customers NATURAL JOIN Orders WHERE shipped

EXPLAIN ANALYZE
 SELECT * FROM Customers NATURAL JOIN (SELECT * FROM Orders WHERE shipped) AS O

CREATE INDEX idx_shipped_orders ON Orders(shipped, order_id) ;

Context

StackExchange Database Administrators Q#168874, answer score: 8

Revisions (0)

No revisions yet.