HiveBrain v1.2.0
Get Started
← Back to all entries
principlesqlMinor

A costly approach to filtering on a calculated date

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
dateapproachcostlycalculatedfiltering

Problem

This builds off my previously wonderfully answered question Running a complex query for every date in a range I've got an expanded problem and I'm certainly not solving it in a performant way.

The order table setup is the same. A table of repeating orders that can be overridden. There's another table of order items, each order item has a quantity, a product, lead time in days and relates to the order table.

CREATE TABLE orders (
    id integer NOT NULL,
    client_id integer NOT NULL,
    start_date date NOT NULL,
    end_date date,
    order_type character varying NOT NULL
);

CREATE TABLE order_items (
    id integer NOT NULL,
    order_id integer NOT NULL,
    product_id integer NOT NULL,
    quantity integer NOT NULL,
    lead_days integer NOT NULL  
);


The problem was to figure out which orders were "active" on a range of ship dates allowing "temporary" orders to override "standing orders" on each specific date. We solved that with a query like this.

```
INSERT INTO orders
(id, client_id, start_date, end_date, order_type) VALUES
(1, 1, '2014-02-05', NULL, 'standing'),
(2, 2, '2014-07-16', '2015-07-19', 'standing'),
(3, 3, '2015-04-01', NULL, 'standing'),
(4, 3, '2015-07-18', '2015-07-18', 'temporary'),
(5, 4, '2015-04-01', NULL, 'standing'),
(6, 4, '2015-07-18', '2015-07-18', 'temporary');

SELECT DISTINCT
ON (client_id) *
FROM
orders
WHERE
start_date = DATE '2015-07-18'
)
ORDER BY
client_id,
order_type DESC

| id | client_id | start_date | end_date | order_type |
|----|-----------|----------------------------|------------------------|------------|
| 1 | 1 | February, 05 2014 00:00:00 | (null) | standing |
| 2 | 2 | July, 16 2014 00:00:00 | July, 19 2015 00:00:00 | standing |
| 4 | 3 | July, 18 2015 00:00:00 | July, 18 2015 00:00:00 | temporary |
| 6 | 4 | July, 18 2015 00:00:00 | July, 18 2015 00:00:00 | temporary

Solution

WARNING! Depending on authors response, my answer could be way off. Awaiting his reponse.

reconbot, we need more info, because it seems like your entire subquery

SELECT DISTINCT ON (client_id) id
FROM orders
WHERE
  start_date = ship_date)
  AND client_id = items_to_work.client_id
ORDER BY client_id, order_type DESC


is completely redundant. You appear to be re-checking for already checked conditions.

For example, try just running

SELECT *
FROM (
  SELECT
    order_items.id item_id,
    order_items.product_id,
    order_items.quantity,
    order_items.lead_days,
    orders.id order_id,
    orders.client_id client_id,
    (DATE '2014-07-18' + order_items.lead_days) ship_date
  FROM order_items
  INNER JOIN orders ON orders.id = order_items.order_id
  WHERE
    -- find all order items that have active orders on the production_date + lead time
    DATE '2014-07-18' >= (orders.start_date - order_items.lead_days)
    AND (
      DATE '2014-07-18' <= (orders.end_date - order_items.lead_days)
      OR orders.end_date IS NULL
    )
) items_to_work;


and you get the exact same results. See the SQL Fiddle here. If you're still concerned about distinct client_ids, you could also use

SELECT *
FROM (
  SELECT DISTINCT ON (client_id)
    order_items.id item_id,
    order_items.product_id,
    order_items.quantity,
    order_items.lead_days,
    orders.id order_id,
    orders.client_id client_id,
    (DATE '2014-07-18' + order_items.lead_days) ship_date
  FROM order_items
  INNER JOIN orders ON orders.id = order_items.order_id
  WHERE
    -- find all order items that have active orders on the production_date + lead time
    DATE '2014-07-18' >= (orders.start_date - order_items.lead_days)
    AND (
      DATE '2014-07-18' <= (orders.end_date - order_items.lead_days)
      OR orders.end_date IS NULL
    )
) items_to_work;


as in this SQL Fiddle.

So, I'm either misunderstanding, or your originally given data set doesn't reflect the significance of your sub-SELECT in the original question.

IGNORE EVERYTHING BELOW HERE...

Until we get clarification from the author...

I think I've got your solution. At the very least, I've made a SQLFiddle with the results, and it appears that it will be much less costly.

LATERAL sub-queries in PostgreSQL

I love chances when a LATERAL sub-query can be used to save some time in your query. Unfortunately, I think I'm pretty bad at explaining when and where it should be used, and I'm only OK at recognizing instances of when to use it. It just doesn't come up too often in my particular query designs.

Take a look at the Postgres documentation on LATERAL keyword for some ideas, and also I really like this SlideShare presentation by Markus Winand for helping to explain LATERAL a bit better. In essence, it has a flavor of a "for each" statement in typical pseudo-coding vernacular.

The reason I looked into it for your case was: you were building the items_to_work table, and then using the client_id attribute of items_to_work in your sub-SELECT, where you checked if items_to_work.order_id was IN the distinct returned set of orders.id values. Using items_to_work attributes in the WHERE clause of the sub-SELECT was the red flag for me.

OK, so I realize the explanation of my motivation isn't so hot, sorry! :P On to the results...

New Query

Without further ado, here it is:

SELECT *
FROM (
  SELECT
    order_items.id item_id,
    order_items.product_id,
    order_items.quantity,
    order_items.lead_days,
    orders.id order_id,
    orders.client_id client_id,
    (DATE '2014-07-18' + order_items.lead_days) ship_date
  FROM order_items
  INNER JOIN orders ON orders.id = order_items.order_id
  WHERE
-- find all order items that have active orders on the production_date + lead time
    DATE '2014-07-18' >= (orders.start_date - order_items.lead_days)
    AND (
      DATE '2014-07-18' = ship_date)
  AND items_to_work.order_id = orders.id) lat;


Also, you can note that we get back identical results as those you are expecting, except for a single added column on the right side, as part of the LATERAL selection, which is a replication of the orders.id. I'm not super experienced with LATERAL (it doesn't come up too often in my work), so if that's a problem for you, we can sort out a way to drop it. :P

EXPLAIN results

So, we don't have the big data set which you have, so that we can really test out the results of the new query versus the old. I'm relying on the EXPLAIN estimates here, but...

Using the Old SQLFiddle, we can see that the overall estimated cost is approximately 283,000 Postgres units. :P

Using the New SQLFiddle, we can see the much better estimated cost of only 204!!

Code Snippets

SELECT DISTINCT ON (client_id) id
FROM orders
WHERE
  start_date <= ship_date and (end_date is NULL OR end_date >= ship_date)
  AND client_id = items_to_work.client_id
ORDER BY client_id, order_type DESC
SELECT *
FROM (
  SELECT
    order_items.id item_id,
    order_items.product_id,
    order_items.quantity,
    order_items.lead_days,
    orders.id order_id,
    orders.client_id client_id,
    (DATE '2014-07-18' + order_items.lead_days) ship_date
  FROM order_items
  INNER JOIN orders ON orders.id = order_items.order_id
  WHERE
    -- find all order items that have active orders on the production_date + lead time
    DATE '2014-07-18' >= (orders.start_date - order_items.lead_days)
    AND (
      DATE '2014-07-18' <= (orders.end_date - order_items.lead_days)
      OR orders.end_date IS NULL
    )
) items_to_work;
SELECT *
FROM (
  SELECT DISTINCT ON (client_id)
    order_items.id item_id,
    order_items.product_id,
    order_items.quantity,
    order_items.lead_days,
    orders.id order_id,
    orders.client_id client_id,
    (DATE '2014-07-18' + order_items.lead_days) ship_date
  FROM order_items
  INNER JOIN orders ON orders.id = order_items.order_id
  WHERE
    -- find all order items that have active orders on the production_date + lead time
    DATE '2014-07-18' >= (orders.start_date - order_items.lead_days)
    AND (
      DATE '2014-07-18' <= (orders.end_date - order_items.lead_days)
      OR orders.end_date IS NULL
    )
) items_to_work;
SELECT *
FROM (
  SELECT
    order_items.id item_id,
    order_items.product_id,
    order_items.quantity,
    order_items.lead_days,
    orders.id order_id,
    orders.client_id client_id,
    (DATE '2014-07-18' + order_items.lead_days) ship_date
  FROM order_items
  INNER JOIN orders ON orders.id = order_items.order_id
  WHERE
-- find all order items that have active orders on the production_date + lead time
    DATE '2014-07-18' >= (orders.start_date - order_items.lead_days)
    AND (
      DATE '2014-07-18' <= (orders.end_date - order_items.lead_days)
      OR orders.end_date IS NULL
    )
) items_to_work,
LATERAL(
  SELECT id FROM orders
  WHERE orders.client_id = items_to_work.client_id
  AND start_date <= ship_date and (end_date is NULL OR end_date >= ship_date)
  AND items_to_work.order_id = orders.id) lat;

Context

StackExchange Database Administrators Q#108767, answer score: 3

Revisions (0)

No revisions yet.