HiveBrain v1.2.0
Get Started
← Back to all entries
patternsqlMinor

Fastest way to choose distinct rows and a different order by without using a subquery

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
rowsdistinctorderwithoutchoosewaydifferentsubqueryfastestusing

Problem

PostgresSQL - Join and get distinct left table rows only without gigantic subquery sort

We have a midsized database that has 50k rows of products and 33m rows of product prices at various store locations.

We join the products table to the access table with various where conditions / order by the suit the end user the best.

A simple explanation of a typical query would be 100 distinct products, sorted by highest price first, sorted by in stock first.

Seems simple enough... Everything I read online says to join the tables with a LEFT INNER JOIN (perhaps this is the wrong type of join to get unique rows?) on products_displayproductaccess in a sub query, then order the outer query by the order we're looking for. Problem is the performance is brutal... We partitioned the tables and it did help, but not significantly due the large sort size.

I'm surprised this is such an issue in Postgres (or perhaps other DB's as well) or the solution has just alluded me and the other upwork DBA's we've paid. It's just a join to get unique rows from the left table...

products_displayproduct

```
Table "public.products_displayproduct"
Column | Type | Collation | Nullable | Default | Storage | Stats target | Description
-----------------------+--------------------------+-----------+----------+-----------------------------------------------------+----------+--------------+-------------
id | integer | | not null | nextval('products_displayproduct_id_seq'::regclass) | plain | |
date_created | timestamp with time zone | | not null | | plain | |
date_updated | timestamp with time zone | | not null | | plain | |
name | character varying(255) |

Solution

The working query you have, should get substantially faster if you do ORDER BY / LIMIT before the join:

SELECT p.id
     , p.date_created
     , p.date_updated
     , p.name
     , p.sub_title
     , p.tags
     , p.has_multiple_variants
     , p.placement_id
     , p.brand_id
     , p.poster_image_id
     , p.rating
     , p.reviews
     , p.is_toppick
     , p.last_amalgamation
     , p.search_index
     , pa.price
     , pa.is_instock
FROM  (
   SELECT *
   FROM  (
      SELECT DISTINCT ON (product_id)
             product_id, price, is_instock
      FROM   products_displayproductaccess
      WHERE  location_intspace = 755
      AND    location_id IN (55, 60, 61, 1532)  -- many more
      ORDER  BY product_id, price DESC, is_instock DESC
      ) sub
   ORDER  BY price DESC, is_instock DESC
   LIMIT  100
   ) pa
JOIN   products_displayproduct p ON p.id = pa.product_id;


Depending on data distribution and typical query predicates, DISTINCT ON in the subquery might be replaced with something more efficient. Maybe a recursive CTE or a LATERAL join to a subquery with ORDER BY / LIMIT. See:

  • Optimize GROUP BY query to retrieve latest row per user



  • Select first row in each GROUP BY group?



  • Postgres sometimes uses inferior index for WHERE a IN (...) ORDER BY b LIMIT N



  • Can spatial index help a "range - order by - limit" query



And there is probably more potential to make this faster, yet. Looks like a major project for paid consultant work - starting by sanitizing your table partitioning. The undisclosed triggers after_insert_products_displayproductaccess_trigger and before_insert_products_displayproductaccess_trigger look like a sub-optimal implementation.

Code Snippets

SELECT p.id
     , p.date_created
     , p.date_updated
     , p.name
     , p.sub_title
     , p.tags
     , p.has_multiple_variants
     , p.placement_id
     , p.brand_id
     , p.poster_image_id
     , p.rating
     , p.reviews
     , p.is_toppick
     , p.last_amalgamation
     , p.search_index
     , pa.price
     , pa.is_instock
FROM  (
   SELECT *
   FROM  (
      SELECT DISTINCT ON (product_id)
             product_id, price, is_instock
      FROM   products_displayproductaccess
      WHERE  location_intspace = 755
      AND    location_id IN (55, 60, 61, 1532)  -- many more
      ORDER  BY product_id, price DESC, is_instock DESC
      ) sub
   ORDER  BY price DESC, is_instock DESC
   LIMIT  100
   ) pa
JOIN   products_displayproduct p ON p.id = pa.product_id;

Context

StackExchange Database Administrators Q#260852, answer score: 2

Revisions (0)

No revisions yet.