HiveBrain v1.2.0
Get Started
← Back to all entries
patternsqlMinor

Speed up count queries on a couple million rows

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
millionrowsqueriescouplecountspeed

Problem

Assuming a db full of products. A product can belong to exactly 1 collection and is created by a user. Rough scale of the db:

  • Products: 52.000.000



  • Collections: 9.000.000



  • Users: roughly 9.000.000 as well



I am trying to retrieve the amount of products+collections a user has, and the amount of products inside each collection (this information is supposed to get generated all x days and indexed in ElasticSearch) .

For the user query, I am currently doing something like this:

SELECT
        users.*,
        (SELECT
          count(*)
        FROM
          products product
        WHERE
          product.user_id = user.id
        ) AS product_count,
        (SELECT
          count(*)
        FROM
          collections collection
        WHERE
          collection.user_id = user.id
        ) AS collection_count
      FROM
        users user


all *_id fields are indexed. Using explain(analyze, verbose) (sensitive information removed):

```
Limit (cost=0.00..156500.97 rows=100 width=41) (actual time=0.064..28345.363 rows=100 loops=1)
Output: (...), ((SubPlan 1)), ((SubPlan 2))
-> Seq Scan on public.users user (cost=0.00..14549429167.11 rows=9296702 width=41) (actual time=0.064..28345.241 rows=100 loops=1)
Output: (...), (SubPlan 1), (SubPlan 2)
SubPlan 1
-> Aggregate (cost=1415.84..1415.85 rows=1 width=0) (actual time=261.101..261.102 rows=1 loops=100)
Output: count(*)
-> Bitmap Heap Scan on public.products product (cost=7.32..1414.95 rows=355 width=0) (actual time=0.282..260.767 rows=382 loops=100)
Output: (...)
Recheck Cond: (product.user_id = user.id)
Heap Blocks: exact=32882
-> Bitmap Index Scan on products_user_id_index (cost=0.00..7.23 rows=355 width=0) (actual time=0.165..0.165 rows=382 loops=100)
Index Cond: (product.user_id = user.id)
SubPl

Solution

While computing numbers for all or most users, it's much more efficient to use plain subqueries to aggregate counts per user before joining instead of correlated subqueries:

SELECT u.*, p.product_count, c.collection_count
FROM   users u
LEFT   JOIN (
   SELECT user_id AS id, count(*) AS product_count
   FROM   products
   GROUP  BY 1
   ) p USING (id)
LEFT   JOIN (
   SELECT user_id  As id, count(*) AS collection_count
   FROM   collections
   GROUP  BY 1
   ) c USING (id);


The index-only scan and bitmap index we see in the EXPLAIN output only benefit queries for a small subset of rows (LIMIT 100). Your test is misleading in this respect. While computing numbers for all (or most) users, indexes won't help. Sequential scans will be faster.

The Bitmap Heap Scan you see is only the second step needed for the bitmap index scan - which comes as a surprise for just 100 rows. Either your table statistics are out of date or those 100 users have a lot of related products or rows in products are highly clustered (physically, meaning multiple rows for one user reside on the same or few data pages). Postgres only switches to a bitmap index scan if it expects to find multiple rows per data page, which comes as a surprise for 100 users and 52.000.000 products (rows=355 expected and rows=382 found).

Related:

  • Index usage on a temporary table



  • Configuring PostgreSQL for read performance

Code Snippets

SELECT u.*, p.product_count, c.collection_count
FROM   users u
LEFT   JOIN (
   SELECT user_id AS id, count(*) AS product_count
   FROM   products
   GROUP  BY 1
   ) p USING (id)
LEFT   JOIN (
   SELECT user_id  As id, count(*) AS collection_count
   FROM   collections
   GROUP  BY 1
   ) c USING (id);

Context

StackExchange Database Administrators Q#153739, answer score: 5

Revisions (0)

No revisions yet.