patternsqlMinor
Speed up count queries on a couple million rows
Viewed 0 times
millionrowsqueriescouplecountspeed
Problem
Assuming a db full of products. A product can belong to exactly 1 collection and is created by a user. Rough scale of the db:
I am trying to retrieve the amount of products+collections a user has, and the amount of products inside each collection (this information is supposed to get generated all x days and indexed in ElasticSearch) .
For the user query, I am currently doing something like this:
all *_id fields are indexed. Using explain(analyze, verbose) (sensitive information removed):
```
Limit (cost=0.00..156500.97 rows=100 width=41) (actual time=0.064..28345.363 rows=100 loops=1)
Output: (...), ((SubPlan 1)), ((SubPlan 2))
-> Seq Scan on public.users user (cost=0.00..14549429167.11 rows=9296702 width=41) (actual time=0.064..28345.241 rows=100 loops=1)
Output: (...), (SubPlan 1), (SubPlan 2)
SubPlan 1
-> Aggregate (cost=1415.84..1415.85 rows=1 width=0) (actual time=261.101..261.102 rows=1 loops=100)
Output: count(*)
-> Bitmap Heap Scan on public.products product (cost=7.32..1414.95 rows=355 width=0) (actual time=0.282..260.767 rows=382 loops=100)
Output: (...)
Recheck Cond: (product.user_id = user.id)
Heap Blocks: exact=32882
-> Bitmap Index Scan on products_user_id_index (cost=0.00..7.23 rows=355 width=0) (actual time=0.165..0.165 rows=382 loops=100)
Index Cond: (product.user_id = user.id)
SubPl
- Products: 52.000.000
- Collections: 9.000.000
- Users: roughly 9.000.000 as well
I am trying to retrieve the amount of products+collections a user has, and the amount of products inside each collection (this information is supposed to get generated all x days and indexed in ElasticSearch) .
For the user query, I am currently doing something like this:
SELECT
users.*,
(SELECT
count(*)
FROM
products product
WHERE
product.user_id = user.id
) AS product_count,
(SELECT
count(*)
FROM
collections collection
WHERE
collection.user_id = user.id
) AS collection_count
FROM
users userall *_id fields are indexed. Using explain(analyze, verbose) (sensitive information removed):
```
Limit (cost=0.00..156500.97 rows=100 width=41) (actual time=0.064..28345.363 rows=100 loops=1)
Output: (...), ((SubPlan 1)), ((SubPlan 2))
-> Seq Scan on public.users user (cost=0.00..14549429167.11 rows=9296702 width=41) (actual time=0.064..28345.241 rows=100 loops=1)
Output: (...), (SubPlan 1), (SubPlan 2)
SubPlan 1
-> Aggregate (cost=1415.84..1415.85 rows=1 width=0) (actual time=261.101..261.102 rows=1 loops=100)
Output: count(*)
-> Bitmap Heap Scan on public.products product (cost=7.32..1414.95 rows=355 width=0) (actual time=0.282..260.767 rows=382 loops=100)
Output: (...)
Recheck Cond: (product.user_id = user.id)
Heap Blocks: exact=32882
-> Bitmap Index Scan on products_user_id_index (cost=0.00..7.23 rows=355 width=0) (actual time=0.165..0.165 rows=382 loops=100)
Index Cond: (product.user_id = user.id)
SubPl
Solution
While computing numbers for all or most users, it's much more efficient to use plain subqueries to aggregate counts per user before joining instead of correlated subqueries:
The index-only scan and bitmap index we see in the
The
Related:
SELECT u.*, p.product_count, c.collection_count
FROM users u
LEFT JOIN (
SELECT user_id AS id, count(*) AS product_count
FROM products
GROUP BY 1
) p USING (id)
LEFT JOIN (
SELECT user_id As id, count(*) AS collection_count
FROM collections
GROUP BY 1
) c USING (id);The index-only scan and bitmap index we see in the
EXPLAIN output only benefit queries for a small subset of rows (LIMIT 100). Your test is misleading in this respect. While computing numbers for all (or most) users, indexes won't help. Sequential scans will be faster.The
Bitmap Heap Scan you see is only the second step needed for the bitmap index scan - which comes as a surprise for just 100 rows. Either your table statistics are out of date or those 100 users have a lot of related products or rows in products are highly clustered (physically, meaning multiple rows for one user reside on the same or few data pages). Postgres only switches to a bitmap index scan if it expects to find multiple rows per data page, which comes as a surprise for 100 users and 52.000.000 products (rows=355 expected and rows=382 found).Related:
- Index usage on a temporary table
- Configuring PostgreSQL for read performance
Code Snippets
SELECT u.*, p.product_count, c.collection_count
FROM users u
LEFT JOIN (
SELECT user_id AS id, count(*) AS product_count
FROM products
GROUP BY 1
) p USING (id)
LEFT JOIN (
SELECT user_id As id, count(*) AS collection_count
FROM collections
GROUP BY 1
) c USING (id);Context
StackExchange Database Administrators Q#153739, answer score: 5
Revisions (0)
No revisions yet.