patternsqlMajor
Improve performance of COUNT/GROUP-BY in large PostgresSQL table?
Viewed 0 times
grouppostgressqlimprovelargeperformancecounttable
Problem
I am running PostgresSQL 9.2 and have a 12 column relation with about 6,700,000 rows. It contains nodes in a 3D space, each one referencing a user (who created it). To query which user has created how many nodes I do the following (added
As you can see, this takes about 1.7 seconds. This isn't too bad considering the amount of data, but I wonder if this can be improved. I tried to add a BTree index on the user column, but this didn't help in any way.
Do you have alternative suggestions?
For the sake of completeness, this is the complete table definition with all it's indices (without foreign key constraints, references and triggers):
```
Column | Type | Modifiers
---------------+--------------------------+------------------------------------------------------
id | bigint | not null default nextval('concept_id_seq'::regclass)
user_id | bigint | not null
creation_time | timestamp with time zone | not null default now()
edition_time | timestamp with time zone | not null default now()
project_id | bigint | not null
location | double3d | not null
reviewer_id | integer | not null default (-1)
review_time | ti
explain analyze for more information):EXPLAIN ANALYZE SELECT user_id, count(user_id) FROM treenode WHERE project_id=1 GROUP BY user_id;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=253668.70..253669.07 rows=37 width=8) (actual time=1747.620..1747.623 rows=38 loops=1)
-> Seq Scan on treenode (cost=0.00..220278.79 rows=6677983 width=8) (actual time=0.019..886.803 rows=6677983 loops=1)
Filter: (project_id = 1)
Total runtime: 1747.653 msAs you can see, this takes about 1.7 seconds. This isn't too bad considering the amount of data, but I wonder if this can be improved. I tried to add a BTree index on the user column, but this didn't help in any way.
Do you have alternative suggestions?
For the sake of completeness, this is the complete table definition with all it's indices (without foreign key constraints, references and triggers):
```
Column | Type | Modifiers
---------------+--------------------------+------------------------------------------------------
id | bigint | not null default nextval('concept_id_seq'::regclass)
user_id | bigint | not null
creation_time | timestamp with time zone | not null default now()
edition_time | timestamp with time zone | not null default now()
project_id | bigint | not null
location | double3d | not null
reviewer_id | integer | not null default (-1)
review_time | ti
Solution
Main problem is the missing index. But there is more.
-
You have many
While optimizing the table definition, consider this related answer, with an emphasis on data alignment and padding. Most of the rest applies, too:
-
The elephant in the room: there is no index on
While being at it, make that a multicolumn index:
If you followed my advice,
-
-
Added information
Q: "How many different project_id and user_id?"
A: "Not more than five different project_id."
That means Postgres has to read about 20% of the whole table to satisfy your query. Unless it can use an index-only scan, a sequential scan on the table will be faster than involving any indexes. No more performance to gain here - except by optimizing the table and server settings.
As for the index-only scan: To see how effective that can be, run
As well as:
SELECT user_id, count(*) AS ct
FROM treenode
WHERE project_id = 1
GROUP BY user_id;-
You have many
bigint columns. Probably overkill. Typically, integer is more than enough for columns like project_id and user_id. This would also help the next item.While optimizing the table definition, consider this related answer, with an emphasis on data alignment and padding. Most of the rest applies, too:
- Configuring PostgreSQL for read performance
-
The elephant in the room: there is no index on
project_id. Create one. This is more important than the rest of this answer.While being at it, make that a multicolumn index:
CREATE INDEX treenode_project_id_user_id_index ON treenode (project_id, user_id);If you followed my advice,
integer would be perfect here. See:- Is a composite index also good for queries on the first field?
-
user_id is defined NOT NULL, so count(user_id) is equivalent to count(*), but the latter is a bit shorter and faster. (For the given query, this even applies without user_id being defined NOT NULL.)-
id is already the primary key, the additional UNIQUE constraint is useless ballast. Drop it:"treenode_pkey" PRIMARY KEY, btree (id)
"treenode_id_key" UNIQUE CONSTRAINT, btree (id)Added information
Q: "How many different project_id and user_id?"
A: "Not more than five different project_id."
That means Postgres has to read about 20% of the whole table to satisfy your query. Unless it can use an index-only scan, a sequential scan on the table will be faster than involving any indexes. No more performance to gain here - except by optimizing the table and server settings.
As for the index-only scan: To see how effective that can be, run
VACUUM ANALYZE. Then try your query again. It should now be moderately faster using only the index. Read this related answer first:- Optimize simple query using ORDER BY date and text
As well as:
- The manual on index-only scans
- Postgres Wiki on index-only scans.
Code Snippets
SELECT user_id, count(*) AS ct
FROM treenode
WHERE project_id = 1
GROUP BY user_id;CREATE INDEX treenode_project_id_user_id_index ON treenode (project_id, user_id);"treenode_pkey" PRIMARY KEY, btree (id)
"treenode_id_key" UNIQUE CONSTRAINT, btree (id)Context
StackExchange Database Administrators Q#62500, answer score: 40
Revisions (0)
No revisions yet.