HiveBrain v1.2.0
Get Started
← Back to all entries
patternsqlModerate

Very slow execution of a simple query

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
simplequeryslowveryexecution

Problem

I try to create a report for my data, but it is really slow on a big table.

The table structure is:

CREATE TABLE posts
(
  id serial NOT NULL,
  project_id integer,
  moderation character varying(255),
  keyword_id integer,
  author_id integer,
  created_at timestamp without time zone,
  updated_at timestamp without time zone,
  server_id character varying(255),
  social_creation_time integer,
  social_id character varying(255),
  network character varying(255),
  mood character varying(255) DEFAULT NULL::character varying,
  url text,
  source_id integer,
  location character varying(255),
  subject_id integer,
  conversation_id integer,
  CONSTRAINT posts_pkey PRIMARY KEY (id)
);

CREATE INDEX index_posts_on_author_id ON posts (author_id);
CREATE INDEX index_posts_on_keyword_id ON posts (keyword_id);
CREATE INDEX index_posts_on_project_id_and_network_and_social_id 
    ON posts (project_id, network, social_id);
CREATE INDEX index_posts_on_project_id_and_social_creation_time 
    ON posts (project_id, social_creation_time DESC);
CREATE INDEX index_posts_on_server_id ON posts (server_id);
CREATE INDEX index_posts_on_social_id ON posts (social_id);


The query:

SELECT date_trunc('hour', timestamp 'epoch' 
           + (posts.social_creation_time * INTERVAL '1 second')) creating,
       network, 
       count(*) posts 
FROM posts 
WHERE posts.project_id = 7 
   AND (posts.moderation NOT IN ('junk','spam')) 
   AND (posts.social_creation_time BETWEEN 1391716800 AND 1392839999) 
GROUP BY network, creating 
ORDER BY creating


The count is 3940689.

Explain plan

```
GroupAggregate (cost=631282.11..671932.05 rows=338750 width=12) (actual time=22576.318..23826.124 rows=1776 loops=1)
-> Sort (cost=631282.11..639750.85 rows=3387494 width=12) (actual time=22576.188..23438.485 rows=3536790 loops=1)
Sort Key: (date_trunc('hour'::text, ('1970-01-01 00:00:00'::timestamp without time zone + ((social_creation_time)::double precision * '00:00:01'::in

Solution

In addition to the good advice by @Craig and @dezso:
Statistics

The count is 3940689.

Yet, your query plan says:

Seq Scan on posts  (cost=0.00..205984.62 rows=**3387494** width=12)


And your count is based on a selection:

Rows Removed by Filter: 404218


4344907 (3940689 + 404218) >> 3387494. Your statistics are not up to date. Something may be wrong with your autovacuum settings, which among other things runs ANALYZE automatically. Very bad for overall db performance. For the query at hand run before you retry anything:

ANALYZE posts


If you can afford a lock on the table for some time, run

VACUUM FULL ANALYZE posts


to clean house. More here.
Index

The numbers indicate your query uses about 90 % of all rows. Therefore, a sequential scan will be faster than any possible index scan - except for a covering index (index-only scan). Requires Postgres 9.2+. Be sure to read the Postgres Wiki on the topic.

Since you only use two small columns out of a long list of columns, such an index would be smaller and faster. While being at it, and depending on your overall requirements, a tailored index like the following would squeeze out the maximum performance: a partial, functional, multicolumn, covering index - at some cost for write operations:

CREATE INDEX test_idx ON posts (
   date_trunc('hour', timestamp 'epoch' + social_creation_time * interval  '1 sec')
  ,network)
WHERE moderation NOT IN ('junk','spam')
AND   project_id = 7                         -- ??
AND   social_creation_time BETWEEN 1391716800 AND 1392839999 -- ??


Actual WHERE conditions depend on your actual requirements and need to be added in a more or less identical form to any query that wishes to use this index. Trim rows that are never used from the index. Only use conditions that eliminate more than a few rows, tailor a super-set of rows needed for your queries.

Generally, covering indexes work well for fairly static tables. Read the Wiki. A quick test:

SELECT relallvisible, relpages
FROM   pg_class
WHERE  oid = 'posts'::regclass


If relallvisible is not much smaller that relpages, chances are good. Be sure autovacuum is running properly before you try this.

I would also test without the functional aspect to see which is used / faster:

CREATE INDEX test_idx ON posts (social_creation_time, network)
WHERE moderation NOT IN ('junk','spam')
AND   project_id = 7                         -- ??
AND   social_creation_time BETWEEN 1391716800 AND 1392839999; -- ??


A search for related answers dealing with a partial covering indexe.
Table layout

Finally, in your table definition integer and text columns interchange, bloating the table quite a bit due to data alignment and padding. More in this related answer:

Configuring PostgreSQL for read performance

I would recreate your table along these lines:

CREATE TABLE post (
  post_id serial PRIMARY KEY,
  project_id integer,
  created_at timestamp,
  updated_at timestamp,
  keyword_id integer,
  author_id integer,
  source_id integer,
  subject_id integer,
  conversation_id integer,
  social_creation_time integer,

  server_id  text,   -- could be integer?
  social_id  text,
  moderation text,
  network    text,
  url        text,
  location   text,
  mood       text
);


Will be a bit smaller, helping overall performance.

Why text instead of varchar(255)?

Code Snippets

Seq Scan on posts  (cost=0.00..205984.62 rows=**3387494** width=12)
Rows Removed by Filter: 404218
ANALYZE posts
VACUUM FULL ANALYZE posts
CREATE INDEX test_idx ON posts (
   date_trunc('hour', timestamp 'epoch' + social_creation_time * interval  '1 sec')
  ,network)
WHERE moderation NOT IN ('junk','spam')
AND   project_id = 7                         -- ??
AND   social_creation_time BETWEEN 1391716800 AND 1392839999 -- ??

Context

StackExchange Database Administrators Q#59170, answer score: 11

Revisions (0)

No revisions yet.