HiveBrain v1.2.0
Get Started
← Back to all entries
patternsqlMinor

Slow query on primary server runs fast on hot standby

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
fastprimaryqueryslowstandbyhotserverruns

Problem

I have a primary PostgreSQL 9.3.4 server with 64GB of RAM that is replicated using streaming replication to a hot standby server with 32GB of RAM. My problem is as follows: I've detected a query that takes a long time to run on my primary server but runs very fast on the standby server. I did an EXPLAIN ANALYZE on the query:

```
EXPLAIN ANALYZE
SELECT this_.id AS id1_31_0_,
this_.nom AS nom2_31_0_,
this_.nom_slug AS nom3_31_0_,
this_.descripcio AS descripc4_31_0_,
this_.url AS url5_31_0_,
this_.data_captura AS data6_31_0_,
this_.data_publicacio AS data7_31_0_,
this_.propietari AS propieta8_31_0_,
this_.privacitat AS privacit9_31_0_,
this_.desnivellpujada AS desnive10_31_0_,
this_.desnivellbaixada AS desnive11_31_0_,
this_.longitud AS longitu13_31_0_,
this_.beginpoint AS beginpo14_31_0_,
this_.endpoint AS endpoin15_31_0_,
this_.caixa3d AS caixa16_31_0_,
this_.pic_id AS pic17_31_0_,
this_.skill AS skill18_31_0_,
this_.spatial_type AS spatial19_31_0_,
this_.tags_cached AS tags20_31_0_,
this_.images_cached AS images21_31_0_,
this_.ncomments AS ncommen22_31_0_,
this_.group_order AS group23_31_0_,
this_.author AS author24_31_0_,
this_.proper_a AS proper25_31_0_,
this_.duration AS duratio26_31_0_,
this_.isloop AS isloop27_31_0_,
this_.seo_country AS seo28_31_0_,
this_.seo_region AS seo29_31_0_,
this_.seo_place AS seo30_31_0_,
this_.source AS source31_31_0_,
this_.source_name AS source32_31_0_,
this_.api_key AS api33_31_0_,
this_.ratingui AS ratingu34_31_0_,
this_.nratings AS nrating35_31_0_,
this_.trailrank AS trailra36_31_0_,
this_.ncoords AS ncoords37_31_0_,
this_.egeom AS egeom38_31_0_,
this_.elevels AS elevels39_31_0_,
this_.elevations AS elevati40_31_0_,
this_.nphotoswpts AS nphotos41_31_0_,

Solution

The bad plan is probably a culmination of many problems. Which means that there are many ways of tackling it. My guess is that the culmination of problems causes two plans to look falsely close to each other in cost, and then the differences in memory setting (probably effective_cache_size) between master and slave is the straw that broke the camel's back and tips the master over to using the wrong one.

The simplest starting point for tackling it may be this one:

"        ->  Index Scan using idx_own_spas on spatial_artifact this_  (cost=0.43..23351.70 rows=6841 width=604) (actual time=0.037..2.119 rows=618 loops=1)"
"              Index Cond: (propietari = 7649)"


This is simple because the scan runs to completion, and only has a single index qualification, and yet the estimate is off by a factor of 10. Since your statistic are up to date, the problem must be an inadequate default_statistics_target. You can increase that globally, or increase it just for the column propietari using SET STATISTICS, and then re-analyzing the table. If this estimation problem were fixed, the master would almost certainly switch to that faster plan as it would appear even faster yet.

Another problem is that it thinks rows meeting the criteria ((group_id IS NULL) AND (propietari = 7649)) are distributed randomly over the values of this_.id, which they apparently are not. That causes it to think the LIMIT clause will kick in much sooner than it actually does, so making the slow plan look like it will be faster than it will be. Unfortunately, there is nothing simple you can do to make this estimation better. But, if you create the right index, you could provide a third plan which actually is much faster, and also looks much faster, than either of the two current ones.

That index would be:

CREATE INDEX idx_own_spas
  ON spatial_artifact
  USING btree
  (propietari,id)
  WHERE group_id IS NULL;


You can replace the existing similar index but without the id column with this index, so it should not take up much additional space or maintenance overhead.

Finally, you can change effective_cache_size within each session, so it would be easy to change this value on the master and on the slave to see if that is truly which is tipping the plan over from the good plan to the bad plan. But if that is the case, that still doesn't tell you what to do about it. Just because a certain value causes a bad planner choice for one query doesn't mean it is not set properly for the system as a whole. So hopefully one of the early two methods will work.

Code Snippets

"        ->  Index Scan using idx_own_spas on spatial_artifact this_  (cost=0.43..23351.70 rows=6841 width=604) (actual time=0.037..2.119 rows=618 loops=1)"
"              Index Cond: (propietari = 7649)"
CREATE INDEX idx_own_spas
  ON spatial_artifact
  USING btree
  (propietari,id)
  WHERE group_id IS NULL;

Context

StackExchange Database Administrators Q#144168, answer score: 3

Revisions (0)

No revisions yet.