HiveBrain v1.2.0
Get Started
← Back to all entries
patternsqlMinor

inner join on PK with extra criteria slow despite indices?

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
indicesdespitecriteriawithjoinslowinnerextra

Problem

Given the two tables below I am struggling to understand:

  • why is the third query slow even though first two queries are fast



  • what exactly is EXPLAIN saying



  • can I do anything to significantly speed up the slow query?



joining two tables on PK is fast:

mysql> select sql_no_cache p.id, sv.postProcessed 
       from product_views p, site_visits sv 
       where p.siteVisitId=sv.id 
       limit 1;
+----+---------------+
| id | postProcessed |
+----+---------------+
|  1 |             1 |
+----+---------------+
1 row in set (0.10 sec)


so is just selecting PVs by timestamp range:

mysql> select sql_no_cache p.id, p.timestamp
from product_views p
where p.timestamp >= "2012-10-10"
and p.timestamp

but joining the two is really slow (takes 5 min+ ):

mysql> select sql_no_cache p.id, p.timestamp, sv.postProcessed
from product_views p, site_visits sv
where p.siteVisitId=sv.id
and p.timestamp >= "2012-10-10"
and p.timestamp

here's the EXPLAIN

```
mysql> explain select sql_no_cache p.id, p.timestamp, sv.postProcessed from product_views p, site_visits sv where p.siteVisitId=sv.id and p.timestamp >= "2012-10-10" and p.timestamp < "2012-11-10" limit 1;
+----+-------------+-------+--------+------------------------------------------------------------+--------------------+---------+---------------------+-----------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+--------+------------------------------------------------------------+--------------------+---------+---------------------+-----------+--------------------------+
| 1 | SIMPLE | p | index | FK52C29B1E3CAB9CC4,timestamp_idx,siteVisitId_timestamp_idx | FK52C29B1E3CAB9CC4 | 8 | NULL | 119195469 | Using where; Using in

Solution

The optimizer does not see that your conditions are correlated and picks the wrong access method.

Basically, it considers two options:

-
Scan the index on siteVisitId until the first match on site_visits and the first satisfied timestamp condition.

-
Scan the index on timestamp until the first match on site_visits.

Since timestamp is a part of the primary key and siteVisitId is not, the second plan would involve table lookups on product_views which is several times more slow than a pure index scan (note Using index in the plan).

The optimizer calculates the conditional probability of the timestamp condition being satisfied (given that a corresponding site_visit record exists) and compares it to the overhead of the table access.

Since your timestamp condition is quite wide (as seen on the index histograms), the optimizer prefers the first method.

However, since both siteVisitId and timestamp are incremental, they are correlated and the conditional probability of both matches is not a mere product of their independent probabilities.

In simple words, you have to filter through many low siteVisitId until you find the first matching timestamp, which is exactly what is happening to your query.

You should add ORDER BY timestamp to your query to make the timestamp index cheaper as it won't have to sort. It would also help to create an index on timestamp, siteVisitId (in this order) to avoid table lookups.

Context

StackExchange Database Administrators Q#28828, answer score: 2

Revisions (0)

No revisions yet.