patternsqlMinor
inner join on PK with extra criteria slow despite indices?
Viewed 0 times
indicesdespitecriteriawithjoinslowinnerextra
Problem
Given the two tables below I am struggling to understand:
joining two tables on PK is fast:
so is just selecting PVs by timestamp range:
from product_views p, site_visits sv
where p.siteVisitId=sv.id
and p.timestamp >= "2012-10-10"
and p.timestamp
here's the EXPLAIN
```
mysql> explain select sql_no_cache p.id, p.timestamp, sv.postProcessed from product_views p, site_visits sv where p.siteVisitId=sv.id and p.timestamp >= "2012-10-10" and p.timestamp < "2012-11-10" limit 1;
+----+-------------+-------+--------+------------------------------------------------------------+--------------------+---------+---------------------+-----------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+--------+------------------------------------------------------------+--------------------+---------+---------------------+-----------+--------------------------+
| 1 | SIMPLE | p | index | FK52C29B1E3CAB9CC4,timestamp_idx,siteVisitId_timestamp_idx | FK52C29B1E3CAB9CC4 | 8 | NULL | 119195469 | Using where; Using in
- why is the third query slow even though first two queries are fast
- what exactly is EXPLAIN saying
- can I do anything to significantly speed up the slow query?
joining two tables on PK is fast:
mysql> select sql_no_cache p.id, sv.postProcessed
from product_views p, site_visits sv
where p.siteVisitId=sv.id
limit 1;
+----+---------------+
| id | postProcessed |
+----+---------------+
| 1 | 1 |
+----+---------------+
1 row in set (0.10 sec)so is just selecting PVs by timestamp range:
mysql> select sql_no_cache p.id, p.timestamp
from product_views p
where p.timestamp >= "2012-10-10"
and p.timestamp
but joining the two is really slow (takes 5 min+ ):
mysql> select sql_no_cache p.id, p.timestamp, sv.postProcessed from product_views p, site_visits sv
where p.siteVisitId=sv.id
and p.timestamp >= "2012-10-10"
and p.timestamp
here's the EXPLAIN
```
mysql> explain select sql_no_cache p.id, p.timestamp, sv.postProcessed from product_views p, site_visits sv where p.siteVisitId=sv.id and p.timestamp >= "2012-10-10" and p.timestamp < "2012-11-10" limit 1;
+----+-------------+-------+--------+------------------------------------------------------------+--------------------+---------+---------------------+-----------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+--------+------------------------------------------------------------+--------------------+---------+---------------------+-----------+--------------------------+
| 1 | SIMPLE | p | index | FK52C29B1E3CAB9CC4,timestamp_idx,siteVisitId_timestamp_idx | FK52C29B1E3CAB9CC4 | 8 | NULL | 119195469 | Using where; Using in
Solution
The optimizer does not see that your conditions are correlated and picks the wrong access method.
Basically, it considers two options:
-
Scan the index on
-
Scan the index on
Since
The optimizer calculates the conditional probability of the
Since your
However, since both
In simple words, you have to filter through many low
You should add
Basically, it considers two options:
-
Scan the index on
siteVisitId until the first match on site_visits and the first satisfied timestamp condition.-
Scan the index on
timestamp until the first match on site_visits.Since
timestamp is a part of the primary key and siteVisitId is not, the second plan would involve table lookups on product_views which is several times more slow than a pure index scan (note Using index in the plan).The optimizer calculates the conditional probability of the
timestamp condition being satisfied (given that a corresponding site_visit record exists) and compares it to the overhead of the table access.Since your
timestamp condition is quite wide (as seen on the index histograms), the optimizer prefers the first method.However, since both
siteVisitId and timestamp are incremental, they are correlated and the conditional probability of both matches is not a mere product of their independent probabilities.In simple words, you have to filter through many low
siteVisitId until you find the first matching timestamp, which is exactly what is happening to your query.You should add
ORDER BY timestamp to your query to make the timestamp index cheaper as it won't have to sort. It would also help to create an index on timestamp, siteVisitId (in this order) to avoid table lookups.Context
StackExchange Database Administrators Q#28828, answer score: 2
Revisions (0)
No revisions yet.