snippetsqlMinor
How to optimize this tricky SQL self-join?
Viewed 0 times
thistrickysqljoinoptimizehowself
Problem
I want to perform a complex self-join on a table. I know that this could in theory be done very efficiently (see below), but I have trouble getting SQL (on Microsoft SQL Server) to do so.
My question is:
How can I get SQL to do this efficiently? What information do I need to provide it to be able to infer the optimal solution, or something similarly fast?
The input:
I have a table of events. Each event belongs to a certain item and has one of two types. I am free to do with this table whatever I want, and create any index I want, as it is an intermediate table. It is also only used for offline processing, so no new data will be added later.
The table has hundreds of millions of rows. Entries with type=0 and entries with type=1 appear roughly with the same frequency, and they are more or less fairly distributed because the input data is created in a way that follows certain rules, so that the following can be assumed to be true for the data:
Each time an event of
The table currently looks like this, but you can feel free to suggest changes:
I have tried several types of indices and several orders of the columns in those indices, but the below task will not get any faster:
The task:
"For each event of type=0, find the chronologically next event of type=1 that involved the same item. Get tuples of (
A query to get this could look like this:
```
select
My question is:
How can I get SQL to do this efficiently? What information do I need to provide it to be able to infer the optimal solution, or something similarly fast?
The input:
I have a table of events. Each event belongs to a certain item and has one of two types. I am free to do with this table whatever I want, and create any index I want, as it is an intermediate table. It is also only used for offline processing, so no new data will be added later.
The table has hundreds of millions of rows. Entries with type=0 and entries with type=1 appear roughly with the same frequency, and they are more or less fairly distributed because the input data is created in a way that follows certain rules, so that the following can be assumed to be true for the data:
Each time an event of
type=0 happens, a counter increases for the item involved, and each time an event of type=1 happens, it decreases again. The counter will always lie between 0 and 3 (inclusive).The table currently looks like this, but you can feel free to suggest changes:
select
a.item
,case when a. then 1 else 0 end as event_type
,row_number() over(partition by a.item order by a.date asc) as sequence_id -- this makes the order clearer and deals with duplicate dates in a manner that is acceptable for these purposes
, as counter_after_event -- this lies in [1;3] if event_type=0, and in [0;2] if event_type=1
from a;I have tried several types of indices and several orders of the columns in those indices, but the below task will not get any faster:
The task:
"For each event of type=0, find the chronologically next event of type=1 that involved the same item. Get tuples of (
item, sequence_id_for_type_0, sequence_id_for_type_1)."A query to get this could look like this:
```
select
Solution
You should consider using LEAD and LAG Analytic Functions and appropriate indices for sorting. I've changed one self-join query with them. Elapsed time was reduced tenfold.
Context
StackExchange Database Administrators Q#128837, answer score: 5
Revisions (0)
No revisions yet.