HiveBrain v1.2.0
Get Started
← Back to all entries
snippetsqlMinor

How to optimize this tricky SQL self-join?

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
thistrickysqljoinoptimizehowself

Problem

I want to perform a complex self-join on a table. I know that this could in theory be done very efficiently (see below), but I have trouble getting SQL (on Microsoft SQL Server) to do so.

My question is:

How can I get SQL to do this efficiently? What information do I need to provide it to be able to infer the optimal solution, or something similarly fast?

The input:

I have a table of events. Each event belongs to a certain item and has one of two types. I am free to do with this table whatever I want, and create any index I want, as it is an intermediate table. It is also only used for offline processing, so no new data will be added later.

The table has hundreds of millions of rows. Entries with type=0 and entries with type=1 appear roughly with the same frequency, and they are more or less fairly distributed because the input data is created in a way that follows certain rules, so that the following can be assumed to be true for the data:
Each time an event of type=0 happens, a counter increases for the item involved, and each time an event of type=1 happens, it decreases again. The counter will always lie between 0 and 3 (inclusive).

The table currently looks like this, but you can feel free to suggest changes:

select
    a.item
    ,case when a. then 1 else 0 end as event_type
    ,row_number() over(partition by a.item order by a.date asc) as sequence_id -- this makes the order clearer and deals with duplicate dates in a manner that is acceptable for these purposes
    , as counter_after_event -- this lies in [1;3] if event_type=0, and in [0;2] if event_type=1
from  a;


I have tried several types of indices and several orders of the columns in those indices, but the below task will not get any faster:

The task:

"For each event of type=0, find the chronologically next event of type=1 that involved the same item. Get tuples of (item, sequence_id_for_type_0, sequence_id_for_type_1)."

A query to get this could look like this:

```
select

Solution

You should consider using LEAD and LAG Analytic Functions and appropriate indices for sorting. I've changed one self-join query with them. Elapsed time was reduced tenfold.

Context

StackExchange Database Administrators Q#128837, answer score: 5

Revisions (0)

No revisions yet.