HiveBrain v1.2.0
Get Started
← Back to all entries
patternsqlMinor

Remove similar and successive rows

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
rowssuccessiveremoveandsimilar

Problem

I have a table with many entries each describing the commerce state of an item.

The table has the following structure:

CREATE TABLE CommerceState
(
    Id INT NOT NULL PRIMARY KEY
    , ItemId INT NOT NULL
    , Listings INT NOT NULL
    , UnitPrice BIGINT NOT NULL
    , Quantity INT NOT NULL
    , [Timestamp] DATETIME NOT NULL
);


At certain intervals I'm saving the current state of an item into this table. It can happen that two consecutive updates contain the same data and just differ in the timestamp.

What I want to do now is to remove those kind of duplicates. I couldn't come up with a script so far. I just got solutions to completely delete all duplicates without respect to the actual timeline.

To visualize what I would like to achieve:

Before:

+ Id + ItemId + Listings + UnitPrice + Quantity + Date                    +
| 1  | 1234   | 10       | 100       | 50       | 2015-10-22 15:55:00.000 |
| 2  | 1235   | 10       | 25        | 150      | 2015-10-22 16:00:00.000 |
| 3  | 1234   | 9        | 100       | 50       | 2015-10-22 16:05:00.000 |
| 4  | 1235   | 5        | 25        | 30       | 2015-10-22 16:10:00.000 |
| 5  | 1235   | 10       | 25        | 150      | 2015-10-22 16:15:00.000 |
| 6  | 1235   | 10       | 25        | 150      | 2015-10-22 16:20:00.000 |
| 7  | 1234   | 10       | 100       | 50       | 2015-10-22 16:25:00.000 |
| 8  | 1234   | 10       | 100       | 50       | 2015-10-22 16:30:00.000 |
| 9  | 1234   | 10       | 100       | 50       | 2015-10-22 16:35:00.000 |
| 10 | 1234   | 9        | 100       | 50       | 2015-10-22 16:40:00.000 |


After:

```
+ Id + ItemId + Listings + UnitPrice + Quantity + Date +
| 1 | 1234 | 10 | 100 | 50 | 2015-10-22 15:55:00.000 |
| 2 | 1235 | 10 | 25 | 150 | 2015-10-22 16:00:00.000 |
| 3 | 1234 | 9 | 100 | 50 | 2015-10-22 16:05:00.000 |
| 4 | 1235 | 5 | 25 | 30 | 2015-10-22 1

Solution

One way to do this is to look up the previous row for each item and check if that row has the same data. If so, we can delete the row:

DELETE c
FROM CommerceState c
CROSS APPLY (
    -- For each row, find the previous row for this item
    SELECT TOP 1 prev.id, prev.Listings, prev.UnitPrice, prev.Quantity
    FROM CommerceState prev
    WHERE prev.itemId = c.itemId
        AND prev.[Timestamp] < c.[Timestamp]
    ORDER BY [Timestamp] DESC
) d
-- If that row matches the current row, we'll delete the current row
WHERE d.Listings = c.Listings
    AND d.UnitPrice = c.UnitPrice
    AND d.Quantity = c.Quantity


Here is a full test script that creates your dummy data, runs the query, and confirms the results.

You can optionally add the following index in order to optimize the lookup of the most recent row, but it would depend on your workload whether the index is worth the extra overhead it incurs when inserting and deleting data.

-- Optionally add this index if you deem the tradeoff to be worthwhile
CREATE INDEX IX_CommerceState
    ON CommerceState (ItemId, [Timestamp])
    INCLUDE (Listings, UnitPrice, Quantity)


With the index, here is the query plan:

Code Snippets

DELETE c
FROM CommerceState c
CROSS APPLY (
    -- For each row, find the previous row for this item
    SELECT TOP 1 prev.id, prev.Listings, prev.UnitPrice, prev.Quantity
    FROM CommerceState prev
    WHERE prev.itemId = c.itemId
        AND prev.[Timestamp] < c.[Timestamp]
    ORDER BY [Timestamp] DESC
) d
-- If that row matches the current row, we'll delete the current row
WHERE d.Listings = c.Listings
    AND d.UnitPrice = c.UnitPrice
    AND d.Quantity = c.Quantity
-- Optionally add this index if you deem the tradeoff to be worthwhile
CREATE INDEX IX_CommerceState
    ON CommerceState (ItemId, [Timestamp])
    INCLUDE (Listings, UnitPrice, Quantity)

Context

StackExchange Database Administrators Q#118903, answer score: 4

Revisions (0)

No revisions yet.