snippetsqlModerate
How to get the last not-null value in an ordered column of a huge table?
Viewed 0 times
lastthecolumntablenullvaluehugegethownot
Problem
I have the following input:
I expect the following result:
The trivial solution would be join the tables with a
However, the trivial execution of this code would create internally the square of the count of the rows of the input table ( O(n^2) ). I expected t-sql to optimize it out - on a block/record level, the task to do is very easy and linear, essentially a for loop ( O(n) ).
However, on my experiments, the latest MS SQL 2016 can't optimize this query correctly, making this query impossible to execute for a large input table.
Furthermore, the query has to run quickly, making a similarly easy (but very different) cursor-based solution infeasible.
Using some memory-backed temporary table could be a good compromise, but I am not sure if it can be run significantly quicker, considered that my example query using subqueries didn't work.
I am also thinking on to dig out some windowing function from the t-sql docs, what could be tricked to do what I want. For example, cumulative sum is doing some very similar, but I couldn't trick it to give the latest non-null element, and not the sum of the elements before.
The ideal solution would be a quick query without procedural code or temporary tables. Alternatively, also a solution with temporary tables is okay, but iterating the table procedurally is not.
id | value
----+-------
1 | 136
2 | NULL
3 | 650
4 | NULL
5 | NULL
6 | NULL
7 | 954
8 | NULL
9 | 104
10 | NULLI expect the following result:
id | value
----+-------
1 | 136
2 | 136
3 | 650
4 | 650
5 | 650
6 | 650
7 | 954
8 | 954
9 | 104
10 | 104The trivial solution would be join the tables with a
< relation, and then selecting the MAX value in a GROUP BY:WITH tmp AS (
SELECT t2.id, MAX(t1.id) AS lastKnownId
FROM t t1, t t2
WHERE
t1.value IS NOT NULL
AND
t2.id >= t1.id
GROUP BY t2.id
)
SELECT
tmp.id, t.value
FROM t, tmp
WHERE t.id = tmp.lastKnownId;However, the trivial execution of this code would create internally the square of the count of the rows of the input table ( O(n^2) ). I expected t-sql to optimize it out - on a block/record level, the task to do is very easy and linear, essentially a for loop ( O(n) ).
However, on my experiments, the latest MS SQL 2016 can't optimize this query correctly, making this query impossible to execute for a large input table.
Furthermore, the query has to run quickly, making a similarly easy (but very different) cursor-based solution infeasible.
Using some memory-backed temporary table could be a good compromise, but I am not sure if it can be run significantly quicker, considered that my example query using subqueries didn't work.
I am also thinking on to dig out some windowing function from the t-sql docs, what could be tricked to do what I want. For example, cumulative sum is doing some very similar, but I couldn't trick it to give the latest non-null element, and not the sum of the elements before.
The ideal solution would be a quick query without procedural code or temporary tables. Alternatively, also a solution with temporary tables is okay, but iterating the table procedurally is not.
Solution
A common solution to this type of problem is given by Itzik Ben-Gan in his article The Last non NULL Puzzle:
Demo: db<>fiddle
In SQL Server 2022, the query can be simplified to:
Demo: db<>fiddle
DROP TABLE IF EXISTS dbo.Example;
CREATE TABLE dbo.Example
(
id integer PRIMARY KEY,
val integer NULL
);
INSERT dbo.Example
(id, val)
VALUES
(1, 136),
(2, NULL),
(3, 650),
(4, NULL),
(5, NULL),
(6, NULL),
(7, 954),
(8, NULL),
(9, 104),
(10, NULL);
SELECT
E.id,
E.val,
lastval =
CAST(
SUBSTRING(
MAX(CAST(E.id AS binary(4)) + CAST(E.val AS binary(4))) OVER (
ORDER BY E.id
ROWS UNBOUNDED PRECEDING),
5, 4)
AS integer)
FROM dbo.Example AS E
ORDER BY
E.id;Demo: db<>fiddle
In SQL Server 2022, the query can be simplified to:
SELECT
E.id,
E.val,
lastval =
LAST_VALUE(E.val) IGNORE NULLS OVER (
ORDER BY E.id ROWS UNBOUNDED PRECEDING)
FROM dbo.Example AS E
ORDER BY
E.id;Demo: db<>fiddle
Code Snippets
DROP TABLE IF EXISTS dbo.Example;
CREATE TABLE dbo.Example
(
id integer PRIMARY KEY,
val integer NULL
);
INSERT dbo.Example
(id, val)
VALUES
(1, 136),
(2, NULL),
(3, 650),
(4, NULL),
(5, NULL),
(6, NULL),
(7, 954),
(8, NULL),
(9, 104),
(10, NULL);
SELECT
E.id,
E.val,
lastval =
CAST(
SUBSTRING(
MAX(CAST(E.id AS binary(4)) + CAST(E.val AS binary(4))) OVER (
ORDER BY E.id
ROWS UNBOUNDED PRECEDING),
5, 4)
AS integer)
FROM dbo.Example AS E
ORDER BY
E.id;SELECT
E.id,
E.val,
lastval =
LAST_VALUE(E.val) IGNORE NULLS OVER (
ORDER BY E.id ROWS UNBOUNDED PRECEDING)
FROM dbo.Example AS E
ORDER BY
E.id;Context
StackExchange Database Administrators Q#233610, answer score: 13
Revisions (0)
No revisions yet.