patternsqlMinor
Aggregate data from multiple rows into single row
Viewed 0 times
rowsintosinglemultiplefromdatarowaggregate
Problem
In my table I have rows, and each row has some data and its version. When column is not-null, it overrides an old value otherwise nothing happens.
Pseudocode (in C#)
Table is large enough (several tens of columns, several hundreds of thousands rows), so it's desired to do it in single table scan (if possible).
I appreciate any help and any possibility to achieve it without any bias, SQLCLR, DSQL and so on, everything is applicable if it works fast and makes its job.
Current script for a test table:
input:
Pseudocode (in C#)
var resultTable = new List();
foreach (var group in table.GroupBy(x => x.ID).Select(g => g.OrderBy(x => x.Priority)))
{
var row = Row.GetEmpty();
foreach (var anotherRow in group)
{
foreach (var column in group.Columns)
{
if (column.GetValue(anotherRow) != null)
column.SetValue(row, anotherRow);
}
}
resultTable.Add(row);
}Table is large enough (several tens of columns, several hundreds of thousands rows), so it's desired to do it in single table scan (if possible).
I appreciate any help and any possibility to achieve it without any bias, SQLCLR, DSQL and so on, everything is applicable if it works fast and makes its job.
Current script for a test table:
CREATE TABLE [dbo].[T](
[id] [int] NOT NULL,
[A] [int] NULL,
[B] [int] NULL,
[C] [int] NULL,
[Priority] [int] NOT NULL
) ON [PRIMARY]
GO
CREATE UNIQUE CLUSTERED INDEX [ClusteredIndex-20160204-122857] ON [dbo].[T]
(
[id] ASC,
[Priority] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
GO
INSERT [dbo].[T] ([id], [A], [B], [C], [Priority]) VALUES (1, NULL, 3, 4, 1)
GO
INSERT [dbo].[T] ([id], [A], [B], [C], [Priority]) VALUES (1, 5, 6, NULL, 2)
GO
INSERT [dbo].[T] ([id], [A], [B], [C], [Priority]) VALUES (1, 8, NULL, NULL, 3)
GO
INSERT [dbo].[T] ([id], [A], [B], [C], [Priority]) VALUES (2, 634, 346, 359, 1)
GO
INSERT [dbo].[T] ([id], [A], [B], [C], [Priority]) VALUES (2, 34, NULL, 734, 2)
GOinput:
id A B C Priority
1 NULL 3 4 1
1 5 6 NULL 2
1 8 NULL NULL 3
2 634 346 359 1
2 34 NULL 734 2Solution
Itzik Ben-Gan's blog post (link provided by Daniel Hutmacher) has some nice solutions to the problem for SQL Server 2012 and later:
The Last non
Here is one more that will work in older versions (even 2005) and can be easily adapted for more columns. I haven't tested for efficiency or compared to the other solutions but I'd expect it to work better if the table had a clustered index on
Test in SQLfiddle.
The
If however, we want the latest non-null values for a specific point in time (say for when
The Last non
NULL Puzzle. Here is one more that will work in older versions (even 2005) and can be easily adapted for more columns. I haven't tested for efficiency or compared to the other solutions but I'd expect it to work better if the table had a clustered index on
(id, priority). But performance may depend on many other things, like the distribution, the number of distinct id values, the number of versions (rows per id values), the number and types of columns, etc.:SELECT
d.id,
d1.col1,
d2.col2,
d3.col3
FROM
( SELECT id, @point_in_time AS priority
FROM tablename
GROUP BY id
) AS d
OUTER APPLY
( SELECT TOP (1) col1
FROM tablename
WHERE id = d.id
AND priority <= d.priority -- optional
AND col1 IS NOT NULL
ORDER BY priority DESC
) AS d1
OUTER APPLY
( SELECT TOP (1) col2
FROM tablename
WHERE id = d.id
AND priority <= d.priority -- optional
AND col2 IS NOT NULL
ORDER BY priority DESC
) AS d2
OUTER APPLY
( SELECT TOP (1) col3
FROM tablename
WHERE id = d.id
AND priority <= d.priority -- optional
AND col3 IS NOT NULL
ORDER BY priority DESC
) AS d3 ;Test in SQLfiddle.
The
@point_in_time above is a parameter, that can be removed if we want the latest non-null values across all the table. If however, we want the latest non-null values for a specific point in time (say for when
priority=55), we can replace that @point_in_time with 55 and have the latest non-null values up to that point in time (55).Code Snippets
SELECT
d.id,
d1.col1,
d2.col2,
d3.col3
FROM
( SELECT id, @point_in_time AS priority
FROM tablename
GROUP BY id
) AS d
OUTER APPLY
( SELECT TOP (1) col1
FROM tablename
WHERE id = d.id
AND priority <= d.priority -- optional
AND col1 IS NOT NULL
ORDER BY priority DESC
) AS d1
OUTER APPLY
( SELECT TOP (1) col2
FROM tablename
WHERE id = d.id
AND priority <= d.priority -- optional
AND col2 IS NOT NULL
ORDER BY priority DESC
) AS d2
OUTER APPLY
( SELECT TOP (1) col3
FROM tablename
WHERE id = d.id
AND priority <= d.priority -- optional
AND col3 IS NOT NULL
ORDER BY priority DESC
) AS d3 ;Context
StackExchange Database Administrators Q#128239, answer score: 3
Revisions (0)
No revisions yet.