patternsqlModerate
Solution to assigning unique values to rows with finite collaboration distance
Viewed 0 times
uniquerowsfinitewithcollaborationdistancevaluessolutionassigning
Problem
I have a table that can be created and populated with the following code:
For all rows that have a finite collaboration distance based on
A correct result set that meets what I'm asking for can be generated with the following query:
To better aid what I'm asking, I'll explain why
I should add that the solution needs to be generic. The above table and result set was merely an example.
Addendum
I removed the requirement for the solution to be non-iterative. While I would prefer such a solution, I believe it's an unreasonable constraint. Unfortunat
CREATE TABLE dbo.Example(GroupKey int NOT NULL, RecordKey varchar(12) NOT NULL);
ALTER TABLE dbo.Example
ADD CONSTRAINT iExample PRIMARY KEY CLUSTERED(GroupKey ASC, RecordKey ASC);
INSERT INTO dbo.Example(GroupKey, RecordKey)
VALUES (1, 'Archimedes'), (1, 'Newton'), (1, 'Euler'), (2, 'Euler'), (2, 'Gauss'),
(3, 'Gauss'), (3, 'Poincaré'), (4, 'Ramanujan'), (5, 'Neumann'),
(5, 'Grothendieck'), (6, 'Grothendieck'), (6, 'Tao');For all rows that have a finite collaboration distance based on
RecordKey with another row, I would like to assign a unique value—I don't care how or what data type the unique value is.A correct result set that meets what I'm asking for can be generated with the following query:
SELECT 1 AS SupergroupKey, GroupKey, RecordKey
FROM dbo.Example
WHERE GroupKey IN(1, 2, 3)
UNION ALL
SELECT 2 AS SupergroupKey, GroupKey, RecordKey
FROM dbo.Example
WHERE GroupKey = 4
UNION ALL
SELECT 3 AS SupergroupKey, GroupKey, RecordKey
FROM dbo.Example
WHERE GroupKey IN(5, 6)
ORDER BY SupergroupKey ASC, GroupKey ASC, RecordKey ASC;To better aid what I'm asking, I'll explain why
GroupKeys 1–3 have the same SupergroupKey:GroupKey1 contains theRecordKeyEuler which in turn is contained inGroupKey2; thusGroupKeys 1 and 2 must have the sameSupergroupKey.
- Because Gauss is contained in both
GroupKeys 2 and 3, they too must have the sameSupergroupKey. This leads toGroupKeys 1–3 having the sameSupergroupKey.
- Since
GroupKeys 1–3 don't share anyRecordKeys with the remainingGroupKeys, they are the only ones assigned aSupergroupKeyvalue of 1.
I should add that the solution needs to be generic. The above table and result set was merely an example.
Addendum
I removed the requirement for the solution to be non-iterative. While I would prefer such a solution, I believe it's an unreasonable constraint. Unfortunat
Solution
This problem is about following links between items. This puts it in the realm of graphs and graph processing. Specifically, the whole dataset forms a graph and we are looking for components of that graph. This can be illustrated by a plot of the sample data from the question.
The question says we can follow GroupKey or RecordKey to find other rows that share that value. So we can treat both as vertexes in a graph. The question goes on to explain how GroupKeys 1–3 have the same SupergroupKey. This can be seen as the cluster on the left joined by thin lines. The picture also shows the two other components (SupergroupKey) formed by the original data.
SQL Server has some graph processing ability built into T-SQL. At this time it is quite meagre, however, and not helpful with this problem. SQL Server also has the ability to call out to R and Python, and the rich & robust suite of packages available to them. One such is igraph. It is written for "fast handling of large graphs, with millions of vertices and edges(link)."
Using R and igraph I was able to process one million rows in 2 minutes 22 seconds in local testing1. This is how it compares against the current best solution:
When processing 1M rows, 1m40s were used to load & process the graph, and to update the table. 42s were required to populate an SSMS result table with the output.
Observation of Task Manager while 1M rows were processed suggest about 3GB of working memory were required. This was available on this system without paging.
I can confirm Ypercube's assessment of the recursive CTE approach. With a few hundred record keys it consumed 100% of CPU and all available RAM. Eventually tempdb grew to over 80GB and the SPID crashed.
I used Paul's table with the SupergroupKey column so there's a fair comparison between the solutions.
For some reason R objected to the accent on Poincaré. Changing it to a plain "e" allowed it to run. I didn't investigate since it's not germane to the problem at hand. I'm sure there's a solution.
Here is the code
This is what the R code does
-
-
-
-
I'm not an R expert. Likely this could be optimised.
Test Data
The OP's data was used for validation. For scale tests I used the following script.
```
drop table if exists Records;
drop table if exists Groups;
create table Groups(GroupKey int NOT NULL primary key);
create table Records(RecordKey varchar(12) NOT NULL primary key);
go
set nocount on;
-- Set @RecordCount to the number of distinct RecordKey values desired.
-- The number of rows in dbo.Example will be 8 * @RecordCount.
declare @RecordCount int = 1000000;
-- @Multiplier was determined by experiment.
-- It gives the OP's "8 RecordKeys per GroupKey and 4 GroupKeys per RecordKey"
-- and allows for clashes of the chosen random values.
declare @Multiplier numeric(4, 2) = 2.7;
-- The number of groups required to reproduce the OP's distribution.
declare @GroupCount int = FLOOR(@RecordCount * @Multiplier);
-- This is a poor man's numbers table.
insert Groups(GroupKey)
The question says we can follow GroupKey or RecordKey to find other rows that share that value. So we can treat both as vertexes in a graph. The question goes on to explain how GroupKeys 1–3 have the same SupergroupKey. This can be seen as the cluster on the left joined by thin lines. The picture also shows the two other components (SupergroupKey) formed by the original data.
SQL Server has some graph processing ability built into T-SQL. At this time it is quite meagre, however, and not helpful with this problem. SQL Server also has the ability to call out to R and Python, and the rich & robust suite of packages available to them. One such is igraph. It is written for "fast handling of large graphs, with millions of vertices and edges(link)."
Using R and igraph I was able to process one million rows in 2 minutes 22 seconds in local testing1. This is how it compares against the current best solution:
Record Keys Paul White R
------------ ---------- --------
Per question 15ms ~220ms
100 80ms ~270ms
1,000 250ms 430ms
10,000 1.4s 1.7s
100,000 14s 14s
1M 2m29 2m22s
1M n/a 1m40 process only, no display
The first column is the number of distinct RecordKey values. The number of rows
in the table will be 8 x this number.When processing 1M rows, 1m40s were used to load & process the graph, and to update the table. 42s were required to populate an SSMS result table with the output.
Observation of Task Manager while 1M rows were processed suggest about 3GB of working memory were required. This was available on this system without paging.
I can confirm Ypercube's assessment of the recursive CTE approach. With a few hundred record keys it consumed 100% of CPU and all available RAM. Eventually tempdb grew to over 80GB and the SPID crashed.
I used Paul's table with the SupergroupKey column so there's a fair comparison between the solutions.
For some reason R objected to the accent on Poincaré. Changing it to a plain "e" allowed it to run. I didn't investigate since it's not germane to the problem at hand. I'm sure there's a solution.
Here is the code
-- This captures the output from R so the base table can be updated.
drop table if exists #Results;
create table #Results
(
Component int not NULL,
Vertex varchar(12) not NULL primary key
);
truncate table #Results; -- facilitates re-execution
declare @Start time = sysdatetimeoffset(); -- for a 'total elapsed' calculation.
insert #Results(Component, Vertex)
exec sp_execute_external_script
@language = N'R',
@input_data_1 = N'select GroupKey, RecordKey from dbo.Example',
@script = N'
library(igraph)
df.g <- graph.data.frame(d = InputDataSet, directed = FALSE)
cpts <- components(df.g, mode = c("weak"))
OutputDataSet <- data.frame(cpts$membership)
OutputDataSet$VertexName <- V(df.g)$name
';
-- Write SuperGroupKey to the base table, as other solutions do
update e
set
SupergroupKey = r.Component
from dbo.Example as e
inner join #Results as r
on r.Vertex = e.RecordKey;
-- Return all rows, as other solutions do
select
e.SupergroupKey,
e.GroupKey,
e.RecordKey
from dbo.Example as e;
-- Calculate the elapsed
declare @End time = sysdatetimeoffset();
select Elapse_ms = DATEDIFF(MILLISECOND, @Start, @End);This is what the R code does
-
@input_data_1 is how SQL Server transfers data from a table to R code and translates it to an R dataframe called InputDataSet.-
library(igraph) imports the library into the R execution environment.-
df.g
-
cpts -
OutputDataSet
-
OutputDataSet$VertexName I'm not an R expert. Likely this could be optimised.
Test Data
The OP's data was used for validation. For scale tests I used the following script.
```
drop table if exists Records;
drop table if exists Groups;
create table Groups(GroupKey int NOT NULL primary key);
create table Records(RecordKey varchar(12) NOT NULL primary key);
go
set nocount on;
-- Set @RecordCount to the number of distinct RecordKey values desired.
-- The number of rows in dbo.Example will be 8 * @RecordCount.
declare @RecordCount int = 1000000;
-- @Multiplier was determined by experiment.
-- It gives the OP's "8 RecordKeys per GroupKey and 4 GroupKeys per RecordKey"
-- and allows for clashes of the chosen random values.
declare @Multiplier numeric(4, 2) = 2.7;
-- The number of groups required to reproduce the OP's distribution.
declare @GroupCount int = FLOOR(@RecordCount * @Multiplier);
-- This is a poor man's numbers table.
insert Groups(GroupKey)
Code Snippets
Record Keys Paul White R
------------ ---------- --------
Per question 15ms ~220ms
100 80ms ~270ms
1,000 250ms 430ms
10,000 1.4s 1.7s
100,000 14s 14s
1M 2m29 2m22s
1M n/a 1m40 process only, no display
The first column is the number of distinct RecordKey values. The number of rows
in the table will be 8 x this number.-- This captures the output from R so the base table can be updated.
drop table if exists #Results;
create table #Results
(
Component int not NULL,
Vertex varchar(12) not NULL primary key
);
truncate table #Results; -- facilitates re-execution
declare @Start time = sysdatetimeoffset(); -- for a 'total elapsed' calculation.
insert #Results(Component, Vertex)
exec sp_execute_external_script
@language = N'R',
@input_data_1 = N'select GroupKey, RecordKey from dbo.Example',
@script = N'
library(igraph)
df.g <- graph.data.frame(d = InputDataSet, directed = FALSE)
cpts <- components(df.g, mode = c("weak"))
OutputDataSet <- data.frame(cpts$membership)
OutputDataSet$VertexName <- V(df.g)$name
';
-- Write SuperGroupKey to the base table, as other solutions do
update e
set
SupergroupKey = r.Component
from dbo.Example as e
inner join #Results as r
on r.Vertex = e.RecordKey;
-- Return all rows, as other solutions do
select
e.SupergroupKey,
e.GroupKey,
e.RecordKey
from dbo.Example as e;
-- Calculate the elapsed
declare @End time = sysdatetimeoffset();
select Elapse_ms = DATEDIFF(MILLISECOND, @Start, @End);drop table if exists Records;
drop table if exists Groups;
create table Groups(GroupKey int NOT NULL primary key);
create table Records(RecordKey varchar(12) NOT NULL primary key);
go
set nocount on;
-- Set @RecordCount to the number of distinct RecordKey values desired.
-- The number of rows in dbo.Example will be 8 * @RecordCount.
declare @RecordCount int = 1000000;
-- @Multiplier was determined by experiment.
-- It gives the OP's "8 RecordKeys per GroupKey and 4 GroupKeys per RecordKey"
-- and allows for clashes of the chosen random values.
declare @Multiplier numeric(4, 2) = 2.7;
-- The number of groups required to reproduce the OP's distribution.
declare @GroupCount int = FLOOR(@RecordCount * @Multiplier);
-- This is a poor man's numbers table.
insert Groups(GroupKey)
select top(@GroupCount)
ROW_NUMBER() over (order by (select NULL))
from sys.objects as a
cross join sys.objects as b
--cross join sys.objects as c -- include if needed
declare @c int = 0
while @c < @RecordCount
begin
-- Can't use a set-based method since RAND() gives the same value for all rows.
-- There are better ways to do this, but it works well enough.
-- RecordKeys will be 10 letters, a-z.
insert Records(RecordKey)
select
CHAR(97 + (26*RAND())) +
CHAR(97 + (26*RAND())) +
CHAR(97 + (26*RAND())) +
CHAR(97 + (26*RAND())) +
CHAR(97 + (26*RAND())) +
CHAR(97 + (26*RAND())) +
CHAR(97 + (26*RAND())) +
CHAR(97 + (26*RAND())) +
CHAR(97 + (26*RAND())) +
CHAR(97 + (26*RAND()));
set @c += 1;
end
-- Process each RecordKey in alphabetical order.
-- For each choose 8 GroupKeys to pair with it.
declare @RecordKey varchar(12) = '';
declare @Groups table (GroupKey int not null);
truncate table dbo.Example;
select top(1) @RecordKey = RecordKey
from Records
where RecordKey > @RecordKey
order by RecordKey;
while @@ROWCOUNT > 0
begin
print @Recordkey;
delete @Groups;
insert @Groups(GroupKey)
select distinct C
from
(
-- Hard-code * from OP's statistics
select FLOOR(RAND() * @GroupCount)
union all
select FLOOR(RAND() * @GroupCount)
union all
select FLOOR(RAND() * @GroupCount)
union all
select FLOOR(RAND() * @GroupCount)
union all
select FLOOR(RAND() * @GroupCount)
union all
select FLOOR(RAND() * @GroupCount)
union all
select FLOOR(RAND() * @GroupCount)
union all
select FLOOR(RAND() * @GroupCount)
) as T(C);
insert dbo.Example(GroupKey, RecordKey)
select
GroupKey, @RecordKey
from @Groups;
select top(1) @RecordKey = RecordKey
from Records
where RecordKey > @RecordKey
order by RecordKey;
end
-- Rebuild the indexes to have a consistent environment
alter index iExample on dbo.Example rebuild partition = all
WITH (PAD_INDEX = OFF, STATISTICS_NOContext
StackExchange Database Administrators Q#229559, answer score: 10
Revisions (0)
No revisions yet.