patternsqlModerate
Select data divided in groups evenly distributed by value
Viewed 0 times
groupsvaluedivideddistributedselectdataevenly
Problem
I would like to select into 4 groups the data from a table having the sum of a values in the groups as evenly distributed possible. I am sure that I am not explaining it clear enough so I will try to give an example.
Here I use NTILE(4) to create the 4 groups:
In the above query and result, the other columns have been omitted for brevity.
So you can see the groups also as follows:
Notice that the Sum Totals of Time using NTile is not really balanced between the groups. A better distribution of the Time values would be for example:
Here the Sum Totals of Time is more evenly distributed over the 4 groups.
How can I perform this through a TSQL statements?
Furthermore I have to say that I am using SQL Server 2012.
If you have something that can help me out, let me know.
I wish you a nice day.
Stan
Here I use NTILE(4) to create the 4 groups:
SELECT Time, NTILE(4) OVER (ORDER BY Time DESC) AS N FROM TableX
Time - N
-------------
10 - 1
9 - 2
8 - 3
7 - 4
6 - 1
5 - 2
4 - 3
3 - 4
2 - 1
1 - 2In the above query and result, the other columns have been omitted for brevity.
So you can see the groups also as follows:
1 2 3 4
--- --- --- ---
10 9 8 7
6 5 4 3
2 1
--- --- --- ---
18 15 12 10 Sum Totals of TimeNotice that the Sum Totals of Time using NTile is not really balanced between the groups. A better distribution of the Time values would be for example:
1 2 3 4
--- --- --- ---
10 9 8 7
3 5 4 6
1 2
--- --- --- ---
14 14 14 13 Sum Totals of TimeHere the Sum Totals of Time is more evenly distributed over the 4 groups.
How can I perform this through a TSQL statements?
Furthermore I have to say that I am using SQL Server 2012.
If you have something that can help me out, let me know.
I wish you a nice day.
Stan
Solution
Here's a stab at an algorithm. It's not perfect, and depending on how much time you want to spend refining it, there are probably some further small gains to be made.
Let's assume you have a table of tasks to be performed by four queues. You know the amount of work associated with performing each task, and you want all four queues to get an almost equal amount of work to do, so all queues will complete at about the same time.
First off, I'd partition the tasks using a modulous, ordered by their size, from small to large.
The
For ease of use, I'm storing the
Now, we can perform a few calculations on this data:
The column
The idea is to identify the two best rows, one in a "positive" group (with too much work) and one in a "negative" group (with too little work). If we can swap groups on those two rows, we could reduce the absolute
Example:
With a grand total of 727, each group should have a score of about 182 for the distribution to be perfect. The difference between the group's score and 182 is what we're putting in the
As you can see now, in the best of worlds, we should move about 40 points worth of rows from group 1 to group 2 and about 24 points from group 3 to group 0.
Here's the code to identify those candidate rows:
I'm self-joining the common table expression that we created before,
The
Now, all we need to to is add an
TL;DR - here's the query
Here's the complete code:
Let's assume you have a table of tasks to be performed by four queues. You know the amount of work associated with performing each task, and you want all four queues to get an almost equal amount of work to do, so all queues will complete at about the same time.
First off, I'd partition the tasks using a modulous, ordered by their size, from small to large.
SELECT [time], ROW_NUMBER() OVER (ORDER BY [time])%4 AS grp, 0The
ROW_NUMBER() orders every row by size, then assigns a row number, starting at 1. This row number is assigned a "group" (the grp column) on a round-robin basis. First row is group 1, second row is group 2, then 3, the fourth gets group 0, and so on.time ROW_NUMBER() grp
---- ------------ ---
1 1 1
10 2 2
12 3 3
15 4 0
19 5 1
22 6 2
...For ease of use, I'm storing the
time and grp columns in a table variable called @work.Now, we can perform a few calculations on this data:
WITH cte AS (
SELECT *, SUM([time]) OVER (PARTITION BY grp)
-SUM([time]) OVER (PARTITION BY (SELECT NULL))/4 AS _grpoffset
FROM @work)
...The column
_grpoffset is how much the total time per grp differs from the "ideal" average. If the total time of the all tasks is 1000 and there are four groups, there should ideally be a total of 250 in each group. If a group contains a total of 268, that group's _grpoffset=18.The idea is to identify the two best rows, one in a "positive" group (with too much work) and one in a "negative" group (with too little work). If we can swap groups on those two rows, we could reduce the absolute
_grpoffset of both groups.Example:
time grp total _grpoffset
---- --- ----- ----------
3 1 222 40
46 1 222 40
73 1 222 40
100 1 222 40
6 2 134 -48
52 2 134 -48
76 2 134 -48
11 3 163 -21
66 3 163 -21
86 3 163 -21
45 0 208 24
71 0 208 24
92 0 208 24
----
=727With a grand total of 727, each group should have a score of about 182 for the distribution to be perfect. The difference between the group's score and 182 is what we're putting in the
_grpoffset column.As you can see now, in the best of worlds, we should move about 40 points worth of rows from group 1 to group 2 and about 24 points from group 3 to group 0.
Here's the code to identify those candidate rows:
SELECT TOP 1 pos._row AS _pos_row, pos.grp AS _pos_grp,
neg._row AS _neg_row, neg.grp AS _neg_grp
FROM cte AS pos
INNER JOIN cte AS neg ON
pos._grpoffset>0 AND
neg._grpoffset<0 AND
--- To prevent infinite recursion:
pos.moved<4 AND
neg.moved<4
WHERE --- must improve positive side's offset:
ABS(pos._grpoffset-pos.[time]+neg.[time])<=pos._grpoffset AND
--- must improve negative side's offset:
ABS(neg._grpoffset-neg.[time]+pos.[time])<=ABS(neg._grpoffset)
--- Largest changes first:
ORDER BY ABS(pos.[time]-neg.[time]) DESC
) AS x ON w._row IN (x._pos_row, x._neg_row);I'm self-joining the common table expression that we created before,
cte: On one side, groups with a positive _grpoffset, on the other side groups with negative ones. To further filter out which rows are supposed to match each other, the swap of the positive and negative sides' rows must improve _grpoffset, i.e. get it closer to 0.The
TOP 1 and ORDER BY selects the "best" match to swap first.Now, all we need to to is add an
UPDATE, and loop it until there's no more optimization to be found.TL;DR - here's the query
Here's the complete code:
DECLARE @work TABLE (
_row int IDENTITY(1, 1) NOT NULL,
[time] int NOT NULL,
grp int NOT NULL,
moved tinyint NOT NULL,
PRIMARY KEY CLUSTERED ([time], _row)
);
WITH cte AS (
SELECT 0 AS n, CAST(1+100*RAND(CHECKSUM(NEWID())) AS int) AS [time]
UNION ALL
SELECT n+1, CAST(1+100*RAND(CHECKSUM(NEWID())) AS int) AS [time]
FROM cte WHERE n0 AND
neg._grpoffset<0 AND
--- To prevent infinite recursion:
pos.moved<4 AND
neg.moved<4
WHERE --- must improve positive side's offset:
ABS(pos._grpoffset-pos.[time]+neg.[time])<=pos._grpoffset AND
--- must improve negative side's offset:
ABS(neg._grpoffset-neg.[time]+pos.[time])<=ABS(neg._grpoffset)
--- Largest changes first:
ORDER BY ABS(pos.[time]-neg.[time]) DESC
) AS x ON w._row IN (x._pos_row, x._neg_row);Code Snippets
SELECT [time], ROW_NUMBER() OVER (ORDER BY [time])%4 AS grp, 0time ROW_NUMBER() grp
---- ------------ ---
1 1 1
10 2 2
12 3 3
15 4 0
19 5 1
22 6 2
...WITH cte AS (
SELECT *, SUM([time]) OVER (PARTITION BY grp)
-SUM([time]) OVER (PARTITION BY (SELECT NULL))/4 AS _grpoffset
FROM @work)
...time grp total _grpoffset
---- --- ----- ----------
3 1 222 40
46 1 222 40
73 1 222 40
100 1 222 40
6 2 134 -48
52 2 134 -48
76 2 134 -48
11 3 163 -21
66 3 163 -21
86 3 163 -21
45 0 208 24
71 0 208 24
92 0 208 24
----
=727SELECT TOP 1 pos._row AS _pos_row, pos.grp AS _pos_grp,
neg._row AS _neg_row, neg.grp AS _neg_grp
FROM cte AS pos
INNER JOIN cte AS neg ON
pos._grpoffset>0 AND
neg._grpoffset<0 AND
--- To prevent infinite recursion:
pos.moved<4 AND
neg.moved<4
WHERE --- must improve positive side's offset:
ABS(pos._grpoffset-pos.[time]+neg.[time])<=pos._grpoffset AND
--- must improve negative side's offset:
ABS(neg._grpoffset-neg.[time]+pos.[time])<=ABS(neg._grpoffset)
--- Largest changes first:
ORDER BY ABS(pos.[time]-neg.[time]) DESC
) AS x ON w._row IN (x._pos_row, x._neg_row);Context
StackExchange Database Administrators Q#127920, answer score: 16
Revisions (0)
No revisions yet.