snippetsqlModerate
How do I get the aggregate of a window function in Postgres?
Viewed 0 times
thepostgresfunctiongethowwindowaggregate
Problem
I have a table containing a two columns of permutations/combinations of integer arrays, and a third column containing a value, like so:
I want to find out the average and standard deviation for each permutation, as well as for each combination. I can do that with this query:
However, that query can get pretty slow when I have a lot of data, because the "foo" table (which in reality, consists of 14 partitions each with roughly 4 million rows) needs to be scanned twice.
Recently, I learned that Postgres supports "Window Functions", which is basically like a GROUP BY for a particular column. I modified my query to use these like so:
```
SELECT
perm,
combo,
avg( value ) as perm_average_value,
avg( avg( value ) ) over w_c
CREATE TABLE foo
(
perm integer[] NOT NULL,
combo integer[] NOT NULL,
value numeric NOT NULL DEFAULT 0
);
INSERT INTO foo
VALUES
( '{3,1,2}', '{1,2,3}', '1.1400' ),
( '{3,1,2}', '{1,2,3}', '0' ),
( '{3,1,2}', '{1,2,3}', '1.2680' ),
( '{3,1,2}', '{1,2,3}', '0' ),
( '{3,1,2}', '{1,2,3}', '1.2680' ),
( '{3,1,2}', '{1,2,3}', '0' ),
( '{3,1,2}', '{1,2,3}', '0' ),
( '{3,1,2}', '{1,2,3}', '1.2680' ),
( '{3,1,2}', '{1,2,3}', '0.9280' ),
( '{3,1,2}', '{1,2,3}', '0' ),
( '{3,1,2}', '{1,2,3}', '1.2680' ),
( '{3,1,2}', '{1,2,3}', '0' ),
( '{3,1,2}', '{1,2,3}', '0' ),
( '{3,1,2}', '{1,2,3}', '1.2680' ),
( '{3,1,2}', '{1,2,3}', '0' ),
( '{3,2,1}', '{1,2,3}', '0' ),
( '{3,2,1}', '{1,2,3}', '0.8000' )I want to find out the average and standard deviation for each permutation, as well as for each combination. I can do that with this query:
SELECT
f1.perm,
f2.combo,
f1.perm_average_value,
f2.combo_average_value,
f1.perm_stddev,
f2.combo_stddev,
f1.perm_count,
f2.combo_count
FROM
(
SELECT
perm,
combo,
avg( value ) AS perm_average_value,
stddev_pop( value ) AS perm_stddev,
count( * ) AS perm_count
FROM foo
GROUP BY perm, combo
) AS f1
JOIN
(
SELECT
combo,
avg( value ) AS combo_average_value,
stddev_pop( value ) AS combo_stddev,
count( * ) AS combo_count
FROM foo
GROUP BY combo
) AS f2 ON ( f1.combo = f2.combo );However, that query can get pretty slow when I have a lot of data, because the "foo" table (which in reality, consists of 14 partitions each with roughly 4 million rows) needs to be scanned twice.
Recently, I learned that Postgres supports "Window Functions", which is basically like a GROUP BY for a particular column. I modified my query to use these like so:
```
SELECT
perm,
combo,
avg( value ) as perm_average_value,
avg( avg( value ) ) over w_c
Solution
You can have window functions on the result of aggregate functions in a single query level.
This would all work nicely after a few modifications - except that it fails for the standard deviation on mathematical principal. The involved calculations are not linear, so you cannot simply combine standard deviations of sub-populations.
For
Since you need a weighted average. (The average of a group with 10 members weighs more than the average of a group with just 2 members!)
This works:
I am using two different windows here, and reduce the rows with
But I seriously doubt it will be faster than your original query. I am pretty sure it isn't.
Better performance with altered table layout
Arrays have an overhead of 24 bytes (slight variations depending on type). Also, you seem to have quite a few items per array and many repetitions. For a huge table like yours it would pay to normalize the schema. Example layout:
If you don't need referential integrity you can omit the foreign key constraints.
The connection to
This would result in a row size of 32 bytes (tuple header + padding: 24 bytes, 2 x int (8 byte), no padding), plus the unknown size of your
More on physical storage in this related answer on SO or here:
Configuring PostgreSQL for read performance
Anyway, that's only a fraction of what you have now and would make your query a lot faster by size alone. Grouping and sorting on simple integers is also a lot faster.
You would first aggregate in a subquery and then join to
This would all work nicely after a few modifications - except that it fails for the standard deviation on mathematical principal. The involved calculations are not linear, so you cannot simply combine standard deviations of sub-populations.
SELECT perm
,combo
,avg(value) AS perm_average_value
,sum(avg(value) * count(*)) OVER w_combo /
sum(count(*)) OVER w_combo AS combo_average_value
,stddev_pop(value) AS perm_stddev
,0 AS combo_stddev -- doesn't work!
,count(*) AS perm_count
,sum(count(*)) OVER w_combo AS combo_count
FROM foo
GROUP BY perm, combo
WINDOW w_combo AS (PARTITION BY combo);For
combo_average_value you would need this expressionsum(avg(value) * count(*)) OVER w_combo / sum(count(*)) OVER w_comboSince you need a weighted average. (The average of a group with 10 members weighs more than the average of a group with just 2 members!)
This works:
SELECT DISTINCT ON (perm, combo)
perm
,combo
,avg(value) OVER wpc AS perm_average_value
,avg(value) OVER wc AS combo_average_value
,stddev_pop(value) OVER wpc AS perm_stddev
,stddev_pop(value) OVER wc AS combo_stddev
,count(*) OVER wpc AS perm_count
,count(*) OVER wc AS combo_count
FROM foo
WINDOW wc AS (PARTITION BY combo)
,wpc AS (PARTITION BY perm, combo);I am using two different windows here, and reduce the rows with
DISTINCT which is applied even after window functions.But I seriously doubt it will be faster than your original query. I am pretty sure it isn't.
Better performance with altered table layout
Arrays have an overhead of 24 bytes (slight variations depending on type). Also, you seem to have quite a few items per array and many repetitions. For a huge table like yours it would pay to normalize the schema. Example layout:
CREATE TABLE combo (
combo_id serial PRIMARY KEY
,combo int[] NOT NULL
);
CREATE TABLE perm (
perm_id serial PRIMARY KEY
,perm int[] NOT NULL
);
CREATE TABLE value (
perm_id int REFERENCES perm(perm_id)
,combo_id int REFERENCES combo(combo_id)
,value numeric NOT NULL DEFAULT 0
);If you don't need referential integrity you can omit the foreign key constraints.
The connection to
combo_id could also be placed in the table perm, but in this scenario I would store it (slightly de-normalized) in value for better performance.This would result in a row size of 32 bytes (tuple header + padding: 24 bytes, 2 x int (8 byte), no padding), plus the unknown size of your
numeric column. (If you don't need extreme precision, a double precision or even a real column might do, too.)More on physical storage in this related answer on SO or here:
Configuring PostgreSQL for read performance
Anyway, that's only a fraction of what you have now and would make your query a lot faster by size alone. Grouping and sorting on simple integers is also a lot faster.
You would first aggregate in a subquery and then join to
perm and combo for best performance.Code Snippets
SELECT perm
,combo
,avg(value) AS perm_average_value
,sum(avg(value) * count(*)) OVER w_combo /
sum(count(*)) OVER w_combo AS combo_average_value
,stddev_pop(value) AS perm_stddev
,0 AS combo_stddev -- doesn't work!
,count(*) AS perm_count
,sum(count(*)) OVER w_combo AS combo_count
FROM foo
GROUP BY perm, combo
WINDOW w_combo AS (PARTITION BY combo);sum(avg(value) * count(*)) OVER w_combo / sum(count(*)) OVER w_comboSELECT DISTINCT ON (perm, combo)
perm
,combo
,avg(value) OVER wpc AS perm_average_value
,avg(value) OVER wc AS combo_average_value
,stddev_pop(value) OVER wpc AS perm_stddev
,stddev_pop(value) OVER wc AS combo_stddev
,count(*) OVER wpc AS perm_count
,count(*) OVER wc AS combo_count
FROM foo
WINDOW wc AS (PARTITION BY combo)
,wpc AS (PARTITION BY perm, combo);CREATE TABLE combo (
combo_id serial PRIMARY KEY
,combo int[] NOT NULL
);
CREATE TABLE perm (
perm_id serial PRIMARY KEY
,perm int[] NOT NULL
);
CREATE TABLE value (
perm_id int REFERENCES perm(perm_id)
,combo_id int REFERENCES combo(combo_id)
,value numeric NOT NULL DEFAULT 0
);Context
StackExchange Database Administrators Q#46297, answer score: 10
Revisions (0)
No revisions yet.