HiveBrain v1.2.0
Get Started
← Back to all entries
patternsqlMinor

Query to obtain weighted percentile

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
weightedqueryobtainpercentile

Problem

Trying generate SQL to compute a weighted continuous value at a given set of percentile values (the 25%,50%,and 75% levels used below, but solution should allow for an arbitrary parameter level). In other words, want to find the interpolated "raw" values, weighted by "cnt", at each of the 25%, 50% and 75% cumulative percentiles for the test data in "source" table below.

NB: cnt represents the number of times that the raw value occurred during the sampling period, and the expected output would weight the raw value by cnt to arrive at the percentile (akin to quantile/ median and similar statistics)

Test data: (Table: source)

|  site  |  dateval   |  raw  |   cnt   |
+--------+------------+-------+---------+
|   A    | 2019-01-05 |   45  |      14 |
|   A    | 2019-01-05 |   52  |     178 |
|   A    | 2019-01-05 |   45  |       9 |
|   A    | 2019-01-05 |   37  |      75 |
|   A    | 2019-01-05 |   23  |      98 |
|   A    | 2019-01-05 |   78  |     102 |
|   A    | 2019-01-05 |   56  |       9 |
|   A    | 2019-01-05 |   17  |      54 |
|   A    | 2019-01-05 |   56  |       8 |
|   A    | 2019-01-06 |   33  |      35 |
|   A    | 2019-01-06 |   67  |      45 |
|   A    | 2019-01-06 |   65  |      93 |
|   A    | 2019-01-06 |   89  |     113 |
|   A    | 2019-01-06 |   52  |      64 |
|   A    | 2019-01-06 |  101  |      12 |
|   B    | 2019-01-05 |    5  |      25 |
|   B    | 2019-01-05 |   16  |      48 |
|   B    | 2019-01-05 |   12  |     107 |
|   B    | 2019-01-05 |   25  |      78 |
|   B    | 2019-01-05 |   44  |      53 |
|   B    | 2019-01-05 |    8  |      12 |
|   B    | 2019-01-05 |   31  |      32 |
|   B    | 2019-01-06 |   34  |      87 |
|   B    | 2019-01-06 |   18  |      35 |
|   B    | 2019-01-06 |   51  |      17 |
|   B    | 2019-01-06 |   22  |      23 |
|   B    | 2019-01-06 |   14  |      52 |
|   B    | 2019-01-06 |    6  |      34 |
+--------+------------+-------+---------+


Expected output (rounded to nearest 1/100th):

Solution

Postgres has Ordered-Set Aggregate Functions for your purpose.

The special difficulty: you want rows "weighted" by cnt. If that's supposed to mean that each row represents cnt identical rows, you can multiply input rows by joining to generate_series(1, cnt):

SELECT site, dateval
     , percentile_cont('{0,.25,.5,.75,1}'::float8[]) WITHIN GROUP (ORDER BY raw)
FROM   source s, generate_series(1, s.cnt)
GROUP  BY 1, 2;


db<>fiddle here

But results differ from your expected output (except for the 0 and 100 percentile). So you are "weighting" differently ...

Aside, your original query can be simplified to this equivalent:

SELECT site, dateval, raw, sum(cnt) AS sumcnt
     , sum(sum(cnt)) OVER w AS cumsumcnt
     , sum(sum(cnt)) OVER w / sum(sum(cnt)) OVER (PARTITION BY site, dateval) AS percentile 
FROM   source
GROUP  BY site, dateval, raw
WINDOW w AS (PARTITION BY site, dateval ORDER BY raw);


You can run a window function over the result of an aggregate function in the same SELECT (but not vice versa). See:

  • Need separate columns that calculate percentage



I added a demo to the fiddle above.

But neither explains the odd numbers in your "expected results". Those strike me as incorrect, no matter how you interpolate. Example: 22.07 in the first line for p25 does not seem to make sense - the value 23 occupies all rows up to the 27.7879 percentile after factoring in cnt according to your own query ...

Code Snippets

SELECT site, dateval
     , percentile_cont('{0,.25,.5,.75,1}'::float8[]) WITHIN GROUP (ORDER BY raw)
FROM   source s, generate_series(1, s.cnt)
GROUP  BY 1, 2;
SELECT site, dateval, raw, sum(cnt) AS sumcnt
     , sum(sum(cnt)) OVER w AS cumsumcnt
     , sum(sum(cnt)) OVER w / sum(sum(cnt)) OVER (PARTITION BY site, dateval) AS percentile 
FROM   source
GROUP  BY site, dateval, raw
WINDOW w AS (PARTITION BY site, dateval ORDER BY raw);

Context

StackExchange Database Administrators Q#227543, answer score: 3

Revisions (0)

No revisions yet.