HiveBrain v1.2.0
Get Started
← Back to all entries
patternsqlMinor

PostgreSQL equivalent to Oracle's ANY_VALUE(...) KEEP (DENSE_RANK FIRST/LAST ORDER BY ...)

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
postgresqllastorderequivalentkeepdense_rankany_valuefirstoracle

Problem

There's a technique in Oracle SQL that can be used to simplify aggregation queries:

Aggregate on a particular column, but get information from a different column, using a simple calculated column in the SELECT list.

--Oracle
--For a given country, what city has the highest population? (where the country has more than one city)
--Include the city name as a column.
select
    country,
    count(*),
    max(population),
    any_value(city) keep (dense_rank first order by population desc)   -- 1


db<>fiddle

As shown above, the following column can bring in the city name, even though the city name isn't in the GROUP BY:

any_value(city) keep (dense_rank first order by population desc)


There are a number of ways to achieve that kind of thing using SQL. I'm looking for a solution in PostgreSQL that lets me do it in a calculated column -- all within a single SELECT query (no subqueries, joins, WITH, etc.).

Question: Is there equivalent functionality to Oracle's ANY_VALUE(...) KEEP (DENSE_RANK FIRST/LAST ORDER BY ...) in PostgreSQL?

Related:

  • YouTube: The KEEP clause will KEEP your SQL queries SIMPLE (Oracle)



  • Stack Overflow: Explanation of KEEP in Oracle FIRST/LAST



  • db-oriented.com: ANY_VALUE and FIRST/LAST (KEEP)



  • DBA Stack Exchange: How to request an enhancement to PostgreSQL



Edit:

I changed MAX() to ANY_VALUE(), since I think ANY_VALUE() is easier to read.

Ties can be broken by adding , city desc to the order by, making it deterministic:

any_value(city) keep (dense_rank first order by population desc, city desc)

Solution

first_last_agg

There is the additional module first_last_agg to make this simple.
It's available from apt.postgresql.org (among others). Read instructions in the Postgres Wiki. Install it once per database with:

CREATE EXTENSION first_last_agg;


It provides two aggregate functions: first() and last().

Most hosted services don't provide the module. If you can't install it, the next best option is to create aggregate functions yourself like demonstrated in the Postgres Wiki, and also in my fiddle below. Or here:

  • Fetch a row that contains the set of last non-NULL values for each column



But the C implementation of the module first_last_agg is faster.

Then:

SELECT country
     , count(*) AS ct_cities
     , max(population) AS highest_population
     , last(city ORDER BY population, city) AS biggest_city  -- !
FROM   cities
GROUP  BY country
HAVING count(*) > 1;


fiddle

Same as:

, first(city ORDER BY population DESC NULLS LAST, city DESC NULLS LAST) AS biggest_city


Why NULLS LAST? See:

  • Sort by column ASC, but NULL values first?



Either reports the city with the highest population and the name sorting alphabetically last - like your original.
Without additional module

If you cannot install additional modules. And you still insist on:

all within a single SELECT query (no subqueries, joins, WITH, etc.).

DISTINCT ON in combination with window functions does it, too:

SELECT DISTINCT ON (country)
       country
     , count(*) OVER (PARTITION BY country) AS ct_cities
     , population AS highest_population
     , city AS biggest_city
FROM   cities c
ORDER  BY country, population DESC NULLS LAST, city DESC NULLS LAST;


See:

  • Calculating follower growth over time for each influencer



To also eliminate countries with only a single entry:

SELECT DISTINCT ON (country)
       country
     , count(*) OVER (PARTITION BY country) AS ct_cities
     , population AS highest_population
     , city AS biggest_city
FROM   cities c
WHERE  EXISTS (SELECT FROM cities c1 WHERE c1.country = c.country AND c1.ctid <> c.ctid)
ORDER  BY country, population DESC NULLS LAST, city DESC NULLS LAST;


Use your PK instead of ctid if you have one. See:

  • Is the system column "ctid" legitimate for identifying rows to delete?



If a subquery is allowed, then rather:

SELECT *
FROM  (
   SELECT DISTINCT ON (country)
          country
        , count(*) OVER (PARTITION BY country) AS ct_cities
        , population AS highest_population
        , city AS biggest_city
   FROM   cities c
   ORDER  BY country, population DESC NULLS LAST, city DESC NULLS LAST
   ) sub
WHERE  ct_cities > 1;


(array_agg(city ORDER BY population DESC NULLS LAST))[1] typically performs poorly for more than a few rows per country. Aggregating big arrays, only to get the first element is expensive. See performance benchmark:

  • Select first row in each GROUP BY group?

Code Snippets

CREATE EXTENSION first_last_agg;
SELECT country
     , count(*) AS ct_cities
     , max(population) AS highest_population
     , last(city ORDER BY population, city) AS biggest_city  -- !
FROM   cities
GROUP  BY country
HAVING count(*) > 1;
, first(city ORDER BY population DESC NULLS LAST, city DESC NULLS LAST) AS biggest_city
SELECT DISTINCT ON (country)
       country
     , count(*) OVER (PARTITION BY country) AS ct_cities
     , population AS highest_population
     , city AS biggest_city
FROM   cities c
ORDER  BY country, population DESC NULLS LAST, city DESC NULLS LAST;
SELECT DISTINCT ON (country)
       country
     , count(*) OVER (PARTITION BY country) AS ct_cities
     , population AS highest_population
     , city AS biggest_city
FROM   cities c
WHERE  EXISTS (SELECT FROM cities c1 WHERE c1.country = c.country AND c1.ctid <> c.ctid)
ORDER  BY country, population DESC NULLS LAST, city DESC NULLS LAST;

Context

StackExchange Database Administrators Q#324426, answer score: 7

Revisions (0)

No revisions yet.