HiveBrain v1.2.0
Get Started
← Back to all entries
patternsqlMinor

Use array expression from subquery in ANY() condition

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
expressionfromconditionarrayanysubqueryuse

Problem

I have a relatively complicated query with a subquery fetching an array like so:

...
ARRAY(SELECT category_id FROM category_schedule_con con
      WHERE s.id = con.schedule_id ORDER BY category_id) AS cats,
...


and would like to use the array 'cats' in a later WHERE condition like

...
WHERE 4 = ANY(cats)
...


But this does not work as it states that the column 'cats' doesn't exist.
c/p'ing the subquery into the ANY clause yields the expected result.

Solution

Explanation

By definition in the SQL standard (which Postgres implements) you can reference output columns in ORDER BY or GROUP BY, but not in the WHERE or HAVING clause. The manual:


An output column's name can be used to refer to the column's value in
ORDER BY and GROUP BY clauses, but not in the WHERE or HAVING clauses;
there you must write out the expression instead.

Related:

  • PostgreSQL reusing computation result in select query



  • PostgreSQL Where count condition



  • GROUP BY + CASE statement



Obviously, your subquery is a correlated subquery expression in the SELECT list (which is hidden in the question due to over-simplification).

To avoid repeating lengthy / expensive expressions in the WHERE clause you can use a subquery in the FROM list. A correlated subquery in the SELECT list cannot be referenced by alias in the WHERE clause, that's just an output column like any other.

Chances are, your query can be much more efficient ...

Better query

Applying the above, this query would work:

SELECT s.*, con.cats
FROM   some_table s  -- guessing the missing query
JOIN  (
   SELECT schedule_id
        , array_agg(category_id ORDER BY category_id) AS cats
   FROM   category_schedule_con
   GROUP  BY 1
   ) con ON con.schedule_id = s.id
WHERE  3 = ANY(cats);


Now you can reference the column alias cats in the WHERE clause. But this query is terribly inefficient for big tables for multiple reasons. Most importantly, the predicate is not sargable.

Using GROUP BY and the aggregate function array_agg() instead of the ARRAY constructor because we are producing multiple arrays in a single query.

You can apply an ORDER BY clause to almost any aggregate function, but a sorted subquery typically performs better:

SELECT s.*, con.cats
FROM   some_table s
JOIN  (
   SELECT id, array_agg(category_id) AS cats
   FROM   (
      SELECT schedule_id AS id, category_id  -- alias id for convenience
      FROM   category_schedule_con
      ORDER  BY 1, 2  -- to get ordered list per schedule_id
      ) con
   GROUP  BY 1
   ) con USING (id)
WHERE  3 = ANY(cats);


More importantly, pull the predicate (the WHERE condition) down into the subquery to make it possible to use an index and exclude irrelevant rows early. Much faster with big tables:

SELECT s.*, con.cats
FROM   some_table s
JOIN  (
   SELECT id, array_agg(category_id) AS cats
   FROM  (
      SELECT schedule_id AS id, category_id
      FROM   category_schedule_con c
      WHERE  EXISTS (
         SELECT 1 FROM category_schedule_con
         WHERE  schedule_id = c.schedule_id
         AND    category_id = 3
         )
      ORDER  BY 1, 2
      ) con
   GROUP BY 1
   ) con USING (id);


Depending on data distribution, a LATERAL join (requires Postgres 9.3+) may be more efficient:

SELECT s.*, con.cats
FROM   some_table s
     , LATERAL (
   SELECT ARRAY (
      SELECT category_id
      FROM   category_schedule_con
      WHERE  schedule_id = s.id
      ORDER  BY 1
      ) AS cats
   ) con
WHERE  EXISTS (
   SELECT 1 FROM category_schedule_con
   WHERE  schedule_id = s.id
   AND    category_id = 3
   );


About LATERAL:

  • What is the difference between LATERAL and a subquery in PostgreSQL?



But it should be the fastest to invert the logic: start by finding schedule_id that have category_id = 3, self-join to category_schedule_con and aggregate before joining to the other table:

SELECT s.*, con.cats
FROM  (
   SELECT id, array_agg(c.category_id) AS cats
   FROM  (
      SELECT schedule_id AS id
      FROM   category_schedule_con
      WHERE  category_id = 3
      ) x
   JOIN   category_schedule_con c USING (id)
   GROUP  BY id
   ) con
JOIN  some_table s USING (id);


Index

Be sure to have a multicolumn index like:

CREATE INDEX category_schedule_con_foo_idx ON category_schedule_con
(schedule_id, category_id);


For the last query, we'd need the inverted sequence of columns:

CREATE INDEX category_schedule_con_bar_idx ON category_schedule_con
(category_id, schedule_id);


and another one with two columns switched back again. Two indexes on just (category_id) and (schedule_id) would work fast, too;

A PK or UNIQUE constraint on both columns serves as well.

Code Snippets

SELECT s.*, con.cats
FROM   some_table s  -- guessing the missing query
JOIN  (
   SELECT schedule_id
        , array_agg(category_id ORDER BY category_id) AS cats
   FROM   category_schedule_con
   GROUP  BY 1
   ) con ON con.schedule_id = s.id
WHERE  3 = ANY(cats);
SELECT s.*, con.cats
FROM   some_table s
JOIN  (
   SELECT id, array_agg(category_id) AS cats
   FROM   (
      SELECT schedule_id AS id, category_id  -- alias id for convenience
      FROM   category_schedule_con
      ORDER  BY 1, 2  -- to get ordered list per schedule_id
      ) con
   GROUP  BY 1
   ) con USING (id)
WHERE  3 = ANY(cats);
SELECT s.*, con.cats
FROM   some_table s
JOIN  (
   SELECT id, array_agg(category_id) AS cats
   FROM  (
      SELECT schedule_id AS id, category_id
      FROM   category_schedule_con c
      WHERE  EXISTS (
         SELECT 1 FROM category_schedule_con
         WHERE  schedule_id = c.schedule_id
         AND    category_id = 3
         )
      ORDER  BY 1, 2
      ) con
   GROUP BY 1
   ) con USING (id);
SELECT s.*, con.cats
FROM   some_table s
     , LATERAL (
   SELECT ARRAY (
      SELECT category_id
      FROM   category_schedule_con
      WHERE  schedule_id = s.id
      ORDER  BY 1
      ) AS cats
   ) con
WHERE  EXISTS (
   SELECT 1 FROM category_schedule_con
   WHERE  schedule_id = s.id
   AND    category_id = 3
   );
SELECT s.*, con.cats
FROM  (
   SELECT id, array_agg(c.category_id) AS cats
   FROM  (
      SELECT schedule_id AS id
      FROM   category_schedule_con
      WHERE  category_id = 3
      ) x
   JOIN   category_schedule_con c USING (id)
   GROUP  BY id
   ) con
JOIN  some_table s USING (id);

Context

StackExchange Database Administrators Q#110743, answer score: 6

Revisions (0)

No revisions yet.