HiveBrain v1.2.0
Get Started
← Back to all entries
patternsqlMinor

Select rows for which at least one row per set meets a condition

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
rowsconditionperoneforleastwhichselectrowmeets

Problem

I have following table:

create table test (
  company_id integer not null, 
  client_id integer not null, 
  client_status text,
  unique (company_id, client_id)
);

insert into test values
  (1, 1, 'y'),    -- company1

  (2, 2, null),   -- company2

  (3, 3, 'n'),    -- company3

  (4, 4, 'y'),    -- company4
  (4, 5, 'n'),

  (5, 6, null),   -- company5
  (5, 7, 'n')
;


Basically, there are 5 different companies, each one has one or more clients and each client has status: 'y' or 'n' (might be null as well).

What I have to do is to select all pairs (company_id, client_id) for all companies for which there is at least one client whose status is not 'n' ('y' or null). So for the example data above, the output should be:

company_id;client_id
1;1
2;2
4;4
4;5
5;6
5;7


I tried something with window functions but I can't figure out how to compare the number of ALL clients with the number of clients with STATUS = 'n'.

select company_id,
count(*) over (partition by company_id) as all_clients_count
from test
-- where all_clients_count != ... ?


I figured out how to do this, but I am not sure if it's the right way:

select sub.company_id, unnest(sub.client_ids)
from (
  select company_id, array_agg(client_id) as client_ids
  from test
  group by company_id
  having count(*) != count( (case when client_status = 'n' then 1 else null end) )
) sub

Solution

Basically you are looking for the expression:

client_status IS DISTINCT FROM 'n'


The column client_status should really be data type boolean, not text, which would allow the simpler expression:

client_status IS NOT FALSE


The manual has details in the chapter Comparison Operators.

Assuming your actual table has a UNIQUE or PK constraint, we arrive at:

CREATE TABLE test (
  company_id    integer NOT NULL, 
  client_id     integer NOT NULL, 
  client_status boolean,
  PRIMARY KEY (company_id, client_id)
);


Queries

All of these do the same (what you asked), which is the fastest depends on data distribution:

SELECT company_id, client_id
FROM   test t
WHERE  EXISTS (
   SELECT FROM test
   WHERE  company_id = t.company_id
   AND    client_status IS NOT FALSE
   );


Or:

SELECT company_id, client_id
FROM   test t
JOIN  (
   SELECT company_id
   FROM   test t
   GROUP  BY 1
   HAVING bool_or(client_status IS NOT FALSE)
   ) c USING (company_id);


Or:

SELECT company_id, client_id
FROM   test t
JOIN  (
   SELECT DISTINCT company_id, client_status 
   FROM   test t
   ORDER  BY company_id, client_status DESC
   ) c USING (company_id)
WHERE  c.client_status IS NOT FALSE;


Boolean values sort FALSE -> TRUE -> NULL in ascending sort order. So FALSE comes last in descending order. If there's is any other value available, then that one's picked first ...

  • Sorting null values after all others, except special



The added PK is implemented with a useful index for these queries. If you want faster, yet, add a partial index for query 1:

CREATE INDEX test_special_idx ON test (company_id, client_id)
WHERE  client_status IS NOT FALSE;


You could use window functions, too, but that would be slower. Example with first_value():

SELECT company_id, client_id
FROM  (
   SELECT company_id, client_id
        , first_value(client_status) OVER (PARTITION BY company_id
                                           ORDER BY client_status DESC) AS stat
   FROM   test t
   ) sub
WHERE  stat IS NOT FALSE;


For lots of rows per company_id, one of these techniques may be faster, still:

  • Optimize GROUP BY query to retrieve latest record per user

Code Snippets

client_status IS DISTINCT FROM 'n'
client_status IS NOT FALSE
CREATE TABLE test (
  company_id    integer NOT NULL, 
  client_id     integer NOT NULL, 
  client_status boolean,
  PRIMARY KEY (company_id, client_id)
);
SELECT company_id, client_id
FROM   test t
WHERE  EXISTS (
   SELECT FROM test
   WHERE  company_id = t.company_id
   AND    client_status IS NOT FALSE
   );
SELECT company_id, client_id
FROM   test t
JOIN  (
   SELECT company_id
   FROM   test t
   GROUP  BY 1
   HAVING bool_or(client_status IS NOT FALSE)
   ) c USING (company_id);

Context

StackExchange Database Administrators Q#139980, answer score: 7

Revisions (0)

No revisions yet.