HiveBrain v1.2.0
Get Started
← Back to all entries
snippetsqlMinor

How to (optimally) get random sample of (a.id, b.id) pairs from two tables (a, b)?

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
randompairstablessampletwogethowfromoptimally

Problem

Let's assume I have to very simple tables

CREATE TABLE a(id integer PRIMARY KEY, 
       t timestamp default now(), 
       sensor_readings real[]);
CREATE TABLE b(id integer PRIMARY KEY, 
       t timestamp default now(), 
       sensor_readings real[]);


with some data on them

INSERT INTO a(id) SELECT generate_series(    1,   100);
INSERT INTO b(id) SELECT generate_series(10001, 10100);


In reality, table a might have about 100_000 rows, and table b about 50_000. In practive, also, the id sequence might have gaps (in the order of a few %). Thus, the cartesian product a x b has a cardinality of billions.

I want to take a random sample of 1_000 sorted pairs (a.id, b.id).
I can use something like the following query:

SELECT  
    *
FROM
(
    SELECT
        *
    FROM
        (
        SELECT 
            a.id AS a_id, b.id AS b_id
        FROM
            a CROSS JOIN b
        ORDER BY
            random()
        ) AS s0
    LIMIT
        1000 
) AS s1
ORDER BY
    a_id, b_id ;


... but it would become extremely inefficient as soon as the number of rows on a or b grows (because of the growth of the CROSS JOIN).

Is there any way of doing something equivalent to this in an optimal manner? That is, is there a practical way of getting a random sample of rows from the a x b relation without actually having to instantiate it.

NOTE: There's no limitation on the fact that a.id or b.id can be repeated. Although the pair (a.id, b.id) cannot.

If I were trying to program this in an imperative language, I'd most probably would use a loop and be doing something like the following pseudo-code (and then, have it checked by a statistitian, to make sure I really take a sample where all the pairs have the same probability of being chosen):

```
start with a result set equal to {} (empty set)
while size of result set rand_id_a
Pick the id value from a random row from table b -> rand_id_b
If (rand_id_a, rand_id_b) not in result set

Solution

The best solution depends on the exact definition of your setup. For the example setup it's trivial:

  • Serial integer columns without gaps.



SELECT DISTINCT
           1 + trunc(random() * 100)::int AS a_id
     , 10001 + trunc(random() * 100)::int AS b_id
FROM   generate_series(1, 1100) g  -- enough excess to make up for possible dupes
LIMIT  1000;  -- only take 1000


The only interesting question: how to fold dupes efficiently. The solution: let Postgres decide. Simply use DISTINCT.

We don't even need to involve the tables at all. Super fast.

Note that random() generates (per documentation):

random value in the range 0.0

Code Snippets

SELECT DISTINCT
           1 + trunc(random() * 100)::int AS a_id
     , 10001 + trunc(random() * 100)::int AS b_id
FROM   generate_series(1, 1100) g  -- enough excess to make up for possible dupes
LIMIT  1000;  -- only take 1000
CREATE TABLE a(a_id integer PRIMARY KEY, a text);
CREATE TABLE b(b_id integer PRIMARY KEY, b text);

INSERT INTO a(a_id, a)
SELECT g, 't' || g FROM generate_series(    1,   100) g;
INSERT INTO b(b_id, b)
SELECT g, 't' || g FROM generate_series(10001, 10100) g;
SELECT a.a_id, a.a, b.b_id, b.b
FROM  (
    SELECT DISTINCT
               1 + trunc(random() * 100)::int AS a_id  -- cover *whole* key space
         , 10001 + trunc(random() * 100)::int AS b_id  -- maybe add reserve for new rows
    FROM   generate_series(1, 1100) g
    LIMIT  1000
    ) ra
JOIN   a USING (a_id)
JOIN   b USING (b_id);
SELECT a.a_id, a.a, b.b_id, b.b
FROM  (
    SELECT DISTINCT
               1 + trunc(random() * 100)::int AS a_id
         , 10001 + trunc(random() * 100)::int AS b_id
    FROM   generate_series(1, 1100) g  -- enough to cover dupes *and* gaps
    ) ra
JOIN   a USING (a_id)
JOIN   b USING (b_id)
LIMIT  1000;  -- LIMIT moves to outer query to cover gaps

Context

StackExchange Database Administrators Q#159213, answer score: 4

Revisions (0)

No revisions yet.