snippetsqlMinor
How to (optimally) get random sample of (a.id, b.id) pairs from two tables (a, b)?
Viewed 0 times
randompairstablessampletwogethowfromoptimally
Problem
Let's assume I have to very simple tables
with some data on them
In reality, table a might have about 100_000 rows, and table b about 50_000. In practive, also, the id sequence might have gaps (in the order of a few %). Thus, the cartesian product a x b has a cardinality of billions.
I want to take a random sample of 1_000 sorted pairs (a.id, b.id).
I can use something like the following query:
... but it would become extremely inefficient as soon as the number of rows on a or b grows (because of the growth of the CROSS JOIN).
Is there any way of doing something equivalent to this in an optimal manner? That is, is there a practical way of getting a random sample of rows from the
NOTE: There's no limitation on the fact that a.id or b.id can be repeated. Although the pair (a.id, b.id) cannot.
If I were trying to program this in an imperative language, I'd most probably would use a loop and be doing something like the following pseudo-code (and then, have it checked by a statistitian, to make sure I really take a sample where all the pairs have the same probability of being chosen):
```
start with a result set equal to {} (empty set)
while size of result set rand_id_a
Pick the id value from a random row from table b -> rand_id_b
If (rand_id_a, rand_id_b) not in result set
CREATE TABLE a(id integer PRIMARY KEY,
t timestamp default now(),
sensor_readings real[]);
CREATE TABLE b(id integer PRIMARY KEY,
t timestamp default now(),
sensor_readings real[]);with some data on them
INSERT INTO a(id) SELECT generate_series( 1, 100);
INSERT INTO b(id) SELECT generate_series(10001, 10100);In reality, table a might have about 100_000 rows, and table b about 50_000. In practive, also, the id sequence might have gaps (in the order of a few %). Thus, the cartesian product a x b has a cardinality of billions.
I want to take a random sample of 1_000 sorted pairs (a.id, b.id).
I can use something like the following query:
SELECT
*
FROM
(
SELECT
*
FROM
(
SELECT
a.id AS a_id, b.id AS b_id
FROM
a CROSS JOIN b
ORDER BY
random()
) AS s0
LIMIT
1000
) AS s1
ORDER BY
a_id, b_id ;... but it would become extremely inefficient as soon as the number of rows on a or b grows (because of the growth of the CROSS JOIN).
Is there any way of doing something equivalent to this in an optimal manner? That is, is there a practical way of getting a random sample of rows from the
a x b relation without actually having to instantiate it.NOTE: There's no limitation on the fact that a.id or b.id can be repeated. Although the pair (a.id, b.id) cannot.
If I were trying to program this in an imperative language, I'd most probably would use a loop and be doing something like the following pseudo-code (and then, have it checked by a statistitian, to make sure I really take a sample where all the pairs have the same probability of being chosen):
```
start with a result set equal to {} (empty set)
while size of result set rand_id_a
Pick the id value from a random row from table b -> rand_id_b
If (rand_id_a, rand_id_b) not in result set
Solution
The best solution depends on the exact definition of your setup. For the example setup it's trivial:
The only interesting question: how to fold dupes efficiently. The solution: let Postgres decide. Simply use
We don't even need to involve the tables at all. Super fast.
Note that
random value in the range 0.0
- Serial integer columns without gaps.
SELECT DISTINCT
1 + trunc(random() * 100)::int AS a_id
, 10001 + trunc(random() * 100)::int AS b_id
FROM generate_series(1, 1100) g -- enough excess to make up for possible dupes
LIMIT 1000; -- only take 1000The only interesting question: how to fold dupes efficiently. The solution: let Postgres decide. Simply use
DISTINCT.We don't even need to involve the tables at all. Super fast.
Note that
random() generates (per documentation):random value in the range 0.0
Code Snippets
SELECT DISTINCT
1 + trunc(random() * 100)::int AS a_id
, 10001 + trunc(random() * 100)::int AS b_id
FROM generate_series(1, 1100) g -- enough excess to make up for possible dupes
LIMIT 1000; -- only take 1000CREATE TABLE a(a_id integer PRIMARY KEY, a text);
CREATE TABLE b(b_id integer PRIMARY KEY, b text);
INSERT INTO a(a_id, a)
SELECT g, 't' || g FROM generate_series( 1, 100) g;
INSERT INTO b(b_id, b)
SELECT g, 't' || g FROM generate_series(10001, 10100) g;SELECT a.a_id, a.a, b.b_id, b.b
FROM (
SELECT DISTINCT
1 + trunc(random() * 100)::int AS a_id -- cover *whole* key space
, 10001 + trunc(random() * 100)::int AS b_id -- maybe add reserve for new rows
FROM generate_series(1, 1100) g
LIMIT 1000
) ra
JOIN a USING (a_id)
JOIN b USING (b_id);SELECT a.a_id, a.a, b.b_id, b.b
FROM (
SELECT DISTINCT
1 + trunc(random() * 100)::int AS a_id
, 10001 + trunc(random() * 100)::int AS b_id
FROM generate_series(1, 1100) g -- enough to cover dupes *and* gaps
) ra
JOIN a USING (a_id)
JOIN b USING (b_id)
LIMIT 1000; -- LIMIT moves to outer query to cover gapsContext
StackExchange Database Administrators Q#159213, answer score: 4
Revisions (0)
No revisions yet.