HiveBrain v1.2.0
Get Started
← Back to all entries
patternsqlMinor

Sampling in PostgreSQL

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
postgresqlsamplingstackoverflow

Problem

I am looking for possible ways of random sampling in PostgreSQL. I found a couple of methods to do that with different advantages and disadvantages. The naive way to do that is:

select * from Table_Name
order by random()
limit 10;


Another faster method is:

select * from Table_Name
WHERE random() <= 0.01
order by random()
limit 10;


(Although that 0.01 depends on the table size and the sample size; this is just an example.)

In both of these queries a random number is generated for each row and sorted based on those random generated numbers. Then in the sorted numbers the first 10 are selected as the final result, so I think these should be sampling without replacement.

Now what I want to do is to somehow turn this sampling methods into sampling with replacement. How is that possible? Or is there any other random sampling method with replacement in PostgreSQL?

I have to say that I do have an idea how this might be possible but I don't know how to implement it in Postgres. Here is my idea:

If instead of generating one random value we generate S random values where S is the sample size, then order all of the random generated values, it will be sampling with replacement. (I don't know if I am right.)

At this point I don't mind about the performance of the query.

Solution

Corrective

If instead of generating one random value we generate S random values
where S is the sample size, then order all of the random generated
values, it will be sampling with replacement. (I don't know if I am right.)

You are wrong in two points.

  • Your original query does not generate "one random value". random() is a volatile function that is called for every row. Then all rows are sorted by the result, that's why this is hugely inefficient for big tables.



  • Accordingly, the approach "if we generate S random values ..." is going nowhere. You have to "granulate" random rows to matching "row numbers" - either existing IDs or surrogate numbers. That's what I am going to demonstrate.



I know you said:

At this point I don't mind about the performance of the query.

So @ypercube's answer is correct as usual. But my heart is bleeding. There are much faster ways.

Assuming the worst case that you don't know anything about the table.

Attach a sequential number to each row in arbitrary order, then you can select random rows reliably, multiple times if you want. The table is scanned once, which is still expensive for huge tables and small samples, but much better than scanning n times:

WITH t AS (SELECT *, row_number() OVER () AS rn FROM tbl)
SELECT * FROM (
    SELECT trunc(random() * (SELECT max(rn) FROM t))::int + 1 AS rn
    FROM   generate_series(1, 10) g
    ) r
JOIN   t USING (rn);


Each row has the same chance to be selected any number of times.

If you have an ID column with few or no gaps there are much faster options for big tables:

  • Best way to select random rows PostgreSQL

Code Snippets

WITH t AS (SELECT *, row_number() OVER () AS rn FROM tbl)
SELECT * FROM (
    SELECT trunc(random() * (SELECT max(rn) FROM t))::int + 1 AS rn
    FROM   generate_series(1, 10) g
    ) r
JOIN   t USING (rn);

Context

StackExchange Database Administrators Q#96610, answer score: 4

Revisions (0)

No revisions yet.