HiveBrain v1.2.0
Get Started
← Back to all entries
patternsqlModerate

Get a truly RANDOM row from a PostgreSQL table quickly

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
randomfrompostgresqlquicklytrulygetrowtable

Problem

I always used to do:

SELECT column FROM table ORDER BY random() LIMIT 1;


For large tables, this was unbearably, impossibly slow, to the point of being useless in practice. That's why I started hunting for more efficient methods. People recommended:

SELECT column FROM table TABLESAMPLE BERNOULLI(1) LIMIT 1;


While fast, it also provides worthless randomness. It appears to always pick the same damn records, so this is also worthless.

I've also tried:

SELECT column FROM table TABLESAMPLE BERNOULLI(100) LIMIT 1;


It gives even worse randomness. It picks the same few records every time. This is completely worthless. I need actual randomness.

Why is it apparently so difficult to just pick a random record? Why does it have to grab EVERY record and then sort them (in the first case)? And why do the "TABLESAMPLE" versions just grab the same stupid records all the time? Why aren't they random whatsoever? Who would ever want to use this "BERNOULLI" stuff when it just picks the same few records over and over? I can't believe I'm still, after all these years, asking about grabbing a random record... it's one of the most basic possible queries.

What is the actual command to use for grabbing a random record from a table in PG which isn't so slow that it takes several full seconds for a decent-sized table?

Solution

Interesting question - which has many possibilities/permutations (this answer has been extensively revised).

Basically, this problem can be divided into two main streams.

-
A single random record

-
Multiple random records (not in the question - see reference and discussion at bottom)

Having researched this, I believe that the fastest solution to the single record problem is via the tsm_system_rows extension to PostgreSQL provided by Evan Carroll's answer.

If you're using a binary distribution, I'm not sure, but I think that the contrib modules (of which tsm_system_rows is one) are available by default - at least they were for the EnterpriseDB Windows version I used for my Windows testing (see below). My main testing was done on 12.1 compiled from source on Linux (make world and make install-world).

The reason why I feel that it is best for the single record use case is that the only problem mentioned concerning this extension is that:


Like the built-in SYSTEM sampling method, SYSTEM_ROWS performs
block-level sampling, so that the sample is not completely random but
may be subject to clustering effects, especially if only a small
number of rows are requested.

however, since you are only interested in selecting 1 row, the block-level clustering effect should not be an issue. This article from 2ndQuadrant shows why this shouldn't be a problem for a sample of one record! It is a major problem for small subsets (see end of post) - OR if you wish to generate a large sample of random records from one large table (again, see the discussion of tsm_system_rows and tsm_system_time below).

Then I created and populated a table like this:

CREATE TABLE rand AS SELECT generate_series(1, 100000000) AS seq, MD5(random()::text);


So, I now have a table with 100,000,000 (100 million) records. Then I added a PRIMARY KEY:

ALTER TABLE rand ADD PRIMARY KEY (seq);


So, now to SELECT random records:

SELECT LENGTH((seq/100)::TEXT), seq/100::FLOAT, md5 FROM rand TABLESAMPLE SYSTEM_ROWS(1);


Notice that I have used a slightly modified command so that I could "see" the randomness - I also set the \timing command so that I could get empirical measurements.

I used the LENGTH() function so that I could readily perceive the size of the PRIMARY KEY integer being returned. Here is a sample of records returned:

test=# SELECT LENGTH((seq/100)::TEXT), seq/100::FLOAT, md5 FROM rand TABLESAMPLE SYSTEM_ROWS(1);
 length | ?column?  |               md5                
--------+-----------+----------------------------------
      6 | 970749.61 | bf18719016ff4f5d16ed54c5f4679e20
(1 row)

Time: 30.606 ms
test=# SELECT LENGTH((seq/100)::TEXT), seq/100::FLOAT, md5 FROM rand TABLESAMPLE SYSTEM_ROWS(1);
 length | ?column?  |               md5                
--------+-----------+----------------------------------
      6 | 512101.21 | d27fbeea30b79d3e4eacdfea7a62b8ac
(1 row)

Time: 0.556 ms
test=# SELECT LENGTH((seq/100)::TEXT), seq/100::FLOAT, md5 FROM rand TABLESAMPLE SYSTEM_ROWS(1);
 length | ?column?  |               md5                
--------+-----------+----------------------------------
      6 | 666476.41 | c7c0c34d59229bdc42d91d0d4d9d1403
(1 row)

Time: 0.650 ms
test=# SELECT LENGTH((seq/100)::TEXT), seq/100::FLOAT, md5 FROM rand TABLESAMPLE SYSTEM_ROWS(1);
 length | ?column? |               md5                
--------+----------+----------------------------------
      5 | 49152.01 | 0a2ff4da00a2b81697e7e465bd67d85c
(1 row)

Time: 0.593 ms
test=# SELECT LENGTH((seq/100)::TEXT), seq/100::FLOAT, md5 FROM rand TABLESAMPLE SYSTEM_ROWS(1);
 length | ?column? |               md5                
--------+----------+----------------------------------
      5 | 18061.21 | ee46adc96a6f8264a5c6614f8463667d
(1 row)

Time: 0.616 ms
test=# SELECT LENGTH((seq/100)::TEXT), seq/100::FLOAT, md5 FROM rand TABLESAMPLE SYSTEM_ROWS(1);
 length | ?column?  |               md5                
--------+-----------+----------------------------------
      6 | 691962.01 | 4bac0d051490c47716f860f8afb8b24a
(1 row)

Time: 0.743 ms


So, as you can see, the LENGTH() function returns 6 most of the time - this is to be expected as most records will be between 10,000,000 and 100,000,000, but there are a couple which show a value of 5 (also have seen values of 3 & 4 - data not shown).

Now, notice the timings. The first is 30 milliseconds (ms) but the rest are sub millisecond (approx. 0.6 - 0.7ms). Most of the random samples are returned in this sub-millisecond range, but, there are results returned in 25 - 30 ms (1 in 3 or 4 on average).

From time to time, this multi-millisecond result can occur twice or even three times in a row, but, as I said, the majority of results (approx. 66 - 75%) are sub-millisecond. None of the response times for my solution that I have seen has been in excess of 75ms.

During my research I also discovered the tsm_system_time extension which is similar to `tsm_s

Code Snippets

CREATE TABLE rand AS SELECT generate_series(1, 100000000) AS seq, MD5(random()::text);
ALTER TABLE rand ADD PRIMARY KEY (seq);
SELECT LENGTH((seq/100)::TEXT), seq/100::FLOAT, md5 FROM rand TABLESAMPLE SYSTEM_ROWS(1);
test=# SELECT LENGTH((seq/100)::TEXT), seq/100::FLOAT, md5 FROM rand TABLESAMPLE SYSTEM_ROWS(1);
 length | ?column?  |               md5                
--------+-----------+----------------------------------
      6 | 970749.61 | bf18719016ff4f5d16ed54c5f4679e20
(1 row)

Time: 30.606 ms
test=# SELECT LENGTH((seq/100)::TEXT), seq/100::FLOAT, md5 FROM rand TABLESAMPLE SYSTEM_ROWS(1);
 length | ?column?  |               md5                
--------+-----------+----------------------------------
      6 | 512101.21 | d27fbeea30b79d3e4eacdfea7a62b8ac
(1 row)

Time: 0.556 ms
test=# SELECT LENGTH((seq/100)::TEXT), seq/100::FLOAT, md5 FROM rand TABLESAMPLE SYSTEM_ROWS(1);
 length | ?column?  |               md5                
--------+-----------+----------------------------------
      6 | 666476.41 | c7c0c34d59229bdc42d91d0d4d9d1403
(1 row)

Time: 0.650 ms
test=# SELECT LENGTH((seq/100)::TEXT), seq/100::FLOAT, md5 FROM rand TABLESAMPLE SYSTEM_ROWS(1);
 length | ?column? |               md5                
--------+----------+----------------------------------
      5 | 49152.01 | 0a2ff4da00a2b81697e7e465bd67d85c
(1 row)

Time: 0.593 ms
test=# SELECT LENGTH((seq/100)::TEXT), seq/100::FLOAT, md5 FROM rand TABLESAMPLE SYSTEM_ROWS(1);
 length | ?column? |               md5                
--------+----------+----------------------------------
      5 | 18061.21 | ee46adc96a6f8264a5c6614f8463667d
(1 row)

Time: 0.616 ms
test=# SELECT LENGTH((seq/100)::TEXT), seq/100::FLOAT, md5 FROM rand TABLESAMPLE SYSTEM_ROWS(1);
 length | ?column?  |               md5                
--------+-----------+----------------------------------
      6 | 691962.01 | 4bac0d051490c47716f860f8afb8b24a
(1 row)

Time: 0.743 ms
SELECT LENGTH((seq/100)::TEXT), seq/100::FLOAT, md5 FROM rand TABLESAMPLE SYSTEM_TIME(0.001) LIMIT 1;

Context

StackExchange Database Administrators Q#259205, answer score: 15

Revisions (0)

No revisions yet.