patternsqlMinor
Improve performance for order by with columns from many tables
Viewed 0 times
tablesordercolumnswithimproveforperformancemanyfrom
Problem
Using PostgreSQL 8.4, I'm trying to consult two tables with 1 million records using order by with indexed columns of the two tables, and I'm losing performance (with 1 column takes 30 ms and with two columns takes 5 minutes). E.g.:
Table Info:
select r.code, r.test_code, r.sample_code, s.barcode, s.registry_date
from requests r
inner join samples s on (s.code = r.sample_code)
order by s.barcode asc , r.code asc
limit 21;Table Info:
CREATE TABLE public.samples (
code BIGINT NOT NULL,
barcode VARCHAR(40) NOT NULL,
registry_date TIMESTAMP WITH TIME ZONE NOT NULL,
CONSTRAINT samples_pkey PRIMARY KEY(code)
);
CREATE INDEX idx_samp_barcode ON public.samples (barcode);
CREATE INDEX idx_samp_barcode_code ON public.samples (barcode, code);
CREATE INDEX idx_samp_barcode_code_desc ON public.samples (barcode DESC, code DESC);
CREATE INDEX idx_test_identifier_desc ON public.samples (barcode DESC);
CREATE TABLE public.requests (
code BIGINT NOT NULL,
test_code INTEGER NOT NULL,
sample_code BIGINT NOT NULL,
CONSTRAINT requests_pkey PRIMARY KEY(code),
CONSTRAINT "Requests_S_fk" FOREIGN KEY (sample_code)
REFERENCES public.samples(code)
ON DELETE NO ACTION ON UPDATE NO ACTION NOT DEFERRABLE,
CONSTRAINT "Requests_T_fk" FOREIGN KEY (test_code)
REFERENCES public.tests(code)
ON DELETE NO ACTION ON UPDATE NO ACTION NOT DEFERRABLE
);
CREATE INDEX idx_req_test_code ON public.requests (test_code);
CREATE INDEX idx_req_test_code_desc ON public.requests (test_code DESC);
CREATE INDEX request_sample_code_index ON public.requests (sample_code);
CREATE INDEX requests_sample_code_desc_idx ON public.requests (sample_code DESC, code DESC);
CREATE INDEX requests_sample_code_idx ON public.requests (sample_code, code);- Each table has 1 million rows.
- All rows in both tables are distinct.
- If I order by either column alone, the execution time is ~ 30 ms.
- There is no sample without request.
- A sample can have many
s.codefor the smae`s.barcode
Solution
Problem
This is a more complex problem than is obvious on a quick glance. You are sorting by two columns, each from a different table, while you join on two other columns. This makes it impossible for Postgres to use the provided indexes and it has to default to (very) expensive sequential scans. Here is a related case on dba.SE:
Indexes
You need two indexes for best performance, both of which are already present:
But the plain query you have can not make use of them, not even in pg 9.4.
Query
Postgres 8.4 is quite a hurdle to take. But I think I found a way with a recursive CTE:
Tested and works for me in pg 9.4. It uses the indexes (partly in index-only scans in pg 9.2+).
pg 8.4 already has recursive CTEs. The rest is basic SQL (even
Explanation
The commented parts:
would be needed if there could be samples with no requests at all, which you ruled out in a comment.
The query relies on this documented behavior of recursive CTEs:
Tip: The recursive query evaluation algorithm produces its output in
breadth-first search order.
Bold emphasis mine. And on this, too:
This works because PostgreSQL's implementation evaluates only as many
rows of a
Using this trick in production is not recommended, because other
systems might work differently. Also, it usually won't work if you
make the outer query sort the recursive query's results or join them
to some other table.
Bold emphasis mine. So, it's only guaranteed to work in PostgreSQL and only if you do not add another
This is fast for a (relatively) small
PL/pgSQL function
The other option I can think of is a procedural solution with plpgsql. Should be a bit faster yet, especially for repeated calls:
Call:
A large
Postgres 9.3+ has a built-in feature, with your outdated software you have to knit the solution by hand. It's not that complicated, though. Basically a snapshot- table thats populated from a pre-defined
```
CREATE TABLE req_sample AS
SELECT row_numbe
This is a more complex problem than is obvious on a quick glance. You are sorting by two columns, each from a different table, while you join on two other columns. This makes it impossible for Postgres to use the provided indexes and it has to default to (very) expensive sequential scans. Here is a related case on dba.SE:
- Can spatial index help a “range - order by - limit” query
Indexes
You need two indexes for best performance, both of which are already present:
CREATE INDEX idx_samp_barcode_code ON public.samples (barcode, code);
CREATE INDEX requests_sample_code_idx ON public.requests (sample_code, code);But the plain query you have can not make use of them, not even in pg 9.4.
Query
Postgres 8.4 is quite a hurdle to take. But I think I found a way with a recursive CTE:
WITH RECURSIVE cte AS (
( -- all parentheses are required
SELECT r.code, r.test_code, r.sample_code, s.barcode, s.registry_date
FROM (
SELECT s.barcode
FROM samples s
-- WHERE EXISTS (SELECT 1 FROM requests WHERE r.sample_code = s.code)
ORDER BY s.barcode
LIMIT 1 -- get smallest barcode to start with
) s0
JOIN samples s USING (barcode) -- join all samples with same barcode
JOIN requests r ON r.sample_code = s.code
ORDER BY r.code -- start with ordered list
)
UNION ALL
(
SELECT r.code, r.test_code, r.sample_code, s.barcode, s.registry_date
FROM (
SELECT s.barcode
FROM cte c
JOIN samples s ON s.barcode > c.barcode
-- WHERE EXISTS (SELECT 1 FROM requests WHERE r.sample_code = s.code)
ORDER BY s.barcode
LIMIT 1 -- get next higher barcode
) s0
JOIN samples s USING (barcode) -- join all samples with same barcode
JOIN requests r ON r.sample_code = s.code
ORDER BY r.code -- again, ordered list
)
)
TABLE cte
LIMIT 21;Tested and works for me in pg 9.4. It uses the indexes (partly in index-only scans in pg 9.2+).
pg 8.4 already has recursive CTEs. The rest is basic SQL (even
TABLE is available in pg 8.4 (and can be replaced with SELECT * FROM). I hope there are no restrictions to spoil the party, I don't have an installation of pg 8.4 any more.Explanation
The commented parts:
-- WHERE EXISTS (SELECT 1 FROM requests WHERE r.sample_code = s.code)would be needed if there could be samples with no requests at all, which you ruled out in a comment.
The query relies on this documented behavior of recursive CTEs:
Tip: The recursive query evaluation algorithm produces its output in
breadth-first search order.
Bold emphasis mine. And on this, too:
This works because PostgreSQL's implementation evaluates only as many
rows of a
WITH query as are actually fetched by the parent query.Using this trick in production is not recommended, because other
systems might work differently. Also, it usually won't work if you
make the outer query sort the recursive query's results or join them
to some other table.
Bold emphasis mine. So, it's only guaranteed to work in PostgreSQL and only if you do not add another
ORDER BY in the outer query.This is fast for a (relatively) small
LIMIT in the outer query. For the case at hand, it should be very fast. The query would be no good for a large LIMIT.PL/pgSQL function
The other option I can think of is a procedural solution with plpgsql. Should be a bit faster yet, especially for repeated calls:
CREATE OR REPLACE FUNCTION f_demo(_limit int)
RETURNS TABLE (code int8, test_code int, sample_code int8
, barcode varchar, registry_date timestamptz) AS
$func$
DECLARE
_barcode text;
_n int;
_rest int := _limit; -- init with _limit parameter
BEGIN
FOR _barcode IN
SELECT DISTINCT s.barcode
FROM samples s
-- WHERE EXISTS (SELECT 1 FROM requests WHERE r.sample_code = s.code)
ORDER BY s.barcode
LOOP
RETURN QUERY
SELECT r.code, r.test_code, r.sample_code, s.barcode, s.registry_date
FROM samples s
JOIN requests r ON r.sample_code = s.code
WHERE s.barcode = _barcode
ORDER BY r.code
LIMIT _limit;
GET DIAGNOSTICS _n = ROW_COUNT;
_rest := _rest - _n;
EXIT WHEN _rest < 1;
END LOOP;
END
$func$ LANGUAGE plpgsql;Call:
SELECT * FROM f_demo(21);MATERIALIZED VIEWA large
OFFSET has a similar effect as a large LIMIT. While all the skipped rows don't add to I/O, they still have to be computed to determine the first rows after the offset. If you need that, use a MATERIALIZED VIEW with an added row number and an index on that.Postgres 9.3+ has a built-in feature, with your outdated software you have to knit the solution by hand. It's not that complicated, though. Basically a snapshot- table thats populated from a pre-defined
SELECT statement or VIEW, like. Basically, create the table as:```
CREATE TABLE req_sample AS
SELECT row_numbe
Code Snippets
CREATE INDEX idx_samp_barcode_code ON public.samples (barcode, code);
CREATE INDEX requests_sample_code_idx ON public.requests (sample_code, code);WITH RECURSIVE cte AS (
( -- all parentheses are required
SELECT r.code, r.test_code, r.sample_code, s.barcode, s.registry_date
FROM (
SELECT s.barcode
FROM samples s
-- WHERE EXISTS (SELECT 1 FROM requests WHERE r.sample_code = s.code)
ORDER BY s.barcode
LIMIT 1 -- get smallest barcode to start with
) s0
JOIN samples s USING (barcode) -- join all samples with same barcode
JOIN requests r ON r.sample_code = s.code
ORDER BY r.code -- start with ordered list
)
UNION ALL
(
SELECT r.code, r.test_code, r.sample_code, s.barcode, s.registry_date
FROM (
SELECT s.barcode
FROM cte c
JOIN samples s ON s.barcode > c.barcode
-- WHERE EXISTS (SELECT 1 FROM requests WHERE r.sample_code = s.code)
ORDER BY s.barcode
LIMIT 1 -- get next higher barcode
) s0
JOIN samples s USING (barcode) -- join all samples with same barcode
JOIN requests r ON r.sample_code = s.code
ORDER BY r.code -- again, ordered list
)
)
TABLE cte
LIMIT 21;-- WHERE EXISTS (SELECT 1 FROM requests WHERE r.sample_code = s.code)CREATE OR REPLACE FUNCTION f_demo(_limit int)
RETURNS TABLE (code int8, test_code int, sample_code int8
, barcode varchar, registry_date timestamptz) AS
$func$
DECLARE
_barcode text;
_n int;
_rest int := _limit; -- init with _limit parameter
BEGIN
FOR _barcode IN
SELECT DISTINCT s.barcode
FROM samples s
-- WHERE EXISTS (SELECT 1 FROM requests WHERE r.sample_code = s.code)
ORDER BY s.barcode
LOOP
RETURN QUERY
SELECT r.code, r.test_code, r.sample_code, s.barcode, s.registry_date
FROM samples s
JOIN requests r ON r.sample_code = s.code
WHERE s.barcode = _barcode
ORDER BY r.code
LIMIT _limit;
GET DIAGNOSTICS _n = ROW_COUNT;
_rest := _rest - _n;
EXIT WHEN _rest < 1;
END LOOP;
END
$func$ LANGUAGE plpgsql;SELECT * FROM f_demo(21);Context
StackExchange Database Administrators Q#112679, answer score: 5
Revisions (0)
No revisions yet.