HiveBrain v1.2.0
Get Started
← Back to all entries
snippetsqlMinor

How to index two tables for JOINed query optimisation

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
tablesjoinedqueryoptimisationtwoforhowindex

Problem

I'm using PostgreSQL 9.5 on High Sierra.

Over the two tables:

request_logs - ~ 26K rows

response_logs - ~ 9K rows

I've the following query (with JOIN):

SELECT
req.uuid,
res.status,
req.method,
req.requesturi,
req.accessed,
req.payload reqpayload,

res.payload respayload,
COUNT(*) OVER() AS total_rows
FROM
request_logs req
LEFT OUTER JOIN response_logs res ON req.uuid = res.uuid
WHERE
req.accountid = 2 AND
req.requesturi not ilike '/v1/sessions%' AND
req.accessed BETWEEN “2018-01-01 15:04:05 +0000” and “2019-01-02 15:04:05+0000” AND
res.status IS NOT NULL AND
req.applicationid = 1
ORDER BY
accessed DESC LIMIT 1000

As I'm trying to optimise the query, I've experimented with different indexes:
Here's a list of what I tried:

Configuration 1:
1. request_log.uuid (pkey, unique)
2. response_log.uuid (pkey, unique, foreign key)

Response time avg. : 260 ms

Configuration 2:
1. request_log.uuid (pkey, unique)
2. request_log.applicationid
3. response_log.uuid (pkey, unique, foreign key)

Response time avg. : 230 ms

Configuration 3:
1. request_log.uuid (pkey, unique)
2. request_log.applicationid
3. request_log.accessed (timestampz)
4. response_log.uuid (pkey, unique, foreign key)

Response time avg. : 230 ms

Configuration 4:
1. request_log.uuid (pkey, unique)
2. request_log.applicationid
3. request_log.accessed (timestampz)
4. request_log.accountid
5. response_log.uuid (pkey, unique, foreign key)

Response time avg. : 230 ms

Configuration 5:
1. request_log.uuid (pkey, unique)
2. request_log.applicationid, request_log.accessed (combined)
3. response_log.uuid (pkey, unique, foreign key)

Response time avg. : 240 ms

As visible from the result, indexing by applicationid (an int8) did help a little, while indexing by the timestampz accessed didn't help at all.
Maybe the bad performance is due to the JOIN?
Altogether, it seems quite slow and I try not to think what will happen when these tables contain millio

Solution

For starters, get your LEFT JOIN right and try to get rid of total rows (as discussed in comments):
Query

Updated to your comment:

I only want to returns rows from request_logs that have a parallel entry in response_logs with non nil status:

SELECT q.uuid
     , s.status
     , q.method
     , q.requesturi
     , q.accessed
     , q.payload reqpayload
     , s.payload respayload
--   , COUNT(*) OVER() AS total_rows  -- see below
FROM   request_logs q
JOIN   response_logs s USING (uuid)   -- not LEFT JOIN
WHERE  q.accountid = 2
AND    q.requesturi NOT ILIKE '/v1/sessions%'
AND    q.accessed BETWEEN '2018-01-01 15:04:05 +0' AND '2019-01-02 15:04:05+0'
AND    q.applicationid = 1
AND    s.status IS NOT NULL           -- see below
ORDER  BY q.accessed DESC
LIMIT  1001;                          -- see below


Your LEFT [OUTER] JOIN burns down to a plain [INNER] JOIN. Related:

  • Query with LEFT JOIN not returning rows for count of 0



COUNT(*) OVER() AS total_rows is expensive for counts substantially bigger than LIMIT (and you are expecting "10m+ rows"). Maybe it's good enough to use LIMIT 1001, only use the first 1000 rows, check the row count and if it's 1001 then you know there are "more than 1000 matching rows". Related:

  • Best way to get result count before LIMIT was applied



Indexes

If (like you commented)

the only 'moving parts' in this query are the dates (from and to) that are compared with accessed

... and more than a few rows are excluded with the other predicates
on applicationid, requesturi and accountid, then a partial index should help read performance (a lot):

CREATE INDEX ON request_logs (uuid, accessed DESC)
WHERE  accountid = 2
AND    requesturi NOT ILIKE '/v1/sessions%'
AND    applicationid = 1


And if a substantial percentage of rows is eliminated by status IS NOT NULL, also:

CREATE INDEX ON response_logs (uuid)
WHERE  status IS NOT NULL;


It may pay to append the columns status and payload as index columns if you can get index-only scans out of it. Some preconditions apply.

Code Snippets

SELECT q.uuid
     , s.status
     , q.method
     , q.requesturi
     , q.accessed
     , q.payload reqpayload
     , s.payload respayload
--   , COUNT(*) OVER() AS total_rows  -- see below
FROM   request_logs q
JOIN   response_logs s USING (uuid)   -- not LEFT JOIN
WHERE  q.accountid = 2
AND    q.requesturi NOT ILIKE '/v1/sessions%'
AND    q.accessed BETWEEN '2018-01-01 15:04:05 +0' AND '2019-01-02 15:04:05+0'
AND    q.applicationid = 1
AND    s.status IS NOT NULL           -- see below
ORDER  BY q.accessed DESC
LIMIT  1001;                          -- see below
CREATE INDEX ON request_logs (uuid, accessed DESC)
WHERE  accountid = 2
AND    requesturi NOT ILIKE '/v1/sessions%'
AND    applicationid = 1
CREATE INDEX ON response_logs (uuid)
WHERE  status IS NOT NULL;

Context

StackExchange Database Administrators Q#213327, answer score: 3

Revisions (0)

No revisions yet.