patternsqlModerate
Get fast count after join on "containment" expression
Viewed 0 times
fastaftercontainmentexpressionjoingetcount
Problem
I have three tables in a PostgreSQL database that I'm querying via a view and some joins.
All of the tables have ~150K rows right now, and it takes a really long time (~3 hours) to run
Is there a way I can restructure or optimize this particular schema or view to make the query faster, or is it a hardware issue? I'm going to spin up a beefy VM in the cloud and test, but wanted to see if there's a way to optimize w/out just throwing more hardware at it.
CREATE TABLE network_info (
network CIDR NOT NULL,
some_info TEXT NULL,
PRIMARY KEY (network)
);
CREATE TABLE ipaddr_info (
ipaddr INET NOT NULL,
some_info INT NULL,
PRIMARY KEY (ipaddr, some_info)
);
CREATE TABLE ipaddrs (
addr INET NOT NULL,
PRIMARY KEY (addr)
);
CREATE VIEW ipaddr_summary AS
SELECT DISTINCT
i.addr AS ip_address,
a.some_info AS network_info,
COUNT(b.ipaddr) AS ip_info_count
FROM ipaddrs AS i
LEFT JOIN network_info AS a
ON (i.addr << a.network)
LEFT JOIN ipaddr_info AS b
ON (i.addr = b.ipaddr)
GROUP BY i.addr, a.some_info
;All of the tables have ~150K rows right now, and it takes a really long time (~3 hours) to run
SELECT * from ipaddr_summary; on an Intel Pentium 4 2.8GHz dual core with 2G of memory running PostgreSQL 9.3.Is there a way I can restructure or optimize this particular schema or view to make the query faster, or is it a hardware issue? I'm going to spin up a beefy VM in the cloud and test, but wanted to see if there's a way to optimize w/out just throwing more hardware at it.
Solution
There might be hardware issues, too - how should we know? But there are certainly issues with the query.
First of all, remove
Arriving at this (cleaned up) query:
Next, the query looks suspiciously like it's going very wrong. Two uncorrelated joins can multiply rows:
With 150k rows in each table, there is potential for a huge (unintended) Cartesian product. My educated guess, you really want this:
I suspect that
You could first join to
Finally, you need index support. Equality between
For the "is contained by" operator `
First of all, remove
DISTINCT from your VIEW definition. It's doing nothing at all (but complicating and slowing things down). Related answer on SO with explanation:- PostgreSQL - Slow query joining on a VIEW
Arriving at this (cleaned up) query:
SELECT i.addr AS ip_address
, a.some_info AS network_info
, COUNT(b.ipaddr) AS ip_info_count
FROM ipaddrs i
LEFT JOIN ipaddr_info b ON i.addr = b.ipaddr
LEFT JOIN network_info a ON i.addr << a.network
GROUP BY 1,2;Next, the query looks suspiciously like it's going very wrong. Two uncorrelated joins can multiply rows:
- Two SQL LEFT JOINS produce incorrect result
With 150k rows in each table, there is potential for a huge (unintended) Cartesian product. My educated guess, you really want this:
SELECT addr AS ip_address
, a.some_info AS network_info
, b.ip_info_count
FROM ipaddrs i
LEFT JOIN (
SELECT ipaddr AS addr, count(*) AS ip_info_count
FROM ipaddr_info
GROUP BY 1
) b USING (addr)
LEFT JOIN network_info a ON i.addr << a.network;I suspect that
GROUP BY in the outer SELECT is not needed now, either. Besides fixing the count, avoiding the proxy cross-join should also be faster by orders of magnitude.You could first join to
ipaddrs (especially if you have predicates filtering rows from it) and then aggregate, or first aggregate in the subquery like displayed. Usefulness of this variant largely depends on data distribution. It's great for few ipaddr with big counts. See:- Slow queries related to subqueries using aggregation
Finally, you need index support. Equality between
ipaddr and addr is covered by the default btree indexes of the PRIMARY KEY. The query on the whole table is probably using a sequential scan anyway.For the "is contained by" operator `
Code Snippets
SELECT i.addr AS ip_address
, a.some_info AS network_info
, COUNT(b.ipaddr) AS ip_info_count
FROM ipaddrs i
LEFT JOIN ipaddr_info b ON i.addr = b.ipaddr
LEFT JOIN network_info a ON i.addr << a.network
GROUP BY 1,2;SELECT addr AS ip_address
, a.some_info AS network_info
, b.ip_info_count
FROM ipaddrs i
LEFT JOIN (
SELECT ipaddr AS addr, count(*) AS ip_info_count
FROM ipaddr_info
GROUP BY 1
) b USING (addr)
LEFT JOIN network_info a ON i.addr << a.network;CREATE INDEX ON network_info USING gist (network inet_ops);SELECT addr AS ip_address
, a.network_info
, b.ip_info_count
FROM ipaddrs i
LEFT JOIN (
SELECT ipaddr AS addr, count(*) AS ip_info_count
FROM ipaddr_info
GROUP BY 1
) b USING (addr)
LEFT JOIN LATERAL (
SELECT some_info AS network_info
FROM network_info
WHERE network >> i.addr
) a ON true;Context
StackExchange Database Administrators Q#90237, answer score: 10
Revisions (0)
No revisions yet.