HiveBrain v1.2.0
Get Started
← Back to all entries
patternsqlModerate

Get fast count after join on "containment" expression

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
fastaftercontainmentexpressionjoingetcount

Problem

I have three tables in a PostgreSQL database that I'm querying via a view and some joins.

CREATE TABLE network_info (
  network         CIDR          NOT NULL,
  some_info       TEXT          NULL,
  PRIMARY KEY (network)
);

CREATE TABLE ipaddr_info (
  ipaddr          INET          NOT NULL,
  some_info       INT           NULL,
  PRIMARY KEY (ipaddr, some_info)
);

CREATE TABLE ipaddrs (
  addr            INET          NOT NULL,
  PRIMARY KEY (addr)
);

CREATE VIEW ipaddr_summary AS
SELECT DISTINCT
  i.addr                  AS ip_address,
  a.some_info             AS network_info,
  COUNT(b.ipaddr)         AS ip_info_count
FROM ipaddrs AS i
LEFT JOIN network_info AS a
  ON (i.addr << a.network)
LEFT JOIN ipaddr_info AS b
  ON (i.addr = b.ipaddr)
GROUP BY i.addr, a.some_info
;


All of the tables have ~150K rows right now, and it takes a really long time (~3 hours) to run SELECT * from ipaddr_summary; on an Intel Pentium 4 2.8GHz dual core with 2G of memory running PostgreSQL 9.3.

Is there a way I can restructure or optimize this particular schema or view to make the query faster, or is it a hardware issue? I'm going to spin up a beefy VM in the cloud and test, but wanted to see if there's a way to optimize w/out just throwing more hardware at it.

Solution

There might be hardware issues, too - how should we know? But there are certainly issues with the query.

First of all, remove DISTINCT from your VIEW definition. It's doing nothing at all (but complicating and slowing things down). Related answer on SO with explanation:

  • PostgreSQL - Slow query joining on a VIEW



Arriving at this (cleaned up) query:

SELECT i.addr      AS ip_address
     , a.some_info AS network_info
     , COUNT(b.ipaddr) AS ip_info_count
FROM   ipaddrs           i
LEFT   JOIN ipaddr_info  b  ON i.addr = b.ipaddr
LEFT   JOIN network_info a  ON i.addr << a.network
GROUP  BY 1,2;


Next, the query looks suspiciously like it's going very wrong. Two uncorrelated joins can multiply rows:

  • Two SQL LEFT JOINS produce incorrect result



With 150k rows in each table, there is potential for a huge (unintended) Cartesian product. My educated guess, you really want this:

SELECT addr        AS ip_address
     , a.some_info AS network_info
     , b.ip_info_count
FROM   ipaddrs i
LEFT   JOIN (
   SELECT  ipaddr AS addr, count(*) AS ip_info_count
   FROM    ipaddr_info
   GROUP   BY 1
  ) b USING (addr)
LEFT   JOIN network_info a ON i.addr << a.network;


I suspect that GROUP BY in the outer SELECT is not needed now, either. Besides fixing the count, avoiding the proxy cross-join should also be faster by orders of magnitude.

You could first join to ipaddrs (especially if you have predicates filtering rows from it) and then aggregate, or first aggregate in the subquery like displayed. Usefulness of this variant largely depends on data distribution. It's great for few ipaddr with big counts. See:

  • Slow queries related to subqueries using aggregation



Finally, you need index support. Equality between ipaddr and addr is covered by the default btree indexes of the PRIMARY KEY. The query on the whole table is probably using a sequential scan anyway.

For the "is contained by" operator `

Code Snippets

SELECT i.addr      AS ip_address
     , a.some_info AS network_info
     , COUNT(b.ipaddr) AS ip_info_count
FROM   ipaddrs           i
LEFT   JOIN ipaddr_info  b  ON i.addr = b.ipaddr
LEFT   JOIN network_info a  ON i.addr << a.network
GROUP  BY 1,2;
SELECT addr        AS ip_address
     , a.some_info AS network_info
     , b.ip_info_count
FROM   ipaddrs i
LEFT   JOIN (
   SELECT  ipaddr AS addr, count(*) AS ip_info_count
   FROM    ipaddr_info
   GROUP   BY 1
  ) b USING (addr)
LEFT   JOIN network_info a ON i.addr << a.network;
CREATE INDEX ON network_info USING gist (network inet_ops);
SELECT addr AS ip_address
     , a.network_info
     , b.ip_info_count
FROM   ipaddrs i
LEFT   JOIN (
   SELECT  ipaddr AS addr, count(*) AS ip_info_count
   FROM    ipaddr_info
   GROUP   BY 1
  ) b USING (addr)
LEFT   JOIN LATERAL (
   SELECT some_info AS network_info
   FROM   network_info
   WHERE  network >> i.addr
   ) a ON true;

Context

StackExchange Database Administrators Q#90237, answer score: 10

Revisions (0)

No revisions yet.