HiveBrain v1.2.0
Get Started
← Back to all entries
snippetsqlMinor

Filter is applied after DISTINCT ON

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
afterdistinctfilterapplied

Problem

There's a script at the end that will fully create the schema and populate it with sample data.

Schema

Consider these two tables:

Table of polygons:

CREATE TABLE my_polygon (
    my_polygon_id SERIAL PRIMARY KEY,
    common_id INTEGER NOT NULL,
    value1 NUMERIC NOT NULL,
    value2 NUMERIC NOT NULL,
    value3 NUMERIC NOT NULL,
    geom GEOMETRY(Polygon) NOT NULL
)
;

CREATE INDEX ON my_polygon (common_id);
CREATE INDEX ON my_polygon USING GIST (common_id, geom);


Table of points contained inside polygons:

CREATE TABLE my_point (
    my_point_id SERIAL PRIMARY KEY,
    common_id INTEGER NOT NULL,
    pointvalue NUMERIC NOT NULL,
    geom GEOMETRY(Point) NOT NULL
);

CREATE INDEX ON my_point (common_id);
CREATE INDEX ON my_point USING GIST (common_id, geom);


The fact that I'm using geometries isn't strictly related to the issue here; however, I think this makes the motives for what I'm attempting to do much clearer.

Problem query

The problem is that there are very tiny, insignificant overlaps between the polygons. (Trying to clean them is really not an option. The overlaps come from some kind of floating point error in generating them, as near as I can figure.) But some points can fall inside these tiny overlaps, resulting in two rows when I JOIN them based on containment. But really, each point should only be associated with a single polygon. When one does fall within two of them, it doesn't really matter which one it ends up associated with, so it's fine to make the query just pick one, like this:

SELECT DISTINCT ON (my_point.my_point_id)
    my_polygon.*,
    my_point.my_point_id,
    my_point.pointvalue,
    my_point.geom AS pointgeom
FROM my_polygon
JOIN my_point ON my_point.common_id = my_polygon.common_id AND ST_Contains(my_polygon.geom, my_point.geom)
WHERE my_polygon.common_id = 1
ORDER BY my_point.my_point_id, my_polygon.my_polygon_id


As in the above query, I normally want to SELECT based on the common_id. This quer

Solution

It turns out there's a way to work around this and get PG to optimize correctly: you have to include common_id in the DISTINCT ON clause.

Like this:

SELECT *
FROM (
    SELECT DISTINCT ON (my_point.my_point_id, my_polygon.common_id)
        my_polygon.*,
        my_point.my_point_id,
        my_point.pointvalue,
        my_point.geom AS pointgeom
    FROM my_polygon
    JOIN my_point ON my_point.common_id = my_polygon.common_id AND ST_Contains(my_polygon.geom, my_point.geom)
    ORDER BY my_point.my_point_id, my_polygon.common_id, my_polygon.my_polygon_id
) point_with_polygon
WHERE common_id = 1


This results in the same query plan as including the WHERE clause before the DISTINCT ON:

Unique  (cost=2307.77..2345.55 rows=5038 width=212)
  ->  Sort  (cost=2307.77..2320.36 rows=5038 width=212)
        Sort Key: my_point.my_point_id, my_polygon.my_polygon_id
        ->  Nested Loop  (cost=9.36..1479.97 rows=5038 width=212)
              ->  Bitmap Heap Scan on my_polygon  (cost=4.93..190.19 rows=100 width=164)
                    Recheck Cond: (common_id = 1)
                    ->  Bitmap Index Scan on my_polygon_common_id_geom_idx  (cost=0.00..4.90 rows=100 width=0)
                          Index Cond: (common_id = 1)
              ->  Bitmap Heap Scan on my_point  (cost=4.43..12.89 rows=1 width=52)
                    Recheck Cond: ((common_id = 1) AND (my_polygon.geom && geom))
                    Filter: _st_contains(my_polygon.geom, geom)
                    ->  Bitmap Index Scan on my_point_common_id_geom_idx  (cost=0.00..4.43 rows=2 width=0)
                          Index Cond: ((common_id = 1) AND (my_polygon.geom && geom))


Including common_id is kind of redundant since it's part of the JOIN condition anyway, but that fact also means that this won't change the result of the query.

Warning: Make sure everything matches up

It's very important that you use the common_id that's in the SELECT result. Take this query for example:

SELECT *
FROM (
    SELECT DISTINCT ON (my_point.my_point_id, my_point.common_id)
        my_polygon.*,
        my_point.my_point_id,
        my_point.pointvalue,
        my_point.geom AS pointgeom
    FROM my_polygon
    JOIN my_point ON my_point.common_id = my_polygon.common_id AND ST_Contains(my_polygon.geom, my_point.geom)
    ORDER BY my_point.my_point_id, my_point.common_id, my_polygon.my_polygon_id
) point_with_polygon
WHERE common_id = 1


This query uses my_polygon.common_id in the SELECT clause, but it uses my_point.common_id in the ORDER BY and DISTINCT ON clauses. PG will not push the filter down into the subquery in this case.

Code Snippets

SELECT *
FROM (
    SELECT DISTINCT ON (my_point.my_point_id, my_polygon.common_id)
        my_polygon.*,
        my_point.my_point_id,
        my_point.pointvalue,
        my_point.geom AS pointgeom
    FROM my_polygon
    JOIN my_point ON my_point.common_id = my_polygon.common_id AND ST_Contains(my_polygon.geom, my_point.geom)
    ORDER BY my_point.my_point_id, my_polygon.common_id, my_polygon.my_polygon_id
) point_with_polygon
WHERE common_id = 1
Unique  (cost=2307.77..2345.55 rows=5038 width=212)
  ->  Sort  (cost=2307.77..2320.36 rows=5038 width=212)
        Sort Key: my_point.my_point_id, my_polygon.my_polygon_id
        ->  Nested Loop  (cost=9.36..1479.97 rows=5038 width=212)
              ->  Bitmap Heap Scan on my_polygon  (cost=4.93..190.19 rows=100 width=164)
                    Recheck Cond: (common_id = 1)
                    ->  Bitmap Index Scan on my_polygon_common_id_geom_idx  (cost=0.00..4.90 rows=100 width=0)
                          Index Cond: (common_id = 1)
              ->  Bitmap Heap Scan on my_point  (cost=4.43..12.89 rows=1 width=52)
                    Recheck Cond: ((common_id = 1) AND (my_polygon.geom && geom))
                    Filter: _st_contains(my_polygon.geom, geom)
                    ->  Bitmap Index Scan on my_point_common_id_geom_idx  (cost=0.00..4.43 rows=2 width=0)
                          Index Cond: ((common_id = 1) AND (my_polygon.geom && geom))
SELECT *
FROM (
    SELECT DISTINCT ON (my_point.my_point_id, my_point.common_id)
        my_polygon.*,
        my_point.my_point_id,
        my_point.pointvalue,
        my_point.geom AS pointgeom
    FROM my_polygon
    JOIN my_point ON my_point.common_id = my_polygon.common_id AND ST_Contains(my_polygon.geom, my_point.geom)
    ORDER BY my_point.my_point_id, my_point.common_id, my_polygon.my_polygon_id
) point_with_polygon
WHERE common_id = 1

Context

StackExchange Database Administrators Q#157540, answer score: 2

Revisions (0)

No revisions yet.