snippetsqlMinor
Filter is applied after DISTINCT ON
Viewed 0 times
afterdistinctfilterapplied
Problem
There's a script at the end that will fully create the schema and populate it with sample data.
Schema
Consider these two tables:
Table of polygons:
Table of points contained inside polygons:
The fact that I'm using geometries isn't strictly related to the issue here; however, I think this makes the motives for what I'm attempting to do much clearer.
Problem query
The problem is that there are very tiny, insignificant overlaps between the polygons. (Trying to clean them is really not an option. The overlaps come from some kind of floating point error in generating them, as near as I can figure.) But some points can fall inside these tiny overlaps, resulting in two rows when I
As in the above query, I normally want to
Schema
Consider these two tables:
Table of polygons:
CREATE TABLE my_polygon (
my_polygon_id SERIAL PRIMARY KEY,
common_id INTEGER NOT NULL,
value1 NUMERIC NOT NULL,
value2 NUMERIC NOT NULL,
value3 NUMERIC NOT NULL,
geom GEOMETRY(Polygon) NOT NULL
)
;
CREATE INDEX ON my_polygon (common_id);
CREATE INDEX ON my_polygon USING GIST (common_id, geom);Table of points contained inside polygons:
CREATE TABLE my_point (
my_point_id SERIAL PRIMARY KEY,
common_id INTEGER NOT NULL,
pointvalue NUMERIC NOT NULL,
geom GEOMETRY(Point) NOT NULL
);
CREATE INDEX ON my_point (common_id);
CREATE INDEX ON my_point USING GIST (common_id, geom);The fact that I'm using geometries isn't strictly related to the issue here; however, I think this makes the motives for what I'm attempting to do much clearer.
Problem query
The problem is that there are very tiny, insignificant overlaps between the polygons. (Trying to clean them is really not an option. The overlaps come from some kind of floating point error in generating them, as near as I can figure.) But some points can fall inside these tiny overlaps, resulting in two rows when I
JOIN them based on containment. But really, each point should only be associated with a single polygon. When one does fall within two of them, it doesn't really matter which one it ends up associated with, so it's fine to make the query just pick one, like this:SELECT DISTINCT ON (my_point.my_point_id)
my_polygon.*,
my_point.my_point_id,
my_point.pointvalue,
my_point.geom AS pointgeom
FROM my_polygon
JOIN my_point ON my_point.common_id = my_polygon.common_id AND ST_Contains(my_polygon.geom, my_point.geom)
WHERE my_polygon.common_id = 1
ORDER BY my_point.my_point_id, my_polygon.my_polygon_idAs in the above query, I normally want to
SELECT based on the common_id. This querSolution
It turns out there's a way to work around this and get PG to optimize correctly: you have to include
Like this:
This results in the same query plan as including the
Including
Warning: Make sure everything matches up
It's very important that you use the
This query uses
common_id in the DISTINCT ON clause.Like this:
SELECT *
FROM (
SELECT DISTINCT ON (my_point.my_point_id, my_polygon.common_id)
my_polygon.*,
my_point.my_point_id,
my_point.pointvalue,
my_point.geom AS pointgeom
FROM my_polygon
JOIN my_point ON my_point.common_id = my_polygon.common_id AND ST_Contains(my_polygon.geom, my_point.geom)
ORDER BY my_point.my_point_id, my_polygon.common_id, my_polygon.my_polygon_id
) point_with_polygon
WHERE common_id = 1This results in the same query plan as including the
WHERE clause before the DISTINCT ON:Unique (cost=2307.77..2345.55 rows=5038 width=212)
-> Sort (cost=2307.77..2320.36 rows=5038 width=212)
Sort Key: my_point.my_point_id, my_polygon.my_polygon_id
-> Nested Loop (cost=9.36..1479.97 rows=5038 width=212)
-> Bitmap Heap Scan on my_polygon (cost=4.93..190.19 rows=100 width=164)
Recheck Cond: (common_id = 1)
-> Bitmap Index Scan on my_polygon_common_id_geom_idx (cost=0.00..4.90 rows=100 width=0)
Index Cond: (common_id = 1)
-> Bitmap Heap Scan on my_point (cost=4.43..12.89 rows=1 width=52)
Recheck Cond: ((common_id = 1) AND (my_polygon.geom && geom))
Filter: _st_contains(my_polygon.geom, geom)
-> Bitmap Index Scan on my_point_common_id_geom_idx (cost=0.00..4.43 rows=2 width=0)
Index Cond: ((common_id = 1) AND (my_polygon.geom && geom))Including
common_id is kind of redundant since it's part of the JOIN condition anyway, but that fact also means that this won't change the result of the query.Warning: Make sure everything matches up
It's very important that you use the
common_id that's in the SELECT result. Take this query for example:SELECT *
FROM (
SELECT DISTINCT ON (my_point.my_point_id, my_point.common_id)
my_polygon.*,
my_point.my_point_id,
my_point.pointvalue,
my_point.geom AS pointgeom
FROM my_polygon
JOIN my_point ON my_point.common_id = my_polygon.common_id AND ST_Contains(my_polygon.geom, my_point.geom)
ORDER BY my_point.my_point_id, my_point.common_id, my_polygon.my_polygon_id
) point_with_polygon
WHERE common_id = 1This query uses
my_polygon.common_id in the SELECT clause, but it uses my_point.common_id in the ORDER BY and DISTINCT ON clauses. PG will not push the filter down into the subquery in this case.Code Snippets
SELECT *
FROM (
SELECT DISTINCT ON (my_point.my_point_id, my_polygon.common_id)
my_polygon.*,
my_point.my_point_id,
my_point.pointvalue,
my_point.geom AS pointgeom
FROM my_polygon
JOIN my_point ON my_point.common_id = my_polygon.common_id AND ST_Contains(my_polygon.geom, my_point.geom)
ORDER BY my_point.my_point_id, my_polygon.common_id, my_polygon.my_polygon_id
) point_with_polygon
WHERE common_id = 1Unique (cost=2307.77..2345.55 rows=5038 width=212)
-> Sort (cost=2307.77..2320.36 rows=5038 width=212)
Sort Key: my_point.my_point_id, my_polygon.my_polygon_id
-> Nested Loop (cost=9.36..1479.97 rows=5038 width=212)
-> Bitmap Heap Scan on my_polygon (cost=4.93..190.19 rows=100 width=164)
Recheck Cond: (common_id = 1)
-> Bitmap Index Scan on my_polygon_common_id_geom_idx (cost=0.00..4.90 rows=100 width=0)
Index Cond: (common_id = 1)
-> Bitmap Heap Scan on my_point (cost=4.43..12.89 rows=1 width=52)
Recheck Cond: ((common_id = 1) AND (my_polygon.geom && geom))
Filter: _st_contains(my_polygon.geom, geom)
-> Bitmap Index Scan on my_point_common_id_geom_idx (cost=0.00..4.43 rows=2 width=0)
Index Cond: ((common_id = 1) AND (my_polygon.geom && geom))SELECT *
FROM (
SELECT DISTINCT ON (my_point.my_point_id, my_point.common_id)
my_polygon.*,
my_point.my_point_id,
my_point.pointvalue,
my_point.geom AS pointgeom
FROM my_polygon
JOIN my_point ON my_point.common_id = my_polygon.common_id AND ST_Contains(my_polygon.geom, my_point.geom)
ORDER BY my_point.my_point_id, my_point.common_id, my_polygon.my_polygon_id
) point_with_polygon
WHERE common_id = 1Context
StackExchange Database Administrators Q#157540, answer score: 2
Revisions (0)
No revisions yet.