patternMinor
Redshift - Optimize Expensive Query
Viewed 0 times
queryexpensiveoptimizeredshift
Problem
I have two tables in Redshift that I am trying to do a join on to get zip code demographics based on a users normalized ip address. By normalized address, I mean that it is concerted to a uniform length string that has the periods stripped out and can be directly compared to one another. For example, this is applied to all ips before any joins are done and is stored in the tables:
so
I have two tables. This first is visitor_details which contains the following:
The I have a table called geo_ip which has the following structure:
I'm trying to run the following query:
The sort keys on geo ip are an interleaved sort key using both startip and endip. The table also doesn't seem to be skewed. However, running the query results in a very long execution time (never completed). Looking at the explain I see the following:
```
XN Limit (cost=0.00..245.17 rows=500 width=238)
-> XN Nested Loop DS_BCAST_INNER (cost=0.00..18442148764959.20 rows=37610983146614 width=2
lpad(split_part(ip, '.', 1), 3, '0') ||
lpad(split_part(ip, '.', 2), 3, '0') ||
lpad(split_part(ip, '.', 3), 3, '0') ||
lpad(split_part(ip, '.', 4), 3, '0')so
209.170.151.71 would be transformed into 209170151071.I have two tables. This first is visitor_details which contains the following:
-----------------------------
| visitor_id | ip |
-----------------------------
| 1 | 209170151071 |
| 2 | 123170167071 |
... ...
| 50000000 | 001213020341 |
-----------------------------The I have a table called geo_ip which has the following structure:
----------------------------------------
| start_ip | end_ip | zip |
----------------------------------------
|209170151071 | 209170151071 | 11101 |
|309170151071 | 409170151071 | 11102 |
... ... ...
|509170151071 | 609170151071 | 11103 |
----------------------------------------I'm trying to run the following query:
WITH vd AS (
SELECT visitor_id,
ip_address as c_ip
FROM dev.visitor_details
)
SELECT
visitor_id,
c_ip,
g.*
FROM
vd
JOIN
dev.geo_ip g
ON vd.c_ip BETWEEN g.startip and g.endip
LIMIT 500;The sort keys on geo ip are an interleaved sort key using both startip and endip. The table also doesn't seem to be skewed. However, running the query results in a very long execution time (never completed). Looking at the explain I see the following:
```
XN Limit (cost=0.00..245.17 rows=500 width=238)
-> XN Nested Loop DS_BCAST_INNER (cost=0.00..18442148764959.20 rows=37610983146614 width=2
Solution
The main thing is to avoid the nested loop join that is caused by the "between" in the join condition.
In your example specifically, I would start by rewriting this as
Note that Redshift will only do a full outer join if considers it a merge joinable condition, which means you should set your distribution and sort key for both tables to be on visitor.ip and geo_ip.start_ip, respectively. If this is not an option, or too much of a bother, you can do this instead as a UNION ALL since you don't actually need the geo and visitor records to be on the same row:
These would get rid of the nested loop join. You can still improve upon this for larger clusters by allowing the operation to be distributable.
For example, if you can partition your IP address table by the first octet (i.e, no row in the table has a start and end of the range with different first octets), you could add a partition to your window function, which should let the sorting be processed on separate nodes in your cluster:
In your example specifically, I would start by rewriting this as
SELECT
visitor.id,
visitor.ip,
LAST_VALUE(zip IGNORE NULLS) OVER (
ORDER BY COALESCE(geo_ip.start_ip, visitor.ip)
ROWS UNBOUNDED PRECEDING
) as zip
FROM geo_ip
FULL OUTER JOIN visitor
ON visitor.ip=geo_ip.start_ipNote that Redshift will only do a full outer join if considers it a merge joinable condition, which means you should set your distribution and sort key for both tables to be on visitor.ip and geo_ip.start_ip, respectively. If this is not an option, or too much of a bother, you can do this instead as a UNION ALL since you don't actually need the geo and visitor records to be on the same row:
SELECT
visitor_id,
visitor_ip,
LAST_VALUE(zip IGNORE NULLS) OVER (
ORDER BY ip, CASE WHEN zip IS NULL THEN 0 ELSE 1 END
ROWS UNBOUNDED PRECEDING
) as zip
FROM (
SELECT null as visitor_id, start_ip as ip, zip
FROM geo_ip
UNION ALL
SELECT id as visitor_id, ip, null as zip
FROM visitor
)These would get rid of the nested loop join. You can still improve upon this for larger clusters by allowing the operation to be distributable.
For example, if you can partition your IP address table by the first octet (i.e, no row in the table has a start and end of the range with different first octets), you could add a partition to your window function, which should let the sorting be processed on separate nodes in your cluster:
SELECT
visitor.id,
visitor.ip,
LAST_VALUE(zip IGNORE NULLS) OVER (
PARTITION BY (COALESCE(geo_ip.start_ip, visitor.ip) / 1000000000 )::int
ORDER BY COALESCE(geo_ip.start_ip, visitor.ip)
ROWS UNBOUNDED PRECEDING
) as zip
FROM geo_ip
FULL OUTER JOIN visitor
ON visitor.ip=geo_ip.start_ipCode Snippets
SELECT
visitor.id,
visitor.ip,
LAST_VALUE(zip IGNORE NULLS) OVER (
ORDER BY COALESCE(geo_ip.start_ip, visitor.ip)
ROWS UNBOUNDED PRECEDING
) as zip
FROM geo_ip
FULL OUTER JOIN visitor
ON visitor.ip=geo_ip.start_ipSELECT
visitor_id,
visitor_ip,
LAST_VALUE(zip IGNORE NULLS) OVER (
ORDER BY ip, CASE WHEN zip IS NULL THEN 0 ELSE 1 END
ROWS UNBOUNDED PRECEDING
) as zip
FROM (
SELECT null as visitor_id, start_ip as ip, zip
FROM geo_ip
UNION ALL
SELECT id as visitor_id, ip, null as zip
FROM visitor
)SELECT
visitor.id,
visitor.ip,
LAST_VALUE(zip IGNORE NULLS) OVER (
PARTITION BY (COALESCE(geo_ip.start_ip, visitor.ip) / 1000000000 )::int
ORDER BY COALESCE(geo_ip.start_ip, visitor.ip)
ROWS UNBOUNDED PRECEDING
) as zip
FROM geo_ip
FULL OUTER JOIN visitor
ON visitor.ip=geo_ip.start_ipContext
StackExchange Database Administrators Q#150505, answer score: 3
Revisions (0)
No revisions yet.