HiveBrain v1.2.0
Get Started
← Back to all entries
patternMinor

Redshift - Optimize Expensive Query

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
queryexpensiveoptimizeredshift

Problem

I have two tables in Redshift that I am trying to do a join on to get zip code demographics based on a users normalized ip address. By normalized address, I mean that it is concerted to a uniform length string that has the periods stripped out and can be directly compared to one another. For example, this is applied to all ips before any joins are done and is stored in the tables:

lpad(split_part(ip, '.', 1), 3, '0') ||
lpad(split_part(ip, '.', 2), 3, '0') ||
lpad(split_part(ip, '.', 3), 3, '0') ||
lpad(split_part(ip, '.', 4), 3, '0')


so 209.170.151.71 would be transformed into 209170151071.

I have two tables. This first is visitor_details which contains the following:

-----------------------------
| visitor_id |      ip      |
-----------------------------
|      1     | 209170151071 |
|      2     | 123170167071 |
      ...           ...
| 50000000   | 001213020341 |
-----------------------------


The I have a table called geo_ip which has the following structure:

----------------------------------------
|    start_ip |    end_ip      |  zip  |
----------------------------------------
|209170151071 | 209170151071   | 11101 |
|309170151071 | 409170151071   | 11102 |
      ...           ...           ...
|509170151071 | 609170151071   | 11103 |
----------------------------------------


I'm trying to run the following query:

WITH vd AS (
  SELECT visitor_id,
         ip_address as c_ip
  FROM dev.visitor_details
)
SELECT
  visitor_id,
  c_ip,
  g.*
FROM
  vd
JOIN
  dev.geo_ip g
  ON vd.c_ip BETWEEN g.startip and g.endip
LIMIT 500;


The sort keys on geo ip are an interleaved sort key using both startip and endip. The table also doesn't seem to be skewed. However, running the query results in a very long execution time (never completed). Looking at the explain I see the following:

```
XN Limit (cost=0.00..245.17 rows=500 width=238)
-> XN Nested Loop DS_BCAST_INNER (cost=0.00..18442148764959.20 rows=37610983146614 width=2

Solution

The main thing is to avoid the nested loop join that is caused by the "between" in the join condition.

In your example specifically, I would start by rewriting this as

SELECT 
  visitor.id,
  visitor.ip,
  LAST_VALUE(zip IGNORE NULLS) OVER (
    ORDER BY COALESCE(geo_ip.start_ip, visitor.ip)
    ROWS UNBOUNDED PRECEDING
  ) as zip
FROM geo_ip
FULL OUTER JOIN visitor
ON visitor.ip=geo_ip.start_ip


Note that Redshift will only do a full outer join if considers it a merge joinable condition, which means you should set your distribution and sort key for both tables to be on visitor.ip and geo_ip.start_ip, respectively. If this is not an option, or too much of a bother, you can do this instead as a UNION ALL since you don't actually need the geo and visitor records to be on the same row:

SELECT 
  visitor_id,
  visitor_ip,
  LAST_VALUE(zip IGNORE NULLS) OVER (
    ORDER BY ip, CASE WHEN zip IS NULL THEN 0 ELSE 1 END
    ROWS UNBOUNDED PRECEDING
  ) as zip
FROM (
  SELECT null as visitor_id, start_ip as ip, zip
  FROM geo_ip
  UNION ALL
  SELECT id as visitor_id, ip, null as zip
  FROM visitor
)


These would get rid of the nested loop join. You can still improve upon this for larger clusters by allowing the operation to be distributable.

For example, if you can partition your IP address table by the first octet (i.e, no row in the table has a start and end of the range with different first octets), you could add a partition to your window function, which should let the sorting be processed on separate nodes in your cluster:

SELECT 
  visitor.id,
  visitor.ip,
  LAST_VALUE(zip IGNORE NULLS) OVER (
    PARTITION BY (COALESCE(geo_ip.start_ip, visitor.ip) / 1000000000 )::int
    ORDER BY COALESCE(geo_ip.start_ip, visitor.ip)
    ROWS UNBOUNDED PRECEDING
  ) as zip
FROM geo_ip
FULL OUTER JOIN visitor
ON visitor.ip=geo_ip.start_ip

Code Snippets

SELECT 
  visitor.id,
  visitor.ip,
  LAST_VALUE(zip IGNORE NULLS) OVER (
    ORDER BY COALESCE(geo_ip.start_ip, visitor.ip)
    ROWS UNBOUNDED PRECEDING
  ) as zip
FROM geo_ip
FULL OUTER JOIN visitor
ON visitor.ip=geo_ip.start_ip
SELECT 
  visitor_id,
  visitor_ip,
  LAST_VALUE(zip IGNORE NULLS) OVER (
    ORDER BY ip, CASE WHEN zip IS NULL THEN 0 ELSE 1 END
    ROWS UNBOUNDED PRECEDING
  ) as zip
FROM (
  SELECT null as visitor_id, start_ip as ip, zip
  FROM geo_ip
  UNION ALL
  SELECT id as visitor_id, ip, null as zip
  FROM visitor
)
SELECT 
  visitor.id,
  visitor.ip,
  LAST_VALUE(zip IGNORE NULLS) OVER (
    PARTITION BY (COALESCE(geo_ip.start_ip, visitor.ip) / 1000000000 )::int
    ORDER BY COALESCE(geo_ip.start_ip, visitor.ip)
    ROWS UNBOUNDED PRECEDING
  ) as zip
FROM geo_ip
FULL OUTER JOIN visitor
ON visitor.ip=geo_ip.start_ip

Context

StackExchange Database Administrators Q#150505, answer score: 3

Revisions (0)

No revisions yet.