HiveBrain v1.2.0
Get Started
← Back to all entries
patternsqlMinor

Slow query in PostgreSQL selecting a single row from between a range defined in two columns

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
postgresqlfromcolumnsrangequeryslowtwosinglebetweenselecting

Problem

I have imported a copy of the ip2location_db11 lite database, which contains 3,319,097 rows, and I am looking to optimize a numeric range query, where the low and high values are in separate columns of the table (ip_from, ip_to).

Importing the database:

CREATE TABLE ip2location_db11
(
  ip_from bigint NOT NULL, -- First IP address in netblock.
  ip_to bigint NOT NULL, -- Last IP address in netblock.
  country_code character(2) NOT NULL, -- Two-character country code based on ISO 3166.
  country_name character varying(64) NOT NULL, -- Country name based on ISO 3166.
  region_name character varying(128) NOT NULL, -- Region or state name.
  city_name character varying(128) NOT NULL, -- City name.
  latitude real NOT NULL, -- City latitude. Default to capital city latitude if city is unknown.
  longitude real NOT NULL, -- City longitude. Default to capital city longitude if city is unknown.
  zip_code character varying(30) NOT NULL, -- ZIP/Postal code.
  time_zone character varying(8) NOT NULL, -- UTC time zone (with DST supported).
  CONSTRAINT ip2location_db11_pkey PRIMARY KEY (ip_from, ip_to)
);
\copy ip2location_db11 FROM 'IP2LOCATION-LITE-DB11.CSV' WITH CSV QUOTE AS '"';


My first naive indexing attempt was to create separate indices on each of those columns, which resulted in a sequential scan with query times of 400ms:

```
account=> CREATE INDEX ip_from_db11_idx ON ip2location_db11 (ip_from);
account=> CREATE INDEX ip_to_db11_idx ON ip2location_db11 (ip_to);

account=> EXPLAIN ANALYZE VERBOSE SELECT * FROM ip2location_db11 WHERE 2538629520 BETWEEN ip_from AND ip_to;

QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------
Seq Scan on public.ip2location_db11 (cost=0.00..48930.99 rows=43111 width=842) (actual time=286.714..401.805 rows=1 loops=1)
Output: ip_from, ip_to, country_code, co

Solution

This is a little bit of a different solution to those already offered which involved using spacial indexes to do some tricks.

Instead its worth remembering that with IP addresses you cannot have overlapping ranges. That is A -> B cannot intersect X -> Y in any way. Knowing this you can change your SELECT query slightly and take advantage of this trait. In taking advantage of this trait you do not have to have any "clever" indexing at all. In fact you only need to index your ip_from column.

Previously, the query being analyzed was:

SELECT * FROM ip2location_db11 WHERE 2538629520 BETWEEN ip_from AND ip_to;


Lets assume that the range that 2538629520 falls into happens to be 2538629512 and 2538629537.


Note: It really does not matter what the range is, this is just to help demonstrate the pattern we can take advantage of.

From this we can assume that the next ip_from value is 2538629538. We actually dont need to worry about any records above this ip_from value. Indeed all we actually care about is the range where ip_from equals 2538629512.

Knowing this fact, our query actually becomes (in English):


Find me the maximumip_from value where my IP address is higher than ip_from. Show me the record where you find this value.


Or in otherwords:
Find me the ip_from value just before my IP address and give me that record

Because we never have overlapping ranges of ip_from to ip_to this holds true and allows us to write the query as:

SELECT * 
FROM ip2location
WHERE ip_from = (
    SELECT MAX(ip_from)
    FROM ip2location
    WHERE ip_from <= 2538629520
    )


Back to the indexing to take advantage of all this. All we're actually looking across is ip_from and we are doing integer comparisons. The MIN(ip_from) have PostgreSQL find the first record available. This is good because we can seek right to that and then not worry about any other records at all.

All we really need is an index like:

CREATE UNIQUE INDEX CONCURRENTLY ix_ip2location_ipFrom ON public.ip2location(ip_from)

We can make the index unique because we will not have overlapping records. I would even make this column the primary key myself.

With this index and this query, the explain plan is:

Index Scan using ix_ip2location_ipfrom on public.ip2location  (cost=0.90..8.92 rows=1 width=69) (actual time=0.530..0.533 rows=1 loops=1)
Output: ip2location.ip_from, ip2location.ip_to, ip2location.country_code, ip2location.country_name, ip2location.region_name, ip2location.city_name, ip2location.latitude, ip2location.longitude, ip2location.zip_code, ip2location.time_zone
Index Cond: (ip2location.ip_from = $1)
InitPlan 2 (returns $1)
    ->  Result  (cost=0.46..0.47 rows=1 width=8) (actual time=0.452..0.452 rows=1 loops=1)
        Output: $0
        InitPlan 1 (returns $0)
            ->  Limit  (cost=0.43..0.46 rows=1 width=8) (actual time=0.443..0.444 rows=1 loops=1)
                Output: ip2location_1.ip_from
                ->  Index Only Scan using ix_ip2location_ipfrom on public.ip2location ip2location_1  (cost=0.43..35440.79 rows=1144218 width=8) (actual time=0.438..0.438 rows=1 loops=1)
                        Output: ip2location_1.ip_from
                        Index Cond: ((ip2location_1.ip_from IS NOT NULL) AND (ip2location_1.ip_from >= '2538629520'::bigint))
                        Heap Fetches: 0


To give you an idea of improvement in query performance with this approach I tested this on my Raspberry Pi. The original approach took approx 4 secs. This approach takes approx 120ms. The big win is from individual row seeks verses some scans. The original query would suffer EXTREMELY from low range values as more of the table needs to be considered in the results. This query will exhibit consistent performance across the range of values.

Hope this helps and my explanation makes sense to you all.

Code Snippets

SELECT * FROM ip2location_db11 WHERE 2538629520 BETWEEN ip_from AND ip_to;
SELECT * 
FROM ip2location
WHERE ip_from = (
    SELECT MAX(ip_from)
    FROM ip2location
    WHERE ip_from <= 2538629520
    )
Index Scan using ix_ip2location_ipfrom on public.ip2location  (cost=0.90..8.92 rows=1 width=69) (actual time=0.530..0.533 rows=1 loops=1)
Output: ip2location.ip_from, ip2location.ip_to, ip2location.country_code, ip2location.country_name, ip2location.region_name, ip2location.city_name, ip2location.latitude, ip2location.longitude, ip2location.zip_code, ip2location.time_zone
Index Cond: (ip2location.ip_from = $1)
InitPlan 2 (returns $1)
    ->  Result  (cost=0.46..0.47 rows=1 width=8) (actual time=0.452..0.452 rows=1 loops=1)
        Output: $0
        InitPlan 1 (returns $0)
            ->  Limit  (cost=0.43..0.46 rows=1 width=8) (actual time=0.443..0.444 rows=1 loops=1)
                Output: ip2location_1.ip_from
                ->  Index Only Scan using ix_ip2location_ipfrom on public.ip2location ip2location_1  (cost=0.43..35440.79 rows=1144218 width=8) (actual time=0.438..0.438 rows=1 loops=1)
                        Output: ip2location_1.ip_from
                        Index Cond: ((ip2location_1.ip_from IS NOT NULL) AND (ip2location_1.ip_from >= '2538629520'::bigint))
                        Heap Fetches: 0

Context

StackExchange Database Administrators Q#190533, answer score: 5

Revisions (0)

No revisions yet.