patternsqlMinor
Joining two large tables on a date range with Postgres
Viewed 0 times
tablespostgreswithrangedatetwolargejoining
Problem
I know this question must come up often, but I can't seem to find a recent answer that addresses something more simple like what I'm trying to do. I've read several similar questions.
I'm using PostgreSQL 9.4 and have 2 tables, each with close to 200M rows containing two types of metrics. The columns are the_date, feature_A, feature_B and 24 hourly metrics. The metrics in the two tables are different enough such that the tables can't be combined. Or at least I don't think so. Some metrics don't apply to all features and that's why they're separated. Both tables have indexes on the date column.
When I try to do a query that joins them, if I specify a single date the query will return in about 3 seconds which seems reasonable to me. But if I simply do
Why is that and how can I get it to use the indexes again? I'd like to do queries like
The query:
```
select a.feature_a, a.feature_b, count(*) tot_ct,
count(case when b.br01 > 0 and a.bc01 0 and a.bc02 0 and a.bc03 0 and a.bc04 0 and a.bc05 0 and a.bc06 0 and a.bc07 0 and a.bc08 0 and a.bc09 0 and a.bc10 0 and a.bc11 0 and a.bc12 0 and a.bc13 0 and a.bc14 0 and a.bc15 0 and a.bc16 0 and a.bc17 0 and a.bc18 0 and a.bc19 0 and a.bc20 0 and a.bc21 0 and a.bc22 0 and a.bc23 0 and a.bc24 <= 10 then 1 end) cx24,
avg(b.br01) av01, avg(b.br02) av02, avg(b.br03) av03,
avg(b.br04) av04, avg(b.br05) av05, avg(b.br06) av06,
avg(b.br07) av07, avg(b.br08) av08, avg(b.br09) av09,
avg(b.br10) av10, avg(b.br11) av11, av
I'm using PostgreSQL 9.4 and have 2 tables, each with close to 200M rows containing two types of metrics. The columns are the_date, feature_A, feature_B and 24 hourly metrics. The metrics in the two tables are different enough such that the tables can't be combined. Or at least I don't think so. Some metrics don't apply to all features and that's why they're separated. Both tables have indexes on the date column.
When I try to do a query that joins them, if I specify a single date the query will return in about 3 seconds which seems reasonable to me. But if I simply do
a.the_date between '8/1/2017' and '8/1/2017' the query now takes essentially forever. I kill it after 10 minutes because I'm impatient. Obviously PostgreSQL is no longer using the indexes.Why is that and how can I get it to use the indexes again? I'd like to do queries like
a.the_date between '7/1/2017' and '7/31/2017' or a.the_date >= '1/1/2017'. Right now I've simply got a python program running the query for one date at a time and then totaling the results, which seems like a hack. But at least it's predictable and much faster at 3-seconds-per-date.The query:
```
select a.feature_a, a.feature_b, count(*) tot_ct,
count(case when b.br01 > 0 and a.bc01 0 and a.bc02 0 and a.bc03 0 and a.bc04 0 and a.bc05 0 and a.bc06 0 and a.bc07 0 and a.bc08 0 and a.bc09 0 and a.bc10 0 and a.bc11 0 and a.bc12 0 and a.bc13 0 and a.bc14 0 and a.bc15 0 and a.bc16 0 and a.bc17 0 and a.bc18 0 and a.bc19 0 and a.bc20 0 and a.bc21 0 and a.bc22 0 and a.bc23 0 and a.bc24 <= 10 then 1 end) cx24,
avg(b.br01) av01, avg(b.br02) av02, avg(b.br03) av03,
avg(b.br04) av04, avg(b.br05) av05, avg(b.br06) av06,
avg(b.br07) av07, avg(b.br08) av08, avg(b.br09) av09,
avg(b.br10) av10, avg(b.br11) av11, av
Solution
Solved my index issue, for anyone who might find this...
It was as simple as adding the date range to the where clause of both tables. Now both indexes get used and no more full table scans.
Seems unnecessary, but I remember doing things like this many years ago when optimizing Oracle queries. Sometimes I'd even have to do things like
It was as simple as adding the date range to the where clause of both tables. Now both indexes get used and no more full table scans.
and a.the_date between '7/1/2017' and '7/31/2017'
and b.the_date between '7/1/2017' and '7/31/2017'
and a.the_date = b.the_dateSeems unnecessary, but I remember doing things like this many years ago when optimizing Oracle queries. Sometimes I'd even have to do things like
a.col1 = b.col1 and a.col1 = b.col1 because having the = clause in there once wasn't enough.Code Snippets
and a.the_date between '7/1/2017' and '7/31/2017'
and b.the_date between '7/1/2017' and '7/31/2017'
and a.the_date = b.the_dateContext
StackExchange Database Administrators Q#184817, answer score: 5
Revisions (0)
No revisions yet.