patternsqlMinor
Managing and speeding up queries on PostgreSQL table with over 3 trillion rows
Viewed 0 times
postgresqlrowsmanagingspeedingwithandqueriesovertabletrillion
Problem
I have time series data which spans over 10 years and has over 3 trillion rows and 10 columns.
At the moment I use a PCIe SSD with 128GB of RAM and I am finding that querying takes a significant amount of time. For example running the below command takes well over 15 mins:
The table is mostly used for reads. The only time the table is written to is during weekly updates which insert about 15 million rows.
What are the best ways to manage tables so large? Would you recommend splitting it by year?
The table size is 542 GB and the external size is 109 GB.
The table was created using the following code:
At the moment I use a PCIe SSD with 128GB of RAM and I am finding that querying takes a significant amount of time. For example running the below command takes well over 15 mins:
SELECT * FROM tbl WHERE column_a = 'value1' AND column_b = 'value2';The table is mostly used for reads. The only time the table is written to is during weekly updates which insert about 15 million rows.
What are the best ways to manage tables so large? Would you recommend splitting it by year?
The table size is 542 GB and the external size is 109 GB.
EXPLAIN (BUFFERS, ANALYZE) output:"Seq Scan on table (cost=0.00..116820941.44 rows=758 width=92) (actual time=0.011..1100643.844 rows=667 loops=1)"
" Filter: (("COLUMN_A" = 'Value1'::text) AND ("COLUMN_B" = 'Value2'::text))"
" Rows Removed by Filter: 4121893840"
" Buffers: shared hit=2 read=56640470 dirtied=476248 written=476216"
"Total runtime: 1100643.967 ms"The table was created using the following code:
CREATE TABLE tbl (
DATE timestamp with time zone,
COLUMN_A text,
COLUMN_B text,
VALUE_1 double precision,
VALUE_2 double precision,
VALUE_3 double precision,
VALUE_4 double precision,
VALUE_5 double precision,
VALUE_6 double precision,
VALUE_7 double precision,
);
CREATE INDEX ix_table_name_date ON table_name (DATE);Solution
Your existing index on
is an index for
Details:
Next, I would consider partial indexes if you can safely exclude large parts of the table from the queries. Or partition your table in something like 100 partitions. And use constraint exclusion.
"Splitting it by year", like you contemplated, would be just as useless (harmful, even) as your index on
If partitioning is no good, a less invasive measure would be to
-
Configuring PostgreSQL for read performance
-
Best way to populate a new column in a large table?
Be sure to run the latest version of Postgres. The upcoming Postgres 9.5 might be particularly interesting for you, since it introduces BRIN indexes (block range indexes), which can reduce the index size dramatically for very big tables. Might be exactly what you are looking for.
DATE is obviously useless for the query. The first obvious step for your query:SELECT * FROM tbl WHERE column_a = 'value1' AND column_b = 'value2';is an index for
column_a or column_b (which ever is more selective) or possibly a multicolumn index on (column_a, column_b), like:CREATE INDEX tbl_a_b_idx ON tbl(column_a, column_b);Details:
- Is a composite index also good for queries on the first field?
Next, I would consider partial indexes if you can safely exclude large parts of the table from the queries. Or partition your table in something like 100 partitions. And use constraint exclusion.
"Splitting it by year", like you contemplated, would be just as useless (harmful, even) as your index on
DATE for the given query. Partitions need to be based on columns in the predicates of the query (column_a and column_b) to be helpful. Predicates on just one column would be much easier in this respect. If you have important queries filtering on various columns, partitioning is probably not the way to go. (Partial indexes still might.)If partitioning is no good, a less invasive measure would be to
CLUSTER data based on the new index (you can't use a partial index for this). Or simply create a new table from the sorted output of a query. This is particularly interesting since your table is mostly read-only. It should pay to do this at least once, but that one time will be very expensive: the whole table has to be re-written one or the other way. You need enough free space, as much RAM as you can get and an exclusive lock on the table. Or use pg_repack, to avoid the exclusive lock:-
Configuring PostgreSQL for read performance
-
Best way to populate a new column in a large table?
Be sure to run the latest version of Postgres. The upcoming Postgres 9.5 might be particularly interesting for you, since it introduces BRIN indexes (block range indexes), which can reduce the index size dramatically for very big tables. Might be exactly what you are looking for.
Code Snippets
SELECT * FROM tbl WHERE column_a = 'value1' AND column_b = 'value2';CREATE INDEX tbl_a_b_idx ON tbl(column_a, column_b);Context
StackExchange Database Administrators Q#110912, answer score: 8
Revisions (0)
No revisions yet.