HiveBrain v1.2.0
Get Started
← Back to all entries
patternsqlMinor

PostgreSQL time series table with composite partitions

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
postgresqlwithtimepartitionscompositeseriestable

Problem

I am investigating how I might structure a PostgreSQL table to store a large amount of time stamped data that also needs to be portioned by another field.
My expected data structure will be something like:

CREATE TABLE event (
    event_time       timestamp not null,
    object_sha       char(64) not null,       ; sha256 as hex digits
    username         text not null,           ; actual name not a foreign key
    payload          jsonb not null           ; many other fields, not indexed
);


Events will be written into the table at a fast rate, possibly as high as 1000 per second, stored for around 6 months, and exported to cheaper storage. (perhaps pgdump files to an S3 bucket). So it would make sense to use declarative portioning on the event time, using pg_partman to create and manage new partition tables each week or so.

However, there is a strong requirement to run queries on the data by the object_sha, and my concern is that if the data is portioned only by timestamp, then the most recent partition table will be under heavy I/O load and might not keep up, so it appears logical to me that the data should also be partitioned by the prefix on object_sha, (perhaps on the first one or two hex digits), as that way reads and writes will be evenly distributed across many tables.

My questions are:

  • The pg_partman documentation says Sub-partitioning with multiple levels is supported, but it is of very limited use in PostgreSQL and provides next to NO PERFORMANCE BENEFIT – Why is that? Is that advice up to date and correct, as I can’t see how it would be a bad idea.



  • All the queries that I plan to run will always have order by event_time desc limit 50 or suchlike included and most of the time will be satisfied by recent events in one or two recent tables. Is the PostgreSQL query engine smart enough to limit the query to the most recent table, and only look in older tables if it does not find enough results?



  • Given that my front end application will kn

Solution

my concern is that if the data is portioned only by timestamp, then the most recent partition table will be under heavy I/O load and might not keep up

I think your fundamental premise is flawed. It is the job of your storage layer (RAID controller, LVM, etc.) to balance the IO load over the available underlying storage hardware by using striping; and with modern systems they are good at it. Trying to do it in PostgreSQL with partitioning (and different tablespaces for the different partitions) is the road to madness. I think this answers your first question: assuming your storage layer is vaguely competent, there is no additional performance benefit to be had.

All the queries that I plan to run will always have order by event_time desc limit 50 or suchlike included and most of the time will be satisfied by recent events in one or two recent tables.

As long as there is an index leading with "event_time" and that is also the range partition key, yes it will start with the newest partitions. If there is no such index, then it will not start with the newest ones and stop once it reaches the LIMIT (even though I would argue that this is when it is most important to start with the newest ones). Of course "suchlike" might hide a multitude of sins...

For your 3rd question, I don't think it matters much with declarative partitioning, though it would have been important under the older inheritance-based partitioning. But if you can arrange for different partitions to take their turn in the "hot seat" that could make a big difference in cache effectiveness, and having the clients aware of the partitioning scheme would probably be necessary (but not sufficient) to make that happen.

Context

StackExchange Database Administrators Q#322077, answer score: 2

Revisions (0)

No revisions yet.