snippetsqlMinor
How to structure an append-only database in PostgreSQL?
Viewed 0 times
postgresqldatabasestructurehowappendonly
Problem
Preface
We have been stuck for years using a monolithic program with terrible and inefficient database structure. Now we are finally moving away from it with our own program.
The only area we are having difficulties with is the database. Our data is financial and our existing program causes loss of data mostly through overwrites (
We are using PostgreSQL and will be denying modification privileges such as
Our Ideas
We thought we might be able to contain all current and historic data into one table such that we query the last record (or newest
To clarify, we have dozens of tables; Below represents a simple version of one such tables.
We have also considered NoSQL solutions, except we prefer to remain ACID compliant (and most NoSQL solutions cannot meet that expectation).
We have also looked into other products such as AWS QLDB, but we're limited by budget, so we've decided to go with PostgreSQL for the time being and will re-evaluate when necessary.
Questions
We have been stuck for years using a monolithic program with terrible and inefficient database structure. Now we are finally moving away from it with our own program.
The only area we are having difficulties with is the database. Our data is financial and our existing program causes loss of data mostly through overwrites (
UPDATE, DELETE). We have decided to move toward an append-only structure to keep all of the data we accumulate. We may look into using a warehouse if it grows past our initial estimates, but in the meantime, we expect the database to grow to a few millions rows over a few years for some of the larger tables; not massive in scale at this point.We are using PostgreSQL and will be denying modification privileges such as
UPDATE and DELETE for users to ensure historical consistency of data.Our Ideas
We thought we might be able to contain all current and historic data into one table such that we query the last record (or newest
timestamp) for each id as the following table suggests.To clarify, we have dozens of tables; Below represents a simple version of one such tables.
╔═════╦════╦════════╦════════╦═════════════════════╗
║ key ║ id ║ field1 ║ field2 ║ timestamp ║
╠═════╬════╬════════╬════════╬═════════════════════╣
║ 0 ║ 0 ║ a ║ b ║ 1970-01-01 00:00:01 ║
╠═════╬════╬════════╬════════╬═════════════════════╣
║ 1 ║ 1 ║ x ║ y ║ 1970-01-01 00:00:02 ║
╠═════╬════╬════════╬════════╬═════════════════════╣
║ 2 ║ 0 ║ a ║ c ║ 1970-01-01 00:05:01 ║ # new version of id=0 row
╚═════╩════╩════════╩════════╩═════════════════════╝We have also considered NoSQL solutions, except we prefer to remain ACID compliant (and most NoSQL solutions cannot meet that expectation).
We have also looked into other products such as AWS QLDB, but we're limited by budget, so we've decided to go with PostgreSQL for the time being and will re-evaluate when necessary.
Questions
- How can we best stru
Solution
This sounds like a use-case for temporal tables. AFAIK PostgreSQL does not implement this natively. There are extensions for it. It can be written in user-code, usually with triggers.
The basic idea is that the data table has two date/time columns. These define the interval during which the value in a row is applicable - it is "the" value for that key during that interval. By key I mean natural key, not a surrogate row-labeling key i.e. the
When a new value for that key arrives, the existing row is updated to set its end and a new row is written. The end date/time for the old row and the start date/time for the new row are equal, hence the closed-open formulation.
To use your example: the first row arrives
Some time later the second row for this key arrives. The first row is updated, the second row is written
To read a value
The place holder `
The basic idea is that the data table has two date/time columns. These define the interval during which the value in a row is applicable - it is "the" value for that key during that interval. By key I mean natural key, not a surrogate row-labeling key i.e. the
id column, not the key column in your example. It is best to make this a "closed-open" interval so the start time is included but the end time excluded. The resolution of the interval start and end columns has to be appropriate for the update frequency.When a new value for that key arrives, the existing row is updated to set its end and a new row is written. The end date/time for the old row and the start date/time for the new row are equal, hence the closed-open formulation.
To use your example: the first row arrives
║ key ║ id ║ field1 ║ field2 ║ from ║ to
║ 0 ║ 0 ║ a ║ b ║ 1970-01-01 00:00:01 ║ 9999-12-31 23:59:59 ║Some time later the second row for this key arrives. The first row is updated, the second row is written
║ key ║ id ║ field1 ║ field2 ║ from ║ to
║ 0 ║ 0 ║ a ║ b ║ 1970-01-01 00:00:01 ║ 1970-01-01 00:05:01 ║
║ 2 ║ 0 ║ a ║ c ║ 1970-01-01 00:05:01 ║ 9999-12-31 23:59:59 ║To read a value
select
from
where id = 0
and from
and to > The place holder `
represents the current instant on the server. It can be local, UTC or whatever. It just has to be consistent with what was used when the row was written. Using UTC uniformly makes this easier.
A consistent set of values can be read from the database from any point in history by using that historical time in the predicate instead of .
To delete a value set to to the current time.
There are, of course, downsides. Each table must have extra columns. These must be indexed. Ensuring intervals for each key do not overlap using only constraints is non-trivial. Every query must have temporal predicates and they must all use exactly the same time value.
These complexities can be mitigated by adding an archive table. When an existing row's to values is set, that row is also deleted from the data table inserted into the corresponding archive table. So each id` value will only have one row in the data table, and many retired rows in the archive table. Queries no longer need the additional temporal predicates. Indexes (on the data table) need not include the interval columns. Historical queries become more complex, and every new value stored will incur the cost of an additional delete and insert.Code Snippets
║ key ║ id ║ field1 ║ field2 ║ from ║ to
║ 0 ║ 0 ║ a ║ b ║ 1970-01-01 00:00:01 ║ 9999-12-31 23:59:59 ║║ key ║ id ║ field1 ║ field2 ║ from ║ to
║ 0 ║ 0 ║ a ║ b ║ 1970-01-01 00:00:01 ║ 1970-01-01 00:05:01 ║
║ 2 ║ 0 ║ a ║ c ║ 1970-01-01 00:05:01 ║ 9999-12-31 23:59:59 ║select <columns>
from <table>
where id = 0
and from <= <right now>
and to > <right now>Context
StackExchange Database Administrators Q#240323, answer score: 7
Revisions (0)
No revisions yet.