patternsqlMinor
PostgreSQL - Design database to avoid frequent update on large dataset
Viewed 0 times
postgresqlupdatedesignavoiddatabasedatasetlargefrequent
Problem
We are trying to optimize out the performance on a database with these numbers and query requisites:
With these requisites, we arrived at this solution (with a strong drawback, explained below).
Explaining the [A] table:
Explaining the [B] table:
There is a back-end process that elaborate the network.
It inserts or updates the states of each segment, every 3 minutes.
In other words this process will insert new rows on day start and will update the inner array every 3 minutes.
The advantages of this solution:
- 200-400k network segments identified by a unique ID
- each network segment have a state with a limited number of dynamic attributes (i.e. an average speed). A single state of dynamic attributes could be stored in 8 bytes
- the segment state will change every 3 minutes, H24, 7/7
- Is it possible to query for the state of a group of segments (sometimes all segments) in a particular date range or just the actual situation.
- It could be requested a spatial query to find all segments "around me" in a particular date (normally "at now")
With these requisites, we arrived at this solution (with a strong drawback, explained below).
[A] TABLE main_segments_history(
id_segment integer NOT NULL,
day date NOT NULL,
day_slices bigint[],
CONSTRAINT main_segments_history_pk PRIMARY KEY (id_segment,day)
)
[B] TABLE current_segment_release_state(
id_segment integer NOT NULL,
release_date timestamptz,
... all other attributes ...
CONSTRAINT currsegm_release_state_pk PRIMARY KEY (id_segment,release_date)
)Explaining the [A] table:
- It is partitioned on field "day" with partition_manager (
pg_partman). Each partition is one month
- The day_slices array is a one-dimensional array of 480 elements, representing the granularity of each 3-minute-slice of the full day
Explaining the [B] table:
- It's just the current release state for each segment
There is a back-end process that elaborate the network.
It inserts or updates the states of each segment, every 3 minutes.
In other words this process will insert new rows on day start and will update the inner array every 3 minutes.
The advantages of this solution:
- A limited number of rows for each month partition table
- Good performance when joining the static data of the segment (i.e. the geometry)
- The little redundancy of the current release is very goo
Solution
The only way to avoid updates to your tables is just not to do them. Given the fact that you're just logging data, I would suggest a daily table, where you store the data for one day. Instead of having 190 M updates/day, you may have 190 M (smaller) inserts/day; and as many updates as there are segments.
This would simulate the data for one day (and 200 segments; instead of the millions you might have):
I assume you have some time with low activity, when you run the
This is, basically, what you would do:
dbfiddle here
Assumptions:
CREATE TABLE main_segments_history
(
id_segment integer NOT NULL,
day date NOT NULL,
day_slices bigint[],
CONSTRAINT main_segments_history_pk PRIMARY KEY (id_segment, day)
) ;
CREATE TABLE dayly_segments
(
id_segment integer NOT NULL,
day date NOT NULL,
id_slice integer NOT NULL,
slice bigint,
PRIMARY KEY (id_segment, day, id_slice)
) ;This would simulate the data for one day (and 200 segments; instead of the millions you might have):
INSERT INTO
dayly_segments
(id_segment, day, id_slice, slice)
SELECT
id_segment, '2017-01-01', id_slice, (random()*1e7)::bigint
FROM
generate_series (1, 200) AS s1(id_segment)
CROSS JOIN generate_series (1, 20*24) AS s2(id_slice) ;I assume you have some time with low activity, when you run the
vacuuming processes. At that time, you can move data from the dayly_segments table to the main_segments_history, and put all the data as an array. This way, you're never updating main_segments_history. This is, basically, what you would do:
-- Move segments from dayly_segments to main_segment_history
INSERT INTO
main_segments_history
(id_segment, day, day_slices)
SELECT
id_segment, day,
(SELECT array_agg(slice)
FROM (SELECT slice
FROM dayly_segments s1
WHERE s1.id_segment = s0.id_segment AND s1.day = s0.day
ORDER BY id_slice) AS s2)
FROM
(SELECT DISTINCT
id_segment, day
FROM
dayly_segments s0
WHERE
day = '2017-01-01'
) AS s0 ;
-- Delete them from original
DELETE FROM
dayly_segments
WHERE
day = '2017-01-01' ;
-- At this point, you should also...
VACUUM dayly_segments ;dbfiddle here
Assumptions:
- You never "miss" any
$current_value. That is, there are no holes in your arrays.
id_slicecan be assigned the result of any increasing sequence. You could aso use, instead of an integer value, just atimevalue.
Code Snippets
CREATE TABLE main_segments_history
(
id_segment integer NOT NULL,
day date NOT NULL,
day_slices bigint[],
CONSTRAINT main_segments_history_pk PRIMARY KEY (id_segment, day)
) ;
CREATE TABLE dayly_segments
(
id_segment integer NOT NULL,
day date NOT NULL,
id_slice integer NOT NULL,
slice bigint,
PRIMARY KEY (id_segment, day, id_slice)
) ;INSERT INTO
dayly_segments
(id_segment, day, id_slice, slice)
SELECT
id_segment, '2017-01-01', id_slice, (random()*1e7)::bigint
FROM
generate_series (1, 200) AS s1(id_segment)
CROSS JOIN generate_series (1, 20*24) AS s2(id_slice) ;-- Move segments from dayly_segments to main_segment_history
INSERT INTO
main_segments_history
(id_segment, day, day_slices)
SELECT
id_segment, day,
(SELECT array_agg(slice)
FROM (SELECT slice
FROM dayly_segments s1
WHERE s1.id_segment = s0.id_segment AND s1.day = s0.day
ORDER BY id_slice) AS s2)
FROM
(SELECT DISTINCT
id_segment, day
FROM
dayly_segments s0
WHERE
day = '2017-01-01'
) AS s0 ;
-- Delete them from original
DELETE FROM
dayly_segments
WHERE
day = '2017-01-01' ;
-- At this point, you should also...
VACUUM dayly_segments ;Context
StackExchange Database Administrators Q#173663, answer score: 2
Revisions (0)
No revisions yet.