HiveBrain v1.2.0
Get Started
← Back to all entries
patternsqlMinor

PostgreSQL - Design database to avoid frequent update on large dataset

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
postgresqlupdatedesignavoiddatabasedatasetlargefrequent

Problem

We are trying to optimize out the performance on a database with these numbers and query requisites:

  • 200-400k network segments identified by a unique ID



  • each network segment have a state with a limited number of dynamic attributes (i.e. an average speed). A single state of dynamic attributes could be stored in 8 bytes



  • the segment state will change every 3 minutes, H24, 7/7



  • Is it possible to query for the state of a group of segments (sometimes all segments) in a particular date range or just the actual situation.



  • It could be requested a spatial query to find all segments "around me" in a particular date (normally "at now")



With these requisites, we arrived at this solution (with a strong drawback, explained below).

[A] TABLE main_segments_history( 
      id_segment integer NOT NULL,
      day date NOT NULL,    
      day_slices bigint[],
      CONSTRAINT main_segments_history_pk PRIMARY KEY (id_segment,day)
)

[B] TABLE current_segment_release_state(
      id_segment integer NOT NULL,
      release_date timestamptz,
      ... all other attributes ...
      CONSTRAINT currsegm_release_state_pk PRIMARY KEY (id_segment,release_date)
)


Explaining the [A] table:

  • It is partitioned on field "day" with partition_manager (pg_partman). Each partition is one month



  • The day_slices array is a one-dimensional array of 480 elements, representing the granularity of each 3-minute-slice of the full day



Explaining the [B] table:

  • It's just the current release state for each segment



There is a back-end process that elaborate the network.

It inserts or updates the states of each segment, every 3 minutes.

In other words this process will insert new rows on day start and will update the inner array every 3 minutes.

The advantages of this solution:

  • A limited number of rows for each month partition table



  • Good performance when joining the static data of the segment (i.e. the geometry)



  • The little redundancy of the current release is very goo

Solution

The only way to avoid updates to your tables is just not to do them. Given the fact that you're just logging data, I would suggest a daily table, where you store the data for one day. Instead of having 190 M updates/day, you may have 190 M (smaller) inserts/day; and as many updates as there are segments.

CREATE TABLE main_segments_history
 ( 
       id_segment integer NOT NULL,
       day date NOT NULL,    
       day_slices bigint[],
       CONSTRAINT main_segments_history_pk PRIMARY KEY (id_segment, day)
 ) ;

 CREATE TABLE dayly_segments
 (
      id_segment integer NOT NULL,
      day date NOT NULL,
      id_slice integer NOT NULL,
      slice bigint,
      PRIMARY KEY (id_segment, day, id_slice)
 ) ;


This would simulate the data for one day (and 200 segments; instead of the millions you might have):

INSERT INTO 
     dayly_segments
     (id_segment, day, id_slice, slice)
 SELECT
     id_segment, '2017-01-01', id_slice, (random()*1e7)::bigint
 FROM
     generate_series (1, 200) AS s1(id_segment)
     CROSS JOIN generate_series (1, 20*24) AS s2(id_slice) ;


I assume you have some time with low activity, when you run the vacuuming processes. At that time, you can move data from the dayly_segments table to the main_segments_history, and put all the data as an array. This way, you're never updating main_segments_history.

This is, basically, what you would do:

-- Move segments from dayly_segments to main_segment_history
 INSERT INTO
     main_segments_history
     (id_segment, day, day_slices)
 SELECT
     id_segment, day, 
     (SELECT array_agg(slice) 
      FROM (SELECT slice 
              FROM dayly_segments s1 
             WHERE s1.id_segment = s0.id_segment AND s1.day = s0.day 
          ORDER BY id_slice) AS s2)
 FROM
     (SELECT DISTINCT 
         id_segment, day 
     FROM
         dayly_segments s0
     WHERE 
         day = '2017-01-01'
     ) AS s0 ;

 -- Delete them from original
 DELETE FROM
     dayly_segments
 WHERE
     day = '2017-01-01' ;

 -- At this point, you should also...
 VACUUM dayly_segments ;


dbfiddle here

Assumptions:

  • You never "miss" any $current_value. That is, there are no holes in your arrays.



  • id_slice can be assigned the result of any increasing sequence. You could aso use, instead of an integer value, just a time value.

Code Snippets

CREATE TABLE main_segments_history
 ( 
       id_segment integer NOT NULL,
       day date NOT NULL,    
       day_slices bigint[],
       CONSTRAINT main_segments_history_pk PRIMARY KEY (id_segment, day)
 ) ;

 CREATE TABLE dayly_segments
 (
      id_segment integer NOT NULL,
      day date NOT NULL,
      id_slice integer NOT NULL,
      slice bigint,
      PRIMARY KEY (id_segment, day, id_slice)
 ) ;
INSERT INTO 
     dayly_segments
     (id_segment, day, id_slice, slice)
 SELECT
     id_segment, '2017-01-01', id_slice, (random()*1e7)::bigint
 FROM
     generate_series (1, 200) AS s1(id_segment)
     CROSS JOIN generate_series (1, 20*24) AS s2(id_slice) ;
-- Move segments from dayly_segments to main_segment_history
 INSERT INTO
     main_segments_history
     (id_segment, day, day_slices)
 SELECT
     id_segment, day, 
     (SELECT array_agg(slice) 
      FROM (SELECT slice 
              FROM dayly_segments s1 
             WHERE s1.id_segment = s0.id_segment AND s1.day = s0.day 
          ORDER BY id_slice) AS s2)
 FROM
     (SELECT DISTINCT 
         id_segment, day 
     FROM
         dayly_segments s0
     WHERE 
         day = '2017-01-01'
     ) AS s0 ;

 -- Delete them from original
 DELETE FROM
     dayly_segments
 WHERE
     day = '2017-01-01' ;

 -- At this point, you should also...
 VACUUM dayly_segments ;

Context

StackExchange Database Administrators Q#173663, answer score: 2

Revisions (0)

No revisions yet.