HiveBrain v1.2.0
Get Started
← Back to all entries
patternsqlMinor

Is Postgres ideal for Many to Many data (M:N) which keeps increasing?

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
idealkeepspostgresforincreasingmanywhichdata

Problem

I have a use-case where the data is many to many and needs a wide range of querying functionality.

Participants and Events

A User/Participant can register for multiple events. And each event can have many participants. It's a many-to-many relationship.

Consider a data set like this.

  • Each event can have 10Million users registered.



  • Each User can register a max of 1000 events



  • There are 1000 Events running



The following queries are required:

  • Query 1. Get all Participants who registered for an event



  • Query 2. Get all Events registered by a Participant



  • Query 3. Get all Events which are upcoming for a participant



For handling Query 1 & Query 2

EventParticipantTable : (eventId, participantId) : 1000 x 10M records

This needs to search 1000 x 10M records?

The dataset can be split into blocks per eventId to make it ideally scan only 10M records but not sure how this can be handled in PostgreSQL.

For handling Query 3

Event Table + EventParticipantTable Join

This needs a join of two tables where I first fetch the Events table for upcoming events(based on start and end timestamps) and for each eventId matched need to find if queried participant id exists in in EventParticipantTable.

This needs to search 1000 events (1000 10M) event-participant-table entries ?

Is 1000 x 10M records per table is not an issue in this scenario?

Solution

To resolve your issues, I did the following (all the code below is available on the fiddle here):

These tests have been run on the db<>fiddle server - we don't exactly know the configuration of the machine(s) nor do we know what else is happening while we're running our queries.

I also ran the tests on my home laptop:

  • Linux Fedora 34



  • 1TB Samsung SSD



  • 4 CPUs, 2 cores



  • nothing else running apart from standard Linux processes



The PostgreSQL 12.7 instance was compiled from source with the following options:

./configure --prefix=/home/pol/Downloads/db/dba_test/12.7/inst --enable-nls --with-python --with-icu --with-openssl --with-uuid=e2fs


The system settings are the defaults except for the recommendations of pgtune as follows:

DB Version: 12
OS Type: linux
DB Type: dw
Total Memory (RAM): 32 GB
CPUs num: 4
Data Storage: ssd


Recommended changes from defaults:

max_connections = 40
shared_buffers = 8GB
effective_cache_size = 24GB
maintenance_work_mem = 2GB
checkpoint_completion_target = 0.9
wal_buffers = 16MB
default_statistics_target = 500
random_page_cost = 1.1
effective_io_concurrency = 200
work_mem = 52428kB
min_wal_size = 4GB   -- used 16GB for this setting
max_wal_size = 16GB  -- used 64GB for this setting
max_worker_processes = 4
max_parallel_workers_per_gather = 2
max_parallel_workers = 4
max_parallel_maintenance_workers = 2


The min_ and max_wal settings were bumped up because of stuff I read for systems with heavy writes to speed up loading - shouldn't affect reads - lost references(s)...

Firstly, I created a function to generate random strings (from here):

CREATE FUNCTION random_text(INTEGER)
RETURNS TEXT
LANGUAGE SQL
AS $ 
  select upper(
    substring(
      (SELECT string_agg(md5(random()::TEXT), '')
       FROM generate_series(
           1,
           CEIL($1 / 32.)::integer) 
       ), 1, $1) );
$;


Then, I created an event table:

CREATE TABLE event 
(
  event_id SMALLINT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
  event_name TEXT NOT NULL UNIQUE,
  event_date DATE NOT NULL
);


gave it an index on event_name also - I can imagine many scenarios where one would wish to search by name.

CREATE INDEX ev_name_ix ON event USING BTREE
(event_name ASC);


and also on event_date:

CREATE INDEX ev_date_ix ON event USING BTREE
(event_date ASC);


Then I created 100 (1,000 on laptop) events as follows:

INSERT INTO event (event_name, event_date)
SELECT random_text(10), CURRENT_DATE - INTERVAL '7 DAY'
FROM GENERATE_SERIES(1, 100);


BUT!, you might scream... all the event dates are in the past - yes, but if you do this, then you'll have 50% in the past and 50% in the future:

UPDATE event 
SET event_date = 
  (
    CASE 
      WHEN MOD(event_id, 2) = 1 THEN event_date  -- i.e. no change!
      ELSE CURRENT_DATE + INTERVAL '7 DAY'
    END
  );


Checking with SELECT * FROM event; - result:

event_id    event_name  event_date
       1    A653585119  2021-07-30
       2    01563801BB  2021-08-13
       3    4ED87ABDEC  2021-07-30
       4    EF0394645B  2021-08-13
...     
... snipped for brevity
...


Doing it this way (rather than with literal dates) means that the fiddle will work years from now because the event_date depends only on when the fiddle is run and not some constant!

The participant table:

CREATE TABLE participant
(
  participant_id INTEGER GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
  participant_name TEXT NOT NULL  -- might not be UNIQUE
);


participant_name index:

CREATE INDEX par_name_ix ON participant USING BTREE
(participant_name ASC);


Then created 10,000 (10,000,000 - 10M on laptop) participants:

INSERT INTO participant (participant_name)
SELECT random_text(10)
FROM GENERATE_SERIES(1, 10000);


Now, our joining table (or Associative Entity):

CREATE TABLE ev_par
(
  ev_id SMALLINT NOT NULL,
  par_id INTEGER NOT NULL,
  CONSTRAINT ev_par_pk PRIMARY KEY (ev_id, par_id),
  CONSTRAINT ev_id_fk  FOREIGN KEY (ev_id)  REFERENCES event (event_id),
  CONSTRAINT par_id_fk FOREIGN KEY (par_id) REFERENCES participant (participant_id)
);


Now, this is where things become interesting. Running Query 1 on the laptop (see below) gives a response time of ~ 25 mins - not ideal!

I tried all sorts of "tricks" (SET enable_seqscan = off and SET enable_bitmapscan = off - see here) - basically, I was just flailing around trying anything I could find on the web...

I finally bit the bullet and went for partitioning - so, what's a logical partition key for the ev_par table? Well, the event_id appears to be the best candidate - there's 1,000 of them - the entire table (data only) was ~ 350GB so that would give 1,000 tables of ~ 350MB - more manageable!

With indexes (PK + par_ev_ix - see below), the table is ~ 750GB!

So, after the final bracket ();) and before the semi-colon, we put:

) PARTITION BY LIST (ev_id);


I found useful information here (the m

Code Snippets

./configure --prefix=/home/pol/Downloads/db/dba_test/12.7/inst --enable-nls --with-python --with-icu --with-openssl --with-uuid=e2fs
DB Version: 12
OS Type: linux
DB Type: dw
Total Memory (RAM): 32 GB
CPUs num: 4
Data Storage: ssd
max_connections = 40
shared_buffers = 8GB
effective_cache_size = 24GB
maintenance_work_mem = 2GB
checkpoint_completion_target = 0.9
wal_buffers = 16MB
default_statistics_target = 500
random_page_cost = 1.1
effective_io_concurrency = 200
work_mem = 52428kB
min_wal_size = 4GB   -- used 16GB for this setting
max_wal_size = 16GB  -- used 64GB for this setting
max_worker_processes = 4
max_parallel_workers_per_gather = 2
max_parallel_workers = 4
max_parallel_maintenance_workers = 2
CREATE FUNCTION random_text(INTEGER)
RETURNS TEXT
LANGUAGE SQL
AS $$ 
  select upper(
    substring(
      (SELECT string_agg(md5(random()::TEXT), '')
       FROM generate_series(
           1,
           CEIL($1 / 32.)::integer) 
       ), 1, $1) );
$$;
CREATE TABLE event 
(
  event_id SMALLINT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
  event_name TEXT NOT NULL UNIQUE,
  event_date DATE NOT NULL
);

Context

StackExchange Database Administrators Q#297662, answer score: 8

Revisions (0)

No revisions yet.