patternsqlMinor
Is Postgres ideal for Many to Many data (M:N) which keeps increasing?
Viewed 0 times
idealkeepspostgresforincreasingmanywhichdata
Problem
I have a use-case where the data is many to many and needs a wide range of querying functionality.
Participants and Events
A User/Participant can register for multiple events. And each event can have many participants. It's a many-to-many relationship.
Consider a data set like this.
The following queries are required:
For handling Query 1 & Query 2
EventParticipantTable : (eventId, participantId) : 1000 x 10M records
This needs to search 1000 x 10M records?
The dataset can be split into blocks per eventId to make it ideally scan only 10M records but not sure how this can be handled in PostgreSQL.
For handling Query 3
Event Table + EventParticipantTable Join
This needs a join of two tables where I first fetch the Events table for upcoming events(based on start and end timestamps) and for each eventId matched need to find if queried participant id exists in in EventParticipantTable.
This needs to search 1000 events (1000 10M) event-participant-table entries ?
Is 1000 x 10M records per table is not an issue in this scenario?
Participants and Events
A User/Participant can register for multiple events. And each event can have many participants. It's a many-to-many relationship.
Consider a data set like this.
- Each event can have 10Million users registered.
- Each User can register a max of 1000 events
- There are 1000 Events running
The following queries are required:
- Query 1. Get all Participants who registered for an event
- Query 2. Get all Events registered by a Participant
- Query 3. Get all Events which are upcoming for a participant
For handling Query 1 & Query 2
EventParticipantTable : (eventId, participantId) : 1000 x 10M records
This needs to search 1000 x 10M records?
The dataset can be split into blocks per eventId to make it ideally scan only 10M records but not sure how this can be handled in PostgreSQL.
For handling Query 3
Event Table + EventParticipantTable Join
This needs a join of two tables where I first fetch the Events table for upcoming events(based on start and end timestamps) and for each eventId matched need to find if queried participant id exists in in EventParticipantTable.
This needs to search 1000 events (1000 10M) event-participant-table entries ?
Is 1000 x 10M records per table is not an issue in this scenario?
Solution
To resolve your issues, I did the following (all the code below is available on the fiddle here):
These tests have been run on the db<>fiddle server - we don't exactly know the configuration of the machine(s) nor do we know what else is happening while we're running our queries.
I also ran the tests on my home laptop:
The PostgreSQL 12.7 instance was compiled from source with the following options:
The system settings are the defaults except for the recommendations of pgtune as follows:
Recommended changes from defaults:
The
Firstly, I created a function to generate random strings (from here):
Then, I created an
gave it an index on
and also on
Then I created 100 (1,000 on laptop) events as follows:
BUT!, you might scream... all the event dates are in the past - yes, but if you do this, then you'll have 50% in the past and 50% in the future:
Checking with
Doing it this way (rather than with literal dates) means that the fiddle will work years from now because the
The
Then created 10,000 (10,000,000 - 10M on laptop) participants:
Now, our joining table (or
Now, this is where things become interesting. Running Query 1 on the laptop (see below) gives a response time of ~ 25 mins - not ideal!
I tried all sorts of "tricks" (
I finally bit the bullet and went for partitioning - so, what's a logical partition key for the
With indexes (PK + par_ev_ix - see below), the table is ~ 750GB!
So, after the final bracket (
I found useful information here (the m
These tests have been run on the db<>fiddle server - we don't exactly know the configuration of the machine(s) nor do we know what else is happening while we're running our queries.
I also ran the tests on my home laptop:
- Linux Fedora 34
- 1TB Samsung SSD
- 4 CPUs, 2 cores
- nothing else running apart from standard Linux processes
The PostgreSQL 12.7 instance was compiled from source with the following options:
./configure --prefix=/home/pol/Downloads/db/dba_test/12.7/inst --enable-nls --with-python --with-icu --with-openssl --with-uuid=e2fsThe system settings are the defaults except for the recommendations of pgtune as follows:
DB Version: 12
OS Type: linux
DB Type: dw
Total Memory (RAM): 32 GB
CPUs num: 4
Data Storage: ssdRecommended changes from defaults:
max_connections = 40
shared_buffers = 8GB
effective_cache_size = 24GB
maintenance_work_mem = 2GB
checkpoint_completion_target = 0.9
wal_buffers = 16MB
default_statistics_target = 500
random_page_cost = 1.1
effective_io_concurrency = 200
work_mem = 52428kB
min_wal_size = 4GB -- used 16GB for this setting
max_wal_size = 16GB -- used 64GB for this setting
max_worker_processes = 4
max_parallel_workers_per_gather = 2
max_parallel_workers = 4
max_parallel_maintenance_workers = 2The
min_ and max_wal settings were bumped up because of stuff I read for systems with heavy writes to speed up loading - shouldn't affect reads - lost references(s)...Firstly, I created a function to generate random strings (from here):
CREATE FUNCTION random_text(INTEGER)
RETURNS TEXT
LANGUAGE SQL
AS $
select upper(
substring(
(SELECT string_agg(md5(random()::TEXT), '')
FROM generate_series(
1,
CEIL($1 / 32.)::integer)
), 1, $1) );
$;Then, I created an
event table:CREATE TABLE event
(
event_id SMALLINT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
event_name TEXT NOT NULL UNIQUE,
event_date DATE NOT NULL
);gave it an index on
event_name also - I can imagine many scenarios where one would wish to search by name.CREATE INDEX ev_name_ix ON event USING BTREE
(event_name ASC);and also on
event_date:CREATE INDEX ev_date_ix ON event USING BTREE
(event_date ASC);Then I created 100 (1,000 on laptop) events as follows:
INSERT INTO event (event_name, event_date)
SELECT random_text(10), CURRENT_DATE - INTERVAL '7 DAY'
FROM GENERATE_SERIES(1, 100);BUT!, you might scream... all the event dates are in the past - yes, but if you do this, then you'll have 50% in the past and 50% in the future:
UPDATE event
SET event_date =
(
CASE
WHEN MOD(event_id, 2) = 1 THEN event_date -- i.e. no change!
ELSE CURRENT_DATE + INTERVAL '7 DAY'
END
);Checking with
SELECT * FROM event; - result:event_id event_name event_date
1 A653585119 2021-07-30
2 01563801BB 2021-08-13
3 4ED87ABDEC 2021-07-30
4 EF0394645B 2021-08-13
...
... snipped for brevity
...Doing it this way (rather than with literal dates) means that the fiddle will work years from now because the
event_date depends only on when the fiddle is run and not some constant!The
participant table:CREATE TABLE participant
(
participant_id INTEGER GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
participant_name TEXT NOT NULL -- might not be UNIQUE
);participant_name index:CREATE INDEX par_name_ix ON participant USING BTREE
(participant_name ASC);Then created 10,000 (10,000,000 - 10M on laptop) participants:
INSERT INTO participant (participant_name)
SELECT random_text(10)
FROM GENERATE_SERIES(1, 10000);Now, our joining table (or
Associative Entity):CREATE TABLE ev_par
(
ev_id SMALLINT NOT NULL,
par_id INTEGER NOT NULL,
CONSTRAINT ev_par_pk PRIMARY KEY (ev_id, par_id),
CONSTRAINT ev_id_fk FOREIGN KEY (ev_id) REFERENCES event (event_id),
CONSTRAINT par_id_fk FOREIGN KEY (par_id) REFERENCES participant (participant_id)
);Now, this is where things become interesting. Running Query 1 on the laptop (see below) gives a response time of ~ 25 mins - not ideal!
I tried all sorts of "tricks" (
SET enable_seqscan = off and SET enable_bitmapscan = off - see here) - basically, I was just flailing around trying anything I could find on the web...I finally bit the bullet and went for partitioning - so, what's a logical partition key for the
ev_par table? Well, the event_id appears to be the best candidate - there's 1,000 of them - the entire table (data only) was ~ 350GB so that would give 1,000 tables of ~ 350MB - more manageable!With indexes (PK + par_ev_ix - see below), the table is ~ 750GB!
So, after the final bracket (
);) and before the semi-colon, we put:) PARTITION BY LIST (ev_id);I found useful information here (the m
Code Snippets
./configure --prefix=/home/pol/Downloads/db/dba_test/12.7/inst --enable-nls --with-python --with-icu --with-openssl --with-uuid=e2fsDB Version: 12
OS Type: linux
DB Type: dw
Total Memory (RAM): 32 GB
CPUs num: 4
Data Storage: ssdmax_connections = 40
shared_buffers = 8GB
effective_cache_size = 24GB
maintenance_work_mem = 2GB
checkpoint_completion_target = 0.9
wal_buffers = 16MB
default_statistics_target = 500
random_page_cost = 1.1
effective_io_concurrency = 200
work_mem = 52428kB
min_wal_size = 4GB -- used 16GB for this setting
max_wal_size = 16GB -- used 64GB for this setting
max_worker_processes = 4
max_parallel_workers_per_gather = 2
max_parallel_workers = 4
max_parallel_maintenance_workers = 2CREATE FUNCTION random_text(INTEGER)
RETURNS TEXT
LANGUAGE SQL
AS $$
select upper(
substring(
(SELECT string_agg(md5(random()::TEXT), '')
FROM generate_series(
1,
CEIL($1 / 32.)::integer)
), 1, $1) );
$$;CREATE TABLE event
(
event_id SMALLINT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
event_name TEXT NOT NULL UNIQUE,
event_date DATE NOT NULL
);Context
StackExchange Database Administrators Q#297662, answer score: 8
Revisions (0)
No revisions yet.