HiveBrain v1.2.0
Get Started
← Back to all entries
patternsqlMinor

Postgresql 11 logical replication - stuck in `catchup` state

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
postgresqlstuckreplicationstatecatchuplogical

Problem

I'm running two postgresql 11 servers - master and slave (setup with logical replication).

The problem I'm facing is that today after weeks of uninterrupted work slave got out of sync with this error message:

2019-09-16 07:39:44.332 CEST [30117] ERROR:  could not send data to WAL stream: server closed the connection unexpectedly
                This probably means the server terminated abnormally
                before or while processing the request.
2019-09-16 07:39:44.539 CEST [12932] LOG:  logical replication apply worker for subscription "logical_from_master" has started
2019-09-16 07:39:44.542 CEST [27972] LOG:  background worker "logical replication worker" (PID 30117) exited with exit code 1


I did see this error message before and my process was to increase wal_sender_timeout on master (more details on this here: logical replication in postgresql - "server closed the connection unexpectedly")

So then I wanted to restore replication however the state of replication is stuck on catchup:

master=# select * from pg_stat_replication;
pid | usesysid | usename | application_name | client_addr | client_hostname | client_port | backend_start | backend_xmin | state | sent_lsn | write_lsn | flush_lsn | replay_lsn | write_lag | flush_lag | replay_lag | sync_priority | sync_state
-------+----------+---------+-------------------+---------------+-----------------+-------------+-------------------------------+--------------+---------+--------------+--------------+--------------+--------------+-----------------+-----------------+-----------------+---------------+------------
86864 | 16680 | my_user | logical_from_master | 10.10.10.10 | | 46110 | 2019-09-16 12:45:56.491325+02 | | catchup | D55/FA04D4B8 | D55/F9E74158 | D55/F9E44CD8 | D55/F9E74030 | 00:00:03.603104 | 00:00:03.603104 | 00:00:03.603104 | 0 | async
(1 row)


I tried to restart slave a few tim

Solution

As you already found out your master database is too busy for single replication worker to handle all of the changes.

You need to cluster your tables - but make sure you do it in such way that tables with foreign keys are handled with the same worker, otherwise you might get into situation where foreign key constraint will prevent data from being inserted into one table because table foreign key points at had not been updated yet.

Context

StackExchange Database Administrators Q#248863, answer score: 2

Revisions (0)

No revisions yet.