patternMinor
PostgreSQL Replication delay debugging
Viewed 0 times
postgresqldebuggingdelayreplication
Problem
I've got a cluster of 9.1 servers running streaming replication and using WAL-E based log shipping for when that falls behind.
Generally speaking it's a pretty solid setup and does not fall behind, however we appear to have one slave in the cluster that consistently falls behind when replicating in the wee hours of the morning. It always catches up again shortly after but I have not yet been able understand why it does this.
There does not appear to be consistency with regards to the timing apart from it happens when our application is not handling a lot of load. Is it possible that I'm seeing this occur because of lack of data to be replicated?
Here's what the relevant portions of the postgresql.conf and recovery.conf look like
postgresql.conf
recovery.conf
To add a little more context, you can see the replication delay for this server is creeping up as time goes on.
What are some good strategies for debugging this kind of issue?
EDIT
I had it suggested to me that my methodology for computing the replication delay may be incorrect, here's what I'm using:
Generally speaking it's a pretty solid setup and does not fall behind, however we appear to have one slave in the cluster that consistently falls behind when replicating in the wee hours of the morning. It always catches up again shortly after but I have not yet been able understand why it does this.
There does not appear to be consistency with regards to the timing apart from it happens when our application is not handling a lot of load. Is it possible that I'm seeing this occur because of lack of data to be replicated?
Here's what the relevant portions of the postgresql.conf and recovery.conf look like
postgresql.conf
archive_mode = on
archive_command = '. /mnt/data/postgresql/.profile && wal-e wal-push %p'recovery.conf
recovery_target_timeline = 'latest'
standby_mode = 'on'
primary_conninfo = 'host=master_db_ip_address port=5432 user=postgres'
restore_command = '. /mnt/data/postgresql/.profile && wal-e wal-fetch "%f" "%p"'To add a little more context, you can see the replication delay for this server is creeping up as time goes on.
- This should not occur. The streaming replication delay should be
- There are no scheduled jobs running on the db cluster at the time these spikes in the replication timing occur.
What are some good strategies for debugging this kind of issue?
EDIT
I had it suggested to me that my methodology for computing the replication delay may be incorrect, here's what I'm using:
SELECT EXTRACT(EPOCH FROM NOW() - pg_last_xact_replay_timestamp())Solution
Maybe you are experienceing what is described at the end of the following article:
http://www.niwi.be/2013/02/16/replication-status-in-postgresql/
In a very busy database, with many writes per second, this number will remain fairly accurate. However, in a system where there are few writes, the "replication_delay" will continually grow because the last replayed transaction timestamp isn't increasing (this is generally the same limitation as MySQL's SHOW SLAVE STATUS output).
http://www.niwi.be/2013/02/16/replication-status-in-postgresql/
In a very busy database, with many writes per second, this number will remain fairly accurate. However, in a system where there are few writes, the "replication_delay" will continually grow because the last replayed transaction timestamp isn't increasing (this is generally the same limitation as MySQL's SHOW SLAVE STATUS output).
Context
StackExchange Database Administrators Q#83496, answer score: 2
Revisions (0)
No revisions yet.