debugsqlMinor
Could not receive data from WAL stream: ERROR: requested WAL segment has already been removed
Viewed 0 times
streamerrorremovedcouldbeensegmenthasreceivealreadywal
Problem
Configuration: Postgres 9.6 with a 3 cluster node. db1 is the master, db2 and db3 are replicas. WAL files are archived in AWS S3 using custom
To simulate: On db1, do heavy writes (like vacuum a large table) and then stop db1 by
Now, start db1 again by (
could not receive data from WAL stream: ERROR: requested WAL segment 0000002400053C55000000AE has already been removed.
This error message is reported by db3 (the leader), which db1 just logs it.
So, let's see the timeline. Initially db1 was in timeline 35 (0x23) and did write the following files to archive:
db1 is stopped at this point.
db3's logs show this:
and db3 copies the following files to archives
As db3 became leader, db2 starts the process to become replica to db3 (which it successfully becomes) and here is the summary of the logs:
`restored log file "00000024.history" from archive
restored log file "0000002300053C55000000AE" from archive
restored log file "00000023.history" from archive
redo starts at 53C55/AE0001A8
restored log file "0000002300053C55000000AF" from archive
...
restored log file "0000002300053C5600000042" from archive
Retrieving 0000002300053C5600000043 to pg_xlog/RECOVERYXLOG FAILED (log by
pgrsync tool. Cluster managed by patroni. The archive_command and restore_command is properly configured on all the nodes.To simulate: On db1, do heavy writes (like vacuum a large table) and then stop db1 by
sudo systemctl stop patroni). db3 becomes the new leader. db2 requests more WAL files, which it gets via the proper restore command from AWS S3, becomes replica to db3.Now, start db1 again by (
sudo systemctl start patroni). But db1 (the old leader and the new to-be-replica) never comes up as a replica and gives the error message:could not receive data from WAL stream: ERROR: requested WAL segment 0000002400053C55000000AE has already been removed.
This error message is reported by db3 (the leader), which db1 just logs it.
So, let's see the timeline. Initially db1 was in timeline 35 (0x23) and did write the following files to archive:
0000002300053C55000000AE (TL is 23 and archive at 53C55/AE)
..
0000002300053C5600000043 (TL is 23 and archive at 53C56/43)
db1 is stopped at this point.
db3's logs show this:
received promote request
redo done at 53C56/A3FFC8C0
selected new timeline ID: 36
database system is ready to accept connections
and db3 copies the following files to archives
00000024.history
0000002300053C56000000A3.partial
0000002400053C56000000A3
0000002200053C55000000AE
As db3 became leader, db2 starts the process to become replica to db3 (which it successfully becomes) and here is the summary of the logs:
`restored log file "00000024.history" from archive
restored log file "0000002300053C55000000AE" from archive
restored log file "00000023.history" from archive
redo starts at 53C55/AE0001A8
restored log file "0000002300053C55000000AF" from archive
...
restored log file "0000002300053C5600000042" from archive
Retrieving 0000002300053C5600000043 to pg_xlog/RECOVERYXLOG FAILED (log by
Solution
We had a similar situation and ended up initializing from the current primary. There is a nasty bug here. So, to play safe my advice is to re instantiate the slave. (until the bug fix is in place)
If you have a nice reproducible case, it would help to document it so it can be studied and with a little luck fixed in the code.
Reporting bugs: https://www.postgresql.org/docs/current/bug-reporting.html
PostgreSQL 12.3, 11.8, 10.13, 9.6.18, and 9.5.22 Released!
This bug was fixed in 9.5.22 and 9.6.18 and 10.13 and 11.8 and 12.3:
Avoid premature recycling of WAL segments during crash recovery (Jehan-Guillaume de Rorthais)
WAL segments that become ready to be archived during crash recovery were potentially recycled without being archived.
If you have a nice reproducible case, it would help to document it so it can be studied and with a little luck fixed in the code.
Reporting bugs: https://www.postgresql.org/docs/current/bug-reporting.html
PostgreSQL 12.3, 11.8, 10.13, 9.6.18, and 9.5.22 Released!
This bug was fixed in 9.5.22 and 9.6.18 and 10.13 and 11.8 and 12.3:
Avoid premature recycling of WAL segments during crash recovery (Jehan-Guillaume de Rorthais)
WAL segments that become ready to be archived during crash recovery were potentially recycled without being archived.
Context
StackExchange Database Administrators Q#284146, answer score: 2
Revisions (0)
No revisions yet.