HiveBrain v1.2.0
Get Started
← Back to all entries
debugsqlMinor

Could not receive data from WAL stream: ERROR: requested WAL segment has already been removed

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
streamerrorremovedcouldbeensegmenthasreceivealreadywal

Problem

Configuration: Postgres 9.6 with a 3 cluster node. db1 is the master, db2 and db3 are replicas. WAL files are archived in AWS S3 using custom pgrsync tool. Cluster managed by patroni. The archive_command and restore_command is properly configured on all the nodes.

To simulate: On db1, do heavy writes (like vacuum a large table) and then stop db1 by sudo systemctl stop patroni). db3 becomes the new leader. db2 requests more WAL files, which it gets via the proper restore command from AWS S3, becomes replica to db3.

Now, start db1 again by (sudo systemctl start patroni). But db1 (the old leader and the new to-be-replica) never comes up as a replica and gives the error message:

could not receive data from WAL stream: ERROR: requested WAL segment 0000002400053C55000000AE has already been removed.

This error message is reported by db3 (the leader), which db1 just logs it.

So, let's see the timeline. Initially db1 was in timeline 35 (0x23) and did write the following files to archive:
0000002300053C55000000AE (TL is 23 and archive at 53C55/AE)
..
0000002300053C5600000043 (TL is 23 and archive at 53C56/43)


db1 is stopped at this point.
db3's logs show this:
received promote request
redo done at 53C56/A3FFC8C0
selected new timeline ID: 36
database system is ready to accept connections


and db3 copies the following files to archives
00000024.history
0000002300053C56000000A3.partial
0000002400053C56000000A3
0000002200053C55000000AE


As db3 became leader, db2 starts the process to become replica to db3 (which it successfully becomes) and here is the summary of the logs:
`restored log file "00000024.history" from archive
restored log file "0000002300053C55000000AE" from archive
restored log file "00000023.history" from archive
redo starts at 53C55/AE0001A8
restored log file "0000002300053C55000000AF" from archive
...
restored log file "0000002300053C5600000042" from archive
Retrieving 0000002300053C5600000043 to pg_xlog/RECOVERYXLOG FAILED (log by

Solution

We had a similar situation and ended up initializing from the current primary. There is a nasty bug here. So, to play safe my advice is to re instantiate the slave. (until the bug fix is in place)
If you have a nice reproducible case, it would help to document it so it can be studied and with a little luck fixed in the code.
Reporting bugs: https://www.postgresql.org/docs/current/bug-reporting.html

PostgreSQL 12.3, 11.8, 10.13, 9.6.18, and 9.5.22 Released!
This bug was fixed in 9.5.22 and 9.6.18 and 10.13 and 11.8 and 12.3:
Avoid premature recycling of WAL segments during crash recovery (Jehan-Guillaume de Rorthais)

WAL segments that become ready to be archived during crash recovery were potentially recycled without being archived.

Context

StackExchange Database Administrators Q#284146, answer score: 2

Revisions (0)

No revisions yet.