HiveBrain v1.2.0
Get Started
← Back to all entries
patternsqlMinor

PostgreSQL + Docker: incorrect resource manager data checksum in record at 46F/6A7B6D28

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
postgresqlincorrectresourcedockerchecksumrecord6a7b6d2846fmanagerdata

Problem

I'm running PostgreSQL 9.3.9, Docker 1.8.2 and Ubuntu 14.04. I have an issue where my hot standby keeps failing with the following error message:

incorrect resource manager data checksum in record at 46F/6A7B6D28


After turning on DEBUG2, this is now the log I see:

2015-10-02 06:56:34.033 UTCDEBUG:  sending write 477/1E9C9990 flush 477/1E9C6700 apply 477/1E9C6700
2015-10-02 06:56:34.078 UTCDEBUG:  sending write 477/1E9C9990 flush 477/1E9C9990 apply 477/1E9C6700
2015-10-02 06:56:34.078 UTCDEBUG:  sendtime 2015-10-02 06:56:34.027356+00 receipttime 2015-10-02 06:56:34.078378+00 replication apply delay 0 ms transfer latency 51 ms
2015-10-02 06:56:34.078 UTCLOG:  incorrect resource manager data checksum in record at 477/1E9C8488
2015-10-02 06:56:34.078 UTCDEBUG:  sending write 477/1E9CE560 flush 477/1E9C9990 apply 477/1E9C8488
2015-10-02 06:56:34.095 UTCDEBUG:  sending write 477/1E9CE560 flush 477/1E9CE560 apply 477/1E9C8488
2015-10-02 06:56:34.095 UTCFATAL:  terminating walreceiver process due to administrator command
2015-10-02 06:56:34.195 UTCDEBUG:  switched WAL source from stream to archive after failure
2015-10-02 06:56:34.195 UTCLOG:  incorrect resource manager data checksum in record at 477/1E9C8488
2015-10-02 06:56:34.195 UTCDEBUG:  switched WAL source from archive to stream after failure
2015-10-02 06:56:34.196 UTCLOG:  incorrect resource manager data checksum in record at 477/1E9C8488
2015-10-02 06:56:39.200 UTCDEBUG:  switched WAL source from stream to archive after failure
2015-10-02 06:56:39.200 UTCDEBUG:  incorrect resource manager data checksum in record at 477/1E9C8488
2015-10-02 06:56:39.200 UTCDEBUG:  switched WAL source from archive to stream after failure


Or this log upon failure:

```
2015-10-02 00:55:19.838 UTCDEBUG: sendtime 2015-10-02 00:55:19.836191+00 receipttime 2015-10-02 00:55:19.838961+00 replication apply delay 0 ms transfer latency 2 ms
2015-10-02 00:55:19.839 UTCDEBUG: sending write 476/E48058A0 flush 476/E4805798 ap

Solution

Ok, I finally figured it out (for the most part)!

The important detail I left out in the question was that I was running PostgreSQL on btrfs and was streaming to a different server with ext4. There seemed to be some race condition during high load periods that caused the data being streamed to be corrupted or read incorrectly. I don't know exactly what. Sometimes it failed after 30 seconds, sometimes 30 minutes.

So last night I shut down the system, backed everything up on a separate HDD, reformated my btrfs partition to ext4, moved everything back and brought the system back up. Once I restarted the live replication it caught up and now 24 hours later it is still perfectly in sync, no errors!

So whatever was going on it was related to the btrfs partition. I've spent this entire week trying to figure this out so I hope this saves someone else some time. :-)

Context

StackExchange Database Administrators Q#116569, answer score: 3

Revisions (0)

No revisions yet.