HiveBrain v1.2.0
Get Started
← Back to all entries
patternsqlModerate

Postgres is stuck in recovery mode

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
stuckrecoverymodepostgres

Problem

I have a stand-alone instance of PostgreSQL which is in recovery mode. It has been saying

2014-03-24 18:45:57 MDT FATAL:  the database system is starting up


For many hours, and ps shows:

postgres  2637  0.1  0.1 116916  4420 ?        Ds   15:43   0:18 postgres: startup process   recovering 00000001000000040000007E


Is there any way to watch the progress of the recovery, ideally with an ETA? How can I get this process "un-stuck"?

I can stop postgres using the standard start/stop scripts, but when I start it again, it's still stuck in recovery mode.

Debian 7.4, Linux kernel 3.2.0, PostgreSQL 9.1.12

Output of strace -p 2637:

Process 2637 attached - interrupt to quit
close(154)                              = 0
getppid()                               = 2600
open("pg_clog/0003", O_RDWR|O_CREAT, 0600) = 154
lseek(154, 221184, SEEK_SET)            = 221184
write(154, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 8192) = 8192
fsync(154)                              = 0


The above output repeats seemingly infinitely.

A gdb backtrace (three subsequent bts were all practically identical):

```
(gdb) bt
#0 0x00007f0c8f917d70 in fsync () from /lib/x86_64-linux-gnu/libc.so.6
#1 0x00007f0c9169f255 in pg_fsync_no_writethrough (fd=) at /home/cbe/projects/postgresql/9.1/postgresql-9.1-9.1.12/build/../src/backend/storage/file/fd.c:286
#2 0x00007f0c9169f265 in pg_fsync (fd=) at /home/cbe/projects/postgresql/9.1/postgresql-9.1-9.1.12/build/../src/backend/storage/file/fd.c:274
#3 0x00007f0c9151917b in SlruPhysicalWritePage (ctl=ctl@entry=0x7f0c91b856c0, pageno=pageno@entry=123, slotno=slotno@entry=3, fdata=fdata@entry=0x0) at /home/cbe/projects/postgresql/9.1/postgresql-9.1- 9.1.12/build/../src/backend/access/transam/slru.c:801
#4 0x00007f0c91519925 in SlruInternalWritePage (ctl=ctl@entry=0x7f0c91b856c0, slotno=3, fdata=fdata@entry=0x0) at /home/cbe/projects/postgresql/9.1/postgresql-9.1-9.1.12/build/../src/backend/access/t

Solution

After googling for hours, I stumbled across a thread that wasn't really related to my issue, best I could tell, but it seemed harmless enough to try, and voilà!

I looked at /var/lib/postgresql/9.1/main/pg_clog, and saw:

drwx------  2 postgres postgres   4096 Mar 15 15:20 .
drwx------ 13 postgres postgres   4096 Mar 25 12:15 ..
-rw-------  1 postgres postgres 262144 Feb  4 19:39 0000
-rw-------  1 postgres postgres 262144 Feb 13 11:10 0001
-rw-------  1 postgres postgres 262144 Mar 15 15:20 0002
-rw-------  1 postgres postgres 229376 Mar 25 14:51 0003


Noticing that /var/lib/postgresql/9.1/main/pg_clog/0003 was the file being opened/seeked in the strace output, and that this was the file in question in the forum post, I tried the suggested action in the forum:

dd if=/dev/zero bs=8k count=1 >> /var/lib/postgresql/9.1/main/pg_clog/0003


And now postgres starts again.

Code Snippets

drwx------  2 postgres postgres   4096 Mar 15 15:20 .
drwx------ 13 postgres postgres   4096 Mar 25 12:15 ..
-rw-------  1 postgres postgres 262144 Feb  4 19:39 0000
-rw-------  1 postgres postgres 262144 Feb 13 11:10 0001
-rw-------  1 postgres postgres 262144 Mar 15 15:20 0002
-rw-------  1 postgres postgres 229376 Mar 25 14:51 0003
dd if=/dev/zero bs=8k count=1 >> /var/lib/postgresql/9.1/main/pg_clog/0003

Context

StackExchange Database Administrators Q#61650, answer score: 13

Revisions (0)

No revisions yet.