HiveBrain v1.2.0
Get Started
← Back to all entries
patternsqlMinor

Why is my postmaster process (sometimes) becoming unmanageable after a WAL base restore?

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
sometimesafterwhypostmasterunmanageableprocessbecomingwalrestorebase

Problem

TL;DR: Un-stoppable, unusable postmaster is being spawned when Postgres is started right after restoring its data directory from a WAL base backup. Why?

Context:

We run postgresql 8.4, on CentOS 6, using the PGDG packages. We have a script for use on developer test environments that restores a nightly backup of our production server's data directory (created between calls to pg_start_backup and pg_stop_backup). The script decompresses the file, and uses restore_command to reapply any WALs that were generated during the time that the backup was taken on production.

It usually works fine, and restores hundreds of times faster than an SQL-based restore of a pg_dump'ed file.

Problem:

Sometimes, after it unzips the data dir, the script starts postgres by running /etc/init.d/postgresql start (which is a symlink to /etc/init.d/postgresql-8.4. This makes it a predictable init script for when we eventually upgrade to 9.*). It reports "OK", as in: it started correctly. Then WALs don't restore; it hangs indefinitely waiting for a recovery.done file to appear.

What I've Tried:

When I ran /etc/init.d/postgresql status during the indefinite hang, the init script reports dead but pid file exists.

Then I ran ps -ef | grep post. Oddly, the postmaster process and archivers etc were running. All of the invocation parameters were correct (right datadir etc etc).

When I ran psql, it detected a running postmaster and an initted postgres DB, but did not detect the main DB--the one we care about restoring via the WAL script.

I then checked the perms on the data dir, and everything looked OK.

Running /etc/init.d/postgresql stop reported "OK", and killed the archiver/watcher processes, but the postmaster stayed running.

The same thing happened when I tried killall -r '.postmaster.'.

The only thing that worked to resume the stuck WAL restore was a killall -s 3 -r '.postmaster.' (Signal 3 is SIGQUIT), and then a `/etc/init.d/postgresql st

Solution

In general if you are seeing problems of this sort, it may be best to take them up on the pgsql-bugs list. People there can help figure out what information to gather to help determine what the scope of this misbehavior is and help fix it for you.

Also 8.4.11 to 8.4.12 wal restore should work just fine.

If this is only occasionally happening, I don't think your explanations get there. It sounds like something that really could use additional troubleshooting by people who can determine if a code fix is required.

Context

StackExchange Database Administrators Q#20361, answer score: 2

Revisions (0)

No revisions yet.