patternsqlMajor

What happens in PostgreSQL checkpoint?

Submitted by: @import:stackexchange-dba·Mar 10, 2026·

Viewed 0 times

happenscheckpointpostgresqlwhat

Problem

Here's part of my checkpoint log:

2014-03-26 11:51:29.341 CDT,,,18682,,532854fc.48fa,4985,,2014-03-18 09:15:24 CDT,,0,LOG,00000,"checkpoint complete: wrote 15047 buffers (1.4%); 0 transaction log file(s) added, 0 removed, 30 recycled; write=68.980 s, sync=1.542 s, total=70.548 s; sync files=925, longest=0.216 s, average=0.001 s",,,,,,,,,""
2014-03-26 11:56:05.430 CDT,,,18682,,532854fc.48fa,4987,,2014-03-18 09:15:24 CDT,,0,LOG,00000,"checkpoint complete: wrote 16774 buffers (1.6%); 0 transaction log file(s) added, 0 removed, 31 recycled; write=72.542 s, sync=17.164 s, total=89.733 s; sync files=885, longest=3.812 s, average=0.019 s",,,,,,,,,""
2014-03-26 12:01:21.650 CDT,,,18682,,532854fc.48fa,4989,,2014-03-18 09:15:24 CDT,,0,LOG,00000,"checkpoint complete: wrote 14436 buffers (1.4%); 0 transaction log file(s) added, 0 removed, 33 recycled; write=122.350 s, sync=5.212 s, total=127.676 s; sync files=924, longest=3.740 s, average=0.005 s",,,,,,,,,""
2014-03-26 12:06:25.028 CDT,,,18682,,532854fc.48fa,4991,,2014-03-18 09:15:24 CDT,,0,LOG,00000,"checkpoint complete: wrote 13277 buffers (1.3%); 0 transaction log file(s) added, 0 removed, 29 recycled; write=126.217 s, sync=5.733 s, total=131.991 s; sync files=894, longest=1.859 s, average=0.006 s",,,,,,,,,""
2014-03-26 12:10:41.958 CDT,,,18682,,532854fc.48fa,4993,,2014-03-18 09:15:24 CDT,,0,LOG,00000,"checkpoint complete: wrote 20765 buffers (2.0%); 0 transaction log file(s) added, 0 removed, 28 recycled; write=88.015 s, sync=10.818 s, total=98.872 s; sync files=881, longest=2.690 s, average=0.012 s",,,,,,,,,""

I noticed that sometimes our database is very slow - you can see a very large number of normally short queries stuck for much longer than now. It happens regularly without a clear culprit.

Question: Could checkpoint cause this? What happens in the "sync" phase of checkpoint?

Solution

During its operation, PostgreSQL records changes to transaction log files, but it doesn't immediately flush them to the actual database tables. It usually just keeps the changes in memory, and returns them from memory when they are requested, unless RAM starts getting full and it has to write them out.

This means that if it crashes, the on disk tables won't be up to date. It has to replay the transaction logs, applying the changes to the on-disk tables, before it can start back up. That can take a while for a big, busy database.

For that reason, and so that the transaction logs do not keep growing forever, PostgreSQL periodically does a checkpoint where it makes sure the DB is in a clean state. It flushes all pending changes to disk and recycles the transaction logs that were being used to keep a crash recovery record of the changes.

This flush happens in two phases:

Buffered write()s of dirty shared_buffers to the tables; and

fsync() of affected files to make sure the changes really hit disk

Both of those can increase disk I/O load. Contention caused by these writes can slow down reads, and can also slow down flushing of WAL segments that's required in order to commit transactions.

It's been a longstanding challenge, but it's getting worse as we see systems with more and more RAM so they can buffer more data and take longer to write it out. There's discussion between the Linux and PostgreSQL communities on how to deal with this at the moment, as discussed in this LWN.net article. (LWN.net won't be able to keep writing this sort of great work if people don't subscribe. I'm a subscriber and sharing this link because it's useful and informative. Please consider subscribing if you want to see more of this sort of thing.)

The main thing you can do to reduce the impact of checkpoints at the moment is to spread checkpoint activity out by increasing checkpoint_completion_target so that more of the data has been written out by the time the final checkpoint arrives. This has a cost, though - if you update a page (say) ten times, it might be written to disk multiple times before the checkpoint with a high completion target, even though it only strictly had to be written out once for crash safety. A higher completion target makes for smoother I/O patterns but more overall I/O overhead.

The other thing you can do to help is tell your operating system to immediately start writing data when it gets buffered writes. This is like the kernel side of setting checkpoint_completion_target and has a similar trade-off. See the linux vm documentation, in particular dirty_background_bytes, dirty_background_ratio, dirty_expire_centisecs.

Context

StackExchange Database Administrators Q#61822, answer score: 40

Revisions (0)

No revisions yet.