During its operation, PostgreSQL records changes to transaction log files, but it doesn't immediately flush them to the actual database tables. It usually just keeps the changes in memory, and returns them from memory when they are requested, unless RAM starts getting full and it has to write them out.
This means that if it crashes, the on disk tables won't be up to date. It has to replay the transaction logs, applying the changes to the on-disk tables, before it can start back up. That can take a while for a big, busy database.
For that reason, and so that the transaction logs do not keep growing forever, PostgreSQL periodically does a checkpoint where it makes sure the DB is in a clean state. It flushes all pending changes to disk and recycles the transaction logs that were being used to keep a crash recovery record of the changes.
This flush happens in two phases:
- Buffered
write()s of dirtyshared_buffersto the tables; and fsync()of affected files to make sure the changes really hit disk
Both of those can increase disk I/O load. Contention caused by these writes can slow down reads, and can also slow down flushing of WAL segments that's required in order to commit transactions.
It's been a longstanding challenge, but it's getting worse as we see systems with more and more RAM so they can buffer more data and take longer to write it out. There's discussion between the Linux and PostgreSQL communities on how to deal with this at the moment, as discussed in this LWN.net article. (LWN.net won't be able to keep writing this sort of great work if people don't subscribe. I'm a subscriber and sharing this link because it's useful and informative. Please consider subscribing if you want to see more of this sort of thing.)
The main thing you can do to reduce the impact of checkpoints at the moment is to spread checkpoint activity out by increasing checkpoint_completion_target so that more of the data has been written out by the time the final checkpoint arrives. This has a cost, though - if you update a page (say) ten times, it might be written to disk multiple times before the checkpoint with a high completion target, even though it only strictly had to be written out once for crash safety. A higher completion target makes for smoother I/O patterns but more overall I/O overhead.
The other thing you can do to help is tell your operating system to immediately start writing data when it gets buffered writes. This is like the kernel side of setting checkpoint_completion_target and has a similar trade-off. See the linux vm documentation, in particular dirty_background_bytes, dirty_background_ratio, dirty_expire_centisecs.
Videos
During its operation, PostgreSQL records changes to transaction log files, but it doesn't immediately flush them to the actual database tables. It usually just keeps the changes in memory, and returns them from memory when they are requested, unless RAM starts getting full and it has to write them out.
This means that if it crashes, the on disk tables won't be up to date. It has to replay the transaction logs, applying the changes to the on-disk tables, before it can start back up. That can take a while for a big, busy database.
For that reason, and so that the transaction logs do not keep growing forever, PostgreSQL periodically does a checkpoint where it makes sure the DB is in a clean state. It flushes all pending changes to disk and recycles the transaction logs that were being used to keep a crash recovery record of the changes.
This flush happens in two phases:
- Buffered
write()s of dirtyshared_buffersto the tables; and fsync()of affected files to make sure the changes really hit disk
Both of those can increase disk I/O load. Contention caused by these writes can slow down reads, and can also slow down flushing of WAL segments that's required in order to commit transactions.
It's been a longstanding challenge, but it's getting worse as we see systems with more and more RAM so they can buffer more data and take longer to write it out. There's discussion between the Linux and PostgreSQL communities on how to deal with this at the moment, as discussed in this LWN.net article. (LWN.net won't be able to keep writing this sort of great work if people don't subscribe. I'm a subscriber and sharing this link because it's useful and informative. Please consider subscribing if you want to see more of this sort of thing.)
The main thing you can do to reduce the impact of checkpoints at the moment is to spread checkpoint activity out by increasing checkpoint_completion_target so that more of the data has been written out by the time the final checkpoint arrives. This has a cost, though - if you update a page (say) ten times, it might be written to disk multiple times before the checkpoint with a high completion target, even though it only strictly had to be written out once for crash safety. A higher completion target makes for smoother I/O patterns but more overall I/O overhead.
The other thing you can do to help is tell your operating system to immediately start writing data when it gets buffered writes. This is like the kernel side of setting checkpoint_completion_target and has a similar trade-off. See the linux vm documentation, in particular dirty_background_bytes, dirty_background_ratio, dirty_expire_centisecs.
Flushing the dirty OS file system buffers caused by exceeding dirty_bytes or dirty_ratio is a foreground blocking operation!
The kernel tunables dirty_bytes, dirty_background_bytes, dirty_ratio, dirty_background_ratio and dirty_centisecs control flushing of dirty OS file system buffers to disk. dirty_bytes is the threshold in bytes, dirty_ratio is the threshold as a ratio of total memory. dirty_background_bytes and dirty_background_ratio are similar thresholds, but flushing happens in the background and does not block other read/write operations until it completes. dirty_centisecs is how many centiseconds can pass before a flush is initiated.
Recently the defaults for these tunables was lowered in Linux, as memory size for modern machines has increased dramatically. Even ratios of 5 and 10% for dirty_background_ratio and dirty_ratio on a 256GB machine can flood an I/O system.
Tuning dirty_background_bytes or dirty_background_ratio to start flushing dirty buffers in the background is tricky. Fortunately you can tune these settings without having to stop either PostgreSQL or the host by echoing new values to the appropriate files:
$ sudo echo [int value of bytes] > /proc/sys/vm/dirty_background_bytes
for example to set the number of dirtied bytes to trigger a background flush. If you are using a battery-backed, capacitor-backed, or flash memory RAID card (you do want to keep your data in case of a crash, don't you?) start by tuning dirty_background_bytes to 1/2 the write cache buffer size and dirty_bytes to 3/4 that size. Monitor your I/O profile with iostats and if you are still seeing latency issues that means your database write load is still overwhelming the file buffer cache flushes. Turn the values down until latency improves or consider upgrading your I/O subsystem. FusionIO cards and SSDs are two possibilities for extreme I/O throughput.
Good luck!
Call pg_basebackup with the option --checkpoint=fast to force a fast checkpoint rather than waiting for a spread one to complete.
It's possible to force a checkpoint to complete. To do so, run CHECKPOINT; on the master server:
$ sudo su - postgres
$ psql
postgres=# CHECKPOINT;