PostgreSQL and the OOM Killer: Why Strict Memory Overcommit is Essential
PostgreSQL and the OOM Killer: Why Strict Memory Overcommit is Essential
PostgreSQL's Vulnerability to the Linux OOM Killer
A single OOM (out of memory) kill can crash an entire PostgreSQL cluster, forcing a full database restart and crash recovery. This happens because of PostgreSQL's shared memory architecture. The main supervisor process (postmaster) forks backend processes for each connection, and these backends share critical memory segments for shared buffers, WAL buffers, and lock tables.
When the Linux OOM killer terminates a backend process to free memory, it does so without understanding these shared dependencies. If a backend is killed while modifying a shared memory segment, that segment can be left in an inconsistent state. To prevent silent data corruption, the postmaster detects the loss of a child process and immediately terminates all other remaining backends, dropping every active connection and aborting all in-flight transactions. The subsequent crash recovery process can cause significant downtime, especially under high write volumes.
Strict Memory Overcommit: Converting Catastrophes to Routine Errors
Strict memory overcommit (vm.overcommit_memory=2) protects PostgreSQL by failing memory allocations early and gracefully rather than allowing the system to over-allocate and crash later.
Linux provides three overcommit policies:
- Mode 0 (Heuristic): The default. The kernel allows most allocations unless a single request is absurdly large.
- Mode 1 (Always): The kernel never refuses an allocation. If physical memory runs out, the OOM killer terminates processes.
- Mode 2 (Strict): The kernel tracks total committed virtual memory (
Committed_AS) and enforces aCommitLimit. Any allocation exceeding this limit is immediately refused with anENOMEMerror.
For PostgreSQL, ENOMEM is a routine error. A backend that cannot allocate memory will report the error to the client and cancel the transaction, but the postmaster and all other connections remain unaffected. This converts a system-wide failure into a localized, manageable error.
Case Study: The "Phantom Memory" Kernel Bug
A subtle bug in the Linux 6.5 kernel caused Committed_AS to leak, leading to false OOM errors even when physical memory was abundant.
Ubicloud discovered that some 8 GB servers reported over 650 GB of committed memory. Analysis revealed that servers running kernel 6.5.0 were 52x more likely to experience this inflation than those on 6.8.0, with the leak growing at roughly 4.7% compound per week.
The Root Cause
The bug was introduced in commit 408579c and centered on a one-character change in mm/mremap.c within the move_vma() function. The error check for do_vmi_munmap() was changed from < 0 (run on error) to ! (run on success).
Because the condition was inverted, the kernel re-incremented the committed memory counter on every successful memory remap instead of only on failure. This caused Committed_AS to grow monotonically over time. This bug remained hidden under default heuristic overcommit because the kernel does not use Committed_AS to gate allocations in Mode 0; it only causes failures when strict overcommit (Mode 2) is enabled.
Heuristics for Setting the Commit Limit
To implement strict overcommit safely, the commit limit must account for both kernel overhead and sidecar processes.
Ubicloud uses the following formula to calculate overcommit_kbytes:
overcommit_kbytes = (total_memory_kb * 0.75 * 0.8) + 2 * 1048576
Breakdown of the Calculation
- The 80% Rule: 80% of available memory is committed to userspace. The remaining 20% is reserved for kernel data structures (page tables, slab caches, network buffers). This 20% is not wasted; it is still used for the page cache, which improves PostgreSQL read performance because page cache is reclaimable and does not count toward
Committed_AS. - The 2 GB Buffer: A fixed 2 GB offset is added to accommodate sidecar processes (e.g., Prometheus, node_exporter, wal-g). Many of these are written in Go, which reserves large virtual memory regions upfront. Ubicloud's data showed that 96% of their servers' sidecars consumed less than 1 GB of committed memory, making 2 GB a safe, generous ceiling.
- Hugepage Adjustment: In the provided implementation, total memory is first multiplied by 0.75 to account for 25% of memory reserved for hugepages, which are handled by separate accounting and do not count toward the commit limit.
Implementation Summary
For production PostgreSQL deployments, enabling vm.overcommit_memory=2 is recommended to avoid catastrophic OOM kills. However, this should only be done after monitoring the memory characteristics of the workload to ensure the CommitLimit is not set so low that it triggers frequent, unnecessary ENOMEM errors.