Linux Swap Doesn’t Completely Suck Anymore
For the longest time, seeing your Linux server hit swap meant your pager was about to go off. Swap was basically the waiting room for the OOM (Out of Memory) killer. You’d watch your load average spike to 80, SSH would freeze, and you’d just sit there waiting for the inevitable crash.
My rule for the last decade was simple: disable swap entirely on production database nodes. Let the kernel kill the offending process immediately rather than dragging the whole system down into an IO-bound death spiral. But I might actually have to change my defaults.
And the recent massive overhaul to the Linux kernel’s swap code has fundamentally changed how memory pages are pushed to disk. We’re seeing performance gains north of 20% in specific synthetic tests, but more importantly, real-world workloads are actually surviving memory pressure without completely locking up.
What actually changed in the memory management
If you haven’t been following the mailing lists, the core issue with the old swap implementation was lock contention. That was the real killer.
When you have a modern 64-core processor trying to dump gigabytes of stale memory pages to an NVMe drive, having all those threads fight over global spinlocks in the swap cache is exactly as disastrous as it sounds. The CPU spends more time fighting itself for permission to write to the swap table than it does actually moving data.
The recent patches basically ripped out the old swap table and swap cache mechanisms. They replaced them with a much more granular, lockless (mostly) approach. Threads can now allocate and free swap slots concurrently without waiting in a massive single-file line.
I decided to test this myself last Thursday. I had a spare bare-metal Debian 12 box with 64GB of RAM and a fast PCIe 4.0 NVMe drive. I compiled the patched kernel, booted it up, and intentionally caused a memory crisis.
Benchmarking Valkey under heavy memory pressure
Synthetic benchmarks are fine, but I care about databases. Specifically, in-memory datastores that occasionally need to dump to disk.
I spun up an instance of Valkey 8.0.1. I loaded it up with about 58GB of data, leaving barely enough headroom for the OS. Then I triggered a heavy BGSAVE operation while simultaneously hammering the server with a massive write workload from memtier_benchmark.
Here are the actual numbers from my test:
Old Kernel (6.1 LTS):
- Average latency during BGSAVE: 45ms
- 99th percentile (p99) latency: 412ms
- System load average: 48.2
New Patched Kernel:
- Average latency during BGSAVE: 12ms
- 99th percentile (p99) latency: 89ms
- System load average: 14.5
That p99 drop is massive. Going from 412ms to 89ms is the difference between an application timing out and a user just thinking the app is slightly slow. The system actually remained responsive while it was heavily swapping to disk.
Kernel compilation times showed a similar trend. I ran a make -j128 on a memory-constrained VM, forcing it to swap. It finished about 8% faster than the baseline. Not a miracle, but I’ll take free performance any day.
The zswap gotcha
There is one edge case the initial reports didn’t mention loudly enough. I wasted about two hours trying to replicate these 20% gains on a different staging cluster before I realized my mistake.
If you are heavily relying on zswap (which compresses memory pages before they ever hit the physical disk), the gains from this new swap overhaul are much less pronounced. Zswap already masks a lot of the underlying block IO pain. On my staging cluster running Ubuntu 24.04 with zswap aggressively configured, the performance difference was maybe 2-3%.
So if you’re already compressing swap in RAM, don’t expect a massive difference. But for raw disk swap? Night and day.
How I’m monitoring this now
Since I’m actually letting my servers swap again, I needed better visibility into what the swap cache is doing. Standard vmstat is a bit too coarse for debugging lock contention.
I wrote this quick eBPF script using bpftrace to watch swap allocations in real-time. It hooks into the kernel’s page allocation functions to show exactly how much latency the swap process is adding.
#!/usr/bin/env bpftrace
// Watch swap out latency
kprobe:add_to_swap
{
@start[tid] = nsecs;
}
kretprobe:add_to_swap
/@start[tid]/
{
$duration = (nsecs - @start[tid]) / 1000; // convert to microseconds
@swap_out_us = hist($duration);
delete(@start[tid]);
}
// Print every 5 seconds
interval:s:5
{
print(@swap_out_us);
clear(@swap_out_us);
}
Run that while your system is under load. If your latency histogram is heavily skewed to the right (lots of operations taking thousands of microseconds), your swap table is bottlenecking. On the new kernel code, you’ll see this histogram shift dramatically to the left.
I expect this to become the default standard very quickly. By Q1 2027, when these patches fully trickle down into the ultra-conservative enterprise distributions, a lot of sysadmins are going to notice their memory-heavy applications mysteriously surviving load spikes that used to crash them.
I’m still not going to run my primary database clusters with 100GB of swap space. That’s just asking for trouble. But I am going to start re-enabling small, fast NVMe swap partitions on my cache nodes. The penalty for hitting swap is no longer a death sentence for the server.
It only took them a few decades to fix it. But better late than never.
