on swapping and kernels

There is much more to write about all the work we do at Facebook with memory management efficiency on our systems, but there was this one detour investigation in the middle of 2012 that I had to revisit recently courtesy of Wikipedia.

There are lots of factors that make machines page out memory segments into disk, thus slowing everything down and locking software up – from file system cache pressure to runaway memory leaks to kernel drivers being greedy. But certain swap-out scenarios are confusing – systems seem to have lots of memory available, with proper settings file system cache should not cause swapping, and obviously in production environment all the memory leaks are ironed out.

And yet in mid-2012 we noticed that our new kernel machines were swapping out for no obvious reason. When it comes to swapping, MySQL community will always point to Jeremy’s post on “swap insanity” – it has something to do with NUMA and what not. But what we observed was odd – there was free memory available on multiple nodes when swapping out happened. Of course, one of our kernel engineers wrote a NUMA rebalancing tool that attaches to running CPUs and evens out memory allocations without any downtime (not that we ended up using it…) – just in case Jeremy’s described issue is actually an issue for us.

In some cases systems threw warning messages in kernel logs that immediately helped us to get closer to the problem – network device driver was failing to allocate 16k memory pages.

Inside Linux kernel one has two ways to allocate memory, kmalloc and vmalloc. Generally, vmalloc will go through standard memory management, and if you ask for 16k, it will glue together 4k pages and allocation will succeed without any problems.

kmalloc though is used for device drivers when hardware is doing direct memory access (DMA) – so these address ranges have to be contiguous, and therefore to allocate it one has to find subsequent empty pages that can be used. Unfortunately, the easiest way to free up memory is looking at the tail of LRU list and drop some – but that does not give contiguous ranges.

Actual solution for ages was to organize the free memory available into powers-of-2 sized buckets (4k pages, 8k, 16k, ) – called Buddy Allocator (interesting – it was implemented first by Nobel Prize winner in Economics Harry Markowitz back in 1964). Any request for any memory size can be satisfied from larger buckets, and once there’s nothing in larger buckets one would compact the free memory by shuffling bits around.

One can see the details of buddy allocator in /proc/buddyinfo:

Node 0, zone      DMA      0      0      1      0      2      1
Node 0, zone    DMA32    229    434    689    472    364    197
Node 0, zone   Normal  11093   1193    415    182     38     12
Node 1, zone   Normal  10417     53    139    159     47      0

(Columns on the left are indicating numbers of small memory segments available, columns on the right – larger).

It is actually aiming for performance that leads to device drivers dynamically allocating memory all the time (e.g. to avoid copying of data from static device buffers to userland memory). On a machine that is doing lots of e.g. network traffic it will be network interface grouping packets on a stream into large segments and writing them to these allocated areas in memory, then dropping all that right after application consumed network bits, so this technique is really useful.

On the other side of the Linux device driver spectrum there are latency sensitive operations, such as gaming and music listening and production. This millennium being the Millennium of Linux Desktop results in Advanced Linux Sound Architecture users (alsa-users) to complain that such memory management sometimes makes their sound drivers complain. That would not be much of an issue on well-tuned multi-core servers with hardware interrupt handling spread across multiple threads, but Linux kernel engineers prefer the desktop and disabled compaction altogether in 2011.

If memory is not fragmented at all, nobody notices. Although on busy servers one may need to flush gigabytes or tens of gigabytes of pages (drop caches if it is file system cache or swap out if it is memory allocated to programs) to find a single contiguous region (though I’m not sure how exactly it chooses when to stop flushing).

Fortunately, there is a manual trigger to force a compaction that my fellow kernel engineers were glad to inform me about (as otherwise we’d have to engineer a kernel fix or go for some other workarounds). Immediatelly a script was deployed that would trigger compaction whenever needed, so I got to forget the problem.

Until now where I just saw this problem confusing engineers at Wikipedia – servers with 192GB of memory were constantly losing their filesystem cache and having all sorts of other weird memory behaviors. Those servers were running Varnish, which assumes that kernels are awesome and perfect, and if one is unhappy, he can use FreeBSD :)

There were multiple ways to deal with the issue – one was just disabling features on hardware that use the memory (e.g. no more TCP offloading), another is writing 1s into /proc/sys/vm/compact_memory – and maybe some new kernels have some of alleviations to the problem.

Update: By popular demand I published the script that can be used in cron