linux memory management for servers

We’ve been learning for many years how to run Linux for databases, but over time we realized that many of our lessons learned apply to many other server workloads. Generally, server process will have to interact with network clients, access memory, do some storage operations and do some processing work – all under supervision of the kernel.

Unfortunately, from what I learned, there’re various problems in pretty much every area of server operation. By keeping the operational knowledge in narrow camps we did not help others. Finding out about these problems requires quite intimate understanding of how things work and slightly more than beginner kernel knowledge.

Many different choices could be made by doing empiric tests, sometimes with outcomes that guide or misguide direction for many years. In our work we try to understand the reasons behind differences that we observe in random poking at a problem.

In order to qualify and quantify operational properties from our server systems we have to understand what we should expect from them. If we build a user-facing service where we expect sub-millisecond response times of individual parts of the system, great performance from all of the components is needed. If we want to build high-efficiency archive and optimize data access patterns, any non-optimized behavior will really stand out. High throughput system should not operate at low throughput, etc.

In this post I’ll quickly glance over some areas in memory management that we found problematic in our operations.

Whenever you want to duplicate a string or send a packet over the network, that has to go via allocators (e.g. some flavor of malloc in userland or SLUB in kernel). Over many years state of the art in user-land has evolved to support all sorts of properties better – memory efficiency, concurrency, performance, etc – and some of added features were there to avoid dealing with the kernel too much.

Modern allocators like jemalloc have per-thread caches, as well as multiple memory arenas that can be managed concurrently. In some cases the easiest way to make kernel memory management easier is to avoid it as much as possible (jemalloc can be much greedy and not give memory back to the kernel via lg_dirty_mult setting).

Just hinting the kernel that you don’t care about page contents gets them immediately taken away from you. Once you want to take it back, even if nobody else used the page, kernel will have to clean it for you, shuffle it around multiple lists, etc. Although that is considerable overhead, it far from worst what can happen.

Your page can be given to someone else – for example, file system cache, some other process or kernel’s own needs like network stack. When you want your page back, you can’t take it from all these allocations that easily, and your page has to come from free memory pool.

Linux free memory pool is something that probably works better on desktops and batch processing and not low latency services. It is governed by vm.min_free_kbytes setting, which has very scarce documentation and even more scarce resource allocation – on 1GB machine you can find yourself with 5% of your memory kept free, but then there’re caps on it at 64MB when autosizing it on large machines.

Although it may seem that all this free memory is a waste, one has to look at how kernel reclaims memory. This limit sets up how much to clean up, but not at when to trigger background reclamation – that is done at only 25% of free memory limit – so memory pool that can be used for instant memory allocation is at measly 16MB – just two userland stacks.

Once you exhaust the free memory limit kernel has to go into “direct reclaim” mode – it will stall your program and try to get memory from somewhere (thanks, Johannes, for vmscan/mm_vmscan_direct_reclaim_begin hint). If you’re lucky, it will drop some file system pages, if you’re less lucky it will start swapping, putting pressure on all sorts of other kernel caches, possibly even shrinking TCP windows and what not. Understanding what kernel will do in the direct claim has never been easy, and I’ve observed cases of systems going into multi-second allocation stalls where nothing seems to work and fancy distributed systems failover can declare node dead.

Obviously, raising free memory reserves helps a lot, and on various servers we maintain 1GB free memory pool just because low watermark is too low otherwise. Johannes Weiner from our kernel team has proposed tunable change in behaviors there. That still requires teams to understand implications of free memory needs and not run with defaults.

Addressing this issue gets servers into much healthier states, but doesn’t always help with memory allocation stalls – there’s another class of issues that was being addressed lately.

I wrote about it before – kernel has all sorts of nasty behaviors whenever it can’t allocate memory, and certain memory allocation attempts are much more aggressive – atomic contiguous allocations of memory end up scanning (and evicting) many pages because it can’t find readily available contiguous segments of free memory.

These behaviors can lead to unpredictable chain of events – sometimes TCP packets arrive and are forced to wait until some I/O gets done as memory compaction ended up stealing dirty inodes or something like that. One has to know memory subsystem much more than I do in order to build beautiful reproducible test-cases.

This area can be addressed in multiple ways – one can soften allocation needs of various processes on the system (do iptables really need 128k allocation for an arriving segment to log it via NFLOG to some user land collection process?), also it is possible to tweak kernel background threads to have less fragmented memory (like a cronjob I deployed many years ago) or of course, getting the memory reclamation order into decent shape instead of treating it as a black box that “should work for you unless you do something wrong” (like using TCP stack).

Some of our quick changes (like net: don’t wait for order-3 page allocation) were addressing this case by case basis, but it was amazing to see that this kind of optimization was pulled in to cover many many more allocations via wide-reaching change (mm/slub: don’t wait for high-order page allocation). From my experience, this addresses huge class of reliability and stability issues in Linux environments and makes system behavior way more adaptive and fluid.

There are still many gray areas in Linux kernel and desktop direction may not always help addressing them. I have test-cases where kernel is only able to reclaim memory at ~100MB/s (orders of magnitudes away from RAM performance) – and what these test cases usually have in common is “this would happen on a server but never on a desktop”. For example if your process writes a [transaction] log file and you forget to remove it from cache yourself, Linux will thrash on the inode mutex quite a bit.

There’re various zone reclaim contract violations that are easy to trigger with simple test cases – those test cases definitely expose edge behaviors, but many system reliability issues we investigate in our team are edge behaviors.

In database world we exasperate these behaviors when we bypass various kernel subsystems – memory is pooled inside the process, files are cached inside the process, threads are pooled inside the process, network connections are pooled by clients, etc. Kernel ends up being so dumb that it breaks on a simple problems like ‘find /proc’ (directory entry cache blows up courtesy of /proc/X/task/Y/fd/Z explosion ).

Although cgroups and other methods allow to constrain some sets of resources within various process groups, it doesn’t help when a shared kernel subsystem goes into an overdrive.

There’re also various problems with memory accounting – although kernel may report you quickly how many dirty file system pages it has, it doesn’t give equal opportunities to network stack. Figuring out how much of memory is in socket buffers (and how full these buffers are) is a non-trivial operation, and on many of our systems we will have much more memory allocated to network stack than to many other categories in /proc/meminfo. I’ve written scripts that pull socket data from netlink, try to guess what is the real memory allocation (it is not straightforward math) to produce a very approximate result.

Lack of proper memory attribution and accounting has been a regular issue – in 3.14 a new metric (MemAvailable) has been added, which sums up part of cache and reclaimable slab, but if you pay more attention to it, there’s lots of guessing whether your cache or slab is actually reclaimable (or what the costs are).

Currently when we want to understand what is cached, we have to walk the file system, map the files and use mincore() to get basic idea of our cache composition and age – and only then we can tell that it is safe to reclaim pages from memory. Quite a while ago I have written a piece of software that removes files from cache (now vmtouch does the same).

Nowadays on some of our systems we have much more complicated cache management. Pretty much every buffered write that we do is followed by asynchronous cache purge later so that we are not at the mercy of the kernel and its expensive behaviors.

So, you either have to get kernel entirely out of your way and manage everything yourself, or blindly trust whatever is going on and losing efficiency on the way. There must be a middle ground somewhere, hopefully, and from time to time we move in the right direction.

In desktop world you’re not supposed to run your system 100% loaded or look for those 5% optimizations and 0.01% probability stalls. In massively interconnected service fabrics we have to care about these areas and address them all the time, and as long as these kinds of optimizations reach wider set of systems, everybody wins.

TL;DR: upgrade your kernels and bump vm.min_free_kbytes :-)

on swapping and kernels

There is much more to write about all the work we do at Facebook with memory management efficiency on our systems, but there was this one detour investigation in the middle of 2012 that I had to revisit recently courtesy of Wikipedia.

There are lots of factors that make machines page out memory segments into disk, thus slowing everything down and locking software up – from file system cache pressure to runaway memory leaks to kernel drivers being greedy. But certain swap-out scenarios are confusing – systems seem to have lots of memory available, with proper settings file system cache should not cause swapping, and obviously in production environment all the memory leaks are ironed out.

And yet in mid-2012 we noticed that our new kernel machines were swapping out for no obvious reason. When it comes to swapping, MySQL community will always point to Jeremy’s post on “swap insanity” – it has something to do with NUMA and what not. But what we observed was odd – there was free memory available on multiple nodes when swapping out happened. Of course, one of our kernel engineers wrote a NUMA rebalancing tool that attaches to running CPUs and evens out memory allocations without any downtime (not that we ended up using it…) – just in case Jeremy’s described issue is actually an issue for us.

In some cases systems threw warning messages in kernel logs that immediately helped us to get closer to the problem – network device driver was failing to allocate 16k memory pages.

Inside Linux kernel one has two ways to allocate memory, kmalloc and vmalloc. Generally, vmalloc will go through standard memory management, and if you ask for 16k, it will glue together 4k pages and allocation will succeed without any problems.

kmalloc though is used for device drivers when hardware is doing direct memory access (DMA) – so these address ranges have to be contiguous, and therefore to allocate it one has to find subsequent empty pages that can be used. Unfortunately, the easiest way to free up memory is looking at the tail of LRU list and drop some – but that does not give contiguous ranges.

Actual solution for ages was to organize the free memory available into powers-of-2 sized buckets (4k pages, 8k, 16k, ) – called Buddy Allocator (interesting – it was implemented first by Nobel Prize winner in Economics Harry Markowitz back in 1964). Any request for any memory size can be satisfied from larger buckets, and once there’s nothing in larger buckets one would compact the free memory by shuffling bits around.

One can see the details of buddy allocator in /proc/buddyinfo:

Node 0, zone      DMA      0      0      1      0      2      1
Node 0, zone    DMA32    229    434    689    472    364    197
Node 0, zone   Normal  11093   1193    415    182     38     12
Node 1, zone   Normal  10417     53    139    159     47      0

(Columns on the left are indicating numbers of small memory segments available, columns on the right – larger).

It is actually aiming for performance that leads to device drivers dynamically allocating memory all the time (e.g. to avoid copying of data from static device buffers to userland memory). On a machine that is doing lots of e.g. network traffic it will be network interface grouping packets on a stream into large segments and writing them to these allocated areas in memory, then dropping all that right after application consumed network bits, so this technique is really useful.

On the other side of the Linux device driver spectrum there are latency sensitive operations, such as gaming and music listening and production. This millennium being the Millennium of Linux Desktop results in Advanced Linux Sound Architecture users (alsa-users) to complain that such memory management sometimes makes their sound drivers complain. That would not be much of an issue on well-tuned multi-core servers with hardware interrupt handling spread across multiple threads, but Linux kernel engineers prefer the desktop and disabled compaction altogether in 2011.

If memory is not fragmented at all, nobody notices. Although on busy servers one may need to flush gigabytes or tens of gigabytes of pages (drop caches if it is file system cache or swap out if it is memory allocated to programs) to find a single contiguous region (though I’m not sure how exactly it chooses when to stop flushing).

Fortunately, there is a manual trigger to force a compaction that my fellow kernel engineers were glad to inform me about (as otherwise we’d have to engineer a kernel fix or go for some other workarounds). Immediatelly a script was deployed that would trigger compaction whenever needed, so I got to forget the problem.

Until now where I just saw this problem confusing engineers at Wikipedia – servers with 192GB of memory were constantly losing their filesystem cache and having all sorts of other weird memory behaviors. Those servers were running Varnish, which assumes that kernels are awesome and perfect, and if one is unhappy, he can use FreeBSD :)

There were multiple ways to deal with the issue – one was just disabling features on hardware that use the memory (e.g. no more TCP offloading), another is writing 1s into /proc/sys/vm/compact_memory – and maybe some new kernels have some of alleviations to the problem.

Update: By popular demand I published the script that can be used in cron

Blowing up in memory

MySQL isn’t too concerned about table handler memory usage – it will allocate row size buffer thrice per each table invocation. There’s a few year old bug discussing UNION memory usage – for each mention in an union one can allocate nearly 200k of unaccounted memory – so a megabyte sized query can consume 7GB of RAM already.

Partitioning though adds even more pain here – it will allocate those three buffers per each partition, so opening a table with 1000 partitions looks like this on memory profile:

Click to enlarge, and you will see 191MB sent to execute a simple single-row fetching query from a table (I filed a bug on this).

There’re multiple real life situations when this is painful (e.g. any kind of server stall may lead to multiple concurrent threads reading from same table, consuming additional gigabytes or tens of gigabytes of memory). It gets even more painful when combined with UNION bug – a megabyte query on an empty table can now consume 7TB of memory and I doubt anyone has that much on their MySQL servers :-)

P.S. Also, check out how much memory can be wasted for malloc overhead, once discussed here.
P.P.S. And here you can see why innodb_max_dirty_pages_pct=0 doesn’t do what you’d expect.

Stonebraker trapped in Stonebraker 'fate worse than death'

Oh well, I know I shouldn’t poke directly at people, but they deserve that sometimes (at least in my very personal opinion). Heck, I even gave 12h window for this not to be hot-headed opinion.

Those who followed MySQL at facebook development probably know how much we focus on actual performance on top of mixed-composition I/O devices (flashcache, etc) – not just retreating to comfortable zone of in-memory (or in-pure-flash) data.

I feel somewhat sad that I have to put this truism out here – disks are way more cost efficient, and if used properly can be used to facilitate way more long-term products, not just real time data. Think Wikipedia without history, think comments that disappear on old posts, together with old posts, think all 404s you hit on various articles you remember from the past and want to read.

Building the web that lasts is completely different task from what academia people imagine building the web is.

I already had this issue with other RDBMS pioneer (there’s something in common among top database luminaries) – he also suggested that disks are things of the past and now everything has to be in memory, because memory is cheap. And data can be whatever unordered clutter, because CPUs can sort it, because CPUs are cheap.

They probably missed Al Gore message. Throwing more and more hardware without fine tuning for actual operational efficiency requirements is wasteful and harms our planet. Yes, we do lots of in-memory efficiency work, so that we reduce our I/O, but at the same time we balance the workload so that I/O subsystem provides as efficient as possible delivery of the long tail.

What happens in real world if one gets 2x efficiency gain? Twice more data can be stored, twice more data intensive products can be launched.
What happens in academia of in-memory databases, if one gets 2x efficiency gain? A paper.
What happens when real world doesn’t read your papers anymore? You troll everyone via GigaOM.

Though sure, there’s some operational overhead in handling sharding and availability of MySQL deployments, at large scale it becomes somewhat constant cost, whereas operational efficiency gains are linear.

Update: Quite a few people pointed out that I was dissing a person who has done incredible amount of contributions, or that I’m anti-academia. I’m not, and I extremely value any work that people do wherever they are, albeit I do apply critical thinking to whatever they speak.

In my text above (I don’t want to edit and hide what I said) I don’t mean that “a paper” is useless. Me and my colleagues do read papers and try to understand the direction of computer science and how it applies to our work (there are indeed various problems yet to solve). I’d love to come up with something worth a paper (and quite a few of my colleagues did).

Still, if someone does not find that direction useful, there’s no way to portray them the way the original GigaOM article did.

Logs memory pressure

Warning, this may be kernel version specific, albeit this kernel is used by many database systems

Lately I’ve been working on getting more memory used by InnoDB buffer pool – besides obvious things like InnoDB memory tax there were seemingly external factors that were pushing out MySQL into swap (even with swappiness=0). We were working a lot on getting low hanging fruits like scripts that use too much memory, but they seem to be all somewhat gone, but MySQL has way too much memory pressure from outside.

I grabbed my uncache utility to assist with the investigation and started uncaching various bits on two systems, one that had larger buffer pool (60G), which was already being sent to swap, and a conservatively allocated (55G) machine, both 72G boxes. Initial finds were somewhat surprising – apparently on both machines most of external-to-mysqld memory was conserved by two sets of items:

  • binary logs – write once, read only tail (sometimes, if MySQL I/O cache cannot satisfy) – we saw nearly 10G consumed by binlogs on conservatively allocated machines
  • transaction logs – write many, read never (by MySQL), buffered I/O – full set of transaction logs was found in memory

It was remarkably easy to get rid of binlogs from cache, both by calling out ‘uncache’ from scripts, or using this tiny Python class:

libc = ctypes.CDLL("libc.so.6")
class cachedfile (file):
    FADV_DONTNEED = 4
    def uncache(self):
        libc.posix_fadvise(self.fileno(), 0, 0, self.FADV_DONTNEED)

As it was major memory stress source, it was somewhat a no brainer that binlogs have to be removed from cache – something that can be serially re-read is taking space away from a buffer pool which avoids random reads. It may make sense to call posix_fadvise() right after writes to them, even.

Transaction logs, on the other hand, are entirely different beast. From MySQL perspective they should be uncached immediately, as nobody ever ever reads them (crash recovery aside, but re-reading then is relatively cheap, as no writes or random reads are done during log read phase). Unfortunately, the problem lies way below MySQL, and thanks to PeterZ for reminding me (we had a small chat about this at Jeremy’s Silicon Valley MySQL Meetup).

MySQL transaction records are stored in multiple log groups per transaction, then written out as per-log-group writes (each is in multiple of 512 bytes), followed by fsync(). This allows FS to do transaction log write as single I/O operation. This also means that it will be doing partial page writes to buffered files – overwriting existing data in part of the page, so it has to be read from storage.

So, if all transaction log pages are removed from cache, quite some of them will have to be read back in (depending on sizes of transactions, probably all of them in some cases). Oddly enough, when I tried to hit the edge case, single thread transactions-per-second remained same, but I saw consistent read I/O traffic on disks. So, this would probably work on systems, that have spare I/O (e.g. flash based ones).

Of course, as writes are already in multiples of 512 (and appears that memory got allocated just fine), I could try out direct I/O – it should avoid page read-in problem and not cause any memory pressure by itself. In this case switching InnoDB to use O_DIRECT was a bit dirtier – one needs to edit source code and rebuild the server, restart, etc, or…

# lsof ib_logfile*
# gdb -p $(pidof mysqld)
(gdb) call os_file_set_nocache(9, "test", "test")
(gdb) call os_file_set_nocache(10, "test", "test")

I did not remove fsync() call, but as it is somewhat noop on O_DIRECT files, I left it there, probably it would change benchmark results, but not much.

Some observations:

  • O_DIRECT was ~10% faster at best case scenario – lots of tiny transactions in single thread
  • If group commit is used (without binlogs), InnoDB can have way more transactions with multiple threads using buffered I/O, as it does multiple writes per fsync
  • Enabling sync_binlog makes the difference not that big – even with many parallel writes direct writes are 10-20% slower than buffered ones
  • Same for innodb_flush_log_on_trx_commit0 – multiple writes per fsync are much more efficient with buffered I/O
  • One would need to do log group merge to have more efficient O_DIRECT for larger transactions
  • O_DIRECT does not have theoretical disadvantage, current deficiencies are just implementation oriented at buffered I/O – and can be resolved by (in same areas – extensive) engineering
  • YMMV. In certain cases it definitely makes sense even right now, in some other – not so much

So, the outcome here depends on many variables – with flash read-on-write is not as expensive, especially if read-ahead works. With disks one has to see what is better use for the memory – using it for buffer pool reduces amount of data reads, but causes log reads. And of course, O_DIRECT wins in the long run :-)

With this data moved away from cache and InnoDB memory tax reduced one could switch from using 75 % of memory to 90% or even 95% for InnoDB buffer pools. Yay?

uncache!

this is source code for a tiny program I just wrote that traverses specified directories and removes them from file system cache.

There are few use cases for it. One is for all these people who benchmark stuff and want selective OS cache purges, another is for those who run high performance databases. Remember the O_DIRECT serialization everywhere? Well, XFS does direct I/O in parallel, unless there are cached pages (and they can happen because of any random outside-of-database activity, like ‘file’ command). Once you ‘uncache’ the files, XFS will be very much parallel again \o/ \o/

Memcached for small objects

Memcached quite often ends up as a store for very small objects (small key and some integer value), though it isn’t really designed to do this kind of work by default. Current memory management is based on slabs (200 of them), where objects are grouped by similar size – though actual sizes are pre-defined at startup based on few configuration parameters.

By default memcached would have slabs based on assumption, that smallest object size will have 48 bytes of data (thats without item header), and will increase the slab sizes in +25% steps:

slab class   1: chunk size    104 perslab 10082
slab class   2: chunk size    136 perslab  7710
slab class   3: chunk size    176 perslab  5957
slab class   4: chunk size    224 perslab  4681
...

So, in this case, it allocates at least 104 bytes per object, and next steps are way behind. Fortunately, there’re some quick steps to have better efficiency: Continue reading “Memcached for small objects”