on swapping and kernels

There is much more to write about all the work we do at Facebook with memory management efficiency on our systems, but there was this one detour investigation in the middle of 2012 that I had to revisit recently courtesy of Wikipedia.

There are lots of factors that make machines page out memory segments into disk, thus slowing everything down and locking software up – from file system cache pressure to runaway memory leaks to kernel drivers being greedy. But certain swap-out scenarios are confusing – systems seem to have lots of memory available, with proper settings file system cache should not cause swapping, and obviously in production environment all the memory leaks are ironed out.

And yet in mid-2012 we noticed that our new kernel machines were swapping out for no obvious reason. When it comes to swapping, MySQL community will always point to Jeremy’s post on “swap insanity” – it has something to do with NUMA and what not. But what we observed was odd – there was free memory available on multiple nodes when swapping out happened. Of course, one of our kernel engineers wrote a NUMA rebalancing tool that attaches to running CPUs and evens out memory allocations without any downtime (not that we ended up using it…) – just in case Jeremy’s described issue is actually an issue for us.

In some cases systems threw warning messages in kernel logs that immediately helped us to get closer to the problem – network device driver was failing to allocate 16k memory pages.

Inside Linux kernel one has two ways to allocate memory, kmalloc and vmalloc. Generally, vmalloc will go through standard memory management, and if you ask for 16k, it will glue together 4k pages and allocation will succeed without any problems.

kmalloc though is used for device drivers when hardware is doing direct memory access (DMA) – so these address ranges have to be contiguous, and therefore to allocate it one has to find subsequent empty pages that can be used. Unfortunately, the easiest way to free up memory is looking at the tail of LRU list and drop some – but that does not give contiguous ranges.

Actual solution for ages was to organize the free memory available into powers-of-2 sized buckets (4k pages, 8k, 16k, ) – called Buddy Allocator (interesting – it was implemented first by Nobel Prize winner in Economics Harry Markowitz back in 1964). Any request for any memory size can be satisfied from larger buckets, and once there’s nothing in larger buckets one would compact the free memory by shuffling bits around.

One can see the details of buddy allocator in /proc/buddyinfo:

Node 0, zone      DMA      0      0      1      0      2      1
Node 0, zone    DMA32    229    434    689    472    364    197
Node 0, zone   Normal  11093   1193    415    182     38     12
Node 1, zone   Normal  10417     53    139    159     47      0

(Columns on the left are indicating numbers of small memory segments available, columns on the right – larger).

It is actually aiming for performance that leads to device drivers dynamically allocating memory all the time (e.g. to avoid copying of data from static device buffers to userland memory). On a machine that is doing lots of e.g. network traffic it will be network interface grouping packets on a stream into large segments and writing them to these allocated areas in memory, then dropping all that right after application consumed network bits, so this technique is really useful.

On the other side of the Linux device driver spectrum there are latency sensitive operations, such as gaming and music listening and production. This millennium being the Millennium of Linux Desktop results in Advanced Linux Sound Architecture users (alsa-users) to complain that such memory management sometimes makes their sound drivers complain. That would not be much of an issue on well-tuned multi-core servers with hardware interrupt handling spread across multiple threads, but Linux kernel engineers prefer the desktop and disabled compaction altogether in 2011.

If memory is not fragmented at all, nobody notices. Although on busy servers one may need to flush gigabytes or tens of gigabytes of pages (drop caches if it is file system cache or swap out if it is memory allocated to programs) to find a single contiguous region (though I’m not sure how exactly it chooses when to stop flushing).

Fortunately, there is a manual trigger to force a compaction that my fellow kernel engineers were glad to inform me about (as otherwise we’d have to engineer a kernel fix or go for some other workarounds). Immediatelly a script was deployed that would trigger compaction whenever needed, so I got to forget the problem.

Until now where I just saw this problem confusing engineers at Wikipedia – servers with 192GB of memory were constantly losing their filesystem cache and having all sorts of other weird memory behaviors. Those servers were running Varnish, which assumes that kernels are awesome and perfect, and if one is unhappy, he can use FreeBSD :)

There were multiple ways to deal with the issue – one was just disabling features on hardware that use the memory (e.g. no more TCP offloading), another is writing 1s into /proc/sys/vm/compact_memory – and maybe some new kernels have some of alleviations to the problem.

Update: By popular demand I published the script that can be used in cron

how MySQL engineering broke the backups

MySQL has exceptional track of record by introducing minor fixes that cause major breakages. Though usually I could blame naiveté of engineers, who did not really ever have to deal with production implications, but lately I can start sensing various business implications against open-source offerings.

As an original author of mydumper I really cannot get out of my mind that 5.5 and 5.6 metadata locking changes are there to screw with anyone who is building a backup solution using stable snapshot views of MySQL (for example, mysqldump –single-transaction, the golden standard of backing things up in MySQL world).

As seen in a bug #71017 (palindrome!) filed by my esteemed colleague Eric, newly introduced behaviors gobble all the locks possible, even if it makes absolutely no sense for backup/ETL/migration/etc scenarios. 

The only supported way out of that is using MySQL Enterprise Backup, which is proprietary software, and does not produce logical backups that allow selective data restores or ETL capabilities or anything else. You get complete vendor lock in where there is no way to get your data out of the system in a consistent manner, unless, of course, you restrict to “no metadata changes allowed in production” mode. 

on nuodb and falcon

Warning: this is a mixture of historical content, biases, stupid marketing and unknown/proprietary/closed source technologies. Proceed with caution.

NuoDB marketing was sending out this message, encouraging me to blog (they were looking for bloggers too):

And while Facebook sharded MySQL 4000 times, even they call it a “fate worse than death.”

We’ve seen this phrase before and it did not come from us. For whatever reason NewSQL echo chamber is repeating this with less and less truth in it. In various whitepapers (all behind registration walls) they mention some analyst estimates and try to put a parallel between operating costs of large companies and something a new developer would do, as if everyone is living under same constraints.

I don’t know if NuoDB is a good technology for the customer they’re targeting, all their diagrams essentially say “we have blocks of magic and we multiply them”, and if you approach them at a conference, their “tech guy is away at the moment”. Still, the key term around is that what they do is Holy Grail of databases and we should believe in that.

It is still a question whether NuoDB does solve problems of massive scale web deployments. They seem to diss existing operational environment costs with “thousands of servers and storage are required” and I’m not sure what the cost of their alternative is.

We’ve revealed some aggregate numbers of our MySQL based data platform before – there’re tens of millions of queries (billions at cache level), millions of IOPS, and it is somewhat difficult to squeeze that into less than “thousands of servers”.

There’re more than billion users to serve and sheer amount of data in the social graph is also not something you can put on a few thumb drives. If only any of these software vendors could tell how much their platform would cost in such a case.

I am not an expert at optimistic concurrency control that seems to be in there – I have yet to see a large scale system using it. Optimistic concurrency control (the use of “control” here sounds like an oxymoron) means that if users talking to different servers do same operation, one of them is going to get an error on commit (instead of waiting for the lock release and doing his job on top). This also cannot hide any latencies, if consistency is required. Unfortunately, retries in higher latency environments are even more expensive and writing software for such model becomes more complicated than writing software for sharded datasets.

Software that does not have to be aware of sharding and underlying partitioning is easier to implement. Unfortunately, writing efficient software on top of distributed datasets is not that easy. Fan-out in distributed systems is expensive (especially at thousands of machines level) and is not an operation that should be done a lot in web-facing environments.

Usually developers will already have abstractions that allow them to do efficient data retrieval without thinking about shards, yet forcing to think twice when they would be doing something expensive. The cost is always there, visible or invisible, and someone has to deal with it.

At least we know who is behind this. Meet Jim Starkey, database luminary behind it and few other database management systems. In MySQL world he had his fame during the rocky period of Oracle InnoDB acquisition. MySQL AB had to do something and acquired Netfrastructure – a Java application-server/database hybrid. Some of that technology was rewritten into C++ and used as a storage engine for MySQL. This whole new development had two names:

MySQL 6 and Falcon

Jim Starkey captivated crowds by dissing status quo (I saw famous people leaving the room) yet not being all correct himself. His storage engine was supposed to be architected for the web. Well, it wasn’t. At that time I was doing some work on Wikipedia and had various ideas on what works and doesn’t work in web facing databases. Jim Starkey had different ideas.

One of them was that RAM was supposed to be cheap, so Falcon was supposed to be memory hungry for various operations. Sure, RAM got cheaper but data volumes got larger. The trend I observed later was that amount of RAM per bytes stored was decreasing rather than increasing.

Another Falcon bet was that CPUs are going to be much faster, so instead of storing/reading data in any ordered fashion one was supposed to read unsorted data then sort it in memory (as RAM is cheap too). Again, major web pattern (open-ended range reads – ORDER BY … LIMIT) got missed. Lots of web-facing range queries would become very expensive, so in order to be web scale on has to rewrite their applications to fit the new model.

Random disk access was supposed to go away – and even if index looks up sparse data, Starkey believed that doing disk reads in physical order was supposed to give better performance. That was not a correct assumption at concurrent workloads and ended up missing few other important optimizations such as lack of covering index reads. We did not see too much flash around at that time, and I’m not sure how Falcon design would’ve worked on flash anyway.

I wrote some of these observations down and sent them to MySQL engineering. Jim Starkey did not reply, someone else close to him did with “let’s wait for benchmarks, then talk”. Unfortunately, decent benchmarks never arrived. I was not the only one who had questions.

There were various performance issues. For a while it was told that one should not implement low level concurrency primitives and use OS provided methods (e.g. pthreads) everywhere instead. Apparently when Falcon tried implementing internal spinlocks they did not work that well. Mark Callaghan pointed out that spinlock implementation inside Falcon was not actually spinning (compiler optimized that loop away) and was just using OS mutexes.

There were some interesting properties of the engine that could have been valuable – it had row cache at the core, kept transactional overhead in memory (you were not supposed to run long running transactions), etc.

Eventualy Falcon leadership changed and remaining team of engineers tried to salvage the product for quite a while (success of the project was measured in how many minutes it can stay up without crashing), but it all became moot once InnoDB and MySQL teams were reunited under Oracle.

And no, I will not make another “fate worse than death” joke, those are expired long ago. Though I don’t think that Falcon record expired by now, so I will take NuoDB claims with a grain of salt. I don’t know exactly what problems they are solving, so it is difficult to come up with good analysis and comparisons. I won’t and neither should they.

on durability

MySQL did not start as a durable data store and had lots of mockery for that – (ISAM? no replication?). Eventually InnoDB took over, and it brought at least parts of MySQL into a reliable storage world. Checksummed pages, decent crash recovery, good synchronous behavior had InnoDB ahead of open source competition for quite a while, as well as on par with other solutions. Unfortunately, that safety was limited only to InnoDB row operations and not DDL or replication state.

In the world where nothing before was synchronous, transitioning to reliable storage introduced lots of slowdowns, and still was not good enough.
There was lots of work done outside of internal MySQL/Sun/Oracle development to help with some of these problems. For example Google 4.0 patch tried to solve slave crash safety by storing replication state inside InnoDB – not only that allowed slaves to properly recover after a crash, but also that was achieved without synching storage at every transaction. A wish to run masters reliably required to synchronize binary logs with data store, leading to three synchronous data writes per transaction – forcing multiple parties to work on commit grouping implementations. Costly checksums were offloaded to on-chip CRC32 implementations, etc.

Eventually we got MySQL to the situation where it was considered to be reliable. Of course, that came with a cost.

One problem is that log writes have huge write amplification induced by OS paging system, so a synchronous tiny log write of few hundred bytes was written out as 4k page. We observe ~3x-4x write amplification from this on our database masters. That is not much of an issue if underlying hardware (e.g. NVRAM on RAID controller) can absorb all that, but on systems that use SSDs underlying hardware may no longer do such merging, and limited write cycles of flash storage suffer from such write amplification a lot. We end up writing more log pages than data pages (though much less log bytes), so it is a major issue for write endurance on flash devices.

Other problem is that underlying hardware isn’t always fast. Though on modern devices super-capacitors don’t break as much as batteries used to, and are not subject to recharge cycles, still, there are various sorts of I/O stalling, that impact durable behavior of high performance systems. Flash devices are running all sorts of background activity that can make fast writes suddenly be not so fast. In such cases a system that is gladly eating tens of thousands of writes a second suddenly has to accommodate a backlog of thousands of transactions, assuming that they are not locking each other out. Group commit may not be much of a help here.

In MySQL world durability is controlled via two settings – one tells whether InnoDB should fsync on every transaction, other is how often binary log should be synced. So, either you get to have fully durable system, or buy into unsynchronized environment with up to a second of data loss. I was making fun of MemSQL before for their initial durability implementation, but honestly now both MemSQL and PostgreSQL have durability settings that allow millisecond precision control.

In MySQL 5.6 finally we get the transactional replication states on slaves (I cannot imagine running a replicated environment without that) and semi-synchronous replication allows to have network durability, which may extend or replace existing host-local durability. Even though this allows higher availability and consistency of a replica set, it still does not make masters crash safe – replication state of a master is not synchronized with transactional state of data subsystem, so in case of master crash one is supposed to discard it instead of being able to resync it. That may not sound as an issue for ten user websites, but when instances go into terabyte or tens of terabytes size ranges, rebuilding masters after crashes is costlier than one would think.

Solution sounds somewhat obvious – include replication state within the transactional store and use it to re-synchronize with the replica set, allowing to skip most of expensive synchronous page overwrites, and introducing best-effort background syncing (e.g. sync data written up to a page boundary). Unfortunately, even with GTIDs and semi-sync replication that may not be exactly straightforward and 100% reliable. Still, in large environments it is more about statistics and costs at a large scale rather than standalone system operation, so with good understanding of the impact tradeoffs can be made.

The cost of double-write buffer has been long neglected as well (in compressed environments it is even triple-writing or nonuple-writing), and even some hardware vendors are offering atomic writes, more standard stack still has to rely on it to make sure recovery is successful – apparently it is used more than we expected. In large scale environments one may just want to quickly detect broken pages rather than fully recover, so it may be possible to shrink the double write buffer just to store page IDs. Of course, with more devices supporting atomic writes future may be better here, but alternative approaches can be useful as well (including network-based physical data recovery).

Currently there are more and more systems that provide proper network durability without having to do expensive host-level durability, and MySQL world has quite a bit of catching up to do in order to stay useful in datacenter environments. The way crash safe slaves made it feasible to run replication at scale five years ago, network-durable crash safe masters may be needed to compete with other solutions today.

Hash of shame

I may retract this post, I may have been way too incorrect here, not sure yet.

Hash maps were invented sixty years ago, apparently. MySQL reinvented them.

Original idea was too fancy and too good, I guess. It allowed very cheap access to data, if you knew a key, and it achieved that by having a hashing function, which is used to pick a slot, then going directly to that slot. It is used in your computer all the time. ALL THE TIME.

They are so fast and useful, that they are always treated as building blocks. There have been various iterations later, to support concurrency, hashing functions evolved, etc, but the premise was the same.

If we look at the dictionary, it is obvious that “hash” is:

a mixture of jumbled incongruous things; a mess.

Yes, MySQL, the whole concept is to have as messy as possible data distribution, so that there are no natural collisions. Because if there is a collision, then there is contention over the list sitting in a hash bucket, or a worker, ready to do the work, or some fancy shmancy high performance concurrent workflow.

But no, MySQL wants order, it does not want to have the hashing function messy, it will redefine the term and sixty years of industry practice. For example, given these inputs, it will calculate hash for them as close as possible (tested with (gdb) print my_calc_hash(&mapping_db_to_worker, "db100002", 8)):

  • db100002 – 0xace3c8bc
  • db100003 – 0xace3c8f3
  • db100004 – 0xace3c836

It will decide, that for each change in byte it will change only one byte in the hashing function (pretty much every other hashing function like CRC32, Adler, Fletcher, HeikkiFolding, MD5, SHA1 will have all bytes different ) (this was debunked by a commenter, I was completely wrong here).

Then MySQL will decide that resulting hash is still messy and will not look at different bytes, just map everything to exactly same slot.

Mark saw this on other place, his performance of 5.6 was not too good, and he investigated hash problems in this bug report. MySQL developers looked at it and filed a separate bug which still does not catch the whole essence – MDL lock is blah blah blah.

No, MySQL, it is not related to any use case of your hashing implementation. This is a bug in your critical data structure code. You should be just happy that it is not used too much, only in two places, one that is on Mark’s radar, but other is in new flashy beautiful feature of MySQL 5.6 – Multi Threaded Slave. Or should I call it, Not-So-Multi-Threaded Slave. NSMTS.

P.S. On the upside, I can skip talking about MTS in my “high performance replication” talk at MySQL Conference next week \o/

Update: It may be that MTS is not working because of other issues, checksum function is getting empty inputs.

The saddest bug of them all (SQL is dead?)

From time to time I will observe servers wasting lots of CPU when doing batch row operations. In perf top it will look like this:

8.24% mysqld [.] Arg_comparator::compare_int_unsigned()
7.17% mysqld [.] Item_cond_and::val_int()
4.37% mysqld [.] Item_field::val_int()
4.37% mysqld [.] Item_cond_or::val_int()
2.90% mysqld [.] MYSQLparse(void*)
2.64% mysqld [.] Item::val_bool()

Essentially if you construct queries like (a=1 AND b=2) OR (a=3 AND b=4) ..., at large enough batch size evaluating the WHERE will become far more expensive than anything else (yes, more expensive than decompressing rows or doing all the InnoDB magic and what not).

MySQL has awesome syntax that makes certain batch lookups much faster: WHERE a IN (1,2,3). It constructs a tree that then each row can be compared against and one does not have to iterate through lists of predicates to check whether the row returned by batch index lookups needs to be filtered away. One would think that the composite syntax that server has (WHERE (a,b) IN ((1,2),(3,4))) may help here.

Unfortunately, if you run a query using this syntax, a full table scan will be done, even if there’s a great index for that – even the syntax exists, none of sane ways to execute apply. Of course, there have to be bugs about that:

Compound (cola,colb) IN ((x,y)) statements do not use indexes. (filed by yours truly).

Well, it was closed by optimization lead (now at Maria), who decided, that this bug is a duplicate of another problem, (a,b,c)=(A,B,C). I probably should’ve contested there, but there might have been other people who’d do that? Sure there were, in 2007 Mark filed another bug:

Optimizer does not do index range scans using predicates in IN lists

If one would look more closely, there’s this small exchange in 2010:

[7 Jan 2010 21:35] Mark Callaghan
This is the same as http://bugs.mysql.com/bug.php?id=16247, 
but the author of this feature request is much more convincing.
[7 Jan 2010 21:36] Domas Mituzas
hehe.

He wrote a Facebook note back then too.

Apparently Mark’s bug became the master bug for this problem, even if it arrived bit later, but, I can’t argue, he is far more convincing :-)
There’s a bit of odd banter there, Harrison (ohi!) points out that it is documented (albeit incorrectly, Sergey Petrunia – ohi – notices) in the manual. No, it isn’t this exact issue that is noted in the manual.
Duplicates are pouring in:
Multi column IN does not use index, says Peter Zaitsev (ohi!) in 2008
MySQL not using an index when using tuples in IN clause, says Shlomo (ohi!) in 2010
There are few more.

Sinisa (ohi!) pronounces:

In my humble opinion, this is still not a bug, but feature request. This feature request, however, should be very high on TODO list, as it would make a feature that is sorely missing, particularly since tuple equality expressions have been already optimized.

Fast forward into 2013, even in 5.6 we still have to use expressions with exponential cost, (a,b) IN (( is causing full table scans and people are switching to NoSQLs, NewSQLs and OtherSQLs, because there’s no way to efficiently express batch predicates in MySQL (besides other reasons). Back in the day MySQL was ahead of everyone else with ease of use for simple expressions, but there have been nearly no changes since back in the day to make developer lives easier. Did you know that it may actually be faster to create a temporary table with all the values, then join against it, than actually to use any of existing SQL methods?

Where is MySQL innovation nowadays? In this exact case I guess you’re supposed to be memcached API (forget authorization, common interfaces, transactions and what not), solving this in SQL is too complicated.

on wikipedia and mariadb

There’s some media coverage about Wikipedia switching to MariaDB, I just wanted to point out that performance figures cited are somewhat incorrect and don’t attribute gains to correct authors.

Proper performance evaluation should include not just MariaDB 5.5 but Oracle’s MySQL 5.5 version too, because thats where most of performance development happened (multiple buffer pools, rollback segments, change buffering et al).

5.5 is faster for some workloads, 5.1-fb can outperform 5.5 in other workloads (ones with lots of IO), it is good to know that there’s beneficial impact from upgrading (though I’d wait for 5.6), but it is important to state that it is an effort from Oracle as well, not just MariaDB developers.

P.S. As far as I understand, decision to switch is political, and with 5.6 momentum right now it may not be the best one, 5.6 is going to rock :-)

on seconds_behind_master sleuthing

With large enough infrastructure it gets a bit more and more complicated to detect whether an incident or a problem is real systems problem or a monitoring glitch. This is a story of one such investigation.

With a sufficiently large set of machines, there’re multiple graphs/sets of data to look at to understand what is going on:

  • Group average – the easiest to look at for long term trends or system-wide problems
  • Top-N chart – our favorite for dive-ins, looking only at top offender lines (quite often part of the offender lines end up matching tier baseline)
  • Percentiles chart – knowing best performers, medians, 99% or 99.9% behavior is incredibly useful to understand medium-term dynamics and can give lots of insight where to look for problems. Is way more complicated to collect though.

Another very important property of data one looks at is collection frequency – though thousands of servers aggregated together into single chart will already provide smoothened view of what is going on, individual charts will always look way choppier, without any intermediate dots to show progression of metric. You will see that there’re lots of threads running, but it won’t explain how that built up – only longer term incidents averaged out over lots of machines will show that.

In our deployment, replication lag is very important metric of how well we’re doing – with faker in production it already means that machine (or machines) are in trouble, and usually that is either hardware trouble, or way too much workload from applications, and both problems need immediate actions.

It is even more interesting on pure flash machines – those are not supposed to lag at all, once they’re in production – unless something is really really wrong.

This is why I once I saw a group of flash-only machines reporting higher replication lag values, jumped to investigate. There could’ve been various reasons – e.g. long transactions increasing cost of write queries, and what not.

Looking at charts I saw that there’s some lag, that some machines spike to 1000+s lag values, and percentiles showed that only small amount of machines were hitting this. It was relatively interesting, that each of those machines would hit a lag spike and then behave normally afterwards again.

I tried to find a machine that was in high lag condition and check what is going on there. Listing machines with their lag values was relatively easy:

pmysql "SHOW SLAVE STATUS" | cut -f 1,34 | sort -nr -k 2,2 | head

What was odd, that each time I was running this, machines on the list were different. First thought in my mind was that some old transactions with old timestamps were showing up (that is usually a reason for odd Seconds_behind_master spikes), figuring out statements would’ve been piece of cake:

pmysql "SELECT * FROM information_schema.processlist where 
    user='system user' and info is not null and time>100"

Unfortunately, even though “SHOW SLAVE STATUS” was always showing me machines with 100s+ lag, none of that came via processlist query.

The aha! moment came when I tried to look at binlog/relaylog values – every “lagging” machine was at one of two relay log positions – 298 or 308. So, I looked at timestamps of events in relay log positions:

#120921 13:21:42 server id 22222  end_log_pos 106 
    Start: binlog v 4, server v 5.1.53-log created 120921 13:21:42
#120921 13:21:42 server id 4294967295  end_log_pos 151 
    Rotate to binary-logs.017880  pos: 81259576
#691231 16:00:00 server id 11111  end_log_pos 0 
    Rotate to binary-logs.017880  pos: 81259576
#120921 13:17:39 server id 11111  end_log_pos 0 
    Start: binlog v 4, server v 5.1.53-log created 120921 13:17:39
#120921 13:21:42 server id 11111  end_log_pos 81259647 
    Query ...

First record was relay log header (format description event), next two were rotation events with bogus (-1 and 0) values for timestamp and server_id – and binary log positions. Then we suddenly have master’s binary log header which is taken from the beginning of the binary log, with the binlog creation timestamp in it, then we have a query from the middle of binary log, with proper timestamp on it.

For the whole duration of SQL thread acting on the first query server thought it was executing events from 13:17:29 and not 13:21:42. Seconds_behind_master was telling that the gross offense of 253s replication lag was being committed, and database users were suffering.

Fortunately, it was just a MySQL status variable glitch and data integrity was completely fine, but this reminds us how important is quality monitoring and internal metrics for large environments – and how easy it is to draw wrong conclusions otherwise.

P.S. Of course, a bug was filed :)