The “DBMS” part of MySQL is fine – storage engines (especially with new kids on the block), replication, etc, but it really sucks at executing SQL.
I won’t even get into complex SQL that has complex data dependencies, will start with very basic tasks.
SELECT * FROM table WHERE indexed = "A" LIMIT 1
If multiple indexes can satisfy the query, MySQL will hit each of them at least twice – looking up first “A” and last “A” records. It will do that while planning the SQL execution. Once it comes up with a plan, it will go and hit the index it picked again, this time once. So, if you have two candidate indexes, you will have 5 index accesses at 4 positions.
How would a direct cursor access work? Single index hit. Want to simulate that in SQL? You can add a ‘FORCE INDEX’ hint, then only one index will be considered, but still at two positions and you will have three index accesses. I wrote about this before.
SELECT * FROM table WHERE indexed IN (101, 201, 301) LIMIT 1
Would do six index dives during preparation as well (two for each IN position), plus another one during execution – seven dives when one would suffice.
What we learn from this is that whenever there’s an opportunity for lazy evaluation, MySQL will gladly miss it and do most work possible to run your query.
If you could express it yourself, you’d be lazy and you’d be efficient. It gets even worse if we start expecting something more advanced than a basic cursor positioning.
Whenever you’re doing a range scan on (a, b) indexed table:
SELECT * FROM table WHERE a BETWEEN 1 and 100 AND b=1234567 LIMIT 1
There’s an easy optimization given low-cardinality ‘a’ – you jump to each ‘a’ position and then you can do a dive for the ‘b’ value. In a way you can emulate this behavior with:
SELECT * FROM table JOIN ( SELECT DISTINCT a FROM table ) x USING (a) WHERE b=1234567 LIMIT 1
As I mentioned, MySQL will skip any opportunity to be lazy and in this case it will fully materialize all distinct ‘a’ values. If it were able to lazy evaluate, there’s a chance we can terminate the scan early.
We’re adding ability to skip-scan records in our builds (you can follow the work at https://reviews.facebook.net/D59877).
It is quite easy to describe whatever is needed in basic loop though – storage engines already provide with necessary hooks for these types of access methods.
Another basic single-table SELECT that fails with SQL is a standard “feed” query:
SELECT * FROM table WHERE category IN (5,6) ORDER BY time DESC LIMIT 5;
MySQL will have to pick between two alternatives, one would be just scanning ‘time’ index and searching for specified categories among found rows, another is read all entries for each category, sort them all and return top 5.
Efficient way to execute such query would involve having per-category cursors and merging their reads, something like:
( SELECT * FROM table WHERE category = 5 ORDER BY time DESC LIMIT 5 ) UNION ALL ( SELECT * FROM table WHERE category = 6 ORDER BY time DESC LIMIT 5 ) ORDER BY time DESC LIMIT 5
MySQL is unable to merge these two cursors without writing all the data into temporary file and sorting it – although we already are reading much smaller datasets than with alternative naive (and readable) query. Besides, each subselect will open a table handler with all the associated buffers and if you want to merge hundred cursors you’re looking at hundred open tables.
You can do this operation efficiently in bash (with ‘sort -m’), it isn’t that complicated algorithm in any scripting language, but having such access method in MySQL’s SQL doesn’t seem likely.
Even where MySQL is already doing efficient loose scan (like in indexed-based GROUP BY), there were basic bugs open for years where efficiency would go down the drain simply because a LIMIT is added.
There’re quite a few other limitations in access methods that are quite annoying to work around. For example a query like this one:
SELECT * FROM table ORDER BY priority DESC, time ASC
Would be unable to use a regular index on (priority, time) – it can only walk everything in one direction and cannot have mixed order scans that are trivial to implement in basic procedural algorithm (move cursor to lowest time of lower priority, read all records for that priority in ascending order before repositioning the cursor to even lower priority).
Of course, one can change the direction of an index, or even have multiple indexes on same columns just to get an ordered read efficient. But none of that is needed if there’s a proper access method that can be used by SQL execution.
Directional index hints may be needed just to specify common order (e.g. prefix compression in RocksDB makes one direction cheaper than the other, although both still much cheaper than full blown filesort).
Even basic single order traversal breaks if you’re joining two tables:
SELECT * FROM t1 JOIN t2 USING (b) ORDER BY t1.a, t1.b, t2.c
In some cases I (*shudder*) had to use ‘FORCE INDEX’ without an ORDER BY to pick a direction I needed. One could expect basic functionality like ordered JOIN traversal to be part of SQL package.
There’re various other issues related to JOINs – again, in the best tradition of MySQL, lazy evaluation of joins is not possible – it will gladly do more work than you’d ever need, reading from tables you did not ask for.
These things slowly slowly change with each version, and to get basic things right we have to yell a lot. Hurray, 5.7 may finally have fixes for issues we were working around back when 5.x series did not exist) – but that is just way too slow for an SQL layer to evolve.
MySQL employs many bright engineers many of whom are trying to make better cost decisions on same existing limited set of access methods they use, by spending more cycles and adding more and more overhead.
In order to execute various queries one has to either resort to network roundtrips after each row or over-fetch data all the time. You may suggest stored procedures, but their data access is limited to running SQL queries and returning them as multiple result sets (or buffer everything in temporary table). Amount of deferred work and lazy evaluation you can do in a stored procedure is quite limited and overhead of running many SQL statements within a procedure is high.
Ability to access data via lower level interfaces by high performance server-side language (Lua? JS?) would allow to circumvent many many many limitations and inefficiencies of existing SQL optimizer and execution layer while still allowing occasional SQL access, high performance storage engines, replication and all the tooling that has been built to support MySQL at scale.
What do we do today? We run lots of that conditional logic in our application and absorb lots of the cost with a cache layer in the middle. There’s a chance that we would be able to spend less CPU, less I/O, less memory and less network resources if we could ask smarter queries expressed as procedures instead of wrangling all the relational algebra on top of dumb executor.
In some cases we have different database systems.
P.S. Many of the hypothetical scenarios map to workloads where amounts of data and query volume warrants all the optimizations I discussed here.
P.P.S. Other databases may have other set of issues.
We’ve been learning for many years how to run Linux for databases, but over time we realized that many of our lessons learned apply to many other server workloads. Generally, server process will have to interact with network clients, access memory, do some storage operations and do some processing work – all under supervision of the kernel.
Unfortunately, from what I learned, there’re various problems in pretty much every area of server operation. By keeping the operational knowledge in narrow camps we did not help others. Finding out about these problems requires quite intimate understanding of how things work and slightly more than beginner kernel knowledge.
Many different choices could be made by doing empiric tests, sometimes with outcomes that guide or misguide direction for many years. In our work we try to understand the reasons behind differences that we observe in random poking at a problem.
In order to qualify and quantify operational properties from our server systems we have to understand what we should expect from them. If we build a user-facing service where we expect sub-millisecond response times of individual parts of the system, great performance from all of the components is needed. If we want to build high-efficiency archive and optimize data access patterns, any non-optimized behavior will really stand out. High throughput system should not operate at low throughput, etc.
In this post I’ll quickly glance over some areas in memory management that we found problematic in our operations.
Whenever you want to duplicate a string or send a packet over the network, that has to go via allocators (e.g. some flavor of malloc in userland or SLUB in kernel). Over many years state of the art in user-land has evolved to support all sorts of properties better – memory efficiency, concurrency, performance, etc – and some of added features were there to avoid dealing with the kernel too much.
Modern allocators like jemalloc have per-thread caches, as well as multiple memory arenas that can be managed concurrently. In some cases the easiest way to make kernel memory management easier is to avoid it as much as possible (jemalloc can be much greedy and not give memory back to the kernel via lg_dirty_mult setting).
Just hinting the kernel that you don’t care about page contents gets them immediately taken away from you. Once you want to take it back, even if nobody else used the page, kernel will have to clean it for you, shuffle it around multiple lists, etc. Although that is considerable overhead, it far from worst what can happen.
Your page can be given to someone else – for example, file system cache, some other process or kernel’s own needs like network stack. When you want your page back, you can’t take it from all these allocations that easily, and your page has to come from free memory pool.
Linux free memory pool is something that probably works better on desktops and batch processing and not low latency services. It is governed by vm.min_free_kbytes setting, which has very scarce documentation and even more scarce resource allocation – on 1GB machine you can find yourself with 5% of your memory kept free, but then there’re caps on it at 64MB when autosizing it on large machines.
Although it may seem that all this free memory is a waste, one has to look at how kernel reclaims memory. This limit sets up how much to clean up, but not at when to trigger background reclamation – that is done at only 25% of free memory limit – so memory pool that can be used for instant memory allocation is at measly 16MB – just two userland stacks.
Once you exhaust the free memory limit kernel has to go into “direct reclaim” mode – it will stall your program and try to get memory from somewhere (thanks, Johannes, for vmscan/mm_vmscan_direct_reclaim_begin hint). If you’re lucky, it will drop some file system pages, if you’re less lucky it will start swapping, putting pressure on all sorts of other kernel caches, possibly even shrinking TCP windows and what not. Understanding what kernel will do in the direct claim has never been easy, and I’ve observed cases of systems going into multi-second allocation stalls where nothing seems to work and fancy distributed systems failover can declare node dead.
Obviously, raising free memory reserves helps a lot, and on various servers we maintain 1GB free memory pool just because low watermark is too low otherwise. Johannes Weiner from our kernel team has proposed tunable change in behaviors there. That still requires teams to understand implications of free memory needs and not run with defaults.
Addressing this issue gets servers into much healthier states, but doesn’t always help with memory allocation stalls – there’s another class of issues that was being addressed lately.
I wrote about it before – kernel has all sorts of nasty behaviors whenever it can’t allocate memory, and certain memory allocation attempts are much more aggressive – atomic contiguous allocations of memory end up scanning (and evicting) many pages because it can’t find readily available contiguous segments of free memory.
These behaviors can lead to unpredictable chain of events – sometimes TCP packets arrive and are forced to wait until some I/O gets done as memory compaction ended up stealing dirty inodes or something like that. One has to know memory subsystem much more than I do in order to build beautiful reproducible test-cases.
This area can be addressed in multiple ways – one can soften allocation needs of various processes on the system (do iptables really need 128k allocation for an arriving segment to log it via NFLOG to some user land collection process?), also it is possible to tweak kernel background threads to have less fragmented memory (like a cronjob I deployed many years ago) or of course, getting the memory reclamation order into decent shape instead of treating it as a black box that “should work for you unless you do something wrong” (like using TCP stack).
Some of our quick changes (like net: don’t wait for order-3 page allocation) were addressing this case by case basis, but it was amazing to see that this kind of optimization was pulled in to cover many many more allocations via wide-reaching change (mm/slub: don’t wait for high-order page allocation). From my experience, this addresses huge class of reliability and stability issues in Linux environments and makes system behavior way more adaptive and fluid.
There are still many gray areas in Linux kernel and desktop direction may not always help addressing them. I have test-cases where kernel is only able to reclaim memory at ~100MB/s (orders of magnitudes away from RAM performance) – and what these test cases usually have in common is “this would happen on a server but never on a desktop”. For example if your process writes a [transaction] log file and you forget to remove it from cache yourself, Linux will thrash on the inode mutex quite a bit.
There’re various zone reclaim contract violations that are easy to trigger with simple test cases – those test cases definitely expose edge behaviors, but many system reliability issues we investigate in our team are edge behaviors.
In database world we exasperate these behaviors when we bypass various kernel subsystems – memory is pooled inside the process, files are cached inside the process, threads are pooled inside the process, network connections are pooled by clients, etc. Kernel ends up being so dumb that it breaks on a simple problems like ‘find /proc’ (directory entry cache blows up courtesy of /proc/X/task/Y/fd/Z explosion ).
Although cgroups and other methods allow to constrain some sets of resources within various process groups, it doesn’t help when a shared kernel subsystem goes into an overdrive.
There’re also various problems with memory accounting – although kernel may report you quickly how many dirty file system pages it has, it doesn’t give equal opportunities to network stack. Figuring out how much of memory is in socket buffers (and how full these buffers are) is a non-trivial operation, and on many of our systems we will have much more memory allocated to network stack than to many other categories in /proc/meminfo. I’ve written scripts that pull socket data from netlink, try to guess what is the real memory allocation (it is not straightforward math) to produce a very approximate result.
Lack of proper memory attribution and accounting has been a regular issue – in 3.14 a new metric (MemAvailable) has been added, which sums up part of cache and reclaimable slab, but if you pay more attention to it, there’s lots of guessing whether your cache or slab is actually reclaimable (or what the costs are).
Currently when we want to understand what is cached, we have to walk the file system, map the files and use mincore() to get basic idea of our cache composition and age – and only then we can tell that it is safe to reclaim pages from memory. Quite a while ago I have written a piece of software that removes files from cache (now vmtouch does the same).
Nowadays on some of our systems we have much more complicated cache management. Pretty much every buffered write that we do is followed by asynchronous cache purge later so that we are not at the mercy of the kernel and its expensive behaviors.
So, you either have to get kernel entirely out of your way and manage everything yourself, or blindly trust whatever is going on and losing efficiency on the way. There must be a middle ground somewhere, hopefully, and from time to time we move in the right direction.
In desktop world you’re not supposed to run your system 100% loaded or look for those 5% optimizations and 0.01% probability stalls. In massively interconnected service fabrics we have to care about these areas and address them all the time, and as long as these kinds of optimizations reach wider set of systems, everybody wins.
TL;DR: upgrade your kernels and bump vm.min_free_kbytes :-)
If you remove a file, file system generally just marks in its metadata that previously occupied blocks can now be used for other files – that operation is usually cheap, unless the file has millions of segments (that is such a rare case, only seen in experimental InnoDB features that Oracle thought was a good idea).
This changes a bit with SSDs – if you update underlying device metadata, it can have smarter compaction / grooming / garbage collection underneath. Linux file systems have ‘discard’ option that one should use on top of SSDs – that will extend the life time of their storage quite a bit by TRIM’ing underlying blocks.
Now, each type of storage device will react differently to that, some of them support large TRIM commands, some of them will support high rate of them, some of them won’t, etc – so one has to take that into account when removing files in production environments.
Currently Linux block layer sees TRIM commands in same shape as write commands, so if you are truncating a terabyte, it is seen as a terabyte of write activity (and managed in similar fashion). That may make your writes (and/or reads) suffer.
Probably correct place to handle this better could be a file system – but it doesn’t have good feedback signals and doesn’t really have the job of I/O scheduling. Currently it does this data discard at the same place as it returns blocks – in a path that is expected to be fast, so it may hold file system transaction for a duration of this operation. This may or may not stall other file system activity at the time.
So now that we know that devices can be stupid, device drivers are stupid, block layer is stupid and file systems are stupid, we have to somehow address this issue. An obvious solution is to delete files slower.
Lots of performance engineering can be done by adding right sleeps into appropriate places – so one can do that to ‘rm’ too – it can sleep a bit after certain amount of bytes removed. What to do with large files? We have to slowly truncate them before we unlink them.
So I built a slower rm:
Usage: slowrm [OPTION...] PATH [PATH] ... Help Options: -h, --help Show help options Application Options: -r, --recursive Dive into directories recursively -c, --chunk=128 Chunk size in megabytes -s, --sleep=0.1 Sleep time between chunks -f, --force Continue on errors (by default bail on everything) -x, --one-file-system Only operate on one file system
We do what we got to do.
For years it was very easy to defend InnoDB’s advantage over competition – covering index reads were saving I/O operations and CPU everywhere, table space and I/O management allowed to focus on database and not on file systems or virtual memory behaviors, and for past few years InnoDB compression was the way to have highly efficient OLTP (or in our case – SGTP – Social Graph Transaction Processing) environments. Until one day (for some it came sooner, for others later)…
InnoDB team announced that it will change how it is going to do compression in the future and that old ways (that we rely on) will be all gone. I’m not exactly sure if there was any definite messaging on the future of existing methods, but Oracle in public will never put out a roadmap, and there’s lots of uncertainty involved then. Unfortunately, with this uncertainty, we probably lost quite some momentum in InnoDB engineering efforts (we don’t get to see some of planned advancements like Nizam’s work on page reorganization).
The new way is “InnoDB Transparent PageIO Compression” – and it makes lots of sense from full-stack architecture perspective. It relies on the fact that high end flash storage devices already have a log-structured block storage internally, and if one ties directly into it, lots of overhead can be avoided (similar concepts are used by MariaDB’s atomic writes).
We were throwing this idea around as a thought exercise years ago, and we mentioned it here and there. As every thought exercise, we had lots of pros and cons to think about.
One problem is that even it is log structured internally, it is still glued together out of blocks. Few years ago disks and flash devices used to be 512-byte formatted. Nowadays industry is switching to 4k sectors (on disks it yields higher density, on flash it reduces flash translation layer (FTL) costs).
If 16k compresses into 9k, earlier assumption was that new layer will write only 9k. With 4k sectors it will actually write 12k, oh no. How do we solve that with old-style compression? We only partially fill InnoDB’s page so that we will write 8k. In this case InnoDB deciding to be naive and not do any speculative page size management ends up writing much more than solutions used at large scale environments.
Another problem is that buffer pool is no longer compressed. This may mean you will need to buy devices with more IOPS and higher write endurance. Compressed buffer pool is huge advantage, and without it users will just have to spend more on hardware (and Oracle is in selling hardware business, yay!).
Then there’s this whole other thing, which makes absolutely no sense. Why would Oracle decide to support single hardware vendor (it doesn’t even own) proprietary solution in its ubiquitous open-source product. They say they’re using APIs that work elsewhere, but thats where it is recycled bovine manure.
When you’re talking to flash device, its FTL is hiding the fact that everything is truly fragmented underneath you and the namespace it has to deal with it does not have any complicated dependencies – it is essentially log-structured K/V store, where key is block address. The ease of log structured design is that you’re writing to very few places (and you’re usually appending). General purpose file system such as XFS has to handle all the metadata between underlying flat-addressed block device and directories, file placement, extents and writes to files. On top of that it has to provide semantics like file expansion, renames, deletion, all happening on that single block device underneath.
For quite a while InnoDB was holding a global mutex when extending files – and that is very trivial operation comparing to what hole punching would mean. Hole punching inside a file system would make each InnoDB page a separate segment that has to be tracked via file system metadata management (so every page write will be accompanied by filesystem journal and metadata writes). There is a question whether file system is going to scale, and then there’s just basic efficiency (a sparse synchronous write is ~5x more expensive than non-sparse one).
Dropping a file with millions of file system segments in it will take minutes of CPU time and lock contention on allocation group (each segment has to be evaluated, added back to list of free space segments with possible merging, etc). Understanding implications of extreme fragmentation (can you even use the file system once it hits 50% full? 75% full?) is not that straightforward either.
I did not have to think at all about file system scalability before (as long as writes got through), now I can’t stop noticing things like XFS padding log writes to a imaginary or real stripe size (as if every RAID is RAID5).
So while Oracle has completely messed up with InnoDB compression roadmap, surrounding industry moved ahead in leaps and bounds. Remember that toy MongoDB with all of its inefficiencies? This is where it is today:
Chasing benchmarks is not enough to win a datacenter, especially when large scale environments are working on improving efficiency of systems, not just throughput. RocksDB has been making its way into InnoDB’s turf in MySQL world, MongoDB ecosystem has RocksDB, TokuDB, WiredTiger. Embeddable InnoDB does not exist anymore, so most of innovation in storage systems ends up completely ignoring it.
While Oracle orients MySQL towards proprietary file systems and hardware devices, we will see more and more new platforms on top of open-source pluggable storage engines.
Though we did deploy recently some non-compressed InnoDB environments (I am going to talk at MySQL Conference about our MySQL/InnoDB Messenger backend), Yoshinori is going to talk about LSM databases at Facebook too and Harrison’s keynote will be about all the different systems that are needed to deal with complex data problems.
There is much more to write about all the work we do at Facebook with memory management efficiency on our systems, but there was this one detour investigation in the middle of 2012 that I had to revisit recently courtesy of Wikipedia.
There are lots of factors that make machines page out memory segments into disk, thus slowing everything down and locking software up – from file system cache pressure to runaway memory leaks to kernel drivers being greedy. But certain swap-out scenarios are confusing – systems seem to have lots of memory available, with proper settings file system cache should not cause swapping, and obviously in production environment all the memory leaks are ironed out.
And yet in mid-2012 we noticed that our new kernel machines were swapping out for no obvious reason. When it comes to swapping, MySQL community will always point to Jeremy’s post on “swap insanity” – it has something to do with NUMA and what not. But what we observed was odd – there was free memory available on multiple nodes when swapping out happened. Of course, one of our kernel engineers wrote a NUMA rebalancing tool that attaches to running CPUs and evens out memory allocations without any downtime (not that we ended up using it…) – just in case Jeremy’s described issue is actually an issue for us.
In some cases systems threw warning messages in kernel logs that immediately helped us to get closer to the problem – network device driver was failing to allocate 16k memory pages.
Inside Linux kernel one has two ways to allocate memory, kmalloc and vmalloc. Generally, vmalloc will go through standard memory management, and if you ask for 16k, it will glue together 4k pages and allocation will succeed without any problems.
kmalloc though is used for device drivers when hardware is doing direct memory access (DMA) – so these address ranges have to be contiguous, and therefore to allocate it one has to find subsequent empty pages that can be used. Unfortunately, the easiest way to free up memory is looking at the tail of LRU list and drop some – but that does not give contiguous ranges.
Actual solution for ages was to organize the free memory available into powers-of-2 sized buckets (4k pages, 8k, 16k, ) – called Buddy Allocator (interesting – it was implemented first by Nobel Prize winner in Economics Harry Markowitz back in 1964). Any request for any memory size can be satisfied from larger buckets, and once there’s nothing in larger buckets one would compact the free memory by shuffling bits around.
One can see the details of buddy allocator in /proc/buddyinfo:
Node 0, zone DMA 0 0 1 0 2 1 Node 0, zone DMA32 229 434 689 472 364 197 Node 0, zone Normal 11093 1193 415 182 38 12 Node 1, zone Normal 10417 53 139 159 47 0
(Columns on the left are indicating numbers of small memory segments available, columns on the right – larger).
It is actually aiming for performance that leads to device drivers dynamically allocating memory all the time (e.g. to avoid copying of data from static device buffers to userland memory). On a machine that is doing lots of e.g. network traffic it will be network interface grouping packets on a stream into large segments and writing them to these allocated areas in memory, then dropping all that right after application consumed network bits, so this technique is really useful.
On the other side of the Linux device driver spectrum there are latency sensitive operations, such as gaming and music listening and production. This millennium being the Millennium of Linux Desktop results in Advanced Linux Sound Architecture users (alsa-users) to complain that such memory management sometimes makes their sound drivers complain. That would not be much of an issue on well-tuned multi-core servers with hardware interrupt handling spread across multiple threads, but Linux kernel engineers prefer the desktop and disabled compaction altogether in 2011.
If memory is not fragmented at all, nobody notices. Although on busy servers one may need to flush gigabytes or tens of gigabytes of pages (drop caches if it is file system cache or swap out if it is memory allocated to programs) to find a single contiguous region (though I’m not sure how exactly it chooses when to stop flushing).
Fortunately, there is a manual trigger to force a compaction that my fellow kernel engineers were glad to inform me about (as otherwise we’d have to engineer a kernel fix or go for some other workarounds). Immediatelly a script was deployed that would trigger compaction whenever needed, so I got to forget the problem.
Until now where I just saw this problem confusing engineers at Wikipedia – servers with 192GB of memory were constantly losing their filesystem cache and having all sorts of other weird memory behaviors. Those servers were running Varnish, which assumes that kernels are awesome and perfect, and if one is unhappy, he can use FreeBSD :)
There were multiple ways to deal with the issue – one was just disabling features on hardware that use the memory (e.g. no more TCP offloading), another is writing 1s into /proc/sys/vm/compact_memory – and maybe some new kernels have some of alleviations to the problem.
Update: By popular demand I published the script that can be used in cron
MySQL did not start as a durable data store and had lots of mockery for that – (ISAM? no replication?). Eventually InnoDB took over, and it brought at least parts of MySQL into a reliable storage world. Checksummed pages, decent crash recovery, good synchronous behavior had InnoDB ahead of open source competition for quite a while, as well as on par with other solutions. Unfortunately, that safety was limited only to InnoDB row operations and not DDL or replication state.
In the world where nothing before was synchronous, transitioning to reliable storage introduced lots of slowdowns, and still was not good enough.
There was lots of work done outside of internal MySQL/Sun/Oracle development to help with some of these problems. For example Google 4.0 patch tried to solve slave crash safety by storing replication state inside InnoDB – not only that allowed slaves to properly recover after a crash, but also that was achieved without synching storage at every transaction. A wish to run masters reliably required to synchronize binary logs with data store, leading to three synchronous data writes per transaction – forcing multiple parties to work on commit grouping implementations. Costly checksums were offloaded to on-chip CRC32 implementations, etc.
Eventually we got MySQL to the situation where it was considered to be reliable. Of course, that came with a cost.
One problem is that log writes have huge write amplification induced by OS paging system, so a synchronous tiny log write of few hundred bytes was written out as 4k page. We observe ~3x-4x write amplification from this on our database masters. That is not much of an issue if underlying hardware (e.g. NVRAM on RAID controller) can absorb all that, but on systems that use SSDs underlying hardware may no longer do such merging, and limited write cycles of flash storage suffer from such write amplification a lot. We end up writing more log pages than data pages (though much less log bytes), so it is a major issue for write endurance on flash devices.
Other problem is that underlying hardware isn’t always fast. Though on modern devices super-capacitors don’t break as much as batteries used to, and are not subject to recharge cycles, still, there are various sorts of I/O stalling, that impact durable behavior of high performance systems. Flash devices are running all sorts of background activity that can make fast writes suddenly be not so fast. In such cases a system that is gladly eating tens of thousands of writes a second suddenly has to accommodate a backlog of thousands of transactions, assuming that they are not locking each other out. Group commit may not be much of a help here.
In MySQL world durability is controlled via two settings – one tells whether InnoDB should fsync on every transaction, other is how often binary log should be synced. So, either you get to have fully durable system, or buy into unsynchronized environment with up to a second of data loss. I was making fun of MemSQL before for their initial durability implementation, but honestly now both MemSQL and PostgreSQL have durability settings that allow millisecond precision control.
In MySQL 5.6 finally we get the transactional replication states on slaves (I cannot imagine running a replicated environment without that) and semi-synchronous replication allows to have network durability, which may extend or replace existing host-local durability. Even though this allows higher availability and consistency of a replica set, it still does not make masters crash safe – replication state of a master is not synchronized with transactional state of data subsystem, so in case of master crash one is supposed to discard it instead of being able to resync it. That may not sound as an issue for ten user websites, but when instances go into terabyte or tens of terabytes size ranges, rebuilding masters after crashes is costlier than one would think.
Solution sounds somewhat obvious – include replication state within the transactional store and use it to re-synchronize with the replica set, allowing to skip most of expensive synchronous page overwrites, and introducing best-effort background syncing (e.g. sync data written up to a page boundary). Unfortunately, even with GTIDs and semi-sync replication that may not be exactly straightforward and 100% reliable. Still, in large environments it is more about statistics and costs at a large scale rather than standalone system operation, so with good understanding of the impact tradeoffs can be made.
The cost of double-write buffer has been long neglected as well (in compressed environments it is even triple-writing or nonuple-writing), and even some hardware vendors are offering atomic writes, more standard stack still has to rely on it to make sure recovery is successful – apparently it is used more than we expected. In large scale environments one may just want to quickly detect broken pages rather than fully recover, so it may be possible to shrink the double write buffer just to store page IDs. Of course, with more devices supporting atomic writes future may be better here, but alternative approaches can be useful as well (including network-based physical data recovery).
Currently there are more and more systems that provide proper network durability without having to do expensive host-level durability, and MySQL world has quite a bit of catching up to do in order to stay useful in datacenter environments. The way crash safe slaves made it feasible to run replication at scale five years ago, network-durable crash safe masters may be needed to compete with other solutions today.
Been a while since I wrote about replication work we did. Fake changes based approach was huge success, and now our prefetching has lots of coverage, where standard SELECTs cannot reach. We’re running our systems at replication pressure, where not running faker immediately results in replication lag. On busier machines Python implementation started using quite some CPU and ended up occasionally hitting GIL issues.
So, here’s the straightforward rewrite of fake changes replication prefetcher, faker. It can run 100k statements a sec, if needed. To get it, you can run:
bzr co lp:mysqlatfacebook/tools; cd faker
Our latest changes have been pushed to public mysql@facebook branch, allowing this post to happen \o/
Recently we started rolling out InnoDB compression to our main database tier, and that has been a huge undertaking for multiple teams and a major test for MySQL. Nizam was sure the hero of all this work, and make sure you don’t miss his talk about it at MySQL conference.
Though MySQL manuals have quite some introduction about benefits of compression, we agree that benefits are good – in theory we can do less reads from disk, keep more data in buffer pool or flashcache and take less disk space on premium disk property. The benefits sounded so great, that our engineering team decided to disregard what Oracle has to say about workload characteristics and make it work with whatever workload we have.
There were major architectural issues – for example writing full compressed page images to transaction log is huge flaw for busy systems, and even with write behind caching on underlying hardware that ended up being bottleneck and resource hog.
Another important architectural difference for OLTP is avoiding failed compressions – which was the major CPU cost. Solution to that was adaptive padding – server tries to maintain uncompressed images at a level that would nearly always compress into smaller block sizes.
There were also various bugs that caused servers to melt down if there was even single compressed table on them, as well as numerous other compression problems to fix.
Obviously compression means much more CPU work, and that is especially costly for the replication thread – as it has way less time to be blocked on disk reads and has to spend more time compressing and decompressing. There’re two ways to approach that problem, one is doing less disk reads, other is doing less of everything else. Of course, if there’re multiple ways to solve a problem, we will approach all of them :)
Proper replication prefetching was at the core of this effort – not only it precaches table data from disk, but also decompresses pages for replication thread, as well as loads relay logs if they have been paged out already. Our newest push has few stability and performance fixes for Percona’s fake changes – apparently sibling page read-ins for InnoDB latching was nearly 95% of our replication thread I/O at some time.
The “everything else” part consisted of various CPU inefficiencies and stalls. For example, InnoDB waits for five milliseconds if it detects that other thread is already reading the compressed page – and these collisions sure happen with active prefetching and busy workloads – we constantly saw replication thread “stuck in 90ies“.
Also, InnoDB was actively double-checksumming pages when decompressing them – though checksum on disk read is sure understandable, checksumming while reading a page from buffer pool is certainly not – few % went down that direction.
There were few other evil behaviors in new code paths – e.g. malloc() was being done while holding InnoDB buffer pool mutex, escalating stalls from other places to InnoDB lockups.
We’re still trying to understand implications of uncompressed LRU heuristics – InnoDB will increase amount of pages held uncompressed if it is doing 50x more decompressions than disk reads, which on 10000 IOPS machine means around 8GB/s of decompressed data. We added the tunables, but for now it looks that some of our machines are still I/O bound and we’re not sure if that is a problem on other type of hardware.
It was a bit fun to spot that more than 2% of time was spent loading database options (yes, that db.opt file that has pretty much nothing in it) – even if they are needed for CREATE TABLE only.
Of course, more instrumentation and monitoring was necessary to understand and manage compression in production – standard InnoDB gives just some global overview, but to properly understand what is going on one needs per-table information.
There’re plenty of possible next steps for the compression future – more efficient packing, performance improvements, different algorithms, etc – but for now we see that first phase worked out.
Our database environment is quite busy – there’re millions of row changes a second, millions of I/O operations a second and impact of that can be felt at each shard. Especially, as we also have to replicate to other datacenters, single threaded replication on MySQL becomes a real bottleneck.
We use multiple methods to understand and analyze replication lag composition – a simple replication thread state sampling via MySQL processlist helps to understand logical workload components (and work in that field yields great results), and pstack/GDB based replication thread sampling shows server internal behavior quite well too (a similar technique was used for accept thread visualisation).
The biggest problem with single replication thread is that it has to read data to execute queries (rather than applying physical page deltas, like PG or just appending to files like HBase, it does logical edits to page data) – we can observe 95% of process time at that state. As generally there’s just one outstanding data read per replication thread, other workload hitting the machine will also make replication reads slower.
Generally, the obvious way to deal with slow I/O is issue more outstanding parallel requests, and the only way to do that apart from parallel replication, is to predict what will be needed in future and try to fetch that.
Many many moons ago Paul Tuckfield discussed about the Youtube replication prefetcher – it would take write statements yet to be executed in relay logs, convert them to SELECTs and run them before replication thread needs that data. He still says that was one of most satisfying quick hacks :-)
Maatkit (now Percona Toolkit) introduced mk-slave-prefetch (I played with it back in 2008, didn’t put it into operation at that time though), and eventually that looked like a reasonable option for prefetching statements on our database cluster.
5000 lines of Perl is not the easiest code to work with (or to debug), so the journey was quite bumpy. We got it working in some shape, eventually, but Baron, original author, has something to say about it:
Please don’t use mk-slave-prefetch on MySQL unless you are Facebook. Or at least don’t tell your friends, so they won’t use it.
Anyway, our updates rate would saturate mksp.pl if we used anything fancier on it, so it was a constant balancing act, in which looking at the code was something nobody wanted to do ;-) Still, it was (and is) helping us, so getting rid of it wasn’t possible either.
At some point in time we decided to make an experiment – what if we executed statements, then rolled them back – so I did a quick implementation of that method from scratch in Python – resulting piece of code was relatively small and fun to experiment with.
There were multiple problems with such approach – one complication was that queries were grabbing locks for the duration of the statement, and some of those locks would collide with what actual replication thread is doing. Fixing that would require immediate lock wait timeout or transaction kill for prefetcher thread – so, relatively deep dive into InnoDB. Another problem was internal InnoDB lock contention on rollbacks – that was expensive operation, and benefits of pages read in were negated by rollback segments lock contention. Fixing that is even more extensive InnoDB work (though probably some people would like their rollbacks to be efficient ;-)
At that moment we came up with the idea, that InnoDB codebase could be instrumented to not do any real work on updates – just page data in and return to the caller, and if any change accidentally slips in, commits can fail. That looked like a feasible project for the future.
At some point in time we were rolling out a new database tier for one product, which was supposed to have really high volume of changes, but all coming in a uniform format. It took less than hour (as most of the work has been done to create rollback-based one) to come up with a prototype that would efficiently extract literals from uniform statements, then use them for prefetching.
This method worked fine – at tiny fraction of resources used by mk-slave-prefetch we were preloading secondary indexes and could have relatively extensive parallelism.
Meanwhile, our main database cluster was having more and more uniform query workload, thanks to various libraries, abstractions and middleware – so a day of work on lowest hanging fruits provided relatively good coverage of the workload.
We didn’t stop mksp.pl – it still provided some coverage for various odd cases, which were time-consuming to work on manually.
There were few other problems with the new method – apparently we were targeting our SELECTs too accurately – UPDATEs were spending plenty of time in records_in_range. Additionally, optimistic update path was reading in pages that selects wouldn’t (due to inefficiency in B-Tree locking code). There were some odd reads done for INSERTs.
Also, SELECTs are using indexing less efficiently – InnoDB can pinpoint entries in secondary indexes by using PK values, yet that ability is not exposed to SQL layer, so prefetching on indexes that don’t have explicitly defined all fields within them is not that easy.
In theory, all these issues are supposed to be ‘fixed’ by fake changes concept. Percona recently implemented it in their releases, and we started experimenting with those changes. It is still not that mature concept, so we will be revisiting how things are or should be done, but for now test results are quite positive (we did some changes to reduce locking and avoid deadlock in REPLACE INTO, among other things).
I still observe I/Os done by main replication thread, so we’re not in perfect shape yet, but method seems to be working relatively well (at least it definitely speeds up replication). We still have to do lots of testing to qualify this for large-scale production, but this may allow way more write workload on our machines until we get parallel replication all around.
Our code for custom query, fake changes or rollback prefetcher can be checked out from a public repo together with other tools (oops, Bazaar doesn’t give easy access to subdirectories:
bzr co lp:mysqlatfacebook/tools; cd prefetch
Or browse it online.
P.S. There’s also Tungsten Replicator for ones who don’t want to wait for 5.6 parallel replication.