Our latest changes have been pushed to public mysql@facebook branch, allowing this post to happen \o/
Recently we started rolling out InnoDB compression to our main database tier, and that has been a huge undertaking for multiple teams and a major test for MySQL. Nizam was sure the hero of all this work, and make sure you don’t miss his talk about it at MySQL conference.
Though MySQL manuals have quite some introduction about benefits of compression, we agree that benefits are good – in theory we can do less reads from disk, keep more data in buffer pool or flashcache and take less disk space on premium disk property. The benefits sounded so great, that our engineering team decided to disregard what Oracle has to say about workload characteristics and make it work with whatever workload we have.
There were major architectural issues – for example writing full compressed page images to transaction log is huge flaw for busy systems, and even with write behind caching on underlying hardware that ended up being bottleneck and resource hog.
Another important architectural difference for OLTP is avoiding failed compressions – which was the major CPU cost. Solution to that was adaptive padding – server tries to maintain uncompressed images at a level that would nearly always compress into smaller block sizes.
There were also various bugs that caused servers to melt down if there was even single compressed table on them, as well as numerous other compression problems to fix.
Obviously compression means much more CPU work, and that is especially costly for the replication thread – as it has way less time to be blocked on disk reads and has to spend more time compressing and decompressing. There’re two ways to approach that problem, one is doing less disk reads, other is doing less of everything else. Of course, if there’re multiple ways to solve a problem, we will approach all of them :)
Proper replication prefetching was at the core of this effort – not only it precaches table data from disk, but also decompresses pages for replication thread, as well as loads relay logs if they have been paged out already. Our newest push has few stability and performance fixes for Percona’s fake changes – apparently sibling page read-ins for InnoDB latching was nearly 95% of our replication thread I/O at some time.
The “everything else” part consisted of various CPU inefficiencies and stalls. For example, InnoDB waits for five milliseconds if it detects that other thread is already reading the compressed page – and these collisions sure happen with active prefetching and busy workloads – we constantly saw replication thread “stuck in 90ies“.
Also, InnoDB was actively double-checksumming pages when decompressing them – though checksum on disk read is sure understandable, checksumming while reading a page from buffer pool is certainly not – few % went down that direction.
There were few other evil behaviors in new code paths – e.g. malloc() was being done while holding InnoDB buffer pool mutex, escalating stalls from other places to InnoDB lockups.
We’re still trying to understand implications of uncompressed LRU heuristics – InnoDB will increase amount of pages held uncompressed if it is doing 50x more decompressions than disk reads, which on 10000 IOPS machine means around 8GB/s of decompressed data. We added the tunables, but for now it looks that some of our machines are still I/O bound and we’re not sure if that is a problem on other type of hardware.
It was a bit fun to spot that more than 2% of time was spent loading database options (yes, that db.opt file that has pretty much nothing in it) – even if they are needed for CREATE TABLE only.
Of course, more instrumentation and monitoring was necessary to understand and manage compression in production – standard InnoDB gives just some global overview, but to properly understand what is going on one needs per-table information.
There’re plenty of possible next steps for the compression future – more efficient packing, performance improvements, different algorithms, etc – but for now we see that first phase worked out.