For years it was very easy to defend InnoDB’s advantage over competition – covering index reads were saving I/O operations and CPU everywhere, table space and I/O management allowed to focus on database and not on file systems or virtual memory behaviors, and for past few years InnoDB compression was the way to have highly efficient OLTP (or in our case – SGTP – Social Graph Transaction Processing) environments. Until one day (for some it came sooner, for others later)…
InnoDB team announced that it will change how it is going to do compression in the future and that old ways (that we rely on) will be all gone. I’m not exactly sure if there was any definite messaging on the future of existing methods, but Oracle in public will never put out a roadmap, and there’s lots of uncertainty involved then. Unfortunately, with this uncertainty, we probably lost quite some momentum in InnoDB engineering efforts (we don’t get to see some of planned advancements like Nizam’s work on page reorganization).
The new way is “InnoDB Transparent PageIO Compression” – and it makes lots of sense from full-stack architecture perspective. It relies on the fact that high end flash storage devices already have a log-structured block storage internally, and if one ties directly into it, lots of overhead can be avoided (similar concepts are used by MariaDB’s atomic writes).
We were throwing this idea around as a thought exercise years ago, and we mentioned it here and there. As every thought exercise, we had lots of pros and cons to think about.
One problem is that even it is log structured internally, it is still glued together out of blocks. Few years ago disks and flash devices used to be 512-byte formatted. Nowadays industry is switching to 4k sectors (on disks it yields higher density, on flash it reduces flash translation layer (FTL) costs).
If 16k compresses into 9k, earlier assumption was that new layer will write only 9k. With 4k sectors it will actually write 12k, oh no. How do we solve that with old-style compression? We only partially fill InnoDB’s page so that we will write 8k. In this case InnoDB deciding to be naive and not do any speculative page size management ends up writing much more than solutions used at large scale environments.
Another problem is that buffer pool is no longer compressed. This may mean you will need to buy devices with more IOPS and higher write endurance. Compressed buffer pool is huge advantage, and without it users will just have to spend more on hardware (and Oracle is in selling hardware business, yay!).
Then there’s this whole other thing, which makes absolutely no sense. Why would Oracle decide to support single hardware vendor (it doesn’t even own) proprietary solution in its ubiquitous open-source product. They say they’re using APIs that work elsewhere, but thats where it is recycled bovine manure.
When you’re talking to flash device, its FTL is hiding the fact that everything is truly fragmented underneath you and the namespace it has to deal with it does not have any complicated dependencies – it is essentially log-structured K/V store, where key is block address. The ease of log structured design is that you’re writing to very few places (and you’re usually appending). General purpose file system such as XFS has to handle all the metadata between underlying flat-addressed block device and directories, file placement, extents and writes to files. On top of that it has to provide semantics like file expansion, renames, deletion, all happening on that single block device underneath.
For quite a while InnoDB was holding a global mutex when extending files – and that is very trivial operation comparing to what hole punching would mean. Hole punching inside a file system would make each InnoDB page a separate segment that has to be tracked via file system metadata management (so every page write will be accompanied by filesystem journal and metadata writes). There is a question whether file system is going to scale, and then there’s just basic efficiency (a sparse synchronous write is ~5x more expensive than non-sparse one).
Dropping a file with millions of file system segments in it will take minutes of CPU time and lock contention on allocation group (each segment has to be evaluated, added back to list of free space segments with possible merging, etc). Understanding implications of extreme fragmentation (can you even use the file system once it hits 50% full? 75% full?) is not that straightforward either.
I did not have to think at all about file system scalability before (as long as writes got through), now I can’t stop noticing things like XFS padding log writes to a imaginary or real stripe size (as if every RAID is RAID5).
So while Oracle has completely messed up with InnoDB compression roadmap, surrounding industry moved ahead in leaps and bounds. Remember that toy MongoDB with all of its inefficiencies? This is where it is today:
Chasing benchmarks is not enough to win a datacenter, especially when large scale environments are working on improving efficiency of systems, not just throughput. RocksDB has been making its way into InnoDB’s turf in MySQL world, MongoDB ecosystem has RocksDB, TokuDB, WiredTiger. Embeddable InnoDB does not exist anymore, so most of innovation in storage systems ends up completely ignoring it.
While Oracle orients MySQL towards proprietary file systems and hardware devices, we will see more and more new platforms on top of open-source pluggable storage engines.
Though we did deploy recently some non-compressed InnoDB environments (I am going to talk at MySQL Conference about our MySQL/InnoDB Messenger backend), Yoshinori is going to talk about LSM databases at Facebook too and Harrison’s keynote will be about all the different systems that are needed to deal with complex data problems.
9 thoughts on “how innodb lost its advantage”
Domas, great write up. I’ve always wondered what hole punching will do to a file system over time, you’ve given me a lot to think about.
FYI… The MongoDB world enjoys Fractal Tree indexing via TokuMX; the MySQL / MariaDB / InnoDB world via TokuDB.
But, your heart was in the right place ;)
For more on LSM vs Fractal Tree indexing: http://bit.ly/LSMvFT
Dave as in “Dave Rosenlund” the Tokutek VP of Marketing?
You may be surprised, but I get to know a bit about Tokutek. Unfortunately, not as much as some of my peers, as there was a huge (legal implication) wall between my production work and TokuDB evaluation efforts.
Lost its advantage huh… [takes a moment to stop laughing violently while rolling around on the road like a Molly overdose], have a gander at Postgres buddy boy. Postgres outperforms Mongo so badly that it’s almost embarrassing to be a human, knowing that humans created something as a replacement for RDBMS that is 1) not acid compliant, 2 lacking of a schema (supposedly a feature, despite every app ever made having specific requirements about data stucture)
you didn’t read, did you?
Isn’t it crazy to assume we can have a copy-on-write b-tree that does compression immediately before writeback (where write is a group of pages and the request size rounds up to the next sector boundary)? As a bonus, torn pages aren’t possible so no need for the doublewrite buffer.
Oh wait, we get that with the WiredTiger b-tree in MongoDB.
Comments are closed.