on durability

MySQL did not start as a durable data store and had lots of mockery for that – (ISAM? no replication?). Eventually InnoDB took over, and it brought at least parts of MySQL into a reliable storage world. Checksummed pages, decent crash recovery, good synchronous behavior had InnoDB ahead of open source competition for quite a while, as well as on par with other solutions. Unfortunately, that safety was limited only to InnoDB row operations and not DDL or replication state.

In the world where nothing before was synchronous, transitioning to reliable storage introduced lots of slowdowns, and still was not good enough.
There was lots of work done outside of internal MySQL/Sun/Oracle development to help with some of these problems. For example Google 4.0 patch tried to solve slave crash safety by storing replication state inside InnoDB – not only that allowed slaves to properly recover after a crash, but also that was achieved without synching storage at every transaction. A wish to run masters reliably required to synchronize binary logs with data store, leading to three synchronous data writes per transaction – forcing multiple parties to work on commit grouping implementations. Costly checksums were offloaded to on-chip CRC32 implementations, etc.

Eventually we got MySQL to the situation where it was considered to be reliable. Of course, that came with a cost.

One problem is that log writes have huge write amplification induced by OS paging system, so a synchronous tiny log write of few hundred bytes was written out as 4k page. We observe ~3x-4x write amplification from this on our database masters. That is not much of an issue if underlying hardware (e.g. NVRAM on RAID controller) can absorb all that, but on systems that use SSDs underlying hardware may no longer do such merging, and limited write cycles of flash storage suffer from such write amplification a lot. We end up writing more log pages than data pages (though much less log bytes), so it is a major issue for write endurance on flash devices.

Other problem is that underlying hardware isn’t always fast. Though on modern devices super-capacitors don’t break as much as batteries used to, and are not subject to recharge cycles, still, there are various sorts of I/O stalling, that impact durable behavior of high performance systems. Flash devices are running all sorts of background activity that can make fast writes suddenly be not so fast. In such cases a system that is gladly eating tens of thousands of writes a second suddenly has to accommodate a backlog of thousands of transactions, assuming that they are not locking each other out. Group commit may not be much of a help here.

In MySQL world durability is controlled via two settings – one tells whether InnoDB should fsync on every transaction, other is how often binary log should be synced. So, either you get to have fully durable system, or buy into unsynchronized environment with up to a second of data loss. I was making fun of MemSQL before for their initial durability implementation, but honestly now both MemSQL and PostgreSQL have durability settings that allow millisecond precision control.

In MySQL 5.6 finally we get the transactional replication states on slaves (I cannot imagine running a replicated environment without that) and semi-synchronous replication allows to have network durability, which may extend or replace existing host-local durability. Even though this allows higher availability and consistency of a replica set, it still does not make masters crash safe – replication state of a master is not synchronized with transactional state of data subsystem, so in case of master crash one is supposed to discard it instead of being able to resync it. That may not sound as an issue for ten user websites, but when instances go into terabyte or tens of terabytes size ranges, rebuilding masters after crashes is costlier than one would think.

Solution sounds somewhat obvious – include replication state within the transactional store and use it to re-synchronize with the replica set, allowing to skip most of expensive synchronous page overwrites, and introducing best-effort background syncing (e.g. sync data written up to a page boundary). Unfortunately, even with GTIDs and semi-sync replication that may not be exactly straightforward and 100% reliable. Still, in large environments it is more about statistics and costs at a large scale rather than standalone system operation, so with good understanding of the impact tradeoffs can be made.

The cost of double-write buffer has been long neglected as well (in compressed environments it is even triple-writing or nonuple-writing), and even some hardware vendors are offering atomic writes, more standard stack still has to rely on it to make sure recovery is successful – apparently it is used more than we expected. In large scale environments one may just want to quickly detect broken pages rather than fully recover, so it may be possible to shrink the double write buffer just to store page IDs. Of course, with more devices supporting atomic writes future may be better here, but alternative approaches can be useful as well (including network-based physical data recovery).

Currently there are more and more systems that provide proper network durability without having to do expensive host-level durability, and MySQL world has quite a bit of catching up to do in order to stay useful in datacenter environments. The way crash safe slaves made it feasible to run replication at scale five years ago, network-durable crash safe masters may be needed to compete with other solutions today.

3 thoughts on “on durability”

Comments are closed.

%d bloggers like this: