Once MySQL is deployed inside a datacenter environment (i.e. forms a cloud ;-), major feature in it becomes replication. It is used to maintain hot copies, standby copies, read-only copies, invalidate external systems, replicate to external systems, etc. If this functionality is broken, datacenter is broken – components are not synchronized anymore, invalidations not done, data not consistent.
From performance perspective, replication not working properly results in unusable slaves so load cannot be spread. This results in higher load on other machines, including master (especially on master, if environment needs stronger consistency guarantees).
Judging on replication importance in MySQL deployments, it should attract performance engineering as much as InnoDB and other critical pieces. Though slave replication performance is being increased in 5.6, master side is not (well, group commit may help a bit, but not as much).
The major problem here is LOCK_log.
It is a mutex that guards all MySQL log files (Update: that sentence is wrong, it is per-log-type mutex, is not shared among different types, at is class private variable) (general query log, binary log, etc) – only one read or write can happen to any of them. Other logs aside (their impact can be minimized), binary log finds this mutex extremely problematic, because:
- It stalls COMMITs
- It stalls slave binary log reads
Though one can argue, that once there’re less commits, there will be less binary log and that will result in some sort of equilibrium. Unfortunately, one doesn’t get anywhere close to decent efficiency, as there’re too many situations, when LOCK_log is kept for way too long:
- Slow binary log read (if they already got purged out of cache, highly likely)
- Slow binary log write (if controller cache is saturated by page flush workload)
- Slow memory allocation for larger events (on systems with malloc scalability issues)
Saturating RAID controller cache isn’t that difficult – and even if one dedicates separate disks to a separate log disk array (what a waste!), it doesn’t mean that out of the shelf hardware will support supplying separate cache for that disk array. Without a separate dedicated cache or high priority log processing one ends up blocking until disks are available for a write.
Read efficiency is a concern here too – MySQL uses statically sized buffer of 8k for all binary log reads, which means that on highly pressured system read-ahead may not work, and there will be way more I/O operations/seeks, while holding a precious mutex. Even if the operation to read the 8k page from memory buffers is very fast, slave threads will be forced to wait for other, much slower operations (a lagging slave reading yesterday’s data? high write workload?) to finish.
If binary log writes are on a separate device one can inspect the utilization on iostat. Usually there will be only one write or read outstanding, so the busy percentage shows really well how contended this resource is. A double-digit number already says that operations aren’t instant, and replication slaves can be subjected to log starvation, that is – not replicating.
From operation side solution to this is somehow obtaining a dedicated cache or fast device for binary log writes. One can simply disable write cache for the InnoDB data partition and use lots of parallel InnoDB page write threads, so that they win the fight against read threads. Using flashcache or similar technology to absorb writes can be beneficial in this case.
On the other hand, MySQL can be changed not to have giant log mutex – binary logs are append-only structures, and just guarding tail and rotation events could be enough, with additional one or two atomic variables to make sure that readers are fine.
Also, having compressed binlogs on disk would reduce the I/O pressure caused by writes or reads, as well as there’s a possible gain from not having to recompress logs for each remote reader.
Then datacenter will be happy, and datacenter users even happier.