Today I attended kick-ass ZFS talk (3 hours of incredibly detailed material presented by someone who knows the stuff and knows how to talk) at CEC (Sun internal training event/conference), so now I know way more about ZFS than I used to. Probably I know way more about ZFS than Average Joe DBA \o/
And now I think ZFS has lots of brilliant design and implementation bits, except it doesn’t match database access pattern needs.
See, ZFS is not a regular POSIX-API -> HDD bridge, unlike pretty much everything out there. It is transactional object store which allows multiple access semantics, APIs, and standard ZFS POSIX Layer (ZPL) is just one of them. In MySQL talk, think of all other filesystems as of MyISAM, and ZFS is InnoDB :-)
So, putting InnoDB on top of ZFS after some high-school-like variable replacement ends up “putting InnoDB on top of InnoDB”. Let’s go a bit deeper here:
- ZFS has checksums, so does InnoDB (though ZFS checksums are faster, Fletcher-derived, etc ;-)
- ZFS has atomicity, so does InnoDB
- ZFS has ZIL (Intent Log), so does InnoDB (Transaction Log)
- ZFS has background intelligent flushing of data, so does InnoDB (maybe not that intelligent though)
- ZFS has Adaptive Replacement Cache, so does InnoDB (calls it Buffer Pool, instead of three replacement queues uses just one – LRU, doesn’t account for MFU)
- ZFS has copy-on-write snapshotting, so does InnoDB (MVCC!)
- ZFS has compression, so does InnoDB (in plugin, though)
- ZFS has intelligent mirroring/striping/etc, this is why InnoDB people use RAID controllers.
- ZFS has bit-rot recovery and self healing and such, InnoDB has assertions and crashes :-)
So, we have two intelligent layers on top of each other, and there’s lots of work duplicated. Of course, we can try to eliminate some bits:
- Disable checksums at InnoDB level
- Unfortunately, there’s nothing to be done about two transaction logs
- Dirty pages can be flushed immediately by InnoDB, probably is tunable at ZFS level too
- InnoDB buffer pool may be probably reduced, to favor ARC, or opposite…
- Double Copy-on-write is inevitable (and copy-on-write transaction log does not really make sense…)
- Compression can be done at either level
- ZFS use for volume management would be the major real win, as well as all the self healing capacity
So, I’m not too convinced at this moment about using this combo, but there’s another idea circulating around for quite a while – what if MyISAM suddenly started using all the ZFS capabilities. Currently the ZPL and actual ZFS object store management are mutually exclusive – you have to pick one way, but if ZPL would be extended to support few simple operations (create/drop snapshots just on single file, wrap multiple write() calls into a transaction), MyISAM could get a different life:
- Non-blocking SELECTs could be implemented using snapshots
- Writes would be atomic and non-corrupting
- MyISAM would get checksummed, compressed, consistent data, that is flushed by intelligent background threads, and would have immediate crash recovery
- For replication slaves write concurrency would not be that necessary (single thread is updating data anyway)
- “Zettabyte” (was told not to use this ;-) File System would actually allow Zettabyte-MyISAM-Tables o/
- All the Linux people (including me :) would complain about Sun doing something just for [Open]Solaris, instead of working on [insert favorite storage engine here].
Unfortunately, to implement that now one would have either to tap directly into object management API (that would mean quite a bit of rewriting), or wait for ZFS people to extend the ZPL calls. And for now, I’d say, “not yet”.
Disclaimer: the opinion of the author does not represent opinion of his employer (especially Marketing people), and may be affected by the fact, that the author was enjoying free wireless and whoever knows what else in Las Vegas McCarran International Airport.