solaris – domas mituzas

Solaris has this beautiful tool ‘plockstat’ that can report application lock contention and hold events (hold events turn into contention events at parallelism.. ;-) It is just a frontend to a set of dtrace rules, that monitor mutexes and rwlocks.

Today I was testing an edge case (what happens, when multiple threads are scanning lots of same data) – and plockstat/dtrace indicated that there were zero (0!!!) lock waits. I tried using ‘prstat’ with microstate accounting, and it indeed pointed out that there’s lots of LCK% activity going on (half of CPU usage…). The dtrace profiling oneliner (dtrace -n 'profile-997 {@a[ustack()]=count()}') immediately revealed the culprit:

              libc.so.1`clear_lockbyte+0x10
              libc.so.1`mutex_unlock+0x16a
              mysqld`mutex_exit+0x1d
              mysqld`buf_page_optimistic_get_func+0xa0

So, plenty of CPU time was spent when trying to unlock mutex (what seemed strange), but didn’t seem that strange once I noticed the code:

do {
	old = *lockword64;
	new = old & ~LOCKMASK64;
} while (atomic_cas_64(lockword64, old, new) != old);

So, there’s unaccounted busy loop (it is just part of hold event in dtrace). What is odd, is that nobody expects this place to loop too much, what happens here – it gives away mutex to other thread, which wants it. So, instead of having the new owner spin-lock (where it accounts properly), it has old owner spin-locking. I’m not convinced this kind of behavior is one that should scale on large machines, but I’m not much of a locking expert.

Without proper instrumentation plockstat failed to provide information about locks that were consuming half of CPU time. I hope that really was just an edge case – more real testing will follow soon, will see if plockstat will fail as much. Oh well, will find the information I need anyway :) Lesson learned – treat pretty much everything with grain of salt, especially when OS tells you mysql has no lock contention, haha.

Today I attended kick-ass ZFS talk (3 hours of incredibly detailed material presented by someone who knows the stuff and knows how to talk) at CEC (Sun internal training event/conference), so now I know way more about ZFS than I used to. Probably I know way more about ZFS than Average Joe DBA \o/

And now I think ZFS has lots of brilliant design and implementation bits, except it doesn’t match database access pattern needs.

See, ZFS is not a regular POSIX-API -> HDD bridge, unlike pretty much everything out there. It is transactional object store which allows multiple access semantics, APIs, and standard ZFS POSIX Layer (ZPL) is just one of them. In MySQL talk, think of all other filesystems as of MyISAM, and ZFS is InnoDB :-)

So, putting InnoDB on top of ZFS after some high-school-like variable replacement ends up “putting InnoDB on top of InnoDB”. Let’s go a bit deeper here:

ZFS has checksums, so does InnoDB (though ZFS checksums are faster, Fletcher-derived, etc ;-)
ZFS has atomicity, so does InnoDB
ZFS has ZIL (Intent Log), so does InnoDB (Transaction Log)
ZFS has background intelligent flushing of data, so does InnoDB (maybe not that intelligent though)
ZFS has Adaptive Replacement Cache, so does InnoDB (calls it Buffer Pool, instead of three replacement queues uses just one – LRU, doesn’t account for MFU)
ZFS has copy-on-write snapshotting, so does InnoDB (MVCC!)
ZFS has compression, so does InnoDB (in plugin, though)
ZFS has intelligent mirroring/striping/etc, this is why InnoDB people use RAID controllers.
ZFS has bit-rot recovery and self healing and such, InnoDB has assertions and crashes :-)

So, we have two intelligent layers on top of each other, and there’s lots of work duplicated. Of course, we can try to eliminate some bits:

Disable checksums at InnoDB level
Unfortunately, there’s nothing to be done about two transaction logs
Dirty pages can be flushed immediately by InnoDB, probably is tunable at ZFS level too
InnoDB buffer pool may be probably reduced, to favor ARC, or opposite…
Double Copy-on-write is inevitable (and copy-on-write transaction log does not really make sense…)
Compression can be done at either level
ZFS use for volume management would be the major real win, as well as all the self healing capacity

So, I’m not too convinced at this moment about using this combo, but there’s another idea circulating around for quite a while – what if MyISAM suddenly started using all the ZFS capabilities. Currently the ZPL and actual ZFS object store management are mutually exclusive – you have to pick one way, but if ZPL would be extended to support few simple operations (create/drop snapshots just on single file, wrap multiple write() calls into a transaction), MyISAM could get a different life:

Non-blocking SELECTs could be implemented using snapshots
Writes would be atomic and non-corrupting
MyISAM would get checksummed, compressed, consistent data, that is flushed by intelligent background threads, and would have immediate crash recovery
For replication slaves write concurrency would not be that necessary (single thread is updating data anyway)
“Zettabyte” (was told not to use this ;-) File System would actually allow Zettabyte-MyISAM-Tables o/
All the Linux people (including me :) would complain about Sun doing something just for [Open]Solaris, instead of working on [insert favorite storage engine here].

Unfortunately, to implement that now one would have either to tap directly into object management API (that would mean quite a bit of rewriting), or wait for ZFS people to extend the ZPL calls. And for now, I’d say, “not yet”.

Disclaimer: the opinion of the author does not represent opinion of his employer (especially Marketing people), and may be affected by the fact, that the author was enjoying free wireless and whoever knows what else in Las Vegas McCarran International Airport.

	markcallaghan (@mark… on MySQL does not need SQL
	markcallaghan (@mark… on MySQL does not need SQL
	Domas Mituzas on MySQL does not need SQL
	Marc on MySQL does not need SQL
	Nils Meyer on linux memory management for…

Tag: solaris

plockstat fail!

ZFS and MySQL … not yet