plockstat fail!

Solaris has this beautiful tool ‘plockstat’ that can report application lock contention and hold events (hold events turn into contention events at parallelism.. ;-) It is just a frontend to a set of dtrace rules, that monitor mutexes and rwlocks.

Today I was testing an edge case (what happens, when multiple threads are scanning lots of same data) – and plockstat/dtrace indicated that there were zero (0!!!) lock waits. I tried using ‘prstat’ with microstate accounting, and it indeed pointed out that there’s lots of LCK% activity going on (half of CPU usage…). The dtrace profiling oneliner (dtrace -n 'profile-997 {@a[ustack()]=count()}') immediately revealed the culprit:


So, plenty of CPU time was spent when trying to unlock mutex (what seemed strange), but didn’t seem that strange once I noticed the code:

do {
	old = *lockword64;
	new = old & ~LOCKMASK64;
} while (atomic_cas_64(lockword64, old, new) != old);

So, there’s unaccounted busy loop (it is just part of hold event in dtrace). What is odd, is that nobody expects this place to loop too much, what happens here – it gives away mutex to other thread, which wants it. So, instead of having the new owner spin-lock (where it accounts properly), it has old owner spin-locking. I’m not convinced this kind of behavior is one that should scale on large machines, but I’m not much of a locking expert.

Without proper instrumentation plockstat failed to provide information about locks that were consuming half of CPU time. I hope that really was just an edge case – more real testing will follow soon, will see if plockstat will fail as much. Oh well, will find the information I need anyway :) Lesson learned – treat pretty much everything with grain of salt, especially when OS tells you mysql has no lock contention, haha.

3 thoughts on “plockstat fail!”

Comments are closed.