I always have difficulties with complex analysis schemes, so fall back to something that is somewhat easier. Or much easier. Here I will explain the super-powerful method of database write workload analysis.
Doing any analysis on master servers is already too complicated, as instead of analyzing write costs one can be too obsessed with locking and there’s sometimes uncontrollable amount of workload hitting the server beside writes. Fortunately, slaves are much better targets, not only because writes there are single-threaded, thus exposing every costly I/O as time component, but also one can drain traffic from slaves, or send more in order to cause more natural workload.
Also, there can be multiple states of slave load:
- Healthy, always at 0-1s lag, write statements are always immediate
- Spiky, usually at 0s lag, but has jumps due to sometimes occuring slow statements
- Lagging, because of read load stealing I/O capacity
- Lagging (or not catching up fast enough), because it can’t keep up with writes anymore, even with no read load
Each of these states are interesting by themselves, and may have slightly different properties, but pretty much all of them are quite easy to look at using replication profiling.
The code for it is somewhat straightforward:
(while true; do
echo 'SELECT info FROM information_schema.processlist
WHERE db IS NOT NULL AND user="system user"; '
sleep 0.1; done) | mysql -BN | head -n 100000 > replication-sample
There are multiple ways to analyze it, e.g. finding slowest statements is as easy as:
uniq -c replication-sample | sort -nr | head
More advanced methods may group up statements by statement types, tables, user IDs or any other random metadata embedded in query comments – and really lots of value can be obtained by doing ad-hoc analysis using simply ‘grep -c keyword replication-sample’ – to understand what share of your workload certain feature has.
I already mentioned, that there are different shapes of slave performance, and it is easy to test it in different shapes. One of methods is actually stopping a slave for a day, then running the sampler while it is trying to catch up. It will probably have much more buffer pool space usable for write operations, so keep that in mind – certain operations that are depending on larger buffer pools would be much faster.
This is really simple, although remarkably powerful method, that allows quite deep workload analysis without spending too much time on statistics features. As there’s no EXPLAIN for UPDATE or DELETE statements, longer, coarser samples allow detecting deviations from good query plans too.
Systematic use of it has allowed to reveal quite a few important issues that had to be fixed – which were not that obvious from general statistics view. I like.