With large enough infrastructure it gets a bit more and more complicated to detect whether an incident or a problem is real systems problem or a monitoring glitch. This is a story of one such investigation.
With a sufficiently large set of machines, there’re multiple graphs/sets of data to look at to understand what is going on:
- Group average – the easiest to look at for long term trends or system-wide problems
- Top-N chart – our favorite for dive-ins, looking only at top offender lines (quite often part of the offender lines end up matching tier baseline)
- Percentiles chart – knowing best performers, medians, 99% or 99.9% behavior is incredibly useful to understand medium-term dynamics and can give lots of insight where to look for problems. Is way more complicated to collect though.
Another very important property of data one looks at is collection frequency – though thousands of servers aggregated together into single chart will already provide smoothened view of what is going on, individual charts will always look way choppier, without any intermediate dots to show progression of metric. You will see that there’re lots of threads running, but it won’t explain how that built up – only longer term incidents averaged out over lots of machines will show that.
In our deployment, replication lag is very important metric of how well we’re doing – with faker in production it already means that machine (or machines) are in trouble, and usually that is either hardware trouble, or way too much workload from applications, and both problems need immediate actions.
It is even more interesting on pure flash machines – those are not supposed to lag at all, once they’re in production – unless something is really really wrong.
This is why I once I saw a group of flash-only machines reporting higher replication lag values, jumped to investigate. There could’ve been various reasons – e.g. long transactions increasing cost of write queries, and what not.
Looking at charts I saw that there’s some lag, that some machines spike to 1000+s lag values, and percentiles showed that only small amount of machines were hitting this. It was relatively interesting, that each of those machines would hit a lag spike and then behave normally afterwards again.
I tried to find a machine that was in high lag condition and check what is going on there. Listing machines with their lag values was relatively easy:
pmysql "SHOW SLAVE STATUS" | cut -f 1,34 | sort -nr -k 2,2 | head
What was odd, that each time I was running this, machines on the list were different. First thought in my mind was that some old transactions with old timestamps were showing up (that is usually a reason for odd Seconds_behind_master spikes), figuring out statements would’ve been piece of cake:
pmysql "SELECT * FROM information_schema.processlist where user='system user' and info is not null and time>100"
Unfortunately, even though “SHOW SLAVE STATUS” was always showing me machines with 100s+ lag, none of that came via processlist query.
The aha! moment came when I tried to look at binlog/relaylog values – every “lagging” machine was at one of two relay log positions – 298 or 308. So, I looked at timestamps of events in relay log positions:
#120921 13:21:42 server id 22222 end_log_pos 106 Start: binlog v 4, server v 5.1.53-log created 120921 13:21:42 #120921 13:21:42 server id 4294967295 end_log_pos 151 Rotate to binary-logs.017880 pos: 81259576 #691231 16:00:00 server id 11111 end_log_pos 0 Rotate to binary-logs.017880 pos: 81259576 #120921 13:17:39 server id 11111 end_log_pos 0 Start: binlog v 4, server v 5.1.53-log created 120921 13:17:39 #120921 13:21:42 server id 11111 end_log_pos 81259647 Query ...
First record was relay log header (format description event), next two were rotation events with bogus (-1 and 0) values for timestamp and server_id – and binary log positions. Then we suddenly have master’s binary log header which is taken from the beginning of the binary log, with the binlog creation timestamp in it, then we have a query from the middle of binary log, with proper timestamp on it.
For the whole duration of SQL thread acting on the first query server thought it was executing events from 13:17:29 and not 13:21:42. Seconds_behind_master was telling that the gross offense of 253s replication lag was being committed, and database users were suffering.
Fortunately, it was just a MySQL status variable glitch and data integrity was completely fine, but this reminds us how important is quality monitoring and internal metrics for large environments – and how easy it is to draw wrong conclusions otherwise.
P.S. Of course, a bug was filed :)