more on PMP

Lately we have been especially enjoying the opportunities that Poor Man’s Profiler provides us – but also the technology has improved a lot too – there have been few really useful mutations.

One mutation (hyper-pmp) was Ryan Mack’s approach of having somewhat more efficient sampling – instead of firing gdb each time, he instructed gdb to get backtraces every time monitored process gets a signal (SIGUSR2 for example). This allows to maintain a persistent debugger attachment – and then signal periodically to get stacks analyzed.

Other mutation was auto-pmp – high frequency polling of process state (e.g. how many threads are running), and when a certain threshold is exceeded – obtaining stacks for further analysis (this combines really well with the hpmp approach – one process is the stacks reader, and other is signaling on thresholds). My major problem in such approach was that the polling methods we chose would be biased to show me end of overload events (because it wouldn’t return process state due to internal process locking).

At one point in time I had an epiphany, that was quickly melted by the reality – in theory we could use gdb watchpoints to replace my external process polling. Watchpoints allow to break a process when a change to a variable inside a program happens (and conditions can be applied), so essentially we would be able to instrument gdb to get stacks exactly at the moment when there’re stalls and spikes. Unfortunately, even though that worked fine in single-threaded or lightly loaded environments, monitored process crashed horribly in more realistic workloads – we have yet to figure out if it is a fundamental issue of the approach or actually a bug that may have been fixed in later versions.

Of course, there’s a workaround, that we’re considering for high performance system analysis – simply instrumenting a process to fire a signal or do a conditional jump whenever there’s an overload condition – so essentially that would be implementing in-process watchpoint-to-breakpoint translation giving us just-in-time analytics – so we’d see pretty much every situation where running threads pile up (unless there’s a bottleneck that simply doesn’t allow the workload to arrive :)

PMP on-demand allowed us to uncover various issues inside MySQL that have been overlooked in most of benchmarking as non-significant, but they are critical for us in providing better quality of service for each query, not just 99th percentile (I wrote about that recently). We keep thinking how to provide instrumentation for some of views we get inside MySQL (e.g. an ability to export pthread lock graph without using external tools), as well as better visibility of I/O blocking…

But for now we have what we have, poor man’s profiler :-)

on performance stalls

We quite often say, that benchmark performance is usually different from real world performance – so performance engineering usually has to cover both – benchmarks allow to understand sustained performance bottlenecks, and real world analysis usually concentrates on something what would be considered ‘exceptional’ and not important in benchmarks – stalls of various kind. They are extremely important, as the state when our performance is lowest is the state of performance we provide to our platform users.

On a machine that is doing 5000qps, stalling for 100ms means that 500 queries were not served as fast as they could, or even hit application timeouts or exceptional MySQL conditions (like 1023 transaction limit). Of course, stalling for a second means 5000 queries were not served in time…

We have multiple methods to approach this – one is our ‘dogpiled’ framework – an agent doing status polling every second and reporting information about I/O state, MySQL/InnoDB statuses, processlists, etc – so we see the scope of stalls in our environment. We try to maintain the threshold between complete information overload and something that reveals problems – so it is always balancing act, especially with great work done by engineering team :)

Other approach, usually led to by dogpiles information, is auto-PMP – high-frequency status polling combined with gdb invocations, that allow us to jump into the process whenever we notice something weird is going on. We have some extensions to how we use PMP – but thats worth another post.

Issues we do find out that harm us most in production environments are ones that are quite often discarded as either “this never happens” or “get better hardware” or “your application is wrong”. Unfortunately, that happens, we do have thousands of machines that aren’t free and our application demands are our application demands :)

Few examples:

  • TRUNCATE stalls the server (oh well, DROP TABLE too) – in this case, truncating a table grabs dictionary mutex, other transaction blocks while holding LOCK_open, everything else stops. Though truncating is supposed to be fast operation, it has to unlink (delete) a file, and with large files such operation isn’t really instant on any filesystem. Even if one deletes all the data before truncating, file is still on the filesystem.
  • Extending data files stalls the server – when a data file is being extended, global mutex is held, which blocks all I/Os (with limited concurrency that is full server stall). Somewhat more impressive with file-per-table. This is the major reason for mini-stalls at the moment – on machines that grow at gigabytes-a-day rate this is being hit quite often.
  • Updating table statistics stalls the server – we hit this with high-performance task tracking machines, row churn there is quite amazing, and dictionary statistics are reread more often than one would expect. Updating statistics means locking the table while doing random reads from disk. Once major workload is hitting that table, it quickly escalates to full server stall
  • Fuzzy checkpoint stalls the server – this is one of biggest issues outstanding in stock MySQL – though one would expect that “fuzzy checkpoint” that uses async background threads is nonblocking, actually all writes during it will stall, taking all concurrency slots and leading to a server stall. Mark’s fix was just doing this work in background thread.
  • (no bug filed on this yet) – Purge stalls the server – purge holds dictionary lock while doing random reads from disk, with table stall leading to server stall.

There’re more issues (mostly related to heavier in-memory activities of the server), but these ones are most obvious ones – where single I/O request done is escalated to table or instance lockup, where no other work is done. Our machines have multiple disks, multiple CPUs and can support multiple SQL queries being executed at once, so any of these lockups effectively limit our available performance or damage the quality of service we can provide.

On the upside, my colleagues are absolutely amazing and I’m sure that we will have all these issues fixed in our deployment in near future, as well as everyone will be able to pick that up via mysqlatfacebook branch.

When bad things happen!

When bad things happen, like… ‘strace -f’ or ‘gdb’ or any other process inspection tool decides to hang your precious processes (they show up in state T in process lists), there’s always help:

#include <sys/ptrace.h>
#include <signal.h>
main(int ac, char **av) {
int pid; if (ac>1) pid=atoi(av[1]);
ptrace(PTRACE_ATTACH,pid,0,0);
ptrace(PTRACE_DETACH,pid,0,0);
kill(pid, SIGCONT); }

Poor man’s contention profiling

I wrote already about poor man’s query profiling techniques last summer. Today I’m going to share poor man’s contention profiling tool, for all these poor souls which do not have Solaris with dtrace or pstack, don’t want to run tcmalloc contention profiler, and simply need some easy hack to see ‘what the heck is going on in my server’. Here it is:

gdb \
    -ex "set pagination 0" \
    -ex "thread apply all bt" \
    --batch -p $(pidof mysqld)

Run few times, and you will have enough samples to start judging. Do note, this may stop the process execution for a second, so do not spam it in too tight loop.
Once you have results it is just a matter of 20-liner script to extract any useful calculations :)

P.S. I’d love to see efficient pstack implementation for 64-bit Linux :)

update: this now lives at http://poormansprofiler.org