more on PMP

Lately we have been especially enjoying the opportunities that Poor Man’s Profiler provides us – but also the technology has improved a lot too – there have been few really useful mutations.

One mutation (hyper-pmp) was Ryan Mack’s approach of having somewhat more efficient sampling – instead of firing gdb each time, he instructed gdb to get backtraces every time monitored process gets a signal (SIGUSR2 for example). This allows to maintain a persistent debugger attachment – and then signal periodically to get stacks analyzed.

Other mutation was auto-pmp – high frequency polling of process state (e.g. how many threads are running), and when a certain threshold is exceeded – obtaining stacks for further analysis (this combines really well with the hpmp approach – one process is the stacks reader, and other is signaling on thresholds). My major problem in such approach was that the polling methods we chose would be biased to show me end of overload events (because it wouldn’t return process state due to internal process locking).

At one point in time I had an epiphany, that was quickly melted by the reality – in theory we could use gdb watchpoints to replace my external process polling. Watchpoints allow to break a process when a change to a variable inside a program happens (and conditions can be applied), so essentially we would be able to instrument gdb to get stacks exactly at the moment when there’re stalls and spikes. Unfortunately, even though that worked fine in single-threaded or lightly loaded environments, monitored process crashed horribly in more realistic workloads – we have yet to figure out if it is a fundamental issue of the approach or actually a bug that may have been fixed in later versions.

Of course, there’s a workaround, that we’re considering for high performance system analysis – simply instrumenting a process to fire a signal or do a conditional jump whenever there’s an overload condition – so essentially that would be implementing in-process watchpoint-to-breakpoint translation giving us just-in-time analytics – so we’d see pretty much every situation where running threads pile up (unless there’s a bottleneck that simply doesn’t allow the workload to arrive :)

PMP on-demand allowed us to uncover various issues inside MySQL that have been overlooked in most of benchmarking as non-significant, but they are critical for us in providing better quality of service for each query, not just 99th percentile (I wrote about that recently). We keep thinking how to provide instrumentation for some of views we get inside MySQL (e.g. an ability to export pthread lock graph without using external tools), as well as better visibility of I/O blocking…

But for now we have what we have, poor man’s profiler :-)

Read ahead…

Mark wrote about how to find situations where InnoDB read-ahead is a bottleneck. What he didn’t disclose, though, is his trick to disable read-ahead without restart or recompile of MySQL. See, there’s no internal “disable read ahead knob”. But there is…

buf_read_ahead_random(...){ ...
       if (srv_startup_is_before_trx_rollback_phase) {
                /* No read-ahead to avoid thread deadlocks */
                return(0);
        }

This variable is tested at two functions – buf_read_ahead_linear() and buf_read_ahead_random() and nowhere else. So yeah, “server startup is before transaction rollback phase” is another way of saying “don’t do read ahead, please please”.

gdb -ex "set  srv_startup_is_before_trx_rollback_phase=1" \
    --batch -p $(pidof mysqld)

And many servers bottlenecked on this became much much much faster (and 2000 concurrent threads running dropped to 10). Of course, this is most visible in high-latency-high-throughput I/O situations, but we’re hitting this contention spot on local disk setups too.

Don’t forget to have the fix if gdb decides to be nasty and locks up your server :)

When bad things happen!

When bad things happen, like… ‘strace -f’ or ‘gdb’ or any other process inspection tool decides to hang your precious processes (they show up in state T in process lists), there’s always help:

#include <sys/ptrace.h>
#include <signal.h>
main(int ac, char **av) {
int pid; if (ac>1) pid=atoi(av[1]);
ptrace(PTRACE_ATTACH,pid,0,0);
ptrace(PTRACE_DETACH,pid,0,0);
kill(pid, SIGCONT); }

GDB 7!

I wasn’t prepared for this. After spending months playing with GDB development trees I somehow entirely missed that 7.0 release is getting close, and took me more than an hour to spot it.

My favorite features are python scripting and non-stop debugging. I was toying around with python scripting for a while, and was planning to make backtraces make sense. Having hands that open means that one can see PHP backtraces, when gdb’ing apache, see table names and states when MySQL thread access handler interfaces, or remote IPs and users, when it is writing to network. Process inspection can simply rock, if right tools are created using these new capabilities, and I’m way too excited when I think about those. “Always have debugging symbols” gets way more meaning now.

Another issue I’ve been trying to resolve lately is avoiding long locking periods for running processes (directly attaching to process can freeze its work for a second or so, which isn’t that tolerable in production environments). GDB is getting closer to the async debugging capabilities – where one can run a debugger without actually stopping anything.

So, congratulations GDB team, now it is job for us to find all the uses of the tool. It has been invaluable so far, but this is much much more.

Evil replication management

When one wants to script automated replication chain building, certain things are quite annoying, like immutable replication configuration variables. For example, at certain moments log_slave_updates is more than needed, and thats what the server says:

mysql> show variables like 'log_slave_updates';
+-------------------+-------+
| Variable_name     | Value |
+-------------------+-------+
| log_slave_updates | OFF   |
+-------------------+-------+
1 row in set (0.00 sec)

mysql> set global log_slave_updates=1;
ERROR 1238 (HY000): Variable 'log_slave_updates' is a read only variable

Of course, there are few options, roll in-house fork (heheeeee!), restart your server, and keep warming up your tens of gigabytes of cache arenas, or wait for MySQL to ship a feature change in next major release. Then there are evil tactics:

mysql> system gdb -p $(pidof mysqld)
                       -ex "set opt_log_slave_updates=1" -batch
mysql> show variables like 'log_slave_updates';
+-------------------+-------+
| Variable_name     | Value |
+-------------------+-------+
| log_slave_updates | ON    |
+-------------------+-------+
1 row in set (0.00 sec)

I don’t guarantee safety of this when slave is running, but… stopping and starting slave threads is somewhat cheaper, than stopping and starting big database instance, right?

What else can we do?

mysql> show slave status \G
...
     Replicate_Do_DB: test
...
mysql> system gdb -p $(pidof mysqld)
          -ex 'call rpl_filter->add_do_db(strdup("hehehe"))' -batch
mysql> show slave status \G
...
      Replicate_Do_DB: test,hehehe
...

It is actually possible to add all sorts of filters this way, rpl_filter.h can be good reference :) So now that you want to throw out some data from your slaves, restart isn’t needed. Unfortunately, deleting entries isn’t possible via rpl_filter methods, but you can always edit base_ilists, can’t you?

P.S. having this functionality inside server would definitely be best.

On binaries and -fomit-frame-pointer

Over last few years 64-bit x86 platform has became ubiquitous, thus making stupid memory limitations a thing of some forgotten past. 64-bit processors made internal memory structures bigger, but compensated that with twice the amount and twice larger registers.

But there’s one thing that definitely got worse – gcc, the compiler, has a change in default compilation options – it omits frame pointers from binaries in x86_64 architecture. People have advocated against that back in 1997 because of very simple reasons, that are still very much existing today too – frame pointers are needed for efficient stack trace calculations, and stack traces are very very useful, sometimes.

So, this change means that oprofile is not able to give hierarchical profiles, it also means that MySQL crash information will not include most useful bit – the failing stack trace. This decision has been just because of performance reasons – frame pointer takes whole register (though on x86_32 that meant 1 out of 8, on x86_64 it is 1 out of 16), which could be used to optimize the application.

I tested two MySQL builds, one built with ‘-O3 -g -fno-omit-frame-pointer’ and other with -fomit-frame-pointer instead – and performance difference was negligible. It was around 1% in sysbench tests, and slightly over 3% at tight-loop select benchmark(100000000,(select asin(5+5)+sin(5+5))); on a 2-cpu Opteron box.

The summary suggestion for this flag would be very simple. If you don’t care about fixing or making your product faster, you can probably use 1% speed-up. But if getting actual real-time performance data can lead to much better performance fixes, and fast introspection means qualified engineers can diagnose problems much faster, 1% or even 3% is not that much. So, add ‘-fno-omit-frame-pointer’ to CFLAGS and CXXFLAGS, and enjoy things getting back to normal :)

By the way, while I was at it, I did some empiric tests of other options, but one that irks me most is not using ‘-g’ on production binaries. See, debugging information, symbol tables, etc – they all cause around 0% performance difference. The only difference is that e.g. mysqld binary will weight 30M, instead of 6M (though that fat will not be loaded into RAM, and will only cost diskspace).

Why does debugging information matter? It doesn’t, if you don’t attempt to be power-user. It does, if you enjoy crazy debugger tricks (like one here or here). Oh, and of course, bonus GDB trick, how to run KILL without connecting to server:

(gdb) thread apply all bt
...
(gdb) thread 2
[Switching to thread 2 (Thread 0x44a76950 (LWP 23955))]#0  ...
(gdb) bt
#0  0x00007f7821e68e1d in pthread_cond_timedwait...
#1  0x000000000052dc0e in Item_func_sleep::val_int (this=0x12317f0)...
#2  0x0000000000501484 in Item::send (this=0x12317f0, ...
...
#15 0x00000000005caf29 in do_command (thd=0x11da290) ...
...
(gdb) frame 15
#15 0x00000000005caf29 in do_command (thd=0x11da290) at sql_parse.cc:854
854	  return_value= dispatch_command(command, thd, packet+1, ...
(gdb) set thd->killed = THD::KILL_QUERY
(gdb) continue

And the client gets ‘ERROR 1317 (70100): Query execution was interrupted’ :-)

Checksums again, some I/O too

When I was doing data loading tests, I realized that usually low checksum calculation CPU percentage is actually the blocking factor. See, usually when background writers do the flushing, it gets parallelized, but if active query is forcing a checkpoint, it all happens in ‘foreground’ thread, checksum computation included. This is where more Sun-ish wisdom (these people tune kernel with debugger all the time) comes in:

gdb -p $(pidof mysqld) -ex "set srv_use_checksums=0" --batch

Puff. Everything becomes much faster. Of course, one would be able to restart the server with –skip-innodb-checksums, but that would interrupt the whole process, etc. Of course, proper people would implement tunable parameter (5 lines of code, or so), but anyone with Solaris experience knows how to tune stuff with debuggers, hahaha.

Odd though, I was used to compiled -O3 mode optimizing checksums to disappear from profiles, so either this doesn’t work, or it just got so much emphasis to actually matter. This is why I told Heikki at the Users Conference, that checksums ‘must go’ or be fixed in one way or another. Even when they’re small part of operation and doesn’t cost anything if it doesn’t block primary operations (e.g. happens in parallel), people keep accessing SSDs, and then relative cost of checksum is insane.

It is quite easy to check that, just run InnoDB with tiny buffer pool, OS caching, and try some benchmarking with checksums enabled and disabled – there will be a huge difference, which may simply explain the performance difference of caching at buffer pool and OS buffers. Turn checksums off, and OS caching may be even tolerable for your workloads.

Other interesting issue here is that MarkC has lots of I/O path optimization in his fourth (and four is a good number!) patch. Now… maybe everyone will be soon caching pages in OS, once some more work is done in I/O access path cost work?