Profile guided optimization with gcc

Yesterday I wrote how certain build optimizations can have performance differences – and I decided to step a bit deeper into a quite interesting field – profile guided binary optimization. There’re quite a few interesting projects out there, like LLVM (I hear it is used extensively in iphone?) – which analyze run-time profile of compiled code and can do just in time adjustments of binary code. Apparently, you don’t need that fancy technology, and can use plain old gcc.

The whole plan is:

  1. Compile all code with -fprofile-generate in {C|CXX|LD}FLAGS
  2. Run the binary
  3. Run your application/benchmark against that binary
  4. Recompile all code with -fprofile-use (above steps will place lots of .gcda files in source tree)
  5. PROFIT!!! (note the omission of “???” step)

How much profit? I measured ~7% of sysbench performance increase (and probably would see much higher value in CPU-tight benchmarks). YMMV. Can such PGO be useful for every user out there? Maybe – but the best results are achieved once looking at actual use patterns – though of course, lots of them are similar everywhere around.

Also, I am showing the actual profiling process with too much of pink. Apparently gcc/gcov profiles tend to get corrupted in multithreaded applications, so I did multiple profile/build passes, until I managed to assemble final binary. :-)

Now I have to figure out how to use -combine flag in gcc, and treat whole MySQL codebase as one huge .c file (apparently compilers can make much much better decisions then).

On binaries and -fomit-frame-pointer

Over last few years 64-bit x86 platform has became ubiquitous, thus making stupid memory limitations a thing of some forgotten past. 64-bit processors made internal memory structures bigger, but compensated that with twice the amount and twice larger registers.

But there’s one thing that definitely got worse – gcc, the compiler, has a change in default compilation options – it omits frame pointers from binaries in x86_64 architecture. People have advocated against that back in 1997 because of very simple reasons, that are still very much existing today too – frame pointers are needed for efficient stack trace calculations, and stack traces are very very useful, sometimes.

So, this change means that oprofile is not able to give hierarchical profiles, it also means that MySQL crash information will not include most useful bit – the failing stack trace. This decision has been just because of performance reasons – frame pointer takes whole register (though on x86_32 that meant 1 out of 8, on x86_64 it is 1 out of 16), which could be used to optimize the application.

I tested two MySQL builds, one built with ‘-O3 -g -fno-omit-frame-pointer’ and other with -fomit-frame-pointer instead – and performance difference was negligible. It was around 1% in sysbench tests, and slightly over 3% at tight-loop select benchmark(100000000,(select asin(5+5)+sin(5+5))); on a 2-cpu Opteron box.

The summary suggestion for this flag would be very simple. If you don’t care about fixing or making your product faster, you can probably use 1% speed-up. But if getting actual real-time performance data can lead to much better performance fixes, and fast introspection means qualified engineers can diagnose problems much faster, 1% or even 3% is not that much. So, add ‘-fno-omit-frame-pointer’ to CFLAGS and CXXFLAGS, and enjoy things getting back to normal :)

By the way, while I was at it, I did some empiric tests of other options, but one that irks me most is not using ‘-g’ on production binaries. See, debugging information, symbol tables, etc – they all cause around 0% performance difference. The only difference is that e.g. mysqld binary will weight 30M, instead of 6M (though that fat will not be loaded into RAM, and will only cost diskspace).

Why does debugging information matter? It doesn’t, if you don’t attempt to be power-user. It does, if you enjoy crazy debugger tricks (like one here or here). Oh, and of course, bonus GDB trick, how to run KILL without connecting to server:

(gdb) thread apply all bt
(gdb) thread 2
[Switching to thread 2 (Thread 0x44a76950 (LWP 23955))]#0  ...
(gdb) bt
#0  0x00007f7821e68e1d in pthread_cond_timedwait...
#1  0x000000000052dc0e in Item_func_sleep::val_int (this=0x12317f0)...
#2  0x0000000000501484 in Item::send (this=0x12317f0, ...
#15 0x00000000005caf29 in do_command (thd=0x11da290) ...
(gdb) frame 15
#15 0x00000000005caf29 in do_command (thd=0x11da290) at
854	  return_value= dispatch_command(command, thd, packet+1, ...
(gdb) set thd->killed = THD::KILL_QUERY
(gdb) continue

And the client gets ‘ERROR 1317 (70100): Query execution was interrupted’ :-)

Crashes, complicated edition

Usually our 4.0.40 (aka ‘four oh forever’) build doesn’t crash, and if it does, it is always hardware problem or kernel/filesystem bug, or whatever else. So, we have a very calm life, until crashes start to happen…

As we used to run RAID0, a disk failure usually means system wipe and reinstall once fixed – so our machines all run relatively new kernels and OS (except some boxes which just refuse to die ;-), and we’re usually way more ahead than all the bunch of conservative RHEL users.

We had one machine which was reporting CPU/northbridge/RAM problems, and every MySQL crash was accompanied by MCEs, so after replacing RAM, CPU and motherboard itself, we just sent the machine back to service, and asked them to do whatever it takes to fix it.

So, this machine, with proud name of ‘db1’ comes and after entering the service starts crashing every day. I reduced InnoDB log file size, to make recovery faster, and would run it under ‘gdb’. Stacktrace on crash pointed to check-summing (aka folding) bunch of functions, so initial assumption was ‘here we get memory errors again’. So, for a while I thought that ‘db1’ needs some more hardware work, and just left it as is, as we were waiting for new database hardware batch to deploy and there was a bit more work around.

We started deploying new database hardware, and it started crashing every few hours instead of every few days. Here again, reduced InnoDB transaction log size and gdb attached allowed to trap the segfault, and it was pointing again to the very same adaptive hash key calculation (folding!).

Unfortunately, it was non-trivial chain of inlined functions (InnoDB is full of these), so I built ‘-g -fno-inline’ build, and was keenly waiting for a crash to happen, so I could investigate what and where gets corrupted. It did not. Then I looked at our zoo just to find out we have lots of different builds. On one hand it was a bit messy, on another hand, it showed few conclusions:

  • Only Opterons crashed (though there’re like three year gap between revisions)
  • Only Ubuntu 8.04 crashed
  • Only GCC-4.2 build crashed

After thinking a bit that:

  • We have Opterons that don’t crash (older gcc builds)
  • Xeons didn’t crash.
  • We have Ubuntu 8.04 that don’t crash (they either are Xeons or run older gcc-4.1 builds)
  • We have GCC-4.2 builds that run nice (all – on Xeons, all on 8.04 Ubuntu).

The next test was taking gcc-4.1 builds and running them on our new machines. No crash for next two days.
One new machine did have gcc-4.2 build and didn’t crash for few days of replicate-only load, but once it got some parallel load, it crashed in next few hours.

I tried to chat about it on Freenode’s #gcc, and I got just:

noshadow>	domas: almost everything that fails when
		optimized (as inlining opens many new
		optimisation possibilities)
noshadow>	i.e: const misuse, relying on undefined
		behaviour, breaking aliasing rules, ...
domas>		interesting though, I hit it just with
		gcc 4.2.3 and opterons only
noshadow>	domas: that makes it more likely that
		it is caused by optimisation unveiling
		programming bugs

In the end I know, that there’s programming bug in ancient code using inlined functions, that causes memory corruption in multithreaded load if compiled with gcc-4.2 and ran on Opteron. As for now it is our fork, pretty much everyone will point at each other and won’t try to fix it :)

And me? I can always do:

env CC=gcc-4.1 CXX=g++-4.1 ./configure ... 

I’m too lazy to learn how to disassemble and check compiled code differences, especially when every test takes few hours. I already destroyed my weekend with this :-) I’m just waiting for people to hit this with stock mysql – would be one of those things we love debugging ;-)