Crash recovery, again

There’s one stage in InnoDB crash recovery where it reads log file, you know, this one:

InnoDB: Doing recovery: scanned up to log sequence number 354164119040
InnoDB: Doing recovery: scanned up to log sequence number 354169361920

On a machine with bigger logs it will start spending nearly 100% CPU somewhere in recv_scan_log_recs. Guess what it does…. -fno-inline builds to the rescue:

#0  mem_block_get_len at ./include/mem0mem.ic:86
#1  mem_heap_get_size at ./include/mem0mem.ic:591
#2  recv_scan_log_recs at log/log0recv.c:2727

And:

samples  %        symbol name
8467     72.9222  mem_heap_get_size
291       2.5062  recv_add_to_hash_table
95        0.8182  mem_block_get_len

To speak in layman’s terms, InnoDB does SUM(LENGTH(allocation)) on its relatively wide memory (tens, hundreds of thousands of entries) arena, FOR EVERY LOG SEGMENT, to make sure it didn’t run out of available 32GBs. Hehehe, lame.

As for now, I’ll just killed the check and have my recovery much much faster – finished in 3 minutes, what it wasn’t able to do in 30 before.

P.S. This is different from what I wrote before (and magic Yasufumi’s patch)
P.P.S. Now I got to learn to reuse LOG_DUMMY table during the recovery process, as it is next low hanging fruit there…

Progress in percents: 0 1 2 3 …

Well, servers usually don’t crash ( our English Wikipedia master is running for 800 days, on white-box hardware, RAID0, 4.0 ;-), but when they do (like some kernel bugs on our big big boxes), one of most painful experiences is InnoDB log recovery.

Usually people will reduce the innodb-log-file-size to ease up with the recovery (it helps, in a way :), but the real problem is somewhere else.

See, when InnoDB does crash recovery, it applies the log changes in memory, and builds a flush list. It doesn’t flush any pages during the recovery process, so the flush list grows big, thousands, tens of thousands, maybe millions kind of big, anyway, big-number big.

Oh, did I mention? The flush list is actually a linked list, not some kind of hippy tree stuff. Every time a log record is read from a log and something gets updated, the flush list will be traversed, thousands, tens of thousands, maybe millions of entries.

The expensive code looks something like this:

while (b && (ut_dulint_cmp(b->oldest_modification,
             block->oldest_modification) > 0)) {
       prev_b = b;
       b = UT_LIST_GET_NEXT(flush_list, b);
}

Then your profile starts looking like this, and you wish your systems didn’t crash:

%        symbol name
87.6978  buf_flush_insert_sorted_into_flush_list
 5.8571  -kernel
 1.9793  recv_apply_hashed_log_recs
 0.8059  buf_calc_page_new_checksum

So, the recovery process cost is exponential, and people work around it by reducing the log file size, by reducing performance of their system, while the actual fix is right there, in optimizing the data structure. Current model is outdated for anything built in last 5 years anyway.

Oh, and of course, I’d like systems not to crash at all, like that database master on whitebox raid0 running for 800 days.

Update: this is old stuff. Peter wrote about it, Heikki opened a bug, then thought it would need more than five minutes to fix it and classified it as a feature request, so Peter could write more about it. That makes it even more sad. We’d probably change the synopsis for the feature request, “make crash recovery work”.

Update 2: get the patch at Percona (Yasufumi is god :)

Crashes, complicated edition

Usually our 4.0.40 (aka ‘four oh forever’) build doesn’t crash, and if it does, it is always hardware problem or kernel/filesystem bug, or whatever else. So, we have a very calm life, until crashes start to happen…

As we used to run RAID0, a disk failure usually means system wipe and reinstall once fixed – so our machines all run relatively new kernels and OS (except some boxes which just refuse to die ;-), and we’re usually way more ahead than all the bunch of conservative RHEL users.

We had one machine which was reporting CPU/northbridge/RAM problems, and every MySQL crash was accompanied by MCEs, so after replacing RAM, CPU and motherboard itself, we just sent the machine back to service, and asked them to do whatever it takes to fix it.

So, this machine, with proud name of ‘db1’ comes and after entering the service starts crashing every day. I reduced InnoDB log file size, to make recovery faster, and would run it under ‘gdb’. Stacktrace on crash pointed to check-summing (aka folding) bunch of functions, so initial assumption was ‘here we get memory errors again’. So, for a while I thought that ‘db1’ needs some more hardware work, and just left it as is, as we were waiting for new database hardware batch to deploy and there was a bit more work around.

We started deploying new database hardware, and it started crashing every few hours instead of every few days. Here again, reduced InnoDB transaction log size and gdb attached allowed to trap the segfault, and it was pointing again to the very same adaptive hash key calculation (folding!).

Unfortunately, it was non-trivial chain of inlined functions (InnoDB is full of these), so I built ‘-g -fno-inline’ build, and was keenly waiting for a crash to happen, so I could investigate what and where gets corrupted. It did not. Then I looked at our zoo just to find out we have lots of different builds. On one hand it was a bit messy, on another hand, it showed few conclusions:

  • Only Opterons crashed (though there’re like three year gap between revisions)
  • Only Ubuntu 8.04 crashed
  • Only GCC-4.2 build crashed

After thinking a bit that:

  • We have Opterons that don’t crash (older gcc builds)
  • Xeons didn’t crash.
  • We have Ubuntu 8.04 that don’t crash (they either are Xeons or run older gcc-4.1 builds)
  • We have GCC-4.2 builds that run nice (all – on Xeons, all on 8.04 Ubuntu).

The next test was taking gcc-4.1 builds and running them on our new machines. No crash for next two days.
One new machine did have gcc-4.2 build and didn’t crash for few days of replicate-only load, but once it got some parallel load, it crashed in next few hours.

I tried to chat about it on Freenode’s #gcc, and I got just:

noshadow>	domas: almost everything that fails when
		optimized (as inlining opens many new
		optimisation possibilities)
noshadow>	i.e: const misuse, relying on undefined
		behaviour, breaking aliasing rules, ...
domas>		interesting though, I hit it just with
		gcc 4.2.3 and opterons only
noshadow>	domas: that makes it more likely that
		it is caused by optimisation unveiling
		programming bugs

In the end I know, that there’s programming bug in ancient code using inlined functions, that causes memory corruption in multithreaded load if compiled with gcc-4.2 and ran on Opteron. As for now it is our fork, pretty much everyone will point at each other and won’t try to fix it :)

And me? I can always do:

env CC=gcc-4.1 CXX=g++-4.1 ./configure ... 

I’m too lazy to learn how to disassemble and check compiled code differences, especially when every test takes few hours. I already destroyed my weekend with this :-) I’m just waiting for people to hit this with stock mysql – would be one of those things we love debugging ;-)