Progress in percents: 0 1 2 3 …

Well, servers usually don’t crash ( our English Wikipedia master is running for 800 days, on white-box hardware, RAID0, 4.0 ;-), but when they do (like some kernel bugs on our big big boxes), one of most painful experiences is InnoDB log recovery.

Usually people will reduce the innodb-log-file-size to ease up with the recovery (it helps, in a way :), but the real problem is somewhere else.

See, when InnoDB does crash recovery, it applies the log changes in memory, and builds a flush list. It doesn’t flush any pages during the recovery process, so the flush list grows big, thousands, tens of thousands, maybe millions kind of big, anyway, big-number big.

Oh, did I mention? The flush list is actually a linked list, not some kind of hippy tree stuff. Every time a log record is read from a log and something gets updated, the flush list will be traversed, thousands, tens of thousands, maybe millions of entries.

The expensive code looks something like this:

while (b && (ut_dulint_cmp(b->oldest_modification,
             block->oldest_modification) > 0)) {
       prev_b = b;
       b = UT_LIST_GET_NEXT(flush_list, b);
}

Then your profile starts looking like this, and you wish your systems didn’t crash:

%        symbol name
87.6978  buf_flush_insert_sorted_into_flush_list
 5.8571  -kernel
 1.9793  recv_apply_hashed_log_recs
 0.8059  buf_calc_page_new_checksum

So, the recovery process cost is exponential, and people work around it by reducing the log file size, by reducing performance of their system, while the actual fix is right there, in optimizing the data structure. Current model is outdated for anything built in last 5 years anyway.

Oh, and of course, I’d like systems not to crash at all, like that database master on whitebox raid0 running for 800 days.

Update: this is old stuff. Peter wrote about it, Heikki opened a bug, then thought it would need more than five minutes to fix it and classified it as a feature request, so Peter could write more about it. That makes it even more sad. We’d probably change the synopsis for the feature request, “make crash recovery work”.

Update 2: get the patch at Percona (Yasufumi is god :)

dtrace!

At the MySQL developer conference I accidently showed up some of things we’ve been doing with dtrace (I used it in few cases and realized the power it has), and saw some jaws drop. Then I ended up doing small demos around the event. What most people know about dtrace, is that there’re some probes and you can trace them. What people don’t know is that you can actually create lots of probes dynamically, and use them with lots of flexibility.

One of major things not really grasped by many is that dtrace is a combination of a tracing tool, debugger, programming language and a database, having minor, but very valuable functionality for each. It can attach to any place in code, it can get stacks, function arguments, traverse structures, do some evaluations, aggregate data, and in the end – thats all compiled code executed by kernel (or programs). 

Sometimes a probe may look not that useful (strace would provide file writes too?), but once combined with ability to get immediate stack, as well as set or read context variables (a previous probe on any other event could have saved some important information, e.g. host,user,table names, etc) – so final result may tell statistics correlated to many other activities. 

One developer (a traitor who has left support for easier life in engineering dept) listened to all this, and I asked what his current project was – apparently he was adding static dtrace probes to MySQL. It ended up being quite interesting discussion, as static probes provide two value points. First of all, it provides an interface – whereas dynamic probes can change with code changes (though, that doesn’t happen too often :) Second value – one can do additional calculations on a specific probe, which would be done only on-demand (when the probe is attached). 

So, having a static probe that directly maps to easy-mode dynamic one (it is straightforward to attach to a function, and quite easy to read its arguments), is a bit of waste (both in development time, as well as few instructions are actually written there). Dynamic tracing generally modifies binaries on fly – so it does not carry additional costs otherwise. Though an example where static probe would be awesome – having “query start” event, which would have query string canonized with all literals removed – this would allow on-demand query profiling for query groups, rather than stand-alone queries.

The other major value is ability to set thread-specific context variables in different probes, so they can read each other data. At the type of incoming packet one can tag the thread with whichever information needed – then any subsequent actions can reuse such information to filter out important events. That also removes the need of static probes providing multiple-layer information – it all can be achieved by chaining the events – without too much complexity. 

I took a bit of trollish stance when approached a developer implementing internal performance statistics. We were playing a game – he’d tell me what kind of performance information he’d like to extract, and I’d show a method to do that with dtrace. More people from monitoring field joined, and we ended up discussing what is the perfect performance monitoring and analysis system. It is quite easy to understand, that different people will need different kinds of metrics. For MySQL development work performance engineer will need mutex contention information, someone fixing a leak will need heap profiling, someone writing a feature will want an easy way to trace how server executes their code – and all that is way far from any needs actual user or DBA has. Someone who writes a query just wants to see the query plan with some easy-to-understand costs (just need to pump more steroids into EXPLAIN). DBAs may want to see resource consumption per-user, per-table, etc (something Google patch  provides). It is interesting to find a balance, between external tools and what should be supported out-of-the-box internally – and it is way easier to force internal crowd to have proper tools, and it is always nice to provide a much as possible instrumentation for anyone externally. 

Of course, there’s poor guy in the middle of two camps – a support engineer – who needs easy performance metrics to be accessible from clients, but needs way more depth than standard tools provide. In ideal case dtrace would be everywhere (someone recently said, thats one of coolest things Sun has ever brought) – then we’d be able to retrieve on-demand performance metrics from everywhere, and would be tempted to write DTraceToolkit  (a suite of programs that give lots and lots of information based on dtrace) like bunch of stuff for MySQL internals analysis.

I already made one very very simple tool  which visualizes dtrace output, so we can have graphviz based SVG callgraph for pretty much any type of probe (like, who in application does expensive file reads) – all from a single dtrace oneliner. It seems I can sell the tool to Sun’s performance engineering team – they liked it. :) 

Some people even installed Solaris afterwards for their performance tests. Great, I won’t have to (haha!).

Though lack of dtrace in Linux is currently a blocker for the technology, lots of engineers already have it on their laptops – MacOSX 10.5 ships it. It even has visual toolkit, that allows building some dtrace stuff in a GUI. 

I’m pretty sure now, any engineer would love dtrace (or dtrace based tools), they just don’t know that yet.

Drizzle

Hi! It is about time to write some thoughts about Drizzle, even after it got so much of blogging love elsewhere :)

I love some of the ideas – like employing generic portable record format, and throwing away lots and lots of crufty code associated with reading internal structures.

Some of the ideas I probably love less, mostly the microkernel design.

See, I’m a believer in hacks. A hack in monolith code blends in nicely, a hack in microkernel design looks like bunch of spaghetti on top of kosher pork steak (well, probably bad analogy :). Hacks start bloating the plugin interfaces, microkernel designers become unhappy, there’s lots of tension, instead of living in one huge nice pot of spaghetti.

Why does one need hacks? I like when InnoDB controls the replication (thus adding transactional consistency to it, or adding semi-sync properties), and I like when replication controls the InnoDB (asks for higher priorities, and such). These changes required changing handler interface without even having replication as a module. In case of spaghetti soup, one straw more or less, doesn’t matter that much. :)

The very example of Apache proves the point, that modules don’t work well together. There’s not that much synergy between, say, mod_php and mod_perl. Actually, there’s not much synergy between any Apache module. People end up compressing, logging, filtering, redirecting inside PHP or Python or Perl code, not dedicated Apache modules. Why? Simply, because interfaces are insufficient, and modules end up limited – there’s no real synergy out there. In the end, having data logic in one piece is actually more maintainable than building bridges between entirely separate logic pools.

It is a bit of hypocrisy to aim for modular design with clear plugin interfaces, and at the same time remove all the features, that make design of other applications more modular and using clear interfaces (SPs, prepared statements, triggers, etc ;-)

Of course, I play a bit Devil’s Advocate here, and I’m one of those people forced to know every reason why various features got removed, but I somehow feel that lots of actual improvements (like protocol buffers) could be done without doing the stripping. Also, I know that most of the features removed are not harmful in any way, if not used :)

In the end, most of heavyweight database work is done at storage engine layer anyway, most of resource usage is by storage engine, most of scalability troubles are at the storage engine, and most of actual needed improvements and features should be done at the storage engine layer.

mmap()

I’ve seen quite some work done on implementing mmap() in various places, including MySQL.
mmap() is also used for malloc()’ing huge blocks of memory.
mmap() data cache is part of VM cache, not file cache (though those are inside kernels tightly coupled, priorities still remain different).

If a small program with low memory footprint maps a file, it will probably make file access faster (as it will be cached more aggressively in memory, and will provide pressure on other cached file data -thats cheating though).

If a large program with lots and lots of allocated memory maps a file, that will pressure the filesystem cache to flush pages, and then… will pressure existing VM pages of the very same large program to be swapped out. Thats certainly bad.

For now MySQL is using mmap() just for compressed MyISAM files. Vadim wrote a patch to do more of mmap()ing.

If there’s less data than RAM, mmap() may provide somewhat more efficient CPU cycles. If there’s more data than RAM, mmap() will kill the system.

Interesting though, few months ago there was a discussion on lkml where Linus wrote:

Because quite frankly, the mixture of doing mmap() and write() system calls is quite fragile – and I’m not saying that just because of this particular bug, but because there are all kinds of nasty cache aliasing issues with virtually indexed caches etc that just fundamentally mean that it’s often a mistake to mix mmap with read/write at the same time.

So, simply, don’t.

Update: Oh well, 5.1: –myisam_use_mmap option… Argh.
Update on update: after few minutes of internal testing all mmap()ed MyISAM tables went fubar.

Notes from land of I/O

A discussion on IRC sparkled some interest on how various I/O things work in Linux. I wrote small microbenchmarking program (where all configuration is in source file, and I/O modes can be changed by editing various places in code ;-), and started playing with performance.

The machine for this testing was RAID10 16disk box with 2.6.24 kernel, and I tried to understand how O_DIRECT works, and how fsync() works and ended up digging into some other stuff.

My notes for now are:

  • O_DIRECT serializes writes to a file on ext2, ext3, jfs, so I got at most 200-250w/s.
  • xfs allows parallel (and out-of-order, if that matters) DIO, so I got 1500-2700w/s (depending on file size – seek time changes.. :) of random I/O without write-behind caching. There are few outstanding bugs that lock this down back to 250w/s (#xfs@freenode: “yeah, we drop back to taking the i_mutex in teh case where we are writing beyond EOF or we have cached pages”, so
    posix_fadvise(fd, 0, filesize, POSIX_FADV_DONTNEED)

    helps).

  • fsync(),sync(),fdatasync() wait if there are any writes, bad part – it can wait forever. Filesystems people say thats a bug – it shouldn’t wait for I/O that happened after sync being called. I tend to believe, as it causes stuff like InnoDB semaphore waits and such.

Of course, having write-behind caching at the controller (or disk, *shudder*) level allows filesystems to be lazy (and benchmarks are no longer that different), but having the upper layers work efficiently is quite important too, to avoid bottlenecks.

It is interesting, that write-behind caching isn’t needed that much anymore for random writes, once filesystem parallelizes I/O, even direct, nonbuffered one.

Anyway, now that I found some of I/O properties and issues, should probably start thinking how they apply to the upper layers like InnoDB.. :)

Crashes, complicated edition

Usually our 4.0.40 (aka ‘four oh forever’) build doesn’t crash, and if it does, it is always hardware problem or kernel/filesystem bug, or whatever else. So, we have a very calm life, until crashes start to happen…

As we used to run RAID0, a disk failure usually means system wipe and reinstall once fixed – so our machines all run relatively new kernels and OS (except some boxes which just refuse to die ;-), and we’re usually way more ahead than all the bunch of conservative RHEL users.

We had one machine which was reporting CPU/northbridge/RAM problems, and every MySQL crash was accompanied by MCEs, so after replacing RAM, CPU and motherboard itself, we just sent the machine back to service, and asked them to do whatever it takes to fix it.

So, this machine, with proud name of ‘db1’ comes and after entering the service starts crashing every day. I reduced InnoDB log file size, to make recovery faster, and would run it under ‘gdb’. Stacktrace on crash pointed to check-summing (aka folding) bunch of functions, so initial assumption was ‘here we get memory errors again’. So, for a while I thought that ‘db1’ needs some more hardware work, and just left it as is, as we were waiting for new database hardware batch to deploy and there was a bit more work around.

We started deploying new database hardware, and it started crashing every few hours instead of every few days. Here again, reduced InnoDB transaction log size and gdb attached allowed to trap the segfault, and it was pointing again to the very same adaptive hash key calculation (folding!).

Unfortunately, it was non-trivial chain of inlined functions (InnoDB is full of these), so I built ‘-g -fno-inline’ build, and was keenly waiting for a crash to happen, so I could investigate what and where gets corrupted. It did not. Then I looked at our zoo just to find out we have lots of different builds. On one hand it was a bit messy, on another hand, it showed few conclusions:

  • Only Opterons crashed (though there’re like three year gap between revisions)
  • Only Ubuntu 8.04 crashed
  • Only GCC-4.2 build crashed

After thinking a bit that:

  • We have Opterons that don’t crash (older gcc builds)
  • Xeons didn’t crash.
  • We have Ubuntu 8.04 that don’t crash (they either are Xeons or run older gcc-4.1 builds)
  • We have GCC-4.2 builds that run nice (all – on Xeons, all on 8.04 Ubuntu).

The next test was taking gcc-4.1 builds and running them on our new machines. No crash for next two days.
One new machine did have gcc-4.2 build and didn’t crash for few days of replicate-only load, but once it got some parallel load, it crashed in next few hours.

I tried to chat about it on Freenode’s #gcc, and I got just:

noshadow>	domas: almost everything that fails when
		optimized (as inlining opens many new
		optimisation possibilities)
noshadow>	i.e: const misuse, relying on undefined
		behaviour, breaking aliasing rules, ...
domas>		interesting though, I hit it just with
		gcc 4.2.3 and opterons only
noshadow>	domas: that makes it more likely that
		it is caused by optimisation unveiling
		programming bugs

In the end I know, that there’s programming bug in ancient code using inlined functions, that causes memory corruption in multithreaded load if compiled with gcc-4.2 and ran on Opteron. As for now it is our fork, pretty much everyone will point at each other and won’t try to fix it :)

And me? I can always do:

env CC=gcc-4.1 CXX=g++-4.1 ./configure ... 

I’m too lazy to learn how to disassemble and check compiled code differences, especially when every test takes few hours. I already destroyed my weekend with this :-) I’m just waiting for people to hit this with stock mysql – would be one of those things we love debugging ;-)

Knol

There isn’t much to talk about Knol technology – it is either nicely engineered or missing (they probably thought that search is main tool for collaboration). Of course, many issues are already covered by others, but…

My first look was at the featured articles. What was wrong?

  • It features ‘closed collaboration’. Actually, thats no different from a blog, then…
  • It doesn’t care much about the licensing – featured articles had images with “all rights reserved”, or images taken from Wikipedia, with attribution but without share-alike clause. Also, no share-alike license forbids importing of content from many other places, but as we see it – nobody cares. ;-)
  • It doesn’t care about linking. Google search was based on the web links. Wikipedia was built on top of lots of broken links (oh, and working ones too). And nobody is going to type a Knol URL.
  • It doesn’t seem to have community tools. It just doesn’t.
  • WYSIWYG editing leads to articles without structure, just some text parts bolder than the other.

So for now, it seems to be pure-engineering approach at the problem, without looking at actual work done, social implications or properly respecting copyrights.

One needs community for that. Community helps not only with content, but with style, metadata, organizing, and most of all – ensures that project maintains values and spirit.