mmap()

I’ve seen quite some work done on implementing mmap() in various places, including MySQL.
mmap() is also used for malloc()’ing huge blocks of memory.
mmap() data cache is part of VM cache, not file cache (though those are inside kernels tightly coupled, priorities still remain different).

If a small program with low memory footprint maps a file, it will probably make file access faster (as it will be cached more aggressively in memory, and will provide pressure on other cached file data -thats cheating though).

If a large program with lots and lots of allocated memory maps a file, that will pressure the filesystem cache to flush pages, and then… will pressure existing VM pages of the very same large program to be swapped out. Thats certainly bad.

For now MySQL is using mmap() just for compressed MyISAM files. Vadim wrote a patch to do more of mmap()ing.

If there’s less data than RAM, mmap() may provide somewhat more efficient CPU cycles. If there’s more data than RAM, mmap() will kill the system.

Interesting though, few months ago there was a discussion on lkml where Linus wrote:

Because quite frankly, the mixture of doing mmap() and write() system calls is quite fragile – and I’m not saying that just because of this particular bug, but because there are all kinds of nasty cache aliasing issues with virtually indexed caches etc that just fundamentally mean that it’s often a mistake to mix mmap with read/write at the same time.

So, simply, don’t.

Update: Oh well, 5.1: –myisam_use_mmap option… Argh.
Update on update: after few minutes of internal testing all mmap()ed MyISAM tables went fubar.

Notes from land of I/O

A discussion on IRC sparkled some interest on how various I/O things work in Linux. I wrote small microbenchmarking program (where all configuration is in source file, and I/O modes can be changed by editing various places in code ;-), and started playing with performance.

The machine for this testing was RAID10 16disk box with 2.6.24 kernel, and I tried to understand how O_DIRECT works, and how fsync() works and ended up digging into some other stuff.

My notes for now are:

  • O_DIRECT serializes writes to a file on ext2, ext3, jfs, so I got at most 200-250w/s.
  • xfs allows parallel (and out-of-order, if that matters) DIO, so I got 1500-2700w/s (depending on file size – seek time changes.. :) of random I/O without write-behind caching. There are few outstanding bugs that lock this down back to 250w/s (#xfs@freenode: “yeah, we drop back to taking the i_mutex in teh case where we are writing beyond EOF or we have cached pages”, so
    posix_fadvise(fd, 0, filesize, POSIX_FADV_DONTNEED)

    helps).

  • fsync(),sync(),fdatasync() wait if there are any writes, bad part – it can wait forever. Filesystems people say thats a bug – it shouldn’t wait for I/O that happened after sync being called. I tend to believe, as it causes stuff like InnoDB semaphore waits and such.

Of course, having write-behind caching at the controller (or disk, *shudder*) level allows filesystems to be lazy (and benchmarks are no longer that different), but having the upper layers work efficiently is quite important too, to avoid bottlenecks.

It is interesting, that write-behind caching isn’t needed that much anymore for random writes, once filesystem parallelizes I/O, even direct, nonbuffered one.

Anyway, now that I found some of I/O properties and issues, should probably start thinking how they apply to the upper layers like InnoDB.. :)

Crashes, complicated edition

Usually our 4.0.40 (aka ‘four oh forever’) build doesn’t crash, and if it does, it is always hardware problem or kernel/filesystem bug, or whatever else. So, we have a very calm life, until crashes start to happen…

As we used to run RAID0, a disk failure usually means system wipe and reinstall once fixed – so our machines all run relatively new kernels and OS (except some boxes which just refuse to die ;-), and we’re usually way more ahead than all the bunch of conservative RHEL users.

We had one machine which was reporting CPU/northbridge/RAM problems, and every MySQL crash was accompanied by MCEs, so after replacing RAM, CPU and motherboard itself, we just sent the machine back to service, and asked them to do whatever it takes to fix it.

So, this machine, with proud name of ‘db1’ comes and after entering the service starts crashing every day. I reduced InnoDB log file size, to make recovery faster, and would run it under ‘gdb’. Stacktrace on crash pointed to check-summing (aka folding) bunch of functions, so initial assumption was ‘here we get memory errors again’. So, for a while I thought that ‘db1’ needs some more hardware work, and just left it as is, as we were waiting for new database hardware batch to deploy and there was a bit more work around.

We started deploying new database hardware, and it started crashing every few hours instead of every few days. Here again, reduced InnoDB transaction log size and gdb attached allowed to trap the segfault, and it was pointing again to the very same adaptive hash key calculation (folding!).

Unfortunately, it was non-trivial chain of inlined functions (InnoDB is full of these), so I built ‘-g -fno-inline’ build, and was keenly waiting for a crash to happen, so I could investigate what and where gets corrupted. It did not. Then I looked at our zoo just to find out we have lots of different builds. On one hand it was a bit messy, on another hand, it showed few conclusions:

  • Only Opterons crashed (though there’re like three year gap between revisions)
  • Only Ubuntu 8.04 crashed
  • Only GCC-4.2 build crashed

After thinking a bit that:

  • We have Opterons that don’t crash (older gcc builds)
  • Xeons didn’t crash.
  • We have Ubuntu 8.04 that don’t crash (they either are Xeons or run older gcc-4.1 builds)
  • We have GCC-4.2 builds that run nice (all – on Xeons, all on 8.04 Ubuntu).

The next test was taking gcc-4.1 builds and running them on our new machines. No crash for next two days.
One new machine did have gcc-4.2 build and didn’t crash for few days of replicate-only load, but once it got some parallel load, it crashed in next few hours.

I tried to chat about it on Freenode’s #gcc, and I got just:

noshadow>	domas: almost everything that fails when
		optimized (as inlining opens many new
		optimisation possibilities)
noshadow>	i.e: const misuse, relying on undefined
		behaviour, breaking aliasing rules, ...
domas>		interesting though, I hit it just with
		gcc 4.2.3 and opterons only
noshadow>	domas: that makes it more likely that
		it is caused by optimisation unveiling
		programming bugs

In the end I know, that there’s programming bug in ancient code using inlined functions, that causes memory corruption in multithreaded load if compiled with gcc-4.2 and ran on Opteron. As for now it is our fork, pretty much everyone will point at each other and won’t try to fix it :)

And me? I can always do:

env CC=gcc-4.1 CXX=g++-4.1 ./configure ... 

I’m too lazy to learn how to disassemble and check compiled code differences, especially when every test takes few hours. I already destroyed my weekend with this :-) I’m just waiting for people to hit this with stock mysql – would be one of those things we love debugging ;-)

MySQL support fun, multiplication

There was a question how to do an aggregate multiplication in MySQL. MySQL does not provide such functionality, so we were looking at various workarounds.

We discussed UDF interface that allows to construct custom aggregates, also did look at @a:=@a*field hack, and how different initializers have results wrapped differently.

Then Scott killed our discussion with this simple query:

select exp(sum(log(c)))

Of course, thats nice mathematical approach to solve the multiplication issue with just SUM() at hands, but while we were still in awe, Scott explained it with this wit:

The Great Flood is over, and as the animals are departing 3×3, Noah is blessing each, saying “Go forth, be fruitful and multiply.” Two snakes come down the ramp and say to Noah, “We can’t. We’re adders.”

Noah groans and says, “That’s the worst pun I’ve heard in 40 days and nights. Go sit in that pile of sticks until I can deal with you!”

After mucking out the ark, Noah returns to the sticks and lifts them up, to find baby snakes everywhere. “What happened?” he asks. “I thought you were having problems!”

The snakes reply, “Even adders can multiply with logs.”

:-)

Velocity: Clouds!

This early morning I’ll start making a betting pool, if there will be a Velocity presentation that won’t mention ‘clouds’. While most of people enjoy the idea of clouds, thats actually where snow, hail, thunderstorms, and acid rain comes from. This industry needs better metaphors.

Update: Though I failed to mention a word cloud on my talk (I guess I was entirely alone in whole conference in that regard), it still made it to Slashdot.

Also, we already replaced ‘Wikimedia Grid’ with ‘Wikimedia Cloud’ on Ganglia.

On blocking

If a process has two blocking operations, each blocking other (like, I/O and networking), theoretical performance decrease will be 50%. Solution is very easy – convert one operation (quite often the one that blocks less, but I guess it doesn’t matter that much) into a nonblocking one.

Though MySQL has network-write buffer, which provides some async network behavior, it still has to get context switch into a thread to write stuff.

rsync and other file transfer protocols are even worse in this regard. On a regular Linux machine rsync even on gigabit network will keep kernel’s send-queue saturated (it is 128K by default anyway).

How to make MySQL’s or rsync networking snappier? If in ‘netstat’ sendq column is maxed out – just increase kernel buffers, instead of process buffers:

# increase TCP max buffer size
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
# increase Linux autotuning TCP buffer limits
# min, default, and max number of bytes to use
net.ipv4.tcp_rmem = 4096 87380 16777216
net.ipv4.tcp_wmem = 4096 65536 16777216

This can add additional 10-20% of file transfer throughput (and sendq goes up to 500k – so it seems to be really worth it).

Wikipedia at Velocity conference

Next Monday I’ll be presenting (if jetlag doesn’t kill me) at Velocity 2008 – webops and performance conference. It won’t be my first time talking about Wikipedia infrastructure, but this time people will know the technology and scaling methods anyway.

As I see it, in such context Wikipedia is more interesting as a case of operations underdog – non-profit lean budgets, brave approaches in infrastructure, conservative feature development, and lots of cheating and cheap tricks (caching! caching! caching!).

Also, I’ll be able to share (making audience jealous) how it is great to be on non-profit ops team (and one of example perks – we can be cheap about getting conference passes too ;-)

The best part (for audience, not for me) – I will be forced to be honest. Nearly whole tech team will be at the event, and if I fail to attribute any developments, or start talking crap – not only they can throw rotten tomatoes, but also disable my login access and claim they never knew me, without me being able to fight back :) I didn’t publicly present in front of these guys since 2005 – will be tough.

On async

I’ve been telling to people that we need async MySQL client API. The fact that you have to block while database box is thinking (doesn’t happen too often, but still..) is holding back some nice response-time oriented coding practices.

Developers of various proxies (not to be named :) admit that the major inspiration was adding features to proxies was easier for them, than getting into main MySQL (where the way minor enhancements are treated leads to pure stagnation — my opinion).

And now I realized there’s an easy way out, how to make client libraries more async. Simply, don’t use mysql protocol – talk HTTP, thats what dbslayer provides. Then application-level can use cURL, PEAR libraries or other ways to access it – often with way better parallelization or non-blocking capabilities.

Now, add more HTTP capabilities to the mix – stateful connections, pipelining, and benefits of protocol designed to serve the Web, not just talk to single server – start showing up.

Or of course, one could implement pipelining, parallel query and async capabilities to MySQL client libraries… Maybe afterwards MySQL server itself will start supporting query multiplexing in the protocol. But for now it is job for clients. Oh, and proxies.

5.0 journal: various issues, replication prefetching, our branch

First of all, I have to apologize about some of my previous remark on 5.0 performance. I passed ‘-g’ CFLAGS to my build, and that replaced default ‘-O2’. Compiling MySQL without -O2 or -O3 makes it slower. Apparently, much slower.

Few migration notes – once I loaded the schema with character set set to binary (because we treat it as such), all VARCHAR fields were converted to VARBINARY, what I expected, but more annoying was CHAR converted to BINARY – which pads data with bytes. Solution was converting everything into VARBINARY – as actually it doesn’t have much overhead. TRIM('' FROM field) eventually helped too.

The other problem I hit was paramy operation issue. One table definition failed, so paramy exited immediately – though it had few more queries remaining in the queue – so most recent data from some table was not inserted. The cheap workaround was adding -f option, which just ignores errors. Had to reload all data though…

I had real fun experimenting with auto-inc locking. As it was major problem for initial paramy tests, I hacked InnoDB not to acquire auto-inc table-level lock (that was just commenting out few lines in ha_innodb.cc). After that change CPU use went to >300% instead of ~100% – so I felt nearly like I’ve done the good thing. Interesting though – profile showed that quite a lot of CPU time was spent in synchronization – mutexes and such – so I hit SMP contention at just 4 cores. Still, the import was faster (or at least the perception), and I already have in mind few cheap tricks to make it faster (like disabling mempool). The easiest way to make it manageable is simply provide a global variable for auto-inc behavior, though elegant solutions would attach to ‘ALTER TABLE … ENABLE KEYS’ or something similar.

Once loaded, catching up on replication was another task worth few experiments. As the data image was already quite a few days old, I had at least few hours to try to speed up replication. Apparently, Jay Janssen’s prefetcher has disappeared from the internets, so the only one left was maatkit’s mk-slave-prefetch. It rewrites UPDATEs into simple SELECTs, but executes them just on single thread, so the prefetcher was just few seconds ahead of SQL thread – and speedup was less than 50%. I made a quick hack that parallelized the task, and it managed to double replication speed.

Still, there’re few problems with the concept – it preheats just one index, used for lookup, and doesn’t work on secondary indexes. Actually analyzing the query, identifying what and where changes, and sending a select with UNIONs, preheating every index affected by write query could be more efficient. Additionally it would make adaptive hash or insert buffers useless – as all buffer pool pages required would be already in memory – thus leading to less spots of mutex contention.

We also managed to hit few optimizer bugs too, related to casting changes in 5.0. Back in 4.0 it was safe to pass all constants as strings, but 5.0 started making poor solutions then (like filesorting, instead of using existing ref lookup index, etc). I will have to review why this happens, does it make sense, and if not – file a bug. For now, we have some workarounds, and don’t seem to be bitten too much by the behavior.

Anyway, in the end I directed half of this site’s core database off-peak load to this machine, and it was still keeping up with replication at ~8000 queries per second. The odd thing yet is that though 5.0 eats ~30% more CPU, it shows up on profiling as faster-responding box. I guess we’re just doing something wrong.

I’ve published our MySQL branch at launchpad. Do note, release process is somewhat ad-hoc (or non-existing), and engineer doing it is clueless newbie. :)

I had plans to do some more scalability tests today, but apparently the server available is just two-core machine, so there’s nothing much I can do on it. I guess another option is grabbing some 8-core application server and play with it. :)

On checksums

InnoDB maintains two checksums per buffer pool block. Old formula of checksum, and new formula of checksum. Both are read, both are written. I guess this had to be some kind of transition period, but it obviously took too long (or was forgotten). Anyway, disabling checksums code entirely makes single-thread data load 7% faster – though in parallel activity locking contention provides with some extra CPU resources for checksum calculation.

Leaving just single version of checksum would cut this fat in half, without abandoning the feature entirely – probably worth trying.

Update: Benchmarked InnoDB checksum against Fletcher. Results were interesting (milliseconds for 10000 iterations):

Algorithm: InnoDB Fletcher
826 453
-O2: 316 133
-O3: 42 75

So, though using Fletcher doubles the performance, -O3 optimizes InnoDB checksumming much better. How many folks do run -O3 compiled mysqld?