MySQL support fun, multiplication

There was a question how to do an aggregate multiplication in MySQL. MySQL does not provide such functionality, so we were looking at various workarounds.

We discussed UDF interface that allows to construct custom aggregates, also did look at @a:=@a*field hack, and how different initializers have results wrapped differently.

Then Scott killed our discussion with this simple query:

select exp(sum(log(c)))

Of course, thats nice mathematical approach to solve the multiplication issue with just SUM() at hands, but while we were still in awe, Scott explained it with this wit:

The Great Flood is over, and as the animals are departing 3×3, Noah is blessing each, saying “Go forth, be fruitful and multiply.” Two snakes come down the ramp and say to Noah, “We can’t. We’re adders.”

Noah groans and says, “That’s the worst pun I’ve heard in 40 days and nights. Go sit in that pile of sticks until I can deal with you!”

After mucking out the ark, Noah returns to the sticks and lifts them up, to find baby snakes everywhere. “What happened?” he asks. “I thought you were having problems!”

The snakes reply, “Even adders can multiply with logs.”

:-)

Velocity: Clouds!

This early morning I’ll start making a betting pool, if there will be a Velocity presentation that won’t mention ‘clouds’. While most of people enjoy the idea of clouds, thats actually where snow, hail, thunderstorms, and acid rain comes from. This industry needs better metaphors.

Update: Though I failed to mention a word cloud on my talk (I guess I was entirely alone in whole conference in that regard), it still made it to Slashdot.

Also, we already replaced ‘Wikimedia Grid’ with ‘Wikimedia Cloud’ on Ganglia.

On blocking

If a process has two blocking operations, each blocking other (like, I/O and networking), theoretical performance decrease will be 50%. Solution is very easy – convert one operation (quite often the one that blocks less, but I guess it doesn’t matter that much) into a nonblocking one.

Though MySQL has network-write buffer, which provides some async network behavior, it still has to get context switch into a thread to write stuff.

rsync and other file transfer protocols are even worse in this regard. On a regular Linux machine rsync even on gigabit network will keep kernel’s send-queue saturated (it is 128K by default anyway).

How to make MySQL’s or rsync networking snappier? If in ‘netstat’ sendq column is maxed out – just increase kernel buffers, instead of process buffers:

# increase TCP max buffer size
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
# increase Linux autotuning TCP buffer limits
# min, default, and max number of bytes to use
net.ipv4.tcp_rmem = 4096 87380 16777216
net.ipv4.tcp_wmem = 4096 65536 16777216

This can add additional 10-20% of file transfer throughput (and sendq goes up to 500k – so it seems to be really worth it).

Wikipedia at Velocity conference

Next Monday I’ll be presenting (if jetlag doesn’t kill me) at Velocity 2008 – webops and performance conference. It won’t be my first time talking about Wikipedia infrastructure, but this time people will know the technology and scaling methods anyway.

As I see it, in such context Wikipedia is more interesting as a case of operations underdog – non-profit lean budgets, brave approaches in infrastructure, conservative feature development, and lots of cheating and cheap tricks (caching! caching! caching!).

Also, I’ll be able to share (making audience jealous) how it is great to be on non-profit ops team (and one of example perks – we can be cheap about getting conference passes too ;-)

The best part (for audience, not for me) – I will be forced to be honest. Nearly whole tech team will be at the event, and if I fail to attribute any developments, or start talking crap – not only they can throw rotten tomatoes, but also disable my login access and claim they never knew me, without me being able to fight back :) I didn’t publicly present in front of these guys since 2005 – will be tough.

On async

I’ve been telling to people that we need async MySQL client API. The fact that you have to block while database box is thinking (doesn’t happen too often, but still..) is holding back some nice response-time oriented coding practices.

Developers of various proxies (not to be named :) admit that the major inspiration was adding features to proxies was easier for them, than getting into main MySQL (where the way minor enhancements are treated leads to pure stagnation — my opinion).

And now I realized there’s an easy way out, how to make client libraries more async. Simply, don’t use mysql protocol – talk HTTP, thats what dbslayer provides. Then application-level can use cURL, PEAR libraries or other ways to access it – often with way better parallelization or non-blocking capabilities.

Now, add more HTTP capabilities to the mix – stateful connections, pipelining, and benefits of protocol designed to serve the Web, not just talk to single server – start showing up.

Or of course, one could implement pipelining, parallel query and async capabilities to MySQL client libraries… Maybe afterwards MySQL server itself will start supporting query multiplexing in the protocol. But for now it is job for clients. Oh, and proxies.

5.0 journal: various issues, replication prefetching, our branch

First of all, I have to apologize about some of my previous remark on 5.0 performance. I passed ‘-g’ CFLAGS to my build, and that replaced default ‘-O2’. Compiling MySQL without -O2 or -O3 makes it slower. Apparently, much slower.

Few migration notes – once I loaded the schema with character set set to binary (because we treat it as such), all VARCHAR fields were converted to VARBINARY, what I expected, but more annoying was CHAR converted to BINARY – which pads data with bytes. Solution was converting everything into VARBINARY – as actually it doesn’t have much overhead. TRIM('' FROM field) eventually helped too.

The other problem I hit was paramy operation issue. One table definition failed, so paramy exited immediately – though it had few more queries remaining in the queue – so most recent data from some table was not inserted. The cheap workaround was adding -f option, which just ignores errors. Had to reload all data though…

I had real fun experimenting with auto-inc locking. As it was major problem for initial paramy tests, I hacked InnoDB not to acquire auto-inc table-level lock (that was just commenting out few lines in ha_innodb.cc). After that change CPU use went to >300% instead of ~100% – so I felt nearly like I’ve done the good thing. Interesting though – profile showed that quite a lot of CPU time was spent in synchronization – mutexes and such – so I hit SMP contention at just 4 cores. Still, the import was faster (or at least the perception), and I already have in mind few cheap tricks to make it faster (like disabling mempool). The easiest way to make it manageable is simply provide a global variable for auto-inc behavior, though elegant solutions would attach to ‘ALTER TABLE … ENABLE KEYS’ or something similar.

Once loaded, catching up on replication was another task worth few experiments. As the data image was already quite a few days old, I had at least few hours to try to speed up replication. Apparently, Jay Janssen’s prefetcher has disappeared from the internets, so the only one left was maatkit’s mk-slave-prefetch. It rewrites UPDATEs into simple SELECTs, but executes them just on single thread, so the prefetcher was just few seconds ahead of SQL thread – and speedup was less than 50%. I made a quick hack that parallelized the task, and it managed to double replication speed.

Still, there’re few problems with the concept – it preheats just one index, used for lookup, and doesn’t work on secondary indexes. Actually analyzing the query, identifying what and where changes, and sending a select with UNIONs, preheating every index affected by write query could be more efficient. Additionally it would make adaptive hash or insert buffers useless – as all buffer pool pages required would be already in memory – thus leading to less spots of mutex contention.

We also managed to hit few optimizer bugs too, related to casting changes in 5.0. Back in 4.0 it was safe to pass all constants as strings, but 5.0 started making poor solutions then (like filesorting, instead of using existing ref lookup index, etc). I will have to review why this happens, does it make sense, and if not – file a bug. For now, we have some workarounds, and don’t seem to be bitten too much by the behavior.

Anyway, in the end I directed half of this site’s core database off-peak load to this machine, and it was still keeping up with replication at ~8000 queries per second. The odd thing yet is that though 5.0 eats ~30% more CPU, it shows up on profiling as faster-responding box. I guess we’re just doing something wrong.

I’ve published our MySQL branch at launchpad. Do note, release process is somewhat ad-hoc (or non-existing), and engineer doing it is clueless newbie. :)

I had plans to do some more scalability tests today, but apparently the server available is just two-core machine, so there’s nothing much I can do on it. I guess another option is grabbing some 8-core application server and play with it. :)

On checksums

InnoDB maintains two checksums per buffer pool block. Old formula of checksum, and new formula of checksum. Both are read, both are written. I guess this had to be some kind of transition period, but it obviously took too long (or was forgotten). Anyway, disabling checksums code entirely makes single-thread data load 7% faster – though in parallel activity locking contention provides with some extra CPU resources for checksum calculation.

Leaving just single version of checksum would cut this fat in half, without abandoning the feature entirely – probably worth trying.

Update: Benchmarked InnoDB checksum against Fletcher. Results were interesting (milliseconds for 10000 iterations):

Algorithm: InnoDB Fletcher
826 453
-O2: 316 133
-O3: 42 75

So, though using Fletcher doubles the performance, -O3 optimizes InnoDB checksumming much better. How many folks do run -O3 compiled mysqld?

Wasting InnoDB memory

I usually get strange looks when I complain about memory handling inside InnoDB. It seems as if terabytes of RAM are so common and cheap, that nobody should really care about memory efficiency. Unfortunately for me, I do.

Examples:

  • The infamous Bug#15815 – buffer pool mutex contention. The patch for the bug added lots of small mutexes, and by ‘lots’ I mean really really lots – two mutexes (and rwlock structure) for each buffer pool page. That makes two million mutexes for 16GB buffer pool, um, four million mutexes for 32GB buffer pool, and I guess more for larger buffer pools. Result – 16GB buffer pool gets 625MB locking tax to solve a 8-core locking problem. Solution? Between giant lock and armies of page mutexes there lives a land of mutex pools, where locks are shared happily by multiple entities. I even made a patch, unfortunately it gets some ibuf assertion after server restart though at first everything works great :)
  • InnoDB data dictionary always grows, never shrinks. It is not considered a bug, as it isn’t memory leak – all memory is accounted by (hidden) dict_sys->size, and valgrind doesn’t print errors. 1-column table takes 2k of memory in InnoDB data dictionary, a table with few more columns and indexes takes already 10k. 100000 tables, and 1GB of memory is wasted. Who needs 100000 tables? People running application farms do. Actually, there even is a code for cleaning up data dictionary, just wasn’t finished, and is commented out at the moment. Even worse, the fix for #20877 was a joke – reducing the in-memory structure size, still not caring about structure count. And of course, do note that every InnoDB partition of a table takes space there too…

So generally if you’re running bigger InnoDB deployment, you may be hitting various hidden memory taxes – in hundreds of megabytes, or gigabytes – that don’t provide too much value anyway. Well, memory is cheap, our next database boxes will be 32GB-class instead of those ‘amnesia’ 16GB types, and I can probably stop ranting :)

Performance of status commands

Though some may think status command performance is overrated, our code heavily relies on SHOW STATUS LIKE "Thread%", and I like to spam SHOW INNODB STATUS. Add monitoring systems, which rely on ‘SHOW GLOBAL STATUS’ – and performance starts to matter.

And things are a bit abandoned. See, in performance engineering people always look for low hanging fruit to pick, and there’re quite a few:

  • Peter wrote a year ago about performance regression in ‘SHOW STATUS’. Of course, now that Mark has filed a bug – patch is already in there. This bug would byte us very very severely, if we’d run live site against unpatched 5.0 – as it slows down from 0.2ms to 20ms. There’s a small ha-ha in this issue – guess where these two lines come from. ;-)
  • SHOW INNODB STATUS was taking 0.3s just because there were two rarely-looked at variables that needed traversing quite some structures. I filed bugs #36941 and #36942, to remind developers of this problem. Disabling these two counters allowed to run few thousand ‘SHOW INNODB STATUS’ a second, instead of just three.
  • SHOW GLOBAL STATUS traverses statistics for each thread. Doesn’t matter on single-thread development box too much, but is very very unhappy at higher thread concurrency – walking every variable on every thread is quite a task even for modern hardware. Generally some SMP scaling has been added by reducing locking on global variables this way – at the cost of making the command unusable.