innodb – Page 3 – domas mituzas

after the conference, mydumper, parallelism, etc

Though slides for my MySQL Conference talks were on the O’Reilly website, I placed them in my talks page too, for both dtrace and security presentations.

I also gave a lightning talk about mydumper. Since my original announcement mydumper has changed a bit. It supports writing compressed files, detecting and killing slow queries that could block table flushes, supports regular expressions for table names, and trunk is slowly moving towards understanding that storage engines differ :)

I’ve been using mydumper quite a lot in my deployments (and observing 10x faster dumps). Now, the sad part is how to do faster recovery. It is quite easy to parallelize load of data (apparently, xargs supports running parallel processes):

echo *.sql.gz | xargs -n1 -P 16 -I % sh -c 'zcat % | mysql dbname'

Still, that doesn’t scale much – only doubles the load speed, compared to single threaded load, even on quite powerful machine. The problem lives in log_sys mutex – it is acquired for every InnoDB row operation, to grab LogicalSequenceNumbers (LSNs), so neither batching nor differentiation strategies really help, and same problem is hit by LOAD DATA too. In certain cases I saw quite some spinning on other mutexes, and it seems that InnoDB currently doesn’t scale that well with lots of small row operations. Maybe someone some day will pick this up and fix, thats why we go to conferences and share our findings :)

stupid innodb tricks

When it comes to extreme tuning, this is how I got +5% performance in one of my tests on 8-cpu machine (contended at mtr_commit()):

gdb -p $(pidof mysqld) -ex "set srv_spin_wait_delay=50" -batch

P.S. was probably worth posting this just to point out that there’re ways to fill the gap between recompile and my.cnf settings :)

Eyecandy mutexes!

In my quest of making MySQL usable, I managed to hit contention that wasn’t spotted by performance masters before. Meet most useless mutex ever (this is actual contention event, not just a hold):

Count     nsec Lock
 1451   511364 mysqld`ut_list_mutex

      nsec ---- Distribution --- count    Stack
      2048 |@@                 |   132    libc.so.1`mutex_lock_impl
      4096 |@@@                |   186    libc.so.1`mutex_lock
      8192 |                   |     3    ut_malloc_low
     16384 |@@                 |   137    mem_heap_create_block
     32768 |@                  |   105    row_sel_store_mysql_rec
     65536 |@                  |    89    row_search_for_mysql
    131072 |@                  |    99    ha_innobase::general_fetch
    262144 |@@                 |   131    ha_innobase::rnd_next
    524288 |@@@                |   195    rr_sequential
   1048576 |@@@                |   223
   2097152 |@@                 |   137
   4194304 |                   |    14
----------------------------------------------------------------

ut_list_mutex guards a memory structure which has all memory blocks allocated by InnoDB (via ut_malloc/ut_free) in it.
It has two uses:

Printing “Total memory allocated” in SHOW INNODB STATUS (though this can still be implemented lock-free)
Deallocating all memory on shutdown (though, all modern operating systems do that anyway, so this is purely just to shut up valgrind)

If you have any BLOB/TEXT data in your tables, you’re definitely hitting this contention spot (it is #1 contention in such cases).

Fix? Kill the eyecandy, replace ut_malloc and ut_free with direct calls to malloc() and free(), oh and of course, use scalable allocators like tcmalloc or Hoard.

Charsets mutex meets InnoDB

When InnoDB compares data (like, when looking up in indexes), it actually asks help from MySQL – character sets may provide different rules for evaluation. MySQL then looks up character set information. This is where the fun begins – if character set used for comparison is default server character set, or latin1 – there are shortcuts. If not (say, some smart developer decided to use Unicode without forcing DBA to set default server charset, as it isn’t needed) – a very nice internal routine is called, which at the very beginning does exactly this:

 /*
    To make things thread safe we are not
    allowing other threads to interfere
    while we may changing the cs_info_table
  */
  pthread_mutex_lock(&THR_LOCK_charset);

Apparently, ha_innodb.cc has such comment too:

/* Use the charset number to pick the right charset struct for
the comparison. Since the MySQL function get_charset may be
slow before Bar removes the mutex operation there, we first
look at 2 common charsets directly. */

if (charset_number == default_charset_info->number) {
    charset = default_charset_info;
} else if (charset_number == my_charset_latin1.number) {
    charset = &my_charset_latin1;
} else {
    charset = get_charset(charset_number, MYF(MY_WME));
    [...]

I’ll avoid going into discussions why such global lock at every row operation is harmful, but in case anyone is hitting lots of mutex contention there – just set default server character set to what your databases are in (or use binary or latin1, or add a clause for utf8 up here, or remove mutex, nobody is changing your character sets anyway ;-)

Update: Some of the discussion is at bug#42649

ZFS and MySQL … not yet

Today I attended kick-ass ZFS talk (3 hours of incredibly detailed material presented by someone who knows the stuff and knows how to talk) at CEC (Sun internal training event/conference), so now I know way more about ZFS than I used to. Probably I know way more about ZFS than Average Joe DBA \o/

And now I think ZFS has lots of brilliant design and implementation bits, except it doesn’t match database access pattern needs.

See, ZFS is not a regular POSIX-API -> HDD bridge, unlike pretty much everything out there. It is transactional object store which allows multiple access semantics, APIs, and standard ZFS POSIX Layer (ZPL) is just one of them. In MySQL talk, think of all other filesystems as of MyISAM, and ZFS is InnoDB :-)

So, putting InnoDB on top of ZFS after some high-school-like variable replacement ends up “putting InnoDB on top of InnoDB”. Let’s go a bit deeper here:

ZFS has checksums, so does InnoDB (though ZFS checksums are faster, Fletcher-derived, etc ;-)
ZFS has atomicity, so does InnoDB
ZFS has ZIL (Intent Log), so does InnoDB (Transaction Log)
ZFS has background intelligent flushing of data, so does InnoDB (maybe not that intelligent though)
ZFS has Adaptive Replacement Cache, so does InnoDB (calls it Buffer Pool, instead of three replacement queues uses just one – LRU, doesn’t account for MFU)
ZFS has copy-on-write snapshotting, so does InnoDB (MVCC!)
ZFS has compression, so does InnoDB (in plugin, though)
ZFS has intelligent mirroring/striping/etc, this is why InnoDB people use RAID controllers.
ZFS has bit-rot recovery and self healing and such, InnoDB has assertions and crashes :-)

So, we have two intelligent layers on top of each other, and there’s lots of work duplicated. Of course, we can try to eliminate some bits:

Disable checksums at InnoDB level
Unfortunately, there’s nothing to be done about two transaction logs
Dirty pages can be flushed immediately by InnoDB, probably is tunable at ZFS level too
InnoDB buffer pool may be probably reduced, to favor ARC, or opposite…
Double Copy-on-write is inevitable (and copy-on-write transaction log does not really make sense…)
Compression can be done at either level
ZFS use for volume management would be the major real win, as well as all the self healing capacity

So, I’m not too convinced at this moment about using this combo, but there’s another idea circulating around for quite a while – what if MyISAM suddenly started using all the ZFS capabilities. Currently the ZPL and actual ZFS object store management are mutually exclusive – you have to pick one way, but if ZPL would be extended to support few simple operations (create/drop snapshots just on single file, wrap multiple write() calls into a transaction), MyISAM could get a different life:

Non-blocking SELECTs could be implemented using snapshots
Writes would be atomic and non-corrupting
MyISAM would get checksummed, compressed, consistent data, that is flushed by intelligent background threads, and would have immediate crash recovery
For replication slaves write concurrency would not be that necessary (single thread is updating data anyway)
“Zettabyte” (was told not to use this ;-) File System would actually allow Zettabyte-MyISAM-Tables o/
All the Linux people (including me :) would complain about Sun doing something just for [Open]Solaris, instead of working on [insert favorite storage engine here].

Unfortunately, to implement that now one would have either to tap directly into object management API (that would mean quite a bit of rewriting), or wait for ZFS people to extend the ZPL calls. And for now, I’d say, “not yet”.

Disclaimer: the opinion of the author does not represent opinion of his employer (especially Marketing people), and may be affected by the fact, that the author was enjoying free wireless and whoever knows what else in Las Vegas McCarran International Airport.

On SSDs, rotations and I/O

Every time anyone mentions SSDs, I have a feeling of futility and being useless in near future. I have spent way too much time to work around limitations of rotational media, and understand the implications of whole vertical data stack on top.

The most interesting upcoming shift is not only the speed difference, but simply different cost balance between reads and writes. With modern RAID controllers and modern disks and modern filesystems reads are way more expensive operation from application perspective than writes.

Writes can be merged at application and OS level, buffered at I/O controller level, and even sped up by on-disk volatile cache (NCQ write reordering can give +30% faster random write performance).

Reads have none of that. Of course, there’re caches, but they don’t speed up actual read operations, they just help to avoid them. This leads to very disproportionate amount of caches needed for reads, compared to writes.

Simply, 32GB system with MySQL/InnoDB will be wasting 4GB on mutexes (argh!!..), few more gigs on data dictionary (arghhh #2), and everything else for read caching inside buffer pool. There may be some dirty pages and adaptive hash or insert buffer entries, but they are all there not because systems lack write output capacity, but simply because of braindead InnoDB page flushing policy.

Also, database write performance is mostly impacted not because of actual underlying write speed, but simply because every write has to read from multiple scattered places to actually find what needs to be changed.

This is where SSDs matter – they will have same satisfactory write performance (and fixes for InnoDB are out there ;-) – but the read performance will be satisfactory (uhm, much much better) too.

What this would mean for MySQL use:

Buffer pool, key cache, read-ahead buffers – all gone (or drastically reduced).
Data locality wouldn’t matter that much anymore either, covering indexes would provide just double performance, rather than up to 100x speed increase.
Re-reading data may be cheaper, than including it in various temporary sorting and grouping structures
RAIDs no longer needed (?), though RAM-backed write-behind caching would still be necessary
Log-based storage designs like PBXT will make much more sense
Complex data flushing logic like inside InnoDB’s will not be useful anymore (one can say, it is useless already ;-) – and straightforward methods such as in Maria are welcome again.

Probably the happiest camp out there are PostgreSQL people – data locality issues were plaguing their performance most, and it is strong side of InnoDB. On the other hand, MySQL has pluggable engine support, so it may be way easier to produce SSD versions for anything we have now, or just use new ones (hello, Maria!).

Still, there is quite some work to adapt to the new storage model, and judging by the fact how InnoDB works with modern rotational media, we will need some very strong push to adapt it for the new stuff.

You can sense the futility of any work done to optimize for rotation – all the “make reads fast” techniques will end up resolved at hardware layer, and the human isn’t needed anymore (nor all these servers with lots of memory and lots of spindles).

Progress in percents: 0 1 2 3 …

Well, servers usually don’t crash ( our English Wikipedia master is running for 800 days, on white-box hardware, RAID0, 4.0 ;-), but when they do (like some kernel bugs on our big big boxes), one of most painful experiences is InnoDB log recovery.

Usually people will reduce the innodb-log-file-size to ease up with the recovery (it helps, in a way :), but the real problem is somewhere else.

See, when InnoDB does crash recovery, it applies the log changes in memory, and builds a flush list. It doesn’t flush any pages during the recovery process, so the flush list grows big, thousands, tens of thousands, maybe millions kind of big, anyway, big-number big.

Oh, did I mention? The flush list is actually a linked list, not some kind of hippy tree stuff. Every time a log record is read from a log and something gets updated, the flush list will be traversed, thousands, tens of thousands, maybe millions of entries.

The expensive code looks something like this:

while (b && (ut_dulint_cmp(b->oldest_modification,
             block->oldest_modification) > 0)) {
       prev_b = b;
       b = UT_LIST_GET_NEXT(flush_list, b);
}

Then your profile starts looking like this, and you wish your systems didn’t crash:

%        symbol name
87.6978  buf_flush_insert_sorted_into_flush_list
 5.8571  -kernel
 1.9793  recv_apply_hashed_log_recs
 0.8059  buf_calc_page_new_checksum

So, the recovery process cost is exponential, and people work around it by reducing the log file size, by reducing performance of their system, while the actual fix is right there, in optimizing the data structure. Current model is outdated for anything built in last 5 years anyway.

Oh, and of course, I’d like systems not to crash at all, like that database master on whitebox raid0 running for 800 days.

Update: this is old stuff. Peter wrote about it, Heikki opened a bug, then thought it would need more than five minutes to fix it and classified it as a feature request, so Peter could write more about it. That makes it even more sad. We’d probably change the synopsis for the feature request, “make crash recovery work”.

Update 2: get the patch at Percona (Yasufumi is god :)

Notes from land of I/O

A discussion on IRC sparkled some interest on how various I/O things work in Linux. I wrote small microbenchmarking program (where all configuration is in source file, and I/O modes can be changed by editing various places in code ;-), and started playing with performance.

The machine for this testing was RAID10 16disk box with 2.6.24 kernel, and I tried to understand how O_DIRECT works, and how fsync() works and ended up digging into some other stuff.

My notes for now are:

O_DIRECT serializes writes to a file on ext2, ext3, jfs, so I got at most 200-250w/s.
xfs allows parallel (and out-of-order, if that matters) DIO, so I got 1500-2700w/s (depending on file size – seek time changes.. :) of random I/O without write-behind caching. There are few outstanding bugs that lock this down back to 250w/s (#xfs@freenode: “yeah, we drop back to taking the i_mutex in teh case where we are writing beyond EOF or we have cached pages”, so
```
posix_fadvise(fd, 0, filesize, POSIX_FADV_DONTNEED)
```
helps).
fsync(),sync(),fdatasync() wait if there are any writes, bad part – it can wait forever. Filesystems people say thats a bug – it shouldn’t wait for I/O that happened after sync being called. I tend to believe, as it causes stuff like InnoDB semaphore waits and such.

Of course, having write-behind caching at the controller (or disk, *shudder*) level allows filesystems to be lazy (and benchmarks are no longer that different), but having the upper layers work efficiently is quite important too, to avoid bottlenecks.

It is interesting, that write-behind caching isn’t needed that much anymore for random writes, once filesystem parallelizes I/O, even direct, nonbuffered one.

Anyway, now that I found some of I/O properties and issues, should probably start thinking how they apply to the upper layers like InnoDB.. :)

Crashes, complicated edition

Usually our 4.0.40 (aka ‘four oh forever’) build doesn’t crash, and if it does, it is always hardware problem or kernel/filesystem bug, or whatever else. So, we have a very calm life, until crashes start to happen…

As we used to run RAID0, a disk failure usually means system wipe and reinstall once fixed – so our machines all run relatively new kernels and OS (except some boxes which just refuse to die ;-), and we’re usually way more ahead than all the bunch of conservative RHEL users.

We had one machine which was reporting CPU/northbridge/RAM problems, and every MySQL crash was accompanied by MCEs, so after replacing RAM, CPU and motherboard itself, we just sent the machine back to service, and asked them to do whatever it takes to fix it.

So, this machine, with proud name of ‘db1’ comes and after entering the service starts crashing every day. I reduced InnoDB log file size, to make recovery faster, and would run it under ‘gdb’. Stacktrace on crash pointed to check-summing (aka folding) bunch of functions, so initial assumption was ‘here we get memory errors again’. So, for a while I thought that ‘db1’ needs some more hardware work, and just left it as is, as we were waiting for new database hardware batch to deploy and there was a bit more work around.

We started deploying new database hardware, and it started crashing every few hours instead of every few days. Here again, reduced InnoDB transaction log size and gdb attached allowed to trap the segfault, and it was pointing again to the very same adaptive hash key calculation (folding!).

Unfortunately, it was non-trivial chain of inlined functions (InnoDB is full of these), so I built ‘-g -fno-inline’ build, and was keenly waiting for a crash to happen, so I could investigate what and where gets corrupted. It did not. Then I looked at our zoo just to find out we have lots of different builds. On one hand it was a bit messy, on another hand, it showed few conclusions:

Only Opterons crashed (though there’re like three year gap between revisions)
Only Ubuntu 8.04 crashed
Only GCC-4.2 build crashed

After thinking a bit that:

We have Opterons that don’t crash (older gcc builds)
Xeons didn’t crash.
We have Ubuntu 8.04 that don’t crash (they either are Xeons or run older gcc-4.1 builds)
We have GCC-4.2 builds that run nice (all – on Xeons, all on 8.04 Ubuntu).

The next test was taking gcc-4.1 builds and running them on our new machines. No crash for next two days.
One new machine did have gcc-4.2 build and didn’t crash for few days of replicate-only load, but once it got some parallel load, it crashed in next few hours.

I tried to chat about it on Freenode’s #gcc, and I got just:

noshadow>	domas: almost everything that fails when
		optimized (as inlining opens many new
		optimisation possibilities)
noshadow>	i.e: const misuse, relying on undefined
		behaviour, breaking aliasing rules, ...
domas>		interesting though, I hit it just with
		gcc 4.2.3 and opterons only
noshadow>	domas: that makes it more likely that
		it is caused by optimisation unveiling
		programming bugs

In the end I know, that there’s programming bug in ancient code using inlined functions, that causes memory corruption in multithreaded load if compiled with gcc-4.2 and ran on Opteron. As for now it is our fork, pretty much everyone will point at each other and won’t try to fix it :)

And me? I can always do:

env CC=gcc-4.1 CXX=g++-4.1 ./configure ...

I’m too lazy to learn how to disassemble and check compiled code differences, especially when every test takes few hours. I already destroyed my weekend with this :-) I’m just waiting for people to hit this with stock mysql – would be one of those things we love debugging ;-)

5.0 journal: various issues, replication prefetching, our branch

First of all, I have to apologize about some of my previous remark on 5.0 performance. I passed ‘-g’ CFLAGS to my build, and that replaced default ‘-O2’. Compiling MySQL without -O2 or -O3 makes it slower. Apparently, much slower.

Few migration notes – once I loaded the schema with character set set to binary (because we treat it as such), all VARCHAR fields were converted to VARBINARY, what I expected, but more annoying was CHAR converted to BINARY – which pads data with bytes. Solution was converting everything into VARBINARY – as actually it doesn’t have much overhead. TRIM('' FROM field) eventually helped too.

The other problem I hit was paramy operation issue. One table definition failed, so paramy exited immediately – though it had few more queries remaining in the queue – so most recent data from some table was not inserted. The cheap workaround was adding -f option, which just ignores errors. Had to reload all data though…

I had real fun experimenting with auto-inc locking. As it was major problem for initial paramy tests, I hacked InnoDB not to acquire auto-inc table-level lock (that was just commenting out few lines in ha_innodb.cc). After that change CPU use went to >300% instead of ~100% – so I felt nearly like I’ve done the good thing. Interesting though – profile showed that quite a lot of CPU time was spent in synchronization – mutexes and such – so I hit SMP contention at just 4 cores. Still, the import was faster (or at least the perception), and I already have in mind few cheap tricks to make it faster (like disabling mempool). The easiest way to make it manageable is simply provide a global variable for auto-inc behavior, though elegant solutions would attach to ‘ALTER TABLE … ENABLE KEYS’ or something similar.

Once loaded, catching up on replication was another task worth few experiments. As the data image was already quite a few days old, I had at least few hours to try to speed up replication. Apparently, Jay Janssen’s prefetcher has disappeared from the internets, so the only one left was maatkit’s mk-slave-prefetch. It rewrites UPDATEs into simple SELECTs, but executes them just on single thread, so the prefetcher was just few seconds ahead of SQL thread – and speedup was less than 50%. I made a quick hack that parallelized the task, and it managed to double replication speed.

Still, there’re few problems with the concept – it preheats just one index, used for lookup, and doesn’t work on secondary indexes. Actually analyzing the query, identifying what and where changes, and sending a select with UNIONs, preheating every index affected by write query could be more efficient. Additionally it would make adaptive hash or insert buffers useless – as all buffer pool pages required would be already in memory – thus leading to less spots of mutex contention.

We also managed to hit few optimizer bugs too, related to casting changes in 5.0. Back in 4.0 it was safe to pass all constants as strings, but 5.0 started making poor solutions then (like filesorting, instead of using existing ref lookup index, etc). I will have to review why this happens, does it make sense, and if not – file a bug. For now, we have some workarounds, and don’t seem to be bitten too much by the behavior.

Anyway, in the end I directed half of this site’s core database off-peak load to this machine, and it was still keeping up with replication at ~8000 queries per second. The odd thing yet is that though 5.0 eats ~30% more CPU, it shows up on profiling as faster-responding box. I guess we’re just doing something wrong.

I’ve published our MySQL branch at launchpad. Do note, release process is somewhat ad-hoc (or non-existing), and engineer doing it is clueless newbie. :)

I had plans to do some more scalability tests today, but apparently the server available is just two-core machine, so there’s nothing much I can do on it. I guess another option is grabbing some 8-core application server and play with it. :)

	markcallaghan (@mark… on MySQL does not need SQL
	markcallaghan (@mark… on MySQL does not need SQL
	Domas Mituzas on MySQL does not need SQL
	Marc on MySQL does not need SQL
	Nils Meyer on linux memory management for…