again, on benchmarks

Dear interweb, if you have no idea what you’re writing about, keep it to yourself, don’t litter into the tubes. Some people may not notice they’re eating absolute crap and get diarrhea.

This particular benchmark has two favorite parts, that go with each other together really well:

I didnt change absolutely any parameters for the servers, eg didn’t change the innodb_buffer_pool_size or key_buffer_size.

And..

If you need speed just to fetch a data for a given combination or key, Redis is a solution that you need to look at. MySQL can no way compare to Redis and Memcache. …

Seriously, how does one repay for all the damage of such idiotic benchmarks?

P.S. I’ve ranted at benchmarks before, and will continue doing so.

On file system benchmarks

I see this benchmark being quoted in multiple places, and there I see stuff like:

When carrying out more database benchmarking, but this time with PostgreSQL, XFS and Btrfs were too slow to even complete this test, even when it had been running for more than an hour for a single run. Between EXT3, EXT4, and NILFS2, the fastest file-system was EXT3 and then its successor, EXT4, was slightly behind that. Far behind the position of EXT4 were NILFS2 and then Btrfs and XFS.

There were few other benchmarks, e.g. SQLite showed ‘bad performance’ on XFS and Btrfs.

*clear throat*

Dear benchmarkers, don’t compare apples and oranges. If you see differences between benchmarks, do some very very tiny research, and use some intellect, that you, as primates, do have. If database tests are slowest on filesystems created by Oracle (who know some stuff about systems in general) or SGI (who, despite giving away their campus to Google, still have lots of expertise in the field), that can indicate, that your tests are probably flawed somewhere, at least for that test domain.

Now, probably you’ve heard about such thing as ‘data consistency’. That is something what database stack tries to ensure, sometimes at higher costs, like not trusting volatile caches, enforcing certain write orders, depending on acknowledgements by underlying hardware.

So, in this case it wasn’t “benchmarking file systems”, it was simply, benchmarking “consistency” against “no consistency”. But don’t worry, most benchmarks have such flaws – getting numbers but not understanding them makes results much more interesting, right?

Oh, and… thanks for few more misguided people.

Linux 2.6.29

2.6.29 was released. I don’t usually write about linux kernel releases, thats what Slashdot is for :), but this one introduces write barriers in LVM, as well as ext4 with write barriers enabled by default. If you run this kernel and forget to turn off barrier support at filesystems (like XFS, nobarrier), you will see nasty performance slowdowns (recent post about it). Beware.

Eyecandy mutexes!

In my quest of making MySQL usable, I managed to hit contention that wasn’t spotted by performance masters before. Meet most useless mutex ever (this is actual contention event, not just a hold):

Count     nsec Lock
 1451   511364 mysqld`ut_list_mutex

      nsec ---- Distribution --- count    Stack
      2048 |@@                 |   132    libc.so.1`mutex_lock_impl
      4096 |@@@                |   186    libc.so.1`mutex_lock
      8192 |                   |     3    ut_malloc_low
     16384 |@@                 |   137    mem_heap_create_block
     32768 |@                  |   105    row_sel_store_mysql_rec
     65536 |@                  |    89    row_search_for_mysql
    131072 |@                  |    99    ha_innobase::general_fetch
    262144 |@@                 |   131    ha_innobase::rnd_next
    524288 |@@@                |   195    rr_sequential
   1048576 |@@@                |   223
   2097152 |@@                 |   137
   4194304 |                   |    14
----------------------------------------------------------------

ut_list_mutex guards a memory structure which has all memory blocks allocated by InnoDB (via ut_malloc/ut_free) in it.
It has two uses:

  1. Printing “Total memory allocated” in SHOW INNODB STATUS (though this can still be implemented lock-free)
  2. Deallocating all memory on shutdown (though, all modern operating systems do that anyway, so this is purely just to shut up valgrind)

If you have any BLOB/TEXT data in your tables, you’re definitely hitting this contention spot (it is #1 contention in such cases).

Fix? Kill the eyecandy, replace ut_malloc and ut_free with direct calls to malloc() and free(), oh and of course, use scalable allocators like tcmalloc or Hoard.

plockstat fail!

Solaris has this beautiful tool ‘plockstat’ that can report application lock contention and hold events (hold events turn into contention events at parallelism.. ;-) It is just a frontend to a set of dtrace rules, that monitor mutexes and rwlocks.

Today I was testing an edge case (what happens, when multiple threads are scanning lots of same data) – and plockstat/dtrace indicated that there were zero (0!!!) lock waits. I tried using ‘prstat’ with microstate accounting, and it indeed pointed out that there’s lots of LCK% activity going on (half of CPU usage…). The dtrace profiling oneliner (dtrace -n 'profile-997 {@a[ustack()]=count()}') immediately revealed the culprit:

              libc.so.1`clear_lockbyte+0x10
              libc.so.1`mutex_unlock+0x16a
              mysqld`mutex_exit+0x1d
              mysqld`buf_page_optimistic_get_func+0xa0

So, plenty of CPU time was spent when trying to unlock mutex (what seemed strange), but didn’t seem that strange once I noticed the code:

do {
	old = *lockword64;
	new = old & ~LOCKMASK64;
} while (atomic_cas_64(lockword64, old, new) != old);

So, there’s unaccounted busy loop (it is just part of hold event in dtrace). What is odd, is that nobody expects this place to loop too much, what happens here – it gives away mutex to other thread, which wants it. So, instead of having the new owner spin-lock (where it accounts properly), it has old owner spin-locking. I’m not convinced this kind of behavior is one that should scale on large machines, but I’m not much of a locking expert.

Without proper instrumentation plockstat failed to provide information about locks that were consuming half of CPU time. I hope that really was just an edge case – more real testing will follow soon, will see if plockstat will fail as much. Oh well, will find the information I need anyway :) Lesson learned – treat pretty much everything with grain of salt, especially when OS tells you mysql has no lock contention, haha.

ZFS and MySQL … not yet

Today I attended kick-ass ZFS talk (3 hours of incredibly detailed material presented by someone who knows the stuff and knows how to talk) at CEC (Sun internal training event/conference), so now I know way more about ZFS than I used to. Probably I know way more about ZFS than Average Joe DBA \o/ 

And now I think ZFS has lots of brilliant design and implementation bits, except it doesn’t match database access pattern needs. 

See, ZFS is not a regular POSIX-API -> HDD bridge, unlike pretty much everything out there. It is transactional object store which allows multiple access semantics, APIs, and standard ZFS POSIX Layer (ZPL) is just one of them. In MySQL talk, think of all other filesystems as of MyISAM, and ZFS is InnoDB :-) 

So, putting InnoDB on top of ZFS after some high-school-like variable replacement ends up “putting InnoDB on top of InnoDB”. Let’s go a bit deeper here:

  • ZFS has checksums, so does InnoDB (though ZFS checksums are faster, Fletcher-derived, etc ;-)
  • ZFS has atomicity, so does InnoDB
  • ZFS has ZIL (Intent Log), so does InnoDB (Transaction Log)
  • ZFS has background intelligent flushing of data, so does InnoDB (maybe not that intelligent though)
  • ZFS has Adaptive Replacement Cache, so does InnoDB (calls it Buffer Pool, instead of three replacement queues uses just one – LRU, doesn’t account for MFU)
  • ZFS has copy-on-write snapshotting, so does InnoDB (MVCC!)
  • ZFS has compression, so does InnoDB (in plugin, though)
  • ZFS has intelligent mirroring/striping/etc, this is why InnoDB people use RAID controllers.
  • ZFS has bit-rot recovery and self healing and such, InnoDB has assertions and crashes :-)

So, we have two intelligent layers on top of each other, and there’s lots of work duplicated. Of course, we can try to eliminate some bits:

  • Disable checksums at InnoDB level
  • Unfortunately, there’s nothing to be done about two transaction logs
  • Dirty pages can be flushed immediately by InnoDB, probably is tunable at ZFS level too
  • InnoDB buffer pool may be probably reduced, to favor ARC, or opposite…
  • Double Copy-on-write is inevitable (and copy-on-write transaction log does not really make sense…)
  • Compression can be done at either level
  • ZFS use for volume management would be the major real win, as well as all the self healing capacity

So, I’m not too convinced at this moment about using this combo, but there’s another idea circulating around for quite a while – what if MyISAM suddenly started using all the ZFS capabilities. Currently the ZPL and actual ZFS object store management are mutually exclusive – you have to pick one way, but if ZPL would be extended to support few simple operations (create/drop snapshots just on single file, wrap multiple write() calls into a transaction), MyISAM could get a different life:

  • Non-blocking SELECTs could be implemented using snapshots
  • Writes would be atomic and non-corrupting
  • MyISAM would get checksummed, compressed, consistent data, that is flushed by intelligent background threads, and would have immediate crash recovery
  • For replication slaves write concurrency would not be that necessary (single thread is updating data anyway)
  • “Zettabyte” (was told not to use this ;-) File System would actually allow Zettabyte-MyISAM-Tables o/
  • All the Linux people (including me :) would complain about Sun doing something just for [Open]Solaris, instead of working on [insert favorite storage engine here]. 

Unfortunately, to implement that now one would have either to tap directly into object management API (that would mean quite a bit of rewriting), or wait for ZFS people to extend the ZPL calls. And for now, I’d say, “not yet”.

Disclaimer: the opinion of the author does not represent opinion of his employer (especially Marketing people), and may be affected by the fact, that the author was enjoying free wireless and whoever knows what else in Las Vegas McCarran International Airport. 

On SSDs, rotations and I/O

Every time anyone mentions SSDs, I have a feeling of futility and being useless in near future. I have spent way too much time to work around limitations of rotational media, and understand the implications of whole vertical data stack on top.

The most interesting upcoming shift is not only the speed difference, but simply different cost balance between reads and writes. With modern RAID controllers and modern disks and modern filesystems reads are way more expensive operation from application perspective than writes.

Writes can be merged at application and OS level, buffered at I/O controller level, and even sped up by on-disk volatile cache (NCQ write reordering can give +30% faster random write performance).

Reads have none of that. Of course, there’re caches, but they don’t speed up actual read operations, they just help to avoid them. This leads to very disproportionate amount of caches needed for reads, compared to writes.

Simply, 32GB system with MySQL/InnoDB will be wasting 4GB on mutexes (argh!!..), few more gigs on data dictionary (arghhh #2), and everything else for read caching inside buffer pool. There may be some dirty pages and adaptive hash or insert buffer entries, but they are all there not because systems lack write output capacity, but simply because of braindead InnoDB page flushing policy.

Also, database write performance is mostly impacted not because of actual underlying write speed, but simply because every write has to read from multiple scattered places to actually find what needs to be changed.

This is where SSDs matter – they will have same satisfactory write performance (and fixes for InnoDB are out there ;-) – but the read performance will be satisfactory (uhm, much much better) too.

What this would mean for MySQL use:

  • Buffer pool, key cache, read-ahead buffers – all gone (or drastically reduced).
  • Data locality wouldn’t matter that much anymore either, covering indexes would provide just double performance, rather than up to 100x speed increase.
  • Re-reading data may be cheaper, than including it in various temporary sorting and grouping structures
  • RAIDs no longer needed (?), though RAM-backed write-behind caching would still be necessary
  • Log-based storage designs like PBXT will make much more sense
  • Complex data flushing logic like inside InnoDB’s will not be useful anymore (one can say, it is useless already ;-) – and straightforward methods such as in Maria are welcome again.

Probably the happiest camp out there are PostgreSQL people – data locality issues were plaguing their performance most, and it is strong side of InnoDB. On the other hand, MySQL has pluggable engine support, so it may be way easier to produce SSD versions for anything we have now, or just use new ones (hello, Maria!).

Still, there is quite some work to adapt to the new storage model, and judging by the fact how InnoDB works with modern rotational media, we will need some very strong push to adapt it for the new stuff.

You can sense the futility of any work done to optimize for rotation – all the “make reads fast” techniques will end up resolved at hardware layer, and the human isn’t needed anymore (nor all these servers with lots of memory and lots of spindles).

On XFS write barriers

I’m very naive, when I have to trust software. I just can’t believe a filesystem may have a tunable that makes it 20x faster (or rather, doesn’t make it 20x slower). I expect it to work out of the box. So, I was pondering, why in my testing XFS on LVM flushes data ~20x faster than on a box where it talks directly to device. Though I have noticed some warnings before, people on #xfs pointed out that LVM doesn’t support write barriers.

So, as I had no idea what write barriers are, had to read up a bit on that. There is a very nice phrase in there regarding battery-backed write-behind caching:

Using write barriers in this instance is not warranted and will in fact lower performance. Therefore, it is recommended to turn off the barrier support and mount the filesystem with “nobarrier”.

No shit, 20x lower performance :) As usually, I was not the only one to spot that..

So, I just ran this:

mount -o remount,nobarrier /a

And InnoDB flushed pages at 80MB/s instead of 4MB/s.

Update (2009/03): 2.6.29 kernel will support write barriers for LVM too – so XFS performance degradation is very much expected at very very wide scope. Also ext4 uses write barriers by default too. This thing is getting huge.