Tim is now vocal

Tim at the datacenter
Tim is one of most humble and intelligent developers I’ve ever met – and we’re extremely happy having him at Wikimedia. Now he has a blog, where the first entry is already epic by any standards. I mentioned the IE bug, and Tim has done thorough analysis on this one, and similar problems.

I hope he continues to disclose the complexity of real web applications – and that will always be a worthy read.

ZFS and MySQL … not yet

Today I attended kick-ass ZFS talk (3 hours of incredibly detailed material presented by someone who knows the stuff and knows how to talk) at CEC (Sun internal training event/conference), so now I know way more about ZFS than I used to. Probably I know way more about ZFS than Average Joe DBA \o/ 

And now I think ZFS has lots of brilliant design and implementation bits, except it doesn’t match database access pattern needs. 

See, ZFS is not a regular POSIX-API -> HDD bridge, unlike pretty much everything out there. It is transactional object store which allows multiple access semantics, APIs, and standard ZFS POSIX Layer (ZPL) is just one of them. In MySQL talk, think of all other filesystems as of MyISAM, and ZFS is InnoDB :-) 

So, putting InnoDB on top of ZFS after some high-school-like variable replacement ends up “putting InnoDB on top of InnoDB”. Let’s go a bit deeper here:

  • ZFS has checksums, so does InnoDB (though ZFS checksums are faster, Fletcher-derived, etc ;-)
  • ZFS has atomicity, so does InnoDB
  • ZFS has ZIL (Intent Log), so does InnoDB (Transaction Log)
  • ZFS has background intelligent flushing of data, so does InnoDB (maybe not that intelligent though)
  • ZFS has Adaptive Replacement Cache, so does InnoDB (calls it Buffer Pool, instead of three replacement queues uses just one – LRU, doesn’t account for MFU)
  • ZFS has copy-on-write snapshotting, so does InnoDB (MVCC!)
  • ZFS has compression, so does InnoDB (in plugin, though)
  • ZFS has intelligent mirroring/striping/etc, this is why InnoDB people use RAID controllers.
  • ZFS has bit-rot recovery and self healing and such, InnoDB has assertions and crashes :-)

So, we have two intelligent layers on top of each other, and there’s lots of work duplicated. Of course, we can try to eliminate some bits:

  • Disable checksums at InnoDB level
  • Unfortunately, there’s nothing to be done about two transaction logs
  • Dirty pages can be flushed immediately by InnoDB, probably is tunable at ZFS level too
  • InnoDB buffer pool may be probably reduced, to favor ARC, or opposite…
  • Double Copy-on-write is inevitable (and copy-on-write transaction log does not really make sense…)
  • Compression can be done at either level
  • ZFS use for volume management would be the major real win, as well as all the self healing capacity

So, I’m not too convinced at this moment about using this combo, but there’s another idea circulating around for quite a while – what if MyISAM suddenly started using all the ZFS capabilities. Currently the ZPL and actual ZFS object store management are mutually exclusive – you have to pick one way, but if ZPL would be extended to support few simple operations (create/drop snapshots just on single file, wrap multiple write() calls into a transaction), MyISAM could get a different life:

  • Non-blocking SELECTs could be implemented using snapshots
  • Writes would be atomic and non-corrupting
  • MyISAM would get checksummed, compressed, consistent data, that is flushed by intelligent background threads, and would have immediate crash recovery
  • For replication slaves write concurrency would not be that necessary (single thread is updating data anyway)
  • “Zettabyte” (was told not to use this ;-) File System would actually allow Zettabyte-MyISAM-Tables o/
  • All the Linux people (including me :) would complain about Sun doing something just for [Open]Solaris, instead of working on [insert favorite storage engine here]. 

Unfortunately, to implement that now one would have either to tap directly into object management API (that would mean quite a bit of rewriting), or wait for ZFS people to extend the ZPL calls. And for now, I’d say, “not yet”.

Disclaimer: the opinion of the author does not represent opinion of his employer (especially Marketing people), and may be affected by the fact, that the author was enjoying free wireless and whoever knows what else in Las Vegas McCarran International Airport. 

On SSDs, rotations and I/O

Every time anyone mentions SSDs, I have a feeling of futility and being useless in near future. I have spent way too much time to work around limitations of rotational media, and understand the implications of whole vertical data stack on top.

The most interesting upcoming shift is not only the speed difference, but simply different cost balance between reads and writes. With modern RAID controllers and modern disks and modern filesystems reads are way more expensive operation from application perspective than writes.

Writes can be merged at application and OS level, buffered at I/O controller level, and even sped up by on-disk volatile cache (NCQ write reordering can give +30% faster random write performance).

Reads have none of that. Of course, there’re caches, but they don’t speed up actual read operations, they just help to avoid them. This leads to very disproportionate amount of caches needed for reads, compared to writes.

Simply, 32GB system with MySQL/InnoDB will be wasting 4GB on mutexes (argh!!..), few more gigs on data dictionary (arghhh #2), and everything else for read caching inside buffer pool. There may be some dirty pages and adaptive hash or insert buffer entries, but they are all there not because systems lack write output capacity, but simply because of braindead InnoDB page flushing policy.

Also, database write performance is mostly impacted not because of actual underlying write speed, but simply because every write has to read from multiple scattered places to actually find what needs to be changed.

This is where SSDs matter – they will have same satisfactory write performance (and fixes for InnoDB are out there ;-) – but the read performance will be satisfactory (uhm, much much better) too.

What this would mean for MySQL use:

  • Buffer pool, key cache, read-ahead buffers – all gone (or drastically reduced).
  • Data locality wouldn’t matter that much anymore either, covering indexes would provide just double performance, rather than up to 100x speed increase.
  • Re-reading data may be cheaper, than including it in various temporary sorting and grouping structures
  • RAIDs no longer needed (?), though RAM-backed write-behind caching would still be necessary
  • Log-based storage designs like PBXT will make much more sense
  • Complex data flushing logic like inside InnoDB’s will not be useful anymore (one can say, it is useless already ;-) – and straightforward methods such as in Maria are welcome again.

Probably the happiest camp out there are PostgreSQL people – data locality issues were plaguing their performance most, and it is strong side of InnoDB. On the other hand, MySQL has pluggable engine support, so it may be way easier to produce SSD versions for anything we have now, or just use new ones (hello, Maria!).

Still, there is quite some work to adapt to the new storage model, and judging by the fact how InnoDB works with modern rotational media, we will need some very strong push to adapt it for the new stuff.

You can sense the futility of any work done to optimize for rotation – all the “make reads fast” techniques will end up resolved at hardware layer, and the human isn’t needed anymore (nor all these servers with lots of memory and lots of spindles).

On XFS write barriers

I’m very naive, when I have to trust software. I just can’t believe a filesystem may have a tunable that makes it 20x faster (or rather, doesn’t make it 20x slower). I expect it to work out of the box. So, I was pondering, why in my testing XFS on LVM flushes data ~20x faster than on a box where it talks directly to device. Though I have noticed some warnings before, people on #xfs pointed out that LVM doesn’t support write barriers.

So, as I had no idea what write barriers are, had to read up a bit on that. There is a very nice phrase in there regarding battery-backed write-behind caching:

Using write barriers in this instance is not warranted and will in fact lower performance. Therefore, it is recommended to turn off the barrier support and mount the filesystem with “nobarrier”.

No shit, 20x lower performance :) As usually, I was not the only one to spot that..

So, I just ran this:

mount -o remount,nobarrier /a

And InnoDB flushed pages at 80MB/s instead of 4MB/s.

Update (2009/03): 2.6.29 kernel will support write barriers for LVM too – so XFS performance degradation is very much expected at very very wide scope. Also ext4 uses write barriers by default too. This thing is getting huge.

How did the Sunday go?

Yesterday Europe switched the daylight saving time. Though for most people thats a great change (you can sleep one hour longer, you’re not late for one day in whole year, etc), thats a data nightmare…

  • Same time happens twice
  • Time is not sequential value
  • Criteria for time gets fuzzy

So, if you are using local time, instead of UTC, your data may hit unique constraints (if anyone has unique constraints on datetimes, of course ;-), or simply you won’t be able to detect which moment of time it refers to (bad for audit trails!)

The fun part within MySQL is that DATETIME is time-zone agnostic (it just stores presentation values), whereas TIMESTAMP is not (it stores data in UTC and presents it based on session timezone), so the behaviors for these will be different.

How different? DATETIME will just have non-sequential lossy information, and TIMESTAMP will have… depends. If implicit TIMESTAMP default values are used, it will have correct UTC-based data. If NOW() or literals are inserted, that will provide with lossy presentation-time based values. In order for database to actually understand what literal value means, it would ned an offset included (“hh:mm:ss+hhmm”) in the string (though MySQL currently does not support this notation). NOW() behavior is probably a bug.

Storing time correctly opens another can of worms – user-provided time criteria cannot be converted into UTC, and therefore a lossy match is done. To complicate things even more, as indexes are in sequential time, but data matching is done based on non-sequential time, data returned will be different based on data access method (see Bug#38455).

So, the elegant fix for timezone support inside MySQL would be extending the presentation/literal format to support specified offsets, fixing datetime/timestamp code in multiple places – and that would quite some work to fix one hour per year.

Timezone support is my old interest, there’s lots of complexity and different implications, but most of people just don’t get to that – and may lose their data consistency.

So, how did your Sunday go?

Progress in percents: 0 1 2 3 …

Well, servers usually don’t crash ( our English Wikipedia master is running for 800 days, on white-box hardware, RAID0, 4.0 ;-), but when they do (like some kernel bugs on our big big boxes), one of most painful experiences is InnoDB log recovery.

Usually people will reduce the innodb-log-file-size to ease up with the recovery (it helps, in a way :), but the real problem is somewhere else.

See, when InnoDB does crash recovery, it applies the log changes in memory, and builds a flush list. It doesn’t flush any pages during the recovery process, so the flush list grows big, thousands, tens of thousands, maybe millions kind of big, anyway, big-number big.

Oh, did I mention? The flush list is actually a linked list, not some kind of hippy tree stuff. Every time a log record is read from a log and something gets updated, the flush list will be traversed, thousands, tens of thousands, maybe millions of entries.

The expensive code looks something like this:

while (b && (ut_dulint_cmp(b->oldest_modification,
             block->oldest_modification) > 0)) {
       prev_b = b;
       b = UT_LIST_GET_NEXT(flush_list, b);
}

Then your profile starts looking like this, and you wish your systems didn’t crash:

%        symbol name
87.6978  buf_flush_insert_sorted_into_flush_list
 5.8571  -kernel
 1.9793  recv_apply_hashed_log_recs
 0.8059  buf_calc_page_new_checksum

So, the recovery process cost is exponential, and people work around it by reducing the log file size, by reducing performance of their system, while the actual fix is right there, in optimizing the data structure. Current model is outdated for anything built in last 5 years anyway.

Oh, and of course, I’d like systems not to crash at all, like that database master on whitebox raid0 running for 800 days.

Update: this is old stuff. Peter wrote about it, Heikki opened a bug, then thought it would need more than five minutes to fix it and classified it as a feature request, so Peter could write more about it. That makes it even more sad. We’d probably change the synopsis for the feature request, “make crash recovery work”.

Update 2: get the patch at Percona (Yasufumi is god :)

dtrace!

At the MySQL developer conference I accidently showed up some of things we’ve been doing with dtrace (I used it in few cases and realized the power it has), and saw some jaws drop. Then I ended up doing small demos around the event. What most people know about dtrace, is that there’re some probes and you can trace them. What people don’t know is that you can actually create lots of probes dynamically, and use them with lots of flexibility.

One of major things not really grasped by many is that dtrace is a combination of a tracing tool, debugger, programming language and a database, having minor, but very valuable functionality for each. It can attach to any place in code, it can get stacks, function arguments, traverse structures, do some evaluations, aggregate data, and in the end – thats all compiled code executed by kernel (or programs). 

Sometimes a probe may look not that useful (strace would provide file writes too?), but once combined with ability to get immediate stack, as well as set or read context variables (a previous probe on any other event could have saved some important information, e.g. host,user,table names, etc) – so final result may tell statistics correlated to many other activities. 

One developer (a traitor who has left support for easier life in engineering dept) listened to all this, and I asked what his current project was – apparently he was adding static dtrace probes to MySQL. It ended up being quite interesting discussion, as static probes provide two value points. First of all, it provides an interface – whereas dynamic probes can change with code changes (though, that doesn’t happen too often :) Second value – one can do additional calculations on a specific probe, which would be done only on-demand (when the probe is attached). 

So, having a static probe that directly maps to easy-mode dynamic one (it is straightforward to attach to a function, and quite easy to read its arguments), is a bit of waste (both in development time, as well as few instructions are actually written there). Dynamic tracing generally modifies binaries on fly – so it does not carry additional costs otherwise. Though an example where static probe would be awesome – having “query start” event, which would have query string canonized with all literals removed – this would allow on-demand query profiling for query groups, rather than stand-alone queries.

The other major value is ability to set thread-specific context variables in different probes, so they can read each other data. At the type of incoming packet one can tag the thread with whichever information needed – then any subsequent actions can reuse such information to filter out important events. That also removes the need of static probes providing multiple-layer information – it all can be achieved by chaining the events – without too much complexity. 

I took a bit of trollish stance when approached a developer implementing internal performance statistics. We were playing a game – he’d tell me what kind of performance information he’d like to extract, and I’d show a method to do that with dtrace. More people from monitoring field joined, and we ended up discussing what is the perfect performance monitoring and analysis system. It is quite easy to understand, that different people will need different kinds of metrics. For MySQL development work performance engineer will need mutex contention information, someone fixing a leak will need heap profiling, someone writing a feature will want an easy way to trace how server executes their code – and all that is way far from any needs actual user or DBA has. Someone who writes a query just wants to see the query plan with some easy-to-understand costs (just need to pump more steroids into EXPLAIN). DBAs may want to see resource consumption per-user, per-table, etc (something Google patch  provides). It is interesting to find a balance, between external tools and what should be supported out-of-the-box internally – and it is way easier to force internal crowd to have proper tools, and it is always nice to provide a much as possible instrumentation for anyone externally. 

Of course, there’s poor guy in the middle of two camps – a support engineer – who needs easy performance metrics to be accessible from clients, but needs way more depth than standard tools provide. In ideal case dtrace would be everywhere (someone recently said, thats one of coolest things Sun has ever brought) – then we’d be able to retrieve on-demand performance metrics from everywhere, and would be tempted to write DTraceToolkit  (a suite of programs that give lots and lots of information based on dtrace) like bunch of stuff for MySQL internals analysis.

I already made one very very simple tool  which visualizes dtrace output, so we can have graphviz based SVG callgraph for pretty much any type of probe (like, who in application does expensive file reads) – all from a single dtrace oneliner. It seems I can sell the tool to Sun’s performance engineering team – they liked it. :) 

Some people even installed Solaris afterwards for their performance tests. Great, I won’t have to (haha!).

Though lack of dtrace in Linux is currently a blocker for the technology, lots of engineers already have it on their laptops – MacOSX 10.5 ships it. It even has visual toolkit, that allows building some dtrace stuff in a GUI. 

I’m pretty sure now, any engineer would love dtrace (or dtrace based tools), they just don’t know that yet.

Drizzle

Hi! It is about time to write some thoughts about Drizzle, even after it got so much of blogging love elsewhere :)

I love some of the ideas – like employing generic portable record format, and throwing away lots and lots of crufty code associated with reading internal structures.

Some of the ideas I probably love less, mostly the microkernel design.

See, I’m a believer in hacks. A hack in monolith code blends in nicely, a hack in microkernel design looks like bunch of spaghetti on top of kosher pork steak (well, probably bad analogy :). Hacks start bloating the plugin interfaces, microkernel designers become unhappy, there’s lots of tension, instead of living in one huge nice pot of spaghetti.

Why does one need hacks? I like when InnoDB controls the replication (thus adding transactional consistency to it, or adding semi-sync properties), and I like when replication controls the InnoDB (asks for higher priorities, and such). These changes required changing handler interface without even having replication as a module. In case of spaghetti soup, one straw more or less, doesn’t matter that much. :)

The very example of Apache proves the point, that modules don’t work well together. There’s not that much synergy between, say, mod_php and mod_perl. Actually, there’s not much synergy between any Apache module. People end up compressing, logging, filtering, redirecting inside PHP or Python or Perl code, not dedicated Apache modules. Why? Simply, because interfaces are insufficient, and modules end up limited – there’s no real synergy out there. In the end, having data logic in one piece is actually more maintainable than building bridges between entirely separate logic pools.

It is a bit of hypocrisy to aim for modular design with clear plugin interfaces, and at the same time remove all the features, that make design of other applications more modular and using clear interfaces (SPs, prepared statements, triggers, etc ;-)

Of course, I play a bit Devil’s Advocate here, and I’m one of those people forced to know every reason why various features got removed, but I somehow feel that lots of actual improvements (like protocol buffers) could be done without doing the stripping. Also, I know that most of the features removed are not harmful in any way, if not used :)

In the end, most of heavyweight database work is done at storage engine layer anyway, most of resource usage is by storage engine, most of scalability troubles are at the storage engine, and most of actual needed improvements and features should be done at the storage engine layer.