Stonebraker trapped in Stonebraker 'fate worse than death'

Oh well, I know I shouldn’t poke directly at people, but they deserve that sometimes (at least in my very personal opinion). Heck, I even gave 12h window for this not to be hot-headed opinion.

Those who followed MySQL at facebook development probably know how much we focus on actual performance on top of mixed-composition I/O devices (flashcache, etc) – not just retreating to comfortable zone of in-memory (or in-pure-flash) data.

I feel somewhat sad that I have to put this truism out here – disks are way more cost efficient, and if used properly can be used to facilitate way more long-term products, not just real time data. Think Wikipedia without history, think comments that disappear on old posts, together with old posts, think all 404s you hit on various articles you remember from the past and want to read.

Building the web that lasts is completely different task from what academia people imagine building the web is.

I already had this issue with other RDBMS pioneer (there’s something in common among top database luminaries) – he also suggested that disks are things of the past and now everything has to be in memory, because memory is cheap. And data can be whatever unordered clutter, because CPUs can sort it, because CPUs are cheap.

They probably missed Al Gore message. Throwing more and more hardware without fine tuning for actual operational efficiency requirements is wasteful and harms our planet. Yes, we do lots of in-memory efficiency work, so that we reduce our I/O, but at the same time we balance the workload so that I/O subsystem provides as efficient as possible delivery of the long tail.

What happens in real world if one gets 2x efficiency gain? Twice more data can be stored, twice more data intensive products can be launched.
What happens in academia of in-memory databases, if one gets 2x efficiency gain? A paper.
What happens when real world doesn’t read your papers anymore? You troll everyone via GigaOM.

Though sure, there’s some operational overhead in handling sharding and availability of MySQL deployments, at large scale it becomes somewhat constant cost, whereas operational efficiency gains are linear.

Update: Quite a few people pointed out that I was dissing a person who has done incredible amount of contributions, or that I’m anti-academia. I’m not, and I extremely value any work that people do wherever they are, albeit I do apply critical thinking to whatever they speak.

In my text above (I don’t want to edit and hide what I said) I don’t mean that “a paper” is useless. Me and my colleagues do read papers and try to understand the direction of computer science and how it applies to our work (there are indeed various problems yet to solve). I’d love to come up with something worth a paper (and quite a few of my colleagues did).

Still, if someone does not find that direction useful, there’s no way to portray them the way the original GigaOM article did.

On SSDs, rotations and I/O

Every time anyone mentions SSDs, I have a feeling of futility and being useless in near future. I have spent way too much time to work around limitations of rotational media, and understand the implications of whole vertical data stack on top.

The most interesting upcoming shift is not only the speed difference, but simply different cost balance between reads and writes. With modern RAID controllers and modern disks and modern filesystems reads are way more expensive operation from application perspective than writes.

Writes can be merged at application and OS level, buffered at I/O controller level, and even sped up by on-disk volatile cache (NCQ write reordering can give +30% faster random write performance).

Reads have none of that. Of course, there’re caches, but they don’t speed up actual read operations, they just help to avoid them. This leads to very disproportionate amount of caches needed for reads, compared to writes.

Simply, 32GB system with MySQL/InnoDB will be wasting 4GB on mutexes (argh!!..), few more gigs on data dictionary (arghhh #2), and everything else for read caching inside buffer pool. There may be some dirty pages and adaptive hash or insert buffer entries, but they are all there not because systems lack write output capacity, but simply because of braindead InnoDB page flushing policy.

Also, database write performance is mostly impacted not because of actual underlying write speed, but simply because every write has to read from multiple scattered places to actually find what needs to be changed.

This is where SSDs matter – they will have same satisfactory write performance (and fixes for InnoDB are out there ;-) – but the read performance will be satisfactory (uhm, much much better) too.

What this would mean for MySQL use:

  • Buffer pool, key cache, read-ahead buffers – all gone (or drastically reduced).
  • Data locality wouldn’t matter that much anymore either, covering indexes would provide just double performance, rather than up to 100x speed increase.
  • Re-reading data may be cheaper, than including it in various temporary sorting and grouping structures
  • RAIDs no longer needed (?), though RAM-backed write-behind caching would still be necessary
  • Log-based storage designs like PBXT will make much more sense
  • Complex data flushing logic like inside InnoDB’s will not be useful anymore (one can say, it is useless already ;-) – and straightforward methods such as in Maria are welcome again.

Probably the happiest camp out there are PostgreSQL people – data locality issues were plaguing their performance most, and it is strong side of InnoDB. On the other hand, MySQL has pluggable engine support, so it may be way easier to produce SSD versions for anything we have now, or just use new ones (hello, Maria!).

Still, there is quite some work to adapt to the new storage model, and judging by the fact how InnoDB works with modern rotational media, we will need some very strong push to adapt it for the new stuff.

You can sense the futility of any work done to optimize for rotation – all the “make reads fast” techniques will end up resolved at hardware layer, and the human isn’t needed anymore (nor all these servers with lots of memory and lots of spindles).