On SSDs, rotations and I/O

Every time anyone mentions SSDs, I have a feeling of futility and being useless in near future. I have spent way too much time to work around limitations of rotational media, and understand the implications of whole vertical data stack on top.

The most interesting upcoming shift is not only the speed difference, but simply different cost balance between reads and writes. With modern RAID controllers and modern disks and modern filesystems reads are way more expensive operation from application perspective than writes.

Writes can be merged at application and OS level, buffered at I/O controller level, and even sped up by on-disk volatile cache (NCQ write reordering can give +30% faster random write performance).

Reads have none of that. Of course, there’re caches, but they don’t speed up actual read operations, they just help to avoid them. This leads to very disproportionate amount of caches needed for reads, compared to writes.

Simply, 32GB system with MySQL/InnoDB will be wasting 4GB on mutexes (argh!!..), few more gigs on data dictionary (arghhh #2), and everything else for read caching inside buffer pool. There may be some dirty pages and adaptive hash or insert buffer entries, but they are all there not because systems lack write output capacity, but simply because of braindead InnoDB page flushing policy.

Also, database write performance is mostly impacted not because of actual underlying write speed, but simply because every write has to read from multiple scattered places to actually find what needs to be changed.

This is where SSDs matter – they will have same satisfactory write performance (and fixes for InnoDB are out there ;-) – but the read performance will be satisfactory (uhm, much much better) too.

What this would mean for MySQL use:

  • Buffer pool, key cache, read-ahead buffers – all gone (or drastically reduced).
  • Data locality wouldn’t matter that much anymore either, covering indexes would provide just double performance, rather than up to 100x speed increase.
  • Re-reading data may be cheaper, than including it in various temporary sorting and grouping structures
  • RAIDs no longer needed (?), though RAM-backed write-behind caching would still be necessary
  • Log-based storage designs like PBXT will make much more sense
  • Complex data flushing logic like inside InnoDB’s will not be useful anymore (one can say, it is useless already ;-) – and straightforward methods such as in Maria are welcome again.

Probably the happiest camp out there are PostgreSQL people – data locality issues were plaguing their performance most, and it is strong side of InnoDB. On the other hand, MySQL has pluggable engine support, so it may be way easier to produce SSD versions for anything we have now, or just use new ones (hello, Maria!).

Still, there is quite some work to adapt to the new storage model, and judging by the fact how InnoDB works with modern rotational media, we will need some very strong push to adapt it for the new stuff.

You can sense the futility of any work done to optimize for rotation – all the “make reads fast” techniques will end up resolved at hardware layer, and the human isn’t needed anymore (nor all these servers with lots of memory and lots of spindles).

10 thoughts on “On SSDs, rotations and I/O”

  1. What is ‘braindead’ about the page-flush design for InnoDB? What is the page-flush design for Maria?

    It is possible to significantly shrink the per buffer cache page overhead from InnoDB, but that will take time. Is the current overhead really 2kb per page? I thought it was a few hundred bytes per page and the ratio is reduced if you boost the Innodb page size from 16kb to 32kb or 64kb.

  2. Mark, I think any static rate limiting is not needed – even if rate limiting is done, it could be expressed with some time-based exponential pressure (more you’re lagging, more you’re pressuring the writes to I/O).

    Maria also does background LRU flushing (see http://forge.mysql.com/worklog/task.php?id=3261), but doesn’t have adaptive hash/insert buffer steps.

    As for memory use – I’d have to show full profile (this one is a bit outdated and 4.0 based: http://noc.wikimedia.org/~midom/mem.pdf ), and indeed it may be possible to change that just by bumping up the page size (does anyone do that in production with nice results?). Last time I looked it was wasteful enough :)

  3. Couple of comments

    1) Getting rid of Buffer Pool is not going to happen even with SSD. Take a look at the code path for date in cache vs data on disk for Innodb (I have done work running MySQL on RAM drive and Violin Memory system)

    2) Rate limiting makes sense. If you need to flush 10000 pages and do it as fast as possible you will see less bandwith available for reads and will see the dip in results. Though Innodb is doing it wrong. You can check DBT2 graph or any other flush intensive benchmarks and you will see dips still… because Innodb either flush not fast enough or starts flushing like a crazy and unable to do anything else. Adaptive and configurable limit would make much more sense (and this is what we’re doing in Percona patches)

  4. … At the same time.

    Yes SSD and Flash in general will change landscape dramatically. Though I think of the Flash more than SSD – when you’re looking at Flash based systems it does not really make sense to go through the disk interface. Solutions connecting to the bus directly (such as FusionIO) are likely to be more interesting. And as you mention RAID – for the same reason there is less requirement for hot swap too

  5. “Log-based storage designs like PBXT will make much more sense”

    I do not understand the reasoning behind this. With SSD, the advantage of doing sequential writes instead of random writes will be almost eliminated. Since one of the main motivations of log-based storage is to reduce the amount of random writes, I do not see how the log-based approach will make more sense with SSD.

  6. Øystein, PBXT per-thread log reduces lock contention (otherwise per-table logs would be appended). It gets at least one part of operation to be lock-less.

  7. I’ve been working on a long blog post about the pros/cons of in-kernel caching vs in-process caching.

    It’s a very complicated issue that I’ve been thinking about for a long time.

    If you can get your data on SSD, and avoid caching altogether, these issues totally vanish.

    The problem is that InnoDB/MySQL is not very good when you put it on an ultra-fast IO subsystem (as Peter mentioned).

    All of our SSD tests were CPU bound in InnoDB. In a number of them MyISAM crushed SSD because it wasn’t using any CPU.

    Hopefully some of the new performance patches will fix things but I think the data/IO path will need to be optimized.

    We just simply didn’t have access to devices with these IO capabilities in the past…….


Comments are closed.