how innodb lost its advantage

For years it was very easy to defend InnoDB’s advantage over competition – covering index reads were saving I/O operations and CPU everywhere, table space and I/O management allowed to focus on database and not on file systems or virtual memory behaviors, and for past few years InnoDB compression was the way to have highly efficient OLTP (or in our case – SGTP – Social Graph Transaction Processing) environments. Until one day (for some it came sooner, for others later)…

InnoDB team announced that it will change how it is going to do compression in the future and that old ways (that we rely on) will be all gone. I’m not exactly sure if there was any definite messaging on the future of existing methods, but Oracle in public will never put out a roadmap, and there’s lots of uncertainty involved then. Unfortunately, with this uncertainty, we probably lost quite some momentum in InnoDB engineering efforts (we don’t get to see some of planned advancements like Nizam’s work on page reorganization).

The new way is “InnoDB Transparent PageIO Compression” – and it makes lots of sense from full-stack architecture perspective. It relies on the fact that high end flash storage devices already have a log-structured block storage internally, and if one ties directly into it, lots of overhead can be avoided (similar concepts are used by MariaDB’s atomic writes).

We were throwing this idea around as a thought exercise years ago, and we mentioned it here and there. As every thought exercise, we had lots of pros and cons to think about.

One problem is that even it is log structured internally, it is still glued together out of blocks. Few years ago disks and flash devices used to be 512-byte formatted. Nowadays industry is switching to 4k sectors (on disks it yields higher density, on flash it reduces flash translation layer (FTL) costs).

If 16k compresses into 9k, earlier assumption was that new layer will write only 9k. With 4k sectors it will actually write 12k, oh no. How do we solve that with old-style compression? We only partially fill InnoDB’s page so that we will write 8k. In this case InnoDB deciding to be naive and not do any speculative page size management ends up writing much more than solutions used at large scale environments.

Another problem is that buffer pool is no longer compressed. This may mean you will need to buy devices with more IOPS and higher write endurance. Compressed buffer pool is huge advantage, and without it users will just have to spend more on hardware (and Oracle is in selling hardware business, yay!).

Then there’s this whole other thing, which makes absolutely no sense. Why would Oracle decide to support single hardware vendor (it doesn’t even own) proprietary solution in its ubiquitous open-source product. They say they’re using APIs that work elsewhere, but thats where it is recycled bovine manure.

When you’re talking to flash device, its FTL is hiding the fact that everything is truly fragmented underneath you and the namespace it has to deal with it does not have any complicated dependencies – it is essentially log-structured K/V store, where key is block address. The ease of log structured design is that you’re writing to very few places (and you’re usually appending). General purpose file system such as XFS has to handle all the metadata between underlying flat-addressed block device and directories, file placement, extents and writes to files. On top of that it has to provide semantics like file expansion, renames, deletion, all happening on that single block device underneath.

For quite a while InnoDB was holding a global mutex when extending files – and that is very trivial operation comparing to what hole punching would mean. Hole punching inside a file system would make each InnoDB page a separate segment that has to be tracked via file system metadata management (so every page write will be accompanied by filesystem journal and metadata writes). There is a question whether file system is going to scale, and then there’s just basic efficiency (a sparse synchronous write is ~5x more expensive than non-sparse one).

Dropping a file with millions of file system segments in it will take minutes of CPU time and lock contention on allocation group (each segment has to be evaluated, added back to list of free space segments with possible merging, etc). Understanding implications of extreme fragmentation (can you even use the file system once it hits 50% full? 75% full?) is not that straightforward either.

I did not have to think at all about file system scalability before (as long as writes got through), now I can’t stop noticing things like XFS padding log writes to a imaginary or real stripe size (as if every RAID is RAID5).

So while Oracle has completely messed up with InnoDB compression roadmap, surrounding industry moved ahead in leaps and bounds. Remember that toy MongoDB with all of its inefficiencies? This is where it is today:

Chasing benchmarks is not enough to win a datacenter, especially when large scale environments are working on improving efficiency of systems, not just throughput. RocksDB has been making its way into InnoDB’s turf in MySQL world, MongoDB ecosystem has RocksDB, TokuDB, WiredTiger. Embeddable InnoDB does not exist anymore, so most of innovation in storage systems ends up completely ignoring it.

While Oracle orients MySQL towards proprietary file systems and hardware devices, we will see more and more new platforms on top of open-source pluggable storage engines.

Though we did deploy recently some non-compressed InnoDB environments (I am going to talk at MySQL Conference about our MySQL/InnoDB Messenger backend), Yoshinori is going to talk about LSM databases at Facebook too and Harrison’s keynote will be about all the different systems that are needed to deal with complex data problems.

after the conference, mydumper, parallelism, etc

Though slides for my MySQL Conference talks were on the O’Reilly website, I placed them in my talks page too, for both dtrace and security presentations.

I also gave a lightning talk about mydumper. Since my original announcement mydumper has changed a bit. It supports writing compressed files, detecting and killing slow queries that could block table flushes, supports regular expressions for table names, and trunk is slowly moving towards understanding that storage engines differ :)

I’ve been using mydumper quite a lot in my deployments (and observing 10x faster dumps). Now, the sad part is how to do faster recovery. It is quite easy to parallelize load of data (apparently, xargs supports running parallel processes):

echo *.sql.gz | xargs -n1 -P 16 -I % sh -c 'zcat % | mysql dbname'

Still, that doesn’t scale much – only doubles the load speed, compared to single threaded load, even on quite powerful machine. The problem lives in log_sys mutex – it is acquired for every InnoDB row operation, to grab LogicalSequenceNumbers (LSNs), so neither batching nor differentiation strategies really help, and same problem is hit by LOAD DATA too. In certain cases I saw quite some spinning on other mutexes, and it seems that InnoDB currently doesn’t scale that well with lots of small row operations. Maybe someone some day will pick this up and fix, thats why we go to conferences and share our findings :)

oracle?

oracle!

While everyone is sleeping and preparing for four busy days of MySQL Conference, here, in Santa Clara – I started getting SMSes asking if I already learnt PL/SQL, and here, I’m jetlagged, and finding out that I work for another company.

If they don’t kill MySQL, InnoDB and MySQL will finally be together.

If they kill MySQL, I’ll have to look for a job. Will anyone use MySQL then, or will I have to fall back to more generic non-MySQL work I’ve been doing for my hobby projects, teeeheeee.

And for now, I see 6AM faces showing up, and greeting Oracle buddies – some jetlagged, some just early birds.

Percona performance conference

Heee, Baron announced “Percona Performance Conference”.

How do I feel when somebody schedules that on top of MySQL Conference? Bad. Seriously, this was totally uncool.

I sure understand that Percona folks have to give same talk over and over again (of course, there’re few new things every year), and need venue for that, but… it is incredible work and preparation to come up with new topics too, and that involves lots of work and research. I may sound harsh, but I really don’t feel well, when people we should work together, instead end up blackmailing.

Update: apparently I was seriously misguided back then, Percona seems to have been shunned out of MySQL Conference by organizers and this was their way to get back into the community.

Packing for MySQL Conference 2009

Yay, coming to Santa Clara again (4th conference in a row!:). I can’t imagine my year without MySQL Conference trip anymore. To get a free ticket I’ll present on two topics, MySQL Security (lately I have related role, and have prepared bunch of information already) and deep-inspecting MySQL with DTrace (a voodoo session for all happy Solaris and MacOSX users :). See you there?

Speaking at MySQL Conference again, twice

Yay, coming this year to the MySQL conference again. This time with two different talks (second got approved just few days ago) on two distinct quite generic topics:

  • Practical MySQL for web applications
  • Practical character sets

The abstracts were submitted weeks apart, so the ‘practical’ being in both is something completely accidental :) Still, I’ll try to cover problems met and solutions used in various environments and practices – both as support engineer in MySQL, as well as engineer working on wikipedia bits.

Coming to US and talking about character sets should be interesting experience. Though most English-speaking people can stick to ASCII and be happy, current attempts to produce multilingual applications lead to various unexpected performance, security and usability problems.

And of course, web applications end up introducing quite new model of managing data environments, by introducing new set of rules, and throwing away traditional OLTP approaches. It is easy to slap another label on these, call it OLRP – on-line response processing. It needs preparing data for reads more than for writes (though balance has to be maintained). It needs digesting data for immediate responses. It needs lightweight (and lightning) accesses to do the minimum work. Thats where MySQL fits nicely, if used properly.

MySQL Conference 2007: Piggyback riding Wikipedia again. \o/

This year I’m coming to MySQL Conference again. Last year it was marvelous experience, with customers, community and colleagues (CCC!) gathering together, so I didn’t want to miss it this year at any cost :-)

This year instead of describing Wikipedia internals I’ll be disclosing them – all important bits, configuration files, code, ideas, problems, bugs and work being done through whole stack – starting with distributed caches in front, distributed middle-ware somewhere in the middle and distributed data storage in the back end. It will take three hours or so – bring your pillows. :)