on MySQL replication prefetching

For the impatient ones, or ones that prefer code to narrative, go here. This is long overdue anyway, and Yoshinori already beat me, hehe…

Our database environment is quite busy – there’re millions of row changes a second, millions of I/O operations a second and impact of that can be felt at each shard. Especially, as we also have to replicate to other datacenters, single threaded replication on MySQL becomes a real bottleneck.

We use multiple methods to understand and analyze replication lag composition – a simple replication thread state sampling via MySQL processlist helps to understand logical workload components (and work in that field yields great results), and pstack/GDB based replication thread sampling shows server internal behavior quite well too (a similar technique was used for accept thread visualisation).

The biggest problem with single replication thread is that it has to read data to execute queries (rather than applying physical page deltas, like PG or just appending to files like HBase, it does logical edits to page data) – we can observe 95% of process time at that state. As generally there’s just one outstanding data read per replication thread, other workload hitting the machine will also make replication reads slower.

Generally, the obvious way to deal with slow I/O is issue more outstanding parallel requests, and the only way to do that apart from parallel replication, is to predict what will be needed in future and try to fetch that.

Many many moons ago Paul Tuckfield discussed about the Youtube replication prefetcher – it would take write statements yet to be executed in relay logs,  convert them to SELECTs and run them before replication thread needs that data. He still says that was one of most satisfying quick hacks :-)

Maatkit (now Percona Toolkit) introduced mk-slave-prefetch (I played with it back in 2008, didn’t put it into operation at that time though), and eventually that looked like a reasonable option for prefetching statements on our database cluster.

5000 lines of Perl is not the easiest code to work with (or to debug), so the journey was quite bumpy. We got it working in some shape, eventually, but Baron, original author, has something to say about it:

Please don’t use mk-slave-prefetch on MySQL unless you are Facebook. Or at least don’t tell your friends, so they won’t use it.

Anyway, our updates rate would saturate mksp.pl if we used anything fancier on it, so it was a constant balancing act, in which looking at the code was something nobody wanted to do ;-) Still, it was (and is) helping us, so getting rid of it wasn’t possible either.

At some point in time we decided to make an experiment – what if we executed statements, then rolled them back – so I did a quick implementation of that method from scratch in Python – resulting piece of code was relatively small and fun to experiment with.

There were multiple problems with such approach – one complication was that queries were grabbing locks for the duration of the statement, and some of those locks would collide with what actual replication thread is doing. Fixing that would require immediate lock wait timeout or transaction kill for prefetcher thread – so, relatively deep dive into InnoDB. Another problem was internal InnoDB lock contention on rollbacks – that was expensive operation, and benefits of pages read in were negated by rollback segments lock contention. Fixing that is even more extensive InnoDB work (though probably some people would like their rollbacks to be efficient ;-)

At that moment we came up with the idea, that InnoDB codebase could be instrumented to not do any real work on updates – just page data in and return to the caller, and if any change accidentally slips in, commits can fail. That looked like a feasible project for the future.

At some point in time we were rolling out a new database tier for one product, which was supposed to have really high volume of changes, but all coming in a uniform format. It took less than hour (as most of the work has been done to create rollback-based one) to come up with a prototype that would efficiently extract literals from uniform statements, then use them for prefetching.

This method worked fine – at tiny fraction of resources used by mk-slave-prefetch we were preloading secondary indexes and could have relatively extensive parallelism.

Meanwhile, our main database cluster was having more and more uniform query workload, thanks to various libraries, abstractions and middleware – so a day of work on lowest hanging fruits provided relatively good coverage of the workload.

We didn’t stop mksp.pl – it still provided some coverage for various odd cases, which were time-consuming to work on manually.

There were few other problems with the new method – apparently we were targeting our SELECTs too accurately – UPDATEs were spending plenty of time in records_in_range. Additionally, optimistic update path was reading in pages that selects wouldn’t (due to inefficiency in B-Tree locking code). There were some odd reads done for INSERTs.

Also, SELECTs are using indexing less efficiently – InnoDB can pinpoint entries in secondary indexes by using PK values, yet that ability is not exposed to SQL layer, so prefetching on indexes that don’t have explicitly defined all fields within them is not that easy.

In theory, all these issues are supposed to be ‘fixed’ by fake changes concept. Percona recently implemented it in their releases, and we started experimenting with those changes. It is still not that mature concept, so we will be revisiting how things are or should be done, but for now test results are quite positive (we did some changes to reduce locking and avoid deadlock in REPLACE INTO, among other things).

I still observe I/Os done by main replication thread, so we’re not in perfect shape yet, but method seems to be working relatively well (at least it definitely speeds up replication). We still have to do lots of testing to qualify this for large-scale production, but this may allow way more write workload on our machines until we get parallel replication all around.

Our code for custom query, fake changes or rollback prefetcher can be checked out from a public repo together with other tools (oops, Bazaar doesn’t give easy access to subdirectories:

bzr co lp:mysqlatfacebook/tools; cd prefetch

Or browse it online.

P.S. There’s also Tungsten Replicator for ones who don’t want to wait for 5.6 parallel replication.