On binlogs and datacenters

Once MySQL is deployed inside a datacenter environment (i.e. forms a cloud ;-), major feature in it becomes replication. It is used to maintain hot copies, standby copies, read-only copies, invalidate external systems, replicate to external systems, etc. If this functionality is broken, datacenter is broken – components are not synchronized anymore, invalidations not done, data not consistent.

From performance perspective, replication not working properly results in unusable slaves so load cannot be spread. This results in higher load on other machines, including master (especially on master, if environment needs stronger consistency guarantees).

Judging on replication importance in MySQL deployments, it should attract performance engineering as much as InnoDB and other critical pieces. Though slave replication performance is being increased in 5.6, master side is not (well, group commit may help a bit, but not as much).

The major problem here is LOCK_log. It is a mutex that guards all MySQL log files (Update: that sentence is wrong, it is per-log-type  mutex, is not shared among different types, at is class private variable) (general query log, binary log, etc) – only one read or write can happen to any of them. Other logs aside (their impact can be minimized), binary log finds this mutex extremely problematic, because:

  • It stalls COMMITs
  • It stalls slave binary log reads

Though one can argue, that once there’re less commits, there will be less binary log and that will result in some sort of equilibrium. Unfortunately, one doesn’t get anywhere close to decent efficiency, as there’re too many situations, when LOCK_log is kept for way too long:

  • Slow binary log read (if they already got purged out of cache, highly likely)
  • Slow binary log write (if controller cache is saturated by page flush workload)
  • Slow memory allocation for larger events (on systems with malloc scalability issues)

Saturating RAID controller cache isn’t that difficult – and even if one dedicates separate disks to a separate log disk array (what a waste!), it doesn’t mean that out of the shelf hardware will support supplying separate cache for that disk array. Without a separate dedicated cache or high priority log processing one ends up blocking until disks are available for a write.

Read efficiency is a concern here too – MySQL uses statically sized buffer of 8k for all binary log reads, which means that on highly pressured system read-ahead may not work, and there will be way more I/O operations/seeks, while holding a precious mutex. Even if the operation to read the 8k page from memory buffers is very fast, slave threads will be forced to wait for other, much slower operations (a lagging slave reading yesterday’s data? high write workload?) to finish.

If binary log writes are on a separate device one can inspect the utilization on iostat. Usually there will be only one write or read outstanding, so the busy percentage shows really well how contended this resource is. A double-digit number already says that operations aren’t instant, and replication slaves can be subjected to log starvation, that is – not replicating.

From operation side solution to this is somehow obtaining a dedicated cache or fast device for binary log writes. One can simply disable write cache for the InnoDB data partition and use lots of parallel InnoDB page write threads, so that they win the fight against read threads. Using flashcache or similar technology to absorb writes can be beneficial in this case.

On the other hand, MySQL can be changed not to have giant log mutex – binary logs are append-only structures, and just guarding tail and rotation events could be enough, with additional one or two atomic variables to make sure that readers are fine.

Also, having compressed binlogs on disk would reduce the I/O pressure caused by writes or reads, as well as there’s a possible gain from not having to recompress logs for each remote reader.

Then datacenter will be happy, and datacenter users even happier.

on MySQL replication prefetching

For the impatient ones, or ones that prefer code to narrative, go here. This is long overdue anyway, and Yoshinori already beat me, hehe…

Our database environment is quite busy – there’re millions of row changes a second, millions of I/O operations a second and impact of that can be felt at each shard. Especially, as we also have to replicate to other datacenters, single threaded replication on MySQL becomes a real bottleneck.

We use multiple methods to understand and analyze replication lag composition – a simple replication thread state sampling via MySQL processlist helps to understand logical workload components (and work in that field yields great results), and pstack/GDB based replication thread sampling shows server internal behavior quite well too (a similar technique was used for accept thread visualisation).

The biggest problem with single replication thread is that it has to read data to execute queries (rather than applying physical page deltas, like PG or just appending to files like HBase, it does logical edits to page data) – we can observe 95% of process time at that state. As generally there’s just one outstanding data read per replication thread, other workload hitting the machine will also make replication reads slower.

Generally, the obvious way to deal with slow I/O is issue more outstanding parallel requests, and the only way to do that apart from parallel replication, is to predict what will be needed in future and try to fetch that.

Many many moons ago Paul Tuckfield discussed about the Youtube replication prefetcher – it would take write statements yet to be executed in relay logs,  convert them to SELECTs and run them before replication thread needs that data. He still says that was one of most satisfying quick hacks :-)

Maatkit (now Percona Toolkit) introduced mk-slave-prefetch (I played with it back in 2008, didn’t put it into operation at that time though), and eventually that looked like a reasonable option for prefetching statements on our database cluster.

5000 lines of Perl is not the easiest code to work with (or to debug), so the journey was quite bumpy. We got it working in some shape, eventually, but Baron, original author, has something to say about it:

Please don’t use mk-slave-prefetch on MySQL unless you are Facebook. Or at least don’t tell your friends, so they won’t use it.

Anyway, our updates rate would saturate mksp.pl if we used anything fancier on it, so it was a constant balancing act, in which looking at the code was something nobody wanted to do ;-) Still, it was (and is) helping us, so getting rid of it wasn’t possible either.

At some point in time we decided to make an experiment – what if we executed statements, then rolled them back – so I did a quick implementation of that method from scratch in Python – resulting piece of code was relatively small and fun to experiment with.

There were multiple problems with such approach – one complication was that queries were grabbing locks for the duration of the statement, and some of those locks would collide with what actual replication thread is doing. Fixing that would require immediate lock wait timeout or transaction kill for prefetcher thread – so, relatively deep dive into InnoDB. Another problem was internal InnoDB lock contention on rollbacks – that was expensive operation, and benefits of pages read in were negated by rollback segments lock contention. Fixing that is even more extensive InnoDB work (though probably some people would like their rollbacks to be efficient ;-)

At that moment we came up with the idea, that InnoDB codebase could be instrumented to not do any real work on updates – just page data in and return to the caller, and if any change accidentally slips in, commits can fail. That looked like a feasible project for the future.

At some point in time we were rolling out a new database tier for one product, which was supposed to have really high volume of changes, but all coming in a uniform format. It took less than hour (as most of the work has been done to create rollback-based one) to come up with a prototype that would efficiently extract literals from uniform statements, then use them for prefetching.

This method worked fine – at tiny fraction of resources used by mk-slave-prefetch we were preloading secondary indexes and could have relatively extensive parallelism.

Meanwhile, our main database cluster was having more and more uniform query workload, thanks to various libraries, abstractions and middleware – so a day of work on lowest hanging fruits provided relatively good coverage of the workload.

We didn’t stop mksp.pl – it still provided some coverage for various odd cases, which were time-consuming to work on manually.

There were few other problems with the new method – apparently we were targeting our SELECTs too accurately – UPDATEs were spending plenty of time in records_in_range. Additionally, optimistic update path was reading in pages that selects wouldn’t (due to inefficiency in B-Tree locking code). There were some odd reads done for INSERTs.

Also, SELECTs are using indexing less efficiently – InnoDB can pinpoint entries in secondary indexes by using PK values, yet that ability is not exposed to SQL layer, so prefetching on indexes that don’t have explicitly defined all fields within them is not that easy.

In theory, all these issues are supposed to be ‘fixed’ by fake changes concept. Percona recently implemented it in their releases, and we started experimenting with those changes. It is still not that mature concept, so we will be revisiting how things are or should be done, but for now test results are quite positive (we did some changes to reduce locking and avoid deadlock in REPLACE INTO, among other things).

I still observe I/Os done by main replication thread, so we’re not in perfect shape yet, but method seems to be working relatively well (at least it definitely speeds up replication). We still have to do lots of testing to qualify this for large-scale production, but this may allow way more write workload on our machines until we get parallel replication all around.

Our code for custom query, fake changes or rollback prefetcher can be checked out from a public repo together with other tools (oops, Bazaar doesn’t give easy access to subdirectories:

bzr co lp:mysqlatfacebook/tools; cd prefetch

Or browse it online.

P.S. There’s also Tungsten Replicator for ones who don’t want to wait for 5.6 parallel replication.

On database write workload profiling

I always have difficulties with complex analysis schemes, so fall back to something that is somewhat easier. Or much easier. Here I will explain the super-powerful method of database write workload analysis.

Doing any analysis on master servers is already too complicated, as instead of analyzing write costs one can be too obsessed with locking and there’s sometimes uncontrollable amount of workload hitting the server beside writes. Fortunately, slaves are much better targets, not only because writes there are single-threaded, thus exposing every costly I/O as time component, but also one can drain traffic from slaves, or send more in order to cause more natural workload.

Also, there can be multiple states of slave load:

  • Healthy, always at 0-1s lag, write statements are always immediate
  • Spiky, usually at 0s lag, but has jumps due to sometimes occuring slow statements
  • Lagging, because of read load stealing I/O capacity
  • Lagging (or not catching up fast enough), because it can’t keep up with writes anymore, even with no read load

Each of these states are interesting by themselves, and may have slightly different properties, but pretty much all of them are quite easy to look at using replication profiling.

The code for it is somewhat straightforward:

(while true; do
echo 'SELECT info FROM information_schema.processlist
WHERE db IS NOT NULL AND user="system user"; '
sleep 0.1; done) | mysql -BN | head -n 100000 > replication-sample

There are multiple ways to analyze it, e.g. finding slowest statements is as easy as:

uniq -c replication-sample | sort -nr | head

More advanced methods may group up statements by statement types, tables, user IDs or any other random metadata embedded in query comments – and really lots of value can be obtained by doing ad-hoc analysis using simply ‘grep -c keyword replication-sample’ – to understand what share of your workload certain feature has.

I already mentioned, that there are different shapes of slave performance, and it is easy to test it in different shapes. One of methods is actually stopping a slave for a day, then running the sampler while it is trying to catch up. It will probably have much more buffer pool space usable for write operations, so keep that in mind – certain operations that are depending on larger buffer pools would be much faster.

This is really simple, although remarkably powerful method, that allows quite deep workload analysis without spending too much time on statistics features. As there’s no EXPLAIN for UPDATE or DELETE statements, longer, coarser samples allow detecting deviations from good query plans too.

Systematic use of it has allowed to reveal quite a few important issues that had to be fixed – which were not that obvious from general statistics view. I like.

On MySQL replication, again…

There are few things one is supposed to know about MySQL replication in production, as manual doesn’t always discuss things openly.

This is small set of rules and advices I compiled (some apply to statement based replication only, but row based replication may benefit from one or two too):

  • Don’t use MyISAM. Or in fact, don’t use any non-transactional engine, if you care about your data. On either side, master or slave, or both – using non-transactional engines will cause data drift, as partially executed statements on master would be fully executed on slave, or simply stop replication. Of course, every crash has the capacity of getting your tables desynced from each other and there are absolutely no guarantees.
    This “don’t” can be easily transformed into “do” – just use InnoDB. Bonus point – one doesn’t need to take down the server, to clone a slave from a master :)
  • Don’t use temporary tables. MySQL manual is very funny about temporary tables in replication, it says “do not shut down the slave while it has temporary tables open.” That of course means that you’re not supposed to crash either – and more slaves there are, more of them will crash because of various reasons (e.g. solar flares).
    The operational overhead temporary tables add is huge – even though it may not show up in the benchmark.
  • Prefer simple, idempotent statements. If one can replay same statements multiple times without having database drift it doesn’t matter much if replication position is somewhat outdated. Updating rows by PK to fixed values, avoiding multiple table updates/deletes can allow to recover after crash much faster.
  • Set sync_binlog=1. This will introduce biggest bottleneck for transactions, but losing 30s of data may be worse (as this will force to do full slave resync in most cases). On really busy servers one can go for higher values (e.g. sync every 20 transactions), but 0 is asking for disaster.
  • Avoid long running updates. Though all long statement would cause on a master is slightly longer locking window and some performance pressure, once it gets replicated to the slave, all the updates will have to wait for the giant one to finish, in many cases rendering the slave useless.
    If something big has to be replicated, either split it into smaller chunks or run it directly against slaves (with binary logging on the master disabled for it).
    Splitting into smaller chunks can allow wait-for-slave logic to be implemented, thus not having any major impact on production environments.
  • Don’t use replicate-do-db. Or replicate-ignore-db. They both rely on database context, and statements like ‘INSERT INTO database.table’ will fail.
    If you need it, use replicate-wild-do-table=db.% – but even then, be careful with cross-database statements, that involve tables from multiple databases – as they may be filtered out…
  • Note the multiversioning. Some statements may become replication performance hogs because of long-running transactions (backups? reporting? ETL?) running on slaves – it may not need to rescan all the row versions on master, but they’d be still there on a slave. Such statements may need to be rewritten to avoid scanning gaps with too many invisible rows, or long transactions have to be split.

Though probably the best advice I can give now is “call your mysql vendor and ask for transactional replication“. Server, rack, datacenter crashes will not cause excessive work on fixing replication – it will be always consistent. One can even disable log syncing to disk then \o/

Evil replication management

When one wants to script automated replication chain building, certain things are quite annoying, like immutable replication configuration variables. For example, at certain moments log_slave_updates is more than needed, and thats what the server says:

mysql> show variables like 'log_slave_updates';
+-------------------+-------+
| Variable_name     | Value |
+-------------------+-------+
| log_slave_updates | OFF   |
+-------------------+-------+
1 row in set (0.00 sec)

mysql> set global log_slave_updates=1;
ERROR 1238 (HY000): Variable 'log_slave_updates' is a read only variable

Of course, there are few options, roll in-house fork (heheeeee!), restart your server, and keep warming up your tens of gigabytes of cache arenas, or wait for MySQL to ship a feature change in next major release. Then there are evil tactics:

mysql> system gdb -p $(pidof mysqld)
                       -ex "set opt_log_slave_updates=1" -batch
mysql> show variables like 'log_slave_updates';
+-------------------+-------+
| Variable_name     | Value |
+-------------------+-------+
| log_slave_updates | ON    |
+-------------------+-------+
1 row in set (0.00 sec)

I don’t guarantee safety of this when slave is running, but… stopping and starting slave threads is somewhat cheaper, than stopping and starting big database instance, right?

What else can we do?

mysql> show slave status \G
...
     Replicate_Do_DB: test
...
mysql> system gdb -p $(pidof mysqld)
          -ex 'call rpl_filter->add_do_db(strdup("hehehe"))' -batch
mysql> show slave status \G
...
      Replicate_Do_DB: test,hehehe
...

It is actually possible to add all sorts of filters this way, rpl_filter.h can be good reference :) So now that you want to throw out some data from your slaves, restart isn’t needed. Unfortunately, deleting entries isn’t possible via rpl_filter methods, but you can always edit base_ilists, can’t you?

P.S. having this functionality inside server would definitely be best.

5.0 journal: various issues, replication prefetching, our branch

First of all, I have to apologize about some of my previous remark on 5.0 performance. I passed ‘-g’ CFLAGS to my build, and that replaced default ‘-O2’. Compiling MySQL without -O2 or -O3 makes it slower. Apparently, much slower.

Few migration notes – once I loaded the schema with character set set to binary (because we treat it as such), all VARCHAR fields were converted to VARBINARY, what I expected, but more annoying was CHAR converted to BINARY – which pads data with bytes. Solution was converting everything into VARBINARY – as actually it doesn’t have much overhead. TRIM('' FROM field) eventually helped too.

The other problem I hit was paramy operation issue. One table definition failed, so paramy exited immediately – though it had few more queries remaining in the queue – so most recent data from some table was not inserted. The cheap workaround was adding -f option, which just ignores errors. Had to reload all data though…

I had real fun experimenting with auto-inc locking. As it was major problem for initial paramy tests, I hacked InnoDB not to acquire auto-inc table-level lock (that was just commenting out few lines in ha_innodb.cc). After that change CPU use went to >300% instead of ~100% – so I felt nearly like I’ve done the good thing. Interesting though – profile showed that quite a lot of CPU time was spent in synchronization – mutexes and such – so I hit SMP contention at just 4 cores. Still, the import was faster (or at least the perception), and I already have in mind few cheap tricks to make it faster (like disabling mempool). The easiest way to make it manageable is simply provide a global variable for auto-inc behavior, though elegant solutions would attach to ‘ALTER TABLE … ENABLE KEYS’ or something similar.

Once loaded, catching up on replication was another task worth few experiments. As the data image was already quite a few days old, I had at least few hours to try to speed up replication. Apparently, Jay Janssen’s prefetcher has disappeared from the internets, so the only one left was maatkit’s mk-slave-prefetch. It rewrites UPDATEs into simple SELECTs, but executes them just on single thread, so the prefetcher was just few seconds ahead of SQL thread – and speedup was less than 50%. I made a quick hack that parallelized the task, and it managed to double replication speed.

Still, there’re few problems with the concept – it preheats just one index, used for lookup, and doesn’t work on secondary indexes. Actually analyzing the query, identifying what and where changes, and sending a select with UNIONs, preheating every index affected by write query could be more efficient. Additionally it would make adaptive hash or insert buffers useless – as all buffer pool pages required would be already in memory – thus leading to less spots of mutex contention.

We also managed to hit few optimizer bugs too, related to casting changes in 5.0. Back in 4.0 it was safe to pass all constants as strings, but 5.0 started making poor solutions then (like filesorting, instead of using existing ref lookup index, etc). I will have to review why this happens, does it make sense, and if not – file a bug. For now, we have some workarounds, and don’t seem to be bitten too much by the behavior.

Anyway, in the end I directed half of this site’s core database off-peak load to this machine, and it was still keeping up with replication at ~8000 queries per second. The odd thing yet is that though 5.0 eats ~30% more CPU, it shows up on profiling as faster-responding box. I guess we’re just doing something wrong.

I’ve published our MySQL branch at launchpad. Do note, release process is somewhat ad-hoc (or non-existing), and engineer doing it is clueless newbie. :)

I had plans to do some more scalability tests today, but apparently the server available is just two-core machine, so there’s nothing much I can do on it. I guess another option is grabbing some 8-core application server and play with it. :)

Trainwreck: external MySQL replication agent

I wanted to work more on the actual project before writing about it, but I’m lazy, and dear community may be not.

At Wikimedia we have one database server which replicates from multiple (like 15!) masters. It even splits replication streams by database, and applying changes in parallel.

All this stuff is done by external replication agent, Trainwreck. It is public-domain software, which was written by River, doesn’t have much documentation, works only on Solaris (River likes Solaris), unless you comment out all process management blocks, which use doors and other Solaris specific API.

It lives in Wikimedia SVN, and can be checked out using:

svn co http://svn.wikimedia.org/svnroot/mediawiki/trunk/tools/trainwreck/

It sits there, maintained just for needs of that specific single server (ok, there might be two or three), so if anyone wants to make it available for broader audience, feel free to fork a project to some community-oriented place, add all nice features you need. :)