On MySQL replication, again…

There are few things one is supposed to know about MySQL replication in production, as manual doesn’t always discuss things openly.

This is small set of rules and advices I compiled (some apply to statement based replication only, but row based replication may benefit from one or two too):

  • Don’t use MyISAM. Or in fact, don’t use any non-transactional engine, if you care about your data. On either side, master or slave, or both – using non-transactional engines will cause data drift, as partially executed statements on master would be fully executed on slave, or simply stop replication. Of course, every crash has the capacity of getting your tables desynced from each other and there are absolutely no guarantees.
    This “don’t” can be easily transformed into “do” – just use InnoDB. Bonus point – one doesn’t need to take down the server, to clone a slave from a master :)
  • Don’t use temporary tables. MySQL manual is very funny about temporary tables in replication, it says “do not shut down the slave while it has temporary tables open.” That of course means that you’re not supposed to crash either – and more slaves there are, more of them will crash because of various reasons (e.g. solar flares).
    The operational overhead temporary tables add is huge – even though it may not show up in the benchmark.
  • Prefer simple, idempotent statements. If one can replay same statements multiple times without having database drift it doesn’t matter much if replication position is somewhat outdated. Updating rows by PK to fixed values, avoiding multiple table updates/deletes can allow to recover after crash much faster.
  • Set sync_binlog=1. This will introduce biggest bottleneck for transactions, but losing 30s of data may be worse (as this will force to do full slave resync in most cases). On really busy servers one can go for higher values (e.g. sync every 20 transactions), but 0 is asking for disaster.
  • Avoid long running updates. Though all long statement would cause on a master is slightly longer locking window and some performance pressure, once it gets replicated to the slave, all the updates will have to wait for the giant one to finish, in many cases rendering the slave useless.
    If something big has to be replicated, either split it into smaller chunks or run it directly against slaves (with binary logging on the master disabled for it).
    Splitting into smaller chunks can allow wait-for-slave logic to be implemented, thus not having any major impact on production environments.
  • Don’t use replicate-do-db. Or replicate-ignore-db. They both rely on database context, and statements like ‘INSERT INTO database.table’ will fail.
    If you need it, use replicate-wild-do-table=db.% – but even then, be careful with cross-database statements, that involve tables from multiple databases – as they may be filtered out…
  • Note the multiversioning. Some statements may become replication performance hogs because of long-running transactions (backups? reporting? ETL?) running on slaves – it may not need to rescan all the row versions on master, but they’d be still there on a slave. Such statements may need to be rewritten to avoid scanning gaps with too many invisible rows, or long transactions have to be split.

Though probably the best advice I can give now is “call your mysql vendor and ask for transactional replication“. Server, rack, datacenter crashes will not cause excessive work on fixing replication – it will be always consistent. One can even disable log syncing to disk then \o/

MySQL processlist phrase book

For your convenience, a short phrase book, starting with explanation of process states where MySQL is mostly working to look up data from tables:

  • “Sending data” – reading data from tables (or looking it up)
  • “Copying to tmp table” – reading data from tables (or looking it up)
  • “Copying to tmp table on disk” – query needs a rewrite
  • “statistics” – looking up data from tables
  • “Sorting result” – reading data from tables (or looking it up)

Locking is fun:

  • “update” – There is big chance that it is waiting for a lock, can be updating as well, on busy systems
  • “Locked” – Even bigger chance that it is MyISAM table waiting to be converted to InnoDB

Table opening is even funnier:

  • “Opening tables” – some other idiot thread is holding the LOCK_open or similar mutex
  • “Closing tables” – some other idiot thread is holding the LOCK_open or similar mutex
  • “Opening table” – the idiot thread (or not that smart DBA who thinks that table_cache should be set to current value)
  • “Flushing tables” – someone decided that flushing tables is good idea, but forgot to kill long running queries
  • “Waiting for tables” – same as “Flushing tables”

If you have replication:

  • “Connecting to server” – could not connect to server, waiting to retry
  • “Reconnecting after a failed master event read” – master and slave have same @server_id
  • “Registering slave on master” – master and slave have same @server_id
  • “Waiting to reconnect after a failed master event read” – master and slave have same @server_id
  • “Sending binlog event to slave” – actually, not sending binlog event – it is waiting for binlog mutex.

Few more tricky ones:

  • “Sleep” – last millisecond it was probably not sleeping, due to how process scheduling and kernel level buffering works probably it has finished right before kernel switched to the ‘SHOW PROCESSLIST’ thread.
  • “Writing to net” – usually, not writing to net (client has blocked) – actual writing to net is usually nearly non-blocking operation, as long as client doesn’t sleep.
  • “Killed” – actually, someone tried to kill it, but it doesn’t care, as it doesn’t check for thd->killed flag

And in the user column:

  • “unauthenticated user” – are you sure your DNS is working right?
  • “system user” – it isn’t user, these are (hopefully plural) replication threads, that don’t have any “user” credentials
  • “root” – either that is you running the PROCESSLIST, another session you forgot you are running or you forgot to create separate user for your application

I probably didn’t write quite a few important ones, but some of them are self-evident (such as “init” and “end”), and others probably will never show up :)

iostat -x

My favorite Linux tool in DB work is ‘iostat -x’ (and I really really want to see whenever I’m doing any kind of performance analysis), yet I had to learn its limitations and properties. For example, I took 1s snapshot from a slightly overloaded 16-disk database box:

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           8.12    0.00    2.57   21.65    0.00   67.66

Device:  rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s \
sda     7684.00    19.00 2420.00  498.00 81848.00  5287.00 \

        avgrq-sz avgqu-sz   await  svctm  %util
           29.86    32.99   11.17   0.34 100.00

I pasted this somewhere on IRC, and got “doesn’t look too healthy” and that it is disk-bound. Now, to understand if it really is, one has to understand what iostat tells here.

First line of numbers shows that we’ve got plenty of CPU resources (thats because nowadays it is quite difficult to get a box with not enough CPU power, and I/O still seems to be bottleneck) – and we have more threads waiting for I/O than we have CPU execution (that sounds normal).

Now the actual per-disk statistics are where one should look. I used to prefer %util over general %iowait (I couldn’t really explain what %iostat is, and I can say what %util is). I don’t know why, but iostat has most interesting bits at the end, and not so interesting at the start:

  • %util: how much time did the storage device have outstanding work (was busy). In proper RAID environments it is more like “how much time did at least one disk in RAID array have something to do”. I’m deliberately excluding any kind of cache here – if request can be served from cache, the chance is quite negligible it will show up in %util, unlike in other values. What this also means – the RAID subsystem can be loaded from 6.25% (one disk doing the work) to 100% (all of them busy). Thats quite a lot of insight in single value of ‘100%’, isn’t it?
  • svctm: Though manual says “The average service time (in milliseconds) for I/O requests that were issued to the device.”, it isn’t exactly that when you look at multiple-disk systems. What it says is, “when your I/O subsystem is busy, how fast does it respond requests overall”. Actually, less you load your system, higher svctm is (as there’re less outstanding requests, and average time to serve them goes up). Of course, at some certain moment, when I/O becomes really overloaded, you can see svctm going up. One can tweak /sys/block/sda/queue/nr_requests based on this – to avoid overloading I/O controller, though that is really rarely needed.
  • await. One of my favorites – how fast do requests go through. It is just an average, how long it takes to serve a request for a device, once it gets into device queue, to final “OK”. Low = good, high = bad. There’re few gotchas here – even though different reads can have different performance properties (middle of disk, outer areas of disk, etc), the biggest difference is between reads and writes. Reads take time, writes can be instant (write caching at underlying layers..). As 80% of requests were reads, we can try to account for that by doing 11.17/0.8 math, to get 14ms figure. Thats quite high – systems that aren’t loaded can show ~5ms times (which isn’t that far away from 4ms rotation time of 15krpm disk).
  • avgqu-sz: Very very very important value – how many requests are there in a request queue. Low = either your system is not loaded, or has serialized I/O and cannot utilize underlying storage properly. High = your software stack is scalable enough to load properly underlying I/O. Queue size equal to amount of disks means (in best case of request distribution) that all your disks are busy. Queue size higher than amount of disks means that you are already trading I/O response time for better throughput (disks can optimize order of operations if they know them beforehand, thats what NCQ – Native Command Queueing does). If one complains about I/O performance issues when avgqu-sz is lower, then it is application specific stuff, that can be resolved with more aggressive read-ahead, less fsyncs, etc. One interesting part – avqu-sz, await, svctm and %util are iterdependent ( await = avgqu-sz * svctm / (%util/100)
  • avgrq-sz: Just an average request size. Quite often will look like a block size of some kind – can indicate what kind of workload happens. This is already post-merging, so lots of adjacent block operations will bump this up. Also, if database page is 16k, though filesystem or volume manager block is 32k, this will be seen in avgrq-sz. Large requests indicate there’s some big batch/stream task going on.
  • wsec/s & rsec/s: Sectors read and written per second. Divide by 2048, and you’ll get megabytes per second. I wanted to write this isn’t important, but remembered all the non-database people who store videos on filesystems :) So, if megabytes per second matter, these values are important (and can be seen in ‘vmstat’ output too). If not, for various database people there are other ones:
  • r/s & w/s: Read and write requests per second. This is already post-merging, and in proper I/O setups reads will mean blocking random read (serial reads are quite often merged), and writes will mean non-blocking random write (as underlying cache can allow to serve the OS instantly). These numbers are the ones that are the I/O capacity figures, though of course, depending on how much pressure underlying I/O subsystem gets (queue size!), they can vary. And as mentioned above, on rotational media it is possible to trade response time (which is not that important in parallel workloads) for better throughput.
  • rrqm/s & wrqm/s: How many requests were merged by block layer. In ideal world, there should be no merges at I/O level, because applications would have done it ages ago. Ideals differ though, for others it is good to have kernel doing this job, so they don’t have to do it inside application. Quite often there will be way less merges, because applications which tend to write adjacent blocks, also tend to wait after every write (see my rant on I/O schedulers). Reads however can be merged way easier – especially if application does “read ahead” block by block. Another reason for merges is simple block size mismatch – 16k database pages on top of 8k database pages will cause adjacent block reads, which would be merged by block layer. On some systems read of two adjacent pages would result in 1MB reads, but thats another rant :)
  • Device: – just to make sure, that you’re looking at the right device. :-)

So, after all this, the iostat output above tells us something like:

  • System has healthy high load (request queue has two-requests-per-disk)
  • Average request time is double the value one would expect from idle system, it isn’t too harmful, but one can do better
  • It is reading 80 40MB/s from disks, at 2420 requests/s. Thats quite high performance from inexpensive 2u database server (shameless plug: X4240 :)
  • High amount of merges comes from LVM snapshots, can be ignored
  • System is alive, healthy and kicking, no matter what anyone says :)

Memcached for small objects

Memcached quite often ends up as a store for very small objects (small key and some integer value), though it isn’t really designed to do this kind of work by default. Current memory management is based on slabs (200 of them), where objects are grouped by similar size – though actual sizes are pre-defined at startup based on few configuration parameters.

By default memcached would have slabs based on assumption, that smallest object size will have 48 bytes of data (thats without item header), and will increase the slab sizes in +25% steps:

slab class   1: chunk size    104 perslab 10082
slab class   2: chunk size    136 perslab  7710
slab class   3: chunk size    176 perslab  5957
slab class   4: chunk size    224 perslab  4681

So, in this case, it allocates at least 104 bytes per object, and next steps are way behind. Fortunately, there’re some quick steps to have better efficiency: Continue reading “Memcached for small objects”

Five minutes of MediaWiki performance tuning

MediaWiki is quite complex package, and some even trivial features are not the ones that should be enabled on sites having more load. Though it is quite modular, still lots of code has to be executed, and some of it requires additional steps to be done. Continue reading “Five minutes of MediaWiki performance tuning”

TCP tuning for your database

Lately lots of new fascinating technologies are used to build even more fascinating new solutions, and solutions nowadays even run on distributed environments, not just on single server. These servers usually communicate using TCP – standard, that has been here long before gigabit (or ten megabit) ethernet or dawn of LAMP, and needs to be kicked a bit, to work properly. In our environment we have over hundred of application servers which handle quite a number of requests, and put quite demanding stress on database servers. For that we had to change some of default kernel settings – and it did improve situation a bit. Here I’ll try to overview some of them…
Continue reading “TCP tuning for your database”