on primary keys

5.1.46 has this change:

Performance: While looking for the shortest index for a covering index scan, the optimizer did not consider the full row length for a clustered primary key, as in InnoDB. Secondary covering indexes will now be preferred, making full table scans less likely.

In other words, if you have covering index on * (which is quite common on m:n mapping tables), use it rather than PK. As I have spent my time getting indexing right and having PKs be based on primary access pattern and SKs on secondary access pattern, I hereby not welcome the new change that suddenly reverses the behavior in late GA version.

Not good, when mysqldump queries end up taking 6 days instead of previous half an hour, not good at all.

Update: Oh, MariaDB has this reverted, from their change log:

mybug:39653: reverted as invalid

If only upstream MySQL would take note ;-)

MySQL versions at Wikipedia

More of information about how we handle database stuff can be found in some of my talks.

Lately I hear people questioning database software choices we made at Wikipedia, and I’d like to point out, that…

Wikipedia database infrastructure needs are remarkably boring.

We have worked a lot on having majority of site workload handled by edge HTTP caches, and some of most database intensive code (our parsing pipeline) is well absorbed by just 160G of memcached arena, residing on our web servers.

Also, major issue with our databases is finding the right balance between storage space (even though text is stored in ‘external store’, which is just set of machines with lots of large slow disks) – we store information about every revision, every link, every edit – and available I/O performance per dollar for that kind of space needed.

As a platform of choice we use X4240s (I advertised it before) – 16 SAS disks in compact 2u package. There’s relatively small hot area (we have 1:10 RAM/DB ratio), and quite a long tail of various stuff we have to serve.

The whole database is just six shards, each getting up to 20k read queries a second (single server can handle that), and few hundred writes (binlog is under 100k/s – nothing too fancy). We have overprovisioned some hardware for slightly higher availability – we don’t have always available on-site resources – the slightly humorous logic is

we need four servers, in case one goes down, another will be accidentally brought down by fixing person, then you got one to use as a source of recovery and remaining one to run the site.

Application doesn’t have too many really expensive queries, and those aren’t the biggest share of our workload. Database by itself is minor part of where application code spends time (looking at profiling now – only 6% of busy application time is inside database, memcached is even less, Lucene is way up with 11%). This is remarkably good shape to be at, and it is much better than what we used to have when we had to deal with insane (“explosive”) growth. I am sure, pretty much anything deployed now (even sqlite!) will work just fine, but what we used has been created during bad times.

Bad times didn’t mean that everything was absolutely overloaded, it was more that it could get overloaded very soon, if we don’t take appropriate measures, and our fundraisers were much tinier back then. We were using 6-disk RAID-0 boxes to be able to sustain good performance and have required disk space at the same time (or of course, go expensive SAN route).

While the mainstream MySQL development with its leadership back then was headed towards implementing all sorts of features that didn’t mean anything to our environment (and from various discussions I had with lots of people, many many other web environments):

  • utf8 support that didn’t support Unicode
  • Prepared Statements that don’t really make much sense in PHP/C environments
  • unoptimized subqueries, that allow people to write shitty performing queries
  • later in 5.0 – views, stored routines, triggers
  • etc…

… nobody was really looking at MySQL performance at that time, and it could have insane performance regressions (“nobody runs these things anyway”, like ‘SHOW STATUS’) and a forest full of low hanging fruits.
From operations perspective it wasn’t perfect either – replication didn’t survive crashes, crash recovery was taking forever, etc.

Thats when Google released their set of patches for 4.0, which immediately provided incredible amount of fixes (thats what I wrote about it back then). To highlight some of introduced changes:

  • Crash-safe replication (replication position is stored inside InnoDB along with transaction state) – this allowed to run slaves with innodb log flushing turned off on slaves and having consistent recovery, vanilla MySQL doesn’t have that yet, Percona added this to XtraDB at some point in time
  • Guaranteed InnoDB concurrency slot for replication thread – however loaded the server is, replication does not get queued outside and can proceed. This allowed us to have way more load pointed towards MySQL. This is now part of 5.1
  • Multiple read-ahead and write-behind threads – again, allowed to bypass certain bottlenecks, such as read-ahead slots (though apparently it is wiser just to turn off read-ahead entirely) – now part of InnoDB Plugin
  • Multiple reserved SUPER connections – during overloads systems were way more manageable

Running these changes live have been especially successful (and that was way before Mark/Google released their 5.0 patch set which was then taken in parts by OurDelta/Percona/etc) – and I spent quite some time trying to evangelize these changes to MySQL developers (as I would have loved to see that deployed at our customers, way less work then!). Unfortunately, nobody cared, so running reliable and fast replication environments with mainline MySQL didn’t happen (now one has to use either XtraDB or FB build).

So, I did some merging work, added few other small fixes and ended up with our 4.0.40 build (also known as four-oh-forever), which still runs half of shards today. It has sufficient in-memory performance for us, it can utilize our disk capacity fully, and it doesn’t have crash history (I used to tell about two 4.0 servers, both whitebox raid0 machines, having unbroken replication session for two years). By todays standards it already misses few things (I miss fast crash recovery mostly, after last full power outage in a datacenter ;-) – and developers would love to abuse SQL features (hehe, recently a read-only subquery locked all rows because of a bug :-) I’m way more conservative when it comes to using certain features live, as when working at MySQL Support I could see all the ways those features break for people, and we used to joke (this one was about RBR :):

Which is the stable build for feature X? Next one!

Anyway, even knowing that stuff breaks in one way or another, I was running a 5.1 upgrade project, mostly because of peer pressure (“4.0 haha!”, even though that 4.0 is more modern from operations standpoint).

As MediaWiki is open-source project, used by many, we already engineer it for wide range of databases – we support MySQL 4.0, we support MySQL 6.0-whatever-is-in-future, and there’s some support for different vendor DBMSes (at various stages – PG, Oracle, MS SQL, etc) – so we can be sure that it works relatively fine on newer versions.

Upgrade in short:

  • Dump schema/data
  • Load schema on 5.1 instance
  • Adjust schema, as we can do it, set all varchar to varbinary to maintain 4.0 behavior
  • Load data on 5.1 instance
  • Fix MySQL to replicate from 4.0 (stupid breakage for nothing)
  • Switch master to 5.1 instance

We had some 5.0 and 5.1 replicas running for a while to detect any issues, and as there weren’t too many, the switch could be nearly immediate (English Wikipedia was converted 4.0->5.1 over a weekend).

I had an engineering effort before to merge Google 5.0 patches into later than 5.0.37 tree, but eventually Mark left Google for Facebook and “Google patch” was abandoned, long live the Facebook patch! :)

At first FB-internal efforts were to get the 5.0 environment working properly, so 5.1 development was a bit on hold. At that time I cherry-picked some of Percona’s patch work (mostly to get transactional replication for 5.1, as well as fast crash recovery) – and started deploying this new branch. Of course, once Facebook development focus switched to 5.1, maintaining separate branch is becoming less needed – my plan for the future is getting FB build deployed across all shards.

The beauty of FB-build is that development team is remarkably close to operations (and operations team is close to development), and there is lots of focus on making it do the right stuff (make sure you follow mysql@facebook page). The visibility of systems (PMP!) we have at Facebook can be transformed into code fixes nearly instantly, especially when compared with development cycles outside. I’m sure some of those changes will trickle to other trees eventually, but we have those changes in FB builds already here, and they are state of the art of MySQL performance/operations engineering, while maintain great code quality.

Yes, at Wikipedia we run a mix of really fresh and also quite old/frozen software (there will be unification, of course), but…. it runs fine. It isn’t as fascinating anymore as years ago, but it allows not paying any attention for years. Which is good, right? Oh, and of course, there’s full data on-disk compatibility with standard InnoDB builds, in case anyone really wants to roll back or switch to the next-best-fork.

On MySQL replication, again…

There are few things one is supposed to know about MySQL replication in production, as manual doesn’t always discuss things openly.

This is small set of rules and advices I compiled (some apply to statement based replication only, but row based replication may benefit from one or two too):

  • Don’t use MyISAM. Or in fact, don’t use any non-transactional engine, if you care about your data. On either side, master or slave, or both – using non-transactional engines will cause data drift, as partially executed statements on master would be fully executed on slave, or simply stop replication. Of course, every crash has the capacity of getting your tables desynced from each other and there are absolutely no guarantees.
    This “don’t” can be easily transformed into “do” – just use InnoDB. Bonus point – one doesn’t need to take down the server, to clone a slave from a master :)
  • Don’t use temporary tables. MySQL manual is very funny about temporary tables in replication, it says “do not shut down the slave while it has temporary tables open.” That of course means that you’re not supposed to crash either – and more slaves there are, more of them will crash because of various reasons (e.g. solar flares).
    The operational overhead temporary tables add is huge – even though it may not show up in the benchmark.
  • Prefer simple, idempotent statements. If one can replay same statements multiple times without having database drift it doesn’t matter much if replication position is somewhat outdated. Updating rows by PK to fixed values, avoiding multiple table updates/deletes can allow to recover after crash much faster.
  • Set sync_binlog=1. This will introduce biggest bottleneck for transactions, but losing 30s of data may be worse (as this will force to do full slave resync in most cases). On really busy servers one can go for higher values (e.g. sync every 20 transactions), but 0 is asking for disaster.
  • Avoid long running updates. Though all long statement would cause on a master is slightly longer locking window and some performance pressure, once it gets replicated to the slave, all the updates will have to wait for the giant one to finish, in many cases rendering the slave useless.
    If something big has to be replicated, either split it into smaller chunks or run it directly against slaves (with binary logging on the master disabled for it).
    Splitting into smaller chunks can allow wait-for-slave logic to be implemented, thus not having any major impact on production environments.
  • Don’t use replicate-do-db. Or replicate-ignore-db. They both rely on database context, and statements like ‘INSERT INTO database.table’ will fail.
    If you need it, use replicate-wild-do-table=db.% – but even then, be careful with cross-database statements, that involve tables from multiple databases – as they may be filtered out…
  • Note the multiversioning. Some statements may become replication performance hogs because of long-running transactions (backups? reporting? ETL?) running on slaves – it may not need to rescan all the row versions on master, but they’d be still there on a slave. Such statements may need to be rewritten to avoid scanning gaps with too many invisible rows, or long transactions have to be split.

Though probably the best advice I can give now is “call your mysql vendor and ask for transactional replication“. Server, rack, datacenter crashes will not cause excessive work on fixing replication – it will be always consistent. One can even disable log syncing to disk then \o/

On hot backups

Few years ago I was looking at crash recovery code, and realized that InnoDB has removed all the comments from the code [this assumption is debunked by Heikki in comments section], related to replay of transaction log. Judging by high quality of comments in the remaining codebase, I realized that it was all done to obscure any efforts to build another InnoDB hot backup solution – competitor to first Innobase standalone offering.

I was enjoying the moment when Percona launched their own implementation of the tool. Since the inception, it became more and more robust and feature rich.

We have used xtrabackup in our environment a lot – just… not for backup – the major use case right now is for cloning server instances – either for building new replicas, shadow servers, or replacing masters – and allows us to do that without interrupting any operation.

Now what makes the whole situation way more interesting – Oracle/MySQL announced at the conference, that InnoDB Hot Backup will be part of the Enterprise offering – which makes it way more available to MySQL customer community, than when it required quite expensive per-server licenses.

Of course, open source xtrabackup is way easier to tweak for our environment (O_DIRECT support, posix_fadvise(), flashcache hints, etc – was all added after release) – and it is interesting, how Oracle-provided tool will evolve. Right now xtrabackup already supports streaming operation, which makes it much more usable in large-database-on-small-hardware (read: sharded) environments, and provides flexibility to the users. Oracle of course owns much more of in-house expertise of both current internals operation, as well as all the future changes that will happen, so we may see leadership in the field coming from their side too.

One of our reasons for not using physical backup solution is simply that it is not space efficient. There may be multiple ways to approach that – from robust incremental backups, to partial backups, that wouldn’t include secondary indexes or have limited set of tables taken.

Some changes may actually require extended MySQL/InnoDB support – on multiple-terabyte instances one may not want to rescan the dataset for each incremental backup run – as resulting diff would be just a hundred gigabytes or less. This would require support for always-running backup agent that would aggregate information about block changes and allow for more efficient backup operation.

Discarding secondary indexes is way more attractive option with 5.1/Plugin ability to do fast independent index builds, that don’t require one row at a time B-Tree builds for all indexes at once (and of course, hit severe memory penalties on large tables or in parallel workloads).

Having always ready backups is important not only for ability to rebuild a box (and we have replicas for machine failures) – the real value is when backups can be used for massive-scale thousands of machines subset of table rows extraction. For that one cannot just ship full instance data around from backup storage – so recovery tools will have to be way flexible.

Probably core feature for that kind of operation would be ability to import tables directly from hot backup to online instances – unfortunately, restarting database instance is still costly (though we’re doing quite some work in that direction too).

I’m extremely happy that InnoDB started fixing operational issues like crash recovery performance, but there’s still a wide area of problems not touched properly yet – extremely in disaster recovery space, and I’m eager to see developments in this field – both from Oracle, and community members.

Opening tables v2!

PMP on demand revealed one of reasons why we’ve been seeing ‘Opening tables’ during proper operations, not just during startup (see my previous post on this topic).

We had a thousand or so threads waiting on LOCK_open, and the only thread holding the mutex was this darling:

Thread 902 (Thread 1637624128 (LWP 22113)):
#3  mutex_spin_wait (mutex=0x2aaaac3232b8,
        file_name=0x8b3bf7 "trx0trx.c", line=220) at sync0sync.c:565
#4  trx_allocate_for_mysql
#5  ha_innobase::allocate_trx
#6  check_trx_exists
#7  ha_innobase::info
#8  ha_innobase::open
#9  handler::ha_open
#10 openfrm
#11 open_unireg_entry
#12 open_table
#13 open_tables
#14 open_and_lock_tables
#15 mysql_insert

So, kernel mutex, which is quite contended, will escalate to LOCK_open on table open, which will block all queries from happening on the server. Fix? Nearly same code as previous “Opening tables” fix – don’t call ha_innobase::info() as part of open(), or move the transaction establishment code outside of info().

Read ahead…

Mark wrote about how to find situations where InnoDB read-ahead is a bottleneck. What he didn’t disclose, though, is his trick to disable read-ahead without restart or recompile of MySQL. See, there’s no internal “disable read ahead knob”. But there is…

buf_read_ahead_random(...){ ...
       if (srv_startup_is_before_trx_rollback_phase) {
                /* No read-ahead to avoid thread deadlocks */
                return(0);
        }

This variable is tested at two functions – buf_read_ahead_linear() and buf_read_ahead_random() and nowhere else. So yeah, “server startup is before transaction rollback phase” is another way of saying “don’t do read ahead, please please”.

gdb -ex "set  srv_startup_is_before_trx_rollback_phase=1" \
    --batch -p $(pidof mysqld)

And many servers bottlenecked on this became much much much faster (and 2000 concurrent threads running dropped to 10). Of course, this is most visible in high-latency-high-throughput I/O situations, but we’re hitting this contention spot on local disk setups too.

Don’t forget to have the fix if gdb decides to be nasty and locks up your server :)

Opening tables!

There’s one bottleneck in MySQL/InnoDB that ultimately sucks. It sucked in 4.0, sucked in 5.0, sucks in 5.1 with newest InnoDB plugin. Opening tables has been a bottleneck on machines that have thousands of tables all the time (as LOCK_open is being held during the process), and while there was a table being opened, everything else would stall on the machine.

It can simply take hours on such systems just to open tables – and the major portion of time spent is randomly diving into InnoDB tables to populate index statistics. It obviously sounds like low hanging fruit – as statistics aren’t needed while you are opening a table, they’re needed just for querying the table.

So, I threw in few thousand tables to my machine, and tried opening them with ten connections. Standard InnoDB code was opening 13.5 tables a second. After spending few minutes and moving (this is pure prototype, not suitable for production) statistic collection post ha_innodb::open(), I noticed performance increase.

Tables were opened at 105-a-second speed. A bit better, ~8x better.

Merry Christmas, MySQL!

On deadlock detection

InnoDB detects deadlocks. Deadlocks are those nasty situations, when transaction 1 tries to acquire locks A and B, whereas transaction 2 tries to acquire locks B and A at the same time. As both are stubborn, InnoDB will decide simply to terminate one of them. If it wouldn’t do that, both transactions would have to wait until lock_wait_timeout to expire otherwise. There is a big chance that longer the transaction is, more likely it is to cause deadlocks. Deadlock detection kind of helps, then, but… at certain costs.

Transaction 1 and 2 case is way too easy, try adding few hundred transactions that contend over same set of locks. To do that, InnoDB deadlock monitor will recursively brute-force lock graph, until it hits a 200-transaction-long chain (it will say it is a deadlock), or until it runs out of paths to check. Still, with the power of modern hardware that will still be milliseconds.

Unfortunately, InnoDB will also hold kernel_mutex at that time, so lots and lots of InnoDB operations will not happen at that time. To be exact, InnoDB will rarely do anything else, while deadlock check is happening.

To illustrate that, I have a very simple testcase (that in certain conditions stalls the server for half an hour, even if it is not being ran):

UPDATE t1 SET b=b+1 WHERE a=1;

With few threads it executes nearly 20000 times a second on my desktop machine. With ten threads it executes 14000/s. With 50 threads it is only 3000/s. With 100 threads it falls down to 639 operations a second. At 140 threads it is already just 266.

I built InnoDB without deadlock detection (tiny tiny patch), and tried same test. Similar performance with 10 threads, still doing 10000 operations a second at 100 threads:

Though I illustrated edge case here, its purity actually didn’t show how bad this can go – this situation can happen not only because of high contention on single row, but simply because someone holds up the row lock for a bit too long (there’s always that sleep between UPDATE and COMMIT, too). It can take a single transaction to cause a lock convoy, and once transactions queue up, and update rate falls down below 100/s, all MySQL will be doing is checking for deadlocks, even if they never happen.

On many systems deadlock detection is causing way more issues, than lack of it would. Most deadlocks happen on transactions that are somewhere in the middle of their lock wait anyway :)

There’s some discussion about it at MySQL Bug#49047