InnoDB locking makes me sad

Vadim and others have pointed at the index->lock problems before, but I think they didn’t good job enough at pointing out how bad it can get (the actual problematic was hidden somewhere as some odd edge case). What ‘index lock’ means is generally the fact that InnoDB has table-level locking which will kill performance on big tables miserably.

InnoDB is a huge pie of layers, that have various locking behaviors, and are layered on top of each other, and are structured nicely as subdirectories in your innodb_plugin directory. Low level storage interfaces are done via os/ routines, then on top of that there’s some file space manager, fsp/, which allocates space for btr/ to live in, where individual page/ entities live, with multiple row/ pieces. There’re few other subsystems around, that got quite some attention lately – e.g. buf/ pool, transaction log/, and large trx/ transactions are composed of micro transactions living in mtr/.

If you live in memory, you care about buffer pool and transaction log performance, if you write insane amounts of data to in-memory buffers you hit mtr/ problems and depend o how fast you can write out log/ or flush out buf/. If you are in I/O-heavy land most of stuff you care about happens in btr/.

Generally InnoDB is quite good about read scalability in I/O bound environments – nowadays one can saturate really fast I/O devices and there will be plenty of parallel reads done. Major scalability problem in this field was read-ahead which was funneling all read-ahead activity into a small set of threads, but other than that there can be hundreds of parallel reads issued to underlying devices. Situation changes when writes are added to the mix, though again, there’re few different scenarios.

There’re two ways for InnoDB to write out updates to pages, “optimistic” and “pessimistic”. Optimism here means that only in-page (page/row) operation will be needed without changing the tree structure. In one case you can expect quite high parallelism – multiple pages can be read for that operation at a time, multiple of them can be edited at a time, then some serialization will happen while writing out changes to redo log and undo segments. Expect good performance.

The much worse case is when B-Tree is supposed to be reorganized and multiple page operations can happen; thats pessimism. In this case whole index gets locked (via a read-write lock obtained from dict/),
then B-Tree path is latched, then changes are done, then it is all unlocked until next row operation needs to hit the tree. Unfortunately, both ‘path is latched’ and ‘changes are done’ are expensive operations, and not only in-core, but are doing sync page read-ins, one at a time, which on busy systems serving lots of read load are supposed to be slow. Ironically, as no other operations can happen on the table at that time, you may find out you have spare I/O capacity.. ;-)

What gets quite interesting though is the actual operation needed to latch b-tree path. Usual wisdom would say that if you want to change a row (read-modify-write), you probably looked up the page already, so there won’t be I/O. Unfortunately, InnoDB uses an slightly more complicated binary tree version, where pages have links to neighbors, and tree latching does this (a bit simplified for reading clarity):

/* x-latch also brothers from left to right */
get_block = btr_block_get(space, zip_size, left_page_no, RW_X_LATCH, mtr);
get_block = btr_block_get(space, zip_size, page_no, RW_X_LATCH, mtr);
get_block = btr_block_get(space, zip_size, right_page_no, RW_X_LATCH, mtr);

So, essentially in this case, just because InnoDB is being pessimistic, it reads neighboring blocks to lock them, even if they may not be touched/accessed in any way – and bloats buffer pool at that time with tripple reads. It doesn’t cost much if whole tree fits in memory, but it is doing three I/Os in here, if we’re pessimistic about InnoDB being pessimistic (and I am). So, this isn’t just locking problem – it is also resource consumption problem at this stage.

Now, as the dictionary lock is hold in write mode, not only updates to this table stop, but reads too – think MyISAM kind of stop. Of course, this ‘table locking’ happens at entirely different layer than MyISAM. In MyISAM it is statement-length locking whereas in InnoDB this lock is held just for row operation on single index, but if statement is doing multiple row operations it can be acquired multiple times.

Probably there exist decent workarounds if anyone wants to tackle this – grabbing read locks on the tree while reading pages into buffer pool, then escalating lock to exclusive. A bit bigger architectural change would be allowing to grab locks on neighbors (if they are needed) without bringing in page data into memory – but that needs InnoDB overlords to look at it. Talk to your closest MySQL vendor and ask for a fix!

How do regular workloads hit this? Larger your records are, more likely you are to have tree changes, lower your performance will be. In my edge case I was inserting 7k sized rows – even though my machine had multiple disks, once the dataset fell out of buffer pool, it couldn’t insert more than 50 rows a second, even though there were many disks idle and capacity gods cried. It gets worse with out-of-page blobs – then every operation is pessimistic.

Of course, there’re ways to work around this – usually by taking the hit of sharding/partitioning (this is where common wisdom of “large tables need to be partitioned” mostly comes from). Then, like with MyISAM, one will have multiple table locks and there may be some scalability then.

TL;DR: InnoDB index lock is major architectural performance flaw, and that is why you hear that large tables are slower. There’s a big chance that there’re more scalable engines for on-disk writes out there, and all the large InnoDB write/insert benchmarks were severely hit by this.

Update: Filed bugs #61735 and #61736 with MySQL

Opening tables v2!

PMP on demand revealed one of reasons why we’ve been seeing ‘Opening tables’ during proper operations, not just during startup (see my previous post on this topic).

We had a thousand or so threads waiting on LOCK_open, and the only thread holding the mutex was this darling:

Thread 902 (Thread 1637624128 (LWP 22113)):
#3  mutex_spin_wait (mutex=0x2aaaac3232b8,
        file_name=0x8b3bf7 "trx0trx.c", line=220) at sync0sync.c:565
#4  trx_allocate_for_mysql
#5  ha_innobase::allocate_trx
#6  check_trx_exists
#7  ha_innobase::info
#8  ha_innobase::open
#9  handler::ha_open
#10 openfrm
#11 open_unireg_entry
#12 open_table
#13 open_tables
#14 open_and_lock_tables
#15 mysql_insert

So, kernel mutex, which is quite contended, will escalate to LOCK_open on table open, which will block all queries from happening on the server. Fix? Nearly same code as previous “Opening tables” fix – don’t call ha_innobase::info() as part of open(), or move the transaction establishment code outside of info().

Opening tables!

There’s one bottleneck in MySQL/InnoDB that ultimately sucks. It sucked in 4.0, sucked in 5.0, sucks in 5.1 with newest InnoDB plugin. Opening tables has been a bottleneck on machines that have thousands of tables all the time (as LOCK_open is being held during the process), and while there was a table being opened, everything else would stall on the machine.

It can simply take hours on such systems just to open tables – and the major portion of time spent is randomly diving into InnoDB tables to populate index statistics. It obviously sounds like low hanging fruit – as statistics aren’t needed while you are opening a table, they’re needed just for querying the table.

So, I threw in few thousand tables to my machine, and tried opening them with ten connections. Standard InnoDB code was opening 13.5 tables a second. After spending few minutes and moving (this is pure prototype, not suitable for production) statistic collection post ha_innodb::open(), I noticed performance increase.

Tables were opened at 105-a-second speed. A bit better, ~8x better.

Merry Christmas, MySQL!

On deadlock detection

InnoDB detects deadlocks. Deadlocks are those nasty situations, when transaction 1 tries to acquire locks A and B, whereas transaction 2 tries to acquire locks B and A at the same time. As both are stubborn, InnoDB will decide simply to terminate one of them. If it wouldn’t do that, both transactions would have to wait until lock_wait_timeout to expire otherwise. There is a big chance that longer the transaction is, more likely it is to cause deadlocks. Deadlock detection kind of helps, then, but… at certain costs.

Transaction 1 and 2 case is way too easy, try adding few hundred transactions that contend over same set of locks. To do that, InnoDB deadlock monitor will recursively brute-force lock graph, until it hits a 200-transaction-long chain (it will say it is a deadlock), or until it runs out of paths to check. Still, with the power of modern hardware that will still be milliseconds.

Unfortunately, InnoDB will also hold kernel_mutex at that time, so lots and lots of InnoDB operations will not happen at that time. To be exact, InnoDB will rarely do anything else, while deadlock check is happening.

To illustrate that, I have a very simple testcase (that in certain conditions stalls the server for half an hour, even if it is not being ran):

UPDATE t1 SET b=b+1 WHERE a=1;

With few threads it executes nearly 20000 times a second on my desktop machine. With ten threads it executes 14000/s. With 50 threads it is only 3000/s. With 100 threads it falls down to 639 operations a second. At 140 threads it is already just 266.

I built InnoDB without deadlock detection (tiny tiny patch), and tried same test. Similar performance with 10 threads, still doing 10000 operations a second at 100 threads:

Though I illustrated edge case here, its purity actually didn’t show how bad this can go – this situation can happen not only because of high contention on single row, but simply because someone holds up the row lock for a bit too long (there’s always that sleep between UPDATE and COMMIT, too). It can take a single transaction to cause a lock convoy, and once transactions queue up, and update rate falls down below 100/s, all MySQL will be doing is checking for deadlocks, even if they never happen.

On many systems deadlock detection is causing way more issues, than lack of it would. Most deadlocks happen on transactions that are somewhere in the middle of their lock wait anyway :)

There’s some discussion about it at MySQL Bug#49047

Eyecandy mutexes!

In my quest of making MySQL usable, I managed to hit contention that wasn’t spotted by performance masters before. Meet most useless mutex ever (this is actual contention event, not just a hold):

Count     nsec Lock
 1451   511364 mysqld`ut_list_mutex

      nsec ---- Distribution --- count    Stack
      2048 |@@                 |   132`mutex_lock_impl
      4096 |@@@                |   186`mutex_lock
      8192 |                   |     3    ut_malloc_low
     16384 |@@                 |   137    mem_heap_create_block
     32768 |@                  |   105    row_sel_store_mysql_rec
     65536 |@                  |    89    row_search_for_mysql
    131072 |@                  |    99    ha_innobase::general_fetch
    262144 |@@                 |   131    ha_innobase::rnd_next
    524288 |@@@                |   195    rr_sequential
   1048576 |@@@                |   223
   2097152 |@@                 |   137
   4194304 |                   |    14

ut_list_mutex guards a memory structure which has all memory blocks allocated by InnoDB (via ut_malloc/ut_free) in it.
It has two uses:

  1. Printing “Total memory allocated” in SHOW INNODB STATUS (though this can still be implemented lock-free)
  2. Deallocating all memory on shutdown (though, all modern operating systems do that anyway, so this is purely just to shut up valgrind)

If you have any BLOB/TEXT data in your tables, you’re definitely hitting this contention spot (it is #1 contention in such cases).

Fix? Kill the eyecandy, replace ut_malloc and ut_free with direct calls to malloc() and free(), oh and of course, use scalable allocators like tcmalloc or Hoard.

plockstat fail!

Solaris has this beautiful tool ‘plockstat’ that can report application lock contention and hold events (hold events turn into contention events at parallelism.. ;-) It is just a frontend to a set of dtrace rules, that monitor mutexes and rwlocks.

Today I was testing an edge case (what happens, when multiple threads are scanning lots of same data) – and plockstat/dtrace indicated that there were zero (0!!!) lock waits. I tried using ‘prstat’ with microstate accounting, and it indeed pointed out that there’s lots of LCK% activity going on (half of CPU usage…). The dtrace profiling oneliner (dtrace -n 'profile-997 {@a[ustack()]=count()}') immediately revealed the culprit:


So, plenty of CPU time was spent when trying to unlock mutex (what seemed strange), but didn’t seem that strange once I noticed the code:

do {
	old = *lockword64;
	new = old & ~LOCKMASK64;
} while (atomic_cas_64(lockword64, old, new) != old);

So, there’s unaccounted busy loop (it is just part of hold event in dtrace). What is odd, is that nobody expects this place to loop too much, what happens here – it gives away mutex to other thread, which wants it. So, instead of having the new owner spin-lock (where it accounts properly), it has old owner spin-locking. I’m not convinced this kind of behavior is one that should scale on large machines, but I’m not much of a locking expert.

Without proper instrumentation plockstat failed to provide information about locks that were consuming half of CPU time. I hope that really was just an edge case – more real testing will follow soon, will see if plockstat will fail as much. Oh well, will find the information I need anyway :) Lesson learned – treat pretty much everything with grain of salt, especially when OS tells you mysql has no lock contention, haha.