InnoDB locking makes me sad

Vadim and others have pointed at the index->lock problems before, but I think they didn’t good job enough at pointing out how bad it can get (the actual problematic was hidden somewhere as some odd edge case). What ‘index lock’ means is generally the fact that InnoDB has table-level locking which will kill performance on big tables miserably.

InnoDB is a huge pie of layers, that have various locking behaviors, and are layered on top of each other, and are structured nicely as subdirectories in your innodb_plugin directory. Low level storage interfaces are done via os/ routines, then on top of that there’s some file space manager, fsp/, which allocates space for btr/ to live in, where individual page/ entities live, with multiple row/ pieces. There’re few other subsystems around, that got quite some attention lately – e.g. buf/ pool, transaction log/, and large trx/ transactions are composed of micro transactions living in mtr/.

If you live in memory, you care about buffer pool and transaction log performance, if you write insane amounts of data to in-memory buffers you hit mtr/ problems and depend o how fast you can write out log/ or flush out buf/. If you are in I/O-heavy land most of stuff you care about happens in btr/.

Generally InnoDB is quite good about read scalability in I/O bound environments – nowadays one can saturate really fast I/O devices and there will be plenty of parallel reads done. Major scalability problem in this field was read-ahead which was funneling all read-ahead activity into a small set of threads, but other than that there can be hundreds of parallel reads issued to underlying devices. Situation changes when writes are added to the mix, though again, there’re few different scenarios.

There’re two ways for InnoDB to write out updates to pages, “optimistic” and “pessimistic”. Optimism here means that only in-page (page/row) operation will be needed without changing the tree structure. In one case you can expect quite high parallelism – multiple pages can be read for that operation at a time, multiple of them can be edited at a time, then some serialization will happen while writing out changes to redo log and undo segments. Expect good performance.

The much worse case is when B-Tree is supposed to be reorganized and multiple page operations can happen; thats pessimism. In this case whole index gets locked (via a read-write lock obtained from dict/),
then B-Tree path is latched, then changes are done, then it is all unlocked until next row operation needs to hit the tree. Unfortunately, both ‘path is latched’ and ‘changes are done’ are expensive operations, and not only in-core, but are doing sync page read-ins, one at a time, which on busy systems serving lots of read load are supposed to be slow. Ironically, as no other operations can happen on the table at that time, you may find out you have spare I/O capacity.. ;-)

What gets quite interesting though is the actual operation needed to latch b-tree path. Usual wisdom would say that if you want to change a row (read-modify-write), you probably looked up the page already, so there won’t be I/O. Unfortunately, InnoDB uses an slightly more complicated binary tree version, where pages have links to neighbors, and tree latching does this (a bit simplified for reading clarity):


/* x-latch also brothers from left to right */
get_block = btr_block_get(space, zip_size, left_page_no, RW_X_LATCH, mtr);
get_block = btr_block_get(space, zip_size, page_no, RW_X_LATCH, mtr);
get_block = btr_block_get(space, zip_size, right_page_no, RW_X_LATCH, mtr);

So, essentially in this case, just because InnoDB is being pessimistic, it reads neighboring blocks to lock them, even if they may not be touched/accessed in any way – and bloats buffer pool at that time with tripple reads. It doesn’t cost much if whole tree fits in memory, but it is doing three I/Os in here, if we’re pessimistic about InnoDB being pessimistic (and I am). So, this isn’t just locking problem – it is also resource consumption problem at this stage.

Now, as the dictionary lock is hold in write mode, not only updates to this table stop, but reads too – think MyISAM kind of stop. Of course, this ‘table locking’ happens at entirely different layer than MyISAM. In MyISAM it is statement-length locking whereas in InnoDB this lock is held just for row operation on single index, but if statement is doing multiple row operations it can be acquired multiple times.

Probably there exist decent workarounds if anyone wants to tackle this – grabbing read locks on the tree while reading pages into buffer pool, then escalating lock to exclusive. A bit bigger architectural change would be allowing to grab locks on neighbors (if they are needed) without bringing in page data into memory – but that needs InnoDB overlords to look at it. Talk to your closest MySQL vendor and ask for a fix!

How do regular workloads hit this? Larger your records are, more likely you are to have tree changes, lower your performance will be. In my edge case I was inserting 7k sized rows – even though my machine had multiple disks, once the dataset fell out of buffer pool, it couldn’t insert more than 50 rows a second, even though there were many disks idle and capacity gods cried. It gets worse with out-of-page blobs – then every operation is pessimistic.

Of course, there’re ways to work around this – usually by taking the hit of sharding/partitioning (this is where common wisdom of “large tables need to be partitioned” mostly comes from). Then, like with MyISAM, one will have multiple table locks and there may be some scalability then.

TL;DR: InnoDB index lock is major architectural performance flaw, and that is why you hear that large tables are slower. There’s a big chance that there’re more scalable engines for on-disk writes out there, and all the large InnoDB write/insert benchmarks were severely hit by this.

Update: Filed bugs #61735 and #61736 with MySQL

19 thoughts on “InnoDB locking makes me sad”

  1. Forgot to ask: presumably your inserts dropped to 50 rows per seconds because the rows being inserted are out-of-order with regards to the PK, correct?

    If the rows had auto-increment PK, would InnoDB still needs the index lock you mentioned?

  2. Andy, didn’t test PBXT. Yes, the drop is because rows are out-of-order. If rows had auto-increment PK I’d need another index on the field that is PK now, and that index would still have the same problem.

  3. Dom, following is what MySQL Internals (InnoDB) has to say:
    “Similarly, InnoDB does not want to insert new rows according to the B-tree’s key order, so it inserts new rows right after the end of the existing rows or wherever there’s space left by a deleted row. But by definition the records of a B-tree must be accessible in order by key value, so there is a record pointer in each record which points to the next record in key order.”

    So, technically speaking inserts and updates (which are actually delete+insert) should not be an issue as there is no reshuffling required in the tree and the tree structure does not change (only the record pointers need to be updated).

    Is my understanding correct?

  4. Domas,

    In your case if you used auto-increment PK plus a secondary index of what is your primary index now, wouldn’t the performance be better because InnoDB uses insert buffer?

  5. @Andy,. insert buffer is only utilized for inserts into non-unique secondary indexes, may be Dom had to do inserts into unique indexes.,

  6. Ovais, I’m not a big believer in insert buffer – at large enough datasets merging it in single thread will become a bottleneck, what could be parallel operation otherwise. You can’t grow insert buffer beyond certain threshold.

    UPDATE can be pessimistic too – if size is changed, or out-of-page blobs are touched.

    You pasted the page organization quote, not what kind of pages are edited – whenever there’s a chance of touching multiple tables you hit this kind of locking.

  7. Dom, I got what you are saying about the locking thing,. Regarding the insert buffer, I remember seeing Mark’s post with benchmark results with and without insert buffer, but again whether insert buffer can help or not depends on the type of the workload,.

  8. @Ovais – AFAIK, secondary index updates are delete/insert, clustered (primary) index updates are in place.

    @Domas – insert buffer merges issue read requests for the background read threads, so concurrency is possible there

  9. @Domas – code that issues the prefetch requests is from one thread (srv_master_thread). But it submits async read requests and the insert buffer merge work is then done in the IO completion callback.

  10. Domas,

    yes, the problem is not new.. – the workaround I’ve found in the past is to use partitions on tables where the index mutex contention is observed. You may find it here:
    http://dimitrik.free.fr/db_STRESS_XtraDB-8_Performance_Dec2009.html (2009)
    http://dimitrik.free.fr/blog/archives/2010/05/mysql-performance-improving-stability.html (2010)

    But if I remember well, there is a limitation on partitions related to foreign key, so cannot be applied in all cases. Maybe once this limitation will be removed, partitions will provide a simple solution (as it involves a native split of single mutex due a split of index) – otherwise the only solutions are: rewrite the index code, or split index mutex..

    Rgds,
    -Dimitri

  11. @Mark,. okay so primary index updates then of course mean expensive shifting within the tree,. shouldn’t it be delete+insert as well with the insert happening where free space is available and the row pointers (that point to rows in primary key order) being updated,. I thought it happened like that,.

  12. @Dimitri,

    partitions work, but they increase read costs of ranges that span multiple partitions. I think low hanging fruits could be preloading pages from disk before X-latching them, as well as better treatment for neighbors.

  13. It is unfortunate that BLOBs (or long CHAR or VARCHAR columns) are allocated from the same file segment as the clustered index B-tree leaf pages. This forces all BLOB allocation to X-lock the index->lock of the clustered index tree. If InnoDB allocated the off-page columns from another file segment, this should reduce the contention on index->lock.

    1. Marko, though that is another problem that probably asks for a separate bug, large in-page storage is pushing towards more pessimistic behavior too.

  14. Domas,

    I think if pages are not loaded, you’ll get contention on the btr_search_latch mutex, and only once you’re involving changes on the already buffered pages – you’ll see index mutex contention then..

    (BTW, its pity your site is not sending reply notifications via e-mail)

    Rgds,
    -Dimitri

  15. Domas,

    This is a fundamental and inescapable limitation with b-tree organized tables. It’s actually a problem with B-trees in general, but if your indexes are completely separate from the heap (Postgres) you don’t need to lock the root node as often because you can use various tricks to do leaf-page-only updates.

    You might check out Tokutek for this; it is a problem they claim to solve.

Comments are closed.