Vadim and others have pointed at the index->lock problems before, but I think they didn’t good job enough at pointing out how bad it can get (the actual problematic was hidden somewhere as some odd edge case). What ‘index lock’ means is generally the fact that InnoDB has table-level locking which will kill performance on big tables miserably.
InnoDB is a huge pie of layers, that have various locking behaviors, and are layered on top of each other, and are structured nicely as subdirectories in your innodb_plugin directory. Low level storage interfaces are done via os/ routines, then on top of that there’s some file space manager, fsp/, which allocates space for btr/ to live in, where individual page/ entities live, with multiple row/ pieces. There’re few other subsystems around, that got quite some attention lately – e.g. buf/ pool, transaction log/, and large trx/ transactions are composed of micro transactions living in mtr/.
If you live in memory, you care about buffer pool and transaction log performance, if you write insane amounts of data to in-memory buffers you hit mtr/ problems and depend o how fast you can write out log/ or flush out buf/. If you are in I/O-heavy land most of stuff you care about happens in btr/.
Generally InnoDB is quite good about read scalability in I/O bound environments – nowadays one can saturate really fast I/O devices and there will be plenty of parallel reads done. Major scalability problem in this field was read-ahead which was funneling all read-ahead activity into a small set of threads, but other than that there can be hundreds of parallel reads issued to underlying devices. Situation changes when writes are added to the mix, though again, there’re few different scenarios.
There’re two ways for InnoDB to write out updates to pages, “optimistic” and “pessimistic”. Optimism here means that only in-page (page/row) operation will be needed without changing the tree structure. In one case you can expect quite high parallelism – multiple pages can be read for that operation at a time, multiple of them can be edited at a time, then some serialization will happen while writing out changes to redo log and undo segments. Expect good performance.
The much worse case is when B-Tree is supposed to be reorganized and multiple page operations can happen; thats pessimism. In this case whole index gets locked (via a read-write lock obtained from dict/),
then B-Tree path is latched, then changes are done, then it is all unlocked until next row operation needs to hit the tree. Unfortunately, both ‘path is latched’ and ‘changes are done’ are expensive operations, and not only in-core, but are doing sync page read-ins, one at a time, which on busy systems serving lots of read load are supposed to be slow. Ironically, as no other operations can happen on the table at that time, you may find out you have spare I/O capacity.. ;-)
What gets quite interesting though is the actual operation needed to latch b-tree path. Usual wisdom would say that if you want to change a row (read-modify-write), you probably looked up the page already, so there won’t be I/O. Unfortunately, InnoDB uses an slightly more complicated binary tree version, where pages have links to neighbors, and tree latching does this (a bit simplified for reading clarity):
/* x-latch also brothers from left to right */
get_block = btr_block_get(space, zip_size, left_page_no, RW_X_LATCH, mtr);
get_block = btr_block_get(space, zip_size, page_no, RW_X_LATCH, mtr);
get_block = btr_block_get(space, zip_size, right_page_no, RW_X_LATCH, mtr);
So, essentially in this case, just because InnoDB is being pessimistic, it reads neighboring blocks to lock them, even if they may not be touched/accessed in any way – and bloats buffer pool at that time with tripple reads. It doesn’t cost much if whole tree fits in memory, but it is doing three I/Os in here, if we’re pessimistic about InnoDB being pessimistic (and I am). So, this isn’t just locking problem – it is also resource consumption problem at this stage.
Now, as the dictionary lock is hold in write mode, not only updates to this table stop, but reads too – think MyISAM kind of stop. Of course, this ‘table locking’ happens at entirely different layer than MyISAM. In MyISAM it is statement-length locking whereas in InnoDB this lock is held just for row operation on single index, but if statement is doing multiple row operations it can be acquired multiple times.
Probably there exist decent workarounds if anyone wants to tackle this – grabbing read locks on the tree while reading pages into buffer pool, then escalating lock to exclusive. A bit bigger architectural change would be allowing to grab locks on neighbors (if they are needed) without bringing in page data into memory – but that needs InnoDB overlords to look at it. Talk to your closest MySQL vendor and ask for a fix!
How do regular workloads hit this? Larger your records are, more likely you are to have tree changes, lower your performance will be. In my edge case I was inserting 7k sized rows – even though my machine had multiple disks, once the dataset fell out of buffer pool, it couldn’t insert more than 50 rows a second, even though there were many disks idle and capacity gods cried. It gets worse with out-of-page blobs – then every operation is pessimistic.
Of course, there’re ways to work around this – usually by taking the hit of sharding/partitioning (this is where common wisdom of “large tables need to be partitioned” mostly comes from). Then, like with MyISAM, one will have multiple table locks and there may be some scalability then.
TL;DR: InnoDB index lock is major architectural performance flaw, and that is why you hear that large tables are slower. There’s a big chance that there’re more scalable engines for on-disk writes out there, and all the large InnoDB write/insert benchmarks were severely hit by this.