on performance stalls

We quite often say, that benchmark performance is usually different from real world performance – so performance engineering usually has to cover both – benchmarks allow to understand sustained performance bottlenecks, and real world analysis usually concentrates on something what would be considered ‘exceptional’ and not important in benchmarks – stalls of various kind. They are extremely important, as the state when our performance is lowest is the state of performance we provide to our platform users.

On a machine that is doing 5000qps, stalling for 100ms means that 500 queries were not served as fast as they could, or even hit application timeouts or exceptional MySQL conditions (like 1023 transaction limit). Of course, stalling for a second means 5000 queries were not served in time…

We have multiple methods to approach this – one is our ‘dogpiled’ framework – an agent doing status polling every second and reporting information about I/O state, MySQL/InnoDB statuses, processlists, etc – so we see the scope of stalls in our environment. We try to maintain the threshold between complete information overload and something that reveals problems – so it is always balancing act, especially with great work done by engineering team :)

Other approach, usually led to by dogpiles information, is auto-PMP – high-frequency status polling combined with gdb invocations, that allow us to jump into the process whenever we notice something weird is going on. We have some extensions to how we use PMP – but thats worth another post.

Issues we do find out that harm us most in production environments are ones that are quite often discarded as either “this never happens” or “get better hardware” or “your application is wrong”. Unfortunately, that happens, we do have thousands of machines that aren’t free and our application demands are our application demands :)

Few examples:

  • TRUNCATE stalls the server (oh well, DROP TABLE too) – in this case, truncating a table grabs dictionary mutex, other transaction blocks while holding LOCK_open, everything else stops. Though truncating is supposed to be fast operation, it has to unlink (delete) a file, and with large files such operation isn’t really instant on any filesystem. Even if one deletes all the data before truncating, file is still on the filesystem.
  • Extending data files stalls the server – when a data file is being extended, global mutex is held, which blocks all I/Os (with limited concurrency that is full server stall). Somewhat more impressive with file-per-table. This is the major reason for mini-stalls at the moment – on machines that grow at gigabytes-a-day rate this is being hit quite often.
  • Updating table statistics stalls the server – we hit this with high-performance task tracking machines, row churn there is quite amazing, and dictionary statistics are reread more often than one would expect. Updating statistics means locking the table while doing random reads from disk. Once major workload is hitting that table, it quickly escalates to full server stall
  • Fuzzy checkpoint stalls the server – this is one of biggest issues outstanding in stock MySQL – though one would expect that “fuzzy checkpoint” that uses async background threads is nonblocking, actually all writes during it will stall, taking all concurrency slots and leading to a server stall. Mark’s fix was just doing this work in background thread.
  • (no bug filed on this yet) – Purge stalls the server – purge holds dictionary lock while doing random reads from disk, with table stall leading to server stall.

There’re more issues (mostly related to heavier in-memory activities of the server), but these ones are most obvious ones – where single I/O request done is escalated to table or instance lockup, where no other work is done. Our machines have multiple disks, multiple CPUs and can support multiple SQL queries being executed at once, so any of these lockups effectively limit our available performance or damage the quality of service we can provide.

On the upside, my colleagues are absolutely amazing and I’m sure that we will have all these issues fixed in our deployment in near future, as well as everyone will be able to pick that up via mysqlatfacebook branch.

Read ahead…

Mark wrote about how to find situations where InnoDB read-ahead is a bottleneck. What he didn’t disclose, though, is his trick to disable read-ahead without restart or recompile of MySQL. See, there’s no internal “disable read ahead knob”. But there is…

buf_read_ahead_random(...){ ...
       if (srv_startup_is_before_trx_rollback_phase) {
                /* No read-ahead to avoid thread deadlocks */
                return(0);
        }

This variable is tested at two functions – buf_read_ahead_linear() and buf_read_ahead_random() and nowhere else. So yeah, “server startup is before transaction rollback phase” is another way of saying “don’t do read ahead, please please”.

gdb -ex "set  srv_startup_is_before_trx_rollback_phase=1" \
    --batch -p $(pidof mysqld)

And many servers bottlenecked on this became much much much faster (and 2000 concurrent threads running dropped to 10). Of course, this is most visible in high-latency-high-throughput I/O situations, but we’re hitting this contention spot on local disk setups too.

Don’t forget to have the fix if gdb decides to be nasty and locks up your server :)

Opening tables!

There’s one bottleneck in MySQL/InnoDB that ultimately sucks. It sucked in 4.0, sucked in 5.0, sucks in 5.1 with newest InnoDB plugin. Opening tables has been a bottleneck on machines that have thousands of tables all the time (as LOCK_open is being held during the process), and while there was a table being opened, everything else would stall on the machine.

It can simply take hours on such systems just to open tables – and the major portion of time spent is randomly diving into InnoDB tables to populate index statistics. It obviously sounds like low hanging fruit – as statistics aren’t needed while you are opening a table, they’re needed just for querying the table.

So, I threw in few thousand tables to my machine, and tried opening them with ten connections. Standard InnoDB code was opening 13.5 tables a second. After spending few minutes and moving (this is pure prototype, not suitable for production) statistic collection post ha_innodb::open(), I noticed performance increase.

Tables were opened at 105-a-second speed. A bit better, ~8x better.

Merry Christmas, MySQL!

On deadlock detection

InnoDB detects deadlocks. Deadlocks are those nasty situations, when transaction 1 tries to acquire locks A and B, whereas transaction 2 tries to acquire locks B and A at the same time. As both are stubborn, InnoDB will decide simply to terminate one of them. If it wouldn’t do that, both transactions would have to wait until lock_wait_timeout to expire otherwise. There is a big chance that longer the transaction is, more likely it is to cause deadlocks. Deadlock detection kind of helps, then, but… at certain costs.

Transaction 1 and 2 case is way too easy, try adding few hundred transactions that contend over same set of locks. To do that, InnoDB deadlock monitor will recursively brute-force lock graph, until it hits a 200-transaction-long chain (it will say it is a deadlock), or until it runs out of paths to check. Still, with the power of modern hardware that will still be milliseconds.

Unfortunately, InnoDB will also hold kernel_mutex at that time, so lots and lots of InnoDB operations will not happen at that time. To be exact, InnoDB will rarely do anything else, while deadlock check is happening.

To illustrate that, I have a very simple testcase (that in certain conditions stalls the server for half an hour, even if it is not being ran):

UPDATE t1 SET b=b+1 WHERE a=1;

With few threads it executes nearly 20000 times a second on my desktop machine. With ten threads it executes 14000/s. With 50 threads it is only 3000/s. With 100 threads it falls down to 639 operations a second. At 140 threads it is already just 266.

I built InnoDB without deadlock detection (tiny tiny patch), and tried same test. Similar performance with 10 threads, still doing 10000 operations a second at 100 threads:

Though I illustrated edge case here, its purity actually didn’t show how bad this can go – this situation can happen not only because of high contention on single row, but simply because someone holds up the row lock for a bit too long (there’s always that sleep between UPDATE and COMMIT, too). It can take a single transaction to cause a lock convoy, and once transactions queue up, and update rate falls down below 100/s, all MySQL will be doing is checking for deadlocks, even if they never happen.

On many systems deadlock detection is causing way more issues, than lack of it would. Most deadlocks happen on transactions that are somewhere in the middle of their lock wait anyway :)

There’s some discussion about it at MySQL Bug#49047

best free() is exit()

Whenever any maintenance needs server restarts, there’s a list of unsolved bottlenecks or inefficient code that gets touched a lot at that time. I can understand that heating up the server can take lots of time (though lots of low hanging fruits there), but the way actual shutdown is done, even if there’s not much of dirty data to flush, sucks.

See, developers are perfectionists, and their perfectionism also includes the crazy idea that all memory has to be deallocated at server shutdown, as otherwise Valgrind and other tools will complain that someone leaked memory. Developers will write expensive code in shutdown routines that will traverse every memory structure and deallocate/free() it.

Now, guess what would happen if they wouldn’t write all this expensive memory deallocation code.

Still guessing?

OS would do it for them, much much much faster, without blocking the shutdown for minutes or using excessive amounts of CPU. \o/

Crash recovery, again

There’s one stage in InnoDB crash recovery where it reads log file, you know, this one:

InnoDB: Doing recovery: scanned up to log sequence number 354164119040
InnoDB: Doing recovery: scanned up to log sequence number 354169361920

On a machine with bigger logs it will start spending nearly 100% CPU somewhere in recv_scan_log_recs. Guess what it does…. -fno-inline builds to the rescue:

#0  mem_block_get_len at ./include/mem0mem.ic:86
#1  mem_heap_get_size at ./include/mem0mem.ic:591
#2  recv_scan_log_recs at log/log0recv.c:2727

And:

samples  %        symbol name
8467     72.9222  mem_heap_get_size
291       2.5062  recv_add_to_hash_table
95        0.8182  mem_block_get_len

To speak in layman’s terms, InnoDB does SUM(LENGTH(allocation)) on its relatively wide memory (tens, hundreds of thousands of entries) arena, FOR EVERY LOG SEGMENT, to make sure it didn’t run out of available 32GBs. Hehehe, lame.

As for now, I’ll just killed the check and have my recovery much much faster – finished in 3 minutes, what it wasn’t able to do in 30 before.

P.S. This is different from what I wrote before (and magic Yasufumi’s patch)
P.P.S. Now I got to learn to reuse LOG_DUMMY table during the recovery process, as it is next low hanging fruit there…

Checksums again, some I/O too

When I was doing data loading tests, I realized that usually low checksum calculation CPU percentage is actually the blocking factor. See, usually when background writers do the flushing, it gets parallelized, but if active query is forcing a checkpoint, it all happens in ‘foreground’ thread, checksum computation included. This is where more Sun-ish wisdom (these people tune kernel with debugger all the time) comes in:

gdb -p $(pidof mysqld) -ex "set srv_use_checksums=0" --batch

Puff. Everything becomes much faster. Of course, one would be able to restart the server with –skip-innodb-checksums, but that would interrupt the whole process, etc. Of course, proper people would implement tunable parameter (5 lines of code, or so), but anyone with Solaris experience knows how to tune stuff with debuggers, hahaha.

Odd though, I was used to compiled -O3 mode optimizing checksums to disappear from profiles, so either this doesn’t work, or it just got so much emphasis to actually matter. This is why I told Heikki at the Users Conference, that checksums ‘must go’ or be fixed in one way or another. Even when they’re small part of operation and doesn’t cost anything if it doesn’t block primary operations (e.g. happens in parallel), people keep accessing SSDs, and then relative cost of checksum is insane.

It is quite easy to check that, just run InnoDB with tiny buffer pool, OS caching, and try some benchmarking with checksums enabled and disabled – there will be a huge difference, which may simply explain the performance difference of caching at buffer pool and OS buffers. Turn checksums off, and OS caching may be even tolerable for your workloads.

Other interesting issue here is that MarkC has lots of I/O path optimization in his fourth (and four is a good number!) patch. Now… maybe everyone will be soon caching pages in OS, once some more work is done in I/O access path cost work?

stop messing with the tablespace

People keep loving and endorsing the –innodb-file-per-table. Then poor new users read about that, get confused, start using –innodb-file-per-table, and tell others to. Others read then, get confused even more, and start using –innodb-file-per-table, then write about it. Then…

Oh well. Here, some endorsements and FUD against one-tablespace-to-unite-them-all:

This same nice property also translates to a not so nice one: data can be greatly fragmented across the tablespace.

Of course, having file-per-table will mean that only one table will be in a file, so, kind of, it will not be ‘mixed’… inside file. Now, when data grows organically (not when you restore few-hundred-gigabyte dump sequentially), all those files grow and start getting fragmented (at ratios depending on how smart filesystem is, and.. how many people choose smart filesystems nowadays?). So, disks will have same soup of data, just instead of ‘fragmentation’ inside tablespace which is laid out sequentially on a disk/filesystem, you get fragmentation at file system level. Now, InnoDB has extents and such, and can afford new storage formats more often than filesystems do, so… which one is more modern for proper data allocation strategies?

So, some more criticism of InnoDB tablespaces:

An annoying property of InnoDB’s tablespaces is that they never shrink.

Annoying are people who find this annoying. Look, it is simple, if your data grows to X, then you do something amazing and shrink your dataset, there’re two outstanding questioms:

  • How soon will your dataset grow back to X?
  • What temporary data will be placed there, until the moment dataset grows back to X?

It is very simple, database servers house data. Run OPTIMIZE, data will get eventually fragmented (quite fast, actually, at usual access patterns, as once you have lots of full pages around, a simple insertion will split pages). That ‘free space’ achieved does not bring too much value, it will be gone, and data will be placed there. Oh, well, and it _will_ be used by database, by _any_ table, not just the owner-table of a file.

It does not release storage to the file system.

*facepalm*. File system doesn’t want that storage. It will give it back to InnoDB as soon as it asks. :)

I’ve seen more than once how certain tables are left unwatched, growing until disk space reaches 90% and SMS notifications start beeping all around.

Where is that issue of data storage strategies? I have tablespaces taking 95%. When they will reach 99% I will kill the database servers and cycle them out to other tasks to deal with smaller datasets. There is nothing bad with data growing, as long as you know the reason. There is nothing wrong with tablespace growing, thats its job – to house the data you feed in.

But having a file which consumes some 80-90% of disk space is a performance catastrophe. It means the disk needle needs to move large distances. Overall disk performance runs very low.

[citation needed]. Which filesystem will not place data at those ‘long distances’ for the needle, when it gets full? Which filesystem will relocate the data to the start of disk? This is most poorly sourced statement, which spreads FUD, even though same data placement would happen pretty much with every storage strategy. If start area gets full, you have to place data elsewhere. End of tablespace, end of filesystem, end of disk, it is same thing.

Also, one can never be sure how underlying storage will be mapped to block device. Writing to the end can be fastest, you never know. Did you try? :)

mysqldump, mk-parallel-dump

Where is mydumper? ;-D

One other nice thing about innodb_file_per_table is that it is possible to monitor table size on the file system level.

Oh sure, thats just what you need, give people who deal with data at the application level, access to MySQL data directories, and let them run awk, sort, du, etc, even though INFORMATION_SCHEMA gives you that data for any kind of table storage type you use. Oh, by the way, both I_S and ‘SHOW TABLE STATUS’ will tell about free space, whereas filesystem has no idea about what data is inside file. You choose your tools… ;-)

Oh, wait, this was given as an argument:

You don’t need access to MySQL, to use SHOW TABLE STATUS or to query the INFORMATION_SCHEMA.

Right…

Though actually I know where file-per-table solves something. I’ve written already, that most Linux file systems serialize O_DIRECT writes. Using separate files will mean that O_DIRECTs will be placed in separate queues per each inode. I’d suggest just using proper file system :-)

Now I should probably tell why having single tablespace is better. Besides the obvious (freed space shared by all tables), I don’t know. Well, there’re lots of different issues that have different impact on different operating systems, file systems, storage, etc.

For example, when data lives inside single tablespace, filesystem metadata operations are nearly non-existing. That means no writes to FS journal, less need for multiple-points-of-synchronization, etc.

Multiple file descriptors could also be mentioned, though probably nowadays it has nearly zero significance. I don’t know – statements like that need more than just assumptions. I don’t have good numbers about how much FS fragmentation affects proper database workload – and results can be different depending on data access patterns, size composition, and of course, filesystems used.

Still, from OS administrator’s perspective data is not fragmented in any way, moving files around will hit bandwidth issues in network, memory, I/O, but will not do any seeks. I probably have dealt too much with systems that had hundreds of millions of files, to love single-file-approach. :)

On throttling

Seems like nowadays InnoDB crash recovery is much funnier. Please allow me to show you something:

while (buf_pool->n_pend_reads >=
        recv_n_pool_free_frames / 2) {
  os_aio_simulated_wake_handler_threads();
  os_thread_sleep(500000);

Translation – if there’re more than 128 outstanding I/O requests, sleep for half a second. Hehehe. Optimized for floppies!