We quite often say, that benchmark performance is usually different from real world performance – so performance engineering usually has to cover both – benchmarks allow to understand sustained performance bottlenecks, and real world analysis usually concentrates on something what would be considered ‘exceptional’ and not important in benchmarks – stalls of various kind. They are extremely important, as the state when our performance is lowest is the state of performance we provide to our platform users.
On a machine that is doing 5000qps, stalling for 100ms means that 500 queries were not served as fast as they could, or even hit application timeouts or exceptional MySQL conditions (like 1023 transaction limit). Of course, stalling for a second means 5000 queries were not served in time…
We have multiple methods to approach this – one is our ‘dogpiled’ framework – an agent doing status polling every second and reporting information about I/O state, MySQL/InnoDB statuses, processlists, etc – so we see the scope of stalls in our environment. We try to maintain the threshold between complete information overload and something that reveals problems – so it is always balancing act, especially with great work done by engineering team :)
Other approach, usually led to by dogpiles information, is auto-PMP – high-frequency status polling combined with gdb invocations, that allow us to jump into the process whenever we notice something weird is going on. We have some extensions to how we use PMP – but thats worth another post.
Issues we do find out that harm us most in production environments are ones that are quite often discarded as either “this never happens” or “get better hardware” or “your application is wrong”. Unfortunately, that happens, we do have thousands of machines that aren’t free and our application demands are our application demands :)
Few examples:
- TRUNCATE stalls the server (oh well, DROP TABLE too) – in this case, truncating a table grabs dictionary mutex, other transaction blocks while holding LOCK_open, everything else stops. Though truncating is supposed to be fast operation, it has to unlink (delete) a file, and with large files such operation isn’t really instant on any filesystem. Even if one deletes all the data before truncating, file is still on the filesystem.
- Extending data files stalls the server – when a data file is being extended, global mutex is held, which blocks all I/Os (with limited concurrency that is full server stall). Somewhat more impressive with file-per-table. This is the major reason for mini-stalls at the moment – on machines that grow at gigabytes-a-day rate this is being hit quite often.
- Updating table statistics stalls the server – we hit this with high-performance task tracking machines, row churn there is quite amazing, and dictionary statistics are reread more often than one would expect. Updating statistics means locking the table while doing random reads from disk. Once major workload is hitting that table, it quickly escalates to full server stall
- Fuzzy checkpoint stalls the server – this is one of biggest issues outstanding in stock MySQL – though one would expect that “fuzzy checkpoint” that uses async background threads is nonblocking, actually all writes during it will stall, taking all concurrency slots and leading to a server stall. Mark’s fix was just doing this work in background thread.
- (no bug filed on this yet) – Purge stalls the server – purge holds dictionary lock while doing random reads from disk, with table stall leading to server stall.
There’re more issues (mostly related to heavier in-memory activities of the server), but these ones are most obvious ones – where single I/O request done is escalated to table or instance lockup, where no other work is done. Our machines have multiple disks, multiple CPUs and can support multiple SQL queries being executed at once, so any of these lockups effectively limit our available performance or damage the quality of service we can provide.
On the upside, my colleagues are absolutely amazing and I’m sure that we will have all these issues fixed in our deployment in near future, as well as everyone will be able to pick that up via mysqlatfacebook branch.
http://bugs.mysql.com/bug.php?id=54538 is open for the purge problem. I am not sure if that covers everything.
I prefer to wait for InnoDB to fix TRUNCATE and use DROP instead of TRUNCATE until that is done.
I think that another fix from me to prevent LOCK_open pileups (open one table instance at a time) that occur at database start made the update stats stall worse. Alas, I think I have fixed that with a recent change.
This is also another drop-table/truncate stall: http://bugs.mysql.com/bug.php?id=51325
When using 50GB buffer pool, it takes about 3 seconds in our environment.