Opening tables v2!

PMP on demand revealed one of reasons why we’ve been seeing ‘Opening tables’ during proper operations, not just during startup (see my previous post on this topic).

We had a thousand or so threads waiting on LOCK_open, and the only thread holding the mutex was this darling:

Thread 902 (Thread 1637624128 (LWP 22113)):
#3  mutex_spin_wait (mutex=0x2aaaac3232b8,
        file_name=0x8b3bf7 "trx0trx.c", line=220) at sync0sync.c:565
#4  trx_allocate_for_mysql
#5  ha_innobase::allocate_trx
#6  check_trx_exists
#7  ha_innobase::info
#8  ha_innobase::open
#9  handler::ha_open
#10 openfrm
#11 open_unireg_entry
#12 open_table
#13 open_tables
#14 open_and_lock_tables
#15 mysql_insert

So, kernel mutex, which is quite contended, will escalate to LOCK_open on table open, which will block all queries from happening on the server. Fix? Nearly same code as previous “Opening tables” fix – don’t call ha_innobase::info() as part of open(), or move the transaction establishment code outside of info().

Read ahead…

Mark wrote about how to find situations where InnoDB read-ahead is a bottleneck. What he didn’t disclose, though, is his trick to disable read-ahead without restart or recompile of MySQL. See, there’s no internal “disable read ahead knob”. But there is…

buf_read_ahead_random(...){ ...
       if (srv_startup_is_before_trx_rollback_phase) {
                /* No read-ahead to avoid thread deadlocks */
                return(0);
        }

This variable is tested at two functions – buf_read_ahead_linear() and buf_read_ahead_random() and nowhere else. So yeah, “server startup is before transaction rollback phase” is another way of saying “don’t do read ahead, please please”.

gdb -ex "set  srv_startup_is_before_trx_rollback_phase=1" \
    --batch -p $(pidof mysqld)

And many servers bottlenecked on this became much much much faster (and 2000 concurrent threads running dropped to 10). Of course, this is most visible in high-latency-high-throughput I/O situations, but we’re hitting this contention spot on local disk setups too.

Don’t forget to have the fix if gdb decides to be nasty and locks up your server :)

When bad things happen!

When bad things happen, like… ‘strace -f’ or ‘gdb’ or any other process inspection tool decides to hang your precious processes (they show up in state T in process lists), there’s always help:

#include <sys/ptrace.h>
#include <signal.h>
main(int ac, char **av) {
int pid; if (ac>1) pid=atoi(av[1]);
ptrace(PTRACE_ATTACH,pid,0,0);
ptrace(PTRACE_DETACH,pid,0,0);
kill(pid, SIGCONT); }

Opening tables!

There’s one bottleneck in MySQL/InnoDB that ultimately sucks. It sucked in 4.0, sucked in 5.0, sucks in 5.1 with newest InnoDB plugin. Opening tables has been a bottleneck on machines that have thousands of tables all the time (as LOCK_open is being held during the process), and while there was a table being opened, everything else would stall on the machine.

It can simply take hours on such systems just to open tables – and the major portion of time spent is randomly diving into InnoDB tables to populate index statistics. It obviously sounds like low hanging fruit – as statistics aren’t needed while you are opening a table, they’re needed just for querying the table.

So, I threw in few thousand tables to my machine, and tried opening them with ten connections. Standard InnoDB code was opening 13.5 tables a second. After spending few minutes and moving (this is pure prototype, not suitable for production) statistic collection post ha_innodb::open(), I noticed performance increase.

Tables were opened at 105-a-second speed. A bit better, ~8x better.

Merry Christmas, MySQL!

On deadlock detection

InnoDB detects deadlocks. Deadlocks are those nasty situations, when transaction 1 tries to acquire locks A and B, whereas transaction 2 tries to acquire locks B and A at the same time. As both are stubborn, InnoDB will decide simply to terminate one of them. If it wouldn’t do that, both transactions would have to wait until lock_wait_timeout to expire otherwise. There is a big chance that longer the transaction is, more likely it is to cause deadlocks. Deadlock detection kind of helps, then, but… at certain costs.

Transaction 1 and 2 case is way too easy, try adding few hundred transactions that contend over same set of locks. To do that, InnoDB deadlock monitor will recursively brute-force lock graph, until it hits a 200-transaction-long chain (it will say it is a deadlock), or until it runs out of paths to check. Still, with the power of modern hardware that will still be milliseconds.

Unfortunately, InnoDB will also hold kernel_mutex at that time, so lots and lots of InnoDB operations will not happen at that time. To be exact, InnoDB will rarely do anything else, while deadlock check is happening.

To illustrate that, I have a very simple testcase (that in certain conditions stalls the server for half an hour, even if it is not being ran):

UPDATE t1 SET b=b+1 WHERE a=1;

With few threads it executes nearly 20000 times a second on my desktop machine. With ten threads it executes 14000/s. With 50 threads it is only 3000/s. With 100 threads it falls down to 639 operations a second. At 140 threads it is already just 266.

I built InnoDB without deadlock detection (tiny tiny patch), and tried same test. Similar performance with 10 threads, still doing 10000 operations a second at 100 threads:

Though I illustrated edge case here, its purity actually didn’t show how bad this can go – this situation can happen not only because of high contention on single row, but simply because someone holds up the row lock for a bit too long (there’s always that sleep between UPDATE and COMMIT, too). It can take a single transaction to cause a lock convoy, and once transactions queue up, and update rate falls down below 100/s, all MySQL will be doing is checking for deadlocks, even if they never happen.

On many systems deadlock detection is causing way more issues, than lack of it would. Most deadlocks happen on transactions that are somewhere in the middle of their lock wait anyway :)

There’s some discussion about it at MySQL Bug#49047

On replication, some more

Dear MySQL,

I feel ashamed that I ever wanted you to support 4.0->5.1 replication, and apologize for that. I really understand that it was really egoistic of me even to consider you should be involved in this.

I even understand that 5.0 is running out of active support (I’m not questioning that you’ll stop supporting 4.1 entirely too), and you’ll stop doing pretty much anything to 5.0, except “critical security fixes” (w00t, I managed to get one into 4.1, 8 year old MITM flaw :).

I really understand that supporting more than one release is very very very difficult, and people should do only adjacent version upgrade.

I’m not asking you much, but, maybe you could then support 5.1 to 5.1 replication? I don’t want much, just:

  • Gracefully recover after slave crashes.
  • Don’t have single serial reading of pages for replication stream as a bottleneck – either read-ahead properly (you can do that with RBR!!!), or apply events in parallel (you can do that with RBR too!)
  • Allow to edit replication filters without restarting servers.
  • Allow to enable and disable binary logging of events received from master, as well as enabling and disabling binary logging without restarting the instance.

I hope it isn’t too much too ask! It is just supported replication between two same version instances.

Thanks!

best free() is exit()

Whenever any maintenance needs server restarts, there’s a list of unsolved bottlenecks or inefficient code that gets touched a lot at that time. I can understand that heating up the server can take lots of time (though lots of low hanging fruits there), but the way actual shutdown is done, even if there’s not much of dirty data to flush, sucks.

See, developers are perfectionists, and their perfectionism also includes the crazy idea that all memory has to be deallocated at server shutdown, as otherwise Valgrind and other tools will complain that someone leaked memory. Developers will write expensive code in shutdown routines that will traverse every memory structure and deallocate/free() it.

Now, guess what would happen if they wouldn’t write all this expensive memory deallocation code.

Still guessing?

OS would do it for them, much much much faster, without blocking the shutdown for minutes or using excessive amounts of CPU. \o/

Crash recovery, again

There’s one stage in InnoDB crash recovery where it reads log file, you know, this one:

InnoDB: Doing recovery: scanned up to log sequence number 354164119040
InnoDB: Doing recovery: scanned up to log sequence number 354169361920

On a machine with bigger logs it will start spending nearly 100% CPU somewhere in recv_scan_log_recs. Guess what it does…. -fno-inline builds to the rescue:

#0  mem_block_get_len at ./include/mem0mem.ic:86
#1  mem_heap_get_size at ./include/mem0mem.ic:591
#2  recv_scan_log_recs at log/log0recv.c:2727

And:

samples  %        symbol name
8467     72.9222  mem_heap_get_size
291       2.5062  recv_add_to_hash_table
95        0.8182  mem_block_get_len

To speak in layman’s terms, InnoDB does SUM(LENGTH(allocation)) on its relatively wide memory (tens, hundreds of thousands of entries) arena, FOR EVERY LOG SEGMENT, to make sure it didn’t run out of available 32GBs. Hehehe, lame.

As for now, I’ll just killed the check and have my recovery much much faster – finished in 3 minutes, what it wasn’t able to do in 30 before.

P.S. This is different from what I wrote before (and magic Yasufumi’s patch)
P.P.S. Now I got to learn to reuse LOG_DUMMY table during the recovery process, as it is next low hanging fruit there…