MySQL is needlessly slow at accepting new connections. People usually work around that by having various sorts of connection pools, but there’s always a scale at which connection pools are not feasible. Sometimes connection avalanches come unexpected, and even if MySQL would have no trouble dealing with queries, it will have problems letting clients in. Something has to be done about it.
Lots of these problems have been low hanging fruits for years – it ‘was not detected’ by benchmarks because everyone who benchmarks MySQL would know that persistent connections are much faster and therefore wouldn’t look at connection speeds anymore.
Usually people attribute most of slowness to the LOCK_thread_count mutex – they are only partially right. This mutex does not just handle the counter of active running connections, but pretty much every operation that deals with increase or decrease of threads (thread cache, active thread lists, etc) has to hold it for a while.
Also, it is common wisdom to use thread cache, but what people quite often miss is that thread cache is something that was created back when OS threads were extremely expensive to create, and all it does is caching pthreads. It does not do any of MySQL specific thread caching magic – everything gets completely reinitialized for each incoming structure.
I decided to attack this problem based on very simple hypothesis – whatever ‘accept thread’ is doing, is bottleneck for whole process. It is very simple to analyze everything from this perspective (and I had some success looking at replication threads from this perspective).
All we need is gdb and two loops – gdb attaches to accept thread, one loop does ‘breakpoint; continue’, another sends signals at a certain sampling rate (I picked 10Hz in order to avoid profiling bias). I posted those scripts on PMP page. After a lunch break I had 50k stacks (long lunch ;-) that I fed into graphviz for full data visualisation and could look at individually:
A picture is worth thousand words (well, is easier than looking at thousands of lines in stack aggregations), and I immediately noticed few things worth looking at:
- Initializing THD (MySQL thread) structure is CPU-heavy task that resides in choke-point thread
- There is way too much time spent in syscalls, whatever they do
- Too much memory allocation done by the master thread
- There’s mutex contention on thread cache waking up worker threads
- There’s needless mutex contention in few other places
I didn’t want to look at mutex contention issues first so I ended up with something as simple as looking at syscall costs.
- 15% was going into actual accept()
- 8.5% was going into poll()
- 8% went into fcntl()
- 7% went into setsockopt()
- 1.2% went into getsockname()
An strace on mysqld gives a picture that explains quite a bit:
poll([{fd=12, ...}, {fd=13, ...}], 2, -1) = 1
fcntl(12, F_GETFL) = 0x2 (flags O_RDWR)
fcntl(12, F_SETFL, O_RDWR|O_NONBLOCK) = 0
accept(12, {... sin_port=htons(59183), ...) = 32
fcntl(12, F_SETFL, O_RDWR)
getsockname(32, {... sin_port=htons(3306), ...) = 0
fcntl(32, F_SETFL, O_RDONLY) fcntl(32, F_GETFL) = 0x2 (flags O_RDWR)
setsockopt(32, SOL_SOCKET, SO_RCVTIMEO, ...)
setsockopt(32, SOL_SOCKET, SO_SNDTIMEO, ...)
fcntl(32, F_SETFL, O_RDWR|O_NONBLOCK)
setsockopt(32, SOL_IP, IP_TOS, [8], 4)
setsockopt(32, SOL_TCP, TCP_NODELAY, [1], 4)
I’ll skip walking through the code, but essentially what it does here is (12 is accept socket, 32 is connection socket):
- poll() checks whether there are pending connections. If server is busy, trying to accept first, poll on failure is a better approach. There are side effects with that idea though – other sockets may starve a bit, but it is solvable by injecting occasional poll.
- What happens next is a bit sad. Instead of storing per-socket flags (nobody is touching that for now anyway), it gets the socket flags, figures out it is a blocking socket, sets it to nonblocking mode, accepts the connection, sets it back to blocking mode. Just setting to nonblocking at the start and using it forever that way is much cheaper and constipates way less.
- accept() itself can be scaled only by having parallel accept() threads. Maybe most of this post would be not necessary if there were multiple accept threads, but I’m not eager to go into that kind of refactoring for now.
- getsockname() is used just to verify if socket is correct (probably catching EINVAL later seems to be too complicated), it is a very pessimistic code path for a case that nearly never happens (it probably was added for some random Unix back from nineties)
- Next fcntl “get flags” call is quite unnecessary – this is a fresh socket and one shouldn’t expect anything special within it. Later non-blocking mode is set, so that overrides whatever was obtained here.
- Three out of four setsockopt()s are necessary evil (one turns of Nagle’s algorithm, two other set socket timeouts), so they have to be done before network I/O is done on the socket. Fourth setsockopt() is usually completely useless – not every network observes IP_TOS header, and one has to talk to network administrator first about decent values. I’d say it can be optional parameter (yay, more tuning options).
Pretty much every connection socket operation can be done later, in a worker thread, without consuming expensive accept thread time, and pretty much every syscall except accept() can be removed from a busy accept thread(), which is what I did in my testing build.
Once I got rid of syscalls I started looking at other low hanging fruits. The most obvious one was sprintf() called inside vio_new(). Though it accounted only for 4% of thread time, the uselessness of it was depressing. Here it is:
sprintf(vio->desc, (vio->type == VIO_TYPE_SOCKET ? "socket (%d)" : "TCP/IP (%d)"), vio->sd);
It formats a string that is not used at all by production builds (only few DBUG messages are calling vio_description()). Though I removed this code in non-debug build, as I was moving over network initialization to worker threads, whole my_net_init() and vio() ended up outside of accept thread anyway ;-)
The overall thread cache design is centered around LOCK_thread_count – lock is held while signaling threads, and threads that wake up need the lock too – so there’s lots of overhead involved in the coordination – 13% of time is spent just to pass the task to a worker thread.
Allowing multiple threads to wake up and multiple entries to be placed into thread cache before it is all drained (more of an InnoDB concurrency-queue with FLIFO approach) could be somewhat better – so would be worker threads accepting connections directly (I already said that, I guess). There’s simply too much time wasted waking up and sending threads to sleep, and quite some of that time is on a choke point.
THD initializations are somewhat simpler, as they don’t include SMP madness.
There’re some low hanging fruits of course there as well. For example THD initializer calls sql_rnd_with_mutex(), which locks thread count mutex. Simplest fix could be using another mutex, though lockless random function or on-demand variable initialization would help too.
Some initializers there are quite expensive too – e.g. Warning_info class could initialize dynamic storage only when actually used, and not at THD initialization chokepoint. THD::init can be moved to a worker thread, and lots of THD initialization could be moved over to it.
Quite a lot of time (12%) is spent on malloc() – and lots of that is for allocating lots of various fixed-size structures – slab allocator (or just more efficient malloc implementation) could cut on CPU time there. Of course, more drastic alternative is not dealing with THD at all during accept phase – one can pass stub structure to build upon later, or (oh, am I writing this again) moving accept() part to individual workers.
So far I tested just few optimizations – moved over vio/net initialization to worker threads, reduced number of syscalls, added a new mutex for rand initialization, and that alone got me additional 50% increase in connection accepts. Think how much more one could get from fixing this problem properly ;-)
TL;DR: MySQL sucks at accepting new connections, but there’re lots of low hanging fruit there. Ask your MySQL provider for a fix.
MySQL bug entries:
Dude you would be so helpful in helping optimizing RDF triple stores ;)
http://en.wikipedia.org/wiki/Triplestore
What’s your take on RDF and NoSQL in general? (I have to admit I’m totally pro-RDF.)
MySQL is perfect triple store! ;-)
Too bad that
:) I’ve tried running Jena SDB on top of MySQL but it was too slow, so now we use TDB which is native.The code in handle_connections_sockets() (and mysqld.cc as a whole) is a good example of how not to write code. Network related code in MySQL is a absolute mess. For staters, one needs to take this function apart, move it to a more appropriate layer (no more poll/select mess), have a separate thread for each connection queue (or just have worker threads parked on accept()), etc. The non-blocking flag switch could probably be eliminated with accept4(). Just rewriting this stuff properly would eliminate a good chunk of this crap.
I agree with everything! Though accept4 is somewhat too new.. ;-)
Domas
It’s this kind of qualitative analysis that, as far as I’ve seen, is not only neglected in this industry, but actively suppressed because it’s not “simple”; in fact, it’s fast becoming a lost art. Thank you for breaking that trend and putting some real effort into discovering this.
Domas, can you please elaborate some on how you use connection pools and for how long you expect a connection to last before you have to circulate it?
Kristofer, what connection pools you have in mind?
‘Thread cache’ threads could handle new connections directly without coordination from a central thread, the question then is how to kill idle threads…
I was a bit wrong telling that LOCK_thread_count is less of a problem – it is at very core of this, as the “let’s pass job to other thread” design isn’t scalable.
I was hoping you would recommend a client thread pool solution which helps amortize the authentication times. There are a few but it is still an open question to me on how they behave when some pressure is applied.
The “let’s pass job to other thread” design is a very common pattern though, and it is also recommended by Intel. I suppose part of the trick is how you deliver the job, and obviously some ideas are worse than others. ;-)
In this case I think it was intended for those cases when load was beyond the scaling capabilities which usually led to a fast decrease in performance. With the “pass job to other thread” design the performance remained constant close to the max performance.
Kristofer, client-side solutions don’t help much at a certain scale (you really don’t want to have open connections from every web box ;-) Then the rate of connection accepts becomes much more important, unless you really have elaborate pooling system where x clients talk to y proxies to z servers ;-)
The “passing to another thread” pattern relies on zero thread wakeup cost – which unfortunately is not the case – there’s relatively tiny fraction of time during which optimistic codepath can trigger – and there’s too much locking around to allow hitting it (probably worth instrumenting though, how many threads actually grab new task without going to sleep).
I don’t understand what you mean by scaling capabilities etc – in a design where every thread is accepting (ala apache/multiple worker lighttpd/etc) you just eliminate a major chokepoint (as long as mutex holds are not too long overall).
Current connection accepting code is so bad (it has way too high CPU cost component besides job passing) that it is difficult to hit ‘scalability’ issues at all.
Once CPU cost gets eliminated, job passing becomes major bottleneck, especially if other cores are doing anything else than busylooping on thread condition variable :)
“scaling capabilities” : I mean that if you don’t limit the number of worker threads you will eventually get punished by their overhead and performance will decrease. If you limit the worker threads you still want to put incoming requests on a queue instead of discarding them to limbo. The passing of jobs in and out of this queue was considered a viable pattern to preserve through put performance on spikes (ie when your system isn’t able to scale). It is still unclear to me whether or not the penalty you pay per thread cancels the benefit of this strategy or where the limit goes. Certainly if you have enough capacity for more threads than the max possible number of connections you shouldn’t put anything on hold.
Kristofer, technically now we don’t limit number of worker threads anyway (max_connections=10000 ;-) – though there has been some work with admission control in our branch.
MySQL’s problem is not about not having enough worker threads to deal with requests, MySQL’s problem is that it cannot pass more jobs.
Excellent analysis, and I’m definitely gonna steal the long-running version of PMP. I never tried to track down the root cause of this problem, but a few months ago I finally discovered why some of our requests were taking 3 seconds to complete. The default
accept(2)
backlog for MySQL (back_log
in the config) is 50, which is way too low. This forces the kernel to discard incoming connection requests. Unfortunately, initRTO is hard-coded to 3 seconds (compile-time constant) – although I recently got the kernel to lower initRTO to 1 second (the first release with the change was v3.1-rc1).So in the end of the day, for those who are looking for a quick workaround without touching what version of MySQL they run, increasing
back_log
is the only way to avoid the dreaded 3s penalty. Unfortunately changing this setting requires restarting MySQLd.Benoit, there’s not just that – kernels by default limit max backlog to 128 too (somaxconn parameter) – so you have to bump it too. But yes, backlog is one of first things to change on busy systems.
Domas,
Thanks for the great post!
“Ask your MySQL provider for a fix.”
I will look into it more thoroughly and see if we can come up with something concrete.
-ivan