Google :-(

For past few days (days!!!) I’m not able to log into my account at Google (except GMail, thanks for that! – Voice/Blogger/Plus/Docs/… are all broken though). I’m either getting this message:

Sorry, there seems to be a problem. The service you're looking for is temporarily unavailable. We're working hard to restore your access as soon as possible. Please try again in a few hours. Thanks for your patience.

Sorry, there seems to be a problem.

Or I’m sent to infinite redirect loop. None of their self-help pages are useful or have this error mentioned.

Apparently if you get such message, you should read this message in opposite way:

“Sorry (or you should be sorry), there seems to be a problem. This service is looking for you and you seem to be temporarily unavailable. We’re not working on this at all, it may show up on some report at the end of the year though as lost revenue, sad, we see you spent few hundred dollars on our services, so you trust us, but of course we have enough money from everyone else. You may try again in few minutes, few hours or few days, it won’t help you though. Though we appreciate your patience, you should start researching and debugging this yourself instead, thank you.”

I cleared all Google cookies on my browser and it started working. Apparently this has happened to others too.

Update: again…

Posted in misc | Tagged , | 3 Comments

on MySQL replication prefetching

For the impatient ones, or ones that prefer code to narrative, go here. This is long overdue anyway, and Yoshinori already beat me, hehe…

Our database environment is quite busy – there’re millions of row changes a second, millions of I/O operations a second and impact of that can be felt at each shard. Especially, as we also have to replicate to other datacenters, single threaded replication on MySQL becomes a real bottleneck.

We use multiple methods to understand and analyze replication lag composition – a simple replication thread state sampling via MySQL processlist helps to understand logical workload components (and work in that field yields great results), and pstack/GDB based replication thread sampling shows server internal behavior quite well too (a similar technique was used for accept thread visualisation).

The biggest problem with single replication thread is that it has to read data to execute queries (rather than applying physical page deltas, like PG or just appending to files like HBase, it does logical edits to page data) – we can observe 95% of process time at that state. As generally there’s just one outstanding data read per replication thread, other workload hitting the machine will also make replication reads slower.

Generally, the obvious way to deal with slow I/O is issue more outstanding parallel requests, and the only way to do that apart from parallel replication, is to predict what will be needed in future and try to fetch that.

Many many moons ago Paul Tuckfield discussed about the Youtube replication prefetcher – it would take write statements yet to be executed in relay logs,  convert them to SELECTs and run them before replication thread needs that data. He still says that was one of most satisfying quick hacks :-)

Maatkit (now Percona Toolkit) introduced mk-slave-prefetch (I played with it back in 2008, didn’t put it into operation at that time though), and eventually that looked like a reasonable option for prefetching statements on our database cluster.

5000 lines of Perl is not the easiest code to work with (or to debug), so the journey was quite bumpy. We got it working in some shape, eventually, but Baron, original author, has something to say about it:

Please don’t use mk-slave-prefetch on MySQL unless you are Facebook. Or at least don’t tell your friends, so they won’t use it.

Anyway, our updates rate would saturate mksp.pl if we used anything fancier on it, so it was a constant balancing act, in which looking at the code was something nobody wanted to do ;-) Still, it was (and is) helping us, so getting rid of it wasn’t possible either.

At some point in time we decided to make an experiment – what if we executed statements, then rolled them back – so I did a quick implementation of that method from scratch in Python – resulting piece of code was relatively small and fun to experiment with.

There were multiple problems with such approach – one complication was that queries were grabbing locks for the duration of the statement, and some of those locks would collide with what actual replication thread is doing. Fixing that would require immediate lock wait timeout or transaction kill for prefetcher thread – so, relatively deep dive into InnoDB. Another problem was internal InnoDB lock contention on rollbacks – that was expensive operation, and benefits of pages read in were negated by rollback segments lock contention. Fixing that is even more extensive InnoDB work (though probably some people would like their rollbacks to be efficient ;-)

At that moment we came up with the idea, that InnoDB codebase could be instrumented to not do any real work on updates – just page data in and return to the caller, and if any change accidentally slips in, commits can fail. That looked like a feasible project for the future.

At some point in time we were rolling out a new database tier for one product, which was supposed to have really high volume of changes, but all coming in a uniform format. It took less than hour (as most of the work has been done to create rollback-based one) to come up with a prototype that would efficiently extract literals from uniform statements, then use them for prefetching.

This method worked fine – at tiny fraction of resources used by mk-slave-prefetch we were preloading secondary indexes and could have relatively extensive parallelism.

Meanwhile, our main database cluster was having more and more uniform query workload, thanks to various libraries, abstractions and middleware – so a day of work on lowest hanging fruits provided relatively good coverage of the workload.

We didn’t stop mksp.pl – it still provided some coverage for various odd cases, which were time-consuming to work on manually.

There were few other problems with the new method – apparently we were targeting our SELECTs too accurately – UPDATEs were spending plenty of time in records_in_range. Additionally, optimistic update path was reading in pages that selects wouldn’t (due to inefficiency in B-Tree locking code). There were some odd reads done for INSERTs.

Also, SELECTs are using indexing less efficiently – InnoDB can pinpoint entries in secondary indexes by using PK values, yet that ability is not exposed to SQL layer, so prefetching on indexes that don’t have explicitly defined all fields within them is not that easy.

In theory, all these issues are supposed to be ‘fixed’ by fake changes concept. Percona recently implemented it in their releases, and we started experimenting with those changes. It is still not that mature concept, so we will be revisiting how things are or should be done, but for now test results are quite positive (we did some changes to reduce locking and avoid deadlock in REPLACE INTO, among other things).

I still observe I/Os done by main replication thread, so we’re not in perfect shape yet, but method seems to be working relatively well (at least it definitely speeds up replication). We still have to do lots of testing to qualify this for large-scale production, but this may allow way more write workload on our machines until we get parallel replication all around.

Our code for custom query, fake changes or rollback prefetcher can be checked out from a public repo together with other tools (oops, Bazaar doesn’t give easy access to subdirectories:

bzr co lp:mysqlatfacebook/tools; cd prefetch

Or browse it online.

P.S. There’s also Tungsten Replicator for ones who don’t want to wait for 5.6 parallel replication.

Posted in facebook, mysql | Tagged , | 3 Comments

… in numbers

Posted in wikipedia, wikitech | Tagged | 9 Comments

Blowing up in memory

MySQL isn’t too concerned about table handler memory usage – it will allocate row size buffer thrice per each table invocation. There’s a few year old bug discussing UNION memory usage – for each mention in an union one can allocate nearly 200k of unaccounted memory – so a megabyte sized query can consume 7GB of RAM already.

Partitioning though adds even more pain here – it will allocate those three buffers per each partition, so opening a table with 1000 partitions looks like this on memory profile:

Click to enlarge, and you will see 191MB sent to execute a simple single-row fetching query from a table (I filed a bug on this).

There’re multiple real life situations when this is painful (e.g. any kind of server stall may lead to multiple concurrent threads reading from same table, consuming additional gigabytes or tens of gigabytes of memory). It gets even more painful when combined with UNION bug – a megabyte query on an empty table can now consume 7TB of memory and I doubt anyone has that much on their MySQL servers :-)

P.S. Also, check out how much memory can be wasted for malloc overhead, once discussed here.
P.P.S. And here you can see why innodb_max_dirty_pages_pct=0 doesn’t do what you’d expect.

Posted in mysql | Tagged , , , | 7 Comments

On connections

MySQL is needlessly slow at accepting new connections. People usually work around that by having various sorts of connection pools, but there’s always a scale at which connection pools are not feasible. Sometimes connection avalanches come unexpected, and even if MySQL would have no trouble dealing with queries, it will have problems letting clients in. Something has to be done about it.

Lots of these problems have been low hanging fruits for years – it ‘was not detected’ by benchmarks because everyone who benchmarks MySQL would know that persistent connections are much faster and therefore wouldn’t look at connection speeds anymore.

Usually people attribute most of slowness to the LOCK_thread_count mutex – they are only partially right. This mutex does not just handle the counter of active running connections, but pretty much every operation that deals with increase or decrease of threads (thread cache, active thread lists, etc) has to hold it for a while.

Also, it is common wisdom to use thread cache, but what people quite often miss is that thread cache is something that was created back when OS threads were extremely expensive to create, and all it does is caching pthreads. It does not do any of MySQL specific thread caching magic – everything gets completely reinitialized for each incoming structure.

I decided to attack this problem based on very simple hypothesis – whatever ‘accept thread’ is doing, is bottleneck for whole process. It is very simple to analyze everything from this perspective (and I had some success looking at replication threads from this perspective).

All we need is gdb and two loops – gdb attaches to accept thread, one loop does ‘breakpoint; continue’, another sends signals at a certain sampling rate (I picked 10Hz in order to avoid profiling bias). I posted those scripts on PMP page. After a lunch break I had 50k stacks (long lunch ;-) that I fed into graphviz for full data visualisation and could look at individually:

A picture is worth thousand words (well, is easier than looking at thousands of lines in stack aggregations), and I immediately noticed few things worth looking at:

  • Initializing THD (MySQL thread) structure is CPU-heavy task that resides in choke-point thread
  • There is way too much time spent in syscalls, whatever they do
  • Too much memory allocation done by the master thread
  • There’s mutex contention on thread cache waking up worker threads
  • There’s needless mutex contention in few other places

I didn’t want to look at mutex contention issues first so I ended up with something as simple as looking at syscall costs.

  • 15% was going into actual accept()
  • 8.5% was going into poll()
  • 8% went into fcntl()
  • 7% went into setsockopt()
  • 1.2% went into getsockname()

An strace on mysqld gives a picture that explains quite a bit:

poll([{fd=12, ...}, {fd=13, ...}], 2, -1) = 1
fcntl(12, F_GETFL) = 0x2 (flags O_RDWR)
fcntl(12, F_SETFL, O_RDWR|O_NONBLOCK) = 0
accept(12, {... sin_port=htons(59183), ...) = 32
fcntl(12, F_SETFL, O_RDWR)
getsockname(32, {... sin_port=htons(3306), ...) = 0
fcntl(32, F_SETFL, O_RDONLY) fcntl(32, F_GETFL) = 0x2 (flags O_RDWR)
setsockopt(32, SOL_SOCKET, SO_RCVTIMEO, ...)
setsockopt(32, SOL_SOCKET, SO_SNDTIMEO, ...)
fcntl(32, F_SETFL, O_RDWR|O_NONBLOCK)
setsockopt(32, SOL_IP, IP_TOS, [8], 4)
setsockopt(32, SOL_TCP, TCP_NODELAY, [1], 4)

I’ll skip walking through the code, but essentially what it does here is (12 is accept socket, 32 is connection socket):

  • poll() checks whether there are pending connections. If server is busy, trying to accept first, poll on failure is a better approach. There are side effects with that idea though – other sockets may starve a bit, but it is solvable by injecting occasional poll.
  • What happens next is a bit sad. Instead of storing per-socket flags (nobody is touching that for now anyway), it gets the socket flags, figures out it is a blocking socket, sets it to nonblocking mode, accepts the connection, sets it back to blocking mode. Just setting to nonblocking at the start and using it forever that way is much cheaper and constipates way less.
  • accept() itself can be scaled only by having parallel accept() threads. Maybe most of this post would be not necessary if there were multiple accept threads, but I’m not eager to go into that kind of refactoring for now.
  • getsockname() is used just to verify if socket is correct (probably catching EINVAL later seems to be too complicated), it is a very pessimistic code path for a case that nearly never happens (it probably was added for some random Unix back from nineties)
  • Next fcntl “get flags” call is quite unnecessary – this is a fresh socket and one shouldn’t expect anything special within it. Later non-blocking mode is set, so that overrides whatever was obtained here.
  • Three out of four setsockopt()s are necessary evil (one turns of Nagle’s algorithm, two other set socket timeouts), so they have to be done before network I/O is done on the socket. Fourth setsockopt() is usually completely useless – not every network observes IP_TOS header, and one has to talk to network administrator first about decent values. I’d say it can be optional parameter (yay, more tuning options).

Pretty much every connection socket operation can be done later, in a worker thread, without consuming expensive accept thread time, and pretty much every syscall except accept() can be removed from a busy accept thread(), which is what I did in my testing build.

Once I got rid of syscalls I started looking at other low hanging fruits. The most obvious one was sprintf() called inside vio_new(). Though it accounted only for 4% of thread time, the uselessness of it was depressing. Here it is:

sprintf(vio->desc,
   (vio->type == VIO_TYPE_SOCKET ? "socket (%d)" : "TCP/IP (%d)"),
   vio->sd);

It formats a string that is not used at all by production builds (only few DBUG messages are calling vio_description()). Though I removed this code in non-debug build, as I was moving over network initialization to worker threads, whole my_net_init() and vio() ended up outside of accept thread anyway ;-)

The overall thread cache design is centered around LOCK_thread_count – lock is held while signaling threads, and threads that wake up need the lock too – so there’s lots of overhead involved in the coordination – 13% of time is spent just to pass the task to a worker thread.

Allowing multiple threads to wake up and multiple entries to be placed into thread cache before it is all drained (more of an InnoDB concurrency-queue with FLIFO approach) could be somewhat better – so would be worker threads accepting connections directly (I already said that, I guess). There’s simply too much time wasted waking up and sending threads to sleep, and quite some of that time is on a choke point.

THD initializations are somewhat simpler, as they don’t include SMP madness.

There’re some low hanging fruits of course there as well. For example THD initializer calls sql_rnd_with_mutex(), which locks thread count mutex. Simplest fix could be using another mutex, though lockless random function or on-demand variable initialization would help too.

Some initializers there are quite expensive too – e.g. Warning_info class could initialize dynamic storage only when actually used, and not at THD initialization chokepoint. THD::init can be moved to a worker thread, and lots of THD initialization could be moved over to it.

Quite a lot of time (12%) is spent on malloc() – and lots of that is for allocating lots of various fixed-size structures – slab allocator (or just more efficient malloc implementation) could cut on CPU time there. Of course, more drastic alternative is not dealing with THD at all during accept phase – one can pass stub structure to build upon later, or (oh, am I writing this again) moving accept() part to individual workers.

So far I tested just few optimizations – moved over vio/net initialization to worker threads, reduced number of syscalls, added a new mutex for rand initialization, and that alone got me additional 50% increase in connection accepts. Think how much more one could get from fixing this problem properly ;-)

TL;DR: MySQL sucks at accepting new connections, but there’re lots of low hanging fruit there. Ask your MySQL provider for a fix.

MySQL bug entries:

Posted in facebook, mysql | Tagged , , | 16 Comments

InnoDB subsystems in color

I tried to put every subdirectory of InnoDB codebase into a chart that would explain some of relations between subsystems and modules inside the source. This is what I got (click to enlarge):

Update: Check Vadim’s diagram for a more operational view of InnoDB
Another update: There’s a vector PDF version

Posted in mysql | Tagged , , | 5 Comments

Stonebraker trapped in Stonebraker 'fate worse than death'

Oh well, I know I shouldn’t poke directly at people, but they deserve that sometimes (at least in my very personal opinion). Heck, I even gave 12h window for this not to be hot-headed opinion.

Those who followed MySQL at facebook development probably know how much we focus on actual performance on top of mixed-composition I/O devices (flashcache, etc) – not just retreating to comfortable zone of in-memory (or in-pure-flash) data.

I feel somewhat sad that I have to put this truism out here – disks are way more cost efficient, and if used properly can be used to facilitate way more long-term products, not just real time data. Think Wikipedia without history, think comments that disappear on old posts, together with old posts, think all 404s you hit on various articles you remember from the past and want to read.

Building the web that lasts is completely different task from what academia people imagine building the web is.

I already had this issue with other RDBMS pioneer (there’s something in common among top database luminaries) – he also suggested that disks are things of the past and now everything has to be in memory, because memory is cheap. And data can be whatever unordered clutter, because CPUs can sort it, because CPUs are cheap.

They probably missed Al Gore message. Throwing more and more hardware without fine tuning for actual operational efficiency requirements is wasteful and harms our planet. Yes, we do lots of in-memory efficiency work, so that we reduce our I/O, but at the same time we balance the workload so that I/O subsystem provides as efficient as possible delivery of the long tail.

What happens in real world if one gets 2x efficiency gain? Twice more data can be stored, twice more data intensive products can be launched.
What happens in academia of in-memory databases, if one gets 2x efficiency gain? A paper.
What happens when real world doesn’t read your papers anymore? You troll everyone via GigaOM.

Though sure, there’s some operational overhead in handling sharding and availability of MySQL deployments, at large scale it becomes somewhat constant cost, whereas operational efficiency gains are linear.

Update: Quite a few people pointed out that I was dissing a person who has done incredible amount of contributions, or that I’m anti-academia. I’m not, and I extremely value any work that people do wherever they are, albeit I do apply critical thinking to whatever they speak.

In my text above (I don’t want to edit and hide what I said) I don’t mean that “a paper” is useless. Me and my colleagues do read papers and try to understand the direction of computer science and how it applies to our work (there are indeed various problems yet to solve). I’d love to come up with something worth a paper (and quite a few of my colleagues did).

Still, if someone does not find that direction useful, there’s no way to portray them the way the original GigaOM article did.

Posted in facebook, mysql | Tagged , , , , | 30 Comments