On replication, some more

Dear MySQL,

I feel ashamed that I ever wanted you to support 4.0->5.1 replication, and apologize for that. I really understand that it was really egoistic of me even to consider you should be involved in this.

I even understand that 5.0 is running out of active support (I’m not questioning that you’ll stop supporting 4.1 entirely too), and you’ll stop doing pretty much anything to 5.0, except “critical security fixes” (w00t, I managed to get one into 4.1, 8 year old MITM flaw :).

I really understand that supporting more than one release is very very very difficult, and people should do only adjacent version upgrade.

I’m not asking you much, but, maybe you could then support 5.1 to 5.1 replication? I don’t want much, just:

  • Gracefully recover after slave crashes.
  • Don’t have single serial reading of pages for replication stream as a bottleneck – either read-ahead properly (you can do that with RBR!!!), or apply events in parallel (you can do that with RBR too!)
  • Allow to edit replication filters without restarting servers.
  • Allow to enable and disable binary logging of events received from master, as well as enabling and disabling binary logging without restarting the instance.

I hope it isn’t too much too ask! It is just supported replication between two same version instances.

Thanks!

best free() is exit()

Whenever any maintenance needs server restarts, there’s a list of unsolved bottlenecks or inefficient code that gets touched a lot at that time. I can understand that heating up the server can take lots of time (though lots of low hanging fruits there), but the way actual shutdown is done, even if there’s not much of dirty data to flush, sucks.

See, developers are perfectionists, and their perfectionism also includes the crazy idea that all memory has to be deallocated at server shutdown, as otherwise Valgrind and other tools will complain that someone leaked memory. Developers will write expensive code in shutdown routines that will traverse every memory structure and deallocate/free() it.

Now, guess what would happen if they wouldn’t write all this expensive memory deallocation code.

Still guessing?

OS would do it for them, much much much faster, without blocking the shutdown for minutes or using excessive amounts of CPU. \o/

Crash recovery, again

There’s one stage in InnoDB crash recovery where it reads log file, you know, this one:

InnoDB: Doing recovery: scanned up to log sequence number 354164119040
InnoDB: Doing recovery: scanned up to log sequence number 354169361920

On a machine with bigger logs it will start spending nearly 100% CPU somewhere in recv_scan_log_recs. Guess what it does…. -fno-inline builds to the rescue:

#0  mem_block_get_len at ./include/mem0mem.ic:86
#1  mem_heap_get_size at ./include/mem0mem.ic:591
#2  recv_scan_log_recs at log/log0recv.c:2727

And:

samples  %        symbol name
8467     72.9222  mem_heap_get_size
291       2.5062  recv_add_to_hash_table
95        0.8182  mem_block_get_len

To speak in layman’s terms, InnoDB does SUM(LENGTH(allocation)) on its relatively wide memory (tens, hundreds of thousands of entries) arena, FOR EVERY LOG SEGMENT, to make sure it didn’t run out of available 32GBs. Hehehe, lame.

As for now, I’ll just killed the check and have my recovery much much faster – finished in 3 minutes, what it wasn’t able to do in 30 before.

P.S. This is different from what I wrote before (and magic Yasufumi’s patch)
P.P.S. Now I got to learn to reuse LOG_DUMMY table during the recovery process, as it is next low hanging fruit there…

again, on benchmarks

Dear interweb, if you have no idea what you’re writing about, keep it to yourself, don’t litter into the tubes. Some people may not notice they’re eating absolute crap and get diarrhea.

This particular benchmark has two favorite parts, that go with each other together really well:

I didnt change absolutely any parameters for the servers, eg didn’t change the innodb_buffer_pool_size or key_buffer_size.

And..

If you need speed just to fetch a data for a given combination or key, Redis is a solution that you need to look at. MySQL can no way compare to Redis and Memcache. …

Seriously, how does one repay for all the damage of such idiotic benchmarks?

P.S. I’ve ranted at benchmarks before, and will continue doing so.

update

In past few months I had lots of changes going on – left the Sun/MySQL job, my term on Wikimedia Board of Trustees ended, I joined Facebook and now I got appointed to Wikimedia Advisory Board. This also probably means that I will have slightly less hands-on work on Wikipedia technology (I’ll be mostly in “relaxed maintenance mode“), though I don’t know yet how much less – time will show :)

P.S. I also quit World of Warcraft. ;-)

A formal objection to formal objection!

Dear EC, I have an opinion on Sun/Oracle merger.

I don’t like objections, therefore I object to them. Of course, I understand, that some people want to play major parts in dramas and soap operas, others are spectating and enjoying it, but most people I know and respect simply want to be efficient and get their work done.

These people, many of whom were my colleagues back at Sun/MySQL, sometimes can be labeled as ones who don’t care about greater picture, or the common good. Unfortunately, the topic of common interest depends from religion to religion, from society to society, from person to person.

My approach to common good is doing what I want to do for the community, and not what community wants me to do. If that relation is not compatible for any of the sides, there is no relation at all. Now, what happens now, is that there is some user community (and I include all the people who are protecting their free-software-usage interests) which has demands for vendors.

When PG zealots keep telling that “PG is open, MySQL is closed-development by a company”, you will often see MySQL camp not listening to that at all. But there is some true in that, every MySQL user was using a product that was developed inside a citadel (even some builds were called ‘Citadel’), and company had a say over lots of issues.

Start with Windows licensing at the beginning, GPLv2 versus GPLv3 or any other issue, that was affecting our user community at large, it has always been top-down decision, nobody has ever asked me as a user what kind of license I would want – and all those decisions were made by MySQL, the semi-independent company.

I never actually craved for too much control of the project – I’m always free to fork (and I have forked at least 3 releases of MySQL, and now have to deal with yet another fork at work ;-). I have went through various stages of relations with MySQL – and at every of them I had to understand that open-source software carries its own set of risks, but also has its own set of mitigations. I’m free to support my organization needs at various outcomes, I don’t have any data lock-in, I don’t have any vendor lock-in, the only impact is a risk that I should have calculated years ago, not today.

So even if the argument starts with “MySQL is being acquired by competitor, and it is risky situation”, there are few real messages out there: “we need to protect people who don’t care about risks” – sounds very much like bailout money for all the people with crazy spending, as well as “if your entry campaign is about doing lots of common good, you may not be allowed to profit in the future”.

This doesn’t make open-source a good investment (would Sun pay 1B$ if it forecasted today’s EC position?), and it not being a good investment means there won’t be opportunities for people to balance common good and business opportunities in future.

If a community was supposed to have a voice in this situation, it should’ve had a fair share of board members in project management. If a community didn’t get its fair share of management at the organization, it should’ve found another organization, or accepted the risks.

In my opinion, if you are right now in the camp of supporting objections, it is not because you’re seeing a lot now, it is mostly because you didn’t see anything before.

GDB 7!

I wasn’t prepared for this. After spending months playing with GDB development trees I somehow entirely missed that 7.0 release is getting close, and took me more than an hour to spot it.

My favorite features are python scripting and non-stop debugging. I was toying around with python scripting for a while, and was planning to make backtraces make sense. Having hands that open means that one can see PHP backtraces, when gdb’ing apache, see table names and states when MySQL thread access handler interfaces, or remote IPs and users, when it is writing to network. Process inspection can simply rock, if right tools are created using these new capabilities, and I’m way too excited when I think about those. “Always have debugging symbols” gets way more meaning now.

Another issue I’ve been trying to resolve lately is avoiding long locking periods for running processes (directly attaching to process can freeze its work for a second or so, which isn’t that tolerable in production environments). GDB is getting closer to the async debugging capabilities – where one can run a debugger without actually stopping anything.

So, congratulations GDB team, now it is job for us to find all the uses of the tool. It has been invaluable so far, but this is much much more.

MySQL processlist phrase book

For your convenience, a short phrase book, starting with explanation of process states where MySQL is mostly working to look up data from tables:

  • “Sending data” – reading data from tables (or looking it up)
  • “Copying to tmp table” – reading data from tables (or looking it up)
  • “Copying to tmp table on disk” – query needs a rewrite
  • “statistics” – looking up data from tables
  • “Sorting result” – reading data from tables (or looking it up)

Locking is fun:

  • “update” – There is big chance that it is waiting for a lock, can be updating as well, on busy systems
  • “Locked” – Even bigger chance that it is MyISAM table waiting to be converted to InnoDB

Table opening is even funnier:

  • “Opening tables” – some other idiot thread is holding the LOCK_open or similar mutex
  • “Closing tables” – some other idiot thread is holding the LOCK_open or similar mutex
  • “Opening table” – the idiot thread (or not that smart DBA who thinks that table_cache should be set to current value)
  • “Flushing tables” – someone decided that flushing tables is good idea, but forgot to kill long running queries
  • “Waiting for tables” – same as “Flushing tables”

If you have replication:

  • “Connecting to server” – could not connect to server, waiting to retry
  • “Reconnecting after a failed master event read” – master and slave have same @server_id
  • “Registering slave on master” – master and slave have same @server_id
  • “Waiting to reconnect after a failed master event read” – master and slave have same @server_id
  • “Sending binlog event to slave” – actually, not sending binlog event – it is waiting for binlog mutex.

Few more tricky ones:

  • “Sleep” – last millisecond it was probably not sleeping, due to how process scheduling and kernel level buffering works probably it has finished right before kernel switched to the ‘SHOW PROCESSLIST’ thread.
  • “Writing to net” – usually, not writing to net (client has blocked) – actual writing to net is usually nearly non-blocking operation, as long as client doesn’t sleep.
  • “Killed” – actually, someone tried to kill it, but it doesn’t care, as it doesn’t check for thd->killed flag

And in the user column:

  • “unauthenticated user” – are you sure your DNS is working right?
  • “system user” – it isn’t user, these are (hopefully plural) replication threads, that don’t have any “user” credentials
  • “root” – either that is you running the PROCESSLIST, another session you forgot you are running or you forgot to create separate user for your application

I probably didn’t write quite a few important ones, but some of them are self-evident (such as “init” and “end”), and others probably will never show up :)

Getting apache core dumps in Linux

If you want to get core dumps for intermittent Apache/mod_php crashes in Linux, you will probably need this module (otherwise Linux kernel will refuse to dump core, whatever you put into your OS or Apache configuration):


/*
* Author: Domas Mituzas
* Released to public domain
*
* apxs -c -i mod_dumpcore.c
* and...
* LoadModule dumpcore_module .../path/to/mod_dumpcore.c
* CoreDumpDirectory /tmp/cores/
* and...
* sysctl -w kernel.core_pattern=/tmp/cores/core.%p.%t
*/

#include "httpd.h"
#include "http_config.h"
#include <sys/prctl.h>

static int dumpcore_handler(request_rec *r)
{ prctl(PR_SET_DUMPABLE,1,0,0,0); return DECLINED; }

static void dumpcore_register_hooks(apr_pool_t *p)
{ap_hook_handler(dumpcore_handler, NULL, NULL, APR_HOOK_MIDDLE);}

module AP_MODULE_DECLARE_DATA dumpcore_module = {
STANDARD20_MODULE_STUFF, NULL, NULL, NULL, NULL, NULL,
dumpcore_register_hooks };

P.S. I was quite astonished to find out that nobody ever needed this, I remember quite a few discussions after which we fixed this in MySQL two years ago.

Update: there’s also ‘echo 1 > /proc/sys/fs/suid_dumpable’ or ‘sysctl -w fs.suid_dumpable=1’ – now I recall whole story, RHEL3 and RHEL4 didn’t have this, so we had to do prctl() hack, whereas later Linux kernel versions allowed this workaround.