Progress in percents: 0 1 2 3 …

Well, servers usually don’t crash ( our English Wikipedia master is running for 800 days, on white-box hardware, RAID0, 4.0 ;-), but when they do (like some kernel bugs on our big big boxes), one of most painful experiences is InnoDB log recovery.

Usually people will reduce the innodb-log-file-size to ease up with the recovery (it helps, in a way :), but the real problem is somewhere else.

See, when InnoDB does crash recovery, it applies the log changes in memory, and builds a flush list. It doesn’t flush any pages during the recovery process, so the flush list grows big, thousands, tens of thousands, maybe millions kind of big, anyway, big-number big.

Oh, did I mention? The flush list is actually a linked list, not some kind of hippy tree stuff. Every time a log record is read from a log and something gets updated, the flush list will be traversed, thousands, tens of thousands, maybe millions of entries.

The expensive code looks something like this:

while (b && (ut_dulint_cmp(b->oldest_modification,
             block->oldest_modification) > 0)) {
       prev_b = b;
       b = UT_LIST_GET_NEXT(flush_list, b);

Then your profile starts looking like this, and you wish your systems didn’t crash:

%        symbol name
87.6978  buf_flush_insert_sorted_into_flush_list
 5.8571  -kernel
 1.9793  recv_apply_hashed_log_recs
 0.8059  buf_calc_page_new_checksum

So, the recovery process cost is exponential, and people work around it by reducing the log file size, by reducing performance of their system, while the actual fix is right there, in optimizing the data structure. Current model is outdated for anything built in last 5 years anyway.

Oh, and of course, I’d like systems not to crash at all, like that database master on whitebox raid0 running for 800 days.

Update: this is old stuff. Peter wrote about it, Heikki opened a bug, then thought it would need more than five minutes to fix it and classified it as a feature request, so Peter could write more about it. That makes it even more sad. We’d probably change the synopsis for the feature request, “make crash recovery work”.

Update 2: get the patch at Percona (Yasufumi is god :)


At the MySQL developer conference I accidently showed up some of things we’ve been doing with dtrace (I used it in few cases and realized the power it has), and saw some jaws drop. Then I ended up doing small demos around the event. What most people know about dtrace, is that there’re some probes and you can trace them. What people don’t know is that you can actually create lots of probes dynamically, and use them with lots of flexibility.

One of major things not really grasped by many is that dtrace is a combination of a tracing tool, debugger, programming language and a database, having minor, but very valuable functionality for each. It can attach to any place in code, it can get stacks, function arguments, traverse structures, do some evaluations, aggregate data, and in the end – thats all compiled code executed by kernel (or programs). 

Sometimes a probe may look not that useful (strace would provide file writes too?), but once combined with ability to get immediate stack, as well as set or read context variables (a previous probe on any other event could have saved some important information, e.g. host,user,table names, etc) – so final result may tell statistics correlated to many other activities. 

One developer (a traitor who has left support for easier life in engineering dept) listened to all this, and I asked what his current project was – apparently he was adding static dtrace probes to MySQL. It ended up being quite interesting discussion, as static probes provide two value points. First of all, it provides an interface – whereas dynamic probes can change with code changes (though, that doesn’t happen too often :) Second value – one can do additional calculations on a specific probe, which would be done only on-demand (when the probe is attached). 

So, having a static probe that directly maps to easy-mode dynamic one (it is straightforward to attach to a function, and quite easy to read its arguments), is a bit of waste (both in development time, as well as few instructions are actually written there). Dynamic tracing generally modifies binaries on fly – so it does not carry additional costs otherwise. Though an example where static probe would be awesome – having “query start” event, which would have query string canonized with all literals removed – this would allow on-demand query profiling for query groups, rather than stand-alone queries.

The other major value is ability to set thread-specific context variables in different probes, so they can read each other data. At the type of incoming packet one can tag the thread with whichever information needed – then any subsequent actions can reuse such information to filter out important events. That also removes the need of static probes providing multiple-layer information – it all can be achieved by chaining the events – without too much complexity. 

I took a bit of trollish stance when approached a developer implementing internal performance statistics. We were playing a game – he’d tell me what kind of performance information he’d like to extract, and I’d show a method to do that with dtrace. More people from monitoring field joined, and we ended up discussing what is the perfect performance monitoring and analysis system. It is quite easy to understand, that different people will need different kinds of metrics. For MySQL development work performance engineer will need mutex contention information, someone fixing a leak will need heap profiling, someone writing a feature will want an easy way to trace how server executes their code – and all that is way far from any needs actual user or DBA has. Someone who writes a query just wants to see the query plan with some easy-to-understand costs (just need to pump more steroids into EXPLAIN). DBAs may want to see resource consumption per-user, per-table, etc (something Google patch  provides). It is interesting to find a balance, between external tools and what should be supported out-of-the-box internally – and it is way easier to force internal crowd to have proper tools, and it is always nice to provide a much as possible instrumentation for anyone externally. 

Of course, there’s poor guy in the middle of two camps – a support engineer – who needs easy performance metrics to be accessible from clients, but needs way more depth than standard tools provide. In ideal case dtrace would be everywhere (someone recently said, thats one of coolest things Sun has ever brought) – then we’d be able to retrieve on-demand performance metrics from everywhere, and would be tempted to write DTraceToolkit  (a suite of programs that give lots and lots of information based on dtrace) like bunch of stuff for MySQL internals analysis.

I already made one very very simple tool  which visualizes dtrace output, so we can have graphviz based SVG callgraph for pretty much any type of probe (like, who in application does expensive file reads) – all from a single dtrace oneliner. It seems I can sell the tool to Sun’s performance engineering team – they liked it. :) 

Some people even installed Solaris afterwards for their performance tests. Great, I won’t have to (haha!).

Though lack of dtrace in Linux is currently a blocker for the technology, lots of engineers already have it on their laptops – MacOSX 10.5 ships it. It even has visual toolkit, that allows building some dtrace stuff in a GUI. 

I’m pretty sure now, any engineer would love dtrace (or dtrace based tools), they just don’t know that yet.

On blocking

If a process has two blocking operations, each blocking other (like, I/O and networking), theoretical performance decrease will be 50%. Solution is very easy – convert one operation (quite often the one that blocks less, but I guess it doesn’t matter that much) into a nonblocking one.

Though MySQL has network-write buffer, which provides some async network behavior, it still has to get context switch into a thread to write stuff.

rsync and other file transfer protocols are even worse in this regard. On a regular Linux machine rsync even on gigabit network will keep kernel’s send-queue saturated (it is 128K by default anyway).

How to make MySQL’s or rsync networking snappier? If in ‘netstat’ sendq column is maxed out – just increase kernel buffers, instead of process buffers:

# increase TCP max buffer size
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
# increase Linux autotuning TCP buffer limits
# min, default, and max number of bytes to use
net.ipv4.tcp_rmem = 4096 87380 16777216
net.ipv4.tcp_wmem = 4096 65536 16777216

This can add additional 10-20% of file transfer throughput (and sendq goes up to 500k – so it seems to be really worth it).

On checksums

InnoDB maintains two checksums per buffer pool block. Old formula of checksum, and new formula of checksum. Both are read, both are written. I guess this had to be some kind of transition period, but it obviously took too long (or was forgotten). Anyway, disabling checksums code entirely makes single-thread data load 7% faster – though in parallel activity locking contention provides with some extra CPU resources for checksum calculation.

Leaving just single version of checksum would cut this fat in half, without abandoning the feature entirely – probably worth trying.

Update: Benchmarked InnoDB checksum against Fletcher. Results were interesting (milliseconds for 10000 iterations):

Algorithm: InnoDB Fletcher
826 453
-O2: 316 133
-O3: 42 75

So, though using Fletcher doubles the performance, -O3 optimizes InnoDB checksumming much better. How many folks do run -O3 compiled mysqld?

Performance of status commands

Though some may think status command performance is overrated, our code heavily relies on SHOW STATUS LIKE "Thread%", and I like to spam SHOW INNODB STATUS. Add monitoring systems, which rely on ‘SHOW GLOBAL STATUS’ – and performance starts to matter.

And things are a bit abandoned. See, in performance engineering people always look for low hanging fruit to pick, and there’re quite a few:

  • Peter wrote a year ago about performance regression in ‘SHOW STATUS’. Of course, now that Mark has filed a bug – patch is already in there. This bug would byte us very very severely, if we’d run live site against unpatched 5.0 – as it slows down from 0.2ms to 20ms. There’s a small ha-ha in this issue – guess where these two lines come from. ;-)
  • SHOW INNODB STATUS was taking 0.3s just because there were two rarely-looked at variables that needed traversing quite some structures. I filed bugs #36941 and #36942, to remind developers of this problem. Disabling these two counters allowed to run few thousand ‘SHOW INNODB STATUS’ a second, instead of just three.
  • SHOW GLOBAL STATUS traverses statistics for each thread. Doesn’t matter on single-thread development box too much, but is very very unhappy at higher thread concurrency – walking every variable on every thread is quite a task even for modern hardware. Generally some SMP scaling has been added by reducing locking on global variables this way – at the cost of making the command unusable.

INSERT speed, paramy, auto-inc

One of first things I noticed, when I tried to run some 5.0 tests, was that INSERT speed was much slower (no benchmarking at that time, purely subjective speed that loading the dump was somewhat too slow).

After discussing at evening, and some coding in the morning I built paramy – a simple C++ utility, that reads .sql stream, and spreads execution of statements over multiple worker threads.

For benchmarking I took a simple 13m row, 1.6G data, 1.7G index table. Loading it with standard mysql client took 30min (7200r/s). Paramy unfortunately hit auto-inc lock, and didn’t provide too much of performance – data load took 25min (8700r/s). The win here was mostly because everything until InnoDB being parallel, as instead of 100% core use, mysqld went up to ~130%, so it obviously managed to get additional cycles from SMP.

Dropping AUTO-INC from the field description changed everything. Standard mysql load wasn’t much faster, but under paramy load mysqld managed to use ~320% of CPU core, and did the load in 10min (~21500r/s).

The interesting part – no INSERT asks for AUTO-INC values, so in theory one can just disable the locking during the load, and re-enable it same way as unique or foreign key constraint checks are re-enabled after import of data – that way loading of data into InnoDB would be much much faster, especially with many more cores – though then contention moves away to transaction log mutex.

Additional way to reduce contention inside InnoDB may be removing both reading and writing checksums- they have quite big CPU load share. Disabling this at least during bulk imports can be quite useful.

Oh, I mentioned the subjective feeling that 5.0 was slower. I finally benchmarked – 4.0 did the data load in five minutes, and went over 40000 rows/s. I hope this is the last test where 4.0 is twice faster than 5.0.

I/O schedulers seriously revisited

The I/O scheduler problems have drawn my attention, and besides trusting empirical results, I tried to do more of benchmarking and analysis, why the heck strange things happen at Linux block layer. So, here is the story, which I found myself quite fascinating…
Continue reading “I/O schedulers seriously revisited”

Where covering indexes rock

One of MySQL features I usually like to exploit is covering index. Simply, if all fields required for a query are contained in an index, there’s no need to access table row somewhere else. In MyISAM that would mean not going to MYD file. Inside InnoDB data is clustered together with PK (another lovely feature) and secondary indexes contain PK values, so it is one B-Tree less to traverse too.

I’ll take a real-life example where it really mattered. In Wikipedia for English language only we have a ‘revisions’ table with 100m records. The PK is (page_id, revision_id), so any per-page operations (say, viewing complete history of changes) would read data clustered together in single block. Now the often used operation is to check user contributions, which would of course have an index (user_id, timestamp) and use it for traversing all revisions made by particular user.

In such case, if user is not centered on single page (of course, there’re many of these), going after entries in primary index will have to read (in worst case) 16k page from disk (and maybe a bigger stripe from RAID) for every 150-byte row that may be in there. In order to read information about 10000 revisions one would end up reading 200MB of data (and having 10000 seeks). Now simply by adding a covering index it reduces number of seeks required to 100 seeks (and reads just 2MB of data).

In our case we simply added an index on (user, timestamp, page, variousmetadata, comments, etc) – simply, all fields that were in table. It made a complete copy of 16GB table, but on the other hand, RAM costs per gigabyte are bigger than disk storage.

It may be not an issue in environments, where primary key (and table data) is already all cached in memory. Accessing a record in a page, once it is already in memory, is very cheap, compared to reading the page from disk. Of course, this makes a copy of table data, but really removes I/O saturation, especially if data is very cold.

Of course, still being on tight budget, we saved some disk space here too. Simply putting such wide index on single slave and not on others, then sending all queries requiring such index to particular slave did solve many resource issues. Maintaining secondary indexes inside InnoDB is not that expensive – all writes are delayed and go to ‘adaptive hash index’, and transaction logs contain just information about a single copy of a row.

Once we will go over 50 (or 20 :) DB servers, we may probably want to think about making such index changes and load balancing more automatic, but for now solving edge cases this way seems proper.

And here is comparison of two DB servers, one with covering and one with simple (just for WHERE conditions) index:

db4 mysql> select sum(rev_minor_edit) from revision
                        where rev_user_text='Test user';
1 row in set (21.73 sec)

db6 mysql> select sum(rev_minor_edit) from revision
                        where rev_user_text='Test user';
1 row in set (2.30 sec)

Both servers did read completely cold data (though they were in working state with warmed up buffers), and repeating of queries did provide not that different results (both servers executed under a second). Of course, 200MB of pages in buffer displaced possibly useful information…

Covering indexes are really powerful tool in environments where cold non-buffered data exists, and yet again prove the point, that access patterns of applications are very important when designing any schema or indexes. It is really one of major issues I expect from a storage engine (or DBMS in general) for any bigger data environments.

Five minutes of MediaWiki performance tuning

MediaWiki is quite complex package, and some even trivial features are not the ones that should be enabled on sites having more load. Though it is quite modular, still lots of code has to be executed, and some of it requires additional steps to be done. Continue reading “Five minutes of MediaWiki performance tuning”

TCP tuning for your database

Lately lots of new fascinating technologies are used to build even more fascinating new solutions, and solutions nowadays even run on distributed environments, not just on single server. These servers usually communicate using TCP – standard, that has been here long before gigabit (or ten megabit) ethernet or dawn of LAMP, and needs to be kicked a bit, to work properly. In our environment we have over hundred of application servers which handle quite a number of requests, and put quite demanding stress on database servers. For that we had to change some of default kernel settings – and it did improve situation a bit. Here I’ll try to overview some of them…
Continue reading “TCP tuning for your database”