On checksums

InnoDB maintains two checksums per buffer pool block. Old formula of checksum, and new formula of checksum. Both are read, both are written. I guess this had to be some kind of transition period, but it obviously took too long (or was forgotten). Anyway, disabling checksums code entirely makes single-thread data load 7% faster – though in parallel activity locking contention provides with some extra CPU resources for checksum calculation.

Leaving just single version of checksum would cut this fat in half, without abandoning the feature entirely – probably worth trying.

Update: Benchmarked InnoDB checksum against Fletcher. Results were interesting (milliseconds for 10000 iterations):

Algorithm: InnoDB Fletcher
826 453
-O2: 316 133
-O3: 42 75

So, though using Fletcher doubles the performance, -O3 optimizes InnoDB checksumming much better. How many folks do run -O3 compiled mysqld?

Wasting InnoDB memory

I usually get strange looks when I complain about memory handling inside InnoDB. It seems as if terabytes of RAM are so common and cheap, that nobody should really care about memory efficiency. Unfortunately for me, I do.


  • The infamous Bug#15815 – buffer pool mutex contention. The patch for the bug added lots of small mutexes, and by ‘lots’ I mean really really lots – two mutexes (and rwlock structure) for each buffer pool page. That makes two million mutexes for 16GB buffer pool, um, four million mutexes for 32GB buffer pool, and I guess more for larger buffer pools. Result – 16GB buffer pool gets 625MB locking tax to solve a 8-core locking problem. Solution? Between giant lock and armies of page mutexes there lives a land of mutex pools, where locks are shared happily by multiple entities. I even made a patch, unfortunately it gets some ibuf assertion after server restart though at first everything works great :)
  • InnoDB data dictionary always grows, never shrinks. It is not considered a bug, as it isn’t memory leak – all memory is accounted by (hidden) dict_sys->size, and valgrind doesn’t print errors. 1-column table takes 2k of memory in InnoDB data dictionary, a table with few more columns and indexes takes already 10k. 100000 tables, and 1GB of memory is wasted. Who needs 100000 tables? People running application farms do. Actually, there even is a code for cleaning up data dictionary, just wasn’t finished, and is commented out at the moment. Even worse, the fix for #20877 was a joke – reducing the in-memory structure size, still not caring about structure count. And of course, do note that every InnoDB partition of a table takes space there too…

So generally if you’re running bigger InnoDB deployment, you may be hitting various hidden memory taxes – in hundreds of megabytes, or gigabytes – that don’t provide too much value anyway. Well, memory is cheap, our next database boxes will be 32GB-class instead of those ‘amnesia’ 16GB types, and I can probably stop ranting :)

Performance of status commands

Though some may think status command performance is overrated, our code heavily relies on SHOW STATUS LIKE "Thread%", and I like to spam SHOW INNODB STATUS. Add monitoring systems, which rely on ‘SHOW GLOBAL STATUS’ – and performance starts to matter.

And things are a bit abandoned. See, in performance engineering people always look for low hanging fruit to pick, and there’re quite a few:

  • Peter wrote a year ago about performance regression in ‘SHOW STATUS’. Of course, now that Mark has filed a bug – patch is already in there. This bug would byte us very very severely, if we’d run live site against unpatched 5.0 – as it slows down from 0.2ms to 20ms. There’s a small ha-ha in this issue – guess where these two lines come from. ;-)
  • SHOW INNODB STATUS was taking 0.3s just because there were two rarely-looked at variables that needed traversing quite some structures. I filed bugs #36941 and #36942, to remind developers of this problem. Disabling these two counters allowed to run few thousand ‘SHOW INNODB STATUS’ a second, instead of just three.
  • SHOW GLOBAL STATUS traverses statistics for each thread. Doesn’t matter on single-thread development box too much, but is very very unhappy at higher thread concurrency – walking every variable on every thread is quite a task even for modern hardware. Generally some SMP scaling has been added by reducing locking on global variables this way – at the cost of making the command unusable.

INSERT speed, paramy, auto-inc

One of first things I noticed, when I tried to run some 5.0 tests, was that INSERT speed was much slower (no benchmarking at that time, purely subjective speed that loading the dump was somewhat too slow).

After discussing at evening, and some coding in the morning I built paramy – a simple C++ utility, that reads .sql stream, and spreads execution of statements over multiple worker threads.

For benchmarking I took a simple 13m row, 1.6G data, 1.7G index table. Loading it with standard mysql client took 30min (7200r/s). Paramy unfortunately hit auto-inc lock, and didn’t provide too much of performance – data load took 25min (8700r/s). The win here was mostly because everything until InnoDB being parallel, as instead of 100% core use, mysqld went up to ~130%, so it obviously managed to get additional cycles from SMP.

Dropping AUTO-INC from the field description changed everything. Standard mysql load wasn’t much faster, but under paramy load mysqld managed to use ~320% of CPU core, and did the load in 10min (~21500r/s).

The interesting part – no INSERT asks for AUTO-INC values, so in theory one can just disable the locking during the load, and re-enable it same way as unique or foreign key constraint checks are re-enabled after import of data – that way loading of data into InnoDB would be much much faster, especially with many more cores – though then contention moves away to transaction log mutex.

Additional way to reduce contention inside InnoDB may be removing both reading and writing checksums- they have quite big CPU load share. Disabling this at least during bulk imports can be quite useful.

Oh, I mentioned the subjective feeling that 5.0 was slower. I finally benchmarked – 4.0 did the data load in five minutes, and went over 40000 rows/s. I hope this is the last test where 4.0 is twice faster than 5.0.

Where covering indexes rock

One of MySQL features I usually like to exploit is covering index. Simply, if all fields required for a query are contained in an index, there’s no need to access table row somewhere else. In MyISAM that would mean not going to MYD file. Inside InnoDB data is clustered together with PK (another lovely feature) and secondary indexes contain PK values, so it is one B-Tree less to traverse too.

I’ll take a real-life example where it really mattered. In Wikipedia for English language only we have a ‘revisions’ table with 100m records. The PK is (page_id, revision_id), so any per-page operations (say, viewing complete history of changes) would read data clustered together in single block. Now the often used operation is to check user contributions, which would of course have an index (user_id, timestamp) and use it for traversing all revisions made by particular user.

In such case, if user is not centered on single page (of course, there’re many of these), going after entries in primary index will have to read (in worst case) 16k page from disk (and maybe a bigger stripe from RAID) for every 150-byte row that may be in there. In order to read information about 10000 revisions one would end up reading 200MB of data (and having 10000 seeks). Now simply by adding a covering index it reduces number of seeks required to 100 seeks (and reads just 2MB of data).

In our case we simply added an index on (user, timestamp, page, variousmetadata, comments, etc) – simply, all fields that were in table. It made a complete copy of 16GB table, but on the other hand, RAM costs per gigabyte are bigger than disk storage.

It may be not an issue in environments, where primary key (and table data) is already all cached in memory. Accessing a record in a page, once it is already in memory, is very cheap, compared to reading the page from disk. Of course, this makes a copy of table data, but really removes I/O saturation, especially if data is very cold.

Of course, still being on tight budget, we saved some disk space here too. Simply putting such wide index on single slave and not on others, then sending all queries requiring such index to particular slave did solve many resource issues. Maintaining secondary indexes inside InnoDB is not that expensive – all writes are delayed and go to ‘adaptive hash index’, and transaction logs contain just information about a single copy of a row.

Once we will go over 50 (or 20 :) DB servers, we may probably want to think about making such index changes and load balancing more automatic, but for now solving edge cases this way seems proper.

And here is comparison of two DB servers, one with covering and one with simple (just for WHERE conditions) index:

db4 mysql> select sum(rev_minor_edit) from revision
                        where rev_user_text='Test user';
1 row in set (21.73 sec)

db6 mysql> select sum(rev_minor_edit) from revision
                        where rev_user_text='Test user';
1 row in set (2.30 sec)

Both servers did read completely cold data (though they were in working state with warmed up buffers), and repeating of queries did provide not that different results (both servers executed under a second). Of course, 200MB of pages in buffer displaced possibly useful information…

Covering indexes are really powerful tool in environments where cold non-buffered data exists, and yet again prove the point, that access patterns of applications are very important when designing any schema or indexes. It is really one of major issues I expect from a storage engine (or DBMS in general) for any bigger data environments.

Open transactions are bad

Two identical queries have somewhat different execution times, though data layout, table format, everything… is quite same:

mysql> select count(*) from transtest;
| count(*) |
|    10000 |
1 row in set (0.00 sec)

mysql> select count(*) from transtest;
| count(*) |
|    10000 |
1 row in set (2.88 sec) < ---- oh noes!!!

The difference is very simple, though often forgotten. Multiversioned (MVCC) tables may have multiple versions (hence the name) of rows.
Even more, the engine will keep all versions of rows, that would be needed for oldest active transaction, just in case it requests some snapshot from the past.
Think of it as of some sort of time machine, just somewhat less efficient :)

If some old transaction is running, heavily edited tables will have more stored row versions.
In above case, second query had to read a million of rows to produce a count, when first one was working on clean environment.
Once the transaction holding old versions was rolled back, the query took 0.00sec again.

Databases (or DataACIDs) will take additional measures, often unexposed, to provide consistent data, so we may treat this as a feature, not a bug.

Conclusion: avoid long running transactions, especially if doing nothing. Oh well, avoid long running transactions with locks too, but that is already different question. If you spot anything sleeping – kill the slacker. At application level. :-)