Postgres kernel regressions

While Mark was looking at MongoDB, I was playing comparing various aspects of MySQL and Postgres performance. Certain PG performance numbers I saw (40kqps vs 110kqps from MySQL) made me really upset, so I ended up discussing with people at #postgresql – and started comparing various versions/configurations/machines/etc.

Apparently 2.6.32 kernel, which is in Ubuntu 10.04 LTS (Lucid Lynx) and is also basis for future RHEL6 kernel has nearly 20% performance degradation for PG (though not for MySQL, phew) – and apparently it was news to their community (I have started an email thread, will see where it goes).

Warning: there’re quotes that has been deliberately taken out of context for more snark

While I was doing my research, of course I could observe plenty of gems of wisdom:

<davidfetter> domas, well, as tom lane once said, there’s no limit to how quickly you can get an answer if it doesn’t have to be correct

And an hour later:

<rawtaz> are you one of those mysql lovers?

<dim> nobody wants performance at the price of correctness, right?
<dim> as Tom said, I can be as quick as you want as providing an answer if you don’t want it to be correct (hint: 42.)

Apparently chasing these performance variations is…

<Snow-Man> tbh, I’m of the opinion that we’re losing the forest for the trees.

Can’t say that all of them were encouraging:

<sterncera> If you can’t be bothered to subscribe to -hackers, maybe you shouldn’t be mailing it

Special thanks goes to mastermind, who not only didn’t lose his temper, but stayed focused on the topic and resisted my trolling :-) I really want PG to be greatly performing database (and I’ve seen some great numbers from it), but the amazing amount of ignorance and animosity they have against MySQL makes it somewhat complicated there though :(

P.S. Now all benchmarks I did are tainted and will need full rerun…

Opening tables v2!

PMP on demand revealed one of reasons why we’ve been seeing ‘Opening tables’ during proper operations, not just during startup (see my previous post on this topic).

We had a thousand or so threads waiting on LOCK_open, and the only thread holding the mutex was this darling:

Thread 902 (Thread 1637624128 (LWP 22113)):
#3  mutex_spin_wait (mutex=0x2aaaac3232b8,
        file_name=0x8b3bf7 "trx0trx.c", line=220) at sync0sync.c:565
#4  trx_allocate_for_mysql
#5  ha_innobase::allocate_trx
#6  check_trx_exists
#7  ha_innobase::info
#8  ha_innobase::open
#9  handler::ha_open
#10 openfrm
#11 open_unireg_entry
#12 open_table
#13 open_tables
#14 open_and_lock_tables
#15 mysql_insert

So, kernel mutex, which is quite contended, will escalate to LOCK_open on table open, which will block all queries from happening on the server. Fix? Nearly same code as previous “Opening tables” fix – don’t call ha_innobase::info() as part of open(), or move the transaction establishment code outside of info().

Crash recovery, again

There’s one stage in InnoDB crash recovery where it reads log file, you know, this one:

InnoDB: Doing recovery: scanned up to log sequence number 354164119040
InnoDB: Doing recovery: scanned up to log sequence number 354169361920

On a machine with bigger logs it will start spending nearly 100% CPU somewhere in recv_scan_log_recs. Guess what it does…. -fno-inline builds to the rescue:

#0  mem_block_get_len at ./include/mem0mem.ic:86
#1  mem_heap_get_size at ./include/mem0mem.ic:591
#2  recv_scan_log_recs at log/log0recv.c:2727

And:

samples  %        symbol name
8467     72.9222  mem_heap_get_size
291       2.5062  recv_add_to_hash_table
95        0.8182  mem_block_get_len

To speak in layman’s terms, InnoDB does SUM(LENGTH(allocation)) on its relatively wide memory (tens, hundreds of thousands of entries) arena, FOR EVERY LOG SEGMENT, to make sure it didn’t run out of available 32GBs. Hehehe, lame.

As for now, I’ll just killed the check and have my recovery much much faster – finished in 3 minutes, what it wasn’t able to do in 30 before.

P.S. This is different from what I wrote before (and magic Yasufumi’s patch)
P.P.S. Now I got to learn to reuse LOG_DUMMY table during the recovery process, as it is next low hanging fruit there…

again, on benchmarks

Dear interweb, if you have no idea what you’re writing about, keep it to yourself, don’t litter into the tubes. Some people may not notice they’re eating absolute crap and get diarrhea.

This particular benchmark has two favorite parts, that go with each other together really well:

I didnt change absolutely any parameters for the servers, eg didn’t change the innodb_buffer_pool_size or key_buffer_size.

And..

If you need speed just to fetch a data for a given combination or key, Redis is a solution that you need to look at. MySQL can no way compare to Redis and Memcache. …

Seriously, how does one repay for all the damage of such idiotic benchmarks?

P.S. I’ve ranted at benchmarks before, and will continue doing so.

On file system benchmarks

I see this benchmark being quoted in multiple places, and there I see stuff like:

When carrying out more database benchmarking, but this time with PostgreSQL, XFS and Btrfs were too slow to even complete this test, even when it had been running for more than an hour for a single run. Between EXT3, EXT4, and NILFS2, the fastest file-system was EXT3 and then its successor, EXT4, was slightly behind that. Far behind the position of EXT4 were NILFS2 and then Btrfs and XFS.

There were few other benchmarks, e.g. SQLite showed ‘bad performance’ on XFS and Btrfs.

*clear throat*

Dear benchmarkers, don’t compare apples and oranges. If you see differences between benchmarks, do some very very tiny research, and use some intellect, that you, as primates, do have. If database tests are slowest on filesystems created by Oracle (who know some stuff about systems in general) or SGI (who, despite giving away their campus to Google, still have lots of expertise in the field), that can indicate, that your tests are probably flawed somewhere, at least for that test domain.

Now, probably you’ve heard about such thing as ‘data consistency’. That is something what database stack tries to ensure, sometimes at higher costs, like not trusting volatile caches, enforcing certain write orders, depending on acknowledgements by underlying hardware.

So, in this case it wasn’t “benchmarking file systems”, it was simply, benchmarking “consistency” against “no consistency”. But don’t worry, most benchmarks have such flaws – getting numbers but not understanding them makes results much more interesting, right?

Oh, and… thanks for few more misguided people.

Linux 2.6.29

2.6.29 was released. I don’t usually write about linux kernel releases, thats what Slashdot is for :), but this one introduces write barriers in LVM, as well as ext4 with write barriers enabled by default. If you run this kernel and forget to turn off barrier support at filesystems (like XFS, nobarrier), you will see nasty performance slowdowns (recent post about it). Beware.