uncache!

this is source code for a tiny program I just wrote that traverses specified directories and removes them from file system cache.

There are few use cases for it. One is for all these people who benchmark stuff and want selective OS cache purges, another is for those who run high performance databases. Remember the O_DIRECT serialization everywhere? Well, XFS does direct I/O in parallel, unless there are cached pages (and they can happen because of any random outside-of-database activity, like ‘file’ command). Once you ‘uncache’ the files, XFS will be very much parallel again \o/ \o/

Checksums again, some I/O too

When I was doing data loading tests, I realized that usually low checksum calculation CPU percentage is actually the blocking factor. See, usually when background writers do the flushing, it gets parallelized, but if active query is forcing a checkpoint, it all happens in ‘foreground’ thread, checksum computation included. This is where more Sun-ish wisdom (these people tune kernel with debugger all the time) comes in:

gdb -p $(pidof mysqld) -ex "set srv_use_checksums=0" --batch

Puff. Everything becomes much faster. Of course, one would be able to restart the server with –skip-innodb-checksums, but that would interrupt the whole process, etc. Of course, proper people would implement tunable parameter (5 lines of code, or so), but anyone with Solaris experience knows how to tune stuff with debuggers, hahaha.

Odd though, I was used to compiled -O3 mode optimizing checksums to disappear from profiles, so either this doesn’t work, or it just got so much emphasis to actually matter. This is why I told Heikki at the Users Conference, that checksums ‘must go’ or be fixed in one way or another. Even when they’re small part of operation and doesn’t cost anything if it doesn’t block primary operations (e.g. happens in parallel), people keep accessing SSDs, and then relative cost of checksum is insane.

It is quite easy to check that, just run InnoDB with tiny buffer pool, OS caching, and try some benchmarking with checksums enabled and disabled – there will be a huge difference, which may simply explain the performance difference of caching at buffer pool and OS buffers. Turn checksums off, and OS caching may be even tolerable for your workloads.

Other interesting issue here is that MarkC has lots of I/O path optimization in his fourth (and four is a good number!) patch. Now… maybe everyone will be soon caching pages in OS, once some more work is done in I/O access path cost work?

stop messing with the tablespace

People keep loving and endorsing the –innodb-file-per-table. Then poor new users read about that, get confused, start using –innodb-file-per-table, and tell others to. Others read then, get confused even more, and start using –innodb-file-per-table, then write about it. Then…

Oh well. Here, some endorsements and FUD against one-tablespace-to-unite-them-all:

This same nice property also translates to a not so nice one: data can be greatly fragmented across the tablespace.

Of course, having file-per-table will mean that only one table will be in a file, so, kind of, it will not be ‘mixed’… inside file. Now, when data grows organically (not when you restore few-hundred-gigabyte dump sequentially), all those files grow and start getting fragmented (at ratios depending on how smart filesystem is, and.. how many people choose smart filesystems nowadays?). So, disks will have same soup of data, just instead of ‘fragmentation’ inside tablespace which is laid out sequentially on a disk/filesystem, you get fragmentation at file system level. Now, InnoDB has extents and such, and can afford new storage formats more often than filesystems do, so… which one is more modern for proper data allocation strategies?

So, some more criticism of InnoDB tablespaces:

An annoying property of InnoDB’s tablespaces is that they never shrink.

Annoying are people who find this annoying. Look, it is simple, if your data grows to X, then you do something amazing and shrink your dataset, there’re two outstanding questioms:

  • How soon will your dataset grow back to X?
  • What temporary data will be placed there, until the moment dataset grows back to X?

It is very simple, database servers house data. Run OPTIMIZE, data will get eventually fragmented (quite fast, actually, at usual access patterns, as once you have lots of full pages around, a simple insertion will split pages). That ‘free space’ achieved does not bring too much value, it will be gone, and data will be placed there. Oh, well, and it _will_ be used by database, by _any_ table, not just the owner-table of a file.

It does not release storage to the file system.

*facepalm*. File system doesn’t want that storage. It will give it back to InnoDB as soon as it asks. :)

I’ve seen more than once how certain tables are left unwatched, growing until disk space reaches 90% and SMS notifications start beeping all around.

Where is that issue of data storage strategies? I have tablespaces taking 95%. When they will reach 99% I will kill the database servers and cycle them out to other tasks to deal with smaller datasets. There is nothing bad with data growing, as long as you know the reason. There is nothing wrong with tablespace growing, thats its job – to house the data you feed in.

But having a file which consumes some 80-90% of disk space is a performance catastrophe. It means the disk needle needs to move large distances. Overall disk performance runs very low.

[citation needed]. Which filesystem will not place data at those ‘long distances’ for the needle, when it gets full? Which filesystem will relocate the data to the start of disk? This is most poorly sourced statement, which spreads FUD, even though same data placement would happen pretty much with every storage strategy. If start area gets full, you have to place data elsewhere. End of tablespace, end of filesystem, end of disk, it is same thing.

Also, one can never be sure how underlying storage will be mapped to block device. Writing to the end can be fastest, you never know. Did you try? :)

mysqldump, mk-parallel-dump

Where is mydumper? ;-D

One other nice thing about innodb_file_per_table is that it is possible to monitor table size on the file system level.

Oh sure, thats just what you need, give people who deal with data at the application level, access to MySQL data directories, and let them run awk, sort, du, etc, even though INFORMATION_SCHEMA gives you that data for any kind of table storage type you use. Oh, by the way, both I_S and ‘SHOW TABLE STATUS’ will tell about free space, whereas filesystem has no idea about what data is inside file. You choose your tools… ;-)

Oh, wait, this was given as an argument:

You don’t need access to MySQL, to use SHOW TABLE STATUS or to query the INFORMATION_SCHEMA.

Right…

Though actually I know where file-per-table solves something. I’ve written already, that most Linux file systems serialize O_DIRECT writes. Using separate files will mean that O_DIRECTs will be placed in separate queues per each inode. I’d suggest just using proper file system :-)

Now I should probably tell why having single tablespace is better. Besides the obvious (freed space shared by all tables), I don’t know. Well, there’re lots of different issues that have different impact on different operating systems, file systems, storage, etc.

For example, when data lives inside single tablespace, filesystem metadata operations are nearly non-existing. That means no writes to FS journal, less need for multiple-points-of-synchronization, etc.

Multiple file descriptors could also be mentioned, though probably nowadays it has nearly zero significance. I don’t know – statements like that need more than just assumptions. I don’t have good numbers about how much FS fragmentation affects proper database workload – and results can be different depending on data access patterns, size composition, and of course, filesystems used.

Still, from OS administrator’s perspective data is not fragmented in any way, moving files around will hit bandwidth issues in network, memory, I/O, but will not do any seeks. I probably have dealt too much with systems that had hundreds of millions of files, to love single-file-approach. :)

On throttling

Seems like nowadays InnoDB crash recovery is much funnier. Please allow me to show you something:

while (buf_pool->n_pend_reads >=
        recv_n_pool_free_frames / 2) {
  os_aio_simulated_wake_handler_threads();
  os_thread_sleep(500000);

Translation – if there’re more than 128 outstanding I/O requests, sleep for half a second. Hehehe. Optimized for floppies!

after the conference, mydumper, parallelism, etc

Though slides for my MySQL Conference talks were on the O’Reilly website, I placed them in my talks page too, for both dtrace and security presentations.

I also gave a lightning talk about mydumper. Since my original announcement mydumper has changed a bit. It supports writing compressed files, detecting and killing slow queries that could block table flushes, supports regular expressions for table names, and trunk is slowly moving towards understanding that storage engines differ :)

I’ve been using mydumper quite a lot in my deployments (and observing 10x faster dumps). Now, the sad part is how to do faster recovery. It is quite easy to parallelize load of data (apparently, xargs supports running parallel processes):

echo *.sql.gz | xargs -n1 -P 16 -I % sh -c 'zcat % | mysql dbname'

Still, that doesn’t scale much – only doubles the load speed, compared to single threaded load, even on quite powerful machine. The problem lives in log_sys mutex – it is acquired for every InnoDB row operation, to grab LogicalSequenceNumbers (LSNs), so neither batching nor differentiation strategies really help, and same problem is hit by LOAD DATA too. In certain cases I saw quite some spinning on other mutexes, and it seems that InnoDB currently doesn’t scale that well with lots of small row operations. Maybe someone some day will pick this up and fix, thats why we go to conferences and share our findings :)

oracle?

oracle!

While everyone is sleeping and preparing for four busy days of MySQL Conference, here, in Santa Clara – I started getting SMSes asking if I already learnt PL/SQL, and here, I’m jetlagged, and finding out that I work for another company.

If they don’t kill MySQL, InnoDB and MySQL will finally be together.

If they kill MySQL, I’ll have to look for a job. Will anyone use MySQL then, or will I have to fall back to more generic non-MySQL work I’ve been doing for my hobby projects, teeeheeee.

And for now, I see 6AM faces showing up, and greeting Oracle buddies – some jetlagged, some just early birds.

april1

Today I enjoyed YouTube (upside down movies of upside-down actions, brilliant!) and Wikipedia (always real stories) front page appearances. I also enjoyed this chat in #postgresql:

[12:25]  <TommyG> hehe, nice april 1st at http://www.postgresql.fr/
[12:25]  <johto> :D
[12:27]  <davidfetter> heh
[12:29]  <AlexB> Cute.
[12:29]  <domas> omg, so funny!
[12:29]  <CaT[t2]> best april fools so far tbh :)
[12:31]  <mst> we just had somebody on irc.perl.org miss it entirely
[12:31]  <mst> I'm currently trying to sell him a bridge.
[12:31]  <CaT[t2]> go for it. you could gett rich :)

OK, my “so funny” there was well hidden sarcasm. That site had a big warning, kind of “could not connect to mysql socket, could not find mysql socket, mysql fail”. See, for PG people it is fun to say “mysql can fail!”. Now, what made it much funnier to people outside of PG community – for past few hours the http://www.postgresql.fr server did not respond, and later was telling just “Maintenance – Des travaux sont en cours sur la plateforme postgresql.fr”. Ironic :-)

I did similar joke, not too funny, last year, probably in a bit more polite fashion, in this email.

on tools and operating systems

Sometimes people ask why do I use MacOSX as my main work platform (isn’t that something to do with beliefs?). My answer is “good foundation with great user interface”. Though that can be treated as “he must like unix kernel and look&feel!”, it is not exactly that.

What I like is that I can have good graphical stable environment with some mandatory tools (yes, I used OS-supplied browser, mail, etc), but beside that maintain the bleeding edge open-source space (provided by MacPorts).

Also what I like, is OS-supplied development and performance tools. DTrace included is awesome, yes, but Apple did put some special touch on it too. This is visualization environment for dtrace probes and other profiling/debugging tools:

Even the web browser (well, I upgraded to Safari4.0 ;-) provides some impressive debugging and profiling capabilities:

Of course, I end up running plethora of virtual machines (switching from Parallels to VirtualBox lately), but even got a KDE/Aqua build (for kcachegrind mostly). I don’t really need Windows apps, and I can run ‘Linux’ ones natively on MacOSX, and I can run MacOSX ones on MacOSX.

There’s full web stack for my MediaWiki work, there’re dozens of MySQL builds around, there’re photo albums, dtrace tools, World of Warcraft, bunch of toy projects, few different office suites, Skype, NetBeans, Eclipse, Xcode, integrated address books and calendars, all major scripting languages, revision control systems – git, svn, mercurial, bzr, bitkeeper, cvs, etc.

All that on single machine, running for three years, without too much clutter, and nearly zero effort to make it all work. Thats what I want from desktop operating system – extreme productivity without too much tinkering.

And if anyone blames me that I’m using non-open-source software, my reply is very simple – my work output is open-sourced.

twitter!

Quite often fast databases, super-duper backend caching layers and other fancy stuff doesn’t help if you don’t serve your customer right. Take, for example, Twitter. This service has lots and lots of clicks, people following each other, in endless loops, trees, and probably serving occasional page-views.

I noticed that every click seemed to be somewhat sluggish, and started looking at it (sometimes this gets me free lunch or so ;-)

Indeed, every click seemed to reload quite a bit of static content (like CSS and JavaScript from their ‘assets’ service). Every pageview carrying information took 2s to serve, but static content slowed down the actual page presentation for three-six additional seconds.


Now, I can’t say Twitter didn’t try to optimize this. Their images are loaded from S3 and have decent caching (even though datacenter is far away from Europe), but something they completely control and own, and what should make least amount of costs, ends up the major slow-down.

What did they do right? They put timestamp markers into URLs for all included javascript and stylesheet files, so it is really easy to switch to new files (as those URLs are all dynamically generated by their application for every pageview).

What did they do wrong? Let’s look at the response headers for the slow content:

Accept-Ranges:bytes
Cache-Control:max-age=315360000
Connection:close
Content-Encoding:gzip
Content-Length:2385
Content-Type:text/css
Date:Wed, 25 Mar 2009 21:12:21 GMT
Expires:Sat, 23 Mar 2019 21:12:21 GMT
Last-Modified:Tue, 24 Mar 2009 21:21:04 GMT
Server:Apache
Vary:Accept-Encoding

It probably looks perfectly valid (expires in ten years, cache control existing), but…

  • Cache-Control simply forgot to say this is “public” data.
  • ETag header could help too, especially if no ‘public’ is specified.
  • Update: Different pages have different timestamp values for included files – so all caching headers don’t have much purpose ;-)

And of course, if those files were any closer to Europe (now they seem to go long long way to San Jose, California), I’d forgive lack of keep-alive. Just serve those few files off a CDN, dammit.

Linux 2.6.29

2.6.29 was released. I don’t usually write about linux kernel releases, thats what Slashdot is for :), but this one introduces write barriers in LVM, as well as ext4 with write barriers enabled by default. If you run this kernel and forget to turn off barrier support at filesystems (like XFS, nobarrier), you will see nasty performance slowdowns (recent post about it). Beware.