embarrassment

So, we had a major embarrassment last night. It consisted of multiple factors:

  • We don’t have parallelism coordinator for our most cpu-intensive task at Wikipedia, so it can work on same job in ten, hundred, thousand threads across the cluster at the same time.
  • Some parts of our parsing process ended up extremely CPU-intensive, and that happened not in our code, but in ‘templates’, that are in user-space. We don’t have profiling for templates, so we can just guess which one is slow, which one is fast, nor their overall aggregates.
  • Some parts of pages are extremely template-heavy, making page rendering cost a lot (e.g. citations – see this discussion).
  • In order to avoid content integrity race conditions, editing process releases locks and invalidates objects early, separated from ‘virgin parse’ which populates caches.
  • It takes quite some time to refill the cache, as rendering is CPU-bound for quite a while in certain cases.
  • During that short time when caches are empty, stampede of users on single article causes lots of redundant work across the cluster/grid/cloud.
  • Michael Jackson article on English Wikipedia alone had a million views in one hour

So, in summary, we had havoc in our cluster because stampede of heavy requests between cache purge and cache population was consuming all available CPU resources, mostly working on rendering references section on Michael Jackson article.

Oh well, quick operations hack looked like this:

Index: ParserCache.php
===================================================================
--- ParserCache.php	(revision 52088)
+++ ParserCache.php	(working copy)
@@ -63,6 +63,7 @@
  if ( is_object( $value ) ) {
    wfDebug( "Found.\n" );
    # Delete if article has changed since the cache was made
    // temp hack!
+   if( $article->mTitle->getPrefixedText() != 'Michael Jackson' ) {
    $canCache = $article->checkTouched();
    $cacheTime = $value->getCacheTime();
    $touched = $article->mTouched;

It is embarrassing, as actual pageview count was way below our usual capacity, whenever we have problems is because of some narrow expensive problem, not because of overall unavoidable resource shortage. We can afford much more edits, much more pageviews. We could have handled this load way better if our users wouldn’t be creating complex logic in articles. We could have handled this way better, if we had more aggressive redundant job elimination.

Thats the real story of operations, though headlines like “High profile event brought down Wikipedia” may sound nice, the real story is “shit happens”.

Checksums again, some I/O too

When I was doing data loading tests, I realized that usually low checksum calculation CPU percentage is actually the blocking factor. See, usually when background writers do the flushing, it gets parallelized, but if active query is forcing a checkpoint, it all happens in ‘foreground’ thread, checksum computation included. This is where more Sun-ish wisdom (these people tune kernel with debugger all the time) comes in:

gdb -p $(pidof mysqld) -ex "set srv_use_checksums=0" --batch

Puff. Everything becomes much faster. Of course, one would be able to restart the server with –skip-innodb-checksums, but that would interrupt the whole process, etc. Of course, proper people would implement tunable parameter (5 lines of code, or so), but anyone with Solaris experience knows how to tune stuff with debuggers, hahaha.

Odd though, I was used to compiled -O3 mode optimizing checksums to disappear from profiles, so either this doesn’t work, or it just got so much emphasis to actually matter. This is why I told Heikki at the Users Conference, that checksums ‘must go’ or be fixed in one way or another. Even when they’re small part of operation and doesn’t cost anything if it doesn’t block primary operations (e.g. happens in parallel), people keep accessing SSDs, and then relative cost of checksum is insane.

It is quite easy to check that, just run InnoDB with tiny buffer pool, OS caching, and try some benchmarking with checksums enabled and disabled – there will be a huge difference, which may simply explain the performance difference of caching at buffer pool and OS buffers. Turn checksums off, and OS caching may be even tolerable for your workloads.

Other interesting issue here is that MarkC has lots of I/O path optimization in his fourth (and four is a good number!) patch. Now… maybe everyone will be soon caching pages in OS, once some more work is done in I/O access path cost work?

stop messing with the tablespace

People keep loving and endorsing the –innodb-file-per-table. Then poor new users read about that, get confused, start using –innodb-file-per-table, and tell others to. Others read then, get confused even more, and start using –innodb-file-per-table, then write about it. Then…

Oh well. Here, some endorsements and FUD against one-tablespace-to-unite-them-all:

This same nice property also translates to a not so nice one: data can be greatly fragmented across the tablespace.

Of course, having file-per-table will mean that only one table will be in a file, so, kind of, it will not be ‘mixed’… inside file. Now, when data grows organically (not when you restore few-hundred-gigabyte dump sequentially), all those files grow and start getting fragmented (at ratios depending on how smart filesystem is, and.. how many people choose smart filesystems nowadays?). So, disks will have same soup of data, just instead of ‘fragmentation’ inside tablespace which is laid out sequentially on a disk/filesystem, you get fragmentation at file system level. Now, InnoDB has extents and such, and can afford new storage formats more often than filesystems do, so… which one is more modern for proper data allocation strategies?

So, some more criticism of InnoDB tablespaces:

An annoying property of InnoDB’s tablespaces is that they never shrink.

Annoying are people who find this annoying. Look, it is simple, if your data grows to X, then you do something amazing and shrink your dataset, there’re two outstanding questioms:

  • How soon will your dataset grow back to X?
  • What temporary data will be placed there, until the moment dataset grows back to X?

It is very simple, database servers house data. Run OPTIMIZE, data will get eventually fragmented (quite fast, actually, at usual access patterns, as once you have lots of full pages around, a simple insertion will split pages). That ‘free space’ achieved does not bring too much value, it will be gone, and data will be placed there. Oh, well, and it _will_ be used by database, by _any_ table, not just the owner-table of a file.

It does not release storage to the file system.

*facepalm*. File system doesn’t want that storage. It will give it back to InnoDB as soon as it asks. :)

I’ve seen more than once how certain tables are left unwatched, growing until disk space reaches 90% and SMS notifications start beeping all around.

Where is that issue of data storage strategies? I have tablespaces taking 95%. When they will reach 99% I will kill the database servers and cycle them out to other tasks to deal with smaller datasets. There is nothing bad with data growing, as long as you know the reason. There is nothing wrong with tablespace growing, thats its job – to house the data you feed in.

But having a file which consumes some 80-90% of disk space is a performance catastrophe. It means the disk needle needs to move large distances. Overall disk performance runs very low.

[citation needed]. Which filesystem will not place data at those ‘long distances’ for the needle, when it gets full? Which filesystem will relocate the data to the start of disk? This is most poorly sourced statement, which spreads FUD, even though same data placement would happen pretty much with every storage strategy. If start area gets full, you have to place data elsewhere. End of tablespace, end of filesystem, end of disk, it is same thing.

Also, one can never be sure how underlying storage will be mapped to block device. Writing to the end can be fastest, you never know. Did you try? :)

mysqldump, mk-parallel-dump

Where is mydumper? ;-D

One other nice thing about innodb_file_per_table is that it is possible to monitor table size on the file system level.

Oh sure, thats just what you need, give people who deal with data at the application level, access to MySQL data directories, and let them run awk, sort, du, etc, even though INFORMATION_SCHEMA gives you that data for any kind of table storage type you use. Oh, by the way, both I_S and ‘SHOW TABLE STATUS’ will tell about free space, whereas filesystem has no idea about what data is inside file. You choose your tools… ;-)

Oh, wait, this was given as an argument:

You don’t need access to MySQL, to use SHOW TABLE STATUS or to query the INFORMATION_SCHEMA.

Right…

Though actually I know where file-per-table solves something. I’ve written already, that most Linux file systems serialize O_DIRECT writes. Using separate files will mean that O_DIRECTs will be placed in separate queues per each inode. I’d suggest just using proper file system :-)

Now I should probably tell why having single tablespace is better. Besides the obvious (freed space shared by all tables), I don’t know. Well, there’re lots of different issues that have different impact on different operating systems, file systems, storage, etc.

For example, when data lives inside single tablespace, filesystem metadata operations are nearly non-existing. That means no writes to FS journal, less need for multiple-points-of-synchronization, etc.

Multiple file descriptors could also be mentioned, though probably nowadays it has nearly zero significance. I don’t know – statements like that need more than just assumptions. I don’t have good numbers about how much FS fragmentation affects proper database workload – and results can be different depending on data access patterns, size composition, and of course, filesystems used.

Still, from OS administrator’s perspective data is not fragmented in any way, moving files around will hit bandwidth issues in network, memory, I/O, but will not do any seeks. I probably have dealt too much with systems that had hundreds of millions of files, to love single-file-approach. :)

On throttling

Seems like nowadays InnoDB crash recovery is much funnier. Please allow me to show you something:

while (buf_pool->n_pend_reads >=
        recv_n_pool_free_frames / 2) {
  os_aio_simulated_wake_handler_threads();
  os_thread_sleep(500000);

Translation – if there’re more than 128 outstanding I/O requests, sleep for half a second. Hehehe. Optimized for floppies!

after the conference, mydumper, parallelism, etc

Though slides for my MySQL Conference talks were on the O’Reilly website, I placed them in my talks page too, for both dtrace and security presentations.

I also gave a lightning talk about mydumper. Since my original announcement mydumper has changed a bit. It supports writing compressed files, detecting and killing slow queries that could block table flushes, supports regular expressions for table names, and trunk is slowly moving towards understanding that storage engines differ :)

I’ve been using mydumper quite a lot in my deployments (and observing 10x faster dumps). Now, the sad part is how to do faster recovery. It is quite easy to parallelize load of data (apparently, xargs supports running parallel processes):

echo *.sql.gz | xargs -n1 -P 16 -I % sh -c 'zcat % | mysql dbname'

Still, that doesn’t scale much – only doubles the load speed, compared to single threaded load, even on quite powerful machine. The problem lives in log_sys mutex – it is acquired for every InnoDB row operation, to grab LogicalSequenceNumbers (LSNs), so neither batching nor differentiation strategies really help, and same problem is hit by LOAD DATA too. In certain cases I saw quite some spinning on other mutexes, and it seems that InnoDB currently doesn’t scale that well with lots of small row operations. Maybe someone some day will pick this up and fix, thats why we go to conferences and share our findings :)

flying…

so, I’m cruising now at 38000 feet, 530mph, flying directly over I-5 in California. In-flight wireless internet works, in-flight entertainment has some albums of my favorite artists I didn’t hear before (Jazzanova!) and movies I didn’t see. Guy in front of me is playing Doom (and he knows the secrets). Service is awesome, from the check in to flight attendants. Airplane interior has violet glow, and when I boarded, Kruder & Dorfmeister was creating the lounge atmosphere. Ticket cost – 60$. Guess, which airline it is…

oracle?

oracle!

While everyone is sleeping and preparing for four busy days of MySQL Conference, here, in Santa Clara – I started getting SMSes asking if I already learnt PL/SQL, and here, I’m jetlagged, and finding out that I work for another company.

If they don’t kill MySQL, InnoDB and MySQL will finally be together.

If they kill MySQL, I’ll have to look for a job. Will anyone use MySQL then, or will I have to fall back to more generic non-MySQL work I’ve been doing for my hobby projects, teeeheeee.

And for now, I see 6AM faces showing up, and greeting Oracle buddies – some jetlagged, some just early birds.

april1

Today I enjoyed YouTube (upside down movies of upside-down actions, brilliant!) and Wikipedia (always real stories) front page appearances. I also enjoyed this chat in #postgresql:

[12:25]  <TommyG> hehe, nice april 1st at http://www.postgresql.fr/
[12:25]  <johto> :D
[12:27]  <davidfetter> heh
[12:29]  <AlexB> Cute.
[12:29]  <domas> omg, so funny!
[12:29]  <CaT[t2]> best april fools so far tbh :)
[12:31]  <mst> we just had somebody on irc.perl.org miss it entirely
[12:31]  <mst> I'm currently trying to sell him a bridge.
[12:31]  <CaT[t2]> go for it. you could gett rich :)

OK, my “so funny” there was well hidden sarcasm. That site had a big warning, kind of “could not connect to mysql socket, could not find mysql socket, mysql fail”. See, for PG people it is fun to say “mysql can fail!”. Now, what made it much funnier to people outside of PG community – for past few hours the http://www.postgresql.fr server did not respond, and later was telling just “Maintenance – Des travaux sont en cours sur la plateforme postgresql.fr”. Ironic :-)

I did similar joke, not too funny, last year, probably in a bit more polite fashion, in this email.

on tools and operating systems

Sometimes people ask why do I use MacOSX as my main work platform (isn’t that something to do with beliefs?). My answer is “good foundation with great user interface”. Though that can be treated as “he must like unix kernel and look&feel!”, it is not exactly that.

What I like is that I can have good graphical stable environment with some mandatory tools (yes, I used OS-supplied browser, mail, etc), but beside that maintain the bleeding edge open-source space (provided by MacPorts).

Also what I like, is OS-supplied development and performance tools. DTrace included is awesome, yes, but Apple did put some special touch on it too. This is visualization environment for dtrace probes and other profiling/debugging tools:

Even the web browser (well, I upgraded to Safari4.0 ;-) provides some impressive debugging and profiling capabilities:

Of course, I end up running plethora of virtual machines (switching from Parallels to VirtualBox lately), but even got a KDE/Aqua build (for kcachegrind mostly). I don’t really need Windows apps, and I can run ‘Linux’ ones natively on MacOSX, and I can run MacOSX ones on MacOSX.

There’s full web stack for my MediaWiki work, there’re dozens of MySQL builds around, there’re photo albums, dtrace tools, World of Warcraft, bunch of toy projects, few different office suites, Skype, NetBeans, Eclipse, Xcode, integrated address books and calendars, all major scripting languages, revision control systems – git, svn, mercurial, bzr, bitkeeper, cvs, etc.

All that on single machine, running for three years, without too much clutter, and nearly zero effort to make it all work. Thats what I want from desktop operating system – extreme productivity without too much tinkering.

And if anyone blames me that I’m using non-open-source software, my reply is very simple – my work output is open-sourced.

I loved Encarta

That happened long before Wikipedia. I loved Encarta. Well, before Encarta, I used to read this thing a lot:

But then Encarta arrived and I loved it. It did fit into single CD and didn’t take too much space on disk. I could look up all these articles in it, without having to use expensive dialup, fast. I remember my school buddies coming over and watching those tiny movies in it. I could rip it off for my school works, and look incredibly smart (now people rip off Wikipedia and don’t get too much credit for that :).

It is dead.

People on the interwebs suggest that employees at Wikipedia and Encyclopaedia Britannica will be throwing parties tonight. Oh well, Wikipedia is already up to date about this. Every encyclopedia out there was an inspiration for Wikipedia, more so than any technology or “web-two-oh” hype. There’s not much joy seeing good things die.

Ten years ago I imagined, that once I have my own home, I’ll have a place to put a full set of dead-tree Britannica, like my parents had “Lithuanian soviet encyclopaedia”. Wikipedia changed my plans (now there’re two flat panels staring at Wiki, inside and outside), but it seems it already is changing the world around it way more. RIP Encarta. You were inspiring, and really too young to die. If it was us, we didn’t mean it, really. By the way, that content of yours, I’d be glad to see it free. *wink*