domas mituzas

more on PMP

Lately we have been especially enjoying the opportunities that Poor Man’s Profiler provides us – but also the technology has improved a lot too – there have been few really useful mutations.

One mutation (hyper-pmp) was Ryan Mack’s approach of having somewhat more efficient sampling – instead of firing gdb each time, he instructed gdb to get backtraces every time monitored process gets a signal (SIGUSR2 for example). This allows to maintain a persistent debugger attachment – and then signal periodically to get stacks analyzed.

Other mutation was auto-pmp – high frequency polling of process state (e.g. how many threads are running), and when a certain threshold is exceeded – obtaining stacks for further analysis (this combines really well with the hpmp approach – one process is the stacks reader, and other is signaling on thresholds). My major problem in such approach was that the polling methods we chose would be biased to show me end of overload events (because it wouldn’t return process state due to internal process locking).

At one point in time I had an epiphany, that was quickly melted by the reality – in theory we could use gdb watchpoints to replace my external process polling. Watchpoints allow to break a process when a change to a variable inside a program happens (and conditions can be applied), so essentially we would be able to instrument gdb to get stacks exactly at the moment when there’re stalls and spikes. Unfortunately, even though that worked fine in single-threaded or lightly loaded environments, monitored process crashed horribly in more realistic workloads – we have yet to figure out if it is a fundamental issue of the approach or actually a bug that may have been fixed in later versions.

Of course, there’s a workaround, that we’re considering for high performance system analysis – simply instrumenting a process to fire a signal or do a conditional jump whenever there’s an overload condition – so essentially that would be implementing in-process watchpoint-to-breakpoint translation giving us just-in-time analytics – so we’d see pretty much every situation where running threads pile up (unless there’s a bottleneck that simply doesn’t allow the workload to arrive :)

PMP on-demand allowed us to uncover various issues inside MySQL that have been overlooked in most of benchmarking as non-significant, but they are critical for us in providing better quality of service for each query, not just 99th percentile (I wrote about that recently). We keep thinking how to provide instrumentation for some of views we get inside MySQL (e.g. an ability to export pthread lock graph without using external tools), as well as better visibility of I/O blocking…

But for now we have what we have, poor man’s profiler :-)

I need a new keyboard

Thanks to Jayant, I had to clean coffee off my screen, but keyboard is a bit more tricky. This is Postgres advocacy at its best:

The base timings are almost 5 times more in mysql as compared to pgsql.

Then in comments, Harrison points out:

From the numbers, it is hard to believe that PostgreSQL is really fsync’ing properly there. I doubt a laptop HDD can do 1000 writes per second. Even assuming we have perfect group commit going on, with two threads it is still 500 per second for the first test.

InnoDB’s numbers are about right on for writes per second for a cheap commodity hdd.

Jayant’s answer is:

You may be right. But postgres is a more advanced and reliable database as compared to mysql.

on performance stalls

We quite often say, that benchmark performance is usually different from real world performance – so performance engineering usually has to cover both – benchmarks allow to understand sustained performance bottlenecks, and real world analysis usually concentrates on something what would be considered ‘exceptional’ and not important in benchmarks – stalls of various kind. They are extremely important, as the state when our performance is lowest is the state of performance we provide to our platform users.

On a machine that is doing 5000qps, stalling for 100ms means that 500 queries were not served as fast as they could, or even hit application timeouts or exceptional MySQL conditions (like 1023 transaction limit). Of course, stalling for a second means 5000 queries were not served in time…

We have multiple methods to approach this – one is our ‘dogpiled’ framework – an agent doing status polling every second and reporting information about I/O state, MySQL/InnoDB statuses, processlists, etc – so we see the scope of stalls in our environment. We try to maintain the threshold between complete information overload and something that reveals problems – so it is always balancing act, especially with great work done by engineering team :)

Other approach, usually led to by dogpiles information, is auto-PMP – high-frequency status polling combined with gdb invocations, that allow us to jump into the process whenever we notice something weird is going on. We have some extensions to how we use PMP – but thats worth another post.

Issues we do find out that harm us most in production environments are ones that are quite often discarded as either “this never happens” or “get better hardware” or “your application is wrong”. Unfortunately, that happens, we do have thousands of machines that aren’t free and our application demands are our application demands :)

Few examples:

TRUNCATE stalls the server (oh well, DROP TABLE too) – in this case, truncating a table grabs dictionary mutex, other transaction blocks while holding LOCK_open, everything else stops. Though truncating is supposed to be fast operation, it has to unlink (delete) a file, and with large files such operation isn’t really instant on any filesystem. Even if one deletes all the data before truncating, file is still on the filesystem.
Extending data files stalls the server – when a data file is being extended, global mutex is held, which blocks all I/Os (with limited concurrency that is full server stall). Somewhat more impressive with file-per-table. This is the major reason for mini-stalls at the moment – on machines that grow at gigabytes-a-day rate this is being hit quite often.
Updating table statistics stalls the server – we hit this with high-performance task tracking machines, row churn there is quite amazing, and dictionary statistics are reread more often than one would expect. Updating statistics means locking the table while doing random reads from disk. Once major workload is hitting that table, it quickly escalates to full server stall
Fuzzy checkpoint stalls the server – this is one of biggest issues outstanding in stock MySQL – though one would expect that “fuzzy checkpoint” that uses async background threads is nonblocking, actually all writes during it will stall, taking all concurrency slots and leading to a server stall. Mark’s fix was just doing this work in background thread.
(no bug filed on this yet) – Purge stalls the server – purge holds dictionary lock while doing random reads from disk, with table stall leading to server stall.

There’re more issues (mostly related to heavier in-memory activities of the server), but these ones are most obvious ones – where single I/O request done is escalated to table or instance lockup, where no other work is done. Our machines have multiple disks, multiple CPUs and can support multiple SQL queries being executed at once, so any of these lockups effectively limit our available performance or damage the quality of service we can provide.

On the upside, my colleagues are absolutely amazing and I’m sure that we will have all these issues fixed in our deployment in near future, as well as everyone will be able to pick that up via mysqlatfacebook branch.

Postgres kernel regressions

While Mark was looking at MongoDB, I was playing comparing various aspects of MySQL and Postgres performance. Certain PG performance numbers I saw (40kqps vs 110kqps from MySQL) made me really upset, so I ended up discussing with people at #postgresql – and started comparing various versions/configurations/machines/etc.

Apparently 2.6.32 kernel, which is in Ubuntu 10.04 LTS (Lucid Lynx) and is also basis for future RHEL6 kernel has nearly 20% performance degradation for PG (though not for MySQL, phew) – and apparently it was news to their community (I have started an email thread, will see where it goes).

Warning: there’re quotes that has been deliberately taken out of context for more snark

While I was doing my research, of course I could observe plenty of gems of wisdom:

<davidfetter> domas, well, as tom lane once said, there’s no limit to how quickly you can get an answer if it doesn’t have to be correct

And an hour later:

<rawtaz> are you one of those mysql lovers?
…
<dim> nobody wants performance at the price of correctness, right?
<dim> as Tom said, I can be as quick as you want as providing an answer if you don’t want it to be correct (hint: 42.)

Apparently chasing these performance variations is…

<Snow-Man> tbh, I’m of the opinion that we’re losing the forest for the trees.

Can’t say that all of them were encouraging:

<sterncera> If you can’t be bothered to subscribe to -hackers, maybe you shouldn’t be mailing it

Special thanks goes to mastermind, who not only didn’t lose his temper, but stayed focused on the topic and resisted my trolling :-) I really want PG to be greatly performing database (and I’ve seen some great numbers from it), but the amazing amount of ignorance and animosity they have against MySQL makes it somewhat complicated there though :(

P.S. Now all benchmarks I did are tainted and will need full rerun…

pmysql (multi-server mysql client)

To have more visibility of our systems I needed a method to quickly query large sets of MySQL servers. Though there are few methods like ‘xargs -P’ and dsh, they are overkill for the SQL-only job, and simply not fast enough.

So, I hacked pmysql (274 lines of C) – a tool that is able to run queries on multiple servers in parallel. It can run SQL query on 1000 servers in under 0.1s, and scales well with larger sets of servers.

Use cases can be various – from configuration/schema checks (I really enjoyed using information_schema), to ‘global processlist’ analysis.
Ability to crunch hundreds of megabytes of processlist info in few seconds – priceless, so is schema verification of millions of tables.

It is fast enough to use it as backend for some of real-time health check CGI scripts :-)

To get the source:
bzr co lp:mysqlatfacebook/tools/pmysql

P.S. Oh, I know that nearly nobody needs it, but thats not a reason for keeping it closed :)

on primary keys

5.1.46 has this change:

Performance: While looking for the shortest index for a covering index scan, the optimizer did not consider the full row length for a clustered primary key, as in InnoDB. Secondary covering indexes will now be preferred, making full table scans less likely.

In other words, if you have covering index on * (which is quite common on m:n mapping tables), use it rather than PK. As I have spent my time getting indexing right and having PKs be based on primary access pattern and SKs on secondary access pattern, I hereby not welcome the new change that suddenly reverses the behavior in late GA version.

Not good, when mysqldump queries end up taking 6 days instead of previous half an hour, not good at all.

Update: Oh, MariaDB has this reverted, from their change log:

mybug:39653: reverted as invalid

If only upstream MySQL would take note ;-)

MySQL versions at Wikipedia

More of information about how we handle database stuff can be found in some of my talks.

Lately I hear people questioning database software choices we made at Wikipedia, and I’d like to point out, that…

Wikipedia database infrastructure needs are remarkably boring.

We have worked a lot on having majority of site workload handled by edge HTTP caches, and some of most database intensive code (our parsing pipeline) is well absorbed by just 160G of memcached arena, residing on our web servers.

Also, major issue with our databases is finding the right balance between storage space (even though text is stored in ‘external store’, which is just set of machines with lots of large slow disks) – we store information about every revision, every link, every edit – and available I/O performance per dollar for that kind of space needed.

As a platform of choice we use X4240s (I advertised it before) – 16 SAS disks in compact 2u package. There’s relatively small hot area (we have 1:10 RAM/DB ratio), and quite a long tail of various stuff we have to serve.

The whole database is just six shards, each getting up to 20k read queries a second (single server can handle that), and few hundred writes (binlog is under 100k/s – nothing too fancy). We have overprovisioned some hardware for slightly higher availability – we don’t have always available on-site resources – the slightly humorous logic is

we need four servers, in case one goes down, another will be accidentally brought down by fixing person, then you got one to use as a source of recovery and remaining one to run the site.

Application doesn’t have too many really expensive queries, and those aren’t the biggest share of our workload. Database by itself is minor part of where application code spends time (looking at profiling now – only 6% of busy application time is inside database, memcached is even less, Lucene is way up with 11%). This is remarkably good shape to be at, and it is much better than what we used to have when we had to deal with insane (“explosive”) growth. I am sure, pretty much anything deployed now (even sqlite!) will work just fine, but what we used has been created during bad times.

Bad times didn’t mean that everything was absolutely overloaded, it was more that it could get overloaded very soon, if we don’t take appropriate measures, and our fundraisers were much tinier back then. We were using 6-disk RAID-0 boxes to be able to sustain good performance and have required disk space at the same time (or of course, go expensive SAN route).

While the mainstream MySQL development with its leadership back then was headed towards implementing all sorts of features that didn’t mean anything to our environment (and from various discussions I had with lots of people, many many other web environments):

utf8 support that didn’t support Unicode
Prepared Statements that don’t really make much sense in PHP/C environments
unoptimized subqueries, that allow people to write shitty performing queries
later in 5.0 – views, stored routines, triggers
etc…

… nobody was really looking at MySQL performance at that time, and it could have insane performance regressions (“nobody runs these things anyway”, like ‘SHOW STATUS’) and a forest full of low hanging fruits.
From operations perspective it wasn’t perfect either – replication didn’t survive crashes, crash recovery was taking forever, etc.

Thats when Google released their set of patches for 4.0, which immediately provided incredible amount of fixes (thats what I wrote about it back then). To highlight some of introduced changes:

Crash-safe replication (replication position is stored inside InnoDB along with transaction state) – this allowed to run slaves with innodb log flushing turned off on slaves and having consistent recovery, vanilla MySQL doesn’t have that yet, Percona added this to XtraDB at some point in time
Guaranteed InnoDB concurrency slot for replication thread – however loaded the server is, replication does not get queued outside and can proceed. This allowed us to have way more load pointed towards MySQL. This is now part of 5.1
Multiple read-ahead and write-behind threads – again, allowed to bypass certain bottlenecks, such as read-ahead slots (though apparently it is wiser just to turn off read-ahead entirely) – now part of InnoDB Plugin
Multiple reserved SUPER connections – during overloads systems were way more manageable

Running these changes live have been especially successful (and that was way before Mark/Google released their 5.0 patch set which was then taken in parts by OurDelta/Percona/etc) – and I spent quite some time trying to evangelize these changes to MySQL developers (as I would have loved to see that deployed at our customers, way less work then!). Unfortunately, nobody cared, so running reliable and fast replication environments with mainline MySQL didn’t happen (now one has to use either XtraDB or FB build).

So, I did some merging work, added few other small fixes and ended up with our 4.0.40 build (also known as four-oh-forever), which still runs half of shards today. It has sufficient in-memory performance for us, it can utilize our disk capacity fully, and it doesn’t have crash history (I used to tell about two 4.0 servers, both whitebox raid0 machines, having unbroken replication session for two years). By todays standards it already misses few things (I miss fast crash recovery mostly, after last full power outage in a datacenter ;-) – and developers would love to abuse SQL features (hehe, recently a read-only subquery locked all rows because of a bug :-) I’m way more conservative when it comes to using certain features live, as when working at MySQL Support I could see all the ways those features break for people, and we used to joke (this one was about RBR :):

Which is the stable build for feature X? Next one!

Anyway, even knowing that stuff breaks in one way or another, I was running a 5.1 upgrade project, mostly because of peer pressure (“4.0 haha!”, even though that 4.0 is more modern from operations standpoint).

As MediaWiki is open-source project, used by many, we already engineer it for wide range of databases – we support MySQL 4.0, we support MySQL 6.0-whatever-is-in-future, and there’s some support for different vendor DBMSes (at various stages – PG, Oracle, MS SQL, etc) – so we can be sure that it works relatively fine on newer versions.

Upgrade in short:

Dump schema/data
Load schema on 5.1 instance
Adjust schema, as we can do it, set all varchar to varbinary to maintain 4.0 behavior
Load data on 5.1 instance
Fix MySQL to replicate from 4.0 (stupid breakage for nothing)
Switch master to 5.1 instance

We had some 5.0 and 5.1 replicas running for a while to detect any issues, and as there weren’t too many, the switch could be nearly immediate (English Wikipedia was converted 4.0->5.1 over a weekend).

I had an engineering effort before to merge Google 5.0 patches into later than 5.0.37 tree, but eventually Mark left Google for Facebook and “Google patch” was abandoned, long live the Facebook patch! :)

At first FB-internal efforts were to get the 5.0 environment working properly, so 5.1 development was a bit on hold. At that time I cherry-picked some of Percona’s patch work (mostly to get transactional replication for 5.1, as well as fast crash recovery) – and started deploying this new branch. Of course, once Facebook development focus switched to 5.1, maintaining separate branch is becoming less needed – my plan for the future is getting FB build deployed across all shards.

The beauty of FB-build is that development team is remarkably close to operations (and operations team is close to development), and there is lots of focus on making it do the right stuff (make sure you follow mysql@facebook page). The visibility of systems (PMP!) we have at Facebook can be transformed into code fixes nearly instantly, especially when compared with development cycles outside. I’m sure some of those changes will trickle to other trees eventually, but we have those changes in FB builds already here, and they are state of the art of MySQL performance/operations engineering, while maintain great code quality.

Yes, at Wikipedia we run a mix of really fresh and also quite old/frozen software (there will be unification, of course), but…. it runs fine. It isn’t as fascinating anymore as years ago, but it allows not paying any attention for years. Which is good, right? Oh, and of course, there’s full data on-disk compatibility with standard InnoDB builds, in case anyone really wants to roll back or switch to the next-best-fork.

On MySQL replication, again…

There are few things one is supposed to know about MySQL replication in production, as manual doesn’t always discuss things openly.

This is small set of rules and advices I compiled (some apply to statement based replication only, but row based replication may benefit from one or two too):

Don’t use MyISAM. Or in fact, don’t use any non-transactional engine, if you care about your data. On either side, master or slave, or both – using non-transactional engines will cause data drift, as partially executed statements on master would be fully executed on slave, or simply stop replication. Of course, every crash has the capacity of getting your tables desynced from each other and there are absolutely no guarantees.
This “don’t” can be easily transformed into “do” – just use InnoDB. Bonus point – one doesn’t need to take down the server, to clone a slave from a master :)
Don’t use temporary tables. MySQL manual is very funny about temporary tables in replication, it says “do not shut down the slave while it has temporary tables open.” That of course means that you’re not supposed to crash either – and more slaves there are, more of them will crash because of various reasons (e.g. solar flares).
The operational overhead temporary tables add is huge – even though it may not show up in the benchmark.
Prefer simple, idempotent statements. If one can replay same statements multiple times without having database drift it doesn’t matter much if replication position is somewhat outdated. Updating rows by PK to fixed values, avoiding multiple table updates/deletes can allow to recover after crash much faster.
Set sync_binlog=1. This will introduce biggest bottleneck for transactions, but losing 30s of data may be worse (as this will force to do full slave resync in most cases). On really busy servers one can go for higher values (e.g. sync every 20 transactions), but 0 is asking for disaster.
Avoid long running updates. Though all long statement would cause on a master is slightly longer locking window and some performance pressure, once it gets replicated to the slave, all the updates will have to wait for the giant one to finish, in many cases rendering the slave useless.
If something big has to be replicated, either split it into smaller chunks or run it directly against slaves (with binary logging on the master disabled for it).
Splitting into smaller chunks can allow wait-for-slave logic to be implemented, thus not having any major impact on production environments.
Don’t use replicate-do-db. Or replicate-ignore-db. They both rely on database context, and statements like ‘INSERT INTO database.table’ will fail.
If you need it, use replicate-wild-do-table=db.% – but even then, be careful with cross-database statements, that involve tables from multiple databases – as they may be filtered out…
Note the multiversioning. Some statements may become replication performance hogs because of long-running transactions (backups? reporting? ETL?) running on slaves – it may not need to rescan all the row versions on master, but they’d be still there on a slave. Such statements may need to be rewritten to avoid scanning gaps with too many invisible rows, or long transactions have to be split.

Though probably the best advice I can give now is “call your mysql vendor and ask for transactional replication“. Server, rack, datacenter crashes will not cause excessive work on fixing replication – it will be always consistent. One can even disable log syncing to disk then \o/

On hot backups

Few years ago I was looking at crash recovery code, and realized that InnoDB has removed all the comments from the code [this assumption is debunked by Heikki in comments section], related to replay of transaction log. Judging by high quality of comments in the remaining codebase, I realized that it was all done to obscure any efforts to build another InnoDB hot backup solution – competitor to first Innobase standalone offering.

I was enjoying the moment when Percona launched their own implementation of the tool. Since the inception, it became more and more robust and feature rich.

We have used xtrabackup in our environment a lot – just… not for backup – the major use case right now is for cloning server instances – either for building new replicas, shadow servers, or replacing masters – and allows us to do that without interrupting any operation.

Now what makes the whole situation way more interesting – Oracle/MySQL announced at the conference, that InnoDB Hot Backup will be part of the Enterprise offering – which makes it way more available to MySQL customer community, than when it required quite expensive per-server licenses.

Of course, open source xtrabackup is way easier to tweak for our environment (O_DIRECT support, posix_fadvise(), flashcache hints, etc – was all added after release) – and it is interesting, how Oracle-provided tool will evolve. Right now xtrabackup already supports streaming operation, which makes it much more usable in large-database-on-small-hardware (read: sharded) environments, and provides flexibility to the users. Oracle of course owns much more of in-house expertise of both current internals operation, as well as all the future changes that will happen, so we may see leadership in the field coming from their side too.

One of our reasons for not using physical backup solution is simply that it is not space efficient. There may be multiple ways to approach that – from robust incremental backups, to partial backups, that wouldn’t include secondary indexes or have limited set of tables taken.

Some changes may actually require extended MySQL/InnoDB support – on multiple-terabyte instances one may not want to rescan the dataset for each incremental backup run – as resulting diff would be just a hundred gigabytes or less. This would require support for always-running backup agent that would aggregate information about block changes and allow for more efficient backup operation.

Discarding secondary indexes is way more attractive option with 5.1/Plugin ability to do fast independent index builds, that don’t require one row at a time B-Tree builds for all indexes at once (and of course, hit severe memory penalties on large tables or in parallel workloads).

Having always ready backups is important not only for ability to rebuild a box (and we have replicas for machine failures) – the real value is when backups can be used for massive-scale thousands of machines subset of table rows extraction. For that one cannot just ship full instance data around from backup storage – so recovery tools will have to be way flexible.

Probably core feature for that kind of operation would be ability to import tables directly from hot backup to online instances – unfortunately, restarting database instance is still costly (though we’re doing quite some work in that direction too).

I’m extremely happy that InnoDB started fixing operational issues like crash recovery performance, but there’s still a wide area of problems not touched properly yet – extremely in disaster recovery space, and I’m eager to see developments in this field – both from Oracle, and community members.

mysql conference talk

Done! Here are the slides:

And video:

	markcallaghan (@mark… on MySQL does not need SQL
	markcallaghan (@mark… on MySQL does not need SQL
	Domas Mituzas on MySQL does not need SQL
	Marc on MySQL does not need SQL
	Nils Meyer on linux memory management for…