The saddest bug of them all (SQL is dead?)

From time to time I will observe servers wasting lots of CPU when doing batch row operations. In perf top it will look like this:

8.24% mysqld [.] Arg_comparator::compare_int_unsigned()
7.17% mysqld [.] Item_cond_and::val_int()
4.37% mysqld [.] Item_field::val_int()
4.37% mysqld [.] Item_cond_or::val_int()
2.90% mysqld [.] MYSQLparse(void*)
2.64% mysqld [.] Item::val_bool()

Essentially if you construct queries like (a=1 AND b=2) OR (a=3 AND b=4) ..., at large enough batch size evaluating the WHERE will become far more expensive than anything else (yes, more expensive than decompressing rows or doing all the InnoDB magic and what not).

MySQL has awesome syntax that makes certain batch lookups much faster: WHERE a IN (1,2,3). It constructs a tree that then each row can be compared against and one does not have to iterate through lists of predicates to check whether the row returned by batch index lookups needs to be filtered away. One would think that the composite syntax that server has (WHERE (a,b) IN ((1,2),(3,4))) may help here.

Unfortunately, if you run a query using this syntax, a full table scan will be done, even if there’s a great index for that – even the syntax exists, none of sane ways to execute apply. Of course, there have to be bugs about that:

Compound (cola,colb) IN ((x,y)) statements do not use indexes. (filed by yours truly).

Well, it was closed by optimization lead (now at Maria), who decided, that this bug is a duplicate of another problem, (a,b,c)=(A,B,C). I probably should’ve contested there, but there might have been other people who’d do that? Sure there were, in 2007 Mark filed another bug:

Optimizer does not do index range scans using predicates in IN lists

If one would look more closely, there’s this small exchange in 2010:

[7 Jan 2010 21:35] Mark Callaghan
This is the same as, 
but the author of this feature request is much more convincing.
[7 Jan 2010 21:36] Domas Mituzas

He wrote a Facebook note back then too.

Apparently Mark’s bug became the master bug for this problem, even if it arrived bit later, but, I can’t argue, he is far more convincing :-)
There’s a bit of odd banter there, Harrison (ohi!) points out that it is documented (albeit incorrectly, Sergey Petrunia – ohi – notices) in the manual. No, it isn’t this exact issue that is noted in the manual.
Duplicates are pouring in:
Multi column IN does not use index, says Peter Zaitsev (ohi!) in 2008
MySQL not using an index when using tuples in IN clause, says Shlomo (ohi!) in 2010
There are few more.

Sinisa (ohi!) pronounces:

In my humble opinion, this is still not a bug, but feature request. This feature request, however, should be very high on TODO list, as it would make a feature that is sorely missing, particularly since tuple equality expressions have been already optimized.

Fast forward into 2013, even in 5.6 we still have to use expressions with exponential cost, (a,b) IN (( is causing full table scans and people are switching to NoSQLs, NewSQLs and OtherSQLs, because there’s no way to efficiently express batch predicates in MySQL (besides other reasons). Back in the day MySQL was ahead of everyone else with ease of use for simple expressions, but there have been nearly no changes since back in the day to make developer lives easier. Did you know that it may actually be faster to create a temporary table with all the values, then join against it, than actually to use any of existing SQL methods?

Where is MySQL innovation nowadays? In this exact case I guess you’re supposed to be memcached API (forget authorization, common interfaces, transactions and what not), solving this in SQL is too complicated.

on wikipedia and mariadb

There’s some media coverage about Wikipedia switching to MariaDB, I just wanted to point out that performance figures cited are somewhat incorrect and don’t attribute gains to correct authors.

Proper performance evaluation should include not just MariaDB 5.5 but Oracle’s MySQL 5.5 version too, because thats where most of performance development happened (multiple buffer pools, rollback segments, change buffering et al).

5.5 is faster for some workloads, 5.1-fb can outperform 5.5 in other workloads (ones with lots of IO), it is good to know that there’s beneficial impact from upgrading (though I’d wait for 5.6), but it is important to state that it is an effort from Oracle as well, not just MariaDB developers.

P.S. As far as I understand, decision to switch is political, and with 5.6 momentum right now it may not be the best one, 5.6 is going to rock :-)

on seconds_behind_master sleuthing

With large enough infrastructure it gets a bit more and more complicated to detect whether an incident or a problem is real systems problem or a monitoring glitch. This is a story of one such investigation.

With a sufficiently large set of machines, there’re multiple graphs/sets of data to look at to understand what is going on:

  • Group average – the easiest to look at for long term trends or system-wide problems
  • Top-N chart – our favorite for dive-ins, looking only at top offender lines (quite often part of the offender lines end up matching tier baseline)
  • Percentiles chart – knowing best performers, medians, 99% or 99.9% behavior is incredibly useful to understand medium-term dynamics and can give lots of insight where to look for problems. Is way more complicated to collect though.

Another very important property of data one looks at is collection frequency – though thousands of servers aggregated together into single chart will already provide smoothened view of what is going on, individual charts will always look way choppier, without any intermediate dots to show progression of metric. You will see that there’re lots of threads running, but it won’t explain how that built up – only longer term incidents averaged out over lots of machines will show that.

In our deployment, replication lag is very important metric of how well we’re doing – with faker in production it already means that machine (or machines) are in trouble, and usually that is either hardware trouble, or way too much workload from applications, and both problems need immediate actions.

It is even more interesting on pure flash machines – those are not supposed to lag at all, once they’re in production – unless something is really really wrong.

This is why I once I saw a group of flash-only machines reporting higher replication lag values, jumped to investigate. There could’ve been various reasons – e.g. long transactions increasing cost of write queries, and what not.

Looking at charts I saw that there’s some lag, that some machines spike to 1000+s lag values, and percentiles showed that only small amount of machines were hitting this. It was relatively interesting, that each of those machines would hit a lag spike and then behave normally afterwards again.

I tried to find a machine that was in high lag condition and check what is going on there. Listing machines with their lag values was relatively easy:

pmysql "SHOW SLAVE STATUS" | cut -f 1,34 | sort -nr -k 2,2 | head

What was odd, that each time I was running this, machines on the list were different. First thought in my mind was that some old transactions with old timestamps were showing up (that is usually a reason for odd Seconds_behind_master spikes), figuring out statements would’ve been piece of cake:

pmysql "SELECT * FROM information_schema.processlist where 
    user='system user' and info is not null and time>100"

Unfortunately, even though “SHOW SLAVE STATUS” was always showing me machines with 100s+ lag, none of that came via processlist query.

The aha! moment came when I tried to look at binlog/relaylog values – every “lagging” machine was at one of two relay log positions – 298 or 308. So, I looked at timestamps of events in relay log positions:

#120921 13:21:42 server id 22222  end_log_pos 106 
    Start: binlog v 4, server v 5.1.53-log created 120921 13:21:42
#120921 13:21:42 server id 4294967295  end_log_pos 151 
    Rotate to binary-logs.017880  pos: 81259576
#691231 16:00:00 server id 11111  end_log_pos 0 
    Rotate to binary-logs.017880  pos: 81259576
#120921 13:17:39 server id 11111  end_log_pos 0 
    Start: binlog v 4, server v 5.1.53-log created 120921 13:17:39
#120921 13:21:42 server id 11111  end_log_pos 81259647 
    Query ...

First record was relay log header (format description event), next two were rotation events with bogus (-1 and 0) values for timestamp and server_id – and binary log positions. Then we suddenly have master’s binary log header which is taken from the beginning of the binary log, with the binlog creation timestamp in it, then we have a query from the middle of binary log, with proper timestamp on it.

For the whole duration of SQL thread acting on the first query server thought it was executing events from 13:17:29 and not 13:21:42. Seconds_behind_master was telling that the gross offense of 253s replication lag was being committed, and database users were suffering.

Fortunately, it was just a MySQL status variable glitch and data integrity was completely fine, but this reminds us how important is quality monitoring and internal metrics for large environments – and how easy it is to draw wrong conclusions otherwise.

P.S. Of course, a bug was filed :)

replication prefetching revisited

Been a while since I wrote about replication work we did. Fake changes based approach was huge success, and now our prefetching has lots of coverage, where standard SELECTs cannot reach. We’re running our systems at replication pressure, where not running faker immediately results in replication lag. On busier machines Python implementation started using quite some CPU and ended up occasionally hitting GIL issues.

So, here’s the straightforward rewrite of fake changes replication prefetcher, faker. It can run 100k statements a sec, if needed. To get it, you can run:

bzr co lp:mysqlatfacebook/tools; cd faker

MySQL is bazillion times faster than MemSQL

I don’t like stupid benchmarks, as they waste my time. I don’t like stupid marketing, as it wastes my time too. Sometimes I succumb to those things, and now in return I want to waste your time a bit.

So, this MemSQL thing, written by some smart guys has been making rounds in press and technical community. Centerpiece of all the communication was:

“MemSQL, the database they have developed over the past year, is thirty times faster than conventional disk-based databases”

Though I usually understand that those claims don’t make any sense, I was wondering what did they do wrong. Apparently they got MySQL with default settings running and MemSQL with default settings running, then compared the two. They say it is a good benchmark, as it compares what users get just by installing standard packages.

That is already cheating, because systems are forced to work in completely different profiles. For example, memory used for data buffering, is essentially unbound on MemSQL, yet InnoDB has it limited to 128MB on 5.5 (and that is 16x the default setting used on 5.1).

For write benchmarks MemSQL will write out snapshot at 2G log mark, InnoDB is configured with 10MB transaction log, so it will start checkpointing pretty much immediately.

Still, for any benchmark, most important thing is durability. See, MemSQL claims that they support ACID, and durability is core part of that. MySQL’s InnoDB (I don’t assume other engines are usable) is durable by default, making sure that if it says that transaction returned ‘ok’, it is on disk and will be there after a crash. MemSQL is also “durable by default”, which means that it will write a transaction log, but it doesn’t really mean that it will hit the disk.

See, MemSQL also has “transaction-buffer” setting, which, in default “full durability mode” will asynchronously return “ok” until 128M buffer is full (or background log flusher thread writes it out). Essentially this is something similar to innodb_flush_log_at_trx_commit=2. In my opinion not durable.

What happens if you really enable full durability on MemSQL? Absolute sadness does. Apparently each commit will wait for background thread to wake up and write out transaction log. How often does background thread wake up? Every 50ms. Well, it actually does time accounting magic, to flush every 50ms, and calls very exact sleep.

Claim #1: MemSQL is 500x slower at durable transactions a second than InnoDB.

It is relatively easy to back that up – with decent RAID controller that has write-behind caching, InnoDB can easily sustain 10k transactions a second from a single thread, as it doesn’t sleep for 50ms between fsyncs. There is some commit grouping there, two threads will have 40tps, ten threads will have 200tps, but as I get to choose my own benchmark I claim that MemSQL is 500x slower at single-thread durable transaction rate.

Now that we established MySQL superiority (ha ha), let’s look at read performance. I sure agree, that it is where MemSQL should shine. I do agree, that its execution speeds for table scans are not bad – 8M rows scanned a second (for SELECT COUNT(*) query) from single thread is sure a great achievement.

To be honest, I didn’t want to spend my time in benchmarking what an in-memory database should excel at (I’m sure it does random point reads on skiplist just fine). Instead I decided to test my favorite query:


You know, the query that is all around the web – showing you heads of various lists. MySQL does that by pointing a cursor at an index position then walking record by record in index order. Not MemSQL, it will actually have to traverse whole table and sort it to return you the answer. Even “SELECT MAX(id)” does crawl whole table.

Claim #2: MemSQL is thousand times slower than MySQL. Or million times slower. At simple read queries. (I have been corrected on this – apparently indexes in MemSQL are unidirectional, so you have to define separate index for each direction you are going to read the table in).

Well, once we establish that MemSQL will have O(N) performance on some common operation, all we need is just find an N that is large enough ;-)

I don’t know how much we should be blaming MemSQL guys and how much that should be directed at journalists that were hyping on the technology. If we get back to ACID, we’d see that A for atomicity is done only at statement level and BEGIN/COMMIT are ignored. Isolation is only READ COMMITTED (difficult to have REPEATABLE READ with no real transactions). Durability is flawed, and I didn’t check C part. I got to admit, MemSQL FAQ states that “Yes, MemSQL supports fully ACID transactions”. This is on them, then.

The 80000 queries a second on MemSQL number isn’t anything impressive, compared to what modern MySQL can do (especially with HandlerSocket) – people are approaching million queries a sec there :-)

Though, definitely, for things that it is doing well, it is fastest MySQL protocol speaking thing at the moment, though it isn’t that far ahead of MySQL Cluster, and people talking to NDB are having also quite good performance (really amazing performance, that is).

I’m sure, that my both claims can be fixed with some engineering work. Write performance needs proper real time synching with group commit (there has been some great development in MySQL world about that lately too – though when binlog is involved things are way more complicated).

Read performance needs proper optimizations for most common patterns – index order reads, for example. Memory is fast, but not fast enough if high concurrency environment would need to do this over and over again. Even for what it does well, I’m somewhat sure that it wouldn’t overperform InnoDB 30x at in-memory workloads. I’m too lazy to benchmark today, but this ‘Claim #3’ is not that difficult to prove :-)

Anyway, we wouldn’t need this post if there was a decent disclosure of behaviors and proper benchmarking. Now we get multiple conflicting claims that are way too easy to spot within few minutes of testing. Way too easy.

P.S. Harrison also has discussed this on Quora

On binlogs and datacenters

Once MySQL is deployed inside a datacenter environment (i.e. forms a cloud ;-), major feature in it becomes replication. It is used to maintain hot copies, standby copies, read-only copies, invalidate external systems, replicate to external systems, etc. If this functionality is broken, datacenter is broken – components are not synchronized anymore, invalidations not done, data not consistent.

From performance perspective, replication not working properly results in unusable slaves so load cannot be spread. This results in higher load on other machines, including master (especially on master, if environment needs stronger consistency guarantees).

Judging on replication importance in MySQL deployments, it should attract performance engineering as much as InnoDB and other critical pieces. Though slave replication performance is being increased in 5.6, master side is not (well, group commit may help a bit, but not as much).

The major problem here is LOCK_log. It is a mutex that guards all MySQL log files (Update: that sentence is wrong, it is per-log-type  mutex, is not shared among different types, at is class private variable) (general query log, binary log, etc) – only one read or write can happen to any of them. Other logs aside (their impact can be minimized), binary log finds this mutex extremely problematic, because:

  • It stalls COMMITs
  • It stalls slave binary log reads

Though one can argue, that once there’re less commits, there will be less binary log and that will result in some sort of equilibrium. Unfortunately, one doesn’t get anywhere close to decent efficiency, as there’re too many situations, when LOCK_log is kept for way too long:

  • Slow binary log read (if they already got purged out of cache, highly likely)
  • Slow binary log write (if controller cache is saturated by page flush workload)
  • Slow memory allocation for larger events (on systems with malloc scalability issues)

Saturating RAID controller cache isn’t that difficult – and even if one dedicates separate disks to a separate log disk array (what a waste!), it doesn’t mean that out of the shelf hardware will support supplying separate cache for that disk array. Without a separate dedicated cache or high priority log processing one ends up blocking until disks are available for a write.

Read efficiency is a concern here too – MySQL uses statically sized buffer of 8k for all binary log reads, which means that on highly pressured system read-ahead may not work, and there will be way more I/O operations/seeks, while holding a precious mutex. Even if the operation to read the 8k page from memory buffers is very fast, slave threads will be forced to wait for other, much slower operations (a lagging slave reading yesterday’s data? high write workload?) to finish.

If binary log writes are on a separate device one can inspect the utilization on iostat. Usually there will be only one write or read outstanding, so the busy percentage shows really well how contended this resource is. A double-digit number already says that operations aren’t instant, and replication slaves can be subjected to log starvation, that is – not replicating.

From operation side solution to this is somehow obtaining a dedicated cache or fast device for binary log writes. One can simply disable write cache for the InnoDB data partition and use lots of parallel InnoDB page write threads, so that they win the fight against read threads. Using flashcache or similar technology to absorb writes can be beneficial in this case.

On the other hand, MySQL can be changed not to have giant log mutex – binary logs are append-only structures, and just guarding tail and rotation events could be enough, with additional one or two atomic variables to make sure that readers are fine.

Also, having compressed binlogs on disk would reduce the I/O pressure caused by writes or reads, as well as there’s a possible gain from not having to recompress logs for each remote reader.

Then datacenter will be happy, and datacenter users even happier.

MySQL is doomed?

Percona’s version of MySQL Conference this year was awesome, and there were some great keynotes there, I’ll high-light two of them.

One was called “Future Perfect: The Road Ahead for MySQL” and had a vendor panel of “Industry leaders from HP, Amazon Web Services, McAfee, Clustrix, and Akiban discuss the future of MySQL”.

It went like this:

BaronTheModerator: “so, what is the future for MySQL?”
HP Cloud: “Clouds!”
Amazon Web Services: “Clouds and DynamoDB/RDS!”
McAffee: “More Security!”
Clustrix: “More Scaling!”
Akiban: “More document stores!”

BaronTheModerator, “what do your customers ask for?”
HP Cloud: “Clouds!”
Amazon Web Services: “Clouds and DynamoDB/RDS!”

After an hour of that I already knew a lot about road ahead for MySQL, much more than after pragmatic “What Comes Next”, which covered boring things like optimizing for future hardware, compression, bla bla.

Probably the most significant part of keynotes was Brian Aker’s disclosure of his impact on MySQL ecosystem in his “The New MySQL Cloud Ecosystem”. Actually, people should know that Brian actually coined the term “MySQL Ecosystem”. Brian remembered his days as MySQL’s Director of Architecture (05:16 in the video):

So, what I think we’ve now been entering for last couple years really is actually the cloud era. And we have glued some pieces of things… It’s funny, we’re talking about… you know… I’m looking at MySQL 5.5 and look at 5.6 and I’m like yep, I remember, thats the roadmaps I wrote, ha, glad to know, I’m curious what they gonna do after 5.6. Didn’t leave many.. eh.. Cliff Notes to what goes behi… after that.

This reveals that MySQL Community and Oracle customer base are actually locked in into a stagnant software product with no future. I think you should be worried. Heh.