Hash of shame

I may retract this post, I may have been way too incorrect here, not sure yet.

Hash maps were invented sixty years ago, apparently. MySQL reinvented them.

Original idea was too fancy and too good, I guess. It allowed very cheap access to data, if you knew a key, and it achieved that by having a hashing function, which is used to pick a slot, then going directly to that slot. It is used in your computer all the time. ALL THE TIME.

They are so fast and useful, that they are always treated as building blocks. There have been various iterations later, to support concurrency, hashing functions evolved, etc, but the premise was the same.

If we look at the dictionary, it is obvious that “hash” is:

a mixture of jumbled incongruous things; a mess.

Yes, MySQL, the whole concept is to have as messy as possible data distribution, so that there are no natural collisions. Because if there is a collision, then there is contention over the list sitting in a hash bucket, or a worker, ready to do the work, or some fancy shmancy high performance concurrent workflow.

But no, MySQL wants order, it does not want to have the hashing function messy, it will redefine the term and sixty years of industry practice. For example, given these inputs, it will calculate hash for them as close as possible (tested with (gdb) print my_calc_hash(&mapping_db_to_worker, "db100002", 8)):

  • db100002 - 0xace3c8bc
  • db100003 - 0xace3c8f3
  • db100004 - 0xace3c836

It will decide, that for each change in byte it will change only one byte in the hashing function (pretty much every other hashing function like CRC32, Adler, Fletcher, HeikkiFolding, MD5, SHA1 will have all bytes different ) (this was debunked by a commenter, I was completely wrong here).

Then MySQL will decide that resulting hash is still messy and will not look at different bytes, just map everything to exactly same slot.

Mark saw this on other place, his performance of 5.6 was not too good, and he investigated hash problems in this bug report. MySQL developers looked at it and filed a separate bug which still does not catch the whole essence – MDL lock is blah blah blah.

No, MySQL, it is not related to any use case of your hashing implementation. This is a bug in your critical data structure code. You should be just happy that it is not used too much, only in two places, one that is on Mark’s radar, but other is in new flashy beautiful feature of MySQL 5.6 – Multi Threaded Slave. Or should I call it, Not-So-Multi-Threaded Slave. NSMTS.

P.S. On the upside, I can skip talking about MTS in my “high performance replication” talk at MySQL Conference next week \o/

Update: It may be that MTS is not working because of other issues, checksum function is getting empty inputs.

Posted in mysql | 8 Comments

MySQL 5.6 @ Facebook development tree

Steaphan is a hero (well, everyone else on database engineering team are too) and he is driving efforts to publish MySQL 5.6 changes we’re making to the open. Now they’re on the github (yet not in production, we’re in active testing though with our workloads).

Link | Posted on by | Tagged , | Leave a comment

On performance schemas

You should probably read Marc Alff’s post about configuring Performance Schema in MySQL 5.6.

We wrote another guide, how to start using user/table/index statistics first introduced in Google patch, now part of various MySQL branches and forks:

  1. Start MySQL
Posted in mysql | 3 Comments

The saddest bug of them all (SQL is dead?)

From time to time I will observe servers wasting lots of CPU when doing batch row operations. In perf top it will look like this:

8.24% mysqld [.] Arg_comparator::compare_int_unsigned()
7.17% mysqld [.] Item_cond_and::val_int()
4.37% mysqld [.] Item_field::val_int()
4.37% mysqld [.] Item_cond_or::val_int()
2.90% mysqld [.] MYSQLparse(void*)
2.64% mysqld [.] Item::val_bool()

Essentially if you construct queries like (a=1 AND b=2) OR (a=3 AND b=4) ..., at large enough batch size evaluating the WHERE will become far more expensive than anything else (yes, more expensive than decompressing rows or doing all the InnoDB magic and what not).

MySQL has awesome syntax that makes certain batch lookups much faster: WHERE a IN (1,2,3). It constructs a tree that then each row can be compared against and one does not have to iterate through lists of predicates to check whether the row returned by batch index lookups needs to be filtered away. One would think that the composite syntax that server has (WHERE (a,b) IN ((1,2),(3,4))) may help here.

Unfortunately, if you run a query using this syntax, a full table scan will be done, even if there’s a great index for that – even the syntax exists, none of sane ways to execute apply. Of course, there have to be bugs about that:

Compound (cola,colb) IN ((x,y)) statements do not use indexes. (filed by yours truly).

Well, it was closed by optimization lead (now at Maria), who decided, that this bug is a duplicate of another problem, (a,b,c)=(A,B,C). I probably should’ve contested there, but there might have been other people who’d do that? Sure there were, in 2007 Mark filed another bug:

Optimizer does not do index range scans using predicates in IN lists

If one would look more closely, there’s this small exchange in 2010:

[7 Jan 2010 21:35] Mark Callaghan
This is the same as http://bugs.mysql.com/bug.php?id=16247, 
but the author of this feature request is much more convincing.
[7 Jan 2010 21:36] Domas Mituzas
hehe.

He wrote a Facebook note back then too.

Apparently Mark’s bug became the master bug for this problem, even if it arrived bit later, but, I can’t argue, he is far more convincing :-)
There’s a bit of odd banter there, Harrison (ohi!) points out that it is documented (albeit incorrectly, Sergey Petrunia – ohi – notices) in the manual. No, it isn’t this exact issue that is noted in the manual.
Duplicates are pouring in:
Multi column IN does not use index, says Peter Zaitsev (ohi!) in 2008
MySQL not using an index when using tuples in IN clause, says Shlomo (ohi!) in 2010
There are few more.

Sinisa (ohi!) pronounces:

In my humble opinion, this is still not a bug, but feature request. This feature request, however, should be very high on TODO list, as it would make a feature that is sorely missing, particularly since tuple equality expressions have been already optimized.

Fast forward into 2013, even in 5.6 we still have to use expressions with exponential cost, (a,b) IN (( is causing full table scans and people are switching to NoSQLs, NewSQLs and OtherSQLs, because there’s no way to efficiently express batch predicates in MySQL (besides other reasons). Back in the day MySQL was ahead of everyone else with ease of use for simple expressions, but there have been nearly no changes since back in the day to make developer lives easier. Did you know that it may actually be faster to create a temporary table with all the values, then join against it, than actually to use any of existing SQL methods?

Where is MySQL innovation nowadays? In this exact case I guess you’re supposed to be memcached API (forget authorization, common interfaces, transactions and what not), solving this in SQL is too complicated.

Posted in mysql | Tagged | 15 Comments

on wikipedia and mariadb

There’s some media coverage about Wikipedia switching to MariaDB, I just wanted to point out that performance figures cited are somewhat incorrect and don’t attribute gains to correct authors.

Proper performance evaluation should include not just MariaDB 5.5 but Oracle’s MySQL 5.5 version too, because thats where most of performance development happened (multiple buffer pools, rollback segments, change buffering et al).

5.5 is faster for some workloads, 5.1-fb can outperform 5.5 in other workloads (ones with lots of IO), it is good to know that there’s beneficial impact from upgrading (though I’d wait for 5.6), but it is important to state that it is an effort from Oracle as well, not just MariaDB developers.

P.S. As far as I understand, decision to switch is political, and with 5.6 momentum right now it may not be the best one, 5.6 is going to rock :-)

Posted in mysql, wikitech | 20 Comments

on seconds_behind_master sleuthing

With large enough infrastructure it gets a bit more and more complicated to detect whether an incident or a problem is real systems problem or a monitoring glitch. This is a story of one such investigation.

With a sufficiently large set of machines, there’re multiple graphs/sets of data to look at to understand what is going on:

  • Group average – the easiest to look at for long term trends or system-wide problems
  • Top-N chart – our favorite for dive-ins, looking only at top offender lines (quite often part of the offender lines end up matching tier baseline)
  • Percentiles chart – knowing best performers, medians, 99% or 99.9% behavior is incredibly useful to understand medium-term dynamics and can give lots of insight where to look for problems. Is way more complicated to collect though.

Another very important property of data one looks at is collection frequency – though thousands of servers aggregated together into single chart will already provide smoothened view of what is going on, individual charts will always look way choppier, without any intermediate dots to show progression of metric. You will see that there’re lots of threads running, but it won’t explain how that built up – only longer term incidents averaged out over lots of machines will show that.

In our deployment, replication lag is very important metric of how well we’re doing – with faker in production it already means that machine (or machines) are in trouble, and usually that is either hardware trouble, or way too much workload from applications, and both problems need immediate actions.

It is even more interesting on pure flash machines – those are not supposed to lag at all, once they’re in production – unless something is really really wrong.

This is why I once I saw a group of flash-only machines reporting higher replication lag values, jumped to investigate. There could’ve been various reasons – e.g. long transactions increasing cost of write queries, and what not.

Looking at charts I saw that there’s some lag, that some machines spike to 1000+s lag values, and percentiles showed that only small amount of machines were hitting this. It was relatively interesting, that each of those machines would hit a lag spike and then behave normally afterwards again.

I tried to find a machine that was in high lag condition and check what is going on there. Listing machines with their lag values was relatively easy:

pmysql "SHOW SLAVE STATUS" | cut -f 1,34 | sort -nr -k 2,2 | head

What was odd, that each time I was running this, machines on the list were different. First thought in my mind was that some old transactions with old timestamps were showing up (that is usually a reason for odd Seconds_behind_master spikes), figuring out statements would’ve been piece of cake:

pmysql "SELECT * FROM information_schema.processlist where 
    user='system user' and info is not null and time>100"

Unfortunately, even though “SHOW SLAVE STATUS” was always showing me machines with 100s+ lag, none of that came via processlist query.

The aha! moment came when I tried to look at binlog/relaylog values – every “lagging” machine was at one of two relay log positions – 298 or 308. So, I looked at timestamps of events in relay log positions:

#120921 13:21:42 server id 22222  end_log_pos 106 
    Start: binlog v 4, server v 5.1.53-log created 120921 13:21:42
#120921 13:21:42 server id 4294967295  end_log_pos 151 
    Rotate to binary-logs.017880  pos: 81259576
#691231 16:00:00 server id 11111  end_log_pos 0 
    Rotate to binary-logs.017880  pos: 81259576
#120921 13:17:39 server id 11111  end_log_pos 0 
    Start: binlog v 4, server v 5.1.53-log created 120921 13:17:39
#120921 13:21:42 server id 11111  end_log_pos 81259647 
    Query ...

First record was relay log header (format description event), next two were rotation events with bogus (-1 and 0) values for timestamp and server_id – and binary log positions. Then we suddenly have master’s binary log header which is taken from the beginning of the binary log, with the binlog creation timestamp in it, then we have a query from the middle of binary log, with proper timestamp on it.

For the whole duration of SQL thread acting on the first query server thought it was executing events from 13:17:29 and not 13:21:42. Seconds_behind_master was telling that the gross offense of 253s replication lag was being committed, and database users were suffering.

Fortunately, it was just a MySQL status variable glitch and data integrity was completely fine, but this reminds us how important is quality monitoring and internal metrics for large environments – and how easy it is to draw wrong conclusions otherwise.

P.S. Of course, a bug was filed :)

Posted in mysql | 1 Comment

replication prefetching revisited

Been a while since I wrote about replication work we did. Fake changes based approach was huge success, and now our prefetching has lots of coverage, where standard SELECTs cannot reach. We’re running our systems at replication pressure, where not running faker immediately results in replication lag. On busier machines Python implementation started using quite some CPU and ended up occasionally hitting GIL issues.

So, here’s the straightforward rewrite of fake changes replication prefetcher, faker. It can run 100k statements a sec, if needed. To get it, you can run:

bzr co lp:mysqlatfacebook/tools; cd faker
Posted in facebook, mysql