Hash of shame

I may retract this post, I may have been way too incorrect here, not sure yet.

Hash maps were invented sixty years ago, apparently. MySQL reinvented them.

Original idea was too fancy and too good, I guess. It allowed very cheap access to data, if you knew a key, and it achieved that by having a hashing function, which is used to pick a slot, then going directly to that slot. It is used in your computer all the time. ALL THE TIME.

They are so fast and useful, that they are always treated as building blocks. There have been various iterations later, to support concurrency, hashing functions evolved, etc, but the premise was the same.

If we look at the dictionary, it is obvious that “hash” is:

a mixture of jumbled incongruous things; a mess.

Yes, MySQL, the whole concept is to have as messy as possible data distribution, so that there are no natural collisions. Because if there is a collision, then there is contention over the list sitting in a hash bucket, or a worker, ready to do the work, or some fancy shmancy high performance concurrent workflow.

But no, MySQL wants order, it does not want to have the hashing function messy, it will redefine the term and sixty years of industry practice. For example, given these inputs, it will calculate hash for them as close as possible (tested with (gdb) print my_calc_hash(&mapping_db_to_worker, "db100002", 8)):

  • db100002 – 0xace3c8bc
  • db100003 – 0xace3c8f3
  • db100004 – 0xace3c836

It will decide, that for each change in byte it will change only one byte in the hashing function (pretty much every other hashing function like CRC32, Adler, Fletcher, HeikkiFolding, MD5, SHA1 will have all bytes different ) (this was debunked by a commenter, I was completely wrong here).

Then MySQL will decide that resulting hash is still messy and will not look at different bytes, just map everything to exactly same slot.

Mark saw this on other place, his performance of 5.6 was not too good, and he investigated hash problems in this bug report. MySQL developers looked at it and filed a separate bug which still does not catch the whole essence – MDL lock is blah blah blah.

No, MySQL, it is not related to any use case of your hashing implementation. This is a bug in your critical data structure code. You should be just happy that it is not used too much, only in two places, one that is on Mark’s radar, but other is in new flashy beautiful feature of MySQL 5.6 – Multi Threaded Slave. Or should I call it, Not-So-Multi-Threaded Slave. NSMTS.

P.S. On the upside, I can skip talking about MTS in my “high performance replication” talk at MySQL Conference next week \o/

Update: It may be that MTS is not working because of other issues, checksum function is getting empty inputs.

9 thoughts on “Hash of shame”

      1. my point is that this problem has been here a long, long time. Lots of that code is missing nice things like unit tests, comments, etc.

  1. > CRC32, Adler, Fletcher, HeikkiFolding, MD5, SHA1 will have all bytes different

    python> import zlib
    python> hex(zlib.adler32(‘db100002’)), hex(zlib.adler32(‘db100003’))
    (‘0x9ce01ea’, ‘0x9cf01eb’)

  2. Close but not overlapping is good. There are many implementation contexts where close but not overlapping is better than dispersed (and obviously not overlapping). Having said that, I have no idea how “not overlapping” the MySQL’s “close” implementation is.

  3. Domas,

    You say “This is a bug in your critical data structure code.”. What do you mean by “data structure code”. Do you know how imprecise is it ??? Please, explain …

    BTW, the original hashing functions, that I used so frequently while developing server, e.g. in privileges code, is written with speed in mind. Both speed of calculation and speed of the search. Monty expounded the entire theory on the subject in Bordeaux meeting, which you have missed for obvious reasons. You were still in kindergarten … hehe …

    There are so many hashing algorithms there. Even those, like Ruposov’s hash, aiming for almost-uniqueness of the hash result.

    So, please be more precise about “data structure code”. This is as vague as it could be.

Comments are closed.

%d bloggers like this: