algorithm – domas mituzas

Brute force!

More than three years ago one prominent Wikipedia community member, who didn’t have enough time to work on our technology, but had enough of it to criticize it, had a problem. He was working for a super-computing lab that boasted one of fastest world’s clusters (IBM blue-something), and to relieve the stress of some of his hard work, was trying to solve a computer science mystery:

Given a dictionary, find words that have most anagrams. With few-hundred-thousand word dictionary, on 8-way machine this should complete in less than 12 hours.

The supercomputing lab inspired solution was something like this (I may be wrong, years pass though):

Put all the words into an array
Start generating letter sequences: aaaa, aaab, aaac, …
For every generated word, traverse the array, find how many matches it hits
Get sequences with most matches

See, the problem was not how to make this algorithm efficient, but how to parallelize it, or maybe rewrite in C (thats what they do in supercomputing labs). It took few minutes to write following code:

wordlist=open("/usr/share/dict/words")
anagrams={}
for line in wordlist:
  word=line.strip()
  sorted=list(word)
  sorted.sort()
  sorted="".join(sorted)
  if sorted not in anagrams:
    anagrams[sorted]={'count':1,'words':[word,]}
  else:
    anagrams[sorted]['count']+=1
    anagrams[sorted]['words'].append(word)

(I just canonized words into base-anagram-form – sorted letters, and ran collision test)

It took maybe half a second to run it on my 1.2ghz ibook back then. The end of the story back then was that we get one critic less.

Now, why I remember this story today – Yasufumi Kinoshita just did exactly the same to us, MySQL users. Simply, brute force is not the answer, if you can spare the minute and rethink stuff. Thanks, for reminding this.

Many needles in a haystack

This is probably quite useless experiment I’ve been doing, but it all started when someone in #mysql@freenode gave a very simple task:

Table A has two million rows
Table B has two thousand rows
Find all rows in A, which have any B row as substring in some field.

So, my colleague Scott was there, and gave the answer which satisfied the guy:

SELECT * FROM a JOIN b ON a.field LIKE CONCAT('%',b.field,'%');

I started telling Scott, that this will result in too many nested loops, and that better combined pattern matching should be used. Well, my actual words were something like “use RLIKE”. So, folks replaced LIKE in above query with RLIKE, didn’t see too much of improvement and made some fun of me.

So, I thought I should provide some decent response to mocking :-) I downloaded ‘most common American last names of 19th century’ from some website out there, took ~2000 of them, also built a list of all Wikipedia article titles, that have space in them (just to reduce dataset a bit, and make more humans show up there).

My initial poking showed around double speed increase when using combined pattern of RLIKE, and using PCRE UDFs provided double speed over RLIKE. I have no idea what I did wrong back then (or doing wrong now), but simple LIKE with nested row lookup is faster on my current test. Still, there’s something else I wanted to show :)

GNU grep has ‘-F’ functionality, which Interprets PATTERN as a list of fixed strings, separated by newlines, any of which is to be matched. Actually, it is quite well optimized, and uses nice algorithm, located in file kwset.c. This is what some comment in that file tells:

The algorithm implemented by these routines bears a startling resemblence to one discovered by Beate Commentz-Walter, although it is not identical.

See “A String Matching Algorithm Fast on the Average,” Technical Report, IBM-Germany, Scientific Center Heidelberg, Tiergartenstrasse 15, D-6900 Heidelberg, Germany. See also Aho, A.V., and M. Corasick, “Efficient String Matching: An Aid to Bibliographic Search,” CACM June 1975, Vol. 18, No. 6, which describes the failure function used below.

So, let’s try standing on shoulders of giants. Actually, I’m not even that smart to find this, it was actually Tim who researched and implemented this as PHP extension to make some of Wikipedia’s code faster.

So I shamelessly took few files from grep, and wrote some crappy MySQL UDF glue (it is just for demonstration, would need proper engineering to make it usable for general purposes).

So, what kind of performance would a custom-tailored algorithm for the task give…

Simple LIKE matching:

select straight_join * from names join lasts on
binary name like concat("%",last,"%") limit 1000;
1000 rows in set (3.80 sec)

Combined RLIKE matching (I don’t know why it is slower – it was much better on some previous test):

SELECT CONCAT("(",GROUP_CONCAT(last SEPARATOR "|"),")")
from lasts into @re;

SELECT * FROM names WHERE BINARY name RLIKE @re;
1000 rows in set (25.99 sec)

Combined PECL UDF matching:

SELECT CONCAT("/(",GROUP_CONCAT(last SEPARATOR "|"),")/")
from lasts into @re;

select * from names where preg_rlike(@re,name) limit 1000;
1000 rows in set (8.10 sec)

Algorithm theft:

SELECT fss_prepare(last) FROM lasts;
SELECT fss_search(name) FROM names LIMIT 1000;
1000 rows in set (0.02 sec)

Damn it, too fast, this must be some mistake, let’s try more rows:

10000 rows in set (0.07 sec)
100000 rows in set (0.62 sec)
551971 rows in set (3.50 sec)

So, by using customized algorithm we got 600x performance. What does Scott say about this?

domas: my answer took about 10 minutes.
       yours has taken what, week and a half?

.. Bastard… Oh well, priorities priorities :)

On checksums

InnoDB maintains two checksums per buffer pool block. Old formula of checksum, and new formula of checksum. Both are read, both are written. I guess this had to be some kind of transition period, but it obviously took too long (or was forgotten). Anyway, disabling checksums code entirely makes single-thread data load 7% faster – though in parallel activity locking contention provides with some extra CPU resources for checksum calculation.

Leaving just single version of checksum would cut this fat in half, without abandoning the feature entirely – probably worth trying.

Update: Benchmarked InnoDB checksum against Fletcher. Results were interesting (milliseconds for 10000 iterations):

Algorithm:	InnoDB	Fletcher
–	826	453
-O2:	316	133
-O3:	42	75

So, though using Fletcher doubles the performance, -O3 optimizes InnoDB checksumming much better. How many folks do run -O3 compiled mysqld?

	markcallaghan (@mark… on MySQL does not need SQL
	markcallaghan (@mark… on MySQL does not need SQL
	Domas Mituzas on MySQL does not need SQL
	Marc on MySQL does not need SQL
	Nils Meyer on linux memory management for…