Many needles in a haystack

This is probably quite useless experiment I’ve been doing, but it all started when someone in #mysql@freenode gave a very simple task:

  • Table A has two million rows
  • Table B has two thousand rows
  • Find all rows in A, which have any B row as substring in some field.

So, my colleague Scott was there, and gave the answer which satisfied the guy:

SELECT * FROM a JOIN b ON a.field LIKE CONCAT('%',b.field,'%');

I started telling Scott, that this will result in too many nested loops, and that better combined pattern matching should be used. Well, my actual words were something like “use RLIKE”. So, folks replaced LIKE in above query with RLIKE, didn’t see too much of improvement and made some fun of me.

So, I thought I should provide some decent response to mocking :-) I downloaded ‘most common American last names of 19th century’ from some website out there, took ~2000 of them, also built a list of all Wikipedia article titles, that have space in them (just to reduce dataset a bit, and make more humans show up there).

My initial poking showed around double speed increase when using combined pattern of RLIKE, and using PCRE UDFs provided double speed over RLIKE. I have no idea what I did wrong back then (or doing wrong now), but simple LIKE with nested row lookup is faster on my current test. Still, there’s something else I wanted to show :)

GNU grep has ‘-F’ functionality, which Interprets PATTERN as a list of fixed strings, separated by newlines, any of which is to be matched. Actually, it is quite well optimized, and uses nice algorithm, located in file kwset.c. This is what some comment in that file tells:

The algorithm implemented by these routines bears a startling resemblence to one discovered by Beate Commentz-Walter, although it is not identical.

See “A String Matching Algorithm Fast on the Average,” Technical Report, IBM-Germany, Scientific Center Heidelberg, Tiergartenstrasse 15, D-6900 Heidelberg, Germany. See also Aho, A.V., and M. Corasick, “Efficient String Matching: An Aid to Bibliographic Search,” CACM June 1975, Vol. 18, No. 6, which describes the failure function used below.

So, let’s try standing on shoulders of giants. Actually, I’m not even that smart to find this, it was actually Tim who researched and implemented this as PHP extension to make some of Wikipedia’s code faster.

So I shamelessly took few files from grep, and wrote some crappy MySQL UDF glue (it is just for demonstration, would need proper engineering to make it usable for general purposes).

So, what kind of performance would a custom-tailored algorithm for the task give…

Simple LIKE matching:

select straight_join * from names join lasts on
binary name like concat("%",last,"%") limit 1000;
1000 rows in set (3.80 sec)

Combined RLIKE matching (I don’t know why it is slower – it was much better on some previous test):

SELECT CONCAT("(",GROUP_CONCAT(last SEPARATOR "|"),")")
from lasts into @re;

SELECT * FROM names WHERE BINARY name RLIKE @re;
1000 rows in set (25.99 sec)

Combined PECL UDF matching:

SELECT CONCAT("/(",GROUP_CONCAT(last SEPARATOR "|"),")/")
from lasts into @re;

select * from names where preg_rlike(@re,name) limit 1000;
1000 rows in set (8.10 sec)

Algorithm theft:

SELECT fss_prepare(last) FROM lasts;
SELECT fss_search(name) FROM names LIMIT 1000;
1000 rows in set (0.02 sec)

Damn it, too fast, this must be some mistake, let’s try more rows:

10000 rows in set (0.07 sec)
100000 rows in set (0.62 sec)
551971 rows in set (3.50 sec)

So, by using customized algorithm we got 600x performance. What does Scott say about this?

domas: my answer took about 10 minutes.
       yours has taken what, week and a half?

.. Bastard… Oh well, priorities priorities :)

%d bloggers like this: