The saddest bug of them all (SQL is dead?)

From time to time I will observe servers wasting lots of CPU when doing batch row operations. In perf top it will look like this:

8.24% mysqld [.] Arg_comparator::compare_int_unsigned()
7.17% mysqld [.] Item_cond_and::val_int()
4.37% mysqld [.] Item_field::val_int()
4.37% mysqld [.] Item_cond_or::val_int()
2.90% mysqld [.] MYSQLparse(void*)
2.64% mysqld [.] Item::val_bool()

Essentially if you construct queries like (a=1 AND b=2) OR (a=3 AND b=4) ..., at large enough batch size evaluating the WHERE will become far more expensive than anything else (yes, more expensive than decompressing rows or doing all the InnoDB magic and what not).

MySQL has awesome syntax that makes certain batch lookups much faster: WHERE a IN (1,2,3). It constructs a tree that then each row can be compared against and one does not have to iterate through lists of predicates to check whether the row returned by batch index lookups needs to be filtered away. One would think that the composite syntax that server has (WHERE (a,b) IN ((1,2),(3,4))) may help here.

Unfortunately, if you run a query using this syntax, a full table scan will be done, even if there’s a great index for that – even the syntax exists, none of sane ways to execute apply. Of course, there have to be bugs about that:

Compound (cola,colb) IN ((x,y)) statements do not use indexes. (filed by yours truly).

Well, it was closed by optimization lead (now at Maria), who decided, that this bug is a duplicate of another problem, (a,b,c)=(A,B,C). I probably should’ve contested there, but there might have been other people who’d do that? Sure there were, in 2007 Mark filed another bug:

Optimizer does not do index range scans using predicates in IN lists

If one would look more closely, there’s this small exchange in 2010:

[7 Jan 2010 21:35] Mark Callaghan
This is the same as http://bugs.mysql.com/bug.php?id=16247, 
but the author of this feature request is much more convincing.
[7 Jan 2010 21:36] Domas Mituzas
hehe.

He wrote a Facebook note back then too.

Apparently Mark’s bug became the master bug for this problem, even if it arrived bit later, but, I can’t argue, he is far more convincing :-)
There’s a bit of odd banter there, Harrison (ohi!) points out that it is documented (albeit incorrectly, Sergey Petrunia – ohi – notices) in the manual. No, it isn’t this exact issue that is noted in the manual.
Duplicates are pouring in:
Multi column IN does not use index, says Peter Zaitsev (ohi!) in 2008
MySQL not using an index when using tuples in IN clause, says Shlomo (ohi!) in 2010
There are few more.

Sinisa (ohi!) pronounces:

In my humble opinion, this is still not a bug, but feature request. This feature request, however, should be very high on TODO list, as it would make a feature that is sorely missing, particularly since tuple equality expressions have been already optimized.

Fast forward into 2013, even in 5.6 we still have to use expressions with exponential cost, (a,b) IN (( is causing full table scans and people are switching to NoSQLs, NewSQLs and OtherSQLs, because there’s no way to efficiently express batch predicates in MySQL (besides other reasons). Back in the day MySQL was ahead of everyone else with ease of use for simple expressions, but there have been nearly no changes since back in the day to make developer lives easier. Did you know that it may actually be faster to create a temporary table with all the values, then join against it, than actually to use any of existing SQL methods?

Where is MySQL innovation nowadays? In this exact case I guess you’re supposed to be memcached API (forget authorization, common interfaces, transactions and what not), solving this in SQL is too complicated.

16 thoughts on “The saddest bug of them all (SQL is dead?)”

  1. I don’t believe that SQL, PL/SQL, or D/SQL is dead, or dying, but I do believe it’ll need to find a new place, or at least evolve to a bit more for DB engines (regardless of the engine) to allow more cross, or interconnections between what they’re needed to do, and what they’re interacting with (more cross scripting/programming).

    1. I don’t think that Domas believes SQL is dead. Just unhappy that this hasn’t been fixed given more awareness of the problems it causes in production.

  2. I have been guilty of using derived tables in joins to perform this. Seeing machine generated large (SELECT … UNION … UNION) foo LEFT JOIN bar in the FROM clause makes me sad but, *sigh* oh well.

  3. “If you’re calling about PHP, please select $ now. If you’re calling about SQL, please SELECT * FROM…”

    Okay, silliness aside, I agree that SQL has its places, and one of its “not-places” may well be dealing with peta-data like Facebook does. It’s convenient to design, which makes it great for low-scale operations. Once you’ve got something of this size, though, you’re going to have to go with a NoSQL solution. Have you considered RDF triple storage? (Just tossing an idea out there for example.)

  4. i was reporting this bug some time ago at mysql bug
    the solution (workaround) was create a temporary table with this values.
    ok it’s not the best workaround, but it’s better than bad performace, i’m using BIG inserts, like “insert into () values ((1 list),(2 list),(3 list)….)” and after join…
    well maybe in some mariadb version this will run this faster, mysql still stoped in time after oracle purchase… any news is wellcome :)

  5. Thanks for informing about High CPU performance in MySQL queries. Often in shared hosting High CPU usage in MySQL queries was punished. However it would be better if MySQL addresses this issue by themselves than developers need to adapt themselves

  6. SQL is not dead, it just what you get when trying to use MySQL as normal SQL server.

  7. The subquery enhancements in MySQL 5.6 provides a good workaround here. Create a temporary table with an index, insert the list of keys to search for and convert the IN predicate to use the temporary table:

    create temporary table tmp(i int, j int, key kk(i, j));
    insert into tmp values (0,0), (1,1);
    select count(*) from c where (i,j) in (select i, j from tmp);

    If everything goes right, the query is converted into a semi-join, the optimizer chooses LooseScan access method, and the result may even appear faster than with range scan.

    1. Would be much easier to not have to use temp tables to solve this problem.

      I would rather not read through the manual and run a few tests to confirm that “create temporary table” is crash-safe with replication. The manual states that it doesn’t commit an open transaction, which is a good thing. But I don’t always trust the manual and lots of us have had many problems with temp tables and replication in the past (maybe only when they spanned transactions). See:
      http://scale-out-blog.blogspot.com/2012/04/replication-is-bad-for-mysql-temp.html
      http://www.mysqlperformanceblog.com/2008/05/26/mysql-temporary-tables-safe-for-statement-based-replication/
      http://www.xaprb.com/blog/2007/05/11/how-to-eliminate-temporary-tables-in-mysql/
      http://mysqlha.blogspot.com/2007/11/how-to-keep-mysql-replication-in-sync.html

      And this advice from http://dev.mysql.com/doc/refman/5.6/en/replication-features-temptables.html is either awesome or incredibly unrealistic:
      >>>
      To avoid this problem, do not shut down the slave while it has temporary tables open. Instead, use the following procedure:
      Issue a STOP SLAVE SQL_THREAD statement.
      Use SHOW STATUS to check the value of the Slave_open_temp_tables variable.
      If the value is not 0, restart the slave SQL thread with START SLAVE SQL_THREAD and repeat the procedure later.
      When the value is 0, issue a mysqladmin shutdown command to stop the slave.
      >>>

      1. I am aware that using a temporary table is not a perfect solution, and I agree that it should rarely be necessary to rewrite an SQL query for performance reasons.

        Are the problems reported on temporary tables and replication also applicable when using the MEMORY engine? We use these table types a lot inside the optimizer, but the use is a bit different since those tables are scoped to a single query and do not have any external visibility.

        I think the concept is still interesting: We might convert the IN list to a real table internally, and use the described technique. This is an alternative to extending the range optimizer, but is probably less general.

        It is possible to eliminate the temporary table by using a series of UNION statements, like:

        select count(*) from c where (i,j) in
        (select (select 0 as i,0 as j union select 1,1) as dt);

        AFAIK, the derived table is currently needed to force the optimizer to use semi-join. I have not tested the performance of this solution though.

  8. I have also reported to Oracle/MySQL this problem a long time ago and it’s quite frustrating it has not been addressed. Just looking in a random server’s MySQL 5.6’s performance_schema.events_statements_summary_by_digest table shows lots of this type of query which has to be done long-hand of the form: … WHERE ( col1 = ‘X’ AND col2 = ‘Y’) OR ( col1 = ‘Z’ AND col2 = ‘A’ ) OR …
    which is more easily / naturally written as … WHERE ( col1, col2 ) IN ( (‘X’,’Y’),(‘Z’,’A’), …. ) and should perform the same. (I also wonder how well the digestifying works on data like this.) In fact, due to this sort of issue I have several tables which an extra single column primary key (auto_increment) where there’s already a unique (col1,col2) “natural” primary key defined. Joining against this gets too complicated otherwise. A waste of space and a waste of application logic/developers time when it is harder to use the natural primary key than needed.

    Temporary tables are not ideal either for 2 reasons that I can think of:
    (1) if you have a connection pool to the database you need to be aware of previous temporary tables, or ensure you create and drop the tables after use, and
    (2) to create the temporary table and later drop it you need an extra 1 or 2 round trips for the setup, plus the INSERT, plus the final join. That adds latency to the query stream and unneeded extra complexity which should not be needed. On a busy server that extra overhead may not be acceptable.

    1. OK, you made the optimizer team unhappy but you made the PS team very happy…

      “Just looking in a random server’s MySQL 5.6′s performance_schema.events_statements_summary_by_digest table shows “

Comments are closed.