A case for FORCE INDEX

I remember various discussions in different mediums where people were building cases against use of FORCE INDEX in SQL queries. I’ll hereby suggest it using way more often, but at first I’ll start with small explanation.

For ages, the concept of index statistics affecting query plans has been clogging minds of DBAs, supported by long explanations of MyISAM and InnoDB manuals. Actually, statistics are used just for determining which index to use for a joined table, as predicate is not known at the time of ‘optimization’.

What happens if you do a simple query like:

SELECT * FROM table WHERE a=5 AND b=6

? If there’s an index that enforces uniqueness on (a,b), it will be used – this is short-path for PRIMARY KEY lookups. Otherwise, it will go to any index, composite or not, that can satisfy either a or b (or both), and evaluate how many rows it will fetch from it using the provided criteria.

Now, contrary to what people usually think, the row count evaluation has nothing really much to do with cardinality statistics – instead it builds the range that the known predicate can check on existing index, and does two full B-Tree dives to the index – one at the start of the range, and one at the end of it. For each possible index.
This simply means that even if you are not using the index to execute query, two leaf pages (and all the tree branches to reach them) will end up being fetched from disk into the cache – wasting both I/O cycles and memory.

There’s also quite interesting paradox at this – in some cases, more similar other indexes are, more waste they create because of rows-in-range checks. If a table has indexes on (a,b,c) and (a,b,d), query for (a,b,d) will be best satisfied by (a,b,d) index, but will evaluate range sizes for (a,b). If the first index were (a,c,b), it would be only able to check head and tail of (a) – so way less B-Tree positions would be cached in memory for the check. This makes better indexing sometimes fare worse than what they’re worth in benchmarks (assuming that people do I/O-heavy benchmarking :)

The easy way out is using FORCE INDEX. It will not do the index evaluation – and no B-Tree dives on unneeded index.

In my edge case testing with real data and skewed access pattern hitting a second index during ‘statistics’ phase has increased execution time by 70%, number of I/Os done by 75%, number of entrances into buffer pool by 31% and bloated buffer pool with data I didn’t need for read workload.

For some queries like “newest 10 entries” this will actually waste some space preheating blocks from the other end of the range that will never be shown – there will definitely be a B-Tree leaf page in buffer pool with edits from few years ago because of RIR. Unfortunately, the only MySQL-side solution for this is HANDLER interface (or probably HandlerSocket) – but it doesn’t make using FORCE INDEX not worth it – it just pushes towards making FORCE INDEX be much more forceful.

So, use the FORCE, Luke :)

This entry was posted in facebook, mysql and tagged , , , . Bookmark the permalink.

7 Responses to A case for FORCE INDEX

  1. Sergei Golubchik says:

    That’s really the case for USE INDEX or IGNORE INDEX. They also make optimizer to ignore the given (or not given) index for a query. But unlike FORCE INDEX they don’t artificially increase the cost of the table scan (like FORCE INDEX does, to make it look much more expensive than the index access).

  2. Sergei, at certain table sizes, and especially web workloads, increasing the cost of table scan is just a minute issue, as table scans should be never happening anyway :)

  3. Where are the graphs and charts?!

  4. Yingkuan says:

    Good to see your post, Domas!
    You disappeared two months after last post “Death to Mituzas!”
    That got everyone worried :)

    >>does two full B-Tree dives to the index – one at the start of the range,
    >> and one at the end of it.
    >> This simply means that even if you are not using the index to execute query,
    >> two leaf pages (and all the tree branches to reach them) will end up being
    >> fetched from disk into the cache

    Does this mean even with a=5 it will do two scans? In this case the start and end is same?

    • Hehe, nah, I’m alive and well :-)

      It will scan every index that can satisfy ref/range conditions – twice each time. The “start and end is same” may be true in some cases – when there’s just one row with (a), but as (a) is not unique, records matching such predicate would span multiple pages and there’d be multiple dives then.

  5. Pingback: MySQL metrics for read workloads | domas mituzas

Comments are closed.