The I/O scheduler problems have drawn my attention, and besides trusting empirical results, I tried to do more of benchmarking and analysis, why the heck strange things happen at Linux block layer. So, here is the story, which I found myself quite fascinating…
The synonym for ‘i/o scheduler’ inside Linux kernel is ‘elevator’ – and it helps to explain many things. The physical movement needs of disk spindles are similar to how elevators move people in tall buildings:
- Attempting of going single direction, just ‘up’ or ‘down’, and grabbing all people on their way.
- At every floor where it stops, waiting for more people to show up, not closing doors immediately.
- More full the elevator cabin is, more efficient it is in terms of transported people.
- More full the elevator cabin is, more annoying is for people in it – stopping at every floor, then waiting, people start hitting ‘close door’ button nervously.
- Buildings solve this by having more elevators, or sophisticated queueing systems for getting into them
Now, imagine a huge hotel building, that is having huge convention of privacy-worshippers. Or just misanthropes. They will never get into elevator, until they know that it is empty, and human that was traveling before got out of elevator safely. Essentially, thats how database transaction serialization works.
Thats where smart elevators fail – they immediately notice, that all writes are going to same location and prefer to wait for more requests – and merge them. Though, whenever an elevator decides to wait, nothing happens – there is a global lock inside database engine, which tells not to write until first write finishes.
Scheduler waits, decides that it did wait too long, flushes the write, gets another request, notices write goes to same location and there might be chance to merge subsequent requests, which… do not come in again.
And the solution is – using teleports. Well, at least treating database writes as instant accesses, not caring about order, waiting, just doing everything as soon as possible.
To demonstrate this I went to the world of edge cases – made the performance test, which maybe shouldn’t be called a ‘benchmark’. I created a very simple table and started spamming rows, each as a separate transaction, into it. The hardware for test was a ‘regular’ DB box, 8 disks, write-behind cache on RAID controller, 16GB of memory, 8 cores – not my desktop or laptop. The sole idea of test was finding how different I/O modes affect ability to write to I/O controller as fast as possible – resulting in transaction throughput.
Here are some results:
|1 thread||8 threads|
Here, CFQ had very huge regression at higher concurrency, actually Anticipatory showed similar, slightly better results. NOOP showed similar results to deadline, much faster at single thread, slightly slower at multiple.
So, whichever decisions CFQ takes during this test, they must be all wrong – with multiple disks and raid controller handling flushing of write cache there is no need for elevation or request merging, multiple tagged commands can be sent to the I/O subsystem, and they will be executed swiftly. It is supposed to be a scheduler good for most of workloads, and it probably is. But high-performance databases rely on storage being fast, and for enforcing of ACID requirements, synchronous operations should not wait forever (or wait, when not necessary)
Of course, CFQ may provide better performance on systems that do extensive I/O scanning, so for folks with slow queries it may end up providing more throughput, as it will tolerate delaying small things for big things to get through – deadline would not care about fairness and try getting everything done as quickly as possible.
The worst part is that Deadline is enabled by default just on community distributions, like Ubuntu Server (though Ubuntu desktop kernel has CFQ by default). So, most of people will end up having anti-database scheduler for ages, and will rarely get into internals of whole stack, or analysis of performance profile. Switching to another scheduler is a matter of single command (though it managed to crash my system once :)