A discussion on IRC sparkled some interest on how various I/O things work in Linux. I wrote small microbenchmarking program (where all configuration is in source file, and I/O modes can be changed by editing various places in code ;-), and started playing with performance.
The machine for this testing was RAID10 16disk box with 2.6.24 kernel, and I tried to understand how O_DIRECT works, and how fsync() works and ended up digging into some other stuff.
My notes for now are:
- O_DIRECT serializes writes to a file on ext2, ext3, jfs, so I got at most 200-250w/s.
- xfs allows parallel (and out-of-order, if that matters) DIO, so I got 1500-2700w/s (depending on file size – seek time changes.. :) of random I/O without write-behind caching. There are few outstanding bugs that lock this down back to 250w/s (#xfs@freenode: “yeah, we drop back to taking the i_mutex in teh case where we are writing beyond EOF or we have cached pages”, so
posix_fadvise(fd, 0, filesize, POSIX_FADV_DONTNEED)
helps).
- fsync(),sync(),fdatasync() wait if there are any writes, bad part – it can wait forever. Filesystems people say thats a bug – it shouldn’t wait for I/O that happened after sync being called. I tend to believe, as it causes stuff like InnoDB semaphore waits and such.
Of course, having write-behind caching at the controller (or disk, *shudder*) level allows filesystems to be lazy (and benchmarks are no longer that different), but having the upper layers work efficiently is quite important too, to avoid bottlenecks.
It is interesting, that write-behind caching isn’t needed that much anymore for random writes, once filesystem parallelizes I/O, even direct, nonbuffered one.
Anyway, now that I found some of I/O properties and issues, should probably start thinking how they apply to the upper layers like InnoDB.. :)
Domas,
So if you use O_DIRECT with ext-2 and InnoDB, do you want to use file per table to reduce the problem from not getting concurrent IO requests to a file?
Mark,
Indeed using file per table helps a lot but you also need to keep in mind UNDO (in the main tablespace) and also you may have couple of tables being hot even in case of file_per_table configuration.
In general this is a big gotcha for O_DIRECT for write intensive workloads.
Mark,
I didn’t test this with multiple files, but it may be one of workarounds. Of course, if you have write-behind caching at RAID controller level, it mitigates lots of the issue.
I didn’t try AIO though – may be one of possible workarounds too. It all also can be kernel dependent, there’re lots and lots of variables around and it is better to test ;-)
I reproduced the same performance (serialized writes when one file is used) for ext-2, O_DIRECT and SW RAID 0.
it will be fun to reproduce it on ext4.
a xfs vs ext4 match !
Domas,
Did you ever find a kernel/xfs version that doesn’t serialize requests when using XFS/MySQL/O_DIRECT?
As far as I can tell, the only workaround (if you want O_DIRECT) is to have a battery backed write cache to minimize the serialization impact, since it seems that XFS, ext3, etc. all have this issue.
All of them don’t serialize, if there’re no dirty pages.
All of them do, if there are.
Just make sure you don’t have dirty pages (don’t open ibdata with other programs), and you’ll be fine.