Notes from land of I/O

A discussion on IRC sparkled some interest on how various I/O things work in Linux. I wrote small microbenchmarking program (where all configuration is in source file, and I/O modes can be changed by editing various places in code ;-), and started playing with performance.

The machine for this testing was RAID10 16disk box with 2.6.24 kernel, and I tried to understand how O_DIRECT works, and how fsync() works and ended up digging into some other stuff.

My notes for now are:

  • O_DIRECT serializes writes to a file on ext2, ext3, jfs, so I got at most 200-250w/s.
  • xfs allows parallel (and out-of-order, if that matters) DIO, so I got 1500-2700w/s (depending on file size – seek time changes.. :) of random I/O without write-behind caching. There are few outstanding bugs that lock this down back to 250w/s (#xfs@freenode: “yeah, we drop back to taking the i_mutex in teh case where we are writing beyond EOF or we have cached pages”, so
    posix_fadvise(fd, 0, filesize, POSIX_FADV_DONTNEED)

    helps).

  • fsync(),sync(),fdatasync() wait if there are any writes, bad part – it can wait forever. Filesystems people say thats a bug – it shouldn’t wait for I/O that happened after sync being called. I tend to believe, as it causes stuff like InnoDB semaphore waits and such.

Of course, having write-behind caching at the controller (or disk, *shudder*) level allows filesystems to be lazy (and benchmarks are no longer that different), but having the upper layers work efficiently is quite important too, to avoid bottlenecks.

It is interesting, that write-behind caching isn’t needed that much anymore for random writes, once filesystem parallelizes I/O, even direct, nonbuffered one.

Anyway, now that I found some of I/O properties and issues, should probably start thinking how they apply to the upper layers like InnoDB.. :)

On blocking

If a process has two blocking operations, each blocking other (like, I/O and networking), theoretical performance decrease will be 50%. Solution is very easy – convert one operation (quite often the one that blocks less, but I guess it doesn’t matter that much) into a nonblocking one.

Though MySQL has network-write buffer, which provides some async network behavior, it still has to get context switch into a thread to write stuff.

rsync and other file transfer protocols are even worse in this regard. On a regular Linux machine rsync even on gigabit network will keep kernel’s send-queue saturated (it is 128K by default anyway).

How to make MySQL’s or rsync networking snappier? If in ‘netstat’ sendq column is maxed out – just increase kernel buffers, instead of process buffers:

# increase TCP max buffer size
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
# increase Linux autotuning TCP buffer limits
# min, default, and max number of bytes to use
net.ipv4.tcp_rmem = 4096 87380 16777216
net.ipv4.tcp_wmem = 4096 65536 16777216

This can add additional 10-20% of file transfer throughput (and sendq goes up to 500k – so it seems to be really worth it).

Shameless ad

The Sun Fire X4240, powered by the AMD Opteron 2200 and 2300 processor series, is a two-socket, 8-core, 2RU system with up to twice the memory and storage capacity of any system in its class. It’s the first and only two-socket AMD Opteron system with sixteen hard drive slots in a 2RU form factor.”

Well, now that I work for Sun, it ends up being a shameless ad and boasting :) But back when I saw information about this product, I wasn’t my first thought was “wow, thats the best machine for scaling up scaled out environments!”.

In web database world people agree that number of spindles (disks!) matters – remember YouTube’s “think disks, not servers” mantra said during the scaling panel at MySQL conference. Before, getting such number of spindles would’ve required external arrays taking space and sucking power (TCO! ;-)

And for us… it probably means we can finally start doing RAID10, instead of RAID0. :-)

By the way, that box even has Quad-Core service processor. Way to go! :)

I/O schedulers seriously revisited

The I/O scheduler problems have drawn my attention, and besides trusting empirical results, I tried to do more of benchmarking and analysis, why the heck strange things happen at Linux block layer. So, here is the story, which I found myself quite fascinating…
Continue reading “I/O schedulers seriously revisited”

On guts and I/O schedulers

Benchmarks and guts sometimes may contradict each other. Like, a benchmark tells that “performance difference is not big”, but guts do tell otherwise (something like “OH YEAH URGHH”). I was wondering why some servers are much faster than other, and apparently different kernels had different I/O schedulers. Setting ‘deadline’ (Ubuntu Server default) makes miracles over having ‘cfq’ (Fedora, and probably Ubuntu standard kernel default) on our traditional workload.

Now all we need is showing some numbers, to please gut-based thinking (though it is always pleased anyway):

Deadline:

avg-cpu:  %user   %nice    %sys %iowait   %idle
           4.72    0.00    7.95   18.18   69.15

Device:    rrqm/s wrqm/s   r/s   w/s  rsec/s  wsec/s
sda          0.00   0.10 91.30 31.30 3147.20 1796.00

    rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util
  1573.60   898.00    40.32     0.98    7.98   3.65  44.80

CFQ:

avg-cpu:  %user   %nice    %sys %iowait   %idle
           4.65    0.00    7.62   38.26   49.48

Device:    rrqm/s wrqm/s   r/s   w/s  rsec/s  wsec/s
sda          0.00   0.10 141.26 38.86 4563.44 2571.03

    rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util
  2281.72  1285.51    39.61     7.61   42.52   5.38  96.98

Though load slightly rises and drops, the await/svctime parameters are always better on deadline. The box does high-concurrency (multiple background InnoDB readers), high volume (>3000 SELECT/s) read-only (aka slave) workload on ~200gb dataset, on top of 6-disk RAID0 with write-behind cache. Whatever next benchmarks say, my guts will still fanatically believe that deadline rocks.