<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	xmlns:georss="http://www.georss.org/georss" xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#" xmlns:media="http://search.yahoo.com/mrss/"
	>

<channel>
	<title>domas mituzas &#187; directio</title>
	<atom:link href="http://dom.as/tag/directio/feed/" rel="self" type="application/rss+xml" />
	<link>http://dom.as</link>
	<description></description>
	<lastBuildDate>Thu, 02 Feb 2012 21:29:05 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.com/</generator>
<cloud domain='dom.as' port='80' path='/?rsscloud=notify' registerProcedure='' protocol='http-post' />
<image>
		<url>http://0.gravatar.com/blavatar/6e344c6e0cd7462eb056f8b98eb2cbcd?s=96&#038;d=http%3A%2F%2Fs2.wp.com%2Fi%2Fbuttonw-com.png</url>
		<title>domas mituzas &#187; directio</title>
		<link>http://dom.as</link>
	</image>
	<atom:link rel="search" type="application/opensearchdescription+xml" href="http://dom.as/osd.xml" title="domas mituzas" />
	<atom:link rel='hub' href='http://dom.as/?pushpress=hub'/>
		<item>
		<title>Logs memory pressure</title>
		<link>http://dom.as/2010/11/18/logs-memory-pressure/</link>
		<comments>http://dom.as/2010/11/18/logs-memory-pressure/#comments</comments>
		<pubDate>Thu, 18 Nov 2010 14:59:33 +0000</pubDate>
		<dc:creator>Domas Mituzas</dc:creator>
				<category><![CDATA[facebook]]></category>
		<category><![CDATA[mysql]]></category>
		<category><![CDATA[directio]]></category>
		<category><![CDATA[innodb]]></category>
		<category><![CDATA[io]]></category>
		<category><![CDATA[memory]]></category>

		<guid isPermaLink="false">http://dom.as/?p=818</guid>
		<description><![CDATA[Warning, this may be kernel version specific, albeit this kernel is used by many database systems Lately I&#8217;ve been working on getting more memory used by InnoDB buffer pool &#8211; besides obvious things like InnoDB memory tax there were seemingly &#8230; <a href="http://dom.as/2010/11/18/logs-memory-pressure/">Continue reading <span class="meta-nav">&#8594;</span></a><img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=dom.as&amp;blog=190075&amp;post=818&amp;subd=domasmituzas&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p><i>Warning, this may be kernel version specific, albeit this kernel is used by many database systems</i></p>
<p>Lately I&#8217;ve been working on getting more memory used by InnoDB buffer pool &#8211; besides obvious things like InnoDB <a href='http://dom.as/2008/05/29/wasting-innodb-memory/'>memory tax</a> there were seemingly external factors that were pushing out MySQL into swap (even with swappiness=0). We were working a lot on getting low hanging fruits like scripts that use too much memory, but they seem to be all somewhat gone, but MySQL has way too much memory pressure from outside.</p>
<p>I grabbed my <a href='http://dom.as/2009/06/26/uncache/'>uncache</a> utility to assist with the investigation and started uncaching various bits on two systems, one that had larger buffer pool (60G), which was already being sent to swap, and a conservatively allocated (55G) machine, both 72G boxes. Initial finds were somewhat surprising &#8211; apparently on both machines most of external-to-mysqld memory was conserved by two sets of items:</p>
<ul>
<li><b>binary logs</b> &#8211; write once, read only tail (sometimes, if MySQL I/O cache cannot satisfy) &#8211; we saw nearly 10G consumed by binlogs on conservatively allocated machines</li>
<li><b>transaction logs</b> &#8211; write many, read never (by MySQL), buffered I/O &#8211; full set of transaction logs was found in memory</li>
</ul>
<p>It was remarkably easy to get rid of binlogs from cache, both by calling out &#8216;uncache&#8217; from scripts, or using this tiny Python class:</p>
<pre>
libc = ctypes.CDLL("libc.so.6")
class cachedfile (file):
    FADV_DONTNEED = 4
    def uncache(self):
        libc.posix_fadvise(self.fileno(), 0, 0, self.FADV_DONTNEED)
</pre>
<p>As it was major memory stress source, it was somewhat a no brainer that binlogs have to be removed from cache &#8211; something that can be serially re-read is taking space away from a buffer pool which avoids random reads. It may make sense to call posix_fadvise() right after writes to them, even.</p>
<p>Transaction logs, on the other hand, are entirely different beast. From MySQL perspective they should be uncached immediately, as nobody ever ever reads them (crash recovery aside, but re-reading then is relatively cheap, as no writes or random reads are done during log read phase). Unfortunately, the problem lies way below MySQL, and thanks to PeterZ for reminding me (we had a small chat about this at Jeremy&#8217;s <a href='http://www.meetup.com/mysql-silicon-valley/'>Silicon Valley MySQL Meetup</a>).</p>
<p>MySQL transaction records are stored in multiple log groups per transaction, then written out as per-log-group writes (each is in multiple of 512 bytes), followed by fsync(). This allows FS to do transaction log write as single I/O operation. This also means that it will be doing partial page writes to buffered files &#8211; overwriting existing data in part of the page, so it has to be read from storage.</p>
<p>So, if all transaction log pages are removed from cache, quite some of them will have to be read back in (depending on sizes of transactions, probably all of them in some cases). Oddly enough, when I tried to hit the edge case, single thread transactions-per-second remained same, but I saw consistent read I/O traffic on disks. So, this would probably work on systems, that have spare I/O (e.g. flash based ones).</p>
<p>Of course, as writes are already in multiples of 512 (and appears that memory got allocated just fine), I could try out direct I/O &#8211; it should avoid page read-in problem and not cause any memory pressure by itself. In this case switching InnoDB to use O_DIRECT was a bit dirtier &#8211; one needs to edit source code and rebuild the server, restart, etc, or&#8230;<br />
<code><br />
# lsof ib_logfile*<br />
# gdb -p $(pidof mysqld)<br />
(gdb) call os_file_set_nocache(9, "test", "test")<br />
(gdb) call os_file_set_nocache(10, "test", "test")<br />
</code><br />
I did not remove fsync() call, but as it is somewhat noop on O_DIRECT files, I left it there, probably it would change benchmark results, but not much.</p>
<p>Some observations:</p>
<ul>
<li>O_DIRECT was ~10% faster at best case scenario &#8211; lots of tiny transactions in single thread</li>
<li>If group commit is used (without binlogs), InnoDB can have way more transactions with multiple threads using buffered I/O, as it does multiple writes per fsync</li>
<li>Enabling sync_binlog makes the difference not that big &#8211; even with many parallel writes direct writes are 10-20% slower than buffered ones</li>
<li>Same for innodb_flush_log_on_trx_commit0 &#8211; multiple writes per fsync are much more efficient with buffered I/O</li>
<li>One would need to do log group merge to have more efficient O_DIRECT for larger transactions</li>
<li>O_DIRECT does not have theoretical disadvantage, current deficiencies are just implementation oriented at buffered I/O &#8211; and can be resolved by (in same areas &#8211; extensive) engineering</li>
<li>YMMV. In certain cases it definitely makes sense even right now, in some other &#8211; not so much</li>
</ul>
<p>So, the outcome here depends on many variables &#8211; with flash read-on-write is not as expensive, especially if read-ahead works. With disks one has to see what is better use for the memory &#8211; using it for buffer pool reduces amount of data reads, but causes log reads. And of course, O_DIRECT wins in the long run :-)</p>
<p>With this data moved away from cache and InnoDB memory tax reduced one could switch from using 75 % of memory to 90% or even 95% for InnoDB buffer pools. Yay?</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/domasmituzas.wordpress.com/818/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/domasmituzas.wordpress.com/818/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/domasmituzas.wordpress.com/818/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/domasmituzas.wordpress.com/818/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/domasmituzas.wordpress.com/818/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/domasmituzas.wordpress.com/818/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/domasmituzas.wordpress.com/818/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/domasmituzas.wordpress.com/818/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/domasmituzas.wordpress.com/818/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/domasmituzas.wordpress.com/818/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/domasmituzas.wordpress.com/818/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/domasmituzas.wordpress.com/818/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/domasmituzas.wordpress.com/818/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/domasmituzas.wordpress.com/818/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=dom.as&amp;blog=190075&amp;post=818&amp;subd=domasmituzas&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://dom.as/2010/11/18/logs-memory-pressure/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/c660a6eb3a4005232acb111303bef12c?s=96&#38;d=http%3A%2F%2Fs0.wp.com%2Fi%2Fmu.gif&#38;r=G" medium="image">
			<media:title type="html">domasmituzas</media:title>
		</media:content>
	</item>
		<item>
		<title>Notes from land of I/O</title>
		<link>http://dom.as/2008/08/11/notes-from-land-of-io/</link>
		<comments>http://dom.as/2008/08/11/notes-from-land-of-io/#comments</comments>
		<pubDate>Mon, 11 Aug 2008 12:52:00 +0000</pubDate>
		<dc:creator>Domas Mituzas</dc:creator>
				<category><![CDATA[mysql]]></category>
		<category><![CDATA[directio]]></category>
		<category><![CDATA[innodb]]></category>
		<category><![CDATA[io]]></category>
		<category><![CDATA[linux]]></category>
		<category><![CDATA[xfs]]></category>

		<guid isPermaLink="false">http://dammit.lt/?p=184</guid>
		<description><![CDATA[A discussion on IRC sparkled some interest on how various I/O things work in Linux. I wrote small microbenchmarking program (where all configuration is in source file, and I/O modes can be changed by editing various places in code ;-), &#8230; <a href="http://dom.as/2008/08/11/notes-from-land-of-io/">Continue reading <span class="meta-nav">&#8594;</span></a><img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=dom.as&amp;blog=190075&amp;post=184&amp;subd=domasmituzas&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>A discussion on IRC sparkled some interest on how various I/O things work in Linux. I wrote small microbenchmarking <a href='http://noc.wikimedia.org/~midom/raidbench.c.txt'>program</a> (where all configuration is in source file, and I/O modes can be changed by editing various places in code ;-), and started playing with performance.</p>
<p>The machine for this testing was RAID10 16disk box with 2.6.24 kernel, and I tried to understand how O_DIRECT works, and how fsync() works and ended up digging into some other stuff.</p>
<p>My notes for now are:</p>
<ul>
<li>O_DIRECT serializes writes to a file on ext2, ext3, jfs, so I got at most 200-250w/s.</li>
<li>xfs allows parallel (and out-of-order, if that matters) DIO, so I got 1500-2700w/s (depending on file size &#8211; seek time changes.. :) of random I/O without write-behind caching. There are few outstanding bugs that lock this down back to 250w/s (<i>#xfs@freenode: &#8220;yeah, we drop back to taking the i_mutex in teh case where we are writing beyond EOF or we have cached pages&#8221;</i>, so
<pre>posix_fadvise(fd, 0, filesize, POSIX_FADV_DONTNEED)</pre>
<p>helps).</li>
<li>fsync(),sync(),fdatasync() wait if there are any writes, bad part &#8211; it can wait forever. Filesystems people say thats a bug &#8211; it shouldn&#8217;t wait for I/O that happened after sync being called. I tend to believe, as it causes stuff like InnoDB semaphore waits and such. </li>
</ul>
<p>Of course, having write-behind caching at the controller (or disk, *shudder*) level allows filesystems to be lazy (and benchmarks are no longer that different), but having the upper layers work efficiently is quite important too, to avoid bottlenecks.</p>
<p>It is interesting, that write-behind caching isn&#8217;t needed that much anymore for random writes, once filesystem parallelizes I/O, even direct, nonbuffered one.</p>
<p>Anyway, now that I found some of I/O properties and issues, should probably start thinking how they apply to the upper layers like InnoDB.. :)</p>
<br /><img alt="" border="0" src="http://feeds.wordpress.com/1.0/categories/domasmituzas.wordpress.com/184/" /> <img alt="" border="0" src="http://feeds.wordpress.com/1.0/tags/domasmituzas.wordpress.com/184/" /> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/domasmituzas.wordpress.com/184/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/domasmituzas.wordpress.com/184/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/domasmituzas.wordpress.com/184/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/domasmituzas.wordpress.com/184/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/domasmituzas.wordpress.com/184/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/domasmituzas.wordpress.com/184/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/domasmituzas.wordpress.com/184/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/domasmituzas.wordpress.com/184/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/domasmituzas.wordpress.com/184/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/domasmituzas.wordpress.com/184/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/domasmituzas.wordpress.com/184/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/domasmituzas.wordpress.com/184/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/domasmituzas.wordpress.com/184/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/domasmituzas.wordpress.com/184/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=dom.as&amp;blog=190075&amp;post=184&amp;subd=domasmituzas&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://dom.as/2008/08/11/notes-from-land-of-io/feed/</wfw:commentRss>
		<slash:comments>8</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/c660a6eb3a4005232acb111303bef12c?s=96&#38;d=http%3A%2F%2Fs0.wp.com%2Fi%2Fmu.gif&#38;r=G" medium="image">
			<media:title type="html">domasmituzas</media:title>
		</media:content>
	</item>
	</channel>
</rss>
