Warning, this may be kernel version specific, albeit this kernel is used by many database systems
Lately I’ve been working on getting more memory used by InnoDB buffer pool – besides obvious things like InnoDB memory tax there were seemingly external factors that were pushing out MySQL into swap (even with swappiness=0). We were working a lot on getting low hanging fruits like scripts that use too much memory, but they seem to be all somewhat gone, but MySQL has way too much memory pressure from outside.
I grabbed my uncache utility to assist with the investigation and started uncaching various bits on two systems, one that had larger buffer pool (60G), which was already being sent to swap, and a conservatively allocated (55G) machine, both 72G boxes. Initial finds were somewhat surprising – apparently on both machines most of external-to-mysqld memory was conserved by two sets of items:
- binary logs – write once, read only tail (sometimes, if MySQL I/O cache cannot satisfy) – we saw nearly 10G consumed by binlogs on conservatively allocated machines
- transaction logs – write many, read never (by MySQL), buffered I/O – full set of transaction logs was found in memory
It was remarkably easy to get rid of binlogs from cache, both by calling out ‘uncache’ from scripts, or using this tiny Python class:
libc = ctypes.CDLL("libc.so.6") class cachedfile (file): FADV_DONTNEED = 4 def uncache(self): libc.posix_fadvise(self.fileno(), 0, 0, self.FADV_DONTNEED)
As it was major memory stress source, it was somewhat a no brainer that binlogs have to be removed from cache – something that can be serially re-read is taking space away from a buffer pool which avoids random reads. It may make sense to call posix_fadvise() right after writes to them, even.
Transaction logs, on the other hand, are entirely different beast. From MySQL perspective they should be uncached immediately, as nobody ever ever reads them (crash recovery aside, but re-reading then is relatively cheap, as no writes or random reads are done during log read phase). Unfortunately, the problem lies way below MySQL, and thanks to PeterZ for reminding me (we had a small chat about this at Jeremy’s Silicon Valley MySQL Meetup).
MySQL transaction records are stored in multiple log groups per transaction, then written out as per-log-group writes (each is in multiple of 512 bytes), followed by fsync(). This allows FS to do transaction log write as single I/O operation. This also means that it will be doing partial page writes to buffered files – overwriting existing data in part of the page, so it has to be read from storage.
So, if all transaction log pages are removed from cache, quite some of them will have to be read back in (depending on sizes of transactions, probably all of them in some cases). Oddly enough, when I tried to hit the edge case, single thread transactions-per-second remained same, but I saw consistent read I/O traffic on disks. So, this would probably work on systems, that have spare I/O (e.g. flash based ones).
Of course, as writes are already in multiples of 512 (and appears that memory got allocated just fine), I could try out direct I/O – it should avoid page read-in problem and not cause any memory pressure by itself. In this case switching InnoDB to use O_DIRECT was a bit dirtier – one needs to edit source code and rebuild the server, restart, etc, or…
# lsof ib_logfile*
# gdb -p $(pidof mysqld)
(gdb) call os_file_set_nocache(9, "test", "test")
(gdb) call os_file_set_nocache(10, "test", "test")
I did not remove fsync() call, but as it is somewhat noop on O_DIRECT files, I left it there, probably it would change benchmark results, but not much.
- O_DIRECT was ~10% faster at best case scenario – lots of tiny transactions in single thread
- If group commit is used (without binlogs), InnoDB can have way more transactions with multiple threads using buffered I/O, as it does multiple writes per fsync
- Enabling sync_binlog makes the difference not that big – even with many parallel writes direct writes are 10-20% slower than buffered ones
- Same for innodb_flush_log_on_trx_commit0 – multiple writes per fsync are much more efficient with buffered I/O
- One would need to do log group merge to have more efficient O_DIRECT for larger transactions
- O_DIRECT does not have theoretical disadvantage, current deficiencies are just implementation oriented at buffered I/O – and can be resolved by (in same areas – extensive) engineering
- YMMV. In certain cases it definitely makes sense even right now, in some other – not so much
So, the outcome here depends on many variables – with flash read-on-write is not as expensive, especially if read-ahead works. With disks one has to see what is better use for the memory – using it for buffer pool reduces amount of data reads, but causes log reads. And of course, O_DIRECT wins in the long run :-)
With this data moved away from cache and InnoDB memory tax reduced one could switch from using 75 % of memory to 90% or even 95% for InnoDB buffer pools. Yay?