on copying files

After hitting stupid kernel bug/behavior with rsync (um, memory pressure on socket caused to do some expensive cleanups on every packet) I ended up using ‘nc’ for reimaging slaves and copying files from server X to server Y:

dbX# nc -l -p 8888 | tar xf -
dbY# tar cf - . | nc dbX 8888

This would happily sustain gigabit traffic, and I guess could easily quadruple on 10GE without too much effort.

The only remaining problem with this method is copying from single source to multiple destinations. There are some tools that use multicast for sending out data streams, but due to lossy nature there apparently corruptions happen. I figured out there should be some easy way – chaining ‘nc’ commands. The only problem – I didn’t know how to do that in practice, until helpful people in #bash@freenode assisted:

dbX# nc -l -p 8888 | tar xf -
# cat >/dev/null serves as fallback in
# case some intermittent I/O errors happen
dbY# nc -l -p 8888 | tee >( tar xf -; cat >/dev/null ) | nc dbX 8888
dbZ# tar cf - . | nc dbY 8888

End result – with networks being full-duplex one can daisy-chain as many servers as needed, and single stream would be extracted on all of them :-) And for the curious, >(command) does this (from bash manual):

Process substitution is supported on systems that support named pipes (FIFOs) or the /dev/fd method of naming
open files. It takes the form of (list). The process list is run with its input or output connected to a FIFO or some file in /dev/fd. The name of this file is passed as an argument to the current command as the result of the expansion. If the >(list) form is used, writing to the file will provide input for list.

Memcached for small objects

Memcached quite often ends up as a store for very small objects (small key and some integer value), though it isn’t really designed to do this kind of work by default. Current memory management is based on slabs (200 of them), where objects are grouped by similar size – though actual sizes are pre-defined at startup based on few configuration parameters.

By default memcached would have slabs based on assumption, that smallest object size will have 48 bytes of data (thats without item header), and will increase the slab sizes in +25% steps:

slab class   1: chunk size    104 perslab 10082
slab class   2: chunk size    136 perslab  7710
slab class   3: chunk size    176 perslab  5957
slab class   4: chunk size    224 perslab  4681
...

So, in this case, it allocates at least 104 bytes per object, and next steps are way behind. Fortunately, there’re some quick steps to have better efficiency: Continue reading “Memcached for small objects”

Packing for MySQL Conference 2009

Yay, coming to Santa Clara again (4th conference in a row!:). I can’t imagine my year without MySQL Conference trip anymore. To get a free ticket I’ll present on two topics, MySQL Security (lately I have related role, and have prepared bunch of information already) and deep-inspecting MySQL with DTrace (a voodoo session for all happy Solaris and MacOSX users :). See you there?

Tim is now vocal

Tim at the datacenter
Tim is one of most humble and intelligent developers I’ve ever met – and we’re extremely happy having him at Wikimedia. Now he has a blog, where the first entry is already epic by any standards. I mentioned the IE bug, and Tim has done thorough analysis on this one, and similar problems.

I hope he continues to disclose the complexity of real web applications – and that will always be a worthy read.

ZFS and MySQL … not yet

Today I attended kick-ass ZFS talk (3 hours of incredibly detailed material presented by someone who knows the stuff and knows how to talk) at CEC (Sun internal training event/conference), so now I know way more about ZFS than I used to. Probably I know way more about ZFS than Average Joe DBA \o/ 

And now I think ZFS has lots of brilliant design and implementation bits, except it doesn’t match database access pattern needs. 

See, ZFS is not a regular POSIX-API -> HDD bridge, unlike pretty much everything out there. It is transactional object store which allows multiple access semantics, APIs, and standard ZFS POSIX Layer (ZPL) is just one of them. In MySQL talk, think of all other filesystems as of MyISAM, and ZFS is InnoDB :-) 

So, putting InnoDB on top of ZFS after some high-school-like variable replacement ends up “putting InnoDB on top of InnoDB”. Let’s go a bit deeper here:

  • ZFS has checksums, so does InnoDB (though ZFS checksums are faster, Fletcher-derived, etc ;-)
  • ZFS has atomicity, so does InnoDB
  • ZFS has ZIL (Intent Log), so does InnoDB (Transaction Log)
  • ZFS has background intelligent flushing of data, so does InnoDB (maybe not that intelligent though)
  • ZFS has Adaptive Replacement Cache, so does InnoDB (calls it Buffer Pool, instead of three replacement queues uses just one – LRU, doesn’t account for MFU)
  • ZFS has copy-on-write snapshotting, so does InnoDB (MVCC!)
  • ZFS has compression, so does InnoDB (in plugin, though)
  • ZFS has intelligent mirroring/striping/etc, this is why InnoDB people use RAID controllers.
  • ZFS has bit-rot recovery and self healing and such, InnoDB has assertions and crashes :-)

So, we have two intelligent layers on top of each other, and there’s lots of work duplicated. Of course, we can try to eliminate some bits:

  • Disable checksums at InnoDB level
  • Unfortunately, there’s nothing to be done about two transaction logs
  • Dirty pages can be flushed immediately by InnoDB, probably is tunable at ZFS level too
  • InnoDB buffer pool may be probably reduced, to favor ARC, or opposite…
  • Double Copy-on-write is inevitable (and copy-on-write transaction log does not really make sense…)
  • Compression can be done at either level
  • ZFS use for volume management would be the major real win, as well as all the self healing capacity

So, I’m not too convinced at this moment about using this combo, but there’s another idea circulating around for quite a while – what if MyISAM suddenly started using all the ZFS capabilities. Currently the ZPL and actual ZFS object store management are mutually exclusive – you have to pick one way, but if ZPL would be extended to support few simple operations (create/drop snapshots just on single file, wrap multiple write() calls into a transaction), MyISAM could get a different life:

  • Non-blocking SELECTs could be implemented using snapshots
  • Writes would be atomic and non-corrupting
  • MyISAM would get checksummed, compressed, consistent data, that is flushed by intelligent background threads, and would have immediate crash recovery
  • For replication slaves write concurrency would not be that necessary (single thread is updating data anyway)
  • “Zettabyte” (was told not to use this ;-) File System would actually allow Zettabyte-MyISAM-Tables o/
  • All the Linux people (including me :) would complain about Sun doing something just for [Open]Solaris, instead of working on [insert favorite storage engine here]. 

Unfortunately, to implement that now one would have either to tap directly into object management API (that would mean quite a bit of rewriting), or wait for ZFS people to extend the ZPL calls. And for now, I’d say, “not yet”.

Disclaimer: the opinion of the author does not represent opinion of his employer (especially Marketing people), and may be affected by the fact, that the author was enjoying free wireless and whoever knows what else in Las Vegas McCarran International Airport. 

On SSDs, rotations and I/O

Every time anyone mentions SSDs, I have a feeling of futility and being useless in near future. I have spent way too much time to work around limitations of rotational media, and understand the implications of whole vertical data stack on top.

The most interesting upcoming shift is not only the speed difference, but simply different cost balance between reads and writes. With modern RAID controllers and modern disks and modern filesystems reads are way more expensive operation from application perspective than writes.

Writes can be merged at application and OS level, buffered at I/O controller level, and even sped up by on-disk volatile cache (NCQ write reordering can give +30% faster random write performance).

Reads have none of that. Of course, there’re caches, but they don’t speed up actual read operations, they just help to avoid them. This leads to very disproportionate amount of caches needed for reads, compared to writes.

Simply, 32GB system with MySQL/InnoDB will be wasting 4GB on mutexes (argh!!..), few more gigs on data dictionary (arghhh #2), and everything else for read caching inside buffer pool. There may be some dirty pages and adaptive hash or insert buffer entries, but they are all there not because systems lack write output capacity, but simply because of braindead InnoDB page flushing policy.

Also, database write performance is mostly impacted not because of actual underlying write speed, but simply because every write has to read from multiple scattered places to actually find what needs to be changed.

This is where SSDs matter – they will have same satisfactory write performance (and fixes for InnoDB are out there ;-) – but the read performance will be satisfactory (uhm, much much better) too.

What this would mean for MySQL use:

  • Buffer pool, key cache, read-ahead buffers – all gone (or drastically reduced).
  • Data locality wouldn’t matter that much anymore either, covering indexes would provide just double performance, rather than up to 100x speed increase.
  • Re-reading data may be cheaper, than including it in various temporary sorting and grouping structures
  • RAIDs no longer needed (?), though RAM-backed write-behind caching would still be necessary
  • Log-based storage designs like PBXT will make much more sense
  • Complex data flushing logic like inside InnoDB’s will not be useful anymore (one can say, it is useless already ;-) – and straightforward methods such as in Maria are welcome again.

Probably the happiest camp out there are PostgreSQL people – data locality issues were plaguing their performance most, and it is strong side of InnoDB. On the other hand, MySQL has pluggable engine support, so it may be way easier to produce SSD versions for anything we have now, or just use new ones (hello, Maria!).

Still, there is quite some work to adapt to the new storage model, and judging by the fact how InnoDB works with modern rotational media, we will need some very strong push to adapt it for the new stuff.

You can sense the futility of any work done to optimize for rotation – all the “make reads fast” techniques will end up resolved at hardware layer, and the human isn’t needed anymore (nor all these servers with lots of memory and lots of spindles).

On XFS write barriers

I’m very naive, when I have to trust software. I just can’t believe a filesystem may have a tunable that makes it 20x faster (or rather, doesn’t make it 20x slower). I expect it to work out of the box. So, I was pondering, why in my testing XFS on LVM flushes data ~20x faster than on a box where it talks directly to device. Though I have noticed some warnings before, people on #xfs pointed out that LVM doesn’t support write barriers.

So, as I had no idea what write barriers are, had to read up a bit on that. There is a very nice phrase in there regarding battery-backed write-behind caching:

Using write barriers in this instance is not warranted and will in fact lower performance. Therefore, it is recommended to turn off the barrier support and mount the filesystem with “nobarrier”.

No shit, 20x lower performance :) As usually, I was not the only one to spot that..

So, I just ran this:

mount -o remount,nobarrier /a

And InnoDB flushed pages at 80MB/s instead of 4MB/s.

Update (2009/03): 2.6.29 kernel will support write barriers for LVM too – so XFS performance degradation is very much expected at very very wide scope. Also ext4 uses write barriers by default too. This thing is getting huge.

How did the Sunday go?

Yesterday Europe switched the daylight saving time. Though for most people thats a great change (you can sleep one hour longer, you’re not late for one day in whole year, etc), thats a data nightmare…

  • Same time happens twice
  • Time is not sequential value
  • Criteria for time gets fuzzy

So, if you are using local time, instead of UTC, your data may hit unique constraints (if anyone has unique constraints on datetimes, of course ;-), or simply you won’t be able to detect which moment of time it refers to (bad for audit trails!)

The fun part within MySQL is that DATETIME is time-zone agnostic (it just stores presentation values), whereas TIMESTAMP is not (it stores data in UTC and presents it based on session timezone), so the behaviors for these will be different.

How different? DATETIME will just have non-sequential lossy information, and TIMESTAMP will have… depends. If implicit TIMESTAMP default values are used, it will have correct UTC-based data. If NOW() or literals are inserted, that will provide with lossy presentation-time based values. In order for database to actually understand what literal value means, it would ned an offset included (“hh:mm:ss+hhmm”) in the string (though MySQL currently does not support this notation). NOW() behavior is probably a bug.

Storing time correctly opens another can of worms – user-provided time criteria cannot be converted into UTC, and therefore a lossy match is done. To complicate things even more, as indexes are in sequential time, but data matching is done based on non-sequential time, data returned will be different based on data access method (see Bug#38455).

So, the elegant fix for timezone support inside MySQL would be extending the presentation/literal format to support specified offsets, fixing datetime/timestamp code in multiple places – and that would quite some work to fix one hour per year.

Timezone support is my old interest, there’s lots of complexity and different implications, but most of people just don’t get to that – and may lose their data consistency.

So, how did your Sunday go?