MySQL at Debian and Ubuntu

Good news, mysql-server package doesn’t suck miserably on Debian 5.0 or Ubuntu 8.10, unlike previous versions did (there were quite a few rants about CHECK TABLES being ran on huge healthy InnoDB tables, mysqld_safe would suddenly use 100% CPU, binlogs would be placed on /var/run/ and cleaned up on restart, etc).

Now, Ubuntu provides you with AppArmor (my new favorite security thing :) profile that doesn’t suck either, and Debian is a bit rude:
There is ‘tasksel’ tool, which allows you to install groups of packages. Debian has a task ‘Database server’ which will install PostgreSQL. Ubuntu is more politically correct and has renamed that task to “PostgreSQL server”.

MySQL has to do quite some work to repair its image in OS-engineer-geek mindsets, these people tend to love PG more, and once project is community-led, most active participants win the decisive power.

Quite important to look at versions – Ubuntu Intrepid carries 5.0.67, and Debian carries 5.0.51. As I’m way more exposed to everyday issues within MySQL, I value distributions that don’t just backport security fixes, but use MySQL bugfix releases way more proactively. The difference between those two versions has a nearly full year of performance, stability and feature maturity work (and thats really LOTS of real hard intensive work by my teammates). Debian will probably stay at that exact version, unless one uses ‘testing’ or ‘unstable’ packages – so over time the gap will just widen.

Oh, and why I ended up writing this today – Ubuntu/Canonical people were really nice and open, and tried to get as much as possible of feedback to make MySQL shine on their distribution. Gotta say, they’re already leading :)

RESET SLAVE, bash edition

Nearly every job advertisement for MySQL DBA positions asks for ‘shell scripting’, so I decided to investigate what it is. I remembered some performance training, where I was told how forking is bad, and one should attempt to use shell features as much as possible (like, avoid paths to something, what can be used by builtin (e.g. don’t use /usr/bin/[, just pure [ instead )

I tried to automate one MySQL DBA task (reinitializing slave after relay log corruption or after copying in cloned dataset from other server) using just bash – and it kind of worked. From now on I can put ‘Shell scripting’ proudly on my resume :-)

Next step – learn JCL (some people think this is funny :)

What was your most complicated task solved with shell scripts? :)

#!/bin/bash

HOST=$1
RPASS=$(<passwordfile)

IFS="
"

mysql -h $HOST -e "STOP SLAVE"

for line in $(mysql -e "SHOW SLAVE STATUS" -E -h $HOST)
 do
	key=${line%:*}  # Split off what goes before :
	key=${key// }   # And trim spaces
	data=${line#*:} # Split off what goes after :
	data=${data// } # And trim spaces yet again!!!

	case $key in
		Exec_master_log_pos) LOGPOS=$data;;
		Relay_Master_Log_File) LOGFILE=$data;;
		Master_Host) RHOST=$data;;
		Master_User) RUSER=$data;;
	esac

 done

if [ -z $LOGPOS -o -z $LOGFILE ]; then
        echo "OMG FAIL NO POSITIONS KNOWN"; exit;
fi

mysql -f -h $HOST -e "
        RESET SLAVE;
        CHANGE MASTER TO
                MASTER_HOST='$RHOST',
                MASTER_USER='$RUSER',
                MASTER_PASSWORD='$RPASS',
                MASTER_LOG_FILE='$LOGFILE',
                MASTER_LOG_POS=$LOGPOS ;
        START SLAVE
        "

Update: rewrote the last ‘mysql’ command to avoid multiple unnecessary forks! (thanks, Aidai :)

stupid dtrace trick

So, if there is no way to flush status variables of a server, one could use dtrace (and run SHOW GLOBAL STATUS ;-)

#!/usr/sbin/dtrace -Fws
int *zero;
BEGIN { zero=alloca(4); *zero=0; }
pid$target:mysqld:add_to_status*:1b {
        copyout(zero,uregs[R_ECX],4);
        copyout(zero,uregs[R_EDX],4);
}

P.S. This is just simple proof of concept, that can warp counters back, ignore some variables, and eat babies. Code position and registers are architecture, version and build dependent.
P.P.S. I’m giving dtrace talk at mysql conference as well (yay plugs ;-)

stupid innodb tricks

When it comes to extreme tuning, this is how I got +5% performance in one of my tests on 8-cpu machine (contended at mtr_commit()):

gdb -p $(pidof mysqld) -ex "set srv_spin_wait_delay=50" -batch

P.S. was probably worth posting this just to point out that there’re ways to fill the gap between recompile and my.cnf settings :)

iostat -x

My favorite Linux tool in DB work is ‘iostat -x’ (and I really really want to see whenever I’m doing any kind of performance analysis), yet I had to learn its limitations and properties. For example, I took 1s snapshot from a slightly overloaded 16-disk database box:

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           8.12    0.00    2.57   21.65    0.00   67.66

Device:  rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s \
sda     7684.00    19.00 2420.00  498.00 81848.00  5287.00 \

        avgrq-sz avgqu-sz   await  svctm  %util
           29.86    32.99   11.17   0.34 100.00

I pasted this somewhere on IRC, and got “doesn’t look too healthy” and that it is disk-bound. Now, to understand if it really is, one has to understand what iostat tells here.

First line of numbers shows that we’ve got plenty of CPU resources (thats because nowadays it is quite difficult to get a box with not enough CPU power, and I/O still seems to be bottleneck) – and we have more threads waiting for I/O than we have CPU execution (that sounds normal).

Now the actual per-disk statistics are where one should look. I used to prefer %util over general %iowait (I couldn’t really explain what %iostat is, and I can say what %util is). I don’t know why, but iostat has most interesting bits at the end, and not so interesting at the start:

  • %util: how much time did the storage device have outstanding work (was busy). In proper RAID environments it is more like “how much time did at least one disk in RAID array have something to do”. I’m deliberately excluding any kind of cache here – if request can be served from cache, the chance is quite negligible it will show up in %util, unlike in other values. What this also means – the RAID subsystem can be loaded from 6.25% (one disk doing the work) to 100% (all of them busy). Thats quite a lot of insight in single value of ‘100%’, isn’t it?
  • svctm: Though manual says “The average service time (in milliseconds) for I/O requests that were issued to the device.”, it isn’t exactly that when you look at multiple-disk systems. What it says is, “when your I/O subsystem is busy, how fast does it respond requests overall”. Actually, less you load your system, higher svctm is (as there’re less outstanding requests, and average time to serve them goes up). Of course, at some certain moment, when I/O becomes really overloaded, you can see svctm going up. One can tweak /sys/block/sda/queue/nr_requests based on this – to avoid overloading I/O controller, though that is really rarely needed.
  • await. One of my favorites – how fast do requests go through. It is just an average, how long it takes to serve a request for a device, once it gets into device queue, to final “OK”. Low = good, high = bad. There’re few gotchas here – even though different reads can have different performance properties (middle of disk, outer areas of disk, etc), the biggest difference is between reads and writes. Reads take time, writes can be instant (write caching at underlying layers..). As 80% of requests were reads, we can try to account for that by doing 11.17/0.8 math, to get 14ms figure. Thats quite high – systems that aren’t loaded can show ~5ms times (which isn’t that far away from 4ms rotation time of 15krpm disk).
  • avgqu-sz: Very very very important value – how many requests are there in a request queue. Low = either your system is not loaded, or has serialized I/O and cannot utilize underlying storage properly. High = your software stack is scalable enough to load properly underlying I/O. Queue size equal to amount of disks means (in best case of request distribution) that all your disks are busy. Queue size higher than amount of disks means that you are already trading I/O response time for better throughput (disks can optimize order of operations if they know them beforehand, thats what NCQ – Native Command Queueing does). If one complains about I/O performance issues when avgqu-sz is lower, then it is application specific stuff, that can be resolved with more aggressive read-ahead, less fsyncs, etc. One interesting part – avqu-sz, await, svctm and %util are iterdependent ( await = avgqu-sz * svctm / (%util/100)
  • avgrq-sz: Just an average request size. Quite often will look like a block size of some kind – can indicate what kind of workload happens. This is already post-merging, so lots of adjacent block operations will bump this up. Also, if database page is 16k, though filesystem or volume manager block is 32k, this will be seen in avgrq-sz. Large requests indicate there’s some big batch/stream task going on.
  • wsec/s & rsec/s: Sectors read and written per second. Divide by 2048, and you’ll get megabytes per second. I wanted to write this isn’t important, but remembered all the non-database people who store videos on filesystems :) So, if megabytes per second matter, these values are important (and can be seen in ‘vmstat’ output too). If not, for various database people there are other ones:
  • r/s & w/s: Read and write requests per second. This is already post-merging, and in proper I/O setups reads will mean blocking random read (serial reads are quite often merged), and writes will mean non-blocking random write (as underlying cache can allow to serve the OS instantly). These numbers are the ones that are the I/O capacity figures, though of course, depending on how much pressure underlying I/O subsystem gets (queue size!), they can vary. And as mentioned above, on rotational media it is possible to trade response time (which is not that important in parallel workloads) for better throughput.
  • rrqm/s & wrqm/s: How many requests were merged by block layer. In ideal world, there should be no merges at I/O level, because applications would have done it ages ago. Ideals differ though, for others it is good to have kernel doing this job, so they don’t have to do it inside application. Quite often there will be way less merges, because applications which tend to write adjacent blocks, also tend to wait after every write (see my rant on I/O schedulers). Reads however can be merged way easier – especially if application does “read ahead” block by block. Another reason for merges is simple block size mismatch – 16k database pages on top of 8k database pages will cause adjacent block reads, which would be merged by block layer. On some systems read of two adjacent pages would result in 1MB reads, but thats another rant :)
  • Device: – just to make sure, that you’re looking at the right device. :-)

So, after all this, the iostat output above tells us something like:

  • System has healthy high load (request queue has two-requests-per-disk)
  • Average request time is double the value one would expect from idle system, it isn’t too harmful, but one can do better
  • It is reading 80 40MB/s from disks, at 2420 requests/s. Thats quite high performance from inexpensive 2u database server (shameless plug: X4240 :)
  • High amount of merges comes from LVM snapshots, can be ignored
  • System is alive, healthy and kicking, no matter what anyone says :)

MySQL password security

Simple password authentication schemes are usually guarding against one of two evils – either leaked password tables, or sniffed network traffic. In 4.1 MySQL introduced challenge-response scheme, that is guarding against both, just not both at the same time. How does one obtain the token required to log into the server? There are few methods:

  • Use gdb, dtrace or any other deep-inspection method to grab ‘buf’ in check_scramble()
  • Grab mysql.user table, sniff network traffic, calculate the hash_stage1 value out of public_seed (initial server packet), client’s reply and actual password hash
  • Intercept the password client-side at libmysqlclient level (again, gdb, dtrace, etc ;-)
  • Mix ethyl alcohol with the carbohydrate-based bipedal DBA, until it becomes quadrupedal and tells the password (might not be able to tell anything else at that moment).


MySQL Conference & Expo 2009

P.S. I was asked by MySQL Conference organizers to do some shameless plugs, so… yeah, I’m going to talk about first three methods in my talk on MySQL security, and do live trials of last method during conference evening program.

Poor man’s contention profiling

I wrote already about poor man’s query profiling techniques last summer. Today I’m going to share poor man’s contention profiling tool, for all these poor souls which do not have Solaris with dtrace or pstack, don’t want to run tcmalloc contention profiler, and simply need some easy hack to see ‘what the heck is going on in my server’. Here it is:

gdb \
    -ex "set pagination 0" \
    -ex "thread apply all bt" \
    --batch -p $(pidof mysqld)

Run few times, and you will have enough samples to start judging. Do note, this may stop the process execution for a second, so do not spam it in too tight loop.
Once you have results it is just a matter of 20-liner script to extract any useful calculations :)

P.S. I’d love to see efficient pstack implementation for 64-bit Linux :)

update: this now lives at http://poormansprofiler.org

Eyecandy mutexes!

In my quest of making MySQL usable, I managed to hit contention that wasn’t spotted by performance masters before. Meet most useless mutex ever (this is actual contention event, not just a hold):

Count     nsec Lock
 1451   511364 mysqld`ut_list_mutex

      nsec ---- Distribution --- count    Stack
      2048 |@@                 |   132    libc.so.1`mutex_lock_impl
      4096 |@@@                |   186    libc.so.1`mutex_lock
      8192 |                   |     3    ut_malloc_low
     16384 |@@                 |   137    mem_heap_create_block
     32768 |@                  |   105    row_sel_store_mysql_rec
     65536 |@                  |    89    row_search_for_mysql
    131072 |@                  |    99    ha_innobase::general_fetch
    262144 |@@                 |   131    ha_innobase::rnd_next
    524288 |@@@                |   195    rr_sequential
   1048576 |@@@                |   223
   2097152 |@@                 |   137
   4194304 |                   |    14
----------------------------------------------------------------

ut_list_mutex guards a memory structure which has all memory blocks allocated by InnoDB (via ut_malloc/ut_free) in it.
It has two uses:

  1. Printing “Total memory allocated” in SHOW INNODB STATUS (though this can still be implemented lock-free)
  2. Deallocating all memory on shutdown (though, all modern operating systems do that anyway, so this is purely just to shut up valgrind)

If you have any BLOB/TEXT data in your tables, you’re definitely hitting this contention spot (it is #1 contention in such cases).

Fix? Kill the eyecandy, replace ut_malloc and ut_free with direct calls to malloc() and free(), oh and of course, use scalable allocators like tcmalloc or Hoard.

Rasmus vs me

Rasmus (of PHP fame) and me exchanged these nice words on Freenode’s #php (when discussing some PHP execution efficiency issues):

 
<Rasmus_> if that is your bottleneck, you are the world's best
          PHP developer
<Rasmus_> domas: then you are writing some very odd code.
          I think you will find you spend most of your time in
          syscalls and db stuff

<domas> Rasmus_: I can tell you're the best database developer, if
        you spend most of your time in db stuff :)
 

You can immediately see different application engineering perspectives :)