On valgrind and tcmalloc

I already wrote about tcmalloc, and how it helped with memory fragmentation. This time had some experience with extended features – memory profiling and leak checker. With tcmalloc it is possible to get an overview as well as detailed reports of what areas of memory are allocated for what uses. Even more, it can detect and report any memory that leaked. Of course, valgrind does that too. With one slight difference:

  • valgrind slows down applications 20-40 times (my tests ran 4s instead of 14ms)
  • tcmalloc does not. Same 14ms.

I wrote some MySQL UDFs for profiling and heap checking management, so can extract per-thread single-test stuff. Will try to clean up and release. Would be shame not to.

First days at Sun, Honeycomb opensourced

So now that MySQL is part of Sun, I can be real Sun technologies fanboy :)
So besides all the ‘it is done’ news, I noticed on Sun’s site that they’ve (oh, we’ve) opensourced Honeycomb. Thats the technology I’ve been drooling about quite a few years ago, and would’ve always wanted to see open. Now it is.
Honeycomb (aka StorageTek 5800) is distributed content-addressable object storage system (CAS), that allows putting in objects, and their metadata, have it indexed, and represent virtual directories based on that metadata. Accessible via WebDAV, API, CLI, etc… Previous solutions used to be expensive, now put it on top on Thumpers, and it is easy to manage storage archive. Now all I need is to find if it really is suitable for our needs, and start playing more with it.
Oh, and MySQL storage engine access to it would be quite cool too, it would mean Brian’s archiving engine is history :)

Update: Damn, apparently only emulators and clients released so far, and according to Sun folks, the HADB bits might be complicated to release – though the other stack might appear public and open soonish.

I/O schedulers seriously revisited

The I/O scheduler problems have drawn my attention, and besides trusting empirical results, I tried to do more of benchmarking and analysis, why the heck strange things happen at Linux block layer. So, here is the story, which I found myself quite fascinating…
Continue reading “I/O schedulers seriously revisited”

Speaking at MySQL Conference again, twice

Yay, coming this year to the MySQL conference again. This time with two different talks (second got approved just few days ago) on two distinct quite generic topics:

  • Practical MySQL for web applications
  • Practical character sets

The abstracts were submitted weeks apart, so the ‘practical’ being in both is something completely accidental :) Still, I’ll try to cover problems met and solutions used in various environments and practices – both as support engineer in MySQL, as well as engineer working on wikipedia bits.

Coming to US and talking about character sets should be interesting experience. Though most English-speaking people can stick to ASCII and be happy, current attempts to produce multilingual applications lead to various unexpected performance, security and usability problems.

And of course, web applications end up introducing quite new model of managing data environments, by introducing new set of rules, and throwing away traditional OLTP approaches. It is easy to slap another label on these, call it OLRP – on-line response processing. It needs preparing data for reads more than for writes (though balance has to be maintained). It needs digesting data for immediate responses. It needs lightweight (and lightning) accesses to do the minimum work. Thats where MySQL fits nicely, if used properly.

My own database abstraction class

Back in 2006 July, I decided that all other database classes are not worth it, and created my own one, incorporating best features from MySQL and PHP world. It resulted in this brilliant code, which I showed to few colleagues, and got such quote:

i like your nonframework. it gives a fuzzy feeling to the poor souls that think they need an abstraction layer. — Kristian Köhntopp

This was written using TIC pattern and can be used in variety of applications:

class MyDB {
  var $conn = null;

  function MyDB($database=null,$user='root',
                $password='',$host='localhost') {
    $this->conn=mysql_connect($host,$user,$password) and
    $database?mysql_select_db($database, $this->conn):null;

  function _escape($s) {
    return mysql_real_escape_string($s,$this->conn);

  function _quote($s) {
    return "'" . $this->_escape($s) . "'";

  function __call($method,$arguments) {
         'return " ".strtolower($s[0]);'),
    $query=str_replace(' everything ',' * ',$query);
    if ($first) {
      if (is_array($first)) {
        $query .= ' (' . implode(',',
                            $first)) . ') ';
      } else {
        while($argument = array_shift($arguments)) {
          $first = preg_replace('/\?/',
        $query .= $first;
    if (!$res) { print mysql_error(); exit(); }
    while($row=mysql_fetch_assoc($res)) {
    return $ret;


$x = new MyDB('test');
$x->selectEverythingFromMytableWhereIdIn(" (?,?,?) ","a'b",2,3);

Now I wonder where I should build community for this, Google Code or Sourceforge? Or should that be the darling MySQL Forge.

As I started digging my stuff from history, I also managed to upload my past six month pictures to flickr too – including Vienna, Taipei and few other trips.

On guts and I/O schedulers

Benchmarks and guts sometimes may contradict each other. Like, a benchmark tells that “performance difference is not big”, but guts do tell otherwise (something like “OH YEAH URGHH”). I was wondering why some servers are much faster than other, and apparently different kernels had different I/O schedulers. Setting ‘deadline’ (Ubuntu Server default) makes miracles over having ‘cfq’ (Fedora, and probably Ubuntu standard kernel default) on our traditional workload.

Now all we need is showing some numbers, to please gut-based thinking (though it is always pleased anyway):


avg-cpu:  %user   %nice    %sys %iowait   %idle
           4.72    0.00    7.95   18.18   69.15

Device:    rrqm/s wrqm/s   r/s   w/s  rsec/s  wsec/s
sda          0.00   0.10 91.30 31.30 3147.20 1796.00

    rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util
  1573.60   898.00    40.32     0.98    7.98   3.65  44.80


avg-cpu:  %user   %nice    %sys %iowait   %idle
           4.65    0.00    7.62   38.26   49.48

Device:    rrqm/s wrqm/s   r/s   w/s  rsec/s  wsec/s
sda          0.00   0.10 141.26 38.86 4563.44 2571.03

    rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util
  2281.72  1285.51    39.61     7.61   42.52   5.38  96.98

Though load slightly rises and drops, the await/svctime parameters are always better on deadline. The box does high-concurrency (multiple background InnoDB readers), high volume (>3000 SELECT/s) read-only (aka slave) workload on ~200gb dataset, on top of 6-disk RAID0 with write-behind cache. Whatever next benchmarks say, my guts will still fanatically believe that deadline rocks.

Dynamo by Amazon

Amazon has published a nice paper on Dynamo, their distributed hash storage. Such kind of solution is really needed for the big shops, and the only issue why we’re not using anything like at Wikipedia that is that we don’t actually have it.

I saw a comment in there by “Max”:

Looks like a lot of thought has been put into this. I stopped reading about halfway through but how much of this was inspired by memcached?

I guess the answer would be ‘none’. Key-value databases will be mainstream for 30 years already soonish (the first ‘dbm’ was released by Ken Thompson at AT&T back in 1979). It wasn’t distributed back then, but you may want to slap ‘distributed’ label to pretty anything nowadays. The method of doing that is usually the problem, and in this case Dynamo choses something similar to Chord-like rings (though with simplified routing).

Still, there’re no open libraries that would provide with easy reliable DHT building, and I’d love every paper like this to come with open-source code rather than patents attached.

P.S. My first serious web-app coding was on Sybase PowerDynamo ;-)

MySQL 4.0 Google Edition

At the Conference I realized how much Wikipedia’s database operation had in common with Google – many rules and ideas of operation, problems faced, solutions imagined. Ah, there is one huge difference – they have brilliant engineers resolving quite a few of the issues, whereas we slack and live with as-is software. One of very nice coincidences is that our base MySQL version used is 4.0.26 – surprise, surprise, same one as the one Google have released patches for :) So, here the story of running MySQL fork begins… Continue reading “MySQL 4.0 Google Edition”

Someone please run REPAIR TABLE, I am hungry!

Today I did hit some irony. I went to my favorite ‘order home delivery food online’ place, and the shopping cart failed to grow after my ordering clicks. Then I noticed a small line at the bottom of the page:

 (F) [3] mysql. Neteisinga užklausa SELECT * FROM krepselis WHERE uzsakymas_id=220479 ORDER BY laikas. Klaida: Can’t open file: ‘krepselis.MYI’ (errno: 145). File:core.php Line:1824 

‘krepselis’ being ‘shopping cart’ and ‘uzsakymas’ being ‘order’, I realized, that error 145 will keep me hungry. Ah, and error 145 is ‘Table was marked as crashed and should be repaired’. So please, run REPAIR TABLE, I am hungry!!!

Citizendium revisited

Just spotted amazing article, how Citizendium built better infrastructure than Wikipedia’s. There lots of fascinating details there, like…

They went with PostgreSQL for a number of reasons, including better scalability. PostgreSQL is an MVCC database. Unlike Wikipedia, Citizendium never has to lock the database for reads and writes. MySQL can do a lot of things quick and replicate them to slave servers, but PostgreSQL excels at complex functions and full features like JOINs and can do complicated categories and full text searches faster than Wikipedia.

If PG can function without locks, it must be definitely more scalable. InnoDB uses mutexes, spinlocks, etc – and that internal locking can be a bottleneck in many cases. Additionally, if a row is updated, a lock on the record is acquired. It is still a question how PG maintains ACID without any locks, got to research on that more.
I’m aware that MySQL isn’t best at full-text search out there – but Wikipedia uses Lucene for full-text search, so it is somewhat strange to hear that Citizendium platform is faster in that regard. And… I’m not sure where JOIN performance is really faster there – especially when we do lots of covering-index based joins. Probably the key word there is ‘complex’, though I’m not sure what that means :-)
The first reason not to use MySQL was:

First, to be different from Wikipedia.

Indeed, I always support critical thinking! Though this one:

Finally, we felt from reading various mailing lists over mediawiki development that mediawiki was hitting the ceiling of the features MySQL can provide as a backend.

IIRC that came from single post on single mailing list from someone who is not running Wikipedia backend. Mhm.
Of course, their monthly traffic is equal to our single minute traffic, so some views might differ…

%d bloggers like this: