LAMPS on steroids

I’m not sure if I’m the first coining in ‘LAMPS’ – scaled out LAMP environment with Squid in front, but it sounds cool. Squid is major component in content distribution systems, reducing the load from all the backend systems dramatically (especially with proper caching rules). We had various issues in past, where we used code nobody else seemed to be using – cache coordination, purges and of course, load.

Quite a few problems resulted in memory leaks, but one was particularly nasty: Squid processes under high load started leaking CPU cycles somewhere. After deploying profiling for squid we actually ended up seeing that the problem is inside libc. Once we started profiling libc, one of initial assumptions appeared to be true – our heap was awfully fragmented, slowing down malloc().

Here comes our steroids part: Google has developed a drop-in malloc replacement, tcmalloc, that is really efficient. Space efficient, cpu efficient, lock efficient. This is probably mostly used (and sophisticated) libc function, that was suffering performance issues not that many people wanted to actually tackle. The description sounded really nice, so we ended up using it for our suffering Squids.

The results were what we expected – awesome :) Now the nice part is that the library is optimized for multi-threaded applications, doing lots of allocations for small objects without too much of lock contention, and uses spinlocks for large allocations. MySQL exactly fits the definition, so just by using simple drop-in replacement you may achieve increased performance over standard libc implementations.

For any developers working on high-performance applications, Google performance tools provide easy ways to access information that was PITA to work on before. Another interesting toy they have is embedded http server providing run-time profiling info. I’m already wondering if we’d should combine that with our profiling framework. Yummy. Steroids.


It is somewhat too late to post about Wikimania, so I won’t keep it long. Finally my lightning talk bits are on internet. Please note, that the idea of presentation came just few minutes before giving it. Pity there’s no video recording.

Anyway, the trip was pretty cool. Besides meeting all usual suspects at Wikimania, the whole world of amazing academia (this time, in a good sense) was revealed by my hosts. MIT rocks (especially my place of stay, TEP). Boston was pretty nice too – I quite enjoyed long walks there.

The conference itself had technical stuff extracted into ‘Hacking days’, so I ended up at quite tense technical track rather than hacking sessions. It wasn’t that bad, so I could create ‘socializing track’ at the main conference, and just hang around without any real agenda.

We had there quite less of startup spirit, that was so vivid last year. The current theme is “What we’re doing ” instead of “What we’re going to do”. For some it is maturity, for me it is boring :) I hope that was wrong impression.

I’ll sure try going there next year too. Just to see the usual suspects.

P.S. Wikipedia tremendous growth is back again.

my slides from users conference

Today I was pointed to my previous post on scaling, and I remembered, that I didn’t put my slides from MySQL Users conference online. Maybe those are not giving that much of detail, but still, can disclose some of facts from my talk: “Wikipedia: cheap and explosive scaling with LAMP”.

There’s also Brion‘s presentation at Google on wider aspects of project technology, with slides and video made public too.

Scaling talks at mysql users conference

I am already a bit late to write about my MySQL Users Conference impressions or input, but better later than never. My pet topic is scalability, or rather, how to build big cheap systems, and I’ve had many mixed thoughts after the event, which of course had many scalability gurus from nice companies. The biggest impression was that we all scale different applications and have different demands (some have many datacenters with applications distributed, some had two power failures in whole datacenter in single week and went down for few hours..).

And as I also had a presentation on Wikipedia scaling, I’ll try to mention some of issues discussed there.

Different techniques

Main thing is that rules do not matter, application (or rather a purpose) does. All techniques should be taken with grain of salt, MMORPG is different from e-banking, though both may require synchronized states. A blog is not a wiki, as you won’t have clashes or lock conflicts on same resources. And sure, in some cases high availability (percentage of uptime) is less important than general availability – percentage of reach.

Distributing the load

Second major idea is that load has to be split. Of course, it is mandatory in case of ‘scale out’, but there may be different paths to acquire different kinds of needs – efficiency, availability, redundancy, accuracy, yadda yadda. Like…

  • If you know how to manage desynched slaves (what is really tied to application process), you can allow possibility of desynch (and hence not flush logs to disk after every transaction on any of your boxes).
  • Queries for different data domains can be sent to different slaves. Let it be per user, per language or any other fragmentation. Even if data is all there, just touching specific parts of it may improve cache locality. One of main things to consider then, is that data is clustered together, and a single row fetch will read 16KB block from disk and store it in memory. “Every third article/product/user/…” should be replaced by “every third thousand of articles/products/users”, or even some semi-dynamic bucket allocation. For Wikipedia it’s rather easy, we may just direct different languages to different servers, as per-language projects are quite self-contained.
  • Different types of queries can be sent to different slaves. Even if same data is there, you can still hit it with different patterns, and keep different indexes in hottest caches.
  • Not having the data will be always faster than just not reading it. If there’s enough of redundancy, data from different domains/types can be simply purged. Of course, purging the data that is not needed completely from the system is even more efficient approach.
  • You’ve already got RAIS – Redundant Array of Inexpensive Servers, so you can take out R from RAID and use just stripes of disks, forming AID, for performance of course.

Weakest slave will be slowing down capability of whole system, so doing less work on it not only in terms of requests per second, but also of how much of data it has to handle, may revive it for a new life (and, hehe, that way our poor 4GB old DB servers do have lots of juice :).


Caching is essential in high-performance environments (unless the service is random number generator two-point-oh). It is a common practice to add big nice caching layer (in memcacheds or squids or wherever else), but to leave data in core databases as it is. If efficient caching allows not to access data inside database too often, there’s no need to keep it on core database systems, as those are designed to work with data that needs work. Any box that has some idle resource like storage (most application servers usually do), may handle piece of rarely accessed but heavily cached elsewhere content.

Tools for the task

Different tasks may require different tools for the job. Lots of semi-static data can often be stored on application servers, usually as lightweight hash databases, just a proper method of migrating dynamic changes from core databases is required. It may be a simple rsync after a change was made, but it will save a roundtrip afterwards. Instead of updating full text indexes inside database, streams of changes may go to Lucene-based search application. And of course, sometimes just putting changes into background queues or off-peak schedules may improve responsiveness.

Speed vs power – both important

In scaled out environments adding more hardware often helps, but shouldn’t always be the main solution of the problem. Micro-optimizations have the purpose – besides obvious “saves resources” they also increase efficiency of individual nodes. Having the query served faster means also less locking or occupation of common resources (such as DB threads, waits on network), as well as far more improved user experience. This is where you might want to use high-power cores as in Opteron instead of lots of Niagara or Celeron ones (even if that may look much cheaper). Having 100 requests per second at 0.1s each rather than 100 requests per second at 1s each is quite a difference, and it counts.

Slow tasks

It is critical to avoid slow tasks on high performance systems. If there’re queries that are too expensive, just… kill them. Once you become overloaded you might want to start killing queries that run too long. Just KILLing the thread is not enough, either it has to be optimized (indexes, summary tables, etc), or eliminated. One cheap way is to delete outdated data, but if it is not possible, just having another table with hottest aggregated data may double or triple the performance. Here again, once data is aggregated into commonly used format, main source can be retired from hot memory to disks or to other cheaper services.

My common illustration is – you don’t want to see elephant walking in a highway. Elephant will have absolutely different access pattern, occupy too much space and move too slowly, where usually lots of throughput exists, not only blocking few lanes, but also attracting attention by drivers in opposite direction. Kill the elephant! Or rather, leave in natural habitat.


One of the magic weapons is compression. Gzip is fast, bzip2 is not, and many common perceptions of compression is that it is slow. No, it’s bzip2 that is slow, gzip is fast. Additionally, it may pack your data into a single block on file system instead of two or three. In some cases that may mean three times less seeks – milliseconds saved at a tiny fraction of CPU costs. In cases where there’s lots of similar text – like comments quoting other comments, different revisions for entry – concatenate it all and then compress. Gzip will love the similarity and produce ten or twenty times smaller BSOB (Binary Small Object).


There’re various profiling advices around, but what I hit multiple times, is that one hundred requests profiled separately one by one may provide more insights than a generic collection of min/max/avg/deviation for a million requests. Moreover, profile from production system may give lots of issues unspotted in development. It doesn’t mean though, that generic or development profiling should not be done. There should be no prejudices in process of profiling – worse than that is just optimization without profiling. Instead of “it has to be so” there should always a question if specific task can be improved.

Conclusions (or bragging (or whining))

Site handles now (it rises quite fast) over 12000 HTTP requests per second (out of which around 4000 are pageviews), on a cluster that could be built with ~500k$. At one talk in UC it was told that our platform isn’t very good. Sorry, with few volunteers working on that, we have to choose priorities for development. And it is a pity, that most of scaling management software is usually closed asset of the big players. Um, I’d love it to be open, really, pretty pretty please!

db abstraction: zend framework vs mediawiki

So now as Zend have released their framework I couldn’t resist to look at it’s database classes (there’s much more other stuff in it) and compare with our own bits in mediawiki.

Thumbs up for (what is a must in abstraction layer) plain query building functions. Those are easy to understand and seem to work :)

One of issues with SQL programming is that people queries like text instead of ‘structured queries’. This is where Zend’s $db->quoteInto() is overused – even where there’s no need to people are forced to embed SQL bits, instead of thinking just about data flow and relations.

In order to add additional criteria in MW, you’d do:


Zend Framework’s query builder offers slightly different approach. There’s Select object, which still eats text as parameters:

$select->where('field = ?',$value);

What raised my eyebrows – $select->from('foo', '*');. First of all, SELECT * is evil. On the other hand, it is interesting design decision to put it together with ‘FROM’. How would one issue ‘SELECT 1’ query this way? :)

The $db->insert() calls appeared to be identical in both libraries, you just pass an array of key=>value pairs and a table to put all that into. The $db->update though had same arguments, but different evaluations of those. For Zend you have to pass just a plaintext query in form of $db->quoteInto(‘field=’,$value). MW though allows to pass arrays of conditions:

$where = array ( 'group' => $groupid, 'parent' => array ( 1,2,3,4));

Of course you may pass a plain string to MW’s DB abstraction, type detection helps a lot there :)

Same differences applies to delete method. People used to tell, that you don’t need any query building once you’ve got placeholder replacement in SQL queries. Usually what is missed is that abstractions are not just for portability or security, but also for easier way to do things. If plain text is to be replaced into array passing, doing it completely would help more (I’m not advocating to split strings into arrays of chars though ;-) It still looks better when one appends an item to an array than string concatenation.

‘SELECT’ builder has much more differences. We packed handling of ORDER BY, GROUP BY, HAVING, LIMIT, etc into $options, while Zend provides separate methods for that. I immediately missed ‘SELECT … FOR UPDATE’ in documentation – this is one of most important options in proper web applications.

The OFFSET/LIMIT support (and even recommendation for paging) requires serious attitude readjustment. Scanning result sets _is_ expensive and numeric offsets shouldn’t be shown as a ‘nice way out’.

Lack of outer JOINs (as well as lack of workarounds in documentation) was surprising too.

Of course, providing table as an object was a nice thing to do, especially with possibility to add additional logic to a table object:

class RoundTable extends Zend_Db_Table {
    public function insert($data) {
        if (empty($data['created_on'])) { $data['created_on'] = time(); }
        return parent::insert($data); }

It is nice to see such library pushed – as it is in it’s 0.1.1 state, there’s lots of room for improvement in future :)

There’re other libraries out there, but those are undocumented, unpackaged and not that well maintained. And of course, there’s Ruby On Rails, which was poorly mimicked in Zend’s Active Record implementation. I don’t like ORM myself, but with hype around it, Zend could try to make their ORM better (or rather, existing, as there’re no relations in their objects). Even Django looks much better in this field. So now I’m just waiting for 1.0 ;-)

spread: bad example of open source

The Spread toolkit is one of examples, where opensource project should better not exist. It is reliable multicast, it has APIs in multiple programming languages, and can provide message queueing facility you can run and forget. There’s even MySQL Message API based on it – you can use sync and async messaging between bunch of MySQL servers. Using Spread may give you lots of possibilities in deploying distributed system.

At Wikipedia’s content cluster we could use lots of synchronization based on Spread, but…

3. All advertising materials (including web pages) mentioning features or use
   of this software, or software that uses this software, must display the following

   "This product uses software developed by Spread Concepts LLC for use in the Spread toolkit.
    For more information about Spread see"

That would mean that if we used Spread somewhere in cluster, we’d be showing adds for university project on every page (or at least that is what ‘must display’ sounds like). Of course, as some university project, it might want some advertisement, but I think it would get far more of it, if it was without viral advertisement clause – it is still the only framework of a kind out there.

Additional problem in such situation is that being half-free (or.. adware) it half-fills the need of proper messaging toolkit for community. Starting similar project when there’s Spread might not look attractive.. Of course, there’s always bunch of IRC servers – you would find lots of systems messaging needs efficiently implemented there, just without reliability and guarantees. But probably the best way would be simply asking Spread authors to release it under GPL or any other proper open source license? :)

mediawiki graphic profile

So I used amazing tool named KCacheGrind to parse XDebug output of MediaWiki. Of course, I immediately started refactoring some places in code. It is a random web hit, I don’t know even what it is, but it looks nasty and waiting for a gentle touch. Kill remote interwiki caches! :)

Here is what I got (triple click to zoom):

Here is what I used:


%d bloggers like this: