TCP tuning for your database

Lately lots of new fascinating technologies are used to build even more fascinating new solutions, and solutions nowadays even run on distributed environments, not just on single server. These servers usually communicate using TCP – standard, that has been here long before gigabit (or ten megabit) ethernet or dawn of LAMP, and needs to be kicked a bit, to work properly. In our environment we have over hundred of application servers which handle quite a number of requests, and put quite demanding stress on database servers. For that we had to change some of default kernel settings – and it did improve situation a bit. Here I’ll try to overview some of them…
Continue reading “TCP tuning for your database”

Open transactions are bad

Two identical queries have somewhat different execution times, though data layout, table format, everything… is quite same:

mysql> select count(*) from transtest;
| count(*) |
|    10000 |
1 row in set (0.00 sec)

mysql> select count(*) from transtest;
| count(*) |
|    10000 |
1 row in set (2.88 sec) < ---- oh noes!!!

The difference is very simple, though often forgotten. Multiversioned (MVCC) tables may have multiple versions (hence the name) of rows.
Even more, the engine will keep all versions of rows, that would be needed for oldest active transaction, just in case it requests some snapshot from the past.
Think of it as of some sort of time machine, just somewhat less efficient :)

If some old transaction is running, heavily edited tables will have more stored row versions.
In above case, second query had to read a million of rows to produce a count, when first one was working on clean environment.
Once the transaction holding old versions was rolled back, the query took 0.00sec again.

Databases (or DataACIDs) will take additional measures, often unexposed, to provide consistent data, so we may treat this as a feature, not a bug.

Conclusion: avoid long running transactions, especially if doing nothing. Oh well, avoid long running transactions with locks too, but that is already different question. If you spot anything sleeping – kill the slacker. At application level. :-)

I knew it – Metaverse runs on MySQL!

I was enjoying a fabulous read, Neal Stephenson’s novel Snow Crash. The book was written back in 1992, but it emphasized virtual reality and collaboration in a way that sounds modern still now.

What was fun (and made me write this spoiling post), is that it is easy to spot MySQL usage patterns in science fiction:

  • Metaverse – the virtual reality internet, has its counterparts already in our current non-fictional world. “Second Life” is a nice [MySQL powered] platform for virtual life and creativity, and it is now in state of early Metaverse evolution.
  • Central Intelligence Corporation (created after the merger of Library of Congress and CIA, as they were duplicating each other anyway) operates huge resources of user-provided content, like Flickr, YouTube or Wikipedia [MySQL powered again]. Though it had somewhat better interfaces and APIs, I guess that is what Web 5.0 is about ;-)

I really liked the idea how asynchronous replication and lack of locking effects virtual reality worlds – even these effects were taken into account by the author. Add neurolinguistics, civilization history, brilliant satire, and you really get the best geek fiction. Thumbs up for the book. The Diamond Age seems to be the next read.

On books, examples and wizards

I really like Flickr. I believe it is one of greatest services of the Web (my album is there :), and it runs MySQL. But I don’t understand why their database folks think nobody else knows how to build stuff. I have seen nice books and presentations from Yahoo! (oh wait!) and other guys, who have been building big systems and engineered their solutions properly, so they survived.

Technical literature was noticed:

Then there are whole books on the subject of capacity and scalability for the database layer.

Yes, they are good reads.


Then there are novels from developers that in many cases really don’t know the tricks of the DBMS they are working with, and create elaborate abstraction layers that automatically generate SQL for the DB in question from objects and such.

Some of these layers are made with efficient SQL in mind, or always allow to override their functionality with efficient SQL.

So, it is easy to answer this question:

But, with all these people who tell you how to do it, actually can they prove that it works under a constant high workload for many people all at the same time.

I believe these people can.

Now there’re parts of Flickr operation described:

You may be thinking to yourself yea right say you can do 20K + transactions per seconds that must be a crap load of expensive hardware all running, where all the data is served out of memory.

With proper data layout single 10000$ system may handle 10000 queries per second. Of course, hitting disk may decrease efficiency, so one may end up with 2-5$/query. I’m not sure Flickr would consider 100k$ database setup as expensive hardware. Here again, “all data served from memory” may sound expensive, but mostly systems serve just “most data from memory”. Like ours, which is running on those 10k$-class (disclosure: ~10 of them) machines and serving >30000 queries per second at the moment. And that is efficient.

This is blowing away minds and wiping stuff we know away:

All of our database connections are real time. Our load balancer for the database is written in 13 lines of PHP code.

There are lots of posts detailing how fast MySQL connections are and how database pooling isn’t necessary anymore. Our load balancer is actually 651 lines of PHP code, but it still connects to database at each request. And it takes less than millisecond to connect – quite affordable cost.

I am sure interested in all Flickr design specifics – it is nice application, perfect example and it seems to work. Though I don’t believe that we should deny any other knowledge, or we should be blindly following wizards and their examples. Every application differs, every community has different patterns and wishes, so we should rather follow what people need, and create good products for them, like Flickr. Sometimes even one-man (or all-volunteer) engineering team may do miracles, especially when there’re open platforms to build on.

It is hard to swallow the endless possibilities, that are provided by new type of services. I’m not sure wizardry these times is that difficult to swallow. In modern software world there’re no orthodox or unorthodox designs. There’re just ones which work and which don’t.

my slides from users conference

Today I was pointed to my previous post on scaling, and I remembered, that I didn’t put my slides from MySQL Users conference online. Maybe those are not giving that much of detail, but still, can disclose some of facts from my talk: “Wikipedia: cheap and explosive scaling with LAMP”.

There’s also Brion‘s presentation at Google on wider aspects of project technology, with slides and video made public too.

Scaling talks at mysql users conference

I am already a bit late to write about my MySQL Users Conference impressions or input, but better later than never. My pet topic is scalability, or rather, how to build big cheap systems, and I’ve had many mixed thoughts after the event, which of course had many scalability gurus from nice companies. The biggest impression was that we all scale different applications and have different demands (some have many datacenters with applications distributed, some had two power failures in whole datacenter in single week and went down for few hours..).

And as I also had a presentation on Wikipedia scaling, I’ll try to mention some of issues discussed there.

Different techniques

Main thing is that rules do not matter, application (or rather a purpose) does. All techniques should be taken with grain of salt, MMORPG is different from e-banking, though both may require synchronized states. A blog is not a wiki, as you won’t have clashes or lock conflicts on same resources. And sure, in some cases high availability (percentage of uptime) is less important than general availability – percentage of reach.

Distributing the load

Second major idea is that load has to be split. Of course, it is mandatory in case of ‘scale out’, but there may be different paths to acquire different kinds of needs – efficiency, availability, redundancy, accuracy, yadda yadda. Like…

  • If you know how to manage desynched slaves (what is really tied to application process), you can allow possibility of desynch (and hence not flush logs to disk after every transaction on any of your boxes).
  • Queries for different data domains can be sent to different slaves. Let it be per user, per language or any other fragmentation. Even if data is all there, just touching specific parts of it may improve cache locality. One of main things to consider then, is that data is clustered together, and a single row fetch will read 16KB block from disk and store it in memory. “Every third article/product/user/…” should be replaced by “every third thousand of articles/products/users”, or even some semi-dynamic bucket allocation. For Wikipedia it’s rather easy, we may just direct different languages to different servers, as per-language projects are quite self-contained.
  • Different types of queries can be sent to different slaves. Even if same data is there, you can still hit it with different patterns, and keep different indexes in hottest caches.
  • Not having the data will be always faster than just not reading it. If there’s enough of redundancy, data from different domains/types can be simply purged. Of course, purging the data that is not needed completely from the system is even more efficient approach.
  • You’ve already got RAIS – Redundant Array of Inexpensive Servers, so you can take out R from RAID and use just stripes of disks, forming AID, for performance of course.

Weakest slave will be slowing down capability of whole system, so doing less work on it not only in terms of requests per second, but also of how much of data it has to handle, may revive it for a new life (and, hehe, that way our poor 4GB old DB servers do have lots of juice :).


Caching is essential in high-performance environments (unless the service is random number generator two-point-oh). It is a common practice to add big nice caching layer (in memcacheds or squids or wherever else), but to leave data in core databases as it is. If efficient caching allows not to access data inside database too often, there’s no need to keep it on core database systems, as those are designed to work with data that needs work. Any box that has some idle resource like storage (most application servers usually do), may handle piece of rarely accessed but heavily cached elsewhere content.

Tools for the task

Different tasks may require different tools for the job. Lots of semi-static data can often be stored on application servers, usually as lightweight hash databases, just a proper method of migrating dynamic changes from core databases is required. It may be a simple rsync after a change was made, but it will save a roundtrip afterwards. Instead of updating full text indexes inside database, streams of changes may go to Lucene-based search application. And of course, sometimes just putting changes into background queues or off-peak schedules may improve responsiveness.

Speed vs power – both important

In scaled out environments adding more hardware often helps, but shouldn’t always be the main solution of the problem. Micro-optimizations have the purpose – besides obvious “saves resources” they also increase efficiency of individual nodes. Having the query served faster means also less locking or occupation of common resources (such as DB threads, waits on network), as well as far more improved user experience. This is where you might want to use high-power cores as in Opteron instead of lots of Niagara or Celeron ones (even if that may look much cheaper). Having 100 requests per second at 0.1s each rather than 100 requests per second at 1s each is quite a difference, and it counts.

Slow tasks

It is critical to avoid slow tasks on high performance systems. If there’re queries that are too expensive, just… kill them. Once you become overloaded you might want to start killing queries that run too long. Just KILLing the thread is not enough, either it has to be optimized (indexes, summary tables, etc), or eliminated. One cheap way is to delete outdated data, but if it is not possible, just having another table with hottest aggregated data may double or triple the performance. Here again, once data is aggregated into commonly used format, main source can be retired from hot memory to disks or to other cheaper services.

My common illustration is – you don’t want to see elephant walking in a highway. Elephant will have absolutely different access pattern, occupy too much space and move too slowly, where usually lots of throughput exists, not only blocking few lanes, but also attracting attention by drivers in opposite direction. Kill the elephant! Or rather, leave in natural habitat.


One of the magic weapons is compression. Gzip is fast, bzip2 is not, and many common perceptions of compression is that it is slow. No, it’s bzip2 that is slow, gzip is fast. Additionally, it may pack your data into a single block on file system instead of two or three. In some cases that may mean three times less seeks – milliseconds saved at a tiny fraction of CPU costs. In cases where there’s lots of similar text – like comments quoting other comments, different revisions for entry – concatenate it all and then compress. Gzip will love the similarity and produce ten or twenty times smaller BSOB (Binary Small Object).


There’re various profiling advices around, but what I hit multiple times, is that one hundred requests profiled separately one by one may provide more insights than a generic collection of min/max/avg/deviation for a million requests. Moreover, profile from production system may give lots of issues unspotted in development. It doesn’t mean though, that generic or development profiling should not be done. There should be no prejudices in process of profiling – worse than that is just optimization without profiling. Instead of “it has to be so” there should always a question if specific task can be improved.

Conclusions (or bragging (or whining))

Site handles now (it rises quite fast) over 12000 HTTP requests per second (out of which around 4000 are pageviews), on a cluster that could be built with ~500k$. At one talk in UC it was told that our platform isn’t very good. Sorry, with few volunteers working on that, we have to choose priorities for development. And it is a pity, that most of scaling management software is usually closed asset of the big players. Um, I’d love it to be open, really, pretty pretty please!