I knew it – Metaverse runs on MySQL!

I was enjoying a fabulous read, Neal Stephenson’s novel Snow Crash. The book was written back in 1992, but it emphasized virtual reality and collaboration in a way that sounds modern still now.

What was fun (and made me write this spoiling post), is that it is easy to spot MySQL usage patterns in science fiction:

  • Metaverse – the virtual reality internet, has its counterparts already in our current non-fictional world. “Second Life” is a nice [MySQL powered] platform for virtual life and creativity, and it is now in state of early Metaverse evolution.
  • Central Intelligence Corporation (created after the merger of Library of Congress and CIA, as they were duplicating each other anyway) operates huge resources of user-provided content, like Flickr, YouTube or Wikipedia [MySQL powered again]. Though it had somewhat better interfaces and APIs, I guess that is what Web 5.0 is about ;-)

I really liked the idea how asynchronous replication and lack of locking effects virtual reality worlds – even these effects were taken into account by the author. Add neurolinguistics, civilization history, brilliant satire, and you really get the best geek fiction. Thumbs up for the book. The Diamond Age seems to be the next read.

On books, examples and wizards

I really like Flickr. I believe it is one of greatest services of the Web (my album is there :), and it runs MySQL. But I don’t understand why their database folks think nobody else knows how to build stuff. I have seen nice books and presentations from Yahoo! (oh wait!) and other guys, who have been building big systems and engineered their solutions properly, so they survived.

Technical literature was noticed:

Then there are whole books on the subject of capacity and scalability for the database layer.

Yes, they are good reads.

Or…

Then there are novels from developers that in many cases really don’t know the tricks of the DBMS they are working with, and create elaborate abstraction layers that automatically generate SQL for the DB in question from objects and such.

Some of these layers are made with efficient SQL in mind, or always allow to override their functionality with efficient SQL.

So, it is easy to answer this question:

But, with all these people who tell you how to do it, actually can they prove that it works under a constant high workload for many people all at the same time.

I believe these people can.

Now there’re parts of Flickr operation described:

You may be thinking to yourself yea right say you can do 20K + transactions per seconds that must be a crap load of expensive hardware all running, where all the data is served out of memory.

With proper data layout single 10000$ system may handle 10000 queries per second. Of course, hitting disk may decrease efficiency, so one may end up with 2-5$/query. I’m not sure Flickr would consider 100k$ database setup as expensive hardware. Here again, “all data served from memory” may sound expensive, but mostly systems serve just “most data from memory”. Like ours, which is running on those 10k$-class (disclosure: ~10 of them) machines and serving >30000 queries per second at the moment. And that is efficient.

This is blowing away minds and wiping stuff we know away:

All of our database connections are real time. Our load balancer for the database is written in 13 lines of PHP code.

There are lots of posts detailing how fast MySQL connections are and how database pooling isn’t necessary anymore. Our load balancer is actually 651 lines of PHP code, but it still connects to database at each request. And it takes less than millisecond to connect – quite affordable cost.

I am sure interested in all Flickr design specifics – it is nice application, perfect example and it seems to work. Though I don’t believe that we should deny any other knowledge, or we should be blindly following wizards and their examples. Every application differs, every community has different patterns and wishes, so we should rather follow what people need, and create good products for them, like Flickr. Sometimes even one-man (or all-volunteer) engineering team may do miracles, especially when there’re open platforms to build on.

It is hard to swallow the endless possibilities, that are provided by new type of services. I’m not sure wizardry these times is that difficult to swallow. In modern software world there’re no orthodox or unorthodox designs. There’re just ones which work and which don’t.

my slides from users conference

Today I was pointed to my previous post on scaling, and I remembered, that I didn’t put my slides from MySQL Users conference online. Maybe those are not giving that much of detail, but still, can disclose some of facts from my talk: “Wikipedia: cheap and explosive scaling with LAMP”.

There’s also Brion‘s presentation at Google on wider aspects of project technology, with slides and video made public too.

Scaling talks at mysql users conference

I am already a bit late to write about my MySQL Users Conference impressions or input, but better later than never. My pet topic is scalability, or rather, how to build big cheap systems, and I’ve had many mixed thoughts after the event, which of course had many scalability gurus from nice companies. The biggest impression was that we all scale different applications and have different demands (some have many datacenters with applications distributed, some had two power failures in whole datacenter in single week and went down for few hours..).

And as I also had a presentation on Wikipedia scaling, I’ll try to mention some of issues discussed there.

Different techniques

Main thing is that rules do not matter, application (or rather a purpose) does. All techniques should be taken with grain of salt, MMORPG is different from e-banking, though both may require synchronized states. A blog is not a wiki, as you won’t have clashes or lock conflicts on same resources. And sure, in some cases high availability (percentage of uptime) is less important than general availability – percentage of reach.

Distributing the load

Second major idea is that load has to be split. Of course, it is mandatory in case of ‘scale out’, but there may be different paths to acquire different kinds of needs – efficiency, availability, redundancy, accuracy, yadda yadda. Like…

  • If you know how to manage desynched slaves (what is really tied to application process), you can allow possibility of desynch (and hence not flush logs to disk after every transaction on any of your boxes).
  • Queries for different data domains can be sent to different slaves. Let it be per user, per language or any other fragmentation. Even if data is all there, just touching specific parts of it may improve cache locality. One of main things to consider then, is that data is clustered together, and a single row fetch will read 16KB block from disk and store it in memory. “Every third article/product/user/…” should be replaced by “every third thousand of articles/products/users”, or even some semi-dynamic bucket allocation. For Wikipedia it’s rather easy, we may just direct different languages to different servers, as per-language projects are quite self-contained.
  • Different types of queries can be sent to different slaves. Even if same data is there, you can still hit it with different patterns, and keep different indexes in hottest caches.
  • Not having the data will be always faster than just not reading it. If there’s enough of redundancy, data from different domains/types can be simply purged. Of course, purging the data that is not needed completely from the system is even more efficient approach.
  • You’ve already got RAIS – Redundant Array of Inexpensive Servers, so you can take out R from RAID and use just stripes of disks, forming AID, for performance of course.

Weakest slave will be slowing down capability of whole system, so doing less work on it not only in terms of requests per second, but also of how much of data it has to handle, may revive it for a new life (and, hehe, that way our poor 4GB old DB servers do have lots of juice :).

Caching

Caching is essential in high-performance environments (unless the service is random number generator two-point-oh). It is a common practice to add big nice caching layer (in memcacheds or squids or wherever else), but to leave data in core databases as it is. If efficient caching allows not to access data inside database too often, there’s no need to keep it on core database systems, as those are designed to work with data that needs work. Any box that has some idle resource like storage (most application servers usually do), may handle piece of rarely accessed but heavily cached elsewhere content.

Tools for the task

Different tasks may require different tools for the job. Lots of semi-static data can often be stored on application servers, usually as lightweight hash databases, just a proper method of migrating dynamic changes from core databases is required. It may be a simple rsync after a change was made, but it will save a roundtrip afterwards. Instead of updating full text indexes inside database, streams of changes may go to Lucene-based search application. And of course, sometimes just putting changes into background queues or off-peak schedules may improve responsiveness.

Speed vs power – both important

In scaled out environments adding more hardware often helps, but shouldn’t always be the main solution of the problem. Micro-optimizations have the purpose – besides obvious “saves resources” they also increase efficiency of individual nodes. Having the query served faster means also less locking or occupation of common resources (such as DB threads, waits on network), as well as far more improved user experience. This is where you might want to use high-power cores as in Opteron instead of lots of Niagara or Celeron ones (even if that may look much cheaper). Having 100 requests per second at 0.1s each rather than 100 requests per second at 1s each is quite a difference, and it counts.

Slow tasks

It is critical to avoid slow tasks on high performance systems. If there’re queries that are too expensive, just… kill them. Once you become overloaded you might want to start killing queries that run too long. Just KILLing the thread is not enough, either it has to be optimized (indexes, summary tables, etc), or eliminated. One cheap way is to delete outdated data, but if it is not possible, just having another table with hottest aggregated data may double or triple the performance. Here again, once data is aggregated into commonly used format, main source can be retired from hot memory to disks or to other cheaper services.

My common illustration is – you don’t want to see elephant walking in a highway. Elephant will have absolutely different access pattern, occupy too much space and move too slowly, where usually lots of throughput exists, not only blocking few lanes, but also attracting attention by drivers in opposite direction. Kill the elephant! Or rather, leave in natural habitat.

Compression

One of the magic weapons is compression. Gzip is fast, bzip2 is not, and many common perceptions of compression is that it is slow. No, it’s bzip2 that is slow, gzip is fast. Additionally, it may pack your data into a single block on file system instead of two or three. In some cases that may mean three times less seeks – milliseconds saved at a tiny fraction of CPU costs. In cases where there’s lots of similar text – like comments quoting other comments, different revisions for entry – concatenate it all and then compress. Gzip will love the similarity and produce ten or twenty times smaller BSOB (Binary Small Object).

Profiling

There’re various profiling advices around, but what I hit multiple times, is that one hundred requests profiled separately one by one may provide more insights than a generic collection of min/max/avg/deviation for a million requests. Moreover, profile from production system may give lots of issues unspotted in development. It doesn’t mean though, that generic or development profiling should not be done. There should be no prejudices in process of profiling – worse than that is just optimization without profiling. Instead of “it has to be so” there should always a question if specific task can be improved.

Conclusions (or bragging (or whining))

Site handles now (it rises quite fast) over 12000 HTTP requests per second (out of which around 4000 are pageviews), on a cluster that could be built with ~500k$. At one talk in UC it was told that our platform isn’t very good. Sorry, with few volunteers working on that, we have to choose priorities for development. And it is a pity, that most of scaling management software is usually closed asset of the big players. Um, I’d love it to be open, really, pretty pretty please!

wordpress: friendly look at query log

WordPress is no doubt one of most popular web packages (besides MediaWiki, ha ha ha) out there. It is easy to install, it is easy to use. And it doesn’t run huge sites, it is not designed to. Anyway, even as it is not designed to, I’m running it for my blog, an old version of course (2.0 is scary for me).

I have general query log enabled (for now it’s easier to truncate it, than to disable it on running server), so of course I spot all sorts of funny queries running on my system (people hit the site when I want to test some stuff…). So now I looked at what WordPress does in order to greet my fellow visitors. Some of my comments might be rubbish, as I did not do any profiling on this, but some of them might provide insights for anyone developing web application (or at least blog).

First of all, it wants to know who are privileged users on the site. I don’t know why, but it does this:

SELECT * FROM wp_users WHERE user_level > 0

Of course, no index on user_level, so even if you do not have any privileged users, if you have enabled registration (or even force people to register to post comments, eh), you will have server scanning whole table – there’s no index on user_level. You might want to add it.

Next step: options.

SELECT option_name, option_value FROM wp_options WHERE autoload = 'yes'

Here it reads all options one would need. It’s fun – there’re 60 options that are autoloaded, and their values weight 500 bytes in total. The ones that are not autoloaded (hey, no index here either) – 11 rows, 150k of data. Even more – it’s just 4 rss feeds serialized and put into database. There are multiple lessons here:

  • No need to put options in database when you can have it serialized on disk.
  • No need to put bulk RSS feed data together with lots of small options – use other table for that.
  • Here again, even if putting bulk data together, use indexes for getting proper subset. Anyway, above points are somewhat more important

Further: categories

SELECT * FROM wp_categories

Um, small dataset, let it live. Of course, one could build self-join to have a recursive tree. Doesn’t matter. Sorry for noise!

Fun begins. Posts!!!

SELECT DISTINCT * FROM wp_posts
WHERE 1=1 AND
post_date_gmt < = '2006-03-25 19:15:59' AND
(post_status = "publish" OR post_author = 2
AND post_status != 'draft' AND
    post_status != 'static')
GROUP BY wp_posts.ID
ORDER BY post_date DESC
LIMIT 0, 10

First of all, we select DISTINCT * (this table has primary key, it can’t have dupe rows), then we GROUP BY by unique ID (which again doesn’t make sense), then we eliminate all possible order by index situations, oh wait, there’s no index on post_date. I’d really like if it just used “publish” status for front page and would not care about anything else. This is publishing system with workflows! :)

Mapping categories:

 SELECT DISTINCT
post_id, cat_ID FROM wp_categories, wp_post2cat
WHERE category_id = cat_ID AND post_id IN
(31,30,29,28,27,26,25,24,23,22)

It could be done by… joining with above wp_posts query, scanning an index, then using GROUP BY and GROUP_CONCAT() for category names. Or even having that denormalized (aggregated) inside posts table, but that’s already not that nice. Forget it :)

Counting comments… I know this is always very important – everyone needs to know exact count:

SELECT ID, COUNT( comment_ID ) AS ccount
FROM wp_posts
LEFT JOIN wp_comments ON (
comment_post_ID = ID  AND
comment_approved =  '1')
WHERE ID IN (31,30,29,28,27,26,25,24,23,22)
GROUP BY ID

What happens more often, reading comment count or adding a comment? If later, skip this section, otherwise – put comment count in posts table and increment it once comment is visible. Incrementing and decrementing is slightly cheaper than scanning a table after you’re flooded with hate/love messages.

Um, then… meta!
SELECT post_id, meta_key, meta_value FROM wp_postmeta
WHERE post_id IN(31,30,29,28,27,26,25,24,23,22)
ORDER BY post_id, meta_key

Maybe I’m not one of those semantic web folks, but I’d rather skip querying empty table, less IPC – faster site is! And I’ve seen multiple queries for every article displayed against this table. It’s empty! No need to check it again!

Dates.. months.. years… counts…

SELECT DISTINCT
   YEAR(post_date) AS `year`,
   MONTH(post_date) AS `month`,
   count(ID) as posts
FROM wp_posts
WHERE post_date < '2006-03-25 21:15:11' AND
post_status = 'publish'
GROUP BY YEAR(post_date), MONTH(post_date)
ORDER BY post_date DESC

This is where aggregation may help. Once new month starts, statistics for old month (and even current one) can be written into separate table. Otherwise one would have to scan all posts, order them, sum and present that to you. Every time.

Here we can count articles once again:

SELECT COUNT(ID) FROM wp_posts
WHERE 1=1 AND post_date_gmt < = '2006-03-25 19:15:59' AND
(post_status = "publish" OR post_author = 2 AND
post_status != 'draft' AND post_status != 'static')

Some of above comments apply :)

Static pages… This is content management system – it has to allow to create pages! For that we:

SELECT * FROM wp_posts WHERE post_status = 'static'
ORDER BY post_title ASC

So, for showing links to internal content (ten or twenty bytes on your html) it sends all text of all internal content to application. Well, WordPress keeps all text in posts table, so… * includes that. Of course, there’s no composite index on (status,title), so a traditional order by happens. Of course, no filesort optimizations apply (as rows are too wide to add them to filesort buffer), so table hits multiple times during query.

Afterwards it has some of ‘links’ magic, which I skipped. I don’t have jumbo blogrolls (though I’d love jumbo rolls in sushi bar :), so those queries look evil, but cause no harm :) Eventually there’s some additional counting:

SELECT DISTINCT
YEAR(post_date) AS `year`,
MONTH(post_date) AS `month`, count(ID) as posts
FROM wp_posts
WHERE post_date < '2006-03-25 21:15:11' AND
post_status = 'publish'
GROUP BY YEAR(post_date), MONTH(post_date)
ORDER BY post_date DESC

Um, not again… dejavu? :)

And then…

 Quit \o/ 

This was just a look at load of single page. There’re more various issues once other types of pageviews are done.
Anyway, major issues:

  • Do not use DISTINCT or GROUP BY if you do not have to.
  • Use indexes for sorting, row selection and offset retrieval. Create multiple indexes.
  • Avoid putting bulk data together with metadata.
  • Count and save. Reuse.

What would be my favorite front page query?

SELECT id,name,content,comment_count,
    group_concat(category) as categories
FROM posts
    LEFT JOIN postcategories USING (postid)
WHERE date< 'offset' AND status='published'
ORDER BY date DESC LIMIT 10

It would be absolutely efficient with (status,date) index, it would not need to scan comments table to get count, it would not have to scan whole posts table then group it or order it, and it would be faster than 3 slow queries on tables that may grow.

I had inner voice inside (from the hosting business past) – but MySQL has query cache! Yes, it has, but putting timestamps into queries eliminates possibilities of caching. And one should not rely on query caching, one should rely on data and index caching. This is where databases shine.

Final question: Will upgrade to WordPress 2.0 solve these issues? Won’t I be trapped in yet another featuritis?

db abstraction: zend framework vs mediawiki

So now as Zend have released their framework I couldn’t resist to look at it’s database classes (there’s much more other stuff in it) and compare with our own bits in mediawiki.

Thumbs up for (what is a must in abstraction layer) plain query building functions. Those are easy to understand and seem to work :)

One of issues with SQL programming is that people queries like text instead of ‘structured queries’. This is where Zend’s $db->quoteInto() is overused – even where there’s no need to people are forced to embed SQL bits, instead of thinking just about data flow and relations.

In order to add additional criteria in MW, you’d do:

$where[]=array('field'=>$value);

Zend Framework’s query builder offers slightly different approach. There’s Select object, which still eats text as parameters:

$select->where('field = ?',$value);

What raised my eyebrows – $select->from('foo', '*');. First of all, SELECT * is evil. On the other hand, it is interesting design decision to put it together with ‘FROM’. How would one issue ‘SELECT 1’ query this way? :)

The $db->insert() calls appeared to be identical in both libraries, you just pass an array of key=>value pairs and a table to put all that into. The $db->update though had same arguments, but different evaluations of those. For Zend you have to pass just a plaintext query in form of $db->quoteInto(‘field=’,$value). MW though allows to pass arrays of conditions:

$where = array ( 'group' => $groupid, 'parent' => array ( 1,2,3,4));

Of course you may pass a plain string to MW’s DB abstraction, type detection helps a lot there :)

Same differences applies to delete method. People used to tell, that you don’t need any query building once you’ve got placeholder replacement in SQL queries. Usually what is missed is that abstractions are not just for portability or security, but also for easier way to do things. If plain text is to be replaced into array passing, doing it completely would help more (I’m not advocating to split strings into arrays of chars though ;-) It still looks better when one appends an item to an array than string concatenation.

‘SELECT’ builder has much more differences. We packed handling of ORDER BY, GROUP BY, HAVING, LIMIT, etc into $options, while Zend provides separate methods for that. I immediately missed ‘SELECT … FOR UPDATE’ in documentation – this is one of most important options in proper web applications.

The OFFSET/LIMIT support (and even recommendation for paging) requires serious attitude readjustment. Scanning result sets _is_ expensive and numeric offsets shouldn’t be shown as a ‘nice way out’.

Lack of outer JOINs (as well as lack of workarounds in documentation) was surprising too.

Of course, providing table as an object was a nice thing to do, especially with possibility to add additional logic to a table object:

class RoundTable extends Zend_Db_Table {
    public function insert($data) {
        if (empty($data['created_on'])) { $data['created_on'] = time(); }
        return parent::insert($data); }

It is nice to see such library pushed – as it is in it’s 0.1.1 state, there’s lots of room for improvement in future :)

There’re other libraries out there, but those are undocumented, unpackaged and not that well maintained. And of course, there’s Ruby On Rails, which was poorly mimicked in Zend’s Active Record implementation. I don’t like ORM myself, but with hype around it, Zend could try to make their ORM better (or rather, existing, as there’re no relations in their objects). Even Django looks much better in this field. So now I’m just waiting for 1.0 ;-)

printing envelopes in macosx

I had to print address labels on an envelope (people still need to use snail mail sometimes). I used to be spoiled by MS Office, so I started OpenOffice and tried to do something there. I failed:

  • It didn’t print “landscape” whatever I tried to do, didn’t save settings, barf.
  • It complained that locale “” did not exist, so I couldn’t have my street name spelled correctly.
  • Ah, sure, it crashed several times, it’s 2.0 experimental Mac Intel build, so I guess it must do that, in order to show how difficult QA is. :)

So I opened simple text editor and tried to do it from there. I failed too, because didn’t understand how to transform simple text files into rotated envelope print. It was either rescaled or simply wrong somewhere else. That counts as failure too. I tried to do that from TextMate too, but it didn’t like my tries either.

So I searched the web for hints. It wasn’t a real search, because first title of first hit did tell me exactly what to do: Address Book. The solution is simple, select recipient, print. Sometimes I still am too power-oriented :)