Packing for MySQL Conference 2009

Yay, coming to Santa Clara again (4th conference in a row!:). I can’t imagine my year without MySQL Conference trip anymore. To get a free ticket I’ll present on two topics, MySQL Security (lately I have related role, and have prepared bunch of information already) and deep-inspecting MySQL with DTrace (a voodoo session for all happy Solaris and MacOSX users :). See you there?

Velocity: Clouds!

This early morning I’ll start making a betting pool, if there will be a Velocity presentation that won’t mention ‘clouds’. While most of people enjoy the idea of clouds, thats actually where snow, hail, thunderstorms, and acid rain comes from. This industry needs better metaphors.

Update: Though I failed to mention a word cloud on my talk (I guess I was entirely alone in whole conference in that regard), it still made it to Slashdot.

Also, we already replaced ‘Wikimedia Grid’ with ‘Wikimedia Cloud’ on Ganglia.

Wikipedia at Velocity conference

Next Monday I’ll be presenting (if jetlag doesn’t kill me) at Velocity 2008 – webops and performance conference. It won’t be my first time talking about Wikipedia infrastructure, but this time people will know the technology and scaling methods anyway.

As I see it, in such context Wikipedia is more interesting as a case of operations underdog – non-profit lean budgets, brave approaches in infrastructure, conservative feature development, and lots of cheating and cheap tricks (caching! caching! caching!).

Also, I’ll be able to share (making audience jealous) how it is great to be on non-profit ops team (and one of example perks – we can be cheap about getting conference passes too ;-)

The best part (for audience, not for me) – I will be forced to be honest. Nearly whole tech team will be at the event, and if I fail to attribute any developments, or start talking crap – not only they can throw rotten tomatoes, but also disable my login access and claim they never knew me, without me being able to fight back :) I didn’t publicly present in front of these guys since 2005 – will be tough.

Speaking at MySQL Conference again, twice

Yay, coming this year to the MySQL conference again. This time with two different talks (second got approved just few days ago) on two distinct quite generic topics:

  • Practical MySQL for web applications
  • Practical character sets

The abstracts were submitted weeks apart, so the ‘practical’ being in both is something completely accidental :) Still, I’ll try to cover problems met and solutions used in various environments and practices – both as support engineer in MySQL, as well as engineer working on wikipedia bits.

Coming to US and talking about character sets should be interesting experience. Though most English-speaking people can stick to ASCII and be happy, current attempts to produce multilingual applications lead to various unexpected performance, security and usability problems.

And of course, web applications end up introducing quite new model of managing data environments, by introducing new set of rules, and throwing away traditional OLTP approaches. It is easy to slap another label on these, call it OLRP – on-line response processing. It needs preparing data for reads more than for writes (though balance has to be maintained). It needs digesting data for immediate responses. It needs lightweight (and lightning) accesses to do the minimum work. Thats where MySQL fits nicely, if used properly.

Conference notes

Cool ideas from the conference:

  1. Having same binary logs with same positions on slaves as on masters. This allows switching the database master a DNS-change (or IP takeover) operation. — Google
  2. Reading replication logs and prefetching data in parallel on slaves before replication events get executed. This way data is preheated for upcoming changes and serial updating thread works much faster. Especially useful on slightly delayed slaves. — Youtube
  3. For generating unique IDs for distributed environment use separate set of servers, without any replication, but with auto-increment offsets. This can be separate from core database and quite efficient. — Flickr
  4. Let’s get some beer. — Colleagues at MySQL AB

One of core messages I was trying to spread was “Relax. World is not going to end in case you lose a transaction.” I’m not sure how cool it was, but some nice folks out there said it was expiring. In many cases running a project has to be fun first, and most motivating targets should be the priority.
There still were ideas that had counter-arguments (of course, every situation may have different needs). One of discussions I bumped into was about using big services Out There (such as Amazon S3) instead of building your own datacenters – I didn’t end up convinced, but of course it is interesting to investigate if really costs can be lower.
Some more notes:

  • At the Flickr presentation there was someone asking why is Flickr using Yahoo’s search backend instead of rolling out anything Lucene-based. Lucene seemed to be the buzzword always in the air – though I had nice conversations with Sphinx developer too. Still, if I’d be commercial entity owned by Yahoo, I’d really reuse top-notch technology developed in other Y! departments, rather than use open source project, created by… Yahoo’s employee in free time.
  • Every time I heard someone was using NetApp Filers, I cried. Probably it is very good technology, but last time I checked prices, it was quite expensive… And someone mentioned it becomes slightly pain to ensure all the business continuity.
  • Open Source Systems were demonstrating pretty cool (literally too) dual-motherboard servers – promised lower prices and much lower energy consumption. Interesting.
  • Dolphin interconnect people said 10g-ethernet is not solving the network delay problem, and they’ll live for a while.
  • HP people were showcasing blade technology, though they avoided filling the enclosure with blades and loading them – would be quite a noise generator in the expo hall. We did discuss the unfair listprices by major hardware vendors – industry sucks at this place.
  • Heikki joked, people laughed, even few times. At the ‘clash of DB gurus’ Heikki mentioned he’d like to see Wikipedia still running on InnoDB after 20 years. It made me wonder what kind of data architectures will people use in 20 years.
  • Nobody solved the speed of light problem yet.
  • .org pavillion showcased the core products out there, especially in web publishing (WordPress, Drupal, Joomla), and workflow (Eventum, OTRS, Bugzilla, dotproject).
  • All scaling talks generally choose standard solutions and spend engineering on them. Didn’t see anything breathtaking, but of course, that may be still hidden part of technology.
  • I’d like to come back next year again. Again.

Writing a book (or preparing for MySQL Conference)

I already announced about coming to MySQL Conference, but I didn’t realize preparing for it will take that much time. Last year I had just regular session about Wikipedia’s scaling and did feel that it is somewhat difficult to squeeze that much information into less than one hour. This year I opted in for 3h session (with short break in the middle), and instead of few slides with buzzwords on them I worked on workbook-like material to talk and discuss about.
Presentations are always easy, I have to admit I’ve made quite a lot of my slides an hour before actual talks. Now I realized that writing a workbook ends up to be a book, and books are not written in single day… Full disclosure: I looked at last year’s presentation files and blog posts for preparation of the talk, but still, things have changed, both in technology and in numbers. We have far more visitors (ha, >30kreq/s instead of 12kreq/s!), more content, slightly more servers and less troubles :-)
Today I’ve delivered the paper for printing (dead tree handouts for session attendees!), but there are many ideas already what to append or to extend, so this will end up being perpetual process of improving. Let’s hope tutorial attendees will bring their laptops for updated digital handouts.
Of course, the good part is that the real work will be over after first day and I’ll be able to enjoy other sessions & social activities. If only I survive the staff party..

MySQL Conference 2007: Piggyback riding Wikipedia again. \o/

This year I’m coming to MySQL Conference again. Last year it was marvelous experience, with customers, community and colleagues (CCC!) gathering together, so I didn’t want to miss it this year at any cost :-)

This year instead of describing Wikipedia internals I’ll be disclosing them – all important bits, configuration files, code, ideas, problems, bugs and work being done through whole stack – starting with distributed caches in front, distributed middle-ware somewhere in the middle and distributed data storage in the back end. It will take three hours or so – bring your pillows. :)

Scaling talks at mysql users conference

I am already a bit late to write about my MySQL Users Conference impressions or input, but better later than never. My pet topic is scalability, or rather, how to build big cheap systems, and I’ve had many mixed thoughts after the event, which of course had many scalability gurus from nice companies. The biggest impression was that we all scale different applications and have different demands (some have many datacenters with applications distributed, some had two power failures in whole datacenter in single week and went down for few hours..).

And as I also had a presentation on Wikipedia scaling, I’ll try to mention some of issues discussed there.

Different techniques

Main thing is that rules do not matter, application (or rather a purpose) does. All techniques should be taken with grain of salt, MMORPG is different from e-banking, though both may require synchronized states. A blog is not a wiki, as you won’t have clashes or lock conflicts on same resources. And sure, in some cases high availability (percentage of uptime) is less important than general availability – percentage of reach.

Distributing the load

Second major idea is that load has to be split. Of course, it is mandatory in case of ‘scale out’, but there may be different paths to acquire different kinds of needs – efficiency, availability, redundancy, accuracy, yadda yadda. Like…

  • If you know how to manage desynched slaves (what is really tied to application process), you can allow possibility of desynch (and hence not flush logs to disk after every transaction on any of your boxes).
  • Queries for different data domains can be sent to different slaves. Let it be per user, per language or any other fragmentation. Even if data is all there, just touching specific parts of it may improve cache locality. One of main things to consider then, is that data is clustered together, and a single row fetch will read 16KB block from disk and store it in memory. “Every third article/product/user/…” should be replaced by “every third thousand of articles/products/users”, or even some semi-dynamic bucket allocation. For Wikipedia it’s rather easy, we may just direct different languages to different servers, as per-language projects are quite self-contained.
  • Different types of queries can be sent to different slaves. Even if same data is there, you can still hit it with different patterns, and keep different indexes in hottest caches.
  • Not having the data will be always faster than just not reading it. If there’s enough of redundancy, data from different domains/types can be simply purged. Of course, purging the data that is not needed completely from the system is even more efficient approach.
  • You’ve already got RAIS – Redundant Array of Inexpensive Servers, so you can take out R from RAID and use just stripes of disks, forming AID, for performance of course.

Weakest slave will be slowing down capability of whole system, so doing less work on it not only in terms of requests per second, but also of how much of data it has to handle, may revive it for a new life (and, hehe, that way our poor 4GB old DB servers do have lots of juice :).

Caching

Caching is essential in high-performance environments (unless the service is random number generator two-point-oh). It is a common practice to add big nice caching layer (in memcacheds or squids or wherever else), but to leave data in core databases as it is. If efficient caching allows not to access data inside database too often, there’s no need to keep it on core database systems, as those are designed to work with data that needs work. Any box that has some idle resource like storage (most application servers usually do), may handle piece of rarely accessed but heavily cached elsewhere content.

Tools for the task

Different tasks may require different tools for the job. Lots of semi-static data can often be stored on application servers, usually as lightweight hash databases, just a proper method of migrating dynamic changes from core databases is required. It may be a simple rsync after a change was made, but it will save a roundtrip afterwards. Instead of updating full text indexes inside database, streams of changes may go to Lucene-based search application. And of course, sometimes just putting changes into background queues or off-peak schedules may improve responsiveness.

Speed vs power – both important

In scaled out environments adding more hardware often helps, but shouldn’t always be the main solution of the problem. Micro-optimizations have the purpose – besides obvious “saves resources” they also increase efficiency of individual nodes. Having the query served faster means also less locking or occupation of common resources (such as DB threads, waits on network), as well as far more improved user experience. This is where you might want to use high-power cores as in Opteron instead of lots of Niagara or Celeron ones (even if that may look much cheaper). Having 100 requests per second at 0.1s each rather than 100 requests per second at 1s each is quite a difference, and it counts.

Slow tasks

It is critical to avoid slow tasks on high performance systems. If there’re queries that are too expensive, just… kill them. Once you become overloaded you might want to start killing queries that run too long. Just KILLing the thread is not enough, either it has to be optimized (indexes, summary tables, etc), or eliminated. One cheap way is to delete outdated data, but if it is not possible, just having another table with hottest aggregated data may double or triple the performance. Here again, once data is aggregated into commonly used format, main source can be retired from hot memory to disks or to other cheaper services.

My common illustration is – you don’t want to see elephant walking in a highway. Elephant will have absolutely different access pattern, occupy too much space and move too slowly, where usually lots of throughput exists, not only blocking few lanes, but also attracting attention by drivers in opposite direction. Kill the elephant! Or rather, leave in natural habitat.

Compression

One of the magic weapons is compression. Gzip is fast, bzip2 is not, and many common perceptions of compression is that it is slow. No, it’s bzip2 that is slow, gzip is fast. Additionally, it may pack your data into a single block on file system instead of two or three. In some cases that may mean three times less seeks – milliseconds saved at a tiny fraction of CPU costs. In cases where there’s lots of similar text – like comments quoting other comments, different revisions for entry – concatenate it all and then compress. Gzip will love the similarity and produce ten or twenty times smaller BSOB (Binary Small Object).

Profiling

There’re various profiling advices around, but what I hit multiple times, is that one hundred requests profiled separately one by one may provide more insights than a generic collection of min/max/avg/deviation for a million requests. Moreover, profile from production system may give lots of issues unspotted in development. It doesn’t mean though, that generic or development profiling should not be done. There should be no prejudices in process of profiling – worse than that is just optimization without profiling. Instead of “it has to be so” there should always a question if specific task can be improved.

Conclusions (or bragging (or whining))

Site handles now (it rises quite fast) over 12000 HTTP requests per second (out of which around 4000 are pageviews), on a cluster that could be built with ~500k$. At one talk in UC it was told that our platform isn’t very good. Sorry, with few volunteers working on that, we have to choose priorities for development. And it is a pity, that most of scaling management software is usually closed asset of the big players. Um, I’d love it to be open, really, pretty pretty please!