Conference notes

Cool ideas from the conference:

  1. Having same binary logs with same positions on slaves as on masters. This allows switching the database master a DNS-change (or IP takeover) operation. — Google
  2. Reading replication logs and prefetching data in parallel on slaves before replication events get executed. This way data is preheated for upcoming changes and serial updating thread works much faster. Especially useful on slightly delayed slaves. — Youtube
  3. For generating unique IDs for distributed environment use separate set of servers, without any replication, but with auto-increment offsets. This can be separate from core database and quite efficient. — Flickr
  4. Let’s get some beer. — Colleagues at MySQL AB

One of core messages I was trying to spread was “Relax. World is not going to end in case you lose a transaction.” I’m not sure how cool it was, but some nice folks out there said it was expiring. In many cases running a project has to be fun first, and most motivating targets should be the priority.
There still were ideas that had counter-arguments (of course, every situation may have different needs). One of discussions I bumped into was about using big services Out There (such as Amazon S3) instead of building your own datacenters – I didn’t end up convinced, but of course it is interesting to investigate if really costs can be lower.
Some more notes:

  • At the Flickr presentation there was someone asking why is Flickr using Yahoo’s search backend instead of rolling out anything Lucene-based. Lucene seemed to be the buzzword always in the air – though I had nice conversations with Sphinx developer too. Still, if I’d be commercial entity owned by Yahoo, I’d really reuse top-notch technology developed in other Y! departments, rather than use open source project, created by… Yahoo’s employee in free time.
  • Every time I heard someone was using NetApp Filers, I cried. Probably it is very good technology, but last time I checked prices, it was quite expensive… And someone mentioned it becomes slightly pain to ensure all the business continuity.
  • Open Source Systems were demonstrating pretty cool (literally too) dual-motherboard servers – promised lower prices and much lower energy consumption. Interesting.
  • Dolphin interconnect people said 10g-ethernet is not solving the network delay problem, and they’ll live for a while.
  • HP people were showcasing blade technology, though they avoided filling the enclosure with blades and loading them – would be quite a noise generator in the expo hall. We did discuss the unfair listprices by major hardware vendors – industry sucks at this place.
  • Heikki joked, people laughed, even few times. At the ‘clash of DB gurus’ Heikki mentioned he’d like to see Wikipedia still running on InnoDB after 20 years. It made me wonder what kind of data architectures will people use in 20 years.
  • Nobody solved the speed of light problem yet.
  • .org pavillion showcased the core products out there, especially in web publishing (WordPress, Drupal, Joomla), and workflow (Eventum, OTRS, Bugzilla, dotproject).
  • All scaling talks generally choose standard solutions and spend engineering on them. Didn’t see anything breathtaking, but of course, that may be still hidden part of technology.
  • I’d like to come back next year again. Again.

Wikipedia: site internals, etc (the workbook)

There still are details (and even complete subsystems) to be documented more thoroughly, but this is the workbook I presented at MySQL Conference. I’ve heard comments that the book has ‘hypothetical’ situations, but generally every bit is there represents real practice we’re having.

The talk did drift to some bits of information that may not be there (or did go much deeper), as well as the book covers some parts of operations that were not discussed at the tutorial talk. Anyway, I hope both session and tutorial has some value.

Here’s the file: Wikipedia: Site internals, configuration, code examples and management issues (the workbook). For now it is PDF, but I hope to transform it into properly evolving public document.

Update: Another good presentation on Wikipedia technology by Mark: Wikimedia Architecture

Writing a book (or preparing for MySQL Conference)

I already announced about coming to MySQL Conference, but I didn’t realize preparing for it will take that much time. Last year I had just regular session about Wikipedia’s scaling and did feel that it is somewhat difficult to squeeze that much information into less than one hour. This year I opted in for 3h session (with short break in the middle), and instead of few slides with buzzwords on them I worked on workbook-like material to talk and discuss about.
Presentations are always easy, I have to admit I’ve made quite a lot of my slides an hour before actual talks. Now I realized that writing a workbook ends up to be a book, and books are not written in single day… Full disclosure: I looked at last year’s presentation files and blog posts for preparation of the talk, but still, things have changed, both in technology and in numbers. We have far more visitors (ha, >30kreq/s instead of 12kreq/s!), more content, slightly more servers and less troubles :-)
Today I’ve delivered the paper for printing (dead tree handouts for session attendees!), but there are many ideas already what to append or to extend, so this will end up being perpetual process of improving. Let’s hope tutorial attendees will bring their laptops for updated digital handouts.
Of course, the good part is that the real work will be over after first day and I’ll be able to enjoy other sessions & social activities. If only I survive the staff party..

Where covering indexes rock

One of MySQL features I usually like to exploit is covering index. Simply, if all fields required for a query are contained in an index, there’s no need to access table row somewhere else. In MyISAM that would mean not going to MYD file. Inside InnoDB data is clustered together with PK (another lovely feature) and secondary indexes contain PK values, so it is one B-Tree less to traverse too.

I’ll take a real-life example where it really mattered. In Wikipedia for English language only we have a ‘revisions’ table with 100m records. The PK is (page_id, revision_id), so any per-page operations (say, viewing complete history of changes) would read data clustered together in single block. Now the often used operation is to check user contributions, which would of course have an index (user_id, timestamp) and use it for traversing all revisions made by particular user.

In such case, if user is not centered on single page (of course, there’re many of these), going after entries in primary index will have to read (in worst case) 16k page from disk (and maybe a bigger stripe from RAID) for every 150-byte row that may be in there. In order to read information about 10000 revisions one would end up reading 200MB of data (and having 10000 seeks). Now simply by adding a covering index it reduces number of seeks required to 100 seeks (and reads just 2MB of data).

In our case we simply added an index on (user, timestamp, page, variousmetadata, comments, etc) – simply, all fields that were in table. It made a complete copy of 16GB table, but on the other hand, RAM costs per gigabyte are bigger than disk storage.

It may be not an issue in environments, where primary key (and table data) is already all cached in memory. Accessing a record in a page, once it is already in memory, is very cheap, compared to reading the page from disk. Of course, this makes a copy of table data, but really removes I/O saturation, especially if data is very cold.

Of course, still being on tight budget, we saved some disk space here too. Simply putting such wide index on single slave and not on others, then sending all queries requiring such index to particular slave did solve many resource issues. Maintaining secondary indexes inside InnoDB is not that expensive – all writes are delayed and go to ‘adaptive hash index’, and transaction logs contain just information about a single copy of a row.

Once we will go over 50 (or 20 :) DB servers, we may probably want to think about making such index changes and load balancing more automatic, but for now solving edge cases this way seems proper.

And here is comparison of two DB servers, one with covering and one with simple (just for WHERE conditions) index:

db4 mysql> select sum(rev_minor_edit) from revision
                        where rev_user_text='Test user';
1 row in set (21.73 sec)

db6 mysql> select sum(rev_minor_edit) from revision
                        where rev_user_text='Test user';
1 row in set (2.30 sec)

Both servers did read completely cold data (though they were in working state with warmed up buffers), and repeating of queries did provide not that different results (both servers executed under a second). Of course, 200MB of pages in buffer displaced possibly useful information…

Covering indexes are really powerful tool in environments where cold non-buffered data exists, and yet again prove the point, that access patterns of applications are very important when designing any schema or indexes. It is really one of major issues I expect from a storage engine (or DBMS in general) for any bigger data environments.

MySQL Conference 2007: Piggyback riding Wikipedia again. \o/

This year I’m coming to MySQL Conference again. Last year it was marvelous experience, with customers, community and colleagues (CCC!) gathering together, so I didn’t want to miss it this year at any cost :-)

This year instead of describing Wikipedia internals I’ll be disclosing them – all important bits, configuration files, code, ideas, problems, bugs and work being done through whole stack – starting with distributed caches in front, distributed middle-ware somewhere in the middle and distributed data storage in the back end. It will take three hours or so – bring your pillows. :)

Not reservation, order code!

Italians sometimes may be damn frustrating. This time I ordered an international train ticket online, then went to train station (Milano Centrale) to get it (may sound stupid, but they seem to require all those reservations and e-tickets still not available).

Of course, the guy at “international tickets” counter didn’t actually speak English (I sure know Italian is most important language, at least in operas). Then… it appears I didn’t read the email properly.

The email their automated system sent me did have two sections. One of them was Order details, which contained ‘Booking code (DOSSIER)’. Another was ‘Ticket details’, which contained a reservation number. Of course, I had wrong code. I know it was frustrating for him too, as the intranet application responded to any clicks in about a minute, so he had a really important reason not to try to help me. I already mentioned he did not speak English.

So, after one hour and two metro trips I presented booking/order/dossier code instead of reservation number, and got my tickets. Oh, and by the way, international ticket office starts working at 7:30, train leaves at 7:10. I am all happy I had this story this afternoon, not the morning of departure.

Update: Well, now I feel stupid and outdated. I could have opened that email in my mobile phone. :-)

LAMPS on steroids

I’m not sure if I’m the first coining in ‘LAMPS’ – scaled out LAMP environment with Squid in front, but it sounds cool. Squid is major component in content distribution systems, reducing the load from all the backend systems dramatically (especially with proper caching rules). We had various issues in past, where we used code nobody else seemed to be using – cache coordination, purges and of course, load.

Quite a few problems resulted in memory leaks, but one was particularly nasty: Squid processes under high load started leaking CPU cycles somewhere. After deploying profiling for squid we actually ended up seeing that the problem is inside libc. Once we started profiling libc, one of initial assumptions appeared to be true – our heap was awfully fragmented, slowing down malloc().

Here comes our steroids part: Google has developed a drop-in malloc replacement, tcmalloc, that is really efficient. Space efficient, cpu efficient, lock efficient. This is probably mostly used (and sophisticated) libc function, that was suffering performance issues not that many people wanted to actually tackle. The description sounded really nice, so we ended up using it for our suffering Squids.

The results were what we expected – awesome :) Now the nice part is that the library is optimized for multi-threaded applications, doing lots of allocations for small objects without too much of lock contention, and uses spinlocks for large allocations. MySQL exactly fits the definition, so just by using simple drop-in replacement you may achieve increased performance over standard libc implementations.

For any developers working on high-performance applications, Google performance tools provide easy ways to access information that was PITA to work on before. Another interesting toy they have is embedded http server providing run-time profiling info. I’m already wondering if we’d should combine that with our profiling framework. Yummy. Steroids.

TCP tuning for your database

Lately lots of new fascinating technologies are used to build even more fascinating new solutions, and solutions nowadays even run on distributed environments, not just on single server. These servers usually communicate using TCP – standard, that has been here long before gigabit (or ten megabit) ethernet or dawn of LAMP, and needs to be kicked a bit, to work properly. In our environment we have over hundred of application servers which handle quite a number of requests, and put quite demanding stress on database servers. For that we had to change some of default kernel settings – and it did improve situation a bit. Here I’ll try to overview some of them…
Continue reading “TCP tuning for your database”

Open transactions are bad

Two identical queries have somewhat different execution times, though data layout, table format, everything… is quite same:

mysql> select count(*) from transtest;
| count(*) |
|    10000 |
1 row in set (0.00 sec)

mysql> select count(*) from transtest;
| count(*) |
|    10000 |
1 row in set (2.88 sec) < ---- oh noes!!!

The difference is very simple, though often forgotten. Multiversioned (MVCC) tables may have multiple versions (hence the name) of rows.
Even more, the engine will keep all versions of rows, that would be needed for oldest active transaction, just in case it requests some snapshot from the past.
Think of it as of some sort of time machine, just somewhat less efficient :)

If some old transaction is running, heavily edited tables will have more stored row versions.
In above case, second query had to read a million of rows to produce a count, when first one was working on clean environment.
Once the transaction holding old versions was rolled back, the query took 0.00sec again.

Databases (or DataACIDs) will take additional measures, often unexposed, to provide consistent data, so we may treat this as a feature, not a bug.

Conclusion: avoid long running transactions, especially if doing nothing. Oh well, avoid long running transactions with locks too, but that is already different question. If you spot anything sleeping – kill the slacker. At application level. :-)