on political correctness

Wikipedia administrators received this letter (“Midom” is my username on Wikipedias):

Hi, I’ve nothing to do with any of this but passing through oc.wikipedia.org I have noticed someone who I presume to be some kind of admin, one Midom who seems to be rather lacking in social skills, judging by what’s going on here:https://oc.wikipedia.org/wiki/Discussion_Utilizaire:Midom

I think I appreciate the technical issues being dealt with in there, but his behaviour is way out of line and clearly oversteps what is considered acceptable today in any functional online community.

Especially when this behaviour is directed towards a group who are small and lacking in resources, but very enthusiastic, such as the Occitan Wikipedia lot, this is just plain bullying.

He has, very much without discussion or consultation, decided on the deletion of a significant amount of data–while the reasons appear legitimate, the way in which this was approached by Midom is lamentable (and this is a different discussion, but one could argue that if the templates under discussion lend themselves to be misused in the way they allegedly were, that doesn’t say much about the competence of the programmers involved so perhaps they, being a handsomely paid bunch these days, unlike the oc.wikipedia.org editors, should step in and find a solution to the problem. Just saying.)

So, for what little is left of Wikipedia’s credibility, I urge you to take action and:

  • Reprimand Midom for his reprehensible actions and attitude.
  • Admonish him to present his apologies to the Occitan Wikipedia community for his rude, aggressive, and unhelpful behaviour.

As I said, I personally have no axe to grind here, but I do not condone bullying.

I might as well add, having made a note of the information volunteered by this user in his user page, I do reserve the right to contact his employer and make them aware of his highly irresponsible behaviour and questionable social and technical competence. Midom, it is up to you to take this as a learning experience and make amends with the users you have inconvenienced and offended. Providing some assistance to the OC guys in migrating their data into a form that doesn’t clog up the servers wouldn’t go amiss either. — Preceding unsigned comment added by 83.47.182.89 (talk) 00:24, 23 April 2016 (UTC)

To this person who decided that my operational intervention (and resulting soap opera) back in 2012 was heavy handed, I appreciate your communication skills and eloquence. Extreme political correctness was not needed to operate Wikipedias back in the day. What I remember from that time is that there’d be always some crazies like you, and we had to deal with them in one way or another. Thats what being open and transparent means.

On the other hand, you can always blame me for everything, thats what Wikipedia’s Blame Wheel was invented for:blamewheel

on swapping and kernels

There is much more to write about all the work we do at Facebook with memory management efficiency on our systems, but there was this one detour investigation in the middle of 2012 that I had to revisit recently courtesy of Wikipedia.

There are lots of factors that make machines page out memory segments into disk, thus slowing everything down and locking software up – from file system cache pressure to runaway memory leaks to kernel drivers being greedy. But certain swap-out scenarios are confusing – systems seem to have lots of memory available, with proper settings file system cache should not cause swapping, and obviously in production environment all the memory leaks are ironed out.

And yet in mid-2012 we noticed that our new kernel machines were swapping out for no obvious reason. When it comes to swapping, MySQL community will always point to Jeremy’s post on “swap insanity” – it has something to do with NUMA and what not. But what we observed was odd – there was free memory available on multiple nodes when swapping out happened. Of course, one of our kernel engineers wrote a NUMA rebalancing tool that attaches to running CPUs and evens out memory allocations without any downtime (not that we ended up using it…) – just in case Jeremy’s described issue is actually an issue for us.

In some cases systems threw warning messages in kernel logs that immediately helped us to get closer to the problem – network device driver was failing to allocate 16k memory pages.

Inside Linux kernel one has two ways to allocate memory, kmalloc and vmalloc. Generally, vmalloc will go through standard memory management, and if you ask for 16k, it will glue together 4k pages and allocation will succeed without any problems.

kmalloc though is used for device drivers when hardware is doing direct memory access (DMA) – so these address ranges have to be contiguous, and therefore to allocate it one has to find subsequent empty pages that can be used. Unfortunately, the easiest way to free up memory is looking at the tail of LRU list and drop some – but that does not give contiguous ranges.

Actual solution for ages was to organize the free memory available into powers-of-2 sized buckets (4k pages, 8k, 16k, ) – called Buddy Allocator (interesting – it was implemented first by Nobel Prize winner in Economics Harry Markowitz back in 1964). Any request for any memory size can be satisfied from larger buckets, and once there’s nothing in larger buckets one would compact the free memory by shuffling bits around.

One can see the details of buddy allocator in /proc/buddyinfo:

Node 0, zone      DMA      0      0      1      0      2      1
Node 0, zone    DMA32    229    434    689    472    364    197
Node 0, zone   Normal  11093   1193    415    182     38     12
Node 1, zone   Normal  10417     53    139    159     47      0

(Columns on the left are indicating numbers of small memory segments available, columns on the right – larger).

It is actually aiming for performance that leads to device drivers dynamically allocating memory all the time (e.g. to avoid copying of data from static device buffers to userland memory). On a machine that is doing lots of e.g. network traffic it will be network interface grouping packets on a stream into large segments and writing them to these allocated areas in memory, then dropping all that right after application consumed network bits, so this technique is really useful.

On the other side of the Linux device driver spectrum there are latency sensitive operations, such as gaming and music listening and production. This millennium being the Millennium of Linux Desktop results in Advanced Linux Sound Architecture users (alsa-users) to complain that such memory management sometimes makes their sound drivers complain. That would not be much of an issue on well-tuned multi-core servers with hardware interrupt handling spread across multiple threads, but Linux kernel engineers prefer the desktop and disabled compaction altogether in 2011.

If memory is not fragmented at all, nobody notices. Although on busy servers one may need to flush gigabytes or tens of gigabytes of pages (drop caches if it is file system cache or swap out if it is memory allocated to programs) to find a single contiguous region (though I’m not sure how exactly it chooses when to stop flushing).

Fortunately, there is a manual trigger to force a compaction that my fellow kernel engineers were glad to inform me about (as otherwise we’d have to engineer a kernel fix or go for some other workarounds). Immediatelly a script was deployed that would trigger compaction whenever needed, so I got to forget the problem.

Until now where I just saw this problem confusing engineers at Wikipedia – servers with 192GB of memory were constantly losing their filesystem cache and having all sorts of other weird memory behaviors. Those servers were running Varnish, which assumes that kernels are awesome and perfect, and if one is unhappy, he can use FreeBSD :)

There were multiple ways to deal with the issue – one was just disabling features on hardware that use the memory (e.g. no more TCP offloading), another is writing 1s into /proc/sys/vm/compact_memory – and maybe some new kernels have some of alleviations to the problem.

Update: By popular demand I published the script that can be used in cron

on wikipedia and mariadb

There’s some media coverage about Wikipedia switching to MariaDB, I just wanted to point out that performance figures cited are somewhat incorrect and don’t attribute gains to correct authors.

Proper performance evaluation should include not just MariaDB 5.5 but Oracle’s MySQL 5.5 version too, because thats where most of performance development happened (multiple buffer pools, rollback segments, change buffering et al).

5.5 is faster for some workloads, 5.1-fb can outperform 5.5 in other workloads (ones with lots of IO), it is good to know that there’s beneficial impact from upgrading (though I’d wait for 5.6), but it is important to state that it is an effort from Oracle as well, not just MariaDB developers.

P.S. As far as I understand, decision to switch is political, and with 5.6 momentum right now it may not be the best one, 5.6 is going to rock :-)

MySQL versions at Wikipedia

More of information about how we handle database stuff can be found in some of my talks.

Lately I hear people questioning database software choices we made at Wikipedia, and I’d like to point out, that…

Wikipedia database infrastructure needs are remarkably boring.

We have worked a lot on having majority of site workload handled by edge HTTP caches, and some of most database intensive code (our parsing pipeline) is well absorbed by just 160G of memcached arena, residing on our web servers.

Also, major issue with our databases is finding the right balance between storage space (even though text is stored in ‘external store’, which is just set of machines with lots of large slow disks) – we store information about every revision, every link, every edit – and available I/O performance per dollar for that kind of space needed.

As a platform of choice we use X4240s (I advertised it before) – 16 SAS disks in compact 2u package. There’s relatively small hot area (we have 1:10 RAM/DB ratio), and quite a long tail of various stuff we have to serve.

The whole database is just six shards, each getting up to 20k read queries a second (single server can handle that), and few hundred writes (binlog is under 100k/s – nothing too fancy). We have overprovisioned some hardware for slightly higher availability – we don’t have always available on-site resources – the slightly humorous logic is

we need four servers, in case one goes down, another will be accidentally brought down by fixing person, then you got one to use as a source of recovery and remaining one to run the site.

Application doesn’t have too many really expensive queries, and those aren’t the biggest share of our workload. Database by itself is minor part of where application code spends time (looking at profiling now – only 6% of busy application time is inside database, memcached is even less, Lucene is way up with 11%). This is remarkably good shape to be at, and it is much better than what we used to have when we had to deal with insane (“explosive”) growth. I am sure, pretty much anything deployed now (even sqlite!) will work just fine, but what we used has been created during bad times.

Bad times didn’t mean that everything was absolutely overloaded, it was more that it could get overloaded very soon, if we don’t take appropriate measures, and our fundraisers were much tinier back then. We were using 6-disk RAID-0 boxes to be able to sustain good performance and have required disk space at the same time (or of course, go expensive SAN route).

While the mainstream MySQL development with its leadership back then was headed towards implementing all sorts of features that didn’t mean anything to our environment (and from various discussions I had with lots of people, many many other web environments):

  • utf8 support that didn’t support Unicode
  • Prepared Statements that don’t really make much sense in PHP/C environments
  • unoptimized subqueries, that allow people to write shitty performing queries
  • later in 5.0 – views, stored routines, triggers
  • etc…

… nobody was really looking at MySQL performance at that time, and it could have insane performance regressions (“nobody runs these things anyway”, like ‘SHOW STATUS’) and a forest full of low hanging fruits.
From operations perspective it wasn’t perfect either – replication didn’t survive crashes, crash recovery was taking forever, etc.

Thats when Google released their set of patches for 4.0, which immediately provided incredible amount of fixes (thats what I wrote about it back then). To highlight some of introduced changes:

  • Crash-safe replication (replication position is stored inside InnoDB along with transaction state) – this allowed to run slaves with innodb log flushing turned off on slaves and having consistent recovery, vanilla MySQL doesn’t have that yet, Percona added this to XtraDB at some point in time
  • Guaranteed InnoDB concurrency slot for replication thread – however loaded the server is, replication does not get queued outside and can proceed. This allowed us to have way more load pointed towards MySQL. This is now part of 5.1
  • Multiple read-ahead and write-behind threads – again, allowed to bypass certain bottlenecks, such as read-ahead slots (though apparently it is wiser just to turn off read-ahead entirely) – now part of InnoDB Plugin
  • Multiple reserved SUPER connections – during overloads systems were way more manageable

Running these changes live have been especially successful (and that was way before Mark/Google released their 5.0 patch set which was then taken in parts by OurDelta/Percona/etc) – and I spent quite some time trying to evangelize these changes to MySQL developers (as I would have loved to see that deployed at our customers, way less work then!). Unfortunately, nobody cared, so running reliable and fast replication environments with mainline MySQL didn’t happen (now one has to use either XtraDB or FB build).

So, I did some merging work, added few other small fixes and ended up with our 4.0.40 build (also known as four-oh-forever), which still runs half of shards today. It has sufficient in-memory performance for us, it can utilize our disk capacity fully, and it doesn’t have crash history (I used to tell about two 4.0 servers, both whitebox raid0 machines, having unbroken replication session for two years). By todays standards it already misses few things (I miss fast crash recovery mostly, after last full power outage in a datacenter ;-) – and developers would love to abuse SQL features (hehe, recently a read-only subquery locked all rows because of a bug :-) I’m way more conservative when it comes to using certain features live, as when working at MySQL Support I could see all the ways those features break for people, and we used to joke (this one was about RBR :):

Which is the stable build for feature X? Next one!

Anyway, even knowing that stuff breaks in one way or another, I was running a 5.1 upgrade project, mostly because of peer pressure (“4.0 haha!”, even though that 4.0 is more modern from operations standpoint).

As MediaWiki is open-source project, used by many, we already engineer it for wide range of databases – we support MySQL 4.0, we support MySQL 6.0-whatever-is-in-future, and there’s some support for different vendor DBMSes (at various stages – PG, Oracle, MS SQL, etc) – so we can be sure that it works relatively fine on newer versions.

Upgrade in short:

  • Dump schema/data
  • Load schema on 5.1 instance
  • Adjust schema, as we can do it, set all varchar to varbinary to maintain 4.0 behavior
  • Load data on 5.1 instance
  • Fix MySQL to replicate from 4.0 (stupid breakage for nothing)
  • Switch master to 5.1 instance

We had some 5.0 and 5.1 replicas running for a while to detect any issues, and as there weren’t too many, the switch could be nearly immediate (English Wikipedia was converted 4.0->5.1 over a weekend).

I had an engineering effort before to merge Google 5.0 patches into later than 5.0.37 tree, but eventually Mark left Google for Facebook and “Google patch” was abandoned, long live the Facebook patch! :)

At first FB-internal efforts were to get the 5.0 environment working properly, so 5.1 development was a bit on hold. At that time I cherry-picked some of Percona’s patch work (mostly to get transactional replication for 5.1, as well as fast crash recovery) – and started deploying this new branch. Of course, once Facebook development focus switched to 5.1, maintaining separate branch is becoming less needed – my plan for the future is getting FB build deployed across all shards.

The beauty of FB-build is that development team is remarkably close to operations (and operations team is close to development), and there is lots of focus on making it do the right stuff (make sure you follow mysql@facebook page). The visibility of systems (PMP!) we have at Facebook can be transformed into code fixes nearly instantly, especially when compared with development cycles outside. I’m sure some of those changes will trickle to other trees eventually, but we have those changes in FB builds already here, and they are state of the art of MySQL performance/operations engineering, while maintain great code quality.

Yes, at Wikipedia we run a mix of really fresh and also quite old/frozen software (there will be unification, of course), but…. it runs fine. It isn’t as fascinating anymore as years ago, but it allows not paying any attention for years. Which is good, right? Oh, and of course, there’s full data on-disk compatibility with standard InnoDB builds, in case anyone really wants to roll back or switch to the next-best-fork.

update

In past few months I had lots of changes going on – left the Sun/MySQL job, my term on Wikimedia Board of Trustees ended, I joined Facebook and now I got appointed to Wikimedia Advisory Board. This also probably means that I will have slightly less hands-on work on Wikipedia technology (I’ll be mostly in “relaxed maintenance mode“), though I don’t know yet how much less – time will show :)

P.S. I also quit World of Warcraft. ;-)

Spikes are not fun anymore

English Wikipedia just scored “three million articles”, so I thought I’d give some more numbers and perspectives :) Four years ago we observed impressive +50% traffic spike on Wikipedia – people came in to read about the new pope. Back then it was probably twenty additional page views a second, and we were quite happy to sustain that additional load :)

Nowadays big media events can cause some troubles, but generally they don’t bring huge traffic spikes anymore. Say, Michael Jackson’s English Wikipedia article had peak hour of one million page views (2009-06-25 23:00-24:00) – and that was merely 10% increase on one of our projects (English Wikipedia got 10.4m pageviews that hour). Our problems back then were caused by complexity of page content – and costs got inflated because of lack of rendering farm concurrency control.

Other interesting sources of attention are custom Google logos leading to search results leading to Wikipedia (of course!). Last ones, for Perseids or Hans Christian Ørsted sent over 1.5m daily visitors each – but thats mere 20 article views a second or so.

What makes those spikes boring nowadays is simply the length of long-tail. Our projects serve over five million different articles over the course of an hour (and 20m article views) – around 3.5m articles are opened just once. If our job would be serving just hot news, our cluster setup and software infrastructure would be very very very different – and now we have to accommodate millions of articles, that aren’t just stored in archives, but also are constantly read, even if once an hour (and daily hot set is much larger too).

All this viewership data is available in raw form, as well as nice visualizations at trendingtopics, wikirank and stats.grok.se. It is amazing to hear about all the research that is built on this kind of data, and I guess it needs some improved interfaces and APIs already for all the future uses ;-)

Board again (perhaps)

Tomorrow voting for Wikimedia Foundation Board of Trustees Election starts – and Yours truly is a candidate.

You can find most of my views on various issues in our question pages (I was somewhat boiling when answering the What will you do about the WMF mishandling it’s funding? one – it probably takes great effort to phrase such a bad question, and so easy to answer it :), as well as Wikipedia Signpost ‘interview’.

I was appointed to the Board back in January 2008, after holding various other volunteer (at some point in time – ‘officer’) positions within the organization since 2004 – and brought in the core technology and operational efficiency skill set there. The appointment was supposed to be somewhat temporary, but board restructure appeared to be much longer process than we expected – both the chapters part, and nomination committee work. As a community member, after the restructure I was in ‘community-elected’ seat, though I never participated in any election – so that wasn’t too fair to the actual community, need to fix that :)

So, even though I wasn’t too visible to actual community (people would notice me mostly when things go wrong, and I’m not in best mood then, usually :-), I feel that the values I’ve worked on, evangelized and supported for all these years – efficiency and general availability of our projects – can win mindshare not only of our read-only users I work mostly for, but also eligible voters.

And I do think, that internal technology expertise has to be represented on board, as things we’ve been doing, and methods we’ve been using, are very much unique in the technology world. Oh, and somewhere I mentioned, our technology spending is close to 50%, that has to be represented too :-)

embarrassment

So, we had a major embarrassment last night. It consisted of multiple factors:

  • We don’t have parallelism coordinator for our most cpu-intensive task at Wikipedia, so it can work on same job in ten, hundred, thousand threads across the cluster at the same time.
  • Some parts of our parsing process ended up extremely CPU-intensive, and that happened not in our code, but in ‘templates’, that are in user-space. We don’t have profiling for templates, so we can just guess which one is slow, which one is fast, nor their overall aggregates.
  • Some parts of pages are extremely template-heavy, making page rendering cost a lot (e.g. citations – see this discussion).
  • In order to avoid content integrity race conditions, editing process releases locks and invalidates objects early, separated from ‘virgin parse’ which populates caches.
  • It takes quite some time to refill the cache, as rendering is CPU-bound for quite a while in certain cases.
  • During that short time when caches are empty, stampede of users on single article causes lots of redundant work across the cluster/grid/cloud.
  • Michael Jackson article on English Wikipedia alone had a million views in one hour

So, in summary, we had havoc in our cluster because stampede of heavy requests between cache purge and cache population was consuming all available CPU resources, mostly working on rendering references section on Michael Jackson article.

Oh well, quick operations hack looked like this:

Index: ParserCache.php
===================================================================
--- ParserCache.php	(revision 52088)
+++ ParserCache.php	(working copy)
@@ -63,6 +63,7 @@
  if ( is_object( $value ) ) {
    wfDebug( "Found.\n" );
    # Delete if article has changed since the cache was made
    // temp hack!
+   if( $article->mTitle->getPrefixedText() != 'Michael Jackson' ) {
    $canCache = $article->checkTouched();
    $cacheTime = $value->getCacheTime();
    $touched = $article->mTouched;

It is embarrassing, as actual pageview count was way below our usual capacity, whenever we have problems is because of some narrow expensive problem, not because of overall unavoidable resource shortage. We can afford much more edits, much more pageviews. We could have handled this load way better if our users wouldn’t be creating complex logic in articles. We could have handled this way better, if we had more aggressive redundant job elimination.

Thats the real story of operations, though headlines like “High profile event brought down Wikipedia” may sound nice, the real story is “shit happens”.

on tools and operating systems

Sometimes people ask why do I use MacOSX as my main work platform (isn’t that something to do with beliefs?). My answer is “good foundation with great user interface”. Though that can be treated as “he must like unix kernel and look&feel!”, it is not exactly that.

What I like is that I can have good graphical stable environment with some mandatory tools (yes, I used OS-supplied browser, mail, etc), but beside that maintain the bleeding edge open-source space (provided by MacPorts).

Also what I like, is OS-supplied development and performance tools. DTrace included is awesome, yes, but Apple did put some special touch on it too. This is visualization environment for dtrace probes and other profiling/debugging tools:

Even the web browser (well, I upgraded to Safari4.0 ;-) provides some impressive debugging and profiling capabilities:

Of course, I end up running plethora of virtual machines (switching from Parallels to VirtualBox lately), but even got a KDE/Aqua build (for kcachegrind mostly). I don’t really need Windows apps, and I can run ‘Linux’ ones natively on MacOSX, and I can run MacOSX ones on MacOSX.

There’s full web stack for my MediaWiki work, there’re dozens of MySQL builds around, there’re photo albums, dtrace tools, World of Warcraft, bunch of toy projects, few different office suites, Skype, NetBeans, Eclipse, Xcode, integrated address books and calendars, all major scripting languages, revision control systems – git, svn, mercurial, bzr, bitkeeper, cvs, etc.

All that on single machine, running for three years, without too much clutter, and nearly zero effort to make it all work. Thats what I want from desktop operating system – extreme productivity without too much tinkering.

And if anyone blames me that I’m using non-open-source software, my reply is very simple – my work output is open-sourced.

%d bloggers like this: