So, we had a major embarrassment last night. It consisted of multiple factors:
- We don’t have parallelism coordinator for our most cpu-intensive task at Wikipedia, so it can work on same job in ten, hundred, thousand threads across the cluster at the same time.
- Some parts of our parsing process ended up extremely CPU-intensive, and that happened not in our code, but in ‘templates’, that are in user-space. We don’t have profiling for templates, so we can just guess which one is slow, which one is fast, nor their overall aggregates.
- Some parts of pages are extremely template-heavy, making page rendering cost a lot (e.g. citations – see this discussion).
- In order to avoid content integrity race conditions, editing process releases locks and invalidates objects early, separated from ‘virgin parse’ which populates caches.
- It takes quite some time to refill the cache, as rendering is CPU-bound for quite a while in certain cases.
- During that short time when caches are empty, stampede of users on single article causes lots of redundant work across the cluster/grid/cloud.
- Michael Jackson article on English Wikipedia alone had a million views in one hour
So, in summary, we had havoc in our cluster because stampede of heavy requests between cache purge and cache population was consuming all available CPU resources, mostly working on rendering references section on Michael Jackson article.
Oh well, quick operations hack looked like this:
Index: ParserCache.php =================================================================== --- ParserCache.php (revision 52088) +++ ParserCache.php (working copy) @@ -63,6 +63,7 @@ if ( is_object( $value ) ) { wfDebug( "Found.\n" ); # Delete if article has changed since the cache was made // temp hack! + if( $article->mTitle->getPrefixedText() != 'Michael Jackson' ) { $canCache = $article->checkTouched(); $cacheTime = $value->getCacheTime(); $touched = $article->mTouched;
It is embarrassing, as actual pageview count was way below our usual capacity, whenever we have problems is because of some narrow expensive problem, not because of overall unavoidable resource shortage. We can afford much more edits, much more pageviews. We could have handled this load way better if our users wouldn’t be creating complex logic in articles. We could have handled this way better, if we had more aggressive redundant job elimination.
Thats the real story of operations, though headlines like “High profile event brought down Wikipedia” may sound nice, the real story is “shit happens”.
techblog is dead too. another million of sysadmins looking for cause? :D
haha :) nope, though… blogs had some hidden spam links…
are you saying that ever time a page is viewed it’s generated dynamically? why wouldn’t you do some caching and serve the pages more statically?
Mike, we do, thats why cache stampedes are most painful – everything is tuned for cached content, and those small windows when content isn’t cached can be extremely painful ;-)
Twitter went down too for the same event. Historically the Obama Inauguration took down Cellular networks. It just takes too much time and money to handle those unexpected outlying events. At least they are limited to those few and far between events!
Would there be a way to put a given article at “absolutely critical priority”, adn even freeze one given version of this article (e.g. the latest every five minutes), so that millions of users wanting to access that given article do not break the overall ?
SyG,
Tim and me are working on a wider-scope solution to these issues. We’ve figured out a way how to serve latest-possible last-cached version of articles without compromising too much of overall content delivery.
Some of core work is at:
https://code.launchpad.net/~domas-mituzas/+junk/poolcounter/