wikipedia – domas mituzas

on political correctness

Wikipedia administrators received this letter (“Midom” is my username on Wikipedias):

Hi, I’ve nothing to do with any of this but passing through oc.wikipedia.org I have noticed someone who I presume to be some kind of admin, one Midom who seems to be rather lacking in social skills, judging by what’s going on here:https://oc.wikipedia.org/wiki/Discussion_Utilizaire:Midom

I think I appreciate the technical issues being dealt with in there, but his behaviour is way out of line and clearly oversteps what is considered acceptable today in any functional online community.

Especially when this behaviour is directed towards a group who are small and lacking in resources, but very enthusiastic, such as the Occitan Wikipedia lot, this is just plain bullying.

He has, very much without discussion or consultation, decided on the deletion of a significant amount of data–while the reasons appear legitimate, the way in which this was approached by Midom is lamentable (and this is a different discussion, but one could argue that if the templates under discussion lend themselves to be misused in the way they allegedly were, that doesn’t say much about the competence of the programmers involved so perhaps they, being a handsomely paid bunch these days, unlike the oc.wikipedia.org editors, should step in and find a solution to the problem. Just saying.)

So, for what little is left of Wikipedia’s credibility, I urge you to take action and:

Reprimand Midom for his reprehensible actions and attitude.

Admonish him to present his apologies to the Occitan Wikipedia community for his rude, aggressive, and unhelpful behaviour.

As I said, I personally have no axe to grind here, but I do not condone bullying.

I might as well add, having made a note of the information volunteered by this user in his user page, I do reserve the right to contact his employer and make them aware of his highly irresponsible behaviour and questionable social and technical competence. Midom, it is up to you to take this as a learning experience and make amends with the users you have inconvenienced and offended. Providing some assistance to the OC guys in migrating their data into a form that doesn’t clog up the servers wouldn’t go amiss either. — Preceding unsigned comment added by 83.47.182.89 (talk) 00:24, 23 April 2016 (UTC)

To this person who decided that my operational intervention (and resulting soap opera) back in 2012 was heavy handed, I appreciate your communication skills and eloquence. Extreme political correctness was not needed to operate Wikipedias back in the day. What I remember from that time is that there’d be always some crazies like you, and we had to deal with them in one way or another. Thats what being open and transparent means.

On the other hand, you can always blame me for everything, thats what Wikipedia’s Blame Wheel was invented for: blamewheel

… in numbers

Spikes are not fun anymore

English Wikipedia just scored “three million articles”, so I thought I’d give some more numbers and perspectives :) Four years ago we observed impressive +50% traffic spike on Wikipedia – people came in to read about the new pope. Back then it was probably twenty additional page views a second, and we were quite happy to sustain that additional load :)

Nowadays big media events can cause some troubles, but generally they don’t bring huge traffic spikes anymore. Say, Michael Jackson’s English Wikipedia article had peak hour of one million page views (2009-06-25 23:00-24:00) – and that was merely 10% increase on one of our projects (English Wikipedia got 10.4m pageviews that hour). Our problems back then were caused by complexity of page content – and costs got inflated because of lack of rendering farm concurrency control.

Other interesting sources of attention are custom Google logos leading to search results leading to Wikipedia (of course!). Last ones, for Perseids or Hans Christian Ørsted sent over 1.5m daily visitors each – but thats mere 20 article views a second or so.

What makes those spikes boring nowadays is simply the length of long-tail. Our projects serve over five million different articles over the course of an hour (and 20m article views) – around 3.5m articles are opened just once. If our job would be serving just hot news, our cluster setup and software infrastructure would be very very very different – and now we have to accommodate millions of articles, that aren’t just stored in archives, but also are constantly read, even if once an hour (and daily hot set is much larger too).

All this viewership data is available in raw form, as well as nice visualizations at trendingtopics, wikirank and stats.grok.se. It is amazing to hear about all the research that is built on this kind of data, and I guess it needs some improved interfaces and APIs already for all the future uses ;-)

Board again (perhaps)

Tomorrow voting for Wikimedia Foundation Board of Trustees Election starts – and Yours truly is a candidate.

You can find most of my views on various issues in our question pages (I was somewhat boiling when answering the What will you do about the WMF mishandling it’s funding? one – it probably takes great effort to phrase such a bad question, and so easy to answer it :), as well as Wikipedia Signpost ‘interview’.

I was appointed to the Board back in January 2008, after holding various other volunteer (at some point in time – ‘officer’) positions within the organization since 2004 – and brought in the core technology and operational efficiency skill set there. The appointment was supposed to be somewhat temporary, but board restructure appeared to be much longer process than we expected – both the chapters part, and nomination committee work. As a community member, after the restructure I was in ‘community-elected’ seat, though I never participated in any election – so that wasn’t too fair to the actual community, need to fix that :)

So, even though I wasn’t too visible to actual community (people would notice me mostly when things go wrong, and I’m not in best mood then, usually :-), I feel that the values I’ve worked on, evangelized and supported for all these years – efficiency and general availability of our projects – can win mindshare not only of our read-only users I work mostly for, but also eligible voters.

And I do think, that internal technology expertise has to be represented on board, as things we’ve been doing, and methods we’ve been using, are very much unique in the technology world. Oh, and somewhere I mentioned, our technology spending is close to 50%, that has to be represented too :-)

embarrassment

So, we had a major embarrassment last night. It consisted of multiple factors:

We don’t have parallelism coordinator for our most cpu-intensive task at Wikipedia, so it can work on same job in ten, hundred, thousand threads across the cluster at the same time.
Some parts of our parsing process ended up extremely CPU-intensive, and that happened not in our code, but in ‘templates’, that are in user-space. We don’t have profiling for templates, so we can just guess which one is slow, which one is fast, nor their overall aggregates.
Some parts of pages are extremely template-heavy, making page rendering cost a lot (e.g. citations – see this discussion).
In order to avoid content integrity race conditions, editing process releases locks and invalidates objects early, separated from ‘virgin parse’ which populates caches.
It takes quite some time to refill the cache, as rendering is CPU-bound for quite a while in certain cases.
During that short time when caches are empty, stampede of users on single article causes lots of redundant work across the cluster/grid/cloud.
Michael Jackson article on English Wikipedia alone had a million views in one hour

So, in summary, we had havoc in our cluster because stampede of heavy requests between cache purge and cache population was consuming all available CPU resources, mostly working on rendering references section on Michael Jackson article.

Oh well, quick operations hack looked like this:

Index: ParserCache.php
===================================================================
--- ParserCache.php	(revision 52088)
+++ ParserCache.php	(working copy)
@@ -63,6 +63,7 @@
  if ( is_object( $value ) ) {
    wfDebug( "Found.\n" );
    # Delete if article has changed since the cache was made
    // temp hack!
+   if( $article->mTitle->getPrefixedText() != 'Michael Jackson' ) {
    $canCache = $article->checkTouched();
    $cacheTime = $value->getCacheTime();
    $touched = $article->mTouched;

It is embarrassing, as actual pageview count was way below our usual capacity, whenever we have problems is because of some narrow expensive problem, not because of overall unavoidable resource shortage. We can afford much more edits, much more pageviews. We could have handled this load way better if our users wouldn’t be creating complex logic in articles. We could have handled this way better, if we had more aggressive redundant job elimination.

Thats the real story of operations, though headlines like “High profile event brought down Wikipedia” may sound nice, the real story is “shit happens”.

I loved Encarta

That happened long before Wikipedia. I loved Encarta. Well, before Encarta, I used to read this thing a lot:

But then Encarta arrived and I loved it. It did fit into single CD and didn’t take too much space on disk. I could look up all these articles in it, without having to use expensive dialup, fast. I remember my school buddies coming over and watching those tiny movies in it. I could rip it off for my school works, and look incredibly smart (now people rip off Wikipedia and don’t get too much credit for that :).

It is dead.

People on the interwebs suggest that employees at Wikipedia and Encyclopaedia Britannica will be throwing parties tonight. Oh well, Wikipedia is already up to date about this. Every encyclopedia out there was an inspiration for Wikipedia, more so than any technology or “web-two-oh” hype. There’s not much joy seeing good things die.

Ten years ago I imagined, that once I have my own home, I’ll have a place to put a full set of dead-tree Britannica, like my parents had “Lithuanian soviet encyclopaedia”. Wikipedia changed my plans (now there’re two flat panels staring at Wiki, inside and outside), but it seems it already is changing the world around it way more. RIP Encarta. You were inspiring, and really too young to die. If it was us, we didn’t mean it, really. By the way, that content of yours, I’d be glad to see it free. *wink*

I'm a creative commoner

Lately Creative Commons is becoming very dominant topic in my life. First of all, I see all the people in free culture world holding their breath and waiting for Wikipedia switch to CC license. I’m waiting for that too – and personally I really endorse it. Though usually people do not really notice licenses on web content, they really do once they see something they really want to reuse. Wikipedia ends up being isolated island, if it doesn’t go after sharing and exchanging information with other projects.

It takes time to understand one is ‘creative commoner’. I do have a t-shirt with such caption, but it is much more comfortable once you start feeling real power of use and reuse of information. Few anecdotes…
Continue reading “I'm a creative commoner”

Tim is now vocal

Tim is one of most humble and intelligent developers I’ve ever met – and we’re extremely happy having him at Wikimedia. Now he has a blog, where the first entry is already epic by any standards. I mentioned the IE bug, and Tim has done thorough analysis on this one, and similar problems.

I hope he continues to disclose the complexity of real web applications – and that will always be a worthy read.

Knol

There isn’t much to talk about Knol technology – it is either nicely engineered or missing (they probably thought that search is main tool for collaboration). Of course, many issues are already covered by others, but…

My first look was at the featured articles. What was wrong?

It features ‘closed collaboration’. Actually, thats no different from a blog, then…
It doesn’t care much about the licensing – featured articles had images with “all rights reserved”, or images taken from Wikipedia, with attribution but without share-alike clause. Also, no share-alike license forbids importing of content from many other places, but as we see it – nobody cares. ;-)
It doesn’t care about linking. Google search was based on the web links. Wikipedia was built on top of lots of broken links (oh, and working ones too). And nobody is going to type a Knol URL.
It doesn’t seem to have community tools. It just doesn’t.
WYSIWYG editing leads to articles without structure, just some text parts bolder than the other.

So for now, it seems to be pure-engineering approach at the problem, without looking at actual work done, social implications or properly respecting copyrights.

One needs community for that. Community helps not only with content, but with style, metadata, organizing, and most of all – ensures that project maintains values and spirit.

Wikipedia at Velocity conference

Next Monday I’ll be presenting (if jetlag doesn’t kill me) at Velocity 2008 – webops and performance conference. It won’t be my first time talking about Wikipedia infrastructure, but this time people will know the technology and scaling methods anyway.

As I see it, in such context Wikipedia is more interesting as a case of operations underdog – non-profit lean budgets, brave approaches in infrastructure, conservative feature development, and lots of cheating and cheap tricks (caching! caching! caching!).

Also, I’ll be able to share (making audience jealous) how it is great to be on non-profit ops team (and one of example perks – we can be cheap about getting conference passes too ;-)

The best part (for audience, not for me) – I will be forced to be honest. Nearly whole tech team will be at the event, and if I fail to attribute any developments, or start talking crap – not only they can throw rotten tomatoes, but also disable my login access and claim they never knew me, without me being able to fight back :) I didn’t publicly present in front of these guys since 2005 – will be tough.

	markcallaghan (@mark… on MySQL does not need SQL
	markcallaghan (@mark… on MySQL does not need SQL
	Domas Mituzas on MySQL does not need SQL
	Marc on MySQL does not need SQL
	Nils Meyer on linux memory management for…