Spikes are not fun anymore

English Wikipedia just scored “three million articles”, so I thought I’d give some more numbers and perspectives :) Four years ago we observed impressive +50% traffic spike on Wikipedia – people came in to read about the new pope. Back then it was probably twenty additional page views a second, and we were quite happy to sustain that additional load :)

Nowadays big media events can cause some troubles, but generally they don’t bring huge traffic spikes anymore. Say, Michael Jackson’s English Wikipedia article had peak hour of one million page views (2009-06-25 23:00-24:00) – and that was merely 10% increase on one of our projects (English Wikipedia got 10.4m pageviews that hour). Our problems back then were caused by complexity of page content – and costs got inflated because of lack of rendering farm concurrency control.

Other interesting sources of attention are custom Google logos leading to search results leading to Wikipedia (of course!). Last ones, for Perseids or Hans Christian Ørsted sent over 1.5m daily visitors each – but thats mere 20 article views a second or so.

What makes those spikes boring nowadays is simply the length of long-tail. Our projects serve over five million different articles over the course of an hour (and 20m article views) – around 3.5m articles are opened just once. If our job would be serving just hot news, our cluster setup and software infrastructure would be very very very different – and now we have to accommodate millions of articles, that aren’t just stored in archives, but also are constantly read, even if once an hour (and daily hot set is much larger too).

All this viewership data is available in raw form, as well as nice visualizations at trendingtopics, wikirank and stats.grok.se. It is amazing to hear about all the research that is built on this kind of data, and I guess it needs some improved interfaces and APIs already for all the future uses ;-)

3 thoughts on “Spikes are not fun anymore”

  1. Could you point me to a web page somewhere that describes the raw data? Thank you.

  2. Domas,

    Thanks again for making these insights and datasets available. I just noticed that the wikistats hourly log process at http://dammit.lt/wikistats/ seems stalled as of last night (10/14/09). The last file was projectcounts-20091014-220001 2009-Oct-15 01:00:01

    -Pete

  3. Hi Domas,

    Yes, thank you for making this data available!! It is magnificent to have this data from a knowledge engineering point of view.

    However, data that would also be invaluable is how many users on a given page clicks on what links, i.e. the percentage of users that click on the various links on a page they are visiting. Even given just as percentages this would be incredibly valuable information. No identifying information is necessary like IPs etc.

    Is there any way in which one can obtain THAT data?
    Please reply to my email as I might not see a reply here.

    Stephan.

Comments are closed.

%d bloggers like this: