Wikipedia page counters

Original announcement is here – we have publicly available per-page view count hourly snapshots. In very rough form for now, dumps are at dammit.lt/wikistats.

About these ads
This entry was posted in wikitech and tagged . Bookmark the permalink.

27 Responses to Wikipedia page counters

  1. Hashar says:

    And now you want to tune some RRD archives and generate some graphs :)

  2. Connel says:

    Very nice! Really, very good stuff. I can’t wait to see how other projects’ traffic compares… guessing at relative traffic is always annoying.

  3. LA2 says:

    This is great! I have wanted this for so long.

    1. Is it only Wikipedia? I only see “sv” (Swedish), but no reference to Wikipedia/Wikisource/Wikiqoute/Wikinews, etc.

    2. Could you perhaps URL-decode the names so the files contains UTF-8 rather than %C3%96? (I can do this myself, no big deal, but it’s such a basic operation that it should really be done at the source. Unless there is some good reason not to do this.)

  4. Lars – for now its just Wikipedias, will probably add some other projects too, though that would mean lots more data with small numbers :)

    And urldecoding was not done to ease both the export and import (whitespaces, various evil unicode bits, etc ;-)

  5. LA2 says:

    For “ru” (Russian WP) I get a lot of non-UTF-8 characters. Do they encode URLs in KOI-8, or what?

    For “no” (Norwegian WP – bokmål) I get a lot of Special:Export/PAGENAME. What’s that, a live mirror?

    Other languages (da,et,fi,fo,fy,is,lt,lv,nn,pl,se,sv) are more normal.

  6. Hi,

    I just found your page-count project and it solves a big problem of mine. I’m maintaining scripts for converting the wikipedia to mobile readable eBooks ( see http://fbo.no-ip.org/wpmp ). However, some cell phones/PDAs/… just can’t take big memory cards which means some articles have to be removed – this is where your page count statistics will be very helpfull :-) . I’m currently downloading some of them – but I do have a request: Is it possible to publish i.e. daily or monthly statistics, too?
    It would greatly reduce the amount of data I will have to download and parse through.

    Thank you for your effords – I’ve been waiting for a long time for this kind of data.

  7. Frank, awesome use for counts. Thanks for letting me know.

  8. Salix alba says:

    Nice one for the stats, I’d been looking for these for a long time and I’ve been led to believe that doing this proved hugh technical problems.

    This does pose some questions as to the accuracy of the numbers, are they just from one squid or the whole network, is cacheing taken into account?

    Anyway for my purpose its not a big deal, I’ve long been curious as too what the most read mathematics articles are and to that end took 12 hours worth to build a list http://en.wikipedia.org/wiki/User:Salix_alba/One_day_of_mathematics_page_views
    of the most popular. Which prove somewhat interesting.

  9. thats whole network, taken from squids, unsampled – full stream.

  10. daniel says:

    Are the times in GMT? For the English WP, is the peak between 17 and 22h and offpeak between 8 and 13h?

  11. daniel says:

    Used the WP stats for a cyberaction. See link under name.

  12. Domas,

    This is great stuff. Can I mirror these stats on infochimps.org and other public dataset sites? Are these statistics under the same license as wikipedia content? I’ve been running some further analysis on the data from November that would make for a great code demo, but I don’t want to hit any licensing issues if I post the data itself.

    -Pete

  13. this is public domain data.

  14. Domas,

    Thanks – I’m going to pull together a Hadoop code example using the data and set up a mirror.

    -Pete

  15. Chris Fraser says:

    What does it mean when http://dammit.lt/wikistats holds multiple files of pagecounts for a given hour? For example, the current snapshot for 2009/05/11 includes the slice

    pagecounts-20090511-120000.gz 2009-May-11 14:00:07 61.8M application/octet-stream
    pagecounts-20090511-130000.gz 2009-May-11 15:00:08 65.3M application/octet-stream
    pagecounts-20090511-130001.gz 2009-May-11 15:00:08 65.2M application/octet-stream
    pagecounts-20090511-140000.gz 2009-May-11 16:00:09 68.5M application/octet-stream

    which has one file for noon and 2PM but two for 1PM. In total, there are currently 30 files for 2009/05/11. I skimmed http://dammit.lt/wikistats/archive, and it appears to have exactly 24 files per day (I might have missed some), so I’m wondering if the recent files are groomed before archiving.

    When there is more than one file per hour, are the complete counts for that hour formed by summing their corresponding counts? By using only the last file? Or something else?

    Thanks for your help and for providing the excellent statistics.

  16. argh, they are dupes – I was switching from one collection host to other, you can disregard one file, if there’re two.

  17. Kapil Dalwani says:

    Does the count only indicate the hourly page view count?
    don’t we have the summation of page count untill now?

    and what about 08 data? is there any archive for it

  18. Kapil Dalwani says:

    Where can I find the dumps for 2008?

  19. well, if they’re gone, they’re gone :)

  20. Kapil Dalwani says:

    Hey Domas,

    they are gone? :(
    Is there a way to get it back..I really need them for some research. ..

    is there any other source for this?

  21. if you find one, tell me

  22. Domas,

    I finally posted that AWS Public dataset. It contains the last 3 months of 2008 if that helps the other commenters:

    http://developer.amazonwebservices.com/connect/entry.jspa?externalID=2596

    Having the data mirrored on Amazon might also take some load off your servers ;)

    I also posted source code for a Trending Topics site I built using the data:

    http://www.trendingtopics.org

    Thanks again for making this available. Good to know about the duplicates, I noticed that as well – will have to delete the dupes.

    -Pete

  23. Peter Bodik says:

    Domas,
    is there any way to obtain these files with 1 minute granularity (instead of one hour)? At UC Berkeley we’re working on a storage system and we’d like to evaluate how well it handles changes in data popularity. Using the wikipedia page counts with 1 minute granularity would be a great way to test it.

    If the amount of data would be a problem, we could download the data to our servers and you can delete them. Also, we don’t need a whole year of data, several days or weeks would be enough.

    Thanks!
    Peter

  24. Alvin says:

    Hi,

    Thanks for putting this together. It’s very useful. I was wondering if anybody could help me clarify the meaning of the projectcode?

    From what I understand, each line has four fields: projectcode, pagename, pageviews, bytes, and so a projectcode of “en” refers to the English Wikipedia collection. However, I’m having trouble figuring out what en.b, en.d, en.n, en.q, en.s, en.v, and en2 stand for and how they relate to the en projectcode?

    Thanks!
    Alvin

    • hamm says:

      I also want to know why there is en.b, en.d, en.n, en.q, en.s, en.v?

      • Jane says:

        My guess is that these are the sister projects Wikibooks, Wiktionary, Wikinews, Wikiquotes, Wikisource, and Wikiversity.

  25. Tomukas says:

    Labas Domas, labai ačiū for this great service. Do you know anyone who holds copies of the files that have already been deleted here? And: Why don’t you publish them on a daily basis (and instead: for a longer time)? Best wishes.

Comments are closed.