Summaries of wikipedia usage data


This set of slides illustrates the growing interest people have in Wikipedia, changes in relative interest between languages, and how much Wikipedia interest there is in different language zones.

  1. 1. Summaries of Wikipedia Usage Data Paul Houle, Ontology2
  2. 2. The x-axis is months since Jan 2008, the Yaxis is the total number of hits to all Wikipedia pages. There are some violent variations that are probably caused by data quality problems, in particular around index 30 (2010-06 and 2010-07) we see a drop in hits, then a very high number of hits in (2010-11). I think there may be a few weeks of data missing sometime in that time range
  3. 3. The y-axis here is the fraction of hits to the English Wikipedia. At the beginning, more than 50% of the traffic went to the “en” Wikipedia, but that has fallen off and now “en” represents a bit more than 1/3 of the traffic. “en” is still dominant, but others are catching up.
  4. 4. The y-axis here is the fraction of traffic to the German Wikipedia. Like “en”, the fraction falls over time. Note that there is a high spike at Dec 2008
  5. 5. The y-axis here is hits to the Japanese Wikipedia and the story is similar to “de” except the crazy spike happens around March 2013
  6. 6. The fraction of traffic in the francophone region, “fr”, actually looks stable over time
  7. 7. The fraction of hits to the Korean language Wikipedia actually have been increasing (something has to if “en”, “de” and “ja” are declining)
  8. 8. The fraction of hits to the Chinese Wikipedia has grown over time, but there is a drop in the time frame that looks unstable on the summary graph at the beginning and another crazy spike
  9. 9. The fraction of traffic in the “es” cultural zone seems to have a strong seasonal variation
  10. 10. Top 15 Wikimedia Sites ordered by fraction of all-time hits. Note that “ja” is Japan, “zh” is Chinese, and “tr” is Turkish. and both come up with a single URI, so these probably represent a redirect somewhere.
  11. 11. Notes on data sources • Original source: • Hourly files were aggregated at the month level; a few invalid (empty or full of HTML) files were removed as were a few lines that did not parse. Content sizes were removed • URIs that got fewer than 10 hits a month were removed from the monthlies (this reduces the number of URIs roughly tenfold!)