SlideShare a Scribd company logo
1 of 11
Summaries of Wikipedia
Usage Data
Paul Houle, Ontology2
The x-axis is months since Jan 2008, the Yaxis is the total number of hits to all
Wikipedia pages.
There are some violent variations that are
probably caused by data quality problems, in
particular around index 30 (2010-06 and
2010-07) we see a drop in hits, then a very
high number of hits in (2010-11). I think
there may be a few weeks of data missing
sometime in that time range
The y-axis here is the fraction of hits to the English
Wikipedia. At the beginning, more than 50% of
the traffic went to the “en” Wikipedia, but that
has fallen off and now “en” represents a bit more
than 1/3 of the traffic.
“en” is still dominant, but others are catching up.
The y-axis here is the fraction of traffic to the
German Wikipedia. Like “en”, the fraction falls
over time. Note that there is a high spike at Dec
2008
The y-axis here is hits to the Japanese Wikipedia
and the story is similar to “de” except the crazy
spike happens around March 2013
The fraction of traffic in the francophone region,
“fr”, actually looks stable over time
The fraction of hits to the Korean language
Wikipedia actually have been increasing
(something has to if “en”, “de” and “ja” are
declining)
The fraction of hits to the Chinese Wikipedia has
grown over time, but there is a drop in the time frame
that looks unstable on the summary graph at the
beginning and another crazy spike
The fraction of traffic in the “es” cultural zone seems to
have a strong seasonal variation
Top 15 Wikimedia Sites ordered by fraction of all-time hits.
Note that “ja” is Japan, “zh” is Chinese, and “tr” is Turkish.
en.mw and ja.mw both come up with a single URI, so these probably represent a
redirect somewhere.
Notes on data sources
• Original source: http://dumps.wikimedia.org/other/pagecounts-raw/
• Hourly files were aggregated at the month level; a few invalid (empty
or full of HTML) files were removed as were a few lines that did not
parse. Content sizes were removed
• URIs that got fewer than 10 hits a month were removed from the
monthlies (this reduces the number of URIs roughly tenfold!)

More Related Content

More from Paul Houle

More from Paul Houle (20)

Chatbots in 2017 -- Ithaca Talk Dec 6
Chatbots in 2017 -- Ithaca Talk Dec 6Chatbots in 2017 -- Ithaca Talk Dec 6
Chatbots in 2017 -- Ithaca Talk Dec 6
 
Estimating the Software Product Value during the Development Process
Estimating the Software Product Value during the Development ProcessEstimating the Software Product Value during the Development Process
Estimating the Software Product Value during the Development Process
 
Universal Standards for LEI and other Corporate Reference Data: Enabling risk...
Universal Standards for LEI and other Corporate Reference Data: Enabling risk...Universal Standards for LEI and other Corporate Reference Data: Enabling risk...
Universal Standards for LEI and other Corporate Reference Data: Enabling risk...
 
Fixing a leaky bucket; Observations on the Global LEI System
Fixing a leaky bucket; Observations on the Global LEI SystemFixing a leaky bucket; Observations on the Global LEI System
Fixing a leaky bucket; Observations on the Global LEI System
 
Cisco Fog Strategy For Big and Smart Data
Cisco Fog Strategy For Big and Smart DataCisco Fog Strategy For Big and Smart Data
Cisco Fog Strategy For Big and Smart Data
 
Making the semantic web work
Making the semantic web workMaking the semantic web work
Making the semantic web work
 
Paul houle the supermen
Paul houle   the supermenPaul houle   the supermen
Paul houle the supermen
 
Paul houle what ails enterprise search
Paul houle   what ails enterprise search Paul houle   what ails enterprise search
Paul houle what ails enterprise search
 
Extension methods, nulls, namespaces and precedence in c#
Extension methods, nulls, namespaces and precedence in c#Extension methods, nulls, namespaces and precedence in c#
Extension methods, nulls, namespaces and precedence in c#
 
Dropping unique constraints in sql server
Dropping unique constraints in sql serverDropping unique constraints in sql server
Dropping unique constraints in sql server
 
Paul houle resume
Paul houle resumePaul houle resume
Paul houle resume
 
Embrace dynamic PHP
Embrace dynamic PHPEmbrace dynamic PHP
Embrace dynamic PHP
 
Once asynchronous, always asynchronous
Once asynchronous, always asynchronousOnce asynchronous, always asynchronous
Once asynchronous, always asynchronous
 
Pro align snap 2
Pro align snap 2Pro align snap 2
Pro align snap 2
 
Proalign Snapshot 1
Proalign Snapshot 1Proalign Snapshot 1
Proalign Snapshot 1
 
Text wise technology textwise company, llc
Text wise technology   textwise company, llcText wise technology   textwise company, llc
Text wise technology textwise company, llc
 
Tapir user manager
Tapir user managerTapir user manager
Tapir user manager
 
The Global Performing Arts Database
The Global Performing Arts DatabaseThe Global Performing Arts Database
The Global Performing Arts Database
 
Arxiv.org: Research And Development Directions
Arxiv.org: Research And Development DirectionsArxiv.org: Research And Development Directions
Arxiv.org: Research And Development Directions
 
Commonspot installation at cornell university library
Commonspot installation at cornell university libraryCommonspot installation at cornell university library
Commonspot installation at cornell university library
 

Recently uploaded

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 

Recently uploaded (20)

DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 

Summaries of wikipedia usage data

  • 1. Summaries of Wikipedia Usage Data Paul Houle, Ontology2
  • 2. The x-axis is months since Jan 2008, the Yaxis is the total number of hits to all Wikipedia pages. There are some violent variations that are probably caused by data quality problems, in particular around index 30 (2010-06 and 2010-07) we see a drop in hits, then a very high number of hits in (2010-11). I think there may be a few weeks of data missing sometime in that time range
  • 3. The y-axis here is the fraction of hits to the English Wikipedia. At the beginning, more than 50% of the traffic went to the “en” Wikipedia, but that has fallen off and now “en” represents a bit more than 1/3 of the traffic. “en” is still dominant, but others are catching up.
  • 4. The y-axis here is the fraction of traffic to the German Wikipedia. Like “en”, the fraction falls over time. Note that there is a high spike at Dec 2008
  • 5. The y-axis here is hits to the Japanese Wikipedia and the story is similar to “de” except the crazy spike happens around March 2013
  • 6. The fraction of traffic in the francophone region, “fr”, actually looks stable over time
  • 7. The fraction of hits to the Korean language Wikipedia actually have been increasing (something has to if “en”, “de” and “ja” are declining)
  • 8. The fraction of hits to the Chinese Wikipedia has grown over time, but there is a drop in the time frame that looks unstable on the summary graph at the beginning and another crazy spike
  • 9. The fraction of traffic in the “es” cultural zone seems to have a strong seasonal variation
  • 10. Top 15 Wikimedia Sites ordered by fraction of all-time hits. Note that “ja” is Japan, “zh” is Chinese, and “tr” is Turkish. en.mw and ja.mw both come up with a single URI, so these probably represent a redirect somewhere.
  • 11. Notes on data sources • Original source: http://dumps.wikimedia.org/other/pagecounts-raw/ • Hourly files were aggregated at the month level; a few invalid (empty or full of HTML) files were removed as were a few lines that did not parse. Content sizes were removed • URIs that got fewer than 10 hits a month were removed from the monthlies (this reduces the number of URIs roughly tenfold!)