Dissecting Wikipedia


Published on

Talk on "Dissecting Wikipedia" given at CRASSH, Cambridge, on 6th March 2013.


Andrew Gray, the British Library's Wikipedian in Residence, has been working on an AHRC-supported program to help more academics and researchers engage with Wikipedia. In this talk, he will give a brief history of the Wikipedia project, looking at its origins and the way it has developed over time. The talk will also cover the growing amount of research done around Wikipedia itself. Well over 2,000 peer-reviewed papers have been published which looked at Wikipedia in some way - looking at the project's content and community, or using this data as a way to study broader questions of collaboration and interaction.

Published in: Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Dissecting Wikipedia

  1. 1. Dissecting Wikipedia Andrew Gray Wikipedian in Residence, British Library andrew.gray@bl.uk // @generalising
  2. 2. Wikipedia & Wikimedia Wikimedia  Movement and charitable body  80-100,000 contributors in 280 languages and eleven core projects  Image repository, dictionary, news site…  …used by almost 500,000,000 people Wikipedia  25,000,000 articles, 4,000,000 in English  representing 8-9,000,000 topics & entities  6,500 articles and 235,000 edits per day (…and twelve years ago, this was all fields…)
  3. 3. …so what is Wikipedia? …an encyclopedia (more or less) …written neutrally …and verifiably …using previously published information …free to use, distribute, or reuse …a collaborative community …with no firm rules
  4. 4. A developing internal infrastructure All edits are visible through watchlists and page histories  About 7% are vandalism or malicious; processes to detect these  Median time to correction < 2 minutes… but some stay much longer Individual discussion pages for all articles – “talk” Quality review and assessment process Specialised working groups and central noticeboards  eg/ content topics; style; dispute resolution; copyright; etc.
  5. 5. Quality of Wikipedia as a source On average… it’s not bad  In 2005 four errors per article, versus three in Britannica  In 2011, in English, Spanish & Arabic: “…the Wikipedia articles in this sample scored higher overall than the comparison articles with respect to accuracy, references, style/ readability and overall judgment…” Millions of articles – so many are, individually, problematic  Various ways of identifying “signs” of quality  Markers for quality are both obvious and subtle Very effective “springboard” tool
  6. 6. Moving to other content Other languages – not translations, and may have more content Mousing over footnote markers Within the references:  Links through DOIs and other identifiers  ISBNs go to a special landing page  …and then out to libraries, booksellers, etc  ISSNs go to WorldCat  If an author, look for authority control links:
  7. 7. Other research tools Some tools available – “toolserver” allows live DB queries  Complex to use, but rewarding CatScan: look for intersection of categories  “all physicists born in 1912” – 53 in English, 35 in German Full dumps of all data available – http://dumps.wikipedia.org/ Reusers – Freebase, DBpedia, Wolfram Alpha
  8. 8. Wikidata Wikidata: our new linked data repository  Phase I: cross-language links  Phase II: structured data elements  Phase III: dynamic lists Very loosely defined schema Currently harvesting structured data from WP Public API, open to reusers CC-0 licensed data – fully open
  9. 9. Research about Wikipedia Thriving research around Wikimedia communities & content  by mid-2011, 2100 peer-reviewed articles and 38 PhD theses  Active research committee and WMF support Regular community-produced monthly newsletter  http://enwp.org/meta:Research // @wikiresearch Topics include:  Community and content creation  Reading and researching by users  Quality of content  Technical research  Large-scale content examination
  10. 10. Research on communities Research on the Wikipedia communities:  Dynamics of community conflict, discussions, collaboration, voting, contribution, mentoring…  Demographics, motivation and specialisms of contributors  Patterns of growth and content creation/deletion  Effect of central programs on volunteer activity  Cross-cultural interaction
  11. 11. Visualisation: discussion dynamics http://notabilia.net/
  12. 12. Editor activity and motivation http://commons.wikimedia.org/wiki/File:Effect_of_barnstars_on_productivity.png
  13. 13. Research on users Research on usage of Wikipedia:  Specific searching behaviour  Patterns of usage (yearly, daily)  Tracking external events through Wikipedia  Search engine rankings  Change in usage by students  Effect of Wikipedia publication on wider literature
  14. 14. Visualising editing patterns http://commons.wikimedia.org/wiki/File:WikiTrip_egyptian_revolution_screenshot.png
  15. 15. Research on content Research on the content of Wikipedia:  Evolution of content  Accuracy, coverage and quality  Biases – geographic, cultural, gender  Linguistic analysis  Effect of external publications on Wikipedia
  16. 16. Quality assessment comparisons http://commons.wikimedia.org/wiki/File:Boxplot_of_Average_Article_Feedback_ratings_by_project_rated_quality.svg
  17. 17. Research on technical aspects Research on the technical side of Wikipedia:  Extensive work on scaling open-content services  Tools for detecting and handling vandalism  Algorithmic detection and identification of bias, spam  Practical research on uses of wikis
  18. 18. Research using content Research using content from Wikipedia Hard to distinguish from “conventional” research, but some examples:  Geographical analysis  Visualisations of content  Source for extracted datasets  ...and Wikidata still to come!
  19. 19. Visualising art history http://commons.wikimedia.org/wiki/File:Wikiarthistory.png
  20. 20. Visualising place https://commons.wikimedia.org/wiki/File:Imageworld-artphp3.png