Talk on "Dissecting Wikipedia" given at CRASSH, Cambridge, on 6th March 2013.
Abstract:
Andrew Gray, the British Library's Wikipedian in Residence, has been working on an AHRC-supported program to help more academics and researchers engage with Wikipedia. In this talk, he will give a brief history of the Wikipedia project, looking at its origins and the way it has developed over time. The talk will also cover the growing amount of research done around Wikipedia itself. Well over 2,000 peer-reviewed papers have been published which looked at Wikipedia in some way - looking at the project's content and community, or using this data as a way to study broader questions of collaboration and interaction.
Empowering Africa's Next Generation: The AI Leadership Blueprint
Dissecting Wikipedia
1. Dissecting Wikipedia
Andrew Gray
Wikipedian in Residence, British Library
andrew.gray@bl.uk // @generalising
2. Wikipedia & Wikimedia
Wikimedia
Movement and charitable body
80-100,000 contributors in 280 languages
and eleven core projects
Image repository, dictionary, news site…
…used by almost 500,000,000 people
Wikipedia
25,000,000 articles, 4,000,000 in English
representing 8-9,000,000 topics & entities
6,500 articles and 235,000 edits per day
(…and twelve years ago, this was all fields…)
3. …so what is Wikipedia?
…an encyclopedia (more or less)
…written neutrally
…and verifiably
…using previously published information
…free to use, distribute, or reuse
…a collaborative community
…with no firm rules
4. A developing internal infrastructure
All edits are visible through watchlists and page histories
About 7% are vandalism or malicious; processes to detect
these
Median time to correction < 2 minutes… but some stay much
longer
Individual discussion pages for all articles – “talk”
Quality review and assessment process
Specialised working groups and central noticeboards
eg/ content topics; style; dispute resolution; copyright; etc.
5. Quality of Wikipedia as a source
On average… it’s not bad
In 2005 four errors per article, versus three in Britannica
In 2011, in English, Spanish & Arabic:
“…the Wikipedia articles in this sample scored higher overall than the
comparison articles with respect to accuracy, references, style/
readability and overall judgment…”
Millions of articles – so many are, individually, problematic
Various ways of identifying “signs” of quality
Markers for quality are both obvious and subtle
Very effective “springboard” tool
6. Moving to other content
Other languages – not translations, and may have more content
Mousing over footnote markers
Within the references:
Links through DOIs and other identifiers
ISBNs go to a special landing page
…and then out to libraries, booksellers, etc
ISSNs go to WorldCat
If an author, look for authority control links:
7. Other research tools
Some tools available – “toolserver” allows live DB queries
Complex to use, but rewarding
CatScan: look for intersection of categories
“all physicists born in 1912” – 53 in English, 35 in German
Full dumps of all data available – http://dumps.wikipedia.org/
Reusers – Freebase, DBpedia, Wolfram Alpha
8. Wikidata
Wikidata: our new linked data repository
Phase I: cross-language links
Phase II: structured data elements
Phase III: dynamic lists
Very loosely defined schema
Currently harvesting structured data from WP
Public API, open to reusers
CC-0 licensed data – fully open
9. Research about Wikipedia
Thriving research around Wikimedia communities & content
by mid-2011, 2100 peer-reviewed articles and 38 PhD theses
Active research committee and WMF support
Regular community-produced monthly newsletter
http://enwp.org/meta:Research // @wikiresearch
Topics include:
Community and content creation
Reading and researching by users
Quality of content
Technical research
Large-scale content examination
10. Research on communities
Research on the Wikipedia communities:
Dynamics of community conflict, discussions, collaboration,
voting, contribution, mentoring…
Demographics, motivation and specialisms of contributors
Patterns of growth and content creation/deletion
Effect of central programs on volunteer activity
Cross-cultural interaction
12. Editor activity and motivation
http://commons.wikimedia.org/wiki/File:Effect_of_barnstars_on_productivity.png
13. Research on users
Research on usage of Wikipedia:
Specific searching behaviour
Patterns of usage (yearly, daily)
Tracking external events through Wikipedia
Search engine rankings
Change in usage by students
Effect of Wikipedia publication on wider literature
15. Research on content
Research on the content of Wikipedia:
Evolution of content
Accuracy, coverage and quality
Biases – geographic, cultural, gender
Linguistic analysis
Effect of external publications on Wikipedia
17. Research on technical aspects
Research on the technical side of Wikipedia:
Extensive work on scaling open-content services
Tools for detecting and handling vandalism
Algorithmic detection and identification of bias, spam
Practical research on uses of wikis
18. Research using content
Research using content from Wikipedia
Hard to distinguish from “conventional” research, but some
examples:
Geographical analysis
Visualisations of content
Source for extracted datasets
...and Wikidata still to come!