Wikipedia articles are by definition never finished: at any moment their content can be edited, or discussed in the associated talk pages. In this study we analyse the evolution of these discussions to unveil patterns of collective participation along the temporal dimension, and to shed light on the process of content creation on
different topics. At a micro-scale, we investigate peaks in the discussion activity and we observe a non-trivial relationship with edit activity. At a larger scale, we introduce a measure to account for how fast discussions grow in complexity, and we find speeds that span three orders of magnitude for different articles. Our analysis should help the community in tasks such as early detection of controversies and assessment of discussion maturity.
There is No Deadline - Time Evolution of Wikipedia Discussions
1. Introduction Peaks Growth measure Conclusions
There is No Deadline - Time Evolution of
Wikipedia Discussions
Andreas Kaltenbrunner David Laniado
Social Media Research Group,
Barcelona Media,
Barcelona, Spain
August 28th, 2012
WikiSym ’12, Linz, Austria
Andreas Kaltenbrunner & David Laniado Time Evolution of Wikipedia Discussions
2. Introduction Peaks Growth measure Conclusions
Outline
1 Introduction
Motivation
Dataset
2 Peaks
Peak Detection Algorithm
Peak Statistics
3 Growth measure
Discussion Complexity
Growth in Complexity
4 Conclusions
Andreas Kaltenbrunner & David Laniado Time Evolution of Wikipedia Discussions
3. Introduction Peaks Growth measure Conclusions Motivation Dataset
Outline
1 Introduction
Motivation
Dataset
2 Peaks
Peak Detection Algorithm
Peak Statistics
3 Growth measure
Discussion Complexity
Growth in Complexity
4 Conclusions
Andreas Kaltenbrunner & David Laniado Time Evolution of Wikipedia Discussions
4. Introduction Peaks Growth measure Conclusions Motivation Dataset
Motivation
Wiki means quick in Hawaiian
How to study the speed with which an article changes?
First choice would the number of edits per time unit.
But the larger an article becomes ...
more of its generative process happens in talk pages.
⇒ Looking at the associated discussion is often the most
effective way to understand the editing process.
Research questions
What is the relationship between discussion and edits?
How frequent are spikes of activity?
How fast do discussions grow, and for how long?
Which are the fastest discussions?
Andreas Kaltenbrunner & David Laniado Time Evolution of Wikipedia Discussions
5. Introduction Peaks Growth measure Conclusions Motivation Dataset
Outline
1 Introduction
Motivation
Dataset
2 Peaks
Peak Detection Algorithm
Peak Statistics
3 Growth measure
Discussion Complexity
Growth in Complexity
4 Conclusions
Andreas Kaltenbrunner & David Laniado Time Evolution of Wikipedia Discussions
6. Introduction Peaks Growth measure Conclusions Motivation Dataset
Dataset Dump of March 12th, 2010
Co-evolution of comments and edits
4 number of comments and edits per day
x 10
15 9000
edits
number of comments
12.5 comments 7500
number of edits
10 6000
7.5 4500
5 3000
2.5 1500
0 0
2003 2004 2005 2006 2007 2008 2009 2010
5 zoom on the year 2007
x 10
1.5 9000
number of comments
number of edits
1.25 7500
1 6000
0.75 4500
0.5 3000
Jan−1 Feb Mar Apr May Jun Jul Ago Sep Oct Nov Dec−1 Dec−31
2007
6 comments per 100 edits
Andreas Kaltenbrunner & David Laniado Time Evolution of Wikipedia Discussions
7. Introduction Peaks Growth measure Conclusions Motivation Dataset
Example for a single article
Activity is less synchronised
Peaks in the discussion and edit activity of the article "Barack Obama"
100 100
0 2010 0
300 300
200
100 2009 200
100
#comments, #edits per day.
0 0
300 300
200
100
0
2008 200
100
0
300 300
200
100
0
2007 #comments per day
#edits per day
200
100
0
Jan−01 Feb Mar Apr May Jun Jul Ago Sep Oct Nov Dec−01 Dec−31
How to detect & David Laniado
Andreas Kaltenbrunner
peaks? Time Evolution of Wikipedia Discussions
8. Introduction Peaks Growth measure Conclusions Peak Detection Algorithm Peak Statistics
Outline
1 Introduction
Motivation
Dataset
2 Peaks
Peak Detection Algorithm
Peak Statistics
3 Growth measure
Discussion Complexity
Growth in Complexity
4 Conclusions
Andreas Kaltenbrunner & David Laniado Time Evolution of Wikipedia Discussions
9. Introduction Peaks Growth measure Conclusions Peak Detection Algorithm Peak Statistics
How to detect peaks?
Compare with median activity
Peaks in the discussion and edit activity of the article "Barack Obama"
100 100
0 2010 0
300 300
200
100 2009 200
100
#comments, #edits per day.
0 0
300 300
200
100
0
2008 200
100
0
300 300
200
100
0
2007 #comments per day
median #comments during ± 2 weeks 200
#edits per day
median #edits during ± 2 weeks 100
0
Jan−01 Feb Mar Apr May Jun Jul Ago Sep Oct Nov Dec−01 Dec−31
Andreas Kaltenbrunner & David Laniado Time Evolution of Wikipedia Discussions
10. Introduction Peaks Growth measure Conclusions Peak Detection Algorithm Peak Statistics
Peak if activity > c · max(m(t), nmin ) adapted from [Lehmann 2012]
m(t) . . . 4 weeks median, nmin . . . activity minimum, c . . . peak factor
Peaks in the discussion and edit activity of the article "Barack Obama"
100 100
0 2010 0
300 300
200
100 2009 200
100
#comments, #edits per day.
0 0
300 300
200
100
0
2008 200
100
0
#comments per day
300 median #comments during ± 2 weeks 300
200
100
0
2007 comment peaks
#edits per day
median #edits during ± 2 weeks
edit peaks
200
100
0
Jan−01 Feb Mar Apr May Jun Jul Ago Sep Oct Nov Dec−01 Dec−31
Andreas Kaltenbrunner & David Laniado Time Evolution of Wikipedia Discussions
11. Introduction Peaks Growth measure Conclusions Peak Detection Algorithm Peak Statistics
Edit and comment peaks do not always coincide ...
and can be caused be endogenous or exogenous events
Peaks in the discussion and edit activity of the article "Barack Obama"
100 400 05−Nov−2008
100
0 2010 09 and 10−Mar−2009
350 Endogenous peak
300 300
Pres. Elections
0
20−Jan−2009 250 09−Oct−2009
300 Pres. Inaguaration Nobel Price Win 300
2009
200 200 200 200
150
200 200
100 100 100 100
50
100 100
#comments, #edits per day.
0 0 0 0
0 04−Jun−2008 0
200 17−Mar−2008 200 Nomination Win 200 29−Aug−2008 200 10−Oct−2008
Endogenous peak Official Nomiation Endogenous peak
300 300
2008
150 150 150 150
100 100 100 100
200 200
50 50 50 50
0 0 0 0
100 100
0 0
15−Feb−2007 11 and 12−Mar−2007
Endogenous peak Endogenous peak #comments per day
300 200 200
median #comments during ± 2 weeks 300
200
100
0
2007 150
100
50
0
150
100
50
0
comment peaks
#edits per day
median #edits during ± 2 weeks
edit peaks
200
100
0
Jan−01 Feb Mar Apr May Jun Jul Ago Sep Oct Nov Dec−01 Dec−31
Andreas Kaltenbrunner & David Laniado Time Evolution of Wikipedia Discussions
12. Introduction Peaks Growth measure Conclusions Peak Detection Algorithm Peak Statistics
Outline
1 Introduction
Motivation
Dataset
2 Peaks
Peak Detection Algorithm
Peak Statistics
3 Growth measure
Discussion Complexity
Growth in Complexity
4 Conclusions
Andreas Kaltenbrunner & David Laniado Time Evolution of Wikipedia Discussions
13. Introduction Peaks Growth measure Conclusions Peak Detection Algorithm Peak Statistics
Peak Statistics with c = 5, n min = 10, 2 580 discussion and 32 853 edit peaks
5
1198 articles with comment peaks 10
2580 comment peaks comment peaks
y~x−2.57 y~x−1.41
4
10 y~x−3.52 3
20681 articles with edit peaks 10 edit peaks
4
32853 edit peaks
y~x−3.50 10 y~x−3.97 y~x−1.36
number of time intervalls
3
10
number of articles
number of peaks
3 2
10 10
2
10
2
10
1
10
1
10 1
10
0 0
10 0
10 10 0
1 2 3 4 5 6 7 8 9 10 15 20 1 2 3 4 5 6 7 8 9 1 2 3
number of peaks per article 10 10 10 10
peak length (days) time between consecutive peaks (in days)
In the entire dataset
only 12% of all comment peaks coincide with an edit peak
27% when allowing one day of difference
33.8% when allowing two.
Peaks in the discussion activity do not have to lead to peaks in
the editing activity as well.
Andreas Kaltenbrunner & David Laniado Time Evolution of Wikipedia Discussions
14. Introduction Peaks Growth measure Conclusions Peak Detection Algorithm Peak Statistics
Top 10 articles with most comment and edit peaks
Title #comment-peaks #edit-peaks
Intelligent design 15 2
September 11 attacks 15 3
Race and intelligence 14 5
British Isles 11 0
Main page 11 0
Anarchism 10 12
Catholic church 10 0
Canada 10 0
Transnistria 9 3
New Anti-Semitism 9 0
Title #edit-peaks #com.-peaks
Uxbridge, Massachusetts 19 0
Voodoo (D’Angelo album) 17 0
List of World Wrestling Entertainment employees 16 3
Super Smash Bros. Brawl 16 2
Michael Jackson 16 1
The Biggest Loser: Couples 2 16 0
Roger Federer 15 0
Rafael Nadal 15 0
List of Barney & Friends episodes and videos 15 0
Total Drama Action 15 0
Andreas Kaltenbrunner & David Laniado Time Evolution of Wikipedia Discussions
15. Introduction Peaks Growth measure Conclusions Discussion Complexity Growth in Complexity
Outline
1 Introduction
Motivation
Dataset
2 Peaks
Peak Detection Algorithm
Peak Statistics
3 Growth measure
Discussion Complexity
Growth in Complexity
4 Conclusions
Andreas Kaltenbrunner & David Laniado Time Evolution of Wikipedia Discussions
16. Introduction Peaks Growth measure Conclusions Discussion Complexity Growth in Complexity
How to measure the complexity of a Discussion?
Discussion tree for article “Presidency of Barack Obama”
red → root (the article)
blue → structural nodes
green → anonymous
comments
grey → registered
comments
Andreas Kaltenbrunner & David Laniado Time Evolution of Wikipedia Discussions
17. Introduction Peaks Growth measure Conclusions Discussion Complexity Growth in Complexity
Using the h-index of a discussion introduced in [Gómez 2008]
The h-index ... Example
is a balanced depth
measure.
is the maximal number
h such that there are at
least h comments at
level (depth) h, but not
h + 1 comments at
level h + 1.
h-index=3
In other words there
are h sub-threads of
depth at least h.
Andreas Kaltenbrunner & David Laniado Time Evolution of Wikipedia Discussions
18. Introduction Peaks Growth measure Conclusions Discussion Complexity Growth in Complexity
Outline
1 Introduction
Motivation
Dataset
2 Peaks
Peak Detection Algorithm
Peak Statistics
3 Growth measure
Discussion Complexity
Growth in Complexity
4 Conclusions
Andreas Kaltenbrunner & David Laniado Time Evolution of Wikipedia Discussions
19. Introduction Peaks Growth measure Conclusions Discussion Complexity Growth in Complexity
Example for growth of discussions
5
10
George W. Bush George W. Bush
smoothed trend Bush Barack Obama
Barack Obama Bill Clinton
3 smoothed trend Obama
10 Bill Clinton 4
10
smoothed trend Clinton
number of comments
number of comments
3
10
2
10
2
10
1
10
1
10
0
10 10
0
Jan−2002 Jan−2003 Jan−2004 Jan−2005 Jan−2006 Jan−2007 Jan−2008 Jan−2009 Jan−2010 Jan−2002 Jan−2003 Jan−2004 Jan−2005 Jan−2006 Jan−2007 Jan−2008 Jan−2009 Jan−2010
Can we use the h-index to measure this growth?
Andreas Kaltenbrunner & David Laniado Time Evolution of Wikipedia Discussions
20. Introduction Peaks Growth measure Conclusions Discussion Complexity Growth in Complexity
Example for growth rate ∆h
14
George W. Bush
Barack Obama
12 Bill Clinton
George W. Bush
10
∆h =70.7 days
8
h−index
Barack Obama
6
∆h =90.2 days
4
Bill Clinton
2
∆h =331.9 days
0
Jan−2003 Jan−2004 Jan−2005 Jan−2006 Jan−2007 Jan−2008 Jan−2009
We define the growth rate ∆h as
the average time a discussion increases its h-index by one
th − t1
∆h =
h−1
related to the inverse of the m-index proposed in [Hirsch 2005]
Andreas Kaltenbrunner & David Laniado Time Evolution of Wikipedia Discussions
21. Introduction Peaks Growth measure Conclusions Discussion Complexity Growth in Complexity
Distribution of growth rates ∆h
of all discussions with more than 1000 comments
∆h
60
50
40
# discussions
30
20
10
0
1 10 100 1000
days
Different growth rates
We find several orders of magnitude of different rates of
increase in complexity of the discussions.
Andreas Kaltenbrunner & David Laniado Time Evolution of Wikipedia Discussions
22. Introduction Peaks Growth measure Conclusions Discussion Complexity Growth in Complexity
The 15 fastest and slowest discussions (∆h and duration in days)
Title ∆h start date end date duration final h-index
Virginia Tech massacre 0.5 15-Apr-2007 20-Apr-2007 5 9
2009 flu pandemic 0.9 25-Apr-2009 30-Apr-2009 5 7
Bronze Soldier of Tallinn 0.9 26-Apr-2007 02-May-2007 6 7
2009 Honduran constitutional crisis 1.0 27-Jun-2009 05-Jul-2009 8 8
Seung-Hui Cho 1.0 16-Apr-2007 24-Apr-2007 8 8
2008 Mumbai attacks 1.0 26-Nov-2008 01-Dec-2008 5 6
Israeli-occupied territories 1.2 22-Sep-2005 03-Oct-2005 11 10
International status of Abkhazia and South 1.3 25-Aug-2008 04-Sep-2008 10 8
Air France Flight 447 1.4 01-Jun-2009 08-Jun-2009 7 6
7 July 2005 London bombings 1.7 10-Jul-2005 15-Jul-2005 5 5
State terrorism and the United States 1.7 15-Feb-2008 06-Mar-2008 20 13
July 2009 Ürümqi riots 1.9 06-Jul-2009 21-Jul-2009 15 9
Henry Louis Gates arrest controversy 2.0 24-Jul-2009 09-Aug-2009 16 9
Teach the Controversy 2.6 11-Apr-2005 29-Apr-2005 18 8
Shakespeare authorship question 485.6 02-Jun-2003 24-Jan-2010 2428 7
Karl Marx 487.6 19-Sep-2004 21-Jan-2010 1950 6
Led Zeppelin 511.9 31-Jan-2003 03-Feb-2010 2560 6
Vampire 517.1 19-Nov-2002 18-Jul-2008 2068 6
World War II casualties 523.0 13-Sep-2004 29-Dec-2008 1568 4
War on Terrorism 533.1 07-Oct-2005 22-Feb-2010 1599 6
Fathers’ rights movement 546.0 07-Mar-2004 01-Sep-2008 1639 4
Instant-runoff voting 546.3 09-Jul-2003 03-Jan-2008 1639 5
Scientific method 553.5 15-Jun-2003 08-Jul-2009 2215 6
France 566.5 13-Nov-2003 26-Jan-2010 2266 6
Harry Potter 589.9 27-Nov-2002 02-Oct-2007 1770 6
Anna Anderson 604.5 17-Mar-2004 09-Jul-2007 1209 3
New York City 617.3 09-Dec-2003 03-Jan-2009 1852 5
Pi 627.3 07-Dec-2002 20-Oct-2009 2509 6
Christopher Columbus 1159.0 24-Oct-2003 27-Feb-2010 2318 5
Andreas Kaltenbrunner & David Laniado Time Evolution of Wikipedia Discussions
23. Introduction Peaks Growth measure Conclusions
Conclusions and future work
Conclusions
Discussion and edit peaks occur mostly independently of
each other.
Both endogenous (Wikipedia internal) and exogenous
(offline world) events can be the cause of such peaks.
We have introduced a simple growth measure.
Some discussions need only a few days to evolve, while
the slowest go on over years.
Future work
Use metrics for early detection of controversies.
Apply metrics on sub-threads to detect hot spots.
Assess discussion maturity.
Andreas Kaltenbrunner & David Laniado Time Evolution of Wikipedia Discussions
24. Introduction Peaks Growth measure Conclusions
Questions?
Andreas Kaltenbrunner & David Laniado Time Evolution of Wikipedia Discussions
25. Introduction Peaks Growth measure Conclusions
Bibliography I
Vicenç Gómez, Andreas Kaltenbrunner & Vicente López.
Statistical analysis of the social network and discussion threads in Slashdot.
In WWW ’08: Proceeding of the 17th international conference on World Wide Web, pages 645–654, New
York, NY, USA, 2008. ACM.
J. E. Hirsch.
An index to quantify an individual’s scientific research output.
PNAS, vol. 102, no. 46, pages 16569–16572, 2005.
J. Lehmann, B. Gonçalves, J.J. Ramasco & C. Cattuto.
Dynamical Classes of Collective Attention in Twitter.
In Proc. of WWW, 2012.
Andreas Kaltenbrunner & David Laniado Time Evolution of Wikipedia Discussions