The paper is at http://www.gnuband.org/papers/social_networks_of_wikipedia/
Wikipedia, the free online encyclopedia anyone can edit, is a live social experiment: millions of individuals volunteer their knowledge and time to collective create it. It is hence interesting trying to understand how they do it. While most of the attention concentrated on article pages, a less known share of activities happen on user talk pages, Wikipedia pages where a message can be left for the specific user. This public conversations can be studied from a Social Network Analysis perspective in order to highlight the structure of the “talk” network. In this paper we focus on this preliminary extraction step by proposing different algorithms. We then empirically validate the differences in the networks they generate on the Venetian Wikipedia with the real network of conversations extracted manually by coding every message left on all user talk pages. The comparisons show that both the algorithms and the manual process contain inaccuracies that are intrinsic in the freedom and unpredictability of Wikipedia growth. Nevertheless, a precise description of the involved issues allows to make informed decisions and to base empirical findings on reproducible evidence. Our goal is to lay the foundation for a solid computational sociology of wikis. For this reason we release the scripts encoding our algorithms as open source and also some datasets extracted out of Wikipedia conversations, in order to let other researchers replicate and improve our initial effort.
Scripts (Python) has been released as open source and networks datasets (in GraphML format) too. See http://sonetlab.fbk.eu/data/social_networks_of_wikipedia/
Session Agenda: Open Learning FrameworksMike Bogle
Similar to Social networks of Wikipedia - Paolo Massa - Presentation at (2011). ACM Hypertext 2011: 22nd ACM Conference on Hypertext and Hypermedia (20)
Social networks of Wikipedia - Paolo Massa - Presentation at (2011). ACM Hypertext 2011: 22nd ACM Conference on Hypertext and Hypermedia
1. Social Networks of Wikipedia
Paolo Massa
SoNet @ Bruno Kessler Foundation, Trento, Italy
http://www.gnuband.org
2. Contributions
Methodological paper on
Algorithms for extracting a network of
Who talks to whom on Wikipedia
+
Validation of quality by manual coding
Code is open source and reusable
=
Basic step for Social Network Analysis
3. Outline
● Statistics on Wikipedia/wiki
● Algorithms for Extracting a
Social Network
● Manual Validation of Algorithms
4. English Wikipedia
Started in 2001
3.500.000+ articles
440.000.000+ edits
14.000.000+ registered users
3.500.000+ at-least-1-edit users
10. How to extract a network of
who talk to whom from User
talk pages?
11. User talk page http://en.wikipedia.org/wiki/User_talk:Phauly
0.6
12. User talk page http://en.wikipedia.org/wiki/User_talk:Phauly
0.6
13. User talk page http://en.wikipedia.org/wiki/User_talk:Phauly
1
Shell Phauly
0.6
14. User talk page http://en.wikipedia.org/wiki/User_talk:Phauly
1
Shell Phauly
0.6
15. User talk page http://en.wikipedia.org/wiki/User_talk:Phauly
1
Shell Phauly
1
Martin
16.
17. Broader scope
We (SoNet) work on
● How UTPs are used (coordination)
● Characterize users of Wikipedia (based
on gender, interests, religion, ...)
● Formation of Collective memories of
events in Wikipedia
● Goal: understand/model what users do
in Wikipedia → Wikisociology
18. We're hiring! ;)
Call for researcher at
https://risorseumane.fbk.eu/it/node/234
Info about SoNet group
at http://sonet.fbk.eu
If interested, come to talk
to me!
19. Other Wikipedia networks
● Few papers on User talk pages
● Node=User
● Edge=Coediting x articles
● Edge=Editing article after user A
● Edge=Reverted edit of user A
● Edge=Vote in elections for admins
● Node=Page / Edge=Link
● Node=Category / Edge=Inclusion
20. How to extract who talks to
whom?
3 ways:
(1) Signatures (automated)
(2) History of edits (automated)
(3) Manual coding
21. Input: Wikipedia dumps
XML dump of every edit occured to every
page in time (10 years!)
English Wikipedia dump =
5,600 Gigabytes!
(our scripts work on every wiki: 280+
language Wikipedia, but also 50.000+
wikia.com wikis ...)
22. How to extract who talks to
whom?
3 ways:
(1) Signatures in text (automated)
(2) History of edits (automated)
(3) Manual coding
25. (1) Signature algorithm
<page> pagesmetacurrent XML
<title>User talk:Phauly</title>
<revision>
<text xml:space="preserve">
== '''Welcome!''' ==
Hello, {{BASEPAGENAME}}, and [[Wikipedia:Welcome, newcomers|welcome]] t
your contributions. I hope you like the place and decide to stay. Here
might find helpful:
*[[Wikipedia:Five pillars|The five pillars of Wikipedia]]
*[[Wikipedia:How to edit a page|How to edit a page]]
*[[Help:Contents|Help pages]]
*[[Wikipedia:Tutorial|Tutorial]]
*[[Wikipedia:Article development|How to write a great article]]
*[[Wikipedia:Manual of Style|Manual of Style]]
I hope you enjoy editing here and being a [[Wikipedia:Wikipedians|Wikip
[[Wikipedia:Sign your posts on talk pages|sign your name]] on talk page
(<nowiki>~~~~</nowiki>); this will automatically produce your name and
0.6
check out [[Wikipedia:Questions]], ask me on my talk page, or place
<code><nowiki>{{helpme}}</nowiki></code> on your talk page and someone
answer your questions. Again, welcome! . [[User:Shell_Kinney|Shell
<sup>[[User_talk:Shell_Kinney|babelfish]]</sup> 15:29, 7 November 2006
== "Wikipedia endnote assisstant" ==
Hi, sorry to take so long to reply to your message. It's convention at
messages at the bottom of the page, and as I was moving country at the
see your message until now! Have you tried the updated URL,
http://toolserver.org/~verisimilus/Scholar ? Let me know if you continu
Glad you find the tool useful! Best wishes,
[[User:Smith609|Martin]] '''<small>([[User:Smith609|S
[[User_talk:Smith609|Talk]])</small>''' 01:19, 7 October 2008
== Test anonymous edit ==
Just a test done by myself on signature formatting. [[Special:Contrib
217.77.80.29]] ([[User talk:217.77.80.29|talk]]) 12:08, 8 February 2010
</text>
</revision>
</page>
26. (1) Signature algorithm
<page>
● Consider pages with title <title>User talk:Phauly</title>
<revision>
User talk:T (or equivalent <text xml:space="preserve">
== '''Welcome!''' ==
in other languages) Hello, {{BASEPAGENAME}}, and [[Wikipedia:W
your contributions. I hope you like the pl
might find helpful:
● Search for signatures of *[[Wikipedia:Five pillars|The five pillars
*[[Wikipedia:How to edit a page|How to edi
user S in text *[[Help:Contents|Help pages]]
*[[Wikipedia:Tutorial|Tutorial]]
*[[Wikipedia:Article development|How to wr
● Consider them as *[[Wikipedia:Manual of Style|Manual of Sty
I hope you enjoy editing here and being a
message from S to T [[Wikipedia:Sign your posts on talk pages|
0.6
(<nowiki>~~~~</nowiki>); this will automat
check out [[Wikipedia:Questions]], ask me
<code><nowiki>{{helpme}}</nowiki></code> o
answer your questions. Again, welcome!&nbs
Signature of XXX if [[User:XXX| <sup>[[User_talk:Shell_Kinney|babelfish]]<
== "Wikipedia endnote assisstant" ==
Signature of 217.77.80.29 if Hi, sorry to take so long to reply to your
messages at the bottom of the page, and as
[[Special:Contributions/217.77.80.29| see your message until now! Have you tried
http://toolserver.org/~verisimilus/Scholar
Glad you find the tool useful! Best wishes
[[User:Smith609|Martin]] '''<
Robust on spaces, HTML [[User_talk:Smith609|Talk]])</smal
== Test anonymous edit ==
tags, non balanced Just a test done by myself on signature fo
217.77.80.29]] ([[User talk:217.77.80.29|t
parentheses, ...
</text>
</revision>
</page>
30. (2) History algorithm
<page> stubmetahistory X
<title>User talk:Phauly</title>
● Consider pages with <revision>
<timestamp>20061107T15:29:48Z</
title User talk:T (or <contributor>
equivalent in other <username>Shell Kinney</username
</contributor>
languages) </revision>
<revision>
● Consider revision by <timestamp>20081007T01:19:54Z</
<contributor>
user S as a message <username>Smith609</username>
0.6
</contributor>
from S to T </revision>
<revision>
<timestamp>20100208T12:08:19Z</
<contributor>
<ip>217.77.80.29</ip>
</contributor>
</revision>
</page>
31. They produce different
networks
But
Which is more correct?
Which is more meaningful?
(1) Signatures in text (automated)
(2) History of edits (automated)
32. (3) Manual coding
Validation on Venetian Wikipedia by
manually visiting every user talk page
and manually extracting every
“message“
#users (active in writing or receiving) = 918
(out of 6255 registered users)
#messages = 1786
(paper about “content of messages“ on
UTPs: most are coordination)
34. Goal of Manual Coding
Manual coding = opportunity to notice
patterns and regularities just as
exceptions to them.
Goal: providing empirical evidence of the
reliability of the extraction algorithms.
35. Which is correct? Best?
(1) Signatures in text (automated)
(2) History of edits (automated)
(3) Manual coding
NONE is correct. Not even Manual coding.
They are different.
Most important issues and strategies to
cope with them are in next slides.
(comparison on data at December 30, 2009)
36. (A) Number of nodes
(3) Manual coding 918
(1) Signatures 906
(2) History 981
Why? See next slides
37. (B) Renamed users
Small issue but relevant impact
Venetian Wikipedia = 15 renamings
English Wikipedia = 17,096 renamings
38. (B) Renamed users
Vec.wiki: “Maximillion Pegasus” user wrote msgs on User talk pages
Then a person requested username “Maximillion Pegasus” and got it.
Bureaucrats renamed “Maximillion Pegasus” into
“Usurped12032009”.
UTP of “Usurped12032009” contains messages received when he
was “Maximillion Pegasus”.
The new “Maximillion Pegasus” never received msg
Existing signatures not affected by rename.
So
Usurped12032009 has high indegree and 0 outdegree
“Maximillion Pegasus” has 0 indegree and high outdegree.
Got time to find this user, understand the issue, figure out it was not
a bug in our code!
Signature makes error in this case! Manual coding too!
History works because XML file contains the username of the „real“
user such as Usurped12032009
39. (B) Renamed users
This issue is NOT marginal.
17,000+ renamings in the English
Wikipedia
and usually involving very active and
peculiar users!
This issue affects the most basic element
of social networks, number of nodes!
40. (C) Number of edges
#pairs of users (unweighted) among
which at least 1 msgs was written
(3) Manual coding 1073
(1) Signatures 1087
(2) History 1869
Why? See next slides
41. (D) Information messages and
redirects
“I don't check this vec.wiki often, please write
to User:X on en.wiki [Signature of User:X]“ →
usex X in en.wiki might be different from user X
in vec.wiki: only users in one wiki are
considered
(bot)“This is a bot, please write User:X“
Information messages 60/1786
Redirects 27/1786
Manual coding = OK
Signature = ~KO
History = ~OK (but … A edits UTP of A...)
42. (E) Messages to oneself
A writes on UTP of A
56/1786 messages were self-edges
Wikipedia recommendation: A replies
to B on UTP of B
Small evidence but it seems to
happen: self-edges are rare and
mainly information messages
43. (F) Non human users writing
messages
Each bot has its own “logic“. 1 example:
Marco27bot is a welcome bot
44. Many messages are templates!
Welcome templates {{benvegnu}}
Out of 1786 msgs, 774 (43.33%) are welcome templates.
In vec.wiki, Written by a bot Marco27Bot, but signed with usernames of volunteers
Manual coding and Signature algo: find signers (appearance)
History finds bot (reality)
Suggestion: don't consider bots because of their automated nature
45. (G) Anonymous users, vandalism
and deleted messages
Anon users (IP address) have UTPs
They received 33 message from bots about
possible vandalism
Many of their edits got deleted
Coding and Signature don't find deleted edits
History finds them
Suggestion: remove anonymous users (IP
addresses don't map 1to1 to person anyway)
46. (H) Many edits per message
I edit the UTP of X,
I discover a typo,
I re-edit the UTP of X
These are not 2 messages but history
algorithm detects 2 edits.
Possible heuristics: collapse edits
occurring during short time
47. (I) Personalized, missing or
incorrectly formatted signatures
Large variety in personalized signatures
Hard to detect reliably all signatures,
especially for very active users! And in
each language Wikipedia, different
practices.
Most active vec.wiki user used a template
for signature! {{Utente:Nick1915/firma}}
Biggest drawback of signature algorithm
48. (I) Personalized, missing or
incorrectly formatted signatures
Users forget to sign (not automatic).
A bot (Sinebot in EnWiki and Marco27Bot
in VecWiki) edits the page and add
signature. → It seems the bot “talks“ a
lot.
Some users make errors in the syntax for
signing
Signature = KO
History = OK (forgot to sign is not a
problem, but discard bots)
49. (J) Date of message
Messages are (often) dated → possible
longitudinal analysis!
Signature algo = KO: must detect syntax
of date, different over time (in vec.wiki)
and different in each language wikipedia
History algo = OK: has the info formally
coded in XML dump
<timestamp>20061107T15:29:48Z</timestamp>
50. (K) Archived messages
When UTPs become long, they get archived (by
a bot).
Current content is copied to a newly created
page such as User_talk:Phauly/Archive3
But NOT all subpages of UTP are archives!
Coding and Signature = KO: decide to look for
signatures in subpages based on heuristics on
page title (what is this in Chinese Wikipedia)?
History = OK: edits are done to “main“ UTP
Issue very relevant for “active“ users!
51. Our scripts are open source!
You can run it and extract networks (in order to
analyzed them). Python code at
https://github.com/phauly/wiki-network
Networks already available as extracted by 2
algorithms for German, Spanish, Italian,
Chinese and Venetian Wikipedia
http://sonetlab.fbk.eu/data/social_networks_of_wikipedia/
GraphML format: play with them using Gephi!
(http://www.gephi.org)
Social Network Analysis of who talks to whom on
Wikipedia is possible without caring about all these
details of extraction!
52. Size=Indegree
(#received msgs)
Color=Role
2005-2010 Cumulative
Weighted
Directed
Social network
(who talks to whom)
Nodes=Users (918)
(out of 6255 registered users)
Edges=#Messages
53. Nodes=Users (918)
Most users just
received messages
(receivers, passive)
Only 196 users wrote
At least one msg!
(senders, active)
54. Discussion
No algo is “correct“, not even manual
coding!
Bots and anonymous users should be
removed and analyzed ad hoc
Interested in
(1) the network users see (with its
variability in signatures and formats)
Signature algorithm ok but works only on one
language Wikipedia and needs tweaking
(2) the network of what really happened
History algorithm more robust, also across
wikis (cross-wiki comparison) and with
dates (longitudinal analysis).
55. Conclusions
Small change in algorithm/assumption =
big change in “what you extract“ and
hence in “what you find“!!
Proposed 2 algorithms
Empirical Validation by manual coding
1) Bots and anonymous to be excluded
and treated separately and adhoc
2) History algorithm = more robust
Opensource scripts: First step towards
sociology of wikis
56. Credits
I would like to thanks
Davide Setti
Marco Frassoni
For writing the code and for manual
coding
Don't forget
Call for Postdoc at SoNet
https://risorseumane.fbk.eu/it/node/234