Social Networks of WikipediaPaolo MassaSoNet @ Bruno Kessler Foundation, Trento, Italyhttp://www.gnuband.org
Contributions       Methodological paper onAlgorithms for extracting a network of   Who talks to whom on Wikipedia        ...
Outline●   Statistics on Wikipedia/wiki●   Algorithms for Extracting a    Social Network●   Manual Validation of Algorithms
English WikipediaStarted in 2001  3.500.000+   articles440.000.000+   edits 14.000.000+   registered users  3.500.000+   a...
Multi-lingual: 280+ Wikipedias50.000+ wikis on Wikia.com, some 1.000.000+ edits
Article Page
Article Page / Article Talk Page
User Page
User Page – User Talk Page (UTP)
How to extract a network ofwho talk to whom from User        talk pages?
User talk page http://en.wikipedia.org/wiki/User_talk:Phauly                                                              ...
User talk page http://en.wikipedia.org/wiki/User_talk:Phauly                                                              ...
User talk page http://en.wikipedia.org/wiki/User_talk:Phauly                                                              ...
User talk page http://en.wikipedia.org/wiki/User_talk:Phauly                                                              ...
User talk page http://en.wikipedia.org/wiki/User_talk:Phauly                                                              ...
Broader scopeWe (SoNet) work on● How UTPs are used (coordination)● Characterize users of Wikipedia (based  on gender, inte...
Were hiring! ;)Call for researcher athttps://risorseumane.fbk.eu/it/node/234Info about SoNet groupat http://sonet.fbk.euIf...
Other Wikipedia networks●   Few papers on User talk pages●   Node=User      ●   Edge=Coediting x articles      ●   Edge=Ed...
How to extract who talks to           whom?3 ways:(1) Signatures (automated)(2) History of edits (automated)(3) Manual cod...
Input: Wikipedia dumpsXML dump of every edit occured to everypage in time (10 years!)English Wikipedia dump =5,600 Gigabyt...
How to extract who talks to           whom?3 ways:(1) Signatures in text (automated)(2) History of edits (automated)(3) Ma...
(1) Signature algorithm
(1) Signature algorithm
(1) Signature algorithm                          <page>     pages­meta­current XML                               <title>Us...
(1) Signature algorithm                                        <page> ●   Consider pages with title               <title>U...
(2) History algorithm
(2) History algorithm
(2) History algorithm                        <page>      stub­meta­history XML                         <title>User talk:Ph...
(2) History algorithm                            <page>      stub­meta­history X                             <title>User t...
They produce different           networksButWhich is more correct?Which is more meaningful?(1) Signatures in text (automat...
(3) Manual codingValidation on Venetian Wikipedia bymanually visiting every user talk pageand manually extracting every“me...
Why Venetian Wikipedia?Small, so complete manual coding is possible                                 http://en.wikipedia.or...
Goal of Manual CodingManual coding = opportunity to noticepatterns and regularities just asexceptions to them.Goal: provid...
Which is correct? Best?(1) Signatures in text (automated)(2) History of edits (automated)(3) Manual codingNONE is correct....
(A) Number of nodes(3) Manual coding      918(1) Signatures         906(2) History            981Why? See next slides
(B) Renamed usersSmall issue but relevant impactVenetian Wikipedia = 15 renamingsEnglish Wikipedia = 17,096 renamings
(B) Renamed usersVec.wiki: “Maximillion Pegasus” user wrote msgs on User talk pagesThen a person requested username “Maxim...
(B) Renamed usersThis issue is NOT marginal.17,000+ renamings in the EnglishWikipediaand usually involving very active and...
(C) Number of edges#pairs of users (unweighted) amongwhich at least 1 msgs was written(3) Manual coding      1073(1) Signa...
(D) Information messages and          redirects“I dont check this vec.wiki often, please writeto User:X on en.wiki [Signat...
(E) Messages to oneselfA writes on UTP of A56/1786 messages were self-edgesWikipedia recommendation: A repliesto B on UTP ...
(F) Non human users writing          messagesEach bot has its own “logic“. 1 example:Marco27bot is a welcome bot
Many messages are templates!                     Welcome templates {{benvegnu}}Out of 1786 msgs, 774 (43.33%) are welcome ...
(G) Anonymous users, vandalism     and deleted messagesAnon users (IP address) have UTPsThey received 33 message from bots...
(H) Many edits per messageI edit the UTP of X,I discover a typo,I re-edit the UTP of XThese are not 2 messages but history...
(I) Personalized, missing orincorrectly formatted signaturesLarge variety in personalized signaturesHard to detect reliabl...
(I) Personalized, missing orincorrectly formatted signaturesUsers forget to sign (not automatic).A bot (Sinebot in EnWiki ...
(J) Date of message    Messages are (often) dated → possible    longitudinal analysis!    Signature algo = KO: must detect...
(K) Archived messagesWhen UTPs become long, they get archived (bya bot).Current content is copied to a newly createdpage s...
Our scripts are open source!You can run it and extract networks (in order toanalyzed them). Python code athttps://github.c...
Size=Indegree                                  (#received msgs)                                  Color=Role2005-2010 Cumul...
Nodes=Users (918)Most users justreceived messages(receivers, passive)Only 196 users wroteAt least one msg!(senders, active)
DiscussionNo algo is “correct“, not even manualcoding!Bots and anonymous users should beremoved and analyzed ad hocInteres...
ConclusionsSmall change in algorithm/assumption =big change in “what you extract“ andhence in “what you find“!!Proposed 2 ...
CreditsI would like to thanksDavide SettiMarco FrassoniFor writing the code and for manualcodingDont forgetCall for Postdo...
?   Thanks
Social networks of Wikipedia - Paolo Massa - Presentation at (2011). ACM Hypertext 2011: 22nd ACM Conference on Hypertext ...
Upcoming SlideShare
Loading in …5
×

Social networks of Wikipedia - Paolo Massa - Presentation at (2011). ACM Hypertext 2011: 22nd ACM Conference on Hypertext and Hypermedia

2,959 views
2,911 views

Published on

The paper is at http://www.gnuband.org/papers/social_networks_of_wikipedia/

Wikipedia, the free online encyclopedia anyone can edit, is a live social experiment: millions of individuals volunteer their knowledge and time to collective create it. It is hence interesting trying to understand how they do it. While most of the attention concentrated on article pages, a less known share of activities happen on user talk pages, Wikipedia pages where a message can be left for the specific user. This public conversations can be studied from a Social Network Analysis perspective in order to highlight the structure of the “talk” network. In this paper we focus on this preliminary extraction step by proposing different algorithms. We then empirically validate the differences in the networks they generate on the Venetian Wikipedia with the real network of conversations extracted manually by coding every message left on all user talk pages. The comparisons show that both the algorithms and the manual process contain inaccuracies that are intrinsic in the freedom and unpredictability of Wikipedia growth. Nevertheless, a precise description of the involved issues allows to make informed decisions and to base empirical findings on reproducible evidence. Our goal is to lay the foundation for a solid computational sociology of wikis. For this reason we release the scripts encoding our algorithms as open source and also some datasets extracted out of Wikipedia conversations, in order to let other researchers replicate and improve our initial effort.

Scripts (Python) has been released as open source and networks datasets (in GraphML format) too. See http://sonetlab.fbk.eu/data/social_networks_of_wikipedia/

Published in: Education, Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
2,959
On SlideShare
0
From Embeds
0
Number of Embeds
824
Actions
Shares
0
Downloads
5
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Social networks of Wikipedia - Paolo Massa - Presentation at (2011). ACM Hypertext 2011: 22nd ACM Conference on Hypertext and Hypermedia

  1. 1. Social Networks of WikipediaPaolo MassaSoNet @ Bruno Kessler Foundation, Trento, Italyhttp://www.gnuband.org
  2. 2. Contributions Methodological paper onAlgorithms for extracting a network of Who talks to whom on Wikipedia +Validation of quality by manual coding Code is open source and reusable =Basic step for Social Network Analysis
  3. 3. Outline● Statistics on Wikipedia/wiki● Algorithms for Extracting a Social Network● Manual Validation of Algorithms
  4. 4. English WikipediaStarted in 2001 3.500.000+ articles440.000.000+ edits 14.000.000+ registered users 3.500.000+ at-least-1-edit users
  5. 5. Multi-lingual: 280+ Wikipedias50.000+ wikis on Wikia.com, some 1.000.000+ edits
  6. 6. Article Page
  7. 7. Article Page / Article Talk Page
  8. 8. User Page
  9. 9. User Page – User Talk Page (UTP)
  10. 10. How to extract a network ofwho talk to whom from User talk pages?
  11. 11. User talk page http://en.wikipedia.org/wiki/User_talk:Phauly 0.6
  12. 12. User talk page http://en.wikipedia.org/wiki/User_talk:Phauly 0.6
  13. 13. User talk page http://en.wikipedia.org/wiki/User_talk:Phauly 1 Shell Phauly 0.6
  14. 14. User talk page http://en.wikipedia.org/wiki/User_talk:Phauly 1 Shell Phauly 0.6
  15. 15. User talk page http://en.wikipedia.org/wiki/User_talk:Phauly 1 Shell Phauly 1 Martin
  16. 16. Broader scopeWe (SoNet) work on● How UTPs are used (coordination)● Characterize users of Wikipedia (based on gender, interests, religion, ...)● Formation of Collective memories of events in Wikipedia● Goal: understand/model what users do in Wikipedia → Wikisociology
  17. 17. Were hiring! ;)Call for researcher athttps://risorseumane.fbk.eu/it/node/234Info about SoNet groupat http://sonet.fbk.euIf interested, come to talkto me!
  18. 18. Other Wikipedia networks● Few papers on User talk pages● Node=User ● Edge=Coediting x articles ● Edge=Editing article after user A ● Edge=Reverted edit of user A ● Edge=Vote in elections for admins● Node=Page / Edge=Link● Node=Category / Edge=Inclusion
  19. 19. How to extract who talks to whom?3 ways:(1) Signatures (automated)(2) History of edits (automated)(3) Manual coding
  20. 20. Input: Wikipedia dumpsXML dump of every edit occured to everypage in time (10 years!)English Wikipedia dump =5,600 Gigabytes!(our scripts work on every wiki: 280+language Wikipedia, but also 50.000+wikia.com wikis ...)
  21. 21. How to extract who talks to whom?3 ways:(1) Signatures in text (automated)(2) History of edits (automated)(3) Manual coding
  22. 22. (1) Signature algorithm
  23. 23. (1) Signature algorithm
  24. 24. (1) Signature algorithm <page>     pages­meta­current XML      <title>User talk:Phauly</title>       <revision>          <text xml:space="preserve"> == Welcome! == Hello, {{BASEPAGENAME}}, and [[Wikipedia:Welcome, newcomers|welcome]] t your contributions. I hope you like the place and decide to stay. Here  might find helpful: *[[Wikipedia:Five pillars|The five pillars of Wikipedia]] *[[Wikipedia:How to edit a page|How to edit a page]] *[[Help:Contents|Help pages]] *[[Wikipedia:Tutorial|Tutorial]] *[[Wikipedia:Article development|How to write a great article]] *[[Wikipedia:Manual of Style|Manual of Style]] I hope you enjoy editing here and being a [[Wikipedia:Wikipedians|Wikip [[Wikipedia:Sign your posts on talk pages|sign your name]] on talk page (<nowiki>~~~~</nowiki>); this will automatically produce your name and  0.6 check out [[Wikipedia:Questions]], ask me on my talk page, or place  <code><nowiki>{{helpme}}</nowiki></code> on your talk page and someone  answer your questions. Again, welcome!&nbsp;. [[User:Shell_Kinney|Shell <sup>[[User_talk:Shell_Kinney|babelfish]]</sup> 15:29, 7 November 2006  == "Wikipedia endnote assisstant" == Hi, sorry to take so long to reply to your message. Its convention at  messages at the bottom of the page, and as I was moving country at the  see your message until now! Have you tried the updated URL,  http://toolserver.org/~verisimilus/Scholar ? Let me know if you continu Glad you find the tool useful! Best wishes,  [[User:Smith609|Martin]]&nbsp;<small>([[User:Smith609|S [[User_talk:Smith609|Talk]])</small> 01:19, 7 October 2008  == Test anonymous edit == Just a test done by myself on signature formatting. ­­[[Special:Contrib 217.77.80.29]] ([[User talk:217.77.80.29|talk]]) 12:08, 8 February 2010           </text>     </revision> </page>
  25. 25. (1) Signature algorithm <page> ● Consider pages with title     <title>User talk:Phauly</title>       <revision> User talk:T (or equivalent          <text xml:space="preserve"> == Welcome! == in other languages) Hello, {{BASEPAGENAME}}, and [[Wikipedia:W your contributions. I hope you like the pl might find helpful:● Search for signatures of *[[Wikipedia:Five pillars|The five pillars *[[Wikipedia:How to edit a page|How to edi user S in text *[[Help:Contents|Help pages]] *[[Wikipedia:Tutorial|Tutorial]] *[[Wikipedia:Article development|How to wr● Consider them as *[[Wikipedia:Manual of Style|Manual of Sty I hope you enjoy editing here and being a  message from S to T [[Wikipedia:Sign your posts on talk pages| 0.6 (<nowiki>~~~~</nowiki>); this will automat check out [[Wikipedia:Questions]], ask me  <code><nowiki>{{helpme}}</nowiki></code> o answer your questions. Again, welcome!&nbsSignature of XXX if [[User:XXX| <sup>[[User_talk:Shell_Kinney|babelfish]]< == "Wikipedia endnote assisstant" ==Signature of 217.77.80.29 if Hi, sorry to take so long to reply to your messages at the bottom of the page, and as[[Special:Contributions/217.77.80.29| see your message until now! Have you tried http://toolserver.org/~verisimilus/Scholar Glad you find the tool useful! Best wishes [[User:Smith609|Martin]]&nbsp;<Robust on spaces, HTML [[User_talk:Smith609|Talk]])</smal == Test anonymous edit ==tags, non balanced Just a test done by myself on signature fo 217.77.80.29]] ([[User talk:217.77.80.29|tparentheses, ...           </text>     </revision> </page>
  26. 26. (2) History algorithm
  27. 27. (2) History algorithm
  28. 28. (2) History algorithm <page>      stub­meta­history XML  <title>User talk:Phauly</title>  <revision>   <timestamp>2006­11­07T15:29:48Z</timest   <contributor>    <username>Shell Kinney</username>   </contributor>  </revision>  <revision>   <timestamp>2008­10­07T01:19:54Z</timest   <contributor> 0.6    <username>Smith609</username>   </contributor>  </revision>  <revision>   <timestamp>2010­02­08T12:08:19Z</timest   <contributor>    <ip>217.77.80.29</ip>   </contributor>  </revision> </page>
  29. 29. (2) History algorithm <page>      stub­meta­history X  <title>User talk:Phauly</title>● Consider pages with  <revision>   <timestamp>2006­11­07T15:29:48Z</ title User talk:T (or   <contributor> equivalent in other    <username>Shell Kinney</username   </contributor> languages)  </revision>  <revision>● Consider revision by   <timestamp>2008­10­07T01:19:54Z</   <contributor> user S as a message    <username>Smith609</username> 0.6   </contributor> from S to T  </revision>  <revision>   <timestamp>2010­02­08T12:08:19Z</   <contributor>    <ip>217.77.80.29</ip>   </contributor>  </revision> </page>
  30. 30. They produce different networksButWhich is more correct?Which is more meaningful?(1) Signatures in text (automated)(2) History of edits (automated)
  31. 31. (3) Manual codingValidation on Venetian Wikipedia bymanually visiting every user talk pageand manually extracting every“message“#users (active in writing or receiving) = 918(out of 6255 registered users)#messages = 1786(paper about “content of messages“ onUTPs: most are coordination)
  32. 32. Why Venetian Wikipedia?Small, so complete manual coding is possible http://en.wikipedia.org http://vec.wikipedia.org
  33. 33. Goal of Manual CodingManual coding = opportunity to noticepatterns and regularities just asexceptions to them.Goal: providing empirical evidence of thereliability of the extraction algorithms.
  34. 34. Which is correct? Best?(1) Signatures in text (automated)(2) History of edits (automated)(3) Manual codingNONE is correct. Not even Manual coding.They are different.Most important issues and strategies tocope with them are in next slides.(comparison on data at December 30, 2009)
  35. 35. (A) Number of nodes(3) Manual coding 918(1) Signatures 906(2) History 981Why? See next slides
  36. 36. (B) Renamed usersSmall issue but relevant impactVenetian Wikipedia = 15 renamingsEnglish Wikipedia = 17,096 renamings
  37. 37. (B) Renamed usersVec.wiki: “Maximillion Pegasus” user wrote msgs on User talk pagesThen a person requested username “Maximillion Pegasus” and got it.Bureaucrats renamed “Maximillion Pegasus” into“Usurped12032009”.UTP of “Usurped12032009” contains messages received when hewas “Maximillion Pegasus”.The new “Maximillion Pegasus” never received msgExisting signatures not affected by rename.SoUsurped12032009 has high indegree and 0 outdegree“Maximillion Pegasus” has 0 indegree and high outdegree.Got time to find this user, understand the issue, figure out it was nota bug in our code!Signature makes error in this case! Manual coding too!History works because XML file contains the username of the „real“user such as Usurped12032009
  38. 38. (B) Renamed usersThis issue is NOT marginal.17,000+ renamings in the EnglishWikipediaand usually involving very active andpeculiar users!This issue affects the most basic elementof social networks, number of nodes!
  39. 39. (C) Number of edges#pairs of users (unweighted) amongwhich at least 1 msgs was written(3) Manual coding 1073(1) Signatures 1087(2) History 1869Why? See next slides
  40. 40. (D) Information messages and redirects“I dont check this vec.wiki often, please writeto User:X on en.wiki [Signature of User:X]“ →usex X in en.wiki might be different from user Xin vec.wiki: only users in one wiki areconsidered(bot)“This is a bot, please write User:X“Information messages 60/1786Redirects 27/1786Manual coding = OKSignature = ~KOHistory = ~OK (but … A edits UTP of A...)
  41. 41. (E) Messages to oneselfA writes on UTP of A56/1786 messages were self-edgesWikipedia recommendation: A repliesto B on UTP of BSmall evidence but it seems tohappen: self-edges are rare andmainly information messages
  42. 42. (F) Non human users writing messagesEach bot has its own “logic“. 1 example:Marco27bot is a welcome bot
  43. 43. Many messages are templates! Welcome templates {{benvegnu}}Out of 1786 msgs, 774 (43.33%) are welcome templates.In vec.wiki, Written by a bot Marco27Bot, but signed with usernames of volunteersManual coding and Signature algo: find signers (appearance)History finds bot (reality)Suggestion: dont consider bots because of their automated nature
  44. 44. (G) Anonymous users, vandalism and deleted messagesAnon users (IP address) have UTPsThey received 33 message from bots aboutpossible vandalismMany of their edits got deletedCoding and Signature dont find deleted editsHistory finds themSuggestion: remove anonymous users (IPaddresses dont map 1to1 to person anyway)
  45. 45. (H) Many edits per messageI edit the UTP of X,I discover a typo,I re-edit the UTP of XThese are not 2 messages but historyalgorithm detects 2 edits.Possible heuristics: collapse editsoccurring during short time
  46. 46. (I) Personalized, missing orincorrectly formatted signaturesLarge variety in personalized signaturesHard to detect reliably all signatures,especially for very active users! And ineach language Wikipedia, differentpractices.Most active vec.wiki user used a templatefor signature! {{Utente:Nick1915/firma}}Biggest drawback of signature algorithm
  47. 47. (I) Personalized, missing orincorrectly formatted signaturesUsers forget to sign (not automatic).A bot (Sinebot in EnWiki and Marco27Botin VecWiki) edits the page and addsignature. → It seems the bot “talks“ alot.Some users make errors in the syntax forsigningSignature = KOHistory = OK (forgot to sign is not aproblem, but discard bots)
  48. 48. (J) Date of message Messages are (often) dated → possible longitudinal analysis! Signature algo = KO: must detect syntax of date, different over time (in vec.wiki) and different in each language wikipedia History algo = OK: has the info formally coded in XML dump        <timestamp>2006­11­07T15:29:48Z</timestamp>
  49. 49. (K) Archived messagesWhen UTPs become long, they get archived (bya bot).Current content is copied to a newly createdpage such as User_talk:Phauly/Archive3But NOT all subpages of UTP are archives!Coding and Signature = KO: decide to look forsignatures in subpages based on heuristics onpage title (what is this in Chinese Wikipedia)?History = OK: edits are done to “main“ UTPIssue very relevant for “active“ users!
  50. 50. Our scripts are open source!You can run it and extract networks (in order toanalyzed them). Python code athttps://github.com/phauly/wiki-networkNetworks already available as extracted by 2algorithms for German, Spanish, Italian,Chinese and Venetian Wikipediahttp://sonetlab.fbk.eu/data/social_networks_of_wikipedia/GraphML format: play with them using Gephi!(http://www.gephi.org)Social Network Analysis of who talks to whom onWikipedia is possible without caring about all thesedetails of extraction!
  51. 51. Size=Indegree (#received msgs) Color=Role2005-2010 CumulativeWeightedDirectedSocial network(who talks to whom)Nodes=Users (918) (out of 6255 registered users)Edges=#Messages
  52. 52. Nodes=Users (918)Most users justreceived messages(receivers, passive)Only 196 users wroteAt least one msg!(senders, active)
  53. 53. DiscussionNo algo is “correct“, not even manualcoding!Bots and anonymous users should beremoved and analyzed ad hocInterested in (1) the network users see (with itsvariability in signatures and formats)Signature algorithm ok but works only on onelanguage Wikipedia and needs tweaking (2) the network of what really happenedHistory algorithm more robust, also acrosswikis (cross-wiki comparison) and withdates (longitudinal analysis).
  54. 54. ConclusionsSmall change in algorithm/assumption =big change in “what you extract“ andhence in “what you find“!!Proposed 2 algorithmsEmpirical Validation by manual coding1) Bots and anonymous to be excludedand treated separately and adhoc2) History algorithm = more robustOpensource scripts: First step towardssociology of wikis
  55. 55. CreditsI would like to thanksDavide SettiMarco FrassoniFor writing the code and for manualcodingDont forgetCall for Postdoc at SoNethttps://risorseumane.fbk.eu/it/node/234
  56. 56. ? Thanks

×