Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Large Scale Social Analytics on Wikipedia, Delicious, and Twitter (presented at IBM NPUC 2010)


Published on

Ed H. Chi, Palo Alto Research Center

Large-Scale Social Analytics in Wikipedia, Delicious, and Twitter

We will illustrate an analytical research approach in social computing. Our research in Augmented Social Cognition is aimed at enhancing the ability of a group of people to remember, think, and reason. The drive to build models and theories for social computing research should further our understanding of how network science, behavioral economics, and evolutionary theories could explain how social systems work. Here we will summarize the published research we conducted on large-scale social analytics in Wikipedia, Delicious, and Twitter, and point out how social analytics can help us understand the intricacies of large social systems.

About the Speaker

Ed H. Chi is area manager and principal scientist at Palo Alto Research Center's Augmented Social Cognition Group. He leads the group in understanding how Web2.0 and Social Computing systems help groups of people to remember, think and reason. Ed completed his three degrees (B.S., M.S., and Ph.D.) in 6.5 years from University of Minnesota, and has been doing research on user interface software systems since 1993. He has been featured and quoted in the press, such as the Economist, Time Magazine, LA Times, and the Associated Press. With 20 patents and over 70 research articles, he has won awards for both teaching and research. In his spare time, Ed is an avid Taekwondo martial artist, photographer, and snowboarder.

Published in: Business, Technology
  • 2015: Any update on conflict/coordination effects? For example did adding talk pages to reach consensus on Wikipedia articles lead to more dissent or have editors continued to leave Wikipedia. Moreover does growth in reversions by editors signal users to leave online social systems for new systems. Only Facebook appears to have escaped this effect by adding alternative social properties design to attract the prime dating demographic.
    Are you sure you want to  Yes  No
    Your message goes here

Large Scale Social Analytics on Wikipedia, Delicious, and Twitter (presented at IBM NPUC 2010)

  1. Image from: Ed  H.  Chi,  Principal  Scientist  and  Area  Manager   Peter  Pirolli,  Lichan  Hong   Bongwon  Suh,  Les  Nelson   Gregorio  Convertino,  Sharoda  Paul   Interns:  Sanjay  Kairam,  Jilin  Chen,  Brent  HectMichael  Bernstein   Alumni:  Raluca  Budiu,  Bryan  Pendleton,  Niki  Kittur,  Todd  Mytkowicz,   Terrell  Russell,  Brynn  Evans,  Bryan  Chan,  KMRC  students   Augmented  Social  Cognition  Area   Palo  Alto  Research  Center  
  2. 2010-10-22 IBM NPUC 2010 2 To:   From:  Brad  Barrish  <brad@…removed.for.privacy….com>   Subject:  Pancreatic  cancer   Date:  Thu,  1  Feb  2007  21:37:55  PST   Hey  Ed.  I'm  a  fellow  user  and  noticed  you  bookmark  a  lot       of  pancreatic  cancer  stuff.  I'm  at  home  with  my  dad  who  was  diagnosed       a  little  over  a  year  ago  and  is  now  at  the  tale  end  of  things.  I've       learned  a  lot  through  his  treatments  and  about  what's  out  there.  I       dunno  if  it's  something  you  or  a  family  member  has,  but  just  wanted       to  drop  you  an  email.  Be  well.   Brad  
  3.   Cognition:  the  ability  to  remember,  think,  and  reason;  the  faculty  of   knowing.     Social  Cognition:  the  ability  of  a  group  to  remember,  think,  and   reason;  the  construction  of  knowledge  structures  by  a  group.   –  (not  quite  the  same  as  in  the  branch  of  psychology  that  studies  the   cognitive  processes  involved  in  social  interaction,  though  included)     Augmented  Social  Cognition:  Supported  by  systems,  the   enhancement    of  the  ability  of  a  group  to  remember,  think,  and   reason;  the  system-­‐supported  construction  of  knowledge   structures  by  a  group.     Citation:  Chi,  IEEE  Computer,  Sept  2008   32010-10-22 IBM NPUC 2010
  4. Kudos  to  Todd  Mytkowicz  and  Rowan  Nairn  
  5. Topics  Concepts   Users   Documents   Tags   T1…Tn   Encoding  Decoding   Noise   2010-10-22 5IBM NPUC 2010
  6. H(Tag)  shows  tag  saturation   H(Doc  |  Tag),  browsability   2010-10-22 IBM NPUC 2010 6
  7. I(Doc;  Tag)    Mutual  Information   Raise  in  avg.  tag  /  bookmark   2010-10-22 IBM NPUC 2010 7
  8. 2010-10-22 8 Guide Web Howto Tips Help Tools Tip Tricks Tutorial Tutorials Reference Semantic Similarity Graph IBM NPUC 2010
  9.   Spreading  Activation  in  a  bi-­‐graph     Computation  over  a  very  large  data  set   –  150  Million+  bookmarks   Tags URLs P(URL|Tag) P(Tag|URL) 2010-10-22 9IBM NPUC 2010
  10. 2010-10-22 10IBM NPUC 2010
  11. Kudos  to  Bongwon  Suh,  Niki  Kittur  
  12. What  drives  contributions  to  Wikipedia?     Conflicts  drives  most  of  the  contributions  to  Wikipedia.   –  How  do  we  measure  conflicts?     Conflicts  cause  coordination  costs  to  go  up.   –  Measuring  coordination  costs   2010-10-22 IBM NPUC 2010 12
  13. 2010-10-22 13IBM NPUC 2010
  14. Mediators Sympathetic to parents Sympathetic to husband Anonymous (vandals/ spammers) 2010-10-22 14IBM NPUC 2010
  15. 2010-10-22 IBM NPUC 2010 15   Counting  ‘Controversial’  labels     5x  cross-­‐validation,  R2  =  0.897   0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 Predicted controversial revisions Actualcontroversialrevisions
  16. Number of Articles (Log Scale)’s_growth 2010-10-22 16IBM NPUC 2010
  17. Monthly Edits 2010-10-22 17IBM NPUC 2010
  18. Monthly Edits 2010-10-22 18IBM NPUC 2010
  19. *In thousands Monthly Active Editors 2010-10-22 19IBM NPUC 2010
  20. *In thousands Monthly Active Editors 2010-10-22 20IBM NPUC 2010
  21.   Preferential  Attachment:  Edits  beget  edits   –  more  number  of  previous  edits,  more  number  of  new  edits   Growth rate of population Current population Growth rate depends on: N = current population r = growth rate of the population 2010-10-22 21IBM NPUC 2010 ! dN dt = r " N ! N(t) = N0 " ert
  22.   Ecological  population  growth  model   –  Also  depend  on  environmental  conditions   –  K,  carrying  capacity  (due  to  resource  limitation)   € dN dt = rN(1− N K ) 2010-10-22 22IBM NPUC 2010
  23.   Follows  a  logistic  growth  curve   New Article 2010-10-22 23IBM NPUC 2010
  24.   Biological  system   –  Competition  increases  as   population  hit  the  limits  of  the   ecology   –  Advantage  go  to  members  of  the   population  that  have  competitive   dominance  over  others     Analogy   –  Limited  opportunities  to  make   novel  contributions   –  Increased  patterns  of  conflict  and   dominance     2010-10-22 24IBM NPUC 2010
  25. Monthly Ratio of Reverted Edits 2010-10-22 25IBM NPUC 2010
  26. 2010-10-22 26IBM NPUC 2010
  27. Kudos  to  Brent  Hecht,  Jilin  Chen,     Bongwon  Suh,  Lichan  Hong  
  28. n = 10,000 users with 5 or more tweets All Users Who Manually Specified Location
  29. n = 3,311 users with 5 or more tweets Users w/ No Useful Location Information Manually Entered
  30. Schrute FarmsUser ID 39111154 User ID 75135928 NONE YA BISNESS!! User ID 57987417 in jail...smh not tellin youUser ID 130681147
  31. wherever justin wants me to be User ID 71097545 User ID 77503970 Justin Biebers heart! User ID 134222427 Jonasbieberland3 Bieber IslandUser ID 91705969
  32. n = 10,000 users with 5 or more tweets All Twitter Users
  33. n = 2,965 users with 5 or more tweets Users w/ Informative Location in the United States
  34. CaliforniaUser ID 125271323 User ID 92455577 Skinny Jeans City, IL User ID 92455577 Bieberville, California East Jesus Nowhere, Indiana User ID 26526957
  35. All 1,698 Fake Locations Yahoo! Geocoder Justin Biebers heart!
  36. All 1,698 Fake Locations Yahoo! Geocoder Justin Biebers heart! Lat = 36.328785 Lon = -91.700189
  37. Location of Justin Bieber’s Heart (Don’t Tell Your Teenage Daughters)
  38. Country-scale 10-fold cross validation multinomial naive bayes classifier 2.4x better than random
  39. State-scale 20% test set multinomial naive bayes classifier 2.2x better than random
  40.   Which  tweet  features  are  associated  with  retweet?     Retweet  Model   –  #  Retweet  ~  function(f1,  f2,  ….,  fn),  where  fi  are  simple  features   extracted  from  a  tweet     74M  tweets  from  Twitter  Stream  API   –  Characterization   –  2~3  %  sample   –  Hadoop  /  Hbase  /  MapReduce     2010-10-22 43IBM NPUC 2010
  41. #  Followees:  395   #  Followers:  1,400   #  Favorite:  1,657   #  Day:  (since  June  17,  2008)   #  Past  tweets:  21,000   Contextual   Features   URL   Hashtag   Mention   Content   Features   2010-10-22 44IBM NPUC 2010 Two  Types  of  Features  
  42. ContentFactor Contextual Factor 2010-10-22 45IBM NPUC 2010
  43. Information Streams =>Information Overload ASC Social Recommender Engine 2010-10-22 46IBM NPUC 2010
  44. My Friends’ URLs Popular URLs Recommendation Algorithm: Combining Sources and Models Recommendations My Friends’ Network and Tweeting Pattern Social Ranking Model My Tweets My Friends’ Tweets Topic Relevance Model 2010-10-22 47IBM NPUC 2010
  45.   Hadoop  Compute  Cluster   –  50  nodes,  depending  on  project  requirement   –  ~40TB  storage  capacity   –  Experience  with  Hbase,  Pig,  Interaction  with  Lucene,  MySQL     Large-­‐scale  crawling  and  analytics  experience  with   –  Wikipedia    (all  edits  up  to  2009)   –  Delicious  data  set  (200M  bookmarks)   –  Twitter  (70M+  Tweets)     Experience  with  Large  Scale  Social  Analytics   –  Example  1:  Visual  analytics  in  Wikipedia  (     –  Example  2:  Search  engines  for  social  bookmarks  (   –  Example  3:  Recommenders  for  Twitter  news  (   2010-10-22 IBM NPUC 2010 48
  46. 2010-10-22 IBM NPUC 2010 49
  47. Image from:   Research  Vision:  Understand  how  social  computing   systems  can  enhance  the  ability  of  a  group  of   people  to  remember,  think,  and  reason.     Understand and support Collective Intelligence by modeling social group behaviors and testing prototype tools in Living Labs http://asc-­‐