Mining and analyzing social media hicss 45 tutorial – part 2

1,182 views

Published on

HICSS 45 Tutorial on Mining and Analyzing Social Media Part 2. David King. Jan 4, 2012

Published in: Technology, Business
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,182
On SlideShare
0
From Embeds
0
Number of Embeds
27
Actions
Shares
0
Downloads
48
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Mining and analyzing social media hicss 45 tutorial – part 2

  1. 1. Mining and Analyzing Social Media HICSS 45 Tutorial – Part 2 Dave King January 4, 2012
  2. 2. Agenda: This is how the slides areorganized• Part 1 – Introduction – Bio, Resources, Social Media – Data Mining – Processes and Example – Text Mining – General Processes and Example – Predicting the Future – The Portmanteaus• Part 2 – Sentiment Analysis – Social Network Analysis - Introduction 2 Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
  3. 3. Sentiment Analysis:What are your customers thinking?Every hour of every day they share their opinions, issues, thoughts andsentiments about brands, products, services and companies (on line). Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
  4. 4. Sentiment Analysis: Some Survey DataCone Communications:http://www.coneinc.com/2011coneonlineinfluencetrendtracker 4 Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
  5. 5. Sentiment Analysis:Some Payoffs Marketing Service Products Message Response Issues and Focus A form of Automated Text Categorization (ATC) 5 Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
  6. 6. Sentiment Analysis:Some Examples 6 Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
  7. 7. Sentiment Analysis: Some Examples Cycling Community Responds @BicyclingMag @BikePortland @clevercycle @cyclingreporterGM runs Ad on 10/17/11 7 Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
  8. 8. Sentiment Analysis:Some Examples Key Areas of Concern: • Break in online link to Mint.com • Actionable Service Breaks • Outrage over “$50 limit on debit card transactions” 8 Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
  9. 9. Sentiment Analysis:DefinedText Mining to classify subjective opinions in text intocategories like "positive" or "negative” extracting various formsof attitudinal information: sentiment, opinion, mood, andemotion. Also called Voice of the Customer (VOC) or OpinionMining. Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
  10. 10. Sentiment Analysis:Sample Software VendorsAlterian Etuma Lymbix QuantivoAttensity Evolve24 Medallia Radian6 (SalesForce.com)Brandwatch General Sentiment Meltwater SASBuzzdetector IBM Cognos Meshlabs Sentiment MetricsClarabridge IBM SPSS Netbase Solutions SentMetrixCrimson Hexagon InfiniGraph OpenAmplify TraackrDigimind Kontagent Overtone Visible TechnologiesDigitalPebble Lexalytics PostRank (Google) Wise WindowEffectCheck Lithium 10 Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
  11. 11. Sentiment Analysis:Types • Sentiment Classification – document level, classified as positive or negative • Feature-based opinion – sentence level, determines which aspects of an object people like or dislike • Comparative sentence and relationship mining – sentence level comparisons of one object against another (to determine which is better than the other) 11 Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
  12. 12. Sentiment Analysis:Types • From one type to the next (classification, features, comparisons), it becomes more complex to identify and extract the information. • Once extracted, standard text mining techniques can be used to classify and compare the opinions • Simple techniques (like naïve Bayesian) often produce strong results (e.g. 80+% accuracy) 12 Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
  13. 13. Sentiment Analysis:Assumption An Opinion Lexicon that Expresses State Polar, Opinion-Bearing, and Sentiment Words and Phrases 13 Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
  14. 14. Sentiment Analysis:How do you know if it is “+” or “-”? plot : two teen couples go to a church party , drink and then drive . they get into an accident . one of the guys dies , but his girlfriend continues to see him in her life , and has nightmares . whats the deal ? watch the movie and " sorta " find out . . . critique : a mind-xxx movie for the teen generation that touches on a very cool idea , but presents it in a very bad package . which is what makes this review an even harder one to write , since i generally applaud films which attempt to break the mold , mess with your head and such ( lost highway & memento ) , but there are good and bad ways of making all types of films , and these folks just didnt snag this one correctly . they seem to have taken this pretty neat concept , but executed it terribly . so what are the problems with the movie ? well , its main problem is that its simply too jumbled . having not seen , " who framed roger rabbit " in over 10 years , and not remembering much besides that i liked it then , i decided to rent it recently . watching it i was struck by just how brilliant a film it is . aside from the fact that its a milestone in animation in movies ( its the first film to combine real actors and cartoon characters , have them interact , and make it convincingly real ) and a great entertainment its also quite an effective comedy/mystery . while the plot may be somewhat familiar the characters are original , especially baby herman , and watching them together is a lot of fun . … `who framed roger rabbit is a rare film . one that not only presented a great challenge to the filmmakers but one that can be enjoyed by the whole family ( although some very young viewers may be a little scared by judge doom ) . do yourself a favor and rent it , `p-p-p-p-please . " 14 Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
  15. 15. Sentiment Analysis:Other interests in Sentiment 15 Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
  16. 16. Sentiment Analysis:Other interests in Sentiment 16 Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
  17. 17. Sentiment Analysis: Doing Simple Sentiment Analysis General Problem 1 Automated 2Collection Process Small Set of of Text for Classifying Predetermined 3Documents categories ??? … n 17 Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
  18. 18. Sentiment Analysis: Doing Simple Sentiment Analysis General Answer 1 Automated 2Collection Process Small Set of of Text for Classifying Predetermined 3Documents categories … n Inductive, supervised machine learning classification process and algorithm 18 Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
  19. 19. Sentiment Analysis:Doing Simple Sentiment Analysis Real-World Text Data Training Process Documents with known Classification Document Consolidation Train Test Validate Establish the Corpus Classification Corpus Refinement Algorithm (Token, Stem, Stop…) Feature Selection & Weighting 1 2 3 n Predetermined Categories Term- Doc-Matrix* 19 Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
  20. 20. Sentiment Analysis:Doing Simple Sentiment Analysis Classification Algorithms • Naïve Bayes • Decision Trees • Nearest Neighbor (k-NN) • Support Vector Machine • Neural Nets (e.g. SOM) Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
  21. 21. Sentiment Analysis:Doing Simple Sentiment Analysis Twitter Statistics • ~200M registered users. • ~50M users login every day • Over 400K new users per day. • 400 million unique visitors per month. • 55% use their phone to tweet. • Average 200 million tweets a day. • 600 million search queries per day • 75% of traffic from 3rd Party Apps • 60% of tweets from 3rd Party Apps Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
  22. 22. Sentiment Analysis:Doing Simple Sentiment Analysis Problem Features • Each tweet <= 140 characters (avg. 10-15 words/message) • Heavy presence of non-alpha symb0-ols, abbrevs, misspellings and slang • Tweets often include retweets (original tweet repeated) • In spite of this – Tweets have proven to be an interesting text mining source (warts and all) Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
  23. 23. Sentiment Analysis:Doing Simple Sentiment Analysis• Zambonini, D. "Self-Improving Bayesian Sentiment Analysis for Twitter.“ August 27, 2010. danzambonini.com/self-improving-bayesian-sentiment-analysis-for-twitter.• Kalafatis, T. “The Sentiment on US Economy from Twitter.” October, 2009. lifeanalytics.blogspot.com/2009/10/sentiment-on-us-economy-from-twitter.html.• Pak, A. and P. Paroubek. “Twitter as a Corpus for Sentiment Analysis and Opinion Mining.” In Proceedings of the Seventh International Conference on Language Resources and Evaluation. May, 2010. lrec- conf.org/proceedings/lrec2010/slides/385.pdf• Sood, S. and L. Vasserman. “ESSE: Exploring Mood on the Web.” August 2009. lcs.pomona.edu/people/files/SoodCV.pdf.• Go, A. et al. “Twitter Sentiment Classification using Distant Supervision.” 2009.stanford.edu/~alecmgo/papers/TwitterDistantSupervision09.pdf• Agarawal, A. et al. “Sentiment Analysis of Twitter Data.” 2011. www1.ccls.columbia.edu/~beck/pubs/lsm2011_full.pdf 23 Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
  24. 24. Sentiment Analysis:Doing Simple Sentiment Analysis 24 Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
  25. 25. Sentiment Analysis:Doing Simple Sentiment Analysis• Twitter used to get a total of 3 billion requests a day via its API• API Calls for Public Tweets – http://search.twitter.com/search.json?q=%3A)+feel+ feeling&rpp=100&page=1 – http://api.twitter.com/1/trends/current.json? exclude=hashtags Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
  26. 26. Sentiment Analysis:Doing Simple Sentiment Analysis {uiso_language_code: uen, uto_user_name: None, uto_user_id_str: None, ufrom_user_id_str: u59862385, utext: u"Lol i feel ya!!RT @Sweet_Sun_Shine: @joshaustin13 everythings up, its the weekend baby!!!! :) and I plan on enjoying; how are you feeling?", ufrom_user_name: uB.Resilientue50cue50cue50c, uprofile_image_url: uhttp://a3.twimg.com/profile_images/1650184586/joshaustin13_normal.jpg, uid: 145274459127955456L, uto_user: None, usource: u&lt; a href=&quot;http://www.echofon.com/&quot; Sample rel=&quot;nofollow&quot; &gt;Echofon&lt; Tweet from /a&gt;, API call uid_str: u145274459127955456, ufrom_user: ujoshaustin13, ufrom_user_id: 59862385, uto_user_id: None, ugeo: None, ucreated_at: uFri, 09 Dec 2011 22:51:44 +0000, umetadata: {uresult_type: urecent} } Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
  27. 27. Sentiment Analysis:Doing Simple Sentiment Analysis “Twitter Sentiment Classification using Distant Supervision” (2009) • Utilizes presence of emoticons “ :)” & “ :( “ to serve as surrogates for classification as positive and negative sentiment statements • To construct the term-document matrix relies on a list of positive and negative key words from Twittratr, counting number of key words that appear in each tweet. • 180K tweets collected for training purposes between April and June 2009 • 80%+ accuracy in classification Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
  28. 28. Sentiment Analysis:Doing Simple Sentiment Analysis Counts Type List Set Happy Face  Words HF 8354 2169 SF 7702 1996 Total 16056 3469 Alpha HF 5917 1094 SF 5433 1055 Total 11350 1169 Stop HF 3425 992 SF 3325 953 Total 6750 1563 Sad Face  Stem HF 3425 895 SF 3325 850 Total 6750 1375 Stem w/o HF 2618 894 SF 2516 849 Total 5134 1374 28 Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
  29. 29. Sentiment Analysis:Doing Simple Sentiment Analysis Happy Face  Sad Face  29 Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
  30. 30. Sentiment Analysis:Doing Simple Sentiment Analysis P(H/D) = P(D/H) * P(H)/P(D) H is the hypothesis and D is the data P(H) is the prior probability of H: the probability that H is correct before the data D are seen . P(D/H) is the conditional probability of seeing the data D given that the hypothesis H is true. This conditional probability is called the likelihood. P(D) is the marginal probability of D. P(H/D) is the posterior probability: the probability that the hypothesis is true, given the data and the previous Thomas Bayes state of belief about the hypothesis. Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
  31. 31. Sentiment Analysis: Doing Simple Sentiment Analysis Training Set Message Category Love is great Positive I feel great now Positive I feel sick today Negative Great, today sucks NegativeP(Positive | Tweet) Today is going to be good Positivecompared toP(Negative | Tweet) P(Pos | Tweet) = P(Pos) * P(W1/Pos) / P(Tweet) P(Pos| Tweet) = P(Pos) * P(great/Pos) P(Pos | Tweet) = (3/5) * (2/3) = .4 P(Neg | Tweet) = P(Neg) * P(W1/N) / P(Tweet) P(Neg | Tweet) = P(Neg) * P(great/Neg) P(Neg| Tweet) = (2/5)*(1/2) = .2 Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
  32. 32. Spam Detection: Naïve Bayesian Classifier Training Set Message Category Love is great Positive I feel great now PositiveP(Positive | Tweet) I feel sick today Negativecompared to Great, today sucks NegativeP(Negative | Tweet) Today is going to be good Positive P(Neg | Tweet) = P(Neg) * P(W1/Neg) * P(W2/Neg) * ... P(Neg | Tweet) = P(Neg) * P(today/Neg) * P(sucks/Neg) P(Neg | Tweet) = ..4 * 1 * .5 = .2 P(Pos | Tweet) = P(Pos) * P(W1/Pos) * P(W2/Pos) * ... P(Pos | Tweet) = P(Pos) * P(today/Pos) * P(sucks/Pos) P(Pos | Tweet) = .6 * .33 * 0 = 0 Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
  33. 33. Spam Detection: Naïve Bayesian Classifier Training Set Token1 Token2 Token3 Token4 … ClassTweet1 1 0 0 1 Happy NaïveTweet2 1 0 1 0 Sad BayesianTweet3 0 0 0 1 HappyTweet4 0 0 1 0 Sad Classifier… … … … … … … P(H|Tweet) P(S|Tweet) > 0?? New Tweet Estimated Token1 Token2 Token3 Token4 … , Decision Rule Tweet 0 0 1 0 Probabilities , P(H) P(Wi|H) ln P(H|Tweet) = ln + Ʃ ln P(S|Tweet) P(S) P(Wi|S) Proof left to reader Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
  34. 34. What is this number? 4.74 34 Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
  35. 35. Does this help? Frigyes Karninthy Stanley Milgram 6 John Guare Duncan Watts 35 Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
  36. 36. Six Degrees of Separation A fascinating game grew out of this discussion. One of us suggested performing the following experiment to prove that the population of the Earth is closer together now than they have ever been before. We should select any person from the 1.5 billion inhabitants of the Earth—anyone, anywhere at all. He bet us that, using no more than five individuals, one of whom is a personal acquaintance, he could contact the selected individual using nothing except the network of personal acquaintances. Frigyes Karninthy , Chains, 1929 A 1 2 3 4 5 36 Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
  37. 37. Sample MetricFrom Social Network Analysis 4.74 Average Distance between Facebook Members 37 Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
  38. 38. Social Network Analysis:Another Type of Analysis Which Blogs are Similar? Term1 Term2 Term3 … TermM Blog1 Blog2 Blog3 … BlogNBlog1 1 0 0 … 1 Blog1 - 1 0 … 1Blog2 0 0 1 … 0 Blog2 0 - 1 … 0Blog3 0 1 0 … 1 Blog3 1 1 - … 0… … … … … … … … … … - …BlogN 0 0 0 … 1 BlogN 1 0 1 … - Cluster Analysis Graph/Network (e.g. K-Means) Analysis 38 Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
  39. 39. Social Network Analysis:Another Type of Analysis Which Blogs are Similar? Word1 Word2 Word3 … WordMBlog1 1 0 0 … 1 For a detail description:Blog2 0 0 1 … 0 http://www.slideshare.net/Blog3 0 1 0 … 1 daveking63/… … … … … … text-mining-and-analytics-v6-p2BlogN 0 0 0 … 1 Cluster Analysis (e.g. K-Means) 39 Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
  40. 40. Social Network Analysis:DefinitionsNetwork – Collection of things and theirlinkages to one another.Social Network – Collection of humans, roles,groups, and/or institutions and their socialrelationships with one another.Social Network Analysis (SNA) – Application ofGraph Theory or Network Science to the study ofsocial relationships and connections. 40 Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
  41. 41. Social Network Analysis:Early Efforts Jacob Moreno: Sociometry and the Sociogram 41 Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
  42. 42. Social Network Analysis:Definitions “Ten years ago, the field of Social Network Analysis was a scientific backwater. We were the misfits, rejected from both mainstream sociology and mainstream computer science.” 42 Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
  43. 43. Social Network Analysis:Exploding Commercial Interest 43 Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
  44. 44. Social Network Analysis:So What Happened? Small Data Flat-files / in memory Manually collected computation Medium Data SQL Databases Data snapshots from APIs Big Data Real-time Big Data Approaches social media data 44 Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
  45. 45. Social Network Analysis:When things were simplier … N=26 2005 N=80 45 Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
  46. 46. Social Network Analysis:and then … Growth in Social Media N~1400 Access to SM Network Data Availability of Open Source Tools N~3.5K 2011 N~90K 46 Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
  47. 47. Social Network Analysis:and now … N=20M N=80K N = 721M 47 Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
  48. 48. Social Network Analysis: Key ElementsGraph or Network GraphThe set of [ V,E, f ]vertices/nodes, Aedges/links and therelationship/functionconnecting them. BVertices or Nodes Edge C (Link)The “things” D VertexEdges or Links (Node)The “relationships” 48 Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
  49. 49. Social Network Analysis:Types of Edges or Links Undirected, Directed, Unweighted UnweightedA B A Twitter B Facebook Friends Followers C C Undirected, Directed, Weighted Weighted 100A Facebook B A 60 B 5 Email 70 Friends Network 20 10 C C 49 Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
  50. 50. Social Network Analysis:Types of Networks Unimodal Bimodal Multiplex P1 E1 P1P2 P3 P1 P2 P2 P3 Follows Replies To Mentions 50 Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
  51. 51. Social Network Analysis:Types of Network Analysis“Whole” Network “Ego-Centric” Network P1 P2 P3P2 P3 Ego P4 Alters P4 P5 P6 P5 P6 51 Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
  52. 52. Social Network Analysis: Node Metrics (Centrality)Measure Definition Interpretation ReasoningDegree Number of edges or links. In How connected is a node? How Higher probability of receiving and transmitting degree- links in, Out-degree - links many people can this person reach information flows in the network. Nodes considered to out directly? have influence over larger number of nodes and or are capable of communicating quickly with the nodes in their neighborhood.Betweenness Number of times node or vertex How important is a node in terms Degree to which node controls flow of information in lies on shortest path between 2 of connecting other nodes? How the network. Those with high betweenness function as nodes divided by number of all the likely is this person to be the most brokers. Useful where a network is vulnerable. shortest paths direct route between two people in the network?Closeness 1 over the average distance How easily can a node reach other Measure of reach. Importance based on how close a between a node and every other nodes? How fast can this person node is located with respect to every other node in the node in the network reach everyone in the network? network. Nodes able to reach most or be reached by most all other nodes in the network through geodesic paths.Eigenvector Proporational to the sum of the How important, central, or Evaluates a players popularity. Identifies centers of eigenvector centralities of all the influential are a node’s neighbors? large cliques. Node with more connections to higher nodes directly connected to it. How well is this person connected scoring nodes is more important. to other well-connected people? 52 Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
  53. 53. Social Network Analysis: Centrality – Who is most important?691 E B Eigen Node Degree Normed Degree Betweenness Closeness Eigen Vector A 3 0.17 0.00 0.29 0.29 B 4 0.22 0.01 0.30 0.36 D A G C 2 0.11 0.03 0.35 0.18 D 6 0.33 0.04 0.31 0.46 E 3 0.17 0.00 0.29 0.30 F C F 4 0.22 0.11 0.36 0.35 Betw G 5 0.28 0.19 0.37 0.43 H H 5 0.28 0.58 0.45 0.28 Close I 4 0.22 0.53 0.46 0.13 R J 7 0.39 0.43 0.43 0.12 I N K 3 0.17 0.00 0.32 0.06 P Deg L 3 0.17 0.01 0.33 0.05 J M 3 0.17 0.21 0.33 0.04 O N 3 0.17 0.03 0.38 0.07 K O 2 0.11 0.00 0.31 0.05 M L P 3 0.17 0.03 0.38 0.08 Q 2 0.11 0.11 0.26 0.01 R 1 0.06 0.00 0.32 0.07 Q S 1 0.06 0.00 0.21 0.00 S 53 Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
  54. 54. Social Network Analysis: Cohension – How well connected?Density Ratio of the number of edges in How well connected is the overall Perfectly connected network is called a "clique" and the network over the total number network? has a density of 1. of possible edges between all pairs of nodesAverage Degree Average number of links each node How well connected are the nodes Higher the average the better connected the members or vector has on average? are.Average Path Average number of edges or links On average, how far apart are any This is synonymous with the "degrees of separation" inLength between any two nodes (along the two nodes? a network.(Distance) shortest path)Diameter Longest (shortest path) between At most, how long will it take to Measure of the reach of the network any two nodes reach any node in the network? Sparse networks usually have greater diameters.Clustering A nodes clustering coefficient is What proportion of egos alters Measures certain aspects of "cliquishness." Proportion the density of its 1.5 degree are connected? More technically, of you friends that are also friends with each other. egocentric network (ratio of how many nodes form triangular Another way to measure is to determine (in a connecting among egos alters). subgraphs with their adjacent undirected) graph the ratio of the number of times that For entire network it is the average nodes? two links eminating from the same node are also of all the coefficients for the linked. individual nodes. 54 Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
  55. 55. Social Network Analysis:Network Metrics (Centralization) Measure Definition Degree Centralization Variation in the degrees of vertices divided by the maximum degree variation that is possible in a network of the same size Betweenness Centralization Variation in the betweenness centrality of vertices divided by the maximum variation in betweenness centrality scores possible in a network of the same size Closeness Centralization Variation in the closeness centrality of vertices divided by the maximum variation in closeness centrality scores possible in a network of the same size Eigenvector Centralization Variation in the eigenvector centrality of vertices divided by the maximum variation in eigenvector centrality scores possible in a network of the same size 1. Variation is the summed absolute differences between centrality scores of the vertices and the maximum centrality score among them. 2. Network is more centralized if the vertices vary more with respect to their centrality. 55 Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
  56. 56. Social Network Analysis: Cohesion – How well connected?691 E B Node Clustering A 0.67 B 0.67 D A G Measure Value C 0.00 D 0.40 Average Degree 3.37 E 1.00 F Density 0.19 C F 0.50 Average Distance 3.06 G 0.50 H H 0.10 Diameter 8 I 0.33 R Degree Centralization 0.22 J 0.29 I N Betweenness Centralization 0.48 K 0.67 P L 0.67 Closeness Centralization 0.27 J M 0.33 O Eigenvector Centralization 0.56 N 0.67 K Clustering Coefficient 0.43 O 1.00 M L P 0.67 Q 0.00 R NA Q S NA S 56 Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
  57. 57. Social Network Analysis:Some General Tendencies • Small diameters and small average path lengths • High clustering coefficients relative to random processes • Rate of clustering among the higher- degree nodes decreases with degree • Fat tailed degree distributions relative to random processes • Hard to find networks that actually follow a strict power law • Positive assortativity and high degrees of homophily at least in social networks 57 Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
  58. 58. Social Network Analysis: Is it really a small world?http://www.touchgraph.com/navigator 58 Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
  59. 59. Social Network Analysis:Is it really a small world? Ego Steps 1 1 Friends 1 50 100 FoF 2 2,500 10,000 FoFoF 3 125,000 1,000,000 FoFoFoF 4 6,250,000 100,000,000 FoFoFoFoF 5 312,500,000 10,000,000,000 FoFoFoFoFoF 6 15,625,000,000 1,000,000,000,000 The naïve view 59 Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
  60. 60. Social Network Analysis:Is it really a small world? 60 Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
  61. 61. Social Network Analysis:Is it really a small world? The emergence of online social networking services over the past decade has revolutionized how social scientists study the structure of human relationships [1]. As individuals bring their social relations online, the focal point of the internet is evolving from being a network of documents to being a network of people, and previously invisible social structures are being captured at tremendous scale and with unprecedented detail. 61 Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
  62. 62. Social Network Analysis:Study Population Active* Global US Members 721M 149M Friends 68.7B 15.9B Aver. Friends 190 214 Total Pop 6.9B 260M Accessed within 28 days of May ’11 At least one friend Over 13 years of age 62 Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
  63. 63. Social Network Analysis:Degree Distribution N = 721M F = 69B Encouraged Up to 20 Median ~ 99 63 Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
  64. 64. Social Network Analysis:Distance Distribution Average Average 4.7 4.3 World 92% 99.6% US 96% 99.7% 64 Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
  65. 65. Social Network Analysis:Connected Components 2000 99.91% of Members Members Connected components – set of individuals for which each pair ofindividuals are connected by at least one path through the network 65 Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
  66. 66. Social Network Analysis:Cohesion 14% for 100 100 friends – 28K unique fof’s; 40K non-uniq fofsYou’re friends with a significant Feld: your friends have more friendsfraction of your friends’ friends than you (same with sex partners) 66 Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
  67. 67. Social Network Analysis:Correlation and Assortativity 67 Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
  68. 68. Social Network Analysis:Bird’s of a Feather - Homophily 84% in the same country 68 Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
  69. 69. Social Network Analysis:“Revolution 2.0 will not be Televised” Tweet Rate – Feb. 24-25, 2011, Tahrir SquareIt will be Tweeted & Retweeted 69 Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
  70. 70. Social Network Analysis:“Revolution 2.0 will not be Televised” 1% Feed from the two day period Nodes = 25178 Links = 32471It will be Tweeted & Retweeted 70 Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
  71. 71. Social Network Analysis:Network Features Average Degree = 2.58 Average Distance = 5.40 Min Degree = 0 Min Distance = 1 Max Degree = 729 Diameter = 22 Measure Value Density 0.0001 Degree Centralization 0.029 Betweenness Centralization 0.076 Closeness Centralization WC Eigenvector Centralization 0.724 Clustering Coefficient 0.0045 Number of Components 3122 Size of Largest Component 17762 % in Largest Component 70.50% It will be Tweeted & Retweeted 71 Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
  72. 72. Social Network Analysis:Node and Network Metrics Node Degree NormDeg Betw Close Eigen Clust Ghonim 729 0.029 0.076 0.214 0.509 0.001 Dima_Khatib 506 0.020 0.050 0.208 0.216 0.001 ShababLibya 493 0.020 0.048 0.206 0.219 0.003 monaeltahawy 436 0.017 0.038 0.204 0.198 0.002 AJEnglish 359 0.014 0.030 0.195 0.085 0.001 bencnn 306 0.012 0.021 0.193 0.090 0.001 AJELive 283 0.011 0.017 0.191 0.065 0.001 3arabawy 273 0.011 0.033 0.200 0.092 0.002 cnnbrk 256 0.010 0.015 0.182 0.031 0.000 AJArabic 238 0.009 0.018 0.192 0.050 0.002 Sandmonkey 227 0.009 0.020 0.198 0.096 0.003SultanAlQassemi 216 0.009 0.014 0.189 0.045 0.001 alaa 204 0.008 0.020 0.202 0.088 0.007 alarabiya_ar 169 0.007 0.009 0.180 0.035 0.001 yoanisanchez 161 0.006 0.012 0.149 0.001 0.001 AymanM 160 0.006 0.009 0.190 0.050 0.003 acarvin 159 0.006 0.014 0.200 0.092 0.008iyad_elbaghdadi 146 0.006 0.008 0.182 0.043 0.002 monasosh 140 0.006 0.009 0.192 0.050 0.004 ChangeInLibya 134 0.005 0.010 0.192 0.063 0.011 It will be Tweeted & Retweeted 72 Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
  73. 73. Social Network Analysis:Node and Network Metrics It will be Tweeted & Retweeted 73 Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
  74. 74. Social Network Analysis:Egocentric Analysis Ghonim Measure Bieber 730 Vertices 14 970 Edges 13 0.004 Density 0.140 2.660 Average Degree 1.860 1.990 Average Distance 1.860 2.000 Diameter 2.000 0.999 Degree Centralization 1.000 0.995 Betweenness Centralization 1.000 0.999 Closeness Centralization 1.000 0.990 EigenVector Centralization 1.740 0.003 Cluster Coefficient 0.000 1 Number of Components 1 730 Size of Largest Component 14 100% % in Largest Component 100% It will be Tweeted & Retweeted 74 Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
  75. 75. Social Network Analysis:Subcommunities of US Political Blogs • Single day snapshot of a snowball sample of political blogs (N=1490) • Manually assigned as Liberal or Conservative • Focus on blogrolls and front page citations • A primary question: Cyberbalkanization? 75 Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
  76. 76. Social Network Analysis:Subcommunities of Political Blogs 76 Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
  77. 77. Social Network Analysis:Political Blogs N=1490 Edges = 16715 N=758 N=732 Edges = 7301 Edges = 7839 Liberals Conservatives 77 Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
  78. 78. Social Network Analysis: Political Blogs – Cyberbalkanization?Viewpoint Lib In Links Cons In Links Total In Links %Lib %Consdailykos.com 292 46 338 86% 14% Measure Liberal Conservativewww.talkingpointsmemo.com 242 22 264 92% 8%atrios.blogspot.com 230 39 269 86% 14% N 758 732www.washingtonmonthly.com 165 36 201 82% 18%www.wonkette.com 83 30 113 73% 27% Out Links 74% 84%www.juancole.com 149 16 165 90% 10%yglesias.typepad.com/matthew 104 24 128 81% 19% In Links 67% 82%www.crookedtimber.org 81 19 100 81% 19%www.mydd.com 107 8 115 93% 7%www.oliverwillis.com 97 20 117 83% 17%blog.johnkerry.com 21 2 23 91% 9%www.pandagon.net 118 5 123 96% 4%www.talkleft.com 126 15 141 89% 11%digbysblog.blogspot.com 115 3 118 97% 3%www.politicalwire.com 87 16 103 84% 16%www.j-bradford-delong.net 98 11 109 90% 10%www.prospect.org/weblog 102 11 113 90% 10%americablog.blogspot.com 64 5 69 93% 7%www.theleftcoaster.com 78 4 82 95% 5%www.jameswolcott.com 74 6 80 93% 8%Total Liberal 2433 338 2771 88% 12%www.powerlineblog.com 26 195 221 12% 88%instapundit.com 43 234 277 16% 84%www.littlegreenfootballs.com/weblog 10 171 181 6% 94%www.hughhewitt.com 11 146 157 7% 93%www.andrewsullivan.com/index.php 59 86 145 41% 59%www.captainsquartersblog.com/mt 5 117 122 4% 96%www.wizbangblog.com 14 125 139 10% 90%www.indcjournal.com 6 60 66 9% 91%www.michellemalkin.com 10 191 201 5% 95%blogsforbush.com 4 208 212 2% 98%www.allahpundit.com 2 37 39 5% 95%belmontclub.blogspot.com 3 93 96 3% 97%realclearpolitics.com 13 104 117 11% 89%volokh.com 27 80 107 25% 75%timblair.spleenville.com 7 80 87 8% 92%windsofchange.net 16 65 81 20% 80%www.vodkapundit.com 9 97 106 8% 92%www.rogerlsimon.com 6 74 80 8% 93%www.deanesmay.com 8 79 87 9% 91%mypetjawa.mu.nu 0 51 51 0% 100%Total Conservative 279 2293 2572 11% 89% 78 Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
  79. 79. Social Networks: Political Blogs - Metrics Measure Liberal Conservative Total Density 0.02 0.03 0.01No Components 188 107 268 Largest Comp 569 569 1222Largest Comp% 75.10 84.97 82.01 Min Deg 0 0 0 Max Deg 305 296 351 Aver Deg 19.26 21.42 22.44 Deg Central 0.38 0.38 0.22 Diameter 6 7 8 Aver Dist 2.51 2.51 2.74 Betw Cent 0.10 0.16 0.06 EigVect Cent 0.23 0.26 0.22 Clust Coeff 0.31 0.20 0.22 79 Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
  80. 80. Social Network Analysis: Political Blogs – Empirical Comparisons Blog Citation TwitterMeasure Political Biology Economics Math Physics UK Journalists Egypt TwitterNumber of Nodes 1,490 1,520,521 81,217 253,339 52,909 523 25,178Average Degree 22.4 15.5 1.7 3.9 9.3 88 3Average Path Length 2.74 4.9 9.5 7.6 6.2 1.88 5.4Diameter of the Largest Component 8 24 29 27 20 4 22Overall Clustering 0.22 0.09 0.16 0.15 0.45 0.41 0.004Fraction of Nodes in Largest Component 0.82 0.92 0.41 0.82 0.85 0.99 0.7 80 Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
  81. 81. Social Network Analysis: Political Blogs – Model Comparisons Political Bernoulli Deg Conditional Small World Pref AttachmentMeasure Blogs 2.5% 97.5% 2.5% 97.5% 2.5% 97.5% 2.5% 97.5%Number of Components 268 1 1 1 1 1 1 96 134Fraction of Nodes in Largest Component 0.82 100 100 100 100 100 100 0.91 0.94Diameter of the Largest Component 8 4 4 7 9 4 5 7 9Average Path Length 2.74 2.61 2.63 3.29 3.36 2.98 3.01 3.07 3.14Overall Clustering 0.226 0.017 0.018 0.029 0.031 0.355 0.372 0.095 0.109Betweenness 0.065 0.002 0.003 0.010 0.021 0.003 0.004 0.038 0.064 81 Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
  82. 82. Social Network Analysis: Bernoulli• Fixed number of nodes and lines – Sets density and average degree = density * (n-1)• Assigns lines to each pair of nodes independently with fixed probabilities – Each line a random binary variable Measure Bernouilli• Produces Poisson degree distribution Vertices 1000 Edges 11149• Small diameter Density 0.022 – ln(#nodes)/ln(aver.degree) Average Degree 22.900 Average Distance 2.570• Low clustering Diameter 4 – Average degree/#nodes-1 Degree Centralization 0.016• Large component Betweenness Centralization Closeness Centralization 0.003 0.066 – E.g. At aver.degree of 1.5 ~50% in largest EigenVector Centralization 0.034 component Cluster Coefficient 0.021 Number of Components 1 Size of Largest Component 1000 % in Largest Component 100% 82 Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
  83. 83. Social Network Analysis: Small World• Fixed number of nodes and the number ® of nearby neighbors to which each vertex is linked – Implies strong Transitive ties – Ensures higher clustering ~ (3r-3)/(4r-2)• Probabilistically rewires selected lines from Measure Small World one vertex to another (each line and vertex has Vertices 1000 equal probability of being selected). Edges 11000 Density 0.022 – Ensures low average path length Average Degree 22.000 Average Distance 3.070 Diameter 5.000 Degree Centralization 0.007 Betweenness Centralization 0.007 Closeness Centralization 0.079 EigenVector Centralization 0.031 Cluster Coefficient 0.514 Number of Components 1 Size of Largest Component 1000 % in Largest Component 100% 83 Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
  84. 84. Social Network Analysis:Preferential Attachment• Vast majority of new nodes link to nodes with proportionately higher degrees – “Rich get richer”• Probabilistic part involves the selection of vertices for new lines (e.g. end vertex for new line is proportional to degree of end vertex)• Tend to have a small world structure Measure Preferential Vertices 1000• Exhibit long-tailed degree distributions Edges 11242 – Right-hand tail of the distribution follows a “scale Density 0.019 free” power-law” distribution Average Degree 18.770 – Log-log graph is a straight line Average Distance 2.690 – Ensures low average path length Diameter 7 Degree Centralization 0.154 Betweenness Centralization 0.037 Closeness Centralization - EigenVector Centralization 0.188 Cluster Coefficient 0.192 Number of Components 92 Size of Largest Component 908 % in Largest Component 91% 84 Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL

×