• Save
Mining and Analyzing Online Social Graph Data
Upcoming SlideShare
Loading in...5
×
 

Mining and Analyzing Online Social Graph Data

on

  • 8,524 views

Video with synched slides are now available here: http://bit.ly/bljh5m ...

Video with synched slides are now available here: http://bit.ly/bljh5m

Presentation by Drew Conway (@drewconway) as presented to NYC Predictive Analytics meetup group, May 13, 2010.

Available for download on Drew’s blog Zero Intelligence Agents http://www.drewconway.com/zia/?p=2151

To download the sample code and the pdf, please go to our Meetup.com page here: http://bit.ly/aXDqzg

Drew Conway is a PhD student at New York University, Dept. of Politics, an expert on social networks, and a great presenter. Drew also writes for his blog Zero Intelligence Agents, runs the NY R Statistical Programming Meetup (http://www.meetup.com/nyhackr/) and is a member of NYC Predictive Analytics http://www.meetup.com/NYC-Predictive-Analytics/

Statistics

Views

Total Views
8,524
Views on SlideShare
8,521
Embed Views
3

Actions

Likes
28
Downloads
0
Comments
2

3 Embeds 3

http://przegladmarketingowy.com 1
http://twitter.com 1
http://pinterest.com 1

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Mining and Analyzing Online Social Graph Data Mining and Analyzing Online Social Graph Data Presentation Transcript

  • Mining and Analyzing Online Social Graph Data Drew Conway New York University, Dept. of Politics May 13, 2010
  • Agenda Network basics Unit of analysis Data representation Analysis & Visualization Thoughts on research design Edge contexts A bit on social graph data and ethics Digging into the data Where to get it Our first scrape and build! Real-time social graph analysis Build the network of Twitter users in the room Hold our breath...
  • Live demo setup Last week Bruno asked how many of you use Twitter 15 (54%) members said yes, which may be enough to do some interesting stuff Please tweet the following hash-tag: #analyticsnyc Some example tweets: Many thanks to the @GiltGroupe for hosting tonight’s #analyticsnyc meetup Hanging out with @DKALab and @brunocm at #analyticsnyc I cannot wait to watch @drewconway crash and burn with his #analyticsnyc live demo
  • Network basics The study of relationships Research design Data representation Digging into data Analysis & Visualization Live demonstration Networks and the study of relationships Network theory use the language of graph theory G = {V , E } ↑ In the abstract this is very powerful, we have a general purpose way to represent any number of relations G={Routers,Packet Traffic} Edges have non-binary values G={U.S. Airports, Commercial Routes} Edges can represent distance, cost, frequency, etc. G={New York City nerds, Co-membership in Meetups} Edges are implied! While both nodes and edges are needed to have a network of any substance, it is the edge (relationship) that will always be the primary focus of our analyses Drew Conway Mining and Analyzing Online Social Graph Data
  • Network basics The study of relationships Research design Data representation Digging into data Analysis & Visualization Live demonstration Why focus on the edge? Consider a very simple example of two people meeting for the first time... In reality, the meeting may reveal a large If we focus on the nodes, we might degree of shared structure: consider this meeting creates a dyad: We know, however, that people do not exist as isolates and this assumption is ignoring all of By expressing this meeting in terms of the exogenous social structure these nodes as a function of their edges we may individuals brings with them. gain a much richer understanding of the structural dynamics Drew Conway Mining and Analyzing Online Social Graph Data
  • Network basics The study of relationships Research design Data representation Digging into data Analysis & Visualization Live demonstration Representing network data: the canonical Perhaps the most natural way to represent the relationships between N actors is with an NxN matrix, often referred to as a “sociomatrix” X1 X2 ... XN X1 0 1 ... 0 Very intuitive representation X2 1 0 ... 1 Can support directional and weighted edges . . . . . . .. . . . . . . . Application of matrix algebraic operation for XN 0 1 ... 0 analysis As an introduction to representing relationship, a matrix provides insight to a network by itself. In practice, however, this representation has many limitations Unwieldy as network sizes scale up Most networks are sparse; too many zeroes Difficult to publish and share Fortunately, there are many other options for representing network data! Drew Conway Mining and Analyzing Online Social Graph Data
  • Network basics The study of relationships Research design Data representation Digging into data Analysis & Visualization Live demonstration Representing network data: the practical Remember, it is all about the edges; therefore, more efficient representations will be limited to edge data Edge list Adjacency list A text file with two columns (usually Also a text file, however, here the first delimited by a space or tab), where first column is the source, and all column in source, and second is target subsequent entries are “adjacent” nodes 1 2 1 3 2 1 6 7 6 8 1 2 3 10 12 2 1 10 13 6 7 8 10 12 13 While these are the most universal data formats, network data representation is a bit of a cottage industry Pajek (.net) GraphML (.gml) GraphViz (.dot) ..and domain specific formats (eek!) Drew Conway Mining and Analyzing Online Social Graph Data
  • Network basics The study of relationships Research design Data representation Digging into data Analysis & Visualization Live demonstration A bit on tools The number of software suites and packages available for conducting social network analysis has exploded over the past ten years In general, this software can be categorized in two ways: Type - many SNA tools are developed to be standalone applications, while others are language specific packages Intent - consumers and producer of SNA come from a wide range of technical expertise and/or need, therefore, there exist simple tools for data collection and basic analysis, as well as complex suites for advanced research Standalone Apps Modules & Packages - ORA (Windows) - libSNA (Python) Basic - Analyst Notebook (Windows) - UrlNet (Python) - KrakPlot (Windows) - NodeXL (MS Excel) - UCINet (Windows) - NetworkX (Python) Advanced - Pajek (Multi) - JUNG (Java) - Network Workbench (Multi) - igraph (Python, R & Ruby) Many of the above tools have visualization components, but several tools are designed specifically for visualization: Graphviz, NetDraw, Tom Sawyer, Gephi, etc. What I use Drew Conway Mining and Analyzing Online Social Graph Data
  • Network basics The study of relationships Research design Data representation Digging into data Analysis & Visualization Live demonstration Comparing two network metrics to find key actors Often network analysis is used to identify key actors within a social group. To identify these actors, various centrality metrics can be computed based on a network’s structure Degree (number of connections) Betweenness (number of shortest paths an actor is on) Closeness (relative distance to all other actors) Eigenvector centrality (leading eigenvector of sociomatrix) One method for using these metrics to identify key actors is to plot actors’ scores for Eigenvector centrality versus Betweenness. Theoretically, these metrics should be approximately linear; therefore, any non-linear outliers will be of note. An actor with very high betweenness but low EC may be a critical gatekeeper to a central actor Likewise, an actor with low betweenness but high EC may have unique access to central actors Drew Conway Mining and Analyzing Online Social Graph Data
  • Network basics The study of relationships Research design Data representation Digging into data Analysis & Visualization Live demonstration First, visualize the data Visualization can be the best first step in the analytical process This will give you a good feel for what is going on with your relationships. For large networks this is often not possible For this example, we will use the main component of the social network collected on drug users in Hartford, CT. The network has 194 nodes and 273 edges. Here I am using the GUESS visualizer in NWB with a Kamada-Kawai layout Drew Conway Mining and Analyzing Online Social Graph Data
  • Network basics The study of relationships Research design Data representation Digging into data Analysis & Visualization Live demonstration Finding Key Actors with R The first steps are to load the data into memory, and perform some basic centrality analysis1 Load the data into igraph library(igraph) G<-read.graph("drug_main.txt",format="edgelist") G<-as.undirected(G) # By default, igraph inputs edgelist data as a directed graph. # In this step, we undo this and assume that all relationships are reciprocal. Store metrics in new data frame cent<-data.frame(bet=betweenness(G),eig=evcent(G)$vector) # evcent returns lots of data associated with the EC, but we only need the # leading eigenvector res<-lm(eig~bet,data=cent)$residuals cent<-transform(cent,res=res) # We will use the residuals in the next step 1 Weeks, et al (2002) http://dx.doi.org/10.1023/A:1015457400897 Drew Conway Mining and Analyzing Online Social Graph Data
  • Network basics The study of relationships Research design Data representation Digging into data Analysis & Visualization Live demonstration Finding Key Actors with R Key Actor Analysis for Hartford Drug Users 1.0 44 Plot the data res library(ggplot2) 0.8 −0.2 # We use ggplot2 to make things a 0 # bit prettier 0.2 p<-ggplot(cent,aes(x=bet,y=eig, 28 53 0.4 label=rownames(cent),colour=res, 0.6 0.6 size=abs(res)))+xlab("Betweenness Centrality Eigenvector Centrality")+ylab("Eigenvector 50 abs(res) Centrality") 0.1 # We use the residuals to color and 0.4 0.2 # shape the points of our plot, 141 47 # making it easier to spot outliers. 20 56 0.3 155 58 43 p+geom_text()+opts(title="Key Actor 100 22 126 0.4 57 147 Analysis for Hartford Drug Users") 23 140 18 0.5 104 19 # We use the geom_text function to plot 0.2 75 65 84 139 41 9 0.6 13685 # the actors’ ID’s rather than points 48 17 87 55 54 63 89105 94 77 184 67 0.7 # so we know who is who 101 64 165 93 3349 162 86 91 177 170 21 36168 174 52 111 27 16 42 68 6 12 69 148 169 149 78 183 181 110 172 25 185 182 150 61 11 176 130 45 79 138 66 62 0.0 124 26 164 121 163 189 179 152 119 127 38 88 151 133 161 166 107276114399 46 5 97 608 118 4 120 240 137 109 107159 157144 73 116 51 95 90 135 186 190 123 187 156 173 192 146 96 122 82 125 143 134 37132142 72 13 112 71 167 31 153 145 128 129 188 193 191 15 171 92 39 13 180 158 175 178 106 108 194 113 98 81 59 83 80 74103 154 24 117 70 160 115 34 131 14 30 29 35 102 0 1000 2000 3000 4000 5000 6000 Betweenness Centrality Drew Conway Mining and Analyzing Online Social Graph Data
  • Network basics The study of relationships Research design Data representation Digging into data Analysis & Visualization Live demonstration Key Actor Plot q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q 141 q q q q q 155 Network plot q q q q q q q 58 47 44 q q q # Create positions for all of q 28 q q q q q q q q q q q q q q # the nodes w/ force directed l<-layout.fruchterman.reingold(G, 53 q q q q q q q q q q q niter=500) 20 50 q q q q q # Set the nodes’ size relative to q q q q q q q q q q q q # their residual value q q q q q q q V(G)$size<-abs(res)*10 q q q q q q q q # Only display the labels of key q q q # players q q q q q q nodes<-as.vector(V(G)+1) q q q q q q q q # Key players defined as have a q q q q # residual value >.25 nodes[which(abs(res)<.25)]<-NA q q q q 79 q # Save plot as PDF q q q pdf(‘actor_plot.pdf’,pointsize=7) q q q q plot(G,layout=l,vertex.label=nodes, q q 102 q vertex.label.dist=0.25, q q q q q q q q q q q q q q q vertex.label.color=‘red’,edge.width=1) q q q q q q q q q dev.off() q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q Drew Conway Mining and Analyzing Online Social Graph Data
  • Network basics Research design Edge context Digging into data Network data and ethics Live demonstration Not all edges are created equally I have spent a lot of time tonight describing network analysis as a way to understand relationships Depending on the source and context of the data, these relationships can be interpreted as many different things. Must consider the data generation process by which the edge was created With respect to online social graph data, ask: how do people use this service? Facebook Google SocialGraph Twitter Ties here are driven by Connecting all of Google’s Recent study shows personal contact social sites together Twitter is used as news me → offline → you me → anything → you aggregation service me → interests → you Offline relationships The combining of are driving multiple platforms Geography and “friending” into a single history less important Considerable amount “network” makes Networks may cluster analysis and of meta-data already around communities interpretation in FB, what does this of interest difficult add? Drew Conway Mining and Analyzing Online Social Graph Data
  • Network basics Research design Edge context Digging into data Network data and ethics Live demonstration Ethics and network data Why should we be discusses ethics at a data analytics seminar? “People have really gotten comfortable not only sharing more information and “Money is a terrible master but an different kinds, but more openly and excellent servant” with more people. That social norm is just something that has evolved over “There’s a sucker born every minute” time.” P.T. Barnum Mark Zuckerburg, CEO, Facebook In isolation, the data we provide online about our relationships and preferences are fairly innocuous. Their summation, however, can illuminate aspects of our lives that can be exploited for any number of ways. Social networking services are moving previously private data to the public Simply because data is available does not mean that individuals want or realize that it can be used to make inferences about who they are or their lifestyles Analysts are caught in the middle. There is no IRB on the Internet! Drew Conway Mining and Analyzing Online Social Graph Data
  • Network basics Research design Edge context Digging into data Network data and ethics Live demonstration Examples of ethically questionable network analyses MIT study predicts sexual orientation from Facebook friends Two undergraduate students start a project ‘Gaydar’ Claim they can predict which men are homosexual simply based on their friend structure Problem: Extremely private information, questionable methods Pleaserobme.com Combined information from fousquare.com and Twitter to publicize when people were clearly not at home Used as a proof of concept to show the danger of publicizing localization information Problem: Useful for raising awareness, hurtful to anyone who was actually robbed Project Grey Goose Large group of former and current DoD/IC analysts collaborated to study the identity of hackers I was personally involved in this project Using public web forum data, built up user profiles and social networks to attempt to identify hackers affiliated with Russian government Problem: While no names were published, bordering on vigilantism Drew Conway Mining and Analyzing Online Social Graph Data
  • Network basics Research design Getting data Digging into data First scrape and build Live demonstration Where to get social graph data Recently, there has been an explosion of resources for scraping social graph Service Data API Docs Following(ers), @-replies, date/time/geo http://apiwiki.twitter.com/ Friends, Wall Posts, date/time http://developers.facebook.com/docs/api All SocialGraph relationships http://code.google.com/apis/socialgraph/ Friends, Check-ins http://foursquare.com/developers/ “Taste graph”, recommendations http://hunch.com/developers/ Congressional votes, campaign finance http://developer.nytimes.com/docs There is clearly no shortage of data Each service provides different relational context Data formats are generally JSON, Atom, XML, or some combination For a more extensive list of API resources, see HackNY wiki of local startups Drew Conway Mining and Analyzing Online Social Graph Data
  • Network basics Research design Getting data Digging into data First scrape and build Live demonstration Building the social network among LiveJournal users Using a “seed” user, we will build out a network In Python, use NetworkX, cjson and a other standard scientific libraries, parse the SocialGraph data Through a process called “k-snowball searching” seed → friend → · · · → friendk Seed: imichaeldotorg.livejournal.com k =3 Note the low value of k Drew Conway Mining and Analyzing Online Social Graph Data
  • Network basics Research design Getting data Digging into data First scrape and build Live demonstration The code, part 1 Loading the libraries and setting things up from cjson import * from urllib import * from networkx import * Name: [‘http://imichaeldotorg.livejournal.com/’] from time import * Type: DiGraph from scipy import array,unique Number of nodes: 5 ... Number of edges: 5 if __name__ == "__main__": Average in degree: 1.0 seed_url=‘‘http://imichaeldotorg.livejournal.com" Average out degree: 1.0 sg=get_sg(seed_url) net,newnodes=create_egonet(sg) info(net) Get the JSON from SocialGraph def get_sg(seed_url): sgapi_url="http://socialgraph.apis.google.com/lookup?q="+seed_url+"&edo=1&edi=1&fme=1&pretty=0" try: furl=urlopen(sgapi_url) fr=furl.read() furl.close() return fr except IOError: print "Could not connect to website" print sgapi_url return Drew Conway Mining and Analyzing Online Social Graph Data
  • Network basics Research design Getting data Digging into data First scrape and build Live demonstration Build egonet and snowball Creating the egonet Rolling the snowball def create_egonet(s): def snowball_round(G,seeds,myspace=False): try: t0=time() raw=decode(s) if myspace: G=DiGraph() seeds=get_myspace_url(seeds) pendants=[] sb_data=[] n=raw[’nodes’] for s in range(0,len(seeds)): nk=n.keys() s_sg=get_sg(seeds[s]) G.name=str(nk) new_ego,pen=create_egonet(s_sg) pendants=[] for p in pen: for a in range(0,len(nk)): sb_data.append(p) for b in range(0,len(nk)): if s<1: if a!=b: sb_net=compose(G,new_ego) G.add_edge(nk[a],nk[b]) else: for k in nk: sb_net=compose(new_ego,sb_net) ego=n[k] del new_ego ego_out=ego[’nodes_referenced’] if s==round(len(seeds)*0.2): for o in ego_out: sb_net.name=’20% complete’ G.add_edge(k,o) sb_net.info() pendants.append(o) print ’AT: ’+strftime(’%m/%d/%Y, %H:%M:%S’, gmtime()) ego_in=ego[’nodes_referenced_by’] print ’’ for i in ego_in: ... G.add_edge(i,k) # More time keeping, probably a MUCH better way to do this pendants.append(i) sb_data=array(sb_data) pendants=array(pendants,dtype=str) sb_data.flatten() pendants.flatten() sb_data=unique(sb_data) pendants=unique(pendants) sb_net.info() return G,pendants return sb_net,sb_data except DecodeError: ... except KeyError: Drew Conway Mining and Analyzing Online Social Graph Data
  • Network basics Research design Getting data Digging into data First scrape and build Live demonstration Build the whole network Step Nodes Edges Mean Degree Density Our seed is abnormally isolated, with only four Seed 5 5 2.0 0.25 neighbors k =2 75 115 3.0 0.02 Large jump after first snowball k =3 4,938 8,659 3.5 3.6(10−4 ) Massive structural leap at k = 3 Drew Conway Mining and Analyzing Online Social Graph Data
  • Network basics Research design Getting data Digging into data First scrape and build Live demonstration The full network To get a feeling for the size of the full network... Drew Conway Mining and Analyzing Online Social Graph Data
  • Network basics Research design Digging into data Live demonstration Live demo Live demonstration time! Drew Conway Mining and Analyzing Online Social Graph Data