Video with synched slides are now available here: http://bit.ly/bljh5m
Presentation by Drew Conway (@drewconway) as presented to NYC Predictive Analytics meetup group, May 13, 2010.
Available for download on Drew’s blog Zero Intelligence Agents http://www.drewconway.com/zia/?p=2151
To download the sample code and the pdf, please go to our Meetup.com page here: http://bit.ly/aXDqzg
Drew Conway is a PhD student at New York University, Dept. of Politics, an expert on social networks, and a great presenter. Drew also writes for his blog Zero Intelligence Agents, runs the NY R Statistical Programming Meetup (http://www.meetup.com/nyhackr/) and is a member of NYC Predictive Analytics http://www.meetup.com/NYC-Predictive-Analytics/
1. Mining and Analyzing Online Social Graph Data
Drew Conway
New York University, Dept. of Politics
May 13, 2010
2. Agenda
Network basics
Unit of analysis
Data representation
Analysis & Visualization
Thoughts on research design
Edge contexts
A bit on social graph data and ethics
Digging into the data
Where to get it
Our first scrape and build!
Real-time social graph analysis
Build the network of Twitter users in the room
Hold our breath...
3. Live demo setup
Last week Bruno asked how many of you use Twitter
15 (54%) members said yes, which may be enough to do some interesting
stuff
Please tweet the following hash-tag:
#analyticsnyc
Some example tweets:
Many thanks to the @GiltGroupe for hosting tonight’s #analyticsnyc
meetup
Hanging out with @DKALab and @brunocm at #analyticsnyc
I cannot wait to watch @drewconway crash and burn with his
#analyticsnyc live demo
4. Network basics
The study of relationships
Research design
Data representation
Digging into data
Analysis & Visualization
Live demonstration
Networks and the study of relationships
Network theory use the language of graph theory
G = {V , E }
↑
In the abstract this is very powerful, we have a general purpose way to
represent any number of relations
G={Routers,Packet Traffic}
Edges have non-binary values
G={U.S. Airports, Commercial Routes}
Edges can represent distance, cost, frequency, etc.
G={New York City nerds, Co-membership in Meetups}
Edges are implied!
While both nodes and edges are needed to have a network of any substance, it
is the edge (relationship) that will always be the primary focus of our analyses
Drew Conway Mining and Analyzing Online Social Graph Data
5. Network basics
The study of relationships
Research design
Data representation
Digging into data
Analysis & Visualization
Live demonstration
Why focus on the edge?
Consider a very simple example of two people meeting for the first time...
In reality, the meeting may reveal a large
If we focus on the nodes, we might degree of shared structure:
consider this meeting creates a
dyad:
We know, however, that people do
not exist as isolates and this
assumption is ignoring all of By expressing this meeting in terms of the
exogenous social structure these nodes as a function of their edges we may
individuals brings with them. gain a much richer understanding of the
structural dynamics
Drew Conway Mining and Analyzing Online Social Graph Data
6. Network basics
The study of relationships
Research design
Data representation
Digging into data
Analysis & Visualization
Live demonstration
Representing network data: the canonical
Perhaps the most natural way to represent the relationships between N actors
is with an NxN matrix, often referred to as a “sociomatrix”
X1 X2 ... XN
X1 0 1 ... 0 Very intuitive representation
X2 1 0 ... 1 Can support directional and weighted edges
.
. .
. .
. .. .
.
. . . . . Application of matrix algebraic operation for
XN 0 1 ... 0 analysis
As an introduction to representing relationship, a matrix provides insight to a
network by itself. In practice, however, this representation has many limitations
Unwieldy as network sizes scale up
Most networks are sparse; too many zeroes
Difficult to publish and share
Fortunately, there are many other options for representing network data!
Drew Conway Mining and Analyzing Online Social Graph Data
7. Network basics
The study of relationships
Research design
Data representation
Digging into data
Analysis & Visualization
Live demonstration
Representing network data: the practical
Remember, it is all about the edges; therefore, more efficient representations
will be limited to edge data
Edge list Adjacency list
A text file with two columns (usually Also a text file, however, here the first
delimited by a space or tab), where first column is the source, and all
column in source, and second is target subsequent entries are “adjacent”
nodes
1 2
1 3
2 1
6 7
6 8 1 2 3
10 12 2 1
10 13 6 7 8
10 12 13
While these are the most universal data formats, network data representation is
a bit of a cottage industry
Pajek (.net)
GraphML (.gml)
GraphViz (.dot)
..and domain specific formats (eek!)
Drew Conway Mining and Analyzing Online Social Graph Data
8. Network basics
The study of relationships
Research design
Data representation
Digging into data
Analysis & Visualization
Live demonstration
A bit on tools
The number of software suites and packages available for conducting social
network analysis has exploded over the past ten years
In general, this software can be categorized in two ways:
Type - many SNA tools are developed to be standalone applications, while
others are language specific packages
Intent - consumers and producer of SNA come from a wide range of
technical expertise and/or need, therefore, there exist simple tools for data
collection and basic analysis, as well as complex suites for advanced research
Standalone Apps Modules & Packages
- ORA (Windows) - libSNA (Python)
Basic - Analyst Notebook (Windows) - UrlNet (Python)
- KrakPlot (Windows) - NodeXL (MS Excel)
- UCINet (Windows) - NetworkX (Python)
Advanced - Pajek (Multi) - JUNG (Java)
- Network Workbench (Multi) - igraph (Python, R & Ruby)
Many of the above tools have visualization components, but several tools are
designed specifically for visualization: Graphviz, NetDraw, Tom Sawyer, Gephi,
etc.
What I use
Drew Conway Mining and Analyzing Online Social Graph Data
9. Network basics
The study of relationships
Research design
Data representation
Digging into data
Analysis & Visualization
Live demonstration
Comparing two network metrics to find key actors
Often network analysis is used to identify key actors within a social group. To
identify these actors, various centrality metrics can be computed based on a
network’s structure
Degree (number of connections)
Betweenness (number of shortest paths an actor is on)
Closeness (relative distance to all other actors)
Eigenvector centrality (leading eigenvector of sociomatrix)
One method for using these metrics to identify key actors is to plot actors’
scores for Eigenvector centrality versus Betweenness. Theoretically, these
metrics should be approximately linear; therefore, any non-linear outliers will be
of note.
An actor with very high betweenness but low EC may be a critical
gatekeeper to a central actor
Likewise, an actor with low betweenness but high EC may have unique
access to central actors
Drew Conway Mining and Analyzing Online Social Graph Data
10. Network basics
The study of relationships
Research design
Data representation
Digging into data
Analysis & Visualization
Live demonstration
First, visualize the data
Visualization can be the best first step in the analytical process
This will give you a good
feel for what is going on
with your relationships.
For large networks this is
often not possible
For this example, we will
use the main component
of the social network
collected on drug users in
Hartford, CT. The network
has 194 nodes and 273
edges.
Here I am using the
GUESS visualizer in NWB
with a Kamada-Kawai
layout
Drew Conway Mining and Analyzing Online Social Graph Data
11. Network basics
The study of relationships
Research design
Data representation
Digging into data
Analysis & Visualization
Live demonstration
Finding Key Actors with R
The first steps are to load the data into memory, and perform some basic
centrality analysis1
Load the data into igraph
library(igraph)
G<-read.graph("drug_main.txt",format="edgelist")
G<-as.undirected(G)
# By default, igraph inputs edgelist data as a directed graph.
# In this step, we undo this and assume that all relationships are reciprocal.
Store metrics in new data frame
cent<-data.frame(bet=betweenness(G),eig=evcent(G)$vector)
# evcent returns lots of data associated with the EC, but we only need the
# leading eigenvector
res<-lm(eig~bet,data=cent)$residuals
cent<-transform(cent,res=res)
# We will use the residuals in the next step
1
Weeks, et al (2002) http://dx.doi.org/10.1023/A:1015457400897
Drew Conway Mining and Analyzing Online Social Graph Data
12. Network basics
The study of relationships
Research design
Data representation
Digging into data
Analysis & Visualization
Live demonstration
Finding Key Actors with R
Key Actor Analysis for Hartford Drug Users
1.0 44
Plot the data res
library(ggplot2) 0.8 −0.2
# We use ggplot2 to make things a 0
# bit prettier 0.2
p<-ggplot(cent,aes(x=bet,y=eig, 28
53 0.4
label=rownames(cent),colour=res, 0.6 0.6
size=abs(res)))+xlab("Betweenness
Centrality
Eigenvector
Centrality")+ylab("Eigenvector 50
abs(res)
Centrality")
0.1
# We use the residuals to color and
0.4 0.2
# shape the points of our plot, 141
47
# making it easier to spot outliers. 20 56
0.3
155
58 43
p+geom_text()+opts(title="Key Actor 100
22 126 0.4
57
147
Analysis for Hartford Drug Users") 23
140 18 0.5
104 19
# We use the geom_text function to plot 0.2 75 65
84
139
41 9
0.6
13685
# the actors’ ID’s rather than points 48
17 87
55 54
63 89105
94 77
184
67
0.7
# so we know who is who 101
64 165
93
3349 162
86 91
177
170 21
36168
174
52 111 27 16
42 68
6 12 69
148
169 149 78
183
181
110 172
25
185
182
150 61 11 176
130
45 79
138 66 62
0.0
124
26
164 121
163
189
179
152
119 127 38 88 151
133
161
166
107276114399 46 5 97 608 118 4
120 240 137
109 107159 157144 73 116
51 95 90
135
186 190 123
187
156 173 192
146
96
122 82
125 143
134 37132142 72
13
112 71
167
31
153
145
128
129
188 193
191
15
171
92
39
13
180
158
175 178 106
108
194
113
98
81
59
83
80
74103 154 24 117 70
160 115
34
131 14
30 29 35 102
0 1000 2000 3000 4000 5000 6000
Betweenness Centrality
Drew Conway Mining and Analyzing Online Social Graph Data
13. Network basics
The study of relationships
Research design
Data representation
Digging into data
Analysis & Visualization
Live demonstration
Key Actor Plot
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q q
q
q q
q
q q q q q
q
q
141
q
q
q q
q 155 Network plot
q
q
q
q
q
q q
58 47 44 q
q q # Create positions for all of
q 28
q q
q
q q
q
q q
q
q
q q
q
q
# the nodes w/ force directed
l<-layout.fruchterman.reingold(G,
53
q
q q
q
q
q
q q q q q
niter=500)
20 50
q
q
q
q q # Set the nodes’ size relative to
q
q q
q
q
q
q
q
q
q
q
q
# their residual value
q
q
q
q
q q
q
V(G)$size<-abs(res)*10
q q
q q q
q q
q
# Only display the labels of key
q
q
q
# players
q
q
q
q q
q
nodes<-as.vector(V(G)+1)
q
q
q
q
q
q
q
q
# Key players defined as have a
q
q
q
q
# residual value >.25
nodes[which(abs(res)<.25)]<-NA
q
q
q
q
79
q
# Save plot as PDF
q q
q
pdf(‘actor_plot.pdf’,pointsize=7)
q
q q
q plot(G,layout=l,vertex.label=nodes,
q q
102 q
vertex.label.dist=0.25,
q
q q q
q q
q
q
q
q
q
q
q
q
q
vertex.label.color=‘red’,edge.width=1)
q
q
q
q q
q q
q
q
dev.off()
q q
q
q q q
q q
q
q
q
q
q
q q
q
q q
q
q
q
q
q
q q q
q
q
q
q
Drew Conway Mining and Analyzing Online Social Graph Data
14. Network basics
Research design Edge context
Digging into data Network data and ethics
Live demonstration
Not all edges are created equally
I have spent a lot of time tonight describing network analysis as a way to
understand relationships
Depending on the source and context of the data, these relationships can be interpreted as many different
things.
Must consider the data generation process by which the edge was created
With respect to online social graph data, ask: how do people use this service?
Facebook Google SocialGraph
Twitter
Ties here are driven by Connecting all of Google’s
Recent study shows
personal contact social sites together
Twitter is used as news
me → offline → you me → anything → you
aggregation service
me → interests → you Offline relationships The combining of
are driving multiple platforms
Geography and
“friending” into a single
history less important
Considerable amount “network” makes
Networks may cluster analysis and
of meta-data already
around communities interpretation
in FB, what does this
of interest difficult
add?
Drew Conway Mining and Analyzing Online Social Graph Data
15. Network basics
Research design Edge context
Digging into data Network data and ethics
Live demonstration
Ethics and network data
Why should we be discusses ethics at a data analytics seminar?
“People have really gotten comfortable
not only sharing more information and “Money is a terrible master but an
different kinds, but more openly and excellent servant”
with more people. That social norm is
just something that has evolved over “There’s a sucker born every minute”
time.”
P.T. Barnum
Mark Zuckerburg, CEO, Facebook
In isolation, the data we provide online about our relationships and preferences
are fairly innocuous. Their summation, however, can illuminate aspects of
our lives that can be exploited for any number of ways.
Social networking services are moving previously private data to the public
Simply because data is available does not mean that individuals want or
realize that it can be used to make inferences about who they are or their
lifestyles
Analysts are caught in the middle. There is no IRB on the Internet!
Drew Conway Mining and Analyzing Online Social Graph Data
16. Network basics
Research design Edge context
Digging into data Network data and ethics
Live demonstration
Examples of ethically questionable network analyses
MIT study predicts sexual orientation from Facebook friends
Two undergraduate students start a project ‘Gaydar’
Claim they can predict which men are homosexual simply based on their
friend structure
Problem: Extremely private information, questionable methods
Pleaserobme.com
Combined information from fousquare.com and Twitter to publicize when
people were clearly not at home
Used as a proof of concept to show the danger of publicizing localization
information
Problem: Useful for raising awareness, hurtful to anyone who was actually
robbed
Project Grey Goose
Large group of former and current DoD/IC analysts collaborated to study
the identity of hackers
I was personally involved in this project
Using public web forum data, built up user profiles and social networks to
attempt to identify hackers affiliated with Russian government
Problem: While no names were published, bordering on vigilantism
Drew Conway Mining and Analyzing Online Social Graph Data
17. Network basics
Research design Getting data
Digging into data First scrape and build
Live demonstration
Where to get social graph data
Recently, there has been an explosion of resources for scraping social graph
Service Data API Docs
Following(ers), @-replies, date/time/geo http://apiwiki.twitter.com/
Friends, Wall Posts, date/time http://developers.facebook.com/docs/api
All SocialGraph relationships http://code.google.com/apis/socialgraph/
Friends, Check-ins http://foursquare.com/developers/
“Taste graph”, recommendations http://hunch.com/developers/
Congressional votes, campaign finance http://developer.nytimes.com/docs
There is clearly no shortage of data
Each service provides different relational context
Data formats are generally JSON, Atom, XML, or some combination
For a more extensive list of API resources, see HackNY wiki of local
startups
Drew Conway Mining and Analyzing Online Social Graph Data
18. Network basics
Research design Getting data
Digging into data First scrape and build
Live demonstration
Building the social network among LiveJournal users
Using a “seed” user, we will build out a
network
In Python, use NetworkX, cjson
and a other standard scientific
libraries, parse the SocialGraph
data
Through a process called
“k-snowball searching”
seed → friend → · · · → friendk
Seed: imichaeldotorg.livejournal.com
k =3
Note the low value of k
Drew Conway Mining and Analyzing Online Social Graph Data
19. Network basics
Research design Getting data
Digging into data First scrape and build
Live demonstration
The code, part 1
Loading the libraries and setting things up
from cjson import *
from urllib import *
from networkx import *
Name: [‘http://imichaeldotorg.livejournal.com/’]
from time import *
Type: DiGraph
from scipy import array,unique
Number of nodes: 5
...
Number of edges: 5
if __name__ == "__main__":
Average in degree: 1.0
seed_url=‘‘http://imichaeldotorg.livejournal.com"
Average out degree: 1.0
sg=get_sg(seed_url)
net,newnodes=create_egonet(sg)
info(net)
Get the JSON from SocialGraph
def get_sg(seed_url):
sgapi_url="http://socialgraph.apis.google.com/lookup?q="+seed_url+"&edo=1&edi=1&fme=1&pretty=0"
try:
furl=urlopen(sgapi_url)
fr=furl.read()
furl.close()
return fr
except IOError:
print "Could not connect to website"
print sgapi_url
return
Drew Conway Mining and Analyzing Online Social Graph Data
20. Network basics
Research design Getting data
Digging into data First scrape and build
Live demonstration
Build egonet and snowball
Creating the egonet Rolling the snowball
def create_egonet(s): def snowball_round(G,seeds,myspace=False):
try: t0=time()
raw=decode(s) if myspace:
G=DiGraph() seeds=get_myspace_url(seeds)
pendants=[] sb_data=[]
n=raw[’nodes’] for s in range(0,len(seeds)):
nk=n.keys() s_sg=get_sg(seeds[s])
G.name=str(nk) new_ego,pen=create_egonet(s_sg)
pendants=[] for p in pen:
for a in range(0,len(nk)): sb_data.append(p)
for b in range(0,len(nk)): if s<1:
if a!=b: sb_net=compose(G,new_ego)
G.add_edge(nk[a],nk[b]) else:
for k in nk: sb_net=compose(new_ego,sb_net)
ego=n[k] del new_ego
ego_out=ego[’nodes_referenced’] if s==round(len(seeds)*0.2):
for o in ego_out: sb_net.name=’20% complete’
G.add_edge(k,o) sb_net.info()
pendants.append(o) print ’AT: ’+strftime(’%m/%d/%Y, %H:%M:%S’, gmtime())
ego_in=ego[’nodes_referenced_by’] print ’’
for i in ego_in: ...
G.add_edge(i,k) # More time keeping, probably a MUCH better way to do this
pendants.append(i) sb_data=array(sb_data)
pendants=array(pendants,dtype=str) sb_data.flatten()
pendants.flatten() sb_data=unique(sb_data)
pendants=unique(pendants) sb_net.info()
return G,pendants return sb_net,sb_data
except DecodeError:
...
except KeyError:
Drew Conway Mining and Analyzing Online Social Graph Data
21. Network basics
Research design Getting data
Digging into data First scrape and build
Live demonstration
Build the whole network
Step Nodes Edges Mean Degree Density Our seed is abnormally isolated, with only four
Seed 5 5 2.0 0.25 neighbors
k =2 75 115 3.0 0.02 Large jump after first snowball
k =3 4,938 8,659 3.5 3.6(10−4 )
Massive structural leap at k = 3
Drew Conway Mining and Analyzing Online Social Graph Data
22. Network basics
Research design Getting data
Digging into data First scrape and build
Live demonstration
The full network
To get a feeling for the size of the full network...
Drew Conway Mining and Analyzing Online Social Graph Data
23. Network basics
Research design
Digging into data
Live demonstration
Live demo
Live demonstration time!
Drew Conway Mining and Analyzing Online Social Graph Data