Social Network Analysis Introduction including Data Structure Graph overview.
Social Network Analysis
Presentation by @dougneedham
Data Guy - Started as a DBA in the Marine Corps, evolved to Architect,
now Data Scientist.
Oracle, SQL Server, Cassandra, Hadoop, MySQL, Spark.
I have a strong relational/traditional background.
Learning new things challenges our assumptions. Forces us to take a
new perspective on “old” problems. Eventually maybe even shows us
that there is a better way to solve a problem.
Why study social networks?
It is cool.
The concepts around Social Network Analysis can be applied to many
interesting problems in a variety of business verticals.
The foundation of Social Network Analysis is Graph theory.
Some examples: Introduction to Graph_Theory
What is Social Network Analysis?
“Social network analysis (SNA) is a strategy for investigating social
structures through the use of network and graph theories. It
characterizes networked structures in terms of nodes (individual actors,
people, or things within the network) and the ties or edges
(relationships or interactions) that connect them. Examples of social
structures commonly visualized through social network analysis include
social media networks, friendship and acquaintance networks, kinship,
disease transmission, and sexual relationships. These networks are often
visualized through sociograms in which nodes are represented as points
and ties are represented as lines.” – Wikipedia
Example From wiki:
Kencf0618 - Own work. Licensed under
CC BY-SA 3.0 via Wikimedia Commons -
A little History
The 7 Bridges of Konisberg
Every tome on Graph theory or Network analysis devotes a small
portion of there time to the 7 Bridges of Konisberg.
If I don’t cover this with you, the gods of mathematics will strike me
down, and never allow me to do analysis again in the future.
Folks enjoyed there Sunday afternoon strolls across the bridges, but
occasionally people would wonder if one particular route was more
efficient than another.
Eventually Leonhard Euler was brought into the debate about the
Euler used Vertices to represent the land masses and edges (or arcs, at
the time) to represent bridges. He realized the odd number of edges
per vertex made the problem unsolvable.
Sarada Herke provides for one of the best explanations of the solution
Solution to Konisburg
And here is the cool thing about mathematicians. If we tell you
something is impossible, we have to tell you why in a way you can
understand it. But he also invented the branch of mathematics today
we call Graph Theory.
Why analyze Facebook data?
Facebook is something that most people use.
It is easy to see the relationships and the concepts of the
Graph/Network are intuitive to people who are looking at their “own”
The main idea is, if you can understand your own friend data, you can
learn the concepts quickly, then apply these same concepts to more
We will talk a little about some complicated topics at the end.
A few terms
Stand back, we are going to talk about math!
Basically we are talking about a bunch of dots joined together by lines
Vertex – Dot on a graph
Edge – Line connecting the two points
Edge_Label – this is a term I coined originally related to Data Structure Graphs that
helps trace a path. If you label your edges, and you have multiple edges with the same
label in a Graph you can quite easily identify walks, paths, and cycles through your
Triangle – 3 Vertices, 3 Edges
Square – 4 Vertices, 4 edges
Open Triangle - 3 Vertices, 2 edges /
A lot of things are networks if you look at them the right way.
Mark Newman has done a number of well done presentations, available on Youtube
about Network analysis.
Transitivity – The friend of my friend is my friend. Really?
Homophily – how things are similar
Directed Graphs – or Digraphs
Contagion – How do things “spread” through a network?
Let’s rearrange things, how does the layout affect understanding?
Order of a graph – number of vertices
Size of the graph – number of edges
This is not just data visualization, it can also be used for prediction.
Centrality – Hub and Authority
This is almost a whole topic by itself, since there are different types of
Degree Centrality – Simple, the Vertex with the most degrees is the most
Eigenvector Centrality – How important a particular Vertex is to a given
PageRank – similar to Eigenvector Centrality, only scaled, and if a given
vertex is closely connected to very high PageRank vertex, it is itself given a
Serious nutshell definitions.
Shortest path – How are two vertices connected?
Longest Path – Tracing the flow of an interesting item through a large
collection of applications.
Why is a path important? More on this
The Original Joke This is me in different stores
The Math doesn’t change.
One thing I like about Graphs –
The Math does not change.
The math behind Graph theory can be a little intense, but it does not
change regardless of the scale of the graph.
Once you understand how to “do the math” on a small graph, those
same Maths apply to a Graph whether it is a graph of the people in this
room, or a graph of the people on this planet.
Now, let me introduce you to a tool that does much of the
Mathematics for you…
But first, Netvizz…
Netvizz is a tool that extracts data from different sections of the Facebook Platform.
It provides an interface to the Facebook Graph API
For the version of data we will be looking at, I was able to extract friendship connections.
Facebook has since changed their permissions such that you can no longer extract this
However, there are some other interesting things you can do with Netvizz.
If you manage a Facebook Group, this might be interesting.
For this particular talk we are going to focus on Gephi interpretation. If we want to have a
more in-depth talk on Facebook and the Graph API that Facebook has opened, we can
discuss that at another time.
To get this yourself go into Facebook and search for: Netvizz. (You have to authorize it. You
can un-authorized it later)
You will have a number of options: group data, page data, page like network, search, and
Click “group data”
Select a group if you need a sample id use: 39462256584
It runs for a bit, then dumps to a zip file.
Save the file, then extract it.
Open Gephi, and use Gephi to import your GDF file.
From the website: “Gephi is an
interactive visualization and exploration
platform for all kinds of networks and
complex systems, dynamic and
Java 1.7 required, you may have to set
this in Gephi.conf
Depending on the size of the network
you are studying you may need to
increase the memory available to Java
How do we use this?
You have to ignore the fact that everyone on this graph is connected
to you for a moment.
How would someone get a message to another given person?
They would have to pass it to someone either they both know, or pass
the message to someone who is more likely to be connected to the
target of the message.
This was the heart of Milgram’s experiment that gave us the concept of
6 degrees of separation.
What else can be done with Social Network Analysis?
How about risk exposure to banks?
Application to Business Intelligence
What if the Vertices are not people ?
What if the Edges are not mutual connections?
Jonathan and others over the past few meetings have done a great
job at explaining the underpinnings of how a particular BI framework is
Within a Data Architecture there are lots of moving pieces. ETL, FTP,
SFTP, Web-Services, External data feeds. Data moving into Data Marts,
and Data Warehouses. Data Moving between applications.
Let’s imagine how to visualize this using the information we just gained.
Data Structure Graph
A Data Structure Graph is a group of atomic entities that are related to
each other, stored in a repository, then moved from one persistence
layer to another, rendered as a Graph.
A group of atomic entities.
Related to each other.
Stored in a repository.
Moved from one persistence layer to another.
Rendered as a Graph.
Introducing Data Structure Graphs
Data Structure Graph Level 1 (DSG-L1)– This is roughly like an Entity
Relationship Diagram (ERD) Tables are Vertices, Foreign Keys are Edges.
Data Structure Graph Level 2 (DSG-L2) – Each Vertex in this graph is an
application. Each Edge is data transfer. Roughly equivalent to what we
used to call Data Flow diagrams.
Data Structure Graph Dependency (DSG-D) – Each vertex is a
job,script, program, or process that is dependent on something
happening in sequence before it can do its work.
A DSG-L1 can show you where you are going to have the most
interesting query performance of your tables.
A DSG-L2 can show you where the most amount of work is going on in
A DSG-D can show you the sequence of events that need to take
place in order for something to be completed.
The Original Joke This is me in different stores
Some of you may have heard of Dijkstra’s algorithm.
It is a method for finding the shortest path between two nodes on a
This is a great optimization technique, but what if you need to find the
What “edge_label” has the most influence on my organization?
Iterate through each Edge_Label, create a subgraph that consists of
only the nodes this Edge_Label touches, then calculate the diameter of
The data point represented by a given Edge_label that has the longest
path has the most “value” to your organization.
Hard to see, I know, but the top diagram is the “master graph”, the bottom image is a single Edge_Label. You
can see how an individual data entity flows through an organization.
Goes through a number of examples for doing an Graph analysis of a fictional organization.
Consider the following:
If you need assistance, send a message to the group, or contact me
directly (I am easy to find @dougneedham)
Network/Graph Analysis is cool.
It can show you some interesting things about your data that you may
not have considered.
Due thought should be put towards a network analysis project.
Organizing the data requires a bit of thought. (From -> To vertices is just
Directed graph, undirected, bigraph? Setup work needs to be done.
Tools help with the detailed calculations, and show the paths, walks,
What did I leave out?
Graphs that change over time – What happens when you remove a single
Edge or Vertex?
Growth of a Network – Erdos-Renyi versus Barabasi-Albert models (Random
versus Preferential Attachment)
Scale Free networks – Graphs that conform to Power laws. (These are
intrinsically Social Networks, but I didn’t give much detail)
Comparing two networks – If you have the same number of edges and
nodes, are two graphs the same? Is one graph an isomorphism of another?
Contagion – Ceteris paribus how will things(information, virus’s,
data,disease…) spread through the network. (Since a DSG represents
different types of Edges based on Edge_Label, Contagion should not affect
this type of network entirely.)
Large Graphs – GraphX a part of Apache Spark is best used for this
The strength of Weak Ties Paradox
Finally… Want to do Data Science?
Challenge for members of the audience.
1. Download Gephi.
2. Put together a simple CSV: Source, Target,Edge_Label that describes
your own data environment.
3. Load it in Gephi and have Gephi run the metrics, and perform the auto
4. Answer this question: Did you get what you expected?
5. Get a colleague to do the same thing, compare the images. How similar
Here is my hypothesis: If you have more than 5 data applications, including
Hadoop, and Data Warehouse infrastructure, your Graph will follow the
rules of preferential attachment. (To<->From ETL tools don’t count in the
Tweet me @dougneedham #DataStructureGraph (anonymized, of course.)
What does your Graph look like?