Overview of GraphX
Presentation by @dougneedham
Data Guy - Started as a DBA in the Marine Corps, evolved to Architect,
now aspiring Data Scientist.
Oracle, SQL Server, Cassandra, Hadoop, MySQL, Spark.
I have a strong relational/traditional background.
Learning new things challenges our assumptions. Forces us to take a
new perspective on “old” problems. Eventually maybe even shows us
that there is a better way to solve a problem.
Graphs: What problems do they solve?
Some examples: Introduction to Graph_Theory
There are many ways of constructing networks, and how exactly you construct them
depends on the questions you are posing.
Economics: You don’t participate in an economy by yourself, you make purchases
from others. Record enough transactions, you have a graph.
Almost anything can be modeled as a graph. However, it does require a slight shift in
One of the most used examples is a citation network for academic publications.
I publish a paper, then you cite my paper in your publication.
This shows which paper (ultimately back through the tree) had the largest influence.
A little History
The 7 Bridges of Konisberg
Every tome on Graph theory or Network analysis devotes a small
portion of there time to the 7 Bridges of Konisberg.
If I don’t cover this with you, the gods of mathematics will strike me
down, and never allow me to do analysis again in the future.
Folks enjoyed there Sunday afternoon strolls across the bridges, but
occasionally people would wonder if one particular route was more
efficient than another.
Eventually Leonhard Euler was brought into the debate about the
Euler used Vertices to represent the land masses and edges (or arcs, at
the time) to represent bridges. He realized the odd number of edges
per vertex made the problem unsolvable.
Sarada Herke provides for one of the best explanations of the solution
Solution to Konisburg
And here is the cool thing about mathematicians. If we tell you
something is impossible, we have to tell you why in a way you can
understand it. But he also invented the branch of mathematics today
we call Graph Theory.
A few terms
Stand back, we are going to talk about math!
Basically we are talking about a bunch of dots joined together by lines
Vertex – Dot on a graph
Edge – Line connecting the two points
Edge_Label – this is a term I coined originally related to Data Structure Graphs that
helps trace a path. If you label your edges, and you have multiple edges with the same
label in a Graph you can quite easily identify walks, paths, and cycles through your
Triangle – 3 Vertices, 3 Edges
Square – 4 Vertices, 4 edges
Open Triangle - 3 Vertices, 2 edges
A lot of things are networks if you look at them the right way.
Mark Newman has done a number of really cool presentations, available on Youtube
about Network analysis.
Shortest path – How are two vertices connected?
Longest Path – Tracing the flow of an interesting item through a large
collection of applications.
What is a path?
Centrality – Hub and Authority
This is almost a whole topic by itself, since there are different types of Centrality:
Degree Centrality, Eigenvector Centrality, PageRank, etc…
Homophily – how things are similar
Directed Graphs – or Digraphs
Contagion – How do things “spread” through a network?
Let’s rearrange things, how does the layout affect understanding?
Order of a graph – number of vertices
Size of the graph – number of edges
This is not just data visualization, it can also be used for prediction.
Some Samples from Wiki.
On the right, a basic graph, on the left the languages used in wikipedia
Little sidebar - Paths
Now that we have some terms under our belt.
What is the difference between shortest path, and longest path?
The Math doesn’t change.
One thing I like about Graphs –
The Math does not change.
The math behind Graph theory can be a little intense, but it does not
change regardless of the scale of the graph.
Once you understand how to “do the math” on a small graph, those
same Maths apply to a Graph whether it is a graph of the people in this
room, or a graph of the people on this planet.
What is a small graph?
Friends on Facebook, or LinkedIN.
Usually this can be displayed and analyzed rather easily.
If the Graph continues to grow, you need better tools.
Let’s do a quick demo of a small graph visualization.
From the website: “Gephi is an interactive visualization and exploration
platform for all kinds of networks and complex systems, dynamic and
To get this yourself go into Facebook and search for: Netvizz. (You have
to authorized it. You can un-authorized it later)
Click the application.
Click “personal network”
Download your gdf file
Quick Demo – ( Vote time: If everyone is comfortable with general
graphs we can come back to this.)
What is a large graph?
To me a large graph is one that cannot be easily visualized by software
such as Gephi.
You have to use large tools to calculate the important statistics, such as
centrality, diameter, average degree, etc…
Breaking a large graph down to a small graph is actually not as simple
as it sounds.
This can be done reasonably easily with tools such as GraphX
Now what we all came for:
GraphX is Apache Spark's API for graphs and graph-parallel
While GraphX is “just a library” it is a library that exists within the Spark
environment. Which provides a whole host of benefits like scaling,
clustering, storage, and other things that you don’t have to dwell on.
As of right now, GraphX is Scala only.
Data Science Challenge
Who should Follow whom?
Winklr is a curiously popular social network for fans of the sitcom Happy
Days. Users can post photos, write messages, and most importantly,
follow each other’s posts and content. This helps users keep up with
new content from their favorite users on the site.
Problem 3 of the data science challenge was a graph analysis
Derive the top 70,000 connections that should be recommended.
Type of problem: Graph Analysis
Create a Master Graph.
Run Page Rank to identify centrality.
Create many small graphs for individual users.
Mask the Master Graph, and PageRank Graph.
Multiply out Centrality, number of in Degrees for a possible followers,
and the inverse of the length of the path away from this particular user
to a candidate vertex to be followed.
This code runs in over 48 hours.
Code: Problem3.sh, and AnalyzeGraph.scala
Now we will review github
Snapshot of code:
var PathGraph = ShortestPathFromSource.mask(ClickPairGraph)
var Influence = PathGraph.joinVertices(MasterGraph.inDegrees)((id,pathlength,indeg)=> (1/pathlength)*indeg)
var central_influence = Influence.joinVertices(MaskedMasterGraphPR.vertices)((id,dist,pagerank) => dist*pagerank)
// We want to eliminate the infinite, follow someone that there is in fact a path to
println("Processing " + central_influence.vertices.filter(_._2 < Double.PositiveInfinity).count())
//central_influence.vertices.filter(_._2 < Double.PositiveInfinity).collect()foreach(record_to_list(SourceID,_))
val save_file_name = base_path+"/problem3/OutGraph/"+SourceID.trim()+".data"
central_influence.vertices.filter(_._2 < Double.PositiveInfinity).saveAsTextFile(save_file_name)
This is where we tie together the “small graphs” versus “big graphs”
Creating a Sub-graph of a larger graph is not obvious.
I was expecting to see one big clump of nodes tightly connected. This
would be the “Target” to follow.
I was also expecting to see two smaller clumps of nodes, loosely
connected to the larger clump. These are the “followers”, as we make
a recommendation to them to follow the more popular node, they will
be closer connected to this user.
Here is the output from Gephi that shows whether the code worked or
Where do I get data?
How you construct the network
depends on the question(s) you
Chances are you have lots of
data already, it is simply a
matter of perspective.
Apply Graphs to your own
Public social network data
The example mentioned from
Data Structure Graphs
A DSG Level 1 can show you where you are going to have the most
interesting query performance of your tables.
A DSG Level 2 can show you where the most amount of work is going
on in your Enterprise.
Data Structure Graph Level 1 – This is roughly like an Entity Relationship
Diagram (ERD) Tables are Vertices, Foreign Keys are Edges.
Data Structure Graph Level 2 – Each Vertex in this graph is an
application. Each Edge is data transfer. Roughly equivalent to what we
used to call Data Flow diagrams.
SNAP – Stanford Network Analysis Project.
If you want to learn about how to do Network Analysis and you can’t
find any data, go here.
Consider the following:
Network/Graph Analysis is cool.
It can show you some interesting things about your data that you may
not have considered.
Due thought should be put towards a network analysis project.
Organizing the data requires a bit of thought. (From -> To vertices is just
Directed graph, undirected, bigraph? Some up front setup work needs
to be done.
Tools help with the detailed calculations, and show the paths, walks,
If you need assistance, send a message to the group, or contact me
directly (I am easy to find @dougneedham)