Massive Graph Mining 
Apache Spark’s GraphX and Data Mining
Who we are 
Andy 
@Noootsab 
@NextLab_be 
@Wajug co-driver 
@Devoxx4Kids organizer 
Maths & CS 
Data lover: geo, open, massive 
Fool 
Rand 
@randhindi 
@snips 
Entrepreneur 
PhD bioinformatics, etc.. 
Love data & ML
Graph 101 
A graph is a mathematical representation of 
linked data. 
It’s defined in term of its Vertices and Edges, 
G(V,E). 
A vertex is an entity that can bring a bag of 
data (generally small) 
An edge connects vertices, and can also own a 
bag of data.
Graph 101 
A Graph represent data in a less convenient 
way for classical processing framework. 
Because the burden is not put on the 
observations themselves (row) but on their 
linkage, and specifically density. 
Thus, the problem is often translated as a self-join 
one.
Graph 101 
A Graph, G(V,E) has a reverse representation, 
its Dual. 
A Dual is nothing other than the graph, G’(V’, 
E’), where 
● a vertex is an edge in G, and 
● an edge is a vertex in G, which has at least 
one edge.
Graph 101 
The classical way to store or share the 
connectivity of a graph is using its tabular 
version, that is, its Adjacency Matrix. 
ref: http://en.wikipedia.org/wiki/Adjacency_matrix
GraphX (Apache Spark) 
Spark 101
GraphX (Apache Spark) 
Offers a Graph API on top of Spark. 
Enabling cross-world manipulations
GraphX (Apache Spark) 
How it differs from other classical systems...
GraphX (Apache Spark)
GraphX (Apache Spark)
GraphX (Apache Spark) 
Plenty of operators on both RDDs, but
GraphX (Apache Spark) 
Plenty of operators on both RDDs, but
GraphX (Apache Spark) 
1. Sends messages to neighbors 
2. Returns an RDD of aggregated messages
GraphX (Apache Spark) 
Offers higher level operators and algo, like
GraphX (Apache Spark) 
This one rules them all (and more) 
More later...
PageRank and Pregel 
Everybody know PageRank, right? 
If not: it’s our oil, our friend, our preferred black 
box… 
It’s why Google Search works so fine!
PageRank and Pregel 
Essentially, PageRank is all about importance 
of a node in a Graph → Link Analysis. 
The bottom line is: 
● In-Links are votes 
● In-Links from important node are more 
important →recursion
PageRank and Pregel 
https://d396qusza40orc.cloudfront.net/mmds/lecture_slides/week1_pagerank_the_flow_formulation.pdf
PageRank and Pregel 
TL;DR 
The importance of a node is the probability that 
a random (drunk) walker fall on a given node. 
So, it depends on: 
1. the probability that he lands into one of its 
neighbor 
2. the probability that he crosses a link from 
the neighbor to it 
3. an arbitrary probability of teleportation
PageRank and Pregel 
Solution: Power Method/Iteration (recursive) 
r_new = A x r_old 
Matrix algebra is a pain in distributed 
environment… 
But wait, the process is rather graph oriented!
PageRank and Pregel 
Pregel (google again) 
Based on BSP, Bulk Sync Parallel 
BSP works like message passing style
PageRank and Pregel 
During Superstep i, a vertex can: 
● use messages received from Superstep i-1 
● execute a function 
● send messages 
● vote to halt
PageRank and Pregel
PageRank and Pregel 
In GraphX, as usual with Spark, it’s simple: 
mapReduceTriplet
PageRank and Pregel 
PageRank with Pregel:
PageRank and Pregel 
Applying on our USA.csv file:
OpenStreetMap 
Founded by Steve Coast (UK, 2004) 
Aims to take Geodata off the govs hands to 
give them to the crowd 
Actually, the crowd has to create them...
OSM
OSM
OSM 
So it’s a Graph! 
Node = Vertex 
single point in space defined by its latitude, longitude and node id 
Way = Edge 
A way can have between 2 and 2,000 nodes
OSM 
The network is over-complex for what we need, 
thus: 
● reducing cycling ways like roundabouts to a 
single one 
● transforming the nodes into sections, i.e. 
pieces of streets between 2 intersections
OSM 
Hence, OSM ~ G(Node, Way) 
If it’s not exactly we can still manipulate them 
In our case, we don’t need the connectivity of 
an intersection, but the connectivity of a 
section. 
This is given by G’ (dual of G)
Dataset 
● 80 cities 
● 3M edges in total 
● smallest city 200 edges (Tempe) 
● largest city 200,000 edges (Los Angeles)
Comparing Cities 
● Hypothesis: Cities with similar connectivity 
have similar PageRank distribution 
NYC Chicago
Fort Worth = Philadelphia? 
Looks the same!
Smells like Spurious Correlation
Normalizing PageRank distributions 
● Problem: PageRank is correlated with the 
size of the city 
● size of city = number of sections (edges) in 
the graph 
● Normalized PageRank = PageRank / 
size_of_city 
● Now we can compare cities of different 
sizes!
Fort Worth != Philadelphia! 
Totally different!
Fort Worth before and after 
Note that range of PageRank is preserved
Distance between PG Distributions 
● How to compare PageRank distributions? 
● It’s not always a normal distribution! 
● Can use the Kullback-Leibler divergence 
from information theory 
● the Kullback–Leibler divergence of Q from 
P, denoted DKL(P||Q), is a measure of the 
information lost when Q is used to 
approximate P
KL Divergence 
● Easy to compute 
● Units is nats (can be bits if using log2 
instead of ln)
Very different cities: Dallas & Seattle 
● KL divergence = 18.407 
● Dallas is irregular, Seattle is a perfect grid
Very similar cities: Atlanta & Boston 
● KL divergence = 0.36 
● Both are very irregular
Next steps 
● Using multiple street topology indicators to 
measure the risk of car accident
Q.E.D 
Thanks for keeping up! 
Question => 
Future[(Option[Response], Future[Question])]

Machine Learning and GraphX

  • 1.
    Massive Graph Mining Apache Spark’s GraphX and Data Mining
  • 2.
    Who we are Andy @Noootsab @NextLab_be @Wajug co-driver @Devoxx4Kids organizer Maths & CS Data lover: geo, open, massive Fool Rand @randhindi @snips Entrepreneur PhD bioinformatics, etc.. Love data & ML
  • 3.
    Graph 101 Agraph is a mathematical representation of linked data. It’s defined in term of its Vertices and Edges, G(V,E). A vertex is an entity that can bring a bag of data (generally small) An edge connects vertices, and can also own a bag of data.
  • 4.
    Graph 101 AGraph represent data in a less convenient way for classical processing framework. Because the burden is not put on the observations themselves (row) but on their linkage, and specifically density. Thus, the problem is often translated as a self-join one.
  • 5.
    Graph 101 AGraph, G(V,E) has a reverse representation, its Dual. A Dual is nothing other than the graph, G’(V’, E’), where ● a vertex is an edge in G, and ● an edge is a vertex in G, which has at least one edge.
  • 6.
    Graph 101 Theclassical way to store or share the connectivity of a graph is using its tabular version, that is, its Adjacency Matrix. ref: http://en.wikipedia.org/wiki/Adjacency_matrix
  • 7.
  • 8.
    GraphX (Apache Spark) Offers a Graph API on top of Spark. Enabling cross-world manipulations
  • 9.
    GraphX (Apache Spark) How it differs from other classical systems...
  • 10.
  • 11.
  • 12.
    GraphX (Apache Spark) Plenty of operators on both RDDs, but
  • 13.
    GraphX (Apache Spark) Plenty of operators on both RDDs, but
  • 14.
    GraphX (Apache Spark) 1. Sends messages to neighbors 2. Returns an RDD of aggregated messages
  • 15.
    GraphX (Apache Spark) Offers higher level operators and algo, like
  • 16.
    GraphX (Apache Spark) This one rules them all (and more) More later...
  • 17.
    PageRank and Pregel Everybody know PageRank, right? If not: it’s our oil, our friend, our preferred black box… It’s why Google Search works so fine!
  • 18.
    PageRank and Pregel Essentially, PageRank is all about importance of a node in a Graph → Link Analysis. The bottom line is: ● In-Links are votes ● In-Links from important node are more important →recursion
  • 19.
    PageRank and Pregel https://d396qusza40orc.cloudfront.net/mmds/lecture_slides/week1_pagerank_the_flow_formulation.pdf
  • 20.
    PageRank and Pregel TL;DR The importance of a node is the probability that a random (drunk) walker fall on a given node. So, it depends on: 1. the probability that he lands into one of its neighbor 2. the probability that he crosses a link from the neighbor to it 3. an arbitrary probability of teleportation
  • 21.
    PageRank and Pregel Solution: Power Method/Iteration (recursive) r_new = A x r_old Matrix algebra is a pain in distributed environment… But wait, the process is rather graph oriented!
  • 22.
    PageRank and Pregel Pregel (google again) Based on BSP, Bulk Sync Parallel BSP works like message passing style
  • 23.
    PageRank and Pregel During Superstep i, a vertex can: ● use messages received from Superstep i-1 ● execute a function ● send messages ● vote to halt
  • 24.
  • 25.
    PageRank and Pregel In GraphX, as usual with Spark, it’s simple: mapReduceTriplet
  • 26.
    PageRank and Pregel PageRank with Pregel:
  • 27.
    PageRank and Pregel Applying on our USA.csv file:
  • 28.
    OpenStreetMap Founded bySteve Coast (UK, 2004) Aims to take Geodata off the govs hands to give them to the crowd Actually, the crowd has to create them...
  • 29.
  • 30.
  • 31.
    OSM So it’sa Graph! Node = Vertex single point in space defined by its latitude, longitude and node id Way = Edge A way can have between 2 and 2,000 nodes
  • 32.
    OSM The networkis over-complex for what we need, thus: ● reducing cycling ways like roundabouts to a single one ● transforming the nodes into sections, i.e. pieces of streets between 2 intersections
  • 33.
    OSM Hence, OSM~ G(Node, Way) If it’s not exactly we can still manipulate them In our case, we don’t need the connectivity of an intersection, but the connectivity of a section. This is given by G’ (dual of G)
  • 34.
    Dataset ● 80cities ● 3M edges in total ● smallest city 200 edges (Tempe) ● largest city 200,000 edges (Los Angeles)
  • 35.
    Comparing Cities ●Hypothesis: Cities with similar connectivity have similar PageRank distribution NYC Chicago
  • 36.
    Fort Worth =Philadelphia? Looks the same!
  • 37.
  • 38.
    Normalizing PageRank distributions ● Problem: PageRank is correlated with the size of the city ● size of city = number of sections (edges) in the graph ● Normalized PageRank = PageRank / size_of_city ● Now we can compare cities of different sizes!
  • 39.
    Fort Worth !=Philadelphia! Totally different!
  • 40.
    Fort Worth beforeand after Note that range of PageRank is preserved
  • 41.
    Distance between PGDistributions ● How to compare PageRank distributions? ● It’s not always a normal distribution! ● Can use the Kullback-Leibler divergence from information theory ● the Kullback–Leibler divergence of Q from P, denoted DKL(P||Q), is a measure of the information lost when Q is used to approximate P
  • 42.
    KL Divergence ●Easy to compute ● Units is nats (can be bits if using log2 instead of ln)
  • 43.
    Very different cities:Dallas & Seattle ● KL divergence = 18.407 ● Dallas is irregular, Seattle is a perfect grid
  • 44.
    Very similar cities:Atlanta & Boston ● KL divergence = 0.36 ● Both are very irregular
  • 45.
    Next steps ●Using multiple street topology indicators to measure the risk of car accident
  • 46.
    Q.E.D Thanks forkeeping up! Question => Future[(Option[Response], Future[Question])]