Machine Learning and GraphX

Massive Graph Mining
Apache Spark’s GraphX and Data Mining

Who we are
Andy
@Noootsab
@NextLab_be
@Wajug co-driver
@Devoxx4Kids organizer
Maths & CS
Data lover: geo, open, massive
Fool
Rand
@randhindi
@snips
Entrepreneur
PhD bioinformatics, etc..
Love data & ML

Graph 101
A graph is a mathematical representation of
linked data.
It’s defined in term of its Vertices and Edges,
G(V,E).
A vertex is an entity that can bring a bag of
data (generally small)
An edge connects vertices, and can also own a
bag of data.

Graph 101
A Graph represent data in a less convenient
way for classical processing framework.
Because the burden is not put on the
observations themselves (row) but on their
linkage, and specifically density.
Thus, the problem is often translated as a self-join
one.

Graph 101
A Graph, G(V,E) has a reverse representation,
its Dual.
A Dual is nothing other than the graph, G’(V’,
E’), where
● a vertex is an edge in G, and
● an edge is a vertex in G, which has at least
one edge.

Graph 101
The classical way to store or share the
connectivity of a graph is using its tabular
version, that is, its Adjacency Matrix.
ref: http://en.wikipedia.org/wiki/Adjacency_matrix

GraphX (Apache Spark)
Spark 101

Offers a Graph API on top of Spark.
Enabling cross-world manipulations

How it differs from other classical systems...

Plenty of operators on both RDDs, but

1. Sends messages to neighbors
2. Returns an RDD of aggregated messages

Offers higher level operators and algo, like

This one rules them all (and more)
More later...

PageRank and Pregel
Everybody know PageRank, right?
If not: it’s our oil, our friend, our preferred black
box…
It’s why Google Search works so fine!

PageRank and Pregel
Essentially, PageRank is all about importance
of a node in a Graph → Link Analysis.
The bottom line is:
● In-Links are votes
● In-Links from important node are more
important →recursion

PageRank and Pregel
https://d396qusza40orc.cloudfront.net/mmds/lecture_slides/week1_pagerank_the_flow_formulation.pdf

PageRank and Pregel
TL;DR
The importance of a node is the probability that
a random (drunk) walker fall on a given node.
So, it depends on:
1. the probability that he lands into one of its
neighbor
2. the probability that he crosses a link from
the neighbor to it
3. an arbitrary probability of teleportation

PageRank and Pregel
Solution: Power Method/Iteration (recursive)
r_new = A x r_old
Matrix algebra is a pain in distributed
environment…
But wait, the process is rather graph oriented!

PageRank and Pregel
Pregel (google again)
Based on BSP, Bulk Sync Parallel
BSP works like message passing style

PageRank and Pregel
During Superstep i, a vertex can:
● use messages received from Superstep i-1
● execute a function
● send messages
● vote to halt

PageRank and Pregel
In GraphX, as usual with Spark, it’s simple:
mapReduceTriplet

PageRank and Pregel
PageRank with Pregel:

PageRank and Pregel
Applying on our USA.csv file:

OpenStreetMap
Founded by Steve Coast (UK, 2004)
Aims to take Geodata off the govs hands to
give them to the crowd
Actually, the crowd has to create them...

OSM
So it’s a Graph!
Node = Vertex
single point in space defined by its latitude, longitude and node id
Way = Edge
A way can have between 2 and 2,000 nodes

OSM
The network is over-complex for what we need,
thus:
● reducing cycling ways like roundabouts to a
single one
● transforming the nodes into sections, i.e.
pieces of streets between 2 intersections

OSM
Hence, OSM ~ G(Node, Way)
If it’s not exactly we can still manipulate them
In our case, we don’t need the connectivity of
an intersection, but the connectivity of a
section.
This is given by G’ (dual of G)

Dataset
● 80 cities
● 3M edges in total
● smallest city 200 edges (Tempe)
● largest city 200,000 edges (Los Angeles)

Comparing Cities
● Hypothesis: Cities with similar connectivity
have similar PageRank distribution
NYC Chicago

Fort Worth = Philadelphia?
Looks the same!

Smells like Spurious Correlation

Normalizing PageRank distributions
● Problem: PageRank is correlated with the
size of the city
● size of city = number of sections (edges) in
the graph
● Normalized PageRank = PageRank /
size_of_city
● Now we can compare cities of different
sizes!

Fort Worth != Philadelphia!
Totally different!

Fort Worth before and after
Note that range of PageRank is preserved

Distance between PG Distributions
● How to compare PageRank distributions?
● It’s not always a normal distribution!
● Can use the Kullback-Leibler divergence
from information theory
● the Kullback–Leibler divergence of Q from
P, denoted DKL(P||Q), is a measure of the
information lost when Q is used to
approximate P

KL Divergence
● Easy to compute
● Units is nats (can be bits if using log2
instead of ln)

Very different cities: Dallas & Seattle
● KL divergence = 18.407
● Dallas is irregular, Seattle is a perfect grid

Very similar cities: Atlanta & Boston
● KL divergence = 0.36
● Both are very irregular

Next steps
● Using multiple street topology indicators to
measure the risk of car accident

Q.E.D
Thanks for keeping up!
Question =>
Future[(Option[Response], Future[Question])]

Machine Learning and GraphX

More Related Content

What's hot

Similar to Machine Learning and GraphX

More from Andy Petrella

Recently uploaded

Machine Learning and GraphX