2. Who we are
Andy
@Noootsab
@NextLab_be
@Wajug co-driver
@Devoxx4Kids organizer
Maths & CS
Data lover: geo, open, massive
Fool
Rand
@randhindi
@snips
Entrepreneur
PhD bioinformatics, etc..
Love data & ML
3. Graph 101
A graph is a mathematical representation of
linked data.
It’s defined in term of its Vertices and Edges,
G(V,E).
A vertex is an entity that can bring a bag of
data (generally small)
An edge connects vertices, and can also own a
bag of data.
4. Graph 101
A Graph represent data in a less convenient
way for classical processing framework.
Because the burden is not put on the
observations themselves (row) but on their
linkage, and specifically density.
Thus, the problem is often translated as a self-join
one.
5. Graph 101
A Graph, G(V,E) has a reverse representation,
its Dual.
A Dual is nothing other than the graph, G’(V’,
E’), where
● a vertex is an edge in G, and
● an edge is a vertex in G, which has at least
one edge.
6. Graph 101
The classical way to store or share the
connectivity of a graph is using its tabular
version, that is, its Adjacency Matrix.
ref: http://en.wikipedia.org/wiki/Adjacency_matrix
17. PageRank and Pregel
Everybody know PageRank, right?
If not: it’s our oil, our friend, our preferred black
box…
It’s why Google Search works so fine!
18. PageRank and Pregel
Essentially, PageRank is all about importance
of a node in a Graph → Link Analysis.
The bottom line is:
● In-Links are votes
● In-Links from important node are more
important →recursion
19. PageRank and Pregel
https://d396qusza40orc.cloudfront.net/mmds/lecture_slides/week1_pagerank_the_flow_formulation.pdf
20. PageRank and Pregel
TL;DR
The importance of a node is the probability that
a random (drunk) walker fall on a given node.
So, it depends on:
1. the probability that he lands into one of its
neighbor
2. the probability that he crosses a link from
the neighbor to it
3. an arbitrary probability of teleportation
21. PageRank and Pregel
Solution: Power Method/Iteration (recursive)
r_new = A x r_old
Matrix algebra is a pain in distributed
environment…
But wait, the process is rather graph oriented!
22. PageRank and Pregel
Pregel (google again)
Based on BSP, Bulk Sync Parallel
BSP works like message passing style
23. PageRank and Pregel
During Superstep i, a vertex can:
● use messages received from Superstep i-1
● execute a function
● send messages
● vote to halt
28. OpenStreetMap
Founded by Steve Coast (UK, 2004)
Aims to take Geodata off the govs hands to
give them to the crowd
Actually, the crowd has to create them...
31. OSM
So it’s a Graph!
Node = Vertex
single point in space defined by its latitude, longitude and node id
Way = Edge
A way can have between 2 and 2,000 nodes
32. OSM
The network is over-complex for what we need,
thus:
● reducing cycling ways like roundabouts to a
single one
● transforming the nodes into sections, i.e.
pieces of streets between 2 intersections
33. OSM
Hence, OSM ~ G(Node, Way)
If it’s not exactly we can still manipulate them
In our case, we don’t need the connectivity of
an intersection, but the connectivity of a
section.
This is given by G’ (dual of G)
34. Dataset
● 80 cities
● 3M edges in total
● smallest city 200 edges (Tempe)
● largest city 200,000 edges (Los Angeles)
35. Comparing Cities
● Hypothesis: Cities with similar connectivity
have similar PageRank distribution
NYC Chicago
38. Normalizing PageRank distributions
● Problem: PageRank is correlated with the
size of the city
● size of city = number of sections (edges) in
the graph
● Normalized PageRank = PageRank /
size_of_city
● Now we can compare cities of different
sizes!
40. Fort Worth before and after
Note that range of PageRank is preserved
41. Distance between PG Distributions
● How to compare PageRank distributions?
● It’s not always a normal distribution!
● Can use the Kullback-Leibler divergence
from information theory
● the Kullback–Leibler divergence of Q from
P, denoted DKL(P||Q), is a measure of the
information lost when Q is used to
approximate P
42. KL Divergence
● Easy to compute
● Units is nats (can be bits if using log2
instead of ln)
43. Very different cities: Dallas & Seattle
● KL divergence = 18.407
● Dallas is irregular, Seattle is a perfect grid
44. Very similar cities: Atlanta & Boston
● KL divergence = 0.36
● Both are very irregular
45. Next steps
● Using multiple street topology indicators to
measure the risk of car accident
46. Q.E.D
Thanks for keeping up!
Question =>
Future[(Option[Response], Future[Question])]