6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
GraphX and Pregel - Apache Spark
1. Spark GraphX & Pregel
Challenges and Best Practices
Ashutosh Trivedi (IIIT Bangalore)
Kaushik Ranjan (IIIT Bangalore)
Sigmoid-Meetup Bangalore
https://github.com/anantasty/SparkAlgorithms
2. Ashutosh & Kaushik, Sigmoid-Meetup
Bangalore Dec-2014
Agenda
•Introduction to GraphX
– How to describe a graph
– RDDs to store Graph
– Algorithms available
•Application in graph algorithms
– Feedback Vertex Set of a Graph
– Identifying parallel parts of the solution.
•Challenges we faced
•Best practices
2
4. Ashutosh & Kaushik, Sigmoid-Meetup
Bangalore Dec-2014
Graph Representation
4
class Graph [ V, E ] {
def Graph(vertices: Table[ (Id, V) ],
edges: Table[ (Id, Id, E) ])
• The VertexRDD[A] extends RDD[(VertexID, A)] and adds the additional
constraint that each VertexID occurs only once.
• Moreover, VertexRDD[A] represents a set of vertices each with an
attribute of type A
• The EdgeRDD[ED], extends RDD[Edge[ED]]
6. Ashutosh & Kaushik, Sigmoid-Meetup
Bangalore Dec-2014 6
A BA
Vertex and Edges
Vertex Edge
7. Ashutosh & Kaushik, Sigmoid-Meetup
Bangalore Dec-2014
Triplets Join Vertices and Edges
• The triplets operator joins vertices and edges:
TripletsVertices
B
A
C
D
Edges
A B
A C
B C
C D
A BA
B A C
B C
C D
7
10. Ashutosh & Kaushik, Sigmoid-Meetup
Bangalore Dec-2014 10
Feedback Vertex Set
• A feedback vertex set of a graph is a set of vertices
whose removal leaves a graph without cycles.
• Each feedback vertex set contains at least one vertex of
any cycle in the graph.
• The feedback vertex set problem is an NP-
complete problem in computational complexity theory
• Enumerate each simple cycle.
11. Ashutosh & Kaushik, Sigmoid-Meetup
Bangalore Dec-2014 11
1 2
34
5
6
7
8
9
10
Strongly Connected Components
Each strongly connected component can be
considered in parallel since they do not share
any cycle
SC1 – (1) SC2 – (5) SC3 – (8) SC4 – (9)
12. Ashutosh & Kaushik, Sigmoid-Meetup
Bangalore Dec-2014 12
FVS Algorithm
#Greedy recursive solution
FVS(G)
sccGraph = scc(G)
For each graph in sccGraph
For each vertex
remove vertex and again calculate scc,
vertex V = vertex which give max number of scc
#which means it kills maximum cycles
subGraph = subgraph(remove V )
FVS (subGraph )
15. Ashutosh & Kaushik, Sigmoid-Meetup
Bangalore Dec-2014 15
FVS – Spark Implementation
sccGraph has one more property sccID on each vertices, extract it
16. Ashutosh & Kaushik, Sigmoid-Meetup
Bangalore Dec-2014 16
FVS – Spark Implementation
sccGraph = scc(G)
For each graph in sccGraph
18. Ashutosh & Kaushik, Sigmoid-Meetup
Bangalore Dec-2014 18
FVS – Spark Implementation
For each vertex
remove vertex and again calculate scc,
# Z is a list of scc count after removing each vertex
19. Ashutosh & Kaushik, Sigmoid-Meetup
Bangalore Dec-2014 19
vertex V = vertex which give max number of scc
#which means it kills maximum cycles
FVS – Spark Implementation
21. Ashutosh & Kaushik, Sigmoid-Meetup
Bangalore Dec-2014 21
Pregel
• Graph DB
– Data Storage
– Data Mining
• Advantages
– Large-scale distributed computations
– Parallel-algorithms for graphs on multiple machines
– Fault tolerance and distributability
22. Ashutosh & Kaushik, Sigmoid-Meetup
Bangalore Dec-2014 22
Oldest Follower
What is the age of oldest follower of each user ?
Val oldestFollowerAge = graph
.aggregateMessages(
#map word => (word.dst.id, word.src.age),
#reduce (a,b) => max(a, b)
)
.vertices
mapReduceTriplets is now aggregateMessages
23. Ashutosh & Kaushik, Sigmoid-Meetup
Bangalore Dec-2014 23
In aggregateMessages :
• EdgeContext which exposes the triplet fields .
• functions to explicitly send messages to the source and
destination vertex.
• It require the user to indicate what fields in the triplet are
actually required.
New in GraphX
24. Ashutosh & Kaushik, Sigmoid-Meetup
Bangalore Dec-2014
Theory – it’s Good
How it works – that’s awesome
24
Graph’s are recursive data-structures, where the
property of a vertex is dependent on the properties of
it’s neighbors, which in turn are dependent on the
properties of their neighbors.
28. Ashutosh & Kaushik, Sigmoid-Meetup
Bangalore Dec-2014
Applications - GIS
• Algorithm – to compute all vertices in a directed graph, that can
reach out to a given vertex.
• Can be used for watershed delineation in Geographic Information
Systems
28
Vertices that can reach out to E are A and B