SlideShare a Scribd company logo
1 of 31
Download to read offline
Spark GraphX & Pregel
Challenges and Best Practices
Ashutosh Trivedi (IIIT Bangalore)
Kaushik Ranjan (IIIT Bangalore)
Sigmoid-Meetup Bangalore
https://github.com/anantasty/SparkAlgorithms
Ashutosh & Kaushik, Sigmoid-Meetup
Bangalore Dec-2014
Agenda
•Introduction to GraphX
– How to describe a graph
– RDDs to store Graph
– Algorithms available
•Application in graph algorithms
– Feedback Vertex Set of a Graph
– Identifying parallel parts of the solution.
•Challenges we faced
•Best practices
2
Ashutosh & Kaushik, Sigmoid-Meetup
Bangalore Dec-2014
33
GraphX - Representation
Ashutosh & Kaushik, Sigmoid-Meetup
Bangalore Dec-2014
Graph Representation
4
class Graph [ V, E ] {
def Graph(vertices: Table[ (Id, V) ],
edges: Table[ (Id, Id, E) ])
• The VertexRDD[A] extends RDD[(VertexID, A)] and adds the additional
constraint that each VertexID occurs only once.
• Moreover, VertexRDD[A] represents a set of vertices each with an
attribute of type A
• The EdgeRDD[ED], extends RDD[Edge[ED]]
Ashutosh & Kaushik, Sigmoid-Meetup
Bangalore Dec-2014 5
GraphX - Representation
Ashutosh & Kaushik, Sigmoid-Meetup
Bangalore Dec-2014 6
A BA
Vertex and Edges
Vertex Edge
Ashutosh & Kaushik, Sigmoid-Meetup
Bangalore Dec-2014
Triplets Join Vertices and Edges
• The triplets operator joins vertices and edges:
TripletsVertices
B
A
C
D
Edges
A B
A C
B C
C D
A BA
B A C
B C
C D
7
Ashutosh & Kaushik, Sigmoid-Meetup
Bangalore Dec-2014
88
Triplets elements
Ashutosh & Kaushik, Sigmoid-Meetup
Bangalore Dec-2014 9
Subgraphs
Predicates vpred and epred
Ashutosh & Kaushik, Sigmoid-Meetup
Bangalore Dec-2014 10
Feedback Vertex Set
• A feedback vertex set of a graph is a set of vertices
whose removal leaves a graph without cycles.
• Each feedback vertex set contains at least one vertex of
any cycle in the graph.
• The feedback vertex set problem is an NP-
complete problem in computational complexity theory
• Enumerate each simple cycle.
Ashutosh & Kaushik, Sigmoid-Meetup
Bangalore Dec-2014 11
1 2
34
5
6
7
8
9
10
Strongly Connected Components
Each strongly connected component can be
considered in parallel since they do not share
any cycle
SC1 – (1) SC2 – (5) SC3 – (8) SC4 – (9)
Ashutosh & Kaushik, Sigmoid-Meetup
Bangalore Dec-2014 12
FVS Algorithm
#Greedy recursive solution
FVS(G)
sccGraph = scc(G)
For each graph in sccGraph
For each vertex
remove vertex and again calculate scc,
vertex V = vertex which give max number of scc
#which means it kills maximum cycles
subGraph = subgraph(remove V )
FVS (subGraph )
Ashutosh & Kaushik, Sigmoid-Meetup
Bangalore Dec-2014 13
1 2
4 3
2
4 3
Graph Iteration SCC count
3
1
4 3
1
1 2
4
3
1 2
4 3
1 2
4 3
Remove 2
Remove 1
Remove 3
Ashutosh & Kaushik, Sigmoid-Meetup
Bangalore Dec-2014 14
1
5
8 9
1 5 8 9Feedback Vertex Set
Ashutosh & Kaushik, Sigmoid-Meetup
Bangalore Dec-2014 15
FVS – Spark Implementation
sccGraph has one more property sccID on each vertices, extract it
Ashutosh & Kaushik, Sigmoid-Meetup
Bangalore Dec-2014 16
FVS – Spark Implementation
sccGraph = scc(G)
For each graph in sccGraph
Ashutosh & Kaushik, Sigmoid-Meetup
Bangalore Dec-2014 17
FVS – Spark Implementation
#Greedy recursive function
Ashutosh & Kaushik, Sigmoid-Meetup
Bangalore Dec-2014 18
FVS – Spark Implementation
For each vertex
remove vertex and again calculate scc,
# Z is a list of scc count after removing each vertex
Ashutosh & Kaushik, Sigmoid-Meetup
Bangalore Dec-2014 19
vertex V = vertex which give max number of scc
#which means it kills maximum cycles
FVS – Spark Implementation
Ashutosh & Kaushik, Sigmoid-Meetup
Bangalore Dec-2014 20
subGraph = subgraph(remove V )
FVS (subGraph )
FVS – Spark Implementation
Ashutosh & Kaushik, Sigmoid-Meetup
Bangalore Dec-2014 21
Pregel
• Graph DB
– Data Storage
– Data Mining
• Advantages
– Large-scale distributed computations
– Parallel-algorithms for graphs on multiple machines
– Fault tolerance and distributability
Ashutosh & Kaushik, Sigmoid-Meetup
Bangalore Dec-2014 22
Oldest Follower
What is the age of oldest follower of each user ?
Val oldestFollowerAge = graph
.aggregateMessages(
#map word => (word.dst.id, word.src.age),
#reduce (a,b) => max(a, b)
)
.vertices
mapReduceTriplets is now aggregateMessages
Ashutosh & Kaushik, Sigmoid-Meetup
Bangalore Dec-2014 23
In aggregateMessages :
• EdgeContext which exposes the triplet fields .
• functions to explicitly send messages to the source and
destination vertex.
• It require the user to indicate what fields in the triplet are
actually required.
New in GraphX
Ashutosh & Kaushik, Sigmoid-Meetup
Bangalore Dec-2014
Theory – it’s Good
How it works – that’s awesome
24
Graph’s are recursive data-structures, where the
property of a vertex is dependent on the properties of
it’s neighbors, which in turn are dependent on the
properties of their neighbors.
Ashutosh & Kaushik, Sigmoid-Meetup
Bangalore Dec-2014
Graph.Pregel ( initialMessage ) (
#message consumption
( vertexID, initialProperty, message ) → compute new property
,
#message generation
triplet → .. code ..
Iterator( vertexID, message )
Iterator.empty
,
#message aggregation
( existing message set, new message ) → NEW message set
)
25
Architecture
Ashutosh & Kaushik, Sigmoid-Meetup
Bangalore Dec-2014 26
1 2
4 3
1030
30 20
1 2
4 3
10
30
30 20
max [30,10,20]
max [20] max [10]
1 2
4 3
100
10 10
1 2
4 3
10
0
10 10
max [10] max [10]
Ashutosh & Kaushik, Sigmoid-Meetup
Bangalore Dec-2014 27
Example - output
1 2
4 3
100
0 0
Ashutosh & Kaushik, Sigmoid-Meetup
Bangalore Dec-2014
Applications - GIS
• Algorithm – to compute all vertices in a directed graph, that can
reach out to a given vertex.
• Can be used for watershed delineation in Geographic Information
Systems
28
Vertices that can reach out to E are A and B
Ashutosh & Kaushik, Sigmoid-Meetup
Bangalore Dec-2014
Algorithm
Graph.Pregel( Seq[vertexID’s] ) (
#message consumption
if vertex.state == 1
vertex.state → 2
else if vertex.state == 0
if ( vertex.adjacentVertices ∩ Seq[ vertexID’s ] ) isNotEmpty
vertex.state → 2
#message aggregator
Seq[existing vertex ID’s] U Seq[new vertex ID]
)
29
Ashutosh & Kaushik, Sigmoid-Meetup
Bangalore Dec-2014 30
#message generation
for each triplet
if destinationVertex.state == 1
message( sourceVertexID, Seq[destinationVertexID] )
message( destinationVertexID, Seq[destinationVertexID] )
else if sourceVertex.state == 1 and destinationVertex.state == 2
message( sourceVertexID, Seq[destinationVertexID] )
else message( empty )
Algorithm
References
• Fork our repository at
• https://github.com/anantasty/SparkAlgorithms
• Follow us at
• https://github.com/codeAshu
• https://github.com/kaushikranjan
• https://spark.apache.org/docs/latest/graphx-programming-guide.html
31

More Related Content

What's hot

Solving PostgreSQL wicked problems
Solving PostgreSQL wicked problemsSolving PostgreSQL wicked problems
Solving PostgreSQL wicked problemsAlexander Korotkov
 
Apache Spark on K8S Best Practice and Performance in the Cloud
Apache Spark on K8S Best Practice and Performance in the CloudApache Spark on K8S Best Practice and Performance in the Cloud
Apache Spark on K8S Best Practice and Performance in the CloudDatabricks
 
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...Edureka!
 
Comment l’architecture événementielle révolutionne la communication dans le S...
Comment l’architecture événementielle révolutionne la communication dans le S...Comment l’architecture événementielle révolutionne la communication dans le S...
Comment l’architecture événementielle révolutionne la communication dans le S...Vincent Lepot
 
Ibis: Seamless Transition Between Pandas and Apache Spark
Ibis: Seamless Transition Between Pandas and Apache SparkIbis: Seamless Transition Between Pandas and Apache Spark
Ibis: Seamless Transition Between Pandas and Apache SparkDatabricks
 
My first 90 days with ClickHouse.pdf
My first 90 days with ClickHouse.pdfMy first 90 days with ClickHouse.pdf
My first 90 days with ClickHouse.pdfAlkin Tezuysal
 
Dr. Elephant for Monitoring and Tuning Apache Spark Jobs on Hadoop with Carl ...
Dr. Elephant for Monitoring and Tuning Apache Spark Jobs on Hadoop with Carl ...Dr. Elephant for Monitoring and Tuning Apache Spark Jobs on Hadoop with Carl ...
Dr. Elephant for Monitoring and Tuning Apache Spark Jobs on Hadoop with Carl ...Databricks
 
Data Warehouse vs. Data Lake vs. Data Streaming – Friends, Enemies, Frenemies?
Data Warehouse vs. Data Lake vs. Data Streaming – Friends, Enemies, Frenemies?Data Warehouse vs. Data Lake vs. Data Streaming – Friends, Enemies, Frenemies?
Data Warehouse vs. Data Lake vs. Data Streaming – Friends, Enemies, Frenemies?Kai Wähner
 
On-boarding with JanusGraph Performance
On-boarding with JanusGraph PerformanceOn-boarding with JanusGraph Performance
On-boarding with JanusGraph PerformanceChin Huang
 
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangApache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangDatabricks
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guideRyan Blue
 
Achieve Blazing-Fast Ingest Speeds with Apache Arrow
Achieve Blazing-Fast Ingest Speeds with Apache ArrowAchieve Blazing-Fast Ingest Speeds with Apache Arrow
Achieve Blazing-Fast Ingest Speeds with Apache ArrowNeo4j
 
Learn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive GuideLearn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive GuideWhizlabs
 
Securing data in hybrid environments using Apache Ranger
Securing data in hybrid environments using Apache RangerSecuring data in hybrid environments using Apache Ranger
Securing data in hybrid environments using Apache RangerDataWorks Summit
 
Blue-green deploys with Pulsar & Envoy in an event-driven microservice ecosys...
Blue-green deploys with Pulsar & Envoy in an event-driven microservice ecosys...Blue-green deploys with Pulsar & Envoy in an event-driven microservice ecosys...
Blue-green deploys with Pulsar & Envoy in an event-driven microservice ecosys...StreamNative
 
Producer Performance Tuning for Apache Kafka
Producer Performance Tuning for Apache KafkaProducer Performance Tuning for Apache Kafka
Producer Performance Tuning for Apache KafkaJiangjie Qin
 

What's hot (20)

Solving PostgreSQL wicked problems
Solving PostgreSQL wicked problemsSolving PostgreSQL wicked problems
Solving PostgreSQL wicked problems
 
Apache Spark on K8S Best Practice and Performance in the Cloud
Apache Spark on K8S Best Practice and Performance in the CloudApache Spark on K8S Best Practice and Performance in the Cloud
Apache Spark on K8S Best Practice and Performance in the Cloud
 
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
 
Comment l’architecture événementielle révolutionne la communication dans le S...
Comment l’architecture événementielle révolutionne la communication dans le S...Comment l’architecture événementielle révolutionne la communication dans le S...
Comment l’architecture événementielle révolutionne la communication dans le S...
 
Ibis: Seamless Transition Between Pandas and Apache Spark
Ibis: Seamless Transition Between Pandas and Apache SparkIbis: Seamless Transition Between Pandas and Apache Spark
Ibis: Seamless Transition Between Pandas and Apache Spark
 
My first 90 days with ClickHouse.pdf
My first 90 days with ClickHouse.pdfMy first 90 days with ClickHouse.pdf
My first 90 days with ClickHouse.pdf
 
Dr. Elephant for Monitoring and Tuning Apache Spark Jobs on Hadoop with Carl ...
Dr. Elephant for Monitoring and Tuning Apache Spark Jobs on Hadoop with Carl ...Dr. Elephant for Monitoring and Tuning Apache Spark Jobs on Hadoop with Carl ...
Dr. Elephant for Monitoring and Tuning Apache Spark Jobs on Hadoop with Carl ...
 
Data Warehouse vs. Data Lake vs. Data Streaming – Friends, Enemies, Frenemies?
Data Warehouse vs. Data Lake vs. Data Streaming – Friends, Enemies, Frenemies?Data Warehouse vs. Data Lake vs. Data Streaming – Friends, Enemies, Frenemies?
Data Warehouse vs. Data Lake vs. Data Streaming – Friends, Enemies, Frenemies?
 
On-boarding with JanusGraph Performance
On-boarding with JanusGraph PerformanceOn-boarding with JanusGraph Performance
On-boarding with JanusGraph Performance
 
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangApache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
 
FLiP Into Trino
FLiP Into TrinoFLiP Into Trino
FLiP Into Trino
 
Catalyst optimizer
Catalyst optimizerCatalyst optimizer
Catalyst optimizer
 
Achieve Blazing-Fast Ingest Speeds with Apache Arrow
Achieve Blazing-Fast Ingest Speeds with Apache ArrowAchieve Blazing-Fast Ingest Speeds with Apache Arrow
Achieve Blazing-Fast Ingest Speeds with Apache Arrow
 
Apache ZooKeeper
Apache ZooKeeperApache ZooKeeper
Apache ZooKeeper
 
Spark SQL - The internal -
Spark SQL - The internal -Spark SQL - The internal -
Spark SQL - The internal -
 
Learn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive GuideLearn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive Guide
 
Securing data in hybrid environments using Apache Ranger
Securing data in hybrid environments using Apache RangerSecuring data in hybrid environments using Apache Ranger
Securing data in hybrid environments using Apache Ranger
 
Blue-green deploys with Pulsar & Envoy in an event-driven microservice ecosys...
Blue-green deploys with Pulsar & Envoy in an event-driven microservice ecosys...Blue-green deploys with Pulsar & Envoy in an event-driven microservice ecosys...
Blue-green deploys with Pulsar & Envoy in an event-driven microservice ecosys...
 
Producer Performance Tuning for Apache Kafka
Producer Performance Tuning for Apache KafkaProducer Performance Tuning for Apache Kafka
Producer Performance Tuning for Apache Kafka
 

Viewers also liked

GraphX: Graph Analytics in Apache Spark (AMPCamp 5, 2014-11-20)
GraphX: Graph Analytics in Apache Spark (AMPCamp 5, 2014-11-20)GraphX: Graph Analytics in Apache Spark (AMPCamp 5, 2014-11-20)
GraphX: Graph Analytics in Apache Spark (AMPCamp 5, 2014-11-20)Ankur Dave
 
Building a Graph of all US Businesses Using Spark Technologies by Alexis Roos
Building a Graph of all US Businesses Using Spark Technologies by Alexis RoosBuilding a Graph of all US Businesses Using Spark Technologies by Alexis Roos
Building a Graph of all US Businesses Using Spark Technologies by Alexis RoosSpark Summit
 
Machine Learning and GraphX
Machine Learning and GraphXMachine Learning and GraphX
Machine Learning and GraphXAndy Petrella
 
Implementing Near-Realtime Datacenter Health Analytics using Model-driven Ver...
Implementing Near-Realtime Datacenter Health Analytics using Model-driven Ver...Implementing Near-Realtime Datacenter Health Analytics using Model-driven Ver...
Implementing Near-Realtime Datacenter Health Analytics using Model-driven Ver...Spark Summit
 
GraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesGraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesPaco Nathan
 
An excursion into Graph Analytics with Apache Spark GraphX
An excursion into Graph Analytics with Apache Spark GraphXAn excursion into Graph Analytics with Apache Spark GraphX
An excursion into Graph Analytics with Apache Spark GraphXKrishna Sankar
 
Relationship Extraction from Unstructured Text-Based on Stanford NLP with Spa...
Relationship Extraction from Unstructured Text-Based on Stanford NLP with Spa...Relationship Extraction from Unstructured Text-Based on Stanford NLP with Spa...
Relationship Extraction from Unstructured Text-Based on Stanford NLP with Spa...Spark Summit
 
Graph x pregel
Graph x pregelGraph x pregel
Graph x pregelSigmoid
 
5 Reasons Enterprise Adoption of Spark is Unstoppable by Mike Gualtieri
 5 Reasons Enterprise Adoption of Spark is Unstoppable by Mike Gualtieri 5 Reasons Enterprise Adoption of Spark is Unstoppable by Mike Gualtieri
5 Reasons Enterprise Adoption of Spark is Unstoppable by Mike GualtieriSpark Summit
 
Fighting financial crime with graph analysis at BIWA Summit 2017
Fighting financial crime with graph analysis at BIWA Summit 2017Fighting financial crime with graph analysis at BIWA Summit 2017
Fighting financial crime with graph analysis at BIWA Summit 2017Linkurious
 
Using graphs technologies for intelligence analysis.
Using graphs technologies for intelligence analysis. Using graphs technologies for intelligence analysis.
Using graphs technologies for intelligence analysis. Linkurious
 
Xia Zhu – Intel at MLconf ATL
Xia Zhu – Intel at MLconf ATLXia Zhu – Intel at MLconf ATL
Xia Zhu – Intel at MLconf ATLMLconf
 
Using GraphX/Pregel on Browsing History to Discover Purchase Intent by Lisa Z...
Using GraphX/Pregel on Browsing History to Discover Purchase Intent by Lisa Z...Using GraphX/Pregel on Browsing History to Discover Purchase Intent by Lisa Z...
Using GraphX/Pregel on Browsing History to Discover Purchase Intent by Lisa Z...Spark Summit
 
ggplot2.SparkR: Rebooting ggplot2 for Scalable Big Data Visualization by Jong...
ggplot2.SparkR: Rebooting ggplot2 for Scalable Big Data Visualization by Jong...ggplot2.SparkR: Rebooting ggplot2 for Scalable Big Data Visualization by Jong...
ggplot2.SparkR: Rebooting ggplot2 for Scalable Big Data Visualization by Jong...Spark Summit
 
An excursion into Text Analytics with Apache Spark
An excursion into Text Analytics with Apache SparkAn excursion into Text Analytics with Apache Spark
An excursion into Text Analytics with Apache SparkKrishna Sankar
 
Lambda at Weather Scale by Robbie Strickland
Lambda at Weather Scale by Robbie StricklandLambda at Weather Scale by Robbie Strickland
Lambda at Weather Scale by Robbie StricklandSpark Summit
 
Social Network Analysis with Spark
Social Network Analysis with SparkSocial Network Analysis with Spark
Social Network Analysis with SparkGhulam Imaduddin
 
Using spark for timeseries graph analytics
Using spark for timeseries graph analyticsUsing spark for timeseries graph analytics
Using spark for timeseries graph analyticsSigmoid
 

Viewers also liked (20)

GraphX: Graph Analytics in Apache Spark (AMPCamp 5, 2014-11-20)
GraphX: Graph Analytics in Apache Spark (AMPCamp 5, 2014-11-20)GraphX: Graph Analytics in Apache Spark (AMPCamp 5, 2014-11-20)
GraphX: Graph Analytics in Apache Spark (AMPCamp 5, 2014-11-20)
 
Building a Graph of all US Businesses Using Spark Technologies by Alexis Roos
Building a Graph of all US Businesses Using Spark Technologies by Alexis RoosBuilding a Graph of all US Businesses Using Spark Technologies by Alexis Roos
Building a Graph of all US Businesses Using Spark Technologies by Alexis Roos
 
Machine Learning and GraphX
Machine Learning and GraphXMachine Learning and GraphX
Machine Learning and GraphX
 
Implementing Near-Realtime Datacenter Health Analytics using Model-driven Ver...
Implementing Near-Realtime Datacenter Health Analytics using Model-driven Ver...Implementing Near-Realtime Datacenter Health Analytics using Model-driven Ver...
Implementing Near-Realtime Datacenter Health Analytics using Model-driven Ver...
 
GraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesGraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communities
 
An excursion into Graph Analytics with Apache Spark GraphX
An excursion into Graph Analytics with Apache Spark GraphXAn excursion into Graph Analytics with Apache Spark GraphX
An excursion into Graph Analytics with Apache Spark GraphX
 
Relationship Extraction from Unstructured Text-Based on Stanford NLP with Spa...
Relationship Extraction from Unstructured Text-Based on Stanford NLP with Spa...Relationship Extraction from Unstructured Text-Based on Stanford NLP with Spa...
Relationship Extraction from Unstructured Text-Based on Stanford NLP with Spa...
 
Ashutosh pycon
Ashutosh pyconAshutosh pycon
Ashutosh pycon
 
Spark algorithms
Spark algorithmsSpark algorithms
Spark algorithms
 
Graph x pregel
Graph x pregelGraph x pregel
Graph x pregel
 
5 Reasons Enterprise Adoption of Spark is Unstoppable by Mike Gualtieri
 5 Reasons Enterprise Adoption of Spark is Unstoppable by Mike Gualtieri 5 Reasons Enterprise Adoption of Spark is Unstoppable by Mike Gualtieri
5 Reasons Enterprise Adoption of Spark is Unstoppable by Mike Gualtieri
 
Fighting financial crime with graph analysis at BIWA Summit 2017
Fighting financial crime with graph analysis at BIWA Summit 2017Fighting financial crime with graph analysis at BIWA Summit 2017
Fighting financial crime with graph analysis at BIWA Summit 2017
 
Using graphs technologies for intelligence analysis.
Using graphs technologies for intelligence analysis. Using graphs technologies for intelligence analysis.
Using graphs technologies for intelligence analysis.
 
Xia Zhu – Intel at MLconf ATL
Xia Zhu – Intel at MLconf ATLXia Zhu – Intel at MLconf ATL
Xia Zhu – Intel at MLconf ATL
 
Using GraphX/Pregel on Browsing History to Discover Purchase Intent by Lisa Z...
Using GraphX/Pregel on Browsing History to Discover Purchase Intent by Lisa Z...Using GraphX/Pregel on Browsing History to Discover Purchase Intent by Lisa Z...
Using GraphX/Pregel on Browsing History to Discover Purchase Intent by Lisa Z...
 
ggplot2.SparkR: Rebooting ggplot2 for Scalable Big Data Visualization by Jong...
ggplot2.SparkR: Rebooting ggplot2 for Scalable Big Data Visualization by Jong...ggplot2.SparkR: Rebooting ggplot2 for Scalable Big Data Visualization by Jong...
ggplot2.SparkR: Rebooting ggplot2 for Scalable Big Data Visualization by Jong...
 
An excursion into Text Analytics with Apache Spark
An excursion into Text Analytics with Apache SparkAn excursion into Text Analytics with Apache Spark
An excursion into Text Analytics with Apache Spark
 
Lambda at Weather Scale by Robbie Strickland
Lambda at Weather Scale by Robbie StricklandLambda at Weather Scale by Robbie Strickland
Lambda at Weather Scale by Robbie Strickland
 
Social Network Analysis with Spark
Social Network Analysis with SparkSocial Network Analysis with Spark
Social Network Analysis with Spark
 
Using spark for timeseries graph analytics
Using spark for timeseries graph analyticsUsing spark for timeseries graph analytics
Using spark for timeseries graph analytics
 

Similar to GraphX and Pregel - Apache Spark

Intro to Apache Spark and Scala, Austin ACM SIGKDD, 7/9/2014
Intro to Apache Spark and Scala, Austin ACM SIGKDD, 7/9/2014Intro to Apache Spark and Scala, Austin ACM SIGKDD, 7/9/2014
Intro to Apache Spark and Scala, Austin ACM SIGKDD, 7/9/2014Roger Huang
 
Multiple Graphs: Updatable Views
Multiple Graphs: Updatable ViewsMultiple Graphs: Updatable Views
Multiple Graphs: Updatable ViewsopenCypher
 
Stockage, manipulation et analyse de données matricielles avec PostGIS Raster
Stockage, manipulation et analyse de données matricielles avec PostGIS RasterStockage, manipulation et analyse de données matricielles avec PostGIS Raster
Stockage, manipulation et analyse de données matricielles avec PostGIS RasterACSG Section Montréal
 
Learn basics of Clojure/script and Reagent
Learn basics of Clojure/script and ReagentLearn basics of Clojure/script and Reagent
Learn basics of Clojure/script and ReagentMaty Fedak
 
Introduction to Mahout
Introduction to MahoutIntroduction to Mahout
Introduction to MahoutTed Dunning
 
Introduction to Mahout given at Twin Cities HUG
Introduction to Mahout given at Twin Cities HUGIntroduction to Mahout given at Twin Cities HUG
Introduction to Mahout given at Twin Cities HUGMapR Technologies
 
Roadmap y Novedades de producto
Roadmap y Novedades de productoRoadmap y Novedades de producto
Roadmap y Novedades de productoNeo4j
 
Shape Safety in Tensor Programming is Easy for a Theorem Prover -SBTB 2021
Shape Safety in Tensor Programming is Easy for a Theorem Prover -SBTB 2021Shape Safety in Tensor Programming is Easy for a Theorem Prover -SBTB 2021
Shape Safety in Tensor Programming is Easy for a Theorem Prover -SBTB 2021Peng Cheng
 
Grill at bigdata-cloud conf
Grill at bigdata-cloud confGrill at bigdata-cloud conf
Grill at bigdata-cloud confamarsri
 
GraphQL & DGraph with Go
GraphQL & DGraph with GoGraphQL & DGraph with Go
GraphQL & DGraph with GoJames Tan
 
GraphQL Meetup Bangkok 3.0
GraphQL Meetup Bangkok 3.0GraphQL Meetup Bangkok 3.0
GraphQL Meetup Bangkok 3.0Tobias Meixner
 
RR & Docker @ MuensteR Meetup (Sep 2017)
RR & Docker @ MuensteR Meetup (Sep 2017)RR & Docker @ MuensteR Meetup (Sep 2017)
RR & Docker @ MuensteR Meetup (Sep 2017)Daniel Nüst
 
ABSTRACT GRAPH MACHINE: MODELING ORDERINGS IN ASYNCHRONOUS DISTRIBUTED-MEMORY...
ABSTRACT GRAPH MACHINE: MODELING ORDERINGS IN ASYNCHRONOUS DISTRIBUTED-MEMORY...ABSTRACT GRAPH MACHINE: MODELING ORDERINGS IN ASYNCHRONOUS DISTRIBUTED-MEMORY...
ABSTRACT GRAPH MACHINE: MODELING ORDERINGS IN ASYNCHRONOUS DISTRIBUTED-MEMORY...Thejaka Amila Kanewala, Ph.D.
 
Multiple graphs in openCypher
Multiple graphs in openCypherMultiple graphs in openCypher
Multiple graphs in openCypheropenCypher
 
PG-4034, Using OpenGL and DirectX for Heterogeneous Compute, by Karl Hillesland
PG-4034, Using OpenGL and DirectX for Heterogeneous Compute, by Karl HilleslandPG-4034, Using OpenGL and DirectX for Heterogeneous Compute, by Karl Hillesland
PG-4034, Using OpenGL and DirectX for Heterogeneous Compute, by Karl HilleslandAMD Developer Central
 
Intro to Apache Spark by Marco Vasquez
Intro to Apache Spark by Marco VasquezIntro to Apache Spark by Marco Vasquez
Intro to Apache Spark by Marco VasquezMapR Technologies
 
2017 RM-URISA Track: Spatial SQL - The Best Kept Secret in the Geospatial World
2017 RM-URISA Track:  Spatial SQL - The Best Kept Secret in the Geospatial World2017 RM-URISA Track:  Spatial SQL - The Best Kept Secret in the Geospatial World
2017 RM-URISA Track: Spatial SQL - The Best Kept Secret in the Geospatial WorldGIS in the Rockies
 
Loom & Functional Graphs in Clojure @ LambdaConf 2015
Loom & Functional Graphs in Clojure @ LambdaConf 2015Loom & Functional Graphs in Clojure @ LambdaConf 2015
Loom & Functional Graphs in Clojure @ LambdaConf 2015Aysylu Greenberg
 

Similar to GraphX and Pregel - Apache Spark (20)

Intro to Apache Spark and Scala, Austin ACM SIGKDD, 7/9/2014
Intro to Apache Spark and Scala, Austin ACM SIGKDD, 7/9/2014Intro to Apache Spark and Scala, Austin ACM SIGKDD, 7/9/2014
Intro to Apache Spark and Scala, Austin ACM SIGKDD, 7/9/2014
 
Scala 20140715
Scala 20140715Scala 20140715
Scala 20140715
 
Multiple Graphs: Updatable Views
Multiple Graphs: Updatable ViewsMultiple Graphs: Updatable Views
Multiple Graphs: Updatable Views
 
Stockage, manipulation et analyse de données matricielles avec PostGIS Raster
Stockage, manipulation et analyse de données matricielles avec PostGIS RasterStockage, manipulation et analyse de données matricielles avec PostGIS Raster
Stockage, manipulation et analyse de données matricielles avec PostGIS Raster
 
Learn basics of Clojure/script and Reagent
Learn basics of Clojure/script and ReagentLearn basics of Clojure/script and Reagent
Learn basics of Clojure/script and Reagent
 
Introduction to Mahout
Introduction to MahoutIntroduction to Mahout
Introduction to Mahout
 
Introduction to Mahout given at Twin Cities HUG
Introduction to Mahout given at Twin Cities HUGIntroduction to Mahout given at Twin Cities HUG
Introduction to Mahout given at Twin Cities HUG
 
Roadmap y Novedades de producto
Roadmap y Novedades de productoRoadmap y Novedades de producto
Roadmap y Novedades de producto
 
Shape Safety in Tensor Programming is Easy for a Theorem Prover -SBTB 2021
Shape Safety in Tensor Programming is Easy for a Theorem Prover -SBTB 2021Shape Safety in Tensor Programming is Easy for a Theorem Prover -SBTB 2021
Shape Safety in Tensor Programming is Easy for a Theorem Prover -SBTB 2021
 
Grill at bigdata-cloud conf
Grill at bigdata-cloud confGrill at bigdata-cloud conf
Grill at bigdata-cloud conf
 
GraphQL & DGraph with Go
GraphQL & DGraph with GoGraphQL & DGraph with Go
GraphQL & DGraph with Go
 
GraphQL Meetup Bangkok 3.0
GraphQL Meetup Bangkok 3.0GraphQL Meetup Bangkok 3.0
GraphQL Meetup Bangkok 3.0
 
RR & Docker @ MuensteR Meetup (Sep 2017)
RR & Docker @ MuensteR Meetup (Sep 2017)RR & Docker @ MuensteR Meetup (Sep 2017)
RR & Docker @ MuensteR Meetup (Sep 2017)
 
ABSTRACT GRAPH MACHINE: MODELING ORDERINGS IN ASYNCHRONOUS DISTRIBUTED-MEMORY...
ABSTRACT GRAPH MACHINE: MODELING ORDERINGS IN ASYNCHRONOUS DISTRIBUTED-MEMORY...ABSTRACT GRAPH MACHINE: MODELING ORDERINGS IN ASYNCHRONOUS DISTRIBUTED-MEMORY...
ABSTRACT GRAPH MACHINE: MODELING ORDERINGS IN ASYNCHRONOUS DISTRIBUTED-MEMORY...
 
Multiple graphs in openCypher
Multiple graphs in openCypherMultiple graphs in openCypher
Multiple graphs in openCypher
 
PG-4034, Using OpenGL and DirectX for Heterogeneous Compute, by Karl Hillesland
PG-4034, Using OpenGL and DirectX for Heterogeneous Compute, by Karl HilleslandPG-4034, Using OpenGL and DirectX for Heterogeneous Compute, by Karl Hillesland
PG-4034, Using OpenGL and DirectX for Heterogeneous Compute, by Karl Hillesland
 
Intro to Apache Spark by Marco Vasquez
Intro to Apache Spark by Marco VasquezIntro to Apache Spark by Marco Vasquez
Intro to Apache Spark by Marco Vasquez
 
2017 RM-URISA Track: Spatial SQL - The Best Kept Secret in the Geospatial World
2017 RM-URISA Track:  Spatial SQL - The Best Kept Secret in the Geospatial World2017 RM-URISA Track:  Spatial SQL - The Best Kept Secret in the Geospatial World
2017 RM-URISA Track: Spatial SQL - The Best Kept Secret in the Geospatial World
 
Graph.pptx
Graph.pptxGraph.pptx
Graph.pptx
 
Loom & Functional Graphs in Clojure @ LambdaConf 2015
Loom & Functional Graphs in Clojure @ LambdaConf 2015Loom & Functional Graphs in Clojure @ LambdaConf 2015
Loom & Functional Graphs in Clojure @ LambdaConf 2015
 

Recently uploaded

Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Seán Kennedy
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesTimothy Spann
 
Networking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptxNetworking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptxHimangsuNath
 
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...Unveiling the Role of Social Media Suspect Investigators in Preventing Online...
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...Milind Agarwal
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 217djon017
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxaleedritatuxx
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max PrincetonTimothy Spann
 
Cyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded dataCyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded dataTecnoIncentive
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsVICTOR MAESTRE RAMIREZ
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...Amil Baba Dawood bangali
 
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptxThe Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptxTasha Penwell
 
Principles and Practices of Data Visualization
Principles and Practices of Data VisualizationPrinciples and Practices of Data Visualization
Principles and Practices of Data VisualizationKianJazayeri1
 
Decoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectDecoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectBoston Institute of Analytics
 
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Boston Institute of Analytics
 
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024Susanna-Assunta Sansone
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxMike Bennett
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Boston Institute of Analytics
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson
 
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...Dr Arash Najmaei ( Phd., MBA, BSc)
 

Recently uploaded (20)

Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
 
Data Analysis Project: Stroke Prediction
Data Analysis Project: Stroke PredictionData Analysis Project: Stroke Prediction
Data Analysis Project: Stroke Prediction
 
Networking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptxNetworking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptx
 
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...Unveiling the Role of Social Media Suspect Investigators in Preventing Online...
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max Princeton
 
Cyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded dataCyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded data
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business Professionals
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
 
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptxThe Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
 
Principles and Practices of Data Visualization
Principles and Practices of Data VisualizationPrinciples and Practices of Data Visualization
Principles and Practices of Data Visualization
 
Decoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectDecoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis Project
 
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
 
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptx
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data Story
 
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
 

GraphX and Pregel - Apache Spark

  • 1. Spark GraphX & Pregel Challenges and Best Practices Ashutosh Trivedi (IIIT Bangalore) Kaushik Ranjan (IIIT Bangalore) Sigmoid-Meetup Bangalore https://github.com/anantasty/SparkAlgorithms
  • 2. Ashutosh & Kaushik, Sigmoid-Meetup Bangalore Dec-2014 Agenda •Introduction to GraphX – How to describe a graph – RDDs to store Graph – Algorithms available •Application in graph algorithms – Feedback Vertex Set of a Graph – Identifying parallel parts of the solution. •Challenges we faced •Best practices 2
  • 3. Ashutosh & Kaushik, Sigmoid-Meetup Bangalore Dec-2014 33 GraphX - Representation
  • 4. Ashutosh & Kaushik, Sigmoid-Meetup Bangalore Dec-2014 Graph Representation 4 class Graph [ V, E ] { def Graph(vertices: Table[ (Id, V) ], edges: Table[ (Id, Id, E) ]) • The VertexRDD[A] extends RDD[(VertexID, A)] and adds the additional constraint that each VertexID occurs only once. • Moreover, VertexRDD[A] represents a set of vertices each with an attribute of type A • The EdgeRDD[ED], extends RDD[Edge[ED]]
  • 5. Ashutosh & Kaushik, Sigmoid-Meetup Bangalore Dec-2014 5 GraphX - Representation
  • 6. Ashutosh & Kaushik, Sigmoid-Meetup Bangalore Dec-2014 6 A BA Vertex and Edges Vertex Edge
  • 7. Ashutosh & Kaushik, Sigmoid-Meetup Bangalore Dec-2014 Triplets Join Vertices and Edges • The triplets operator joins vertices and edges: TripletsVertices B A C D Edges A B A C B C C D A BA B A C B C C D 7
  • 8. Ashutosh & Kaushik, Sigmoid-Meetup Bangalore Dec-2014 88 Triplets elements
  • 9. Ashutosh & Kaushik, Sigmoid-Meetup Bangalore Dec-2014 9 Subgraphs Predicates vpred and epred
  • 10. Ashutosh & Kaushik, Sigmoid-Meetup Bangalore Dec-2014 10 Feedback Vertex Set • A feedback vertex set of a graph is a set of vertices whose removal leaves a graph without cycles. • Each feedback vertex set contains at least one vertex of any cycle in the graph. • The feedback vertex set problem is an NP- complete problem in computational complexity theory • Enumerate each simple cycle.
  • 11. Ashutosh & Kaushik, Sigmoid-Meetup Bangalore Dec-2014 11 1 2 34 5 6 7 8 9 10 Strongly Connected Components Each strongly connected component can be considered in parallel since they do not share any cycle SC1 – (1) SC2 – (5) SC3 – (8) SC4 – (9)
  • 12. Ashutosh & Kaushik, Sigmoid-Meetup Bangalore Dec-2014 12 FVS Algorithm #Greedy recursive solution FVS(G) sccGraph = scc(G) For each graph in sccGraph For each vertex remove vertex and again calculate scc, vertex V = vertex which give max number of scc #which means it kills maximum cycles subGraph = subgraph(remove V ) FVS (subGraph )
  • 13. Ashutosh & Kaushik, Sigmoid-Meetup Bangalore Dec-2014 13 1 2 4 3 2 4 3 Graph Iteration SCC count 3 1 4 3 1 1 2 4 3 1 2 4 3 1 2 4 3 Remove 2 Remove 1 Remove 3
  • 14. Ashutosh & Kaushik, Sigmoid-Meetup Bangalore Dec-2014 14 1 5 8 9 1 5 8 9Feedback Vertex Set
  • 15. Ashutosh & Kaushik, Sigmoid-Meetup Bangalore Dec-2014 15 FVS – Spark Implementation sccGraph has one more property sccID on each vertices, extract it
  • 16. Ashutosh & Kaushik, Sigmoid-Meetup Bangalore Dec-2014 16 FVS – Spark Implementation sccGraph = scc(G) For each graph in sccGraph
  • 17. Ashutosh & Kaushik, Sigmoid-Meetup Bangalore Dec-2014 17 FVS – Spark Implementation #Greedy recursive function
  • 18. Ashutosh & Kaushik, Sigmoid-Meetup Bangalore Dec-2014 18 FVS – Spark Implementation For each vertex remove vertex and again calculate scc, # Z is a list of scc count after removing each vertex
  • 19. Ashutosh & Kaushik, Sigmoid-Meetup Bangalore Dec-2014 19 vertex V = vertex which give max number of scc #which means it kills maximum cycles FVS – Spark Implementation
  • 20. Ashutosh & Kaushik, Sigmoid-Meetup Bangalore Dec-2014 20 subGraph = subgraph(remove V ) FVS (subGraph ) FVS – Spark Implementation
  • 21. Ashutosh & Kaushik, Sigmoid-Meetup Bangalore Dec-2014 21 Pregel • Graph DB – Data Storage – Data Mining • Advantages – Large-scale distributed computations – Parallel-algorithms for graphs on multiple machines – Fault tolerance and distributability
  • 22. Ashutosh & Kaushik, Sigmoid-Meetup Bangalore Dec-2014 22 Oldest Follower What is the age of oldest follower of each user ? Val oldestFollowerAge = graph .aggregateMessages( #map word => (word.dst.id, word.src.age), #reduce (a,b) => max(a, b) ) .vertices mapReduceTriplets is now aggregateMessages
  • 23. Ashutosh & Kaushik, Sigmoid-Meetup Bangalore Dec-2014 23 In aggregateMessages : • EdgeContext which exposes the triplet fields . • functions to explicitly send messages to the source and destination vertex. • It require the user to indicate what fields in the triplet are actually required. New in GraphX
  • 24. Ashutosh & Kaushik, Sigmoid-Meetup Bangalore Dec-2014 Theory – it’s Good How it works – that’s awesome 24 Graph’s are recursive data-structures, where the property of a vertex is dependent on the properties of it’s neighbors, which in turn are dependent on the properties of their neighbors.
  • 25. Ashutosh & Kaushik, Sigmoid-Meetup Bangalore Dec-2014 Graph.Pregel ( initialMessage ) ( #message consumption ( vertexID, initialProperty, message ) → compute new property , #message generation triplet → .. code .. Iterator( vertexID, message ) Iterator.empty , #message aggregation ( existing message set, new message ) → NEW message set ) 25 Architecture
  • 26. Ashutosh & Kaushik, Sigmoid-Meetup Bangalore Dec-2014 26 1 2 4 3 1030 30 20 1 2 4 3 10 30 30 20 max [30,10,20] max [20] max [10] 1 2 4 3 100 10 10 1 2 4 3 10 0 10 10 max [10] max [10]
  • 27. Ashutosh & Kaushik, Sigmoid-Meetup Bangalore Dec-2014 27 Example - output 1 2 4 3 100 0 0
  • 28. Ashutosh & Kaushik, Sigmoid-Meetup Bangalore Dec-2014 Applications - GIS • Algorithm – to compute all vertices in a directed graph, that can reach out to a given vertex. • Can be used for watershed delineation in Geographic Information Systems 28 Vertices that can reach out to E are A and B
  • 29. Ashutosh & Kaushik, Sigmoid-Meetup Bangalore Dec-2014 Algorithm Graph.Pregel( Seq[vertexID’s] ) ( #message consumption if vertex.state == 1 vertex.state → 2 else if vertex.state == 0 if ( vertex.adjacentVertices ∩ Seq[ vertexID’s ] ) isNotEmpty vertex.state → 2 #message aggregator Seq[existing vertex ID’s] U Seq[new vertex ID] ) 29
  • 30. Ashutosh & Kaushik, Sigmoid-Meetup Bangalore Dec-2014 30 #message generation for each triplet if destinationVertex.state == 1 message( sourceVertexID, Seq[destinationVertexID] ) message( destinationVertexID, Seq[destinationVertexID] ) else if sourceVertex.state == 1 and destinationVertex.state == 2 message( sourceVertexID, Seq[destinationVertexID] ) else message( empty ) Algorithm
  • 31. References • Fork our repository at • https://github.com/anantasty/SparkAlgorithms • Follow us at • https://github.com/codeAshu • https://github.com/kaushikranjan • https://spark.apache.org/docs/latest/graphx-programming-guide.html 31