SlideShare a Scribd company logo
1 of 48
GraphX:
Unifying Data-Parallel and
Graph-Parallel Analytics
*These slides are best viewed in PowerPoint with anima
Graphs are Central to Analytics
Raw
Wikipedia
< / >
< / >
< / >
XML
Hyperlinks PageRank Top 20 Page
Title PR
Text
Table
Title Body
Topic Model
(LDA) Word Topics
Word Topic
Editor Graph
Community
Detection
User
Community
User Com.
Term-Doc
Graph
Discussion
Table
User Disc.
Community
Topic
Topic Com.
Update ranks in parallel
Iterate until convergence
Rank of
user i Weighted sum of
neighbors’ ranks
3
PageRank: Identifying Leaders
The Graph-Parallel Pattern
4
Model / Alg.
State
Computation depends
only on the neighbors
Many Graph-Parallel Algorithms
• Collaborative Filtering
– Alternating Least Squares
– Stochastic Gradient
Descent
– Tensor Factorization
• Structured Prediction
– Loopy Belief Propagation
– Max-Product Linear
Programs
– Gibbs Sampling
• Semi-supervised ML
– Graph SSL
– CoEM
• Community Detection
– Triangle-Counting
– K-core Decomposition
– K-Truss
• Graph Analytics
– PageRank
– Personalized PageRank
– Shortest Path
– Graph Coloring
• Classification
– Neural Networks
5
Graph-Parallel Systems
6
oogle
Expose specialized APIs to simplify
graph programming.
Exploit graph structure to achieve
orders-of-magnitude performance gains
over more general
PageRank on the Live-Journal Graph
22
354
1340
0 200 400 600 800 1000 1200 1400 1600
GraphLab
Naïve Spark
Mahout/Hadoop
Runtime (in seconds, PageRank for 10 iterations)
GraphLab is 60x faster than Hadoop
GraphLab is 16x faster than Spark
Graphs are Central to Analytics
Raw
Wikipedia
< / >
< / >
< / >
XML
Hyperlinks PageRank Top 20 Page
Title PR
Text
Table
Title Body
Topic Model
(LDA) Word Topics
Word Topic
Editor Graph
Community
Detection
User
Community
User Com.
Term-Doc
Graph
Discussion
Table
User Disc.
Community
Topic
Topic Com.
Separate Systems to Support Each
View
Table View Graph View
Dependency
Graph
Table
Resul
t
Row
Row
Row
Row
Having separate systems
for each view is
difficult to use and inefficient
10
Difficult to Program and Use
Users must Learn, Deploy, and
Manage multiple systems
Leads to brittle and often
complex interfaces
11
Inefficient
12
Extensive data movement and duplication across
the network and file system
< / >
< / >
< / >
XML
HDFS HDFS HDFS HDFS
Limited reuse internal data-structures
across stages
Solution: The GraphX Unified Approach
Enabling users to easily and efficiently
express the entire graph analytics
pipeline
New API
Blurs the distinction
between Tables and
Graphs
New System
Combines Data-Parallel
Graph-Parallel Systems
Tables and Graphs are composable
views of the same physical data
GraphX Unified
Representation
Graph View
Table View
Each view has its own operators that
exploit the semantics of the view
to achieve efficient execution
View a Graph as a Table
Id
Rxin
Jegonzal
Franklin
Istoica
SrcId DstId
rxin jegonzal
franklin rxin
istoica franklin
franklin jegonzal
Property (E)
Friend
Advisor
Coworker
PI
Property (V)
(Stu., Berk.)
(PstDoc, Berk.)
(Prof., Berk)
(Prof., Berk)
R
J
F
I
Property Graph
Vertex Property Table
Edge Property Table
Table Operators
Table (RDD) operators are inherited from
Spark:
16
map
filter
groupBy
sort
union
join
leftOuterJoin
rightOuterJoin
reduce
count
fold
reduceByKey
groupByKey
cogroup
cross
zip
sample
take
first
partitionBy
mapWith
pipe
save
...
class Graph [ V, E ] {
def Graph(vertices: Table[ (Id, V) ],
edges: Table[ (Id, Id, E) ])
// Table Views -----------------
def vertices: Table[ (Id, V) ]
def edges: Table[ (Id, Id, E) ]
def triplets: Table [ ((Id, V), (Id, V), E) ]
// Transformations ------------------------------
def reverse: Graph[V, E]
def subgraph(pV: (Id, V) => Boolean,
pE: Edge[V,E] => Boolean): Graph[V,E]
def mapV(m: (Id, V) => T ): Graph[T,E]
def mapE(m: Edge[V,E] => T ): Graph[V,T]
// Joins ----------------------------------------
def joinV(tbl: Table [(Id, T)]): Graph[(V, T), E ]
def joinE(tbl: Table [(Id, Id, T)]): Graph[V, (E, T)]
// Computation ----------------------------------
def mrTriplets(mapF: (Edge[V,E]) => List[(Id, T)],
reduceF: (T, T) => T): Graph[T, E]
}
Graph Operators
17
Triplets Join Vertices and
Edges
The triplets operator joins vertices and
edges:
The mrTriplets operator sums adjacent
triplets.
SELECT t.dstId, reduceUDF( mapUDF(t) ) AS sum
FROM triplets AS t GROUPBY t.dstId
Triplets
Vertices Edges
B
A
C
D
A B
A C
B C
C D
A B
A
B A C
B C
C D
F
E
Map Reduce Triplets
Map-Reduce for each vertex
D
B
A
C
mapF( )
A B
mapF( )
A C
A1
A2
reduceF( , )
A1 A2 A
19
F
E
Example: Oldest Follower
D
B
A
C
What is the age of the oldest
follower for each user?
val oldestFollowerAge = graph
.mrTriplets(
e=> (e.dst.id, e.src.age),//Map
(a,b)=> max(a, b) //Reduce
)
.vertices
23 42
30
19 75
16
20
We express the Pregel and GraphLab
abstractions using the GraphX operators
in less than 50 lines of code!
21
By composing these operators we can
construct entire graph-analytics pipelines.
DIY Demo this
Afternoon
GraphX System Design
Part. 2
Part. 1
Vertex
Table
(RDD)
B C
A D
F E
A D
Distributed Graphs as Tables
(RDDs)
D
Property Graph
B C
D
E
A
A
F
Edge
Table
(RDD)
A B
A C
C D
B C
A E
A F
E F
E D
B
C
D
E
A
F
Routing
Table
(RDD)
B
C
D
E
A
F
1
2
1 2
1 2
1
2
2D Vertex Cut Heuristic
Vertex
Table
(RDD)
Caching for Iterative mrTriplets
Edge Table
(RDD)
A B
A C
C D
B C
A E
A F
E F
E D
Mirror
Cache
B
C
D
A
Mirror
Cache
D
E
F
A
B
C
D
E
A
F
B
C
D
E
A
F
A
D
Vertex
Table
(RDD)
Edge Table
(RDD)
A B
A C
C D
B C
A E
A F
E F
E D
Mirror
Cache
B
C
D
A
Mirror
Cache
D
E
F
A
Incremental Updates for Iterative
mrTriplets
B
C
D
E
A
F
Change A
A
Change E
Scan
Vertex
Table
(RDD)
Edge Table
(RDD)
A B
A C
C D
B C
A E
A F
E F
E D
Mirror
Cache
B
C
D
A
Mirror
Cache
D
E
F
A
Aggregation for Iterative mrTriplets
B
C
D
E
A
F
Change
Change
Scan
Change
Change
Change
Change
Local
Aggregate
Local
Aggregate
B
C
D
F
Reduction in
Communication Due to
Cached Updates
0.1
1
10
100
1000
10000
0 2 4 6 8 10 12 14 16
Network
Comm.
(MB)
Iteration
Connected Components on Twitter Graph
Most vertices are within 8 hops
of all vertices in their comp.
Benefit of Indexing Active
Edges
0
5
10
15
20
25
30
0 2 4 6 8 10 12 14 16
Runtime
(Seconds)
Iteration
Connected Components on Twitter Graph
Scan
Indexed
Scan All Edges
Index of “Active” Edges
Join Elimination
Identify and bypass joins for unused triplets
fields
»Example: PageRank only accesses source
attribute
30
0
2000
4000
6000
8000
10000
12000
14000
0 5 10 15 20
Communication
(MB)
Iteration
PageRank on Twitter Three Way Join
Join Elimination
Factor of 2 reduction in communication
Additional Query
Optimizations
Indexing and Bitmaps:
»To accelerate joins across graphs
»To efficiently construct sub-graphs
Substantial Index and Data Reuse:
»Reuse routing tables across graphs and sub-
graphs
»Reuse edge adjacency information and indices
31
Performance Comparisons
22
68
207
354
1340
0 200 400 600 800 1000 1200 1400 1600
GraphLab
GraphX
Giraph
Naïve Spark
Mahout/Hadoop
Runtime (in seconds, PageRank for 10 iterations)
GraphX is roughly 3x slower than GraphLab
Live-Journal: 69 Million Edges
GraphX scales to larger
graphs
203
451
749
0 200 400 600 800
GraphLab
GraphX
Giraph
Runtime (in seconds, PageRank for 10 iterations)
GraphX is roughly 2x slower than GraphLab
»Scala + Java overhead: Lambdas, GC time, …
»No shared memory parallelism: 2x increase in comm.
Twitter Graph: 1.5 Billion Edges
PageRank is just one
stage….
What about a pipeline?
HDFS
HDFS
Compute
Spark Preprocess Spark Post.
A Small Pipeline in GraphX
Timed end-to-end GraphX is faster than
Raw Wikipedia
< / >
< / >
< / >
XML
Hyperlinks PageRank Top 20 Pages
342
1492
0 200 400 600 800 1000 1200 1400 1600
GraphLab + Spark
GraphX
Giraph + Spark
Spark
Total Runtime (in Seconds)
605
375
The GraphX Stack
(Lines of Code)
GraphX (3575)
Spark
Pregel (28) + GraphLab (50)
PageRan
k (5)
Connected
Comp. (10)
Shortest
Path
(10)
ALS
(40) LDA
(120)
K-core
(51)
Triangl
e
Count
(45)
SVD
(40)
Status
Alpha release as part of Spark 0.9
Seeking collaborators and feedback
Conclusion and
Observations
Domain specific views: Tables and Graphs
»tables and graphs are first-class composable
objects
»specialized operators which exploit view
semantics
Single system that efficiently spans the
pipeline
»minimize data movement and duplication
»eliminates need to learn and manage multiple
systems
Graphs through the lens of database 38
Active Research
Static Data  Dynamic Data
»Apply GraphX unified approach to time evolving
data
»Model and analyze relationships over time
Serving Graph Structured Data
»Allow external systems to interact with GraphX
»Unify distributed graph databases with relational
database technology
39
Thanks!
ankurd@eecs.berkeley.edu
crankshaw@eecs.berkeley.edu
rxin@eecs.berkeley.edu
jegonzal@eecs.berkeley.edu
http://amplab.github.io/graphx/
Graph Property 1
Real-World Graphs
41
10
0
10
2
10
4
10
6
10
8
10
0
10
2
10
4
10
6
10
8
10
10
degree
count
Top 1% of vertices
are adjacent to
50% of the edges!
Number
of
Vertices
AltaVista WebGraph1.4B Vertices, 6.6B
Edges
Degree
More than 108
vertices
have one neighbor.
0
20
40
60
80
100
120
140
160
180
200
2008 2010 2012
Ratio
of
Edges
to
Vertices
Year
Facebook
Power-Law Degree Distribution Edges >> Vertices
Graph Property 2
Active Vertices
1
10
100
1000
10000
100000
1000000
10000000
100000000
0 10 20 30 40 50 60 70
Num-Vertices
Number of Updates
51% updated only once!
PageRank on Web Graph
Graphs are Essential to
Data Mining and Machine
Learning
Identify influential people and information
Find communities
Understand people’s shared interests
Model complex data dependencies
Ratings Item
s
Recommending Products
Users
Low-Rank Matrix Factorization:
45
r13
r14
r24
r25
f(1)
f(2)
f(3)
f(4)
f(5)
User
Factors
(U)
Movie
Factors
(M)
User
s
Movie
s
Netflix
User
s
≈
x
Movie
s
f(i)
f(j)
Iterate:
Recommending Products
Liberal Conservative
Post
Post
Post
Post
Post
Post
Post
Post
Predicting User Behavior
Post
Post
Post
Post
Post
Post
Post
Post
Post
Post
Post
Post
Post
Post
?
?
?
?
?
?
?
? ?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
? ?
?
46
Conditional Random Field
Belief Propagation
Count triangles passing through each
vertex:
Measures “cohesiveness” of local
community
More Triangles
Stronger Community
Fewer Triangles
Weaker Community
1
2 3
4
Finding Communities
Preprocessing Compute Post Proc.
Example Graph Analytics
Pipeline
48
< / >
< / >
< / >
XML
Raw
Data
ETL Slice Compute
Repeat
Subgraph PageRank
Initial
Graph
Analyz
e
Top
Users

More Related Content

Similar to Graphs in data structures are non-linear data structures made up of a finite number of nodes or vertices and the edges that connect them

Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkDatabricks
 
Building a unified data pipeline in Apache Spark
Building a unified data pipeline in Apache SparkBuilding a unified data pipeline in Apache Spark
Building a unified data pipeline in Apache SparkDataWorks Summit
 
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...Databricks
 
Large Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache SparkLarge Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache SparkCloudera, Inc.
 
Transformation Processing Smackdown; Spark vs Hive vs Pig
Transformation Processing Smackdown; Spark vs Hive vs PigTransformation Processing Smackdown; Spark vs Hive vs Pig
Transformation Processing Smackdown; Spark vs Hive vs PigLester Martin
 
20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and SharkYahooTechConference
 
Composable Parallel Processing in Apache Spark and Weld
Composable Parallel Processing in Apache Spark and WeldComposable Parallel Processing in Apache Spark and Weld
Composable Parallel Processing in Apache Spark and WeldDatabricks
 
Bigdata processing with Spark - part II
Bigdata processing with Spark - part IIBigdata processing with Spark - part II
Bigdata processing with Spark - part IIArjen de Vries
 
Big Data Analytics and Ubiquitous computing
Big Data Analytics and Ubiquitous computingBig Data Analytics and Ubiquitous computing
Big Data Analytics and Ubiquitous computingAnimesh Chaturvedi
 
Apache spark sneha challa- google pittsburgh-aug 25th
Apache spark  sneha challa- google pittsburgh-aug 25thApache spark  sneha challa- google pittsburgh-aug 25th
Apache spark sneha challa- google pittsburgh-aug 25thSneha Challa
 
Scrap Your MapReduce - Apache Spark
 Scrap Your MapReduce - Apache Spark Scrap Your MapReduce - Apache Spark
Scrap Your MapReduce - Apache SparkIndicThreads
 
A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsDatabricks
 
Large-Scale Machine Learning with Apache Spark
Large-Scale Machine Learning with Apache SparkLarge-Scale Machine Learning with Apache Spark
Large-Scale Machine Learning with Apache SparkDB Tsai
 
GraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesGraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesPaco Nathan
 
End-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache SparkEnd-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache SparkDatabricks
 
Graph Analytics in Spark
Graph Analytics in SparkGraph Analytics in Spark
Graph Analytics in SparkPaco Nathan
 
Introduction to Spark - DataFactZ
Introduction to Spark - DataFactZIntroduction to Spark - DataFactZ
Introduction to Spark - DataFactZDataFactZ
 
Artigo 81 - spark_tutorial.pdf
Artigo 81 - spark_tutorial.pdfArtigo 81 - spark_tutorial.pdf
Artigo 81 - spark_tutorial.pdfWalmirCouto3
 

Similar to Graphs in data structures are non-linear data structures made up of a finite number of nodes or vertices and the edges that connect them (20)

Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
 
Scala and spark
Scala and sparkScala and spark
Scala and spark
 
Building a unified data pipeline in Apache Spark
Building a unified data pipeline in Apache SparkBuilding a unified data pipeline in Apache Spark
Building a unified data pipeline in Apache Spark
 
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
 
Large Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache SparkLarge Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache Spark
 
Transformation Processing Smackdown; Spark vs Hive vs Pig
Transformation Processing Smackdown; Spark vs Hive vs PigTransformation Processing Smackdown; Spark vs Hive vs Pig
Transformation Processing Smackdown; Spark vs Hive vs Pig
 
20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark
 
Composable Parallel Processing in Apache Spark and Weld
Composable Parallel Processing in Apache Spark and WeldComposable Parallel Processing in Apache Spark and Weld
Composable Parallel Processing in Apache Spark and Weld
 
Spark training-in-bangalore
Spark training-in-bangaloreSpark training-in-bangalore
Spark training-in-bangalore
 
Bigdata processing with Spark - part II
Bigdata processing with Spark - part IIBigdata processing with Spark - part II
Bigdata processing with Spark - part II
 
Big Data Analytics and Ubiquitous computing
Big Data Analytics and Ubiquitous computingBig Data Analytics and Ubiquitous computing
Big Data Analytics and Ubiquitous computing
 
Apache spark sneha challa- google pittsburgh-aug 25th
Apache spark  sneha challa- google pittsburgh-aug 25thApache spark  sneha challa- google pittsburgh-aug 25th
Apache spark sneha challa- google pittsburgh-aug 25th
 
Scrap Your MapReduce - Apache Spark
 Scrap Your MapReduce - Apache Spark Scrap Your MapReduce - Apache Spark
Scrap Your MapReduce - Apache Spark
 
A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutions
 
Large-Scale Machine Learning with Apache Spark
Large-Scale Machine Learning with Apache SparkLarge-Scale Machine Learning with Apache Spark
Large-Scale Machine Learning with Apache Spark
 
GraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesGraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communities
 
End-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache SparkEnd-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache Spark
 
Graph Analytics in Spark
Graph Analytics in SparkGraph Analytics in Spark
Graph Analytics in Spark
 
Introduction to Spark - DataFactZ
Introduction to Spark - DataFactZIntroduction to Spark - DataFactZ
Introduction to Spark - DataFactZ
 
Artigo 81 - spark_tutorial.pdf
Artigo 81 - spark_tutorial.pdfArtigo 81 - spark_tutorial.pdf
Artigo 81 - spark_tutorial.pdf
 

More from bhargavi804095

Big Data Analytics is not something which was just invented yesterday!
Big Data Analytics is not something which was just invented yesterday!Big Data Analytics is not something which was just invented yesterday!
Big Data Analytics is not something which was just invented yesterday!bhargavi804095
 
Apache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.pptApache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.pptbhargavi804095
 
C++ was developed by Bjarne Stroustrup, as an extension to the C language. cp...
C++ was developed by Bjarne Stroustrup, as an extension to the C language. cp...C++ was developed by Bjarne Stroustrup, as an extension to the C language. cp...
C++ was developed by Bjarne Stroustrup, as an extension to the C language. cp...bhargavi804095
 
A File is a collection of data stored in the secondary memory. So far data wa...
A File is a collection of data stored in the secondary memory. So far data wa...A File is a collection of data stored in the secondary memory. So far data wa...
A File is a collection of data stored in the secondary memory. So far data wa...bhargavi804095
 
C++ helps you to format the I/O operations like determining the number of dig...
C++ helps you to format the I/O operations like determining the number of dig...C++ helps you to format the I/O operations like determining the number of dig...
C++ helps you to format the I/O operations like determining the number of dig...bhargavi804095
 
While writing program in any language, you need to use various variables to s...
While writing program in any language, you need to use various variables to s...While writing program in any language, you need to use various variables to s...
While writing program in any language, you need to use various variables to s...bhargavi804095
 
Python is a high-level, general-purpose programming language. Its design phil...
Python is a high-level, general-purpose programming language. Its design phil...Python is a high-level, general-purpose programming language. Its design phil...
Python is a high-level, general-purpose programming language. Its design phil...bhargavi804095
 
cpp-streams.ppt,C++ is the top choice of many programmers for creating powerf...
cpp-streams.ppt,C++ is the top choice of many programmers for creating powerf...cpp-streams.ppt,C++ is the top choice of many programmers for creating powerf...
cpp-streams.ppt,C++ is the top choice of many programmers for creating powerf...bhargavi804095
 
power point presentation to show oops with python.pptx
power point presentation to show oops with python.pptxpower point presentation to show oops with python.pptx
power point presentation to show oops with python.pptxbhargavi804095
 
Lecture4_Method_overloading power point presentaion
Lecture4_Method_overloading power point presentaionLecture4_Method_overloading power point presentaion
Lecture4_Method_overloading power point presentaionbhargavi804095
 
Lecture5_Method_overloading_Final power point presentation
Lecture5_Method_overloading_Final power point presentationLecture5_Method_overloading_Final power point presentation
Lecture5_Method_overloading_Final power point presentationbhargavi804095
 
power point presentation on object oriented programming functions concepts
power point presentation on object oriented programming functions conceptspower point presentation on object oriented programming functions concepts
power point presentation on object oriented programming functions conceptsbhargavi804095
 
THE C PROGRAMMING LANGUAGE PPT CONTAINS THE BASICS OF C
THE C PROGRAMMING LANGUAGE PPT CONTAINS THE BASICS OF CTHE C PROGRAMMING LANGUAGE PPT CONTAINS THE BASICS OF C
THE C PROGRAMMING LANGUAGE PPT CONTAINS THE BASICS OF Cbhargavi804095
 
Chapter24.pptx big data systems power point ppt
Chapter24.pptx big data systems power point pptChapter24.pptx big data systems power point ppt
Chapter24.pptx big data systems power point pptbhargavi804095
 
power point presentation on pig -hadoop framework
power point presentation on pig -hadoop frameworkpower point presentation on pig -hadoop framework
power point presentation on pig -hadoop frameworkbhargavi804095
 
hadoop distributed file systems complete information
hadoop distributed file systems complete informationhadoop distributed file systems complete information
hadoop distributed file systems complete informationbhargavi804095
 

More from bhargavi804095 (16)

Big Data Analytics is not something which was just invented yesterday!
Big Data Analytics is not something which was just invented yesterday!Big Data Analytics is not something which was just invented yesterday!
Big Data Analytics is not something which was just invented yesterday!
 
Apache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.pptApache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.ppt
 
C++ was developed by Bjarne Stroustrup, as an extension to the C language. cp...
C++ was developed by Bjarne Stroustrup, as an extension to the C language. cp...C++ was developed by Bjarne Stroustrup, as an extension to the C language. cp...
C++ was developed by Bjarne Stroustrup, as an extension to the C language. cp...
 
A File is a collection of data stored in the secondary memory. So far data wa...
A File is a collection of data stored in the secondary memory. So far data wa...A File is a collection of data stored in the secondary memory. So far data wa...
A File is a collection of data stored in the secondary memory. So far data wa...
 
C++ helps you to format the I/O operations like determining the number of dig...
C++ helps you to format the I/O operations like determining the number of dig...C++ helps you to format the I/O operations like determining the number of dig...
C++ helps you to format the I/O operations like determining the number of dig...
 
While writing program in any language, you need to use various variables to s...
While writing program in any language, you need to use various variables to s...While writing program in any language, you need to use various variables to s...
While writing program in any language, you need to use various variables to s...
 
Python is a high-level, general-purpose programming language. Its design phil...
Python is a high-level, general-purpose programming language. Its design phil...Python is a high-level, general-purpose programming language. Its design phil...
Python is a high-level, general-purpose programming language. Its design phil...
 
cpp-streams.ppt,C++ is the top choice of many programmers for creating powerf...
cpp-streams.ppt,C++ is the top choice of many programmers for creating powerf...cpp-streams.ppt,C++ is the top choice of many programmers for creating powerf...
cpp-streams.ppt,C++ is the top choice of many programmers for creating powerf...
 
power point presentation to show oops with python.pptx
power point presentation to show oops with python.pptxpower point presentation to show oops with python.pptx
power point presentation to show oops with python.pptx
 
Lecture4_Method_overloading power point presentaion
Lecture4_Method_overloading power point presentaionLecture4_Method_overloading power point presentaion
Lecture4_Method_overloading power point presentaion
 
Lecture5_Method_overloading_Final power point presentation
Lecture5_Method_overloading_Final power point presentationLecture5_Method_overloading_Final power point presentation
Lecture5_Method_overloading_Final power point presentation
 
power point presentation on object oriented programming functions concepts
power point presentation on object oriented programming functions conceptspower point presentation on object oriented programming functions concepts
power point presentation on object oriented programming functions concepts
 
THE C PROGRAMMING LANGUAGE PPT CONTAINS THE BASICS OF C
THE C PROGRAMMING LANGUAGE PPT CONTAINS THE BASICS OF CTHE C PROGRAMMING LANGUAGE PPT CONTAINS THE BASICS OF C
THE C PROGRAMMING LANGUAGE PPT CONTAINS THE BASICS OF C
 
Chapter24.pptx big data systems power point ppt
Chapter24.pptx big data systems power point pptChapter24.pptx big data systems power point ppt
Chapter24.pptx big data systems power point ppt
 
power point presentation on pig -hadoop framework
power point presentation on pig -hadoop frameworkpower point presentation on pig -hadoop framework
power point presentation on pig -hadoop framework
 
hadoop distributed file systems complete information
hadoop distributed file systems complete informationhadoop distributed file systems complete information
hadoop distributed file systems complete information
 

Recently uploaded

HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSHARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSRajkumarAkumalla
 
Microscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxMicroscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxpurnimasatapathy1234
 
(TARA) Talegaon Dabhade Call Girls Just Call 7001035870 [ Cash on Delivery ] ...
(TARA) Talegaon Dabhade Call Girls Just Call 7001035870 [ Cash on Delivery ] ...(TARA) Talegaon Dabhade Call Girls Just Call 7001035870 [ Cash on Delivery ] ...
(TARA) Talegaon Dabhade Call Girls Just Call 7001035870 [ Cash on Delivery ] ...ranjana rawat
 
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escortsranjana rawat
 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxAsutosh Ranjan
 
Processing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxProcessing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxpranjaldaimarysona
 
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service NashikCall Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service NashikCall Girls in Nagpur High Profile
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingrakeshbaidya232001
 
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptx
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptxthe ladakh protest in leh ladakh 2024 sonam wangchuk.pptx
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptxhumanexperienceaaa
 
Introduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxIntroduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxupamatechverse
 
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...ranjana rawat
 
SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )Tsuyoshi Horigome
 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSSIVASHANKAR N
 
Extrusion Processes and Their Limitations
Extrusion Processes and Their LimitationsExtrusion Processes and Their Limitations
Extrusion Processes and Their Limitations120cr0395
 
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
Introduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxIntroduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxupamatechverse
 
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSAPPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSKurinjimalarL3
 
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCollege Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCall Girls in Nagpur High Profile
 

Recently uploaded (20)

HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSHARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
 
Microscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxMicroscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptx
 
(TARA) Talegaon Dabhade Call Girls Just Call 7001035870 [ Cash on Delivery ] ...
(TARA) Talegaon Dabhade Call Girls Just Call 7001035870 [ Cash on Delivery ] ...(TARA) Talegaon Dabhade Call Girls Just Call 7001035870 [ Cash on Delivery ] ...
(TARA) Talegaon Dabhade Call Girls Just Call 7001035870 [ Cash on Delivery ] ...
 
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptx
 
Processing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxProcessing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptx
 
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service NashikCall Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writing
 
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptx
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptxthe ladakh protest in leh ladakh 2024 sonam wangchuk.pptx
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptx
 
Introduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxIntroduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptx
 
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
 
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
 
SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )
 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
 
Extrusion Processes and Their Limitations
Extrusion Processes and Their LimitationsExtrusion Processes and Their Limitations
Extrusion Processes and Their Limitations
 
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCRCall Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
 
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
 
Introduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxIntroduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptx
 
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSAPPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
 
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCollege Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
 

Graphs in data structures are non-linear data structures made up of a finite number of nodes or vertices and the edges that connect them

  • 1. GraphX: Unifying Data-Parallel and Graph-Parallel Analytics *These slides are best viewed in PowerPoint with anima
  • 2. Graphs are Central to Analytics Raw Wikipedia < / > < / > < / > XML Hyperlinks PageRank Top 20 Page Title PR Text Table Title Body Topic Model (LDA) Word Topics Word Topic Editor Graph Community Detection User Community User Com. Term-Doc Graph Discussion Table User Disc. Community Topic Topic Com.
  • 3. Update ranks in parallel Iterate until convergence Rank of user i Weighted sum of neighbors’ ranks 3 PageRank: Identifying Leaders
  • 4. The Graph-Parallel Pattern 4 Model / Alg. State Computation depends only on the neighbors
  • 5. Many Graph-Parallel Algorithms • Collaborative Filtering – Alternating Least Squares – Stochastic Gradient Descent – Tensor Factorization • Structured Prediction – Loopy Belief Propagation – Max-Product Linear Programs – Gibbs Sampling • Semi-supervised ML – Graph SSL – CoEM • Community Detection – Triangle-Counting – K-core Decomposition – K-Truss • Graph Analytics – PageRank – Personalized PageRank – Shortest Path – Graph Coloring • Classification – Neural Networks 5
  • 6. Graph-Parallel Systems 6 oogle Expose specialized APIs to simplify graph programming. Exploit graph structure to achieve orders-of-magnitude performance gains over more general
  • 7. PageRank on the Live-Journal Graph 22 354 1340 0 200 400 600 800 1000 1200 1400 1600 GraphLab Naïve Spark Mahout/Hadoop Runtime (in seconds, PageRank for 10 iterations) GraphLab is 60x faster than Hadoop GraphLab is 16x faster than Spark
  • 8. Graphs are Central to Analytics Raw Wikipedia < / > < / > < / > XML Hyperlinks PageRank Top 20 Page Title PR Text Table Title Body Topic Model (LDA) Word Topics Word Topic Editor Graph Community Detection User Community User Com. Term-Doc Graph Discussion Table User Disc. Community Topic Topic Com.
  • 9. Separate Systems to Support Each View Table View Graph View Dependency Graph Table Resul t Row Row Row Row
  • 10. Having separate systems for each view is difficult to use and inefficient 10
  • 11. Difficult to Program and Use Users must Learn, Deploy, and Manage multiple systems Leads to brittle and often complex interfaces 11
  • 12. Inefficient 12 Extensive data movement and duplication across the network and file system < / > < / > < / > XML HDFS HDFS HDFS HDFS Limited reuse internal data-structures across stages
  • 13. Solution: The GraphX Unified Approach Enabling users to easily and efficiently express the entire graph analytics pipeline New API Blurs the distinction between Tables and Graphs New System Combines Data-Parallel Graph-Parallel Systems
  • 14. Tables and Graphs are composable views of the same physical data GraphX Unified Representation Graph View Table View Each view has its own operators that exploit the semantics of the view to achieve efficient execution
  • 15. View a Graph as a Table Id Rxin Jegonzal Franklin Istoica SrcId DstId rxin jegonzal franklin rxin istoica franklin franklin jegonzal Property (E) Friend Advisor Coworker PI Property (V) (Stu., Berk.) (PstDoc, Berk.) (Prof., Berk) (Prof., Berk) R J F I Property Graph Vertex Property Table Edge Property Table
  • 16. Table Operators Table (RDD) operators are inherited from Spark: 16 map filter groupBy sort union join leftOuterJoin rightOuterJoin reduce count fold reduceByKey groupByKey cogroup cross zip sample take first partitionBy mapWith pipe save ...
  • 17. class Graph [ V, E ] { def Graph(vertices: Table[ (Id, V) ], edges: Table[ (Id, Id, E) ]) // Table Views ----------------- def vertices: Table[ (Id, V) ] def edges: Table[ (Id, Id, E) ] def triplets: Table [ ((Id, V), (Id, V), E) ] // Transformations ------------------------------ def reverse: Graph[V, E] def subgraph(pV: (Id, V) => Boolean, pE: Edge[V,E] => Boolean): Graph[V,E] def mapV(m: (Id, V) => T ): Graph[T,E] def mapE(m: Edge[V,E] => T ): Graph[V,T] // Joins ---------------------------------------- def joinV(tbl: Table [(Id, T)]): Graph[(V, T), E ] def joinE(tbl: Table [(Id, Id, T)]): Graph[V, (E, T)] // Computation ---------------------------------- def mrTriplets(mapF: (Edge[V,E]) => List[(Id, T)], reduceF: (T, T) => T): Graph[T, E] } Graph Operators 17
  • 18. Triplets Join Vertices and Edges The triplets operator joins vertices and edges: The mrTriplets operator sums adjacent triplets. SELECT t.dstId, reduceUDF( mapUDF(t) ) AS sum FROM triplets AS t GROUPBY t.dstId Triplets Vertices Edges B A C D A B A C B C C D A B A B A C B C C D
  • 19. F E Map Reduce Triplets Map-Reduce for each vertex D B A C mapF( ) A B mapF( ) A C A1 A2 reduceF( , ) A1 A2 A 19
  • 20. F E Example: Oldest Follower D B A C What is the age of the oldest follower for each user? val oldestFollowerAge = graph .mrTriplets( e=> (e.dst.id, e.src.age),//Map (a,b)=> max(a, b) //Reduce ) .vertices 23 42 30 19 75 16 20
  • 21. We express the Pregel and GraphLab abstractions using the GraphX operators in less than 50 lines of code! 21 By composing these operators we can construct entire graph-analytics pipelines.
  • 24. Part. 2 Part. 1 Vertex Table (RDD) B C A D F E A D Distributed Graphs as Tables (RDDs) D Property Graph B C D E A A F Edge Table (RDD) A B A C C D B C A E A F E F E D B C D E A F Routing Table (RDD) B C D E A F 1 2 1 2 1 2 1 2 2D Vertex Cut Heuristic
  • 25. Vertex Table (RDD) Caching for Iterative mrTriplets Edge Table (RDD) A B A C C D B C A E A F E F E D Mirror Cache B C D A Mirror Cache D E F A B C D E A F B C D E A F A D
  • 26. Vertex Table (RDD) Edge Table (RDD) A B A C C D B C A E A F E F E D Mirror Cache B C D A Mirror Cache D E F A Incremental Updates for Iterative mrTriplets B C D E A F Change A A Change E Scan
  • 27. Vertex Table (RDD) Edge Table (RDD) A B A C C D B C A E A F E F E D Mirror Cache B C D A Mirror Cache D E F A Aggregation for Iterative mrTriplets B C D E A F Change Change Scan Change Change Change Change Local Aggregate Local Aggregate B C D F
  • 28. Reduction in Communication Due to Cached Updates 0.1 1 10 100 1000 10000 0 2 4 6 8 10 12 14 16 Network Comm. (MB) Iteration Connected Components on Twitter Graph Most vertices are within 8 hops of all vertices in their comp.
  • 29. Benefit of Indexing Active Edges 0 5 10 15 20 25 30 0 2 4 6 8 10 12 14 16 Runtime (Seconds) Iteration Connected Components on Twitter Graph Scan Indexed Scan All Edges Index of “Active” Edges
  • 30. Join Elimination Identify and bypass joins for unused triplets fields »Example: PageRank only accesses source attribute 30 0 2000 4000 6000 8000 10000 12000 14000 0 5 10 15 20 Communication (MB) Iteration PageRank on Twitter Three Way Join Join Elimination Factor of 2 reduction in communication
  • 31. Additional Query Optimizations Indexing and Bitmaps: »To accelerate joins across graphs »To efficiently construct sub-graphs Substantial Index and Data Reuse: »Reuse routing tables across graphs and sub- graphs »Reuse edge adjacency information and indices 31
  • 32. Performance Comparisons 22 68 207 354 1340 0 200 400 600 800 1000 1200 1400 1600 GraphLab GraphX Giraph Naïve Spark Mahout/Hadoop Runtime (in seconds, PageRank for 10 iterations) GraphX is roughly 3x slower than GraphLab Live-Journal: 69 Million Edges
  • 33. GraphX scales to larger graphs 203 451 749 0 200 400 600 800 GraphLab GraphX Giraph Runtime (in seconds, PageRank for 10 iterations) GraphX is roughly 2x slower than GraphLab »Scala + Java overhead: Lambdas, GC time, … »No shared memory parallelism: 2x increase in comm. Twitter Graph: 1.5 Billion Edges
  • 34. PageRank is just one stage…. What about a pipeline?
  • 35. HDFS HDFS Compute Spark Preprocess Spark Post. A Small Pipeline in GraphX Timed end-to-end GraphX is faster than Raw Wikipedia < / > < / > < / > XML Hyperlinks PageRank Top 20 Pages 342 1492 0 200 400 600 800 1000 1200 1400 1600 GraphLab + Spark GraphX Giraph + Spark Spark Total Runtime (in Seconds) 605 375
  • 36. The GraphX Stack (Lines of Code) GraphX (3575) Spark Pregel (28) + GraphLab (50) PageRan k (5) Connected Comp. (10) Shortest Path (10) ALS (40) LDA (120) K-core (51) Triangl e Count (45) SVD (40)
  • 37. Status Alpha release as part of Spark 0.9 Seeking collaborators and feedback
  • 38. Conclusion and Observations Domain specific views: Tables and Graphs »tables and graphs are first-class composable objects »specialized operators which exploit view semantics Single system that efficiently spans the pipeline »minimize data movement and duplication »eliminates need to learn and manage multiple systems Graphs through the lens of database 38
  • 39. Active Research Static Data  Dynamic Data »Apply GraphX unified approach to time evolving data »Model and analyze relationships over time Serving Graph Structured Data »Allow external systems to interact with GraphX »Unify distributed graph databases with relational database technology 39
  • 41. Graph Property 1 Real-World Graphs 41 10 0 10 2 10 4 10 6 10 8 10 0 10 2 10 4 10 6 10 8 10 10 degree count Top 1% of vertices are adjacent to 50% of the edges! Number of Vertices AltaVista WebGraph1.4B Vertices, 6.6B Edges Degree More than 108 vertices have one neighbor. 0 20 40 60 80 100 120 140 160 180 200 2008 2010 2012 Ratio of Edges to Vertices Year Facebook Power-Law Degree Distribution Edges >> Vertices
  • 42. Graph Property 2 Active Vertices 1 10 100 1000 10000 100000 1000000 10000000 100000000 0 10 20 30 40 50 60 70 Num-Vertices Number of Updates 51% updated only once! PageRank on Web Graph
  • 43. Graphs are Essential to Data Mining and Machine Learning Identify influential people and information Find communities Understand people’s shared interests Model complex data dependencies
  • 46. Liberal Conservative Post Post Post Post Post Post Post Post Predicting User Behavior Post Post Post Post Post Post Post Post Post Post Post Post Post Post ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? 46 Conditional Random Field Belief Propagation
  • 47. Count triangles passing through each vertex: Measures “cohesiveness” of local community More Triangles Stronger Community Fewer Triangles Weaker Community 1 2 3 4 Finding Communities
  • 48. Preprocessing Compute Post Proc. Example Graph Analytics Pipeline 48 < / > < / > < / > XML Raw Data ETL Slice Compute Repeat Subgraph PageRank Initial Graph Analyz e Top Users

Editor's Notes

  1. ----- Meeting Notes (1/5/14 14:58) ----- Spark too.
  2. and users can move between views without copying or moving data
  3. Make mrTriplets more visible.
  4. w
  5. Split Hash Table: reuse key sets across hash tables
  6. Also faster in human time but it is hard to measure.