SlideShare a Scribd company logo
1 of 79
A System for Large-Scale Graph
Processing
Pregel, GoldenOrb, Giraph
2012-07-18
Andrew Yongjoon Kong
sstrato.open@gmail.com
1
Contents
• Introduction
• Model of Computation
• Pregel Architecture
• Goldenorb
• Implementation
• Future work
2
Introduction
3
Introduction
• Today, Many practical computing problems concern large
graphs
• Applied algorithms
- Shortest paths computations
- Page rank
- Clustering techniques
• MapReduce is ill-suited for graph processing
- Many iterations are needed for parallel graph processing
- Materializations of intermediate results at every MapReduce
iteration harm performance
4
Introduction
• Hadoop is well-suited for non-iterative, data
parallelized processing
5
Smith Waterman
is a non iterativ
e case and of c
ourse runs fine
Introduction
6
map map
reduce
Compute the dist
ance to each dat
a point from eac
h cluster center a
nd assign points
to cluster centers
Compute new cluster
centers
Compute new clust
er centers
User program
Iterative?
• Should Handle iterative processing like PDE
(Partial Differential Equation)
• http://www.iterativemapreduce.org/
7
Graph based Computation
• Pregel
– Google’s large scale graph
• GordenOrb
• Giraph
– Yahoo’s platform
• Hama
– Apache Hama’s
• Pegasus
– Carnegie Melon University 8
Single Source Shortest Path (SSSP)
 Problem
– Find shortest path from a source node to all target
nodes
 Solution
– MapReduce
– Pregel
9
Example: SSSP—using MapReduce
• A Map task receives
– Key: node n
– Value: D (distance from start), points-to (list of nodes
reachable from n)
– D(n) = dist + min(D(m))
• The Reduce task gathers possible distances and selects
the minimum one
10
Example: SSSP—using MapReduce
 Adjacency matrix
 Adjacency List
A: (B, 10), (D, 5)
B: (C, 1), (D, 2)
C: (E, 4)
D: (B, 3), (C, 9), (E, 2)
E: (A, 7), (C, 6)
A B C D E
A 0 10 0 5 0
B 0 0 1 2 0
C 0 0 0 0 4
D 0 3 9 0 2
E 7 0 6 0 0
11
0




10
5
2 3
2
1
9
7
4 6
A
B C
D E
Example: SSSP—using MapReduce
 Map input: <node ID, <dist, adj list>>
<A, <0, <(B, 10), (D, 5)>>>
<B, <inf, <(C, 1), (D, 2)>>>
<C, <inf, <(E, 4)>>>
<D, <inf, <(B, 3), (C, 9), (E, 2)>>>
<E, <inf, <(A, 7), (C, 6)>>>
 Map output: <dest node ID, dist>
<B, 10> <D, 5>
<C, inf> <D, inf>
<E, inf>
<B, inf> <C, inf> <E, inf>
<A, inf> <C, inf>
<A, <0, <(B, 10), (D, 5)>>>
<B, <inf, <(C, 1), (D, 2)>>>
<C, <inf, <(E, 4)>>>
<D, <inf, <(B, 3), (C, 9), (E, 2)>>>
<E, <inf, <(A, 7), (C, 6)>>>
Flush to local
disk!
12
0




10
5
2 3
2
1
9
7
4 6
A
B C
D E
Example: SSSP—using MapReduce
 Reduce input: <node ID, dist>
<A, <0, <(B, 10), (D, 5)>>>
<A, inf>
<B, <inf, <(C, 1), (D, 2)>>>
<B, 10> <B, inf>
<C, <inf, <(E, 4)>>>
<C, inf> <C, inf> <C, inf>
<D, <inf, <(B, 3), (C, 9), (E, 2)>>>
<D, 5> <D, inf>
<E, <inf, <(A, 7), (C, 6)>>>
<E, inf> <E, inf>
13
0




10
5
2 3
2
1
9
7
4 6
A
B C
D E
Select possible & minimum value
and update former iteration
result.
Example: SSSP—using MapReduce
 Reduce output: <node ID, <dist, adj list>>
= Map input for next iteration
<A, <0, <(B, 10), (D, 5)>>>
<B, <10, <(C, 1), (D, 2)>>>
<C, <inf, <(E, 4)>>>
<D, <5, <(B, 3), (C, 9), (E, 2)>>>
<E, <inf, <(A, 7), (C, 6)>>>
 Map output: <dest node ID, dist>
<B, 10> <D, 5>
<C, 11> <D, 12>
<E, inf>
<B, 8> <C, 14> <E, 7>
<A, inf> <C, inf>
<A, <0, <(B, 10), (D, 5)>>>
<B, <10, <(C, 1), (D, 2)>>>
<C, <inf, <(E, 4)>>>
<D, <5, <(B, 3), (C, 9), (E, 2)>>>
<E, <inf, <(A, 7), (C, 6)>>>
Flush to local
disk!
14
0
10
5


10
5
2 3
2
1
9
7
4 6
A
B C
D E
Example: SSSP—using MapReduce
 Reduce input: <node ID, dist>
<A, <0, <(B, 10), (D, 5)>>>
<A, inf>
<B, <10, <(C, 1), (D, 2)>>>
<B, 10> <B, 8>
<C, <inf, <(E, 4)>>>
<C, 11> <C, 14> <C, inf>
<D, <5, <(B, 3), (C, 9), (E, 2)>>>
<D, 5> <D,12>
<E, <inf, <(A, 7), (C, 6)>>>
<E, inf> <E, 7>
15
0




10
5
2 3
2
1
9
7
4 6
A
B C
D E
Select possible & minimum value
and update former iteration
result.
Example: SSSP—using MapReduce
 Map input: <node ID, <dist, adj list>>
<A, <0, <(B, 10), (D, 5)>>>
<B, <8, <(C, 1), (D, 2)>>>
<C, <11, <(E, 4)>>>
<D, <5, <(B, 3), (C, 9), (E, 2)>>>
<E, <7, <(A, 7), (C, 6)>>>
 Map output: <dest node ID, dist>
<B, 10> <D, 5>
<C, 9> <D, 10>
<E, 15>
<B, 8> <C, 14> <E, 7>
<A, 14> <C, 13>
<A, <0, <(B, 10), (D, 5)>>>
<B, <8, <(C, 1), (D, 2)>>>
<C, <11, <(E, 4)>>>
<D, <5, <(B, 3), (C, 9), (E, 2)>>>
<E, <7, <(A, 7), (C, 6)>>>
Flush to local
disk!
16
0
8
5
11
7
10
5
2 3
2
1
9
7
4 6
A
B C
D E
Example: SSSP—using MapReduce
 Reduce input: <node ID, dist>
<A, <0, <(B, 10), (D, 5)>>>
<A, 14>
<B, <8, <(C, 1), (D, 2)>>>
<B, 10> <B, 8>
<C, <11, <(E, 4)>>>
<C, 9> <C, 14> <C, 13>
<D, <5, <(B, 3), (C, 9), (E, 2)>>>
<D, 5> <D, 10>
<E, <7, <(A, 7), (C, 6)>>>
<E, 15> <E, 7>
17
Select possible & minimum value
and update former iteration
result.
0
8
5
9
7
10
5
2 3
2
1
9
7
4 6
A
B C
D E
Example: SSSP—using MapReduce
 Map input: <node ID, <dist, adj list>>
<A, <0, <(B, 10), (D, 5)>>>
<B, <8, <(C, 1), (D, 2)>>>
<C, <9, <(E, 4)>>>
<D, <5, <(B, 3), (C, 9), (E, 2)>>>
<E, <7, <(A, 7), (C, 6)>>>
 Map output: <dest node ID, dist>
<B, 10> <D, 5>
<C, 9> <D, 10>
<E, 13>
<B, 8> <C, 14> <E, 7>
<A, 14> <C, 13>
<A, <0, <(B, 10), (D, 5)>>>
<B, <8, <(C, 1), (D, 2)>>>
<C, <9, <(E, 4)>>>
<D, <5, <(B, 3), (C, 9), (E, 2)>>>
<E, <7, <(A, 7), (C, 6)>>>
Flush to local
disk!
18
0
8
5
9
7
10
5
2 3
2
1
9
7
4 6
A
B C
D E
Example: SSSP—using MapReduce
 Reduce input: <node ID, dist>
<A, <0, <(B, 10), (D, 5)>>>
<A, 14>
<B, <8, <(C, 1), (D, 2)>>>
<B, 10> <B, 8>
<C, <9, <(E, 4)>>>
<C, 9> <C, 14> <C, 13>
<D, <5, <(B, 3), (C, 9), (E, 2)>>>
<D, 5><D, 10>
<E, <7, <(A, 7), (C, 6)>>>
<E, 13> <E, 7>
19
No Changes. Quit Process!
0
8
5
9
7
10
5
2 3
2
1
9
7
4 6
A
B C
D E
• The MapReduce use the key/value pairs to save the
nodes and adjacent distance, It is more suitable to
process huge datasets rather than the large-scale
graph
Here, we introduce the new system– Pregel!
20
Model of Computation
21
Model of Pregel Computation
Input
Output
Supersteps:
• A sequence of iterations
• Vertex compute in parallel
Input: a directed graph
•Vertex : a vertex ID
a modifiable
•Edges: a target vertex
a modifiable
associate with source vertices
Output: a directed graph
•The set of values explicitly output
by the vertices
•vertices and edges can be added
and moved
22
Maximum Value Example
• propagate the largest value to every vertex
23
A B C D
Single Source Shortest Path (SSSP)
 Problem
– Find shortest path from a source node to all target
nodes
 Solution
– MapReduce
– Pregel
24
Example: SSSP—using Pregel
25
0




10
5
2 3
2
1
9
7
4 6
A
B
D E
C
Example: SSSP—using Pregel
26
A
ED
CB
Example: SSSP—using Pregel
27
0
10
5


10
5
2 3
2
1
9
7
4 6
A
ED
CB
Example: SSSP—using Pregel
28
A
ED
CB
Example: SSSP—using Pregel
29
0
8
5
11
7
10
5
2 3
2
1
9
7
4 6
A
ED
CB
Example: SSSP—using Pregel
30
0
8
5
11
7
10
5
2 3
2
1
9
7
4 6
13
14
15
9
10
A
ED
CB
Example: SSSP—using Pregel
31
0
8
5
9
7
10
5
2 3
2
1
9
7
4 6
A
ED
CB
Example: SSSP—using Pregel
32
0
8
5
9
7
10
5
2 3
2
1
9
7
4 6
13
A
ED
CB
Example: SSSP—using Pregel
33
0
8
5
9
7
10
5
2 3
2
1
9
7
4 6
A
ED
CB
Pregel vs MapReduce
 Pregel
– Keeps vertices & edges on
the machine that performs
computation
– Uses network transfers only
for messages
– Sufficiently expressive, no
need for remote reads
 MapReduce
– Require much more
communication and
associated overhead
– Needs to coordinate the
steps of a chained
MapReduce add the
programming complexity
Pregel Architecture
35
System Architecture
 Pregel system uses the master/worker model
– Master
 Coordinating worker activity
 Determines the amount of partitions and assign to worker
 Recovers faults of workers (“ping” messges)
 Maintains statistics about the progress of computation
and the state of the graph
– Worker
 Maintain the state of its portion of the graph in memory
 Executing the Compute() method
 Communicates with the other workers
36
37
•Assign portion of the input
•Instruct each worker to
perform a superstep
•call Compute() for each
vertex
• update the data structure
• receive/send messages
• responds to master when
finished
•Control the number of
partitions in graph
•Notify the master to
start the processing
Fault Tolerance
 Checkpointing
– The master periodically instructs the workers to save the state
of their partitions to persistent storage
 e.g., Vertex values, edge values, incoming messages
 Failure detection
– Using regular “ping” messages
 Recovery
– The master reassigns graph partitions to the currently available
workers
– The workers all reload their partition state from most recent
available checkpoint
38
Goldenorb
39
Goldenorb
• Open Source Version of Google’s Pregel
• Implemented in Java
• Version 0.1.1
• Requirements
- hadoop file system
- zookeeper for communication
40
41
Orbcluster
JobsInProgr
ess
Jobid
Messages heartbeat
OrbTracker
LeaderGrou
p
JobQue
OrbTrackte
rs
ZK-TREE
Orb-Tracker(L)
Orb-Tracker(S)
Orb-Tracker(S)
Job
manager
Partition
manager
Watcher
/Event
Partition
request
Partition
(Master)
…
Partition
(slave)
Partition
(slave)
Inbound msg queue
outbound msg queue
current msg queue
Inbound msg queue
outbound msg queue
current msg queue
•startLoadVerticesBarrier
•Superstep start Barrier
•doneComputingVerticesBarr
•doneSendingMessageBarrier
…
•LeaderShipChange
•LostMember
•NewMember
•JobStart/Death/Complete…
HDFS
Read/write
Message Exchange
• Message교환은 Superstep간에 이루어짐
• [S-1] superstep의 outbound message들은 [s] superstep의 inbound
messages
• Outbound Queue가 가득차면 message들을 보내고 다시 queuing
• Superstep 중간에 message를 받은 partition은 inbound queue에
저장하고 다음 Superstep까지 보관
• 현재 superstep에 사용할 message들은 current message queue에 복사
• 이 때, inbound queue가 system이나 jvm의 memory size 를 넘어서면
overflow 발생
Memory management
• Outbound Message Queue
- Fixed size, 가득 차면 바로 messages 보냄
• Inbound Message Queue
- 다음 Superstep에 사용
- Message 양이 많아지면 overflow가능성 있음
• Current Message Queue
- Inbound Queue 과 같은 사이즈
- 현 Superstep 에서는 CurrentQueue에 inboundQueue를 복사해서 사용하므로
currentQueue+inboundQueue 의 메모리 사용 overflow
 Inbound Queue를 file 기반의 local 저장공간에 구현 필요
API
• Sub-classing the predefined classes
– Reader/writer/vertex/message
44
Class Vertex {
public Vertex(Class<VV> vertexValue, Class<EV> edgeValue, Class<MV> messageValue);
String vertexID();
abstract void compute(Collection<MV> messages);
long superStep();
void setValue(VV value);
VV getValue();
Collection<Edge<EV>> getEdges();
void sendMessage(MV message);
void voteToHalt();
}
Not yet implemented
• Aggregator
– a mechanism for global communication monitoring and data
• Combiner
– Reducing the number of messages
– Ex) if compute() sum messages’ value, combiner can calculate
and transmit single message(sum)
• Topology mutation
– Remove or add Vertex/Edge
• Fault Recovery
45
Implementation
46
Implement
 Maximum Value
 Single Source Shortest Path
 PageRank
 K-means
 Mean-shift
47
48
MaximumValue
PageRank
• PageRank is Google’s way of deciding a Page’s
importance
• A important page is linked to by many pages with
high PageRank
• PR(A) = PR(inLink_v1)/L(t1) + ….+ P(inLink_vn)/L(tn)
• Add damping factor d
• PR(A) = (1-d) + d∑PR(v)/L(v)
• Repeat until converged
49
PageRank
50
PageRank
51
AE
B
C
D
F
<Input file>
<output file>
K-means
• N observations are parted to k cluster
• Each observation belongs to the cluster with the
nearest mean
No object
move group?
End
Number of cluster K
Calculate centroids
Distance objects to
centroids
Grouping based on
minimum distance
start
NO
YES
K-means
53
• Message includes cluster id and value
• Every superstep, a vertex sends message to all
vertices
1
2
3
100
101
102
seed2
seed1
A
B C
D
E
F
Step A B C D E F
S0 C1 C2 - - - -
K-means
54
1
2
3
100
101
102
A
B C
D
E
F
Step A B C D E F
S0 C1 C2 - - - -
S1 C1 C2 C1 C1 C1 C1
Centroid1 = Value(A)
Centroid2 = Value(B)
1
2
3
100
101
102
seed2
seed1
A
B C
D
E
F
K-means
55
Step A B C D E F
S0 C1 C2 - - - -
S1 C1 C2 C1 C1 C1 C1
S2 C2 C2 C2 C1 C1 C1
1
2
3
100
101
102
A
B C
D
E
F
Centroid1 = Mean(A,C,D,E,F)
Centroid2 = Mean(B)
1
2
3
100
101
102
A
B C
D
E
F
K-means
56
Step A B C D E F
S0 C1 C2 - - - -
S1 C1 C2 C1 C1 C1 C1
S2 C2 C2 C2 C1 C1 C1
S3 C2 C2 C2 C1 C1 C1
1
2
3
100
101
102
A
B C
D
E
F
Centroid1 = Mean(D,E,F)
Centroid2 = Mean(A,B,C)
If centroids are
converged,
Quit Process!
57
K-means
58
<input file>
<output file>
N : number of vertices
Each superstep, NxN messages are exchanged.
 O(n2) : need too much memory !!!
Giraph
59
Giraph
• ASF(Apache Software Foundation)’s Open Source
Version of Google’s Pregel
• Implemented in Java
• Apache incubator
• Requirements
- hadoop 0.20.203 or higher version
: map-only job in hadoop
- zookeeper
: if not exist, use hadoop file system instead of zookeeper
60
Giraph – vertex distribution
61
Giraph - usages
• Users can set the checkpoint frequency
– GiraphJob.getConfiguration().set(“giraph.checkpointFrequency”, 0)
//means no check points
• User should set zookeeper configuration
– GiraphJob.setZookeeperConfiguration(“zk-server-list”);
62
Giraph - Characteristics
• Faulty tolerance
– If the master dies, a new one will automatically take over
– If a worker dies, the app is rolled back to a previously checkpointed
superstep
– If a zookeeper server dies, as long as a quorum remains, the app can
proceed
– But, Hadoop SPOF still exist
• Combiner/Aggregator
• JSON in/out format
• Easy Job status monitoring (http)
63
Experiments
64
Experiments
• 3 severs
• nPartition = nMapper = 9
• MR vs GoldenOrb vs Giraph
– PageRank
– Kmeans (mahout)
– Elapse time, cpu, memory, disk, network
65
Experiments - PageRank
• Number of Vertices ≈ 220,000
• Fixed iteration = 100
66
Elapse
Time
CPU
(%)
Memory
(kb)
Network
(bytes)
Disk Write
(sec/s)
Rcv. Trans. read write
GoldenOrb 1m 56s 14.53 3,745,376 19,437 12,845 777 606
Giraph 3m 31s 8.77 1,244,000 11,374 914 0 326
MapReduce 34m 51s 3.75 3,091,239 13,514 867 0 4101
Experiments - Kmeans
• Number of Vertices = 100,000
• Number of Cluster(K) = 10
67
Elapse
Time
CPU
(%)
Memory
(kb)
Network
(bytes)
Disk Write
(sec/s)
Rcv. Trans. read write
GoldenOrb 3m 19s 13.32 3,857,892 11,634 27,086 128 1151
Giraph 1m 49s 6.36 1,245,000 7,810 1,999 0 536
MapReduce 11m 28s 1.48 2,645,517 13,528 1,005 0 7104
Experiments
68
Elapse Time(s)
Installation
69
Install Goldenorb (1)
• Requirements
- hadoop-0.20.2
- zookeeper-3.3.3
• Download & unzip
– org.goldenorb.core-0.1.1-SNAPSHOT-distribution.zip
70
Install Goldenorb(2)
• Set configuration
① ORB_HOME 환경변수
> export ORB_HOME=/usr/local/goldenorb
② Conf/orbServers
> localhost:/usr/local/goldenorb
③ Conf/orb-site.xml
> cp orb-site.sample.xml orb-site.xml
> vi orb-site.xml
④ If Distributed mode ,
copy to all servers
71
<property>
<name>goldenOrb.zookeeper.quorum</name>
<value> localhost</value>
<description> The server running zookeeper</description>
<property>
……
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>
Target
zookeeper
server IP
Install Goldenorb(3)
• Set running environment
① Hadoop 실행
> $HADOOP_HOME/bin/start-dfs.sh
② Zookeeper 실행
> $ZK_HOME/bin/zkServer.sh start
③ Orb-tracker 실행
> $ORB_HOME/bin/orb-tracker.sh start
④ Log 확인
> cat #ORB_HOME/logs/xxx.log
72
Install Goldenorb(4)
• Make input
- ex) maximum value
< Vertex-id > <value> <outgoing-edge-list>
73
0
8
5
11
7
A
B C
D E
A 0 B D
B 8 C D
C 11 E
D 5 B C E
E 7 A C
Install Goldenorb(5)
• Upload input files
> hadoop fs –put maxvalue.txt /test/
• Run
> java -cp conf/.:org.goldenorb.core-0.1.1-SNAPSHOT.jar:lib/*:yourjar.jar
org.goldenOrb.algorithms.YourAlgorithm
goldenOrb.orb.localFilesToDistribute=/home/user/yourjar.jar
mapred.input.dir=/test/maxvaluetxt/ mapred.output.dir=/test/output
goldenOrb.orb.requestedPartitions=3 goldenOrb.orb.reservedPartitions=0
goldenOrb.orb.classpaths=yourjar.jar
• Result
> hadoop fs –ls /test/output
> hadoop fs –cat /test/output/*
74
Install Giraph(1)
• Requirements
- hadoop-0.20.203
- zookeeper-3.3.3
- maven 3.0.3
• Download
> svn checkout http://svn.apache.org/repos/asf/incubator/giraph/trunk giraph
• Compile
> mvn compile
– Generate target/giraph-{version}-jar-with-dependencies.jar
75
Install Giraph(2)
• Set running environment
① >$HADOOP_HOME/bin/start-all.sh
② >$ZK_HOME/zkServer.sh start
• Upload input file to HDFS
> hadoop fs –put test.grf /giraph/test/input/
* Input Format (JASON)
76
[0,0,[[1,0]]]
[1,0,[[2,100]]]
[2,100,[[3,200]]]
[3,300,[[4,300]]]
[4,600,[[5,400]]]
Install Giraph(3)
• Run
- ex)shortest path algorithm
> hadoop jar giraph-{version}-jar-with-dependencies.jar
org.apache.giraph.examples.SimpleShortestPathsVertex <input-path>
<output-path> <source-node> <number of worker>
• Running status
– http://localhost:50030
• Result
> hadoop fs –cat /<output-path/part-*
77
Orb vs Giraph
78
GoldenOrb Giraph
상태 확인 로그 파일 용이
Fault Tolerance X O
Vertex mutation,
Combiner,
Aggregator…
X O
개발환경 X O
I/O format O X
Update X O
Thank you !
79

More Related Content

Similar to Graph analysis platform comparison, pregel/goldenorb/giraph

2010-Pregel
2010-Pregel2010-Pregel
2010-Pregelbinzhao
 
My mapreduce1 presentation
My mapreduce1 presentationMy mapreduce1 presentation
My mapreduce1 presentationNoha Elprince
 
Scalable and Adaptive Graph Querying with MapReduce
Scalable and Adaptive Graph Querying with MapReduceScalable and Adaptive Graph Querying with MapReduce
Scalable and Adaptive Graph Querying with MapReduceKyong-Ha Lee
 
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersHDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersXiao Qin
 
Intermachine Parallelism
Intermachine ParallelismIntermachine Parallelism
Intermachine ParallelismSri Prasanna
 
クラウドDWHとしても進化を続けるPivotal Greenplumご紹介
クラウドDWHとしても進化を続けるPivotal Greenplumご紹介クラウドDWHとしても進化を続けるPivotal Greenplumご紹介
クラウドDWHとしても進化を続けるPivotal Greenplumご紹介Masayuki Matsushita
 
Data Analytics and Simulation in Parallel with MATLAB*
Data Analytics and Simulation in Parallel with MATLAB*Data Analytics and Simulation in Parallel with MATLAB*
Data Analytics and Simulation in Parallel with MATLAB*Intel® Software
 
Sparse matrix computations in MapReduce
Sparse matrix computations in MapReduceSparse matrix computations in MapReduce
Sparse matrix computations in MapReduceDavid Gleich
 
NVIDIA HPC ソフトウエア斜め読み
NVIDIA HPC ソフトウエア斜め読みNVIDIA HPC ソフトウエア斜め読み
NVIDIA HPC ソフトウエア斜め読みNVIDIA Japan
 
PL/CUDA - Fusion of HPC Grade Power with In-Database Analytics
PL/CUDA - Fusion of HPC Grade Power with In-Database AnalyticsPL/CUDA - Fusion of HPC Grade Power with In-Database Analytics
PL/CUDA - Fusion of HPC Grade Power with In-Database AnalyticsKohei KaiGai
 
20160407_GTC2016_PgSQL_In_Place
20160407_GTC2016_PgSQL_In_Place20160407_GTC2016_PgSQL_In_Place
20160407_GTC2016_PgSQL_In_PlaceKohei KaiGai
 
Graphs made easy with SAS ODS Graphics Designer (PAPER)
Graphs made easy with SAS ODS Graphics Designer (PAPER)Graphs made easy with SAS ODS Graphics Designer (PAPER)
Graphs made easy with SAS ODS Graphics Designer (PAPER)Kevin Lee
 
pgconfasia2016 plcuda en
pgconfasia2016 plcuda enpgconfasia2016 plcuda en
pgconfasia2016 plcuda enKohei KaiGai
 
Hadoop Network Performance profile
Hadoop Network Performance profileHadoop Network Performance profile
Hadoop Network Performance profilepramodbiligiri
 
ExtraV - Boosting Graph Processing Near Storage with a Coherent Accelerator
ExtraV - Boosting Graph Processing Near Storage with a Coherent AcceleratorExtraV - Boosting Graph Processing Near Storage with a Coherent Accelerator
ExtraV - Boosting Graph Processing Near Storage with a Coherent AcceleratorJinho Lee
 
NYAI - Scaling Machine Learning Applications by Braxton McKee
NYAI - Scaling Machine Learning Applications by Braxton McKeeNYAI - Scaling Machine Learning Applications by Braxton McKee
NYAI - Scaling Machine Learning Applications by Braxton McKeeRizwan Habib
 

Similar to Graph analysis platform comparison, pregel/goldenorb/giraph (20)

LalitBDA2015V3
LalitBDA2015V3LalitBDA2015V3
LalitBDA2015V3
 
2010-Pregel
2010-Pregel2010-Pregel
2010-Pregel
 
My mapreduce1 presentation
My mapreduce1 presentationMy mapreduce1 presentation
My mapreduce1 presentation
 
Scalable and Adaptive Graph Querying with MapReduce
Scalable and Adaptive Graph Querying with MapReduceScalable and Adaptive Graph Querying with MapReduce
Scalable and Adaptive Graph Querying with MapReduce
 
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersHDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
 
Intermachine Parallelism
Intermachine ParallelismIntermachine Parallelism
Intermachine Parallelism
 
LMmanual.pdf
LMmanual.pdfLMmanual.pdf
LMmanual.pdf
 
クラウドDWHとしても進化を続けるPivotal Greenplumご紹介
クラウドDWHとしても進化を続けるPivotal Greenplumご紹介クラウドDWHとしても進化を続けるPivotal Greenplumご紹介
クラウドDWHとしても進化を続けるPivotal Greenplumご紹介
 
Data Analytics and Simulation in Parallel with MATLAB*
Data Analytics and Simulation in Parallel with MATLAB*Data Analytics and Simulation in Parallel with MATLAB*
Data Analytics and Simulation in Parallel with MATLAB*
 
Sparse matrix computations in MapReduce
Sparse matrix computations in MapReduceSparse matrix computations in MapReduce
Sparse matrix computations in MapReduce
 
NVIDIA HPC ソフトウエア斜め読み
NVIDIA HPC ソフトウエア斜め読みNVIDIA HPC ソフトウエア斜め読み
NVIDIA HPC ソフトウエア斜め読み
 
PL/CUDA - Fusion of HPC Grade Power with In-Database Analytics
PL/CUDA - Fusion of HPC Grade Power with In-Database AnalyticsPL/CUDA - Fusion of HPC Grade Power with In-Database Analytics
PL/CUDA - Fusion of HPC Grade Power with In-Database Analytics
 
20160407_GTC2016_PgSQL_In_Place
20160407_GTC2016_PgSQL_In_Place20160407_GTC2016_PgSQL_In_Place
20160407_GTC2016_PgSQL_In_Place
 
Graphs made easy with SAS ODS Graphics Designer (PAPER)
Graphs made easy with SAS ODS Graphics Designer (PAPER)Graphs made easy with SAS ODS Graphics Designer (PAPER)
Graphs made easy with SAS ODS Graphics Designer (PAPER)
 
pgconfasia2016 plcuda en
pgconfasia2016 plcuda enpgconfasia2016 plcuda en
pgconfasia2016 plcuda en
 
Dsp lab manual 15 11-2016
Dsp lab manual 15 11-2016Dsp lab manual 15 11-2016
Dsp lab manual 15 11-2016
 
Hadoop Network Performance profile
Hadoop Network Performance profileHadoop Network Performance profile
Hadoop Network Performance profile
 
MapReduce
MapReduceMapReduce
MapReduce
 
ExtraV - Boosting Graph Processing Near Storage with a Coherent Accelerator
ExtraV - Boosting Graph Processing Near Storage with a Coherent AcceleratorExtraV - Boosting Graph Processing Near Storage with a Coherent Accelerator
ExtraV - Boosting Graph Processing Near Storage with a Coherent Accelerator
 
NYAI - Scaling Machine Learning Applications by Braxton McKee
NYAI - Scaling Machine Learning Applications by Braxton McKeeNYAI - Scaling Machine Learning Applications by Braxton McKee
NYAI - Scaling Machine Learning Applications by Braxton McKee
 

More from Andrew Yongjoon Kong

Nightmare with ceph : Recovery from ceph cluster total failure
Nightmare with ceph : Recovery from ceph cluster total failureNightmare with ceph : Recovery from ceph cluster total failure
Nightmare with ceph : Recovery from ceph cluster total failureAndrew Yongjoon Kong
 
Stream analysis with kafka native way and considerations about monitoring as ...
Stream analysis with kafka native way and considerations about monitoring as ...Stream analysis with kafka native way and considerations about monitoring as ...
Stream analysis with kafka native way and considerations about monitoring as ...Andrew Yongjoon Kong
 
Automating auto-scaled load balancer based on linux and vm orchestrator
Automating auto-scaled load balancer based on linux and vm orchestratorAutomating auto-scaled load balancer based on linux and vm orchestrator
Automating auto-scaled load balancer based on linux and vm orchestratorAndrew Yongjoon Kong
 
GPU cloud with Job scheduler and Container
GPU cloud with Job scheduler and ContainerGPU cloud with Job scheduler and Container
GPU cloud with Job scheduler and ContainerAndrew Yongjoon Kong
 
Cloud: From Unmanned Data Center to Algorithmic Economy using Openstack
Cloud: From Unmanned Data Center to Algorithmic Economy using OpenstackCloud: From Unmanned Data Center to Algorithmic Economy using Openstack
Cloud: From Unmanned Data Center to Algorithmic Economy using OpenstackAndrew Yongjoon Kong
 

More from Andrew Yongjoon Kong (12)

Tunnel without tunnel
Tunnel without tunnelTunnel without tunnel
Tunnel without tunnel
 
Nightmare with ceph : Recovery from ceph cluster total failure
Nightmare with ceph : Recovery from ceph cluster total failureNightmare with ceph : Recovery from ceph cluster total failure
Nightmare with ceph : Recovery from ceph cluster total failure
 
Stream analysis with kafka native way and considerations about monitoring as ...
Stream analysis with kafka native way and considerations about monitoring as ...Stream analysis with kafka native way and considerations about monitoring as ...
Stream analysis with kafka native way and considerations about monitoring as ...
 
Automating auto-scaled load balancer based on linux and vm orchestrator
Automating auto-scaled load balancer based on linux and vm orchestratorAutomating auto-scaled load balancer based on linux and vm orchestrator
Automating auto-scaled load balancer based on linux and vm orchestrator
 
GPU cloud with Job scheduler and Container
GPU cloud with Job scheduler and ContainerGPU cloud with Job scheduler and Container
GPU cloud with Job scheduler and Container
 
Embracing clouds
Embracing cloudsEmbracing clouds
Embracing clouds
 
openstack, devops and people
openstack, devops and peopleopenstack, devops and people
openstack, devops and people
 
Cloud data center and openstack
Cloud data center and openstackCloud data center and openstack
Cloud data center and openstack
 
Openstack summit 2015
Openstack summit 2015Openstack summit 2015
Openstack summit 2015
 
Cloud: From Unmanned Data Center to Algorithmic Economy using Openstack
Cloud: From Unmanned Data Center to Algorithmic Economy using OpenstackCloud: From Unmanned Data Center to Algorithmic Economy using Openstack
Cloud: From Unmanned Data Center to Algorithmic Economy using Openstack
 
Way to cloud
Way to cloudWay to cloud
Way to cloud
 
Openstack dev on
Openstack dev onOpenstack dev on
Openstack dev on
 

Recently uploaded

Call Girls In The Ocean Pearl Retreat Hotel New Delhi 9873777170
Call Girls In The Ocean Pearl Retreat Hotel New Delhi 9873777170Call Girls In The Ocean Pearl Retreat Hotel New Delhi 9873777170
Call Girls In The Ocean Pearl Retreat Hotel New Delhi 9873777170Sonam Pathan
 
Git and Github workshop GDSC MLRITM
Git and Github  workshop GDSC MLRITMGit and Github  workshop GDSC MLRITM
Git and Github workshop GDSC MLRITMgdsc13
 
Call Girls in Uttam Nagar Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Uttam Nagar Delhi 💯Call Us 🔝8264348440🔝Call Girls in Uttam Nagar Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Uttam Nagar Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
Call Girls South Delhi Delhi reach out to us at ☎ 9711199012
Call Girls South Delhi Delhi reach out to us at ☎ 9711199012Call Girls South Delhi Delhi reach out to us at ☎ 9711199012
Call Girls South Delhi Delhi reach out to us at ☎ 9711199012rehmti665
 
Top 10 Interactive Website Design Trends in 2024.pptx
Top 10 Interactive Website Design Trends in 2024.pptxTop 10 Interactive Website Design Trends in 2024.pptx
Top 10 Interactive Website Design Trends in 2024.pptxDyna Gilbert
 
办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书
办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书
办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书zdzoqco
 
定制(AUT毕业证书)新西兰奥克兰理工大学毕业证成绩单原版一比一
定制(AUT毕业证书)新西兰奥克兰理工大学毕业证成绩单原版一比一定制(AUT毕业证书)新西兰奥克兰理工大学毕业证成绩单原版一比一
定制(AUT毕业证书)新西兰奥克兰理工大学毕业证成绩单原版一比一Fs
 
Contact Rya Baby for Call Girls New Delhi
Contact Rya Baby for Call Girls New DelhiContact Rya Baby for Call Girls New Delhi
Contact Rya Baby for Call Girls New Delhimiss dipika
 
Font Performance - NYC WebPerf Meetup April '24
Font Performance - NYC WebPerf Meetup April '24Font Performance - NYC WebPerf Meetup April '24
Font Performance - NYC WebPerf Meetup April '24Paul Calvano
 
A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)
A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)
A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)Christopher H Felton
 
定制(Lincoln毕业证书)新西兰林肯大学毕业证成绩单原版一比一
定制(Lincoln毕业证书)新西兰林肯大学毕业证成绩单原版一比一定制(Lincoln毕业证书)新西兰林肯大学毕业证成绩单原版一比一
定制(Lincoln毕业证书)新西兰林肯大学毕业证成绩单原版一比一Fs
 
办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一
办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一
办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一z xss
 
Call Girls Service Adil Nagar 7001305949 Need escorts Service Pooja Vip
Call Girls Service Adil Nagar 7001305949 Need escorts Service Pooja VipCall Girls Service Adil Nagar 7001305949 Need escorts Service Pooja Vip
Call Girls Service Adil Nagar 7001305949 Need escorts Service Pooja VipCall Girls Lucknow
 
定制(UAL学位证)英国伦敦艺术大学毕业证成绩单原版一比一
定制(UAL学位证)英国伦敦艺术大学毕业证成绩单原版一比一定制(UAL学位证)英国伦敦艺术大学毕业证成绩单原版一比一
定制(UAL学位证)英国伦敦艺术大学毕业证成绩单原版一比一Fs
 
Blepharitis inflammation of eyelid symptoms cause everything included along w...
Blepharitis inflammation of eyelid symptoms cause everything included along w...Blepharitis inflammation of eyelid symptoms cause everything included along w...
Blepharitis inflammation of eyelid symptoms cause everything included along w...Excelmac1
 
Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作
Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作
Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作ys8omjxb
 
定制(Management毕业证书)新加坡管理大学毕业证成绩单原版一比一
定制(Management毕业证书)新加坡管理大学毕业证成绩单原版一比一定制(Management毕业证书)新加坡管理大学毕业证成绩单原版一比一
定制(Management毕业证书)新加坡管理大学毕业证成绩单原版一比一Fs
 

Recently uploaded (20)

Call Girls In The Ocean Pearl Retreat Hotel New Delhi 9873777170
Call Girls In The Ocean Pearl Retreat Hotel New Delhi 9873777170Call Girls In The Ocean Pearl Retreat Hotel New Delhi 9873777170
Call Girls In The Ocean Pearl Retreat Hotel New Delhi 9873777170
 
Git and Github workshop GDSC MLRITM
Git and Github  workshop GDSC MLRITMGit and Github  workshop GDSC MLRITM
Git and Github workshop GDSC MLRITM
 
Call Girls in Uttam Nagar Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Uttam Nagar Delhi 💯Call Us 🔝8264348440🔝Call Girls in Uttam Nagar Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Uttam Nagar Delhi 💯Call Us 🔝8264348440🔝
 
Model Call Girl in Jamuna Vihar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in  Jamuna Vihar Delhi reach out to us at 🔝9953056974🔝Model Call Girl in  Jamuna Vihar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Jamuna Vihar Delhi reach out to us at 🔝9953056974🔝
 
Call Girls South Delhi Delhi reach out to us at ☎ 9711199012
Call Girls South Delhi Delhi reach out to us at ☎ 9711199012Call Girls South Delhi Delhi reach out to us at ☎ 9711199012
Call Girls South Delhi Delhi reach out to us at ☎ 9711199012
 
Top 10 Interactive Website Design Trends in 2024.pptx
Top 10 Interactive Website Design Trends in 2024.pptxTop 10 Interactive Website Design Trends in 2024.pptx
Top 10 Interactive Website Design Trends in 2024.pptx
 
办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书
办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书
办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书
 
定制(AUT毕业证书)新西兰奥克兰理工大学毕业证成绩单原版一比一
定制(AUT毕业证书)新西兰奥克兰理工大学毕业证成绩单原版一比一定制(AUT毕业证书)新西兰奥克兰理工大学毕业证成绩单原版一比一
定制(AUT毕业证书)新西兰奥克兰理工大学毕业证成绩单原版一比一
 
Contact Rya Baby for Call Girls New Delhi
Contact Rya Baby for Call Girls New DelhiContact Rya Baby for Call Girls New Delhi
Contact Rya Baby for Call Girls New Delhi
 
Font Performance - NYC WebPerf Meetup April '24
Font Performance - NYC WebPerf Meetup April '24Font Performance - NYC WebPerf Meetup April '24
Font Performance - NYC WebPerf Meetup April '24
 
A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)
A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)
A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)
 
定制(Lincoln毕业证书)新西兰林肯大学毕业证成绩单原版一比一
定制(Lincoln毕业证书)新西兰林肯大学毕业证成绩单原版一比一定制(Lincoln毕业证书)新西兰林肯大学毕业证成绩单原版一比一
定制(Lincoln毕业证书)新西兰林肯大学毕业证成绩单原版一比一
 
办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一
办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一
办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一
 
Call Girls Service Adil Nagar 7001305949 Need escorts Service Pooja Vip
Call Girls Service Adil Nagar 7001305949 Need escorts Service Pooja VipCall Girls Service Adil Nagar 7001305949 Need escorts Service Pooja Vip
Call Girls Service Adil Nagar 7001305949 Need escorts Service Pooja Vip
 
young call girls in Uttam Nagar🔝 9953056974 🔝 Delhi escort Service
young call girls in Uttam Nagar🔝 9953056974 🔝 Delhi escort Serviceyoung call girls in Uttam Nagar🔝 9953056974 🔝 Delhi escort Service
young call girls in Uttam Nagar🔝 9953056974 🔝 Delhi escort Service
 
Hot Sexy call girls in Rk Puram 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in  Rk Puram 🔝 9953056974 🔝 Delhi escort ServiceHot Sexy call girls in  Rk Puram 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Rk Puram 🔝 9953056974 🔝 Delhi escort Service
 
定制(UAL学位证)英国伦敦艺术大学毕业证成绩单原版一比一
定制(UAL学位证)英国伦敦艺术大学毕业证成绩单原版一比一定制(UAL学位证)英国伦敦艺术大学毕业证成绩单原版一比一
定制(UAL学位证)英国伦敦艺术大学毕业证成绩单原版一比一
 
Blepharitis inflammation of eyelid symptoms cause everything included along w...
Blepharitis inflammation of eyelid symptoms cause everything included along w...Blepharitis inflammation of eyelid symptoms cause everything included along w...
Blepharitis inflammation of eyelid symptoms cause everything included along w...
 
Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作
Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作
Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作
 
定制(Management毕业证书)新加坡管理大学毕业证成绩单原版一比一
定制(Management毕业证书)新加坡管理大学毕业证成绩单原版一比一定制(Management毕业证书)新加坡管理大学毕业证成绩单原版一比一
定制(Management毕业证书)新加坡管理大学毕业证成绩单原版一比一
 

Graph analysis platform comparison, pregel/goldenorb/giraph

  • 1. A System for Large-Scale Graph Processing Pregel, GoldenOrb, Giraph 2012-07-18 Andrew Yongjoon Kong sstrato.open@gmail.com 1
  • 2. Contents • Introduction • Model of Computation • Pregel Architecture • Goldenorb • Implementation • Future work 2
  • 4. Introduction • Today, Many practical computing problems concern large graphs • Applied algorithms - Shortest paths computations - Page rank - Clustering techniques • MapReduce is ill-suited for graph processing - Many iterations are needed for parallel graph processing - Materializations of intermediate results at every MapReduce iteration harm performance 4
  • 5. Introduction • Hadoop is well-suited for non-iterative, data parallelized processing 5 Smith Waterman is a non iterativ e case and of c ourse runs fine
  • 6. Introduction 6 map map reduce Compute the dist ance to each dat a point from eac h cluster center a nd assign points to cluster centers Compute new cluster centers Compute new clust er centers User program
  • 7. Iterative? • Should Handle iterative processing like PDE (Partial Differential Equation) • http://www.iterativemapreduce.org/ 7
  • 8. Graph based Computation • Pregel – Google’s large scale graph • GordenOrb • Giraph – Yahoo’s platform • Hama – Apache Hama’s • Pegasus – Carnegie Melon University 8
  • 9. Single Source Shortest Path (SSSP)  Problem – Find shortest path from a source node to all target nodes  Solution – MapReduce – Pregel 9
  • 10. Example: SSSP—using MapReduce • A Map task receives – Key: node n – Value: D (distance from start), points-to (list of nodes reachable from n) – D(n) = dist + min(D(m)) • The Reduce task gathers possible distances and selects the minimum one 10
  • 11. Example: SSSP—using MapReduce  Adjacency matrix  Adjacency List A: (B, 10), (D, 5) B: (C, 1), (D, 2) C: (E, 4) D: (B, 3), (C, 9), (E, 2) E: (A, 7), (C, 6) A B C D E A 0 10 0 5 0 B 0 0 1 2 0 C 0 0 0 0 4 D 0 3 9 0 2 E 7 0 6 0 0 11 0     10 5 2 3 2 1 9 7 4 6 A B C D E
  • 12. Example: SSSP—using MapReduce  Map input: <node ID, <dist, adj list>> <A, <0, <(B, 10), (D, 5)>>> <B, <inf, <(C, 1), (D, 2)>>> <C, <inf, <(E, 4)>>> <D, <inf, <(B, 3), (C, 9), (E, 2)>>> <E, <inf, <(A, 7), (C, 6)>>>  Map output: <dest node ID, dist> <B, 10> <D, 5> <C, inf> <D, inf> <E, inf> <B, inf> <C, inf> <E, inf> <A, inf> <C, inf> <A, <0, <(B, 10), (D, 5)>>> <B, <inf, <(C, 1), (D, 2)>>> <C, <inf, <(E, 4)>>> <D, <inf, <(B, 3), (C, 9), (E, 2)>>> <E, <inf, <(A, 7), (C, 6)>>> Flush to local disk! 12 0     10 5 2 3 2 1 9 7 4 6 A B C D E
  • 13. Example: SSSP—using MapReduce  Reduce input: <node ID, dist> <A, <0, <(B, 10), (D, 5)>>> <A, inf> <B, <inf, <(C, 1), (D, 2)>>> <B, 10> <B, inf> <C, <inf, <(E, 4)>>> <C, inf> <C, inf> <C, inf> <D, <inf, <(B, 3), (C, 9), (E, 2)>>> <D, 5> <D, inf> <E, <inf, <(A, 7), (C, 6)>>> <E, inf> <E, inf> 13 0     10 5 2 3 2 1 9 7 4 6 A B C D E Select possible & minimum value and update former iteration result.
  • 14. Example: SSSP—using MapReduce  Reduce output: <node ID, <dist, adj list>> = Map input for next iteration <A, <0, <(B, 10), (D, 5)>>> <B, <10, <(C, 1), (D, 2)>>> <C, <inf, <(E, 4)>>> <D, <5, <(B, 3), (C, 9), (E, 2)>>> <E, <inf, <(A, 7), (C, 6)>>>  Map output: <dest node ID, dist> <B, 10> <D, 5> <C, 11> <D, 12> <E, inf> <B, 8> <C, 14> <E, 7> <A, inf> <C, inf> <A, <0, <(B, 10), (D, 5)>>> <B, <10, <(C, 1), (D, 2)>>> <C, <inf, <(E, 4)>>> <D, <5, <(B, 3), (C, 9), (E, 2)>>> <E, <inf, <(A, 7), (C, 6)>>> Flush to local disk! 14 0 10 5   10 5 2 3 2 1 9 7 4 6 A B C D E
  • 15. Example: SSSP—using MapReduce  Reduce input: <node ID, dist> <A, <0, <(B, 10), (D, 5)>>> <A, inf> <B, <10, <(C, 1), (D, 2)>>> <B, 10> <B, 8> <C, <inf, <(E, 4)>>> <C, 11> <C, 14> <C, inf> <D, <5, <(B, 3), (C, 9), (E, 2)>>> <D, 5> <D,12> <E, <inf, <(A, 7), (C, 6)>>> <E, inf> <E, 7> 15 0     10 5 2 3 2 1 9 7 4 6 A B C D E Select possible & minimum value and update former iteration result.
  • 16. Example: SSSP—using MapReduce  Map input: <node ID, <dist, adj list>> <A, <0, <(B, 10), (D, 5)>>> <B, <8, <(C, 1), (D, 2)>>> <C, <11, <(E, 4)>>> <D, <5, <(B, 3), (C, 9), (E, 2)>>> <E, <7, <(A, 7), (C, 6)>>>  Map output: <dest node ID, dist> <B, 10> <D, 5> <C, 9> <D, 10> <E, 15> <B, 8> <C, 14> <E, 7> <A, 14> <C, 13> <A, <0, <(B, 10), (D, 5)>>> <B, <8, <(C, 1), (D, 2)>>> <C, <11, <(E, 4)>>> <D, <5, <(B, 3), (C, 9), (E, 2)>>> <E, <7, <(A, 7), (C, 6)>>> Flush to local disk! 16 0 8 5 11 7 10 5 2 3 2 1 9 7 4 6 A B C D E
  • 17. Example: SSSP—using MapReduce  Reduce input: <node ID, dist> <A, <0, <(B, 10), (D, 5)>>> <A, 14> <B, <8, <(C, 1), (D, 2)>>> <B, 10> <B, 8> <C, <11, <(E, 4)>>> <C, 9> <C, 14> <C, 13> <D, <5, <(B, 3), (C, 9), (E, 2)>>> <D, 5> <D, 10> <E, <7, <(A, 7), (C, 6)>>> <E, 15> <E, 7> 17 Select possible & minimum value and update former iteration result. 0 8 5 9 7 10 5 2 3 2 1 9 7 4 6 A B C D E
  • 18. Example: SSSP—using MapReduce  Map input: <node ID, <dist, adj list>> <A, <0, <(B, 10), (D, 5)>>> <B, <8, <(C, 1), (D, 2)>>> <C, <9, <(E, 4)>>> <D, <5, <(B, 3), (C, 9), (E, 2)>>> <E, <7, <(A, 7), (C, 6)>>>  Map output: <dest node ID, dist> <B, 10> <D, 5> <C, 9> <D, 10> <E, 13> <B, 8> <C, 14> <E, 7> <A, 14> <C, 13> <A, <0, <(B, 10), (D, 5)>>> <B, <8, <(C, 1), (D, 2)>>> <C, <9, <(E, 4)>>> <D, <5, <(B, 3), (C, 9), (E, 2)>>> <E, <7, <(A, 7), (C, 6)>>> Flush to local disk! 18 0 8 5 9 7 10 5 2 3 2 1 9 7 4 6 A B C D E
  • 19. Example: SSSP—using MapReduce  Reduce input: <node ID, dist> <A, <0, <(B, 10), (D, 5)>>> <A, 14> <B, <8, <(C, 1), (D, 2)>>> <B, 10> <B, 8> <C, <9, <(E, 4)>>> <C, 9> <C, 14> <C, 13> <D, <5, <(B, 3), (C, 9), (E, 2)>>> <D, 5><D, 10> <E, <7, <(A, 7), (C, 6)>>> <E, 13> <E, 7> 19 No Changes. Quit Process! 0 8 5 9 7 10 5 2 3 2 1 9 7 4 6 A B C D E
  • 20. • The MapReduce use the key/value pairs to save the nodes and adjacent distance, It is more suitable to process huge datasets rather than the large-scale graph Here, we introduce the new system– Pregel! 20
  • 22. Model of Pregel Computation Input Output Supersteps: • A sequence of iterations • Vertex compute in parallel Input: a directed graph •Vertex : a vertex ID a modifiable •Edges: a target vertex a modifiable associate with source vertices Output: a directed graph •The set of values explicitly output by the vertices •vertices and edges can be added and moved 22
  • 23. Maximum Value Example • propagate the largest value to every vertex 23 A B C D
  • 24. Single Source Shortest Path (SSSP)  Problem – Find shortest path from a source node to all target nodes  Solution – MapReduce – Pregel 24
  • 30. Example: SSSP—using Pregel 30 0 8 5 11 7 10 5 2 3 2 1 9 7 4 6 13 14 15 9 10 A ED CB
  • 34. Pregel vs MapReduce  Pregel – Keeps vertices & edges on the machine that performs computation – Uses network transfers only for messages – Sufficiently expressive, no need for remote reads  MapReduce – Require much more communication and associated overhead – Needs to coordinate the steps of a chained MapReduce add the programming complexity
  • 36. System Architecture  Pregel system uses the master/worker model – Master  Coordinating worker activity  Determines the amount of partitions and assign to worker  Recovers faults of workers (“ping” messges)  Maintains statistics about the progress of computation and the state of the graph – Worker  Maintain the state of its portion of the graph in memory  Executing the Compute() method  Communicates with the other workers 36
  • 37. 37 •Assign portion of the input •Instruct each worker to perform a superstep •call Compute() for each vertex • update the data structure • receive/send messages • responds to master when finished •Control the number of partitions in graph •Notify the master to start the processing
  • 38. Fault Tolerance  Checkpointing – The master periodically instructs the workers to save the state of their partitions to persistent storage  e.g., Vertex values, edge values, incoming messages  Failure detection – Using regular “ping” messages  Recovery – The master reassigns graph partitions to the currently available workers – The workers all reload their partition state from most recent available checkpoint 38
  • 40. Goldenorb • Open Source Version of Google’s Pregel • Implemented in Java • Version 0.1.1 • Requirements - hadoop file system - zookeeper for communication 40
  • 41. 41 Orbcluster JobsInProgr ess Jobid Messages heartbeat OrbTracker LeaderGrou p JobQue OrbTrackte rs ZK-TREE Orb-Tracker(L) Orb-Tracker(S) Orb-Tracker(S) Job manager Partition manager Watcher /Event Partition request Partition (Master) … Partition (slave) Partition (slave) Inbound msg queue outbound msg queue current msg queue Inbound msg queue outbound msg queue current msg queue •startLoadVerticesBarrier •Superstep start Barrier •doneComputingVerticesBarr •doneSendingMessageBarrier … •LeaderShipChange •LostMember •NewMember •JobStart/Death/Complete… HDFS Read/write
  • 42. Message Exchange • Message교환은 Superstep간에 이루어짐 • [S-1] superstep의 outbound message들은 [s] superstep의 inbound messages • Outbound Queue가 가득차면 message들을 보내고 다시 queuing • Superstep 중간에 message를 받은 partition은 inbound queue에 저장하고 다음 Superstep까지 보관 • 현재 superstep에 사용할 message들은 current message queue에 복사 • 이 때, inbound queue가 system이나 jvm의 memory size 를 넘어서면 overflow 발생
  • 43. Memory management • Outbound Message Queue - Fixed size, 가득 차면 바로 messages 보냄 • Inbound Message Queue - 다음 Superstep에 사용 - Message 양이 많아지면 overflow가능성 있음 • Current Message Queue - Inbound Queue 과 같은 사이즈 - 현 Superstep 에서는 CurrentQueue에 inboundQueue를 복사해서 사용하므로 currentQueue+inboundQueue 의 메모리 사용 overflow  Inbound Queue를 file 기반의 local 저장공간에 구현 필요
  • 44. API • Sub-classing the predefined classes – Reader/writer/vertex/message 44 Class Vertex { public Vertex(Class<VV> vertexValue, Class<EV> edgeValue, Class<MV> messageValue); String vertexID(); abstract void compute(Collection<MV> messages); long superStep(); void setValue(VV value); VV getValue(); Collection<Edge<EV>> getEdges(); void sendMessage(MV message); void voteToHalt(); }
  • 45. Not yet implemented • Aggregator – a mechanism for global communication monitoring and data • Combiner – Reducing the number of messages – Ex) if compute() sum messages’ value, combiner can calculate and transmit single message(sum) • Topology mutation – Remove or add Vertex/Edge • Fault Recovery 45
  • 47. Implement  Maximum Value  Single Source Shortest Path  PageRank  K-means  Mean-shift 47
  • 49. PageRank • PageRank is Google’s way of deciding a Page’s importance • A important page is linked to by many pages with high PageRank • PR(A) = PR(inLink_v1)/L(t1) + ….+ P(inLink_vn)/L(tn) • Add damping factor d • PR(A) = (1-d) + d∑PR(v)/L(v) • Repeat until converged 49
  • 52. K-means • N observations are parted to k cluster • Each observation belongs to the cluster with the nearest mean No object move group? End Number of cluster K Calculate centroids Distance objects to centroids Grouping based on minimum distance start NO YES
  • 53. K-means 53 • Message includes cluster id and value • Every superstep, a vertex sends message to all vertices 1 2 3 100 101 102 seed2 seed1 A B C D E F Step A B C D E F S0 C1 C2 - - - -
  • 54. K-means 54 1 2 3 100 101 102 A B C D E F Step A B C D E F S0 C1 C2 - - - - S1 C1 C2 C1 C1 C1 C1 Centroid1 = Value(A) Centroid2 = Value(B) 1 2 3 100 101 102 seed2 seed1 A B C D E F
  • 55. K-means 55 Step A B C D E F S0 C1 C2 - - - - S1 C1 C2 C1 C1 C1 C1 S2 C2 C2 C2 C1 C1 C1 1 2 3 100 101 102 A B C D E F Centroid1 = Mean(A,C,D,E,F) Centroid2 = Mean(B) 1 2 3 100 101 102 A B C D E F
  • 56. K-means 56 Step A B C D E F S0 C1 C2 - - - - S1 C1 C2 C1 C1 C1 C1 S2 C2 C2 C2 C1 C1 C1 S3 C2 C2 C2 C1 C1 C1 1 2 3 100 101 102 A B C D E F Centroid1 = Mean(D,E,F) Centroid2 = Mean(A,B,C) If centroids are converged, Quit Process!
  • 57. 57
  • 58. K-means 58 <input file> <output file> N : number of vertices Each superstep, NxN messages are exchanged.  O(n2) : need too much memory !!!
  • 60. Giraph • ASF(Apache Software Foundation)’s Open Source Version of Google’s Pregel • Implemented in Java • Apache incubator • Requirements - hadoop 0.20.203 or higher version : map-only job in hadoop - zookeeper : if not exist, use hadoop file system instead of zookeeper 60
  • 61. Giraph – vertex distribution 61
  • 62. Giraph - usages • Users can set the checkpoint frequency – GiraphJob.getConfiguration().set(“giraph.checkpointFrequency”, 0) //means no check points • User should set zookeeper configuration – GiraphJob.setZookeeperConfiguration(“zk-server-list”); 62
  • 63. Giraph - Characteristics • Faulty tolerance – If the master dies, a new one will automatically take over – If a worker dies, the app is rolled back to a previously checkpointed superstep – If a zookeeper server dies, as long as a quorum remains, the app can proceed – But, Hadoop SPOF still exist • Combiner/Aggregator • JSON in/out format • Easy Job status monitoring (http) 63
  • 65. Experiments • 3 severs • nPartition = nMapper = 9 • MR vs GoldenOrb vs Giraph – PageRank – Kmeans (mahout) – Elapse time, cpu, memory, disk, network 65
  • 66. Experiments - PageRank • Number of Vertices ≈ 220,000 • Fixed iteration = 100 66 Elapse Time CPU (%) Memory (kb) Network (bytes) Disk Write (sec/s) Rcv. Trans. read write GoldenOrb 1m 56s 14.53 3,745,376 19,437 12,845 777 606 Giraph 3m 31s 8.77 1,244,000 11,374 914 0 326 MapReduce 34m 51s 3.75 3,091,239 13,514 867 0 4101
  • 67. Experiments - Kmeans • Number of Vertices = 100,000 • Number of Cluster(K) = 10 67 Elapse Time CPU (%) Memory (kb) Network (bytes) Disk Write (sec/s) Rcv. Trans. read write GoldenOrb 3m 19s 13.32 3,857,892 11,634 27,086 128 1151 Giraph 1m 49s 6.36 1,245,000 7,810 1,999 0 536 MapReduce 11m 28s 1.48 2,645,517 13,528 1,005 0 7104
  • 70. Install Goldenorb (1) • Requirements - hadoop-0.20.2 - zookeeper-3.3.3 • Download & unzip – org.goldenorb.core-0.1.1-SNAPSHOT-distribution.zip 70
  • 71. Install Goldenorb(2) • Set configuration ① ORB_HOME 환경변수 > export ORB_HOME=/usr/local/goldenorb ② Conf/orbServers > localhost:/usr/local/goldenorb ③ Conf/orb-site.xml > cp orb-site.sample.xml orb-site.xml > vi orb-site.xml ④ If Distributed mode , copy to all servers 71 <property> <name>goldenOrb.zookeeper.quorum</name> <value> localhost</value> <description> The server running zookeeper</description> <property> …… <property> <name>fs.default.name</name> <value>hdfs://localhost:9000</value> </property> Target zookeeper server IP
  • 72. Install Goldenorb(3) • Set running environment ① Hadoop 실행 > $HADOOP_HOME/bin/start-dfs.sh ② Zookeeper 실행 > $ZK_HOME/bin/zkServer.sh start ③ Orb-tracker 실행 > $ORB_HOME/bin/orb-tracker.sh start ④ Log 확인 > cat #ORB_HOME/logs/xxx.log 72
  • 73. Install Goldenorb(4) • Make input - ex) maximum value < Vertex-id > <value> <outgoing-edge-list> 73 0 8 5 11 7 A B C D E A 0 B D B 8 C D C 11 E D 5 B C E E 7 A C
  • 74. Install Goldenorb(5) • Upload input files > hadoop fs –put maxvalue.txt /test/ • Run > java -cp conf/.:org.goldenorb.core-0.1.1-SNAPSHOT.jar:lib/*:yourjar.jar org.goldenOrb.algorithms.YourAlgorithm goldenOrb.orb.localFilesToDistribute=/home/user/yourjar.jar mapred.input.dir=/test/maxvaluetxt/ mapred.output.dir=/test/output goldenOrb.orb.requestedPartitions=3 goldenOrb.orb.reservedPartitions=0 goldenOrb.orb.classpaths=yourjar.jar • Result > hadoop fs –ls /test/output > hadoop fs –cat /test/output/* 74
  • 75. Install Giraph(1) • Requirements - hadoop-0.20.203 - zookeeper-3.3.3 - maven 3.0.3 • Download > svn checkout http://svn.apache.org/repos/asf/incubator/giraph/trunk giraph • Compile > mvn compile – Generate target/giraph-{version}-jar-with-dependencies.jar 75
  • 76. Install Giraph(2) • Set running environment ① >$HADOOP_HOME/bin/start-all.sh ② >$ZK_HOME/zkServer.sh start • Upload input file to HDFS > hadoop fs –put test.grf /giraph/test/input/ * Input Format (JASON) 76 [0,0,[[1,0]]] [1,0,[[2,100]]] [2,100,[[3,200]]] [3,300,[[4,300]]] [4,600,[[5,400]]]
  • 77. Install Giraph(3) • Run - ex)shortest path algorithm > hadoop jar giraph-{version}-jar-with-dependencies.jar org.apache.giraph.examples.SimpleShortestPathsVertex <input-path> <output-path> <source-node> <number of worker> • Running status – http://localhost:50030 • Result > hadoop fs –cat /<output-path/part-* 77
  • 78. Orb vs Giraph 78 GoldenOrb Giraph 상태 확인 로그 파일 용이 Fault Tolerance X O Vertex mutation, Combiner, Aggregator… X O 개발환경 X O I/O format O X Update X O