1.
GraphLab under the hood Zuhair Khayyat12/10/12 1
2.
GraphLab overview: GraphLab 1.0● GraphLab: A New Framework For Parallel Machine Learning – high-level abstractions for machine learning problems – Shared-memory multiprocessor – Assume no fault tolerance needed – Concurrent access precessing models with sequential-consistency guarantees12/10/12 2
3.
GraphLab overview: GraphLab 1.0● How GraphLab 1.0 works? – Represent the users data by a directed graph – Each block of data is represented by a vertex and a directed edge – Shared data table – User functions: ● Update: modify the vertex and edges state, read only to shared table ● Fold: sequential aggregation to a key entry in12/10/12 the shared table, modify vertex data 3 ● Merge: Parallelize Fold function ● Apply: Finalize the key entry in the shared table
5.
GraphLab overview: Distributed GraphLab 1.0 ● Distributed GraphLab: A Framework for Machine Learning and Data Mining in the Cloud – Fault tolerance using snapshot algorithm – Improved distributed parallel processing – Two stage partitioning: ● Atoms generated by ParMetis ● Ghosts generated by the intersection of the atoms12/10/12 – Finalize() function for vertex synchronization5
8.
PowerGraph: Introduction ● GraphLab 2.1 ● Problems of highly skewed power-law graphs: – Workload imbalance ==> performance degradations – Limiting Scalability – Hard to partition if the graph is too large – Storage – Non-parallel computation12/10/12 8
9.
PowerGraph: New Abstraction● Original Functions: – Update – Finalize – Fold – Merge – Apply: The synchronization apply● Introduce GAS model: – Gather: in, out or all neighbors12/10/12 – Apply: The GAS model apply 9 – Scatter
18.
PowerGraph: Discussion ● Isnt it similar to Pregel Mode? – Partially process the vertex if a message exists ● Gather, Apply and Scatter are commutative and associative operations. What if the computation is not commutative! – Sum up the message values in a specific order to get the same floating point rounding error.12/10/12 18
19.
PowerGraph and Mizan ● In Mizan we use partial replication: W0 W1 W0 W1 b b e e c a f c a a f d g d g Compute Phase Communication Phase12/10/12 19
20.
GraphChi: Introduction ● Asynchronous Disk-based version of GraphLab ● Utilizing parallel sliding window – Very small number of non-sequential accesses to the disk ● Support for graph updates – Based on Kineograph, a distributed system for processing a continuous in-flow of graph12/10/12 updates, while simultaneously running 20 advanced graph mining algorithms.
21.
GraphChi: Graph Constrains ● Graph does not fit in memory ● A vertex, its edges and values fits in memory12/10/12 21
22.
GraphChi: Disk storage ● Compressed sparse row (CSR): – Compressed adjacency list with indexes of the edges. – Fast access to the out-degree vertices. ● Compressed Sparse Column (CSC): – CSR for the transpose graph – Fast access to the in-degree vertices ● Shard: Store the edges data12/10/12 22
23.
GraphChi: Loading the graph ● Input graph is split into P disjoint intervals to balance edges, each associated with a shard ● A shard contains data of the edges of an interval ● The sub graph is constructed as reading its interval12/10/12 23
24.
GraphChi: Parallel Sliding Windows ● Each interval is processed in parallel ● P sequential disk access are required to process each interval ● The length of intervals vary with graph distribution ● P * P disk access required for one superstep12/10/12 24
25.
GraphChi: Example Executing interval (1,2):12/10/12 25 (1,2) (3,4) (5,6)
26.
GraphChi: Example Executing interval (3,4):12/10/12 26 (1,2) (3,4) (5,6)
28.
GraphChi: Evolving Graphs ● Adding an edge is reflected on the intervals and shards if read ● Deleting an edge causes that edge to be ignored ● Adding and deleting edges are handled after processing the current interval.12/10/12 28
Be the first to comment