0
Big Graph on small computer
Hadoop meeting
Michael Leznik King.Com
Inspired by Aapo Kryola
What is GraphChi
• GraphChi, a disk-based system for computing
efficiently on graphs with billions of edges.
• GraphChi ca...
Why one needs it?
• To use existing graph frameworks, one is faced
with the challenge of partitioning the graph
across clu...
Would it be possible to do advanced
graph computation on just a personal
computer?
Aapo Kyrola PhD
Candidate at Carnegie
Melon University
How Does It Work
• By using a well-known method to break
large graphs into small parts, and a novel
parallel sliding windo...
What Is Parallel Sliding Window?
• Graph 𝐺 = 𝑉, 𝐸 𝑉 − vertex, 𝐸 − edge
• Each edge and vertex are associated with a
user d...
Vertex Centric Computation
• Algorithm: 𝑉𝑒𝑟𝑡𝑒𝑥 𝑢𝑝𝑑𝑎𝑡𝑒 − 𝑓𝑢𝑛𝑐𝑡𝑖𝑜𝑛
Update vertex begin
𝑥 ← read values of in−and−out edges o...
Graphs and Subgraphs
• Under the PSW method, the vertices V of graph
G = (V, E) are split into P disjoint intervals. For
e...
Graphs and Subgraphs
shard(1)
interval(1) interval(2) interval(P)
shard(2) shard(P)
Graphs and Subgraphs
Shard 1
Shards small enough to fit in memory; balance size of shards
Shard: in-edges for interval of ...
Parallel Sliding Windows
Visualization of the stages of one iteration of the Parallel Sliding Windows method. In
this exam...
Parallel Sliding Windows
foreach iteration do
shards ← 𝐼𝑛𝑖𝑡𝑖𝑎𝑙𝑖𝑧𝑒𝑆ℎ𝑎𝑟𝑑𝑠 𝑃
𝐟𝐨𝐫 interval ← 1 𝑡𝑜 𝑃 𝐝𝐨
subgraph ← 𝐿𝑜𝑎𝑑𝑆𝑢𝑏𝐺𝑟𝑎𝑝ℎ...
Parallel Sliding Windows
Input: Interval index number p
Subgraph of vertices in the interval p
𝑎 ← 𝑖𝑛𝑡𝑒𝑟𝑣𝑎𝑙 𝑝 . 𝑠𝑡𝑎𝑟𝑡
𝑏 ← ...
Graph Data Format
• Adjacency List Format: (can’t have values)
src1 3 dst1 dst2 dst3
src2 2 dst4 dst 6
• Edge List Format:...
Performance
• Laptop, 16 GB, CPU i7@2.7 x4, SSD 160Gb
• Algorithm: Connected Components
• Graph Size: 7,294,535 vertices (...
Experiment Setting
• Mac Mini (Apple Inc.)
– 8 GB RAM
– 256 GB SSD, 1TB hard drive
– Intel Core i5, 2.5 GHz
• Experiment g...
Comparison to Existing Systems
Notes: comparison results do not include time to transfer the data to cluster, preprocessin...
PowerGraph Comparison
• PowerGraph / GraphLab 2
outperforms previous systems
by a wide margin on natural
graphs.
• With 64...
Conclusion
• Parallel Sliding Windows algorithm enables
processing of large graphs with very few non-
sequential disk acce...
Upcoming SlideShare
Loading in...5
×

GraphChi big graph processing

1,573

Published on

GraphChi (Michael Leznik, Head of BI - London, King)

GraphChi, a disk-based system for computing efficiently on graphs with billions of edges. By using a well-known method to break large graphs into small parts, and a novel parallel sliding windows method, GraphChi is able to execute several advanced data mining, graph mining, and machine learning algorithms on very large graphs, using just a single consumer-level computer.

Published in: Technology

Transcript of "GraphChi big graph processing"

  1. 1. Big Graph on small computer Hadoop meeting Michael Leznik King.Com Inspired by Aapo Kryola
  2. 2. What is GraphChi • GraphChi, a disk-based system for computing efficiently on graphs with billions of edges. • GraphChi can compute on the full Twitter follow -graph with just a standard laptop ~ as fast as a very large Hadoop cluster!
  3. 3. Why one needs it? • To use existing graph frameworks, one is faced with the challenge of partitioning the graph across cluster nodes. Finding efficient graph cuts that minimize communication between nodes, and are also balanced, is a hard problem. More generally, distributed systems and their users must deal with managing a cluster, fault tolerance, and often unpredictable performance. From the perspective of programmers, debugging and optimizing distributed algorithms is hard.
  4. 4. Would it be possible to do advanced graph computation on just a personal computer?
  5. 5. Aapo Kyrola PhD Candidate at Carnegie Melon University
  6. 6. How Does It Work • By using a well-known method to break large graphs into small parts, and a novel parallel sliding windows method, GraphChi is able to execute several advanced data mining, graph mining, and machine learning algorithms on very large graphs, using just a single consumer-level computer
  7. 7. What Is Parallel Sliding Window? • Graph 𝐺 = 𝑉, 𝐸 𝑉 − vertex, 𝐸 − edge • Each edge and vertex are associated with a user defined type value (Integer, Double e.tc.) A B e E is an out-edge of A and in-edge of B e = 𝑠𝑜𝑢𝑟𝑐𝑒, 𝑑𝑒𝑠𝑡𝑖𝑛𝑎𝑡𝑖𝑜𝑛 ∈ 𝐸
  8. 8. Vertex Centric Computation • Algorithm: 𝑉𝑒𝑟𝑡𝑒𝑥 𝑢𝑝𝑑𝑎𝑡𝑒 − 𝑓𝑢𝑛𝑐𝑡𝑖𝑜𝑛 Update vertex begin 𝑥 ← read values of in−and−out edges of vertex vertex. value ← 𝑓 𝑥 ; foreach edge of vertex do edge.value ← g(vertex.value, edge.value); end end
  9. 9. Graphs and Subgraphs • Under the PSW method, the vertices V of graph G = (V, E) are split into P disjoint intervals. For each interval we associate a shard, which stores all the edges that have destination in the interval. Edges are stored in the order of their source. Intervals are chosen to balance the number of edges in each shard; the number of intervals, P , is chosen so that any one shard can be loaded completely into memory.
  10. 10. Graphs and Subgraphs shard(1) interval(1) interval(2) interval(P) shard(2) shard(P)
  11. 11. Graphs and Subgraphs Shard 1 Shards small enough to fit in memory; balance size of shards Shard: in-edges for interval of vertices; sorted by source-id in-edgesforvertices1..100 sortedbysource_id Vertices 1..100 Vertices 101..700 Vertices 701..1000 Vertices 1001..10000 Shard 2 Shard 3 Shard 4Shard 1
  12. 12. Parallel Sliding Windows Visualization of the stages of one iteration of the Parallel Sliding Windows method. In this example, vertices are divided into four intervals, each associated with a shard. The computation proceeds by constructing a subgraph of vertices one interval a time. In- edges for the vertices are read from the memory-shard (in dark color) while out-edges are read from each of the sliding shards. The current sliding window is pictured on top of each shard.
  13. 13. Parallel Sliding Windows foreach iteration do shards ← 𝐼𝑛𝑖𝑡𝑖𝑎𝑙𝑖𝑧𝑒𝑆ℎ𝑎𝑟𝑑𝑠 𝑃 𝐟𝐨𝐫 interval ← 1 𝑡𝑜 𝑃 𝐝𝐨 subgraph ← 𝐿𝑜𝑎𝑑𝑆𝑢𝑏𝐺𝑟𝑎𝑝ℎ 𝑖𝑛𝑡𝑒𝑟𝑣𝑎𝑙 𝐩𝐚𝐫𝐚𝐥𝐥𝐞 𝐟𝐨𝐫𝐞𝐚𝐜𝐡 𝑣𝑒𝑟𝑡𝑒𝑥 ∈ 𝑠𝑢𝑏𝑔𝑟𝑎𝑝ℎ. 𝑣𝑒𝑟𝑡𝑒𝑥 𝐝𝐨 𝑈𝐷𝐹_𝑢𝑝𝑑𝑎𝑡𝑒𝑉𝑒𝑟𝑡𝑒𝑥(𝑣𝑒𝑟𝑡𝑒𝑥) end for 𝑠 ∈ 1, … , 𝑃, 𝑠 ≠ 𝑖𝑛𝑡𝑒𝑟𝑣𝑎𝑙 𝐝𝐨 𝑠ℎ𝑎𝑟𝑑𝑠 𝑠 . 𝑈𝑝𝑑𝑎𝑡𝑒𝑙𝑎𝑠𝑡𝑊𝑖𝑛𝑑𝑜𝑤𝑇𝑜𝐷𝑖𝑠𝑘() end end end
  14. 14. Parallel Sliding Windows Input: Interval index number p Subgraph of vertices in the interval p 𝑎 ← 𝑖𝑛𝑡𝑒𝑟𝑣𝑎𝑙 𝑝 . 𝑠𝑡𝑎𝑟𝑡 𝑏 ← 𝑖𝑛𝑡𝑒𝑟𝑣𝑎𝑙 𝑝 . 𝑒𝑛𝑑 𝐺 ← 𝐼𝑛𝑖𝑡𝑖𝑎𝑙𝑖𝑧𝑒𝑆𝑢𝑏𝑔𝑟𝑎𝑝ℎ 𝑎, 𝑏
  15. 15. Graph Data Format • Adjacency List Format: (can’t have values) src1 3 dst1 dst2 dst3 src2 2 dst4 dst 6 • Edge List Format: src1 dst1 value1 src2 dst2 value2 src3 src3 value3
  16. 16. Performance • Laptop, 16 GB, CPU i7@2.7 x4, SSD 160Gb • Algorithm: Connected Components • Graph Size: 7,294,535 vertices (1.4 gb) • Creating Shards: 61.45 sec • Analysing graph: 177.24 sec
  17. 17. Experiment Setting • Mac Mini (Apple Inc.) – 8 GB RAM – 256 GB SSD, 1TB hard drive – Intel Core i5, 2.5 GHz • Experiment graphs: Graph Vertices Edges P (shards) Preprocessing live-journal 4.8M 69M 3 0.5 min netflix 0.5M 99M 20 1 min twitter-2010 42M 1.5B 20 2 min uk-2007-05 106M 3.7B 40 31 min uk-union 133M 5.4B 50 33 min yahoo-web 1.4B 6.6B 50 37 min
  18. 18. Comparison to Existing Systems Notes: comparison results do not include time to transfer the data to cluster, preprocessing, or the time to load the graph from disk. GraphChi computes asynchronously, while all but GraphLab synchronously. PageRank WebGraph Belief Propagation (U Kang et al.) Matrix Factorization (Alt. Least Sqr.) Triangle Counting GraphLab v1 (8 cores) GraphChi (Mac Mini) 0 2 4 6 8 10 12 Minutes Netflix (99B edges) Spark (50 machines) GraphChi (Mac Mini) 0 2 4 6 8 10 12 14 Minutes Twitter-2010 (1.5B edges) Pegasus / Hadoop (100 machines) GraphChi (Mac Mini) 0 5 10 15 20 25 30 Minutes Yahoo-web (6.7B edges) Hadoop (1636 machines) GraphChi (Mac Mini) 0 100 200 300 400 500 Minutes twitter-2010 (1.5B edges)
  19. 19. PowerGraph Comparison • PowerGraph / GraphLab 2 outperforms previous systems by a wide margin on natural graphs. • With 64 more machines, 512 more CPUs: – Pagerank: 40x faster than GraphChi – Triangle counting: 30x faster than GraphChi. GraphChi has state-of-the- art performance / CPU. vs. GraphChi
  20. 20. Conclusion • Parallel Sliding Windows algorithm enables processing of large graphs with very few non- sequential disk accesses. • For the system researchers, GraphChi is a solid baseline for system evaluation – It can solve as big problems as distributed systems. • Takeaway: Appropriate data structures as an alternative to scaling up. Source code and examples: http://graphchi.org License: Apache 2.0
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×