Your SlideShare is downloading. ×
0
2013.09.10 Giraph at London Hadoop Users Group
2013.09.10 Giraph at London Hadoop Users Group
2013.09.10 Giraph at London Hadoop Users Group
2013.09.10 Giraph at London Hadoop Users Group
2013.09.10 Giraph at London Hadoop Users Group
2013.09.10 Giraph at London Hadoop Users Group
2013.09.10 Giraph at London Hadoop Users Group
2013.09.10 Giraph at London Hadoop Users Group
2013.09.10 Giraph at London Hadoop Users Group
2013.09.10 Giraph at London Hadoop Users Group
2013.09.10 Giraph at London Hadoop Users Group
2013.09.10 Giraph at London Hadoop Users Group
2013.09.10 Giraph at London Hadoop Users Group
2013.09.10 Giraph at London Hadoop Users Group
2013.09.10 Giraph at London Hadoop Users Group
2013.09.10 Giraph at London Hadoop Users Group
2013.09.10 Giraph at London Hadoop Users Group
2013.09.10 Giraph at London Hadoop Users Group
2013.09.10 Giraph at London Hadoop Users Group
2013.09.10 Giraph at London Hadoop Users Group
2013.09.10 Giraph at London Hadoop Users Group
2013.09.10 Giraph at London Hadoop Users Group
2013.09.10 Giraph at London Hadoop Users Group
2013.09.10 Giraph at London Hadoop Users Group
2013.09.10 Giraph at London Hadoop Users Group
2013.09.10 Giraph at London Hadoop Users Group
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

2013.09.10 Giraph at London Hadoop Users Group

2,343

Published on

Published in: Technology
0 Comments
17 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
2,343
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
111
Comments
0
Likes
17
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • No internal FB repo. Everyone committer.A global epoch followed by a global barrier where components do concurrent computation and send messages.Graphs are sparse.
  • Giraph is a map-only job
  • Code is real, checked into Giraph.All vertices find the maximum value in a strongly connected graph
  • One active master, with spare masters taking over in the event of an active master failureAll active master state is stored in ZooKeeper so that a spare master can immediately step in when an active master fails“Active” master implemented as a queue in ZooKeeperA single worker failure causes the superstep to failApplication reverts to the last committed superstep automaticallyMaster detects worker failure during any superstep with a ZooKeeper “health” znodeMaster chooses the last committed superstep and sends a command through ZooKeeper for all workers to restart from that superstep
  • One active master, with spare masters taking over in the event of an active master failureAll active master state is stored in ZooKeeper so that a spare master can immediately step in when an active master fails“Active” master implemented as a queue in ZooKeeper
  • Primitive collections are primitive.Lots of boxing / unboxing of types.Object and reference for each instance.
  • Also other implementations like Map<I,E> for edges which use more space but better for lots of mutations.Realistically for FB sized graphs need even bigger.Edges are not uniform in reality, some vertices are much larger.
  • Dangerous, non-portable, volatile. Oracle JVM only. No formal API.Allocate non-GC memory.Inherit from String (final class).Direct access memory (C pointer casts)
  • Cluster open source projects.Histograms. Job metrics.
  • Start sending messages early and sendwith computation.Tune message buffer sizes to reduce wait time.
  • First thing’s first – what’s going on with the system?Want debugger, but don’t have one.Use YourKit’s API to create granular snapshots withinapplication.JMap binding errors – spawn from within process.
  • With byte[] any mutation requires full deserialization / re-serialization.
  • Transcript

    • 1. Scaling Apache Giraph Nitay Joffe, Data Infrastructure Engineer nitay@apache.org @nitayj September 10, 2013
    • 2. Agenda 1 Background 2 Scaling 3 Results 4 Questions
    • 3. Background
    • 4. What is Giraph? • Apache open source graph computation engine based on Google’s Pregel. • Support for Hadoop, Hive, HBase, and Accumulo. • BSP model with simple think like a vertex API. • Combiners, Aggregators, Mutability, and more. • Configurable Graph<I,V,E,M>: – I: Vertex ID – V: Vertex Value – E: Edge Value – M: Message data What is Giraph NOT? • A Graph database. See Neo4J. • A completely asynchronous generic MPI system. • A slow tool. implements Writable
    • 5. Why not Hive? Input format Output format Map tasks Intermediate files Reduce tasks Output 0 Output 1 Input 0 Input 1 Iterate! • Too much disk. Limited in-memory caching. • Each iteration becomes a MapReduce job!
    • 6. Giraph components Master – Application coordinator • Synchronizes supersteps • Assigns partitions to workers before superstep begins Workers – Computation & messaging • Handle I/O – reading and writing the graph • Computation/messaging of assigned partitions ZooKeeper • Maintains global application state
    • 7. Giraph Dataflow Split 0 Split 1 Split 2 Split 3 Worker 1 Master Worker 0Input format Load / Send Graph Load / Send Graph Loading the graph 1 Part 0 Part 1 Part 2 Part 3 Compute / Send Messages Worker 1 Compute / Send Messages Master Worker 0 In-memory graph Send stats / iterate! Compute/Iterate 2 Worker 1 Worker 0 Part 0 Part 1 Part 2 Part 3 Output format Part 0 Part 1 Part 2 Part 3 Storing the graph 3 Split 4 Split
    • 8. Giraph Job Lifetime Output Active Inactive Vote to Halt Received Message Vertex Lifecycle All Vertices Halted? Input Compute Superstep No Master halted? No Yes Yes
    • 9. Simple Example – Compute the maximum value 5 1 5 2 5 5 2 5 5 5 5 5 1 2 Processor 1 Processor 2 Time Connected Components e.g. Finding Communities
    • 10. PageRank – ranking websites Mahout (Hadoop) 854 lines Giraph < 30 lines • Send neighbors an equal fraction of your page rank • New page rank = 0.15 / (# of vertices) + 0.85 * (messages sum)
    • 11. Scaling
    • 12. Problem: Worker Crash. Superstep i (no checkpoint) Superstep i+1 (checkpoint) Superstep i+2 (no checkpoint) Worker failure! Superstep i+1 (checkpoint) Superstep i+2 (no checkpoint) Superstep i+3 (checkpoint) Worker failure after checkpoint complete! Superstep i+3 (no checkpoint) Application Complete… Solution: Checkpointing.
    • 13. “Spare” Master 2 Active Master State“Spare” Master 1 “Active” Master 0 Before failure of active master 0 “Spare” Master 2 Active Master State“Active” Master 1 “Active” Master 0 After failure of active master 0 ZooKeeper ZooKeeper Problem: Master Crash. Solution: ZooKeeper Master Queue.
    • 14. Problem: Primitive Collections. • Graphs often parameterized with { } • Boxing/unboxing. Objects have internal overhead. 3 Solution: Use fastutil, e.g. Long2DoubleOpenHashMap. fastutil extends the Java™ Collections Framework by providing type-specific maps, sets, lists and queues with a small memory footprint and fast access and insertion 1 2 4 5 1.2 0.5 0.8 0.4 1.7 0.7 Single Source Shortest Path s t 1.2 0.5 0.8 0.4 0.2 0.7 Network Flow 3 1 2 4 5 Count In-Degree
    • 15. Problem: Too many objects. Lots of time spent in GC. Graph: 1B Vertices, 200B Edges, 200 Workers. • 1B Edges per Worker. 1 object per edge value. • List<Edge<I, E>>  ~ 10B objects • 5M Vertices per Worker. 10 objects per vertex value. • Map<I, Vertex<I, V, E>  ~ 50M objects • 1 Message per Edge. 10 objects per message data. • Map<I, List<M>>  ~ 10B objects • Objects used ~= O(E*e + V*v + M*m) => O(E*e) Label Propagation e.g. Who’s sleeping? 3 1 2 4 5 Boring Amazing Q: What did he think? 0.5 0.2 0.8 0.36 0.17 0.41 Confusing
    • 16. Problem: Too many objects. Lots of time spent in GC. Solution: byte[] • Serialize messages, edges, and vertices. • Iterable interface with representative object. Input Input Input next() next() next() Objects per worker ~= O(V) Label Propagation e.g. Who’s sleeping? 3 1 2 4 5 Boring Amazing Q: What did he think? 0.5 0.2 0.8 0.36 0.17 0.41 Confusing
    • 17. Problem: Serialization of byte[] • DataInput? Kyro? Custom? Solution: Unsafe • Dangerous. No formal API. Volatile. Non-portable (oracle JVM only). • AWESOME. As fast as it gets. • True native. Essentially C: *(long*)(data+offset);
    • 18. Problem: Large Aggregations. Worker Worker Worker Worker Worker Master Workers own aggregators Worker Worker Worker Worker Worker Master Aggregator owners communicate with Master Worker Worker Worker Worker Worker Master Aggregator owners distribute values Solution: Sharded Aggregators. Worker Worker Worker Worker Worker Master K-Means Clustering e.g. Similar Emails
    • 19. Problem: Network Wait. • RPC doesn’t fit model. • Synchronous calls no good. Solution: Netty Tune queue sizes & threads BarrierBarrier Begin superstep compute network End compute End superstep wait Barrier Barrier Begin superstep compute network wait Time to first message End compute End superstep
    • 20. Results
    • 21. 0 50 100 150 200 250 300 350 400 450 50 100 150 200 250 300 IterationTime(sec) Workers 2B Vertices, 200B Edges, 20 Compute Threads Increasing Workers Increasing Data Size 0 50 100 150 200 250 300 350 400 450 1E+09 1.01E+11 IterationTime(sec) Edges 50 Workers, 20 Compute Threads Scalability Graphs
    • 22. Lessons Learned • Coordinating is a zoo. Be resilient with ZooKeeper. • Efficient networking is hard. Let Netty help. • Primitive collections, primitive performance. Use fastutil. • byte[] is simple yet powerful. • Being Unsafe can be a good thing. • Have a graph? Use Giraph.
    • 23. What’s the final result? Comparison with Hive: • 20x CPU speedup • 100x Elapsed time speedup. 15 hours => 9 minutes. Computations on entire Facebook graph no longer “weekend jobs”. Now they’re coffee breaks.
    • 24. Questions?
    • 25. Problem: Measurements. • Need tools to gain visibility into the system. • Problems with connecting to Hadoop sub-processes. Solution: Do it all. • YourKit – see YourKitProfiler • jmap – see JMapHistoDumper • VisualVM –with jstatd & ssh socks proxy • Yammer Metrics • Hadoop Counters • Logging & GC prints
    • 26. Problem: Mutations • Synchronization. • Load balancing. Solution: Reshuffle resources • Mutations handled at barrier between supersteps. • Master rebalances vertex assignments to optimize distribution. • Handle mutations in batches. • Avoid if using byte[]. • Favor algorithms which don’t mutate graph.

    ×