Scaling Apache GiraphNitay Joffe, Data Infrastructure Engineernitay@apache.orgJune 3, 2013
Agenda1 Background2 Scaling3 Results4 Questions
Background
What is Giraph?• Apache open source graph computation engine based on Google’s Pregel.• Support for Hadoop, Hive, HBase, a...
Why not Hive?InputformatOutputformatMaptasksIntermediatefilesReducetasksOutput 0Output 1Input 0Input 1Iterate!• Too much d...
Giraph componentsMaster – Application coordinator• Synchronizes supersteps• Assigns partitions to workers before superstep...
Giraph DataflowSplit 0Split 1Split 2Split 3Worker1MasterWorker0Input formatLoad /SendGraphLoad /SendGraphLoading the graph...
Giraph Job LifetimeOutputActive InactiveVote to HaltReceived MessageVertex LifecycleAll VerticesHalted?InputComputeSuperst...
Simple Example – Compute the maximum value51525525555512Processor 1Processor 2TimeConnected Componentse.g. Finding Communi...
PageRank – ranking websitesMahout (Hadoop)854 linesGiraph< 30 lines• Send neighbors an equal fraction of your page rank• N...
Scaling
Problem: Worker Crash.Superstep i(no checkpoint)Superstep i+1(checkpoint)Superstep i+2(no checkpoint)Worker failure!Supers...
“Spare”Master 2ActiveMaster State“Spare”Master 1“Active”Master 0Before failure of active master 0“Spare”Master 2ActiveMast...
Problem: Primitive Collections.• Graphs often parameterized with { }• Boxing/unboxing. Objects have internal overhead.3Sol...
Problem: Too many objects.Lots of time spent in GC.Graph: 1B Vertices, 200B Edges, 200 Workers.• 1B Edges per Worker. 1 ob...
Problem: Too many objects.Lots of time spent in GC.Solution: byte[]• Serialize messages, edges, and vertices.• Iterable in...
Problem: Serialization of byte[]• DataInput? Kyro? Custom?Solution: Unsafe• Dangerous. No formal API. Volatile. Non-portab...
Problem: Large Aggregations.WorkerWorkerWorkerWorkerWorkerMasterWorkers own aggregatorsWorkerWorkerWorkerWorkerWorkerMaste...
Problem: Network Wait.• RPC doesn’t fit model.• Synchronous calls no good.Solution: NettyTune queue sizes & threadsBarrier...
Results
05010015020025030035040045050 100 150 200 250 300IterationTime(sec)Workers2B Vertices, 200B Edges, 20 Compute ThreadsIncre...
Lessons Learned• Coordinating is a zoo. Be resilient with ZooKeeper.• Efficient networking is hard. Let Netty help.• Primi...
What’s the final result?Comparison with Hive:• 20x CPU speedup• 100x Elapsed time speedup. 15 hours => 9 minutes.Computati...
Questions?
Problem: Measurements.• Need tools to gain visibility into the system.• Problems with connecting to Hadoop sub-processes.S...
Problem: Mutations• Synchronization.• Load balancing.Solution: Reshuffle resources• Mutations handled at barrier between s...
Upcoming SlideShare
Loading in …5
×

2013 06-03 berlin buzzwords

4,857 views

Published on

0 Comments
20 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
4,857
On SlideShare
0
From Embeds
0
Number of Embeds
209
Actions
Shares
0
Downloads
72
Comments
0
Likes
20
Embeds 0
No embeds

No notes for slide
  • No internal FB repo. Everyone committer.A global epoch followed by a global barrier where components do concurrent computation and send messages.Graphs are sparse.
  • Giraph is a map-only job
  • Code is real, checked into Giraph.All vertices find the maximum value in a strongly connected graph
  • One active master, with spare masters taking over in the event of an active master failureAll active master state is stored in ZooKeeper so that a spare master can immediately step in when an active master fails“Active” master implemented as a queue in ZooKeeperA single worker failure causes the superstep to failApplication reverts to the last committed superstep automaticallyMaster detects worker failure during any superstep with a ZooKeeper “health” znodeMaster chooses the last committed superstep and sends a command through ZooKeeper for all workers to restart from that superstep
  • One active master, with spare masters taking over in the event of an active master failureAll active master state is stored in ZooKeeper so that a spare master can immediately step in when an active master fails“Active” master implemented as a queue in ZooKeeper
  • Primitive collections are primitive.Lots of boxing / unboxing of types.Object and reference for each instance.
  • Also other implementations like Map&lt;I,E&gt; for edges which use more space but better for lots of mutations.Realistically for FB sized graphs need even bigger.Edges are not uniform in reality, some vertices are much larger.
  • Dangerous, non-portable, volatile. Oracle JVM only. No formal API.Allocate non-GC memory.Inherit from String (final class).Direct access memory (C pointer casts)
  • Cluster open source projects.Histograms. Job metrics.
  • Start sending messages early and sendwith computation.Tune message buffer sizes to reduce wait time.
  • First thing’s first – what’s going on with the system?Want debugger, but don’t have one.Use YourKit’s API to create granular snapshots withinapplication.JMap binding errors – spawn from within process.
  • With byte[] any mutation requires full deserialization / re-serialization.
  • 2013 06-03 berlin buzzwords

    1. 1. Scaling Apache GiraphNitay Joffe, Data Infrastructure Engineernitay@apache.orgJune 3, 2013
    2. 2. Agenda1 Background2 Scaling3 Results4 Questions
    3. 3. Background
    4. 4. What is Giraph?• Apache open source graph computation engine based on Google’s Pregel.• Support for Hadoop, Hive, HBase, and Accumulo.• BSP model with simple think like a vertex API.• Combiners, Aggregators, Mutability, and more.• Configurable Graph<I,V,E,M>:– I: Vertex ID– V: Vertex Value– E: Edge Value– M: Message dataWhat is Giraph NOT?• A Graph database. See Neo4J.• A completely asynchronous generic MPI system.• A slow tool.implementsWritable
    5. 5. Why not Hive?InputformatOutputformatMaptasksIntermediatefilesReducetasksOutput 0Output 1Input 0Input 1Iterate!• Too much disk. Limited in-memory caching.• Each iteration becomes a MapReduce job!
    6. 6. Giraph componentsMaster – Application coordinator• Synchronizes supersteps• Assigns partitions to workers before superstep beginsWorkers – Computation & messaging• Handle I/O – reading and writing the graph• Computation/messaging of assigned partitionsZooKeeper• Maintains global application state
    7. 7. Giraph DataflowSplit 0Split 1Split 2Split 3Worker1MasterWorker0Input formatLoad /SendGraphLoad /SendGraphLoading the graph1Part 0Part 1Part 2Part 3Compute /SendMessagesWorker1Compute /SendMessagesMasterWorker0In-memorygraphSend stats / iterate!Compute/Iterate2Worker1Worker0Part 0Part 1Part 2Part 3Output formatPart 0Part 1Part 2Part 3Storing the graph3Split 4Split
    8. 8. Giraph Job LifetimeOutputActive InactiveVote to HaltReceived MessageVertex LifecycleAll VerticesHalted?InputComputeSuperstepNoMasterhalted?NoYesYes
    9. 9. Simple Example – Compute the maximum value51525525555512Processor 1Processor 2TimeConnected Componentse.g. Finding Communities
    10. 10. PageRank – ranking websitesMahout (Hadoop)854 linesGiraph< 30 lines• Send neighbors an equal fraction of your page rank• New page rank = 0.15 / (# of vertices) + 0.85 * (messagessum)
    11. 11. Scaling
    12. 12. Problem: Worker Crash.Superstep i(no checkpoint)Superstep i+1(checkpoint)Superstep i+2(no checkpoint)Worker failure!Superstep i+1(checkpoint)Superstep i+2(no checkpoint)Superstep i+3(checkpoint)Worker failure aftercheckpoint complete!Superstep i+3(no checkpoint)ApplicationComplete…Solution: Checkpointing.
    13. 13. “Spare”Master 2ActiveMaster State“Spare”Master 1“Active”Master 0Before failure of active master 0“Spare”Master 2ActiveMaster State“Active”Master 1“Active”Master 0After failure of active master 0ZooKeeper ZooKeeperProblem: Master Crash.Solution: ZooKeeper Master Queue.
    14. 14. Problem: Primitive Collections.• Graphs often parameterized with { }• Boxing/unboxing. Objects have internal overhead.3Solution: Use fastutil, e.g. Long2DoubleOpenHashMap.fastutil extends the Java™ Collections Framework by providing type-specificmaps, sets, lists and queues with a small memory footprint and fast access andinsertion12451.20.50.80.41.70.7Single Source Shortest Pathst1.20.50.80.40.20.7Network Flow31245Count In-Degree
    15. 15. Problem: Too many objects.Lots of time spent in GC.Graph: 1B Vertices, 200B Edges, 200 Workers.• 1B Edges per Worker. 1 object per edge value.• List<Edge<I, E>>  ~ 10B objects• 5M Vertices per Worker. 10 objects per vertex value.• Map<I, Vertex<I, V, E>  ~ 50M objects• 1 Message per Edge. 10 objects per message data.• Map<I, List<M>>  ~ 10B objects• Objects used ~= O(E*e + V*v + M*m) => O(E*e)Label Propagatione.g. Who’s sleeping?31245BoringAmazingQ: What did he think?0.50.20.8 0.360.170.41Confusing
    16. 16. Problem: Too many objects.Lots of time spent in GC.Solution: byte[]• Serialize messages, edges, and vertices.• Iterable interface with representative object.Input Input Inputnext()next()next()Objects per worker ~= O(V)Label Propagatione.g. Who’s sleeping?31245BoringAmazingQ: What did he think?0.50.20.8 0.360.170.41Confusing
    17. 17. Problem: Serialization of byte[]• DataInput? Kyro? Custom?Solution: Unsafe• Dangerous. No formal API. Volatile. Non-portable (oracle JVM only).• AWESOME. As fast as it gets.• True native. Essentially C: *(long*)(data+offset);
    18. 18. Problem: Large Aggregations.WorkerWorkerWorkerWorkerWorkerMasterWorkers own aggregatorsWorkerWorkerWorkerWorkerWorkerMasterAggregator owners communicatewith MasterWorkerWorkerWorkerWorkerWorkerMasterAggregator owners distribute valuesSolution: Sharded Aggregators.WorkerWorkerWorkerWorkerWorkerMasterK-Means Clusteringe.g. Similar Emails
    19. 19. Problem: Network Wait.• RPC doesn’t fit model.• Synchronous calls no good.Solution: NettyTune queue sizes & threadsBarrierBarrierBegin superstepcomputenetworkEnd computeEnd superstepwaitBarrierBarrierBegin superstepcomputenetworkwaitTime to first messageEnd computeEnd superstep
    20. 20. Results
    21. 21. 05010015020025030035040045050 100 150 200 250 300IterationTime(sec)Workers2B Vertices, 200B Edges, 20 Compute ThreadsIncreasing Workers Increasing DataSize0501001502002503003504004501E+09 1.01E+11IterationTime(sec)Edges50 Workers, 20 Compute ThreadsScalability Graphs
    22. 22. Lessons Learned• Coordinating is a zoo. Be resilient with ZooKeeper.• Efficient networking is hard. Let Netty help.• Primitive collections, primitive performance. Use fastutil.• byte[] is simple yet powerful.• Being Unsafe can be a good thing.• Have a graph? Use Giraph.
    23. 23. What’s the final result?Comparison with Hive:• 20x CPU speedup• 100x Elapsed time speedup. 15 hours => 9 minutes.Computations on entire Facebook graph no longer “weekend jobs”.Now they’re coffee breaks.
    24. 24. Questions?
    25. 25. Problem: Measurements.• Need tools to gain visibility into the system.• Problems with connecting to Hadoop sub-processes.Solution: Do it all.• YourKit – see YourKitProfiler• jmap – see JMapHistoDumper• VisualVM –with jstatd & ssh socks proxy• Yammer Metrics• Hadoop Counters• Logging & GC prints
    26. 26. Problem: Mutations• Synchronization.• Load balancing.Solution: Reshuffle resources• Mutations handled at barrier between supersteps.• Master rebalances vertex assignments to optimize distribution.• Handle mutations in batches.• Avoid if using byte[].• Favor algorithms which don’t mutate graph.

    ×