Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Hadoop Meetup Jan 2019 - Dynamometer and a Case Study in NameNode GC


Published on

Erik Krogen of LinkedIn presents regarding Dynamometer, a system open sourced by LinkedIn for scale- and performance-testing HDFS. He discusses one major use case for Dynamometer, tuning NameNode GC, and discusses characteristics of NameNode GC such as why it is important, and how it interacts with various current and future GC algorithms.

This is taken from the Apache Hadoop Contributors Meetup on January 30, hosted by LinkedIn in Mountain View.

Published in: Technology
  • Paid To Write? Earn up to $200/day on with simple writing jobs. ■■■
    Are you sure you want to  Yes  No
    Your message goes here

Hadoop Meetup Jan 2019 - Dynamometer and a Case Study in NameNode GC

  1. 1. Dynamometer and A Case Study in NameNode GC Erik Krogen Senior Software Engineer, Hadoop & HDFS
  2. 2. Dynamometer • Realistic performance benchmark & stress test for HDFS • Open sourced on LinkedIn GitHub, contributing to Apache • Evaluate scalability limits • Provide confidence before new feature/config deployment
  3. 3. What’s a Dynamometer? A dynamometer or "dyno" for short, is a device for measuring force, torque, or power. For example, the power produced by an engine … - Wikipedia Image taken from and redistributed under the CC BY-SA 2.0 license
  4. 4. Main Goals • Accurate namespace: Namespace characteristics have a big impact • Accurate client workload: Request types and the timing of requests both have a big impact • Accurate system workload: Load imposed by system management (block reports, etc.) has a big impact High Fidelity Efficiency • Low cost: Offline infra has high utilization; can’t afford to keep around unused machines for testing • Low developer effort: Deploying to large number of machines can be cumbersome; make it easy • Fast iteration cycle: Should be able to iterate quickly
  5. 5. Simplify the Problem NameNode is the central component, most frequent bottleneck: focus here
  6. 6. Dynamometer SIMULATED HDFS CLUSTER RUNS IN YARN CONTAINERS • How to schedule and coordinate? Use YARN! • Real NameNode, fake DataNodes to run on ~1% the hardware Dynamomete r Driver DataNode DataNode DataNode DataNode • • • YARN NodeYARN Node NameNode Host YARN Cluster YARN Node DynoAM Host HDFS Cluster FsImage Block Listings
  7. 7. Dynamometer SIMULATED HDFS CLIENTS RUN IN YARN CONTAINERS • Clients can run on YARN too! • Replay real traces from production cluster audit logs Dynamomete r Driver DataNode DataNode DataNode DataNode • • • YARN Node • • • YARN Node NameNode Dynamometer Infrastructure Application Workload MapReduce Job Host YARN Cluster Simulated Client Simulated Client Simulated ClientSimulated Client Host HDFS Cluster Audit Logs
  8. 8. Contributing to Apache • Working to put into hadoop-tools • Easier place for community to access and contribute • Increased chance of others helping to maintain it • Follow HDFS-12345 (actual ticket, not a placeholder)
  9. 9. NameNode GC: A Dynamometer Case Study
  10. 10. NameNode GC Primer • Why do we care? • NameNode heaps are huge (multi-hundred GB) • GC is a big factor in performance • What’s special about NameNode GC? • Huge working set: can have over 100GB of long-lived objects • Massive young gen churn (from RPC requests)
  11. 11. Can we use a new GC algorithm to squeeze more performance out of the NameNode? Q U E S T I O N
  12. 12. Experimental Setup • 16 hour production trace: Long enough to experience 2 rounds of mixed GC • Measure performance via standard metrics (client latency, RPC queue time) • Measure GC pauses during startup and normal workloads • Let’s try G1GC even though we know we’re pushing the limits: • The region sizes can vary from 1 MB to 32 MB depending on the heap size. The goal is to have no more than 2048 regions. – Oracle* • This implies that the heap should be 64 GB and under, but at this*Garbage First Garbage Collector Tuning -
  13. 13. Can You Spot the Issue? [Parallel Time: 17676.0 ms, GC Workers: 16] [GC Worker Start (ms): Min: 883574.6, Avg: 883574.8, Max: 883575.0, Diff: 0.3] [Ext Root Scanning (ms): Min: 1.0, Avg: 1.2, Max: 2.1, Diff: 1.1, Sum: 18.8] [Update RS (ms): Min: 31.7, Avg: 32.2, Max: 32.8, Diff: 1.1, Sum: 514.7] [Processed Buffers: Min: 25, Avg: 30.2, Max: 38, Diff: 13, Sum: 484] [Scan RS (ms): Min: 17011.1, Avg: 17052.9, Max: 17400.5, Diff: 389.4, Sum: 272846.4] [Code Root Scanning (ms): Min: 0.0, Avg: 0.1, Max: 0.6, Diff: 0.6, Sum: 1.0] [Object Copy (ms): Min: 169.8, Avg: 500.5, Max: 534.3, Diff: 364.5, Sum: 8007.3] [Termination (ms): Min: 0.0, Avg: 88.8, Max: 96.5, Diff: 96.5, Sum: 1421.5] [GC Worker Other (ms): Min: 0.0, Avg: 0.0, Max: 0.1, Diff: 0.1, Sum: 0.4] [GC Worker Total (ms): Min: 17675.5, Avg: 17675.6, Max: 17675.8, Diff: 0.3, Sum: 282810.3] [GC Worker End (ms): Min: 901250.4, Avg: 901250.4, Max: 901250.5, Diff: 0.0] [Code Root Fixup: 0.7 ms] [Code Root Migration: 2.3 ms] [Code Root Purge: 0.0 ms] [Clear CT: 6.7 ms] [Other: 1194.8 ms] [Choose CSet: 0.0 ms] [Ref Proc: 2.8 ms] [Ref Enq: 0.4 ms] [Redirty Cards: 468.4 ms] [Free CSet: 4.0 ms] [Eden: 7360.0M(7360.0M)->0.0B(6720.0M) Survivors: 320.0M->960.0M Heap: 92.6G(150.0G)->87.2G(150.0G)] [Times: user=223.20 sys=0.22, real=18.88 secs] 902.102: Total time for which application threads were stopped: 18.8815330 seconds [Scan RS (ms): Min: 17011.1, Avg: 17052.9, Max: 17400.5, Diff: 389.4, Sum: 272846.4] [Object Copy (ms): Min: 169.8, Avg: 500.5, Max: 534.3, Diff: 364.5, Sum: 8007.3] [Eden: 7360.0M(7360.0M)->0.0B(6720.0M) Survivors: 320.0M->960.0M Heap: 92.6G(150.0G)->87.2G(150.0G)] 902.102: Total time for which application threads were stopped: 18.8815330 seconds Huge pause! Few GB of Eden cleared, big but not huge ~500ms pause due to object copy 17.5s pause due to “Scan RS”!
  14. 14. Tuning G1GC with Dynamometer • G1GC has lots of tunables – how do we optimize all of them without hurting our production system? • Dynamometer to the rescue • Easily set up experiments sweeping over different values for a param • Fire-and-forget, test with with many combinations and analyze later • Main parameters needing significant tuning were for the remembered sets • (details to follow in appendix)
  15. 15. How Much Does G1GC Help? Startup Normal Operation METRIC CMS G1GC CMS G1GC Avg Client Latency (ms) 19 18 Total Pause Time (s) 200 180 550 160 Median Pause Time (s) 1.1 0.5 0.12 0.06 Max Pause Time (s) 13.4 3.3 1.4 0.6 * Values are approximate and provided primarily to give a sense of s Excellent reduction in pause times Not much impact on throughput
  16. 16. Looking towards the future… • Question: How does G1GC fare extrapolating to future workloads: • 600GB+ heap size, 1 billion blocks, 1 billion files • Answer: Not so well • RSet entry count has to be increased even further to obtain reasonable performance • Off-heap overheads in the hundreds of gigabytes • Wouldn’t recommend it
  17. 17. Looking towards the future… • Anything we can do besides G1GC? • Extensive testing with Azul’s C4 GC available in Zing® JVM • Good performance with no tuning • Results in a test environment: • 99th percentile pause time ~1ms, max in tens of ms • Average client latency dropped ~30% • Continued to see good performance up to 600GB heap size Zing JVM: Azul C4 GC:
  18. 18. Looking towards the future… • Anything we can do that isn’t proprietary? • Wait for OpenJDK next gen GC algorithms to mature: • Shenandoah • ZGC
  19. 19. Appendix: Detailed G1GC Tuning Tips • -XX:G1RSetRegionEntries: Solving the problem from the previous slide. 4096 worked well (default of 1536) • Comes with high off-heap memory overheads • -XX:G1RSetUpdatingPauseTimePercent: Reduce this to reduce the “Update RS” pause time, push more work to concurrent threads (NameNode is not really that concurrent – extra cores are better by the GC algorithm) • -XX:G1NewSizePercent: Default of 5% is unreasonably large for heaps > 100GB, reducing will help shorten pauses during high churn periods (startup, failover) • -XX:MaxTenuringThreshold, -XX:ParallelGCThreads, -XX:ConcGCThreads: Set empirically based on experiments sweeping over values. This is where Dynamometer really shines • MaxTenuringThreshold is particularly interesting: Based on NN usage pattern (objects are either very long lived or very short), would expect low values (1 or 2) to be best, but in practice closer default of 8 performs better
  20. 20. Thank you!