Your SlideShare is downloading. ×
0
A Tool for Practical Garbage Collection Analysis In the Cloud
A Tool for Practical Garbage Collection Analysis In the Cloud
A Tool for Practical Garbage Collection Analysis In the Cloud
A Tool for Practical Garbage Collection Analysis In the Cloud
A Tool for Practical Garbage Collection Analysis In the Cloud
A Tool for Practical Garbage Collection Analysis In the Cloud
A Tool for Practical Garbage Collection Analysis In the Cloud
A Tool for Practical Garbage Collection Analysis In the Cloud
A Tool for Practical Garbage Collection Analysis In the Cloud
A Tool for Practical Garbage Collection Analysis In the Cloud
A Tool for Practical Garbage Collection Analysis In the Cloud
A Tool for Practical Garbage Collection Analysis In the Cloud
A Tool for Practical Garbage Collection Analysis In the Cloud
A Tool for Practical Garbage Collection Analysis In the Cloud
A Tool for Practical Garbage Collection Analysis In the Cloud
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

A Tool for Practical Garbage Collection Analysis In the Cloud

1,514

Published on

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
1,514
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
0
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. A Tool for Practical Garbage Collection Analysis In the Cloud Arun Kejariwal March 20131 International Conference on Cloud Engineering 2013 © Arun Kejariwal
  • 2. Overview   Cloud computing becoming ubiquitous o  SaaS, PaaS, IaaS o  Market size of 65 to 85 billion by 2015 [McKinsey]   IaaS o  Large adoption   Higher scalability, Lower cost, Reduced time-to-market o  Examples   Zynga, Netflix, PBS, Foursquare, … o  Growing vendors   AWS, Google Compute Engine, Azure, Rackspace   Java-based web applications o  GC impacts application performance in a significant way   For example: [Zhao et. al, OOPSLA’09]   100s of papers published on memory management in languages such as Java [“The Garbage Collection Bibliography,” http://www.cs.kent.ac.uk/people/staff/rej/gcbib/gcbib.pdf”]2 International Conference on Cloud Engineering 2013 © Arun Kejariwal
  • 3. GC Analysis in the Cloud: Why Bother?   User Experience o  Latency, Throughput   Application-driven selection of GC Type   Performance evaluation of new JVM o  JVM 7   G1 collector, New optimizations such as escape analysis   Capacity Planning o  Operational Efficiency o  For example, on AWS3 International Conference on Cloud Engineering 2013 © Arun Kejariwal
  • 4. Key Contributions   Tool – called – for GC analysis in the cloud o  Cluster with over 100 nodes   Features o  Driven by actual needs of the various application teams o  Focus on simplicity   Deployed in production   Solution of the winner of the Netflix Prize was very academic and not deployable in production o  Outlier detection   Detecting “bad” nodes via unsupervised learning o  Detect performance regressions via time series analysis   Performance impact of new features   Red/Black deployments o  Characterize performance during A/B (bucket) testing o  Detect memory “leaks”4 International Conference on Cloud Engineering 2013 © Arun Kejariwal
  • 5. GC: Quick review   Generational garbage collector o  Objects are first allocated to Young Gen (YG) o  Objects are promoted to Old Gen (OG) whose age is more than a given threshold   GC Type o  Parallel o  CMS o  Recent: G15 International Conference on Cloud Engineering 2013 © Arun Kejariwal
  • 6. What About Using Existing Tools?   AppDynamics   GCHisto, GCViewer, Printgcstats, Jconsole   Common limitations o  Absence of support for analyzing GC performance of a cluster of nodes   Tailored for a single Java process o  Lack of statistical analysis   Mean k-Nearest Neighbor for outlier detection   Standard deviation   Trend analysis o  Lack of support for G1 GC o  Most tools are no longer maintained6 International Conference on Cloud Engineering 2013 © Arun Kejariwal
  • 7. Shrek: Analyzing Heap Usage   Why bother? o  High performance variability in the cloud [Iosup et. al, CCG, 2011] o  Potential reasons o  Nodes going bad [Hoelzle and Barroso 2009], [Dai et al.], [Vishwanath and Nagappan, SoCC, 2010] o  Multi-tenancy o  Load balancer issues   AWS ELB issues on Dec 24, 2012 [http://aws.amazon.com/message/680587/] o  A/B Testing o  Cascading effects in a SOA o  Failover from another availability zone7 International Conference on Cloud Engineering 2013 © Arun Kejariwal
  • 8. Shrek: Analyzing Heap Usage (contd.)   Detect “bad”/outlier nodes o  Terminate and spring up new ones o  Early detection results in minimum customer impact o  Example total heap usage time series output obtained via Shrek8 International Conference on Cloud Engineering 2013 © Arun Kejariwal
  • 9. Shrek: Analyzing Heap Usage (contd.)   Detect outliers o  k-NN unsupervised learning 3.9513.953 4 3.764 3.772 3.731 3.697 3.581 3.574 3.563 3.539 3.528 3.467 3.419 3.396 3.394 3.372 3.36 3.225 3.247 10−4/(Avg Young Generation Use * Std Dev) 3.131 3 2.204 2.09 1.97 2 1.885 1.893 1.829 1.705 1.696 1.649 1.561 1.395 1 0.332 0.294 0 0 5 10 15 20 25 30 Node9 International Conference on Cloud Engineering 2013 © Arun Kejariwal
  • 10. Shrek: Analyzing Heap Usage (contd.)   Old Gen usage o  Driven by promotion rate o  Promotion rate may vary across nodes   A/B testing   Shrek also reports the YG usage time series10 International Conference on Cloud Engineering 2013 © Arun Kejariwal
  • 11. Shrek: Analyzing Pause Times   Pause time analysis o  Data distribution of GC pause times o  Histogram plots supported by Shrek   Initial Mark   Remark   Full GC Times11 International Conference on Cloud Engineering 2013 © Arun Kejariwal
  • 12. Shrek: Summary Report   Metrics reported for each node o  Minor GC count o  # Failures (concurrent mode failures) and Failure Time   Not reported by any existing tool o  Initial Mark and Remark o  Average and Max YG (s) o  Average and Max Full GC (s) o  Average Promotion (MB)   Not reported by any existing tool   Summary report integrated with the in-house alerting system o  Assist in triaging production issues   Recap o  Existing tools do not support GC analysis across an entire cluster12 International Conference on Cloud Engineering 2013 © Arun Kejariwal
  • 13. Shrek: Detecting Memory “Leaks”   Time series analysis of heap usage o  Upward sloping over multiple days   Potential memory “leak” o  Predict heap usage trend   Holt Winters method for prediction   Example from production o  Upward sloping o  Verified “leak” with the application team o  Orange region   80% prediction level o  Yellow region   95% prediction level13 International Conference on Cloud Engineering 2013 © Arun Kejariwal
  • 14. Wrapping up …   Shrek – Tool for GC analysis in the cloud o  Statistical analysis o  Detect performance regression o  “Bad”/outlier nodes detection o  Characterize performance of Red/Black deployments o  Memory “leak” detection   Future work o  Integrate with Hive/… to limit pulling GC logs from production nodes to once only o  Support advanced analytics to guide tuning of GC parameters14 International Conference on Cloud Engineering 2013 © Arun Kejariwal
  • 15. Q&A15 International Conference on Cloud Engineering 2013 © Arun Kejariwal

×