More Related Content
Similar to A Tool for Practical Garbage Collection Analysis In the Cloud
Similar to A Tool for Practical Garbage Collection Analysis In the Cloud(20)
More from Arun Kejariwal(20)
A Tool for Practical Garbage Collection Analysis In the Cloud
- 1. A Tool for Practical Garbage Collection Analysis
In the Cloud
Arun Kejariwal
March 2013
1 International Conference on Cloud Engineering 2013 © Arun Kejariwal
- 2. Overview
Cloud computing becoming ubiquitous
o SaaS, PaaS, IaaS
o Market size of 65 to 85 billion by 2015 [McKinsey]
IaaS
o Large adoption
Higher scalability, Lower cost, Reduced time-to-market
o Examples
Zynga, Netflix, PBS, Foursquare, …
o Growing vendors
AWS, Google Compute Engine, Azure, Rackspace
Java-based web applications
o GC impacts application performance in a significant way
For example: [Zhao et. al, OOPSLA’09]
100s of papers published on memory management in languages such as Java
[“The Garbage Collection Bibliography,” http://www.cs.kent.ac.uk/people/staff/rej/gcbib/gcbib.pdf”]
2 International Conference on Cloud Engineering 2013 © Arun Kejariwal
- 3. GC Analysis in the Cloud: Why Bother?
User Experience
o Latency, Throughput
Application-driven selection of GC Type
Performance evaluation of new JVM
o JVM 7
G1 collector, New optimizations such as escape analysis
Capacity Planning
o Operational Efficiency
o For example, on AWS
3 International Conference on Cloud Engineering 2013 © Arun Kejariwal
- 4. Key Contributions
Tool – called – for GC analysis in the cloud
o Cluster with over 100 nodes
Features
o Driven by actual needs of the various application teams
o Focus on simplicity
Deployed in production
Solution of the winner of the Netflix Prize was very academic and not deployable in production
o Outlier detection
Detecting “bad” nodes via unsupervised learning
o Detect performance regressions via time series analysis
Performance impact of new features
Red/Black deployments
o Characterize performance during A/B (bucket) testing
o Detect memory “leaks”
4 International Conference on Cloud Engineering 2013 © Arun Kejariwal
- 5. GC: Quick review
Generational garbage collector
o Objects are first allocated to Young Gen (YG)
o Objects are promoted to Old Gen (OG) whose age is more than a given threshold
GC Type
o Parallel
o CMS
o Recent: G1
5 International Conference on Cloud Engineering 2013 © Arun Kejariwal
- 6. What About Using Existing Tools?
AppDynamics
GCHisto, GCViewer, Printgcstats, Jconsole
Common limitations
o Absence of support for analyzing GC performance of a cluster of nodes
Tailored for a single Java process
o Lack of statistical analysis
Mean
k-Nearest Neighbor for outlier detection
Standard deviation
Trend analysis
o Lack of support for G1 GC
o Most tools are no longer maintained
6 International Conference on Cloud Engineering 2013 © Arun Kejariwal
- 7. Shrek: Analyzing Heap Usage
Why bother?
o High performance variability in the cloud [Iosup et. al, CCG, 2011]
o Potential reasons
o Nodes going bad [Hoelzle and Barroso 2009], [Dai et al.], [Vishwanath and Nagappan, SoCC, 2010]
o Multi-tenancy
o Load balancer issues
AWS ELB issues on Dec 24, 2012 [http://aws.amazon.com/message/680587/]
o A/B Testing
o Cascading effects in a SOA
o Failover from another availability zone
7 International Conference on Cloud Engineering 2013 © Arun Kejariwal
- 8. Shrek: Analyzing Heap Usage (contd.)
Detect “bad”/outlier nodes
o Terminate and spring up new ones
o Early detection results in minimum customer impact
o Example total heap usage time series output obtained via Shrek
8 International Conference on Cloud Engineering 2013 © Arun Kejariwal
- 9. Shrek: Analyzing Heap Usage (contd.)
Detect outliers
o k-NN unsupervised learning
3.9513.953
4
3.764 3.772
3.731
3.697
3.581 3.574 3.563
3.539 3.528
3.467
3.419 3.396
3.394 3.372
3.36
3.225 3.247
10−4/(Avg Young Generation Use * Std Dev)
3.131
3
2.204
2.09
1.97
2
1.885 1.893
1.829
1.705 1.696
1.649
1.561
1.395
1
0.332
0.294
0
0 5 10 15 20 25 30
Node
9 International Conference on Cloud Engineering 2013 © Arun Kejariwal
- 10. Shrek: Analyzing Heap Usage (contd.)
Old Gen usage
o Driven by promotion rate
o Promotion rate may vary across nodes
A/B testing
Shrek also reports the YG usage time series
10 International Conference on Cloud Engineering 2013 © Arun Kejariwal
- 11. Shrek: Analyzing Pause Times
Pause time analysis
o Data distribution of GC pause times
o Histogram plots supported by Shrek
Initial Mark
Remark
Full GC Times
11 International Conference on Cloud Engineering 2013 © Arun Kejariwal
- 12. Shrek: Summary Report
Metrics reported for each node
o Minor GC count
o # Failures (concurrent mode failures) and Failure Time
Not reported by any existing tool
o Initial Mark and Remark
o Average and Max YG (s)
o Average and Max Full GC (s)
o Average Promotion (MB)
Not reported by any existing tool
Summary report integrated with the in-house alerting system
o Assist in triaging production issues
Recap
o Existing tools do not support GC analysis across an entire cluster
12 International Conference on Cloud Engineering 2013 © Arun Kejariwal
- 13. Shrek: Detecting Memory “Leaks”
Time series analysis of heap usage
o Upward sloping over multiple days
Potential memory “leak”
o Predict heap usage trend
Holt Winters method for prediction
Example from production
o Upward sloping
o Verified “leak” with the application team
o Orange region
80% prediction level
o Yellow region
95% prediction level
13 International Conference on Cloud Engineering 2013 © Arun Kejariwal
- 14. Wrapping up …
Shrek – Tool for GC analysis in the cloud
o Statistical analysis
o Detect performance regression
o “Bad”/outlier nodes detection
o Characterize performance of Red/Black deployments
o Memory “leak” detection
Future work
o Integrate with Hive/… to limit pulling GC logs from production nodes to once only
o Support advanced analytics to guide tuning of GC parameters
14 International Conference on Cloud Engineering 2013 © Arun Kejariwal
- 15. Q&A
15 International Conference on Cloud Engineering 2013 © Arun Kejariwal