A Tool for Practical Garbage Collection Analysis In the Cloud

Arun Kejariwal
Arun KejariwalStatistical Learning Principal at Machine Zone, Inc.
A Tool for Practical Garbage Collection Analysis
                       In the Cloud


                            Arun Kejariwal

                                March 2013




1                 International Conference on Cloud Engineering 2013   © Arun Kejariwal
Overview

      Cloud computing becoming ubiquitous
       o  SaaS, PaaS, IaaS
       o  Market size of 65 to 85 billion by 2015 [McKinsey]


      IaaS
       o  Large adoption
                Higher scalability, Lower cost, Reduced time-to-market
       o  Examples
                Zynga, Netflix, PBS, Foursquare, …
       o  Growing vendors
                AWS, Google Compute Engine, Azure, Rackspace



      Java-based web applications
       o  GC impacts application performance in a significant way
                For example: [Zhao et. al, OOPSLA’09]
                100s of papers published on memory management in languages such as Java
                  [“The Garbage Collection Bibliography,” http://www.cs.kent.ac.uk/people/staff/rej/gcbib/gcbib.pdf”]

2                                                 International Conference on Cloud Engineering 2013                    © Arun Kejariwal
GC Analysis in the Cloud: Why Bother?

      User Experience
        o  Latency, Throughput

      Application-driven selection of GC Type

      Performance evaluation of new JVM
        o  JVM 7
              G1 collector, New optimizations such as escape analysis


      Capacity Planning
        o  Operational Efficiency
        o  For example, on AWS




3                                      International Conference on Cloud Engineering 2013   © Arun Kejariwal
Key Contributions

      Tool – called            – for GC analysis in the cloud

        o  Cluster with over 100 nodes
      Features
        o  Driven by actual needs of the various application teams
        o  Focus on simplicity
               Deployed in production
               Solution of the winner of the Netflix Prize was very academic and not deployable in production
        o  Outlier detection
               Detecting “bad” nodes via unsupervised learning
        o  Detect performance regressions via time series analysis
               Performance impact of new features
               Red/Black deployments
        o  Characterize performance during A/B (bucket) testing
        o  Detect memory “leaks”

4                                        International Conference on Cloud Engineering 2013                © Arun Kejariwal
GC: Quick review

      Generational garbage collector




        o  Objects are first allocated to Young Gen (YG)
        o  Objects are promoted to Old Gen (OG) whose age is more than a given threshold
      GC Type
        o  Parallel
        o  CMS
        o  Recent: G1

5                                 International Conference on Cloud Engineering 2013   © Arun Kejariwal
What About Using Existing Tools?

      AppDynamics
      GCHisto, GCViewer, Printgcstats, Jconsole

      Common limitations
       o  Absence of support for analyzing GC performance of a cluster of nodes
              Tailored for a single Java process
       o  Lack of statistical analysis
              Mean
                                     k-Nearest Neighbor for outlier detection
              Standard deviation
              Trend analysis
       o  Lack of support for G1 GC
       o  Most tools are no longer maintained




6                                       International Conference on Cloud Engineering 2013   © Arun Kejariwal
Shrek: Analyzing Heap Usage

      Why bother?
       o  High performance variability in the cloud [Iosup et. al, CCG, 2011]

       o  Potential reasons
            o  Nodes going bad [Hoelzle and Barroso 2009], [Dai et al.], [Vishwanath and Nagappan, SoCC, 2010]

            o  Multi-tenancy

            o  Load balancer issues
                    AWS ELB issues on Dec 24, 2012 [http://aws.amazon.com/message/680587/]


            o  A/B Testing

            o  Cascading effects in a SOA

            o  Failover from another availability zone



7                                           International Conference on Cloud Engineering 2013       © Arun Kejariwal
Shrek: Analyzing Heap Usage (contd.)

      Detect “bad”/outlier nodes
        o  Terminate and spring up new ones
        o  Early detection results in minimum customer impact
        o  Example total heap usage time series output obtained via Shrek




8                                  International Conference on Cloud Engineering 2013   © Arun Kejariwal
Shrek: Analyzing Heap Usage (contd.)

      Detect outliers
        o  k-NN unsupervised learning
                                                                                     3.9513.953
                                                      4


                                                                                                                          3.764           3.772
                                                                                                                                  3.731
                                                                                                  3.697

                                                                                                                  3.581                           3.574 3.563
                                                                                                          3.539                                                         3.528
                                                                                                                                                                3.467
                                                                             3.419                                                                                                                            3.396
                                                                                                                                                                                3.394 3.372
                                                              3.36

                                                                     3.225                                                                                                                            3.247
          10−4/(Avg Young Generation Use * Std Dev)




                                                                                                                                                                                              3.131
                                                      3




                                                                                                                                                                                                                      2.204

                                                                                                                                                                                                                              2.09

                                                                                                                                                                                                                                             1.97
                                                      2




                                                                                                                                                                                                                                     1.885          1.893
                                                                                                                                                                                                                                                            1.829

                                                                                                                                                                                                                                                                    1.705                           1.696
                                                                                                                                                                                                                                                                            1.649
                                                                                                                                                                                                                                                                                    1.561


                                                                                                                                                                                                                                                                                            1.395
                                                      1




                                                                                                                                                                                                                                                                                                                    0.332
                                                                                                                                                                                                                                                                                                            0.294
                                                      0




                                                          0                                5                                  10                                    15                                    20                                        25                                  30

                                                                                                                                                                                 Node
9                                                                                                                                                 International Conference on Cloud Engineering 2013                                                                                                                        © Arun Kejariwal
Shrek: Analyzing Heap Usage (contd.)

       Old Gen usage
         o  Driven by promotion rate
         o  Promotion rate may vary across nodes
               A/B testing




       Shrek also reports the YG usage time series
10                                 International Conference on Cloud Engineering 2013   © Arun Kejariwal
Shrek: Analyzing Pause Times

       Pause time analysis
        o  Data distribution of GC pause times
        o  Histogram plots supported by Shrek
              Initial Mark
              Remark
              Full GC Times




11                                 International Conference on Cloud Engineering 2013   © Arun Kejariwal
Shrek: Summary Report

       Metrics reported for each node
         o  Minor GC count
         o  # Failures (concurrent mode failures) and Failure Time
                 Not reported by any existing tool
         o    Initial Mark and Remark
         o    Average and Max YG (s)
         o    Average and Max Full GC (s)
         o    Average Promotion (MB)
                 Not reported by any existing tool

       Summary report integrated with the in-house alerting system
         o  Assist in triaging production issues
       Recap
         o  Existing tools do not support GC analysis across an entire cluster



12                                          International Conference on Cloud Engineering 2013   © Arun Kejariwal
Shrek: Detecting Memory “Leaks”

       Time series analysis of heap usage
        o  Upward sloping over multiple days
              Potential memory “leak”
        o  Predict heap usage trend
              Holt Winters method for prediction


       Example from production
        o  Upward sloping
            o  Verified “leak” with the application team
        o  Orange region
              80% prediction level
        o  Yellow region
              95% prediction level




13                                       International Conference on Cloud Engineering 2013   © Arun Kejariwal
Wrapping up …

       Shrek – Tool for GC analysis in the cloud
         o    Statistical analysis
         o    Detect performance regression
         o    “Bad”/outlier nodes detection
         o    Characterize performance of Red/Black deployments
         o    Memory “leak” detection




       Future work
         o  Integrate with Hive/… to limit pulling GC logs from production nodes to once only
         o  Support advanced analytics to guide tuning of GC parameters




14                                  International Conference on Cloud Engineering 2013   © Arun Kejariwal
Q&A




15   International Conference on Cloud Engineering 2013   © Arun Kejariwal
1 of 15

Recommended

final ear by
final earfinal ear
final earPradeep Bista
225 views57 slides
Techniques for Minimizing Cloud Footprint by
Techniques for Minimizing Cloud FootprintTechniques for Minimizing Cloud Footprint
Techniques for Minimizing Cloud FootprintArun Kejariwal
1.4K views17 slides
DAT202 Optimizing your Cassandra Database on AWS - AWS re: Invent 2012 by
DAT202 Optimizing your Cassandra Database on AWS - AWS re: Invent 2012DAT202 Optimizing your Cassandra Database on AWS - AWS re: Invent 2012
DAT202 Optimizing your Cassandra Database on AWS - AWS re: Invent 2012Amazon Web Services
4.7K views37 slides
Com t'ho explico by
Com t'ho explicoCom t'ho explico
Com t'ho explicoCESIRE - Dept d'Educació - GENCAT
642 views40 slides
ENT101 Embracing the Cloud - AWS re: Invent 2012 by
ENT101 Embracing the Cloud - AWS re: Invent 2012ENT101 Embracing the Cloud - AWS re: Invent 2012
ENT101 Embracing the Cloud - AWS re: Invent 2012Amazon Web Services
4.7K views47 slides
Tric y Trake 15 junio 1967 by
Tric y Trake  15 junio 1967Tric y Trake  15 junio 1967
Tric y Trake 15 junio 1967Martin Alberto Belaustegui
319 views82 slides

More Related Content

Viewers also liked

Mapa ilustrado de Estados Unidos by
Mapa ilustrado de Estados Unidos Mapa ilustrado de Estados Unidos
Mapa ilustrado de Estados Unidos Martin Alberto Belaustegui
544 views1 slide
Mitigating User Experience from 'Breaking Bad': The Twitter Approach [Velocit... by
Mitigating User Experience from 'Breaking Bad': The Twitter Approach [Velocit...Mitigating User Experience from 'Breaking Bad': The Twitter Approach [Velocit...
Mitigating User Experience from 'Breaking Bad': The Twitter Approach [Velocit...Piyush Kumar
4.5K views30 slides
A Systematic Approach to Capacity Planning in the Real World by
A Systematic Approach to Capacity Planning in the Real WorldA Systematic Approach to Capacity Planning in the Real World
A Systematic Approach to Capacity Planning in the Real WorldArun Kejariwal
5.5K views23 slides
re:Invent 2012 Optimizing Cassandra by
re:Invent 2012 Optimizing Cassandrare:Invent 2012 Optimizing Cassandra
re:Invent 2012 Optimizing CassandraRuslan Meshenberg
2.1K views37 slides
Mistery box by
Mistery boxMistery box
Mistery boxCESIRE - Dept d'Educació - GENCAT
1.8K views19 slides
Data Science with Elastic MapReduce (EMR) at Netflix by
Data Science with Elastic MapReduce (EMR) at NetflixData Science with Elastic MapReduce (EMR) at Netflix
Data Science with Elastic MapReduce (EMR) at NetflixKurt Brown
2.3K views38 slides

Viewers also liked(20)

Mitigating User Experience from 'Breaking Bad': The Twitter Approach [Velocit... by Piyush Kumar
Mitigating User Experience from 'Breaking Bad': The Twitter Approach [Velocit...Mitigating User Experience from 'Breaking Bad': The Twitter Approach [Velocit...
Mitigating User Experience from 'Breaking Bad': The Twitter Approach [Velocit...
Piyush Kumar4.5K views
A Systematic Approach to Capacity Planning in the Real World by Arun Kejariwal
A Systematic Approach to Capacity Planning in the Real WorldA Systematic Approach to Capacity Planning in the Real World
A Systematic Approach to Capacity Planning in the Real World
Arun Kejariwal5.5K views
Data Science with Elastic MapReduce (EMR) at Netflix by Kurt Brown
Data Science with Elastic MapReduce (EMR) at NetflixData Science with Elastic MapReduce (EMR) at Netflix
Data Science with Elastic MapReduce (EMR) at Netflix
Kurt Brown2.3K views
BDT303 Data Science with Elastic MapReduce - AWS re: Invent 2012 by Amazon Web Services
BDT303 Data Science with Elastic MapReduce - AWS re: Invent 2012BDT303 Data Science with Elastic MapReduce - AWS re: Invent 2012
BDT303 Data Science with Elastic MapReduce - AWS re: Invent 2012
Amazon Web Services3.1K views
NetflixOSS meetup lightning talks and roadmap by Ruslan Meshenberg
NetflixOSS meetup lightning talks and roadmapNetflixOSS meetup lightning talks and roadmap
NetflixOSS meetup lightning talks and roadmap
Ruslan Meshenberg124.4K views
AWS Re:Invent - Optimizing Costs with AWS by Coburn Watson
AWS Re:Invent -  Optimizing Costs with AWSAWS Re:Invent -  Optimizing Costs with AWS
AWS Re:Invent - Optimizing Costs with AWS
Coburn Watson4.8K views
Soical studies s.b.a by bigbellyninja
Soical studies s.b.aSoical studies s.b.a
Soical studies s.b.a
bigbellyninja19.8K views
Basic Garbage Collection Techniques by An Khuong
Basic  Garbage  Collection  TechniquesBasic  Garbage  Collection  Techniques
Basic Garbage Collection Techniques
An Khuong14.8K views
MED202 Netflix’s Transcoding Transformation - AWS re: Invent 2012 by Amazon Web Services
MED202 Netflix’s Transcoding Transformation - AWS re: Invent 2012MED202 Netflix’s Transcoding Transformation - AWS re: Invent 2012
MED202 Netflix’s Transcoding Transformation - AWS re: Invent 2012
Amazon Web Services10.6K views
RMG202 Rainmakers: How Netflix Operates Clouds for Maximum Freedom and Agilit... by Amazon Web Services
RMG202 Rainmakers: How Netflix Operates Clouds for Maximum Freedom and Agilit...RMG202 Rainmakers: How Netflix Operates Clouds for Maximum Freedom and Agilit...
RMG202 Rainmakers: How Netflix Operates Clouds for Maximum Freedom and Agilit...
Amazon Web Services4.1K views
Devops at Netflix (re:Invent) by Jeremy Edberg
Devops at Netflix (re:Invent)Devops at Netflix (re:Invent)
Devops at Netflix (re:Invent)
Jeremy Edberg58.1K views
AWS Re:Invent - High Availability Architecture at Netflix by Adrian Cockcroft
AWS Re:Invent - High Availability Architecture at NetflixAWS Re:Invent - High Availability Architecture at Netflix
AWS Re:Invent - High Availability Architecture at Netflix
Adrian Cockcroft47.5K views
Netflix oss season 2 episode 1 - meetup Lightning talks by Ruslan Meshenberg
Netflix oss   season 2 episode 1 - meetup Lightning talksNetflix oss   season 2 episode 1 - meetup Lightning talks
Netflix oss season 2 episode 1 - meetup Lightning talks
Ruslan Meshenberg107.5K views

Similar to A Tool for Practical Garbage Collection Analysis In the Cloud

Business Benefits of Cloud Computing to Indian IT Service by
Business Benefits of Cloud Computing to Indian IT ServiceBusiness Benefits of Cloud Computing to Indian IT Service
Business Benefits of Cloud Computing to Indian IT Servicesaurabh rao
3.1K views131 slides
451\'s Conducting The Cloud Orchestration With A Focus On Test & Development by
451\'s Conducting The Cloud Orchestration With A Focus On Test & Development451\'s Conducting The Cloud Orchestration With A Focus On Test & Development
451\'s Conducting The Cloud Orchestration With A Focus On Test & Developmentjdavidmcmahon3
317 views6 slides
Globus Toolkit 3 Core – A Grid Service Container Framework: Thomas Sandholm J... by
Globus Toolkit 3 Core – A Grid Service Container Framework: Thomas Sandholm J...Globus Toolkit 3 Core – A Grid Service Container Framework: Thomas Sandholm J...
Globus Toolkit 3 Core – A Grid Service Container Framework: Thomas Sandholm J...Information Security Awareness Group
1.7K views22 slides
3 d wm monasolyman_10nov_ainshames by
3 d wm monasolyman_10nov_ainshames3 d wm monasolyman_10nov_ainshames
3 d wm monasolyman_10nov_ainshamesAboul Ella Hassanien
391 views33 slides
IRJET- Improving Data Availability by using VPC Strategy in Cloud Environ... by
IRJET-  	  Improving Data Availability by using VPC Strategy in Cloud Environ...IRJET-  	  Improving Data Availability by using VPC Strategy in Cloud Environ...
IRJET- Improving Data Availability by using VPC Strategy in Cloud Environ...IRJET Journal
8 views3 slides
Research Design Report Tagore by
Research Design Report TagoreResearch Design Report Tagore
Research Design Report TagoreVinoth Kanna
137 views14 slides

Similar to A Tool for Practical Garbage Collection Analysis In the Cloud(20)

Business Benefits of Cloud Computing to Indian IT Service by saurabh rao
Business Benefits of Cloud Computing to Indian IT ServiceBusiness Benefits of Cloud Computing to Indian IT Service
Business Benefits of Cloud Computing to Indian IT Service
saurabh rao3.1K views
451\'s Conducting The Cloud Orchestration With A Focus On Test & Development by jdavidmcmahon3
451\'s Conducting The Cloud Orchestration With A Focus On Test & Development451\'s Conducting The Cloud Orchestration With A Focus On Test & Development
451\'s Conducting The Cloud Orchestration With A Focus On Test & Development
jdavidmcmahon3317 views
IRJET- Improving Data Availability by using VPC Strategy in Cloud Environ... by IRJET Journal
IRJET-  	  Improving Data Availability by using VPC Strategy in Cloud Environ...IRJET-  	  Improving Data Availability by using VPC Strategy in Cloud Environ...
IRJET- Improving Data Availability by using VPC Strategy in Cloud Environ...
IRJET Journal8 views
Research Design Report Tagore by Vinoth Kanna
Research Design Report TagoreResearch Design Report Tagore
Research Design Report Tagore
Vinoth Kanna 137 views
IRJET- Cost Effective Workflow Scheduling in Bigdata by IRJET Journal
IRJET-  	  Cost Effective Workflow Scheduling in BigdataIRJET-  	  Cost Effective Workflow Scheduling in Bigdata
IRJET- Cost Effective Workflow Scheduling in Bigdata
IRJET Journal20 views
Enhanced Integrity Preserving Homomorphic Scheme for Cloud Storage by IRJET Journal
Enhanced Integrity Preserving Homomorphic Scheme for Cloud StorageEnhanced Integrity Preserving Homomorphic Scheme for Cloud Storage
Enhanced Integrity Preserving Homomorphic Scheme for Cloud Storage
IRJET Journal38 views
ANALYSIS OF ATTACK TECHNIQUES ON CLOUD BASED DATA DEDUPLICATION TECHNIQUES by ijccsa
ANALYSIS OF ATTACK TECHNIQUES ON CLOUD BASED DATA DEDUPLICATION TECHNIQUESANALYSIS OF ATTACK TECHNIQUES ON CLOUD BASED DATA DEDUPLICATION TECHNIQUES
ANALYSIS OF ATTACK TECHNIQUES ON CLOUD BASED DATA DEDUPLICATION TECHNIQUES
ijccsa7 views
ANALYSIS OF ATTACK TECHNIQUES ON CLOUD BASED DATA DEDUPLICATION TECHNIQUES by ijccsa
ANALYSIS OF ATTACK TECHNIQUES ON CLOUD BASED DATA DEDUPLICATION TECHNIQUESANALYSIS OF ATTACK TECHNIQUES ON CLOUD BASED DATA DEDUPLICATION TECHNIQUES
ANALYSIS OF ATTACK TECHNIQUES ON CLOUD BASED DATA DEDUPLICATION TECHNIQUES
ijccsa12 views
ANALYSIS OF ATTACK TECHNIQUES ON CLOUD BASED DATA DEDUPLICATION TECHNIQUES by ijccsa
ANALYSIS OF ATTACK TECHNIQUES ON CLOUD BASED DATA DEDUPLICATION TECHNIQUESANALYSIS OF ATTACK TECHNIQUES ON CLOUD BASED DATA DEDUPLICATION TECHNIQUES
ANALYSIS OF ATTACK TECHNIQUES ON CLOUD BASED DATA DEDUPLICATION TECHNIQUES
ijccsa14 views
ANALYSIS OF ATTACK TECHNIQUES ON CLOUD BASED DATA DEDUPLICATION TECHNIQUES by neirew J
ANALYSIS OF ATTACK TECHNIQUES ON CLOUD BASED DATA DEDUPLICATION TECHNIQUESANALYSIS OF ATTACK TECHNIQUES ON CLOUD BASED DATA DEDUPLICATION TECHNIQUES
ANALYSIS OF ATTACK TECHNIQUES ON CLOUD BASED DATA DEDUPLICATION TECHNIQUES
neirew J125 views
International Journal on Cloud Computing: Services and Architecture (IJCCSA) by ijccsa
International Journal on Cloud Computing: Services and Architecture (IJCCSA)International Journal on Cloud Computing: Services and Architecture (IJCCSA)
International Journal on Cloud Computing: Services and Architecture (IJCCSA)
ijccsa3 views
Guaranteed Availability of Cloud Data with Efficient Cost by IRJET Journal
Guaranteed Availability of Cloud Data with Efficient CostGuaranteed Availability of Cloud Data with Efficient Cost
Guaranteed Availability of Cloud Data with Efficient Cost
IRJET Journal84 views
Secure Cloud Storage by ALIN BABU
Secure Cloud StorageSecure Cloud Storage
Secure Cloud Storage
ALIN BABU320 views
Cloud Computing: A Perspective on Next Basic Utility in IT World by IRJET Journal
Cloud Computing: A Perspective on Next Basic Utility in IT World Cloud Computing: A Perspective on Next Basic Utility in IT World
Cloud Computing: A Perspective on Next Basic Utility in IT World
IRJET Journal46 views
DYNAMIC TENANT PROVISIONING AND SERVICE ORCHESTRATION IN HYBRID CLOUD by ijccsa
DYNAMIC TENANT PROVISIONING AND SERVICE ORCHESTRATION IN HYBRID CLOUDDYNAMIC TENANT PROVISIONING AND SERVICE ORCHESTRATION IN HYBRID CLOUD
DYNAMIC TENANT PROVISIONING AND SERVICE ORCHESTRATION IN HYBRID CLOUD
ijccsa8 views

More from Arun Kejariwal

Anomaly Detection At The Edge by
Anomaly Detection At The EdgeAnomaly Detection At The Edge
Anomaly Detection At The EdgeArun Kejariwal
581 views54 slides
Serverless Streaming Architectures and Algorithms for the Enterprise by
Serverless Streaming Architectures and Algorithms for the EnterpriseServerless Streaming Architectures and Algorithms for the Enterprise
Serverless Streaming Architectures and Algorithms for the EnterpriseArun Kejariwal
2.8K views227 slides
Sequence-to-Sequence Modeling for Time Series by
Sequence-to-Sequence Modeling for Time SeriesSequence-to-Sequence Modeling for Time Series
Sequence-to-Sequence Modeling for Time SeriesArun Kejariwal
3.2K views64 slides
Sequence-to-Sequence Modeling for Time Series by
Sequence-to-Sequence Modeling for Time SeriesSequence-to-Sequence Modeling for Time Series
Sequence-to-Sequence Modeling for Time SeriesArun Kejariwal
1.9K views45 slides
Model Serving via Pulsar Functions by
Model Serving via Pulsar FunctionsModel Serving via Pulsar Functions
Model Serving via Pulsar FunctionsArun Kejariwal
1.7K views44 slides
Designing Modern Streaming Data Applications by
Designing Modern Streaming Data ApplicationsDesigning Modern Streaming Data Applications
Designing Modern Streaming Data ApplicationsArun Kejariwal
2.6K views227 slides

More from Arun Kejariwal(20)

Anomaly Detection At The Edge by Arun Kejariwal
Anomaly Detection At The EdgeAnomaly Detection At The Edge
Anomaly Detection At The Edge
Arun Kejariwal581 views
Serverless Streaming Architectures and Algorithms for the Enterprise by Arun Kejariwal
Serverless Streaming Architectures and Algorithms for the EnterpriseServerless Streaming Architectures and Algorithms for the Enterprise
Serverless Streaming Architectures and Algorithms for the Enterprise
Arun Kejariwal2.8K views
Sequence-to-Sequence Modeling for Time Series by Arun Kejariwal
Sequence-to-Sequence Modeling for Time SeriesSequence-to-Sequence Modeling for Time Series
Sequence-to-Sequence Modeling for Time Series
Arun Kejariwal3.2K views
Sequence-to-Sequence Modeling for Time Series by Arun Kejariwal
Sequence-to-Sequence Modeling for Time SeriesSequence-to-Sequence Modeling for Time Series
Sequence-to-Sequence Modeling for Time Series
Arun Kejariwal1.9K views
Model Serving via Pulsar Functions by Arun Kejariwal
Model Serving via Pulsar FunctionsModel Serving via Pulsar Functions
Model Serving via Pulsar Functions
Arun Kejariwal1.7K views
Designing Modern Streaming Data Applications by Arun Kejariwal
Designing Modern Streaming Data ApplicationsDesigning Modern Streaming Data Applications
Designing Modern Streaming Data Applications
Arun Kejariwal2.6K views
Correlation Analysis on Live Data Streams by Arun Kejariwal
Correlation Analysis on Live Data StreamsCorrelation Analysis on Live Data Streams
Correlation Analysis on Live Data Streams
Arun Kejariwal321 views
Deep Learning for Time Series Data by Arun Kejariwal
Deep Learning for Time Series DataDeep Learning for Time Series Data
Deep Learning for Time Series Data
Arun Kejariwal1.6K views
Correlation Analysis on Live Data Streams by Arun Kejariwal
Correlation Analysis on Live Data StreamsCorrelation Analysis on Live Data Streams
Correlation Analysis on Live Data Streams
Arun Kejariwal2.1K views
Modern real-time streaming architectures by Arun Kejariwal
Modern real-time streaming architecturesModern real-time streaming architectures
Modern real-time streaming architectures
Arun Kejariwal7.2K views
Anomaly detection in real-time data streams using Heron by Arun Kejariwal
Anomaly detection in real-time data streams using HeronAnomaly detection in real-time data streams using Heron
Anomaly detection in real-time data streams using Heron
Arun Kejariwal4.7K views
Data Data Everywhere: Not An Insight to Take Action Upon by Arun Kejariwal
Data Data Everywhere: Not An Insight to Take Action UponData Data Everywhere: Not An Insight to Take Action Upon
Data Data Everywhere: Not An Insight to Take Action Upon
Arun Kejariwal1.5K views
Real Time Analytics: Algorithms and Systems by Arun Kejariwal
Real Time Analytics: Algorithms and SystemsReal Time Analytics: Algorithms and Systems
Real Time Analytics: Algorithms and Systems
Arun Kejariwal23K views
Finding bad apples early: Minimizing performance impact by Arun Kejariwal
Finding bad apples early: Minimizing performance impactFinding bad apples early: Minimizing performance impact
Finding bad apples early: Minimizing performance impact
Arun Kejariwal1.1K views
Statistical Learning Based Anomaly Detection @ Twitter by Arun Kejariwal
Statistical Learning Based Anomaly Detection @ TwitterStatistical Learning Based Anomaly Detection @ Twitter
Statistical Learning Based Anomaly Detection @ Twitter
Arun Kejariwal5.1K views
Days In Green (DIG): Forecasting the life of a healthy service by Arun Kejariwal
Days In Green (DIG): Forecasting the life of a healthy serviceDays In Green (DIG): Forecasting the life of a healthy service
Days In Green (DIG): Forecasting the life of a healthy service
Arun Kejariwal793 views
Gimme More! Supporting User Growth in a Performant and Efficient Fashion by Arun Kejariwal
Gimme More! Supporting User Growth in a Performant and Efficient FashionGimme More! Supporting User Growth in a Performant and Efficient Fashion
Gimme More! Supporting User Growth in a Performant and Efficient Fashion
Arun Kejariwal2.3K views
Isolating Events from the Fail Whale by Arun Kejariwal
Isolating Events from the Fail WhaleIsolating Events from the Fail Whale
Isolating Events from the Fail Whale
Arun Kejariwal2K views

Recently uploaded

Unit 1_Lecture 2_Physical Design of IoT.pdf by
Unit 1_Lecture 2_Physical Design of IoT.pdfUnit 1_Lecture 2_Physical Design of IoT.pdf
Unit 1_Lecture 2_Physical Design of IoT.pdfStephenTec
12 views36 slides
virtual reality.pptx by
virtual reality.pptxvirtual reality.pptx
virtual reality.pptxG036GaikwadSnehal
14 views15 slides
【USB韌體設計課程】精選講義節錄-USB的列舉過程_艾鍗學院 by
【USB韌體設計課程】精選講義節錄-USB的列舉過程_艾鍗學院【USB韌體設計課程】精選講義節錄-USB的列舉過程_艾鍗學院
【USB韌體設計課程】精選講義節錄-USB的列舉過程_艾鍗學院IttrainingIttraining
58 views8 slides
MVP and prioritization.pdf by
MVP and prioritization.pdfMVP and prioritization.pdf
MVP and prioritization.pdfrahuldharwal141
31 views8 slides
iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas... by
iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas...iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas...
iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas...Bernd Ruecker
40 views69 slides
6g - REPORT.pdf by
6g - REPORT.pdf6g - REPORT.pdf
6g - REPORT.pdfLiveplex
10 views23 slides

Recently uploaded(20)

Unit 1_Lecture 2_Physical Design of IoT.pdf by StephenTec
Unit 1_Lecture 2_Physical Design of IoT.pdfUnit 1_Lecture 2_Physical Design of IoT.pdf
Unit 1_Lecture 2_Physical Design of IoT.pdf
StephenTec12 views
【USB韌體設計課程】精選講義節錄-USB的列舉過程_艾鍗學院 by IttrainingIttraining
【USB韌體設計課程】精選講義節錄-USB的列舉過程_艾鍗學院【USB韌體設計課程】精選講義節錄-USB的列舉過程_艾鍗學院
【USB韌體設計課程】精選講義節錄-USB的列舉過程_艾鍗學院
iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas... by Bernd Ruecker
iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas...iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas...
iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas...
Bernd Ruecker40 views
6g - REPORT.pdf by Liveplex
6g - REPORT.pdf6g - REPORT.pdf
6g - REPORT.pdf
Liveplex10 views
TrustArc Webinar - Managing Online Tracking Technology Vendors_ A Checklist f... by TrustArc
TrustArc Webinar - Managing Online Tracking Technology Vendors_ A Checklist f...TrustArc Webinar - Managing Online Tracking Technology Vendors_ A Checklist f...
TrustArc Webinar - Managing Online Tracking Technology Vendors_ A Checklist f...
TrustArc11 views
Voice Logger - Telephony Integration Solution at Aegis by Nirmal Sharma
Voice Logger - Telephony Integration Solution at AegisVoice Logger - Telephony Integration Solution at Aegis
Voice Logger - Telephony Integration Solution at Aegis
Nirmal Sharma39 views
ESPC 2023 - Protect and Govern your Sensitive Data with Microsoft Purview in ... by Jasper Oosterveld
ESPC 2023 - Protect and Govern your Sensitive Data with Microsoft Purview in ...ESPC 2023 - Protect and Govern your Sensitive Data with Microsoft Purview in ...
ESPC 2023 - Protect and Govern your Sensitive Data with Microsoft Purview in ...
Powerful Google developer tools for immediate impact! (2023-24) by wesley chun
Powerful Google developer tools for immediate impact! (2023-24)Powerful Google developer tools for immediate impact! (2023-24)
Powerful Google developer tools for immediate impact! (2023-24)
wesley chun10 views
Special_edition_innovator_2023.pdf by WillDavies22
Special_edition_innovator_2023.pdfSpecial_edition_innovator_2023.pdf
Special_edition_innovator_2023.pdf
WillDavies2218 views
"Running students' code in isolation. The hard way", Yurii Holiuk by Fwdays
"Running students' code in isolation. The hard way", Yurii Holiuk "Running students' code in isolation. The hard way", Yurii Holiuk
"Running students' code in isolation. The hard way", Yurii Holiuk
Fwdays17 views
Serverless computing with Google Cloud (2023-24) by wesley chun
Serverless computing with Google Cloud (2023-24)Serverless computing with Google Cloud (2023-24)
Serverless computing with Google Cloud (2023-24)
wesley chun11 views

A Tool for Practical Garbage Collection Analysis In the Cloud

  • 1. A Tool for Practical Garbage Collection Analysis In the Cloud Arun Kejariwal March 2013 1 International Conference on Cloud Engineering 2013 © Arun Kejariwal
  • 2. Overview   Cloud computing becoming ubiquitous o  SaaS, PaaS, IaaS o  Market size of 65 to 85 billion by 2015 [McKinsey]   IaaS o  Large adoption   Higher scalability, Lower cost, Reduced time-to-market o  Examples   Zynga, Netflix, PBS, Foursquare, … o  Growing vendors   AWS, Google Compute Engine, Azure, Rackspace   Java-based web applications o  GC impacts application performance in a significant way   For example: [Zhao et. al, OOPSLA’09]   100s of papers published on memory management in languages such as Java [“The Garbage Collection Bibliography,” http://www.cs.kent.ac.uk/people/staff/rej/gcbib/gcbib.pdf”] 2 International Conference on Cloud Engineering 2013 © Arun Kejariwal
  • 3. GC Analysis in the Cloud: Why Bother?   User Experience o  Latency, Throughput   Application-driven selection of GC Type   Performance evaluation of new JVM o  JVM 7   G1 collector, New optimizations such as escape analysis   Capacity Planning o  Operational Efficiency o  For example, on AWS 3 International Conference on Cloud Engineering 2013 © Arun Kejariwal
  • 4. Key Contributions   Tool – called – for GC analysis in the cloud o  Cluster with over 100 nodes   Features o  Driven by actual needs of the various application teams o  Focus on simplicity   Deployed in production   Solution of the winner of the Netflix Prize was very academic and not deployable in production o  Outlier detection   Detecting “bad” nodes via unsupervised learning o  Detect performance regressions via time series analysis   Performance impact of new features   Red/Black deployments o  Characterize performance during A/B (bucket) testing o  Detect memory “leaks” 4 International Conference on Cloud Engineering 2013 © Arun Kejariwal
  • 5. GC: Quick review   Generational garbage collector o  Objects are first allocated to Young Gen (YG) o  Objects are promoted to Old Gen (OG) whose age is more than a given threshold   GC Type o  Parallel o  CMS o  Recent: G1 5 International Conference on Cloud Engineering 2013 © Arun Kejariwal
  • 6. What About Using Existing Tools?   AppDynamics   GCHisto, GCViewer, Printgcstats, Jconsole   Common limitations o  Absence of support for analyzing GC performance of a cluster of nodes   Tailored for a single Java process o  Lack of statistical analysis   Mean k-Nearest Neighbor for outlier detection   Standard deviation   Trend analysis o  Lack of support for G1 GC o  Most tools are no longer maintained 6 International Conference on Cloud Engineering 2013 © Arun Kejariwal
  • 7. Shrek: Analyzing Heap Usage   Why bother? o  High performance variability in the cloud [Iosup et. al, CCG, 2011] o  Potential reasons o  Nodes going bad [Hoelzle and Barroso 2009], [Dai et al.], [Vishwanath and Nagappan, SoCC, 2010] o  Multi-tenancy o  Load balancer issues   AWS ELB issues on Dec 24, 2012 [http://aws.amazon.com/message/680587/] o  A/B Testing o  Cascading effects in a SOA o  Failover from another availability zone 7 International Conference on Cloud Engineering 2013 © Arun Kejariwal
  • 8. Shrek: Analyzing Heap Usage (contd.)   Detect “bad”/outlier nodes o  Terminate and spring up new ones o  Early detection results in minimum customer impact o  Example total heap usage time series output obtained via Shrek 8 International Conference on Cloud Engineering 2013 © Arun Kejariwal
  • 9. Shrek: Analyzing Heap Usage (contd.)   Detect outliers o  k-NN unsupervised learning 3.9513.953 4 3.764 3.772 3.731 3.697 3.581 3.574 3.563 3.539 3.528 3.467 3.419 3.396 3.394 3.372 3.36 3.225 3.247 10−4/(Avg Young Generation Use * Std Dev) 3.131 3 2.204 2.09 1.97 2 1.885 1.893 1.829 1.705 1.696 1.649 1.561 1.395 1 0.332 0.294 0 0 5 10 15 20 25 30 Node 9 International Conference on Cloud Engineering 2013 © Arun Kejariwal
  • 10. Shrek: Analyzing Heap Usage (contd.)   Old Gen usage o  Driven by promotion rate o  Promotion rate may vary across nodes   A/B testing   Shrek also reports the YG usage time series 10 International Conference on Cloud Engineering 2013 © Arun Kejariwal
  • 11. Shrek: Analyzing Pause Times   Pause time analysis o  Data distribution of GC pause times o  Histogram plots supported by Shrek   Initial Mark   Remark   Full GC Times 11 International Conference on Cloud Engineering 2013 © Arun Kejariwal
  • 12. Shrek: Summary Report   Metrics reported for each node o  Minor GC count o  # Failures (concurrent mode failures) and Failure Time   Not reported by any existing tool o  Initial Mark and Remark o  Average and Max YG (s) o  Average and Max Full GC (s) o  Average Promotion (MB)   Not reported by any existing tool   Summary report integrated with the in-house alerting system o  Assist in triaging production issues   Recap o  Existing tools do not support GC analysis across an entire cluster 12 International Conference on Cloud Engineering 2013 © Arun Kejariwal
  • 13. Shrek: Detecting Memory “Leaks”   Time series analysis of heap usage o  Upward sloping over multiple days   Potential memory “leak” o  Predict heap usage trend   Holt Winters method for prediction   Example from production o  Upward sloping o  Verified “leak” with the application team o  Orange region   80% prediction level o  Yellow region   95% prediction level 13 International Conference on Cloud Engineering 2013 © Arun Kejariwal
  • 14. Wrapping up …   Shrek – Tool for GC analysis in the cloud o  Statistical analysis o  Detect performance regression o  “Bad”/outlier nodes detection o  Characterize performance of Red/Black deployments o  Memory “leak” detection   Future work o  Integrate with Hive/… to limit pulling GC logs from production nodes to once only o  Support advanced analytics to guide tuning of GC parameters 14 International Conference on Cloud Engineering 2013 © Arun Kejariwal
  • 15. Q&A 15 International Conference on Cloud Engineering 2013 © Arun Kejariwal