Machine Learning for automated diagnosis of distributed ...AE

Machine Learning for Automated Diagnosis of Distributed Systems Performance Ira Cohen HP-Labs June 2006 http://www.hpl.hp.com/personal/Ira_Cohen

Intersection of systems and ML/Data mining: Growing (research) area ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

SLIC project at HP-Labs * : Statistical learning inference and control ,[object Object],[object Object],[object Object],I’ll Focus today on Performance diagnosis

Intuition: Why is performance diagnosis hard? ,[object Object]

Why care about performance? ,[object Object],[object Object],[object Object]

Challenges today in diagnosing/forecasting IT performance problems ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Translation to Machine Learning Challenges ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Outline ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Example: A real distributed HP Application architecture Geographically distribution 3-tier application Results shown today are from last 19+ months of data collected from this service

Application performance “management”: Service Level Objectives (SLO) Unhealthy = SLO Violation

Detection is not enough… ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Unhealthy

Challenge 1: Transforming data to information… ,[object Object],[object Object],[object Object],[object Object],Where is the relevant information?

ML Approach: Model using Classifiers ,[object Object],[object Object],[object Object],[object Object],Unhealthy F(M ,SLO)

But we need an explanation, not just classification accuracy... P(M|SLO) Our approach: Learn joint probability distribution (Bayesian network classifiers) Unhealthy P(M,SLO) Normal Metric has a value associated with healthy behavior Abnormal Metric has a value associated with unhealthy behavior Inferences (“ metric attribution ”):

Bayesian network classifiers: Results ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],SLO State M3 M30 M32 M5 M8

Additional issues ,[object Object],[object Object],[object Object]

Challenge 2: Adaptation ,[object Object],[object Object],Learning with “Concept drift” Different? Same problem?

Adaptation: Possible approaches ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Our approach: Managing an ensemble of models for our classification approach ,[object Object],[object Object],[object Object],Construction Inference: Use Brier score for selection of models

Adaptation: Results ,[object Object],[object Object],[object Object],[object Object],[object Object],0.9 84.2 Single model with sliding window 7.1 90.7 Ensemble of Models 71.5 82.4 Single model trained with all history (no forgetting) 0.2 61.4 Single model: No Adaptation Total Processing Time (mins) Accuracy (%)

Adaptation: Result ,[object Object],[object Object]

Additional issues ,[object Object],[object Object]

Challenge 3: Leveraging history ,[object Object],Diagnosis : Stuck thread due to insufficient Database connections Repair : Increase connections to +6 Periods : : : : Severity : SLO time increases up to 10secs : : Location : Americas. Not seen in Asia/Pacific

Leveraging history ,[object Object],[object Object],[object Object],[object Object],Diagnosis : Stuck thread due to insufficient Database connections Repair : Increase connections to +6 Periods : : : : Severity : SLO time increases up to 10secs : : Location : Americas. Not seen in Asia/Pacific

Our approach to defining signatures 1) Learn probabilistic classifiers 2) Inferences: Metric Attribution Unhealthy Models P(SLO,M) DB cpu util high app active proc high app alive proc high app cpu util Abnormal metrics 3) Define these as signatures of the problems

Example: Defining a signature ,[object Object],[object Object],Attri- bution

Results: With signatures… ,[object Object],[object Object],Diagnosis : Stuck thread due to insufficient Database connections Repair : Increase connections to +6 Periods : : : : Severity : SLO time increases up to 10secs : : Location : Americas. Not seen in Asia/Pacific

Results: Retrieval accuracy Retrieval of "Stuck Thread" problem Top 100: 92 vs 51 Ideal P-R curve

Results: With signatures we can also… ,[object Object],[object Object],[object Object]

Challenge 4: Combining multiple data sources ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Properties of logs ,[object Object],[object Object],[object Object],[object Object]

Our approach: Processing application error-logs ,[object Object],[object Object],[object Object],[object Object],2006-02-26T00:00:06.461 ES_Domain:ES_hpat615_01:2257913:Thread43.ES82|commandchain.BaseErrorHandler.logException()|FUNCTIONAL|0||FatalException occurred type=com.hp.es.service.productEntitlement.knight.logic.access.KnightIOException, message=Connection timed out, class=com.hp.es.service.productEntitlement.knight.logic.RequestKnightResultMENUCommand 2006-02-26T00:00:06.465 ES_Domain:ES_hpat615_01:22579163:Thread-43.ES82|com.hp.es.service.productEntitlement.combined.errorhandling.DefaultAlwaysEIAErrorHandlerRed.handleException()|FATAL|2706||KNIGHT system unavailable: java.io.IOException 2006-02-26T00:00:06.465 ES_Domain:ES_hpat615_01:22579163:Thread-43.ES82|com.hp.es.service.productEntitlement.combined.errorhandling.DefaultAlwaysEIAErrorHandlerRed.handleException()|FATAL|0||com.hp.es.service.productEntitlement.knight.logic.RequestKnightResultMENUCommand message: Connection timed out causing exception type: java.io.IOException KNIGHT URL accessed: http://vccekntpro.cce.hp.com/knight/knightwarrantyservice.asmx 2006-02-26T00:00:06.466 ES_Domain:ES_hpat615_01:22579163:Thread-43.ES82|com.hp.es.service.productEntitlement.combined.errorhandling.DefaultAlwaysEIAErrorHandlerRed.handleException()|FATAL|0||com.hp.es.service.productEntitlement.knight.logic.access.KnightIOException: Connection timed out 2006-02-26T00:00:08.279 ES_Domain:ES_hpat615_01:22579163:ExecuteThread: '16' for 'weblogic.kernel.Default'.ES82|com.hp.es.service.productEntitlement.combined.MergeAllStartedThreadsCommand.setWaitingFinished()|WARNING|3709||2006-02-26T00:00:08.279 ES_Domain:ES_hpat615_01:22579163:ExecuteThread: '16' for 2006-02-26T00:00:06.465 ES_Domain:ES_hpat615_01:22579163:Thread-43.ES82|com.hp.es.service.productEntitlement.combined.errorhandling.DefaultAlwaysEIAErrorHandlerRed.handleException()|FATAL|0||com.hp.es.service.productEntitlem Over 4,000,000 error log entries 200,000+ distinct error messages Use count of appearances over 5-minute intervals of the features messages as metrics for learning Similarity-based Sequential Clustering 190 “feature messages”

Learning Probabilistic Models ,[object Object],# of appearances PDF

Results: Adding Log based metrics ,[object Object],[object Object],From Operator Incident Report: Diagnosis and Solution: Unable to start SWAT wrapper. Disk usage reached 100%. Cleaned up disk and restarted the wrapper… CORBA access failure: IDL:hpsewrapper/SystemNotAvailableException:… com.hp.es.wrapper.corba.hpsewrapper.SystemNotAvailableException From Application Error Log:

Challenge 5: Scaling up Machine Learning techniques ,[object Object],[object Object],[object Object],[object Object],A B C D E

Challenge 5: Possible approaches ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Example: Diagnosis with Multiple Instances ,[object Object],A B

Diagnosis with Multiple Instances ,[object Object],B C D E F G H A

[object Object],[object Object],A B Diagnosis with Multiple Instances

[object Object],B C D E F G H Diagnosis with Multiple Instances A

Metric Exchange: Does it help? ,[object Object],[object Object],Time Epoch Online Prediction Time Epoch Online Prediction Violation detection w/ model exchange Violation detection w/o model exchange False Alarm Instance 1 Instance 2

[object Object],[object Object],[object Object],Model Exchange: Does it help? Time Epoch Online Prediction Violation detection w/o model exchange Violation detection w/ model exchange False alarm w/ model exchange False alarm w/o model exchange Models imported from other instances improve accuracy

Providing diagnosis as a web service: SLIC’s IT-Rover ,[object Object],[object Object],[object Object],[object Object],[object Object],Metrics/SLO Monitoring Signature construction engine Signature DB Clustering engine Retrieval engine Monitored Services Admin

Discussion: Additional issues, opportunities, and challenges ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

[object Object],[object Object],[object Object],[object Object],[object Object],Summary

Publications: ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Machine Learning for automated diagnosis of distributed ...AE

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Viewers also liked

Viewers also liked (15)

Similar to Machine Learning for automated diagnosis of distributed ...AE

Similar to Machine Learning for automated diagnosis of distributed ...AE (20)

More from butest

More from butest (20)

Machine Learning for automated diagnosis of distributed ...AE

Editor's Notes