SlideShare a Scribd company logo
1 of 18
PAD: Performance Anomaly
Detection in Multi-Server
Distributed Systems
Manjula Peiris, James H. Hill,
Jorgen Thelin, Gabriel Kliot, Sergey Bykov & Christian Konig
The 7th IEEE International Conference on Cloud
Computing (IEEE CLOUD 2014) 28TH JUN – 2ND JUL,
ALASKA, USA
Multi-Server Distributed Systems
Servers
Data Center
Servers
Data Center
Clients
Identification of performance
anomalies and root causes of those
anomalies are very important
• Data centers contain thousands
of machines
• Highly Scalable
• Need to achieve high
throughputs with low latencies
Manual analysis
• Time consuming
• Thousands of logs to inspect
• Error prone
Performance anomaly detection process need to be
automated, if not semi-automated !
Goals of Performance Anomaly Detection
(PAD)
• For Distributed System developers and users:
– Get insights from performance data about performance
issues and optimizations of the system
– Minimize developer time required to analyze large
amounts of performance data generated across hundreds
to thousands of servers
– Assist in troubleshooting performance related issues and
finding root causes
– Production Service Intelligence (Use of statistical
techniques in the context of performance analysis)
Application Domain: Orleans Project
• Distributed Actor
Programming Model
• Provides a programming
model and a runtime to
construct highly scalable
distributed systems
• Need to achieve high
throughputs and low latencies
• Difficult to detect
performance anomalies (typical
deployment contains hundreds
of servers)
http://research.microsoft.com/en-us/projects/orleans/
Some example performance issues
from Orleans Project
• Stuck Random Number Generator
– Lower throughput and a higher number of failed requests
– Diagnosis requires comparing performance data in different
server logs at different time points
– Key : The time point most number of requests are getting time
outs
– Root cause : Thread unsafe random number generator
• Leaking Buffer pool
– Lower throughput with higher response times
– Key : Correlation between performance degradation and a
memory issue
– Root cause : A memory leak in a custom memory allocator
Performance Counter Data
• Key value pairs stored in logs
• Tracks specific system states, system resource usages (e.g. CPU, memory)
• Hundreds of performance counters
– Hundreds of servers recording values periodically
• Example : Different classes of performance counters in Orleans
Type Examples
Orleans Runtime CPU usage, Percentage of time in garbage collection
Message Queues Lengths of the send and receive message queues
Messaging Number of total messages sent and received
Actors Number of actors on a server
Requirements for PAD
• Quickly finding the deviations of performance
data using visualizations
• Automatically finding the performance
counters that exhibits large deviations
• Ability to compare performance counters of
different logs of different system executions
• Ability to compare logs of different servers in
the same execution
Challenges for PAD
• Large data volumes to look at
– Which set of performance counters to consider
– Whether to consider performance counters of all servers at
a particular time or particular server across time or both
• Insufficient training data
– Performance data is available but not a labeled data set
– Hard to apply machine learning based classification
techniques
• Time correlation
– Large number of physical machines, Clocks are not
synchronized
– Some performance counters are sensitive to time
PAD-Assisted Investigation
• Step 1 : Performance data
collection
• Step 2 : Data visualization
• Step 3 :Threshold analysis
! "#"$
%&''( )#&*$
+, $- ( - &*. $
/( *0&*- ", )( $
)&1, #( *$2"#"$
3$
43$
533$
3$
63$
73$
83$
93$
533$
563$
< *'( ", =$'&>=$
?@1*( $: #&*">( $
?, "'. @( *=$
A"*=( $
B1( *. $
%*( "#( =$
?, "'. @( $
C( /&*#=$
?, &- "'&1=$
A( *0&*- ", )( $
%&1, #( *=$
D;=1"';@"E&, =$
• Step 4 :Correlation
analysis
• Step 5 : Comparative
analysis
Step 2: Data Visualization
Time view
Server view
Detail view
• Detail view provide the overall trend
• Server view shows the anomalous
servers
• Time view shows the anomalous
time points
• Time and Server view are based on
summary statistics
• Helps developers reduce the
problem space
Step 3 : Threshold Analysis
<PerformanceCounter Name="Runtime.GC.PercentOfTimeInGC">
<Rules>
<Rule AppliesTo=”Detail">
<Name>1</Name>
<Statistic>Any</Statistic>
<ExpectedValue>30</ExpectedValue>
<ComparisonOperator>GreaterThan</ComparisonOperator>
</Rule>
</Rules>
</PerformanceCounter>
<PerformanceCounter Name="Scheduler.PendingWorkItems">
<Rules>
<Rule AppliesTo="Time">
<Name>1</Name>
<Statistic>Average</Statistic>
<ExpectedValue>5</ExpectedValue>
<ComparisonOperator>GreaterThan</ComparisonOperator>
</Rule>
</Rules>
</PerformanceCounter>
<PerformanceCounter Name="Messaging.Sent.Messages.Delta">
<Rules>
<Rule AppliesTo=”Server">
<Name>1</Name>
<Statistic>Median</Statistic>
<ExpectedValue>100</ExpectedValue>
<ComparisonOperator>GreaterThan</ComparisonOperator>
</Rule>
</Rules>
</PerformanceCounter>
</StatConfiguration>
A Global rule irrespective of
time and server
A Time rule irrespective of server
A Server rule irrespective of
time
Goals:
• Filter the performance counters
based on developer expertise
knowledge
• Identify abnormal servers or time
points for performance counters
developer suspects
Step 4 : Correlation Analysis
• Previous steps help developers find abnormal counters
• What has happened in the system to cause this undesired behavior ? (Root
cause analysis)
• Statistical correlation techniques
– Pearson coefficient
– Spearman coefficient
• Explanatory performance counters
• E.g. Number of queued request in a particular server is positively correlated
to time spend in garbage collection in that server.
Step 5 : Comparative Analysis
• Which set of performance counters to analyze ?
• Hundreds of performance counters - visualizing every
performance counter is not scalable
• Finding the suspicious counters using comparisons
Comparative analysis within a dataset
• Statistical properties of certain
servers, time points can be abnormal
compared to the others
X =
|GlobalMedian - LocalMedian |
GlobalStandardDeviation
Comparative analysis between
datasets
• Compare statistical properties of
two different executions
X =
|RefDataSetMedian - DataSetMedian|
RefDataSetStdDev
Applications of PAD to Orleans
• Unbalance Distributed Hash Table (DHT) problem
– Similar to Stuck Random Number generation problem
– One server was serving many requests
– Eventually degrade the throughput
– Comparative and Visualization analysis are applied.
• Performance bottleneck and Tuning analysis
– Uses PAD to analyze the impacts of different performance
optimization techniques
– Inspected the impact of a certain batching algorithms
– PAD helped to assess the effectiveness of various
optimization techniques
Related Work
• Approaches rely on historical performance data and known
performance problems
– Considers older versions of a system as baselines and evaluate newer
versions
– Only provide comparative analysis
– Uses historical performance crisis to create finger prints of
performance data
• Approaches that do not require historical performance data
– Uses Principal Component Analysis (PCA)
– Performance counters with high variance --> anomalies
– Root cause analysis using Dynamic Binary Instrumentation (DBI)
– Detecting performance anti-patterns
Lessons learned
• PAD setup the stage for developers to do
deeper analysis of performance counter data
• Visualization and summary statistic is a key
part in performance anomaly detection
• Reducing the number of performance
counters is important
• Fully automated root cause analysis for
performance anomalies is hard
Questions
Thank You !

More Related Content

What's hot

Apache Apex Fault Tolerance and Processing Semantics
Apache Apex Fault Tolerance and Processing SemanticsApache Apex Fault Tolerance and Processing Semantics
Apache Apex Fault Tolerance and Processing SemanticsApache Apex
 
Apache Apex: Stream Processing Architecture and Applications
Apache Apex: Stream Processing Architecture and ApplicationsApache Apex: Stream Processing Architecture and Applications
Apache Apex: Stream Processing Architecture and ApplicationsThomas Weise
 
Fault Tolerance and Processing Semantics in Apache Apex
Fault Tolerance and Processing Semantics in Apache ApexFault Tolerance and Processing Semantics in Apache Apex
Fault Tolerance and Processing Semantics in Apache ApexApache Apex Organizer
 
Introduction of MapReduce
Introduction of MapReduceIntroduction of MapReduce
Introduction of MapReduceHC Lin
 
Introduction to Apache Apex and writing a big data streaming application
Introduction to Apache Apex and writing a big data streaming application  Introduction to Apache Apex and writing a big data streaming application
Introduction to Apache Apex and writing a big data streaming application Apache Apex
 
Ingestion and Dimensions Compute and Enrich using Apache Apex
Ingestion and Dimensions Compute and Enrich using Apache ApexIngestion and Dimensions Compute and Enrich using Apache Apex
Ingestion and Dimensions Compute and Enrich using Apache ApexApache Apex
 
Rolta’s application testing services for handling ever changing environment.
Rolta’s application testing services for handling ever changing environment.   Rolta’s application testing services for handling ever changing environment.
Rolta’s application testing services for handling ever changing environment. Rolta
 
Smart Partitioning with Apache Apex (Webinar)
Smart Partitioning with Apache Apex (Webinar)Smart Partitioning with Apache Apex (Webinar)
Smart Partitioning with Apache Apex (Webinar)Apache Apex
 
Intro to Apache Apex (next gen Hadoop) & comparison to Spark Streaming
Intro to Apache Apex (next gen Hadoop) & comparison to Spark StreamingIntro to Apache Apex (next gen Hadoop) & comparison to Spark Streaming
Intro to Apache Apex (next gen Hadoop) & comparison to Spark StreamingApache Apex
 
Introduction to Apache Apex
Introduction to Apache ApexIntroduction to Apache Apex
Introduction to Apache ApexApache Apex
 
DataTorrent Presentation @ Big Data Application Meetup
DataTorrent Presentation @ Big Data Application MeetupDataTorrent Presentation @ Big Data Application Meetup
DataTorrent Presentation @ Big Data Application MeetupThomas Weise
 
February 2017 HUG: Exactly-once end-to-end processing with Apache Apex
February 2017 HUG: Exactly-once end-to-end processing with Apache ApexFebruary 2017 HUG: Exactly-once end-to-end processing with Apache Apex
February 2017 HUG: Exactly-once end-to-end processing with Apache ApexYahoo Developer Network
 
Big Data Berlin v8.0 Stream Processing with Apache Apex
Big Data Berlin v8.0 Stream Processing with Apache Apex Big Data Berlin v8.0 Stream Processing with Apache Apex
Big Data Berlin v8.0 Stream Processing with Apache Apex Apache Apex
 
Adapting and adopting spm v04
Adapting and adopting spm v04Adapting and adopting spm v04
Adapting and adopting spm v04Carlos Sierra
 
Architectual Comparison of Apache Apex and Spark Streaming
Architectual Comparison of Apache Apex and Spark StreamingArchitectual Comparison of Apache Apex and Spark Streaming
Architectual Comparison of Apache Apex and Spark StreamingApache Apex
 
Apache Big Data 2016: Next Gen Big Data Analytics with Apache Apex
Apache Big Data 2016: Next Gen Big Data Analytics with Apache ApexApache Big Data 2016: Next Gen Big Data Analytics with Apache Apex
Apache Big Data 2016: Next Gen Big Data Analytics with Apache ApexApache Apex
 
Building your first aplication using Apache Apex
Building your first aplication using Apache ApexBuilding your first aplication using Apache Apex
Building your first aplication using Apache ApexYogi Devendra Vyavahare
 
Hadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache Apex
Hadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache ApexHadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache Apex
Hadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache ApexApache Apex
 
Query Optimization in SQL Server
Query Optimization in SQL ServerQuery Optimization in SQL Server
Query Optimization in SQL ServerRajesh Gunasundaram
 

What's hot (20)

Apache Apex Fault Tolerance and Processing Semantics
Apache Apex Fault Tolerance and Processing SemanticsApache Apex Fault Tolerance and Processing Semantics
Apache Apex Fault Tolerance and Processing Semantics
 
Apache Apex: Stream Processing Architecture and Applications
Apache Apex: Stream Processing Architecture and ApplicationsApache Apex: Stream Processing Architecture and Applications
Apache Apex: Stream Processing Architecture and Applications
 
Fault Tolerance and Processing Semantics in Apache Apex
Fault Tolerance and Processing Semantics in Apache ApexFault Tolerance and Processing Semantics in Apache Apex
Fault Tolerance and Processing Semantics in Apache Apex
 
Introduction of MapReduce
Introduction of MapReduceIntroduction of MapReduce
Introduction of MapReduce
 
Introduction to Apache Apex and writing a big data streaming application
Introduction to Apache Apex and writing a big data streaming application  Introduction to Apache Apex and writing a big data streaming application
Introduction to Apache Apex and writing a big data streaming application
 
Ingestion and Dimensions Compute and Enrich using Apache Apex
Ingestion and Dimensions Compute and Enrich using Apache ApexIngestion and Dimensions Compute and Enrich using Apache Apex
Ingestion and Dimensions Compute and Enrich using Apache Apex
 
Rolta’s application testing services for handling ever changing environment.
Rolta’s application testing services for handling ever changing environment.   Rolta’s application testing services for handling ever changing environment.
Rolta’s application testing services for handling ever changing environment.
 
Smart Partitioning with Apache Apex (Webinar)
Smart Partitioning with Apache Apex (Webinar)Smart Partitioning with Apache Apex (Webinar)
Smart Partitioning with Apache Apex (Webinar)
 
Intro to Apache Apex (next gen Hadoop) & comparison to Spark Streaming
Intro to Apache Apex (next gen Hadoop) & comparison to Spark StreamingIntro to Apache Apex (next gen Hadoop) & comparison to Spark Streaming
Intro to Apache Apex (next gen Hadoop) & comparison to Spark Streaming
 
Introduction to Apache Apex
Introduction to Apache ApexIntroduction to Apache Apex
Introduction to Apache Apex
 
DataTorrent Presentation @ Big Data Application Meetup
DataTorrent Presentation @ Big Data Application MeetupDataTorrent Presentation @ Big Data Application Meetup
DataTorrent Presentation @ Big Data Application Meetup
 
February 2017 HUG: Exactly-once end-to-end processing with Apache Apex
February 2017 HUG: Exactly-once end-to-end processing with Apache ApexFebruary 2017 HUG: Exactly-once end-to-end processing with Apache Apex
February 2017 HUG: Exactly-once end-to-end processing with Apache Apex
 
Big Data Berlin v8.0 Stream Processing with Apache Apex
Big Data Berlin v8.0 Stream Processing with Apache Apex Big Data Berlin v8.0 Stream Processing with Apache Apex
Big Data Berlin v8.0 Stream Processing with Apache Apex
 
Apex as yarn application
Apex as yarn applicationApex as yarn application
Apex as yarn application
 
Adapting and adopting spm v04
Adapting and adopting spm v04Adapting and adopting spm v04
Adapting and adopting spm v04
 
Architectual Comparison of Apache Apex and Spark Streaming
Architectual Comparison of Apache Apex and Spark StreamingArchitectual Comparison of Apache Apex and Spark Streaming
Architectual Comparison of Apache Apex and Spark Streaming
 
Apache Big Data 2016: Next Gen Big Data Analytics with Apache Apex
Apache Big Data 2016: Next Gen Big Data Analytics with Apache ApexApache Big Data 2016: Next Gen Big Data Analytics with Apache Apex
Apache Big Data 2016: Next Gen Big Data Analytics with Apache Apex
 
Building your first aplication using Apache Apex
Building your first aplication using Apache ApexBuilding your first aplication using Apache Apex
Building your first aplication using Apache Apex
 
Hadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache Apex
Hadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache ApexHadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache Apex
Hadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache Apex
 
Query Optimization in SQL Server
Query Optimization in SQL ServerQuery Optimization in SQL Server
Query Optimization in SQL Server
 

Similar to PAD: Performance Anomaly Detection in Multi-Server Distributed Systems

SQL Explore 2012: P&T Part 1
SQL Explore 2012: P&T Part 1SQL Explore 2012: P&T Part 1
SQL Explore 2012: P&T Part 1sqlserver.co.il
 
Automated Discovery of Performance Regressions in Enterprise Applications
Automated Discovery of Performance Regressions in Enterprise ApplicationsAutomated Discovery of Performance Regressions in Enterprise Applications
Automated Discovery of Performance Regressions in Enterprise ApplicationsSAIL_QU
 
performancetestinganoverview-110206071921-phpapp02.pdf
performancetestinganoverview-110206071921-phpapp02.pdfperformancetestinganoverview-110206071921-phpapp02.pdf
performancetestinganoverview-110206071921-phpapp02.pdfMAshok10
 
TCC14 tour hague optimising workbooks
TCC14 tour hague optimising workbooksTCC14 tour hague optimising workbooks
TCC14 tour hague optimising workbooksMrunal Shridhar
 
Performance Testing
Performance TestingPerformance Testing
Performance TestingAnu Shaji
 
SCRIMPS-STD: Test Automation Design Principles - and asking the right questions!
SCRIMPS-STD: Test Automation Design Principles - and asking the right questions!SCRIMPS-STD: Test Automation Design Principles - and asking the right questions!
SCRIMPS-STD: Test Automation Design Principles - and asking the right questions!Richard Robinson
 
Adding Value in the Cloud with Performance Test
Adding Value in the Cloud with Performance TestAdding Value in the Cloud with Performance Test
Adding Value in the Cloud with Performance TestRodolfo Kohn
 
05. performance-concepts
05. performance-concepts05. performance-concepts
05. performance-conceptsMuhammad Ahad
 
Presentation cloud control enterprise manager 12c
Presentation   cloud control enterprise manager 12cPresentation   cloud control enterprise manager 12c
Presentation cloud control enterprise manager 12cxKinAnx
 
Java Developers, make the database work for you (NLJUG JFall 2010)
Java Developers, make the database work for you (NLJUG JFall 2010)Java Developers, make the database work for you (NLJUG JFall 2010)
Java Developers, make the database work for you (NLJUG JFall 2010)Lucas Jellema
 
Optimizing Application Performance - 2022.pptx
Optimizing Application Performance - 2022.pptxOptimizing Application Performance - 2022.pptx
Optimizing Application Performance - 2022.pptxJasonTuran2
 
Presentación Oracle Database Migración consideraciones 10g/11g/12c
Presentación Oracle Database Migración consideraciones 10g/11g/12cPresentación Oracle Database Migración consideraciones 10g/11g/12c
Presentación Oracle Database Migración consideraciones 10g/11g/12cRonald Francisco Vargas Quesada
 
Diksha sda presentation
Diksha sda presentationDiksha sda presentation
Diksha sda presentationdikshagupta111
 
Performance Tuning And Optimization Microsoft SQL Database
Performance Tuning And Optimization Microsoft SQL DatabasePerformance Tuning And Optimization Microsoft SQL Database
Performance Tuning And Optimization Microsoft SQL DatabaseTung Nguyen Thanh
 

Similar to PAD: Performance Anomaly Detection in Multi-Server Distributed Systems (20)

SQL Explore 2012: P&T Part 1
SQL Explore 2012: P&T Part 1SQL Explore 2012: P&T Part 1
SQL Explore 2012: P&T Part 1
 
Automated Discovery of Performance Regressions in Enterprise Applications
Automated Discovery of Performance Regressions in Enterprise ApplicationsAutomated Discovery of Performance Regressions in Enterprise Applications
Automated Discovery of Performance Regressions in Enterprise Applications
 
performancetestinganoverview-110206071921-phpapp02.pdf
performancetestinganoverview-110206071921-phpapp02.pdfperformancetestinganoverview-110206071921-phpapp02.pdf
performancetestinganoverview-110206071921-phpapp02.pdf
 
TCC14 tour hague optimising workbooks
TCC14 tour hague optimising workbooksTCC14 tour hague optimising workbooks
TCC14 tour hague optimising workbooks
 
Performance Testing
Performance TestingPerformance Testing
Performance Testing
 
Breaking data
Breaking dataBreaking data
Breaking data
 
Visual Studio Profiler
Visual Studio ProfilerVisual Studio Profiler
Visual Studio Profiler
 
SCRIMPS-STD: Test Automation Design Principles - and asking the right questions!
SCRIMPS-STD: Test Automation Design Principles - and asking the right questions!SCRIMPS-STD: Test Automation Design Principles - and asking the right questions!
SCRIMPS-STD: Test Automation Design Principles - and asking the right questions!
 
Performance Testing Overview
Performance Testing OverviewPerformance Testing Overview
Performance Testing Overview
 
Adding Value in the Cloud with Performance Test
Adding Value in the Cloud with Performance TestAdding Value in the Cloud with Performance Test
Adding Value in the Cloud with Performance Test
 
05. performance-concepts
05. performance-concepts05. performance-concepts
05. performance-concepts
 
Presentation cloud control enterprise manager 12c
Presentation   cloud control enterprise manager 12cPresentation   cloud control enterprise manager 12c
Presentation cloud control enterprise manager 12c
 
Test Automation for Data Warehouses
Test Automation for Data Warehouses Test Automation for Data Warehouses
Test Automation for Data Warehouses
 
Java Developers, make the database work for you (NLJUG JFall 2010)
Java Developers, make the database work for you (NLJUG JFall 2010)Java Developers, make the database work for you (NLJUG JFall 2010)
Java Developers, make the database work for you (NLJUG JFall 2010)
 
Optimizing Application Performance - 2022.pptx
Optimizing Application Performance - 2022.pptxOptimizing Application Performance - 2022.pptx
Optimizing Application Performance - 2022.pptx
 
Presentación Oracle Database Migración consideraciones 10g/11g/12c
Presentación Oracle Database Migración consideraciones 10g/11g/12cPresentación Oracle Database Migración consideraciones 10g/11g/12c
Presentación Oracle Database Migración consideraciones 10g/11g/12c
 
Software Performance
Software Performance Software Performance
Software Performance
 
Diksha sda presentation
Diksha sda presentationDiksha sda presentation
Diksha sda presentation
 
Monitor database essentials with Applications Manager
Monitor database essentials with Applications ManagerMonitor database essentials with Applications Manager
Monitor database essentials with Applications Manager
 
Performance Tuning And Optimization Microsoft SQL Database
Performance Tuning And Optimization Microsoft SQL DatabasePerformance Tuning And Optimization Microsoft SQL Database
Performance Tuning And Optimization Microsoft SQL Database
 

Recently uploaded

Introduction to Decentralized Applications (dApps)
Introduction to Decentralized Applications (dApps)Introduction to Decentralized Applications (dApps)
Introduction to Decentralized Applications (dApps)Intelisync
 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...Christina Lin
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackVICTOR MAESTRE RAMIREZ
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...harshavardhanraghave
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsAlberto González Trastoy
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantAxelRicardoTrocheRiq
 
chapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptchapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptkotipi9215
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comFatema Valibhai
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...soniya singh
 
Unit 1.1 Excite Part 1, class 9, cbse...
Unit 1.1 Excite Part 1, class 9, cbse...Unit 1.1 Excite Part 1, class 9, cbse...
Unit 1.1 Excite Part 1, class 9, cbse...aditisharan08
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataBradBedford3
 
Asset Management Software - Infographic
Asset Management Software - InfographicAsset Management Software - Infographic
Asset Management Software - InfographicHr365.us smith
 
What is Binary Language? Computer Number Systems
What is Binary Language?  Computer Number SystemsWhat is Binary Language?  Computer Number Systems
What is Binary Language? Computer Number SystemsJheuzeDellosa
 
why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfjoe51371421
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providermohitmore19
 
EY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityEY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityNeo4j
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...MyIntelliSource, Inc.
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...stazi3110
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...kellynguyen01
 

Recently uploaded (20)

Introduction to Decentralized Applications (dApps)
Introduction to Decentralized Applications (dApps)Introduction to Decentralized Applications (dApps)
Introduction to Decentralized Applications (dApps)
 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStack
 
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service Consultant
 
chapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptchapter--4-software-project-planning.ppt
chapter--4-software-project-planning.ppt
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
 
Unit 1.1 Excite Part 1, class 9, cbse...
Unit 1.1 Excite Part 1, class 9, cbse...Unit 1.1 Excite Part 1, class 9, cbse...
Unit 1.1 Excite Part 1, class 9, cbse...
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
 
Asset Management Software - Infographic
Asset Management Software - InfographicAsset Management Software - Infographic
Asset Management Software - Infographic
 
What is Binary Language? Computer Number Systems
What is Binary Language?  Computer Number SystemsWhat is Binary Language?  Computer Number Systems
What is Binary Language? Computer Number Systems
 
why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdf
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
EY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityEY_Graph Database Powered Sustainability
EY_Graph Database Powered Sustainability
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
 

PAD: Performance Anomaly Detection in Multi-Server Distributed Systems

  • 1. PAD: Performance Anomaly Detection in Multi-Server Distributed Systems Manjula Peiris, James H. Hill, Jorgen Thelin, Gabriel Kliot, Sergey Bykov & Christian Konig The 7th IEEE International Conference on Cloud Computing (IEEE CLOUD 2014) 28TH JUN – 2ND JUL, ALASKA, USA
  • 2. Multi-Server Distributed Systems Servers Data Center Servers Data Center Clients Identification of performance anomalies and root causes of those anomalies are very important • Data centers contain thousands of machines • Highly Scalable • Need to achieve high throughputs with low latencies Manual analysis • Time consuming • Thousands of logs to inspect • Error prone Performance anomaly detection process need to be automated, if not semi-automated !
  • 3. Goals of Performance Anomaly Detection (PAD) • For Distributed System developers and users: – Get insights from performance data about performance issues and optimizations of the system – Minimize developer time required to analyze large amounts of performance data generated across hundreds to thousands of servers – Assist in troubleshooting performance related issues and finding root causes – Production Service Intelligence (Use of statistical techniques in the context of performance analysis)
  • 4. Application Domain: Orleans Project • Distributed Actor Programming Model • Provides a programming model and a runtime to construct highly scalable distributed systems • Need to achieve high throughputs and low latencies • Difficult to detect performance anomalies (typical deployment contains hundreds of servers) http://research.microsoft.com/en-us/projects/orleans/
  • 5. Some example performance issues from Orleans Project • Stuck Random Number Generator – Lower throughput and a higher number of failed requests – Diagnosis requires comparing performance data in different server logs at different time points – Key : The time point most number of requests are getting time outs – Root cause : Thread unsafe random number generator • Leaking Buffer pool – Lower throughput with higher response times – Key : Correlation between performance degradation and a memory issue – Root cause : A memory leak in a custom memory allocator
  • 6. Performance Counter Data • Key value pairs stored in logs • Tracks specific system states, system resource usages (e.g. CPU, memory) • Hundreds of performance counters – Hundreds of servers recording values periodically • Example : Different classes of performance counters in Orleans Type Examples Orleans Runtime CPU usage, Percentage of time in garbage collection Message Queues Lengths of the send and receive message queues Messaging Number of total messages sent and received Actors Number of actors on a server
  • 7. Requirements for PAD • Quickly finding the deviations of performance data using visualizations • Automatically finding the performance counters that exhibits large deviations • Ability to compare performance counters of different logs of different system executions • Ability to compare logs of different servers in the same execution
  • 8. Challenges for PAD • Large data volumes to look at – Which set of performance counters to consider – Whether to consider performance counters of all servers at a particular time or particular server across time or both • Insufficient training data – Performance data is available but not a labeled data set – Hard to apply machine learning based classification techniques • Time correlation – Large number of physical machines, Clocks are not synchronized – Some performance counters are sensitive to time
  • 9. PAD-Assisted Investigation • Step 1 : Performance data collection • Step 2 : Data visualization • Step 3 :Threshold analysis ! "#"$ %&''( )#&*$ +, $- ( - &*. $ /( *0&*- ", )( $ )&1, #( *$2"#"$ 3$ 43$ 533$ 3$ 63$ 73$ 83$ 93$ 533$ 563$ < *'( ", =$'&>=$ ?@1*( $: #&*">( $ ?, "'. @( *=$ A"*=( $ B1( *. $ %*( "#( =$ ?, "'. @( $ C( /&*#=$ ?, &- "'&1=$ A( *0&*- ", )( $ %&1, #( *=$ D;=1"';@"E&, =$ • Step 4 :Correlation analysis • Step 5 : Comparative analysis
  • 10. Step 2: Data Visualization Time view Server view Detail view • Detail view provide the overall trend • Server view shows the anomalous servers • Time view shows the anomalous time points • Time and Server view are based on summary statistics • Helps developers reduce the problem space
  • 11. Step 3 : Threshold Analysis <PerformanceCounter Name="Runtime.GC.PercentOfTimeInGC"> <Rules> <Rule AppliesTo=”Detail"> <Name>1</Name> <Statistic>Any</Statistic> <ExpectedValue>30</ExpectedValue> <ComparisonOperator>GreaterThan</ComparisonOperator> </Rule> </Rules> </PerformanceCounter> <PerformanceCounter Name="Scheduler.PendingWorkItems"> <Rules> <Rule AppliesTo="Time"> <Name>1</Name> <Statistic>Average</Statistic> <ExpectedValue>5</ExpectedValue> <ComparisonOperator>GreaterThan</ComparisonOperator> </Rule> </Rules> </PerformanceCounter> <PerformanceCounter Name="Messaging.Sent.Messages.Delta"> <Rules> <Rule AppliesTo=”Server"> <Name>1</Name> <Statistic>Median</Statistic> <ExpectedValue>100</ExpectedValue> <ComparisonOperator>GreaterThan</ComparisonOperator> </Rule> </Rules> </PerformanceCounter> </StatConfiguration> A Global rule irrespective of time and server A Time rule irrespective of server A Server rule irrespective of time Goals: • Filter the performance counters based on developer expertise knowledge • Identify abnormal servers or time points for performance counters developer suspects
  • 12. Step 4 : Correlation Analysis • Previous steps help developers find abnormal counters • What has happened in the system to cause this undesired behavior ? (Root cause analysis) • Statistical correlation techniques – Pearson coefficient – Spearman coefficient • Explanatory performance counters • E.g. Number of queued request in a particular server is positively correlated to time spend in garbage collection in that server.
  • 13. Step 5 : Comparative Analysis • Which set of performance counters to analyze ? • Hundreds of performance counters - visualizing every performance counter is not scalable • Finding the suspicious counters using comparisons Comparative analysis within a dataset • Statistical properties of certain servers, time points can be abnormal compared to the others X = |GlobalMedian - LocalMedian | GlobalStandardDeviation Comparative analysis between datasets • Compare statistical properties of two different executions X = |RefDataSetMedian - DataSetMedian| RefDataSetStdDev
  • 14. Applications of PAD to Orleans • Unbalance Distributed Hash Table (DHT) problem – Similar to Stuck Random Number generation problem – One server was serving many requests – Eventually degrade the throughput – Comparative and Visualization analysis are applied. • Performance bottleneck and Tuning analysis – Uses PAD to analyze the impacts of different performance optimization techniques – Inspected the impact of a certain batching algorithms – PAD helped to assess the effectiveness of various optimization techniques
  • 15. Related Work • Approaches rely on historical performance data and known performance problems – Considers older versions of a system as baselines and evaluate newer versions – Only provide comparative analysis – Uses historical performance crisis to create finger prints of performance data • Approaches that do not require historical performance data – Uses Principal Component Analysis (PCA) – Performance counters with high variance --> anomalies – Root cause analysis using Dynamic Binary Instrumentation (DBI) – Detecting performance anti-patterns
  • 16. Lessons learned • PAD setup the stage for developers to do deeper analysis of performance counter data • Visualization and summary statistic is a key part in performance anomaly detection • Reducing the number of performance counters is important • Fully automated root cause analysis for performance anomalies is hard

Editor's Notes

  1. Remove the word