SlideShare a Scribd company logo
joint work with
Alessandro Tundo Leonardo Mariani
Paolo Saltarel
Cloud Failure Prediction with Hierarchical
Temporal Memory
An Empirical Assessment
Oliviero Riganelli
University of Milano - Bicocca
Marco Mobilio
Runtime Failures are unavoidable
Average downtime per year 

[IWGCR]
10 

hours
Runtime Failures are unavoidable
… and expensive
$1.25 billion to $2.5 billion
cost of unplanned downtime per year [Fortune]
$
Lost revenue Lost productivity Lost brand equity or trust
The top three costs organizations face due to downtime
[Forrester Consulting]
Proactive Failure Management
Error Failure
time
Proactive Failure Management
Error Failure
tprediction
time
tdiagnosis thealing
Failure
Prediction
Diagnosis Healing
Proactive Failure Management
… our goal
Error Failure
tprediction
time
tdiagnosis thealing
Failure
Prediction
Diagnosis Healing
Online Anomaly-based Failure Prediction in the Cloud
Cloud
Resource
Cloud
Resource
Anomaly 

Detector
Anomaly 

Detector
Anomaly 

Detector
Anomaly 

Detector
KPI values
KPI values
KPI values
KPI values
…
…
Local

Failure
Predictor
anomalies
anomalies
Global

Failure
Predictor
local failure prediction
Local

Failure
Predictor
anomalies
anomalies
local failure prediction
failure alert
…
local failure prediction
Online Anomaly-based Failure Prediction in the Cloud
Cloud
Resource
Anomaly 

Detector
Anomaly 

Detector
… Local

Failure
Predictor
anomalies
anomalies
Global

Failure
Predictor
local failure prediction
failure alert
…
S. Ahmad, A. Lavina, S. Purdy, and Z. Aghaab, “Unsupervised real-time anomaly detection for streaming data”, Neurocomputing 2017
KPI values
KPI values
HTM
xt
a(xt)
π(xt)
St Lt
Anomaly Detection with Hierarchical Temporal Memory (HTM)
an anomaly is reported if Lt >= 1-ε 
Prediction


Error
Anomaly
likelihood
Anomaly 

Detector
…
local failure prediction
Local

Failure
Predictor
Online Anomaly-based Failure Prediction in the Cloud
Cloud
Resource
Anomaly 

Detector
Anomaly 

Detector
KPI values
KPI values
KPI values
…
…
Local

Failure
Predictor
anomalies
anomalies
Global

Failure
Predictor
local failure prediction
failure alert
normal executions
failure prone executions
Class boundary


(Separating hyperplane)
Local Failure Predictor with one-class SVM
a failure is reported after n consecutive failure predictions
Online Anomaly-based Failure Prediction in the Cloud
Cloud
Resource
Cloud
Resource
Anomaly 

Detector
Anomaly 

Detector
Anomaly 

Detector
Anomaly 

Detector
KPI values
KPI values
KPI values
KPI values
…
…
Local

Failure
Predictor
anomalies
anomalies
Global

Failure
Predictor
local failure prediction
Local

Failure
Predictor
anomalies
anomalies
local failure prediction
failure alert
…
Failure
No
Yes
Failure
No
Yes
Failure
No
Yes
Vote-based
Single resource
No Failure
Failure alert
Failure
No
Yes
Failure
No
Yes
Failure
No
Yes
x consecutive failure predictions to
raise an alert
y consecutive failure predictions to
raise an alert
Experimental Setting
Testbed
1 Cloud-native IP Multimedia SubSystem
6 VMs with 2 vCPUs, 2GB RAM, 20GB HD
150 KPIs
Workload patterns
Daily variations: higher tra
ffi
c on working days
Hourly variations: heavier during the day with peaks
at 9am and 7pm
Fault Injection
CPU Hog

Memory leak

Packet loss

Excessive workload
Types Activation Patterns
Linear

Exponential

Random
Tested Parameters
Anomaly Detector: ε = 0.8, 0.85, 0.9, 0.95
Local Failure Predictor: n=1,2.

Single Resource Global Failure Predictor: x=1,2,3,4,5,6.

Vote-based Global Failure Predictor: y=1,2,3.
Research Questions
Can an HTM-based anomaly detector support a
failure prediction system in accurately predicting
failures?
How early can failures be predicted?
RQ1
RQ2
Prediction Timeliness
Prediction Effectiveness
Prediction Effectiveness RQ1
0.8
0.85
0.9
0.95
0.8
0.85
0.9
0.95
Prediction Effectiveness RQ1
0.8
0.85
0.9
0.95
0.8
0.85
0.9
0.95
Prediction Effectiveness RQ1
0.8
0.85
0.9
0.95
0.8
0.85
0.9
0.95
Prediction Timeliness RQ2
Prediction Timeliness RQ2
Cloud Failure Prediction with Hierarchical Temporal Memory An Empirical Assessment

More Related Content

Similar to Cloud Failure Prediction with Hierarchical Temporal Memory An Empirical Assessment

Streamline it & save with virtualization
Streamline it & save with virtualizationStreamline it & save with virtualization
Streamline it & save with virtualization
Advanced Logic Industries
 
New Essentials of Disaster Recovery Planning
New Essentials of Disaster Recovery PlanningNew Essentials of Disaster Recovery Planning
New Essentials of Disaster Recovery Planning
Jason Dea
 
POD-Diagnosis: Error Detection and Diagnosis of Sporadic Operations on Cloud ...
POD-Diagnosis: Error Detection and Diagnosis of Sporadic Operations on Cloud ...POD-Diagnosis: Error Detection and Diagnosis of Sporadic Operations on Cloud ...
POD-Diagnosis: Error Detection and Diagnosis of Sporadic Operations on Cloud ...
Liming Zhu
 
Automatic Undo for Cloud Management via AI Planning
Automatic Undo for Cloud Management via AI PlanningAutomatic Undo for Cloud Management via AI Planning
Automatic Undo for Cloud Management via AI Planning
Hiroshi Wada
 
Virtual Disaster Recovery ROI
Virtual Disaster Recovery ROIVirtual Disaster Recovery ROI
Virtual Disaster Recovery ROI
Jason Dea
 
Making Observability Actionable At Scale - DBS DevConnect 2019
Making Observability Actionable At Scale - DBS DevConnect 2019Making Observability Actionable At Scale - DBS DevConnect 2019
Making Observability Actionable At Scale - DBS DevConnect 2019
Squadcast Inc
 
Plate Spin Disaster Recovery Solution
Plate Spin Disaster Recovery SolutionPlate Spin Disaster Recovery Solution
Plate Spin Disaster Recovery Solution
muralis3
 
DR Planning - Improving Recovery Time
DR Planning - Improving Recovery TimeDR Planning - Improving Recovery Time
DR Planning - Improving Recovery Time
Jason Dea
 
Hope Is Not A Strategy: Automating Efficient Resource Utilization for SREs
Hope Is Not A Strategy: Automating Efficient Resource Utilization for SREsHope Is Not A Strategy: Automating Efficient Resource Utilization for SREs
Hope Is Not A Strategy: Automating Efficient Resource Utilization for SREs
StormForge .io
 
Juniper Unmanned AU Presentation
Juniper Unmanned AU PresentationJuniper Unmanned AU Presentation
Juniper Unmanned AU PresentationJeff Cozart
 
Next generation alerting and fault detection, SRECon Europe 2016
Next generation alerting and fault detection, SRECon Europe 2016Next generation alerting and fault detection, SRECon Europe 2016
Next generation alerting and fault detection, SRECon Europe 2016
Dieter Plaetinck
 
Short Data Rules for Observability.pdf
Short Data Rules for Observability.pdfShort Data Rules for Observability.pdf
Short Data Rules for Observability.pdf
Dave McAllister
 
More Than Monitoring: How Observability Takes You From Firefighting to Fire P...
More Than Monitoring: How Observability Takes You From Firefighting to Fire P...More Than Monitoring: How Observability Takes You From Firefighting to Fire P...
More Than Monitoring: How Observability Takes You From Firefighting to Fire P...
DevOps.com
 
Risk Assessment Based Cloudification
Risk Assessment Based CloudificationRisk Assessment Based Cloudification
Risk Assessment Based Cloudification
SERENEWorkshop
 
Undo tech overview_201410
Undo tech overview_201410Undo tech overview_201410
Undo tech overview_201410
gregthelaw
 
DOST 2016 Cloud Without Failures
DOST 2016 Cloud Without FailuresDOST 2016 Cloud Without Failures
DOST 2016 Cloud Without Failures
Jorge Cardoso
 
Performance Aware Development
Performance Aware DevelopmentPerformance Aware Development
Performance Aware Development
Saurabh Badhwar
 
Good Security Starts with Software Assurance - Software Assurance Market Plac...
Good Security Starts with Software Assurance - Software Assurance Market Plac...Good Security Starts with Software Assurance - Software Assurance Market Plac...
Good Security Starts with Software Assurance - Software Assurance Market Plac...Phil Agcaoili
 
What does performance mean in the cloud
What does performance mean in the cloudWhat does performance mean in the cloud
What does performance mean in the cloud
Michael Kopp
 
Ten^H^H^H Many Cloud App Design Patterns
Ten^H^H^H Many Cloud App Design PatternsTen^H^H^H Many Cloud App Design Patterns
Ten^H^H^H Many Cloud App Design Patterns
Shlomo Swidler
 

Similar to Cloud Failure Prediction with Hierarchical Temporal Memory An Empirical Assessment (20)

Streamline it & save with virtualization
Streamline it & save with virtualizationStreamline it & save with virtualization
Streamline it & save with virtualization
 
New Essentials of Disaster Recovery Planning
New Essentials of Disaster Recovery PlanningNew Essentials of Disaster Recovery Planning
New Essentials of Disaster Recovery Planning
 
POD-Diagnosis: Error Detection and Diagnosis of Sporadic Operations on Cloud ...
POD-Diagnosis: Error Detection and Diagnosis of Sporadic Operations on Cloud ...POD-Diagnosis: Error Detection and Diagnosis of Sporadic Operations on Cloud ...
POD-Diagnosis: Error Detection and Diagnosis of Sporadic Operations on Cloud ...
 
Automatic Undo for Cloud Management via AI Planning
Automatic Undo for Cloud Management via AI PlanningAutomatic Undo for Cloud Management via AI Planning
Automatic Undo for Cloud Management via AI Planning
 
Virtual Disaster Recovery ROI
Virtual Disaster Recovery ROIVirtual Disaster Recovery ROI
Virtual Disaster Recovery ROI
 
Making Observability Actionable At Scale - DBS DevConnect 2019
Making Observability Actionable At Scale - DBS DevConnect 2019Making Observability Actionable At Scale - DBS DevConnect 2019
Making Observability Actionable At Scale - DBS DevConnect 2019
 
Plate Spin Disaster Recovery Solution
Plate Spin Disaster Recovery SolutionPlate Spin Disaster Recovery Solution
Plate Spin Disaster Recovery Solution
 
DR Planning - Improving Recovery Time
DR Planning - Improving Recovery TimeDR Planning - Improving Recovery Time
DR Planning - Improving Recovery Time
 
Hope Is Not A Strategy: Automating Efficient Resource Utilization for SREs
Hope Is Not A Strategy: Automating Efficient Resource Utilization for SREsHope Is Not A Strategy: Automating Efficient Resource Utilization for SREs
Hope Is Not A Strategy: Automating Efficient Resource Utilization for SREs
 
Juniper Unmanned AU Presentation
Juniper Unmanned AU PresentationJuniper Unmanned AU Presentation
Juniper Unmanned AU Presentation
 
Next generation alerting and fault detection, SRECon Europe 2016
Next generation alerting and fault detection, SRECon Europe 2016Next generation alerting and fault detection, SRECon Europe 2016
Next generation alerting and fault detection, SRECon Europe 2016
 
Short Data Rules for Observability.pdf
Short Data Rules for Observability.pdfShort Data Rules for Observability.pdf
Short Data Rules for Observability.pdf
 
More Than Monitoring: How Observability Takes You From Firefighting to Fire P...
More Than Monitoring: How Observability Takes You From Firefighting to Fire P...More Than Monitoring: How Observability Takes You From Firefighting to Fire P...
More Than Monitoring: How Observability Takes You From Firefighting to Fire P...
 
Risk Assessment Based Cloudification
Risk Assessment Based CloudificationRisk Assessment Based Cloudification
Risk Assessment Based Cloudification
 
Undo tech overview_201410
Undo tech overview_201410Undo tech overview_201410
Undo tech overview_201410
 
DOST 2016 Cloud Without Failures
DOST 2016 Cloud Without FailuresDOST 2016 Cloud Without Failures
DOST 2016 Cloud Without Failures
 
Performance Aware Development
Performance Aware DevelopmentPerformance Aware Development
Performance Aware Development
 
Good Security Starts with Software Assurance - Software Assurance Market Plac...
Good Security Starts with Software Assurance - Software Assurance Market Plac...Good Security Starts with Software Assurance - Software Assurance Market Plac...
Good Security Starts with Software Assurance - Software Assurance Market Plac...
 
What does performance mean in the cloud
What does performance mean in the cloudWhat does performance mean in the cloud
What does performance mean in the cloud
 
Ten^H^H^H Many Cloud App Design Patterns
Ten^H^H^H Many Cloud App Design PatternsTen^H^H^H Many Cloud App Design Patterns
Ten^H^H^H Many Cloud App Design Patterns
 

Recently uploaded

A Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of PassageA Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of Passage
Philip Schwarz
 
How to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good PracticesHow to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good Practices
Globus
 
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, BetterWebinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
XfilesPro
 
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Shahin Sheidaei
 
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Mind IT Systems
 
How Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptxHow Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptx
wottaspaceseo
 
Large Language Models and the End of Programming
Large Language Models and the End of ProgrammingLarge Language Models and the End of Programming
Large Language Models and the End of Programming
Matt Welsh
 
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdfDominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
AMB-Review
 
Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdf
Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdfEnhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdf
Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdf
Jay Das
 
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
Juraj Vysvader
 
Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus
 
Enhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdfEnhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdf
Globus
 
Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024
Globus
 
Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...
Globus
 
A Comprehensive Look at Generative AI in Retail App Testing.pdf
A Comprehensive Look at Generative AI in Retail App Testing.pdfA Comprehensive Look at Generative AI in Retail App Testing.pdf
A Comprehensive Look at Generative AI in Retail App Testing.pdf
kalichargn70th171
 
Lecture 1 Introduction to games development
Lecture 1 Introduction to games developmentLecture 1 Introduction to games development
Lecture 1 Introduction to games development
abdulrafaychaudhry
 
Quarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden ExtensionsQuarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden Extensions
Max Andersen
 
Vitthal Shirke Microservices Resume Montevideo
Vitthal Shirke Microservices Resume MontevideoVitthal Shirke Microservices Resume Montevideo
Vitthal Shirke Microservices Resume Montevideo
Vitthal Shirke
 
May Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdfMay Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdf
Adele Miller
 
Enterprise Resource Planning System in Telangana
Enterprise Resource Planning System in TelanganaEnterprise Resource Planning System in Telangana
Enterprise Resource Planning System in Telangana
NYGGS Automation Suite
 

Recently uploaded (20)

A Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of PassageA Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of Passage
 
How to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good PracticesHow to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good Practices
 
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, BetterWebinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
 
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
 
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
 
How Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptxHow Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptx
 
Large Language Models and the End of Programming
Large Language Models and the End of ProgrammingLarge Language Models and the End of Programming
Large Language Models and the End of Programming
 
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdfDominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
 
Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdf
Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdfEnhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdf
Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdf
 
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
 
Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024
 
Enhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdfEnhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdf
 
Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024
 
Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...
 
A Comprehensive Look at Generative AI in Retail App Testing.pdf
A Comprehensive Look at Generative AI in Retail App Testing.pdfA Comprehensive Look at Generative AI in Retail App Testing.pdf
A Comprehensive Look at Generative AI in Retail App Testing.pdf
 
Lecture 1 Introduction to games development
Lecture 1 Introduction to games developmentLecture 1 Introduction to games development
Lecture 1 Introduction to games development
 
Quarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden ExtensionsQuarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden Extensions
 
Vitthal Shirke Microservices Resume Montevideo
Vitthal Shirke Microservices Resume MontevideoVitthal Shirke Microservices Resume Montevideo
Vitthal Shirke Microservices Resume Montevideo
 
May Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdfMay Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdf
 
Enterprise Resource Planning System in Telangana
Enterprise Resource Planning System in TelanganaEnterprise Resource Planning System in Telangana
Enterprise Resource Planning System in Telangana
 

Cloud Failure Prediction with Hierarchical Temporal Memory An Empirical Assessment

  • 1. joint work with Alessandro Tundo Leonardo Mariani Paolo Saltarel Cloud Failure Prediction with Hierarchical Temporal Memory An Empirical Assessment Oliviero Riganelli University of Milano - Bicocca Marco Mobilio
  • 2. Runtime Failures are unavoidable
  • 3. Average downtime per year [IWGCR] 10 hours
  • 4. Runtime Failures are unavoidable … and expensive $1.25 billion to $2.5 billion cost of unplanned downtime per year [Fortune] $ Lost revenue Lost productivity Lost brand equity or trust The top three costs organizations face due to downtime [Forrester Consulting]
  • 6. Proactive Failure Management Error Failure tprediction time tdiagnosis thealing Failure Prediction Diagnosis Healing
  • 7. Proactive Failure Management … our goal Error Failure tprediction time tdiagnosis thealing Failure Prediction Diagnosis Healing
  • 8. Online Anomaly-based Failure Prediction in the Cloud Cloud Resource Cloud Resource Anomaly Detector Anomaly Detector Anomaly Detector Anomaly Detector KPI values KPI values KPI values KPI values … … Local Failure Predictor anomalies anomalies Global Failure Predictor local failure prediction Local Failure Predictor anomalies anomalies local failure prediction failure alert …
  • 9. local failure prediction Online Anomaly-based Failure Prediction in the Cloud Cloud Resource Anomaly Detector Anomaly Detector … Local Failure Predictor anomalies anomalies Global Failure Predictor local failure prediction failure alert … S. Ahmad, A. Lavina, S. Purdy, and Z. Aghaab, “Unsupervised real-time anomaly detection for streaming data”, Neurocomputing 2017 KPI values KPI values HTM xt a(xt) π(xt) St Lt Anomaly Detection with Hierarchical Temporal Memory (HTM) an anomaly is reported if Lt >= 1-ε  Prediction Error Anomaly likelihood
  • 10. Anomaly Detector … local failure prediction Local Failure Predictor Online Anomaly-based Failure Prediction in the Cloud Cloud Resource Anomaly Detector Anomaly Detector KPI values KPI values KPI values … … Local Failure Predictor anomalies anomalies Global Failure Predictor local failure prediction failure alert normal executions failure prone executions Class boundary (Separating hyperplane) Local Failure Predictor with one-class SVM a failure is reported after n consecutive failure predictions
  • 11. Online Anomaly-based Failure Prediction in the Cloud Cloud Resource Cloud Resource Anomaly Detector Anomaly Detector Anomaly Detector Anomaly Detector KPI values KPI values KPI values KPI values … … Local Failure Predictor anomalies anomalies Global Failure Predictor local failure prediction Local Failure Predictor anomalies anomalies local failure prediction failure alert … Failure No Yes Failure No Yes Failure No Yes Vote-based Single resource No Failure Failure alert Failure No Yes Failure No Yes Failure No Yes x consecutive failure predictions to raise an alert y consecutive failure predictions to raise an alert
  • 12. Experimental Setting Testbed 1 Cloud-native IP Multimedia SubSystem 6 VMs with 2 vCPUs, 2GB RAM, 20GB HD 150 KPIs Workload patterns Daily variations: higher tra ffi c on working days Hourly variations: heavier during the day with peaks at 9am and 7pm Fault Injection CPU Hog Memory leak Packet loss Excessive workload Types Activation Patterns Linear Exponential Random Tested Parameters Anomaly Detector: ε = 0.8, 0.85, 0.9, 0.95 Local Failure Predictor: n=1,2. Single Resource Global Failure Predictor: x=1,2,3,4,5,6. Vote-based Global Failure Predictor: y=1,2,3. Research Questions Can an HTM-based anomaly detector support a failure prediction system in accurately predicting failures? How early can failures be predicted? RQ1 RQ2 Prediction Timeliness Prediction Effectiveness