Cloud Failure Prediction with Hierarchical Temporal Memory An Empirical Assessment

•

0 likes•87 views

The document proposes a failure prediction system for cloud resources that uses hierarchical temporal memory (HTM) for anomaly detection. It evaluates the system's ability to accurately predict failures (effectiveness) and how early it can predict them (timeliness). The system detects anomalies in key performance indicators using HTM and local predictors. A global predictor then analyzes local predictions and alerts for failures. The study tests the system under different workloads and fault injections to answer its research questions about prediction effectiveness and timeliness.

Software

joint work with
Alessandro Tundo Leonardo Mariani
Paolo Saltarel
Cloud Failure Prediction with Hierarchical
Temporal Memory
An Empirical Assessment
Oliviero Riganelli
University of Milano - Bicocca
Marco Mobilio

Average downtime per year

[IWGCR]
10

hours

Proactive Failure Management
Error Failure
time

Proactive Failure Management
Error Failure
tprediction
time
tdiagnosis thealing
Failure
Prediction
Diagnosis Healing

Proactive Failure Management
… our goal
Error Failure
tprediction
time
tdiagnosis thealing
Failure
Prediction
Diagnosis Healing

Online Anomaly-based Failure Prediction in the Cloud
Cloud
Resource
Cloud
Resource
Anomaly

Detector
Anomaly

Detector
Anomaly

Detector
Anomaly

Detector
KPI values
KPI values
KPI values
KPI values
…
…
Local

Failure
Predictor
anomalies
anomalies
Global

Failure
Predictor
local failure prediction
Local

Failure
Predictor
anomalies
anomalies
local failure prediction
failure alert
…

local failure prediction
Online Anomaly-based Failure Prediction in the Cloud
Cloud
Resource
Anomaly

Detector
Anomaly

Detector
… Local

Failure
Predictor
anomalies
anomalies
Global

Failure
Predictor
local failure prediction
failure alert
…
S. Ahmad, A. Lavina, S. Purdy, and Z. Aghaab, “Unsupervised real-time anomaly detection for streaming data”, Neurocomputing 2017
KPI values
KPI values
HTM
xt
a(xt)
π(xt)
St Lt
Anomaly Detection with Hierarchical Temporal Memory (HTM)
an anomaly is reported if Lt >= 1-ε
Prediction

Error
Anomaly
likelihood

Anomaly

Detector
…
local failure prediction
Local

Failure
Predictor
Online Anomaly-based Failure Prediction in the Cloud
Cloud
Resource
Anomaly

Detector
Anomaly

Detector
KPI values
KPI values
KPI values
…
…
Local

Failure
Predictor
anomalies
anomalies
Global

Failure
Predictor
local failure prediction
failure alert
normal executions
failure prone executions
Class boundary

(Separating hyperplane)
Local Failure Predictor with one-class SVM
a failure is reported after n consecutive failure predictions

Experimental Setting
Testbed
1 Cloud-native IP Multimedia SubSystem
6 VMs with 2 vCPUs, 2GB RAM, 20GB HD
150 KPIs
Workload patterns
Daily variations: higher tra
ffi
c on working days
Hourly variations: heavier during the day with peaks
at 9am and 7pm
Fault Injection
CPU Hog

Memory leak

Packet loss

Excessive workload
Types Activation Patterns
Linear

Exponential

Random
Tested Parameters
Anomaly Detector: ε = 0.8, 0.85, 0.9, 0.95
Local Failure Predictor: n=1,2.

Single Resource Global Failure Predictor: x=1,2,3,4,5,6.

Vote-based Global Failure Predictor: y=1,2,3.
Research Questions
Can an HTM-based anomaly detector support a
failure prediction system in accurately predicting
failures?
How early can failures be predicted?
RQ1
RQ2
Prediction Timeliness
Prediction Effectiveness

Prediction Effectiveness RQ1
0.8
0.85
0.9
0.95
0.8
0.85
0.9
0.95

Cloud Failure Prediction with Hierarchical Temporal Memory An Empirical Assessment

Some view the cloud as a silver bullet to solve performance issues. If only it were that simple. The cloud provides a fantastic way to scale hardware on demand, but performance must be optimized at the application level to realize maximum gains. Apica COO, Craig Irwin, will present key strategic elements employed by today's progressive and innovative companies and actionable insights on how they are leveraging technology to proactively identify bottlenecks, improve performance, and optimize their environment. Learn from high profile crashes, common mistakes enterprises make, and how not to become another headline.

Computational Patterns of the Cloud - QCon NYC 2014

Ines Sombra

The Cloud has undoubtedly changed the way we think about computing, IT operations, innovation, and entrepreneurship. But what are the computational patterns that have emerged from the pervasiveness of public clouds? What can we leverage to improve our organizations? And what are the challenges that we face going forward? In this talk, I will introduce you to cloud computing’s paradigms and discuss their applications with practical examples from Engine Yard’s customers, peers, and partners. We will also cover antipatterns and myths. If you are curious about Cloud computing or want to improve your cloud strategy this talk is for you. NOTE: Open an issue if you want me to explain something in more detail at the accompanying github repo: https://github.com/Randommood/QConNYC2014/

Dependable Operation - Performance Management and Capacity Planning Under Con...

Liming Zhu

Talk at http://www.cmga.org.au/ Meet up Modern large-scale applications experience sporadic changes due to operational activities such as upgrade, redeployment, on-demand scaling and interferences from other simultaneous operations. This poses new challenges in system monitoring, capacity planning, performance management, error detection and diagnosis. For example, the traditional anomaly-detection-based techniques are less effective during the “sporadic” operation period as a wide range of legitimate changes confound the situation and make performance baseline establishment for “normal” operation difficult. The increasing frequency of these sporadic operations (e.g. due to continuous deployment) is exacerbating the problem. In this talk, we will introduce a number of ongoing research activities at NICTA addressing these issues. For example, we propose the Process Oriented Dependability (POD) approach, an approach that explicitly models these sporadic operations as processes and uses the process context to filter logs, traverse fault trees and conduct adaptive monitoring.

DR Planning and Testing

Jason Dea

Apica - Performance Does Matter: Five Key Elements to Consider in the Cloud

RightScale

RightScale Conference Santa Clara 2011: We’ve all heard the stories of sites crashing and performing poorly, from major retailers – to iconic technology brands – to multinational airlines. lt’s only a matter of time before another story hits the headlines. Apica CEO, Sven Hammar, will review the importance of employing a strategic load testing and performance monitoring strategy to ensure that your web application doesn’t become another statistic. While outlining the actionable benefits of performance testing and analysis, Sven will touch on the common mistakes, discuss recent outages that hit the headlines, and share best practices to maintain optimal web performance and avoid system crashes.

DR planning and testing

Jason Dea

Cloud-Based Disaster Recovery Service Overview

PT Datacomm Diangraha

Many organisations already possess a vast amount of existing data about production systems. As customer expectations evolve, organisations are often challenged to find more proactive ways of dealing with traditionally reactive incident response activity. In this talk, we discuss approaches to unlock value from this data by making it truly actionable. Understanding production failure modes better, enriching technical and business context effectively, decomposing response activity into shared primitives, actions and workflows, and overall, sharing and augmenting this active knowledge repository on a continuous basis are key takeaways. Through case studies, we'll discuss how we can accomplish this by engineering your observability processes and tooling to work for human-in-the-loop interpretation and response rather than a purely human-reliant strategy.

Plate Spin Disaster Recovery Solution

muralis3

DR Planning - Improving Recovery Time

Jason Dea

Hope Is Not A Strategy: Automating Efficient Resource Utilization for SREs

StormForge .io

On-Demand Video Available Here: https://www.stormforge.io/event/hope-not-strategy-automating-sres/ As a Site Reliability Engineer, you understand the power of software and automation for solving complex problems. So why are your cloud-native applications still running at suboptimal efficiency, despite all your efforts? The fact is, ensuring peak efficiency of apps running on Kubernetes is a really challenging problem to solve. The complexity of Kubernetes and containers makes it impossible for a human to effectively configure apps for deployment in a way that provides the best trade-off between cost, performance, and reliability. So, you’re left with a time-consuming, and ultimately ineffective, trial-and-error approach. In this webinar, we’ll show how to apply SRE principles to the optimization of cloud-native apps running on Kubernetes to ensure peak efficiency. This Webinar is Best for: • SREs who want to automate the efficiency of their cloud-native applications. • Anyone looking to apply SRE principles and practices to ensure performance, reliability, and cost-efficiency of their apps running on Kubernetes. What You'll Learn: • How machine learning can help to automate yourself out of the painful process of manual, trial-and-error application tuning. • How to empower your dev team to proactively ensure application performance, reliability, and cost-efficiency. • How to build continuous optimization into your automated release process to save time and ensure every release performs at peak efficiency.

Juniper Unmanned AU PresentationJeff Cozart

Next generation alerting and fault detection, SRECon Europe 2016

Dieter Plaetinck

There is a common belief that in order to solve more [advanced] alerting cases and get more complete coverage, we need complex, often math-heavy solutions based on machine learning or stream processing. This talk sets context and pro's/cons for such approaches, and provides anecdotal examples from the industry, nuancing the applicability of these methods. We then explore how we can get dramatically better alerting, as well as make our lives a lot easier by optimizing workflow and machine-human interaction through an alerting IDE (exemplified by bosun), basic logic, basic math and metric metadata, even for solving complicated alerting problems such as detecting faults in seasonal timeseries data. https://www.usenix.org/conference/srecon16europe/program/presentation/plaetinck

Short Data Rules for Observability.pdf

Dave McAllister

More Than Monitoring: How Observability Takes You From Firefighting to Fire P...

DevOps.com

For some, observability is just a hollow rebranding of monitoring, for others it’s monitoring on steroids. But what if we told you observability is the new way to find out why—not just if—your distributed system or application isn’t working as expected? Today, we see that traditional monitoring approaches can fall short if a system or application doesn’t adequately externalize its state. This is truer as workloads move into the cloud and leverage ephemeral technologies, such as microservices and containers. To reach observability, IT and DevOps teams need to correlate different sources from logs, metrics, traces, events and more. This becomes even more challenging when defining the online revenue impact of a failed container—after all, this is what really matters to the business. This webinar will cover: The differences between observability and monitoring Why it is a bigger challenge in a multicloud and containerized world How observability results in less firefighting and more fire prevention How new platforms can help gain observability (on premises and in the cloud) for containers, microservices and even SAP or mainframes

Risk Assessment Based Cloudification

SERENEWorkshop

Undo tech overview_201410

gregthelaw

DOST 2016 Cloud Without Failures

Jorge Cardoso

Gartner analyzed data centers for a period of 10 years and found that 47% of all problems were caused by cloud services outages. The duration of outages ranged between 40 minutes and five days. Ponemon Institute studied the financial impact and found that on average outages cost US$ 690.204, with an average downtime cost of US$ 6.828 per minute. These results are important due to the economic impact of unplanned outages on cloud operations which calls for higher platform reliability. The first part of this talk will present the mechanisms that pioneers, such as Amazon, Google, and Netflix, have already developed to increase the reliability of their cloud platforms. The second part of the talk will describe how Huawei Research is exploring the use of fault-injection mechanisms to effectively increase the reliability of the Open Telekom Cloud platform from Deutsche Telekom.

Performance Aware Development

Saurabh Badhwar

As the products and organizations grow in terms of scale, developing the applications while keeping performance an integral part of the build process is important. The presentation covers about what is performance aware development, why it is the need of the hour and how we are doing it inside LinkedIn where we run hundreds of services having multiple deployments everyday while making sure the performance of the services is kept in check and probability of introducing performance regressions is kept to a minimum.

Good Security Starts with Software Assurance - Software Assurance Market Plac...Phil Agcaoili

What does performance mean in the cloud

Michael Kopp

Performance problems are one of the most cited concerns about to the cloud. But is it really the cloud or the application? What does performance mean anyway when you can scale to thousands of servers? This session will discuss why traditional means of performance management and troubleshooting no longer work and how this affects everything. Most importantly we will look at how to identify the root cause of performance problems in such dynamic environments. Finally we will explain how to assess and manage performance when capacity is no longer the issue.

Ten^H^H^H Many Cloud App Design Patterns

Shlomo Swidler

A Sighting of filterA in Typelevel Rite of Passage

Philip Schwarz

How to Position Your Globus Data Portal for Success Ten Good Practices

Globus

Science gateways allow science and engineering communities to access shared data, software, computing services, and instruments. Science gateways have gained a lot of traction in the last twenty years, as evidenced by projects such as the Science Gateways Community Institute (SGCI) and the Center of Excellence on Science Gateways (SGX3) in the US, The Australian Research Data Commons (ARDC) and its platforms in Australia, and the projects around Virtual Research Environments in Europe. A few mature frameworks have evolved with their different strengths and foci and have been taken up by a larger community such as the Globus Data Portal, Hubzero, Tapis, and Galaxy. However, even when gateways are built on successful frameworks, they continue to face the challenges of ongoing maintenance costs and how to meet the ever-expanding needs of the community they serve with enhanced features. It is not uncommon that gateways with compelling use cases are nonetheless unable to get past the prototype phase and become a full production service, or if they do, they don't survive more than a couple of years. While there is no guaranteed pathway to success, it seems likely that for any gateway there is a need for a strong community and/or solid funding streams to create and sustain its success. With over twenty years of examples to draw from, this presentation goes into detail for ten factors common to successful and enduring gateways that effectively serve as best practices for any new or developing gateway.

Similar to Cloud Failure Prediction with Hierarchical Temporal Memory An Empirical Assessment

Streamline it & save with virtualization

Advanced Logic Industries

New Essentials of Disaster Recovery Planning

Jason Dea

POD-Diagnosis: Error Detection and Diagnosis of Sporadic Operations on Cloud ...

Liming Zhu

Automatic Undo for Cloud Management via AI Planning

Hiroshi Wada

Virtual Disaster Recovery ROI

Jason Dea

Making Observability Actionable At Scale - DBS DevConnect 2019

Squadcast Inc

Plate Spin Disaster Recovery Solution

muralis3

DR Planning - Improving Recovery Time

Jason Dea

Hope Is Not A Strategy: Automating Efficient Resource Utilization for SREs

StormForge .io

Juniper Unmanned AU PresentationJeff Cozart

Next generation alerting and fault detection, SRECon Europe 2016

Dieter Plaetinck

Short Data Rules for Observability.pdf

Dave McAllister

More Than Monitoring: How Observability Takes You From Firefighting to Fire P...

DevOps.com

Risk Assessment Based Cloudification

SERENEWorkshop

Undo tech overview_201410

gregthelaw

DOST 2016 Cloud Without Failures

Jorge Cardoso

Performance Aware Development

Saurabh Badhwar

Good Security Starts with Software Assurance - Software Assurance Market Plac...Phil Agcaoili

What does performance mean in the cloud

Michael Kopp

Ten^H^H^H Many Cloud App Design Patterns

Shlomo Swidler

Similar to Cloud Failure Prediction with Hierarchical Temporal Memory An Empirical Assessment (20)

Streamline it & save with virtualization

New Essentials of Disaster Recovery Planning

POD-Diagnosis: Error Detection and Diagnosis of Sporadic Operations on Cloud ...

Automatic Undo for Cloud Management via AI Planning

Virtual Disaster Recovery ROI

Making Observability Actionable At Scale - DBS DevConnect 2019

Plate Spin Disaster Recovery Solution

DR Planning - Improving Recovery Time

Hope Is Not A Strategy: Automating Efficient Resource Utilization for SREs

Juniper Unmanned AU Presentation

Next generation alerting and fault detection, SRECon Europe 2016

Short Data Rules for Observability.pdf

More Than Monitoring: How Observability Takes You From Firefighting to Fire P...

Risk Assessment Based Cloudification

Undo tech overview_201410

DOST 2016 Cloud Without Failures

Performance Aware Development

Good Security Starts with Software Assurance - Software Assurance Market Plac...

What does performance mean in the cloud

Ten^H^H^H Many Cloud App Design Patterns

Recently uploaded

A Sighting of filterA in Typelevel Rite of Passage

Philip Schwarz

How to Position Your Globus Data Portal for Success Ten Good Practices

Globus

Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better

XfilesPro

Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...

Shahin Sheidaei

Games are powerful teaching tools, fostering hands-on engagement and fun. But they require careful consideration to succeed. Join me to explore factors in running and selecting games, ensuring they serve as effective teaching tools. Learn to maintain focus on learning objectives while playing, and how to measure the ROI of gaming in education. Discover strategies for pitching gaming to leadership. This session offers insights, tips, and examples for coaches, team leads, and enterprise leaders seeking to teach from simple to complex concepts.

Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...

Mind IT Systems

How Recreation Management Software Can Streamline Your Operations.pptx

wottaspaceseo

Recreation management software streamlines operations by automating key tasks such as scheduling, registration, and payment processing, reducing manual workload and errors. It provides centralized management of facilities, classes, and events, ensuring efficient resource allocation and facility usage. The software offers user-friendly online portals for easy access to bookings and program information, enhancing customer experience. Real-time reporting and data analytics deliver insights into attendance and preferences, aiding in strategic decision-making. Additionally, effective communication tools keep participants and staff informed with timely updates. Overall, recreation management software enhances efficiency, improves service delivery, and boosts customer satisfaction.

Large Language Models and the End of Programming

Matt Welsh

Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf

AMB-Review

Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos https://www.amb-review.com/tubetrivia-ai Exclusive Features: AI-Powered Questions, Wide Range of Categories, Adaptive Difficulty, User-Friendly Interface, Multiplayer Mode, Regular Updates. #TubeTriviaAI #QuizVideoMagic #ViralQuizVideos #AIQuizGenerator #EngageExciteExplode #MarketingRevolution #BoostYourTraffic #SocialMediaSuccess #AIContentCreation #UnlimitedTraffic

Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdf

Jay Das

In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...

Juraj Vysvader

Globus Compute wth IRI Workflows - GlobusWorld 2024

Globus

As part of the DOE Integrated Research Infrastructure (IRI) program, NERSC at Lawrence Berkeley National Lab and ALCF at Argonne National Lab are working closely with General Atomics on accelerating the computing requirements of the DIII-D experiment. As part of the work the team is investigating ways to speedup the time to solution for many different parts of the DIII-D workflow including how they run jobs on HPC systems. One of these routes is looking at Globus Compute as a way to replace the current method for managing tasks and we describe a brief proof of concept showing how Globus Compute could help to schedule jobs and be a tool to connect compute at different facilities.

Enhancing Research Orchestration Capabilities at ORNL.pdf

Globus

Cross-facility research orchestration comes with ever-changing constraints regarding the availability and suitability of various compute and data resources. In short, a flexible data and processing fabric is needed to enable the dynamic redirection of data and compute tasks throughout the lifecycle of an experiment. In this talk, we illustrate how we easily leveraged Globus services to instrument the ACE research testbed at the Oak Ridge Leadership Computing Facility with flexible data and task orchestration capabilities.

Globus Connect Server Deep Dive - GlobusWorld 2024

Globus

Developing Distributed High-performance Computing Capabilities of an Open Sci...

Globus

COVID-19 had an unprecedented impact on scientific collaboration. The pandemic and its broad response from the scientific community has forged new relationships among public health practitioners, mathematical modelers, and scientific computing specialists, while revealing critical gaps in exploiting advanced computing systems to support urgent decision making. Informed by our team’s work in applying high-performance computing in support of public health decision makers during the COVID-19 pandemic, we present how Globus technologies are enabling the development of an open science platform for robust epidemic analysis, with the goal of collaborative, secure, distributed, on-demand, and fast time-to-solution analyses to support public health.

A Comprehensive Look at Generative AI in Retail App Testing.pdf

kalichargn70th171

Lecture 1 Introduction to games development

abdulrafaychaudhry

Quarkus Hidden and Forbidden Extensions

Max Andersen

Vitthal Shirke Microservices Resume Montevideo

Vitthal Shirke

May Marketo Masterclass, London MUG May 22 2024.pdf

Adele Miller

Enterprise Resource Planning System in Telangana

NYGGS Automation Suite

Enterprise Resource Planning System includes various modules that reduce any business's workload. Additionally, it organizes the workflows, which drives towards enhancing productivity. Here are a detailed explanation of the ERP modules. Going through the points will help you understand how the software is changing the work dynamics. To know more details here: https://blogs.nyggs.com/nyggs/enterprise-resource-planning-erp-system-modules/

Recently uploaded (20)

A Sighting of filterA in Typelevel Rite of Passage

How to Position Your Globus Data Portal for Success Ten Good Practices

Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better

Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...

Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...

How Recreation Management Software Can Streamline Your Operations.pptx

Large Language Models and the End of Programming

Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf

Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdf

In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...

Globus Compute wth IRI Workflows - GlobusWorld 2024

Enhancing Research Orchestration Capabilities at ORNL.pdf

Globus Connect Server Deep Dive - GlobusWorld 2024

Developing Distributed High-performance Computing Capabilities of an Open Sci...

A Comprehensive Look at Generative AI in Retail App Testing.pdf

Lecture 1 Introduction to games development

Quarkus Hidden and Forbidden Extensions

Vitthal Shirke Microservices Resume Montevideo

May Marketo Masterclass, London MUG May 22 2024.pdf

Enterprise Resource Planning System in Telangana

Cloud Failure Prediction with Hierarchical Temporal Memory An Empirical Assessment

1. joint work with Alessandro Tundo Leonardo Mariani Paolo Saltarel Cloud Failure Prediction with Hierarchical Temporal Memory An Empirical Assessment Oliviero Riganelli University of Milano - Bicocca Marco Mobilio

2. Runtime Failures are unavoidable

3. Average downtime per year [IWGCR] 10 hours

4. Runtime Failures are unavoidable … and expensive $1.25 billion to $2.5 billion cost of unplanned downtime per year [Fortune] $ Lost revenue Lost productivity Lost brand equity or trust The top three costs organizations face due to downtime [Forrester Consulting]

5. Proactive Failure Management Error Failure time

6. Proactive Failure Management Error Failure tprediction time tdiagnosis thealing Failure Prediction Diagnosis Healing

7. Proactive Failure Management … our goal Error Failure tprediction time tdiagnosis thealing Failure Prediction Diagnosis Healing

8. Online Anomaly-based Failure Prediction in the Cloud Cloud Resource Cloud Resource Anomaly Detector Anomaly Detector Anomaly Detector Anomaly Detector KPI values KPI values KPI values KPI values … … Local Failure Predictor anomalies anomalies Global Failure Predictor local failure prediction Local Failure Predictor anomalies anomalies local failure prediction failure alert …

9. local failure prediction Online Anomaly-based Failure Prediction in the Cloud Cloud Resource Anomaly Detector Anomaly Detector … Local Failure Predictor anomalies anomalies Global Failure Predictor local failure prediction failure alert … S. Ahmad, A. Lavina, S. Purdy, and Z. Aghaab, “Unsupervised real-time anomaly detection for streaming data”, Neurocomputing 2017 KPI values KPI values HTM xt a(xt) π(xt) St Lt Anomaly Detection with Hierarchical Temporal Memory (HTM) an anomaly is reported if Lt >= 1-ε Prediction Error Anomaly likelihood

10. Anomaly Detector … local failure prediction Local Failure Predictor Online Anomaly-based Failure Prediction in the Cloud Cloud Resource Anomaly Detector Anomaly Detector KPI values KPI values KPI values … … Local Failure Predictor anomalies anomalies Global Failure Predictor local failure prediction failure alert normal executions failure prone executions Class boundary (Separating hyperplane) Local Failure Predictor with one-class SVM a failure is reported after n consecutive failure predictions

11. Online Anomaly-based Failure Prediction in the Cloud Cloud Resource Cloud Resource Anomaly Detector Anomaly Detector Anomaly Detector Anomaly Detector KPI values KPI values KPI values KPI values … … Local Failure Predictor anomalies anomalies Global Failure Predictor local failure prediction Local Failure Predictor anomalies anomalies local failure prediction failure alert … Failure No Yes Failure No Yes Failure No Yes Vote-based Single resource No Failure Failure alert Failure No Yes Failure No Yes Failure No Yes x consecutive failure predictions to raise an alert y consecutive failure predictions to raise an alert

12. Experimental Setting Testbed 1 Cloud-native IP Multimedia SubSystem 6 VMs with 2 vCPUs, 2GB RAM, 20GB HD 150 KPIs Workload patterns Daily variations: higher tra ffi c on working days Hourly variations: heavier during the day with peaks at 9am and 7pm Fault Injection CPU Hog Memory leak Packet loss Excessive workload Types Activation Patterns Linear Exponential Random Tested Parameters Anomaly Detector: ε = 0.8, 0.85, 0.9, 0.95 Local Failure Predictor: n=1,2. Single Resource Global Failure Predictor: x=1,2,3,4,5,6. Vote-based Global Failure Predictor: y=1,2,3. Research Questions Can an HTM-based anomaly detector support a failure prediction system in accurately predicting failures? How early can failures be predicted? RQ1 RQ2 Prediction Timeliness Prediction Effectiveness

13. Prediction Effectiveness RQ1 0.8 0.85 0.9 0.95 0.8 0.85 0.9 0.95

14. Prediction Effectiveness RQ1 0.8 0.85 0.9 0.95 0.8 0.85 0.9 0.95

15. Prediction Effectiveness RQ1 0.8 0.85 0.9 0.95 0.8 0.85 0.9 0.95

16. Prediction Timeliness RQ2

17. Prediction Timeliness RQ2

Cloud Failure Prediction with Hierarchical Temporal Memory An Empirical Assessment

Recommended

Recommended

More Related Content

Similar to Cloud Failure Prediction with Hierarchical Temporal Memory An Empirical Assessment

Similar to Cloud Failure Prediction with Hierarchical Temporal Memory An Empirical Assessment (20)

Recently uploaded

Recently uploaded (20)

Cloud Failure Prediction with Hierarchical Temporal Memory An Empirical Assessment