SlideShare a Scribd company logo
1 of 20
Download to read offline
On Challenges of Cloud Monitoring
William Pourmajidi John Steinbacher Tony Erwin Andriy Miranskyy
Dept. of Computer Science IBM Canada Lab IBM Watson and Cloud Platform Dept. of Computer Science
Ryerson University Ryerson University
CASCON 2017 - Toronto
This work would not have been possible without the
support of the IBM Centre for Advanced Studies
Contents
• Abstract
• Introduction
• Position
• Challenges and Solutions
• Summary
• References
• Q&A
1
Abstract
Cloud Popularity
• 60% of IT spending in 2016 was Cloud-based [23]
• Public Cloud Market size will reach $236B by 2020 [4]
Monitoring of Cloud
• Several challenges are addressed, we will explore the following:
1. Defining health states of Cloud systems
2. Creating unified monitoring environments
3. Establishing high availability strategies
2
[23] Mahowald et al. , Worldwide Cloud 2016 Predictions — Mastering the Raw Material of Digital Transformation (2015)
[4] Bartel et al., PUBLIC CLOUD MARKET WILL GROW TO $236 BILLION IN 2020. Technical Report. Forrester Research (2016)
Introduction
Cloud Monitoring and its challenges
Constant data collection: billions of records per day, a big data problem!
Elasticity: Tremendous challenge for conventional monitoring tools
Cloud Networks: Large-scale networks
Cloud Delivery options: IaaS, PaaS, SaaS
Autonomous Control: Cloud providers have full control over monitoring
solutions and collected data
Quality dimensions: Availability, reliability, and performance require
different types of monitoring practices
3
Introduction
Examples of solved challenges
Data-efficient Log system: A log system that avoids storage of repetitive records,
80% reduction of storage size [3]
Elastic Monitoring solution: A multi-tier tool that adjusts its scale based on the size of
elastic platforms [39]
4
[3] Anwar et al. , Anatomy of Cloud Monitoring and Metering (2015)
[39] Ward et al. , Self Managing Monitoring for Highly Elastic Large Scale Cloud Deployments ng and Metering (2014)
Introduction
Examples of solved challenges
Cloud Networks: A clustered, fault-tolerant, network monitoring tool [31]
Autonomous Control: An agent-based, role-based system that provides a trustworthy and
holistic monitoring solution [30]
Log Analysis: A tool based on recurrent neural networks that detects up to 98.3% of
anomalies [5]
4
[31] Pongpaibool et al. , A Robust and Scalable Service-Oriented Platform for Distributed Monitoring (2014)
[30] Nguyen et al. , Role-Based Templates for Cloud Monitoring (2014)
[5] Jandaghi et al. , Semantic Aware Online Detection of Resource Anomalies on the Cloud (2016)
Position
Unresolved Challenges
Our position is that further research is required in the realm of Cloud
monitoring in the following areas:
1. Defining health states of Cloud systems
2. Creating unified monitoring environments
3. Establishing high availability strategies
5
Challenges and Solutions
1- Defining health states
Health States
• Typically:
• Binary value (healthy, unhealthy)
• Calculated from set of attributes and thresholds making up “health”
Example
• For a VM, let’s say healthy is CPU < 10% and available storage > 1 GB
• VM diskspace goes below 1 GB
• Operations team alerted and must react before disk space becomes 0 KB!
Question
• How unhealthy? E.g., Is it worth getting someone up at 2:00 AM?
6
Challenges and Solutions
1- Defining health states
• Potential Solution
• States can be extended to a ternary state classification (e.g.,
“green/good”, “yellow/warning”, “red/bad”) [10]
• Warning state might mean a “wait till 9am” decision, while the “red”
state might mean “look at it right now”
• Furthermore, time-series forecasting [33], could use historical behavior to
predict future behavior
• Drawback: Predicted rate of data storage could be wrong based on
totally unforeseen event (e.g., cyber attack)
7
[33] Shumway et al. , Time Series Analysis and Its Applications (2017)
[10] Datadog , Modern monitoring & analytics. https://www.datadoghq.com/ (2017)
Challenges and Solutions
1- Defining health state - size and complexity challenges
• Cloud deployments may have thousands of components (containers, VMs,
switches, bare-metal servers, etc.)
• Number of individual components that requires monitoring can grow rapidly
• Components have multiple attributes and capturing all possible permutations
of all components in a system becomes formidable
• Microservice architectures add additional monitoring challenges
• Challenge: Site reliability engineers still need to set proper alarms
8
Challenges and Solutions
1- Defining health state - size and complexity challenges
• Potential Solution
• Define templates loosely tailored to different groups of components (e.g.,
Docker containers processing user authentication)
• Machine learning techniques such as cognitive computing [28], and deep
neural networks [5,17] can be used to “customize” the templates for each
component
• Feedback from operations can be used to further train the model by
using reinforcement learning schemes [40]
• Drawback: Large volumes of collected data are required
9
[28] Modha et al. , Cognitive Computing (2011)
[5] Bhaacharyya et al. , Semantic Aware Online Detection of Resource Anomalies on the Cloud (2016)
[17] Guo et al. , Robust Online Time Series Prediction with Recurrent Neural Networks (2016)
[40] Wiering et al. , Reinforcement Learning: State-of-the-Art (2012)
Challenges and Solutions
1- Defining health states - overall system state
• Users are concerned with the overall system health and not the state of
individual components
• Different components have different impacts
• E.g., A load-balancer has more important impact on user’s experience
than a persistent storage component that may be a little slow
• Potential Solution: Use statistical process control [29] to compute over-all
health state
10[29] Montgomery , Introduction to Statistical Quality Control.(2012)
Challenges and Solutions
2- Creating unified monitoring environments
Major Challenges
• Monitoring solutions are often discipline-specific
• Redundant logs, lack of unified view, imperfect decision support systems,
different logs for cloud users and cloud providers
Potential Solution
• Design and implement a unified monitoring system
Drawbacks
• Multiple semi-independent teams, make it difficult to settle on a unified
solution
• Cost of data migration, adjusting dashboards based on the newly defined
system, and adopting the new framework can be high 11
Challenges and Solutions
2- Creating unified monitoring environments
Major Challenges
• Hierarchical structure of cloud causes monitoring challenges
• Lowest layer consists of data center and its hardware components, which
support many software-defined layers sitting on top to fulfill user
requirements
• An issue on a lower layer can easily affect upper layers
• Tracing such issues is a challenging task that requires a unified monitoring
system.
• Without knowing how layers are inter-related, operations may wonder if an
error was caused in their layer or in another
12
Challenges and Solutions
3- Establishing high availability(HA) failover strategies
• Defining HA and failover strategies for large distributed, cloud-based systems
is not trivial and requires extensive monitoring
• Monitoring tools SHOULD provide holistic health of the main site as well as all
other backup sites (hot and warm)
• Such broad coverage causes challenges such as an increased latency in real-
time processing of logs which delay decisions that the HA module should make
• Failovers are computationally and commercially expensive and should be used
only when necessary
13
Challenges and Solutions
3- Establishing high availability(HA) failover strategies
• Deciding whether to fail-over the entire system to a backup site or just a
portion of it, can make a significant difference
• Critical to understand what has gone wrong and which components are
affected
• To the best of our knowledge, no general framework to address this task
exists at the time of writing and begs for more research
14
Challenges and Solutions
3- Establishing high availability(HA) failover strategies
• Designing and implementing HA strategies for monolith services is easier than
microservices
• If one node fails, a typical fail-over may bring back the service but the state of
the transactions may become unknown and/or data becomes inconsistent.
• Potential Solution
• Create a monitoring service that satisfies principles of Atomicity,
Consistency, Isolation, Durability (ACID) and will not allow failovers that
would result in data inconsistency
• In case of a microservice failure, the monitoring system will make a
logical decision about failing over one, a group, or all of the microservices
based on transactional boundaries. 15
Summary
• While some challenges have already been solved, our position is that further
research is required in the realm of cloud monitoring
• We focused on three areas
1. Defining health states of cloud systems
2. Creating unified monitoring environments
3. Establishing high availability strategies
• We have shown that these areas are interconnected
• to make HA decisions (area 1), one needs to understand health states
(area 2) of all components of the system (area 3)
• Cloud monitoring is a fertile area for novel research and practice
16
References
17
Thank you
Q & A

More Related Content

Similar to Challenges of Cloud Monitoring

Production Monitoring Platform
Production Monitoring PlatformProduction Monitoring Platform
Production Monitoring PlatformAriel Smoliar
 
Self-adaptation Challenges for Cloud-based Applications (Feedback Computing 2...
Self-adaptation Challenges for Cloud-based Applications (Feedback Computing 2...Self-adaptation Challenges for Cloud-based Applications (Feedback Computing 2...
Self-adaptation Challenges for Cloud-based Applications (Feedback Computing 2...Soodeh Farokhi
 
Public integrity auditing for shared dynamic cloud data with group user revoc...
Public integrity auditing for shared dynamic cloud data with group user revoc...Public integrity auditing for shared dynamic cloud data with group user revoc...
Public integrity auditing for shared dynamic cloud data with group user revoc...Pvrtechnologies Nellore
 
Public Integrity Auditing for Shared Dynamic Cloud Data with Group User Revoc...
Public Integrity Auditing for Shared Dynamic Cloud Data with Group User Revoc...Public Integrity Auditing for Shared Dynamic Cloud Data with Group User Revoc...
Public Integrity Auditing for Shared Dynamic Cloud Data with Group User Revoc...1crore projects
 
Extending the McCumber Cube to Model Software System Maintenance Tasks
Extending the McCumber Cube to Model Software System Maintenance TasksExtending the McCumber Cube to Model Software System Maintenance Tasks
Extending the McCumber Cube to Model Software System Maintenance TasksVorachet Jaroensawas
 
Chapeter 2 introduction to cloud computing
Chapeter 2   introduction to cloud computingChapeter 2   introduction to cloud computing
Chapeter 2 introduction to cloud computingeShikshak
 
Observability in highly distributed systems
Observability in highly distributed systemsObservability in highly distributed systems
Observability in highly distributed systemsDevOps Indonesia
 
Activity Monitoring Using Wearable Sensors and Smart Phone
Activity Monitoring Using Wearable Sensors and Smart PhoneActivity Monitoring Using Wearable Sensors and Smart Phone
Activity Monitoring Using Wearable Sensors and Smart PhoneDrAhmedZoha
 
102.12.25 中正大學資管系古政元教授 屏東科技大學演講(2013-12-25)
102.12.25 中正大學資管系古政元教授 屏東科技大學演講(2013-12-25)102.12.25 中正大學資管系古政元教授 屏東科技大學演講(2013-12-25)
102.12.25 中正大學資管系古政元教授 屏東科技大學演講(2013-12-25)平原 謝
 
Software Engineering Important Short Question for Exams
Software Engineering Important Short Question for ExamsSoftware Engineering Important Short Question for Exams
Software Engineering Important Short Question for ExamsMuhammadTalha436
 
The most trusted, proven enterprise-class Cloud:Closer than you think
The most trusted, proven enterprise-class Cloud:Closer than you think The most trusted, proven enterprise-class Cloud:Closer than you think
The most trusted, proven enterprise-class Cloud:Closer than you think Uni Systems S.M.S.A.
 
CCSK Certificate of Cloud Computing Knowledge - overview
CCSK Certificate of Cloud Computing Knowledge - overviewCCSK Certificate of Cloud Computing Knowledge - overview
CCSK Certificate of Cloud Computing Knowledge - overviewPeter HJ van Eijk
 
RightScale Webinar - Coping With Cloud Migration Challenges: Best Practices a...
RightScale Webinar - Coping With Cloud Migration Challenges: Best Practices a...RightScale Webinar - Coping With Cloud Migration Challenges: Best Practices a...
RightScale Webinar - Coping With Cloud Migration Challenges: Best Practices a...RightScale
 
Webinar compiled powerpoint
Webinar compiled powerpointWebinar compiled powerpoint
Webinar compiled powerpointCloudPassage
 
Supporting operations personnel a software engineers perspective
Supporting operations personnel a software engineers perspectiveSupporting operations personnel a software engineers perspective
Supporting operations personnel a software engineers perspectiveLen Bass
 
VTU 5TH SEM CSE SOFTWARE ENGINEERING SOLVED PAPERS - JUN13 DEC13 JUN14 DEC14 ...
VTU 5TH SEM CSE SOFTWARE ENGINEERING SOLVED PAPERS - JUN13 DEC13 JUN14 DEC14 ...VTU 5TH SEM CSE SOFTWARE ENGINEERING SOLVED PAPERS - JUN13 DEC13 JUN14 DEC14 ...
VTU 5TH SEM CSE SOFTWARE ENGINEERING SOLVED PAPERS - JUN13 DEC13 JUN14 DEC14 ...vtunotesbysree
 

Similar to Challenges of Cloud Monitoring (20)

Production Monitoring Platform
Production Monitoring PlatformProduction Monitoring Platform
Production Monitoring Platform
 
Self-adaptation Challenges for Cloud-based Applications (Feedback Computing 2...
Self-adaptation Challenges for Cloud-based Applications (Feedback Computing 2...Self-adaptation Challenges for Cloud-based Applications (Feedback Computing 2...
Self-adaptation Challenges for Cloud-based Applications (Feedback Computing 2...
 
system development life cycle
system development life cyclesystem development life cycle
system development life cycle
 
Public integrity auditing for shared dynamic cloud data with group user revoc...
Public integrity auditing for shared dynamic cloud data with group user revoc...Public integrity auditing for shared dynamic cloud data with group user revoc...
Public integrity auditing for shared dynamic cloud data with group user revoc...
 
Public Integrity Auditing for Shared Dynamic Cloud Data with Group User Revoc...
Public Integrity Auditing for Shared Dynamic Cloud Data with Group User Revoc...Public Integrity Auditing for Shared Dynamic Cloud Data with Group User Revoc...
Public Integrity Auditing for Shared Dynamic Cloud Data with Group User Revoc...
 
Extending the McCumber Cube to Model Software System Maintenance Tasks
Extending the McCumber Cube to Model Software System Maintenance TasksExtending the McCumber Cube to Model Software System Maintenance Tasks
Extending the McCumber Cube to Model Software System Maintenance Tasks
 
Chapeter 2 introduction to cloud computing
Chapeter 2   introduction to cloud computingChapeter 2   introduction to cloud computing
Chapeter 2 introduction to cloud computing
 
Observability in highly distributed systems
Observability in highly distributed systemsObservability in highly distributed systems
Observability in highly distributed systems
 
Activity Monitoring Using Wearable Sensors and Smart Phone
Activity Monitoring Using Wearable Sensors and Smart PhoneActivity Monitoring Using Wearable Sensors and Smart Phone
Activity Monitoring Using Wearable Sensors and Smart Phone
 
9fcfd50a69d9647585
9fcfd50a69d96475859fcfd50a69d9647585
9fcfd50a69d9647585
 
102.12.25 中正大學資管系古政元教授 屏東科技大學演講(2013-12-25)
102.12.25 中正大學資管系古政元教授 屏東科技大學演講(2013-12-25)102.12.25 中正大學資管系古政元教授 屏東科技大學演講(2013-12-25)
102.12.25 中正大學資管系古政元教授 屏東科技大學演講(2013-12-25)
 
Software Engineering Important Short Question for Exams
Software Engineering Important Short Question for ExamsSoftware Engineering Important Short Question for Exams
Software Engineering Important Short Question for Exams
 
The most trusted, proven enterprise-class Cloud:Closer than you think
The most trusted, proven enterprise-class Cloud:Closer than you think The most trusted, proven enterprise-class Cloud:Closer than you think
The most trusted, proven enterprise-class Cloud:Closer than you think
 
CCSK Certificate of Cloud Computing Knowledge - overview
CCSK Certificate of Cloud Computing Knowledge - overviewCCSK Certificate of Cloud Computing Knowledge - overview
CCSK Certificate of Cloud Computing Knowledge - overview
 
RightScale Webinar - Coping With Cloud Migration Challenges: Best Practices a...
RightScale Webinar - Coping With Cloud Migration Challenges: Best Practices a...RightScale Webinar - Coping With Cloud Migration Challenges: Best Practices a...
RightScale Webinar - Coping With Cloud Migration Challenges: Best Practices a...
 
Webinar compiled powerpoint
Webinar compiled powerpointWebinar compiled powerpoint
Webinar compiled powerpoint
 
cloud
cloudcloud
cloud
 
Supporting operations personnel a software engineers perspective
Supporting operations personnel a software engineers perspectiveSupporting operations personnel a software engineers perspective
Supporting operations personnel a software engineers perspective
 
Edge computing system for large scale distributed sensing systems
Edge computing system for large scale distributed sensing systemsEdge computing system for large scale distributed sensing systems
Edge computing system for large scale distributed sensing systems
 
VTU 5TH SEM CSE SOFTWARE ENGINEERING SOLVED PAPERS - JUN13 DEC13 JUN14 DEC14 ...
VTU 5TH SEM CSE SOFTWARE ENGINEERING SOLVED PAPERS - JUN13 DEC13 JUN14 DEC14 ...VTU 5TH SEM CSE SOFTWARE ENGINEERING SOLVED PAPERS - JUN13 DEC13 JUN14 DEC14 ...
VTU 5TH SEM CSE SOFTWARE ENGINEERING SOLVED PAPERS - JUN13 DEC13 JUN14 DEC14 ...
 

Recently uploaded

Orientation, design and principles of polyhouse
Orientation, design and principles of polyhouseOrientation, design and principles of polyhouse
Orientation, design and principles of polyhousejana861314
 
TOPIC 8 Temperature and Heat.pdf physics
TOPIC 8 Temperature and Heat.pdf physicsTOPIC 8 Temperature and Heat.pdf physics
TOPIC 8 Temperature and Heat.pdf physicsssuserddc89b
 
Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...Nistarini College, Purulia (W.B) India
 
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...Sérgio Sacani
 
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |aasikanpl
 
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡anilsa9823
 
zoogeography of pakistan.pptx fauna of Pakistan
zoogeography of pakistan.pptx fauna of Pakistanzoogeography of pakistan.pptx fauna of Pakistan
zoogeography of pakistan.pptx fauna of Pakistanzohaibmir069
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​kaibalyasahoo82800
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...RohitNehra6
 
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxPhysiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxAArockiyaNisha
 
The Black hole shadow in Modified Gravity
The Black hole shadow in Modified GravityThe Black hole shadow in Modified Gravity
The Black hole shadow in Modified GravitySubhadipsau21168
 
Module 4: Mendelian Genetics and Punnett Square
Module 4:  Mendelian Genetics and Punnett SquareModule 4:  Mendelian Genetics and Punnett Square
Module 4: Mendelian Genetics and Punnett SquareIsiahStephanRadaza
 
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...Sérgio Sacani
 
GFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptxGFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptxAleenaTreesaSaji
 
Analytical Profile of Coleus Forskohlii | Forskolin .pdf
Analytical Profile of Coleus Forskohlii | Forskolin .pdfAnalytical Profile of Coleus Forskohlii | Forskolin .pdf
Analytical Profile of Coleus Forskohlii | Forskolin .pdfSwapnil Therkar
 
Work, Energy and Power for class 10 ICSE Physics
Work, Energy and Power for class 10 ICSE PhysicsWork, Energy and Power for class 10 ICSE Physics
Work, Energy and Power for class 10 ICSE Physicsvishikhakeshava1
 
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxSOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxkessiyaTpeter
 
Behavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdfBehavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdfSELF-EXPLANATORY
 
Neurodevelopmental disorders according to the dsm 5 tr
Neurodevelopmental disorders according to the dsm 5 trNeurodevelopmental disorders according to the dsm 5 tr
Neurodevelopmental disorders according to the dsm 5 trssuser06f238
 
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.aasikanpl
 

Recently uploaded (20)

Orientation, design and principles of polyhouse
Orientation, design and principles of polyhouseOrientation, design and principles of polyhouse
Orientation, design and principles of polyhouse
 
TOPIC 8 Temperature and Heat.pdf physics
TOPIC 8 Temperature and Heat.pdf physicsTOPIC 8 Temperature and Heat.pdf physics
TOPIC 8 Temperature and Heat.pdf physics
 
Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...
 
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
 
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |
 
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
 
zoogeography of pakistan.pptx fauna of Pakistan
zoogeography of pakistan.pptx fauna of Pakistanzoogeography of pakistan.pptx fauna of Pakistan
zoogeography of pakistan.pptx fauna of Pakistan
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...
 
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxPhysiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
 
The Black hole shadow in Modified Gravity
The Black hole shadow in Modified GravityThe Black hole shadow in Modified Gravity
The Black hole shadow in Modified Gravity
 
Module 4: Mendelian Genetics and Punnett Square
Module 4:  Mendelian Genetics and Punnett SquareModule 4:  Mendelian Genetics and Punnett Square
Module 4: Mendelian Genetics and Punnett Square
 
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
 
GFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptxGFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptx
 
Analytical Profile of Coleus Forskohlii | Forskolin .pdf
Analytical Profile of Coleus Forskohlii | Forskolin .pdfAnalytical Profile of Coleus Forskohlii | Forskolin .pdf
Analytical Profile of Coleus Forskohlii | Forskolin .pdf
 
Work, Energy and Power for class 10 ICSE Physics
Work, Energy and Power for class 10 ICSE PhysicsWork, Energy and Power for class 10 ICSE Physics
Work, Energy and Power for class 10 ICSE Physics
 
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxSOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
 
Behavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdfBehavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdf
 
Neurodevelopmental disorders according to the dsm 5 tr
Neurodevelopmental disorders according to the dsm 5 trNeurodevelopmental disorders according to the dsm 5 tr
Neurodevelopmental disorders according to the dsm 5 tr
 
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
 

Challenges of Cloud Monitoring

  • 1. On Challenges of Cloud Monitoring William Pourmajidi John Steinbacher Tony Erwin Andriy Miranskyy Dept. of Computer Science IBM Canada Lab IBM Watson and Cloud Platform Dept. of Computer Science Ryerson University Ryerson University CASCON 2017 - Toronto This work would not have been possible without the support of the IBM Centre for Advanced Studies
  • 2. Contents • Abstract • Introduction • Position • Challenges and Solutions • Summary • References • Q&A 1
  • 3. Abstract Cloud Popularity • 60% of IT spending in 2016 was Cloud-based [23] • Public Cloud Market size will reach $236B by 2020 [4] Monitoring of Cloud • Several challenges are addressed, we will explore the following: 1. Defining health states of Cloud systems 2. Creating unified monitoring environments 3. Establishing high availability strategies 2 [23] Mahowald et al. , Worldwide Cloud 2016 Predictions — Mastering the Raw Material of Digital Transformation (2015) [4] Bartel et al., PUBLIC CLOUD MARKET WILL GROW TO $236 BILLION IN 2020. Technical Report. Forrester Research (2016)
  • 4. Introduction Cloud Monitoring and its challenges Constant data collection: billions of records per day, a big data problem! Elasticity: Tremendous challenge for conventional monitoring tools Cloud Networks: Large-scale networks Cloud Delivery options: IaaS, PaaS, SaaS Autonomous Control: Cloud providers have full control over monitoring solutions and collected data Quality dimensions: Availability, reliability, and performance require different types of monitoring practices 3
  • 5. Introduction Examples of solved challenges Data-efficient Log system: A log system that avoids storage of repetitive records, 80% reduction of storage size [3] Elastic Monitoring solution: A multi-tier tool that adjusts its scale based on the size of elastic platforms [39] 4 [3] Anwar et al. , Anatomy of Cloud Monitoring and Metering (2015) [39] Ward et al. , Self Managing Monitoring for Highly Elastic Large Scale Cloud Deployments ng and Metering (2014)
  • 6. Introduction Examples of solved challenges Cloud Networks: A clustered, fault-tolerant, network monitoring tool [31] Autonomous Control: An agent-based, role-based system that provides a trustworthy and holistic monitoring solution [30] Log Analysis: A tool based on recurrent neural networks that detects up to 98.3% of anomalies [5] 4 [31] Pongpaibool et al. , A Robust and Scalable Service-Oriented Platform for Distributed Monitoring (2014) [30] Nguyen et al. , Role-Based Templates for Cloud Monitoring (2014) [5] Jandaghi et al. , Semantic Aware Online Detection of Resource Anomalies on the Cloud (2016)
  • 7. Position Unresolved Challenges Our position is that further research is required in the realm of Cloud monitoring in the following areas: 1. Defining health states of Cloud systems 2. Creating unified monitoring environments 3. Establishing high availability strategies 5
  • 8. Challenges and Solutions 1- Defining health states Health States • Typically: • Binary value (healthy, unhealthy) • Calculated from set of attributes and thresholds making up “health” Example • For a VM, let’s say healthy is CPU < 10% and available storage > 1 GB • VM diskspace goes below 1 GB • Operations team alerted and must react before disk space becomes 0 KB! Question • How unhealthy? E.g., Is it worth getting someone up at 2:00 AM? 6
  • 9. Challenges and Solutions 1- Defining health states • Potential Solution • States can be extended to a ternary state classification (e.g., “green/good”, “yellow/warning”, “red/bad”) [10] • Warning state might mean a “wait till 9am” decision, while the “red” state might mean “look at it right now” • Furthermore, time-series forecasting [33], could use historical behavior to predict future behavior • Drawback: Predicted rate of data storage could be wrong based on totally unforeseen event (e.g., cyber attack) 7 [33] Shumway et al. , Time Series Analysis and Its Applications (2017) [10] Datadog , Modern monitoring & analytics. https://www.datadoghq.com/ (2017)
  • 10. Challenges and Solutions 1- Defining health state - size and complexity challenges • Cloud deployments may have thousands of components (containers, VMs, switches, bare-metal servers, etc.) • Number of individual components that requires monitoring can grow rapidly • Components have multiple attributes and capturing all possible permutations of all components in a system becomes formidable • Microservice architectures add additional monitoring challenges • Challenge: Site reliability engineers still need to set proper alarms 8
  • 11. Challenges and Solutions 1- Defining health state - size and complexity challenges • Potential Solution • Define templates loosely tailored to different groups of components (e.g., Docker containers processing user authentication) • Machine learning techniques such as cognitive computing [28], and deep neural networks [5,17] can be used to “customize” the templates for each component • Feedback from operations can be used to further train the model by using reinforcement learning schemes [40] • Drawback: Large volumes of collected data are required 9 [28] Modha et al. , Cognitive Computing (2011) [5] Bhaacharyya et al. , Semantic Aware Online Detection of Resource Anomalies on the Cloud (2016) [17] Guo et al. , Robust Online Time Series Prediction with Recurrent Neural Networks (2016) [40] Wiering et al. , Reinforcement Learning: State-of-the-Art (2012)
  • 12. Challenges and Solutions 1- Defining health states - overall system state • Users are concerned with the overall system health and not the state of individual components • Different components have different impacts • E.g., A load-balancer has more important impact on user’s experience than a persistent storage component that may be a little slow • Potential Solution: Use statistical process control [29] to compute over-all health state 10[29] Montgomery , Introduction to Statistical Quality Control.(2012)
  • 13. Challenges and Solutions 2- Creating unified monitoring environments Major Challenges • Monitoring solutions are often discipline-specific • Redundant logs, lack of unified view, imperfect decision support systems, different logs for cloud users and cloud providers Potential Solution • Design and implement a unified monitoring system Drawbacks • Multiple semi-independent teams, make it difficult to settle on a unified solution • Cost of data migration, adjusting dashboards based on the newly defined system, and adopting the new framework can be high 11
  • 14. Challenges and Solutions 2- Creating unified monitoring environments Major Challenges • Hierarchical structure of cloud causes monitoring challenges • Lowest layer consists of data center and its hardware components, which support many software-defined layers sitting on top to fulfill user requirements • An issue on a lower layer can easily affect upper layers • Tracing such issues is a challenging task that requires a unified monitoring system. • Without knowing how layers are inter-related, operations may wonder if an error was caused in their layer or in another 12
  • 15. Challenges and Solutions 3- Establishing high availability(HA) failover strategies • Defining HA and failover strategies for large distributed, cloud-based systems is not trivial and requires extensive monitoring • Monitoring tools SHOULD provide holistic health of the main site as well as all other backup sites (hot and warm) • Such broad coverage causes challenges such as an increased latency in real- time processing of logs which delay decisions that the HA module should make • Failovers are computationally and commercially expensive and should be used only when necessary 13
  • 16. Challenges and Solutions 3- Establishing high availability(HA) failover strategies • Deciding whether to fail-over the entire system to a backup site or just a portion of it, can make a significant difference • Critical to understand what has gone wrong and which components are affected • To the best of our knowledge, no general framework to address this task exists at the time of writing and begs for more research 14
  • 17. Challenges and Solutions 3- Establishing high availability(HA) failover strategies • Designing and implementing HA strategies for monolith services is easier than microservices • If one node fails, a typical fail-over may bring back the service but the state of the transactions may become unknown and/or data becomes inconsistent. • Potential Solution • Create a monitoring service that satisfies principles of Atomicity, Consistency, Isolation, Durability (ACID) and will not allow failovers that would result in data inconsistency • In case of a microservice failure, the monitoring system will make a logical decision about failing over one, a group, or all of the microservices based on transactional boundaries. 15
  • 18. Summary • While some challenges have already been solved, our position is that further research is required in the realm of cloud monitoring • We focused on three areas 1. Defining health states of cloud systems 2. Creating unified monitoring environments 3. Establishing high availability strategies • We have shown that these areas are interconnected • to make HA decisions (area 1), one needs to understand health states (area 2) of all components of the system (area 3) • Cloud monitoring is a fertile area for novel research and practice 16