Challenges of Cloud Monitoring

On Challenges of Cloud Monitoring
William Pourmajidi John Steinbacher Tony Erwin Andriy Miranskyy
Dept. of Computer Science IBM Canada Lab IBM Watson and Cloud Platform Dept. of Computer Science
Ryerson University Ryerson University
CASCON 2017 - Toronto
This work would not have been possible without the
support of the IBM Centre for Advanced Studies

Contents
• Abstract
• Introduction
• Position
• Challenges and Solutions
• Summary
• References
• Q&A
1

Abstract
Cloud Popularity
• 60% of IT spending in 2016 was Cloud-based [23]
• Public Cloud Market size will reach $236B by 2020 [4]
Monitoring of Cloud
• Several challenges are addressed, we will explore the following:
1. Defining health states of Cloud systems
2. Creating unified monitoring environments
3. Establishing high availability strategies
2
[23] Mahowald et al. , Worldwide Cloud 2016 Predictions — Mastering the Raw Material of Digital Transformation (2015)
[4] Bartel et al., PUBLIC CLOUD MARKET WILL GROW TO $236 BILLION IN 2020. Technical Report. Forrester Research (2016)

Introduction
Cloud Monitoring and its challenges
Constant data collection: billions of records per day, a big data problem!
Elasticity: Tremendous challenge for conventional monitoring tools
Cloud Networks: Large-scale networks
Cloud Delivery options: IaaS, PaaS, SaaS
Autonomous Control: Cloud providers have full control over monitoring
solutions and collected data
Quality dimensions: Availability, reliability, and performance require
different types of monitoring practices
3

Introduction
Examples of solved challenges
Data-efficient Log system: A log system that avoids storage of repetitive records,
80% reduction of storage size [3]
Elastic Monitoring solution: A multi-tier tool that adjusts its scale based on the size of
elastic platforms [39]
4
[3] Anwar et al. , Anatomy of Cloud Monitoring and Metering (2015)
[39] Ward et al. , Self Managing Monitoring for Highly Elastic Large Scale Cloud Deployments ng and Metering (2014)

Introduction
Examples of solved challenges
Cloud Networks: A clustered, fault-tolerant, network monitoring tool [31]
Autonomous Control: An agent-based, role-based system that provides a trustworthy and
holistic monitoring solution [30]
Log Analysis: A tool based on recurrent neural networks that detects up to 98.3% of
anomalies [5]
4
[31] Pongpaibool et al. , A Robust and Scalable Service-Oriented Platform for Distributed Monitoring (2014)
[30] Nguyen et al. , Role-Based Templates for Cloud Monitoring (2014)
[5] Jandaghi et al. , Semantic Aware Online Detection of Resource Anomalies on the Cloud (2016)

Position
Unresolved Challenges
Our position is that further research is required in the realm of Cloud
monitoring in the following areas:
1. Defining health states of Cloud systems
5

Challenges and Solutions
1- Defining health states
Health States
• Typically:
• Binary value (healthy, unhealthy)
• Calculated from set of attributes and thresholds making up “health”
Example
• For a VM, let’s say healthy is CPU < 10% and available storage > 1 GB
• VM diskspace goes below 1 GB
• Operations team alerted and must react before disk space becomes 0 KB!
Question
• How unhealthy? E.g., Is it worth getting someone up at 2:00 AM?
6

1- Defining health states
• Potential Solution
• States can be extended to a ternary state classification (e.g.,
“green/good”, “yellow/warning”, “red/bad”) [10]
• Warning state might mean a “wait till 9am” decision, while the “red”
state might mean “look at it right now”
• Furthermore, time-series forecasting [33], could use historical behavior to
predict future behavior
• Drawback: Predicted rate of data storage could be wrong based on
totally unforeseen event (e.g., cyber attack)
7
[33] Shumway et al. , Time Series Analysis and Its Applications (2017)
[10] Datadog , Modern monitoring & analytics. https://www.datadoghq.com/ (2017)

1- Defining health state - size and complexity challenges
• Cloud deployments may have thousands of components (containers, VMs,
switches, bare-metal servers, etc.)
• Number of individual components that requires monitoring can grow rapidly
• Components have multiple attributes and capturing all possible permutations
of all components in a system becomes formidable
• Microservice architectures add additional monitoring challenges
• Challenge: Site reliability engineers still need to set proper alarms
8

1- Defining health state - size and complexity challenges
• Define templates loosely tailored to different groups of components (e.g.,
Docker containers processing user authentication)
• Machine learning techniques such as cognitive computing [28], and deep
neural networks [5,17] can be used to “customize” the templates for each
component
• Feedback from operations can be used to further train the model by
using reinforcement learning schemes [40]
• Drawback: Large volumes of collected data are required
9
[28] Modha et al. , Cognitive Computing (2011)
[5] Bhaacharyya et al. , Semantic Aware Online Detection of Resource Anomalies on the Cloud (2016)
[17] Guo et al. , Robust Online Time Series Prediction with Recurrent Neural Networks (2016)
[40] Wiering et al. , Reinforcement Learning: State-of-the-Art (2012)

1- Defining health states - overall system state
• Users are concerned with the overall system health and not the state of
individual components
• Different components have different impacts
• E.g., A load-balancer has more important impact on user’s experience
than a persistent storage component that may be a little slow
• Potential Solution: Use statistical process control [29] to compute over-all
health state
10[29] Montgomery , Introduction to Statistical Quality Control.(2012)

2- Creating unified monitoring environments
Major Challenges
• Monitoring solutions are often discipline-specific
• Redundant logs, lack of unified view, imperfect decision support systems,
different logs for cloud users and cloud providers
Potential Solution
• Design and implement a unified monitoring system
Drawbacks
• Multiple semi-independent teams, make it difficult to settle on a unified
solution
• Cost of data migration, adjusting dashboards based on the newly defined
system, and adopting the new framework can be high 11

2- Creating unified monitoring environments
Major Challenges
• Hierarchical structure of cloud causes monitoring challenges
• Lowest layer consists of data center and its hardware components, which
support many software-defined layers sitting on top to fulfill user
requirements
• An issue on a lower layer can easily affect upper layers
• Tracing such issues is a challenging task that requires a unified monitoring
system.
• Without knowing how layers are inter-related, operations may wonder if an
error was caused in their layer or in another
12

3- Establishing high availability(HA) failover strategies
• Defining HA and failover strategies for large distributed, cloud-based systems
is not trivial and requires extensive monitoring
• Monitoring tools SHOULD provide holistic health of the main site as well as all
other backup sites (hot and warm)
• Such broad coverage causes challenges such as an increased latency in real-
time processing of logs which delay decisions that the HA module should make
• Failovers are computationally and commercially expensive and should be used
only when necessary
13

• Deciding whether to fail-over the entire system to a backup site or just a
portion of it, can make a significant difference
• Critical to understand what has gone wrong and which components are
affected
• To the best of our knowledge, no general framework to address this task
exists at the time of writing and begs for more research
14

• Designing and implementing HA strategies for monolith services is easier than
microservices
• If one node fails, a typical fail-over may bring back the service but the state of
the transactions may become unknown and/or data becomes inconsistent.
• Create a monitoring service that satisfies principles of Atomicity,
Consistency, Isolation, Durability (ACID) and will not allow failovers that
would result in data inconsistency
• In case of a microservice failure, the monitoring system will make a
logical decision about failing over one, a group, or all of the microservices
based on transactional boundaries. 15

Summary
• While some challenges have already been solved, our position is that further
research is required in the realm of cloud monitoring
• We focused on three areas
1. Defining health states of cloud systems
• We have shown that these areas are interconnected
• to make HA decisions (area 1), one needs to understand health states
(area 2) of all components of the system (area 3)
• Cloud monitoring is a fertile area for novel research and practice
16

Challenges of Cloud Monitoring

Recommended

Recommended

More Related Content

Similar to Challenges of Cloud Monitoring

Similar to Challenges of Cloud Monitoring (20)

Recently uploaded

Recently uploaded (20)

Challenges of Cloud Monitoring