Cloud monitoring - An essential Platform Service

Grow revenue opportunities with fast, personalized
web experiences and manage complexity from peak
demand, mobile devices and data collection.
Cloud Monitoring
Soumitra Bhattacharyya
Director Engineering, Akamai Technologies
www.linkedin.com/in/soumitra001
Video Over CellularAn Essential for any Platform Service

Avoid data theft and downtime by extending the
security perimeter outside the data-center and
protect from increasing frequency, scale and
sophistication of web attacks.
Myths and Mistakes migrating to cloud
- Cloud providers are impenetrable
Anti-thesis to Cloud adoption
- Not having a process and crisis plan
- Focus on technology and less on business need
- Relying on Provider’s dashboard, tools and utilities
Mistakes to Cloud adoption
- Cloud infrastructure is infallible
- All performance problems will be addressed once we move to cloud
- Capacity of cloud providers is infinite and scalability is managed by itself
- Guaranteed SLA from provider

Monitoring Workflow
Monitor
Impact
Bucket
Mitigation
RCA
Safety
Policies/Proc
edures
Cycle of Safety
Health Monitoring
- Monitoring is process and not a
standalone activity
- Involves Tiered escalation
- Latency of detection and Timeliness of
mitigation is the key
- Learn from every event and incorporate
in policies (e.g. Moratoriums)

“ Monitoring should be followed through mitigation actions”
Monitoring WorkFlow – Incident Management
What constitutes a Incident :
- Outage Impacting availability and disruption
- Performance degradation impacting users
- Problem interfering with service administration
Ownership, Responsibility and LifeCycle
- Component owner (SME)
- Business owner
- Incident coordination
- Severity of incident
- Phases of incident
- Resolution time

• Infrastructure
• Application
• Business Process (SLA)
Depending on your Business and Control one or
more of the following is priority
Identifying what to monitor should begin early in the lifecycle
• Starts during Product/System Architecture
• Product architect /Performance Org defines what should be monitored
• Component owners know and write the mitigation steps
Monitoring Considerations
Monitoring Decisions :
• Reactive Vs Proactive
• Real Time Vs Non –Realtime
• Snapshot Vs Trend

Infrastructure
Software Services
Business
Process/SLA
• Hardware – Physical health of servers (CPU, Memory, Disk)
• Data Centre – Group of machines , Regions
• Network – Bandwidth , Devices , Connection, Performance
• Virtualization – Hosts , No of VMs per machine
• Network Storage – Capacity, Disk wear, volume,
• Rolling of service
• DB performance, Queues
• Data transfer , file size , volume , data backlog
• Network utilization (Capacity and Cost)
• HTTP errors , Application errors
• Traffic volume, periodic security scans
• Domain specific Monitoring :
• Web , Security , Media
What to Monitor

Monitoring Architecture
• Detection of alerting conditions
Monitoring Interfaces :
• Telemetry data publishing
• Data collection and transport
• Data analysis through visualization
Quality goals of Monitoring data :
• Synchronization
• Completeness
• Latency
• Consistency
• Identification of correct metrics to
publish
Component owners goals :
• Condition for Alerting
• Data sampling, aggregation
interval etc.

sophistication of web attacks.Challenges :
- Multiple dashboards and alerting mechanism
- Cannot monitor business SLA’s
- Each component is monitored in isolation
- Limitation of individual tools
- Not free
Adoption of Cloud Platform Monitoring Tools
1. Amazon Cloud Watch
2. Microsoft Cloud Monitoring
3. App Dynamics
4. DataDog
5. Sumo Logic
6. Promethius
7. Telegraf

Public Cloud Monitoring Tools (Integration)
Approach
 Identify sources and create plugins to collect data.
 Single dashboard for all trends and alerts
 Collect data in own defined format and send it to own monitoring/Alerting system
System
Resources
(CPU/RAM/M
emory)
Database
NGINX
:
:
:
TELEGRAF INFLUX-DB
GRAFANA/
CHRONOGRAF
Collects time series data
From variety of sources
Visualisation and Graphs
Time/Series data
Other
sources like
Azure
Monitor
Aggregat
or
REST API Monitoring/Alerting
system
Data Sources

Challenges and Evolution
Challenges
• Monitoring and identifying problem areas
is an evolving process
• Correlatingissue seen and cause is complex
• Identification of Automatable steps for mitigation
• False positives and triggers – Hierarchical view
• Deploying fix in complex environment (Phased)
• Start with few servers and has evolved from single
script to scalable process
• Engineers working on trends and carrying out
predictive analysis for early mitigation
Evolution

Thank You
Email : soumitra001@hotmail.com

Cloud monitoring - An essential Platform Service

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Cloud monitoring - An essential Platform Service

Similar to Cloud monitoring - An essential Platform Service (20)

Recently uploaded

Recently uploaded (20)

Cloud monitoring - An essential Platform Service

Editor's Notes