The document discusses monitoring strategies for cloud infrastructure and applications. It notes that effective monitoring involves more than just collecting data and requires tiered escalation processes and incorporating lessons learned into policies. The document outlines key considerations for what to monitor including infrastructure, software services, and business processes. It also discusses challenges in monitoring cloud environments and strategies for adopting cloud-native monitoring tools.
Sensual Call Girls in Tarn Taran Sahib { 9332606886 } VVIP NISHA Call Girls N...
Cloud monitoring - An essential Platform Service
1. Grow revenue opportunities with fast, personalized
web experiences and manage complexity from peak
demand, mobile devices and data collection.
Cloud Monitoring
Soumitra Bhattacharyya
Director Engineering, Akamai Technologies
www.linkedin.com/in/soumitra001
Video Over CellularAn Essential for any Platform Service
2. Avoid data theft and downtime by extending the
security perimeter outside the data-center and
protect from increasing frequency, scale and
sophistication of web attacks.
Myths and Mistakes migrating to cloud
- Cloud providers are impenetrable
Anti-thesis to Cloud adoption
- Not having a process and crisis plan
- Focus on technology and less on business need
- Relying on Provider’s dashboard, tools and utilities
Mistakes to Cloud adoption
- Cloud infrastructure is infallible
- All performance problems will be addressed once we move to cloud
- Capacity of cloud providers is infinite and scalability is managed by itself
- Guaranteed SLA from provider
3. Avoid data theft and downtime by extending the
security perimeter outside the data-center and
protect from increasing frequency, scale and
sophistication of web attacks.
Monitoring Workflow
Monitor
Impact
Bucket
Mitigation
RCA
Safety
Policies/Proc
edures
Cycle of Safety
Health Monitoring
- Monitoring is process and not a
standalone activity
- Involves Tiered escalation
- Latency of detection and Timeliness of
mitigation is the key
- Learn from every event and incorporate
in policies (e.g. Moratoriums)
4. Avoid data theft and downtime by extending the
security perimeter outside the data-center and
protect from increasing frequency, scale and
sophistication of web attacks.
“ Monitoring should be followed through mitigation actions”
Monitoring WorkFlow – Incident Management
What constitutes a Incident :
- Outage Impacting availability and disruption
- Performance degradation impacting users
- Problem interfering with service administration
Ownership, Responsibility and LifeCycle
- Component owner (SME)
- Business owner
- Incident coordination
- Severity of incident
- Phases of incident
- Resolution time
5. Avoid data theft and downtime by extending the
security perimeter outside the data-center and
protect from increasing frequency, scale and
sophistication of web attacks.
• Infrastructure
• Application
• Business Process (SLA)
Depending on your Business and Control one or
more of the following is priority
Identifying what to monitor should begin early in the lifecycle
• Starts during Product/System Architecture
• Product architect /Performance Org defines what should be monitored
• Component owners know and write the mitigation steps
Monitoring Considerations
Monitoring Decisions :
• Reactive Vs Proactive
• Real Time Vs Non –Realtime
• Snapshot Vs Trend
6. Avoid data theft and downtime by extending the
security perimeter outside the data-center and
protect from increasing frequency, scale and
sophistication of web attacks.
Infrastructure
Software Services
Business
Process/SLA
• Hardware – Physical health of servers (CPU, Memory, Disk)
• Data Centre – Group of machines , Regions
• Network – Bandwidth , Devices , Connection, Performance
• Virtualization – Hosts , No of VMs per machine
• Network Storage – Capacity, Disk wear, volume,
• Rolling of service
• DB performance, Queues
• Data transfer , file size , volume , data backlog
• Network utilization (Capacity and Cost)
• HTTP errors , Application errors
• Traffic volume, periodic security scans
• Domain specific Monitoring :
• Web , Security , Media
What to Monitor
7. Avoid data theft and downtime by extending the
security perimeter outside the data-center and
protect from increasing frequency, scale and
sophistication of web attacks.
Monitoring Architecture
• Detection of alerting conditions
Monitoring Interfaces :
• Telemetry data publishing
• Data collection and transport
• Data analysis through visualization
Quality goals of Monitoring data :
• Synchronization
• Completeness
• Latency
• Consistency
• Identification of correct metrics to
publish
Component owners goals :
• Condition for Alerting
• Data sampling, aggregation
interval etc.
8. Avoid data theft and downtime by extending the
security perimeter outside the data-center and
protect from increasing frequency, scale and
sophistication of web attacks.Challenges :
- Multiple dashboards and alerting mechanism
- Cannot monitor business SLA’s
- Each component is monitored in isolation
- Limitation of individual tools
- Not free
Adoption of Cloud Platform Monitoring Tools
1. Amazon Cloud Watch
2. Microsoft Cloud Monitoring
3. App Dynamics
4. DataDog
5. Sumo Logic
6. Promethius
7. Telegraf
9. Avoid data theft and downtime by extending the
security perimeter outside the data-center and
protect from increasing frequency, scale and
sophistication of web attacks.
Public Cloud Monitoring Tools (Integration)
Approach
Identify sources and create plugins to collect data.
Single dashboard for all trends and alerts
Collect data in own defined format and send it to own monitoring/Alerting system
System
Resources
(CPU/RAM/M
emory)
Database
NGINX
:
:
:
TELEGRAF INFLUX-DB
GRAFANA/
CHRONOGRAF
Collects time series data
From variety of sources
Visualisation and Graphs
Time/Series data
Other
sources like
Azure
Monitor
Aggregat
or
REST API Monitoring/Alerting
system
Data Sources
10. Avoid data theft and downtime by extending the
security perimeter outside the data-center and
protect from increasing frequency, scale and
sophistication of web attacks.
Challenges and Evolution
Challenges
• Monitoring and identifying problem areas
is an evolving process
• Correlatingissue seen and cause is complex
• Identification of Automatable steps for mitigation
• False positives and triggers – Hierarchical view
• Deploying fix in complex environment (Phased)
• Start with few servers and has evolved from single
script to scalable process
• Engineers working on trends and carrying out
predictive analysis for early mitigation
Evolution
11. Avoid data theft and downtime by extending the
security perimeter outside the data-center and
protect from increasing frequency, scale and
sophistication of web attacks.
Thank You
Email : soumitra001@hotmail.com
Editor's Notes
Read and get some inputs from Service incident
Website monitoring: Tracking the processes, traffic, availability and resource utilization of cloud-hosted websites
Virtual machine monitoring: Monitoring the virtualization infrastructure and individual virtual machines
Database monitoring: Monitoring processes, queries, availability, and consumption of cloud database resources
Virtual network monitoring: Monitoring virtual network resources, devices, connections, and performance
Cloud storage monitoring: Monitoring storage resources and their processes provisioned to virtual machines, services, databases, and applications
https://docs.google.com/spreadsheets/d/1QJy0dNeAvKqI4Z5WpN5PDHi17WDuiczuCGkfuxQ_ZqQ/edit#gid=1486764721
https://collaborate.akamai.com/confluence/display/MediaAnalytics/BOCC+Alert+Scenarios
Website monitoring: Tracking the processes, traffic, availability and resource utilization of cloud-hosted websites
Virtual machine monitoring: Monitoring the virtualization infrastructure and individual virtual machines
Database monitoring: Monitoring processes, queries, availability, and consumption of cloud database resources
Virtual network monitoring: Monitoring virtual network resources, devices, connections, and performance
Cloud storage monitoring: Monitoring storage resources and their processes provisioned to virtual machines, services, databases, and applications
Write each one of them well
https://phoenixnap.com/blog/cloud-monitoring-tools
Telegraf is the open source server agent to help you collect metrics from your stacks, sensors and systems.
Corelating issue seen and problem in software/infrastructure