Autonomous Cloud Operations
Autonomous Cloud Operations
Asim Razzaq, CEO
BACKGROUND
❏
❏
❏
❏
❏
❏
AUTOMATED VS AUTONOMOUS SYSTEM
Automated System: A system that implements a
repeatable pre-determined set of rules
Human Driven, Machine Assisted
Autonomous System: A system that has the ability
to act independently
Machine Driven, Human Assisted
Automobile Assembly Line Boeing 787 Cockpit
AUTOMATED VS AUTONOMOUS CLOUD OPERATIONS
Monitoring
& Alerting
Reporting & Analytics
Continuous
Deployment
Orchestration
Troubleshooting
& Tracing
Automated System: A system that implements a
repeatable pre-determined set of rules
Human Driven, Machine Assisted
?
Autonomous System: A system that has the ability
to act independently
Machine Driven, Human Assisted
Governance
Financial and operational policy
management
Budget enforcement
Shutdown untagged resources
Operating hours
Compliance e.g. GDPR
Optimization
Consumption to value
Right size resources
Reserved Instance purchase
Lifecycle Management
Standardized provisioning &
de-provisioning
Configure resource size and
dependencies
Monitoring
Troubleshoot & resolve issues
Monitor machine & applications
Log and analyze machine data
Root cause & remediation
KEY PILLARS OF MODERN CLOUD OPERATIONS
40%
Wastage due
to OpEx Cost
Model
On average enterprises
waste 40% cloud
infrastructure
30%
Time on Manual
Cloud Ops
On average, enterprises
spend 30% of their Cloud
Ops time doing manual
tasks that should be
automated
Over 50% of enterprises
are looking to
decentralize cloud
operations
25+
Tools Causing
Fatigue
On average, enterprises
have 25+ tools to manage
cost, performance and
availability SLAs for cloud
workloads
>50%
Cloud Ops
Decentralizing
95%
IT Workforce
Lacking
Expertise
Over 95% of IT
workforce lacks expertise
in concepts of large scale,
distributed and fault
tolerant public cloud
computing domain
CLOUD OPERATIONS GETTING SIGNIFICANTLY COMPLEX
WHAT MAKES IT POSSIBLE NOW?
Autonomous Vehicles
● Scale and variety of data
● Maturity of ML models
● Network bandwidth
● Cloud & local processing power
● Mapping technology
● Vision Detection
WHAT MAKES IT POSSIBLE NOW?
Autonomous Cloud Operations
● Scale and variety of data
● Compute capacity
● Machine Learning models
● Standardization of Infrastructure as a Service
● Standardization of telemetry, log, machine data
● Programmability and APIs to take action
AUTONOMOUS CLOUD OPERATIONS
Lorem ipsum porta
dolor sit amet nec
Lorem ipsum dolor sit
amet adipiscing. Donec
risus dolor, porta
venenatis neque pharetra
luctus felis vel tellus nec
felis.
45%
● Donec risus dolor porta
● Pharetra luctus felis
● Proin vel tellus in felis
● Molestie nec amet cum
Lorem ipsum porta
dolor sit amet nec
Lorem ipsum dolor sit
amet adipiscing. Donec
risus dolor, porta
venenatis neque pharetra
luctus felis vel tellus nec
felis.
28%
● Donec risus dolor porta
● Pharetra luctus felis
● Proin vel tellus in felis
● Molestie nec amet cum
Lorem ipsum porta
dolor sit amet nec
Lorem ipsum dolor sit
amet adipiscing. Donec
risus dolor, porta
venenatis neque pharetra
luctus felis vel tellus nec
felis.
36%
● Donec risus dolor porta
● Pharetra luctus felis
● Proin vel tellus in felis
● Molestie nec amet cum
Lorem ipsum porta
dolor sit amet nec
Lorem ipsum dolor sit
amet adipiscing. Donec
risus dolor, porta
venenatis neque pharetra
luctus felis vel tellus nec
felis.
17%
● Donec risus dolor porta
● Pharetra luctus felis
● Proin vel tellus in felis
● Molestie nec amet cum
Lorem ipsum porta
dolor sit amet nec
Lorem ipsum dolor sit
amet adipiscing. Donec
risus dolor, porta
venenatis neque pharetra
luctus felis vel tellus nec
felis.
61%
● Donec risus dolor porta
● Pharetra luctus felis
● Proin vel tellus in felis
● Molestie nec amet cum
Deterministic
deployments
Services
● Jenkins, Travis
● Chef, Ansible, Puppet
● Vagrant
● Spinnaker
Level 1
● 12-factor app
● Fully automated CI/CD
● Testing Automation
● Automated
Provisioning
● Automated rollbacks
Metrics focused
Services
● Prometheus, Data Dog
● Sumologic, Loggly
● Pagerduty, VictorOps
● New Relic, Appdynamics
Level 2
● Fully monitoring, logging
and alerting
● Well-defined and
measured availability,
latency and throughput
● SLOs and SLAs tied to
business KPIs
Cloud Native
Microservices
Services
● AWS, Azure, GCP
● Docker, rkt
● Terraform, Cloud
Formation, Bosh
● Amazon API Gateway,
Envoy, Kong
Level 3
● Manageable loosely
coupled systems
● Containerized
microservices +
service meshes
● Declarative, immutable
Infrastructure-as-code
Efficient Auto-scaling
Services
● Kubernetes
● Chaos Monkey
Level 4
● Container
orchestration
● Maximized resource
utilization & density
● Provably Resilient
Fully self-managed
Services
● Autonomous CloudOps
Level 5
● Machine-learning
● System + Business
Context
ASPIRATIONAL USE CASE
Fault Tolerance: Less than
0.01% failure rate
PaymentServ Cluster
November 26 - January 2nd
Business KPI: At least
10,000 payment transactions
per day
Performance: 1 sec, 95th
percentile latency
Cost: At most $2,500 per
day
DETECT THE ANOMALIES
YotaScale Confidential & Proprietary
IDENTIFY CONTRIBUTING REASONS
YotaScale Confidential & Proprietary
ROOT CAUSE ANALYSIS
YotaScale Confidential & Proprietary
PLATFORM CONCEPTS
Cloud Provider Data
● Cost
● Utilization
● Inventory
● Logs
● Containers
OPTIMIZE
Suggestions to remediate issues
DIAGNOSE
Identify root cause
DETECT
Discover trends and identify incidents
1
2
3
4
PREDICT
Forecast the future
Third Party Data
● Performance
● Memory
● Configuration
APIs
AUTONOMOUS
Governance Policies
● Mandatory Tags
● RegEx Formats
● Resource Whitelist
● Purchase
Preferences
Enterprise Integrations
MANUAL
Proprietary AI Models Trained on Cloud Data
50%
Accuracy
85%
TODAY
10%
502 PB
Data
Data Ingested
Resources
Apps
Cost Processed
5 PB
4.2 K
1 K
$4 M
6 MONTHS
Data Ingested
Resources
Apps
Cost Processed
68 PB
81 K
4 K
$47 M
1.5 YEARS
Data Ingested
Resources
Apps
Cost Processed
502 PB
476 K
21 K
$202 M
TODAY
There is no compression
algorithm for experience.
SUGGESTIONS & RECOMMENDATIONS
● Invest in classic machine learning
● Start with Dev/QA environments
● Pick a use case or a specific type of workload
● Initially focus on autonomous insights and behavior analysis
● Make sure you have scale and variety of data
● In a single customer environment the best you can do is
leverage historical data
● Give YotaScale a try!
Autonomous Cloud Operations
Questions?

Autonomous Cloud Operations for AWS

  • 1.
    Autonomous Cloud Operations AutonomousCloud Operations Asim Razzaq, CEO
  • 2.
  • 3.
    AUTOMATED VS AUTONOMOUSSYSTEM Automated System: A system that implements a repeatable pre-determined set of rules Human Driven, Machine Assisted Autonomous System: A system that has the ability to act independently Machine Driven, Human Assisted Automobile Assembly Line Boeing 787 Cockpit
  • 4.
    AUTOMATED VS AUTONOMOUSCLOUD OPERATIONS Monitoring & Alerting Reporting & Analytics Continuous Deployment Orchestration Troubleshooting & Tracing Automated System: A system that implements a repeatable pre-determined set of rules Human Driven, Machine Assisted ? Autonomous System: A system that has the ability to act independently Machine Driven, Human Assisted
  • 5.
    Governance Financial and operationalpolicy management Budget enforcement Shutdown untagged resources Operating hours Compliance e.g. GDPR Optimization Consumption to value Right size resources Reserved Instance purchase Lifecycle Management Standardized provisioning & de-provisioning Configure resource size and dependencies Monitoring Troubleshoot & resolve issues Monitor machine & applications Log and analyze machine data Root cause & remediation KEY PILLARS OF MODERN CLOUD OPERATIONS
  • 6.
    40% Wastage due to OpExCost Model On average enterprises waste 40% cloud infrastructure 30% Time on Manual Cloud Ops On average, enterprises spend 30% of their Cloud Ops time doing manual tasks that should be automated Over 50% of enterprises are looking to decentralize cloud operations 25+ Tools Causing Fatigue On average, enterprises have 25+ tools to manage cost, performance and availability SLAs for cloud workloads >50% Cloud Ops Decentralizing 95% IT Workforce Lacking Expertise Over 95% of IT workforce lacks expertise in concepts of large scale, distributed and fault tolerant public cloud computing domain CLOUD OPERATIONS GETTING SIGNIFICANTLY COMPLEX
  • 7.
    WHAT MAKES ITPOSSIBLE NOW? Autonomous Vehicles ● Scale and variety of data ● Maturity of ML models ● Network bandwidth ● Cloud & local processing power ● Mapping technology ● Vision Detection
  • 8.
    WHAT MAKES ITPOSSIBLE NOW? Autonomous Cloud Operations ● Scale and variety of data ● Compute capacity ● Machine Learning models ● Standardization of Infrastructure as a Service ● Standardization of telemetry, log, machine data ● Programmability and APIs to take action
  • 9.
    AUTONOMOUS CLOUD OPERATIONS Loremipsum porta dolor sit amet nec Lorem ipsum dolor sit amet adipiscing. Donec risus dolor, porta venenatis neque pharetra luctus felis vel tellus nec felis. 45% ● Donec risus dolor porta ● Pharetra luctus felis ● Proin vel tellus in felis ● Molestie nec amet cum Lorem ipsum porta dolor sit amet nec Lorem ipsum dolor sit amet adipiscing. Donec risus dolor, porta venenatis neque pharetra luctus felis vel tellus nec felis. 28% ● Donec risus dolor porta ● Pharetra luctus felis ● Proin vel tellus in felis ● Molestie nec amet cum Lorem ipsum porta dolor sit amet nec Lorem ipsum dolor sit amet adipiscing. Donec risus dolor, porta venenatis neque pharetra luctus felis vel tellus nec felis. 36% ● Donec risus dolor porta ● Pharetra luctus felis ● Proin vel tellus in felis ● Molestie nec amet cum Lorem ipsum porta dolor sit amet nec Lorem ipsum dolor sit amet adipiscing. Donec risus dolor, porta venenatis neque pharetra luctus felis vel tellus nec felis. 17% ● Donec risus dolor porta ● Pharetra luctus felis ● Proin vel tellus in felis ● Molestie nec amet cum Lorem ipsum porta dolor sit amet nec Lorem ipsum dolor sit amet adipiscing. Donec risus dolor, porta venenatis neque pharetra luctus felis vel tellus nec felis. 61% ● Donec risus dolor porta ● Pharetra luctus felis ● Proin vel tellus in felis ● Molestie nec amet cum Deterministic deployments Services ● Jenkins, Travis ● Chef, Ansible, Puppet ● Vagrant ● Spinnaker Level 1 ● 12-factor app ● Fully automated CI/CD ● Testing Automation ● Automated Provisioning ● Automated rollbacks Metrics focused Services ● Prometheus, Data Dog ● Sumologic, Loggly ● Pagerduty, VictorOps ● New Relic, Appdynamics Level 2 ● Fully monitoring, logging and alerting ● Well-defined and measured availability, latency and throughput ● SLOs and SLAs tied to business KPIs Cloud Native Microservices Services ● AWS, Azure, GCP ● Docker, rkt ● Terraform, Cloud Formation, Bosh ● Amazon API Gateway, Envoy, Kong Level 3 ● Manageable loosely coupled systems ● Containerized microservices + service meshes ● Declarative, immutable Infrastructure-as-code Efficient Auto-scaling Services ● Kubernetes ● Chaos Monkey Level 4 ● Container orchestration ● Maximized resource utilization & density ● Provably Resilient Fully self-managed Services ● Autonomous CloudOps Level 5 ● Machine-learning ● System + Business Context
  • 10.
    ASPIRATIONAL USE CASE FaultTolerance: Less than 0.01% failure rate PaymentServ Cluster November 26 - January 2nd Business KPI: At least 10,000 payment transactions per day Performance: 1 sec, 95th percentile latency Cost: At most $2,500 per day
  • 11.
    DETECT THE ANOMALIES YotaScaleConfidential & Proprietary
  • 12.
    IDENTIFY CONTRIBUTING REASONS YotaScaleConfidential & Proprietary
  • 13.
    ROOT CAUSE ANALYSIS YotaScaleConfidential & Proprietary
  • 14.
    PLATFORM CONCEPTS Cloud ProviderData ● Cost ● Utilization ● Inventory ● Logs ● Containers OPTIMIZE Suggestions to remediate issues DIAGNOSE Identify root cause DETECT Discover trends and identify incidents 1 2 3 4 PREDICT Forecast the future Third Party Data ● Performance ● Memory ● Configuration APIs AUTONOMOUS Governance Policies ● Mandatory Tags ● RegEx Formats ● Resource Whitelist ● Purchase Preferences Enterprise Integrations MANUAL
  • 16.
    Proprietary AI ModelsTrained on Cloud Data 50% Accuracy 85% TODAY 10% 502 PB Data Data Ingested Resources Apps Cost Processed 5 PB 4.2 K 1 K $4 M 6 MONTHS Data Ingested Resources Apps Cost Processed 68 PB 81 K 4 K $47 M 1.5 YEARS Data Ingested Resources Apps Cost Processed 502 PB 476 K 21 K $202 M TODAY There is no compression algorithm for experience.
  • 17.
    SUGGESTIONS & RECOMMENDATIONS ●Invest in classic machine learning ● Start with Dev/QA environments ● Pick a use case or a specific type of workload ● Initially focus on autonomous insights and behavior analysis ● Make sure you have scale and variety of data ● In a single customer environment the best you can do is leverage historical data ● Give YotaScale a try!
  • 18.