How to use Istio/Anthos to build Enterprise SRE

透過 Istio
打造企業內的 SRE
Hybrid Specialist: Shawn Ho
shawnho@google.com

Product Lifecycle
Concept Business Development Operations Market
Agile
solves this
DevOps
solves this

Developers
Agility
Operators
Stability
Dev & Ops’ KPIs aren't Aligned

What is relationship between Devops and SRE ?
● Devops is more like abstract
concept,guide line and disciplines
to break silos in developments,
operation
● SRE is Google version of realized
practice of Devops.
“Class SRE implements Devops”

Self-Service Platform
Monitoring Automation
CI/CD
SRE
Developers
Class SRE = REAL PERSON

#1. Decision based on data
所有的決定是以資料為基礎

#2. Be user centric
即使所有的監控數據都是正常的，
但客戶只要覺得系統不穩定，那系統就是不穩定

#3. Blameless culture & Share responsibility
降低部門隔閡要由跨部門的責任分享開始 (Developers, Operators, Leader) 系統
系統失效不僅是維運者的責任，程式碼品質，技術債等都是可能的原因

2How to Implement
SRE by Istio/Anthos?

Istio in 2 minutes
Gallery
Service A Service B
proxy proxy
Control Plane API on K8S API Server
Citadel
Logging
plugin
Monitoring
plugin
HTTP, gRPC, TCP
Routing
+
Secure
Naming
CertAuthority
plugin
Ingress Gateway Egress Gateway
mTLSmTLS mTLS
JWT + TLS
Cert issuance
Perimeter
security
policies
Perimeter
security
policies
Istio Control Plane
Pilot
Policy
Enforcement
+
Reporting
Data ﬂow
Control + metrics ﬂow
Local Authz
JWT + TLS
Internal
App 1
External
App 1

What does SRE implement on Platform?
Metrics &
monitoring
Capacity
planning
Emergency
response
Change
management
Culture
● SLO
● Dashboard
● Analytics
● Forecasting
● Demand-driven
● Performance
● Release process
● Consulting design
● Automations
● Oncall
● Incident analysis
● Postmortems
● Toil management
● Blamelessness
● Share responsibility

Monitoring and Incident Management
Understand system
architecture
Understand system
architecture and
deployed topology
System monitoring
Monitoring system
by gathering
blackbox & whitebox
metrics
SLI & SLO are
extracted from the
matrix and logs.
The informations are
visualized thru
dashboard
Log handling
Managing planned
event (release,
maintenance)
Incident handling
Create incident
ticket
Rollback change to
resolve incident
Investigate root
cause with
logging,monitoring
matrix and
debugging.
Postmortem
Retrospect incident
and prepare plan to
prevent reoccurence

What to Monitor?
SLO = SLI + Target
“99% of REST API call will complete in less than 100ms every week”
SLI Target
SLI
service level
indicator: a
well-defined
measure of 'good
enough'
• used to specify
SLO/SLA
SLO
service level
objective: a top-line
target for fraction
of good
interactions
• specifies goals
(SLI + Target)
SLA
service level
agreement:
consequences
• SLA = (SLO + margin)
+ consequences = SLI
+ Target +
consequences
Error Budget
Product management &
SRE define an availability
target.
• 100% - availability target
is a “budget of
unreliability”
(or the error budget).

Availability
SLO
Allowed unavailability window Error Budget
per year per quarter per 30 days Error rate 1%
90% 36.5 days 9 days 3 days 90
95% 18.25 days 4.5 days 1.5 days 80
99% 3.65 days 21.6 hours 7.2 hours 0
99.5% 1.83 days 10.8 hours 3.6 hours -100
99.9% 8.76 hours 2.16 hours 43.2 minutes -900
99.95% 4.38 hours 1.08 hours 21.6 minutes -1900
99.99% 52.6 minutes 12.96 minutes 4.32 minutes -9900
99.999% 5.26 minutes 1.30 minutes 25.9 seconds -99900
Error Budget (Availability)

Demo with Anthos:
Monitoring+Incident Mgmt
● Topology
● SLO/SLI Metrics
● Blackbox/Whitebox
● Log Viewer
● Tracing/Tracing Report

Demo with Anthos:
Topology Blackbox Whitebox

Demo with Anthos:
Logging Tracing

Demo with Anthos:
Proactive Reduce Error Budget
● Alert Setting
● Canary Deployment
● Cross-Region Deployment
Clients
Kubernetes Cluster
Kubernetes Engine
Taiwan-1
Kubernetes Cluster
Kubernetes Engine
Singapore
Cloud Load
Balancing
10
90

● Alert Setting
● Canary Deployment
● Cross-Region Deployment
Clients
Kubernetes Cluster
Kubernetes Engine
Taiwan-1
Kubernetes Cluster
Kubernetes Engine
Singapore
Cloud Load
Balancing
50
50
Demo with Anthos:
Proactive Reduce Error Budget

Capacity planning
Plan for organic growth
Increased product adoption
and usage by customers.
Determine inorganic
growth
Sudden jumps in demand
due to feature launches,
marketing campaigns, etc.

Change Management
Roughly 70%1
of outages are due to changes in a live system
Kubernetes Configuration Service Continuous Deployment
Clients
Kubernetes Cluster
Kubernetes Engine
Multiple Instances
Cloud Source
Repositories
OnPremise
Kubernetes Cluster
Kubernetes Engine
GCP
Kubernetes Cluster
Kubernetes Engine
On-Prem1
Anthos Hub
Service
NAT

Demo with Anthos:
The Power of GitOps

Summary + Call for Action
● SRE has 3 key principles:
○ Decision Based on Data (有意義的監控）
○ Be User Centric（黑箱測試）
○ Blameless Culture & Share Responsibility （分擔責任，共同努力）
● Kubernetes is a perfect platform to implement SRE
○ SLI + SLO + Error Budget
○ Watch for the Budget Burn Rate
○ Establish CI+CD with GitOps
● Pick a System and Build your SRE Practices

Cover images used with permission. These books can be found on shop.oreilly.com.

How to use Istio/Anthos to build Enterprise SRE

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to How to use Istio/Anthos to build Enterprise SRE

Similar to How to use Istio/Anthos to build Enterprise SRE (20)

Recently uploaded

Recently uploaded (20)

How to use Istio/Anthos to build Enterprise SRE