This talk hopes to share Istio's capabilities with audience. Most people may think Istio is simply a cool network product. However, Istio could be much more than network control. Let's see how we could apply its capability to build an "EASIER" SRE organization.
5. What is relationship between Devops and SRE ?
● Devops is more like abstract
concept,guide line and disciplines
to break silos in developments,
operation
● SRE is Google version of realized
practice of Devops.
“Class SRE implements Devops”
11. Istio in 2 minutes
Gallery
Service A Service B
proxy proxy
Control Plane API on K8S API Server
Citadel
Logging
plugin
Monitoring
plugin
HTTP, gRPC, TCP
Routing
+
Secure
Naming
CertAuthority
plugin
Ingress Gateway Egress Gateway
mTLSmTLS mTLS
JWT + TLS
Cert issuance
Perimeter
security
policies
Perimeter
security
policies
Istio Control Plane
Pilot
Policy
Enforcement
+
Reporting
Data flow
Control + metrics flow
Local Authz
JWT + TLS
Internal
App 1
External
App 1
14. Monitoring and Incident Management
Understand system
architecture
Understand system
architecture and
deployed topology
System monitoring
Monitoring system
by gathering
blackbox & whitebox
metrics
SLI & SLO are
extracted from the
matrix and logs.
The informations are
visualized thru
dashboard
Log handling
Managing planned
event (release,
maintenance)
Incident handling
Create incident
ticket
Rollback change to
resolve incident
Investigate root
cause with
logging,monitoring
matrix and
debugging.
Postmortem
Retrospect incident
and prepare plan to
prevent reoccurence
15. What to Monitor?
SLO = SLI + Target
“99% of REST API call will complete in less than 100ms every week”
SLI Target
SLI
service level
indicator: a
well-defined
measure of 'good
enough'
• used to specify
SLO/SLA
SLO
service level
objective: a top-line
target for fraction
of good
interactions
• specifies goals
(SLI + Target)
SLA
service level
agreement:
consequences
• SLA = (SLO + margin)
+ consequences = SLI
+ Target +
consequences
Error Budget
Product management &
SRE define an availability
target.
• 100% - availability target
is a “budget of
unreliability”
(or the error budget).
16. Availability
SLO
Allowed unavailability window Error Budget
per year per quarter per 30 days Error rate 1%
90% 36.5 days 9 days 3 days 90
95% 18.25 days 4.5 days 1.5 days 80
99% 3.65 days 21.6 hours 7.2 hours 0
99.5% 1.83 days 10.8 hours 3.6 hours -100
99.9% 8.76 hours 2.16 hours 43.2 minutes -900
99.95% 4.38 hours 1.08 hours 21.6 minutes -1900
99.99% 52.6 minutes 12.96 minutes 4.32 minutes -9900
99.999% 5.26 minutes 1.30 minutes 25.9 seconds -99900
Error Budget (Availability)
24. Capacity planning
Plan for organic growth
Increased product adoption
and usage by customers.
Determine inorganic
growth
Sudden jumps in demand
due to feature launches,
marketing campaigns, etc.
25. Change Management
Roughly 70%1
of outages are due to changes in a live system
Kubernetes Configuration Service Continuous Deployment
Clients
Kubernetes Cluster
Kubernetes Engine
Multiple Instances
Cloud Source
Repositories
OnPremise
Kubernetes Cluster
Kubernetes Engine
GCP
Kubernetes Cluster
Kubernetes Engine
On-Prem1
Anthos Hub
Service
NAT
27. Summary + Call for Action
● SRE has 3 key principles:
○ Decision Based on Data (有意義的監控)
○ Be User Centric(黑箱測試)
○ Blameless Culture & Share Responsibility (分擔責任,共同努力)
● Kubernetes is a perfect platform to implement SRE
○ SLI + SLO + Error Budget
○ Watch for the Budget Burn Rate
○ Establish CI+CD with GitOps
● Pick a System and Build your SRE Practices
28. Cover images used with permission. These books can be found on shop.oreilly.com.