Performance Engineering Masterclass
Efficient Automation with
SRE Concepts
Henrik Rexed
Performance Engineering Masterclass
Henrik Rexed
● Cloud Native Advocate
● 15+ years of Performance engineering
● Owner of
Producer of
Performance Engineering Masterclass
What are you going to learn out of this session?
● The notion of SLI/SLO
● The importance of observability
● Introduction à Keptn
● Demo
Performance Engineering Masterclass
Continuous Performance Testing
Dev Integration
Testing
Application
Testing
Test a Component Test a System Test Real World
Continuous Testing Embedded in CI/CD Pipelines
Performance Engineering Masterclass
Performance is often a false automation
Several hours,
days
Build Deploy to
„Test“
Run Test
In „Test“
Manual Approval
Promote to
the next
stage
Build Deploy to
„Test“
Run Test
In „Test“
Manual Approval
Promote to
the next
stage
Performance Engineering Masterclass 6
Performance Engineering Masterclass
Performance is not only about Response time!
7
Response time
Ressources
Network
Cost
Scallability
Errors
Performance Engineering Masterclass
Solution : Use SRE methodology
8
Performance Engineering Masterclass
Why do we need SRE?
■ Developers were focused on
innovation and agility
■ Operations on stability
■ SRE has been created to make
sure that we are building reliable
services and avoiding conflict
between Developers and
Operations
Developer Operation
Performance Engineering Masterclass
SLI/SLO help to build reliability targets
Developer Operation
■ Product owners defined at
a very early stages the
objectives for each services
■ SLI/SLO helps to :
■ Availability
■ performance
■ More
■ SLI/SLO helps to detect
issues before our end-users Production
speed
availability
Performance Engineering Masterclass 11
SRE Mantra
Performance Engineering Masterclass
SLI
Good Events
Valid Events
100 %
SLI
Service Level
Indicator
A key metric to for understanding the
health of a service
Example: HTTP Request Latency
# of HTTP Request
with <= 5 sec
response time
Total # of
Requests
100 %
Performance Engineering Masterclass
SLO
SLO
Service Level
Objective
100 %
0 %
99
Example: Request latency will be <= 5 secs
for 99% of Requests
An objective/target we set against an SLI
100 %
0 %
SLO
# of HTTP Request
with <= 5 sec
response time
Performance Engineering Masterclass
Error budget
SLO
99% Is equal to….
Error Budget
1%
One minus the availability target. SREs
and Devs work within the error budget.
1% 1000 000 (30 Days we have 1 000 000 requests)
1000 requests
30 days
Error Budget
Error Budget + Burn Down How fast are we using our budget?
Performance Engineering Masterclass
Remove toil
Typical SRE day
Operations
50%
Dev
50%
Performance Engineering Masterclass
How can we take advantage of
SRE mantra in Performance
Engineering?
Performance Engineering Masterclass
To build SLI we need
measurements
Performance Engineering Masterclass
Observability pilars
Logs Events Metric
Observability
Traces
Performance Engineering Masterclass
The CNCF Landscape
https://landscape.cncf.io/
Performance Engineering Masterclass 20
The reality…
https://twitter.com/dastbe/statu
s/1303858170155081728
Performance Engineering Masterclass
Open Observability
Performance Engineering Masterclass
Prometheus
Metric provider
Performance Engineering Masterclass
Prometheus architecture
Kube State metrics
Node exporter
Cadvisor
Alertmanager
Scrape
Prometheus Serveur
PromQl
Performance Engineering Masterclass
Prometheus is a standard
•CouchDb
•Mysql
•Oracle
•PostgreSQL
•MongoDB
•…
Database
•Netgear
•Windows
•IBM Z
•Nvidia
•….etc
Hardware
•MQ
•Kafka
•MQTT
•RabbitMQ
•…etc
Broker
•Tivoli
•Hadoop
•NetApp
•ScaleIO
Storage
•Jira
•Jenkins
•Github
•Fluentd
•Nagios
•…etc
Other
Performance Engineering Masterclass
Automate
Sre Mantra
Performance Engineering Masterclass
Modern DevOps
● CI/CD
● Production
● Staging
Performance Engineering Masterclass
Operation at scale
• Complex
configuration
• Repeated tasks
• Manual
integrations
Expensive
maintenance
Performance Engineering Masterclass
Open, event- & data-driven automation for DevOps & SREs
Makes data-driven decisions based
on SLOs (Service Level Objectives)
Event-Driven task orchestration
for Multi-Stage Delivery …
Connects to any existing delivery, test, notification, ticketing, config mgmt … tool
Connects to any Observability
Platform to query metrics (SLIs)
, through open event standard
subscriptions
Deploy Test
Validat
e
… and Day 1 & Day 2
Operations
(Canaries, Remediation …)
Assess
Rollbac
k
Validat
e
Release
Scale
Escalat
e
Performance Engineering Masterclass
Keptn: SLO-Driven Automation for DevOps & SREs
You
(Dev/Ops/SRE)
bring your configuration
pick your use case
SLO-Quality
Gates
Progressive
Delivery
Auto-
Remediation
Declaration GitOps SLOs Standards
shipyard SLI/SLO runboo
k
SRE
Automation
workload
Monitoring Delivery Reliability Remediation
automates configuration and provides self-service for
through event-driven process orchestration based on
connect your tools
Performance Engineering Masterclass
SLOs for Data Driven Decisions at the Core of Keptn Orchestration
sli.yaml (Dynatrace)
indicators:
error_rate: "builtin:service.errors.total.count:merge(0):avg"
count_dbcalls: "calc:service.toptestdbcalls:merge(0):sum"
jvm_memory: "builtin:tech.jvm.memory.pool.committed:merge(0):sum"
slo.yaml (SLI Provider independent)
objectives:
- sli: error_rate
pass:
- criteria:
- "<=1“ # We expect a max error rate of 1%
- sli: jvm_memory
- sli: count_dbcalls
pass:
- criteria:
- "=+2%" # We allow a 2% increase in DB Calls between builds
warning:
- criteria:
- "<=10" # We expect no more than 10 DB Calls per TX
total_score:
pass: "90%"
warning: "75%"
sli.yaml (Prometheus)
indicators:
http_requests_total_sucess: http_requests_total{status="success"}
go_routines: go_goroutines{job="$SERVICE-$PROJECT-$STAGE"}
SLI Providers: Query SLIs based on sli.yaml and return individual values
Lighthouse Service: Retrieves SLIs and compares them against SLOs
...
*get-sli*
*evaluation*
count_dbcalls: 5
jvm_memory: 360MB
error_rate: 4.3% sli_y: value for Y
sli_x: value for X
Performance Engineering Masterclass
Automate Approval through SLI/SLO-based Quality Gates
Build Deploy to
„Test“
Run Test
In „Test“
Manual Approval
Promote to
„Staging“
Trigger
Quality Gate
Wait for
Result
SLI & SLO
Result: success, Score: 85/100
Run Test In „Test“
w Tagging
Rt(p95) < 500ms
#ofSQLs <= 5
cpu(max)< 80%
Java GC < 2%
...
Pull SLIs for Testing time frame
Validate
SLOs
Build Deploy to
„Test“
Promote to
„Staging“
~1min
~30-60min
Performance Engineering Masterclass
Keptn is extendable
https://artifacthub.io/packages
/search?ts_query_web=Keptn
Performance Engineering Masterclass
Keptn integrates with other solutions
CLI / REST API
Performance Engineering Masterclass 36
Demo
36
Performance Engineering Masterclass
Quality Gate – Keptn,Prometheus et K6
SLO Evaluation &
Monitoring
Prometheus
Integration
Service
Hipster-shop
Run load test and evaluate
Is it observable
■ If you are looking for educational content on
Observability, check out:
Is It Observable
Performance Engineering Masterclass
Keep in touch!
Henrik Rexed
Cloud Native Advocate
Dynatrace
Henrik.rexed@dynatrace.com
@hrexed

Performance Engineering Masterclass: Efficient Automation with the Help of SRE Concepts