How Game Theory and Observability Can Tear Down Silos and Improve DevOps

Changing the
Game:
How Game
Theory can break
down silos
Kevin Crawley – Developer Relations // Instana
Principle SRE Architect & Co-Owner // Single
Twitter: @notsureifkevin

▫ Docker Captain
▫ Gitlab Hero
▫ DevOpsDays Nashville Organizer
▫ 20 years in software development
▫ 5+ years DevOps/SRE experience
About Me

Discussion Points
▪ How does Game Theory tear down Silos
▪ Characteristics of High Performance
Organizations
▪ DevOps and Site Reliability Engineers
▪ What SREs need to be effective

Let’s talk about
Game Theory
(Disclaimer: I’m bad at math)
source: Nirmal Mehta (Docker Captain)

What is Bad Equilibrium?
It’s a strategy that all players in the game can adoptand converge on, butit
won’tproduce a desirable outcome for anyone.
https://pdfs.semanticscholar.org/30d1/a03db196384a17fed3247407fb5859f7c76b.pdf

Transformation: Focusing on Automation
https://devops-research.com/

Where do silos come from?
Silos can be defined as the contention which exist
between functional units within an organization.
This contention usually manifests between teams
where change management policy requirements
and risks are high.

Nash Equilibrium
(Prisoners Dilemma)
A concept of game theory where the optimal
outcome of a game is one where no player has an
incentive to deviate from her chosen strategy after
considering an opponents choice.
https://en.wikipedia.org/wiki/Nash_equilibrium

Split / Steal – Example 1
▪ Video has been removed to save bandwidth, you may view it
here on YouTube
https://www.youtube.com/watch?v=p3Uos2fzIJ0

Pareto Efficiency
Is a state of allocation of resources in which it is
impossible to make any one individual better off
without making at least one individual worse off.
… aka ZERO SUM
https://en.wikipedia.org/wiki/Pareto_efficiency

Pareto Inefficiency
A situationis inefficient if someone canbe made better off even after
compensating those made worse off.

Pareto Inefficient Nash Equilibrium
… is a Bad Equilibrium

Split / Steal Example 2
▪ Video has been removed to save bandwidth, you may view it
here on YouTube
https://www.youtube.com/watch?v=S0qjK3TWZE8

Pareto Inefficient Nash Equilibrium
Gives you permission and proof to change the
game

Percentage of Work Done Manually
ELITE
PERFORMERS
HIGH
PERFORMERS
LOW
PERFORMERS
Configuration
Management
5% 10% 30%
Testing 10% 20% 30%
Deployments 5% 10% 30%
Change
approval
process
10% 30% 40%

High Performance vs Low Performance
Organizations
High Performers
▪ Deployments:
> 1 hour and < 1 day
▪ Lead Time for
Changes:
> 1 day and < 1 week
▪ MTTR:
< 1 day
▪ Change Failure Rate:
0-15%
Low Performers
▪ Deployments:
Once per week/month
▪ Lead Time for Changes:
> 1 month and <6
months
▪ MTTR:
> 1 week and < 1 month
▪ Change Failure Rate:https://devops-research.com/

What happens when we tear down the silos
and become a DevOps organization?
▪ We ship more software more often,
complexity increases and reliability starts to
decline
▪ We naturally shift our focus to solve the
scalability and reliability issues (alternatively
we give up and readopt the monolith)
▪ Rise of the Site Reliability Engineers

Transformation: Focusing on Information

What are some tools / processes that
organizations can put in place to change
our equilibrium and communicate?
▪ Communication & Collaboration Tools
▫ Slack, Git, Pagerduty, OpsGenie
▪ Observability (SRE) Tooling
▫ Custom Dashboards / Metrics /
Alerting
▫ Log Analytics
▫ Distributed Tracing

What do SREs care about?
▪ Reliability (this one is obvious)
▪ Performance (is the customer happy?)
▪ Costs (is the business happy?)
SREs are in the business of measurement and
define objectives through SLOs by measuring SLIs.

What do SREs typically measure
▪ Error Rates
▪ Latency
▪ Throughput
▪ Saturation
“The Four Golden Signals” - https://landing.google.com/sre/sre-
book/chapters/monitoring-distributed-systems/

What is Observability?
Kalman, 1961 paper
On the general theory of control systems
▪ A system is observable if the behavior of the entire system
can be determined by only looking at its inputs and outputs.
▪ Lesson: control theory is a well-documented approach which
people can learn from vs trying to reinvent

Can we get some pillars?
The 4 pillars of Observability was originally described in a blog
article from Twitter:
▪ Monitoring
▪ Log Aggregation / Analytics
▪ Distributed systems tracing infrastructure
▪ Alerting / Visualization
https://blog.twitter.com/engineering/en_us/a/2016/observability-at-twitter-technical-
overview-part-i.html

More than just
pillars…
“While plainly having access to logs, metrics, and traces
doesn’t necessarily make systems more observable, these
are powerful tools that, if understood well, can unlock the
ability to build better systems.”
- Cindy Sridharen
https://www.oreilly.com/library/view/distributed-systems-observability/9781492033431/

Observability gives us the means to
understand all of the behavior in our
systems
▪ Not just tooling, it’s how
we model and analyze
data
▪ Similar to how DevOps is
a mindset / culture
▪ No longer treating
services like Schrödinger's
cat
▪ (A lot) more context
around events and
transactions
https://peter.bourgon.org/blog/2017/02/21/metrics-tracing-and-logging.html

Why does my
organization need any
of this?
This sounds like a lot of work …

How many of you are running
staging environments?

How many of you actually trust
your staging environments?

In order to observe a system, we must emit
signals and analyze the aggregates.
Those aggregates can answer the
following questions (and more):
▪ Number of Reqs / Retries / Backoffs
(throughput)
▪ Request parameters / Query Statements
(details)
▪ Latency / Outliers (performance)
▪ Top-Level Exceptions / Log Messages (error
analysis)

How can we collect this data?
Distributed Tracing
▪ Also known as Distributed Structured
Logging
▪ Larger Payloads
▪ Rich Contextual Data
https://w3c.github.io/trace-context

Sampling vs. No Sample
▪ Sampling traces may result in important
outliers (P95/P99) to be missed
▪ Extremely high volume systems must
sample due to massive overhead
▪ Start without sampling, adopt as needed,
incorporate solutions which sample
adaptively

How has Observability helped enable a
DevOps culture?
Let’s take a look at a production
microservice application which has been
instrumented by a distributed tracing
solution

▪ Operated by 3 engineers (1 FE/1 BE/1 SRE)
▪ Over 20k transaction / hour, 20+ integrations, 150k LOC, with less
than 15% test coverage
▪ Launched in 2018 with 15 microserviceson DockerSwarm – has since
expanded to over35 microserviceswith zero additional engineering
personnel
▪ One-touch deployment and provisioningfor newand existing services

Visualizing
Large and
Complex
Systems

Analyzing
Distributed Trace
Aggregates
What happens if we aggregate timing, error rate, and # of
reqs for each endpoint on a service

What problems
have Distributed
Tracing helped
solve?
Database Optimizations, Caching, and Concurrency

@notsureifkevin
Exponential
Backoff

Rise in Latency + Processing Time
▪ DBO (Hibernate Query) causing O(n log n) rise in latency and
processing time
▪ Application Dashboard indicated an issue with overall latency
increasing
▪ Fix deployed and improvement was observed immediately

Caching Solved one problem
… but caused another
▪ We implemented Redis for caching, and processing time went
down
▪ However, we didn’t account for token policies changing and
they suddenly began to expire after 30 seconds
▪ Alerting around error rates for this endpoint raised our
awareness around this issue

Context is critical
Metrics are not standalone, they have relationships

Custom
Dashboards
We utilize a mix of Instana, Logz.io and Grafana to manage
our systems

Focusing on Observability
▪ Enables your organization to understand the behavior
of your system
▪ Empowers your engineers to find and fix problems
▪ Enables you to build more reliable systems and ship
software faster
▪ Promotes empathy through understanding,
transparency, and communication.

Want to learn more about monitoring
production microservice apps?
▪ Follow me on twitter for upcoming workshops
@notsureifkevin & @InstanaHQ
▪ Get a free trial of Instana @ https://instana.com

How Game Theory and Observability Can Tear Down Silos and Improve DevOps

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to How Game Theory and Observability Can Tear Down Silos and Improve DevOps

Similar to How Game Theory and Observability Can Tear Down Silos and Improve DevOps (20)

Recently uploaded

Recently uploaded (20)

How Game Theory and Observability Can Tear Down Silos and Improve DevOps

Editor's Notes