This presentation was delivered at the XP Days Ukraine conference in November of 2019. In this presentation we demonstrate how putting both developers and operations on-call and adopting DevOps methodologies are capable of transforming software developement organizations.
Once the silos are torn down, a new challenge emerges, and that's around observability and monitoring the onslaught of new services and features that the organization is prepared to deliver now that they've been transformed into high and elite performing development teams.
How to Troubleshoot Apps for the Modern Connected Worker
How Game Theory and Observability Can Tear Down Silos and Improve DevOps
1. Changing the
Game:
How Game
Theory can break
down silos
Kevin Crawley – Developer Relations // Instana
Principle SRE Architect & Co-Owner // Single
Twitter: @notsureifkevin
2. ▫ Docker Captain
▫ Gitlab Hero
▫ DevOpsDays Nashville Organizer
▫ 20 years in software development
▫ 5+ years DevOps/SRE experience
About Me
3. Discussion Points
▪ How does Game Theory tear down Silos
▪ Characteristics of High Performance
Organizations
▪ DevOps and Site Reliability Engineers
▪ What SREs need to be effective
4. Let’s talk about
Game Theory
(Disclaimer: I’m bad at math)
source: Nirmal Mehta (Docker Captain)
5. What is Bad Equilibrium?
It’s a strategy that all players in the game can adoptand converge on, butit
won’tproduce a desirable outcome for anyone.
https://pdfs.semanticscholar.org/30d1/a03db196384a17fed3247407fb5859f7c76b.pdf
7. Where do silos come from?
Silos can be defined as the contention which exist
between functional units within an organization.
This contention usually manifests between teams
where change management policy requirements
and risks are high.
8. Nash Equilibrium
(Prisoners Dilemma)
A concept of game theory where the optimal
outcome of a game is one where no player has an
incentive to deviate from her chosen strategy after
considering an opponents choice.
https://en.wikipedia.org/wiki/Nash_equilibrium
9. Split / Steal – Example 1
▪ Video has been removed to save bandwidth, you may view it
here on YouTube
https://www.youtube.com/watch?v=p3Uos2fzIJ0
11. Pareto Efficiency
Is a state of allocation of resources in which it is
impossible to make any one individual better off
without making at least one individual worse off.
… aka ZERO SUM
https://en.wikipedia.org/wiki/Pareto_efficiency
21. Percentage of Work Done Manually
ELITE
PERFORMERS
HIGH
PERFORMERS
LOW
PERFORMERS
Configuration
Management
5% 10% 30%
Testing 10% 20% 30%
Deployments 5% 10% 30%
Change
approval
process
10% 30% 40%
https://devops-research.com/
22. High Performance vs Low Performance
Organizations
High Performers
▪ Deployments:
> 1 hour and < 1 day
▪ Lead Time for
Changes:
> 1 day and < 1 week
▪ MTTR:
< 1 day
▪ Change Failure Rate:
0-15%
Low Performers
▪ Deployments:
Once per week/month
▪ Lead Time for Changes:
> 1 month and <6
months
▪ MTTR:
> 1 week and < 1 month
▪ Change Failure Rate:https://devops-research.com/
23. What happens when we tear down the silos
and become a DevOps organization?
▪ We ship more software more often,
complexity increases and reliability starts to
decline
▪ We naturally shift our focus to solve the
scalability and reliability issues (alternatively
we give up and readopt the monolith)
▪ Rise of the Site Reliability Engineers
25. What are some tools / processes that
organizations can put in place to change
our equilibrium and communicate?
▪ Communication & Collaboration Tools
▫ Slack, Git, Pagerduty, OpsGenie
▪ Observability (SRE) Tooling
▫ Custom Dashboards / Metrics /
Alerting
▫ Log Analytics
▫ Distributed Tracing
26. What do SREs care about?
▪ Reliability (this one is obvious)
▪ Performance (is the customer happy?)
▪ Costs (is the business happy?)
SREs are in the business of measurement and
define objectives through SLOs by measuring SLIs.
27. What do SREs typically measure
▪ Error Rates
▪ Latency
▪ Throughput
▪ Saturation
“The Four Golden Signals” - https://landing.google.com/sre/sre-
book/chapters/monitoring-distributed-systems/
28. What is Observability?
Kalman, 1961 paper
On the general theory of control systems
▪ A system is observable if the behavior of the entire system
can be determined by only looking at its inputs and outputs.
▪ Lesson: control theory is a well-documented approach which
people can learn from vs trying to reinvent
29. Can we get some pillars?
The 4 pillars of Observability was originally described in a blog
article from Twitter:
▪ Monitoring
▪ Log Aggregation / Analytics
▪ Distributed systems tracing infrastructure
▪ Alerting / Visualization
https://blog.twitter.com/engineering/en_us/a/2016/observability-at-twitter-technical-
overview-part-i.html
30. More than just
pillars…
“While plainly having access to logs, metrics, and traces
doesn’t necessarily make systems more observable, these
are powerful tools that, if understood well, can unlock the
ability to build better systems.”
- Cindy Sridharen
https://www.oreilly.com/library/view/distributed-systems-observability/9781492033431/
31. Observability gives us the means to
understand all of the behavior in our
systems
▪ Not just tooling, it’s how
we model and analyze
data
▪ Similar to how DevOps is
a mindset / culture
▪ No longer treating
services like Schrödinger's
cat
▪ (A lot) more context
around events and
transactions
https://peter.bourgon.org/blog/2017/02/21/metrics-tracing-and-logging.html
33. How many of you are running
staging environments?
34. How many of you actually trust
your staging environments?
35.
36. In order to observe a system, we must emit
signals and analyze the aggregates.
Those aggregates can answer the
following questions (and more):
▪ Number of Reqs / Retries / Backoffs
(throughput)
▪ Request parameters / Query Statements
(details)
▪ Latency / Outliers (performance)
▪ Top-Level Exceptions / Log Messages (error
analysis)
37. How can we collect this data?
Distributed Tracing
▪ Also known as Distributed Structured
Logging
▪ Larger Payloads
▪ Rich Contextual Data
https://w3c.github.io/trace-context
38. Sampling vs. No Sample
▪ Sampling traces may result in important
outliers (P95/P99) to be missed
▪ Extremely high volume systems must
sample due to massive overhead
▪ Start without sampling, adopt as needed,
incorporate solutions which sample
adaptively
39. How has Observability helped enable a
DevOps culture?
Let’s take a look at a production
microservice application which has been
instrumented by a distributed tracing
solution
40. ▪ Operated by 3 engineers (1 FE/1 BE/1 SRE)
▪ Over 20k transaction / hour, 20+ integrations, 150k LOC, with less
than 15% test coverage
▪ Launched in 2018 with 15 microserviceson DockerSwarm – has since
expanded to over35 microserviceswith zero additional engineering
personnel
▪ One-touch deployment and provisioningfor newand existing services
50. Rise in Latency + Processing Time
▪ DBO (Hibernate Query) causing O(n log n) rise in latency and
processing time
▪ Application Dashboard indicated an issue with overall latency
increasing
▪ Fix deployed and improvement was observed immediately
53. Caching Solved one problem
… but caused another
▪ We implemented Redis for caching, and processing time went
down
▪ However, we didn’t account for token policies changing and
they suddenly began to expire after 30 seconds
▪ Alerting around error rates for this endpoint raised our
awareness around this issue
64. Focusing on Observability
▪ Enables your organization to understand the behavior
of your system
▪ Empowers your engineers to find and fix problems
▪ Enables you to build more reliable systems and ship
software faster
▪ Promotes empathy through understanding,
transparency, and communication.
65. Want to learn more about monitoring
production microservice apps?
▪ Follow me on twitter for upcoming workshops
@notsureifkevin & @InstanaHQ
▪ Get a free trial of Instana @ https://instana.com
Editor's Notes
My name is Kevin.
I’ve been using Docker and maintaining distributed application systems in production since 2014. I help organize events in my local area and speak on topics such as devops, automation, culture, and observability.
This is what happens when orgs try to:
Speed up delivery
Reduce MTTR
Reduce lead times
We all understand the game, but we don’t know how to change the rules to gain an advantage
This is what happens when orgs try to:
Speed up delivery
Reduce MTTR
Reduce lead times
Time-sharing computers
Computer guided missles
Air Defense Network goes online
2. computational complexity and bandwidth requirements of distributed tracing (Lyft, Netflix, Google, etc)
3. These solutions work around inefficient consumers and processing systems (they’re typically not stream based)
4. Unless of course you’re trying to do this yourself, in which case the complexity of running these systems is extremely high, the other condition is you truly are a behemoth, in which case you probably already know most of this stuff already
Over 150 containers
Spread across multiple hosts / azs
Two separate environments
High level overview of all the services in production
Single music has over 30 services in production, we can’t possibly monitor 30 dashboards at a time … or can we?