2. Kurt Andersen
Sr. Staff
Site Reliability
Joined LinkedIn in January 2013
Background in managed services
and anti-abuse security
Introduction
Currently senior technical lead for Product-SRE
(all member & customer facing services)
5. Growing Global Network
546M+ 100K
Members Articles published weekly
40%
yr/yr increase in engaged
feed sessions weekly
2+ 50%
New sign-ups per second
Active members use
Linkedin Messaging
weekly
100M+
Monthly Unique Visitors
9. Vision to Values
VISION
MISSION
VALUE PROPOSITION
TARGET AUDIENCES
STRATEGY
PRIORITIES
OBJECTIVES
CULTURE Transformation - Integrity - Collaboration - Humor - Results
VALUES
Members First - Relationships Matter - Be open, honest, and constructive -
Demand excellence - Take Intelligent Risks - Act like an Owner
10. Values
• Members First
• Relationships Matter
• Be Open, Honest, and Constructive
• Demand Excellence
• Take Intelligent Risks
• Act Like an Owner
11. Kevin Scott’s Hierarchy of Engineering Needs
FoundationSite Reliability
Engineering
Site Up & Secure
Technology at scale
Development at scale
Solid APIs and
building blocks
Efficient
Magic
14. LinkedIn Operations
● Classical, stratified model: Systems,
Networks, Applications, DBA
● Heavy-weight processes driven by tickets
and heroes
● Culture of not trusting developers in any
deployed environments
● Huge wall and growing frustration between
Dev and Ops teams (and in ops itself)
● 7 engineers in total made up NOC, SRE,
Release Operations: “Site Operations”
● On-call was horrible
2010
15. Is the Site Up?
● Peak traffic periods Mon-Wed ~ 6-10am
● Regular capacity related outages Mon-
Wed ~ 6-10am
● Zero tolerance for failure in the
application stack
● Near zero instrumentation
● Bi-weekly downtime maintenances
2010
16. Let’s make a few changes
change software development model
active/active serving model
cheaper datacenters
remove monolithic databases
graceful degradation
remove hardware load balancers
more data centers
move to service oriented architecture24/7 deployments
dev driven deployments
replace java serialized objects over RPC with REST APIs
modernize our application stack
move faster
self service everything
code contributions to the main application stack
3x3 deployments
auto escalation
auto remediation
automated datacenter buildout
17. Development Practices 2010
33
Poor testing practices
Unmaintained, brittle
tests
1
Merge hell
Branch and isolate
engineers from each
other
2 Poor understanding
of change impact
Monolithic codebase
Unspecified dependencies
18. EFFECTS
Development Practices 2010
Best Case: Two weeks lag from
commit to production
deployment of a feature
Production Deployment:
Heroic efforts, released only
part of planned changes
19. Speed, Safety and Stability
Development Code Release Feature Release
1 2 3
Developer Satisfaction & Happiness
20. Development Practices 2018
33
Automatic detection and
rollback to reduce
MTTR
Basic code coverage
1 Stable shared code
base
Trunk based
development
2 Versioned
dependencies
Modular logical code
components
21. Development Practices 2018
Rapid, incremental, small
changes to production
throughout the day
Automated tooling gives
Go/No-Go signal at each
stage
15K+
Successful commits/day Build Test Jobs/Day
35K 28
Mins for Code Review
22. Core SRE Principles
Site Up Empower Developer
Ownership
Operations is an
Engineering Problem
1 2 3
24. Self-service Deployments
Promote to a single production data center
“Canary” to a single production instance
EKG: automated metrics-based validation
Ramp features slowly to the member base
Promote to remaining production data centers
1
2
3
4
5
15K+
Successful commits/day
Code promotions/day
200+
600+
Feature ramps/day
25. Create a culture of operational metrics
“What gets measured gets fixed”
26. REST API
Self-service Instrumentation and Monitoring
java
applications
non-java
applications
metrics
collectors
alerting visualization
metrics api
IRIS
23K
Graph dashboards
10M
Metrics ingested/sec
340K
Alerts processed/min
600M+
Total metrics
29. SFSan Francisco
SNVSunnyvale BLR Bangalore
NYC New York City
SRE
SRE Globally Today
400+ SREs across four global
offices
Composed of Software, Database, Security, and Infrastructure
Engineering generalists that make LinkedIn work
30. Embedded SRE Engagement Model
Partner with application
development teams leveraging
metrics, SLOs, and KPIs
Involved from software
inception to decommission
Participate in sprints, attend
regular staff meetings and sit
with the development teams
Contribute to code base: bug
fixes, instrumentation, logging,
improve efficiency, resilience
and scaling
Participate in on-call rotation for
critical issues along with
development team
Define production-readiness
and overall operability
requirements
31. Engineering Culture
Act like an
owner
Build
Leverage
Reduce
MTTR
Automate
Everything
Measure
Everything
Protect
Member
Data