Making Sites ReliableHuman processes to increase reliability       Andrey Tatarinov, Pavel Uvarov         Site Reliability...
Engineering aspects and roles● Software Development (SWE)   ○ Designing   ○ Coding● Provisioning (Long term preventing of ...
Spreading knowledge●   Mad genius problem●   Sociopath problem●   Introducing new team member●   Shuffling teams●   Low "b...
Sweet spot (or somewhere around)●   TLDR: Speak more to teammates●   Peer-to-peer review in every process●   Data driven d...
Design review● Each significant change         ● Design review meeting  triggers                          ○ Different expe...
Code review● Before commit, not after● Style guide   ○ "code is good when you cant tell who wrote it"   ○ Readability● Pee...
Knowledge externalization● Oncall engineer wakes SWE up at 2am● External knowledge database● Long-term memory   ○ Playbook...
Manual response● No L0, L1, L2 operations levels● Anyone with basic knowledge can react on 80~90% of alerts   ○ Knowledge ...
Productionization● To get into production every service has to comply with● Checklist derived from experience   ○ Continuo...
Post-mortem● Each large outage triggers post-mortem and p-m meeting● Document with   ○ Impact   ○ Timeline   ○ Root cause ...
Site Reliability Engineering
Reliability●   The ability of a person or system to perform and maintain its functions    in routine circumstances, as wel...
Site● Distributed complicated system   ○ Many machines, switchers, routers, racks, etc   ○ Lots of data   ○ Many services ...
Site Reliability●   Suppose we have the software to run on the site●   It must just work●   How to insure it?     ○ Engine...
Automate...● If a person doesnt make any desicions he/she can be replaced with a  script which is much more scalable and r...
Preventing outages● In reality the software is always being upgraded● External world is changing too● Preventing problems....
Oncall●   Monitoring     ○ Every program/server is universally monitored●   Alerting     ○ Alert rules     ○ Alert escalat...
Monitoring and Alerting    services                monitoring system                                                      ...
Preserving oncall "memory"● Long term "memory"   ○ Playbook (useful hints to handle alerts)● Short term "memory"   ○ Hand-...
Carrying the pager. Shifts Are GeographicallyDistributed
SWE vs SRE
SWE: I want to believe!Software Engineers are very devout believers. They believe that: ● Their software does not contain ...
SREs are cynics!Site reliability engineers know that:  ● All software sucks  ● Everything always fails, preferably in the ...
Ouch
Questions?
Thanks!
Upcoming SlideShare
Loading in …5
×

Making Sites Reliable (как сделать систему надежной) (Павел Уваров, Андрей Татаринов)

515 views

Published on

Published in: Technology, Business
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
515
On SlideShare
0
From Embeds
0
Number of Embeds
24
Actions
Shares
0
Downloads
14
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Making Sites Reliable (как сделать систему надежной) (Павел Уваров, Андрей Татаринов)

  1. 1. Making Sites ReliableHuman processes to increase reliability Andrey Tatarinov, Pavel Uvarov Site Reliability Engineering, Google
  2. 2. Engineering aspects and roles● Software Development (SWE) ○ Designing ○ Coding● Provisioning (Long term preventing of outages) (SWE+SRE) ○ Capacity planning ○ Automation ○ Operations feedback● Operations (Short term preventing of outages) (SRE) ○ Manual response (oncall) ○ Site (system) administration
  3. 3. Spreading knowledge● Mad genius problem● Sociopath problem● Introducing new team member● Shuffling teams● Low "bus factor"● Medium and large teams 30+ people● Startups have other problems
  4. 4. Sweet spot (or somewhere around)● TLDR: Speak more to teammates● Peer-to-peer review in every process● Data driven decision making● Group decision making● Priorities ○ Widespread knowledge ○ Predictable quality ○ Leveling extremes● Key point ○ Human processes that scale ○ Principles are omnipresent ○ Not educational or disciplinary, but part of day-to-day life
  5. 5. Design review● Each significant change ● Design review meeting triggers ○ Different experts: PM,● Document that captures SWEs, SREs ○ Problem statement ○ Different priorities: new ○ Requirements features, architecture ○ Proposed solution sanity, stability ○ Costs and benefits ● Result: ■ Development ○ Widespread knowledge ■ Resources ○ Balanced compromise ○ Risk factors and solutions with new functionality, sanity and reliability
  6. 6. Code review● Before commit, not after● Style guide ○ "code is good when you cant tell who wrote it" ○ Readability● Peer-to-peer ○ No individual ownership ○ Each change should be reviewed by other engineer with expertise in this area● Result ○ Widespread knowledge ○ Reliable changes ○ Easy to read and modify, consistent code
  7. 7. Knowledge externalization● Oncall engineer wakes SWE up at 2am● External knowledge database● Long-term memory ○ Playbook● Short-term memory ○ Alert history ○ Hand-off
  8. 8. Manual response● No L0, L1, L2 operations levels● Anyone with basic knowledge can react on 80~90% of alerts ○ Knowledge is included ○ Human is an intellegent executor ■ Prevents feedback loops ■ Can identify anomalies ■ Feedback for provisioning and automation● Weekly/Monthly/Quarterly oncall review ○ More people are aware ○ Top issues identified ○ Automation/rearchitecturing planned
  9. 9. Productionization● To get into production every service has to comply with● Checklist derived from experience ○ Continuous builds/testing ○ Load testing ○ Capacity planning ○ General health and user-facing monitoring/alerting ○ Identified potential issues ■ Monitoring and failover scenarios ○ Playbook entries● Result: service which behaves predictably; any SRE can react on outage
  10. 10. Post-mortem● Each large outage triggers post-mortem and p-m meeting● Document with ○ Impact ○ Timeline ○ Root cause ○ Action items to prevent root cause from happening● Post-mortem review● Result ○ Widespread knowledge ○ This class of root causes is likely not to happen again
  11. 11. Site Reliability Engineering
  12. 12. Reliability● The ability of a person or system to perform and maintain its functions in routine circumstances, as well as hostile or unexpected circumstances. Wikipedia● Unexpected situations cannot be handled by a computer program. Otherwise its not unexpected. Common LogicWe will talk about human processes
  13. 13. Site● Distributed complicated system ○ Many machines, switchers, routers, racks, etc ○ Lots of data ○ Many services ○ But not so many people (machines:admins > 4000:1)● Everything breaks all the time ○ Hardware... ■ Fan stopped, bad memory, disk died, ... ○ Software... ■ Server not running, wrong version, slow response, ... ○ Network...
  14. 14. Site Reliability● Suppose we have the software to run on the site● It must just work● How to insure it? ○ Engineer the production environment for reliability ■ Automate whatever possible ○ Engineer systems and tools that increase reliability ■ Monitoring and alerting ■ Databases that keep track of machines, tasks and even traffic ○ Handle unexpected situations manually ■ Oncall
  15. 15. Automate...● If a person doesnt make any desicions he/she can be replaced with a script which is much more scalable and reliable● Automate ○ failovers ○ load balancing ○ response to failures ○ routine repairs (reinstalling, draining)● Determine physical configuration automatically● Expect the unexpected ○ rare events are completely normal
  16. 16. Preventing outages● In reality the software is always being upgraded● External world is changing too● Preventing problems. ○ Capacity planning & management ○ Review design docs ○ Prelaunch (onboarding) review ○ Adapt production environment to (external) changes
  17. 17. Oncall● Monitoring ○ Every program/server is universally monitored● Alerting ○ Alert rules ○ Alert escalation ○ Pager● Oncall "memory" ○ Playbook (runbook) ○ Oncall hand-off ○ Tracking of issues● Oncall review weekly/monthly -> feedback● Shifts
  18. 18. Monitoring and Alerting services monitoring system primary oncall alert (page) alert rules secondary oncall monitoring escalation variables the team variables history
  19. 19. Preserving oncall "memory"● Long term "memory" ○ Playbook (useful hints to handle alerts)● Short term "memory" ○ Hand-off (description of the shift for the next oncall) ○ Tracking recent issues
  20. 20. Carrying the pager. Shifts Are GeographicallyDistributed
  21. 21. SWE vs SRE
  22. 22. SWE: I want to believe!Software Engineers are very devout believers. They believe that: ● Their software does not contain bugs ● The datacenters and machines are always "up" and gratis ● The network has infinite capacity ● The speed of light does not apply to their system ● Disks never fail and seek times are close to zero msec. ● Configuration files do not contain (syntax) errors
  23. 23. SREs are cynics!Site reliability engineers know that: ● All software sucks ● Everything always fails, preferably in the most inconvenient order ● Complexity is the enemy of reliability ● The laws of physics are real, even inside a virtual machine ● All processes that contain a manual component are highly failure prone ● Traffic forecasts are bogus
  24. 24. Ouch
  25. 25. Questions?
  26. 26. Thanks!

×