Your SlideShare is downloading. ×
  • Like
Making Sites Reliable (как сделать систему надежной) (Павел Уваров, Андрей Татаринов)
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.


Now you can save presentations on your phone or tablet

Available for both IPhone and Android

Text the download link to your phone

Standard text messaging rates apply

Making Sites Reliable (как сделать систему надежной) (Павел Уваров, Андрей Татаринов)



Published in Technology , Business
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads


Total Views
On SlideShare
From Embeds
Number of Embeds



Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

    No notes for slide


  • 1. Making Sites ReliableHuman processes to increase reliability Andrey Tatarinov, Pavel Uvarov Site Reliability Engineering, Google
  • 2. Engineering aspects and roles● Software Development (SWE) ○ Designing ○ Coding● Provisioning (Long term preventing of outages) (SWE+SRE) ○ Capacity planning ○ Automation ○ Operations feedback● Operations (Short term preventing of outages) (SRE) ○ Manual response (oncall) ○ Site (system) administration
  • 3. Spreading knowledge● Mad genius problem● Sociopath problem● Introducing new team member● Shuffling teams● Low "bus factor"● Medium and large teams 30+ people● Startups have other problems
  • 4. Sweet spot (or somewhere around)● TLDR: Speak more to teammates● Peer-to-peer review in every process● Data driven decision making● Group decision making● Priorities ○ Widespread knowledge ○ Predictable quality ○ Leveling extremes● Key point ○ Human processes that scale ○ Principles are omnipresent ○ Not educational or disciplinary, but part of day-to-day life
  • 5. Design review● Each significant change ● Design review meeting triggers ○ Different experts: PM,● Document that captures SWEs, SREs ○ Problem statement ○ Different priorities: new ○ Requirements features, architecture ○ Proposed solution sanity, stability ○ Costs and benefits ● Result: ■ Development ○ Widespread knowledge ■ Resources ○ Balanced compromise ○ Risk factors and solutions with new functionality, sanity and reliability
  • 6. Code review● Before commit, not after● Style guide ○ "code is good when you cant tell who wrote it" ○ Readability● Peer-to-peer ○ No individual ownership ○ Each change should be reviewed by other engineer with expertise in this area● Result ○ Widespread knowledge ○ Reliable changes ○ Easy to read and modify, consistent code
  • 7. Knowledge externalization● Oncall engineer wakes SWE up at 2am● External knowledge database● Long-term memory ○ Playbook● Short-term memory ○ Alert history ○ Hand-off
  • 8. Manual response● No L0, L1, L2 operations levels● Anyone with basic knowledge can react on 80~90% of alerts ○ Knowledge is included ○ Human is an intellegent executor ■ Prevents feedback loops ■ Can identify anomalies ■ Feedback for provisioning and automation● Weekly/Monthly/Quarterly oncall review ○ More people are aware ○ Top issues identified ○ Automation/rearchitecturing planned
  • 9. Productionization● To get into production every service has to comply with● Checklist derived from experience ○ Continuous builds/testing ○ Load testing ○ Capacity planning ○ General health and user-facing monitoring/alerting ○ Identified potential issues ■ Monitoring and failover scenarios ○ Playbook entries● Result: service which behaves predictably; any SRE can react on outage
  • 10. Post-mortem● Each large outage triggers post-mortem and p-m meeting● Document with ○ Impact ○ Timeline ○ Root cause ○ Action items to prevent root cause from happening● Post-mortem review● Result ○ Widespread knowledge ○ This class of root causes is likely not to happen again
  • 11. Site Reliability Engineering
  • 12. Reliability● The ability of a person or system to perform and maintain its functions in routine circumstances, as well as hostile or unexpected circumstances. Wikipedia● Unexpected situations cannot be handled by a computer program. Otherwise its not unexpected. Common LogicWe will talk about human processes
  • 13. Site● Distributed complicated system ○ Many machines, switchers, routers, racks, etc ○ Lots of data ○ Many services ○ But not so many people (machines:admins > 4000:1)● Everything breaks all the time ○ Hardware... ■ Fan stopped, bad memory, disk died, ... ○ Software... ■ Server not running, wrong version, slow response, ... ○ Network...
  • 14. Site Reliability● Suppose we have the software to run on the site● It must just work● How to insure it? ○ Engineer the production environment for reliability ■ Automate whatever possible ○ Engineer systems and tools that increase reliability ■ Monitoring and alerting ■ Databases that keep track of machines, tasks and even traffic ○ Handle unexpected situations manually ■ Oncall
  • 15. Automate...● If a person doesnt make any desicions he/she can be replaced with a script which is much more scalable and reliable● Automate ○ failovers ○ load balancing ○ response to failures ○ routine repairs (reinstalling, draining)● Determine physical configuration automatically● Expect the unexpected ○ rare events are completely normal
  • 16. Preventing outages● In reality the software is always being upgraded● External world is changing too● Preventing problems. ○ Capacity planning & management ○ Review design docs ○ Prelaunch (onboarding) review ○ Adapt production environment to (external) changes
  • 17. Oncall● Monitoring ○ Every program/server is universally monitored● Alerting ○ Alert rules ○ Alert escalation ○ Pager● Oncall "memory" ○ Playbook (runbook) ○ Oncall hand-off ○ Tracking of issues● Oncall review weekly/monthly -> feedback● Shifts
  • 18. Monitoring and Alerting services monitoring system primary oncall alert (page) alert rules secondary oncall monitoring escalation variables the team variables history
  • 19. Preserving oncall "memory"● Long term "memory" ○ Playbook (useful hints to handle alerts)● Short term "memory" ○ Hand-off (description of the shift for the next oncall) ○ Tracking recent issues
  • 20. Carrying the pager. Shifts Are GeographicallyDistributed
  • 21. SWE vs SRE
  • 22. SWE: I want to believe!Software Engineers are very devout believers. They believe that: ● Their software does not contain bugs ● The datacenters and machines are always "up" and gratis ● The network has infinite capacity ● The speed of light does not apply to their system ● Disks never fail and seek times are close to zero msec. ● Configuration files do not contain (syntax) errors
  • 23. SREs are cynics!Site reliability engineers know that: ● All software sucks ● Everything always fails, preferably in the most inconvenient order ● Complexity is the enemy of reliability ● The laws of physics are real, even inside a virtual machine ● All processes that contain a manual component are highly failure prone ● Traffic forecasts are bogus
  • 24. Ouch
  • 25. Questions?
  • 26. Thanks!