SRE From Scratch

3,492 views

Published on

How to bootstrap an SRE team into your company. How to hire them, what to have them work on and how to interact with them as a team. Finally some thought on general practices to consider before your SREs arrive. There are also kitten pictures.

Published in: Technology, Business
5 Comments
10 Likes
Statistics
Notes
No Downloads
Views
Total views
3,492
On SlideShare
0
From Embeds
0
Number of Embeds
647
Actions
Shares
0
Downloads
74
Comments
5
Likes
10
Embeds 0
No embeds

No notes for slide

SRE From Scratch

  1. 1. SRE FROM SCRATCH
  2. 2. SITE RELIABILITYENGINEERING
  3. 3. PRODUCTION ENGINEERING
  4. 4. DEVOPS?
  5. 5. WHAT DO SRE DO?
  6. 6. KEEPTHE SITE UP
  7. 7. KNOWTHE PRODUCTIONENVIRONMENT
  8. 8. KNOWTHEIR PRODUCT
  9. 9. LIAISON, ADVISOR,CONSULTANT
  10. 10. TOOLING AND AUTOMATION
  11. 11. TRIAGE
  12. 12. SO? WHY DO I NEEDTHEM?
  13. 13. UPTIME
  14. 14. THE ENVIRONMENT IS APRODUCT
  15. 15. THEY’VE DONETHIS BEFORE
  16. 16. OK... LET’S HIRE SOME
  17. 17. WHATTO LOOK FOR...
  18. 18. SRES!
  19. 19. SYSADMINSTHAT PROGRAM
  20. 20. PROGRAMMERSTHAT DOSYSADMIN
  21. 21. EXPERIENCE WITH SCALE
  22. 22. HOW DO I INTERVIEWTHEM?
  23. 23. FUNDAMENTALS
  24. 24. HARDWARE
  25. 25. SYSTEM INTERNALS
  26. 26. UNIX ENVIRONMENT
  27. 27. NETWORKING
  28. 28. APPLICATION SUPPORT
  29. 29. OPERATING AT SCALE
  30. 30. PROGRAMMING
  31. 31. DON’T HIRE HEROES
  32. 32. OK, I’VE HIRED SOME,WHATSHOULDTHEY DO?
  33. 33. DESIGN REVIEW
  34. 34. DATA FLOWS
  35. 35. DEPENDENCIES
  36. 36. FAILURE CONDITIONS
  37. 37. SCALING
  38. 38. LAUNCH PREPAREDNESS
  39. 39. DOCUMENTATION
  40. 40. BUILD INFRASTRUCTURE
  41. 41. MONITORING
  42. 42. DEPLOYMENT
  43. 43. OPERATORTOOLS
  44. 44. CONFIGURATIONMANAGEMENT
  45. 45. SELF-SERVICE
  46. 46. HOW SHOULDTHETEAMSINTERACT...
  47. 47. DON’T GIVE ALLTHE DAY-TO-DAYTASKSTOTHE SRES
  48. 48. SHARETHE LOAD
  49. 49. HAVEYOUR SRES SIT WITHYOU
  50. 50. INCLUDETHEM INDISCUSSIONSTHE AFFECTTHEPRODUCTION ENVIRONMENT
  51. 51. SOFTWARE IS NEVERTHROWN OVERTHE WALL
  52. 52. HAND-OFFS
  53. 53. SRES SHOULD BLOCKDANGEROUS CHANGES
  54. 54. IFYOUR SRES ARE FIGHTINGFIRES,THEY’RE NOT BUILDINGINFRASTRUCTURE
  55. 55. IFYOUR SOFTWARE ISCAUSING FIRES, FIX IT
  56. 56. ASKYOUR SRETO HELP MAKEFLAME-PROOF SOFTWARE
  57. 57. DON’T HIDEYOUR PROBLEMSFROM SRE
  58. 58. SRE SHOULD BE INVOLVEDTOUNDERSTANDTHE PROBLEM
  59. 59. EVERYONE SHOULD BEWRITING CODE OR MAKINGHARD DECISIONS
  60. 60. OF COURSETHERE AREOPTIONS...
  61. 61. SRE CAN DO ALLTHESUPPORT
  62. 62. SRES ARE A LIMITEDRESOURCE
  63. 63. SWE CAN SUPPORTPRODUCTS...
  64. 64. APP SUPPORT BY SWE,INFRASTRUCTURE SUPPORTBY SRE
  65. 65. OR JUST ROTATE AROUND
  66. 66. ANY PRODUCTION ADVICE?
  67. 67. SELF-SERVICE
  68. 68. ALLTOOLS SHOULD BEWRITTEN WITHTHE IDEATHATROBOTS CAN RUNTHEM
  69. 69. BEFORE ROBOTS RUNTHEM,ANYONE INTHE COMPANYSHOULD BE ABLETO
  70. 70. PEOPLE SHOULD MAKE HARDDECISIONS, NOT PUSHBUTTONS
  71. 71. GIVE PEOPLE ACCESS
  72. 72. SWE SHOULD HAVE ASMUCH ACCESS ASTHEY NEED.
  73. 73. SWE ALREADY WRITES CODETHAT HAS ACCESSTOSENSITIVE DATA
  74. 74. PRODUCTION DATA STAYS INPRODUCTION
  75. 75. MAKE GOOD SYNTHETICDATA
  76. 76. MAKE GOOD WAYSTOTESTIN PROD
  77. 77. CANARY,A/BTEST, ETC.
  78. 78. LEARNTOTRIAGE
  79. 79. THINGS BREAK,YOU MUST FIXTHEM
  80. 80. MONITORING, METRICS,OPERATORTOOLS, FASTBUILD AND DEPLOY
  81. 81. TO FIX,YOU NEEDTO KNOWIT’S BROKEN
  82. 82. MONITORING
  83. 83. MONITOR APPLICATIONS
  84. 84. MONITOR BEHAVIOR
  85. 85. STANDARDIZEYOUR METRICS
  86. 86. PUSH METRICS OUT
  87. 87. DECOUPLEYOUR SYSTEMS
  88. 88. WATCH SYSTEMS AS AFUNCTION OF CAPACITY
  89. 89. ONLY ALERT ON SYSTEMMETRICS KNOWNTO HURTYOU
  90. 90. DATA STORES
  91. 91. BEWARETHE RDBMS
  92. 92. LEARNTO SHARD
  93. 93. DITCHTHE DURABILITYWHEREYOU CAN
  94. 94. BUT FIGURE OUT HOWTOBOOTSTRAP NON-DURABLESTORES
  95. 95. MEMCACHE IS A BLESSINGAND A CURSE
  96. 96. ALWAYS CONSIDER A SITE-WIDE POWER OUTAGE
  97. 97. USE DURABLE AND NON-DURABLE STORESTOGETHER
  98. 98. ASKYOUR SRE FOR MOREINFO
  99. 99. DESPITE ALLTHIS,YOU CANSTILL FAIL...
  100. 100. OBVIOUS FAILURE
  101. 101. DOWNTIME
  102. 102. DOWNTIME WITHOUTKNOWING
  103. 103. NON-OBVIOUS FAILURES
  104. 104. HEROIC ACTS
  105. 105. WEREYOU UP ALL NIGHT?
  106. 106. DIDYOU DOTHAT SAMETASKALL DAY?
  107. 107. DID A WHOLETEAM STOPWHATTHEY WERE DOING?
  108. 108. THESE ARE HEROIC ACTS,THEY ARE POISON
  109. 109. HEROISM = FAILURE
  110. 110. COMES FROM LEGACYSYSTEMS, PROCEDURES
  111. 111. ALSO FROM PERSONALITYTRAITS...
  112. 112. QUESTIONS?• Grier Johnson• @grierj• grierj@gmail.com

×