Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

SRE Topics with Charity Majors and Liz Fong-Jones of Honeycomb

154 views

Published on

Charity's words make you think while Liz's words make you act, so when you combine them, you get one of the best meetups on Elite DevOps Performance, SRE and Observability topics – ever!

Google Meet recording stopped working, so this *noisy* DIY-copy is the best we got: https://youtu.be/geqoOg4WXcQ. Still, the video is worth your time because you will see how empathy, and simple focus shift

1) from Dev and Ops to your Users,
2) from APM tools to Observability,

can make your workdays more productive, enjoyable and meaningful.

To learn how to define your first SLO, go to Honeycomb's 3-part SRE Crash Course https://go.hny.co/serverlessToronto.

Published in: Software
  • Be the first to comment

  • Be the first to like this

SRE Topics with Charity Majors and Liz Fong-Jones of Honeycomb

  1. 1. Welcome to ServerlessToronto.org 1 Introduce Yourself ☺ - Why are you here? Where from? - Looking or Offering work? Fill the survey to win prises! “SRE Topics with Charity Majors & Liz Fong-Jones of Honeycomb.io” will start at 6:10pm…
  2. 2. Knowledge Sponsor Fill the survey to win monthly Manning Giveaways: https://forms.gle/ANEa6j3GZnsUm7jZ7 https://www.manning.com/books/software-telemetry GOOD LUCK!
  3. 3. Upcoming ServerlessToronto.org Meetups 3 1) AWS re:Invent 2020 Recap – James Beswick of AWS 2) Serverless Cloud Native Java with Spring Cloud GCP (No platform needed) – Ray Tsang, Java Champion of GCP 3) Community Lightning Talks & Updates for our 3rd Birthday 4) ??? We need you ☺ Please rate us online ☺
  4. 4. Serverless is not just about the Tech: 4 Serverless is New Agile & Mindset #1 Serverless Dev (Back-end FaaS dev, but turned into gluing APIs and Managed Services) #2 We're obsessed to creating business value (meaningful MVPs, Products), to empower Business users #3 We build bridges between Serverless Community (“Dev leg”), and Front-end & Voice- First developers & User Experience designers (“UX leg”) #4 Achieve agility NOT by “sprinting” faster (like in Scrum), but working smarter (by using bigger building blocks and less Ops)
  5. 5. “Toss It Over the Fence” SDLC principle ;) Why DevOps topics at Serverless event? 5 My reality was: The idea of Dev – Ops “line” became, not 2 but 3 silos: Dev | DevOps | Ops
  6. 6. Looking back, it appears that my obsession with Startups/Serverless was (in a way) an escape from the Corporate IT reality Why DevOps topics at Serverless event? 6
  7. 7. 7 SLI tells when users are happy/unhappy SLIs, SLOs, SLAs, oh my! (class SRE implements DevOps) videos “When the student is ready, the teacher will appear” to help us resolve Dev vs. Ops conflict… SRE shifts focus from Dev & Ops to Users!
  8. 8. 8 Feature Presentations with Charity & Liz, Honeycomb.io
  9. 9. @mipsytipsy The Socio-Technical Path to ✨High-Performing✨ Teams Observability and the Glorious Future @mipsytipsy
  10. 10. @mipsytipsy engineer/cofounder/CTO https://charity.wtf
  11. 11. the fundamental building block by which we organize ourselves and coordinate and scale our labor. Teams.
  12. 12. The teams you join will define your career more than any other single factor.
  13. 13. bad jobs can be bad in so, so many different ways… Bad Job One: • predatory product • praised for pulling all-nighters • alienated from coworkers • long commute • neglectful manager • the worst Silicon Valley cliches Bad Job Two: • aging, obsolete tech • high operational toil • fragile, flappy systems • complacency • low engineering skill level • command-and-control 
 leadership
  14. 14. autonomy, learning, high- achieving, learned from our mistakes, curious, responsibility, ownership, inspiring, camaraderie, pride, collaboration, career growth, rewarding, motivating manual labor, sacred cows, wasted effort, stale tech, ass-covering, fear, fiefdoms, excessive toil, command-and-control, cargo culting, enervating, discouraging, lethargy, indifference
  15. 15. Motivation
  16. 16. Sociotechnical Systems Socio- Technical Systems Team, Tools, and Production
  17. 17. sociotechnical (n) “Technology is the sum of ways in which social groups construct the material objects of their civilizations. The things made are socially constructed just as much as technically constructed. The merging of these two things, construction and insight, is sociotechnology” — wikipedia if you change the tools people use, you can change how they behave and even who they are. 😱
  18. 18. sociotechnical (n) Values Practices Tools
  19. 19. Socio- Technical Systems sociotechnical (n) Your system is a snowflake. hire growth-oriented people build emotional safety invest in good tooling, pay down tech debt practice observability-driven development instrument, observe, iterate, and repeat construct and tweak virtuous feedback loops.
  20. 20. You can’t fix what you can’t measure. How well does YOUR team perform? https://services.google.com/fh/files/misc/state-of-devops-2019.pdf 4 🔥key🔥 metrics. Socio-
  21. 21. 1 — How frequently do you deploy? 2 — How long does it take for code to go live? 3 — How many of your deploys fail? 4 — How long does it take to recover from an outage? 5 — How often are you alerted after hours? Every team lead should watch these. Know where you stand.
  22. 22. We waste a LOT of time. https://stripe.com/reports/developer-coefficient-2018 42%!!! 42%!!! 42%!!! 42%!!! 42%!!!
  23. 23. There is a wide gap between elite teams and the bottom 50%.
  24. 24. It really, really, really, really, really pays off to be on a high performing team.
  25. 25. Who is going to be the better engineer two years from now? 3000 deploys/year 9 outages/year 6 hours firefighting 5 deploys/year 65 outages/year firefighting: constant
  26. 26. Q: What happens when an engineer from the elite yellow bubble joins a team in the medium blue bubble? A: Your productivity tends to rise (or fall) to match that of the team you join.
  27. 27. “Build great teams by only hiring the best people”Great teams make great engineers. ❤
  28. 28. Every engineering org has a dual mandate. The work consists of cultivating sociotechnical feedback loops so engineers can own the full lifecycle of their code and that begins with observability. Happier customers, happier teams.
  29. 29. This brings us to tools. Technical I don’t know exactly what you need but I can guess :) Observability, Progressive Deployments, Continuous Delivery, SLOs And your weakest points in the OMM
  30. 30. observability(n): “In control theory, observability is a measure of how well internal states of a system can be inferred from knowledge of its external outputs. The observability and controllability of a system are mathematical duals." — wikipedia
  31. 31. Can you understand what’s happening inside your systems, just by asking questions from the outside? Can you debug your code and any behavior using its output? Can you answer new questions without shipping new code? o11y for software:
  32. 32. • Arbitrarily-wide structured data blobs (“canonical logs”) • Visualize by time (“trace”) • Gathers up full context in a single blob per req per service • Propagates unique request IDs and span IDs Yes Observability: • Unstructured, ad hoc logs • Metrics-based tools (Prometheus, DataDog, SignalFX, WaveFront, etc) • Anything that requires indexes or schemas Not Observability: technical nitpick:
  33. 33. o11y technical dependencies (long): https://www.honeycomb.io/blog/so-you-want-to-build-an-observability-tool/ • High cardinality • High dimensionality • Exploratory, open-ended investigation based on raw events • Service Level Objectives. No preaggregation. • Based on arbitrarily-wide structured events with span support • No indexes, schemas, or predefined structure • Bundling the full context of the request across network hops • Metrics != observability. Unstructured logs != observability.
  34. 34. "I don't have time to invest in observability right now. Maybe later” You can't afford not to.
  35. 35. 1. Resiliency to failure 2. High-quality code 3. Manage complexity and technical debt 4. Predictable releases 5. Understand user behavior https://www.honeycomb.io/wp-content/uploads/2019/06/Framework-for-an-Observability-Maturity-Model.pdf Observability Maturity Model … find your weakest category, and tackle that first
  36. 36. Observability is the key to making the leap from known-unknowns to unknown- unknowns, and software lifecycle ownership. unknown-unknownsknown-unknowns monitoring observability
  37. 37. • Ephemeral and dynamic • Far-flung and loosely coupled • Partitioned, sharded • Distributed and replicated • Containers, schedulers • Service registries • Polyglot persistence strategies • Autoscaled, multiple failover • Emergent behaviors • ... etc Complexity🔥 Why now?
  38. 38. Without observability, your team will resort to guessing and iterating blindly, without evidence, and you will struggle to connect or debug feedback loops. Observability enables you to inspect cause and effect at a granular level. Observability enables engineers to have ownership over the lifecycle of their software. Example? Sure!
  39. 39. The app tier capacity is exceeded. Maybe we rolled out a build with a perf regression, or maybe some app instances are down. DB queries are slower than normal. Maybe we deployed a bad new query, or there is lock contention. Errors or latency are high. We will look at several dashboards that reflect common root causes, and one of them will show us why. “Photos are loading slowly for some people. Why?” (LAMP stack) we should monitor these things!
  40. 40. “Photos are loading slowly for some people. Why?” (microservices) Any microservices running on c2.4xlarge instances and PIOPS storage in us-east-1b has a 1/20 chance of running on degraded hardware, and will take 20x longer to complete for requests that hit the disk with a blocking call. This disproportionately impacts people looking at older archives due to our fanout model. Canadian users who are using the French language pack on the iPad running iOS 9, are hitting a firmware condition which makes it fail saving to local cache … which is why it FEELS like photos are loading slowly Our newest SDK makes db queries sequentially if the developer has enabled an optional feature flag. Working as intended; the reporters all had debug mode enabled. But flag should be renamed for clarity sake. eerrrr … wtf do I monitor for?
  41. 41. More Problems "I have twenty microservices and a sharded db and three other data stores across three regions, and everything seems to be getting a little bit slower over the past two weeks but nothing has changed that we know of, and oddly, latency is usually back to the historical norm on Tuesdays.” “All twenty app micro services have 10% of available nodes enter a simultaneous crash loop cycle, about five times a day, at unpredictable intervals. They have nothing in common afaik and it doesn’t seem to impact the stateful services. It clears up before we can debug it, every time.” “Our users can compose their own queries that we execute server-side, and we don’t surface it to them when they are accidentally doing full table scans or even multiple full table scans, so they blame us when they get timeouts.” (microservices)
  42. 42. Still More Problems “Several users in Romania and Eastern Europe are complaining that all push notifications are down for them … for days. But that’s impossible, because pushes are in a single shared queue, and I’m receiving pushes right now!” “Disney is complaining that once in a while, but not always, they don’t see the photo they expected to see — they see someone else’s photo! When they refresh, it’s fixed.” “Sometimes a bot takes off, or an app is featured on the iTunes store, and it takes us a long long time to track down which app or user is generating disproportionate pressure on shared components of our system (esp databases). Everything gets slower at once, and we can’t tell which of the slow apps caused it (because EVERYTHING’s slow). “We run a platform, and it’s hard to programmatically distinguish between problems that users are inflicting themselves and problems in our own code, since they all manifest as the same errors codes or timeouts." (microservices)
  43. 43. You have an observable system when your team can quickly and reliably diagnose any new behavior with no prior knowledge. observability begins with rich instrumentation, putting you in constant conversation with your code well-understood systems require minimal time spent firefighting
  44. 44. kick-start the virtuous cycle of you build it, you own it instrumenting two steps in front of you as you build never accept a PR unless you’ll know when it breaks watch your code go out. muscle memory. working as intended? anything else look weird? thru the lens of your instrumentation O.D.D. that’s observability-driven development
  45. 45. on call: where shit gets real. on call is for everyone who writes code. on call must not be terrible. (software ownership is the only way to make it better.) democratize production. production is for everyone; build guard rails. encourage curiosity, emphasize ownership. don't punish. get up to your elbows in prod EVERY DAY Tools
  46. 46. invest in your deploys, instrument everything fulfill the promise of Continuous Delivery (don’t be scared by regulations) feature flags, feature flags, feature flags. progressive deploy $ALLTHETHINGS ✨Production First✨ test in prod let staging take the leftovers SLOs!
  47. 47. engineer merges diff. hours pass, multiple other engineers merge too someone triggers a deploy with a few days worth of merges the deploy fails, takes down the site, and pages on call who manually rolls back, then begins git bisecting this eats up her day and multiple other engineers everybody bitches about how on call sucks insidious loop: time elapsed: several engineer-days 🥺💀
  48. 48. engineer merges diff, which kicks off an automatic CI/CD and deploy deploy fails; notifies the engineer who merged, reverts to safety who knows exactly what she just did, and swiftly fixes it then adds tests & instrumentation to better detect it and promptly commits a fix total time elapsed: 10 engineering minutes virtuous loop:
  49. 49. It really, really, really, really, really pays off to be on a high performing team. Like REALLY.
  50. 50. In order to spend more of your time on productive activities, instrument, observe, and iterate on the tools and processes that gather, validate and ship your collective output as a team. Join teams that honor and value this work and are committed to consistently improving internal tooling — not just shipping features. Speed is safety. Get code into production within minutes as the best preventive medicine.
  51. 51. Build or buy? Focus relentlessly on your core business differentiators. Engineering cycles are the scarcest currency in your universe. Build reluctantly. All code is legacy. Kill your darlings. Senior engineers: you must amplify hidden costs
  52. 52. Promote people for successfully offloading and outsourcing. Whatever you praise and promote people for, you will see more of. Practice vendor engineering. Good vendors act like they’re on your team. Build libraries, modules, shared interfaces, examples, docs, relationships; drive consistent internal use cases. Tools
  53. 53. Socio- Technical Systems sociotechnical (n) Your system is a snowflake a unique puzzle, unlike any other high-performing teams both contribute to and are a consequence of a well-running sociotechnical system.
  54. 54. for extra fun … let’s examine the sociotechnical implications of the predominant architecture models of the past two decades: monoliths and microservices
  55. 55. Monolith • THE database • THE application • Known-unknowns and mostly predictable failures • Many monitoring checks and noisy paging alerts • "Flip a switch" to deploy, changes are big bang and binary (all on/all off) • Failures to be prevented • Production is to be feared • Debug by intuition and scar tissue of past outages, page bombs • Canned dashboards, runbooks, playbooks • Deploys are scary • Masochistic on-call culture sociotechnical causes & effects “technical” causes: “socio” effects:
  56. 56. Monolith • We built our systems like glass castles — a fragile, forbidding edifice that we could tightly control access to. • Very hostile to exploration or experimentation
  57. 57. • Many storage systems, many services, many polyglot technologies • Unknown-unknowns dominate • Every alert is a novel question • Rich, flexible instrumentation • Few paging alerts, tied to SLOs and keying off user pain • Failures are your friend • A deploy is just the beginning of gaining confidence in your code • Debug methodically by examining the evidence and following the crumbs • Deploys are opportunities • On-call must be sustainable, humane sociotechnical causes & effectsMicroservices “technical” causes: “socio” effects:
  58. 58. • Software ownership -- you build it, you run it • Robust, resilient, built for experimentation and testing in prod • Human scale, with guard rails for safety Microservices
  59. 59. where are we going?
  60. 60. Here's the dirty little secret. The next generation of systems won't be built and run by burned out, exhausted people, or command-and-control teams just following orders. It can't be done. they've become too complicated. too hard.
  61. 61. You can no longer model these systems in your head and leap to the solution -- you will be readily outcompeted by teams with better tools or sharper focus. Our systems are emergent and unpredictable. Runbooks and canned playbooks be damned; we need your full creative self.
  62. 62. Your labor is a scarce and precious resource. Lend it to those who are worthy of it. You only get one career; seek out high-performing teams to spend more time learning and building, not mired in tech debt and shitty, wasteful processes
  63. 63. we have an opportunity here to make things better let's do it <3
  64. 64. Charity Majors @mipsytipsy
  65. 65. Liz Fong-Jones Principal Developer Advocate, Honeycomb @lizthegrey November 16, 2020 Detangling complex systems with compassion & production excellence 1w/ illustrations by @emilywithcurls!
  66. 66. @lizthegrey 2 We write code to solve problems.
  67. 67. @lizthegrey But our job isn't done upon commit. 3
  68. 68. @lizthegrey Production is increasingly complex. 4
  69. 69. @lizthegrey Especially with microservices. 5
  70. 70. @lizthegrey And with big data. 6
  71. 71. @lizthegrey We're adding complexity all the time. 7
  72. 72. @lizthegrey but can't understand or tame it. 8
  73. 73. @lizthegrey What does uptime mean? 9
  74. 74. @lizthegrey Is it measured in servers? 10
  75. 75. @lizthegrey Is it measured in complaints? 11
  76. 76. @lizthegrey How about juggling everything else? 12
  77. 77. @lizthegrey We're adding even more complexity. 13
  78. 78. @lizthegrey Our heroes are exhausted. 14
  79. 79. @lizthegrey Our strategies need to evolve. 15
  80. 80. @lizthegrey Don't "buy" DevOps. 16
  81. 81. @lizthegrey When we order the alphabet soup... 17
  82. 82. @lizthegrey Noisy alerts. Grumpy engineers. 18
  83. 83. @lizthegrey Walls of meaningless dashboards. 19
  84. 84. @lizthegrey Incidents take forever to fix. 20
  85. 85. @lizthegrey Everyone bugs the "expert". 21
  86. 86. @lizthegrey Deploys are unpredictable. 22
  87. 87. @lizthegrey There's no time to do projects... 23
  88. 88. @lizthegrey and when there's time, there's no plan. 24
  89. 89. @lizthegrey The team is struggling to hold on. 25
  90. 90. @lizthegrey What are we missing? 26
  91. 91. @lizthegrey We forgot who operates systems. 27
  92. 92. @lizthegrey Tools aren't magical. 28
  93. 93. @lizthegrey Invest in people, culture, & process. 29
  94. 94. @lizthegrey Enter the art of Production Excellence. 30
  95. 95. @lizthegrey Make systems more reliable & friendly. 31
  96. 96. @lizthegrey ProdEx takes planning. 32
  97. 97. @lizthegrey Measure and act on what matters. 33
  98. 98. @lizthegrey Involve everyone. 34
  99. 99. @lizthegrey Build everyone's confidence. Encourage asking questions. 35
  100. 100. @lizthegrey How do we get started? 36
  101. 101. @lizthegrey Know when it's too broken. 37
  102. 102. @lizthegrey & be able to debug, together when it is. 38
  103. 103. @lizthegrey Eliminate (unnecessary) complexity. 39
  104. 104. @lizthegrey Our systems are always failing. 40
  105. 105. @lizthegrey What if we measure too broken? 41
  106. 106. @lizthegrey We need Service Level Indicators 42
  107. 107. @lizthegrey SLIs and SLOs are common language. 43
  108. 108. @lizthegrey Think in terms of events in context. 44
  109. 109. @lizthegrey 45
  110. 110. @lizthegrey Is this event good or bad? 46
  111. 111. @lizthegrey Are users grumpy? Ask your PM. 47
  112. 112. @lizthegrey Or do chaos engineering experiments. 48
  113. 113. @lizthegrey What threshold buckets events? 49
  114. 114. @lizthegrey HTTP Code 200? Latency < 100ms? 50
  115. 115. @lizthegrey 51
  116. 116. @lizthegrey How many eligible events did we see? 52
  117. 117. @lizthegrey 53
  118. 118. @lizthegrey Availability: Good / Eligible Events 54
  119. 119. @lizthegrey Set a target Service Level Objective. 55
  120. 120. @lizthegrey Use a window and target percentage. 56
  121. 121. @lizthegrey 99.9% of events good in past 30 days. 57
  122. 122. @lizthegrey 58
  123. 123. @lizthegrey A good SLO barely keeps users happy. 59
  124. 124. @lizthegrey Drive alerting with SLOs. 60
  125. 125. @lizthegrey Error budget: allowed unavailability 61
  126. 126. @lizthegrey How long until I run out? 62
  127. 127. @lizthegrey Page if it's hours. 63 Ticket if it's days.
  128. 128. @lizthegrey 64 Uh oh, better wake someone up.
  129. 129. @lizthegrey Data-driven business decisions. 65
  130. 130. @lizthegrey Is it safe to do this risky experiment? 66
  131. 131. @lizthegrey 67
  132. 132. @lizthegrey Should we invest in more reliability? 68
  133. 133. @lizthegrey Perfect SLO > Good SLO >>> No SLO 69
  134. 134. @lizthegrey Measure what you can today. 70
  135. 135. @lizthegrey Iterate to meet user needs. 71
  136. 136. @lizthegrey Only alert on what matters. 72
  137. 137. @lizthegrey SLIs & SLOs are only half the picture... 73
  138. 138. @lizthegrey Our outages are never identical. 74
  139. 139. @lizthegrey Failure modes can't be predicted. 75
  140. 140. @lizthegrey Support debugging novel cases. In production. 76
  141. 141. @lizthegrey Avoid siloed data. 77
  142. 142. @lizthegrey Allow forming & testing hypotheses. 78
  143. 143. @lizthegrey 79
  144. 144. @lizthegrey Dive into data to ask new questions. 80
  145. 145. @lizthegrey 81
  146. 146. @lizthegrey Our services must be observable. 82
  147. 147. @lizthegrey Can you examine events in context? 83
  148. 148. @lizthegrey 84
  149. 149. @lizthegrey Can you explain the variance? 85
  150. 150. @lizthegrey using relevant dimensions? 86
  151. 151. @lizthegrey Can you mitigate impact & debug later? 87
  152. 152. @lizthegrey Observability goes beyond break/fix. 88 OPERATIONAL RESILIENCE MANAGED TECH DEBT QUALITY CODE PREDICTABLE RELEASE USER INSIGHT
  153. 153. @lizthegrey Observability isn't just the data. 89 DATA INSTRUMENT QUERY
  154. 154. @lizthegrey SLOs and Observability go together. 90
  155. 155. @lizthegrey But they alone don't create collaboration. 91
  156. 156. @lizthegrey Heroism isn't sustainable. 92
  157. 157. @lizthegrey Debugging is not a solo activity. 93
  158. 158. @lizthegrey Debugging is for everyone. 94
  159. 159. @lizthegrey Train teams together. 95
  160. 160. @lizthegrey Collaboration is interpersonal. 96
  161. 161. @lizthegrey 97
  162. 162. @lizthegrey Lean on your team. 98
  163. 163. @lizthegrey We learn better when we document. 99
  164. 164. @lizthegrey Fix hero culture. Share knowledge. 100
  165. 165. @lizthegrey Use the same platforms & tools. 101 (e.g. )
  166. 166. @lizthegrey Reward curiosity and teamwork. 102
  167. 167. @lizthegrey Learn from the past. Reward your future self. 103
  168. 168. @lizthegrey 104
  169. 169. @lizthegrey Outages don't repeat, but they rhyme. 105
  170. 170. @lizthegrey Risk analysis helps us plan. 106
  171. 171. @lizthegrey Quantify risks by frequency & impact. 107
  172. 172. @lizthegrey Which risks are most significant? 108
  173. 173. @lizthegrey Address risks that threaten the SLO. 109
  174. 174. @lizthegrey Make the business case to fix them. 110
  175. 175. @lizthegrey And prioritize completing the work. 111
  176. 176. @lizthegrey Don't waste time chrome polishing. 112
  177. 177. @lizthegrey Lack of observability is systemic risk. 113
  178. 178. @lizthegrey So is lack of collaboration. 114
  179. 179. @lizthegrey 115 Success doesn't demand heroism.
  180. 180. @lizthegrey A dozen engineers build Honeycomb. 116
  181. 181. @lizthegrey We make systems humane to run, 117
  182. 182. @lizthegrey by ingesting telemetry, 118
  183. 183. @lizthegrey enabling data exploration, 119
  184. 184. @lizthegrey and empowering engineers. 120
  185. 185. @lizthegrey Yes, we deploy on Fridays. 121
  186. 186. @lizthegrey
  187. 187. @lizthegrey What's our recipe?
  188. 188. @lizthegrey Observability-driven Development.
  189. 189. @lizthegrey start with lead time. (<3 hours) deploy frequency goes up. (hourly, >12x/day) change fail rate goes down. (<0.1%) time to restore goes down. (seconds to minutes) High productivity product engineering:
  190. 190. @lizthegrey start with lead time. 10 min builds (x3 at worst), 1h for peer review, hourly push train = 3 hours to deploy a change.
  191. 191. @lizthegrey start with lead time. deploy frequency goes up. Builds go out every hour if there's a change. 1-2 new commits per build artifact.
  192. 192. @lizthegrey start with lead time. deploy frequency goes up. change fail rate goes down. Increased confidence via testing. Flag-flip or fix-forward, not emergency rollback. 0.1% fail rate.
  193. 193. @lizthegrey start with lead time. deploy frequency goes up. change fail rate goes down. time to restore goes down. Flag flip takes 30 seconds. Rollback to previous build takes <10 min. Fix-forward takes 20 min.
  194. 194. @lizthegrey Instrument as we code.
  195. 195. @lizthegrey Functional and visual testing.
  196. 196. @lizthegrey Design for feature flag deployment.
  197. 197. @lizthegrey Automated integration.
  198. 198. @lizthegrey Honeycomb’s trace of Honeycomb build & deploy
  199. 199. @lizthegrey Human PR review.
  200. 200. @lizthegrey Automated integration.
  201. 201. @lizthegrey Green button merge.
  202. 202. @lizthegrey Auto-updates, rollbacks, & pins.
  203. 203. @lizthegrey Observe behavior in prod.
  204. 204. @lizthegrey Prod: customers observe data.
  205. 205. @lizthegrey Dogfood observes prod.
  206. 206. @lizthegrey [add observe in prod] image: Adoption curve on SLOs
  207. 207. @lizthegrey Kibble observes dogfood.
  208. 208. @lizthegrey That's how 12 eng deploy 12x/day!
  209. 209. @lizthegrey Buy the right tools... 145
  210. 210. @lizthegrey but season your soup with ProdEx. 146
  211. 211. @lizthegrey Measure. Debug. Collaborate. Fix. 147 lizthegrey.com; @lizthegrey /liz Production Excellence & bring teams closer together.
  212. 212. Join www.ServerlessToronto.org Home of “Less IT Mess”

×