Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

The servicescore card - Gamifying Operational Excellence - SRECON

621 views

Published on

What makes a “good” service is a moving target. Technologies and requirements change over time. It can be impossible to ensure that none of your services have been left behind.

The Service ScoreCard approach is to have a small check for each service initiative we have, this could be anything measurable; deployment frequency, the oncall team all have phone; ensuring the latest version of the JVM.

The Service ScoreCard, gives each service a grade from 'F' to 'A+', based on passing or failing the list of checks. As soon as anyone see the service grade’s slipping everyone rallies to improve the grades.

We can then set up rules based on the grades, “Only B and above services can deploy 24 / 7”, “moratorium on services without an A+” or “No SRE support until the services below C grade”.

Published in: Technology

The servicescore card - Gamifying Operational Excellence - SRECON

  1. 1. Gamifying Operational Excellence The Service Score Card
  2. 2. 1 The Problem 3 A Solution tour 4 The results 5 Take aways & lessons Learnt & Questions 2 A Solution idea Agenda
  3. 3. “If it's not broken, I’ll fix it.” From Australia, on loan as Staff SRE @ linkedIn jobs, companies, recruiter & Finder of encoding bugs about me Danny ☃ Lawrence
  4. 4. “If it's not broken, I’ll fix it.” From Australia, on loan as Staff SRE @ linkedIn jobs, companies, recruiter & Finder of encoding bugs about me Danny ☃ Lawrence
  5. 5. “If it's not broken, I’ll fix it.” From Australia, on loan as Staff SRE @ linkedIn jobs, companies, recruiter & Finder of encoding bugs about me Danny ☃ Lawrence
  6. 6. “If it's not broken, I’ll fix it.” From Australia, on loan as Staff SRE @ linkedIn jobs, companies, recruiter & Finder of encoding bugs about me Danny ☃ Lawrence
  7. 7. “If it's not broken, I’ll fix it.” From Australia, on loan as Staff SRE @ linkedIn jobs, companies, recruiter & Finder of encoding bugs about me Danny ☃ Lawrence
  8. 8. Good news SRECON. You passed the ☃ test. about me Danny ☃ Lawrence
  9. 9. Some terms (before we really get started)
  10. 10. Operational Excellence effective and efficient delivery of information, technology, and services required by end users that add measurable value. 10 Gamifying Operational Excellence
  11. 11. Operational Excellence Doing everything required to make sure all of your services are as fast and as reliable as possible. 11 Gamifying Operational Excellence
  12. 12. Gamification application of game-design elements and game principles in non-game contexts. 12 Gamifying Operational Excellence
  13. 13. Some background (LinkedIn SRE crash course)
  14. 14. Mostly Java Multitudes of services Doing lots of things Service-oriented architecture Everything talks to everything My direct team looks after 80+ services We have 200+ SREs 14 LinkedIn SRE Crash Course
  15. 15. The Problem (What started this whole thing)
  16. 16. Problem 1: The GOOD & The BAD 16 Gamifying Operational Excellence
  17. 17. BAD services wake me up 17 Gamifying Operational Excellence
  18. 18. GOOD services let me sleep 18 Gamifying Operational Excellence
  19. 19. What makes a GOOD service at LinkedIn is a moving target. 19 Gamifying Operational Excellence
  20. 20. Technologies and dependencies change over time. 20 Gamifying Operational Excellence
  21. 21. Upgrading dependencies & libraries Java / Jetty / Play / Tomcat Correct usage of TLS Switching databases / caches Migrate from SVN to GIT Reduce application startup time Setup error budgeting True up the number of metrics 21 Some examples
  22. 22. A GOOD service can turn into a BAD service. If you are not checking it 22 Gamifying Operational Excellence
  23. 23. Unfortunately BAD services do not magically turn into GOOD services 23 Gamifying Operational Excellence
  24. 24. Problem 2: Knowing what is BAD 24 Gamifying Operational Excellence
  25. 25. Problem 3: Knowing why it’s BAD 25 Gamifying Operational Excellence
  26. 26. Problem 4: Tribal knowledge about how to get to GOOD 26 Gamifying Operational Excellence
  27. 27. The only thing SREs hate more than not having documentation. Is writing documentation. 27 Gamifying Operational Excellence
  28. 28. The Problem summary
  29. 29. BAD services wake me up Time will cause GOOD to turn BAD Hard to know what is BAD Hard to know why is BAD Not sure how to fix the BAD 29 Gamifying Operational Excellence
  30. 30. The Service ScoreCard (A solution)
  31. 31. In order determine the health of the services we support, we define a list of production requirements. 31 Gamifying Operational Excellence
  32. 32. Apply a weight to each requirement 32 Gamifying Operational Excellence
  33. 33. Codify each requirement into a check. 33 Gamifying Operational Excellence
  34. 34. Execute these checks for each service 34 Service Scorecard
  35. 35. Tally up the results for service. 35 Gamifying Operational Excellence
  36. 36. Grade the service from “F” to “A+” 36 Gamifying Operational Excellence
  37. 37. Add all the services into a highscore system 37 Gamifying Operational Excellence
  38. 38. Then 38 Gamifying Operational Excellence
  39. 39. Publish those scores to the company 39 Gamifying Operational Excellence
  40. 40. This is great, but how do I improve the score? How can I add X check into the system. 40 Gamifying Operational Excellence
  41. 41. What makes a check?
  42. 42. checks are one type of plugin. fetch plugins gather data check plugins check the data. 42 Gamifying Operational Excellence
  43. 43. We use the fetch plugin to gather remote data from: SVN, GIT, Configuration DBs, host databases, monitoring systems, build systems, deployment systems. 43 Gamifying Operational Excellence
  44. 44. Basically, if we can fetch it, then we do so. 44 Gamifying Operational Excellence
  45. 45. We build a giant context object. 45 Gamifying Operational Excellence
  46. 46. The check plugin will look at our context object. 46 Gamifying Operational Excellence
  47. 47. All plugins are small python scripts, where small is 10~30 LOC 47 Gamifying Operational Excellence
  48. 48. Simply return 2 or 3 things. state*: True, False, None or 0.0 - 1.0 message*: short string data: python dict of interesting things. 48 Gamifying Operational Excellence
  49. 49. Example fetch plugin
  50. 50. @ssc.tags(“ownership”) def fetch_ownership(service_name): “Fetch all the ownership data of a service” o = r.get(“http://owners/” + service_name) return True, “gathered data”, o.json() 50
  51. 51. @ssc.tags(“ownership”) def fetch_ownership(service_name): “Fetch all the ownership data of a service” o = r.get(“http://owners/” + service_name) return True, “gathered data”, o.json() 51
  52. 52. @ssc.tags(“ownership”) def fetch_ownership(service_name): “Fetch all the ownership data of a service” o = r.get(“http://owners/” + service_name) return True, “gathered data”, o.json() 52
  53. 53. @ssc.tags(“ownership”) def fetch_ownership(service_name): “Fetch all the ownership data of a service” o = r.get(“http://owners/” + service_name) return True, “gathered data”, o.json() 53
  54. 54. @ssc.tags(“ownership”) def fetch_ownership(service_name): “Fetch all the ownership data of a service” o = r.get(“http://owners/” + service_name) return True, “gathered owner data”, o.json() 54
  55. 55. Example check plugin
  56. 56. @ssc.weight(5) @ssc.tags(‘ownership’) @ssc.wiki(‘http://wiki/ssc_eng_owner’) def check_eng_team(ctx): “ensure ENG ownership of a service” if ctx.ownership.eng_team: return True, ctx.ownership.eng_team return False, “missing eng_team” 56
  57. 57. @ssc.weight(5) @ssc.tags(‘ownership’) @ssc.wiki(‘http://wiki/ssc_eng_owner’) def check_eng_team(ctx): “ensure ENG ownership of a service” if ctx.ownership.eng_team: return True, ctx.ownership.eng_team return False, “missing eng_team” 57
  58. 58. @ssc.weight(5) @ssc.tags(‘ownership’) @ssc.wiki(‘http://wiki/ssc_eng_owner’) def check_eng_team(ctx): “ensure ENG ownership of a service” if ctx.ownership.eng_team: return True, ctx.ownership.eng_team return False, “missing eng_team” 58
  59. 59. @ssc.weight(5) @ssc.tags(‘ownership’) @ssc.wiki(‘http://wiki/ssc_eng_owner’) def check_eng_team(ctx): “ensure ENG ownership of a service” if ctx.ownership.eng_team: return True, ctx.ownership.eng_team return False, “missing eng_team” 59
  60. 60. @ssc.weight(5) @ssc.tags(‘ownership’) @ssc.wiki(‘http://wiki/ssc_eng_owner’) def check_eng_team(ctx): “ensure ENG ownership of a service” if ctx.ownership.eng_team: return True, ctx.ownership.eng_team return False, “missing eng_team” 60
  61. 61. @ssc.weight(5) @ssc.tags(‘ownership’) @ssc.wiki(‘http://wiki/ssc_eng_owner’) def check_eng_team(ctx): “ensure ENG ownership of a service” if ctx.ownership.eng_team: return True, ctx.ownership.eng_team return False, “missing eng_team” 61
  62. 62. @ssc.weight(5) @ssc.tags(‘ownership’) @ssc.wiki(‘http://wiki/ssc_eng_owner’) def check_eng_team(ctx): “ensure ENG ownership of a service” if ctx.ownership.eng_team: return True, ctx.ownership.eng_team return False, “missing eng_team” 62
  63. 63. Putting it all together
  64. 64. Problems Understanding what is BAD Knowing why it is BAD Not sure how to fix the BAD 64 Gamifying Operational Excellence
  65. 65. Problems Understanding what is BAD 65 Gamifying Operational Excellence
  66. 66. 66 Service Scorecard
  67. 67. 67 Service Scorecard
  68. 68. 68 Service Scorecard
  69. 69. 69 Service Scorecard
  70. 70. 70 Service Scorecard
  71. 71. 71 Service Scorecard
  72. 72. 72 Service Scorecard
  73. 73. 73 Service Scorecard
  74. 74. 74 Service Scorecard
  75. 75. 75 Service Scorecard
  76. 76. 76 Service Scorecard
  77. 77. 77 Service Scorecard
  78. 78. 78 Service Scorecard
  79. 79. 79 Service Scorecard
  80. 80. Problems Understanding what is BAD Knowing why it is BAD 80 Gamifying Operational Excellence
  81. 81. 81 Service Scorecard
  82. 82. 82 Service Scorecard
  83. 83. 83
  84. 84. 84
  85. 85. 85
  86. 86. 86
  87. 87. 87
  88. 88. 88
  89. 89. 89
  90. 90. 90
  91. 91. 91
  92. 92. 92
  93. 93. 93
  94. 94. Problems Understanding what is BAD Knowing why it is BAD Not sure how to fix the BAD 94 Gamifying Operational Excellence
  95. 95. 95
  96. 96. 96
  97. 97. 97
  98. 98. 98
  99. 99. 99
  100. 100. What is the check? Why is it important? How long it will take to fix? How will it be fixed? 100 Gamifying Operational Excellence
  101. 101. 101
  102. 102. 102 AngularJS image: CC BY 4.0 https://angular.io/presskit.html (2017)
  103. 103. 103 {{service_name}} becomes jobs-server
  104. 104. 104
  105. 105. 105 {{context.ownership.eng_owner}} becomes jobs-team
  106. 106. Using our fetched data in the wiki
  107. 107. 107 {{service_name}}
  108. 108. 108 {html} <script src=”https://cdn/angularjs.js”/ > {html}
  109. 109. 109 var query = $location.search(); var service_name = query[‘service_name’]; var url = ‘http://ssc/api/’ + service_name; $http.get().success( function(ctx) { $scope.ctx = ctx; } );
  110. 110. 110 var query = $location.search(); var service_name = query[‘service_name’]; var url = ‘http://ssc/api/’ + service_name; $http.get().success( function(ctx) { $scope.ctx = ctx; } );
  111. 111. 111 var query = $location.search(); var service_name = query[‘service_name’]; var url = ‘http://ssc/api/’ + service_name; $http.get().success( function(ctx) { $scope.ctx = ctx; } );
  112. 112. 112 var query = $location.search(); var service_name = query[‘service_name’]; var url = ‘http://ssc/api/’ + service_name; $http.get().success( function(ctx) { $scope.ctx = ctx; } );
  113. 113. 113 var query = $location.search(); var service_name = query[‘service_name’]; var url = ‘http://ssc/api/’ + service_name; $http.get().success( function(ctx) { $scope.ctx = data; } );
  114. 114. 114 var query = $location.search(); var service_name = query[‘service_name’]; var url = ‘http://ssc/api/’ + service_name; $http.get().success( function(ctx) { $scope.ctx = ctx; } );
  115. 115. 115 {{ctx.ownership.owner_eng}}
  116. 116. 116 {{ctx.ownership.owner_eng}} {{ctx.number_of_hosts}} {{ctx.product.lib.jetty.version}} {{ctx.hosts.hostnames}} {{ctx.is_deployed_in_prod}} {{ctx.commits.last_commit}}
  117. 117. Problems Understanding what is BAD Knowing why it is BAD Not sure how to fix the BAD 117 Gamifying Operational Excellence
  118. 118. Now Reports show what is BAD Checks validate why it is BAD Wiki shows how to fix the BAD 118 Gamifying Operational Excellence
  119. 119. No more of these emails “If you use a lib-core, then upgrade it, we found a bug” 119 Gamifying Operational Excellence
  120. 120. How many of my 80 services use this lib? How do I check? How do I upgrade? 120 Gamifying Operational Excellence
  121. 121. 121
  122. 122. 122
  123. 123. 123
  124. 124. Where does this tool fit?
  125. 125. 125 Gamifying Operational Excellence pre-commit Build Deployment Monitoring
  126. 126. 126 Gamifying Operational Excellence pre-commit Build Deployment Monitoring Service Scorecard
  127. 127. 127 Gamifying Operational Excellence pre-commit Build Deployment Monitoring Service Scorecard API
  128. 128. 128 Gamifying Operational Excellence Service Scorecard API hack-days Reporting Deployment Monitoring
  129. 129. Results & Outcomes
  130. 130. What we do with the scores? 130 Gamifying Operational Excellence
  131. 131. Priority #1: Getting the grades better 131 Gamifying Operational Excellence
  132. 132. 132 When we started Now Average grade for my team 40% 80% Average score across SRE 35% 60% Checks in 24 hours 15,560 89,859 Number of checks per service 15 31 Gamifying Operational Excellence
  133. 133. We can now explore news ways to use the scores 133 Gamifying Operational Excellence
  134. 134. Carrot & Stick 134 Gamifying Operational Excellence
  135. 135. Carrot / GOOD Stick / BAD 135 Gamifying Operational Excellence
  136. 136. No SRE support for F Grade services. 136 Gamifying Operational Excellence
  137. 137. F Grade services generally cause the most problems. 137 Gamifying Operational Excellence
  138. 138. No deploy moratorium for A+ services 138 Gamifying Operational Excellence
  139. 139. A+ services generally cause the least problems. 139 Gamifying Operational Excellence
  140. 140. A services are allowed to deploy 24/7 140 Gamifying Operational Excellence
  141. 141. Premium SRE support for A+ services 141 Gamifying Operational Excellence
  142. 142. Priority build queues for GOOD Services. 142 Gamifying Operational Excellence
  143. 143. Tiger teams to raise the scores on F Grade services 143 Gamifying Operational Excellence
  144. 144. Hack Days 144 Gamifying Operational Excellence
  145. 145. FREE BEER 145 Gamifying Operational Excellence
  146. 146. Basically any problem can be solve with FREE BEER 146 Gamifying Operational Excellence
  147. 147. OR T-Shirts 147 Gamifying Operational Excellence
  148. 148. / 148
  149. 149. Influence where we allocate open headcount 149 Gamifying Operational Excellence
  150. 150. Simple way to get things done 150 Gamifying Operational Excellence
  151. 151. Take aways & Lessons Learnt
  152. 152. Everyone cares about Reliability. 152 Gamifying Operational Excellence
  153. 153. Everyone cares about Reliability, Everyone is a Site Reliability Engineer. 153 Gamifying Operational Excellence
  154. 154. Everyone cares about Reliability, You just need to empower them. 154 Gamifying Operational Excellence
  155. 155. Hack Days are important, This POC was built in an afternoon. 155 Gamifying Operational Excellence
  156. 156. Getting the data was easy, Finding interesting ways to use it is hard. 156 Gamifying Operational Excellence
  157. 157. Make it as easy as possible to do the right thing. 157 Gamifying Operational Excellence
  158. 158. Cheers !
  159. 159. Q & A

×