Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
K E V I N A F I N N - B R A U N
S A L E S F O R C E
J . PA U L R E E D
R E L E A S E E N G I N E E R I N G A P P R O A C H...
K E V I N A
F I N N - B R A U N
• Director of Site Reliability Service
Management at Salesforce
• Business Continuity at Y...
J . PA U L
R E E D
• @jpaulreed on
• Host of The Ship Show,
@shipshowpodcast on
• Principal Consultant, Release
Engineerin...
“ S I T E R E L I A B I L I T Y ”
AT S A L E S F O R C E
• Primary operational team
supporting availability
• Acceptance a...
S E R V I C E R E L I A B I L I T Y H U R D L E S AT S F D C
• Inconsistent application of process, leading to inconsisten...
L A N G U A G E O F
T H E “ O L D V I E W ”
• “5 whys”
• “Root cause” analysis
• “Why didn’t you[r team]…”
• “You[r team] ...
@kfinnbraun @jpaulreed#DOES15
T H E T I M E L I N E
• October 2014: First Meeting
• January 2015: “Blow up” HA Forum
• April 2015: Status Check, includi...
T H E T I M E L I N E
• October 2014: First Meeting
• January 2015: “Blow up” HA Forum
• April 2015: Status Check, includi...
Incident, Event,
Bug
Initial
Analysis
RC
Known?
Facilitator opens
investigations
and schedules
post mortem
meeting
Request...
Incident, Event,
Bug
Initial
Analysis
RC
Known?
from incident resolution.
Facilitator opens
investigations
and schedules
p...
Incident, Event,
Bug
Initial
Analysis
RC
Known?
from incident resolution.
Facilitator opens
investigations
and schedules
p...
G E T T I N G A F E E L F O R T H E W E AT H E R
@kfinnbraun @jpaulreed#DOES15
@kfinnbraun @jpaulreed#DOES15
H E A D F I R S T I N T O T H E S T O R M
@kfinnbraun @jpaulreed#DOES15
L A N G U A G E :
M AT T E R S
• “HA Forum” ➡ “WSRR”
• “WAR” (What is it good for?)
• Postmortem versus Retrospective
• Pr...
B E H AV I O R :
M AT T E R S
• Intra-team behavior
• Inter-team behavior
• This is not “#NAFB”
• “People in complex syste...
S T R U C T U R E : M AT T E R S
@kfinnbraun @jpaulreed#DOES15
S T R U C T U R E : M AT T E R S
@kfinnbraun @jpaulreed#DOES15
“ B L A M E L E S S ”
“ P O S T M O R T E M S ” ?
• Brené Brown, research
sociologist, on vulnerability
• “Blame is a way ...
LanguageBehaviors
Novice Competent Proficient ExpertBeginner
@kfinnbraun - #DOES15 - @jpaulreed
LanguageBehaviors
Novice Competent Proficient ExpertBeginner
“Incidents are bad;
my job is on the line”
“I’m getting sent t...
LanguageBehaviors
Novice Competent Proficient ExpertBeginner
“Incidents are bad;
my job is on the line”
“I’m getting sent t...
LanguageBehaviors
Novice Competent Proficient ExpertBeginner
“Incidents are bad;
my job is on the line”
“I’m getting sent t...
LanguageBehaviors
Novice Competent Proficient ExpertBeginner
“Incidents are bad;
my job is on the line”
“I’m getting sent t...
LanguageBehaviors
Novice Competent Proficient ExpertBeginner
“Incidents are bad;
my job is on the line”
“I’m getting sent t...
R E T R O S P E C T I V E S FA C I L I TAT E T H E
S E R V I C E ( A N D D E V E L O P M E N T ! )
I M P R O V E M E N T P...
B E I N G “ T O O B U S Y ” T O L E A R N
O R I M P R O V E M E A N S Y O U A R E I N
A D O W N WA R D S P I R A L ,
B Y D...
I T ’ S N O T A B O U T T H E O U T C O M E .
I T ’ S A B O U T T H E R E S P O N S E .
@kfinnbraun @jpaulreed#DOES15
W H Y + H O W
I S M O R E I M P O R TA N T T H A N
W H AT
@kfinnbraun @jpaulreed#DOES15
Y O U A R E N E V E R D O N E .
@kfinnbraun @jpaulreed#DOES15
Y O U . A R E . N E V E R . D O N E .
@kfinnbraun @jpaulreed#DOES15
O U R F O R E C A S T
F O R T H E F U T U R E
• Evolving the concept of Service
Ownership
• Salesforce-specific
Retrospect...
AV E N U E S F O R C O L L A B O R AT I O N
• How does the described Dreyfus model apply in
other organizations?
• Would l...
Kevina Finn-Braun
kevina.finnbraun@salesforce.com
http://lnkdin.me/kevinafinnbraun
J. Paul Reed
preed@release-approaches.c...
P H O T O C R E D I T S
• Slide 1: https://en.wikipedia.org/wiki/File:Golden_Fog,_San_Francisco.jpg
• Slide 4: Courtesy Ke...
P H O T O C R E D I T S
• Slide 16: Screenshot of aviationweather.gov
• Slide 17: https://www.flickr.com/photos/ravedelay/...
Upcoming SlideShare
Loading in …5
×

The Blameless Cloud: Bringing Actionable Retrospectives to Salesforce

350 views

Published on

DevOps Enterprise Summit 2015 presentation with Kevina Finn-Braun, Director of SRE Management at Salesforce: this is the story of my months-long journey with Kevina and her team to identify the specifics of what made reliability retrospectives difficult to have, why actionable takeaways were often lacking, and how the feedback loops within the company’s operations organization weren’t serving Salesforce’s needs.

We then ran a series of experiments together, putting the SRE team on a road to improving their ability to respond, react, remediate, and reincorporate learnings from failure into the organization.

Published in: Internet
  • Be the first to comment

  • Be the first to like this

The Blameless Cloud: Bringing Actionable Retrospectives to Salesforce

  1. 1. K E V I N A F I N N - B R A U N S A L E S F O R C E J . PA U L R E E D R E L E A S E E N G I N E E R I N G A P P R O A C H E S D E V O P S E N T E R P R I S E S U M M I T, 2 0 1 5 T H E B L A M E L E S S C L O U D : B R I N G I N G A C T I O N A B L E R E T R O S P E C T I V E S T O S A L E S F O R C E
  2. 2. K E V I N A F I N N - B R A U N • Director of Site Reliability Service Management at Salesforce • Business Continuity at Yahoo • Geeks out on Group Dynamics and Behavior • @kfinnbraun on • Prepping for the zombie apocalypse @kfinnbraun @jpaulreed#DOES15
  3. 3. J . PA U L R E E D • @jpaulreed on • Host of The Ship Show, @shipshowpodcast on • Principal Consultant, Release Engineering Approaches • Spend my days talking to organizations about “The DevOps™” @kfinnbraun @jpaulreed#DOES15
  4. 4. “ S I T E R E L I A B I L I T Y ” AT S A L E S F O R C E • Primary operational team supporting availability • Acceptance and validation activities • Develop and implement operational improvements for SFDC • “Game days” @kfinnbraun @jpaulreed#DOES15
  5. 5. S E R V I C E R E L I A B I L I T Y H U R D L E S AT S F D C • Inconsistent application of process, leading to inconsistent information collection • Incident handling/remediation crossing silo boundaries • Confusion over service ownership, due to restructured responsibilities • Disjointed, “heavyweight” meetings • Postmortems centered around “The Old View” of human error @kfinnbraun @jpaulreed#DOES15
  6. 6. L A N G U A G E O F T H E “ O L D V I E W ” • “5 whys” • “Root cause” analysis • “Why didn’t you[r team]…” • “You[r team] should have…” • “Best practices” @kfinnbraun @jpaulreed#DOES15
  7. 7. @kfinnbraun @jpaulreed#DOES15
  8. 8. T H E T I M E L I N E • October 2014: First Meeting • January 2015: “Blow up” HA Forum • April 2015: Status Check, including assessment shared with senior leaders • May 2015: Service ownership roles shift @kfinnbraun @jpaulreed#DOES15
  9. 9. T H E T I M E L I N E • October 2014: First Meeting • January 2015: “Blow up” HA Forum • April 2015: Status Check, including assessment shared with senior leaders • May 2015: Service ownership roles shift • July 2015: Initial Workshop on “The New View” • August 2015: Identified first group for coaching • August 2015 — today: Continued focus and deep-dive on WSRR • August 2015 — today: Weekly sessions with the initial group @kfinnbraun @jpaulreed#DOES15
  10. 10. Incident, Event, Bug Initial Analysis RC Known? Facilitator opens investigations and schedules post mortem meeting Request RCA/ Failure Analysis N RC Identified? Identify corrective actions and implementation plans; Assign actions to scrum teams Y RCM Needed? RCM Process Unable to ascertain root cause; update record with “KE Status” Engage scrum teams as required. HA Forum Y N Corrective Actions complete? Weekly meetings to follow up with scrum master on progress Review @HA? Y Y Additional work items from HA are assigned. Update record and set status to “resolved” Y N END END HA? Incident Guidelines.. Severity 0,1: YES Severity 2 : Maybe (instance & incident length?) Functional Regression: Maybe Incorrect/Incomplete Release: YES Deployment Delayed or Rolled Back: Maybe Impact to Customer/Production or ability to release? Tier 3 support communicate RCM to customer(s) N R O O T C A U S E A N A LY S I S W O R K F L O W • Designed & implemented two years ago • Anchored the process around the weekly “HA Forum” • Intended to apply to all incidents… • In practice, focused on high profile incidents @kfinnbraun @jpaulreed#DOES15
  11. 11. Incident, Event, Bug Initial Analysis RC Known? from incident resolution. Facilitator opens investigations and schedules post mortem meeting Request RCA/ Failure Analysis N RC Identified? Identify corrective actions and implementation plans; Assign actions to scrum teams Y RCM Needed? RCM Process Unable to ascertain root cause; update record with “KE Status” Engage scrum teams as required. HA Forum Y N Corrective Actions complete? Weekly meetings to follow up with scrum master on progress Review @HA? Y Y Additional work items from HA are assigned. Update record and set status to “resolved” Y N END END HA? Incident Guidelines.. Severity 0,1: YES Severity 2 : Maybe (instance & incident length?) Functional Regression: Maybe Incorrect/Incomplete Release: YES Deployment Delayed or Rolled Back: Maybe Impact to Customer/Production or ability to release? Tier 3 support communicate RCM to customer(s) N @kfinnbraun @jpaulreed#DOES15
  12. 12. Incident, Event, Bug Initial Analysis RC Known? from incident resolution. Facilitator opens investigations and schedules post mortem meeting Request RCA/ Failure Analysis N RC Identified? Identify corrective actions and implementation plans; Assign actions to scrum teams Y RCM Needed? RCM Process Unable to ascertain root cause; update record with “KE Status” Engage scrum teams as required. HA Forum Y N Corrective Actions complete? Weekly meetings to follow up with scrum master on progress Review @HA? Y Y Additional work items from HA are assigned. Update record and set status to “resolved” Y N END END HA? Incident Guidelines.. Severity 0,1: YES Severity 2 : Maybe (instance & incident length?) Functional Regression: Maybe Incorrect/Incomplete Release: YES Deployment Delayed or Rolled Back: Maybe Impact to Customer/Production or ability to release? Tier 3 support communicate RCM to customer(s) N R O O T C A U S E A N A LY S I S W O R K F L O W I N R E A L I T Y • Silo transition boundaries evident in the workflow • Some had little/no contact, via the process, with other teams required to perform their job • Sampling of incident reports uncovered consistent inconsistencies • The “Bermuda Blob” @kfinnbraun @jpaulreed#DOES15
  13. 13. G E T T I N G A F E E L F O R T H E W E AT H E R @kfinnbraun @jpaulreed#DOES15
  14. 14. @kfinnbraun @jpaulreed#DOES15
  15. 15. H E A D F I R S T I N T O T H E S T O R M @kfinnbraun @jpaulreed#DOES15
  16. 16. L A N G U A G E : M AT T E R S • “HA Forum” ➡ “WSRR” • “WAR” (What is it good for?) • Postmortem versus Retrospective • Problem Team versus Solution Team • Root Cause versus Proximate Cause @kfinnbraun @jpaulreed#DOES15
  17. 17. B E H AV I O R : M AT T E R S • Intra-team behavior • Inter-team behavior • This is not “#NAFB” • “People in complex systems create safety. … The occasional human contribution to failure occurs because complex systems need an overwhelming human contribution for safety.” — Sydney Dekker @kfinnbraun @jpaulreed#DOES15
  18. 18. S T R U C T U R E : M AT T E R S @kfinnbraun @jpaulreed#DOES15
  19. 19. S T R U C T U R E : M AT T E R S @kfinnbraun @jpaulreed#DOES15
  20. 20. “ B L A M E L E S S ” “ P O S T M O R T E M S ” ? • Brené Brown, research sociologist, on vulnerability • “Blame is a way to discharge pain and discomfort” • Postmortem has a heavy connotation • “Awesome postmortems?” Really?! @kfinnbraun @jpaulreed#DOES15
  21. 21. LanguageBehaviors Novice Competent Proficient ExpertBeginner @kfinnbraun - #DOES15 - @jpaulreed
  22. 22. LanguageBehaviors Novice Competent Proficient ExpertBeginner “Incidents are bad; my job is on the line” “I’m getting sent to the principal’s office because of this outage” Completes the post-incident “paperwork” No formal retrospective/ hallway retrospectives @kfinnbraun - #DOES15 - @jpaulreed
  23. 23. LanguageBehaviors Novice Competent Proficient ExpertBeginner “Incidents are bad; my job is on the line” “I’m getting sent to the principal’s office because of this outage” “Let’s fix this as fast as possible” “What’s the correct fix to avoid this specific issue in the future?” Completes the post-incident “paperwork” No formal retrospective/ hallway retrospectives Some information (inconsistently) recorded Jump to a focus on why @kfinnbraun - #DOES15 - @jpaulreed
  24. 24. LanguageBehaviors Novice Competent Proficient ExpertBeginner “Incidents are bad; my job is on the line” “I’m getting sent to the principal’s office because of this outage” “Let’s fix this as fast as possible” “What’s the correct fix to avoid this specific issue in the future?” “Let’s review the timeline/incident report to answer that” “We need to find the root cause of this incident” Completes the post-incident “paperwork” No formal retrospective/ hallway retrospectives Some information (inconsistently) recorded Jump to a focus on why Follows the prescribed format for retrospectives Have and incorporate complete dataset for the incident into the retrospective @kfinnbraun - #DOES15 - @jpaulreed
  25. 25. LanguageBehaviors Novice Competent Proficient ExpertBeginner “Incidents are bad; my job is on the line” “I’m getting sent to the principal’s office because of this outage” “Let’s fix this as fast as possible” “What’s the correct fix to avoid this specific issue in the future?” “Let’s review the timeline/incident report to answer that” “We need to find the root cause of this incident” “Now that we’ve established what happened, how did it happen?” “How did these multiple factors influence our complex system? Completes the post-incident “paperwork” No formal retrospective/ hallway retrospectives Some information (inconsistently) recorded Jump to a focus on why Follows the prescribed format for retrospectives Have and incorporate complete dataset for the incident into the retrospective Identifies inherent bias in self and others Perspectives solicited from all involved team members/functional groups @kfinnbraun - #DOES15 - @jpaulreed
  26. 26. LanguageBehaviors Novice Competent Proficient ExpertBeginner “Incidents are bad; my job is on the line” “I’m getting sent to the principal’s office because of this outage” “Let’s fix this as fast as possible” “What’s the correct fix to avoid this specific issue in the future?” “Let’s review the timeline/incident report to answer that” “We need to find the root cause of this incident” “Now that we’ve established what happened, how did it happen?” “How did these multiple factors influence our complex system? “How does our team/system contribute to our successes?” “What can we incorporate from this incident to better respond next time?” Completes the post-incident “paperwork” No formal retrospective/ hallway retrospectives Some information (inconsistently) recorded Jump to a focus on why Follows the prescribed format for retrospectives Have and incorporate complete dataset for the incident into the retrospective Identifies inherent bias in self and others Perspectives solicited from all involved team members/functional groups Able to facilitate retrospectives by healthily helping others address tendency to blame/ personal & systemic bias Retrospective outcomes are fed back into the system and prioritized @kfinnbraun - #DOES15 - @jpaulreed
  27. 27. R E T R O S P E C T I V E S FA C I L I TAT E T H E S E R V I C E ( A N D D E V E L O P M E N T ! ) I M P R O V E M E N T P R O C E S S @kfinnbraun @jpaulreed#DOES15
  28. 28. B E I N G “ T O O B U S Y ” T O L E A R N O R I M P R O V E M E A N S Y O U A R E I N A D O W N WA R D S P I R A L , B Y D E F I N I T I O N @kfinnbraun @jpaulreed#DOES15
  29. 29. I T ’ S N O T A B O U T T H E O U T C O M E . I T ’ S A B O U T T H E R E S P O N S E . @kfinnbraun @jpaulreed#DOES15
  30. 30. W H Y + H O W I S M O R E I M P O R TA N T T H A N W H AT @kfinnbraun @jpaulreed#DOES15
  31. 31. Y O U A R E N E V E R D O N E . @kfinnbraun @jpaulreed#DOES15
  32. 32. Y O U . A R E . N E V E R . D O N E . @kfinnbraun @jpaulreed#DOES15
  33. 33. O U R F O R E C A S T F O R T H E F U T U R E • Evolving the concept of Service Ownership • Salesforce-specific Retrospective Guides • Global “live-site” coaching • Refocus on getting the business what it wants @kfinnbraun @jpaulreed#DOES15
  34. 34. AV E N U E S F O R C O L L A B O R AT I O N • How does the described Dreyfus model apply in other organizations? • Would love to hear stories from other enterprises about their retrospective process, who does them, and where they live within the organization @kfinnbraun @jpaulreed#DOES15
  35. 35. Kevina Finn-Braun kevina.finnbraun@salesforce.com http://lnkdin.me/kevinafinnbraun J. Paul Reed preed@release-approaches.com http://jpaulreed.com
  36. 36. P H O T O C R E D I T S • Slide 1: https://en.wikipedia.org/wiki/File:Golden_Fog,_San_Francisco.jpg • Slide 4: Courtesy Kevina Finn-Braun/Salesforce • Slide 6: https://www.flickr.com/photos/hannaneh/6464986121 • Slide 7: https://www.youtube.com/watch?v=_DEToXsgrPc#t=1h5m50s • Slide 13: http://kathmajp.weebly.com/all-movie-reviews/movie-review-twister • Slide 14: http://thevane.gawker.com/heres-everything-they-got-wrong-and-right-in-the- movi-1609968202 • Slide 15: https://www.flickr.com/photos/ravedelay/17761863929 @kfinnbraun @jpaulreed#DOES15
  37. 37. P H O T O C R E D I T S • Slide 16: Screenshot of aviationweather.gov • Slide 17: https://www.flickr.com/photos/ravedelay/17534032771/ • Slide 18: https://www.youtube.com/watch?v=8veT5QspylE#t=15m30s • Slide 19: https://www.flickr.com/photos/jkirkhart35/4984385396 • Slide 20: https://www.youtube.com/watch?v=iCvmsMzlF7o • Slide 33: https://commons.wikimedia.org/wiki/File:Rainbow_background.jpg • Slide 35: https://en.wikipedia.org/wiki/File:Clouds_spilling_over_San_Francisco.jpg @kfinnbraun @jpaulreed#DOES15

×