DevDay 2016: Artur Speth - DevOps - Microsoft Developer Divisions Weg ins nächste Agile Zeitalter

6,295 views

Published on

Märkte sind dynamischer denn je und Businessmodelle ändern sich. Oft unterstützt das Engineering nicht mehr ausreichend diese Dynamik, wodurch sich erhebliche Wettbewerbsnachteile ergeben können. Kürzere Zyklen und eine agile Kultur sind hierbei Schlüsselelemente für eine bessere Wertschöpfung, sind aber in großen Organisationen nicht trivial zu realisieren. Der Vortrag beschrieb am Beispiel von Visual Studio Team Services die agile Transformation der Microsoft Developer Division hin zu einer DevOps-Kultur beschreiben und Ihnen einige Einblicke hinter die Kulissen gewähren, wie die Developer Division heute arbeitet.

Published in: Technology
  • Be the first to comment

DevDay 2016: Artur Speth - DevOps - Microsoft Developer Divisions Weg ins nächste Agile Zeitalter

  1. 1. 4,307
  2. 2. 467 Spread out across 35 feature teams
  3. 3. ProductionDevelopment Backlog Requirements
  4. 4. Visual Studio & TFS Update 1 Visual Studio & TFS Update 2 Visual Studio & TFS Update n VS Team Services
  5. 5. Code Test & Stabilize Code Test & Stabilize Beta RTM 2 years
  6. 6. Planning Customer feedback – we should change the way a feature works. We didn’t get it quite right… … but we’re booked solid already. 2 years
  7. 7. S1 S2 S3 S4 S5 Stabilization S6 A B S7 S8
  8. 8. 2 years 3 weeks
  9. 9. https://flic.kr/p/arXUyP
  10. 10. Alignment Autonomy “Let’s try to give our teams three things…. Autonomy, Mastery, Purpose”
  11. 11. Scenarios Features Stories Tasks
  12. 12. Sprint 3 week 3 Plan 3 sprint Season 6 month Scenario 18 month 3 6 SpringFallSpring Fall Aspirational 60%
  13. 13. Sprint 3 week Plan 3 sprint 3 Season 6 month Scenario 18 month 3 6 SpringFallSpring Fall Hopeful 80% What Epics are we lighting up
  14. 14. Sprint 3 week 3 Plan 3 sprint Season 6 month Scenario 18 month 3 6 SpringFallSpring Fall Thoughtful 90% What features are planned?
  15. 15. Sprint 3 week 3 Plan 3 sprint Scenario 18 month 3 6 SpringFallSpring Fall Confident 95% What stories are we complete? What features are shipping? Season 6 month
  16. 16. Week 1 Week 2 Week 3 Week 1 Week 2 Week 3Week 2 Week 3 Sprint 98 Sprint 97 Sprint 99 The sprint plan What we accomplished
  17. 17. • Updates were large • Months apart • Lots of problems! 4/1/2010 4/23/2012 5/3/2010 TFS 2010 RTM 4/23/2011 ServiceDeployment 8/5/2011 ServiceUpdate 9/26/2011 //BUILD2011 12/7/2011 ServiceUpdate 1/30/2012 ServiceUpdate 2/20/2012 ServiceUpdate 3/12/2012 ServiceUpdate 4/2/2012 ServiceUpdate
  18. 18. Program Management Development Testing Operations
  19. 19. Program Management Engineering Operations Engineering
  20. 20. Program Management Engineering
  21. 21. Week 1 Week 2 Week 3 Week 1 Week 2 Week 3Week 2 Week 3 Sprint 98 Sprint 97 Sprint 99 Deployment Sprint Planning Done
  22. 22. Week 1 Week 2 Week 3
  23. 23. Week 1 Week 2 Week 3
  24. 24. Week 1 Week 2 Week 3
  25. 25. Week 1 Week 2 Week 3
  26. 26. ONE
  27. 27. Code Test & Stabilize Code Test & Stabilize Beta RTM Planning Code Complete
  28. 28. ON OFF
  29. 29. ON OFF
  30. 30. ON OFF
  31. 31. ON OFF
  32. 32. ON OFF
  33. 33. ON OFF
  34. 34. VSO SU1 Chicago VSO SU0 San Antonio VSO SU4 Amsterdam Shared Platform Services San Antonio
  35. 35. Existing experience Baseline: 36% conversion to project 50% to 100% customers conversion to project (+18%)
  36. 36. There’s no place like production!
  37. 37. Telemetry everywhere Customer IntelligenceBusiness IntelligenceOperational Intelligence Dashboard DevOps Debug Experiments
  38. 38. Getting the availability model right 0,8 0,82 0,84 0,86 0,88 0,9 0,92 0,94 0,96 0,98 1 -200 0 200 400 600 800 1000 1200 1400 1600 9.25.13 2:24 PM 9.25.13 3:36 PM 9.25.13 4:48 PM 9.25.13 6:00 PM 9.25.13 7:12 PM 9.25.13 8:24 PM 9.25.13 9:36 PM 9.25.13 10:48 PM Sept 25th 2013 LSI FailedExecutionCount SlowExecutionCount Start End Availability (ID4 - Activity Only) Availability (Current)
  39. 39. Alerting is key to fast detection Every alert must be actionable and represent a real issue with the system. Alerts should create a sense of urgency – false alerts dilutes that Redundant alerts for same the issue Needed to set right thresholds and tune often Stateless alerts contributed to further noise
  40. 40. Health model in action • 3 errors for memory and performance • All 3 related to same code defect • APM component mapped to feature team • Auto-dialer engaged Global DRI Eliminated alert noise ~928 alerts per week to ~22 and reduced DRI escalations by ~56%
  41. 41. Live Site Issues (LSIs)
  42. 42. Time to MitigateTime to Detect %ofIncidents DRAFT DRAFT Microsoft Confidential 52 Service Availability & Health Metrics DRAFT DRAFT DRAFT IncidentCount IncidentCount DRAFT DRAFT DRAFT %ofIncidents UserMinutes DRAFT DRAFTDRAFT Error By SourceIncidents by Severity User Impact Minutes During Incidents [TFS Only] 3 2 1 4 1. TFS Availability is on an improving trend. No Sev0/Sev1 LSIs for July. 2. App Insights switched from synthetic availability to real-user experience in Ibiza portal. A high volume of SEV-2 LSIs (72) contributed to customer impact in addition to intermittent UX errors. (UX fixes applied on 8/11 that improves availability) 3. App Insights was impacted by 3 long running LSIs related to ES maintenance, Ibiza updates and an Azure Storage outage. 4. TFS Service attainment (SLO) improved significantly MoM with focus on minimizing failed/slow commands and reviewing in weekly LiveSite reviews
  43. 43. Service status
  44. 44. © 2015 Microsoft Corporation. All rights reserved.

×