Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Driving DevOps at GS Shop using Mesos

435 views

Published on

Learn how GS Shop is changing its monolithic architecture and engineering culture one application at a time using Mesos stack. One year back, we were running dedicated Virtual machines for each Project, using a broken pipeline to move our software to Production. Today, we have built a production stack that runs our mission critical systems on Mesos, and we would soon be moving all our systems to this stack. We use Mesos as a fabric that runs different classes of applications including legacy monolithic apps and integrated our Continuous delivery infrastructure around it. We practice zero downtime deployments, and also run continuous delivery for all our infrastructure pieces that provides automated upgradability to our Mesos infrastructure. Mesos acts as a central nervous system for our container-technology agnostic service delivery platform. This is the story of our journey.

https://mesosconasia2016.sched.org/event/8Tus/using-mesos-to-drive-devops-adoption-at-scale-at-gsshop-vivek-juneja-gsshop?iframe=no

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Driving DevOps at GS Shop using Mesos

  1. 1. IT Innovation Center *Dev*Ops*, Cloud Infrastructure, Microservices, Containers 2015 - Current Founding Member, Container Platform Team vivekjuneja
  2. 2. The DevOps Enabler
  3. 3. Happy families are all alike every unhappy family is unhappy in its own way. 幸福的家庭大抵相同,而不幸的家庭却各 有其不幸。
  4. 4. Productive teams are all alike every unproductive team is unhappy in its own way.
  5. 5. Productivity = Happy Teams
  6. 6. AGENDA
  7. 7. NOT A LONG AGO BEGINNING OF THE CHANGE ADOPTING THE CHANGE THE ROAD AHEAD 1995 - 2015 2015 - 2016 2016 - 2017 2017 - 2019
  8. 8. NOT A LONG AGO BEGINNING OF THE CHANGE ADOPTING THE CHANGE THE ROAD AHEAD
  9. 9. Source Repo NEXUS BUILDER & DEPLOYER DEV TEST STAGE PROD Build & Deploy Maintenance Development
  10. 10. Build & Deploy 7days 10 changes 10 days 3 changes Deploy Frequency Lead Time for Change Per developer per week Changes per Deploy
  11. 11. Introducing…
  12. 12. OPERATIONS* DEVELOPER*
  13. 13. OPERATIONS* DEVELOPER* Stable system Minimal Changes Control on changes New Features Fast Changes Quick rollout to Prod
  14. 14. Service Management Monolithic App Simple well-understood Management Primitives Minimal Moving Parts
  15. 15. Multi-Apps / Microservices New and Complicated Management Primitives Too many Moving Parts Service Management
  16. 16. Yawn Driven Deployment (打哈欠)
  17. 17. Yawn Driven Deployment (打哈欠) Deploy Code at 3 AM to Production
  18. 18. NOT A LONG AGO BEGINNING OF THE CHANGE THE ROAD AHEAD ADOPTING THE CHANGE
  19. 19. Know thy Issues
  20. 20. We manage our machines like pets. Pets get old, die and it is sad
  21. 21. Each team invents their own tools and processes
  22. 22. It takes a long time for developers to get feedback
  23. 23. Big Bang releases that are treated like religious events
  24. 24. Not-my-problem syndrome Lack of Empathy between roles
  25. 25. Inspiration
  26. 26. Inverse Conway Maneuver
  27. 27. Inverse Conway Maneuver Who is Melvin Conway
  28. 28. Conway’s Law organizations which design systems are constrained to produce designs which are copies of the communication structures of these organizations
  29. 29. organizations which design systems are constrained to produce designs which are copies of the communication structures of these organizations Conway’s Law
  30. 30. Inverse Conway Maneuver Design systems that impose constructive constraints on the teams to change the way they communicate and manage
  31. 31. Design systems that impose constructive constraints on the teams to change the way they communicate and manage Inverse Conway Maneuver
  32. 32. O-ring theory of economics Tasks of Production must be executed proficiently together in order for any of them to be of any value Michael Kremer
  33. 33. Adrian Colyer
  34. 34. O-ring theory of economics devops
  35. 35. E D B A C O-ring theory of devops
  36. 36. E D B A C O-ring theory of devops
  37. 37. Tenets
  38. 38. Disposable Apps Developer Productivity Shared Multi-tenant Infrastructure and Tooling Automate Service management primitives Measure and Log everything Tenets
  39. 39. Reverse Conway Maneuver O-ring theory of devops Apply ! Master Plan
  40. 40. Service Delivery Platform The building blocks for building reliable software at scale
  41. 41. End to End Workflow CODE Build Process Docker Image Deploy Preparation Deployment Manifest ZDD Load Balancer Reload DNS Update DEPLOY PHASE (for all environments)
  42. 42. APM Logging Dashboard Notification End to End Workflow
  43. 43. One Deployment Manifest to rule them all
  44. 44. DEV TEST STAGE PROD
  45. 45. Deployment Manifest Template
  46. 46. DEV TEST STAGE PROD
  47. 47. Blue Green Deployment NEWOLD OLD OLD OLD OLD OLD HAProxy ZDD Control Marathon Mesos Control disabled
  48. 48. Blue Green Deployment Zero Downtime Deployment Ideally
  49. 49. Blue Green Deployment Deploy when awake !
  50. 50. Notifications
  51. 51. Custom Dashboard Metadata based Rollback Comprehensive Health Check uses Service Discovery Supports Multiple Data Centers / Platform Regions Common API for Developers for integration with CI Server
  52. 52. Devops Metrics Marathon /events failed_health_check_event deployment_success deployment_failed deployment_step_failure Timeseries DB Metrics Collector
  53. 53. Devops Metrics
  54. 54. APM Monitoring and Alerts Hardware Monitoring (VM, Physical Machine) Service Monitoring (container, non-container) Container Platform Stack Monitoring Service Latency Service Tracing Audit Trail
  55. 55. APM Hardware Monitoring (VM, Physical Machine) Service Monitoring (container, non-container) Container Platform Stack Monitoring Service Latency Service Tracing Audit Trail Monitoring and Alerts
  56. 56. Monitoring and Alerts
  57. 57. Monitoring and Alerts
  58. 58. Fault Identification Notification APM Log Dashboard …….TRACEID………... …….MESOS_TASK_ID………. Marathon …….MARATHON_APP……… Kill / Kill and Scale
  59. 59. Fault Identification
  60. 60. Fair Share Usage Gravity Platform DEV Environment #1 TEST Environment #1 DEV Environment #2 TEST Environment #1 DEV Environment #2 Disposable Transient Dev/Test Environments Hours
  61. 61. Platform Provisioning Mesos Agent Docker Monit Log Forwarder cAdvisor Worker Node Mesos Master Marathon Monit Log Forwarder Master Node Prometheus cAdvisor HAproxy Marathon-LB Standardization
  62. 62. Platform Provisioning FLEET PKG REPOSITORY WORKER WORKER WORKER MASTER MASTER
  63. 63. Platform Provisioning FLEET WORKER WORKER WORKER MASTER …... WORKER WORKER WORKER MASTER …... WORKER WORKER WORKER MASTER …...
  64. 64. Platform Provisioning FLEET WORKER WORKER WORKER MASTER …... WORKER WORKER WORKER MASTER …... APPLICATIONS SYSTEM MGMT WORKER WORKER WORKER MASTER …...
  65. 65. & many more
  66. 66. NOT A LONG AGO BEGINNING OF THE CHANGE THE ROAD AHEAD ADOPTING THE CHANGE
  67. 67. Change is hard !
  68. 68. Evaluation Experience in Production Confidence in Adoption Tipping Point Maturity with Devops Timescale
  69. 69. Our Adoption Playbook Confidence in Technology Compare and Contrast Create new Roles
  70. 70. Present Present Future Future Unified Deployment Centralized Logging Consolidated Monitoring Common Notifications Evaluation Experience in Production Confidence in Adoption
  71. 71. Compare and Contrast L4 (load balancer) Dedicated VM App Server Dedicated VM App Server Dedicated VM App Server Marathon LB Marathon LB Mesos Agent Mesos Agent App Container App Container App Container App Container 100%.................50%...................0% 0%.................50%...................100% Launch Stability Confidence Launch Stability Confidence
  72. 72. New Roles
  73. 73. OPERATIONS DEVELOPER Create Self Service Automate Primitives Shared Goal with Dev Use Self Service Ops friendly code Shared Goal with Ops
  74. 74. NEW SHARED GOALS Reduce Time from Code checkin to Production Release Ensure Releases can be performed during normal business hours Reduce unplanned work and increase productivity
  75. 75. Reality Check Allocate VM to a Service Less upfront capacity allocation meetings, and more work done ! 1 Let Software decide
  76. 76. Availability and Tolerance = manual mgmt. Let Software decide Less manual intervention, and more time to spend on improving quality 2 Reality Check
  77. 77. Time to Production influenced by lot of manual monotonous work Minimal Manual work and increased Self-service Ops work made more accessible to Devs via Self Service 3 Reality Check
  78. 78. Limited reusability and lack of standards across teams Standardize through Containers and Deployment primitives Reusability across teams, and more time to focus on innovation 4 Reality Check
  79. 79. Less upfront capacity allocation meetings, and more work done ! Less manual intervention, and more time to spend on improving quality Ops work made more accessible to Devs. Ops spend more time improving quality Reusability across teams, and more time to focus on innovation 1 2 3 4 Reality Check
  80. 80. But DevOps is also about Architecture
  81. 81. Architecture Constraints Ops friendly development Metrics friendly development Build systems which are failure aware Everything is distributed
  82. 82. NOT A LONG AGO BEGINNING OF THE CHANGE THE ROAD AHEAD ADOPTING THE CHANGE
  83. 83. Multi-Tenant CLuster Mix workloads for increased efficiency Share Common platform primitives Performance and Isolation guarantees Avoid Noisy neighbours Mesos Agent Selection Isolated Load balancer and Discovery Isolated Container Registry Resource Reservation
  84. 84. Global Workload Allocation Not all workloads are same ! C Cost P Performance I Isolation
  85. 85. Mesos Cluster MULTI-TENANT Physical Machines ( Mesos Agents ) SINGLE-TENANT Physical Machines ( Mesos Agents ) MULTI-TENANT VMs ( Mesos Agents ) SINGLE-TENANT VMs ( Mesos Agents ) CP IP CP CI Global Workload Allocation
  86. 86. Mesos Cluster Custom FrameworkFenzo Mesos Master Physical Machine ( Mesos Agent ) Physical Machine ( Mesos Agent ) VM ( Mesos Agent ) VM ( Mesos Agent ) Global Workload Allocation
  87. 87. Custom Framework ● Recommendation System for Resource allocation ● Integrate with Billing system to provide cost-efficient allocation scheme Global Workload Allocation
  88. 88. Custom Framework ● Recommendation System for Resource allocation ● Integrate with Billing system to provide cost-efficient allocation scheme Global Workload Allocation ● Support more advanced bin-packing with Fenzo #2430 (mesosphere/marathon)
  89. 89. Application aware scheduling Tenant Allocation Framework Mesos Cluster Mesos Agent Mesos Agent Mesos Agent Mesos Agent Tenant Tenant Tenant Mesos Master
  90. 90. Application aware scheduling Cost Reduction
  91. 91. Canary Release and Architecture A/B Mesos Agent Mesos Agent Mesos Agent Mesos Agent Mesos Master Marathon Auto Scaling based on automated workflows Automated Workflows for Testing new versions Self Healing and automated rollback to previous versions
  92. 92. THE EARLY YEARS BEGINNING OF THE CHANGE THE ROAD AHEAD ADOPTING THE CHANGE
  93. 93. 1DAY ONE
  94. 94. Change is possible.
  95. 95. Change is possible. Microsoft joins Linux Foundation - November 16, 2016
  96. 96. We Mesos community ! Thanks 고맙습니다 谢谢 Questions ?
  97. 97. We Mesos community ! Thanks Want to help us build this further ? We are hiring ! 고맙습니다 谢谢

×