Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Scaling a Start-up DevOps team to 10x while scaling the system 50x

1,712 views

Published on

Sumo Logic has built a strong DevOps culture. All our back-end developers are on the on-call rotation, taking responsibility for running the system 24/7 and rolling out new code multiple times a week. This talk is about how we successfully grew that culture. The lessons learned, things that worked, things that didn’t. How our culture has evolved.

Published in: Technology, Business
  • Be the first to comment

Scaling a Start-up DevOps team to 10x while scaling the system 50x

  1. 1. Scaling A Start-up DevOps Team To 10x While Scaling The System 50x Christian Beedgen – Co-Founder & CTO Stefan Zier – Lead Architect DevOpsDays Austin 2014
  2. 2. Christian Beedgen – Co-Founder, CTO – ArcSight, Amazon, … – No prior experience running production systems Stefan Zier – Lead Architect, first engineer – ArcSight, Amazon,… – No prior experience running production systems Intro 2
  3. 3. 3 Scaling Spreading constructive beliefs and behavior from the few to the many. Robert I. Sutton Scaling up Excellence: Getting to More Without Settling for Less
  4. 4. 4
  5. 5. Petabyte scale log management platform Big Data™, High Velocity, Human Real Time Distributed 100% in AWS Service Oriented Architecture 99% in Scala Run by engineers The Sumo Logic Service 5
  6. 6. Data Ingest 6
  7. 7. Code Commits, Services 7
  8. 8. Engineering Head Count Sumo Logic Confidential8 0 10 20 30 40 50 60
  9. 9. The Challenge 9 Scaling Sumo Logic – More confidence and uptime – More operators – More change – More services
  10. 10. 10
  11. 11. DevOps Culture Spreading Knowledge Control surfaces How We Scaled 11
  12. 12. 12 Culture a shared, learned, system of values, beliefs and attitudes that shapes and influences perception and behavior — an abstract “mental blueprint” or “mental code.”
  13. 13. One week, 24/7 responsibility for – Operational decision making – Alert response – Deploying the bits – Configuration changes Pair of people (primary, secondary) – Social schedules & travel – Training – Relief after a noisy night Being On Call 13
  14. 14. Sumo on Sumo – Perfect dog fooding use case Post mortems – Drive improvements from incidents Alerting – Code I wrote yesterday just woke me up at 4am Feedback Loops 14
  15. 15. Mandated for PCI compliance – Change Management Board = Channel on Slack – Change Request = JIRA ticket – Audit trail = Paste slack conversation into JIRA Actually helpful – Good documentation – Starts good discussions – Makes change mindful Change Management 15
  16. 16. 16 Spreading Knowledge
  17. 17. Tactical – Daily Standups – Chat – Playbooks Strategic – Mentoring – “How the sausage is made” sessions – Checklists Spreading Knowledge 17
  18. 18. 18
  19. 19. Playbooks 19 Linked to alert – GitHub wikis – URL in alert Focused on MTTR – Steps to restore service – List of Subject Matter Experts to call Continuously improved – Boy Scout rule
  20. 20. Culture Knowledge Control surfaces Three Pillars Sumo Logic Confidential20
  21. 21. Checklists 21 Improve outcomes – Ensure experts don’t miss any critical steps – Prevent repeating mistakes Well designed – Coherent – Living documents – Concise, clear and require specific actions – Need to be short and well-organized – Are NOT step-by-step instructions
  22. 22. 22
  23. 23. 23
  24. 24. DevOps Friendly 24 Control Surfaces matter for scale – Simplify complex operations – Consistent view – Built-in safety Natural to use – Easy to learn, discover Natural to extend – Every developer
  25. 25. 25
  26. 26. dsh 26 dsh – CLI – Full stack – Fast – Safe – Secure – Proactive – Discoverable
  27. 27. Model Driven 27 Creates consistency Provides guard rails Deployment – Cluster • Instance – Assembly Configured at all levels
  28. 28. 28 daemon restart api:p:25,receiver:p:10
  29. 29. 29
  30. 30. dsh 30 dsh – Scala – Model based – Trivial to extend – Specific to OUR needs – Meaningful defaults – Prevents mistakes
  31. 31. 31 val filter = FilterBuilder.withCluster(“zk”). withOnlyRunningInstances.build() val instances = deployment.connect.describeInstances(filter) instances.par.foreach { instance => val ssh = instance.connectSSH ssh.execute(“sudo service api restart”) }
  32. 32. What would we do differently next time? 32 Upgrade the system less monolithic Don’t ask UI developers do operations Clearer guidelines on managers & operations
  33. 33. Next Experiments 33 Divide up big rotation Bring India development team into rotation Switch from 24/7 shifts to 12/7 Deploy smaller parts of the system more often Bring full-time operations people into the mix
  34. 34. Thank You! 34 Christian Beedgen @raychaser Stefan Zier @stefanzier We’re hiring! go.sumologic.com/jobs

×