DevOpsDays Silicon Valley 2014 - The Game of Operations


Published on

Operating online games is fun and challenging. Games are some of the spikiest workloads around, and real-time really means *real-time*. Randy shares many of the DevOps techniques his team has put into practice at KIXEYE: Cloud infrastructure, Service teams, and DevOps Culture. He talks about elastic workloads, micro-services, configuration automation, and a common service "chassis". He further discusses the organizational and technical disciplines of team autonomy, internal vendor-customer relationships, and, of course, "you build it, you run it"!

Published in: Technology, Business
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

DevOpsDays Silicon Valley 2014 - The Game of Operations

  1. 1. The Game of Operations and The Operation of Games Randy Shoup @randyshoup DevOpsDays Silicon Valley, June 28 2014
  2. 2. Background CTO at KIXEYE • Real-time strategy games for web and mobile Director of Engineering for Google App Engine • World’s largest Platform-as-a-Service Chief Engineer at eBay • Multiple generations of eBay’s real-time search infrastructure
  3. 3. 1973: Xerox PARC and SuperPaint
  4. 4. 40 Years Later …
  5. 5. Real-Time Strategy Games are … • Real-time • Spiky • Computationally- intensive • Constantly evolving • Constantly pushing boundaries  Technically and operationally demanding
  6. 6. Operating Games: Goals Player Fun • If players aren’t playing, we don’t have a business • If players aren’t having fun, we don’t have a business for long • Fun includes game mechanics, feature set, uptime, performance Developer Productivity and Satisfaction • We are a vendor; the studios are our customers • Must be *strictly better* than the alternatives of build, buy, borrow Cost Efficiency • More output for less
  7. 7. The Game of Operations Cloud • All studios and services moving to AWS • Strong focus on automation Services • Small, focused teams • Clean, well-defined interface to customers DevOps Culture • One team across development and ops
  8. 8. The Game of Operations Cloud Services DevOps Culture
  9. 9. Why Cloud? (The Obvious) Provisioning Speed • Minutes, not weeks • Autoscaling in response to load Near-Infinite Capacity • No need to predict and plan for growth • No need to defensively overprovision Pay For What You Use • No “utilization risk” from owning / renting • If it’s not in use, spin it down
  10. 10. Why Cloud? (The Less Obvious) Instance Shaping • Instance shapes to fit most parts of the solution space (compute-intensive, IO- intensive, etc.) • If one shape does not fit, try another Service Quality • Amazon and Google know how to run data centers • Battle-tested and highly automated • World-class networking, both cluster fabric and external peering
  11. 11. Why Cloud? (Fundamental Forces) Economics • Nearly impossible to beat Google / Amazon buying power or operating efficiencies • 2010s in computing are like 1910s in electric power Developer Adoption • It Just Works ™ • Makes it easy to fall in love with infrastructure 
  12. 12. “Soon it will be just as common to run your own data center as it is to run your own electric power generation” -- me
  13. 13. Autoscaling Games are very spiky • Very unpredictable • Huge variability between peak and trough Hits are self-reinforcing
  14. 14. Automation Work at KIXEYE Resilient Clients • Clients back off in response to latency • Clients continue gameplay despite network disruption Elastic Services • Services grow / shrink based on load • Service Cluster == AWS Auto Scale Group
  15. 15. Automation Work at KIXEYE Build / Deploy Pipeline • One button • Puppet -> Packer -> AMI -> Asgard • Zero-downtime red-black deployment • Futures: canarying, auto-rollback Manageability • Puppet for configuration management • Flume -> ElasticSearch / Kibana for logging • Shinken -> PagerDuty for monitoring and alerting
  16. 16. The Game of Operations Cloud Services DevOps Culture
  17. 17. Service Teams • Give teams autonomy • Freedom to choose technology, methodology, working environment • Responsibility for the results of those choices • Hold them accountable for *results* • Give a team a goal, not a solution • Let team own the best way to achieve the goal
  18. 18. KIXEYE Service Chassis • Goal: “chassis” for building scalable game services • Minimal resources, minimal direction • 3 people x 1 month • Consider building on NetflixOSS Team exceeded expectations • Co-developed chassis, transport layer, service template, build pipeline, red-black deployment, etc. • Operability and manageability from the beginning • 15 minutes from no code to running service in AWS (!) • Open-sourced at
  19. 19. Micro-Services Single-purpose Simple, well-defined interface Modular and independent Small teams Autonomy and responsibility A C D E B
  20. 20. Transition to Service Relationships Vendor – Customer Relationship • Friendly and cooperative, but structured • Clear ownership and division of responsibility • Customer can choose to use service or not (!) Service-Level Agreement (SLA) • Promise of service levels by the provider • Customer needs to be able to rely on the service, like a utility
  21. 21. Transition to Service Relationships Charging and Cost Allocation • Charge customers for *usage* of the service • Aligns economic incentives of customer and provider • Motivates both sides to optimize
  22. 22. The Game of Operations Cloud Services DevOps Culture
  23. 23. One Team (!) • Act as one team across development, product, operations, etc. • Solve problems instead of blaming and pointing fingers • Political games are not as fun as real-time strategy games 
  24. 24. Everyone Is Responsible for Prod Everyone’s incentives are aligned Everyone is strongly motivated to have solid instrumentation and monitoring
  25. 25. “DevOps is a reorg” – Adrian Cockcroft
  26. 26. Blame-Free Post-Mortems Learn from mistakes and improve • What did you do -> What did you learn • Take emotion and personalization out of it Post-mortem After Every Incident • Document exactly what happened • What went right • What went wrong
  27. 27. Blame-Free Post-Mortems Open and Honest Discussion • What contributed to the incident? • What could we have done better? Engineers compete to take responsibility (!)
  28. 28. “Failure is not falling down but refusing to get back up” – Theodore Roosevelt
  29. 29. Transition to DevOps Organization • Studios make user-visible games • Services provide common endpoints Training / Retraining • Common bootcamp • Train devs as Ops, Ops as devs Transition On-call • Use primary / secondary on-call as apprenticeship
  30. 30. “You Build It, You Run It” – Everyone
  31. 31. Recap: The Game of Operations Cloud Services DevOps
  32. 32. Come Join Us! DevOps Whiskey Tasting, July 22 333 Bush St., San Francisco Hiring in SF, Seattle, Victoria, Brisbane, Amsterdam