DevOps Chicago - The Game Of Operations and the Operation of Games

The Game of Operations
and
The Operation of Games
Randy Shoup
@randyshoup
linkedin.com/in/randyshoup
DevOps Chicago Meetup, May 19 2014

Background
CTO at KIXEYE
• Real-time strategy games for web and mobile
Director of Engineering for Google App
Engine
• World’s largest Platform-as-a-Service
Chief Engineer at eBay
• Multiple generations of eBay’s real-time
search infrastructure

Real-Time Strategy Games are
… • Real-time
• Spiky
• Computationally-
intensive
• Constantly evolving
• Constantly pushing
boundaries
 Technically and
operationally demanding

Operating Games: Goals
Player Fun
• If players aren’t playing, we don’t have a business
• If players aren’t having fun, we don’t have a business for
long
• Fun includes game mechanics, feature set, quality,
performance
Studio Velocity
• 8 *highly independent* game studios
• Different tech stacks, tool chains, phases of development
Developer Productivity and Satisfaction
• We are a vendor; the studios are our customers
• Must be *strictly better* than the alternatives of build, buy,
borrow
Cost Efficiency
• More output for less

Cloud
• All studios and services moving to AWS
• Strong focus on automation
Services
• Small, focused teams
• Clean, well-defined interface to customers
DevOps
• Developers behave like Ops
• Ops behaves like Developers

Cloud
Services
DevOps

Why Cloud? (The Obvious)
Provisioning Speed
• Minutes, not weeks
• Autoscaling in response to load
Near-Infinite Capacity
• No need to predict and plan for growth
• No need to defensively overprovision
Pay For What You Use
• No “utilization risk” from owning / renting
• If it’s not in use, spin it down

Why Cloud? (The Less
Obvious)
Instance Optimization Opportunities
• Instance shapes to fit most parts of the
solution space (compute-intensive, IO-
intensive, etc.)
• If the shape does not fit, try another
Service Quality
• Amazon and Google know how to run data
centers
• Battle-tested and highly automated
• World-class networking, both cluster fabric
and external peering

Why Cloud? (The
Fundamentals)
Right Side of History
• Almost impossible to beat Google / Amazon
buying power or operating efficiencies
• 2010s in computing are like 1910s in electric
power
• Soon it will be just as common to run your own
data center as it is to run your own electric power
generation (!)
Easy and Fun
• It Just Works ™
• Makes it easy to fall in love with infrastructure 

Autoscaling
Games are very spiky
• Very unpredictable
• Huge variability between peak and trough
• Hits are self-reinforcing
Services and clients have to “flex”
• Clients back off in response to latency
• Services grow / shrink based on load
Service Cluster == AWS Auto-Scale Group
• Scale up or down based on predefined metrics,
thresholds

Automation Work at KIXEYE
Build / Deploy Pipeline
• One button
• Puppet -> Packer -> AMI -> Asgard
• No-downtime red-black deployment
• Futures: canarying, auto-rollback
Manageability
• Flume -> ElasticSearch / Kibana for logging
• Shinken -> PagerDuty for monitoring and
alerting

Service Teams
• Give teams autonomy
• Freedom to choose technology, methodology,
working environment
• Responsibility for the results of those choices
• Hold them accountable for *results*
• Give a team a goal, not a solution
• Let team own the best way to achieve the
goal

KIXEYE Service Chassis
• Goal: Produce a “chassis” for building scalable
game services
• Minimal resources, minimal direction
• 3 people x 1 month
• Consider building on open source projects
 Team exceeded expectations
• Co-developed chassis, transport layer, service
template, build pipeline, red-black deployment, etc.
• Operability and manageability from the beginning
• Heavy use of Netflix open source projects
• 15 minutes from no code to running service in AWS
(!)
• Plan to open-source several parts of this work

Micro-Services
Simple
Well-defined interface
Single-purpose
Modular and independent
Small teams
Autonomy and responsibility
A
C D E
B

Transition to Building Services
Common Chassis
• Make it trivially easy to build and maintain a service
Define Service Interface (Formally!)
• Propose, Discuss, Agree
Prototype Implementation
• Simplest thing that could possibly work
• Client can integrate with prototype
• Implementor can learn what works and what does not
Real Implementation
• Throw away the prototype (!)
 Rinse and Repeat

Transition to Service
Relationships
Vendor – Customer Relationship
• Friendly and cooperative, but structured
• Clear ownership and division of responsibility
• Customer can choose to use service or not (!)
Service-Level Agreement (SLA)
• Promise of service levels by the service provider
• Customer needs to be able to rely on the service, like
a utility
Charging and Cost Allocation
• Charge customers for *usage* of the service
• Aligns economic incentives of customer and provider
• Motivates both sides to optimize

Instrumentation and
Measurement
Instrument Everything
• Machine / instance stats: CPU, memory, I/O
• Software infrastructure stats: database, message
queue
• Application stats: game client, game server, services
Make It Easy to Do the Right Thing ™
• Easy, reliable, low-latency
• Auto-tagged and searchable
Why?
• Measurement beats intuition every time; my own
intuition is usually wrong 
• If you need to ssh into a box, instrumentation failed
you

One Team (!)
• Act as one team across development,
product, operations, etc.
• Solve problems instead of blaming and
pointing fingers
• Political games are not as fun as real-time
strategy games 

Everyone Is Responsible for
Prod
Everyone’s incentives are aligned
Everyone is strongly motivated to have solid
instrumentation and monitoring

Organization: Learning Culture
Learn from mistakes and improve
• What did you do -> What did you learn
• Take emotion and personalization out of it
Encourage iteration and velocity
• “Failure is not falling down but refusing to get
back up” – Theodore Roosevelt

Google Blame-Free Post-
Mortems
Post-mortem After Every Incident
• Document exactly what happened
• What went right
• What went wrong
Open and Honest Discussion
• What contributed to the incident?
• What could we have done better?
Engineers compete to take personal
responsibility (!)

Transition to DevOps
Organization
• Studios make user-visible games
• Services provide common endpoints
Training / Retraining
• Common bootcamp
• Train devs as Ops, Ops as devs
You Build It, You Run It
• Transition on-call
• Use primary / secondary on-call as
apprenticeship

Recap: The Game of
Operations
Cloud
Services
DevOps

Come Join Us!
KIXEYE is hiring in
SF, Seattle, Victoria, Brisbane, Amsterdam
@randyshoup
rshoup@kixeye.com
linkedin.com/in/randyshoup
slideshare.net/randyshoup

DevOps Chicago - The Game Of Operations and the Operation of Games

Recommended

Recommended

More Related Content

More from Randy Shoup

More from Randy Shoup (20)

Recently uploaded

Recently uploaded (17)

DevOps Chicago - The Game Of Operations and the Operation of Games