AWS re:Invent 2013 - MBL303 Gaming Ops - Running High-performance Ops for Mobile Gaming

Gaming Ops
Running High-Performance Ops for Mobile Gaming
Eduardo Saito – Director, Engineering
Nick Dor – Sr. Director, Engineering
GREE International Friday, November 15, 2013
© 2013 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.

Agenda
• Part 1 – Lessons Learned
– Incident Management
– Change Management
– Auto-scale
– Cloud Optimization Tools and Capacity Planning
• Part 2 – Game Architecture, Analytics & Monetization
– Game Architecture
– Moving a live game
– Analytics & Monetization
– Cloud Insights

Incident Management
NOC
Ops
SME (Network, DBA,…)
Dev
Other
monitoring
tools…
Triage
Escalation
Communication

NOC, automated
Ops Dev
Critical
Critical
Non-
Critical
Other
monitoring
tools…
Application-level issue?
Who’s the dev of this game? Phone #?
I can’t find the dev… who’s his
manager?
Oh, the problem is in the backend
service, who’s the dev for that service?

Alert Workflow - DevOps way
Ops
Dev, Game X, Server
Dev, Game Y, Client/iOS
Dev, Service A
Each alert go directly to
the right team that can
resolve it !
Dev, Service B
Analytics

Alerts go to the person that can resolve it
App-level alerts can be triggered by issues in:
Type Scope Checked by Who to page?
ELB Load balancer
health-check
ELB No one – email
alert only
System-level Check cpu / disk
/ memory /
network
Pingdom /
Nagios
Ops team
App-level Application
issues / bugs
Pingdom Dev and Ops
teams
• Server-side
• Client-side
• iOS
• Android

Dev and Ops are responsible
Team In pager duty
Ops 8
Dev 32, from ~20 games
(server-side or client-side, android or iOS developers)
Analytics 5

Big dashboard
=
meta monitoring

IM Bot informs
in the game
channel that
an alert was
triggered
Use IM Bot for status
Both Ops and
Dev receive
the alert,
troubleshoot
IM Bot = collaboration
IM Bot detects
issue is
resolved and
send all-clear
IM Bot = transparency

Review your incidents and alerts
• Monday morning incident review meeting
– Weekly on-call hand-over
– Address false-positives / fine-tune your monitoring
– Heads-up for events / major releases
• Problem management
– Any major or recurrent incident = Problem
– Problem = requires post-mortem
– Remediation items from post-mortem also tracked weekly till
closure

Incident Management
Lessons Learned
• Use automatic paging/escalation tools
• Make the alerts go to the right team directly
• Use big display dashboard
• Use IM-bots to communicate outages
• Do weekly reviews of the incidents / alerts
• Do post-mortems, follow-up on remediation items

Change Management
Type Content Owner Tool
Configuration
Management
3rd. Party
packages and
configuration
Ops Puppet
Release – code
deploy
1st. Party code Dev Jenkins + In-house
scripts
Release – asset
deploy
1st. Party –
images / new
game content /
new missions
Dev Jenkins + In-house
scripts

Configuration Management
pull push
Ops do
changes /
test locally
peer
review
pull
changes
to prod
puppet
puppet
clients
(prod
servers)
pull
changes
syntax
validation
not good

Configuration Management Benefits
• Automate and speed-up deployment
• Repeatable
• Declarative modules/manifests = documentation
• All prod changes:
– peer-reviewed via pull-requests in Git
– validated by Puppet lint
– locally tested via Vagrant (every component has a Vagrant VM)
– communicated through email and IM

Release Management – Code deploy
push
QA
Beta
Prod
Deploy
dev host
dev
S3
In QA/dev channel of that project:
If Prod deploy, in Ops channel of that project:

Release Management – Asset deploy
Code
Review
Warns?
Ops
approval
Override
?
Yes
Yes
No
Dev kick off
new asset
deploy job
Run
validation
Deploy to
prod

Change Management Lessons Learned
• Changes are made directly by the team that is
responsible for that code
– 3rd. party code is configuration management = owned by Ops
– 1st. party code is release management = owned by Dev
• Changes are made through tools
– Configuration management through Puppet
– Release management through Jenkins + internal tool
• No change is done manually
• All changes are communicated and tracked

Auto-scale use-cases
–On-demand
• for the daily traffic fluctuations and
organic growth
–Scheduled
• for in-game events

Auto-scale on-demand and scheduled
CPU
# instances
in ELB
# auto-scale
instances

Scheduled Auto-scale
1- Scheduled
pre-provisioning
config enabled
CPU
# instances
in ELB
# auto-scale
instances
Scheduled action
as-put-scheduled-update-group-action
ccios-app-ScheduledUpFriday
--auto-scaling-group ccios-app-asg
--recurrence “00 17 * * 5”
--min-size 16

2 - Spare
capacity in
place, ready for
event
CPU
# instances
in ELB
# auto-scale
instances

3 - Event starts,
4x spike
CPU
# instances
in ELB
# auto-scale
instances
ADD EVENT
SCREENSHOT
HERE

On-demand Auto-scale
4 – On-demand
auto-scale
reacts to
CPU above
60% and adds
more servers
CPU
# instances
in ELB
# auto-scale
instances
On-demand policy
as-put-scaling-policy
ccios-app-ScaleUpPolicy60
--adjustment=8 --type ChangeInCapacity

5 - Scheduled
pre-provisioning
config is
removed
CPU
# instances
in ELB
# auto-scale
instances
Scheduled action
as-put-scheduled-update-group-action
ccios-app-ScheduledDownFriday
--recurrence "0 21 * * 5" --min-size 2

6 – On-demand
auto-scale
terminate
some
instances as
CPU drops
below 40%
CPU
# instances
in ELB
# auto-scale
instances
On-demand policy
as-put-scaling-policy
ccios-app-ScaleDownPolicy40
--adjustment=-2 --type ChangeInCapacity

Auto-scale bootstrap workflow
Event Description Duration
Cloudwatch alarm is triggered Eg. CPU > 60% for 5 minutes 5 minutes
Auto-scale policy is executed Launches n new instances 2 minutes
User-data script is executed This script is defined on the autoscale launch
config. Installs base packages, gets
instance_id, IP and hostgroup
1 minute
Bootstrap script is executed This script is loaded from S3. It renames host,
runs puppet, deploy code, starts web service
11 minutes
Health-check passes and
servers start to get traffic
Health-check must pass before ELB start to
send traffic to new host
1 minute

Auto-scale external dependencies
Dependency How to resolve
Configuration Management
(Puppet/Chef)
Pre-load all necessary package in the AMI / architecture HA for config
management
External Repo Pre-load all necessary packages in the AMI / setup internal HA repo
Code deploy Same as above, or put in S3
Monitoring registration Make it asynchronous
Server registration Make it asynchronous

Auto-scale Lessons Learned
• Reduce time to spin-up new instances:
– Pre-install all base packages into AMI
• Address those risks:
– on-demand and scheduled AS conflicts
– bootstrap validation and graceful termination
– health-checks: keep it simple
– keep some servers out of auto-scale pool, just in case
– map and resolve/monitor external dependencies for auto-scale
– consider using 2 different thresholds, for quicker ramp-up

• under-utilized hosts
• overloaded hosts
• EBS/ELB not in use
Cloud Optimization areas
• exposed DBs
• EC2 behind ELB exposed
directly
• AZ / region distribution
• backup audit
• un-healthy instances in ELB
• ELB misconfigs
• optimal # of RI
• hosts outside RI
• cost break-down using tags
• estimate on-demand costs
Cost Usage
Availabilit
y Security

Cloud Optimization tools
AWS Trusted Advisor 3rd. Party commercial tools
Open Source tools (eg. Netflix Ice)
In-house tools
Excel !

Cloud Optimization Lessons Learned
• Try Trusted Advisor
• Pilot 3rd.-party solutions
• Evaluate what metrics are important for each component of
your architecture
• Do in-house development for other optimizations you need that
are not covered by TA or 3rd. party solutions
• Tag all assets! Automate tagging!

GREE Games
• All Mobile, all Free-to-Play
– iOS & Android smart phones
– Big focus on tablets
• Role Playing Games (RPG)
– Multi-million dollar franchise, top-grossing titles
– Some of the oldest games on the App Store
• Hardcore
– Deeper more intense gameplay mechanics
• Real-Time Strategy (RTS)
– Fast action, small unit management
• Casino & Casual Games
– Familiar games, wider audience, casual play

Example Game Architecture – RPG
• Application Servers
– PHP
– Game events  Analytics
• Cache Layer
– Memcached  Elasticache
• Batch Processing Servers
– Node.js (moving to GO)
– Batches database writes
• Database
– MySQL  RDS
RDS RDS RDS
Failover
DB
ELB
App App App App
Cache Cache Cache Cache
Batch Batch

Caching Strategy - Current
• Game architecture predates stable NoSQL
– We wanted similar performance at scale
– Keep combined average internal response times below 300ms
• Memcache Authoritative
– Still use an RDBMS; potential data loss is limited
• Allows for cheaper/simpler DB layer
– Always do full row replacements (ie: no current_row_value +1)

Data Flow
• Reads
– ELB  App  Cache
• Writes (Synchronous)
– ELB  App  Cache  DB
– ELB  App  Cache  Batch  DB
– Standard write-through
– No blind writes; always fetch current
ver.
• Writes (Asynchronous)
– Batch  DB
– Batch writes to DB every 30 seconds
ELB
App App App App
Cache Cache Cache Cache
Batch Batch
RDS RDS RDS

Batch Processor
• 80% of game write traffic is Async
– Each write is versioned
• Example: Player items (loot) after multiple quests
– 10 items in 30 sec; app server sends 10 writes downstream
– Batch processor sends last record with final item count to DB
• Greatly reduced writes on DB
– Shard at table and DB server level for larger games

Near Future Trends for GREE OPS
• Multi-region games
– Latency-sensitive games and the shift towards real-time
– Geographic data replication challenges
• Continuous Delivery
• Automation of Game Studio tasks
– Game design, art, data/asset deploy
– Tighter event pre-provisioning and scale-down

More Performance – Lower Costs
• Facebook HipHop Virtual Machine
– JIT compilation & execution of PHP
– 5x faster vs. Zend PHP 5.2
– Achieved 3x to 4x reduction in application server count
– https://github.com/facebook/hhvm
• Google GO
– Used for high-concurrency applications
– Achieved 2x reduction in batch processing servers vs. Node.js
– http://golang.org

Moving a Game – Why?
• Physical datacenter to AWS
– West coast  East coast
– Faster access to EU markets & players
• Reduce necessary attention to infrastructure
– Caching & DB layer; custom high-availability middleware
• Take advantage of cloud provisioning
– Scripted instance spin-ups, auto-scaling for events/load
• Save money
– Reduce stand-by server pool
– Provision for average load, not peak

Moving a Live Game – Whaaaaat?
• Live game, two platforms (iOS, Android)
– Several million $$$ in combined monthly revenue
– More than one million unique players/month
• ~ 30GB Dataset
• Minimal downtime (< 5 minutes)
– Mostly to allow for change to reverse proxy config
• Debian  CentOS
• Physical machines  AWS

Moving a Live Game - How
• Develop timeline
• R&D & architecture review
• Data migration & sync
• Game server/client updates
• Load testing
• D-Day steps & checklist

Moving a Live Game - Timeline
• 3 months overall
• DB dataset transfer validation
– Setup direct MySQL to RDS replication
– Initial DB transfer time: approx. 8 hours
• Functional & performance testing
– Load & capacity profile for application, DB servers
– Heavy use of APM metrics – New Relic

Moving a Live Game - Architecture
• Changes required
– Caching – discreet memcached to Elasticache nodes
– Database – physical MySQL DB servers to RDS
• Decided to drop internally developed MySQL proxy
– Bittersweet: great automatic failover; limited internal knowledge
• RDS failover mechanics added to possible game downtime
– Load balancers
• LVS to ELB
• Processes
– Code asset deployment

Moving a Live Game – D-Day
• Put game into maintenance (shutdown)
• Break DB replication (west  east)
• Setup reverse proxy in datacenter
– Forward traffic from west  east AWS ELB
• Bring game back online
– Reverse proxy sends traffic to AWS
• Update DNS to point to ELB
– Wait for DNS propagation
– Slow DNS updates hit the reverse proxy in datacenter

Moving a Live Game – Before & After

Analytics & Monetization
• Specialize in “Live Events”
– Higher player engagement (fun!) = more revenue
• Single-player events
– “Epic Boss”
– Limited-time quests
• Player organization events
– Guild vs. Guild battles (World Domination, Syndicate Wars)
– Raid Bosses – members help to take down a tough NPC
– Tap into social “meta-gaming”

Modern War World Domination Results (August 2013)

Analytics for Player Engagement
• Player retention
– 1st week and beyond
– Tutorial completion rates
• Balancing mechanics
– Player vs. Environment (PvE), Player vs. Player (PvP)
– Encourage interaction with other players
• When too much good can be bad
– Analytics needs to be paired with player feedback
– Fun for all players, payers AND non-payers

Analytics for Decision-making
• Devices & Markets
– Understand most popular devices (esp. Android)
– Focus efforts on the top devices for your market
• Launching a game
– “Soft-launch” – only launch in certain markets, tune game
– “Hard-launch” – money down (marketing), marquee live events
• When to sunset & decommission
– Depends on strategic goals, infra/engineering costs, etc.

Analytics – Some Scale
• Over 5000 transactions/sec sent to Analytics
• Several billion game events per day
– Attacking, winning, losing, buying, clicking, swiping, etc.
• Anticipating 10x increase in next two years
• Building petabyte scale data warehouse
capacity

Analytics Pipeline
• Working towards “zero-latency” pipeline
– Latency = ETL, summarization, reporting & dashboard
– Already reduced from 24 hours to 1 hour in last year

Cloud Insights
• Agility (Time to Deliver)
• Elasticity – scale up/down quickly
– Auto Scaling is critical
• Service simplification (RDS/Elasticache/ELB)
• Professional development for OPS Team
– Physical (Datacenter/Network focus) vs. Virtual (DevOps focus)

Cloud Insights – Lessons Learned
• Reliability & performance consistency varies
• Stuff breaks often
– Develop an “anti-fragile” mindset; build to anticipate failure
• Cost-predictability still elusive
• Orphaned servers
– Easy to create; must constantly clean up
• Large-scale monitoring is hard
– No silver bullet yet

Thank You
• Thanks to the GREE OPS & Engineering
Teams!
eduardo.saito@gree.net
nick.dor@gree.net
• We’re Hiring DevOps Team Members!!
http://gree-corp.com/jobs

Please give us your feedback on this
presentation
MBL303
As a thank you, we will select prize
winners daily for completed surveys!

AWS re:Invent 2013 - MBL303 Gaming Ops - Running High-performance Ops for Mobile Gaming

Recommended

Recommended

More Related Content

What's hot

What's hot (6)

Similar to AWS re:Invent 2013 - MBL303 Gaming Ops - Running High-performance Ops for Mobile Gaming

Similar to AWS re:Invent 2013 - MBL303 Gaming Ops - Running High-performance Ops for Mobile Gaming (20)

Recently uploaded

Recently uploaded (20)

AWS re:Invent 2013 - MBL303 Gaming Ops - Running High-performance Ops for Mobile Gaming

Editor's Notes