SlideShare a Scribd company logo
Gaming Ops 
Running High-Performance Ops for Mobile Gaming 
Eduardo Saito – Director, Engineering 
Nick Dor – Sr. Director, Engineering 
GREE International Friday, November 15, 2013 
© 2013 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.
Agenda 
• Part 1 – Lessons Learned 
– Incident Management 
– Change Management 
– Auto-scale 
– Cloud Optimization Tools and Capacity Planning 
• Part 2 – Game Architecture, Analytics & Monetization 
– Game Architecture 
– Moving a live game 
– Analytics & Monetization 
– Cloud Insights
Incident Management 
NOC 
Ops 
SME (Network, DBA,…) 
Dev 
Other 
monitoring 
tools… 
Triage 
Escalation 
Communication
NOC, automated 
Ops Dev 
Critical 
Critical 
Non- 
Critical 
Other 
monitoring 
tools… 
Application-level issue? 
Who’s the dev of this game? Phone #? 
I can’t find the dev… who’s his 
manager? 
Oh, the problem is in the backend 
service, who’s the dev for that service?
Alert Workflow - DevOps way 
Ops 
Dev, Game X, Server 
Dev, Game Y, Client/iOS 
Dev, Service A 
Each alert go directly to 
the right team that can 
resolve it ! 
Dev, Service B 
Analytics
Alerts go to the person that can resolve it 
App-level alerts can be triggered by issues in: 
Type Scope Checked by Who to page? 
ELB Load balancer 
health-check 
ELB No one – email 
alert only 
System-level Check cpu / disk 
/ memory / 
network 
Pingdom / 
Nagios 
Ops team 
App-level Application 
issues / bugs 
Pingdom Dev and Ops 
teams 
• Server-side 
• Client-side 
• iOS 
• Android
Dev and Ops are responsible 
Team In pager duty 
Ops 8 
Dev 32, from ~20 games 
(server-side or client-side, android or iOS developers) 
Analytics 5
Big, Simple Status Dashboard
Big dashboard = 
quick status
Big dashboard 
= 
meta monitoring
IM Bot informs 
in the game 
channel that 
an alert was 
triggered 
Use IM Bot for status 
Both Ops and 
Dev receive 
the alert, 
troubleshoot 
IM Bot = collaboration 
IM Bot detects 
issue is 
resolved and 
send all-clear 
IM Bot = transparency
Review your incidents and alerts 
• Monday morning incident review meeting 
– Weekly on-call hand-over 
– Address false-positives / fine-tune your monitoring 
– Heads-up for events / major releases 
• Problem management 
– Any major or recurrent incident = Problem 
– Problem = requires post-mortem 
– Remediation items from post-mortem also tracked weekly till 
closure
Incident Management 
Lessons Learned 
• Use automatic paging/escalation tools 
• Make the alerts go to the right team directly 
• Use big display dashboard 
• Use IM-bots to communicate outages 
• Do weekly reviews of the incidents / alerts 
• Do post-mortems, follow-up on remediation items
Agenda 
• Part 1 – Lessons Learned 
– Incident Management 
– Change Management 
– Auto-scale 
– Cloud Optimization Tools and Capacity Planning 
• Part 2 – Game Architecture, Analytics & Monetization 
– Game Architecture 
– Moving a live game 
– Analytics & Monetization 
– Cloud Insights
Change Management 
Type Content Owner Tool 
Configuration 
Management 
3rd. Party 
packages and 
configuration 
Ops Puppet 
Release – code 
deploy 
1st. Party code Dev Jenkins + In-house 
scripts 
Release – asset 
deploy 
1st. Party – 
images / new 
game content / 
new missions 
Dev Jenkins + In-house 
scripts
Configuration Management 
pull push 
Ops do 
changes / 
test locally 
peer 
review 
pull 
changes 
to prod 
puppet 
puppet 
clients 
(prod 
servers) 
pull 
changes 
syntax 
validation 
not good
Configuration Management Benefits 
• Automate and speed-up deployment 
• Repeatable 
• Declarative modules/manifests = documentation 
• All prod changes: 
– peer-reviewed via pull-requests in Git 
– validated by Puppet lint 
– locally tested via Vagrant (every component has a Vagrant VM) 
– communicated through email and IM
Change Management 
Type Content Owner Tool 
Configuration 
Management 
3rd. Party 
packages and 
configuration 
Ops Puppet 
Release – code 
deploy 
1st. Party code Dev Jenkins + In-house 
scripts 
Release – asset 
deploy 
1st. Party – 
images / new 
game content / 
new missions 
Dev Jenkins + In-house 
scripts
Release Management – Code deploy 
push 
QA 
Beta 
Prod 
Deploy 
dev host 
dev 
S3 
In QA/dev channel of that project: 
If Prod deploy, in Ops channel of that project:
Change Management 
Type Content Owner Tool 
Configuration 
Management 
3rd. Party 
packages and 
configuration 
Ops Puppet 
Release – code 
deploy 
1st. Party code Dev Jenkins + In-house 
scripts 
Release – asset 
deploy 
1st. Party – 
images / new 
game content / 
new missions 
Dev Jenkins + In-house 
scripts
Release Management – Asset deploy 
Code 
Review 
Warns? 
Ops 
approval 
Override 
? 
Yes 
Yes 
No 
Dev kick off 
new asset 
deploy job 
Run 
validation 
Deploy to 
prod
Change Management Lessons Learned 
• Changes are made directly by the team that is 
responsible for that code 
– 3rd. party code is configuration management = owned by Ops 
– 1st. party code is release management = owned by Dev 
• Changes are made through tools 
– Configuration management through Puppet 
– Release management through Jenkins + internal tool 
• No change is done manually 
• All changes are communicated and tracked
Agenda 
• Part 1 – Lessons Learned 
– Incident Management 
– Change Management 
– Auto-scale 
– Cloud Optimization Tools and Capacity Planning 
• Part 2 – Game Architecture, Analytics & Monetization 
– Game Architecture 
– Moving a live game 
– Analytics & Monetization 
– Cloud Insights
Auto-scale use-cases 
–On-demand 
• for the daily traffic fluctuations and 
organic growth 
–Scheduled 
• for in-game events
Auto-scale on-demand and scheduled 
CPU 
# instances 
in ELB 
# auto-scale 
instances
Scheduled Auto-scale 
1- Scheduled 
pre-provisioning 
config enabled 
CPU 
# instances 
in ELB 
# auto-scale 
instances 
Scheduled action 
as-put-scheduled-update-group-action 
ccios-app-ScheduledUpFriday 
--auto-scaling-group ccios-app-asg 
--recurrence “00 17 * * 5” 
--min-size 16
Scheduled Auto-scale 
2 - Spare 
capacity in 
place, ready for 
event 
CPU 
# instances 
in ELB 
# auto-scale 
instances
Scheduled Auto-scale 
3 - Event starts, 
4x spike 
CPU 
# instances 
in ELB 
# auto-scale 
instances 
ADD EVENT 
SCREENSHOT 
HERE
On-demand Auto-scale 
4 – On-demand 
auto-scale 
reacts to 
CPU above 
60% and adds 
more servers 
CPU 
# instances 
in ELB 
# auto-scale 
instances 
On-demand policy 
as-put-scaling-policy 
ccios-app-ScaleUpPolicy60 
--auto-scaling-group ccios-app-asg 
--adjustment=8 --type ChangeInCapacity
On-demand Auto-scale 
5 - Scheduled 
pre-provisioning 
config is 
removed 
CPU 
# instances 
in ELB 
# auto-scale 
instances 
Scheduled action 
as-put-scheduled-update-group-action 
ccios-app-ScheduledDownFriday 
--auto-scaling-group ccios-app-asg 
--recurrence "0 21 * * 5" --min-size 2
On-demand Auto-scale 
6 – On-demand 
auto-scale 
terminate 
some 
instances as 
CPU drops 
below 40% 
CPU 
# instances 
in ELB 
# auto-scale 
instances 
On-demand policy 
as-put-scaling-policy 
ccios-app-ScaleDownPolicy40 
--auto-scaling-group ccios-app-asg 
--adjustment=-2 --type ChangeInCapacity
Auto-scale bootstrap workflow 
Event Description Duration 
Cloudwatch alarm is triggered Eg. CPU > 60% for 5 minutes 5 minutes 
Auto-scale policy is executed Launches n new instances 2 minutes 
User-data script is executed This script is defined on the autoscale launch 
config. Installs base packages, gets 
instance_id, IP and hostgroup 
1 minute 
Bootstrap script is executed This script is loaded from S3. It renames host, 
runs puppet, deploy code, starts web service 
11 minutes 
Health-check passes and 
servers start to get traffic 
Health-check must pass before ELB start to 
send traffic to new host 
1 minute
Auto-scale external dependencies 
Dependency How to resolve 
Configuration Management 
(Puppet/Chef) 
Pre-load all necessary package in the AMI / architecture HA for config 
management 
External Repo Pre-load all necessary packages in the AMI / setup internal HA repo 
Code deploy Same as above, or put in S3 
Monitoring registration Make it asynchronous 
Server registration Make it asynchronous
Auto-scale Lessons Learned 
• Reduce time to spin-up new instances: 
– Pre-install all base packages into AMI 
• Address those risks: 
– on-demand and scheduled AS conflicts 
– bootstrap validation and graceful termination 
– health-checks: keep it simple 
– keep some servers out of auto-scale pool, just in case 
– map and resolve/monitor external dependencies for auto-scale 
– consider using 2 different thresholds, for quicker ramp-up
Agenda 
• Part 1 – Lessons Learned 
– Incident Management 
– Change Management 
– Auto-scale 
– Cloud Optimization Tools and Capacity Planning 
• Part 2 – Game Architecture, Analytics & Monetization 
– Game Architecture 
– Moving a live game 
– Analytics & Monetization 
– Cloud Insights
• under-utilized hosts 
• overloaded hosts 
• EBS/ELB not in use 
Cloud Optimization areas 
• exposed DBs 
• EC2 behind ELB exposed 
directly 
• AZ / region distribution 
• backup audit 
• un-healthy instances in ELB 
• ELB misconfigs 
• optimal # of RI 
• hosts outside RI 
• cost break-down using tags 
• estimate on-demand costs 
Cost Usage 
Availabilit 
y Security
Cloud Optimization tools 
AWS Trusted Advisor 3rd. Party commercial tools 
Open Source tools (eg. Netflix Ice) 
In-house tools 
Excel !
Cloud Optimization Lessons Learned 
• Try Trusted Advisor 
• Pilot 3rd.-party solutions 
• Evaluate what metrics are important for each component of 
your architecture 
• Do in-house development for other optimizations you need that 
are not covered by TA or 3rd. party solutions 
• Tag all assets! Automate tagging!
Agenda 
• Part 1 – Lessons Learned 
– Incident Management 
– Change Management 
– Auto-scale 
– Cloud Optimization Tools and Capacity Planning 
• Part 2 – Game Architecture, Analytics & Monetization 
– Game Architecture 
– Moving a live game 
– Analytics & Monetization 
– Cloud Insights
GREE Games 
• All Mobile, all Free-to-Play 
– iOS & Android smart phones 
– Big focus on tablets 
• Role Playing Games (RPG) 
– Multi-million dollar franchise, top-grossing titles 
– Some of the oldest games on the App Store 
• Hardcore 
– Deeper more intense gameplay mechanics 
• Real-Time Strategy (RTS) 
– Fast action, small unit management 
• Casino & Casual Games 
– Familiar games, wider audience, casual play
Example Game Architecture – RPG 
• Application Servers 
– PHP 
– Game events  Analytics 
• Cache Layer 
– Memcached  Elasticache 
• Batch Processing Servers 
– Node.js (moving to GO) 
– Batches database writes 
• Database 
– MySQL  RDS 
RDS RDS RDS 
Failover 
DB 
ELB 
App App App App 
Cache Cache Cache Cache 
Batch Batch
Caching Strategy - Current 
• Game architecture predates stable NoSQL 
– We wanted similar performance at scale 
– Keep combined average internal response times below 300ms 
• Memcache Authoritative 
– Still use an RDBMS; potential data loss is limited 
• Allows for cheaper/simpler DB layer 
– Always do full row replacements (ie: no current_row_value +1)
Data Flow 
• Reads 
– ELB  App  Cache 
• Writes (Synchronous) 
– ELB  App  Cache  DB 
– ELB  App  Cache  Batch  DB 
– Standard write-through 
– No blind writes; always fetch current 
ver. 
• Writes (Asynchronous) 
– Batch  DB 
– Batch writes to DB every 30 seconds 
ELB 
App App App App 
Cache Cache Cache Cache 
Batch Batch 
RDS RDS RDS
Batch Processor 
• 80% of game write traffic is Async 
– Each write is versioned 
• Example: Player items (loot) after multiple quests 
– 10 items in 30 sec; app server sends 10 writes downstream 
– Batch processor sends last record with final item count to DB 
• Greatly reduced writes on DB 
– Shard at table and DB server level for larger games
Near Future Trends for GREE OPS 
• Multi-region games 
– Latency-sensitive games and the shift towards real-time 
– Geographic data replication challenges 
• Continuous Delivery 
• Automation of Game Studio tasks 
– Game design, art, data/asset deploy 
– Tighter event pre-provisioning and scale-down
More Performance – Lower Costs 
• Facebook HipHop Virtual Machine 
– JIT compilation & execution of PHP 
– 5x faster vs. Zend PHP 5.2 
– Achieved 3x to 4x reduction in application server count 
– https://github.com/facebook/hhvm 
• Google GO 
– Used for high-concurrency applications 
– Achieved 2x reduction in batch processing servers vs. Node.js 
– http://golang.org
Agenda 
• Part 1 – Lessons Learned 
– Incident Management 
– Change Management 
– Auto-scale 
– Cloud Optimization Tools and Capacity Planning 
• Part 2 – Game Architecture, Analytics & Monetization 
– Game Architecture 
– Moving a live game 
– Analytics & Monetization 
– Cloud Insights
Moving a Game – Why? 
• Physical datacenter to AWS 
– West coast  East coast 
– Faster access to EU markets & players 
• Reduce necessary attention to infrastructure 
– Caching & DB layer; custom high-availability middleware 
• Take advantage of cloud provisioning 
– Scripted instance spin-ups, auto-scaling for events/load 
• Save money 
– Reduce stand-by server pool 
– Provision for average load, not peak
Moving a Live Game – Whaaaaat? 
• Live game, two platforms (iOS, Android) 
– Several million $$$ in combined monthly revenue 
– More than one million unique players/month 
• ~ 30GB Dataset 
• Minimal downtime (< 5 minutes) 
– Mostly to allow for change to reverse proxy config 
• Debian  CentOS 
• Physical machines  AWS
Moving a Live Game - How 
• Develop timeline 
• R&D & architecture review 
• Data migration & sync 
• Game server/client updates 
• Load testing 
• D-Day steps & checklist
Moving a Live Game - Timeline 
• 3 months overall 
• DB dataset transfer validation 
– Setup direct MySQL to RDS replication 
– Initial DB transfer time: approx. 8 hours 
• Functional & performance testing 
– Load & capacity profile for application, DB servers 
– Heavy use of APM metrics – New Relic
Moving a Live Game - Architecture 
• Changes required 
– Caching – discreet memcached to Elasticache nodes 
– Database – physical MySQL DB servers to RDS 
• Decided to drop internally developed MySQL proxy 
– Bittersweet: great automatic failover; limited internal knowledge 
• RDS failover mechanics added to possible game downtime 
– Load balancers 
• LVS to ELB 
• Processes 
– Code asset deployment
Moving a Live Game – D-Day 
• Put game into maintenance (shutdown) 
• Break DB replication (west  east) 
• Setup reverse proxy in datacenter 
– Forward traffic from west  east AWS ELB 
• Bring game back online 
– Reverse proxy sends traffic to AWS 
• Update DNS to point to ELB 
– Wait for DNS propagation 
– Slow DNS updates hit the reverse proxy in datacenter
Moving a Live Game – Before & After
Agenda 
• Part 1 – Lessons Learned 
– Incident Management 
– Change Management 
– Auto-scale 
– Cloud Optimization Tools and Capacity Planning 
• Part 2 – Game Architecture, Analytics & Monetization 
– Game Architecture 
– Moving a live game 
– Analytics & Monetization 
– Cloud Insights
Analytics & Monetization 
• Specialize in “Live Events” 
– Higher player engagement (fun!) = more revenue 
• Single-player events 
– “Epic Boss” 
– Limited-time quests 
• Player organization events 
– Guild vs. Guild battles (World Domination, Syndicate Wars) 
– Raid Bosses – members help to take down a tough NPC 
– Tap into social “meta-gaming”
Modern War World Domination Results (August 2013)
Analytics for Player Engagement 
• Player retention 
– 1st week and beyond 
– Tutorial completion rates 
• Balancing mechanics 
– Player vs. Environment (PvE), Player vs. Player (PvP) 
– Encourage interaction with other players 
• When too much good can be bad 
– Analytics needs to be paired with player feedback 
– Fun for all players, payers AND non-payers
Analytics for Decision-making 
• Devices & Markets 
– Understand most popular devices (esp. Android) 
– Focus efforts on the top devices for your market 
• Launching a game 
– “Soft-launch” – only launch in certain markets, tune game 
– “Hard-launch” – money down (marketing), marquee live events 
• When to sunset & decommission 
– Depends on strategic goals, infra/engineering costs, etc.
Analytics – Some Scale 
• Over 5000 transactions/sec sent to Analytics 
• Several billion game events per day 
– Attacking, winning, losing, buying, clicking, swiping, etc. 
• Anticipating 10x increase in next two years 
• Building petabyte scale data warehouse 
capacity
Analytics Pipeline 
• Working towards “zero-latency” pipeline 
– Latency = ETL, summarization, reporting & dashboard 
– Already reduced from 24 hours to 1 hour in last year
Agenda 
• Part 1 – Lessons Learned 
– Incident Management 
– Change Management 
– Auto-scale 
– Cloud Optimization Tools and Capacity Planning 
• Part 2 – Game Architecture, Analytics & Monetization 
– Game Architecture 
– Moving a live game 
– Analytics & Monetization 
– Cloud Insights
Cloud Insights 
• Agility (Time to Deliver) 
• Elasticity – scale up/down quickly 
– Auto Scaling is critical 
• Service simplification (RDS/Elasticache/ELB) 
• Professional development for OPS Team 
– Physical (Datacenter/Network focus) vs. Virtual (DevOps focus)
Cloud Insights – Lessons Learned 
• Reliability & performance consistency varies 
• Stuff breaks often 
– Develop an “anti-fragile” mindset; build to anticipate failure 
• Cost-predictability still elusive 
• Orphaned servers 
– Easy to create; must constantly clean up 
• Large-scale monitoring is hard 
– No silver bullet yet
Thank You 
• Thanks to the GREE OPS & Engineering 
Teams! 
eduardo.saito@gree.net 
nick.dor@gree.net 
• We’re Hiring DevOps Team Members!! 
http://gree-corp.com/jobs
Please give us your feedback on this 
presentation 
MBL303 
As a thank you, we will select prize 
winners daily for completed surveys!

More Related Content

What's hot

Architecting for the cloud storage build test
Architecting for the cloud storage build testArchitecting for the cloud storage build test
Architecting for the cloud storage build test
Len Bass
 
SW development process and the leading indicator
SW development process and the leading indicatorSW development process and the leading indicator
SW development process and the leading indicatorJean Pаoli
 
Webinar: "DBMaestro: Database Enforced Change Management (DECM) tool"
Webinar: "DBMaestro: Database Enforced Change Management (DECM) tool"Webinar: "DBMaestro: Database Enforced Change Management (DECM) tool"
Webinar: "DBMaestro: Database Enforced Change Management (DECM) tool"
Emerasoft, solutions to collaborate
 
Continuous integration practices to improve the software quality
Continuous integration practices to improve the software qualityContinuous integration practices to improve the software quality
Continuous integration practices to improve the software quality
Fabricio Epaminondas
 
Quickstart for continuous integration
Quickstart for continuous integrationQuickstart for continuous integration
Quickstart for continuous integration
Fabricio Epaminondas
 
Extreme Makeover OnBase Edition
Extreme Makeover OnBase EditionExtreme Makeover OnBase Edition
Extreme Makeover OnBase Edition
DataBank, A KYOCERA Group Company
 

What's hot (6)

Architecting for the cloud storage build test
Architecting for the cloud storage build testArchitecting for the cloud storage build test
Architecting for the cloud storage build test
 
SW development process and the leading indicator
SW development process and the leading indicatorSW development process and the leading indicator
SW development process and the leading indicator
 
Webinar: "DBMaestro: Database Enforced Change Management (DECM) tool"
Webinar: "DBMaestro: Database Enforced Change Management (DECM) tool"Webinar: "DBMaestro: Database Enforced Change Management (DECM) tool"
Webinar: "DBMaestro: Database Enforced Change Management (DECM) tool"
 
Continuous integration practices to improve the software quality
Continuous integration practices to improve the software qualityContinuous integration practices to improve the software quality
Continuous integration practices to improve the software quality
 
Quickstart for continuous integration
Quickstart for continuous integrationQuickstart for continuous integration
Quickstart for continuous integration
 
Extreme Makeover OnBase Edition
Extreme Makeover OnBase EditionExtreme Makeover OnBase Edition
Extreme Makeover OnBase Edition
 

Similar to AWS re:Invent 2013 - MBL303 Gaming Ops - Running High-performance Ops for Mobile Gaming

Lessons Learned from Migrating Legacy Enterprise Applications to Microservices
Lessons Learned from Migrating Legacy Enterprise Applications to MicroservicesLessons Learned from Migrating Legacy Enterprise Applications to Microservices
Lessons Learned from Migrating Legacy Enterprise Applications to Microservices
VMware Tanzu
 
Ship code like a keptn
Ship code like a keptnShip code like a keptn
Ship code like a keptn
Rob Jahn
 
Working Agile with Scrum and TFS 2013
Working Agile with Scrum and TFS 2013Working Agile with Scrum and TFS 2013
Working Agile with Scrum and TFS 2013Moataz Nabil
 
Agile Engineering Sparker GLASScon 2015
Agile Engineering Sparker GLASScon 2015Agile Engineering Sparker GLASScon 2015
Agile Engineering Sparker GLASScon 2015
Stephen Ritchie
 
Continuous Integration for z using Test Data Management and Application D...
Continuous  Integration for z  using  Test Data Management  and Application D...Continuous  Integration for z  using  Test Data Management  and Application D...
Continuous Integration for z using Test Data Management and Application D...
DevOps for Enterprise Systems
 
Continuous Build To Continuous Release - Experience
Continuous Build To Continuous Release - ExperienceContinuous Build To Continuous Release - Experience
Continuous Build To Continuous Release - Experience
Raja Soundaramourty
 
Expert guidance on migrating from magento 1 to magento 2
Expert guidance on migrating from magento 1 to magento 2Expert guidance on migrating from magento 1 to magento 2
Expert guidance on migrating from magento 1 to magento 2
James Cowie
 
(Agile) engineering best practices - What every project manager should know
(Agile) engineering best practices - What every project manager should know(Agile) engineering best practices - What every project manager should know
(Agile) engineering best practices - What every project manager should know
Richard Cheng
 
OOSE UNIT-1.pdf
OOSE UNIT-1.pdfOOSE UNIT-1.pdf
OOSE UNIT-1.pdf
KarumuriJayasri
 
Probo.ci Drupal 4 Gov Devops 1/2 day Presentation
Probo.ci Drupal 4 Gov Devops 1/2 day Presentation Probo.ci Drupal 4 Gov Devops 1/2 day Presentation
Probo.ci Drupal 4 Gov Devops 1/2 day Presentation
Zivtech, LLC
 
Code in the Cloud - December 8th 2014
Code in the Cloud - December 8th 2014Code in the Cloud - December 8th 2014
Continuous integration / continuous delivery of web applications, Eugen Kuzmi...
Continuous integration / continuous delivery of web applications, Eugen Kuzmi...Continuous integration / continuous delivery of web applications, Eugen Kuzmi...
Continuous integration / continuous delivery of web applications, Eugen Kuzmi...
Evgeniy Kuzmin
 
Agile process with a fixed cost
Agile process with a fixed costAgile process with a fixed cost
Agile process with a fixed cost
Ralph Johnson
 
prod-dev-management.pptx
prod-dev-management.pptxprod-dev-management.pptx
prod-dev-management.pptx
Michael Ming Lei
 
Agile lifecycle handbook by bhawani nandan prasad
Agile lifecycle handbook by bhawani nandan prasadAgile lifecycle handbook by bhawani nandan prasad
Agile lifecycle handbook by bhawani nandan prasad
Bhawani N Prasad
 
DevOps in the Cloud with Microsoft Azure
DevOps in the Cloud with Microsoft AzureDevOps in the Cloud with Microsoft Azure
DevOps in the Cloud with Microsoft Azure
gjuljo
 
Data Stack Summit 2023
Data Stack Summit 2023Data Stack Summit 2023
Data Stack Summit 2023
Manimuthu Ayyannan
 
Accelerate User Driven Innovation [Webinar]
Accelerate User Driven Innovation [Webinar]Accelerate User Driven Innovation [Webinar]
Accelerate User Driven Innovation [Webinar]
Dynatrace
 
MTR Troubleshooting
MTR TroubleshootingMTR Troubleshooting
MTR Troubleshooting
Graham Walsh
 
Breaking the 2 Pizza Paradox with your Platform as an Application
Breaking the 2 Pizza Paradox with your Platform as an ApplicationBreaking the 2 Pizza Paradox with your Platform as an Application
Breaking the 2 Pizza Paradox with your Platform as an Application
Mark Rendell
 

Similar to AWS re:Invent 2013 - MBL303 Gaming Ops - Running High-performance Ops for Mobile Gaming (20)

Lessons Learned from Migrating Legacy Enterprise Applications to Microservices
Lessons Learned from Migrating Legacy Enterprise Applications to MicroservicesLessons Learned from Migrating Legacy Enterprise Applications to Microservices
Lessons Learned from Migrating Legacy Enterprise Applications to Microservices
 
Ship code like a keptn
Ship code like a keptnShip code like a keptn
Ship code like a keptn
 
Working Agile with Scrum and TFS 2013
Working Agile with Scrum and TFS 2013Working Agile with Scrum and TFS 2013
Working Agile with Scrum and TFS 2013
 
Agile Engineering Sparker GLASScon 2015
Agile Engineering Sparker GLASScon 2015Agile Engineering Sparker GLASScon 2015
Agile Engineering Sparker GLASScon 2015
 
Continuous Integration for z using Test Data Management and Application D...
Continuous  Integration for z  using  Test Data Management  and Application D...Continuous  Integration for z  using  Test Data Management  and Application D...
Continuous Integration for z using Test Data Management and Application D...
 
Continuous Build To Continuous Release - Experience
Continuous Build To Continuous Release - ExperienceContinuous Build To Continuous Release - Experience
Continuous Build To Continuous Release - Experience
 
Expert guidance on migrating from magento 1 to magento 2
Expert guidance on migrating from magento 1 to magento 2Expert guidance on migrating from magento 1 to magento 2
Expert guidance on migrating from magento 1 to magento 2
 
(Agile) engineering best practices - What every project manager should know
(Agile) engineering best practices - What every project manager should know(Agile) engineering best practices - What every project manager should know
(Agile) engineering best practices - What every project manager should know
 
OOSE UNIT-1.pdf
OOSE UNIT-1.pdfOOSE UNIT-1.pdf
OOSE UNIT-1.pdf
 
Probo.ci Drupal 4 Gov Devops 1/2 day Presentation
Probo.ci Drupal 4 Gov Devops 1/2 day Presentation Probo.ci Drupal 4 Gov Devops 1/2 day Presentation
Probo.ci Drupal 4 Gov Devops 1/2 day Presentation
 
Code in the Cloud - December 8th 2014
Code in the Cloud - December 8th 2014Code in the Cloud - December 8th 2014
Code in the Cloud - December 8th 2014
 
Continuous integration / continuous delivery of web applications, Eugen Kuzmi...
Continuous integration / continuous delivery of web applications, Eugen Kuzmi...Continuous integration / continuous delivery of web applications, Eugen Kuzmi...
Continuous integration / continuous delivery of web applications, Eugen Kuzmi...
 
Agile process with a fixed cost
Agile process with a fixed costAgile process with a fixed cost
Agile process with a fixed cost
 
prod-dev-management.pptx
prod-dev-management.pptxprod-dev-management.pptx
prod-dev-management.pptx
 
Agile lifecycle handbook by bhawani nandan prasad
Agile lifecycle handbook by bhawani nandan prasadAgile lifecycle handbook by bhawani nandan prasad
Agile lifecycle handbook by bhawani nandan prasad
 
DevOps in the Cloud with Microsoft Azure
DevOps in the Cloud with Microsoft AzureDevOps in the Cloud with Microsoft Azure
DevOps in the Cloud with Microsoft Azure
 
Data Stack Summit 2023
Data Stack Summit 2023Data Stack Summit 2023
Data Stack Summit 2023
 
Accelerate User Driven Innovation [Webinar]
Accelerate User Driven Innovation [Webinar]Accelerate User Driven Innovation [Webinar]
Accelerate User Driven Innovation [Webinar]
 
MTR Troubleshooting
MTR TroubleshootingMTR Troubleshooting
MTR Troubleshooting
 
Breaking the 2 Pizza Paradox with your Platform as an Application
Breaking the 2 Pizza Paradox with your Platform as an ApplicationBreaking the 2 Pizza Paradox with your Platform as an Application
Breaking the 2 Pizza Paradox with your Platform as an Application
 

Recently uploaded

一比一原版(LBS毕业证)伦敦商学院毕业证成绩单专业办理
一比一原版(LBS毕业证)伦敦商学院毕业证成绩单专业办理一比一原版(LBS毕业证)伦敦商学院毕业证成绩单专业办理
一比一原版(LBS毕业证)伦敦商学院毕业证成绩单专业办理
eutxy
 
急速办(bedfordhire毕业证书)英国贝德福特大学毕业证成绩单原版一模一样
急速办(bedfordhire毕业证书)英国贝德福特大学毕业证成绩单原版一模一样急速办(bedfordhire毕业证书)英国贝德福特大学毕业证成绩单原版一模一样
急速办(bedfordhire毕业证书)英国贝德福特大学毕业证成绩单原版一模一样
3ipehhoa
 
guildmasters guide to ravnica Dungeons & Dragons 5...
guildmasters guide to ravnica Dungeons & Dragons 5...guildmasters guide to ravnica Dungeons & Dragons 5...
guildmasters guide to ravnica Dungeons & Dragons 5...
Rogerio Filho
 
Bridging the Digital Gap Brad Spiegel Macon, GA Initiative.pptx
Bridging the Digital Gap Brad Spiegel Macon, GA Initiative.pptxBridging the Digital Gap Brad Spiegel Macon, GA Initiative.pptx
Bridging the Digital Gap Brad Spiegel Macon, GA Initiative.pptx
Brad Spiegel Macon GA
 
APNIC Foundation, presented by Ellisha Heppner at the PNG DNS Forum 2024
APNIC Foundation, presented by Ellisha Heppner at the PNG DNS Forum 2024APNIC Foundation, presented by Ellisha Heppner at the PNG DNS Forum 2024
APNIC Foundation, presented by Ellisha Heppner at the PNG DNS Forum 2024
APNIC
 
1比1复刻(bath毕业证书)英国巴斯大学毕业证学位证原版一模一样
1比1复刻(bath毕业证书)英国巴斯大学毕业证学位证原版一模一样1比1复刻(bath毕业证书)英国巴斯大学毕业证学位证原版一模一样
1比1复刻(bath毕业证书)英国巴斯大学毕业证学位证原版一模一样
3ipehhoa
 
History+of+E-commerce+Development+in+China-www.cfye-commerce.shop
History+of+E-commerce+Development+in+China-www.cfye-commerce.shopHistory+of+E-commerce+Development+in+China-www.cfye-commerce.shop
History+of+E-commerce+Development+in+China-www.cfye-commerce.shop
laozhuseo02
 
Latest trends in computer networking.pptx
Latest trends in computer networking.pptxLatest trends in computer networking.pptx
Latest trends in computer networking.pptx
JungkooksNonexistent
 
The+Prospects+of+E-Commerce+in+China.pptx
The+Prospects+of+E-Commerce+in+China.pptxThe+Prospects+of+E-Commerce+in+China.pptx
The+Prospects+of+E-Commerce+in+China.pptx
laozhuseo02
 
JAVIER LASA-EXPERIENCIA digital 1986-2024.pdf
JAVIER LASA-EXPERIENCIA digital 1986-2024.pdfJAVIER LASA-EXPERIENCIA digital 1986-2024.pdf
JAVIER LASA-EXPERIENCIA digital 1986-2024.pdf
Javier Lasa
 
This 7-second Brain Wave Ritual Attracts Money To You.!
This 7-second Brain Wave Ritual Attracts Money To You.!This 7-second Brain Wave Ritual Attracts Money To You.!
This 7-second Brain Wave Ritual Attracts Money To You.!
nirahealhty
 
test test test test testtest test testtest test testtest test testtest test ...
test test  test test testtest test testtest test testtest test testtest test ...test test  test test testtest test testtest test testtest test testtest test ...
test test test test testtest test testtest test testtest test testtest test ...
Arif0071
 
Internet-Security-Safeguarding-Your-Digital-World (1).pptx
Internet-Security-Safeguarding-Your-Digital-World (1).pptxInternet-Security-Safeguarding-Your-Digital-World (1).pptx
Internet-Security-Safeguarding-Your-Digital-World (1).pptx
VivekSinghShekhawat2
 
1.Wireless Communication System_Wireless communication is a broad term that i...
1.Wireless Communication System_Wireless communication is a broad term that i...1.Wireless Communication System_Wireless communication is a broad term that i...
1.Wireless Communication System_Wireless communication is a broad term that i...
JeyaPerumal1
 
一比一原版(SLU毕业证)圣路易斯大学毕业证成绩单专业办理
一比一原版(SLU毕业证)圣路易斯大学毕业证成绩单专业办理一比一原版(SLU毕业证)圣路易斯大学毕业证成绩单专业办理
一比一原版(SLU毕业证)圣路易斯大学毕业证成绩单专业办理
keoku
 
Comptia N+ Standard Networking lesson guide
Comptia N+ Standard Networking lesson guideComptia N+ Standard Networking lesson guide
Comptia N+ Standard Networking lesson guide
GTProductions1
 
Multi-cluster Kubernetes Networking- Patterns, Projects and Guidelines
Multi-cluster Kubernetes Networking- Patterns, Projects and GuidelinesMulti-cluster Kubernetes Networking- Patterns, Projects and Guidelines
Multi-cluster Kubernetes Networking- Patterns, Projects and Guidelines
Sanjeev Rampal
 
How to Use Contact Form 7 Like a Pro.pptx
How to Use Contact Form 7 Like a Pro.pptxHow to Use Contact Form 7 Like a Pro.pptx
How to Use Contact Form 7 Like a Pro.pptx
Gal Baras
 
BASIC C++ lecture NOTE C++ lecture 3.pptx
BASIC C++ lecture NOTE C++ lecture 3.pptxBASIC C++ lecture NOTE C++ lecture 3.pptx
BASIC C++ lecture NOTE C++ lecture 3.pptx
natyesu
 
一比一原版(CSU毕业证)加利福尼亚州立大学毕业证成绩单专业办理
一比一原版(CSU毕业证)加利福尼亚州立大学毕业证成绩单专业办理一比一原版(CSU毕业证)加利福尼亚州立大学毕业证成绩单专业办理
一比一原版(CSU毕业证)加利福尼亚州立大学毕业证成绩单专业办理
ufdana
 

Recently uploaded (20)

一比一原版(LBS毕业证)伦敦商学院毕业证成绩单专业办理
一比一原版(LBS毕业证)伦敦商学院毕业证成绩单专业办理一比一原版(LBS毕业证)伦敦商学院毕业证成绩单专业办理
一比一原版(LBS毕业证)伦敦商学院毕业证成绩单专业办理
 
急速办(bedfordhire毕业证书)英国贝德福特大学毕业证成绩单原版一模一样
急速办(bedfordhire毕业证书)英国贝德福特大学毕业证成绩单原版一模一样急速办(bedfordhire毕业证书)英国贝德福特大学毕业证成绩单原版一模一样
急速办(bedfordhire毕业证书)英国贝德福特大学毕业证成绩单原版一模一样
 
guildmasters guide to ravnica Dungeons & Dragons 5...
guildmasters guide to ravnica Dungeons & Dragons 5...guildmasters guide to ravnica Dungeons & Dragons 5...
guildmasters guide to ravnica Dungeons & Dragons 5...
 
Bridging the Digital Gap Brad Spiegel Macon, GA Initiative.pptx
Bridging the Digital Gap Brad Spiegel Macon, GA Initiative.pptxBridging the Digital Gap Brad Spiegel Macon, GA Initiative.pptx
Bridging the Digital Gap Brad Spiegel Macon, GA Initiative.pptx
 
APNIC Foundation, presented by Ellisha Heppner at the PNG DNS Forum 2024
APNIC Foundation, presented by Ellisha Heppner at the PNG DNS Forum 2024APNIC Foundation, presented by Ellisha Heppner at the PNG DNS Forum 2024
APNIC Foundation, presented by Ellisha Heppner at the PNG DNS Forum 2024
 
1比1复刻(bath毕业证书)英国巴斯大学毕业证学位证原版一模一样
1比1复刻(bath毕业证书)英国巴斯大学毕业证学位证原版一模一样1比1复刻(bath毕业证书)英国巴斯大学毕业证学位证原版一模一样
1比1复刻(bath毕业证书)英国巴斯大学毕业证学位证原版一模一样
 
History+of+E-commerce+Development+in+China-www.cfye-commerce.shop
History+of+E-commerce+Development+in+China-www.cfye-commerce.shopHistory+of+E-commerce+Development+in+China-www.cfye-commerce.shop
History+of+E-commerce+Development+in+China-www.cfye-commerce.shop
 
Latest trends in computer networking.pptx
Latest trends in computer networking.pptxLatest trends in computer networking.pptx
Latest trends in computer networking.pptx
 
The+Prospects+of+E-Commerce+in+China.pptx
The+Prospects+of+E-Commerce+in+China.pptxThe+Prospects+of+E-Commerce+in+China.pptx
The+Prospects+of+E-Commerce+in+China.pptx
 
JAVIER LASA-EXPERIENCIA digital 1986-2024.pdf
JAVIER LASA-EXPERIENCIA digital 1986-2024.pdfJAVIER LASA-EXPERIENCIA digital 1986-2024.pdf
JAVIER LASA-EXPERIENCIA digital 1986-2024.pdf
 
This 7-second Brain Wave Ritual Attracts Money To You.!
This 7-second Brain Wave Ritual Attracts Money To You.!This 7-second Brain Wave Ritual Attracts Money To You.!
This 7-second Brain Wave Ritual Attracts Money To You.!
 
test test test test testtest test testtest test testtest test testtest test ...
test test  test test testtest test testtest test testtest test testtest test ...test test  test test testtest test testtest test testtest test testtest test ...
test test test test testtest test testtest test testtest test testtest test ...
 
Internet-Security-Safeguarding-Your-Digital-World (1).pptx
Internet-Security-Safeguarding-Your-Digital-World (1).pptxInternet-Security-Safeguarding-Your-Digital-World (1).pptx
Internet-Security-Safeguarding-Your-Digital-World (1).pptx
 
1.Wireless Communication System_Wireless communication is a broad term that i...
1.Wireless Communication System_Wireless communication is a broad term that i...1.Wireless Communication System_Wireless communication is a broad term that i...
1.Wireless Communication System_Wireless communication is a broad term that i...
 
一比一原版(SLU毕业证)圣路易斯大学毕业证成绩单专业办理
一比一原版(SLU毕业证)圣路易斯大学毕业证成绩单专业办理一比一原版(SLU毕业证)圣路易斯大学毕业证成绩单专业办理
一比一原版(SLU毕业证)圣路易斯大学毕业证成绩单专业办理
 
Comptia N+ Standard Networking lesson guide
Comptia N+ Standard Networking lesson guideComptia N+ Standard Networking lesson guide
Comptia N+ Standard Networking lesson guide
 
Multi-cluster Kubernetes Networking- Patterns, Projects and Guidelines
Multi-cluster Kubernetes Networking- Patterns, Projects and GuidelinesMulti-cluster Kubernetes Networking- Patterns, Projects and Guidelines
Multi-cluster Kubernetes Networking- Patterns, Projects and Guidelines
 
How to Use Contact Form 7 Like a Pro.pptx
How to Use Contact Form 7 Like a Pro.pptxHow to Use Contact Form 7 Like a Pro.pptx
How to Use Contact Form 7 Like a Pro.pptx
 
BASIC C++ lecture NOTE C++ lecture 3.pptx
BASIC C++ lecture NOTE C++ lecture 3.pptxBASIC C++ lecture NOTE C++ lecture 3.pptx
BASIC C++ lecture NOTE C++ lecture 3.pptx
 
一比一原版(CSU毕业证)加利福尼亚州立大学毕业证成绩单专业办理
一比一原版(CSU毕业证)加利福尼亚州立大学毕业证成绩单专业办理一比一原版(CSU毕业证)加利福尼亚州立大学毕业证成绩单专业办理
一比一原版(CSU毕业证)加利福尼亚州立大学毕业证成绩单专业办理
 

AWS re:Invent 2013 - MBL303 Gaming Ops - Running High-performance Ops for Mobile Gaming

  • 1. Gaming Ops Running High-Performance Ops for Mobile Gaming Eduardo Saito – Director, Engineering Nick Dor – Sr. Director, Engineering GREE International Friday, November 15, 2013 © 2013 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.
  • 2.
  • 3. Agenda • Part 1 – Lessons Learned – Incident Management – Change Management – Auto-scale – Cloud Optimization Tools and Capacity Planning • Part 2 – Game Architecture, Analytics & Monetization – Game Architecture – Moving a live game – Analytics & Monetization – Cloud Insights
  • 4. Incident Management NOC Ops SME (Network, DBA,…) Dev Other monitoring tools… Triage Escalation Communication
  • 5. NOC, automated Ops Dev Critical Critical Non- Critical Other monitoring tools… Application-level issue? Who’s the dev of this game? Phone #? I can’t find the dev… who’s his manager? Oh, the problem is in the backend service, who’s the dev for that service?
  • 6. Alert Workflow - DevOps way Ops Dev, Game X, Server Dev, Game Y, Client/iOS Dev, Service A Each alert go directly to the right team that can resolve it ! Dev, Service B Analytics
  • 7. Alerts go to the person that can resolve it App-level alerts can be triggered by issues in: Type Scope Checked by Who to page? ELB Load balancer health-check ELB No one – email alert only System-level Check cpu / disk / memory / network Pingdom / Nagios Ops team App-level Application issues / bugs Pingdom Dev and Ops teams • Server-side • Client-side • iOS • Android
  • 8. Dev and Ops are responsible Team In pager duty Ops 8 Dev 32, from ~20 games (server-side or client-side, android or iOS developers) Analytics 5
  • 9. Big, Simple Status Dashboard
  • 10. Big dashboard = quick status
  • 11. Big dashboard = meta monitoring
  • 12. IM Bot informs in the game channel that an alert was triggered Use IM Bot for status Both Ops and Dev receive the alert, troubleshoot IM Bot = collaboration IM Bot detects issue is resolved and send all-clear IM Bot = transparency
  • 13. Review your incidents and alerts • Monday morning incident review meeting – Weekly on-call hand-over – Address false-positives / fine-tune your monitoring – Heads-up for events / major releases • Problem management – Any major or recurrent incident = Problem – Problem = requires post-mortem – Remediation items from post-mortem also tracked weekly till closure
  • 14. Incident Management Lessons Learned • Use automatic paging/escalation tools • Make the alerts go to the right team directly • Use big display dashboard • Use IM-bots to communicate outages • Do weekly reviews of the incidents / alerts • Do post-mortems, follow-up on remediation items
  • 15. Agenda • Part 1 – Lessons Learned – Incident Management – Change Management – Auto-scale – Cloud Optimization Tools and Capacity Planning • Part 2 – Game Architecture, Analytics & Monetization – Game Architecture – Moving a live game – Analytics & Monetization – Cloud Insights
  • 16. Change Management Type Content Owner Tool Configuration Management 3rd. Party packages and configuration Ops Puppet Release – code deploy 1st. Party code Dev Jenkins + In-house scripts Release – asset deploy 1st. Party – images / new game content / new missions Dev Jenkins + In-house scripts
  • 17. Configuration Management pull push Ops do changes / test locally peer review pull changes to prod puppet puppet clients (prod servers) pull changes syntax validation not good
  • 18. Configuration Management Benefits • Automate and speed-up deployment • Repeatable • Declarative modules/manifests = documentation • All prod changes: – peer-reviewed via pull-requests in Git – validated by Puppet lint – locally tested via Vagrant (every component has a Vagrant VM) – communicated through email and IM
  • 19. Change Management Type Content Owner Tool Configuration Management 3rd. Party packages and configuration Ops Puppet Release – code deploy 1st. Party code Dev Jenkins + In-house scripts Release – asset deploy 1st. Party – images / new game content / new missions Dev Jenkins + In-house scripts
  • 20. Release Management – Code deploy push QA Beta Prod Deploy dev host dev S3 In QA/dev channel of that project: If Prod deploy, in Ops channel of that project:
  • 21. Change Management Type Content Owner Tool Configuration Management 3rd. Party packages and configuration Ops Puppet Release – code deploy 1st. Party code Dev Jenkins + In-house scripts Release – asset deploy 1st. Party – images / new game content / new missions Dev Jenkins + In-house scripts
  • 22. Release Management – Asset deploy Code Review Warns? Ops approval Override ? Yes Yes No Dev kick off new asset deploy job Run validation Deploy to prod
  • 23. Change Management Lessons Learned • Changes are made directly by the team that is responsible for that code – 3rd. party code is configuration management = owned by Ops – 1st. party code is release management = owned by Dev • Changes are made through tools – Configuration management through Puppet – Release management through Jenkins + internal tool • No change is done manually • All changes are communicated and tracked
  • 24. Agenda • Part 1 – Lessons Learned – Incident Management – Change Management – Auto-scale – Cloud Optimization Tools and Capacity Planning • Part 2 – Game Architecture, Analytics & Monetization – Game Architecture – Moving a live game – Analytics & Monetization – Cloud Insights
  • 25. Auto-scale use-cases –On-demand • for the daily traffic fluctuations and organic growth –Scheduled • for in-game events
  • 26. Auto-scale on-demand and scheduled CPU # instances in ELB # auto-scale instances
  • 27. Scheduled Auto-scale 1- Scheduled pre-provisioning config enabled CPU # instances in ELB # auto-scale instances Scheduled action as-put-scheduled-update-group-action ccios-app-ScheduledUpFriday --auto-scaling-group ccios-app-asg --recurrence “00 17 * * 5” --min-size 16
  • 28. Scheduled Auto-scale 2 - Spare capacity in place, ready for event CPU # instances in ELB # auto-scale instances
  • 29. Scheduled Auto-scale 3 - Event starts, 4x spike CPU # instances in ELB # auto-scale instances ADD EVENT SCREENSHOT HERE
  • 30. On-demand Auto-scale 4 – On-demand auto-scale reacts to CPU above 60% and adds more servers CPU # instances in ELB # auto-scale instances On-demand policy as-put-scaling-policy ccios-app-ScaleUpPolicy60 --auto-scaling-group ccios-app-asg --adjustment=8 --type ChangeInCapacity
  • 31. On-demand Auto-scale 5 - Scheduled pre-provisioning config is removed CPU # instances in ELB # auto-scale instances Scheduled action as-put-scheduled-update-group-action ccios-app-ScheduledDownFriday --auto-scaling-group ccios-app-asg --recurrence "0 21 * * 5" --min-size 2
  • 32. On-demand Auto-scale 6 – On-demand auto-scale terminate some instances as CPU drops below 40% CPU # instances in ELB # auto-scale instances On-demand policy as-put-scaling-policy ccios-app-ScaleDownPolicy40 --auto-scaling-group ccios-app-asg --adjustment=-2 --type ChangeInCapacity
  • 33. Auto-scale bootstrap workflow Event Description Duration Cloudwatch alarm is triggered Eg. CPU > 60% for 5 minutes 5 minutes Auto-scale policy is executed Launches n new instances 2 minutes User-data script is executed This script is defined on the autoscale launch config. Installs base packages, gets instance_id, IP and hostgroup 1 minute Bootstrap script is executed This script is loaded from S3. It renames host, runs puppet, deploy code, starts web service 11 minutes Health-check passes and servers start to get traffic Health-check must pass before ELB start to send traffic to new host 1 minute
  • 34. Auto-scale external dependencies Dependency How to resolve Configuration Management (Puppet/Chef) Pre-load all necessary package in the AMI / architecture HA for config management External Repo Pre-load all necessary packages in the AMI / setup internal HA repo Code deploy Same as above, or put in S3 Monitoring registration Make it asynchronous Server registration Make it asynchronous
  • 35. Auto-scale Lessons Learned • Reduce time to spin-up new instances: – Pre-install all base packages into AMI • Address those risks: – on-demand and scheduled AS conflicts – bootstrap validation and graceful termination – health-checks: keep it simple – keep some servers out of auto-scale pool, just in case – map and resolve/monitor external dependencies for auto-scale – consider using 2 different thresholds, for quicker ramp-up
  • 36. Agenda • Part 1 – Lessons Learned – Incident Management – Change Management – Auto-scale – Cloud Optimization Tools and Capacity Planning • Part 2 – Game Architecture, Analytics & Monetization – Game Architecture – Moving a live game – Analytics & Monetization – Cloud Insights
  • 37. • under-utilized hosts • overloaded hosts • EBS/ELB not in use Cloud Optimization areas • exposed DBs • EC2 behind ELB exposed directly • AZ / region distribution • backup audit • un-healthy instances in ELB • ELB misconfigs • optimal # of RI • hosts outside RI • cost break-down using tags • estimate on-demand costs Cost Usage Availabilit y Security
  • 38. Cloud Optimization tools AWS Trusted Advisor 3rd. Party commercial tools Open Source tools (eg. Netflix Ice) In-house tools Excel !
  • 39. Cloud Optimization Lessons Learned • Try Trusted Advisor • Pilot 3rd.-party solutions • Evaluate what metrics are important for each component of your architecture • Do in-house development for other optimizations you need that are not covered by TA or 3rd. party solutions • Tag all assets! Automate tagging!
  • 40. Agenda • Part 1 – Lessons Learned – Incident Management – Change Management – Auto-scale – Cloud Optimization Tools and Capacity Planning • Part 2 – Game Architecture, Analytics & Monetization – Game Architecture – Moving a live game – Analytics & Monetization – Cloud Insights
  • 41. GREE Games • All Mobile, all Free-to-Play – iOS & Android smart phones – Big focus on tablets • Role Playing Games (RPG) – Multi-million dollar franchise, top-grossing titles – Some of the oldest games on the App Store • Hardcore – Deeper more intense gameplay mechanics • Real-Time Strategy (RTS) – Fast action, small unit management • Casino & Casual Games – Familiar games, wider audience, casual play
  • 42. Example Game Architecture – RPG • Application Servers – PHP – Game events  Analytics • Cache Layer – Memcached  Elasticache • Batch Processing Servers – Node.js (moving to GO) – Batches database writes • Database – MySQL  RDS RDS RDS RDS Failover DB ELB App App App App Cache Cache Cache Cache Batch Batch
  • 43. Caching Strategy - Current • Game architecture predates stable NoSQL – We wanted similar performance at scale – Keep combined average internal response times below 300ms • Memcache Authoritative – Still use an RDBMS; potential data loss is limited • Allows for cheaper/simpler DB layer – Always do full row replacements (ie: no current_row_value +1)
  • 44. Data Flow • Reads – ELB  App  Cache • Writes (Synchronous) – ELB  App  Cache  DB – ELB  App  Cache  Batch  DB – Standard write-through – No blind writes; always fetch current ver. • Writes (Asynchronous) – Batch  DB – Batch writes to DB every 30 seconds ELB App App App App Cache Cache Cache Cache Batch Batch RDS RDS RDS
  • 45. Batch Processor • 80% of game write traffic is Async – Each write is versioned • Example: Player items (loot) after multiple quests – 10 items in 30 sec; app server sends 10 writes downstream – Batch processor sends last record with final item count to DB • Greatly reduced writes on DB – Shard at table and DB server level for larger games
  • 46. Near Future Trends for GREE OPS • Multi-region games – Latency-sensitive games and the shift towards real-time – Geographic data replication challenges • Continuous Delivery • Automation of Game Studio tasks – Game design, art, data/asset deploy – Tighter event pre-provisioning and scale-down
  • 47. More Performance – Lower Costs • Facebook HipHop Virtual Machine – JIT compilation & execution of PHP – 5x faster vs. Zend PHP 5.2 – Achieved 3x to 4x reduction in application server count – https://github.com/facebook/hhvm • Google GO – Used for high-concurrency applications – Achieved 2x reduction in batch processing servers vs. Node.js – http://golang.org
  • 48. Agenda • Part 1 – Lessons Learned – Incident Management – Change Management – Auto-scale – Cloud Optimization Tools and Capacity Planning • Part 2 – Game Architecture, Analytics & Monetization – Game Architecture – Moving a live game – Analytics & Monetization – Cloud Insights
  • 49. Moving a Game – Why? • Physical datacenter to AWS – West coast  East coast – Faster access to EU markets & players • Reduce necessary attention to infrastructure – Caching & DB layer; custom high-availability middleware • Take advantage of cloud provisioning – Scripted instance spin-ups, auto-scaling for events/load • Save money – Reduce stand-by server pool – Provision for average load, not peak
  • 50. Moving a Live Game – Whaaaaat? • Live game, two platforms (iOS, Android) – Several million $$$ in combined monthly revenue – More than one million unique players/month • ~ 30GB Dataset • Minimal downtime (< 5 minutes) – Mostly to allow for change to reverse proxy config • Debian  CentOS • Physical machines  AWS
  • 51. Moving a Live Game - How • Develop timeline • R&D & architecture review • Data migration & sync • Game server/client updates • Load testing • D-Day steps & checklist
  • 52. Moving a Live Game - Timeline • 3 months overall • DB dataset transfer validation – Setup direct MySQL to RDS replication – Initial DB transfer time: approx. 8 hours • Functional & performance testing – Load & capacity profile for application, DB servers – Heavy use of APM metrics – New Relic
  • 53. Moving a Live Game - Architecture • Changes required – Caching – discreet memcached to Elasticache nodes – Database – physical MySQL DB servers to RDS • Decided to drop internally developed MySQL proxy – Bittersweet: great automatic failover; limited internal knowledge • RDS failover mechanics added to possible game downtime – Load balancers • LVS to ELB • Processes – Code asset deployment
  • 54. Moving a Live Game – D-Day • Put game into maintenance (shutdown) • Break DB replication (west  east) • Setup reverse proxy in datacenter – Forward traffic from west  east AWS ELB • Bring game back online – Reverse proxy sends traffic to AWS • Update DNS to point to ELB – Wait for DNS propagation – Slow DNS updates hit the reverse proxy in datacenter
  • 55. Moving a Live Game – Before & After
  • 56. Agenda • Part 1 – Lessons Learned – Incident Management – Change Management – Auto-scale – Cloud Optimization Tools and Capacity Planning • Part 2 – Game Architecture, Analytics & Monetization – Game Architecture – Moving a live game – Analytics & Monetization – Cloud Insights
  • 57. Analytics & Monetization • Specialize in “Live Events” – Higher player engagement (fun!) = more revenue • Single-player events – “Epic Boss” – Limited-time quests • Player organization events – Guild vs. Guild battles (World Domination, Syndicate Wars) – Raid Bosses – members help to take down a tough NPC – Tap into social “meta-gaming”
  • 58. Modern War World Domination Results (August 2013)
  • 59. Analytics for Player Engagement • Player retention – 1st week and beyond – Tutorial completion rates • Balancing mechanics – Player vs. Environment (PvE), Player vs. Player (PvP) – Encourage interaction with other players • When too much good can be bad – Analytics needs to be paired with player feedback – Fun for all players, payers AND non-payers
  • 60. Analytics for Decision-making • Devices & Markets – Understand most popular devices (esp. Android) – Focus efforts on the top devices for your market • Launching a game – “Soft-launch” – only launch in certain markets, tune game – “Hard-launch” – money down (marketing), marquee live events • When to sunset & decommission – Depends on strategic goals, infra/engineering costs, etc.
  • 61. Analytics – Some Scale • Over 5000 transactions/sec sent to Analytics • Several billion game events per day – Attacking, winning, losing, buying, clicking, swiping, etc. • Anticipating 10x increase in next two years • Building petabyte scale data warehouse capacity
  • 62. Analytics Pipeline • Working towards “zero-latency” pipeline – Latency = ETL, summarization, reporting & dashboard – Already reduced from 24 hours to 1 hour in last year
  • 63. Agenda • Part 1 – Lessons Learned – Incident Management – Change Management – Auto-scale – Cloud Optimization Tools and Capacity Planning • Part 2 – Game Architecture, Analytics & Monetization – Game Architecture – Moving a live game – Analytics & Monetization – Cloud Insights
  • 64. Cloud Insights • Agility (Time to Deliver) • Elasticity – scale up/down quickly – Auto Scaling is critical • Service simplification (RDS/Elasticache/ELB) • Professional development for OPS Team – Physical (Datacenter/Network focus) vs. Virtual (DevOps focus)
  • 65. Cloud Insights – Lessons Learned • Reliability & performance consistency varies • Stuff breaks often – Develop an “anti-fragile” mindset; build to anticipate failure • Cost-predictability still elusive • Orphaned servers – Easy to create; must constantly clean up • Large-scale monitoring is hard – No silver bullet yet
  • 66. Thank You • Thanks to the GREE OPS & Engineering Teams! eduardo.saito@gree.net nick.dor@gree.net • We’re Hiring DevOps Team Members!! http://gree-corp.com/jobs
  • 67. Please give us your feedback on this presentation MBL303 As a thank you, we will select prize winners daily for completed surveys!

Editor's Notes

  1. Tip #1 is to automate the role of the NOC. Traditionally, some companies have a NOC team that is responsible to: - collect all the alerts from the different monitoring systems - do the triage - assess the severity of the alert - follow run-books - eventually escalate the issue to the Ops team or Dev teams This model have 2 problems: - by the time the right person is involved to troubleshoot, the incident is already going on for 10-15 minutes - of course you need to maintain a NOC team   - show in the same diagram human request, as above - show % of alerts in each bucket: app / system / email
  2. - We learned that a NOC function can be mostly automated using services like Pagerduty But for that to be automated, you need to do some of the triage in the source of the alerts: Critical alerts should go to Pagerduty non-critical can be sent by email and handled later during business hours Tip #2 is to add the developers on-call in parallel to Ops. This workflow is common in many companies, where Ops will get all the pages, try to resolve, and eventually may escalate the issue to Dev if it’s a problem in the application code. The problem here is similar with when we are using a NOC… Ops is still doing triage of what is a system-level alert that he can resolve himself, or doing the escalation manually, if it’s an application issue that requires developer involvement. Effectively, Ops would be doing part of the NOC role, of Triage and Escalation. Ops guy would have to have that awkward call with Dev at 4am: “sorry man to call at this hour so late in the night… your game is down, and I need you to get online and help me fixed it…”  Following the generic DevOps recommendation, some companies started to put Dev on the Ops on-call rotation, so the Dev can also “feel the pain” and help to fix the root-cause of the alerts, but that doesn’t work well when you have multiple projects (in our case Games), and the devs know only 1 of the games.
  3. Instead of putting Dev in the Ops on-call rotation, we found that we had to put the Dev team to receive page at the same time as Ops. And the Dev teams would receive only the alerts from the application they developed. GREE currently have about 20 games, considering versions for Android and iOS. On the development side, each game as a server-side development team, and usually 2 client-side teams: 1 for Android, and 1 for iOS. The application-level alerts, for instance, if it’s an issue of Android client crashing, would page the Android client-side developers of that game.
  4. And that leads us to tip #3 – you need to standardize and separate clearly who is the right team to handle which alert. At GREE, we ask each game team to provide us with 3 different health-checks. The first one is a simple check for the elastic load balancer. This is used to automatically remove the server from rotation if it breaks. No one is paged if that happens, as a new server is created automatically by auto-scale. The second health-check is to detect system-level issues, like disk / network, checks the backend like Database and Memcache. Checks if replication is in-sync. If it fails, it alerts *ONLY* the Ops team, as Ops is able to resolve this kind of issue without dev assistance. The last type of health-check is for application-level issues, usually bugs in application. Those bugs generate warns and errors, that if above certain threshold would alert *BOTH* Dev and Ops team, as Ops will likely need Dev assistance to troubleshoot. The application-level alerts can be further broke-down by Server-side, or by Client-side, either iOS or Android platform. In that way, that alert will go only to the exact team that is responsible for that part of the code that is mal-functioning.
  5. Replace by photos Add *many red* dashboard to illustrate how a major outage looks like 5)  - Service Desk / Monitoring - show before / after     - before: traditional escalation model - ops do triage, escalate to engineering manually, need to look-up list of contacts per game     - after: devops, page engineering directly for app-level alerts  / page correct engineer for each game    - urgent: page via email + Pagerduty integration    - non-urgent: open ticket via email + Jira integration   - show diagram with workflow of email/pingdom/zabbix alerts for email (non-urgent)/pagerduty (urgent)    - show in the same diagram human request, as above - show % of alerts in each bucket: app / system / email - Incident Management + Problem Management    - System-level monitoring - alerts go to Ops    - Application-level monitoring - alerts go to Ops and Devels    - Availability Management    - Pingdom (external) + Zabbix (internal) + Pagerduty    - Skypebot (mention hackaton origin)    - Status dashboard screenshot (mention hackaton) – add screenshots status dashboard green/one-red/many-red    - Monday meetings to discuss previous week Incidents        - chart: number of email alerts decrease over time / warn/errors decrease    - RCA meetings to major incidents (GLUE? for transition between topics - put Release Mgmt first, and glue with Incident Management when things go awry / or other way - Incident caused by a Release, then transition to Release)
  6. Tip #4 – is to put a Big Status Dashboard near your Ops team. A Big Dashboard allow us to very quickly view the status of all of our games.