Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Everything I Learned About
Scaling Online Games I
Learned at Google and
eBay
Randy Shoup
@randyshoup
linkedin.com/in/randy...
Background
CTO at KIXEYE
• Real-time strategy games for web and mobile
Director of Engineering for Google App
Engine
• Wor...
Real-Time Strategy Games are
… • Real-time
• Spiky
• Computationally-
intensive
• Constantly evolving
• Constantly pushing...
How to Scale
Scaling Code
Scaling Infrastructure
Scaling Performance
Scaling DevOps
How to Scale
Scaling Code
Scaling Infrastructure
Scaling Performance
Scaling DevOps
Embrace Open Source
Try someone else’s code first
• Faster to get started, lower development cost
• Open source projects a...
Embrace Standard Data
Formats
Use standard formats
• Well-tested and widely-used
• Internationalization from the beginning...
Embrace Standard Data
Formats
Character set in UTF-8
• (-) 5+ years to convert eBay site from ISO-
8859-1 (Western Europea...
Development Discipline
Quality, Reliability, Scalability are “Priority-0
features”
• Equally important to users as product...
Development Discipline
Developers write tests and code together
• Continuous testing of features, performance, load
• Conf...
How to Scale
Scaling Code
Scaling Infrastructure
Scaling Performance
Scaling DevOps
Layering
Multiple layers
• Client
• Game server
• Services
• Persistence
Game
client
Game
server
Services
Persistence
Micro-Services
Simple
Well-defined interface
Single-purpose
Modular and independent
Small teams
Autonomy and responsibilit...
Google Cloud Datastore
Cloud Datastore: NoSQL service
• Highly scalable and resilient
• Strong transactional consistency
•...
Reactive Servers
Minimize request latency
• Respond as rapidly as possible to client
Functional Reactive + Actor model
• H...
Client Liveness
Default to background processing
• Refresh assets
• Save client state
Client continues seamlessly if disco...
How to Scale
Scaling Code
Scaling Infrastructure
Scaling Performance
Scaling DevOps
Scalability and Performance
Measure, Measure, Measure
• Instrument everything: client, services, network,
DB
• Measurement...
Small Details Matter
In the very large, the very small matters a
*lot*
• Subatomic physics and cosmology are inter-
relate...
eBay Search Index
Compression
Search Engine constrained by index size
• Smaller index size reduces memory, CPU, I/O
• Smal...
TOME Combat Server
Scalability limits in TOME combat server
• Unable to push single server beyond several
hundred simultan...
Measurement and Distributions
• Applies only to
quantities constrained
on both sides, clustered
around a mean
• E.g., adul...
Measurement and Distributions
Power Law (“Long Tail”) distribution *much*
more common
• Latency and performance measuremen...
Measurement and Distributions
Mean and Standard Deviation often misleading
• Encourages you to remove outliers, even thoug...
How to Scale
Scaling Code
Scaling Infrastructure
Scaling Performance
Scaling DevOps
Automate Everything
Humans are always at a premium
• Humans are too valuable for repetitive tasks
• Machines will happily ...
Autoscaling
Games are very spiky
• Very unpredictable
• Huge variability between peak and trough
• Hits are self-reinforci...
App Engine Autoscaling
Autoscaling as part of the Platform
• Gracefully handle spiky application load
• Maximize utilizati...
Google and DevOps
Ops Support is a privilege, not a right
• Developers carry pager for first 6+ months
• Service “graduate...
Recap: How to Scale
Scaling Code
Scaling Infrastructure
Scaling Performance
Scaling DevOps
Thank you!
rshoup@kixeye.com
@randyshoup
linkedin.com/in/randyshoup
Upcoming SlideShare
Loading in …5
×

Everything I Learned About Scaling Online Games I Learned at Google and eBay [Part 2, QConBeijing 2014]

1,267 views

Published on

While the worlds of ecommerce, search, and application platforms might seem as far from the gaming industry as one might imagine, lessons learned in those environments are surprisingly applicable to online games. Real-time games in particular face many of the same challenges faced -- and solved -- by companies like eBay and Google. They are extremely latency-sensitive, are subject to unpredictable growth and scalability curves, and exhibit extremely spiky load profiles. The real-time player experience is critical to the success of a game -- if a game is down or slow, players will leave and never come back. This session will discuss how experiences with large-scale websites like eBay and Google have informed our approach to building, testing, and operating real-time games at KIXEYE.

This session tells several war stories from eBay and Google around Scaling Code, Scaling Infrastructure, Scaling Performance, and Scaling DevOps. It further puts it all together by connecting those experiences with what we are now doing in our next-generation gaming platform at KIXEYE.

See also Part 1 of this topic, presented at QCon San Francisco 2013.

Published in: Internet, Technology
  • Be the first to comment

Everything I Learned About Scaling Online Games I Learned at Google and eBay [Part 2, QConBeijing 2014]

  1. 1. Everything I Learned About Scaling Online Games I Learned at Google and eBay Randy Shoup @randyshoup linkedin.com/in/randyshoup
  2. 2. Background CTO at KIXEYE • Real-time strategy games for web and mobile Director of Engineering for Google App Engine • World’s largest Platform-as-a-Service Chief Engineer at eBay • Multiple generations of eBay’s real-time search infrastructure
  3. 3. Real-Time Strategy Games are … • Real-time • Spiky • Computationally- intensive • Constantly evolving • Constantly pushing boundaries  Technically and operationally demanding
  4. 4. How to Scale Scaling Code Scaling Infrastructure Scaling Performance Scaling DevOps
  5. 5. How to Scale Scaling Code Scaling Infrastructure Scaling Performance Scaling DevOps
  6. 6. Embrace Open Source Try someone else’s code first • Faster to get started, lower development cost • Open source projects are often higher quality, more extensible, better tested • Take advantage of talent outside your company Avoid “Not-Invented-Here” Attitude • (-) Google and eBay “exceptionalism” • Default has been to write it in-house instead of reuse and contribute
  7. 7. Embrace Standard Data Formats Use standard formats • Well-tested and widely-used • Internationalization from the beginning Time in UTC • (-) eBay and Google use local US-Pacific time 
  8. 8. Embrace Standard Data Formats Character set in UTF-8 • (-) 5+ years to convert eBay site from ISO- 8859-1 (Western European only) to Unicode  Structured data format • Explicit structure with associated schema • (+) Google uses protocol buffers for schema, serialization, storage
  9. 9. Development Discipline Quality, Reliability, Scalability are “Priority-0 features” • Equally important to users as product features and engaging user experience Developers responsible for • Features • Quality • Performance • Reliability • Manageability
  10. 10. Development Discipline Developers write tests and code together • Continuous testing of features, performance, load • Confidence to make risky changes • Catch bugs earlier, fail faster “Don’t have time to do it right” ? • WRONG  – Don’t have time to do it twice (!) • The more constrained you are on time and resources, the more important it is to do it solidly the first time
  11. 11. How to Scale Scaling Code Scaling Infrastructure Scaling Performance Scaling DevOps
  12. 12. Layering Multiple layers • Client • Game server • Services • Persistence Game client Game server Services Persistence
  13. 13. Micro-Services Simple Well-defined interface Single-purpose Modular and independent Small teams Autonomy and responsibility A C D E B
  14. 14. Google Cloud Datastore Cloud Datastore: NoSQL service • Highly scalable and resilient • Strong transactional consistency • SQL-like rich query capabilities Megastore: geo-scale structured database • Multi-row transactions • Synchronous cross-datacenter replication Bigtable: cluster-level structured storage • (row, column, timestamp) -> cell contents Colossus: next-generation clustered file system • Block distribution and replication Cluster management infrastructure • Task scheduling, machine assignment Cloud Datastore Megastore Bigtable Colossus Cluster manager
  15. 15. Reactive Servers Minimize request latency • Respond as rapidly as possible to client Functional Reactive + Actor model • Highly asynchronous, never block (!) • Queue events / messages for complex work • Heavy use of Scala / Akka and RxJava at KIXEYE • (-) eBay uses highly synchronous model • (-) Google uses complicated callback-based asynchronous model
  16. 16. Client Liveness Default to background processing • Refresh assets • Save client state Client continues seamlessly if disconnected • Parallel simulation on client and server • Gameplay more important than constant synchronization
  17. 17. How to Scale Scaling Code Scaling Infrastructure Scaling Performance Scaling DevOps
  18. 18. Scalability and Performance Measure, Measure, Measure • Instrument everything: client, services, network, DB • Measurement beats intuition every time • My own intuition is usually wrong  Attack the first bottleneck • Theory of Constraints: attacking *any* other problem does not improve throughput of the system Repeat until performance is good enough • “When you solve problem one, problem two gets a promotion”
  19. 19. Small Details Matter In the very large, the very small matters a *lot* • Subatomic physics and cosmology are inter- related • Particles and forces at the subatomic level controlled formation and evolution of the entire universe Discipline is deciding *which* details matter (!)
  20. 20. eBay Search Index Compression Search Engine constrained by index size • Smaller index size reduces memory, CPU, I/O • Smaller index means fewer nodes, fewer shards Inverted Index • “Posting List”: all occurrences of [term] in documents • Monotonically-increasing series of integers, traversed in order  Delta compression + Variable-byte encoding • Store deltas, not absolute numbers • Encode deltas so smaller numbers use fewer bits
  21. 21. TOME Combat Server Scalability limits in TOME combat server • Unable to push single server beyond several hundred simultaneous players • All system and OS-level measurements OK • CPU, memory usage, I/O, threads, locking • Needed to use CPU-level analyzer (Intel VTune) Bottleneck: memory cache contention • Multiple cores contending on L2 cache memory • 40% scalability increase from six characters … • static Foo;  const static Foo;
  22. 22. Measurement and Distributions • Applies only to quantities constrained on both sides, clustered around a mean • E.g., adult height and weight • Applies only to near- homogeneous populations • E.g., adult male height in North America, vs. female, vs. China, etc. Gaussian (“Normal”) distribution is *not* normal
  23. 23. Measurement and Distributions Power Law (“Long Tail”) distribution *much* more common • Latency and performance measurements • Popularity, income, human connections, etc. • Minimum is 0; maximum is infinite • The more you have, the more you get
  24. 24. Measurement and Distributions Mean and Standard Deviation often misleading • Encourages you to remove outliers, even though outliers represent the real problems (!) • Encourages you to concentrate on the average case, not the worst case • “Mean is meaningless”   Use percentiles instead (!) • Can reasonably characterize any distribution • Measure 90%ile, 99%ile, 99.9%ile • Highlight and focus on the *real* problems
  25. 25. How to Scale Scaling Code Scaling Infrastructure Scaling Performance Scaling DevOps
  26. 26. Automate Everything Humans are always at a premium • Humans are too valuable for repetitive tasks • Machines will happily do things over and over Automated operations • Provisioning • Deployment • Alerting • Self-healing
  27. 27. Autoscaling Games are very spiky • Very unpredictable • Huge variability between peak and trough • Hits are self-reinforcing Services and clients have to “flex” • Clients back off in response to latency • Services grow / shrink based on load
  28. 28. App Engine Autoscaling Autoscaling as part of the Platform • Gracefully handle spiky application load • Maximize utilization of the infrastructure World-class application scheduler • Consider request rate, processing time, max wait time • Also instance startup time, application budget • Predictive model pre-provisions and proactively scales • Reactive autoscaling in response to load • Instantaneous autoscaling on request: spin up new instance(s) *while a request is coming in*
  29. 29. Google and DevOps Ops Support is a privilege, not a right • Developers carry pager for first 6+ months • Service “graduates” to SRE after intensive review of monitoring, reliability, resilience, etc. • SRE collaborates with service to move forward Everyone’s incentives are aligned • Everyone is responsible for production • Everyone strongly motivated to have solid instrumentation and monitoring
  30. 30. Recap: How to Scale Scaling Code Scaling Infrastructure Scaling Performance Scaling DevOps
  31. 31. Thank you! rshoup@kixeye.com @randyshoup linkedin.com/in/randyshoup

×