Everything I Learned About Scaling Online Games I Learned at Google and eBay [Part 2, QConBeijing 2014]


Published on

While the worlds of ecommerce, search, and application platforms might seem as far from the gaming industry as one might imagine, lessons learned in those environments are surprisingly applicable to online games. Real-time games in particular face many of the same challenges faced -- and solved -- by companies like eBay and Google. They are extremely latency-sensitive, are subject to unpredictable growth and scalability curves, and exhibit extremely spiky load profiles. The real-time player experience is critical to the success of a game -- if a game is down or slow, players will leave and never come back. This session will discuss how experiences with large-scale websites like eBay and Google have informed our approach to building, testing, and operating real-time games at KIXEYE.

This session tells several war stories from eBay and Google around Scaling Code, Scaling Infrastructure, Scaling Performance, and Scaling DevOps. It further puts it all together by connecting those experiences with what we are now doing in our next-generation gaming platform at KIXEYE.

See also Part 1 of this topic, presented at QCon San Francisco 2013.

Published in: Internet, Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Everything I Learned About Scaling Online Games I Learned at Google and eBay [Part 2, QConBeijing 2014]

  1. 1. Everything I Learned About Scaling Online Games I Learned at Google and eBay Randy Shoup @randyshoup linkedin.com/in/randyshoup
  2. 2. Background CTO at KIXEYE • Real-time strategy games for web and mobile Director of Engineering for Google App Engine • World’s largest Platform-as-a-Service Chief Engineer at eBay • Multiple generations of eBay’s real-time search infrastructure
  3. 3. Real-Time Strategy Games are … • Real-time • Spiky • Computationally- intensive • Constantly evolving • Constantly pushing boundaries  Technically and operationally demanding
  4. 4. How to Scale Scaling Code Scaling Infrastructure Scaling Performance Scaling DevOps
  5. 5. How to Scale Scaling Code Scaling Infrastructure Scaling Performance Scaling DevOps
  6. 6. Embrace Open Source Try someone else’s code first • Faster to get started, lower development cost • Open source projects are often higher quality, more extensible, better tested • Take advantage of talent outside your company Avoid “Not-Invented-Here” Attitude • (-) Google and eBay “exceptionalism” • Default has been to write it in-house instead of reuse and contribute
  7. 7. Embrace Standard Data Formats Use standard formats • Well-tested and widely-used • Internationalization from the beginning Time in UTC • (-) eBay and Google use local US-Pacific time 
  8. 8. Embrace Standard Data Formats Character set in UTF-8 • (-) 5+ years to convert eBay site from ISO- 8859-1 (Western European only) to Unicode  Structured data format • Explicit structure with associated schema • (+) Google uses protocol buffers for schema, serialization, storage
  9. 9. Development Discipline Quality, Reliability, Scalability are “Priority-0 features” • Equally important to users as product features and engaging user experience Developers responsible for • Features • Quality • Performance • Reliability • Manageability
  10. 10. Development Discipline Developers write tests and code together • Continuous testing of features, performance, load • Confidence to make risky changes • Catch bugs earlier, fail faster “Don’t have time to do it right” ? • WRONG  – Don’t have time to do it twice (!) • The more constrained you are on time and resources, the more important it is to do it solidly the first time
  11. 11. How to Scale Scaling Code Scaling Infrastructure Scaling Performance Scaling DevOps
  12. 12. Layering Multiple layers • Client • Game server • Services • Persistence Game client Game server Services Persistence
  13. 13. Micro-Services Simple Well-defined interface Single-purpose Modular and independent Small teams Autonomy and responsibility A C D E B
  14. 14. Google Cloud Datastore Cloud Datastore: NoSQL service • Highly scalable and resilient • Strong transactional consistency • SQL-like rich query capabilities Megastore: geo-scale structured database • Multi-row transactions • Synchronous cross-datacenter replication Bigtable: cluster-level structured storage • (row, column, timestamp) -> cell contents Colossus: next-generation clustered file system • Block distribution and replication Cluster management infrastructure • Task scheduling, machine assignment Cloud Datastore Megastore Bigtable Colossus Cluster manager
  15. 15. Reactive Servers Minimize request latency • Respond as rapidly as possible to client Functional Reactive + Actor model • Highly asynchronous, never block (!) • Queue events / messages for complex work • Heavy use of Scala / Akka and RxJava at KIXEYE • (-) eBay uses highly synchronous model • (-) Google uses complicated callback-based asynchronous model
  16. 16. Client Liveness Default to background processing • Refresh assets • Save client state Client continues seamlessly if disconnected • Parallel simulation on client and server • Gameplay more important than constant synchronization
  17. 17. How to Scale Scaling Code Scaling Infrastructure Scaling Performance Scaling DevOps
  18. 18. Scalability and Performance Measure, Measure, Measure • Instrument everything: client, services, network, DB • Measurement beats intuition every time • My own intuition is usually wrong  Attack the first bottleneck • Theory of Constraints: attacking *any* other problem does not improve throughput of the system Repeat until performance is good enough • “When you solve problem one, problem two gets a promotion”
  19. 19. Small Details Matter In the very large, the very small matters a *lot* • Subatomic physics and cosmology are inter- related • Particles and forces at the subatomic level controlled formation and evolution of the entire universe Discipline is deciding *which* details matter (!)
  20. 20. eBay Search Index Compression Search Engine constrained by index size • Smaller index size reduces memory, CPU, I/O • Smaller index means fewer nodes, fewer shards Inverted Index • “Posting List”: all occurrences of [term] in documents • Monotonically-increasing series of integers, traversed in order  Delta compression + Variable-byte encoding • Store deltas, not absolute numbers • Encode deltas so smaller numbers use fewer bits
  21. 21. TOME Combat Server Scalability limits in TOME combat server • Unable to push single server beyond several hundred simultaneous players • All system and OS-level measurements OK • CPU, memory usage, I/O, threads, locking • Needed to use CPU-level analyzer (Intel VTune) Bottleneck: memory cache contention • Multiple cores contending on L2 cache memory • 40% scalability increase from six characters … • static Foo;  const static Foo;
  22. 22. Measurement and Distributions • Applies only to quantities constrained on both sides, clustered around a mean • E.g., adult height and weight • Applies only to near- homogeneous populations • E.g., adult male height in North America, vs. female, vs. China, etc. Gaussian (“Normal”) distribution is *not* normal
  23. 23. Measurement and Distributions Power Law (“Long Tail”) distribution *much* more common • Latency and performance measurements • Popularity, income, human connections, etc. • Minimum is 0; maximum is infinite • The more you have, the more you get
  24. 24. Measurement and Distributions Mean and Standard Deviation often misleading • Encourages you to remove outliers, even though outliers represent the real problems (!) • Encourages you to concentrate on the average case, not the worst case • “Mean is meaningless”   Use percentiles instead (!) • Can reasonably characterize any distribution • Measure 90%ile, 99%ile, 99.9%ile • Highlight and focus on the *real* problems
  25. 25. How to Scale Scaling Code Scaling Infrastructure Scaling Performance Scaling DevOps
  26. 26. Automate Everything Humans are always at a premium • Humans are too valuable for repetitive tasks • Machines will happily do things over and over Automated operations • Provisioning • Deployment • Alerting • Self-healing
  27. 27. Autoscaling Games are very spiky • Very unpredictable • Huge variability between peak and trough • Hits are self-reinforcing Services and clients have to “flex” • Clients back off in response to latency • Services grow / shrink based on load
  28. 28. App Engine Autoscaling Autoscaling as part of the Platform • Gracefully handle spiky application load • Maximize utilization of the infrastructure World-class application scheduler • Consider request rate, processing time, max wait time • Also instance startup time, application budget • Predictive model pre-provisions and proactively scales • Reactive autoscaling in response to load • Instantaneous autoscaling on request: spin up new instance(s) *while a request is coming in*
  29. 29. Google and DevOps Ops Support is a privilege, not a right • Developers carry pager for first 6+ months • Service “graduates” to SRE after intensive review of monitoring, reliability, resilience, etc. • SRE collaborates with service to move forward Everyone’s incentives are aligned • Everyone is responsible for production • Everyone strongly motivated to have solid instrumentation and monitoring
  30. 30. Recap: How to Scale Scaling Code Scaling Infrastructure Scaling Performance Scaling DevOps
  31. 31. Thank you! rshoup@kixeye.com @randyshoup linkedin.com/in/randyshoup