Zero to ten million daily users in fourweeks: sustainable speed is king        Jodi Moran, CTO, Plumbee                   ...
Who am I?• Mass market web entertainment for  more than 8 years• Small businesses, large businesses• Small teams, large te...
About social games • Free-to-play games on Facebook,   monetized with microtransactions • Highly interactive • Cost per us...
Case study: The Sims Social • Released mid-August 2011 • By mid-September 2011   – 10 million daily active users   – 65 mi...
About Plumbee• Social casino games• Development started October 2011 with 3  engineers• 5 engineers Dec 2011, 8 engineers ...
What is sustainable speed? • Speed measured by end-to-end time for   each change • Sustainability measured by maintaining ...
Why sustainable speed? • Responsiveness   – To fickle audience   – To changing competition   – To changing platform • Retu...
Achieving sustainable speed •   Iterate and automate •   Use commodity technology •   Analyse and improve •   Build servic...
Achieving sustainable speed •   Iterate and automate •   Use commodity technology •   Analyse and improve •   Build servic...
Be agile • Framework for incremental delivery • Incremental delivery (small batches) by   definition improves end-to-end t...
Automate routine work    Code build and                               Deployment of code                          Provisio...
Isolate changes • Makes problem causes easy to identify • Place high-risk parts of the system on   different release track...
Make it minimally viable first • Launch with minimal product       … and minimal process       … and minimal tech • “If yo...
Prepare for technical debt • Too much slows you down • But it’s not possible to avoid • So you will need a way to keep it ...
Case study: content tools • Special case of change isolation / automation • Games have a lot of “configuration” or   conte...
Content editing tools                        16
Iterate and automate • Small batches reduce end-to-end time • Small batches help you find problems   faster • Automation m...
Achieving sustainable speed •   Iterate and automate •   Use commodity technology •   Analyse and improve •   Build servic...
Use commodity languages• Large developer communities• Many open source components• Aids and encourages componentization  a...
Use third-party services• World-class features                         Internal• Low opportunity                    Compan...
Virtualized infrastructure with AWS •   Flexibility & agility •   Small operations team •   Infrastructure, not platform •...
Case study: Highly-scalable storage       with commodity tech                                      22
Plumbee data access patterns • Thick client means data is cached client   side • High ratio of writes to reads • User prim...
Plumbee data storage: micro view • Data stored in key-value form, key is user id • Multiple values stored against the user...
Plumbee data storage: macro view • MySQL on multi-AZ RDS • Read slaves handle e.g. reads of friend   data • Users are spre...
The results • By using commodity tech + services:   MySQL/InnoDB, RDS, Java, Spring/AspectJ, GPB • We have:    –   Fast ac...
Commodity technology• Easy and cheap to acquire• Easy to hire people who know it• Quick assembly of product from many  par...
Achieving sustainable speed •   Iterate and automate •   Use commodity technology •   Analyse and improve •   Build servic...
Collect user data • Never too much data: collect everything   and store it forever • Collect data through events • At Plum...
Collect system data • Just another kind of analytics • Instead of “what is user doing”, “what is   system doing” • Collect...
Analytics with commodity tech                                31
What can you do with data? •   Reporting •   Monitoring •   Data mining •   Predictive analytics •   Personalization •   S...
Split testing • Run controlled experiments to determine   how changes affect users • To do this: assign users randomly to ...
Simple random sampling variants = empty; foreach (test in currently running tests) {    selectedVariant = test.getStoredVa...
Simple significance testing • Conditions   – Metric to be improved is a proportion: e.g.     percent of users converting t...
Analyse and improve • Analysing your system tells you how to   improve it • The more accessible and timely your data,   th...
Achieving sustainable speed •   Iterate and automate •   Use commodity technology •   Analyse and improve •   Build servic...
What are services? • Essential quality: data & functions on that   data combined into one component • Data only accessible...
Technical benefits of SOA • Scalability improvements: data is   partitioned • Performance improvements: data storage   opt...
Consider the round-trip                    Concept      Live system                                Development        anal...
Apply distributed systems design • Minimize communication – especially   long-distance communication • Make local progress...
Match architecture and organization • Create little startups   – small cross-functional teams   – completely accountable f...
A social game team                             Business, product                               management   Production, pr...
Build services • Small, independent, cross-functional   teams can iterate quickly • Forming little startups creates   orga...
Create a high-speed culture •   Iterate and automate •   Use commodity technology •   Analyse and improve •   Build servic...
Post-modern programming• Glue code is beautiful• “Functionality is an asset, code is a  liability”• Embrace heterogeneity,...
Test is dead • Alberto Savoias keynote at GTAC 2011 • What are tests for?    – Is the functionality what the developer    ...
Corollary: load test is dead • (Good) load tests are very difficult • To test ability to add capacity?   – Design for hori...
And also: operations is dead • All engineers need mission-critical   mentality • When all engineers operate the   system, ...
High-speed culture •   Hire the best people •   Get them passionate about the product •   Provide easy access to informati...
Sustainable speedFrom this…               Get this! • Iterate & automate     • Get ahead of your • Use commodity          ...
Come and help us!www.plumbeegames.com                       52
Upcoming SlideShare
Loading in …5
×

Zero to ten million daily users in four weeks: sustainable speed is king

1,197 views

Published on

A successful social game can achieve very rapid growth to large scale during a short period of time, with extreme cases (CityVille, The Sims Social, Gardens of Time) achieving millions of daily active users and tens of millions of monthly active users within a few weeks of launch. In this presentation I will argue that achieving the fastest possible sustainable speed of development is the most important technical factor in launching and operating a successful social game, or indeed any large-scale web business -- whether at startup phase, during rapid growth, or established and mature. I will propose some techniques, both technical and non-technical, to optimize rapid development at large scale. Finally, I will describe some of the typical architecture, processes, and tools used in the development of both the transactional and analytical systems of social games.

Published in: Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,197
On SlideShare
0
From Embeds
0
Number of Embeds
13
Actions
Shares
0
Downloads
0
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide
  • Presentation: "Zero to ten million daily users in four weeks: sustainable speed is king"Track: Highly-available systems / Time: Wednesday 15:20 - 16:20 / Location: FlemingA successful social game can achieve very rapid growth to large scale during a short period of time, with extreme cases (CityVille, The Sims Social, Gardens of Time) achieving millions of daily active users and tens of millions of monthly active users within a few weeks of launch. In this presentation I will argue that achieving the fastest possible sustainable speed of development is the most important technical factor in launching and operating a successful social game, or indeed any large-scale web business -- whether at startup phase, during rapid growth, or established and mature. I will propose some techniques, both technical and non-technical, to optimize rapid development at large scale. Finally, I will describe some of the typical architecture, processes, and tools used in the development of both the transactional and analytical systems of social games.
  • Variety of target audiences, different problems Habbo, Betfair,NCSoft, Atari, Playfish, PlumbeeI am an “organizer”, always looking for the connections between lots of different piecesSo this talk is a bit different, it’s about the big picture and touches just a bit on many different topics
  • I’m trying to apply the lessons I’ve learned from massive, high-growth systems to the startup space= on a budget!!
  • Not enough to crunch to get one feature doneWeeks, months, and years
  • There’s a lot of thinking about speed these days, it goes under different names agile, lean, radical managementReturns greater => you see them earlier, so they accumulate over more timeInvestments are less => there’s a bunch of things you won’t have to do
  • My first lesson: be agileStarting off with agile, because agile is sustainable speed 101I include in agile any framework for incremental deliveryThings like deliver frequently, embrace change, maximize the work not doneThe best places don’t “do” agile, they “are” agile.Process is a tool, a means to an end -- find what works for your business and don’t be afraid to change it
  • Automate everything, but especially the routine work of build, test, deployMachines are faster than people at routine workMachines make fewer mistakes than people when doing routine work, which also makes things faster
  • Don’t try to make process perfect from the startDon’t try to make a perfect system from the startDo what needs doing
  • Has a cumulative effect over development iterationsBecause you can incur it retroactively (you realize later there was a better way)
  • 1. User edits config in googlespreadsheet2. User requests deployment to stable3. Transformer pulls data, transforms to XML4. Transformer commits to Git5. Transformer tells Jenkins to deploy6. Jenkins fetches XML and packages7. Jenkins uploads new config package to S38. Jenkins updates environment config in S3 with new version number9. Jenkins bounces server in stable10. Server gets new config location from S311. Sever reloads config12. Jenkins emails userGoogle docs spreadsheet (CT)TransformerGit (CT)Jenkins (CT)S3 (CT)
  • Dev communities => easy and quick to hire fluent developers, easy access to informationLots of open source components => you can use to make things fasterEncourages componentization => your devs will make things that can be recombined and re-leveraged
  • You don’t have to buildYou don’t have to maintain
  • Push launch dates forward and backward without concern for hardware lead times; never be caught without capacity100 to 1 ratio of servers to opsLower knowledge barrier, lower risk if we choose to migrateThe amount of engineering that goes on! Availability zones, highly reliable distributed storage, block store migrated between machinesDesign for failure tolerance and horizontal server scaling
  • Splitting values into multiple fields means don’t read/write whole thing every time (better performance)Secondary indexes have to be maintained manually, if needed
  • - Reads esp. of friend data offloaded to slave- Shards have to be manually rebalanced with a custom tool
  • You can now get even highly redundant storage at ~10 cents per GB-month = Storing hundreds of GB a day is quite reasonableSo what excuse to not collect everything and keep it forever?Better to collect data through events – can capture non-persistent things, done real time instead of in bulk = more timely, doesn’t overload databasesAt plumbee, we capture the content of every server call and every database write = can recreate any session
  • Personalization: price points,payment methods, discounts,bonuses,make recommendations,
  • Each user who visits while a test is running is permanently assigned to a variant in that test according to the variant weights (may be the control variant)Each combination of variants for all running tests is potentially directed to a different server group => this allows serving totally different functionality to that user
  • Collect everythingSegue: how we do this @ PlumbeeWhat you can do with dataSegue: how we do split testing @ PlumbeeWhy you need this for speed
  • Motivated by object-oriented programmingWe encapsulate data & functions on that data into one componentAn extrapolation of OOP to distributed systemsOrganize system according to functionalityEach block of functionality is called a serviceEach service is developed and deployed independentlyServices interact only through remote calls
  • … and in order to motivate those, I want to talk a bit philosophically about building a software-as-a-service organization.
  • Think about your organization as a system designed for two thingsthe constant delivery of softwarethe operation of constantly changing software. This means you should organize to optimize flow: the flow of software from development to operations the flow of information from operations back into developmentAll of this creates a positive feedback loop that will continuously improve the service you are operating.
  • When designing distd system, Minimize comm + long distance commSo how do you deto optimize flow? Any technical people in the audience will recognize that when designing distributed systems you try to minimize communication across the network because it’s the most expensive and slow thing to do, especially the further you have to send the message. It’s the same thing with an organization – the more people have to communicate in order to get something done, and the more widely spread those people are, the slower things are going to happen. Seems obvious -- the best way to organize is so that the people who have to communicate the most are in the same team.
  • The best way to optimize for flow and minimizecross-team communications is to create small, cross-functional teams who are completely dedicated to a particular product or slice of functionality. These teams should contain all of the skill sets necessary to deliver and operate a service, relying as little as possible on other teams. Small teams can operate very rapidly because fewer people are communicating to make a decision and get things done, which also means that less process is needed to manage that communication. When all the needed skill sets are available to the team, there are no dependencies or gates that can slow things down. Also, restricting the size of the team also means you are forced to restrict the size of the product that the team is tackling. This means that everyone on the team can fully understand what they are trying to achieve, and there’s a greater sense of responsibility for each team member to contribute to making their product better. So how small should your team be? Evidence suggests that the optimal team size is what Amazon calls a two-pizza team – if you can’t feed the team with two pizzas, the team is too large. When this happens the team should be split into smaller teams, each with its own separate full slice of functionality.
  • These are roles, not people – some people may cover more than oneThis is not exhaustive, more roles might be requiredEngineers that cover every required skill set (actionscript, javascript, java)Yes I’m oversimplifying
  • Internal “commodity tech” Faster developmentTeam independenceReduced scope for each teamFlexibilityDevelopment scalabilityGrow through division
  • To encourage use of commodity tech… a postmoden programming mentalityJames Noble and Robert Biddle
  • Google Test Automation ConfWhat’s really dead: taking away the responsibility for quality from the people who create the workThe best teams have the least QA.The last one is the most important, but it can’t be determined by test => product needs to be in the hands of the user!Testing less is actually LESS risky =>(1) you get to the most important test of all (with the user) quicker and so minimize your greatest risk (2) your bugs are less costly because they can do less damage because you can fix them quicker
  • It should involve experimental design (stating hypothesis, removing confounding factors) and proper statistical analysis -- this is rareSimulating actual user behaviour is hard, predicting user behaviour even harderIn fact, you already have gradual rollout – you need same infrastructure for split-testing!Every new product version should be a split test
  • More reliabilityBetter monitoringBetter loggingSelf-managing systems
  • Good developers by nature are carefulHire the best => then need to encourage them to lean forward all the time, “move fast and break things”The best way to get sustainable speed is to create a high-speed culture.
  • Zero to ten million daily users in four weeks: sustainable speed is king

    1. 1. Zero to ten million daily users in fourweeks: sustainable speed is king Jodi Moran, CTO, Plumbee 1
    2. 2. Who am I?• Mass market web entertainment for more than 8 years• Small businesses, large businesses• Small teams, large teams• Variety of products• I’m all about the big picture 2
    3. 3. About social games • Free-to-play games on Facebook, monetized with microtransactions • Highly interactive • Cost per user matters • Can grow very quickly 3
    4. 4. Case study: The Sims Social • Released mid-August 2011 • By mid-September 2011 – 10 million daily active users – 65 million monthly active users – 1 TB of analytics data collected daily 4
    5. 5. About Plumbee• Social casino games• Development started October 2011 with 3 engineers• 5 engineers Dec 2011, 8 engineers today• Launching first product in just a few weeks 5
    6. 6. What is sustainable speed? • Speed measured by end-to-end time for each change • Sustainability measured by maintaining speed over long time periods 6
    7. 7. Why sustainable speed? • Responsiveness – To fickle audience – To changing competition – To changing platform • Returns are greater • Investments are less 7
    8. 8. Achieving sustainable speed • Iterate and automate • Use commodity technology • Analyse and improve • Build services • Create a high-speed culture 8
    9. 9. Achieving sustainable speed • Iterate and automate • Use commodity technology • Analyse and improve • Build services • Create a high-speed culture 9
    10. 10. Be agile • Framework for incremental delivery • Incremental delivery (small batches) by definition improves end-to-end time • Framework for reflecting on process • Focus on principles, not practices: process is a means to an end 10
    11. 11. Automate routine work Code build and Deployment of code Provisioning of test execution of unit and configuration to environment tests test environment Execution of Provisioning of Promotion of build to automated end-to- production production end regression test environment suite Deployment of code and configuration to production environment 11
    12. 12. Isolate changes • Makes problem causes easy to identify • Place high-risk parts of the system on different release tracks from low-risk parts • Each release track can have a different cadence • At Plumbee: Client, server, application configuration / content, environment configuration each separately versioned and independently releaseable 12
    13. 13. Make it minimally viable first • Launch with minimal product … and minimal process … and minimal tech • “If you aren’t embarrassed by your first launch, you didn’t launch early enough” 13
    14. 14. Prepare for technical debt • Too much slows you down • But it’s not possible to avoid • So you will need a way to keep it under control • Take it on intentionally when needed 14
    15. 15. Case study: content tools • Special case of change isolation / automation • Games have a lot of “configuration” or content that needs to be tweaked and balanced • Content tools usually end at build stage • At Plumbee: – Edit with familiar interface – Button click to deploy to playtesting – Button click to deploy to live 15
    16. 16. Content editing tools 16
    17. 17. Iterate and automate • Small batches reduce end-to-end time • Small batches help you find problems faster • Automation makes things faster • Automation reduces errors 17
    18. 18. Achieving sustainable speed • Iterate and automate • Use commodity technology • Analyse and improve • Build services • Create a high-speed culture 18
    19. 19. Use commodity languages• Large developer communities• Many open source components• Aids and encourages componentization and reuse• For example: Java, Javascript, Actionscript 19
    20. 20. Use third-party services• World-class features Internal• Low opportunity Company email, calendars, documents, accounting, HR, cost bug-tracking• Maintain business focus Technical Version control, build systems, monitoring, analytics, infrastructure User- facing Customer support, bulk and transactional email 20
    21. 21. Virtualized infrastructure with AWS • Flexibility & agility • Small operations team • Infrastructure, not platform • Advanced features • Forces good software practice 21
    22. 22. Case study: Highly-scalable storage with commodity tech 22
    23. 23. Plumbee data access patterns • Thick client means data is cached client side • High ratio of writes to reads • User primarily reads and writes their own data • Secondarily, reads and rare writes of friends’ data 23
    24. 24. Plumbee data storage: micro view • Data stored in key-value form, key is user id • Multiple values stored against the user id • Each value is a data structure serialized to binary format with Google Protocol Buffers • {userid, valueid, value} tuples stored in single table in InnoDB/MySQL: {int, int, blob} • Transactions managed with (modified) Spring / AspectJ 24
    25. 25. Plumbee data storage: macro view • MySQL on multi-AZ RDS • Read slaves handle e.g. reads of friend data • Users are spread across many shards • Shards are managed with custom library • Users allocated using simple round-robin to shards, shard mapping persisted 25
    26. 26. The results • By using commodity tech + services: MySQL/InnoDB, RDS, Java, Spring/AspectJ, GPB • We have: – Fast access for use cases – Easy to understand and use – No downtime for schema changes – Easy monitoring and tuning – Horizontal scaling – Highly reliability – Automatic failover (with replica reassignment) – Easy snapshot backups • All with just a few man-weeks of effort! 26
    27. 27. Commodity technology• Easy and cheap to acquire• Easy to hire people who know it• Quick assembly of product from many parts• Easy to change 27
    28. 28. Achieving sustainable speed • Iterate and automate • Use commodity technology • Analyse and improve • Build services • Create a high-speed culture 28
    29. 29. Collect user data • Never too much data: collect everything and store it forever • Collect data through events • At Plumbee: we collect the entire content of every request and every database write 29
    30. 30. Collect system data • Just another kind of analytics • Instead of “what is user doing”, “what is system doing” • Collect and use system data alongside user data • Report on and monitor both user and system metrics 30
    31. 31. Analytics with commodity tech 31
    32. 32. What can you do with data? • Reporting • Monitoring • Data mining • Predictive analytics • Personalization • Split-testing 32
    33. 33. Split testing • Run controlled experiments to determine how changes affect users • To do this: assign users randomly to one of several product versions, called “variants” • Tag all collected events with variant • Calculate metrics are separately for each variant • Perform statistical tests to determine whether the difference in metric is significant 33
    34. 34. Simple random sampling variants = empty; foreach (test in currently running tests) { selectedVariant = test.getStoredVariantForUser(user); if (selectedVariant == null) { if (shard in test) { selectedVariant = test.chooseRandomWeightedVariant(); user.storeVariantForTest(test, selectedVariant); } } variants.addTestAndVariantPair(test, selectedVariant); } serverGroup = getServerGroupForVariants(variants); serverGroup.forwardRequest(request, user, shard, variants); 34
    35. 35. Simple significance testing • Conditions – Metric to be improved is a proportion: e.g. percent of users converting to spender. – Proportions are not too close to 0 or 1 – Independent samples – Random sampling • Result: super-simple test for confidence (z-test) that runs in linear time wrt size of test 35
    36. 36. Analyse and improve • Analysing your system tells you how to improve it • The more accessible and timely your data, the quicker your decision-making • And the greater your responsiveness to changes • Good analysis and split-testing means you do less work! 36
    37. 37. Achieving sustainable speed • Iterate and automate • Use commodity technology • Analyse and improve • Build services • Create a high-speed culture 37
    38. 38. What are services? • Essential quality: data & functions on that data combined into one component • Data only accessible through remote API • Each service is developed, deployed, and operated independently of other services • “Service-oriented architecture” is an extrapolation of object-oriented programming to distributed systems 38
    39. 39. Technical benefits of SOA • Scalability improvements: data is partitioned • Performance improvements: data storage optimized for specific use cases • System availability improvements: system can fail in parts • But there’s other benefits too… 39
    40. 40. Consider the round-trip Concept Live system Development analysis Operation 40
    41. 41. Apply distributed systems design • Minimize communication – especially long-distance communication • Make local progress • Optimize the 90% case • Place people who need to communicate the most in the same team 41
    42. 42. Match architecture and organization • Create little startups – small cross-functional teams – completely accountable for a specific service – act independently of other teams • “Two-pizza teams” – Small teams can act fast – Scope is restricted – Everyone can understand team purpose – Greater sense of shared responsibility 42
    43. 43. A social game team Business, product management Production, process Art, design Data analysis Marketing, community Automated Engineering infrastructure 43
    44. 44. Build services • Small, independent, cross-functional teams can iterate quickly • Forming little startups creates organizational scalability • Services act as internal “commodity tech” that can be easily reused 44
    45. 45. Create a high-speed culture • Iterate and automate • Use commodity technology • Analyse and improve • Build services • Create a high-speed culture 45
    46. 46. Post-modern programming• Glue code is beautiful• “Functionality is an asset, code is a liability”• Embrace heterogeneity, don’t try to standardize• There is no big picture 46
    47. 47. Test is dead • Alberto Savoias keynote at GTAC 2011 • What are tests for? – Is the functionality what the developer intended? – Is the functionality what the product owner intended? – Is the functionality what the user wants? • Testing less is less risky 47
    48. 48. Corollary: load test is dead • (Good) load tests are very difficult • To test ability to add capacity? – Design for horizontal scalability • To predict capacity needed for more users? – If you have IaaS, good design, good monitoring, and speed, you can scale just-in-time • To predict capacity needed for new features? – Invest in ability to do dark launches and gradual rollouts 48
    49. 49. And also: operations is dead • All engineers need mission-critical mentality • When all engineers operate the system, the constraints and requirements of the operational environment will always be taken into account • At Plumbee: all engineers have (audited) access to production, and all engineers take turns on call 49
    50. 50. High-speed culture • Hire the best people • Get them passionate about the product • Provide easy access to information • Give them freedom & responsibility • Trust them to make decisions • Encourage them to take risks 50
    51. 51. Sustainable speedFrom this… Get this! • Iterate & automate • Get ahead of your • Use commodity customers, competition technology , and platform • Analyse & improve • Achieve greater returns • Build services • Do less work! • Create a high-speed culture 51
    52. 52. Come and help us!www.plumbeegames.com 52

    ×