For those who don’t know what Draw Something is – it is a “social” game like Pictionary. Two players play. A player is presented with a list of three words, from which they pick one to draw. The other player then sees the drawing and has to guess the word. And it goes back and forth like that.OMGPOP’s second version of this type of game, first one was real-time one called Draw My Thing
Well, the game launched on February 6, 2012. Like most new games, social media integration (Facebook in this case) makes it easy to both invite your friends to play, and to highlight that you are playing the game. This “social component” helps build popularity. A few weeks into its life, Draw Something began to get a lot of attention, including from celebrities who also used social media (facebook, twitter and pinterest) to talk about their experience. One of the stars of Jersey Shore tweeted about the game in early March, kicking off the initial round of growth – to 1 million daily active uers. Miley Cyrus tweeted about her Draw Something “addiction” on March 8 and growth accelerated – from over 4 million daily users. Two weeks later, at over 14 million daily active users, the company behind Draw Something was acquired by Zynga for a purported $200 plus million.
Unfortunately, not everyone prepares. On March 1, as Vinny and Pauly D of the Jersey Shore were tweeting about Draw Something, EA launched a game called The Simpson’s: Tapped Out. Almost immediately the game charged to #2 on the iPAD and #3 on the iPhone top free app lists. Growth started to follow the same trajectory as Draw Something! But the outcome couldn’t have been more different. While Draw Something continued to grow, EA was unable to keep up with the success of the game. Games were reportedly being “lost,” there was huge lag and users were beginning to complain, loudly. Rather than praise on twitter, there was a flood of negative reaction. EA was forced, just 4 days later! To pull the game from the App Store. As of the end of March, 2012, it had still not returned. What a contrast.
Marmelade framework for mobile client app (C/C++)Based on scaling experience with past games30k downloads on first day =. Surprise (pleasant one)
Celebritiesstrt tweeting about“crazy” success a million downloads!Problem: App layer started to backup
Goliath from PostRankAsnchronous, but made easier than most event driven frameworksFrom 115 App instances to 15 app instancesStill only SIX servers in app tier
App server changes helpedThen noticed performance issuesBut then noticed 90% of S3 request were throwing errorsData layer the next bottleneck
Amazon said too much S3 usage, so rate limited themAlready had experience with Couchbase Server from other projectsImplemented in one night!Resolved S3 issues
With new data layer got ability to scale database under load without app going offlineEasy scalability, without app changes or workload offline is keyHow does it work?Also point out replication within database
Growth took off at even crazier proportionsReached number 3 in AppStore on end of February and then number 1 by early March!!MILLIONS of players per dayDOUBLED users from 2M -> 4M within 2 days!!!!Architecture is right, but load increases are hard to manage
Around this time they called Couchbase.Before using the Community Edition, then switched to Enterprise Edition to get hotfixes and support and solution architects.Linear scalability is as good as it getsBUT twice the users then eans twice the servers. Could you double your server count within 48h (including provisioning etc)…. ?Just scaling isn’t the answer, biggerand bigger clusters add their own manageability problems and Rackspace, cooling, etcUse scaling/rebalance to remove nodes from cluster, upgrade RAM and disks, balance back in => Presto!Sqlite in 1.7 version had fragmentation issues over time, seek times become meanignful (deletes!) => SSD has no seek time
More and more nodes => some will failCrucial that architecture is designed to accommodate failure and not have entire game go down!App layer is stateless, EasyDatalayer has all the state => hardAutofailover will active replicas => add new nodes to replenish capacityApplication stays available trhough node failures Actually had a HW issues on a nucnh of server leading to a lot more failures than normally anticipated----- Meeting Notes (10/24/12 08:30) -----Shared Nothing! Isolate impact of node failure
Grwoth beyond beliefRemember when initially 30 drawingsper second seemed like a lot of 1 Million downlaods were a huge deal to them. Growth is at 1M a day!!!
All about monitoring at this pointFind a bottleneck and then add capacity to resolve itMonitoring was key. At all levelsThis is Couchbase console to keeptrack of data layer at all levels 9per cluster or drill down into machines) [No, not an actual shot of the DrawSomething cluster]
Only possible because they chose the right technologies and had thought about scalability from the beginning!You can’t retrofit scalability once an app takes offOver 90 Couchbase Server nodes by now----- Meeting Notes (10/24/12 08:30) -----Try and overprovision to have some breathing room
In the end within 50 days from launch 50M downloads reached and over 3k drawings per seconds.BTW, this translated to over 100k ops on Couchbase server tierOf course ended great for OMGPOP with acquisition by Zynga for a rumoured $200m. Hopefully for them it wasn’t all in stock ;)----- Meeting Notes (10/24/12 08:30) -----Big Audience, not really Big data
Of course scaling problems and the lessons OMGPOP learned aren’t just applicable to Social GamesA scalable architecture is key to any web app, you never know which service willtake off?Especially Ease of Scalability is important and it needs to be always-on, there isn’t an off-peak in the Internet anymore…Scaling the app layer is not crazy hard as they don’t have permanent state.So think whether your database layer could have done that though? Aded capacity to grow to this size, and absorb node failures all without application downtime?
CCB 12 How draw something grew to 50 million downloads in 50 days
How Draw Something Grew to 50 Million Downloads in 50 Days Frank Weigel VP, Products, Couchbase @FrankWeigel 1
NoSQL to the rescue. • Couchbase Server NoSQL database HAProxy (Load balancer) • Easy on-line scalability • Lazy migration for existingNGINX & Goliath Ruby App Server data Couchbase Server Amazon S3 11
Scaling Out Under Load APP SERVER 1 APP SERVER 2 COUCHBASE Client Library COUCHBASE Client Library CLUSTER MAP CLUSTER MAP READ/WRITE/UPDATE READ/WRITE/UPDATE SERVER 1 SERVER 2 SERVER 3 SERVER 4 SERVER 5 ACTIVE ACTIVE ACTIVE ACTIVE ACTIVE Doc 5 Doc Doc 4 Doc Doc 1 Doc Doc 2 Doc Doc 7 Doc Doc 2 Doc Doc 9 Doc Doc 8 Doc Doc 6 Doc REPLICA REPLICA REPLICA REPLICA REPLICA Doc 4 Doc Doc 6 Doc Doc 7 Doc Doc 1 Doc Doc 3 Doc Doc 9 Doc Doc 8 Doc Doc 2 Doc Doc 5 Doc COUCHBASE SERVER CLUSTERUser Configured Replica Count = 1 12
2M -> 4M Users in Two Days Draw Something by OMGPOP Daily Active Users (millions)161412 Weeks in:10 • Number 1 in AppStore • Millions of new players per day8 • 1,000 drawings/second 1,000,000 downloads 2,000,000 downloads 30,000 downloads642 2/6 8 10 12 14 16 18 20 22 24 26 28 3/1 3 5 7 9 11 13 15 17 19 21 13
Bigger is Still Better • Linear Scalability! Oh no… • Size matters HAProxy (Load balancer) • SSD to fight fragmentation impactNGINX & Goliath Ruby App Server Couchbase Server 14
Nodes Fail, Game Must Go On APP SERVER 1 APP SERVER 2 COUCHBASE Client Library COUCHBASE Client Library CLUSTER MAP CLUSTER MAP SERVER 1 SERVER 2 SERVER 3 SERVER 4 SERVER 5 ACTIVE ACTIVE ACTIVE ACTIVE ACTIVE Doc 5 Doc Doc 4 Doc Doc 1 Doc Doc 9 Doc Doc 6 Doc Doc 2 Doc Doc 7 Doc Doc 2 Doc Doc 8 Doc Doc Doc 1 Doc 3 REPLICA REPLICA REPLICA REPLICA REPLICA Doc 4 Doc Doc 6 Doc Doc 7 Doc Doc 5 Doc Doc 8 Doc Doc 1 Doc Doc 3 Doc Doc 9 Doc Doc 2 Doc COUCHBASE SERVER CLUSTERUser Configured Replica Count = 1 15