Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Emily Greer at GDC 2018: Data-Driven or Data-Blinded?


Published on

In the last decade of data analysis, A/B testing and predictive modeling have transitioned from an afterthought to a given in the game industry. Data can be invaluable in understanding the player and making decisions, but it can just as easily lead the industry astray, or worse, narrow the way the industry thinks. When should you be driven by data, and when should you let your imagination roam free? This session will expose common mistakes and pitfalls, both technical and emotional, as well as provide practical guidance on how to improve the rigorousness of your tests and the quality of your data, and how to make sure you don't lose the forest for the trees.

Published in: Data & Analytics
  • Be the first to comment

Emily Greer at GDC 2018: Data-Driven or Data-Blinded?

  1. 1. It’s the cornerstone of many of the biggest businesses in the US, including Google & Amazon, and the backbone of most scientific undertakings.
  2. 2. But data is just a tool, and like almost every tool it has both uses and abuses, not to mention just straight up errors. How many conflicting health studies have you seen?
  3. 3. As a company Kongregate uses a lot of data, and some of you have probably seen talks I’ve given before where I share a lot of that data. But a lot of the time I’ve been unsure whether we’re this ship, charting a clean course to treasure, or this ship, towards disaster. Both have happened! And since I think that’s a pretty common phenomenon, I thought it would be a good talk for GDC.
  4. 4. I love using numbers & testing to understand the world. I still probably spend at least an hour a day poking around dashboards and spreadsheets because it’s so much more fun for me than meetings.
  5. 5. I’m mostly self-taught, majored in Eastern European Studies, not math or econ. Stumbled into direct marketing, specifically catalogs, after college, and fell in love with data. Taught myself SQL because I hated to wait for IT to pull my data, took math & econ classes to understand more theory. After 10 years in catalogs & e-commerce and a near-miss with econ grad school I co- founded Kongregate partly to do something completely different. But it hasn’t turned out to be that different after all. User acquisition in particular is fundamentally similar between catalogs & games.
  6. 6. Part of the reason I’m telling you this is to make my first point: And for an organization to do data right you can’t toss analysis back and forth over a wall to quants. It takes intimate knowledge of a game (and the development) to do good analysis and multiple perspectives and theories are good.
  7. 7. Sometimes it’s immediately obvious. One of the first games we launched on mobile was an endless runner. It wasn’t filtering purchases from jailbroken phones and was showing an average revenue per player of $500. That’s not very plausible and easily caught. But most issues are much more subtle – tracking pixels not firing correctly for a particular game on a particular browser, tutorial steps being completed twice by some players but not by others, clients reporting strange timestamps, etc. For this reason I recommend never relying on any analytic system where you can’t go in and inspect individual records. If you can’t check the detail there are some problems you’ll never find and fix.
  8. 8. Even when your data is accurate it can still be deceiving. This looks like 4 separate pictures photoshopped together to create an appealing color grid, right?
  9. 9. Wrong. So much of data is like these pictures – a set-up that appears straightforwardly to be one thing from one angle, turns out to be completely different from another.
  10. 10. Except of course you know I’m setting you up
  11. 11. People are playing game 1 longer than game 2, and buying repeatedly. But if you just concentrated on daily monetization stats you could miss that entirely.
  12. 12. The witnesses may be lying or confused. The crime scene may have been tampered with. You can’t trust any one piece of evidence but by cross-checking them against each other you can figure out what’s true and false.
  13. 13. Client data (our SDK, Adjust) vs server data App stores Benchmarking against other games Benchmarking deltas
  14. 14. Your goal should be to create a 3-dimensional view of your players and your game. How people move through and interact with different parts. It’s a living, changing system and flat views are not enough.
  15. 15. We tend to think of playerbases as monolithic but really they are aggregations of all sorts of subgroups created by time in game, platform, device, browser, demographics, source – and these subgroups are shifting around. Changes in key KPIs are more often the result of changes in the audience than they are of changes in the game.
  16. 16. These examples show dramatic changes, but more subtle audience changes are happening all the time. Tracking cohorts by date of install/registration is a good way to track metrics independent of certain types of mix issues, but then it’s easy to lose track of events and changes in the game. So as ever, it’s about building a true picture across multiple sources.
  17. 17. 75% ARPDAU decline, then a modest recover to ~50% of previous high.
  18. 18. When you break out ARPDAU by player age you can see that the decline isn’t nearly as dramatic. There’s some decline after a big holiday sale, and then again some as we expanded UA aggressively. But most of it is from fewer el
  19. 19. This is for a collectible card game where the player who goes first has a substantial advantage.
  20. 20. On this chart of player win rates for Tyrant it looks like Mission 24 is very difficult (50% win rate) and mission 25 is easy (95% win rate). It’s sort of true: Mission 25 is relatively easy for those who attempt it. But by deck strength it’s harder than 22, which has a 70% win rate. Mission 25 is easy for the players who are strong enough & skilled enough to beat Mission 24, a selected subgroup of those who attempted 24.
  21. 21. So for the last 10 minutes I’ve been ranting about how important it is too look at audience mix split
  22. 22. The most important metrics (revenue, sessions, battles, etc) in games are all power distributions. Your business (especially in free-to-play games) is driven by outliers, and their presence or absence distorts almost any data you look at.
  23. 23. Your outliers are your best players so it’s a good idea to do individual analysis on them to understand who they are, what drives them, and what they’re most likely to distort. Binary “yes/no” metrics like % buyer, D7 retention, tutorial completion are a lot more stable than averages involving revenue and engagement like ARPPU, $/DAU, Avg Sessions, and can be looked at in much smaller samples.
  24. 24. Sometimes we do it consciously, but more often it’s unconscious. I’ll look at a group of cohorts and the best one is ALWAYS the most memorable. If you’re in test market and hoping to hit 50% the days you hit that number will imprint on your brain that your game has 50% D1 retention, even if the average is 45%.
  25. 25. Cherry Picking’s great and good friend! Part of building a mental mode of your game is having theories about behavior, and if you have a theory you should test it. But it’s really easy to look for the data that supports you theory and miss the data that contradicts it, or even just muddies the picture. [Can I find an example]
  26. 26. How you visualize data has a big impact on how you perceive it.
  27. 27. Ice cream consumption and drowning are correlated, because they’re both more likely to happen in hot weather. But ice cream kills would be a terrible conclusion. We’ve all heard this a 1000 times but we need to keep hearing it like a mantra every day because we all make this same mistake over and over and over. We’re humans, we’re wired to search for causation. It’s our superpower and a curse.
  28. 28. Almost every metric you look at will be positively correlated with engagement because the most engaged users do everything more. Maybe Facebook is increasing engagement. Maybe only engaged players were willing to hit the button and potentially spam their friends.
  29. 29. This is the real way to separate correlation from causation and understand what’s really going on. But it’s not a magic bullet, because nothing is that easy. Testing has real costs in engineering time & overhead, complexity, and divisions/confusions for the players, and the more you’re running the worse that gets.
  30. 30. There’s also a lot of ways to screw up A/B testing even though it seems so foolproof. Most A/B test traps are variations on themes I’ve mentioned but some are new, particularly issues around how people get assigned to tests
  31. 31. For example if you’re A/B testing your store, don’t assign people to the test unless they interact with the store. It’s often easier to split people as they arrive in your game, or some other thing, but a) there’s a chance you would end up with non-equal distribution of interaction with the tested feature and b)any signal from the test group would get lost in the noise of a larger sample.
  32. 32. Tests can have unintended consequences, you should look at additional metrics beyond the one being tested to make sure that you get the full picture. Commercial A/B products often make you choose one metric for a test to prevent you from fishing for the good result to decide the test on. I think it’s more important to understand the full effects of the change that you made (though fishing is bad, too.)
  33. 33. Early results tend to be both volatile and fascinating – differences are exaggerated or totally change direction. People tend to remember the early, interesting results rather than the actual results. People also often want to end the test early if they see a big swing, which is a bad idea. So I recommend that you don’t look at early test results except to make sure the test isn’t totally broken. How big should your test sample be? In my opinion the bigger the better.
  34. 34. When people talk about A/B tests you’ll often hear things like “we’ve got a statistically significant 5% lift”! And most people hear that and think that means that the lift is definitely 5%. But that’s not how statistical significance tests work.
  35. 35. Statistical significance tests assume that there is some true difference in lift, and that if you run the same test repeatedly there will be a bell curve distribution of results, with the true lift as the average. Your 5% result could be right on the mean, or it could be an outlier on either end. If it’s statistically significant then the chance is low (usually 5% or less) that there’s no lift at all. But the true lift could be 1% or 10%. Conversely if you do a test that doesn’t show a lift, or doesn’t pass the significance test for a small lift that doesn’t mean there ISN’T a lift. This is why I like to run A/B tests with larger sample sizes. It’s like running the test again and averaging the results. It’s possible you’d get two outlier results in the same direction, but becomes less and less likely, and more likely that your test results represent the true mean.
  36. 36. Often 70-80% of a free-to-play game’s revenue will come from a small % of buyers who spend more than $500.
  37. 37. Large sample sizes help here, too.
  38. 38. This can be really frustrating, even demoralizing for a team. When you’re going through the effort to make and test changes, you want them to mean something! You want to make progress. And then you get another non-result on a test. But finding out what doesn’t matter can actually be really powerful.
  39. 39. Here’s an extreme example of this from the team at Butterscotch Shenanigans, who made the game Crashlands. They had written up an elaborate, detailed description and decided to test how much impact it had using Google’s store testing system on Android against the most extreme possible variant, no description at all. Just the accolades the game has received.
  40. 40. They were kind enough to share the results and after 4 full months the test shows absolutely no difference, and that actually tells you a lot: specifically that the description has very little impact, and this is consistent with the testing we’ve done on our own games, as well. Time and resources are a constraint for virtually everybody, and knowing what is not important allows you to concentrate more on things that do matter. We used to argue endlessly over game names, but after doing test after test and not seeing much difference we’re all much more relaxed about it.
  41. 41. But it’s important not to extrapolate too much. Just because you get a particular
  42. 42. Specifically late game content is often very difficult to test, or any testing on late game players. Daniel Cook from Spryfox tweeted this recently. He was talking about YouTube and algorithms, but I think it helps frame some of the limitations of testing. As a player plays a game, the game is shaping their expectations and experience, and training them to behave in certain ways. So the same player might react very differently based on how long they had been playing the game. And when engaged players start talking to each other in chat and forums they affect each other, too. Plus you run into small sample sizes with lots of outliers and other fun problems I’ve already talked about.
  43. 43. Tyrant successful on a small core audience, but difficult to market
  44. 44. CPIs for live version of Castaway Cove are okay, but much higher than we’d been targeting. Lots of ways we probably went wrong
  45. 45. So far data has helped us iterate on existing games, pointing us in the direction that helped get us from Tyrant to Animation Throwdown. But in
  46. 46. But what we don’t know is as important as what we do know
  47. 47. Data is alway going to tell you to make an existing successful game, but better. It’s not going to tell you to make a game unlike anything people have played before
  48. 48. But what we don’t know is as important as what we do know
  49. 49. Detectives, CSIs, Astronomers, Cartographers, Explorers: