Data Gone Wrong - GDCNext 2013


Published on

In the last five years, data analysis, A/B testing and predictive modeling have transitioned from an afterthought to a given in the game industry. Data can be invaluable in understanding the player and making decisions, but it can just as easily lead us astray. This talk exposes common mistakes and pitfalls, both technical and emotional, as well as provide practical guidance on how to improve the rigorousness of your tests and the quality of your data.

  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Data Gone Wrong - GDCNext 2013

  1. 1. 1
  2. 2. It’s the cornerstone of many of the biggest businesses in the US, including Google & Amazon, and the backbone of most scientific undertakings. 2
  3. 3. There’s plenty of cheerleading for data so I want to spend today on cautionary tales & advice, in the hope of helping keep data more on the Iron Man side of the Robert Downey Jr spectrum. 3
  4. 4. I love using numbers & testing to understand the world, I’m not a data hater by any means. If you want to know about medieval Transylvania or the Ottoman invasion of Hungary, I’m you’re woman. That didn’t get me far on the job market. Partly to do something different, partly to do games. But despite that intention it hasn’t been as different as I expected – user acquisition for games and catalogs is fundamentally the same. But Kongregate, because we’re an open platform with over 70k games and now a mobile game publisher, has provided some incredibly rich opportunities for data mining. 4
  5. 5. Part of the reason I’m telling you this is to make my first point: And for an organization to do data right you can’t toss analysis back and forth over a wall to quants. It takes intimate knowledge of a game (and the development) to do good analysis and multiple perspectives and theories are good. 5
  6. 6. Sometimes it’s immediately obvious – we had a mobile game we launched recently, an endless runner, that wasn’t filtering purchases from jailbroken phones and was showing an ARPPU of $500, not very plausible and easily caught. But most issues are much more subtle – tracking pixels not firing correctly for a particular game on a particular browser, tutorial steps being completed twice by some players but not by others, clients reporting strange timestamps, etc. For this reason you should never rely on any analytic system where you can’t go in and inspect individual records. If you can’t check the detail you’ll never be able to find and fix problems. We use Google Analytics for reference and corroboration but nothing very crucial, and are using it less and less because of this. 6
  7. 7. This looks like 4 separate pictures photoshopped together to create an appealing color grid, right? 7
  8. 8. Wrong. So much of data is like these pictures – a set-up that appears straightforwardly to be one thing from one angle, turns out to be completely different from another. 8
  9. 9. Except of course you know I’m setting you up 9
  10. 10. I mentioned lifetime conversion and showed daily ARPPU. Lifetime conversion may be similar between the two games, but daily conversion is 40% higher for game 1. This is why $/DAU is not a very interesting stat on its own. If someone quotes just D1 retention and $/DAU that’s not enough information to judge how a game monetizes. 10
  11. 11. It’s a living, changing system. Flat views are not enough. 11
  12. 12. So here are a series of likely traps analysts can fall into. I know I have. They’re not in a particular order of importance because they’re all important. We tend to think of playerbases as monolithic but really they are aggregations of all sorts of subgroups. It’s sort of like watching a meal go through a snake. Though with time cohorts it’s easy to lose track of events and changes in the game, so you can’t rely on those, either. 12
  13. 13. 13
  14. 14. 14
  15. 15. You may have noticed that win rates got a bit wacky towards the later missions of the graph of the last chart – this is a sample size issue. Even games that overall have very substantial playerbases like Tyrant may end up with small sample sizes when you’re looking at uncommon behavior in subgroups. Early test market data is often tantalizing & fascinating, but it’s often the most unreliable because you’re combining small sample sizes and a non-representative subgroup – the people who discover you first are the most hard-core. 15
  16. 16. not normal (bell-curve) distributions, which affects everything. Theoretically it’s not even possible to calculate the average value of a power distribution since the infrequent top values could be infinite. The sample size depends on the frequency of the event – tutorial completion & D1 retention should be fine with just a few hundred users, % buyer with 500+, but I don’t like to look at ARPPU with much less than 5,000. These are just my rules of thumb based on experience and probably have no mathematical basis. 16
  17. 17. 17
  18. 18. 18
  19. 19. If you ask small questions you’ll usually get small answers. And the dirty secret of testing is that most test are inconclusive anyway. It’s hard to move important metrics. So prioritize tests that significanly affect the game, like energy limits and regeneration over button colors. 19
  20. 20. Your existing players are used to things working a certain way – a change in UI that makes things clearer for a new player may annoy/confuse an elder player. Where possible I like to test disruptive changes on new players only, and then roll out the test to other players if the test proves successful. A pricing change that increases non-buyer conversion might reduce repeat buyer revenue. For example if you’re A/B testing your store, don’t assign people to the test unless they interact with the store. It’s often easier to split people as they arrive in your game, or some other thing, but a) there’s a chance you would end up with non-equal distribution of interaction with the tested feature and b)any signal from the test group would get lost in the noise of a larger sample. 20
  21. 21. Early results tend to be both volatile and fascinating – differences are exaggerated or totally change direction. People tend to remember the early, interesting results rather than the actual results. People also often want to end the test early if they see a big swing, which is a bad idea. We tested to see what gain we were getting from bonusing buying larger currency packages, which had to be judged on total revenue to make sure we were capturing both transaction size and repeat purchase factors. To make sure the 15% lift was real we broke buyers into cohorts by how much they’d spent ($0-$10, $100-$200, $200-$500, etc) and checked the distribution in each test. On the bonus side of the test we saw fewer buyer <$20 and 30%+ gains on all the cohorts above $100+, so we were confident that the gain was not being driven but a few big spenders. Again this should be worked backward from the frequency and distribution of the metric you’re judging the test on. There’s internet calculators to help you figure out what you need to get to statistical significance given an expected lift. My advice (if you have the playerbase and patience) is to then double or triple that. Why do I want my sample sizes so much bigger than the minimum? 21
  22. 22. It comes down to some of the issues with judging results by statistical significance itself. It doesn’t mean what you probably think it means. Statistical significance tests assume that there is some true difference in lift, and that if you test there will be a bell curve distribution of results, with the true lift as the average. Your 5% result could be right on the mean, or it could be an outlier on either end. If it’s statistically significant then the chance is low (usually 5% or less) that there’s no lift at all. But the true lift could be 1% or 10%. It’s possible you’d get two outlier results in the same direction, but becomes less and less likely, and more likely that your test results represent the true mean. And the size of the effect you are testing does matter as it helps you understand the relative importance of different factors, and what to prioritize testing next. 22
  23. 23. For example we’ve had a lot of tests that increased registration and reduced retention, so much so that we now judge tests based on % retained registrations because that’s what we really care about, but that’s not always possible. 23
  24. 24. A good example of this is adding a Facebook login button on our website. If a player comes back on a different browser they need to be able to login. 24
  25. 25. 25
  26. 26. This is about how you think about your business. 26