In the last five years, data analysis, A/B testing and predictive modeling have transitioned from an afterthought to a given in the game industry. Data can be invaluable in understanding the player and making decisions, but it can just as easily lead us astray. This talk exposes common mistakes and pitfalls, both technical and emotional, as well as provide practical guidance on how to improve the rigorousness of your tests and the quality of your data.
It’s the cornerstone of many of the biggest businesses in the US, including
Google & Amazon, and the backbone of most scientific undertakings.
There’s plenty of cheerleading for data so I want to spend today on
cautionary tales & advice, in the hope of helping keep data more on the Iron
Man side of the Robert Downey Jr spectrum.
I love using numbers & testing to understand the world, I’m not a data hater by any
If you want to know about medieval Transylvania or the Ottoman invasion of
Hungary, I’m you’re woman. That didn’t get me far on the job market.
Partly to do something different, partly to do games. But despite that intention it
hasn’t been as different as I expected – user acquisition for games and catalogs is
fundamentally the same. But Kongregate, because we’re an open platform with over
70k games and now a mobile game publisher, has provided some incredibly rich
opportunities for data mining.
Part of the reason I’m telling you this is to make my first point:
And for an organization to do data right you can’t toss analysis back and forth
over a wall to quants. It takes intimate knowledge of a game (and the
development) to do good analysis and multiple perspectives and theories are
Sometimes it’s immediately obvious – we had a mobile game we launched
recently, an endless runner, that wasn’t filtering purchases from jailbroken
phones and was showing an ARPPU of $500, not very plausible and easily
caught. But most issues are much more subtle – tracking pixels not firing
correctly for a particular game on a particular browser, tutorial steps being
completed twice by some players but not by others, clients reporting strange
For this reason you should never rely on any analytic system where you can’t
go in and inspect individual records. If you can’t check the detail you’ll never
be able to find and fix problems. We use Google Analytics for reference and
corroboration but nothing very crucial, and are using it less and less because
This looks like 4 separate pictures photoshopped together to create an
appealing color grid, right?
So much of data is like these pictures – a set-up that appears
straightforwardly to be one thing from one angle, turns out to be completely
different from another.
I mentioned lifetime conversion and showed daily ARPPU. Lifetime
conversion may be similar between the two games, but daily conversion is
40% higher for game 1.
This is why $/DAU is not a very interesting stat on its own. If someone
quotes just D1 retention and $/DAU that’s not enough information to judge
how a game monetizes.
It’s a living, changing system. Flat views are not enough.
So here are a series of likely traps analysts can fall into. I know I have.
They’re not in a particular order of importance because they’re all important.
We tend to think of playerbases as monolithic but really they are
aggregations of all sorts of subgroups.
It’s sort of like watching a meal go through a snake.
Though with time cohorts it’s easy to lose track of events and changes in the
game, so you can’t rely on those, either.
You may have noticed that win rates got a bit wacky towards the later missions of the graph
of the last chart – this is a sample size issue.
Even games that overall have very substantial playerbases like Tyrant may end up with
small sample sizes when you’re looking at uncommon behavior in subgroups.
Early test market data is often tantalizing & fascinating, but it’s often the most unreliable
because you’re combining small sample sizes and a non-representative subgroup – the
people who discover you first are the most hard-core.
not normal (bell-curve) distributions, which affects everything.
Theoretically it’s not even possible to calculate the average value of a power distribution
since the infrequent top values could be infinite.
The sample size depends on the frequency of the event – tutorial completion & D1
retention should be fine with just a few hundred users, % buyer with 500+, but I don’t like
to look at ARPPU with much less than 5,000. These are just my rules of thumb based on
experience and probably have no mathematical basis.
If you ask small questions you’ll usually get small answers. And the dirty
secret of testing is that most test are inconclusive anyway. It’s hard to move
important metrics. So prioritize tests that significanly affect the game, like
energy limits and regeneration over button colors.
Your existing players are used to things working a certain way – a change in
UI that makes things clearer for a new player may annoy/confuse an elder
player. Where possible I like to test disruptive changes on new players only,
and then roll out the test to other players if the test proves successful. A
pricing change that increases non-buyer conversion might reduce repeat
For example if you’re A/B testing your store, don’t assign people to the test
unless they interact with the store. It’s often easier to split people as they
arrive in your game, or some other thing, but a) there’s a chance you would
end up with non-equal distribution of interaction with the tested feature and
b)any signal from the test group would get lost in the noise of a larger
Early results tend to be both volatile and fascinating – differences are
exaggerated or totally change direction. People tend to remember the early,
interesting results rather than the actual results. People also often want to
end the test early if they see a big swing, which is a bad idea.
We tested to see what gain we were getting from bonusing buying larger
currency packages, which had to be judged on total revenue to make sure
we were capturing both transaction size and repeat purchase factors. To
make sure the 15% lift was real we broke buyers into cohorts by how much
they’d spent ($0-$10, $100-$200, $200-$500, etc) and checked the
distribution in each test. On the bonus side of the test we saw fewer buyer
<$20 and 30%+ gains on all the cohorts above $100+, so we were confident
that the gain was not being driven but a few big spenders.
Again this should be worked backward from the frequency and distribution of
the metric you’re judging the test on. There’s internet calculators to help you
figure out what you need to get to statistical significance given an expected
lift. My advice (if you have the playerbase and patience) is to then double or
triple that. Why do I want my sample sizes so much bigger than the
It comes down to some of the issues with judging results by statistical
significance itself. It doesn’t mean what you probably think it means.
Statistical significance tests assume that there is some true difference in lift,
and that if you test there will be a bell curve distribution of results, with the
true lift as the average. Your 5% result could be right on the mean, or it could
be an outlier on either end. If it’s statistically significant then the chance is
low (usually 5% or less) that there’s no lift at all. But the true lift could be 1%
It’s possible you’d get two outlier results in the same direction, but becomes less and less
likely, and more likely that your test results represent the true mean. And the size of the
effect you are testing does matter as it helps you understand the relative importance of
different factors, and what to prioritize testing next.
For example we’ve had a lot of tests that increased registration and reduced
retention, so much so that we now judge tests based on % retained
registrations because that’s what we really care about, but that’s not always
A good example of this is adding a Facebook login button on our website. If a
player comes back on a different browser they need to be able to login.