my project data analysis is on Cookie Cats game using Python(pandas, Seaborn, Numpy, Matplotlip) and Report for number of player install the versions using Tableau
2. Mobile Game In-game Data Analysis
This presentation analyzes the data form an A/B test
conducted on Cookie Cats, a mobile connect-three-style
puzzle game.
also, we will analyze the result of an A/B test where
the first gate in Cookie Cats was moved from level 30 to
level 40.
In particular, we will analyze the impact on player
retention.
The presentation is written in Python and the libraries
used are Pandas, Numpy, Matplotlib, and Seaborn.
3. Game description
Cookie Cats is a hugely popular
mobile puzzle game developed by
Tactile Entertainment.
It's a classic "connect three" style
puzzle game where the player
must connect tiles of the same
color in order to clear the board
and win the level.
4. Problem Definition
Did the change made on the
position of the gate affect the
player retention rate?
5. Let’s to detect and Fix Potential Problems…
• It seems that there is no missing value presented in the dataset.
• Since there are only 2 unique values in the version column, we can transform the data
type of this column to "category"
6. Outliers in Game Rounds
The describe of sum game rounds column from dataset
7. Seems like there is an outlier value as high as 49854, and the number is so high that it
dwarfs all the other values. This could be caused by:
1.an error resulted from manual data key-in processes.
2.cheating behaviors from the player side.
3.the hard-work a relentlessly hardcore player.
or other
8. We need to work the problem.
For now, it would be easier to just delete this record.
9. Conduct Exploratory Data Analysis (CEDA)
Let's start by examining the graph of the
empirical cumulative distribution function of
"sum_gamerounds" to have a glimpse of
player behavior. We start by defining several
functions that will be used repetitively later.
10. Each data point indicates the percentage of players who played less than or
equal to the rounds of that point. For example, if game rounds = 500, we can
see the corresponding y value is around 0.99. This means that 99% of the
players in the dataset played less than or equal to 500 rounds.
11.
12. From the above chart, we identify several issues that we might be interested in:
Around 4.43% of the players did not finish even one game round, and we might
want to ask: did they encounter any problem when they are playing the game.
Around 20% of the players played no more than 3 rounds and around 40%
stopped before level 11. If a round of a game takes 3 minutes to finish on
average, it means we lose 20% of the players after they play the game for 9
minutes and 40% of them after 30 minutes averagely.
Over 63% of the players did not reach level 30 or higher, so we would like to
ask:
what made them stop before reaching the first gate (at least in the "gate_30"
version)?
is this churn rate what we are expecting? If not, what can we do to improve?
how does this affect our A/B tests?
13.
14. From this chart, we observe that before level 30, the proportion of players for
the _gate30 version is lower than the one for the _gate40 version. In between
level 30 to level 40, the gap closes in and the former surpasses the latter, as
annotated in the chart. This is quite interesting, and we may suspect that:
Setting the gate at level 30 is better at retaining the players because, compared
to the _gate40 data, there are more proportions of players who reached higher
levels after the gate. However, we need to be cautious about establishing any
conclusion simply based on this information.
15. We need to look at the retention rates to better formulate ideas about which
version is better.
The day-1 retention rate, which means the proportion of players who came back to play the game one
day after the installation, was around 45% for both version.
And the day-7 retention rate was around 19%. The is quite alarming because it means, on average, we
lost over a half of the player one day after they installed the game.
16. he p-value is around 0.038, and this tells us:
The p-value is relatively small, so we might reasonably believe that a difference
between the retention rates of the 2 versions is real. Since the empirical difference is
positive, we can reasonably decide that the gate_30 version has a better day-1 retention
rate than gate_40 version does.
17. Prediction :
Seeing the p-value is lower than 0.001, which is extremely small.
We can be very confident to say the difference between the retention rates is real
and that gate_30 version has indeed better day-7 retention rate than gate_40 version
does.