IPL Data Analysis using Data Science

IPL Data Analysis
Kaushal Sanadhya
Indraprashta Institute of Information Technology,Delhi
Okhla Industrial Estate,Phase III
New Delhi,India
emailid : kaushal19133@iiitd.ac.in
Abstract—Cricket is one of the most celebrated games in
the world. With the introduction of Data science and machine
learning techniques in the world of cricket, forecasting the score
of the match has been established as one of the most challenging
problems.Especially in the shortest format T20, score forecasting
and analyzing other statistics become more important as every
moment is sufficient enough to take the game away from oppo-
sition team. Our work develops some crucial predictions using
various machine learning models like RandomForestRegressor,
Linear regressor , Radius Nearest Neighbors etc.
Index Terms—score forecasting,modeling Indian Premier
League data,Regression,machine learning
I. INTRODUCTION
There can be several factors that strongly affect predictions
like the current score, wickets in hand, weather conditions,
dew factor, pitch condition, etc. We have used a data set
of 1,79,079 records consisting of the data for every single
ball in IPL matches from the year 2009 to 2019. Significant
contributions from this project are as follows:
a) Feature construction: We have created new attributes
[balls remaining, current score, wickets in hand] that can
capture the critical information in the dataset(deliveries.csv)
much more efficiently than the original attributes.:
b) Final score prediction : predicting the eventual score
in the first innings. :
II. FEATURE CONSTRUCTION
The existing features (Deliveries.csv) like over,
ball,is super over,wide runs,non-stricker, etc. are not
good enough to make confident, reliable predictions for
a final score, Therefore new features score are created as
follows:
• balls remaining: number of balls remaining in the first
innings of the match
• current score: current score of the team.
• wickets remaining: wickets in hand for the team.
• final score: final score of that team in that match this is
the target variable which we are trying to forecast.
When these newly created features are used to predict the
final score, we obtained some handsome value of R square for
different machine learning models used in the project
III. PREDICTING FIRST INNINGS SCORES
Score prediction for the first innings is a typical multiple
regression problem since the output is the forecasted score.
Data set of size 10,315 records is used to train our model.
A. Training Data
Match data for the following teams is used to train our
machine learning models:
• Chennai Super Kings
• Hyderabad Sun Risers
• Mumbai Indians
Training data size is 10,316 records are used .
B. Test Data
Match data for Kolkata Knight Riders is used for Testing
purposes. To make predictions for the final score of the Kolkata
team, every time the final score for ten random matches is
selected, which is further compared with the predicted score
by our machine learning models.
IV. REGRESSION MODELS USED
A. Multivariate Linear Regression
Most of the cricket problems that are encountered will
have more than two variables. Therefore Multivariate Linear
regression is used to fit the line in our multi dimensional space.
Using this regression model we can even draw the impact of
each feature on the predicted score using the below well know
equation for linear regression: Y = a + b*X1 + c*X2 + d*X3
For our analysis we will get the below equation:
Final Score = 27.915 + 0.989121 * current score +
1.183421 * balls remaining - 3.576307 * wickets
Actual Score Predicted Score
150 154
119 114
204 210
130 131
155 152
160 169
222 232
163 160
223 231
148 149

1) Performance Evaluation: Mean Absolute error and Root
mean Squared errors are the two parameters used to evaluate
the performance and following values are achieved:
Mean Absolute Error: 5.13
Root Mean Squared Error: 6.02
These values are moderately high since the actual and pre-
dicted scores differs by at most 10 runs. The performance can
be enhanced by using some more advanced regression models
like ada boost or Random forest Regression.
B. Random Forest Regression
Random Forest Regression is an ensemble technique which
makes use of multiple prediction model.It combines the result
of these prediction models to give more accurate results.
150 150
119 121
204 201
130 132
155 155
160 160
222 220
163 163
223 221
148 148
1) Importance Associated With Various Features: By
looking at the values of the features importance we can
estimate the significant contribution made by some feature.
Feature Importance
current score 0.47
Balls Remaining 0.30
Wickets 0.22677739128946717
2) Performance Evaluation: Following are the values of
Mean Absolute Error , Root means squared Error:
These values are far better than the values achieved using
Multiple linear Regression which highlights the power of
Ensemble regression models.One can easily visualize the same
by looking at the difference between predicted and actual score
which is at most 3 runs.
C. Radius Neighbors Regression
This regression model is based on the concept of K nearest
neighbors.Just like K nearest neighbors regression model
, Radius Neighbors Regression finds the neighbors within
specific distance(Manhattan Distance).
150 150
119 121
204 204
130 132
155 157
160 163
222 221
163 164
223 222
148 149
1) Performance Evaluation: Following are the values of
Mean absolute Error and Root mean squared Error for the
model.
These values are better than the multiple regression model
and somewhat comparable to Random Forest ensemble error
values.
D. Comparison of R square Values
R square values of the regression models are shown in the
below graph.
High values of R square for Random Forest and Radius
Nearest Neighbors depicts that these regression models ex-
plains all the variability of the response data around its mean
in more effective manner as compared to Multivariate Linear
Regression Model.
E. Comparison of Mean Absolute Error
The mean of the absolute value of the errors is defined as
the absolute difference between actual and predicted score for
a match.The mathematical formula is given below:

Where MAE stands for Mean Absolute Error.
This difference is 6-7 runs for Multivariate Linear Re-
gression, 1-5 runs for Random Forest and Radius Nearest
Neighbor.
The graph depicting these error values for all three models
is shown below:
F. Comparison of Root Mean Squared Error
The square root of the mean of the squared errors is called as
Root Mean Squared Error.The mathematical formula is given
below(RMSE : Root Mean Squared Error):
Following graph compares these Root Mean Squared values.
Root Mean Squared Error and Mean absolute error graphs
are similar to each other.Both the graph shows that Random
Forest and Radius Nearest Neighbor Regression are perform-
ing better as compared to Multivariate Linear Regression
Model.
REFERENCES
[1] https://www.espncricinfo.com/series/ /id/8048/season/2019/indian-
premier-league
[2] Kaggle Data Set https://www.kaggle.com/manasgarg/ipl
[3] Regression Model Implementation https://towardsdatascience.com/
[4] Scikit-Learn Documentation https://scikit-
learn.org/stable/documentation.html
[5] Indian Premier League Ofﬁcial https://www.iplt20.com/
[6] Cricket Analytics Visualized https://cricketsavant.wordpress.com/
[7] Predicting the Outcome of ODI Cricket Matches: A Team Composition
Based Approach by Madan Gopal Jhawar, Vikram Pudi, IIIT-H

IPL Data Analysis using Data Science

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to IPL Data Analysis using Data Science

Similar to IPL Data Analysis using Data Science (20)

Recently uploaded

Recently uploaded (20)

IPL Data Analysis using Data Science