Copy of CRICKET MATCH WIN PREDICTOR USING LOGISTIC REGRESSION.pptx
1. CRICKET MATCH WIN PREDICTOR USING LOGISTIC
REGRESSION
Under the Supervision of :
Mrs. Teressa Longjam
Team Members:
V. Aravind Reddy
V. Yaswanth Reddy
K. Praveen
B. Satyanarayana
Department of Computer Science
and Engineering
2. Contents
● Abstract
● Introduction
● Flow chart
● Outline of the Project and Software tools
● Data and Features
● Sigmoid Function
● Intuition
● Logistic Regression
● Exploratory Data Analysis(Part1,Part2)
● Model Fitting
● Performance Metrics
● References
● Conclusion and Future Work
3. Abstract
This project aims to find the best features which can accurately
predict the probability of a team winning or losing. It also focuses
on how we use stochastic gradient descent optimization
technique to update the weights and get the best linear
combination of features. In this project we have used scikit learn
pipeline for fitting the model,where in the pipeline we have used
columntransformer for various data types to process them at
once.
4. Introduction:
● Finding Important features through merging the both data
frames ,performing Exploratory Data Analysis on it and fitting the
Logistic Regression model to the data to obtain the winning
probability of either teams.
● Taking the first innings score and present situation of the of
second innings, it predicts winning probability of both the teams.
❖ Objectives of the Project
6. ❖ Outline of the project and Software Tools:
● This project focuses on how we can use Exploratory data Analysis to derive
important features and use a suitable machine learning algorithm to build an
application which predicts the winning probability of a certain cricket team.
● Feature (variable) importance indicates how much each feature contributes to the
model prediction.Basically it determines the degree of usefulness of a specific
variable for a current model and prediction.
● Coming to the data used in this project,it consists of two csv data frames collected from
Kaggle. These two dataframes explain regarding the matches and ball by ball data
respectively.
● Coming to the software we are using google colab environment for the project and
numpy,pandas,sklearn python libraries.
7. Data
● The below code explains the shapes of the two data frames i.e match
dataframe and the deliveries data frame respectively.
● The matches dataframe shows that it has 756 matches data over 18 features.
● And deliveries dataframe has almost 1.8 lakh deliveries data over 21 features.
12. Logistic Regression
● Logistic Regression is a classifier that can be applied in a single or multi-label
classification set ups.
● Logistic Regression is a discriminative classifier.
● It obtains probability of sample belonging to a specific class by computing
sigmoid (aka logistic function) of linear combination of features.
● The weight vector for linear combination is learnt via model training.
13. Exploratory Data Analysis (Part-1)
● Here in the EDA our main target is to finally extract a single data frame with
important features from the two given data frames.
● Initially in the deliveries data frame we group the runs(based on
match_id,inning) for every match according to the innings.
● Then we calculate the target as no.of runs in first innings+1.
● Now as the two data frames have a common column of match_id ,we merge
both the data frames on that id.
● After merging we ignore the matches using dl method, abandoned due to rain
and missing data points.
14. Exploratory Data Analysis(Part-2)
● In the part-2 of this analysis ,we focus on constructing cumulative score for
every ball.
● From this cumulative score we can calculate the required runs ,required run
rate and some other important features which are useful in predicting the
probability.
● Now after this step we calculate important features like
cur_run_rate,req_run_rate,balls_left,wickets_left using formulae given below.
● balls_left=126-(over*6+current_ball)
● cur_run_rate=(current_score*6)/(120-balls_left).
● req_run_rate=(runs_left*6)/(balls_left)
15. Presently features derived through EDA
● Batting-team
● Bowling-team
● City
● Runs-left
● Balls-left
● Wickets
● Total_runs
● Required_run_rate
● Cur_run_rate
● result
17. Performance Analysis
● For evaluating performance of the model,we have used accuracy_score as the
metric from sklearn library.
● accuracy_score=(total no of correct predictions)/(Total no of samples).
● The accuracy score of the model was 86%.
● Here the score given by this metric shows that it predicts a correct probability
of the teams in 86% of the cases.
18. Overfitting and Underfitting
● Bias : Assumptions made by a model to make a function easier to learn.It is
actually the error rate of the training data.When the error rate has a high
value,we call it High Bias and when the error rate has a low value,we call it low
Bias.
● Variance :The difference between the error rate of training data and testing
data is called variance.If the difference is high then its called high variance
and when the difference of errors is low then its called low
variance.Usually,we want make a low variance for generalizing our model.
19. Underfitting
● A statistical model or a machine learning algorithm is said to have
underfitting when it cannot capture the underlying trend of data,i.e;it only
performs well on training data but performs poorly on testing data.(it’s just
like trying to fit undersized pants).
● Reasons for Underfitting
1. High bias and low variance
2. The size of the training dataset used is not enough.
3. The model is too simple.
4. Training data is not cleaned and also contains noise in it.
20. Techniques to reduce underfitting
● Increase model complexity
● Increase the number of features using feature engineering.
● Remove noise from the data.
● Increase the number of epochs or increase the duration of training to get
better results.
22. Overfitting
● A statistical model is said to be overfitted when the model does not make
accurate predictions on test data.
● When the model gets trained with so much data,it starts learning from the
noise and inaccurate data entries from the data set.
● Then the models does not categorize the data correctly,because of too many
details and noise.
● Reasons for overfitting
● High variance and low bias,model is too complex ,size of training data.
23. Techniques for reducing overfitting
● Increase the training data
● Reduce the model complexity
● Early stopping during the training phase.
● Using Regularization
● Other overfitting techniques.
24. Conclusion and Future Work
● Here from this project we can conclude that how important is feature
extraction and how we can use a machine learning model on that to build
some useful applications.
● In future further we can develop this project to predict the win probability
from the first innings itself.
● We can use the previous matches datasets to predict the win probability from
first innings itself.
● For giving custom input and to predict the result,we are designing the front
end ,where we can enter all the values of derived features to get the
probability.
25. References
● Ananda Bandulasiri, “Predicting the Winner in One Day International Cricket”
Journal of Mathematical Sciences & Mathematics Education.
● Tejinder Singh, Vishal Singla and Parteek Bhatia, “Score and Winning
Prediction in Cricket through Data Mining” 8 October 2015.