SlideShare a Scribd company logo
Group 4
Team members
Ravi
Richa
Sabarish
Vijay
Problem Statement
Flight Delay Prediction
Dataset understanding and description
No. of features =29, shape of data set = 484551 rows x 28 columns,
Target Variable = ArrDelay
Dataset understanding and description
 Missing Values
 Org_Airport -1177
 Dest_Airport -1479
 Duplicate Data
 2 rows are duplicate
 Duplicate columns
 Columns with same information i.e Org_Airport and Dest_Airport these are repeated
information in data set Origin represented by three letter code for Org_Airport and Dest
represents three letter code for Dest_Airport
 Categorical Variables
 1.UniqueCarrier
 2.FlightNum
 3.TailNum
 4.Origin
 5.Dest
Outliers
# -Columns having outliers :-
 arrdelay
 deepdelay
 taxiout
 carrier delay
 security delay
 late aircraft delay
Data Visualization Techniques used
 Box Plots
 Heat Maps
 Histogram
 Line graphs
 Pie chart
 PairPlot
 Used sweetviz library for more visualization
Screenshots of different visualizations
When we took threshold as .9 then following
features are correlated{'AirTime',
'CRSElapsedTime', 'DepDelay', 'Distance'}
EDA
flight_eda_report.html
Data Processing and FeatureEngg.
 Imputation – Check for Null values in data set and handle it by diff techniques
 Categorical Encoding – target encoding is used as high cardinality
 Handling Outliers – Box plots to analyse outliers in data
 Scaling - Min Max scaler is used
 Feature Selection- Correlation helped in getting feature correlation with each
other and target
 Feature Split – Derived features from Date and time features
Data set post FE – 24 features for Modelling
Our Research
 Analysed the data deeply from domain perspective which gave very interesting insights.
 Different types of categorical handling we have researched on and then came up with target encoding
 As we know deletion of features are always most impactful decision we used both visualization and
domain knowledge to do this part
 Missing Value handling we tried different approached and then finalized one
 We have derived attributes from given features which we really felt will be helpful for further analysing
Future Tasks
 More Feature Engineering
 Training the model on the selected features
 Model development
 Model assessment
Take away from last Meet
 Group Dynamics
 Elements of Data
 Dynamic data
Elements of Data
Feature Engg.
Logistic Regression – SFS for Feature Engineering
Logistic Regression – SFS for Feature
Engineering
Splitting of data
The test_size=0.2 It is split of test and training data as 80/20percent .
X_train data shape after splitting (387639, 24)
X_test data shape after splitting (96910, 24)
y_train data shape after splitting (387639,)
y_test data shape after splitting (96910,)
Linear Regression Model
Interpretation - The R² represents how much variance of the data is explained by
the model, the R2=0.90 means that 0.10 of the variance can not explain by the
model, the logical case when R2=1 the model completely fit and explained all
variance.
 Y = a + bX
 b = slope
 a = intercept
 X= coefficients or features
Ridge Regression Model
Ridge regression is a model tuning method that is used to analyse any data
that suffers from multicollinearity. This method performs L2 regularization.
When the issue of multicollinearity occurs, least-squares are unbiased, and
variances are large, this results in predicted values to be far away from the
actual values.
mean_squared_error with Ridge Regression with train data
0.0033528868720357806
R2 square with Ridge Regression with train data 0.9999999965474273
mean_squared_error with Ridge Regression with test data
0.0027463365607708775
R2 square with Ridge Regression with test data 0.9999999976478994
SVC
Random Forest Regressor
Neural Network
Future Recommendation
1. Regression vs Classification Problem
2. Dataset can have more records for delay = 0
3. Dataset can have more relevant features according to the domain
knowledge/experience
Comparison of Models
Interpretation
 Simple linear regression led to overfitting giving an unrealistic accuracy of 100%. This problem caused by
overfitting is well addressed by applying regularization on the regression model.We have used L2 Regularization
that is Ridge Regression to overcome this issue.
 SVM model is extremely unsuitable for this problem as it takes an unreasonable amount of time(near about 3
hours) to run the model and also gives subpar accuracy. It is computationally expensive and inappropriate for
problems with large datasets such as the one given.
 Random forest is also giving us good accuracy 98 %
 ANN is giving 98 %
Dynamic data
Dynamic data out using linear regression and Ridge Regression

More Related Content

Similar to casestudy_important.pptx

LIDAR- Light Detection and Ranging.
LIDAR- Light Detection and Ranging.LIDAR- Light Detection and Ranging.
LIDAR- Light Detection and Ranging.
Gaurav Agarwal
 
Human Activity Recognition Using AccelerometerData
Human Activity Recognition Using AccelerometerDataHuman Activity Recognition Using AccelerometerData
Human Activity Recognition Using AccelerometerData
IRJET Journal
 
Svd filtered temporal usage clustering
Svd filtered temporal usage clusteringSvd filtered temporal usage clustering
Svd filtered temporal usage clusteringLiang Xie, PhD
 
Parallel KNN for Big Data using Adaptive Indexing
Parallel KNN for Big Data using Adaptive IndexingParallel KNN for Big Data using Adaptive Indexing
Parallel KNN for Big Data using Adaptive Indexing
IRJET Journal
 
Application of combined support vector machines in process fault diagnosis
Application of combined support vector machines in process fault diagnosisApplication of combined support vector machines in process fault diagnosis
Application of combined support vector machines in process fault diagnosis
Dr.Pooja Jain
 
IRJET- Performance Evaluation of Various Classification Algorithms
IRJET- Performance Evaluation of Various Classification AlgorithmsIRJET- Performance Evaluation of Various Classification Algorithms
IRJET- Performance Evaluation of Various Classification Algorithms
IRJET Journal
 
IRJET- Performance Evaluation of Various Classification Algorithms
IRJET- Performance Evaluation of Various Classification AlgorithmsIRJET- Performance Evaluation of Various Classification Algorithms
IRJET- Performance Evaluation of Various Classification Algorithms
IRJET Journal
 
EE660_Report_YaxinLiu_8448347171
EE660_Report_YaxinLiu_8448347171EE660_Report_YaxinLiu_8448347171
EE660_Report_YaxinLiu_8448347171Yaxin Liu
 
Software Analytics In Action: A Hands-on Tutorial on Mining, Analyzing, Model...
Software Analytics In Action: A Hands-on Tutorial on Mining, Analyzing, Model...Software Analytics In Action: A Hands-on Tutorial on Mining, Analyzing, Model...
Software Analytics In Action: A Hands-on Tutorial on Mining, Analyzing, Model...
Chakkrit (Kla) Tantithamthavorn
 
Data Assessment and Analysis for Model Evaluation
Data Assessment and Analysis for Model Evaluation Data Assessment and Analysis for Model Evaluation
Data Assessment and Analysis for Model Evaluation
SaravanakumarSekar4
 
Apache Lens at Hadoop meetup
Apache Lens at Hadoop meetupApache Lens at Hadoop meetup
Apache Lens at Hadoop meetup
amarsri
 
Data science with R - Clustering and Classification
Data science with R - Clustering and ClassificationData science with R - Clustering and Classification
Data science with R - Clustering and Classification
Brigitte Mueller
 
Empirical Analysis of Radix Sort using Curve Fitting Technique in Personal Co...
Empirical Analysis of Radix Sort using Curve Fitting Technique in Personal Co...Empirical Analysis of Radix Sort using Curve Fitting Technique in Personal Co...
Empirical Analysis of Radix Sort using Curve Fitting Technique in Personal Co...
IRJET Journal
 
Backtracking based integer factorisation, primality testing and square root c...
Backtracking based integer factorisation, primality testing and square root c...Backtracking based integer factorisation, primality testing and square root c...
Backtracking based integer factorisation, primality testing and square root c...
csandit
 
IRJET - Stock Market Prediction using Machine Learning Algorithm
IRJET - Stock Market Prediction using Machine Learning AlgorithmIRJET - Stock Market Prediction using Machine Learning Algorithm
IRJET - Stock Market Prediction using Machine Learning Algorithm
IRJET Journal
 
A FLOATING POINT DIVISION UNIT BASED ON TAYLOR-SERIES EXPANSION ALGORITHM AND...
A FLOATING POINT DIVISION UNIT BASED ON TAYLOR-SERIES EXPANSION ALGORITHM AND...A FLOATING POINT DIVISION UNIT BASED ON TAYLOR-SERIES EXPANSION ALGORITHM AND...
A FLOATING POINT DIVISION UNIT BASED ON TAYLOR-SERIES EXPANSION ALGORITHM AND...
csandit
 
My Postdoctoral Research
My Postdoctoral ResearchMy Postdoctoral Research
My Postdoctoral Research
Po-Ting Wu
 
Data mining with caret package
Data mining with caret packageData mining with caret package
Data mining with caret package
Vivian S. Zhang
 
Predicting Employee Attrition
Predicting Employee AttritionPredicting Employee Attrition
Predicting Employee Attrition
Shruti Mohan
 

Similar to casestudy_important.pptx (20)

LIDAR- Light Detection and Ranging.
LIDAR- Light Detection and Ranging.LIDAR- Light Detection and Ranging.
LIDAR- Light Detection and Ranging.
 
Human Activity Recognition Using AccelerometerData
Human Activity Recognition Using AccelerometerDataHuman Activity Recognition Using AccelerometerData
Human Activity Recognition Using AccelerometerData
 
Svd filtered temporal usage clustering
Svd filtered temporal usage clusteringSvd filtered temporal usage clustering
Svd filtered temporal usage clustering
 
Parallel KNN for Big Data using Adaptive Indexing
Parallel KNN for Big Data using Adaptive IndexingParallel KNN for Big Data using Adaptive Indexing
Parallel KNN for Big Data using Adaptive Indexing
 
Application of combined support vector machines in process fault diagnosis
Application of combined support vector machines in process fault diagnosisApplication of combined support vector machines in process fault diagnosis
Application of combined support vector machines in process fault diagnosis
 
IRJET- Performance Evaluation of Various Classification Algorithms
IRJET- Performance Evaluation of Various Classification AlgorithmsIRJET- Performance Evaluation of Various Classification Algorithms
IRJET- Performance Evaluation of Various Classification Algorithms
 
IRJET- Performance Evaluation of Various Classification Algorithms
IRJET- Performance Evaluation of Various Classification AlgorithmsIRJET- Performance Evaluation of Various Classification Algorithms
IRJET- Performance Evaluation of Various Classification Algorithms
 
EE660_Report_YaxinLiu_8448347171
EE660_Report_YaxinLiu_8448347171EE660_Report_YaxinLiu_8448347171
EE660_Report_YaxinLiu_8448347171
 
Software Analytics In Action: A Hands-on Tutorial on Mining, Analyzing, Model...
Software Analytics In Action: A Hands-on Tutorial on Mining, Analyzing, Model...Software Analytics In Action: A Hands-on Tutorial on Mining, Analyzing, Model...
Software Analytics In Action: A Hands-on Tutorial on Mining, Analyzing, Model...
 
Data Assessment and Analysis for Model Evaluation
Data Assessment and Analysis for Model Evaluation Data Assessment and Analysis for Model Evaluation
Data Assessment and Analysis for Model Evaluation
 
Apache Lens at Hadoop meetup
Apache Lens at Hadoop meetupApache Lens at Hadoop meetup
Apache Lens at Hadoop meetup
 
Data science with R - Clustering and Classification
Data science with R - Clustering and ClassificationData science with R - Clustering and Classification
Data science with R - Clustering and Classification
 
Empirical Analysis of Radix Sort using Curve Fitting Technique in Personal Co...
Empirical Analysis of Radix Sort using Curve Fitting Technique in Personal Co...Empirical Analysis of Radix Sort using Curve Fitting Technique in Personal Co...
Empirical Analysis of Radix Sort using Curve Fitting Technique in Personal Co...
 
Backtracking based integer factorisation, primality testing and square root c...
Backtracking based integer factorisation, primality testing and square root c...Backtracking based integer factorisation, primality testing and square root c...
Backtracking based integer factorisation, primality testing and square root c...
 
IRJET - Stock Market Prediction using Machine Learning Algorithm
IRJET - Stock Market Prediction using Machine Learning AlgorithmIRJET - Stock Market Prediction using Machine Learning Algorithm
IRJET - Stock Market Prediction using Machine Learning Algorithm
 
A FLOATING POINT DIVISION UNIT BASED ON TAYLOR-SERIES EXPANSION ALGORITHM AND...
A FLOATING POINT DIVISION UNIT BASED ON TAYLOR-SERIES EXPANSION ALGORITHM AND...A FLOATING POINT DIVISION UNIT BASED ON TAYLOR-SERIES EXPANSION ALGORITHM AND...
A FLOATING POINT DIVISION UNIT BASED ON TAYLOR-SERIES EXPANSION ALGORITHM AND...
 
My Postdoctoral Research
My Postdoctoral ResearchMy Postdoctoral Research
My Postdoctoral Research
 
Data mining with caret package
Data mining with caret packageData mining with caret package
Data mining with caret package
 
Oct.22nd.Presentation.Final
Oct.22nd.Presentation.FinalOct.22nd.Presentation.Final
Oct.22nd.Presentation.Final
 
Predicting Employee Attrition
Predicting Employee AttritionPredicting Employee Attrition
Predicting Employee Attrition
 

Recently uploaded

Guidance_and_Counselling.pdf B.Ed. 4th Semester
Guidance_and_Counselling.pdf B.Ed. 4th SemesterGuidance_and_Counselling.pdf B.Ed. 4th Semester
Guidance_and_Counselling.pdf B.Ed. 4th Semester
Atul Kumar Singh
 
Honest Reviews of Tim Han LMA Course Program.pptx
Honest Reviews of Tim Han LMA Course Program.pptxHonest Reviews of Tim Han LMA Course Program.pptx
Honest Reviews of Tim Han LMA Course Program.pptx
timhan337
 
Language Across the Curriculm LAC B.Ed.
Language Across the  Curriculm LAC B.Ed.Language Across the  Curriculm LAC B.Ed.
Language Across the Curriculm LAC B.Ed.
Atul Kumar Singh
 
Chapter -12, Antibiotics (One Page Notes).pdf
Chapter -12, Antibiotics (One Page Notes).pdfChapter -12, Antibiotics (One Page Notes).pdf
Chapter -12, Antibiotics (One Page Notes).pdf
Kartik Tiwari
 
The Challenger.pdf DNHS Official Publication
The Challenger.pdf DNHS Official PublicationThe Challenger.pdf DNHS Official Publication
The Challenger.pdf DNHS Official Publication
Delapenabediema
 
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
siemaillard
 
Home assignment II on Spectroscopy 2024 Answers.pdf
Home assignment II on Spectroscopy 2024 Answers.pdfHome assignment II on Spectroscopy 2024 Answers.pdf
Home assignment II on Spectroscopy 2024 Answers.pdf
Tamralipta Mahavidyalaya
 
Acetabularia Information For Class 9 .docx
Acetabularia Information For Class 9  .docxAcetabularia Information For Class 9  .docx
Acetabularia Information For Class 9 .docx
vaibhavrinwa19
 
Biological Screening of Herbal Drugs in detailed.
Biological Screening of Herbal Drugs in detailed.Biological Screening of Herbal Drugs in detailed.
Biological Screening of Herbal Drugs in detailed.
Ashokrao Mane college of Pharmacy Peth-Vadgaon
 
S1-Introduction-Biopesticides in ICM.pptx
S1-Introduction-Biopesticides in ICM.pptxS1-Introduction-Biopesticides in ICM.pptx
S1-Introduction-Biopesticides in ICM.pptx
tarandeep35
 
How libraries can support authors with open access requirements for UKRI fund...
How libraries can support authors with open access requirements for UKRI fund...How libraries can support authors with open access requirements for UKRI fund...
How libraries can support authors with open access requirements for UKRI fund...
Jisc
 
How to Make a Field invisible in Odoo 17
How to Make a Field invisible in Odoo 17How to Make a Field invisible in Odoo 17
How to Make a Field invisible in Odoo 17
Celine George
 
Multithreading_in_C++ - std::thread, race condition
Multithreading_in_C++ - std::thread, race conditionMultithreading_in_C++ - std::thread, race condition
Multithreading_in_C++ - std::thread, race condition
Mohammed Sikander
 
Lapbook sobre os Regimes Totalitários.pdf
Lapbook sobre os Regimes Totalitários.pdfLapbook sobre os Regimes Totalitários.pdf
Lapbook sobre os Regimes Totalitários.pdf
Jean Carlos Nunes Paixão
 
2024.06.01 Introducing a competency framework for languag learning materials ...
2024.06.01 Introducing a competency framework for languag learning materials ...2024.06.01 Introducing a competency framework for languag learning materials ...
2024.06.01 Introducing a competency framework for languag learning materials ...
Sandy Millin
 
Normal Labour/ Stages of Labour/ Mechanism of Labour
Normal Labour/ Stages of Labour/ Mechanism of LabourNormal Labour/ Stages of Labour/ Mechanism of Labour
Normal Labour/ Stages of Labour/ Mechanism of Labour
Wasim Ak
 
Unit 2- Research Aptitude (UGC NET Paper I).pdf
Unit 2- Research Aptitude (UGC NET Paper I).pdfUnit 2- Research Aptitude (UGC NET Paper I).pdf
Unit 2- Research Aptitude (UGC NET Paper I).pdf
Thiyagu K
 
The basics of sentences session 5pptx.pptx
The basics of sentences session 5pptx.pptxThe basics of sentences session 5pptx.pptx
The basics of sentences session 5pptx.pptx
heathfieldcps1
 
Unit 8 - Information and Communication Technology (Paper I).pdf
Unit 8 - Information and Communication Technology (Paper I).pdfUnit 8 - Information and Communication Technology (Paper I).pdf
Unit 8 - Information and Communication Technology (Paper I).pdf
Thiyagu K
 
TESDA TM1 REVIEWER FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
TESDA TM1 REVIEWER  FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...TESDA TM1 REVIEWER  FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
TESDA TM1 REVIEWER FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
EugeneSaldivar
 

Recently uploaded (20)

Guidance_and_Counselling.pdf B.Ed. 4th Semester
Guidance_and_Counselling.pdf B.Ed. 4th SemesterGuidance_and_Counselling.pdf B.Ed. 4th Semester
Guidance_and_Counselling.pdf B.Ed. 4th Semester
 
Honest Reviews of Tim Han LMA Course Program.pptx
Honest Reviews of Tim Han LMA Course Program.pptxHonest Reviews of Tim Han LMA Course Program.pptx
Honest Reviews of Tim Han LMA Course Program.pptx
 
Language Across the Curriculm LAC B.Ed.
Language Across the  Curriculm LAC B.Ed.Language Across the  Curriculm LAC B.Ed.
Language Across the Curriculm LAC B.Ed.
 
Chapter -12, Antibiotics (One Page Notes).pdf
Chapter -12, Antibiotics (One Page Notes).pdfChapter -12, Antibiotics (One Page Notes).pdf
Chapter -12, Antibiotics (One Page Notes).pdf
 
The Challenger.pdf DNHS Official Publication
The Challenger.pdf DNHS Official PublicationThe Challenger.pdf DNHS Official Publication
The Challenger.pdf DNHS Official Publication
 
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
 
Home assignment II on Spectroscopy 2024 Answers.pdf
Home assignment II on Spectroscopy 2024 Answers.pdfHome assignment II on Spectroscopy 2024 Answers.pdf
Home assignment II on Spectroscopy 2024 Answers.pdf
 
Acetabularia Information For Class 9 .docx
Acetabularia Information For Class 9  .docxAcetabularia Information For Class 9  .docx
Acetabularia Information For Class 9 .docx
 
Biological Screening of Herbal Drugs in detailed.
Biological Screening of Herbal Drugs in detailed.Biological Screening of Herbal Drugs in detailed.
Biological Screening of Herbal Drugs in detailed.
 
S1-Introduction-Biopesticides in ICM.pptx
S1-Introduction-Biopesticides in ICM.pptxS1-Introduction-Biopesticides in ICM.pptx
S1-Introduction-Biopesticides in ICM.pptx
 
How libraries can support authors with open access requirements for UKRI fund...
How libraries can support authors with open access requirements for UKRI fund...How libraries can support authors with open access requirements for UKRI fund...
How libraries can support authors with open access requirements for UKRI fund...
 
How to Make a Field invisible in Odoo 17
How to Make a Field invisible in Odoo 17How to Make a Field invisible in Odoo 17
How to Make a Field invisible in Odoo 17
 
Multithreading_in_C++ - std::thread, race condition
Multithreading_in_C++ - std::thread, race conditionMultithreading_in_C++ - std::thread, race condition
Multithreading_in_C++ - std::thread, race condition
 
Lapbook sobre os Regimes Totalitários.pdf
Lapbook sobre os Regimes Totalitários.pdfLapbook sobre os Regimes Totalitários.pdf
Lapbook sobre os Regimes Totalitários.pdf
 
2024.06.01 Introducing a competency framework for languag learning materials ...
2024.06.01 Introducing a competency framework for languag learning materials ...2024.06.01 Introducing a competency framework for languag learning materials ...
2024.06.01 Introducing a competency framework for languag learning materials ...
 
Normal Labour/ Stages of Labour/ Mechanism of Labour
Normal Labour/ Stages of Labour/ Mechanism of LabourNormal Labour/ Stages of Labour/ Mechanism of Labour
Normal Labour/ Stages of Labour/ Mechanism of Labour
 
Unit 2- Research Aptitude (UGC NET Paper I).pdf
Unit 2- Research Aptitude (UGC NET Paper I).pdfUnit 2- Research Aptitude (UGC NET Paper I).pdf
Unit 2- Research Aptitude (UGC NET Paper I).pdf
 
The basics of sentences session 5pptx.pptx
The basics of sentences session 5pptx.pptxThe basics of sentences session 5pptx.pptx
The basics of sentences session 5pptx.pptx
 
Unit 8 - Information and Communication Technology (Paper I).pdf
Unit 8 - Information and Communication Technology (Paper I).pdfUnit 8 - Information and Communication Technology (Paper I).pdf
Unit 8 - Information and Communication Technology (Paper I).pdf
 
TESDA TM1 REVIEWER FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
TESDA TM1 REVIEWER  FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...TESDA TM1 REVIEWER  FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
TESDA TM1 REVIEWER FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
 

casestudy_important.pptx

  • 3. Dataset understanding and description No. of features =29, shape of data set = 484551 rows x 28 columns, Target Variable = ArrDelay
  • 4. Dataset understanding and description  Missing Values  Org_Airport -1177  Dest_Airport -1479  Duplicate Data  2 rows are duplicate  Duplicate columns  Columns with same information i.e Org_Airport and Dest_Airport these are repeated information in data set Origin represented by three letter code for Org_Airport and Dest represents three letter code for Dest_Airport  Categorical Variables  1.UniqueCarrier  2.FlightNum  3.TailNum  4.Origin  5.Dest
  • 5. Outliers # -Columns having outliers :-  arrdelay  deepdelay  taxiout  carrier delay  security delay  late aircraft delay
  • 6. Data Visualization Techniques used  Box Plots  Heat Maps  Histogram  Line graphs  Pie chart  PairPlot  Used sweetviz library for more visualization
  • 7. Screenshots of different visualizations When we took threshold as .9 then following features are correlated{'AirTime', 'CRSElapsedTime', 'DepDelay', 'Distance'}
  • 9. Data Processing and FeatureEngg.  Imputation – Check for Null values in data set and handle it by diff techniques  Categorical Encoding – target encoding is used as high cardinality  Handling Outliers – Box plots to analyse outliers in data  Scaling - Min Max scaler is used  Feature Selection- Correlation helped in getting feature correlation with each other and target  Feature Split – Derived features from Date and time features Data set post FE – 24 features for Modelling
  • 10. Our Research  Analysed the data deeply from domain perspective which gave very interesting insights.  Different types of categorical handling we have researched on and then came up with target encoding  As we know deletion of features are always most impactful decision we used both visualization and domain knowledge to do this part  Missing Value handling we tried different approached and then finalized one  We have derived attributes from given features which we really felt will be helpful for further analysing
  • 11. Future Tasks  More Feature Engineering  Training the model on the selected features  Model development  Model assessment Take away from last Meet  Group Dynamics  Elements of Data  Dynamic data
  • 13.
  • 14. Feature Engg. Logistic Regression – SFS for Feature Engineering
  • 15. Logistic Regression – SFS for Feature Engineering
  • 16. Splitting of data The test_size=0.2 It is split of test and training data as 80/20percent . X_train data shape after splitting (387639, 24) X_test data shape after splitting (96910, 24) y_train data shape after splitting (387639,) y_test data shape after splitting (96910,)
  • 17. Linear Regression Model Interpretation - The R² represents how much variance of the data is explained by the model, the R2=0.90 means that 0.10 of the variance can not explain by the model, the logical case when R2=1 the model completely fit and explained all variance.
  • 18.  Y = a + bX  b = slope  a = intercept  X= coefficients or features
  • 19. Ridge Regression Model Ridge regression is a model tuning method that is used to analyse any data that suffers from multicollinearity. This method performs L2 regularization. When the issue of multicollinearity occurs, least-squares are unbiased, and variances are large, this results in predicted values to be far away from the actual values. mean_squared_error with Ridge Regression with train data 0.0033528868720357806 R2 square with Ridge Regression with train data 0.9999999965474273 mean_squared_error with Ridge Regression with test data 0.0027463365607708775 R2 square with Ridge Regression with test data 0.9999999976478994
  • 20. SVC
  • 23. Future Recommendation 1. Regression vs Classification Problem 2. Dataset can have more records for delay = 0 3. Dataset can have more relevant features according to the domain knowledge/experience
  • 25. Interpretation  Simple linear regression led to overfitting giving an unrealistic accuracy of 100%. This problem caused by overfitting is well addressed by applying regularization on the regression model.We have used L2 Regularization that is Ridge Regression to overcome this issue.  SVM model is extremely unsuitable for this problem as it takes an unreasonable amount of time(near about 3 hours) to run the model and also gives subpar accuracy. It is computationally expensive and inappropriate for problems with large datasets such as the one given.  Random forest is also giving us good accuracy 98 %  ANN is giving 98 %
  • 26. Dynamic data Dynamic data out using linear regression and Ridge Regression