SlideShare a Scribd company logo
1 of 12
JanataHack: Mobility Analytics
Quick Surge Price Prediction for Cab
aggregator
Introduction
With the upcoming cab aggregators and demand for mobility solutions, the
past decade has seen immense growth in data collected from commercial
vehicles with major contributors such as Uber, Lyft and Ola to name a few.
There are loads of innovative data science and machine learning solutions
being implemented using such data and that has led to tremendous business
value for such organizations.
This is a Hackathon relating to Mobility Business conducted by Analytics
Vidhya. This presentation is about my attempt in tackling the challenge
Problem Statement
Welcome to Sigma Cab Private Limited - a cab aggregator service. Their customers
can download their app on smartphones and book a cab from any where in the cities
they operate in. They, in turn search for cabs from various service providers and
provide the best option to their client across available options. They have been in
operation for little less than a year now. During this period, they have captured
surge_pricing_type from the service providers.
You have been hired by Sigma Cabs as a Data Scientist and have been asked to build
a predictive model, which could help them in predicting the surge_pricing_type pro-
actively. This would in turn help them in matching the right cabs with the right
customers quickly and efficiently.
Preliminary Understanding
• Train and Test data shows similar pattern in their mean and quartile distribution. This is
great. We can assume that the test data is similar to that of train and predictions on Train
might work on Test
• Train and Test have no empty Train_ID, Train_Distance
• We have few NaN in Type_of_Cab for both Train and Test. Lets create a new category
‘F’ with all the NaN values
• Customer_Since_Months has few NaN values and replace them with 0. They are the newbi
es to this cab services.
• Life_Style_Index, Confidence_Life_Style_Index. This is a propritery value by the cab compan
y and we have no idea how it is derived. Can think of omitting the NaN rows. Since, replaci
ng them with 0 might mean something different. Or, can perform EDA and decide later.
• Destination_Type, Customer_Rating, Cancellation_Last_1Month have no missing values.
• Var1 is masked by the company and is very sparse. We definitely cant remove all records w
ith NaN values and neither assume them to be 0. we could take a call on this after EDA.
Correlation check
Can see Var2 and Var3 correlated in both Test and Train and have removed Var2
Preliminary Understanding
• Train and Test data shows similar pattern in their mean and quartile distribution. This is great.
We can assume that the test data is similar to that of train and predictions on Train might work
on Test
• Train and Test have no empty Train_ID, Train_Distance
• We have few NaN in Type_of_Cab for both Train and Test. Lets create a new category
‘F’ with all the NaN values
• Customer_Since_Months has few NaN values and replace them with 0. They are the newbies to t
his cab services.
• Life_Style_Index, Confidence_Life_Style_Index. This is a propritery value by the cab company and
we have no idea how it is derived. Can think of omitting the NaN rows. Since, replacing them wit
h 0 might mean something different. Or, can perform EDA and decide later.
• Destination_Type, Customer_Rating, Cancellation_Last_1Month have no missing values.
• Var1 is masked by the company and is very sparse. We definitely cant remove all records with N
aN values and neither assume them to be 0. we could take a call on this after EDA.
EDA to understand Life_Style_Index
From the 3 scatter plots, we can notice that most of the values of Life_style_index is
distributed between 2 to 3.5
For simplicity, we fill assume NaN values with mode values for both Train and Test(2.7)
EDA to understand Confidence_Life_Style_Index
Look like the Confidence_Life_Style_Index is randomly assigned with equal distribution. For
simplicity, lets equally assign A,B,C to the NaNs in the field.
EDA on Surge_Pricing_Type
Not a large difference between the target values, and Sampling isn't required.
Modeling: RandomForestClassifier
1. Simple RandomForest gives Accuracy 0.685
Modeling: XGBoost
1. Simple XGBoost gives Accuracy 0.6835
Modeling: XGBoost with GridSearchCV
1. XGBoost with GridSearchCV tuning gives Accuracy 0.696616 on Train data and 0.7015

More Related Content

Similar to Mobility Hackathon

Profile based recommendation for Airbnb users-Project report
Profile based recommendation for Airbnb users-Project reportProfile based recommendation for Airbnb users-Project report
Profile based recommendation for Airbnb users-Project report
Vasanti Mahajan
 
Final Report
Final ReportFinal Report
Final Report
Aman Soni
 
Competitor Analysis_Snehil
Competitor Analysis_SnehilCompetitor Analysis_Snehil
Competitor Analysis_Snehil
Snehil Singh
 
Machine Learning Algorithms | Machine Learning Tutorial | Data Science Algori...
Machine Learning Algorithms | Machine Learning Tutorial | Data Science Algori...Machine Learning Algorithms | Machine Learning Tutorial | Data Science Algori...
Machine Learning Algorithms | Machine Learning Tutorial | Data Science Algori...
Simplilearn
 

Similar to Mobility Hackathon (20)

Customer_Churn_prediction.pptx
Customer_Churn_prediction.pptxCustomer_Churn_prediction.pptx
Customer_Churn_prediction.pptx
 
The Third Eye
The Third EyeThe Third Eye
The Third Eye
 
House price prediction
House price predictionHouse price prediction
House price prediction
 
Deriving insights from data using "R"ight way
Deriving insights from data using "R"ight wayDeriving insights from data using "R"ight way
Deriving insights from data using "R"ight way
 
Telecom Churn Analysis
Telecom Churn AnalysisTelecom Churn Analysis
Telecom Churn Analysis
 
EDA_Assignment_Sourabh S Hubballi.pdf
EDA_Assignment_Sourabh S Hubballi.pdfEDA_Assignment_Sourabh S Hubballi.pdf
EDA_Assignment_Sourabh S Hubballi.pdf
 
Loan default prediction with machine language
Loan  default  prediction with  machine  language Loan  default  prediction with  machine  language
Loan default prediction with machine language
 
Decoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectDecoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis Project
 
Profile based recommendation for Airbnb users-Project report
Profile based recommendation for Airbnb users-Project reportProfile based recommendation for Airbnb users-Project report
Profile based recommendation for Airbnb users-Project report
 
Final Report
Final ReportFinal Report
Final Report
 
CAR PRICE PREDICTION.pptx
CAR PRICE PREDICTION.pptxCAR PRICE PREDICTION.pptx
CAR PRICE PREDICTION.pptx
 
Telecommunication Analysis(3 use-cases) with IBM cognos insight
Telecommunication Analysis(3 use-cases) with IBM cognos insightTelecommunication Analysis(3 use-cases) with IBM cognos insight
Telecommunication Analysis(3 use-cases) with IBM cognos insight
 
Supervised learning
Supervised learningSupervised learning
Supervised learning
 
Competitor Analysis_Snehil
Competitor Analysis_SnehilCompetitor Analysis_Snehil
Competitor Analysis_Snehil
 
Feature engineering mean encodings
Feature engineering   mean encodingsFeature engineering   mean encodings
Feature engineering mean encodings
 
Machine Learning Algorithms | Machine Learning Tutorial | Data Science Algori...
Machine Learning Algorithms | Machine Learning Tutorial | Data Science Algori...Machine Learning Algorithms | Machine Learning Tutorial | Data Science Algori...
Machine Learning Algorithms | Machine Learning Tutorial | Data Science Algori...
 
Lead Scoring Group Case Study Presentation.pdf
Lead Scoring Group Case Study Presentation.pdfLead Scoring Group Case Study Presentation.pdf
Lead Scoring Group Case Study Presentation.pdf
 
Plain sailing services
Plain sailing servicesPlain sailing services
Plain sailing services
 
Automobile Insurance Claim Fraud Detection using Random Forest and ADASYN
Automobile Insurance Claim Fraud Detection using Random Forest and ADASYNAutomobile Insurance Claim Fraud Detection using Random Forest and ADASYN
Automobile Insurance Claim Fraud Detection using Random Forest and ADASYN
 
Database Modeling presentation
Database Modeling  presentationDatabase Modeling  presentation
Database Modeling presentation
 

Recently uploaded

Recently uploaded (20)

Food safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfFood safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdf
 
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
 
Details on CBSE Compartment Exam.pptx1111
Details on CBSE Compartment Exam.pptx1111Details on CBSE Compartment Exam.pptx1111
Details on CBSE Compartment Exam.pptx1111
 
AIM of Education-Teachers Training-2024.ppt
AIM of Education-Teachers Training-2024.pptAIM of Education-Teachers Training-2024.ppt
AIM of Education-Teachers Training-2024.ppt
 
Towards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptxTowards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptx
 
Philosophy of china and it's charactistics
Philosophy of china and it's charactisticsPhilosophy of china and it's charactistics
Philosophy of china and it's charactistics
 
Google Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptxGoogle Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptx
 
How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17
 
latest AZ-104 Exam Questions and Answers
latest AZ-104 Exam Questions and Answerslatest AZ-104 Exam Questions and Answers
latest AZ-104 Exam Questions and Answers
 
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...
 
Jamworks pilot and AI at Jisc (20/03/2024)
Jamworks pilot and AI at Jisc (20/03/2024)Jamworks pilot and AI at Jisc (20/03/2024)
Jamworks pilot and AI at Jisc (20/03/2024)
 
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptxHMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
 
Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024
 
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
 
Single or Multiple melodic lines structure
Single or Multiple melodic lines structureSingle or Multiple melodic lines structure
Single or Multiple melodic lines structure
 
REMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptxREMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptx
 
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
 
Tatlong Kwento ni Lola basyang-1.pdf arts
Tatlong Kwento ni Lola basyang-1.pdf artsTatlong Kwento ni Lola basyang-1.pdf arts
Tatlong Kwento ni Lola basyang-1.pdf arts
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 
How to Add New Custom Addons Path in Odoo 17
How to Add New Custom Addons Path in Odoo 17How to Add New Custom Addons Path in Odoo 17
How to Add New Custom Addons Path in Odoo 17
 

Mobility Hackathon

  • 1. JanataHack: Mobility Analytics Quick Surge Price Prediction for Cab aggregator
  • 2. Introduction With the upcoming cab aggregators and demand for mobility solutions, the past decade has seen immense growth in data collected from commercial vehicles with major contributors such as Uber, Lyft and Ola to name a few. There are loads of innovative data science and machine learning solutions being implemented using such data and that has led to tremendous business value for such organizations. This is a Hackathon relating to Mobility Business conducted by Analytics Vidhya. This presentation is about my attempt in tackling the challenge
  • 3. Problem Statement Welcome to Sigma Cab Private Limited - a cab aggregator service. Their customers can download their app on smartphones and book a cab from any where in the cities they operate in. They, in turn search for cabs from various service providers and provide the best option to their client across available options. They have been in operation for little less than a year now. During this period, they have captured surge_pricing_type from the service providers. You have been hired by Sigma Cabs as a Data Scientist and have been asked to build a predictive model, which could help them in predicting the surge_pricing_type pro- actively. This would in turn help them in matching the right cabs with the right customers quickly and efficiently.
  • 4. Preliminary Understanding • Train and Test data shows similar pattern in their mean and quartile distribution. This is great. We can assume that the test data is similar to that of train and predictions on Train might work on Test • Train and Test have no empty Train_ID, Train_Distance • We have few NaN in Type_of_Cab for both Train and Test. Lets create a new category ‘F’ with all the NaN values • Customer_Since_Months has few NaN values and replace them with 0. They are the newbi es to this cab services. • Life_Style_Index, Confidence_Life_Style_Index. This is a propritery value by the cab compan y and we have no idea how it is derived. Can think of omitting the NaN rows. Since, replaci ng them with 0 might mean something different. Or, can perform EDA and decide later. • Destination_Type, Customer_Rating, Cancellation_Last_1Month have no missing values. • Var1 is masked by the company and is very sparse. We definitely cant remove all records w ith NaN values and neither assume them to be 0. we could take a call on this after EDA.
  • 5. Correlation check Can see Var2 and Var3 correlated in both Test and Train and have removed Var2
  • 6. Preliminary Understanding • Train and Test data shows similar pattern in their mean and quartile distribution. This is great. We can assume that the test data is similar to that of train and predictions on Train might work on Test • Train and Test have no empty Train_ID, Train_Distance • We have few NaN in Type_of_Cab for both Train and Test. Lets create a new category ‘F’ with all the NaN values • Customer_Since_Months has few NaN values and replace them with 0. They are the newbies to t his cab services. • Life_Style_Index, Confidence_Life_Style_Index. This is a propritery value by the cab company and we have no idea how it is derived. Can think of omitting the NaN rows. Since, replacing them wit h 0 might mean something different. Or, can perform EDA and decide later. • Destination_Type, Customer_Rating, Cancellation_Last_1Month have no missing values. • Var1 is masked by the company and is very sparse. We definitely cant remove all records with N aN values and neither assume them to be 0. we could take a call on this after EDA.
  • 7. EDA to understand Life_Style_Index From the 3 scatter plots, we can notice that most of the values of Life_style_index is distributed between 2 to 3.5 For simplicity, we fill assume NaN values with mode values for both Train and Test(2.7)
  • 8. EDA to understand Confidence_Life_Style_Index Look like the Confidence_Life_Style_Index is randomly assigned with equal distribution. For simplicity, lets equally assign A,B,C to the NaNs in the field.
  • 9. EDA on Surge_Pricing_Type Not a large difference between the target values, and Sampling isn't required.
  • 10. Modeling: RandomForestClassifier 1. Simple RandomForest gives Accuracy 0.685
  • 11. Modeling: XGBoost 1. Simple XGBoost gives Accuracy 0.6835
  • 12. Modeling: XGBoost with GridSearchCV 1. XGBoost with GridSearchCV tuning gives Accuracy 0.696616 on Train data and 0.7015