SlideShare a Scribd company logo
1 of 19
Competition ‘16
Machine Learning Project
Data Mining Stages
Objective
• To predict the policy number and the price quoted for that
policy, customer is more likely to purchase.
• The data provided is the historical data from an Insurance
company which provides session history as well as
purchased history of its customers.
Datasets
• Train.csv
• Train_Short.csv
Data Understanding
• Class imbalance for Policy 4 with
respect to the other classes is
the major problem with dataset.
• The dataset heavily features Policy
1 and Policy 3.
• The heavy imbalance shows the
massive difference between maximum
value (Policy 3 with 25294 records)
and minimum value
(Policy 4 with 3925 records).
Approach
• Analyzed the shopping patterns of the customer by looking at Train.csv
dataset.
• Duplicates and outliers(Calculated the standard deviation for each data point
and attribute. Excluded the data points which are out of standard deviation.)
were removed.
• Data was normalized using python.
• Problem statement consists of 2 parts:
– predicting the policy (Classification)
– predicting the cost of the policy (Regression)
• 2 different models were trained and tested using two different algorithms in
Microsoft Azure Machine Learning Suite.
Data Preparation
Findings
– Observances like unique customer_id (67,663 in total) has atleast 3 unique
shopping_pt i.e 1,2,3. The pattern is extracted from the Train.csv file.
– Combination of this information aided with the session history of every
customer up to three shopping points and removal of anomalies like
duplication, un uniformity is done.
Data Normalization
– For the high range attributes such as location, data normalization is done
for better results.
– We used normalize_features(feature_set) function in python for
normalization.
Feature Selection
– Considered the Pearson Correlation, the “Filter-Based Feature Selection”
module was employed in Azure to cut down irrelevant features.
Feature Selection
• Considering the Pearson Correlation(‘r’ value) which is a indication of the
strength of correlation between any two features.
• Projected top 14 features to train and test the models. Features not used
are: “record_type”, “homeowner”, “group_size”, “married_couple” and
“C_previous” which has the lowest pearson correlation value.
• Using different combinations we tweaked the features and trained our
model but after evaluating the results the Pearson correlation helped
improve the performance.
Synthetic Minority Over-Sampling Technique
(SMOTE)
• Smote is a technique employed to over sample the minority class in our
multi-class Classification problem.
• Through this the immense gap in between the values of the four policy
classes in comparison to other classes was reduced.
• Smote is a common data manipulation technique for increasing the number
of cases to create a more balanced dataset.
• Since the instances of policy number 4 is almost seven times less than the
instances of policy 3, we increased the SMOTE sampling to 300% which
increased the accuracy of the classification model by 15%.
Policy Prediction(Classification)
Building Model
• Implemented 2 different algorithms on our training set and after
evaluating their performance, Multiclass Decision Forest turned out the
better results among them both.
• The Decision Forest produced better performance and worked better
towards resolving the class imbalance in the data.
• Tune Model Hyperparameter” helped evaluating the performance of our
model for different combinations of parameter values.
• Through this we were able to conclude that our model works best when
the Decision trees are low in number but high in depth.
Classification (Multi-Class Decision Forest Model)
Parameter Values
Performance Metrics
Cost Prediction (Regression)
• Model-1
– Used Boosted Decision Tree Regression module to create an ensemble of
regression trees using boosting
– The term Boosting implies that every tree is dependent upon its
preceding tree and learns by fitting the residual of the trees that
preceded it.
• Model-2
– Used Neural Network Regression to create a regression model which is a
customizable neural network algorithm.
Cost Prediction (Regression)
Building Model
– The “Root Mean Squared Error” for Neural Network Regression came out
to be 36.85 while for Boosted Decision Tree Regression it was 30 which
clearly shows Boosted Decision Tree Regression is working better for our
dataset.
– With the help of “Tune Model Hyperparameter” , the “Coefficient of
Determination” was achieved close to 0.50 and the “Root Mean Squared
error” close to 23.46(approx.)
– We figured out the best parameters value of Boosted Decision Tree
Regression should have Maximum number of Leaf Nodes to be 20 and
Maximum number of trees to be 20 with learning rate 0.2.
Cost Prediction (Regression)
Algorithm Properties
Performance Metrics
THANK YOU

More Related Content

What's hot

High dimesional data (FAST clustering ALG) PPT
High dimesional data (FAST clustering ALG) PPTHigh dimesional data (FAST clustering ALG) PPT
High dimesional data (FAST clustering ALG) PPTdeepan v
 
Deep Reinforcement Learning based Recommendation with Explicit User-ItemInter...
Deep Reinforcement Learning based Recommendation with Explicit User-ItemInter...Deep Reinforcement Learning based Recommendation with Explicit User-ItemInter...
Deep Reinforcement Learning based Recommendation with Explicit User-ItemInter...Kishor Datta Gupta
 
Probability density estimation using Product of Conditional Experts
Probability density estimation using Product of Conditional ExpertsProbability density estimation using Product of Conditional Experts
Probability density estimation using Product of Conditional ExpertsChirag Gupta
 
ATTRIBUTE REDUCTION-BASED ENSEMBLE RULE CLASSIFIERS METHOD FOR DATASET CLASSI...
ATTRIBUTE REDUCTION-BASED ENSEMBLE RULE CLASSIFIERS METHOD FOR DATASET CLASSI...ATTRIBUTE REDUCTION-BASED ENSEMBLE RULE CLASSIFIERS METHOD FOR DATASET CLASSI...
ATTRIBUTE REDUCTION-BASED ENSEMBLE RULE CLASSIFIERS METHOD FOR DATASET CLASSI...csandit
 
Data Trend Analysis by Assigning Polynomial Function For Given Data Set
Data Trend Analysis by Assigning Polynomial Function For Given Data SetData Trend Analysis by Assigning Polynomial Function For Given Data Set
Data Trend Analysis by Assigning Polynomial Function For Given Data SetIJCERT
 
A fast clustering based feature subset selection algorithm for high-dimension...
A fast clustering based feature subset selection algorithm for high-dimension...A fast clustering based feature subset selection algorithm for high-dimension...
A fast clustering based feature subset selection algorithm for high-dimension...IEEEFINALYEARPROJECTS
 
PR12-094: Model-Agnostic Meta-Learning for fast adaptation of deep networks
PR12-094: Model-Agnostic Meta-Learning for fast adaptation of deep networksPR12-094: Model-Agnostic Meta-Learning for fast adaptation of deep networks
PR12-094: Model-Agnostic Meta-Learning for fast adaptation of deep networksTaesu Kim
 
A fast clustering based feature subset selection algorithm for high-dimension...
A fast clustering based feature subset selection algorithm for high-dimension...A fast clustering based feature subset selection algorithm for high-dimension...
A fast clustering based feature subset selection algorithm for high-dimension...IEEEFINALYEARPROJECTS
 
A fast clustering based feature subset selection algorithm for high-dimension...
A fast clustering based feature subset selection algorithm for high-dimension...A fast clustering based feature subset selection algorithm for high-dimension...
A fast clustering based feature subset selection algorithm for high-dimension...JPINFOTECH JAYAPRAKASH
 
Random Forest and KNN is fun
Random Forest and KNN is funRandom Forest and KNN is fun
Random Forest and KNN is funZhen Li
 
Random forest algorithm
Random forest algorithmRandom forest algorithm
Random forest algorithmRashid Ansari
 
IEEE 2014 JAVA DATA MINING PROJECTS A fast clustering based feature subset se...
IEEE 2014 JAVA DATA MINING PROJECTS A fast clustering based feature subset se...IEEE 2014 JAVA DATA MINING PROJECTS A fast clustering based feature subset se...
IEEE 2014 JAVA DATA MINING PROJECTS A fast clustering based feature subset se...IEEEFINALYEARSTUDENTPROJECTS
 

What's hot (18)

High dimesional data (FAST clustering ALG) PPT
High dimesional data (FAST clustering ALG) PPTHigh dimesional data (FAST clustering ALG) PPT
High dimesional data (FAST clustering ALG) PPT
 
Deep Reinforcement Learning based Recommendation with Explicit User-ItemInter...
Deep Reinforcement Learning based Recommendation with Explicit User-ItemInter...Deep Reinforcement Learning based Recommendation with Explicit User-ItemInter...
Deep Reinforcement Learning based Recommendation with Explicit User-ItemInter...
 
Probability density estimation using Product of Conditional Experts
Probability density estimation using Product of Conditional ExpertsProbability density estimation using Product of Conditional Experts
Probability density estimation using Product of Conditional Experts
 
ATTRIBUTE REDUCTION-BASED ENSEMBLE RULE CLASSIFIERS METHOD FOR DATASET CLASSI...
ATTRIBUTE REDUCTION-BASED ENSEMBLE RULE CLASSIFIERS METHOD FOR DATASET CLASSI...ATTRIBUTE REDUCTION-BASED ENSEMBLE RULE CLASSIFIERS METHOD FOR DATASET CLASSI...
ATTRIBUTE REDUCTION-BASED ENSEMBLE RULE CLASSIFIERS METHOD FOR DATASET CLASSI...
 
Data Trend Analysis by Assigning Polynomial Function For Given Data Set
Data Trend Analysis by Assigning Polynomial Function For Given Data SetData Trend Analysis by Assigning Polynomial Function For Given Data Set
Data Trend Analysis by Assigning Polynomial Function For Given Data Set
 
"Agro-Market Prediction by Fuzzy based Neuro-Genetic Algorithm"
"Agro-Market Prediction by Fuzzy based Neuro-Genetic Algorithm""Agro-Market Prediction by Fuzzy based Neuro-Genetic Algorithm"
"Agro-Market Prediction by Fuzzy based Neuro-Genetic Algorithm"
 
FINAL REVIEW
FINAL REVIEWFINAL REVIEW
FINAL REVIEW
 
A fast clustering based feature subset selection algorithm for high-dimension...
A fast clustering based feature subset selection algorithm for high-dimension...A fast clustering based feature subset selection algorithm for high-dimension...
A fast clustering based feature subset selection algorithm for high-dimension...
 
Strategy Pattern
Strategy PatternStrategy Pattern
Strategy Pattern
 
PR12-094: Model-Agnostic Meta-Learning for fast adaptation of deep networks
PR12-094: Model-Agnostic Meta-Learning for fast adaptation of deep networksPR12-094: Model-Agnostic Meta-Learning for fast adaptation of deep networks
PR12-094: Model-Agnostic Meta-Learning for fast adaptation of deep networks
 
A fast clustering based feature subset selection algorithm for high-dimension...
A fast clustering based feature subset selection algorithm for high-dimension...A fast clustering based feature subset selection algorithm for high-dimension...
A fast clustering based feature subset selection algorithm for high-dimension...
 
A fast clustering based feature subset selection algorithm for high-dimension...
A fast clustering based feature subset selection algorithm for high-dimension...A fast clustering based feature subset selection algorithm for high-dimension...
A fast clustering based feature subset selection algorithm for high-dimension...
 
Random Forest and KNN is fun
Random Forest and KNN is funRandom Forest and KNN is fun
Random Forest and KNN is fun
 
Data processing
Data processingData processing
Data processing
 
Random forest algorithm
Random forest algorithmRandom forest algorithm
Random forest algorithm
 
IEEE 2014 JAVA DATA MINING PROJECTS A fast clustering based feature subset se...
IEEE 2014 JAVA DATA MINING PROJECTS A fast clustering based feature subset se...IEEE 2014 JAVA DATA MINING PROJECTS A fast clustering based feature subset se...
IEEE 2014 JAVA DATA MINING PROJECTS A fast clustering based feature subset se...
 
Ijmet 10 01_141
Ijmet 10 01_141Ijmet 10 01_141
Ijmet 10 01_141
 
01 Introduction to Machine Learning
01 Introduction to Machine Learning01 Introduction to Machine Learning
01 Introduction to Machine Learning
 

Viewers also liked

Descrição do Site Sem Parar (SemParar.Org)
Descrição do Site Sem Parar (SemParar.Org)Descrição do Site Sem Parar (SemParar.Org)
Descrição do Site Sem Parar (SemParar.Org)Site Sem Parar
 
[Isentek] eCompass API Quick Start
[Isentek] eCompass API Quick Start [Isentek] eCompass API Quick Start
[Isentek] eCompass API Quick Start Ming-Hung Hseih
 
Entornos digitales de enseñanza y aprendizaje
Entornos digitales de enseñanza y aprendizajeEntornos digitales de enseñanza y aprendizaje
Entornos digitales de enseñanza y aprendizajeyocabelmartinez
 
20160418 - Emission 4 écrans Médiamétrie
20160418 - Emission 4 écrans Médiamétrie20160418 - Emission 4 écrans Médiamétrie
20160418 - Emission 4 écrans MédiamétrieBenoit David
 
20160318 - Touch by Médiamétrie
20160318 - Touch by Médiamétrie20160318 - Touch by Médiamétrie
20160318 - Touch by MédiamétrieBenoit David
 
Halliday 8ª ed - vol.2
Halliday   8ª ed - vol.2Halliday   8ª ed - vol.2
Halliday 8ª ed - vol.2kleberhellsing
 
[Isentek] eCompass API FAQ
[Isentek] eCompass API FAQ[Isentek] eCompass API FAQ
[Isentek] eCompass API FAQMing-Hung Hseih
 
Smart Patient Monitoring - Copy (1)
Smart Patient Monitoring - Copy (1)Smart Patient Monitoring - Copy (1)
Smart Patient Monitoring - Copy (1)Saurabh Vashist
 
[Advantech] ADAM-3600 open vpn setting Tutorial step by step
[Advantech] ADAM-3600 open vpn setting Tutorial step by step [Advantech] ADAM-3600 open vpn setting Tutorial step by step
[Advantech] ADAM-3600 open vpn setting Tutorial step by step Ming-Hung Hseih
 
[Advantech] Modbus protocol training (ModbusTCP, ModbusRTU)
[Advantech] Modbus protocol training (ModbusTCP, ModbusRTU)[Advantech] Modbus protocol training (ModbusTCP, ModbusRTU)
[Advantech] Modbus protocol training (ModbusTCP, ModbusRTU)Ming-Hung Hseih
 
[Advantech] ADAM-3600 training kit and Taglink
[Advantech]  ADAM-3600 training kit and Taglink[Advantech]  ADAM-3600 training kit and Taglink
[Advantech] ADAM-3600 training kit and TaglinkMing-Hung Hseih
 
[Advantech] PAC SW Multiprog Tutorial step by step
[Advantech] PAC SW Multiprog Tutorial step by step [Advantech] PAC SW Multiprog Tutorial step by step
[Advantech] PAC SW Multiprog Tutorial step by step Ming-Hung Hseih
 
[Advantech] WebOP designer Tutorial step by step
[Advantech] WebOP designer Tutorial step by step [Advantech] WebOP designer Tutorial step by step
[Advantech] WebOP designer Tutorial step by step Ming-Hung Hseih
 

Viewers also liked (15)

Descrição do Site Sem Parar (SemParar.Org)
Descrição do Site Sem Parar (SemParar.Org)Descrição do Site Sem Parar (SemParar.Org)
Descrição do Site Sem Parar (SemParar.Org)
 
[Isentek] eCompass API Quick Start
[Isentek] eCompass API Quick Start [Isentek] eCompass API Quick Start
[Isentek] eCompass API Quick Start
 
Entornos digitales de enseñanza y aprendizaje
Entornos digitales de enseñanza y aprendizajeEntornos digitales de enseñanza y aprendizaje
Entornos digitales de enseñanza y aprendizaje
 
Coloquio Final: Continum
Coloquio Final: ContinumColoquio Final: Continum
Coloquio Final: Continum
 
Curriculum vaite
Curriculum vaite Curriculum vaite
Curriculum vaite
 
20160418 - Emission 4 écrans Médiamétrie
20160418 - Emission 4 écrans Médiamétrie20160418 - Emission 4 écrans Médiamétrie
20160418 - Emission 4 écrans Médiamétrie
 
20160318 - Touch by Médiamétrie
20160318 - Touch by Médiamétrie20160318 - Touch by Médiamétrie
20160318 - Touch by Médiamétrie
 
Halliday 8ª ed - vol.2
Halliday   8ª ed - vol.2Halliday   8ª ed - vol.2
Halliday 8ª ed - vol.2
 
[Isentek] eCompass API FAQ
[Isentek] eCompass API FAQ[Isentek] eCompass API FAQ
[Isentek] eCompass API FAQ
 
Smart Patient Monitoring - Copy (1)
Smart Patient Monitoring - Copy (1)Smart Patient Monitoring - Copy (1)
Smart Patient Monitoring - Copy (1)
 
[Advantech] ADAM-3600 open vpn setting Tutorial step by step
[Advantech] ADAM-3600 open vpn setting Tutorial step by step [Advantech] ADAM-3600 open vpn setting Tutorial step by step
[Advantech] ADAM-3600 open vpn setting Tutorial step by step
 
[Advantech] Modbus protocol training (ModbusTCP, ModbusRTU)
[Advantech] Modbus protocol training (ModbusTCP, ModbusRTU)[Advantech] Modbus protocol training (ModbusTCP, ModbusRTU)
[Advantech] Modbus protocol training (ModbusTCP, ModbusRTU)
 
[Advantech] ADAM-3600 training kit and Taglink
[Advantech]  ADAM-3600 training kit and Taglink[Advantech]  ADAM-3600 training kit and Taglink
[Advantech] ADAM-3600 training kit and Taglink
 
[Advantech] PAC SW Multiprog Tutorial step by step
[Advantech] PAC SW Multiprog Tutorial step by step [Advantech] PAC SW Multiprog Tutorial step by step
[Advantech] PAC SW Multiprog Tutorial step by step
 
[Advantech] WebOP designer Tutorial step by step
[Advantech] WebOP designer Tutorial step by step [Advantech] WebOP designer Tutorial step by step
[Advantech] WebOP designer Tutorial step by step
 

Similar to Competition16

Pricing like a data scientist
Pricing like a data scientistPricing like a data scientist
Pricing like a data scientistMatthew Evans
 
Identifying and classifying unknown Network Disruption
Identifying and classifying unknown Network DisruptionIdentifying and classifying unknown Network Disruption
Identifying and classifying unknown Network Disruptionjagan477830
 
Introduction to data mining
Introduction to data miningIntroduction to data mining
Introduction to data miningUjjawal
 
Evolving the Optimal Relevancy Ranking Model at Dice.com
Evolving the Optimal Relevancy Ranking Model at Dice.comEvolving the Optimal Relevancy Ranking Model at Dice.com
Evolving the Optimal Relevancy Ranking Model at Dice.comSimon Hughes
 
Data Mining Module 2 Business Analytics.
Data Mining Module 2 Business Analytics.Data Mining Module 2 Business Analytics.
Data Mining Module 2 Business Analytics.Jayanti Pande
 
IRJET- Prediction of Crime Rate Analysis using Supervised Classification Mach...
IRJET- Prediction of Crime Rate Analysis using Supervised Classification Mach...IRJET- Prediction of Crime Rate Analysis using Supervised Classification Mach...
IRJET- Prediction of Crime Rate Analysis using Supervised Classification Mach...IRJET Journal
 
SURVEY ON CLASSIFICATION ALGORITHMS USING BIG DATASET
SURVEY ON CLASSIFICATION ALGORITHMS USING BIG DATASETSURVEY ON CLASSIFICATION ALGORITHMS USING BIG DATASET
SURVEY ON CLASSIFICATION ALGORITHMS USING BIG DATASETEditor IJMTER
 
Data Science - Part V - Decision Trees & Random Forests
Data Science - Part V - Decision Trees & Random Forests Data Science - Part V - Decision Trees & Random Forests
Data Science - Part V - Decision Trees & Random Forests Derek Kane
 
introduction to Statistical Theory.pptx
 introduction to Statistical Theory.pptx introduction to Statistical Theory.pptx
introduction to Statistical Theory.pptxDr.Shweta
 
Deep Learning Vocabulary.docx
Deep Learning Vocabulary.docxDeep Learning Vocabulary.docx
Deep Learning Vocabulary.docxjaffarbikat
 
Internship project report,Predictive Modelling
Internship project report,Predictive ModellingInternship project report,Predictive Modelling
Internship project report,Predictive ModellingAmit Kumar
 
Observations
ObservationsObservations
Observationsbutest
 
Guiding through a typical Machine Learning Pipeline
Guiding through a typical Machine Learning PipelineGuiding through a typical Machine Learning Pipeline
Guiding through a typical Machine Learning PipelineMichael Gerke
 
Movie recommendation Engine using Artificial Intelligence
Movie recommendation Engine using Artificial IntelligenceMovie recommendation Engine using Artificial Intelligence
Movie recommendation Engine using Artificial IntelligenceHarivamshi D
 
Proposing an Appropriate Pattern for Car Detection by Using Intelligent Algor...
Proposing an Appropriate Pattern for Car Detection by Using Intelligent Algor...Proposing an Appropriate Pattern for Car Detection by Using Intelligent Algor...
Proposing an Appropriate Pattern for Car Detection by Using Intelligent Algor...Editor IJCATR
 
Machine Learning with Python- Methods for Machine Learning.pptx
Machine Learning with Python- Methods for Machine Learning.pptxMachine Learning with Python- Methods for Machine Learning.pptx
Machine Learning with Python- Methods for Machine Learning.pptxiaeronlineexm
 
laptop price prediction presentation
laptop price prediction presentationlaptop price prediction presentation
laptop price prediction presentationNeerajNishad4
 

Similar to Competition16 (20)

Pricing like a data scientist
Pricing like a data scientistPricing like a data scientist
Pricing like a data scientist
 
Random Forest Decision Tree.pptx
Random Forest Decision Tree.pptxRandom Forest Decision Tree.pptx
Random Forest Decision Tree.pptx
 
random forest.pptx
random forest.pptxrandom forest.pptx
random forest.pptx
 
Identifying and classifying unknown Network Disruption
Identifying and classifying unknown Network DisruptionIdentifying and classifying unknown Network Disruption
Identifying and classifying unknown Network Disruption
 
Introduction to data mining
Introduction to data miningIntroduction to data mining
Introduction to data mining
 
Evolving the Optimal Relevancy Ranking Model at Dice.com
Evolving the Optimal Relevancy Ranking Model at Dice.comEvolving the Optimal Relevancy Ranking Model at Dice.com
Evolving the Optimal Relevancy Ranking Model at Dice.com
 
Unit 2-ML.pptx
Unit 2-ML.pptxUnit 2-ML.pptx
Unit 2-ML.pptx
 
Data Mining Module 2 Business Analytics.
Data Mining Module 2 Business Analytics.Data Mining Module 2 Business Analytics.
Data Mining Module 2 Business Analytics.
 
IRJET- Prediction of Crime Rate Analysis using Supervised Classification Mach...
IRJET- Prediction of Crime Rate Analysis using Supervised Classification Mach...IRJET- Prediction of Crime Rate Analysis using Supervised Classification Mach...
IRJET- Prediction of Crime Rate Analysis using Supervised Classification Mach...
 
SURVEY ON CLASSIFICATION ALGORITHMS USING BIG DATASET
SURVEY ON CLASSIFICATION ALGORITHMS USING BIG DATASETSURVEY ON CLASSIFICATION ALGORITHMS USING BIG DATASET
SURVEY ON CLASSIFICATION ALGORITHMS USING BIG DATASET
 
Data Science - Part V - Decision Trees & Random Forests
Data Science - Part V - Decision Trees & Random Forests Data Science - Part V - Decision Trees & Random Forests
Data Science - Part V - Decision Trees & Random Forests
 
introduction to Statistical Theory.pptx
 introduction to Statistical Theory.pptx introduction to Statistical Theory.pptx
introduction to Statistical Theory.pptx
 
Deep Learning Vocabulary.docx
Deep Learning Vocabulary.docxDeep Learning Vocabulary.docx
Deep Learning Vocabulary.docx
 
Internship project report,Predictive Modelling
Internship project report,Predictive ModellingInternship project report,Predictive Modelling
Internship project report,Predictive Modelling
 
Observations
ObservationsObservations
Observations
 
Guiding through a typical Machine Learning Pipeline
Guiding through a typical Machine Learning PipelineGuiding through a typical Machine Learning Pipeline
Guiding through a typical Machine Learning Pipeline
 
Movie recommendation Engine using Artificial Intelligence
Movie recommendation Engine using Artificial IntelligenceMovie recommendation Engine using Artificial Intelligence
Movie recommendation Engine using Artificial Intelligence
 
Proposing an Appropriate Pattern for Car Detection by Using Intelligent Algor...
Proposing an Appropriate Pattern for Car Detection by Using Intelligent Algor...Proposing an Appropriate Pattern for Car Detection by Using Intelligent Algor...
Proposing an Appropriate Pattern for Car Detection by Using Intelligent Algor...
 
Machine Learning with Python- Methods for Machine Learning.pptx
Machine Learning with Python- Methods for Machine Learning.pptxMachine Learning with Python- Methods for Machine Learning.pptx
Machine Learning with Python- Methods for Machine Learning.pptx
 
laptop price prediction presentation
laptop price prediction presentationlaptop price prediction presentation
laptop price prediction presentation
 

Competition16

  • 3. Objective • To predict the policy number and the price quoted for that policy, customer is more likely to purchase. • The data provided is the historical data from an Insurance company which provides session history as well as purchased history of its customers.
  • 5. Data Understanding • Class imbalance for Policy 4 with respect to the other classes is the major problem with dataset. • The dataset heavily features Policy 1 and Policy 3. • The heavy imbalance shows the massive difference between maximum value (Policy 3 with 25294 records) and minimum value (Policy 4 with 3925 records).
  • 6. Approach • Analyzed the shopping patterns of the customer by looking at Train.csv dataset. • Duplicates and outliers(Calculated the standard deviation for each data point and attribute. Excluded the data points which are out of standard deviation.) were removed. • Data was normalized using python. • Problem statement consists of 2 parts: – predicting the policy (Classification) – predicting the cost of the policy (Regression) • 2 different models were trained and tested using two different algorithms in Microsoft Azure Machine Learning Suite.
  • 7. Data Preparation Findings – Observances like unique customer_id (67,663 in total) has atleast 3 unique shopping_pt i.e 1,2,3. The pattern is extracted from the Train.csv file. – Combination of this information aided with the session history of every customer up to three shopping points and removal of anomalies like duplication, un uniformity is done. Data Normalization – For the high range attributes such as location, data normalization is done for better results. – We used normalize_features(feature_set) function in python for normalization. Feature Selection – Considered the Pearson Correlation, the “Filter-Based Feature Selection” module was employed in Azure to cut down irrelevant features.
  • 8. Feature Selection • Considering the Pearson Correlation(‘r’ value) which is a indication of the strength of correlation between any two features. • Projected top 14 features to train and test the models. Features not used are: “record_type”, “homeowner”, “group_size”, “married_couple” and “C_previous” which has the lowest pearson correlation value. • Using different combinations we tweaked the features and trained our model but after evaluating the results the Pearson correlation helped improve the performance.
  • 9. Synthetic Minority Over-Sampling Technique (SMOTE) • Smote is a technique employed to over sample the minority class in our multi-class Classification problem. • Through this the immense gap in between the values of the four policy classes in comparison to other classes was reduced. • Smote is a common data manipulation technique for increasing the number of cases to create a more balanced dataset. • Since the instances of policy number 4 is almost seven times less than the instances of policy 3, we increased the SMOTE sampling to 300% which increased the accuracy of the classification model by 15%.
  • 10. Policy Prediction(Classification) Building Model • Implemented 2 different algorithms on our training set and after evaluating their performance, Multiclass Decision Forest turned out the better results among them both. • The Decision Forest produced better performance and worked better towards resolving the class imbalance in the data. • Tune Model Hyperparameter” helped evaluating the performance of our model for different combinations of parameter values. • Through this we were able to conclude that our model works best when the Decision trees are low in number but high in depth.
  • 14. Cost Prediction (Regression) • Model-1 – Used Boosted Decision Tree Regression module to create an ensemble of regression trees using boosting – The term Boosting implies that every tree is dependent upon its preceding tree and learns by fitting the residual of the trees that preceded it. • Model-2 – Used Neural Network Regression to create a regression model which is a customizable neural network algorithm.
  • 15. Cost Prediction (Regression) Building Model – The “Root Mean Squared Error” for Neural Network Regression came out to be 36.85 while for Boosted Decision Tree Regression it was 30 which clearly shows Boosted Decision Tree Regression is working better for our dataset. – With the help of “Tune Model Hyperparameter” , the “Coefficient of Determination” was achieved close to 0.50 and the “Root Mean Squared error” close to 23.46(approx.) – We figured out the best parameters value of Boosted Decision Tree Regression should have Maximum number of Leaf Nodes to be 20 and Maximum number of trees to be 20 with learning rate 0.2.