Competition ‘16
Machine Learning Project
Data Mining Stages
Objective
• To predict the policy number and the price quoted for that
policy, customer is more likely to purchase.
• The data provided is the historical data from an Insurance
company which provides session history as well as
purchased history of its customers.
Datasets
• Train.csv
• Train_Short.csv
Data Understanding
• Class imbalance for Policy 4 with
respect to the other classes is
the major problem with dataset.
• The dataset heavily features Policy
1 and Policy 3.
• The heavy imbalance shows the
massive difference between maximum
value (Policy 3 with 25294 records)
and minimum value
(Policy 4 with 3925 records).
Approach
• Analyzed the shopping patterns of the customer by looking at Train.csv
dataset.
• Duplicates and outliers(Calculated the standard deviation for each data point
and attribute. Excluded the data points which are out of standard deviation.)
were removed.
• Data was normalized using python.
• Problem statement consists of 2 parts:
– predicting the policy (Classification)
– predicting the cost of the policy (Regression)
• 2 different models were trained and tested using two different algorithms in
Microsoft Azure Machine Learning Suite.
Data Preparation
Findings
– Observances like unique customer_id (67,663 in total) has atleast 3 unique
shopping_pt i.e 1,2,3. The pattern is extracted from the Train.csv file.
– Combination of this information aided with the session history of every
customer up to three shopping points and removal of anomalies like
duplication, un uniformity is done.
Data Normalization
– For the high range attributes such as location, data normalization is done
for better results.
– We used normalize_features(feature_set) function in python for
normalization.
Feature Selection
– Considered the Pearson Correlation, the “Filter-Based Feature Selection”
module was employed in Azure to cut down irrelevant features.
Feature Selection
• Considering the Pearson Correlation(‘r’ value) which is a indication of the
strength of correlation between any two features.
• Projected top 14 features to train and test the models. Features not used
are: “record_type”, “homeowner”, “group_size”, “married_couple” and
“C_previous” which has the lowest pearson correlation value.
• Using different combinations we tweaked the features and trained our
model but after evaluating the results the Pearson correlation helped
improve the performance.
Synthetic Minority Over-Sampling Technique
(SMOTE)
• Smote is a technique employed to over sample the minority class in our
multi-class Classification problem.
• Through this the immense gap in between the values of the four policy
classes in comparison to other classes was reduced.
• Smote is a common data manipulation technique for increasing the number
of cases to create a more balanced dataset.
• Since the instances of policy number 4 is almost seven times less than the
instances of policy 3, we increased the SMOTE sampling to 300% which
increased the accuracy of the classification model by 15%.
Policy Prediction(Classification)
Building Model
• Implemented 2 different algorithms on our training set and after
evaluating their performance, Multiclass Decision Forest turned out the
better results among them both.
• The Decision Forest produced better performance and worked better
towards resolving the class imbalance in the data.
• Tune Model Hyperparameter” helped evaluating the performance of our
model for different combinations of parameter values.
• Through this we were able to conclude that our model works best when
the Decision trees are low in number but high in depth.
Classification (Multi-Class Decision Forest Model)
Parameter Values
Performance Metrics
Cost Prediction (Regression)
• Model-1
– Used Boosted Decision Tree Regression module to create an ensemble of
regression trees using boosting
– The term Boosting implies that every tree is dependent upon its
preceding tree and learns by fitting the residual of the trees that
preceded it.
• Model-2
– Used Neural Network Regression to create a regression model which is a
customizable neural network algorithm.
Cost Prediction (Regression)
Building Model
– The “Root Mean Squared Error” for Neural Network Regression came out
to be 36.85 while for Boosted Decision Tree Regression it was 30 which
clearly shows Boosted Decision Tree Regression is working better for our
dataset.
– With the help of “Tune Model Hyperparameter” , the “Coefficient of
Determination” was achieved close to 0.50 and the “Root Mean Squared
error” close to 23.46(approx.)
– We figured out the best parameters value of Boosted Decision Tree
Regression should have Maximum number of Leaf Nodes to be 20 and
Maximum number of trees to be 20 with learning rate 0.2.
Cost Prediction (Regression)
Algorithm Properties
Performance Metrics
THANK YOU

Competition16

  • 1.
  • 2.
  • 3.
    Objective • To predictthe policy number and the price quoted for that policy, customer is more likely to purchase. • The data provided is the historical data from an Insurance company which provides session history as well as purchased history of its customers.
  • 4.
  • 5.
    Data Understanding • Classimbalance for Policy 4 with respect to the other classes is the major problem with dataset. • The dataset heavily features Policy 1 and Policy 3. • The heavy imbalance shows the massive difference between maximum value (Policy 3 with 25294 records) and minimum value (Policy 4 with 3925 records).
  • 6.
    Approach • Analyzed theshopping patterns of the customer by looking at Train.csv dataset. • Duplicates and outliers(Calculated the standard deviation for each data point and attribute. Excluded the data points which are out of standard deviation.) were removed. • Data was normalized using python. • Problem statement consists of 2 parts: – predicting the policy (Classification) – predicting the cost of the policy (Regression) • 2 different models were trained and tested using two different algorithms in Microsoft Azure Machine Learning Suite.
  • 7.
    Data Preparation Findings – Observanceslike unique customer_id (67,663 in total) has atleast 3 unique shopping_pt i.e 1,2,3. The pattern is extracted from the Train.csv file. – Combination of this information aided with the session history of every customer up to three shopping points and removal of anomalies like duplication, un uniformity is done. Data Normalization – For the high range attributes such as location, data normalization is done for better results. – We used normalize_features(feature_set) function in python for normalization. Feature Selection – Considered the Pearson Correlation, the “Filter-Based Feature Selection” module was employed in Azure to cut down irrelevant features.
  • 8.
    Feature Selection • Consideringthe Pearson Correlation(‘r’ value) which is a indication of the strength of correlation between any two features. • Projected top 14 features to train and test the models. Features not used are: “record_type”, “homeowner”, “group_size”, “married_couple” and “C_previous” which has the lowest pearson correlation value. • Using different combinations we tweaked the features and trained our model but after evaluating the results the Pearson correlation helped improve the performance.
  • 9.
    Synthetic Minority Over-SamplingTechnique (SMOTE) • Smote is a technique employed to over sample the minority class in our multi-class Classification problem. • Through this the immense gap in between the values of the four policy classes in comparison to other classes was reduced. • Smote is a common data manipulation technique for increasing the number of cases to create a more balanced dataset. • Since the instances of policy number 4 is almost seven times less than the instances of policy 3, we increased the SMOTE sampling to 300% which increased the accuracy of the classification model by 15%.
  • 10.
    Policy Prediction(Classification) Building Model •Implemented 2 different algorithms on our training set and after evaluating their performance, Multiclass Decision Forest turned out the better results among them both. • The Decision Forest produced better performance and worked better towards resolving the class imbalance in the data. • Tune Model Hyperparameter” helped evaluating the performance of our model for different combinations of parameter values. • Through this we were able to conclude that our model works best when the Decision trees are low in number but high in depth.
  • 11.
  • 12.
  • 13.
  • 14.
    Cost Prediction (Regression) •Model-1 – Used Boosted Decision Tree Regression module to create an ensemble of regression trees using boosting – The term Boosting implies that every tree is dependent upon its preceding tree and learns by fitting the residual of the trees that preceded it. • Model-2 – Used Neural Network Regression to create a regression model which is a customizable neural network algorithm.
  • 15.
    Cost Prediction (Regression) BuildingModel – The “Root Mean Squared Error” for Neural Network Regression came out to be 36.85 while for Boosted Decision Tree Regression it was 30 which clearly shows Boosted Decision Tree Regression is working better for our dataset. – With the help of “Tune Model Hyperparameter” , the “Coefficient of Determination” was achieved close to 0.50 and the “Root Mean Squared error” close to 23.46(approx.) – We figured out the best parameters value of Boosted Decision Tree Regression should have Maximum number of Leaf Nodes to be 20 and Maximum number of trees to be 20 with learning rate 0.2.
  • 16.
  • 17.
  • 18.
  • 19.