SlideShare a Scribd company logo
1 of 11
Machine Learning project
Team members: Jack, Harry & Abhishek
Homesite problem:
Predicting Quote conversion
- Homesite sells Home-insurance to home buyers in United States
- Insurance quotes are offered to customers based on several factors
What Homesite knows
- Customer’s geographical,
personal, Financial Information &
HomeOwnership details
- Quote for every customer
Enter
What Homesite Doesn’t know:
Customer’s likelihood of buying
that Insurance contract
Data shared: Training & Test
Task:
Binary classification
Training set:
261k rows, 298 predictors, 1 Binary response
Test set:
200k rows, 298 columns
Predictors:
Customer Activity, Geography, Personal, property & Coverage
Response:
Customer Conversion
What’s good about Homesite data:
- 296 variables don’t have NA’s or bad data entry points
- Not many Levels in Nominal variables
- Plenty of binary variables
- Plenty of ordinal variables
- No unbalanced variables
- No missing values
- No Textual columns
Data cleaning steps
Removing Constants
Removing Identifier rows
Synthesizing Date column
1
2
3
Treating NA variables 4
Treating bad levels (-1) 5
Treating false categorical 6
Categorical to dummy 7
Gradient Boosting (Iterative corrections)
Learning from past mistakes
Could get nearly 0 training error
Weighted scoring of multiple
trees
Hard to tune, as there are too
many parameters to adjust
Often overfit and hard to decide
the stopping point
Random Forests (Majority wins)
Handles missing data
Handles redundancy easily
Reduces variations in results
Produces Out of Bag error rate
Produces De-correlated trees
Random subspace & split
Bias sometimes Increases as
Trees are shallower
Gradient Boosting + Random Forest
Handles missing data
Handles redundancy easily
Reduces variations in results
Produces Out of Bag error rate
Produces De-correlated trees
Random subspace & split
Quite slow & Computational expensive,
optimizing these constraints could be an
excellent area for research
Our Score
AUC = .95
Does not overfit
Little bias, due to correction
Easy to tune
Calculating AUC
ID True class Predicted probability
1 1 .8612
2 0 .2134
3 0 .1791
4 0 .1134
5 1 .7898
6 0 .0612
AUC
- Randomly decide a threshold
- Calculate True Positive Rate (y) & False Positive
Rate (x)
- Based on (x,y) plot the point
- Repeat steps for each value of threshold [0,1]
- We now have a curve and we call it ROC
- Area under this curve becomes AUC
War for the highest AUC
What we have already
employed
- Categorical to Continuous
conversion
- Continuous to Ordinal conversion
- Variable bucketing
- SVM / Logistic Regression
- Random Forest/ Trees
- Lasso / Ridge / Elastic Net
- Gradient Boosting
- Multicollinearity elimination
- Outlier treatment
- K-Fold Cross validation
- Imputation for NA’s
- Model tuning
- Variable transformation
- Most importantly, Your
Suggestions
What we look
forward to use
THANK YOU

More Related Content

What's hot

3.2 partitioning methods
3.2 partitioning methods3.2 partitioning methods
3.2 partitioning methodsKrish_ver2
 
Performance Metrics for Machine Learning Algorithms
Performance Metrics for Machine Learning AlgorithmsPerformance Metrics for Machine Learning Algorithms
Performance Metrics for Machine Learning AlgorithmsKush Kulshrestha
 
2.5 backpropagation
2.5 backpropagation2.5 backpropagation
2.5 backpropagationKrish_ver2
 
Learning set of rules
Learning set of rulesLearning set of rules
Learning set of rulesswapnac12
 
3.5 model based clustering
3.5 model based clustering3.5 model based clustering
3.5 model based clusteringKrish_ver2
 
Concept learning and candidate elimination algorithm
Concept learning and candidate elimination algorithmConcept learning and candidate elimination algorithm
Concept learning and candidate elimination algorithmswapnac12
 
K-Nearest Neighbor Classifier
K-Nearest Neighbor ClassifierK-Nearest Neighbor Classifier
K-Nearest Neighbor ClassifierNeha Kulkarni
 
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...Simplilearn
 
Random Forest Algorithm - Random Forest Explained | Random Forest In Machine ...
Random Forest Algorithm - Random Forest Explained | Random Forest In Machine ...Random Forest Algorithm - Random Forest Explained | Random Forest In Machine ...
Random Forest Algorithm - Random Forest Explained | Random Forest In Machine ...Simplilearn
 
Statistical learning
Statistical learningStatistical learning
Statistical learningSlideshare
 
Machine Learning - Splitting Datasets
Machine Learning - Splitting DatasetsMachine Learning - Splitting Datasets
Machine Learning - Splitting DatasetsAndrew Ferlitsch
 
Logistic Regression | Logistic Regression In Python | Machine Learning Algori...
Logistic Regression | Logistic Regression In Python | Machine Learning Algori...Logistic Regression | Logistic Regression In Python | Machine Learning Algori...
Logistic Regression | Logistic Regression In Python | Machine Learning Algori...Simplilearn
 
Data preprocessing using Machine Learning
Data  preprocessing using Machine Learning Data  preprocessing using Machine Learning
Data preprocessing using Machine Learning Gopal Sakarkar
 
Feature Engineering in Machine Learning
Feature Engineering in Machine LearningFeature Engineering in Machine Learning
Feature Engineering in Machine LearningKnoldus Inc.
 
Machine Learning with Decision trees
Machine Learning with Decision treesMachine Learning with Decision trees
Machine Learning with Decision treesKnoldus Inc.
 

What's hot (20)

3.2 partitioning methods
3.2 partitioning methods3.2 partitioning methods
3.2 partitioning methods
 
Performance Metrics for Machine Learning Algorithms
Performance Metrics for Machine Learning AlgorithmsPerformance Metrics for Machine Learning Algorithms
Performance Metrics for Machine Learning Algorithms
 
2.5 backpropagation
2.5 backpropagation2.5 backpropagation
2.5 backpropagation
 
Learning set of rules
Learning set of rulesLearning set of rules
Learning set of rules
 
Cnn
CnnCnn
Cnn
 
3.5 model based clustering
3.5 model based clustering3.5 model based clustering
3.5 model based clustering
 
Concept learning and candidate elimination algorithm
Concept learning and candidate elimination algorithmConcept learning and candidate elimination algorithm
Concept learning and candidate elimination algorithm
 
K-Nearest Neighbor Classifier
K-Nearest Neighbor ClassifierK-Nearest Neighbor Classifier
K-Nearest Neighbor Classifier
 
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
 
Random Forest Algorithm - Random Forest Explained | Random Forest In Machine ...
Random Forest Algorithm - Random Forest Explained | Random Forest In Machine ...Random Forest Algorithm - Random Forest Explained | Random Forest In Machine ...
Random Forest Algorithm - Random Forest Explained | Random Forest In Machine ...
 
K - Nearest neighbor ( KNN )
K - Nearest neighbor  ( KNN )K - Nearest neighbor  ( KNN )
K - Nearest neighbor ( KNN )
 
Statistical learning
Statistical learningStatistical learning
Statistical learning
 
Machine Learning - Splitting Datasets
Machine Learning - Splitting DatasetsMachine Learning - Splitting Datasets
Machine Learning - Splitting Datasets
 
Naive Bayes
Naive BayesNaive Bayes
Naive Bayes
 
Logistic Regression | Logistic Regression In Python | Machine Learning Algori...
Logistic Regression | Logistic Regression In Python | Machine Learning Algori...Logistic Regression | Logistic Regression In Python | Machine Learning Algori...
Logistic Regression | Logistic Regression In Python | Machine Learning Algori...
 
Data preprocessing using Machine Learning
Data  preprocessing using Machine Learning Data  preprocessing using Machine Learning
Data preprocessing using Machine Learning
 
Feature Engineering in Machine Learning
Feature Engineering in Machine LearningFeature Engineering in Machine Learning
Feature Engineering in Machine Learning
 
Decision tree
Decision treeDecision tree
Decision tree
 
Random forest
Random forestRandom forest
Random forest
 
Machine Learning with Decision trees
Machine Learning with Decision treesMachine Learning with Decision trees
Machine Learning with Decision trees
 

Similar to Machine Learning Project

L1 intro2 supervised_learning
L1 intro2 supervised_learningL1 intro2 supervised_learning
L1 intro2 supervised_learningYogendra Singh
 
An Introduction to boosting
An Introduction to boostingAn Introduction to boosting
An Introduction to boostingbutest
 
Preprocessing.ppt
Preprocessing.pptPreprocessing.ppt
Preprocessing.pptcongtran88
 
Data extraction, cleanup & transformation tools 29.1.16
Data extraction, cleanup & transformation tools 29.1.16Data extraction, cleanup & transformation tools 29.1.16
Data extraction, cleanup & transformation tools 29.1.16Dhilsath Fathima
 
Predicting Real Estate Prices with an ANN
Predicting Real Estate Prices with an ANNPredicting Real Estate Prices with an ANN
Predicting Real Estate Prices with an ANNChris Armstrong
 
Preprocessing.ppt
Preprocessing.pptPreprocessing.ppt
Preprocessing.pptchatbot9
 
Preprocessing
PreprocessingPreprocessing
Preprocessingmmuthuraj
 
1.6.data preprocessing
1.6.data preprocessing1.6.data preprocessing
1.6.data preprocessingKrish_ver2
 
Random Forest Regression Analysis Reveals Impact of Variables on Target Values
Random Forest Regression Analysis Reveals Impact of Variables on Target Values  Random Forest Regression Analysis Reveals Impact of Variables on Target Values
Random Forest Regression Analysis Reveals Impact of Variables on Target Values Smarten Augmented Analytics
 
Data mining to improve e-mail marketing
Data mining to improve e-mail marketing Data mining to improve e-mail marketing
Data mining to improve e-mail marketing Ritu Sarkar
 
Assessing Model Performance - Beginner's Guide
Assessing Model Performance - Beginner's GuideAssessing Model Performance - Beginner's Guide
Assessing Model Performance - Beginner's GuideMegan Verbakel
 
NEURAL Network Design Training
NEURAL Network Design  TrainingNEURAL Network Design  Training
NEURAL Network Design TrainingESCOM
 

Similar to Machine Learning Project (20)

3 module 2
3 module 23 module 2
3 module 2
 
Data1
Data1Data1
Data1
 
Data1
Data1Data1
Data1
 
L1 intro2 supervised_learning
L1 intro2 supervised_learningL1 intro2 supervised_learning
L1 intro2 supervised_learning
 
An Introduction to boosting
An Introduction to boostingAn Introduction to boosting
An Introduction to boosting
 
Preprocessing.ppt
Preprocessing.pptPreprocessing.ppt
Preprocessing.ppt
 
Data extraction, cleanup & transformation tools 29.1.16
Data extraction, cleanup & transformation tools 29.1.16Data extraction, cleanup & transformation tools 29.1.16
Data extraction, cleanup & transformation tools 29.1.16
 
Predicting Real Estate Prices with an ANN
Predicting Real Estate Prices with an ANNPredicting Real Estate Prices with an ANN
Predicting Real Estate Prices with an ANN
 
Preprocessing.ppt
Preprocessing.pptPreprocessing.ppt
Preprocessing.ppt
 
Preprocessing.ppt
Preprocessing.pptPreprocessing.ppt
Preprocessing.ppt
 
Preprocessing.ppt
Preprocessing.pptPreprocessing.ppt
Preprocessing.ppt
 
Preprocessing.ppt
Preprocessing.pptPreprocessing.ppt
Preprocessing.ppt
 
Preprocessing
PreprocessingPreprocessing
Preprocessing
 
1.6.data preprocessing
1.6.data preprocessing1.6.data preprocessing
1.6.data preprocessing
 
Random Forest Regression Analysis Reveals Impact of Variables on Target Values
Random Forest Regression Analysis Reveals Impact of Variables on Target Values  Random Forest Regression Analysis Reveals Impact of Variables on Target Values
Random Forest Regression Analysis Reveals Impact of Variables on Target Values
 
Data mining to improve e-mail marketing
Data mining to improve e-mail marketing Data mining to improve e-mail marketing
Data mining to improve e-mail marketing
 
Assessing Model Performance - Beginner's Guide
Assessing Model Performance - Beginner's GuideAssessing Model Performance - Beginner's Guide
Assessing Model Performance - Beginner's Guide
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
NEURAL Network Design Training
NEURAL Network Design  TrainingNEURAL Network Design  Training
NEURAL Network Design Training
 
Lecture4.pptx
Lecture4.pptxLecture4.pptx
Lecture4.pptx
 

Machine Learning Project

  • 1. Machine Learning project Team members: Jack, Harry & Abhishek
  • 2. Homesite problem: Predicting Quote conversion - Homesite sells Home-insurance to home buyers in United States - Insurance quotes are offered to customers based on several factors What Homesite knows - Customer’s geographical, personal, Financial Information & HomeOwnership details - Quote for every customer Enter What Homesite Doesn’t know: Customer’s likelihood of buying that Insurance contract
  • 3. Data shared: Training & Test Task: Binary classification Training set: 261k rows, 298 predictors, 1 Binary response Test set: 200k rows, 298 columns Predictors: Customer Activity, Geography, Personal, property & Coverage Response: Customer Conversion What’s good about Homesite data: - 296 variables don’t have NA’s or bad data entry points - Not many Levels in Nominal variables - Plenty of binary variables - Plenty of ordinal variables - No unbalanced variables - No missing values - No Textual columns
  • 4. Data cleaning steps Removing Constants Removing Identifier rows Synthesizing Date column 1 2 3 Treating NA variables 4 Treating bad levels (-1) 5 Treating false categorical 6 Categorical to dummy 7
  • 5. Gradient Boosting (Iterative corrections) Learning from past mistakes Could get nearly 0 training error Weighted scoring of multiple trees Hard to tune, as there are too many parameters to adjust Often overfit and hard to decide the stopping point
  • 6. Random Forests (Majority wins) Handles missing data Handles redundancy easily Reduces variations in results Produces Out of Bag error rate Produces De-correlated trees Random subspace & split Bias sometimes Increases as Trees are shallower
  • 7. Gradient Boosting + Random Forest Handles missing data Handles redundancy easily Reduces variations in results Produces Out of Bag error rate Produces De-correlated trees Random subspace & split Quite slow & Computational expensive, optimizing these constraints could be an excellent area for research Our Score AUC = .95 Does not overfit Little bias, due to correction Easy to tune
  • 8. Calculating AUC ID True class Predicted probability 1 1 .8612 2 0 .2134 3 0 .1791 4 0 .1134 5 1 .7898 6 0 .0612 AUC - Randomly decide a threshold - Calculate True Positive Rate (y) & False Positive Rate (x) - Based on (x,y) plot the point - Repeat steps for each value of threshold [0,1] - We now have a curve and we call it ROC - Area under this curve becomes AUC
  • 9. War for the highest AUC
  • 10. What we have already employed - Categorical to Continuous conversion - Continuous to Ordinal conversion - Variable bucketing - SVM / Logistic Regression - Random Forest/ Trees - Lasso / Ridge / Elastic Net - Gradient Boosting - Multicollinearity elimination - Outlier treatment - K-Fold Cross validation - Imputation for NA’s - Model tuning - Variable transformation - Most importantly, Your Suggestions What we look forward to use