The document describes building a machine learning model using a boosted decision tree to predict satisfied and unsatisfied customers in a Santander customer satisfaction dataset from Kaggle. The model is developed using Azure ML, including splitting the training data into training and validation sets, training the model, tuning hyperparameters, and using the model to predict scores for the test set. Key steps are loading data, selecting features, training and evaluating the model, and reviewing learnings around important features and opportunities to improve predictions.
2. Objective
• To build a machine learning algorithm using the training data set and predict the
total satisfied and unsatisfied customers in the Santander test data set
• Here a “Two Class Boosted Decision Tree” is used to build a machine learning
model
• The decision tree method is one of the simpler classification techniques used to
build real world problems
3. Training data set characteristics
• As per Kaggle the data set has anonymous columns with numeric data
• Data set has many unnamed columns
• Unnamed columns make feature extraction tedious
• Exploratory data analysis on the training set can reveal latent features which
actually contribute to the Target column
4. Steps for developing the model
• Load the training data to Azure ML
• Use metadata to select the Target column
• Use Split data to randomly split data (75% training set and 25% validation set)
• Two boosted decision tree with single parameter is used to train the model
• Train model to score the model with the validation set(25% data set)
• The score model is used to evaluate and compare other models
• The test data is loaded in Azure after adding Target column and setting the
column to 1
• The trained model is then used to score the test data
• The output of the score model is used to obtain a Kaggle score
5. Azure (ML) Model
Train Model
Training data
Test data
Edit Metadata
Edit Metadata
Score
model
Evaluate Model Convert data CSV
Boosted Decision Tree
Tune model
hyperparameter
7. Learnings
• Kaggle score from Train model was 0.519 and from the Tune model
Hyperparameter was 0.541
• The redundant columns like ID may be excluded for building the model
• Var30,Var38 are two important features which affect the Target column
• Var3 can be excluded as it not redundant for predicting the Target values
• The data set is not very informative and better feature extraction techniques
should be used to predict Target column