SlideShare a Scribd company logo
1 of 11
2015 Analytic Challenge
KA RA N SA RA O
TEAM
 Karan Sarao
ANALYTIC SOFTWARE USED
 Data Preparation – SAS
 Model Building – R
 Hardware
– Acer Aspire 5750
– 6 GB RAM
SOLUTION OVERVIEW
Data Preparation
Missing Value Treatment
•Nominal – New Category
•Numeric/Ordinal – Replace with 0 (Value)
New Variable Creation
•Multiple derived Variables
Model Tuning and
Stacking
Training / Blending /Testing Split
Caret Function to tune Multiple
Model parameters
Stacking and Testing to optimize
sequence
Final Modeling
2 Stage Modeling process adopted
Initial set of optimized models
created in Stage 1
Scores incorporated into final blended
Model in Stage 2
Scoring
2 Stage scoring process followed
Model Tuning Process
Stage 1 ModelingData Splitting Stage 2 Modeling Evaluation
Phase
Modeling Data Set –
Random Assignment
50% ofObservations
30% ofObservations
20 % of
Observations
Stage 1 Models
 Model 1
 Model 2
 Model 3
 Model 4
 Model 5
Scoreall 5 Models
on Stage 2 Data,
append scores as
new variables
Stage 2 Models
 Model 1
 Model 2
 Model 3
 Model 4
 Model 5
Run Stage 1 Models
Run Stage 2 Models
Compare
performance of all
Stage 2 Models
SOLUTION OVERVIEW – Continued (Model Tuning)
DATA TRANSFORMATIONS
 Mix of Linear and Non Linear (Tree Based) Models
‒ Cover each others weakness
‒ Tree based models are invariant to order preserving transformations (no need for Log/Exponent etc.)
 More focus on feature engineering, new variables created as below 
‒ SHIP_RATIO  (ORDER_SH_AMT+ORDER_ADDL_SH_AMT)/ORDER_GROSS_AMT (Does shipping cost as a ratio of the initial
order have any influence)
‒ PAYMT_RATIO=(ORDER_SH_AMT+ORDER_ADDL_SH_AMT+ORDER_GROSS_AMT)/PAYMENT_QTY (What is amount of each
payment)
‒ REV_RATIO=TOTAL_REV_PRIOR_TO_A/TENURE (Revenue ratio per unit tenure)
‒ REV_PER_ORDER=TOTAL_REV_PRIOR_TO_A/TOTAL_ORDERS_PRIOR_TO_A (Revenue per order)
‒ FIRST_ORDER_RATIO=ORDER_GROSS_AMT/ITEM_QTY
‒ FIRST_PAYMENT_RATIO=ORDER_GROSS_AMT/PAYMENT_QTY
‒ ORDER_FREQ=TENURE/TOTAL_ORDERS_PRIOR_TO_A
‒ ORDER_DUE_RATIO=RECENCY/ORDER_FREQ
‒ ORDER_DUE_RATIO_2=(RECENCY-ORDER_FREQ)/ORDER_FREQ
‒ ORDER_DUE_RATIO_3=(RECENCY-ORDER_FREQ)/RECENCY
‒ All divide by zero exceptions set to 0
Multiple Models trained on 50% of the data
 Random Forests (randomForest)
 AdaBoost (ada)
 Gradient Boosting Machines (gbm)
 eXtreme Gradient Boost (xgboost)
 Logistic Regression (variables selected by studying glmnet output)
 Regularized Logistic Regression (glmnet)
Several of the above models have tunable parameters
 Caret package in R used to cycle through various combinations of input parameters
using multiple folds
 Problem statement specifies rank order primacy, hence ROC metric maximized
Stage 1 Models
 All 5 Models built in stage 1 used to score both Stage 2 and evaluation data
 5 score columns added back to the data set (stage 2 and evaluation)
 4 Models created again on Stage 2 dataset
 Stage 1 and Stage 2 models are scored on evaluation dataset
 ROC (AUC) calculated for the models on evaluation dataset
 Best Model identified – xgboost (Stage 2)
Model Stage 1 (AUC)
On EvaluationSet
Stage 2 (AUC)
On EvaluationSet
xgboost 0.646 0.647
logit 0.641 0.646
gbm 0.636 0.644
glmnet 0.641 0.642
ada 0.637 0.642
random forest 0.617 NA
Stage 2 Models
 Data split as 50-50 between Stage 1 modeling and Stage 2 blending
 Xgboost used to blend in Stage 2
 Initial 5 models score the submission dataset and scores merged
back to create dataset for sixth model
 Blend Model used to generate the final submission score
Final Model Building
Important Variables
TXN_CHANNEL_CD
PAYMENT_QTY
RUSH_ORD_FLAG
SHIP_RATIO
FIRST_ORDER_RATIO
DEMOGRAPHIC_SEGMENT
ORDER_GROSS_AMT
RETAIL/CATALOG_SPENDING_QUINTILE
REV_PER_ORDER
HH_INCOME
PAYMT_RATIO
ETHNICITY
LANGUAGE
 Mix of ready and derived variables
 Ranking of top variables can be difficult
to quantify across multiple modeling
techniques/blends
 Plain logistic regression with these
variables can create a Model with
comparable performance (~.64 AUC)
TOP VARIABLES
 Derived Variables
‒ Create as many behavioral/pattern variables as possible
‒ Ratios such as revenue/order, order frequency, shipping cost to total cost etc.
 Cross Validation for controlling overfit
‒ K fold (maximum possible) validation runs
‒ Tune parameters (control depth and boosting rounds to maximize test ROC)
‒ Use grid search for optimum parameter search or employ Caret package
KEYS TO SUCCESS

More Related Content

Similar to DMA Analytics Challenge 2015 (Winner - First Position)

AIRLINE FARE PRICE PREDICTION
AIRLINE FARE PRICE PREDICTIONAIRLINE FARE PRICE PREDICTION
AIRLINE FARE PRICE PREDICTIONIRJET Journal
 
Lab 2: Classification and Regression Prediction Models, training and testing ...
Lab 2: Classification and Regression Prediction Models, training and testing ...Lab 2: Classification and Regression Prediction Models, training and testing ...
Lab 2: Classification and Regression Prediction Models, training and testing ...Yao Yao
 
Compensator Design for Speed Control of DC Motor by Root Locus Approach using...
Compensator Design for Speed Control of DC Motor by Root Locus Approach using...Compensator Design for Speed Control of DC Motor by Root Locus Approach using...
Compensator Design for Speed Control of DC Motor by Root Locus Approach using...IRJET Journal
 
Oracle_Analytical_function.pdf
Oracle_Analytical_function.pdfOracle_Analytical_function.pdf
Oracle_Analytical_function.pdfKalyankumarVenkat1
 
Lecture16_Process Analyzer and OPTQUEST.ppt
Lecture16_Process Analyzer and OPTQUEST.pptLecture16_Process Analyzer and OPTQUEST.ppt
Lecture16_Process Analyzer and OPTQUEST.pptAbdAbd72
 
Machine Learning Foundations Project Presentation
Machine Learning Foundations Project PresentationMachine Learning Foundations Project Presentation
Machine Learning Foundations Project PresentationAmit J Bhattacharyya
 
Operation Sequencing and Machining Parameter Selection in CAPP for Cylindrica...
Operation Sequencing and Machining Parameter Selection in CAPP for Cylindrica...Operation Sequencing and Machining Parameter Selection in CAPP for Cylindrica...
Operation Sequencing and Machining Parameter Selection in CAPP for Cylindrica...IRJET Journal
 
GPU Accelerated Backtesting and Machine Learning for Quant Trading Strategies
GPU Accelerated Backtesting and Machine Learning for Quant Trading StrategiesGPU Accelerated Backtesting and Machine Learning for Quant Trading Strategies
GPU Accelerated Backtesting and Machine Learning for Quant Trading StrategiesDaniel Egloff
 
SQL Server Query Optimization, Execution and Debugging Query Performance
SQL Server Query Optimization, Execution and Debugging Query PerformanceSQL Server Query Optimization, Execution and Debugging Query Performance
SQL Server Query Optimization, Execution and Debugging Query PerformanceVinod Kumar
 
1 2 chem plantdesign-intro to plant design economics
1 2 chem plantdesign-intro to plant design  economics1 2 chem plantdesign-intro to plant design  economics
1 2 chem plantdesign-intro to plant design economicsayimsevenfold
 
Energy-efficient technology investments using a decision support system frame...
Energy-efficient technology investments using a decision support system frame...Energy-efficient technology investments using a decision support system frame...
Energy-efficient technology investments using a decision support system frame...Emilio L. Cano
 
resilience.io WASH sector prototype debut training workshop
resilience.io WASH sector prototype debut training workshopresilience.io WASH sector prototype debut training workshop
resilience.io WASH sector prototype debut training workshopEcological Sequestration Trust
 
New Directions in Mahout's Recommenders
New Directions in Mahout's RecommendersNew Directions in Mahout's Recommenders
New Directions in Mahout's Recommenderssscdotopen
 
Power of call symput data
Power of call symput dataPower of call symput data
Power of call symput dataYash Sharma
 
Leveraging Feature Selection Within TreeNet
Leveraging Feature Selection Within TreeNetLeveraging Feature Selection Within TreeNet
Leveraging Feature Selection Within TreeNetagdavis
 
Logistic Regression using Mahout
Logistic Regression using MahoutLogistic Regression using Mahout
Logistic Regression using Mahouttanuvir
 
SAS Macros part 3
SAS Macros part 3SAS Macros part 3
SAS Macros part 3venkatam
 
Sequences classification based on group technology for flexible manufacturing...
Sequences classification based on group technology for flexible manufacturing...Sequences classification based on group technology for flexible manufacturing...
Sequences classification based on group technology for flexible manufacturing...eSAT Journals
 

Similar to DMA Analytics Challenge 2015 (Winner - First Position) (20)

AIRLINE FARE PRICE PREDICTION
AIRLINE FARE PRICE PREDICTIONAIRLINE FARE PRICE PREDICTION
AIRLINE FARE PRICE PREDICTION
 
Lab 2: Classification and Regression Prediction Models, training and testing ...
Lab 2: Classification and Regression Prediction Models, training and testing ...Lab 2: Classification and Regression Prediction Models, training and testing ...
Lab 2: Classification and Regression Prediction Models, training and testing ...
 
Compensator Design for Speed Control of DC Motor by Root Locus Approach using...
Compensator Design for Speed Control of DC Motor by Root Locus Approach using...Compensator Design for Speed Control of DC Motor by Root Locus Approach using...
Compensator Design for Speed Control of DC Motor by Root Locus Approach using...
 
BAPI - Criação de Ordem de Manutenção
BAPI - Criação de Ordem de ManutençãoBAPI - Criação de Ordem de Manutenção
BAPI - Criação de Ordem de Manutenção
 
Oracle_Analytical_function.pdf
Oracle_Analytical_function.pdfOracle_Analytical_function.pdf
Oracle_Analytical_function.pdf
 
Lecture16_Process Analyzer and OPTQUEST.ppt
Lecture16_Process Analyzer and OPTQUEST.pptLecture16_Process Analyzer and OPTQUEST.ppt
Lecture16_Process Analyzer and OPTQUEST.ppt
 
Machine Learning Foundations Project Presentation
Machine Learning Foundations Project PresentationMachine Learning Foundations Project Presentation
Machine Learning Foundations Project Presentation
 
Operation Sequencing and Machining Parameter Selection in CAPP for Cylindrica...
Operation Sequencing and Machining Parameter Selection in CAPP for Cylindrica...Operation Sequencing and Machining Parameter Selection in CAPP for Cylindrica...
Operation Sequencing and Machining Parameter Selection in CAPP for Cylindrica...
 
GPU Accelerated Backtesting and Machine Learning for Quant Trading Strategies
GPU Accelerated Backtesting and Machine Learning for Quant Trading StrategiesGPU Accelerated Backtesting and Machine Learning for Quant Trading Strategies
GPU Accelerated Backtesting and Machine Learning for Quant Trading Strategies
 
SQL Server Query Optimization, Execution and Debugging Query Performance
SQL Server Query Optimization, Execution and Debugging Query PerformanceSQL Server Query Optimization, Execution and Debugging Query Performance
SQL Server Query Optimization, Execution and Debugging Query Performance
 
1 2 chem plantdesign-intro to plant design economics
1 2 chem plantdesign-intro to plant design  economics1 2 chem plantdesign-intro to plant design  economics
1 2 chem plantdesign-intro to plant design economics
 
Energy-efficient technology investments using a decision support system frame...
Energy-efficient technology investments using a decision support system frame...Energy-efficient technology investments using a decision support system frame...
Energy-efficient technology investments using a decision support system frame...
 
resilience.io WASH sector prototype debut training workshop
resilience.io WASH sector prototype debut training workshopresilience.io WASH sector prototype debut training workshop
resilience.io WASH sector prototype debut training workshop
 
New Directions in Mahout's Recommenders
New Directions in Mahout's RecommendersNew Directions in Mahout's Recommenders
New Directions in Mahout's Recommenders
 
Power of call symput data
Power of call symput dataPower of call symput data
Power of call symput data
 
Oracle SQL Advanced
Oracle SQL AdvancedOracle SQL Advanced
Oracle SQL Advanced
 
Leveraging Feature Selection Within TreeNet
Leveraging Feature Selection Within TreeNetLeveraging Feature Selection Within TreeNet
Leveraging Feature Selection Within TreeNet
 
Logistic Regression using Mahout
Logistic Regression using MahoutLogistic Regression using Mahout
Logistic Regression using Mahout
 
SAS Macros part 3
SAS Macros part 3SAS Macros part 3
SAS Macros part 3
 
Sequences classification based on group technology for flexible manufacturing...
Sequences classification based on group technology for flexible manufacturing...Sequences classification based on group technology for flexible manufacturing...
Sequences classification based on group technology for flexible manufacturing...
 

Recently uploaded

Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
 
Machine learning classification ppt.ppt
Machine learning classification  ppt.pptMachine learning classification  ppt.ppt
Machine learning classification ppt.pptamreenkhanum0307
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfchwongval
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理e4aez8ss
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxNLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxBoston Institute of Analytics
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改yuu sss
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一fhwihughh
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxMike Bennett
 
While-For-loop in python used in college
While-For-loop in python used in collegeWhile-For-loop in python used in college
While-For-loop in python used in collegessuser7a7cd61
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfBoston Institute of Analytics
 
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSINGmarianagonzalez07
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...ssuserf63bd7
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanMYRABACSAFRA2
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queensdataanalyticsqueen03
 

Recently uploaded (20)

Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
 
Machine learning classification ppt.ppt
Machine learning classification  ppt.pptMachine learning classification  ppt.ppt
Machine learning classification ppt.ppt
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdf
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxNLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts Service
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptx
 
While-For-loop in python used in college
While-For-loop in python used in collegeWhile-For-loop in python used in college
While-For-loop in python used in college
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
 
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population Mean
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queens
 

DMA Analytics Challenge 2015 (Winner - First Position)

  • 3. ANALYTIC SOFTWARE USED  Data Preparation – SAS  Model Building – R  Hardware – Acer Aspire 5750 – 6 GB RAM
  • 4. SOLUTION OVERVIEW Data Preparation Missing Value Treatment •Nominal – New Category •Numeric/Ordinal – Replace with 0 (Value) New Variable Creation •Multiple derived Variables Model Tuning and Stacking Training / Blending /Testing Split Caret Function to tune Multiple Model parameters Stacking and Testing to optimize sequence Final Modeling 2 Stage Modeling process adopted Initial set of optimized models created in Stage 1 Scores incorporated into final blended Model in Stage 2 Scoring 2 Stage scoring process followed
  • 5. Model Tuning Process Stage 1 ModelingData Splitting Stage 2 Modeling Evaluation Phase Modeling Data Set – Random Assignment 50% ofObservations 30% ofObservations 20 % of Observations Stage 1 Models  Model 1  Model 2  Model 3  Model 4  Model 5 Scoreall 5 Models on Stage 2 Data, append scores as new variables Stage 2 Models  Model 1  Model 2  Model 3  Model 4  Model 5 Run Stage 1 Models Run Stage 2 Models Compare performance of all Stage 2 Models SOLUTION OVERVIEW – Continued (Model Tuning)
  • 6. DATA TRANSFORMATIONS  Mix of Linear and Non Linear (Tree Based) Models ‒ Cover each others weakness ‒ Tree based models are invariant to order preserving transformations (no need for Log/Exponent etc.)  More focus on feature engineering, new variables created as below  ‒ SHIP_RATIO  (ORDER_SH_AMT+ORDER_ADDL_SH_AMT)/ORDER_GROSS_AMT (Does shipping cost as a ratio of the initial order have any influence) ‒ PAYMT_RATIO=(ORDER_SH_AMT+ORDER_ADDL_SH_AMT+ORDER_GROSS_AMT)/PAYMENT_QTY (What is amount of each payment) ‒ REV_RATIO=TOTAL_REV_PRIOR_TO_A/TENURE (Revenue ratio per unit tenure) ‒ REV_PER_ORDER=TOTAL_REV_PRIOR_TO_A/TOTAL_ORDERS_PRIOR_TO_A (Revenue per order) ‒ FIRST_ORDER_RATIO=ORDER_GROSS_AMT/ITEM_QTY ‒ FIRST_PAYMENT_RATIO=ORDER_GROSS_AMT/PAYMENT_QTY ‒ ORDER_FREQ=TENURE/TOTAL_ORDERS_PRIOR_TO_A ‒ ORDER_DUE_RATIO=RECENCY/ORDER_FREQ ‒ ORDER_DUE_RATIO_2=(RECENCY-ORDER_FREQ)/ORDER_FREQ ‒ ORDER_DUE_RATIO_3=(RECENCY-ORDER_FREQ)/RECENCY ‒ All divide by zero exceptions set to 0
  • 7. Multiple Models trained on 50% of the data  Random Forests (randomForest)  AdaBoost (ada)  Gradient Boosting Machines (gbm)  eXtreme Gradient Boost (xgboost)  Logistic Regression (variables selected by studying glmnet output)  Regularized Logistic Regression (glmnet) Several of the above models have tunable parameters  Caret package in R used to cycle through various combinations of input parameters using multiple folds  Problem statement specifies rank order primacy, hence ROC metric maximized Stage 1 Models
  • 8.  All 5 Models built in stage 1 used to score both Stage 2 and evaluation data  5 score columns added back to the data set (stage 2 and evaluation)  4 Models created again on Stage 2 dataset  Stage 1 and Stage 2 models are scored on evaluation dataset  ROC (AUC) calculated for the models on evaluation dataset  Best Model identified – xgboost (Stage 2) Model Stage 1 (AUC) On EvaluationSet Stage 2 (AUC) On EvaluationSet xgboost 0.646 0.647 logit 0.641 0.646 gbm 0.636 0.644 glmnet 0.641 0.642 ada 0.637 0.642 random forest 0.617 NA Stage 2 Models
  • 9.  Data split as 50-50 between Stage 1 modeling and Stage 2 blending  Xgboost used to blend in Stage 2  Initial 5 models score the submission dataset and scores merged back to create dataset for sixth model  Blend Model used to generate the final submission score Final Model Building
  • 10. Important Variables TXN_CHANNEL_CD PAYMENT_QTY RUSH_ORD_FLAG SHIP_RATIO FIRST_ORDER_RATIO DEMOGRAPHIC_SEGMENT ORDER_GROSS_AMT RETAIL/CATALOG_SPENDING_QUINTILE REV_PER_ORDER HH_INCOME PAYMT_RATIO ETHNICITY LANGUAGE  Mix of ready and derived variables  Ranking of top variables can be difficult to quantify across multiple modeling techniques/blends  Plain logistic regression with these variables can create a Model with comparable performance (~.64 AUC) TOP VARIABLES
  • 11.  Derived Variables ‒ Create as many behavioral/pattern variables as possible ‒ Ratios such as revenue/order, order frequency, shipping cost to total cost etc.  Cross Validation for controlling overfit ‒ K fold (maximum possible) validation runs ‒ Tune parameters (control depth and boosting rounds to maximize test ROC) ‒ Use grid search for optimum parameter search or employ Caret package KEYS TO SUCCESS