1. MIS 6334 – Advanced Business Analytics with SAS
Team 7
Aravind Vasu Murugan
Charmi Katira
Prasanna Rao
Rohith Muruganandam
Sriram Murali
Expedia Data Analysis
2. Introduction
• Goal: The main objective of this project is to predict, if this user is
going to book at the Expedia website in the remainder of the
session.
• Selection Criteria: Based on misclassification rate of the model.
• Champion Model: Bagging and Boosting of Decision Tree
(Ensemble)
3. Data Preprocessing
• Creating new target variable “new_bookfut”
– Booklc Dummy variable, indicating, if the user has booked at this site
up to this point in the current session
– Altered target variables with ‘TRUE’ values, where Booklc = 1 to capture
all possible scenarios
• Rejecting redundant variable: SEgc vs SErate
– SEgc Indicating if this session uses search engines
– SErate No. of sessions coming from search engines/total sessions of
this site
– Rejected SEgc based on variable worth from Stat Explore node
4. Methods Used
Data + Models
Data + Impute + Transform + Models
Data + Impute + Transform + Variable
Selection + Models
Data + Impute + Transform + Chi Square
Stat Variables + Models
5. Models Used
Regression
Principal Component Analysis
Decision Tree
Dmine Regression
Partial Least Squares
Neural Network
HP Neural
Support Vector Machine
BN Classifier – Bayesian Network
Bagging - Boosting
Ensemble
HP Random Forest
6. Top Model Comparison
Models Method Misclassification
Rate
Bagging - Boosting - Decision
Tree (Series/Parallel)
Raw Data 7.6%
HP Random Forest Raw Data 7.8%
Ensemble - DST, Bayesian,
Dmine
Imputed 9.25%
Bagging - Boosting - Decision
Tree - Ensemble (Parallel)
Raw Data 9.39%
HP SVM Imputed 9.89%
Bagging - Boosting - Dmine
Ensemble
Raw Data 10.32%
Dmine Regression Raw Data 10.81%
Optimal DST Imputed 11.6%
7. Champion Model
• Champion model is Bagging – Boosting with Optimal Decision
Tree
• Bagging - Boosting Series and Bagging - Boosting Parallel
connection
• Ensemble the results
• Misclassification rate – 7.6%
8. Learnings
• Variable Selection, Interactive Binning and PCA increases misclassification
rate for this dataset.
• The Quasi-Newton optimization technique used in Neural Network gives
better performance(trial - error method).
• HP Random Forest and SVM can’t be used along with Bagging/Boosting
because the output is not in SAS data step code format.
9. Learnings - Contd
• Dmine regression is better than normal regression as it calculates 𝑅2
for all variables and categorizes them into 16 bins (AOV 16) and then,
𝑅2
for AOV16 variables is calculated.
• Contrasting models performs well with Ensemble model.
– D-Mine Regression, Bayesian Network , Optimal DST
– misclassification rate 9.2%
• Bagging and boosting connected in series connection outperforms the
parallel combination(Misclassification Rate 7.6% : 9.3%).
10. Challenges
• Reducing the misclassification rate of models to a single digit.
– SVM Model (Imputed data)-9.89%
– Ensemble Model of Dmine, Bayesian, HP Neural (Imputed data) - 9.25 %
– Random Forest (Original data) -7.8%
– Bagging Boosting of decision Tree (Original data) – 7.6%
• Combination of bagging and boosting in ensemble model
• Developing a model, which performs better than Random Forest
(misclassification rate – 7.8%).
• Finding input models, which works well with Ensemble to achieve good
performance.
11. Challenges - Contd
• Manipulating the target variable with more TRUE values using booklc
attribute
• Renamed the variables from x1-x41 to actual names
• Finding similar user centric variables to avoid unnecessary redundant
classification.
• Using models along with Bagging/Boosting other than decision tree.
Ex: Dmine
12. Surprising Findings
• Raw dataset performs better than imputed/transformed/Chi square
variables data.
• Bagging/Boosting gives better result than HP Random Forest.
• Series vs Parallel connection
– Using bagging and boosting in series connection i.e output of
bagging as an input to boosting yields good result than doing
parallelly processing bagging and boosting
13. Surprising Findings - Contd
• SVM performs very poor with raw data (misclassification rate -26%) but
performs well with imputed data(misclassification rate - 9.8%).
• Neural Network has high misclassification rate than optimal decision
tree and Bayesian network.
• Transforming skewness of variables does not yield desired results for
this dataset.