Expedia Data Analysis

MIS 6334 – Advanced Business Analytics with SAS
Team 7
Aravind Vasu Murugan
Charmi Katira
Prasanna Rao
Rohith Muruganandam
Sriram Murali
Expedia Data Analysis

Introduction
• Goal: The main objective of this project is to predict, if this user is
going to book at the Expedia website in the remainder of the
session.
• Selection Criteria: Based on misclassification rate of the model.
• Champion Model: Bagging and Boosting of Decision Tree
(Ensemble)

Data Preprocessing
• Creating new target variable “new_bookfut”
– Booklc  Dummy variable, indicating, if the user has booked at this site
up to this point in the current session
– Altered target variables with ‘TRUE’ values, where Booklc = 1 to capture
all possible scenarios
• Rejecting redundant variable: SEgc vs SErate
– SEgc  Indicating if this session uses search engines
– SErate  No. of sessions coming from search engines/total sessions of
this site
– Rejected SEgc based on variable worth from Stat Explore node

Methods Used
Data + Models
Data + Impute + Transform + Models
Data + Impute + Transform + Variable
Selection + Models
Data + Impute + Transform + Chi Square
Stat Variables + Models

Models Used
Regression
Principal Component Analysis
Decision Tree
Dmine Regression
Partial Least Squares
Neural Network
HP Neural
Support Vector Machine
BN Classifier – Bayesian Network
Bagging - Boosting
Ensemble
HP Random Forest

Top Model Comparison
Models Method Misclassification
Rate
Bagging - Boosting - Decision
Tree (Series/Parallel)
Raw Data 7.6%
HP Random Forest Raw Data 7.8%
Ensemble - DST, Bayesian,
Dmine
Imputed 9.25%
Bagging - Boosting - Decision
Tree - Ensemble (Parallel)
Raw Data 9.39%
HP SVM Imputed 9.89%
Bagging - Boosting - Dmine
Ensemble
Raw Data 10.32%
Dmine Regression Raw Data 10.81%
Optimal DST Imputed 11.6%

Champion Model
• Champion model is Bagging – Boosting with Optimal Decision
Tree
• Bagging - Boosting Series and Bagging - Boosting Parallel
connection
• Ensemble the results
• Misclassification rate – 7.6%

Learnings
• Variable Selection, Interactive Binning and PCA increases misclassification
rate for this dataset.
• The Quasi-Newton optimization technique used in Neural Network gives
better performance(trial - error method).
• HP Random Forest and SVM can’t be used along with Bagging/Boosting
because the output is not in SAS data step code format.

Learnings - Contd
• Dmine regression is better than normal regression as it calculates 𝑅2
for all variables and categorizes them into 16 bins (AOV 16) and then,
𝑅2
for AOV16 variables is calculated.
• Contrasting models performs well with Ensemble model.
– D-Mine Regression, Bayesian Network , Optimal DST
– misclassification rate  9.2%
• Bagging and boosting connected in series connection outperforms the
parallel combination(Misclassification Rate  7.6% : 9.3%).

Challenges
• Reducing the misclassification rate of models to a single digit.
– SVM Model (Imputed data)-9.89%
– Ensemble Model of Dmine, Bayesian, HP Neural (Imputed data) - 9.25 %
– Random Forest (Original data) -7.8%
– Bagging Boosting of decision Tree (Original data) – 7.6%
• Combination of bagging and boosting in ensemble model
• Developing a model, which performs better than Random Forest
(misclassification rate – 7.8%).
• Finding input models, which works well with Ensemble to achieve good
performance.

Challenges - Contd
• Manipulating the target variable with more TRUE values using booklc
attribute
• Renamed the variables from x1-x41 to actual names
• Finding similar user centric variables to avoid unnecessary redundant
classification.
• Using models along with Bagging/Boosting other than decision tree.
Ex: Dmine

Surprising Findings
• Raw dataset performs better than imputed/transformed/Chi square
variables data.
• Bagging/Boosting gives better result than HP Random Forest.
• Series vs Parallel connection
– Using bagging and boosting in series connection i.e output of
bagging as an input to boosting yields good result than doing
parallelly processing bagging and boosting

Surprising Findings - Contd
• SVM performs very poor with raw data (misclassification rate -26%) but
performs well with imputed data(misclassification rate - 9.8%).
• Neural Network has high misclassification rate than optimal decision
tree and Bayesian network.
• Transforming skewness of variables does not yield desired results for
this dataset.

Expedia Data Analysis

Recommended

Recommended

More Related Content

What's hot

What's hot (17)

Similar to Expedia Data Analysis

Similar to Expedia Data Analysis (20)

Recently uploaded

Recently uploaded (20)

Expedia Data Analysis

Editor's Notes