SlideShare a Scribd company logo
1 of 14
MIS 6334 – Advanced Business Analytics with SAS
Team 7
Aravind Vasu Murugan
Charmi Katira
Prasanna Rao
Rohith Muruganandam
Sriram Murali
Expedia Data Analysis
Introduction
• Goal: The main objective of this project is to predict, if this user is
going to book at the Expedia website in the remainder of the
session.
• Selection Criteria: Based on misclassification rate of the model.
• Champion Model: Bagging and Boosting of Decision Tree
(Ensemble)
Data Preprocessing
• Creating new target variable “new_bookfut”
– Booklc  Dummy variable, indicating, if the user has booked at this site
up to this point in the current session
– Altered target variables with ‘TRUE’ values, where Booklc = 1 to capture
all possible scenarios
• Rejecting redundant variable: SEgc vs SErate
– SEgc  Indicating if this session uses search engines
– SErate  No. of sessions coming from search engines/total sessions of
this site
– Rejected SEgc based on variable worth from Stat Explore node
Methods Used
Data + Models
Data + Impute + Transform + Models
Data + Impute + Transform + Variable
Selection + Models
Data + Impute + Transform + Chi Square
Stat Variables + Models
Models Used
Regression
Principal Component Analysis
Decision Tree
Dmine Regression
Partial Least Squares
Neural Network
HP Neural
Support Vector Machine
BN Classifier – Bayesian Network
Bagging - Boosting
Ensemble
HP Random Forest
Top Model Comparison
Models Method Misclassification
Rate
Bagging - Boosting - Decision
Tree (Series/Parallel)
Raw Data 7.6%
HP Random Forest Raw Data 7.8%
Ensemble - DST, Bayesian,
Dmine
Imputed 9.25%
Bagging - Boosting - Decision
Tree - Ensemble (Parallel)
Raw Data 9.39%
HP SVM Imputed 9.89%
Bagging - Boosting - Dmine
Ensemble
Raw Data 10.32%
Dmine Regression Raw Data 10.81%
Optimal DST Imputed 11.6%
Champion Model
• Champion model is Bagging – Boosting with Optimal Decision
Tree
• Bagging - Boosting Series and Bagging - Boosting Parallel
connection
• Ensemble the results
• Misclassification rate – 7.6%
Learnings
• Variable Selection, Interactive Binning and PCA increases misclassification
rate for this dataset.
• The Quasi-Newton optimization technique used in Neural Network gives
better performance(trial - error method).
• HP Random Forest and SVM can’t be used along with Bagging/Boosting
because the output is not in SAS data step code format.
Learnings - Contd
• Dmine regression is better than normal regression as it calculates 𝑅2
for all variables and categorizes them into 16 bins (AOV 16) and then,
𝑅2
for AOV16 variables is calculated.
• Contrasting models performs well with Ensemble model.
– D-Mine Regression, Bayesian Network , Optimal DST
– misclassification rate  9.2%
• Bagging and boosting connected in series connection outperforms the
parallel combination(Misclassification Rate  7.6% : 9.3%).
Challenges
• Reducing the misclassification rate of models to a single digit.
– SVM Model (Imputed data)-9.89%
– Ensemble Model of Dmine, Bayesian, HP Neural (Imputed data) - 9.25 %
– Random Forest (Original data) -7.8%
– Bagging Boosting of decision Tree (Original data) – 7.6%
• Combination of bagging and boosting in ensemble model
• Developing a model, which performs better than Random Forest
(misclassification rate – 7.8%).
• Finding input models, which works well with Ensemble to achieve good
performance.
Challenges - Contd
• Manipulating the target variable with more TRUE values using booklc
attribute
• Renamed the variables from x1-x41 to actual names
• Finding similar user centric variables to avoid unnecessary redundant
classification.
• Using models along with Bagging/Boosting other than decision tree.
Ex: Dmine
Surprising Findings
• Raw dataset performs better than imputed/transformed/Chi square
variables data.
• Bagging/Boosting gives better result than HP Random Forest.
• Series vs Parallel connection
– Using bagging and boosting in series connection i.e output of
bagging as an input to boosting yields good result than doing
parallelly processing bagging and boosting
Surprising Findings - Contd
• SVM performs very poor with raw data (misclassification rate -26%) but
performs well with imputed data(misclassification rate - 9.8%).
• Neural Network has high misclassification rate than optimal decision
tree and Bayesian network.
• Transforming skewness of variables does not yield desired results for
this dataset.
Thank You 
Questions?

More Related Content

What's hot

Data Mining in Market Research
Data Mining in Market ResearchData Mining in Market Research
Data Mining in Market Researchbutest
 
Data Mining In Market Research
Data Mining In Market ResearchData Mining In Market Research
Data Mining In Market Researchjim
 
Conistency of random forests
Conistency of random forestsConistency of random forests
Conistency of random forestsHoang Nguyen
 
Ensemble hybrid learning technique
Ensemble hybrid learning techniqueEnsemble hybrid learning technique
Ensemble hybrid learning techniqueDishaSinha9
 
Accelerating the Random Forest algorithm for commodity parallel- Mark Seligman
Accelerating the Random Forest algorithm for commodity parallel- Mark SeligmanAccelerating the Random Forest algorithm for commodity parallel- Mark Seligman
Accelerating the Random Forest algorithm for commodity parallel- Mark SeligmanPyData
 
Introduction to random forest and gradient boosting methods a lecture
Introduction to random forest and gradient boosting methods   a lectureIntroduction to random forest and gradient boosting methods   a lecture
Introduction to random forest and gradient boosting methods a lectureShreyas S K
 
From decision trees to random forests
From decision trees to random forestsFrom decision trees to random forests
From decision trees to random forestsViet-Trung TRAN
 
(Machine Learning) Ensemble learning
(Machine Learning) Ensemble learning (Machine Learning) Ensemble learning
(Machine Learning) Ensemble learning Omkar Rane
 
Data Science - Part V - Decision Trees & Random Forests
Data Science - Part V - Decision Trees & Random Forests Data Science - Part V - Decision Trees & Random Forests
Data Science - Part V - Decision Trees & Random Forests Derek Kane
 
Gradient Boosted trees
Gradient Boosted treesGradient Boosted trees
Gradient Boosted treesNihar Ranjan
 
Supervised Machine Learning
Supervised Machine LearningSupervised Machine Learning
Supervised Machine LearningAnkit Rai
 
Lect9 Decision tree
Lect9 Decision treeLect9 Decision tree
Lect9 Decision treehktripathy
 
Medicare fraud detection
Medicare fraud detection Medicare fraud detection
Medicare fraud detection Xinyu (Max) Liu
 

What's hot (17)

Data Mining in Market Research
Data Mining in Market ResearchData Mining in Market Research
Data Mining in Market Research
 
Data Mining In Market Research
Data Mining In Market ResearchData Mining In Market Research
Data Mining In Market Research
 
Conistency of random forests
Conistency of random forestsConistency of random forests
Conistency of random forests
 
Ensemble hybrid learning technique
Ensemble hybrid learning techniqueEnsemble hybrid learning technique
Ensemble hybrid learning technique
 
Accelerating the Random Forest algorithm for commodity parallel- Mark Seligman
Accelerating the Random Forest algorithm for commodity parallel- Mark SeligmanAccelerating the Random Forest algorithm for commodity parallel- Mark Seligman
Accelerating the Random Forest algorithm for commodity parallel- Mark Seligman
 
Introduction to random forest and gradient boosting methods a lecture
Introduction to random forest and gradient boosting methods   a lectureIntroduction to random forest and gradient boosting methods   a lecture
Introduction to random forest and gradient boosting methods a lecture
 
From decision trees to random forests
From decision trees to random forestsFrom decision trees to random forests
From decision trees to random forests
 
What is Machine Learning
What is Machine LearningWhat is Machine Learning
What is Machine Learning
 
(Machine Learning) Ensemble learning
(Machine Learning) Ensemble learning (Machine Learning) Ensemble learning
(Machine Learning) Ensemble learning
 
Data Science - Part V - Decision Trees & Random Forests
Data Science - Part V - Decision Trees & Random Forests Data Science - Part V - Decision Trees & Random Forests
Data Science - Part V - Decision Trees & Random Forests
 
Machine learning
Machine learning Machine learning
Machine learning
 
Gradient Boosted trees
Gradient Boosted treesGradient Boosted trees
Gradient Boosted trees
 
Supervised Machine Learning
Supervised Machine LearningSupervised Machine Learning
Supervised Machine Learning
 
L4. Ensembles of Decision Trees
L4. Ensembles of Decision TreesL4. Ensembles of Decision Trees
L4. Ensembles of Decision Trees
 
Lect9 Decision tree
Lect9 Decision treeLect9 Decision tree
Lect9 Decision tree
 
Tree pruning
Tree pruningTree pruning
Tree pruning
 
Medicare fraud detection
Medicare fraud detection Medicare fraud detection
Medicare fraud detection
 

Similar to Expedia Data Analysis

Meetup_Consumer_Credit_Default_Vers_2_All
Meetup_Consumer_Credit_Default_Vers_2_AllMeetup_Consumer_Credit_Default_Vers_2_All
Meetup_Consumer_Credit_Default_Vers_2_AllBernard Ong
 
Prediction of potential customers for term deposit
Prediction of potential customers for term depositPrediction of potential customers for term deposit
Prediction of potential customers for term depositPranov Mishra
 
Introduction to data mining
Introduction to data miningIntroduction to data mining
Introduction to data miningUjjawal
 
Pricing like a data scientist
Pricing like a data scientistPricing like a data scientist
Pricing like a data scientistMatthew Evans
 
Knowledge discovery claudiad amato
Knowledge discovery claudiad amatoKnowledge discovery claudiad amato
Knowledge discovery claudiad amatoSSSW
 
Machine Learning Algorithms | Machine Learning Tutorial | Data Science Algori...
Machine Learning Algorithms | Machine Learning Tutorial | Data Science Algori...Machine Learning Algorithms | Machine Learning Tutorial | Data Science Algori...
Machine Learning Algorithms | Machine Learning Tutorial | Data Science Algori...Simplilearn
 
flankr: EPS presentation
flankr: EPS presentationflankr: EPS presentation
flankr: EPS presentationJimGrange
 
Build Deep Learning model to identify santader bank's dissatisfied customers
Build Deep Learning model to identify santader bank's dissatisfied customersBuild Deep Learning model to identify santader bank's dissatisfied customers
Build Deep Learning model to identify santader bank's dissatisfied customerssriram30691
 
Fraud Detection for Insurance Claims
Fraud Detection for Insurance ClaimsFraud Detection for Insurance Claims
Fraud Detection for Insurance ClaimsYit Wei (Jason) Chia
 
Mining model for hotel recommendations (Kaggle Challenge)
Mining model for hotel recommendations (Kaggle Challenge)Mining model for hotel recommendations (Kaggle Challenge)
Mining model for hotel recommendations (Kaggle Challenge)Arjun Varma
 
Data analytcis-first-steps
Data analytcis-first-stepsData analytcis-first-steps
Data analytcis-first-stepsShesha R
 
Kaggle Higgs Boson Machine Learning Challenge
Kaggle Higgs Boson Machine Learning ChallengeKaggle Higgs Boson Machine Learning Challenge
Kaggle Higgs Boson Machine Learning ChallengeBernard Ong
 
Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for De...
Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for De...Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for De...
Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for De...Databricks
 
Statistical Learning on Credit Data
Statistical Learning on Credit DataStatistical Learning on Credit Data
Statistical Learning on Credit DataFiras Obeid
 
Data Mining In Market Research
Data Mining In Market ResearchData Mining In Market Research
Data Mining In Market Researchkevinlan
 
Processing Terabyte-Scale Genomics Datasets with ADAM: Spark Summit East talk...
Processing Terabyte-Scale Genomics Datasets with ADAM: Spark Summit East talk...Processing Terabyte-Scale Genomics Datasets with ADAM: Spark Summit East talk...
Processing Terabyte-Scale Genomics Datasets with ADAM: Spark Summit East talk...Spark Summit
 

Similar to Expedia Data Analysis (20)

Meetup_Consumer_Credit_Default_Vers_2_All
Meetup_Consumer_Credit_Default_Vers_2_AllMeetup_Consumer_Credit_Default_Vers_2_All
Meetup_Consumer_Credit_Default_Vers_2_All
 
Prediction of potential customers for term deposit
Prediction of potential customers for term depositPrediction of potential customers for term deposit
Prediction of potential customers for term deposit
 
Competition16
Competition16Competition16
Competition16
 
dm1.pdf
dm1.pdfdm1.pdf
dm1.pdf
 
Introduction to data mining
Introduction to data miningIntroduction to data mining
Introduction to data mining
 
Pricing like a data scientist
Pricing like a data scientistPricing like a data scientist
Pricing like a data scientist
 
Knowledge discovery claudiad amato
Knowledge discovery claudiad amatoKnowledge discovery claudiad amato
Knowledge discovery claudiad amato
 
Bank loan purchase modeling
Bank loan purchase modelingBank loan purchase modeling
Bank loan purchase modeling
 
Machine Learning Algorithms | Machine Learning Tutorial | Data Science Algori...
Machine Learning Algorithms | Machine Learning Tutorial | Data Science Algori...Machine Learning Algorithms | Machine Learning Tutorial | Data Science Algori...
Machine Learning Algorithms | Machine Learning Tutorial | Data Science Algori...
 
flankr: EPS presentation
flankr: EPS presentationflankr: EPS presentation
flankr: EPS presentation
 
Machine learning project
Machine learning project Machine learning project
Machine learning project
 
Build Deep Learning model to identify santader bank's dissatisfied customers
Build Deep Learning model to identify santader bank's dissatisfied customersBuild Deep Learning model to identify santader bank's dissatisfied customers
Build Deep Learning model to identify santader bank's dissatisfied customers
 
Fraud Detection for Insurance Claims
Fraud Detection for Insurance ClaimsFraud Detection for Insurance Claims
Fraud Detection for Insurance Claims
 
Mining model for hotel recommendations (Kaggle Challenge)
Mining model for hotel recommendations (Kaggle Challenge)Mining model for hotel recommendations (Kaggle Challenge)
Mining model for hotel recommendations (Kaggle Challenge)
 
Data analytcis-first-steps
Data analytcis-first-stepsData analytcis-first-steps
Data analytcis-first-steps
 
Kaggle Higgs Boson Machine Learning Challenge
Kaggle Higgs Boson Machine Learning ChallengeKaggle Higgs Boson Machine Learning Challenge
Kaggle Higgs Boson Machine Learning Challenge
 
Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for De...
Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for De...Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for De...
Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for De...
 
Statistical Learning on Credit Data
Statistical Learning on Credit DataStatistical Learning on Credit Data
Statistical Learning on Credit Data
 
Data Mining In Market Research
Data Mining In Market ResearchData Mining In Market Research
Data Mining In Market Research
 
Processing Terabyte-Scale Genomics Datasets with ADAM: Spark Summit East talk...
Processing Terabyte-Scale Genomics Datasets with ADAM: Spark Summit East talk...Processing Terabyte-Scale Genomics Datasets with ADAM: Spark Summit East talk...
Processing Terabyte-Scale Genomics Datasets with ADAM: Spark Summit East talk...
 

Recently uploaded

Dubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls DubaiDubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls Dubaikojalkojal131
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Researchmichael115558
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...nirzagarg
 
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling ManjurJual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjurptikerjasaptiker
 
7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.pptibrahimabdi22
 
SR-101-01012024-EN.docx Federal Constitution of the Swiss Confederation
SR-101-01012024-EN.docx  Federal Constitution  of the Swiss ConfederationSR-101-01012024-EN.docx  Federal Constitution  of the Swiss Confederation
SR-101-01012024-EN.docx Federal Constitution of the Swiss ConfederationEfruzAsilolu
 
Data Analyst Tasks to do the internship.pdf
Data Analyst Tasks to do the internship.pdfData Analyst Tasks to do the internship.pdf
Data Analyst Tasks to do the internship.pdftheeltifs
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制vexqp
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...Health
 
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangePredicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangeThinkInnovation
 
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptxThe-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptxVivek487417
 
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制vexqp
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...Bertram Ludäscher
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...gajnagarg
 
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制vexqp
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Valters Lauzums
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...nirzagarg
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1ranjankumarbehera14
 

Recently uploaded (20)

Dubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls DubaiDubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls Dubai
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
 
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling ManjurJual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
 
7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt
 
SR-101-01012024-EN.docx Federal Constitution of the Swiss Confederation
SR-101-01012024-EN.docx  Federal Constitution  of the Swiss ConfederationSR-101-01012024-EN.docx  Federal Constitution  of the Swiss Confederation
SR-101-01012024-EN.docx Federal Constitution of the Swiss Confederation
 
Sequential and reinforcement learning for demand side management by Margaux B...
Sequential and reinforcement learning for demand side management by Margaux B...Sequential and reinforcement learning for demand side management by Margaux B...
Sequential and reinforcement learning for demand side management by Margaux B...
 
Data Analyst Tasks to do the internship.pdf
Data Analyst Tasks to do the internship.pdfData Analyst Tasks to do the internship.pdf
Data Analyst Tasks to do the internship.pdf
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangePredicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
 
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptxThe-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
 
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
 
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
 

Expedia Data Analysis

  • 1. MIS 6334 – Advanced Business Analytics with SAS Team 7 Aravind Vasu Murugan Charmi Katira Prasanna Rao Rohith Muruganandam Sriram Murali Expedia Data Analysis
  • 2. Introduction • Goal: The main objective of this project is to predict, if this user is going to book at the Expedia website in the remainder of the session. • Selection Criteria: Based on misclassification rate of the model. • Champion Model: Bagging and Boosting of Decision Tree (Ensemble)
  • 3. Data Preprocessing • Creating new target variable “new_bookfut” – Booklc  Dummy variable, indicating, if the user has booked at this site up to this point in the current session – Altered target variables with ‘TRUE’ values, where Booklc = 1 to capture all possible scenarios • Rejecting redundant variable: SEgc vs SErate – SEgc  Indicating if this session uses search engines – SErate  No. of sessions coming from search engines/total sessions of this site – Rejected SEgc based on variable worth from Stat Explore node
  • 4. Methods Used Data + Models Data + Impute + Transform + Models Data + Impute + Transform + Variable Selection + Models Data + Impute + Transform + Chi Square Stat Variables + Models
  • 5. Models Used Regression Principal Component Analysis Decision Tree Dmine Regression Partial Least Squares Neural Network HP Neural Support Vector Machine BN Classifier – Bayesian Network Bagging - Boosting Ensemble HP Random Forest
  • 6. Top Model Comparison Models Method Misclassification Rate Bagging - Boosting - Decision Tree (Series/Parallel) Raw Data 7.6% HP Random Forest Raw Data 7.8% Ensemble - DST, Bayesian, Dmine Imputed 9.25% Bagging - Boosting - Decision Tree - Ensemble (Parallel) Raw Data 9.39% HP SVM Imputed 9.89% Bagging - Boosting - Dmine Ensemble Raw Data 10.32% Dmine Regression Raw Data 10.81% Optimal DST Imputed 11.6%
  • 7. Champion Model • Champion model is Bagging – Boosting with Optimal Decision Tree • Bagging - Boosting Series and Bagging - Boosting Parallel connection • Ensemble the results • Misclassification rate – 7.6%
  • 8. Learnings • Variable Selection, Interactive Binning and PCA increases misclassification rate for this dataset. • The Quasi-Newton optimization technique used in Neural Network gives better performance(trial - error method). • HP Random Forest and SVM can’t be used along with Bagging/Boosting because the output is not in SAS data step code format.
  • 9. Learnings - Contd • Dmine regression is better than normal regression as it calculates 𝑅2 for all variables and categorizes them into 16 bins (AOV 16) and then, 𝑅2 for AOV16 variables is calculated. • Contrasting models performs well with Ensemble model. – D-Mine Regression, Bayesian Network , Optimal DST – misclassification rate  9.2% • Bagging and boosting connected in series connection outperforms the parallel combination(Misclassification Rate  7.6% : 9.3%).
  • 10. Challenges • Reducing the misclassification rate of models to a single digit. – SVM Model (Imputed data)-9.89% – Ensemble Model of Dmine, Bayesian, HP Neural (Imputed data) - 9.25 % – Random Forest (Original data) -7.8% – Bagging Boosting of decision Tree (Original data) – 7.6% • Combination of bagging and boosting in ensemble model • Developing a model, which performs better than Random Forest (misclassification rate – 7.8%). • Finding input models, which works well with Ensemble to achieve good performance.
  • 11. Challenges - Contd • Manipulating the target variable with more TRUE values using booklc attribute • Renamed the variables from x1-x41 to actual names • Finding similar user centric variables to avoid unnecessary redundant classification. • Using models along with Bagging/Boosting other than decision tree. Ex: Dmine
  • 12. Surprising Findings • Raw dataset performs better than imputed/transformed/Chi square variables data. • Bagging/Boosting gives better result than HP Random Forest. • Series vs Parallel connection – Using bagging and boosting in series connection i.e output of bagging as an input to boosting yields good result than doing parallelly processing bagging and boosting
  • 13. Surprising Findings - Contd • SVM performs very poor with raw data (misclassification rate -26%) but performs well with imputed data(misclassification rate - 9.8%). • Neural Network has high misclassification rate than optimal decision tree and Bayesian network. • Transforming skewness of variables does not yield desired results for this dataset.

Editor's Notes

  1. Source: https://www.expedia.com/