SlideShare a Scribd company logo
1 of 12
Extension and
Validation of Moro et
al.
By: Tapan Oza
Goals
• Repeatable results
• Use same data
• Use same protocol

• Extension

Tapan Oza

• Validation

• Same data, new protocols
• Averaged one-dependence estimators (AODE)
• Random Forest

• Tools used: Weka

2
• "Using data mining for bank direct marketing: An
application of the CRISP-DM methodology." Moro et al.
• CRISP-DM: CRoss-Industry Standard Data Mining
• Paper uses data from a Portuguese bank
• Acquired via Call Center in 17 different campaigns
• Large number of features
• Large number of cases

Tapan Oza

Original Paper

• Classification methodologies:
• Naïve Bayes
• Decision Tree
• Support Vector Machine

3
Tapan Oza

CRISP-DM

4
Classification Methodologies
• Assumes independent features
• Classification using Bayes Rule
• Apply a decision rule on probability function

• Decision Tree
• Many ways to build tree
• Common method splits on information gain

Tapan Oza

• Naïve Bayes

• Support Vector Machine
• Requires linearly separable data
• Identifies separating hyperplanes
5
Performance: Accuracy vs Speed
• Data mining is strategic
• Computation costs are falling (Amazon EC2)
• Without accuracy, model is useless

• What do we use to measure Accuracy?

Tapan Oza

• Why Accuracy?

• Area under the receiver operating characteristic curve
(AUROC)
• Higher AUROC = more confidence in classification
6
Extensions
• Modified Naïve Bayes
• Weak assumption of data independence
• Higher computational cost
• Computation is cheap

• Random Forest
•
•
•
•

Tapan Oza

• AODE

Many trees, one classification
Every tree “votes” on classification
Class with most “votes” is chosen
Impressive accuracy
7
Results: Validation
• Paper doesn’t specify tree type
• 2 out of 3 validated
• SVM not validated
AUROC

SVM

NB

Decision Tree

Original

0.938

0.870

0.868

Validation

0.583

0.861

Tapan Oza

• Average two different tree results

0.863

8
Results: Extension
• Extension was to have two models

• Weka output for AODE was incomplete
• Cause unknown
• Could be Weka

Tapan Oza

• AODE
• Random forest

• Random forest AUROC is 0.9
• Best result out of all the algorithms

9
• Random forest has impressive accuracy
• Naïve Bayes, Decision Tree, Random Forest are accurate
enough for deployment
• Make sure you have the same tools when validating
• Make sure you use multiple tools when testing
extensions

Tapan Oza

Lessons Learned

10
• Moro, Sérgio, Raul Laureano, and Paulo Cortez. "Using data
mining for bank direct marketing: An application of the crispdm methodology." (2011).
• Breiman, Leo. "Random forests." Machine learning 45.1
(2001): 5-32.
• Webb, Geoffrey I., Janice R. Boughton, and Zhihai Wang. "Not
so naive bayes: Aggregating one-dependence estimators."
Machine Learning 58.1 (2005): 5-24.

Tapan Oza

References:

11
Questions?

More Related Content

Similar to Extension and validation of moro et al

Predict the Oscars with Data Science
Predict the Oscars with Data SciencePredict the Oscars with Data Science
Predict the Oscars with Data Science
Carlos Edo
 

Similar to Extension and validation of moro et al (20)

Machine Learning Application Development
Machine Learning Application DevelopmentMachine Learning Application Development
Machine Learning Application Development
 
Performance Issue? Machine Learning to the rescue!
Performance Issue? Machine Learning to the rescue!Performance Issue? Machine Learning to the rescue!
Performance Issue? Machine Learning to the rescue!
 
Making powerful science: an introduction to NGS data analysis
Making powerful science: an introduction to NGS data analysisMaking powerful science: an introduction to NGS data analysis
Making powerful science: an introduction to NGS data analysis
 
Machinr Learning and artificial_Lect1.pdf
Machinr Learning and artificial_Lect1.pdfMachinr Learning and artificial_Lect1.pdf
Machinr Learning and artificial_Lect1.pdf
 
How and why you need to build a big data lab
How and why you need to build a big data labHow and why you need to build a big data lab
How and why you need to build a big data lab
 
Classification of URLs
Classification of URLsClassification of URLs
Classification of URLs
 
AIMO: An African Internet Measurements Observatory
AIMO: An African Internet Measurements ObservatoryAIMO: An African Internet Measurements Observatory
AIMO: An African Internet Measurements Observatory
 
Kevin Swingler: Introduction to Data Mining
Kevin Swingler: Introduction to Data MiningKevin Swingler: Introduction to Data Mining
Kevin Swingler: Introduction to Data Mining
 
Data ware housing- Introduction to data ware housing
Data ware housing- Introduction to data ware housingData ware housing- Introduction to data ware housing
Data ware housing- Introduction to data ware housing
 
Predict the Oscars with Data Science
Predict the Oscars with Data SciencePredict the Oscars with Data Science
Predict the Oscars with Data Science
 
Random Forest Decision Tree.pptx
Random Forest Decision Tree.pptxRandom Forest Decision Tree.pptx
Random Forest Decision Tree.pptx
 
Rapid Miner
Rapid MinerRapid Miner
Rapid Miner
 
Data Mining Intro
Data Mining IntroData Mining Intro
Data Mining Intro
 
Saksham Sarode - Building Effective test Data Management in Distributed Envir...
Saksham Sarode - Building Effective test Data Management in Distributed Envir...Saksham Sarode - Building Effective test Data Management in Distributed Envir...
Saksham Sarode - Building Effective test Data Management in Distributed Envir...
 
Understanding your Data - Data Analytics Lifecycle and Machine Learning
Understanding your Data - Data Analytics Lifecycle and Machine LearningUnderstanding your Data - Data Analytics Lifecycle and Machine Learning
Understanding your Data - Data Analytics Lifecycle and Machine Learning
 
Introduction to Mahout and Machine Learning
Introduction to Mahout and Machine LearningIntroduction to Mahout and Machine Learning
Introduction to Mahout and Machine Learning
 
Data mining
Data miningData mining
Data mining
 
Predict the Oscars with Data Science
Predict the Oscars with Data SciencePredict the Oscars with Data Science
Predict the Oscars with Data Science
 
Big data
Big dataBig data
Big data
 
An Approach to Combining Disparate Clinical Study Data across Multiple Sponso...
An Approach to Combining Disparate Clinical Study Data across Multiple Sponso...An Approach to Combining Disparate Clinical Study Data across Multiple Sponso...
An Approach to Combining Disparate Clinical Study Data across Multiple Sponso...
 

Recently uploaded

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 

Recently uploaded (20)

A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 

Extension and validation of moro et al

  • 1. Extension and Validation of Moro et al. By: Tapan Oza
  • 2. Goals • Repeatable results • Use same data • Use same protocol • Extension Tapan Oza • Validation • Same data, new protocols • Averaged one-dependence estimators (AODE) • Random Forest • Tools used: Weka 2
  • 3. • "Using data mining for bank direct marketing: An application of the CRISP-DM methodology." Moro et al. • CRISP-DM: CRoss-Industry Standard Data Mining • Paper uses data from a Portuguese bank • Acquired via Call Center in 17 different campaigns • Large number of features • Large number of cases Tapan Oza Original Paper • Classification methodologies: • Naïve Bayes • Decision Tree • Support Vector Machine 3
  • 5. Classification Methodologies • Assumes independent features • Classification using Bayes Rule • Apply a decision rule on probability function • Decision Tree • Many ways to build tree • Common method splits on information gain Tapan Oza • Naïve Bayes • Support Vector Machine • Requires linearly separable data • Identifies separating hyperplanes 5
  • 6. Performance: Accuracy vs Speed • Data mining is strategic • Computation costs are falling (Amazon EC2) • Without accuracy, model is useless • What do we use to measure Accuracy? Tapan Oza • Why Accuracy? • Area under the receiver operating characteristic curve (AUROC) • Higher AUROC = more confidence in classification 6
  • 7. Extensions • Modified Naïve Bayes • Weak assumption of data independence • Higher computational cost • Computation is cheap • Random Forest • • • • Tapan Oza • AODE Many trees, one classification Every tree “votes” on classification Class with most “votes” is chosen Impressive accuracy 7
  • 8. Results: Validation • Paper doesn’t specify tree type • 2 out of 3 validated • SVM not validated AUROC SVM NB Decision Tree Original 0.938 0.870 0.868 Validation 0.583 0.861 Tapan Oza • Average two different tree results 0.863 8
  • 9. Results: Extension • Extension was to have two models • Weka output for AODE was incomplete • Cause unknown • Could be Weka Tapan Oza • AODE • Random forest • Random forest AUROC is 0.9 • Best result out of all the algorithms 9
  • 10. • Random forest has impressive accuracy • Naïve Bayes, Decision Tree, Random Forest are accurate enough for deployment • Make sure you have the same tools when validating • Make sure you use multiple tools when testing extensions Tapan Oza Lessons Learned 10
  • 11. • Moro, Sérgio, Raul Laureano, and Paulo Cortez. "Using data mining for bank direct marketing: An application of the crispdm methodology." (2011). • Breiman, Leo. "Random forests." Machine learning 45.1 (2001): 5-32. • Webb, Geoffrey I., Janice R. Boughton, and Zhihai Wang. "Not so naive bayes: Aggregating one-dependence estimators." Machine Learning 58.1 (2005): 5-24. Tapan Oza References: 11