2. Goals
• Repeatable results
• Use same data
• Use same protocol
• Extension
Tapan Oza
• Validation
• Same data, new protocols
• Averaged one-dependence estimators (AODE)
• Random Forest
• Tools used: Weka
2
3. • "Using data mining for bank direct marketing: An
application of the CRISP-DM methodology." Moro et al.
• CRISP-DM: CRoss-Industry Standard Data Mining
• Paper uses data from a Portuguese bank
• Acquired via Call Center in 17 different campaigns
• Large number of features
• Large number of cases
Tapan Oza
Original Paper
• Classification methodologies:
• Naïve Bayes
• Decision Tree
• Support Vector Machine
3
5. Classification Methodologies
• Assumes independent features
• Classification using Bayes Rule
• Apply a decision rule on probability function
• Decision Tree
• Many ways to build tree
• Common method splits on information gain
Tapan Oza
• Naïve Bayes
• Support Vector Machine
• Requires linearly separable data
• Identifies separating hyperplanes
5
6. Performance: Accuracy vs Speed
• Data mining is strategic
• Computation costs are falling (Amazon EC2)
• Without accuracy, model is useless
• What do we use to measure Accuracy?
Tapan Oza
• Why Accuracy?
• Area under the receiver operating characteristic curve
(AUROC)
• Higher AUROC = more confidence in classification
6
7. Extensions
• Modified Naïve Bayes
• Weak assumption of data independence
• Higher computational cost
• Computation is cheap
• Random Forest
•
•
•
•
Tapan Oza
• AODE
Many trees, one classification
Every tree “votes” on classification
Class with most “votes” is chosen
Impressive accuracy
7
8. Results: Validation
• Paper doesn’t specify tree type
• 2 out of 3 validated
• SVM not validated
AUROC
SVM
NB
Decision Tree
Original
0.938
0.870
0.868
Validation
0.583
0.861
Tapan Oza
• Average two different tree results
0.863
8
9. Results: Extension
• Extension was to have two models
• Weka output for AODE was incomplete
• Cause unknown
• Could be Weka
Tapan Oza
• AODE
• Random forest
• Random forest AUROC is 0.9
• Best result out of all the algorithms
9
10. • Random forest has impressive accuracy
• Naïve Bayes, Decision Tree, Random Forest are accurate
enough for deployment
• Make sure you have the same tools when validating
• Make sure you use multiple tools when testing
extensions
Tapan Oza
Lessons Learned
10
11. • Moro, Sérgio, Raul Laureano, and Paulo Cortez. "Using data
mining for bank direct marketing: An application of the crispdm methodology." (2011).
• Breiman, Leo. "Random forests." Machine learning 45.1
(2001): 5-32.
• Webb, Geoffrey I., Janice R. Boughton, and Zhihai Wang. "Not
so naive bayes: Aggregating one-dependence estimators."
Machine Learning 58.1 (2005): 5-24.
Tapan Oza
References:
11