Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Lecture 8 Agenda
Opening Discussion 45
• Course Project Check In
• Thought Exercise
Data Transformation 60
• Attribute Selection
• SVM Considerations
Ensembling, review and deeper dive; 60
Deriving Knowledge from Data at Scale
Being proficient at data science
requires intuition and
judgement.
Deriving Knowledge from Data at Scale
Intuition and judgement
come with experience
Deriving Knowledge from Data at Scale
There is no compression
algorithm for experience…
Deriving Knowledge from Data at Scale
But you can hack this…
Deriving Knowledge from Data at Scale
Three Steps (every 3 – 4 months)
1. Become proficient in using one tool;
2. Select one algorithm for deep dive;
3. Focus on one data type;
Hands on practice…
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
What tools to use?
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
• Weka – explorer…
• KNIME – experimentation…
Get proficient in at least two (2) tools…
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
http://www.slideshare.net/DataRobot/final-10-r-xc-36610234
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
This is how you learn…
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Performing Experiments
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Copy on Catalyst…
Deriving Knowledge from Data at Scale
Course Project
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
what is the data telling you
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Attribute Selection
(feature selection)
Deriving Knowledge from Data at Scale
Problem: Where to focus attention?
Deriving Knowledge from Data at Scale
What is Evaluated?
Attributes
Subsets of
Attributes
Evaluation
Method
Independent
Filters Filters
Learning
Algorithm Wrappers
Deriving Knowledge from Data at Scale
What is Evaluated?
Attributes
Subsets of
Attributes
Evaluation
Method
Independent
Filters Filters
Learning
Algorithm Wrappers
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Tab for selecting attributes in a data set…
Deriving Knowledge from Data at Scale
Interface for classes that evaluate attributes…
Interface for ranking or searching for a subset of attributes…
Deriving Knowledge from Data at Scale
Select CorrelationAttributeEval for Pearson Correlation…
False, doesn’t return R score
True, returns R scores;
Deriving Knowledge from Data at Scale
Ranks attributes by their individual evaluations, used in
conjunction with GainRatio, Entropy, Pearson, etc…
Number of attributes to return,
-1 returns all ranked attributes;
Attributes to ignore (skip) in the
evaluation forma: [1, 3-5, 10];
Cutoff at which attributes can
be discarded, -1 no cutoff;
Deriving Knowledge from Data at Scale
What is Evaluated?
Attributes
Subsets of
Attributes
Evaluation
Method
Independent
Filters Filters
Learning
Algorithm Wrappers
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Interface for classes that evaluate attributes…
Interface for ranking or searching for a subset of attributes…
Deriving Knowledge from Data at Scale
True: Adds features that are correlated
with class and NOT intercorrelated with
other features already in selection.
False: Eliminates redundant features.
Precompute the correlation matrix in
advance, useful for fast backtracking, or
compute lazily. When given a large
number of attributes, compute lazily…
CfsSubsetEval
Deriving Knowledge from Data at Scale
What is Evaluated?
Attributes
Subsets of
Attributes
Evaluation
Method
Independent
Filters Filters
Learning
Algorithm Wrappers
Deriving Knowledge from Data at Scale
Select a subset of
attributes
Induce learning
algorithm on this subset
Evaluate the resulting
model (e.g., accuracy)
Stop? YesNo
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Tab for selecting attributes in a data set…
Deriving Knowledge from Data at Scale
Interface for classes that evaluate attributes…
Interface for ranking or searching for a subset of attributes…
Deriving Knowledge from Data at Scale
Interface for classes that evaluate attributes…
Interface for ranking or searching for a subset of attributes…
WrapperSubsetEval
Deriving Knowledge from Data at Scale
Select and configure ML algorithm…
Accuracy (default discrete classes), RMSE (default
numeric), AUC, AUPRC, F-measure (discrete class)
Number of folds to use to estimate
subset accuracy
Deriving Knowledge from Data at Scale
BestFirst: Default search method, it searches
the space of descriptor subsets by greedy
hill-climbing augmented with a backtracking
facility. The BestFirst method may start with
the empty set of descriptors and searches
forward (default behavior), or starts with the
full set of attributes and searches backward,
or starts at any point and searches in both
directions (considering all single descriptor
additions and deletions at a given point).
Other options include:
• GreedyStepwise;
• EvolutionarySearch;
• ExhaustiveSearch;
• LinearForwardSearch;
• GeneticSearch (could take hours)
Search Method
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Feature ranking
Forward feature selection
Backward feature elimination
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
10 Minute Break…
Deriving Knowledge from Data at Scale
Practical SVM
Deriving Knowledge from Data at Scale
Goal: to find
discriminator
That maximize the
margins
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
SMO and it's complexity parameter ("-C")
• load your dataset in the Explorer
• choose weka.classifiers.meta.CVParameterSelection as classifier
• select weka.classifiers.functions.SMO as base classifier within CVParameterSelection and modify its
setup if necessary, e.g., RBF kernel
• open the ArrayEditor for CVParameters and enter the following string (and click on Add):
C 2 8 4
This will test the complexity parameters 2, 4, 6 and 8 (= 4 steps)
• close dialogs and start the classifier
• you will get output similar to this one, with the best parameters found in bold:
Deriving Knowledge from Data at Scale
LibSVM
• load your dataset in the Explorer
• choose weka.classifiers.meta.CVParameterSelection as classifier
• select weka.classifiers.functions.LibSVM as base classifier within CVParameterSelection and modify
its setup if necessary, e.g., RBF kernel
• open the ArrayEditor for CVParameters and enter the following string (and click on Add):
G 0.01 0.1 10
Deriving Knowledge from Data at Scale
GridSearch
weka.classifiers.meta.GridSearch is a meta-classifier for exploring 2 parameters, hence the grid in the name.
Instead of just using a classifier, one can specify a base classifier and a filter, which both of them can be
optimized (one parameter each).
For each of the two axes, X and Y, one can specify the following parameters:
• min, the minimum value to start from.
• max, the maximum value.
• step, the step size used to get from min to max.
GridSearch can also optimized based on the following measures:
• Correlation coefficient (= CC)
• Root mean squared error (= RMSE)
• Root relative squared error (= RRSE)
• Mean absolute error (= MAE)
• Root absolute error (= RAE)
• Combined: (1-abs(CC)) + RRSE + RAE
• Accuracy (= ACC)
Deriving Knowledge from Data at Scale
Missing Values
(revisited)
Deriving Knowledge from Data at Scale
?Instances
Attributes
Deriving Knowledge from Data at Scale
Missing values – UCI machine learning repository, 31 of 68 data sets
reported to have missing values. “Missing” can mean many things…
MAR: "Missing at Random":
– usually best case
– usually not true
Non-randomly missing
Presumed normal, so not measured
Causally missing
– attribute value is missing because of other attribute values (or because of
the outcome value!)
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
30% missing values
88% accuracy
93% accuracy
95% accuracy
Deriving Knowledge from Data at Scale
Input:
Output:
Deriving Knowledge from Data at Scale
X
Y
Data point with missing Y
Imputation
x
Filling in there makes
sense!
Deriving Knowledge from Data at Scale
Imputation with k-Nearest Neighbor
Deriving Knowledge from Data at Scale
K-means Clustering Imputation
Deriving Knowledge from Data at Scale
Imputation via Regression/Classification
Deriving Knowledge from Data at Scale
fills in the missing values for an instance with the expected
values
Deriving Knowledge from Data at Scale
This can make a difference!
Deriving Knowledge from Data at Scale
Ensemble Learning
Review & Out of Class Exercises
Deriving Knowledge from Data at Scale
Original
Training data
....D1
D2 Dt-1 Dt
D
Step 1:
Create Multiple
Data Sets
C1 C2 Ct -1 Ct
Step 2:
Build Multiple
Classifiers
C*
Step 3:
Combine
Classifiers
Deriving Knowledge from Data at Scale
Why does it work?







25
13
25
06.0)1(
25
i
ii
i

Deriving Knowledge from Data at Scale
Ensemble vs. Base Classifier Error
As long as base classifier is better than random (error < 0.5),
ensemble will be superior to base classifier
Deriving Knowledge from Data at Scale
• Bagging
• Boosting
• DECORATE
meta-learners
base learner
Deriving Knowledge from Data at Scale
Training set
Matrix 1
Matrix 2
Matrix 3
Learning
algorithm
Model
M1
Learning
algorithm
Model
M2
Learning
algorithm
Model
Me
ENSEMBLE
Consensus
Model
Perturbed sets
C1
Cn
D1 Dm
Compounds/
Descriptor
Matrix
Deriving Knowledge from Data at Scale
Mixture of Experts


L
j
jjdwy
1
Deriving Knowledge from Data at Scale
Stacking
Deriving Knowledge from Data at Scale
Cascading
Deriving Knowledge from Data at Scale
Bagging
Leo Breiman
(1928-2005)
Leo Breiman (1996). Bagging predictors. Machine Learning. 24(2):123-140.
Bagging = Bootstrap Aggregation
Deriving Knowledge from Data at Scale
Training set S
.
.
.
C1
C2
C3
C4
Cn
Bootstrap
.
.
.
C3
C2
C2
C4
C4
Sample Si from training set S
• All compounds have the same probability to
be selected
• Each compound can be selected several
times or even not selected at all (i.e.
compounds are sampled randomly with
replacement)
Efron, B., & Tibshirani, R. J. (1993). "An introduction to the bootstrap". New York: Chapman & Hall
Si
D1 Dm D1 Dm
Deriving Knowledge from Data at Scale
Bagging
Original Data 1 2 3 4 5 6 7 8 9 10
Bagging (Round 1) 7 8 10 8 2 5 10 10 5 9
Bagging (Round 2) 1 4 9 1 2 3 2 7 3 2
Bagging (Round 3) 1 8 5 10 5 5 9 6 3 7
Training Data
Data ID
Deriving Knowledge from Data at Scale
The 0.632 bootstrap
This method is also called the 0.632 bootstrap
• A particular training instance has a probability of 1-1/n of not being
picked
• Thus its probability of ending up in the test data (not selected) is:
This means the training data will contain approximately 63.2% of the
instances
368.0
1
1 1






 
e
n
n
Deriving Knowledge from Data at Scale
Bagging
Training set
.
.
.
C1
C2
C3
C4
Cn
Learning
algorithm
Model
M1
Learning
algorithm
Model
M2
Learning
algorithm
Model
Me
ENSEMBLE
Consensus
Model
S1
S2
Se
C4
C2
C8
C2
C1
C9
C7
C2
C2
C1
C4
C3
C4
C8
Voting (classification)
Averaging (regression)
Data with
perturbed sets
of compounds
C1
Deriving Knowledge from Data at Scale
Classification - Files
train-ache.sdf/test-ache.sdf
train-ache-t3ABl2u3.arff/test-ache-t3ABl2u3.arff
ache-t3ABl2u3.hdr
Deriving Knowledge from Data at Scale
Exercise 1
Development of one individual rules-based model
(JRip method in WEKA)
Deriving Knowledge from Data at Scale
Exercise 1
Load train-ache-t3ABl2u3.arff
Deriving Knowledge from Data at Scale
Load test-ache-t3ABl2u3.arff
Deriving Knowledge from Data at Scale
Setup one JRip
model
Deriving Knowledge from Data at Scale
187. (C*C),(C*C*C),(C*C-C),(C*N),(C*N*C),(C-C),(C-C-C),xC*
81. (C-N),(C-N-C),(C-N-C),(C-N-C),xC
12. (C*C),(C*C),(C*C*C),(C*C*C),(C*C*N),xC
Deriving Knowledge from Data at Scale
What happens if we randomize the data
and rebuild a JRip model ?
Deriving Knowledge from Data at Scale
Changing data ordering induces rules changes
Deriving Knowledge from Data at Scale
Exercise 3a: Bagging
Reinitialize the dataset
In the classifier tab, choose the meta
classifier Bagging
Deriving Knowledge from Data at Scale
Set the base classifier as JRip
Build an ensemble of 1 model
Deriving Knowledge from Data at Scale
JRipBag1.out
JRipBag3.out
JRipBag8.out
Deriving Knowledge from Data at Scale
ROC AUC of the consensus
model as a function of the
number of bagging iterations
Classification
AChE
0.74
0.76
0.78
0.8
0.82
0.84
0.86
0.88
0 2 4 6 8 10
Number of bagging iterations
ROC
AUC
Deriving Knowledge from Data at Scale
Boosting works by training a set of classifiers sequentially by combining them
for prediction, where each latter classifier focuses on the mistakes of the
earlier classifiers.
Yoav Freund Robert Shapire Jerome Friedman
Yoav Freund, Robert E. Schapire: Experiments with a new boosting algorithm. In: Thirteenth International Conference on
Machine Learning, San Francisco, 148-156, 1996.
J.H. Friedman (1999). Stochastic Gradient Boosting. Computational Statistics and Data Analysis. 38:367-378.
AdaBoost -
classification
Regression
boosting
Deriving Knowledge from Data at Scale
Training set
.
.
.
C1
C2
C3
C4
Cn
Learning
algorithm
Model
M1
Learning
algorithm
Model
M2
Learning
algorithm
Model
Mb
ENSEMBLE
Consensus
Model
S1
S2
Se
C1
C2
C3
C4
Cn
.
.
.
w
w
w
w
w
e
e
e
e
e
e
e
e
e
e
C1
C2
C3
C4
Cn
.
.
.
w
w
w
w
w
Weighted averaging
& thresholding
w
C4
Cn
.
.
.
w
w
w
w
C1
C2
C3
Deriving Knowledge from Data at Scale
Load train-ache-t3ABl2u3.arff
In classification tab, load test-ache-t3ABl2u3.arff
Deriving Knowledge from Data at Scale
In classifier tab, choose meta classifier AdaBoostM1
Setup an ensemble of one JRip model
Deriving Knowledge from Data at Scale
JRipBoost1.out
JRipBoost3.out JRipBoost8.out
Deriving Knowledge from Data at Scale
ROC AUC as a function of the
number of boosting iterations
Classification
AChE
Log(Number of boosting iterations)
ROCAUC
0.76
0.77
0.78
0.79
0.8
0.81
0.82
0.83
0 2 4 6 8 10
Deriving Knowledge from Data at Scale
0.7
0.75
0.8
0.85
0.9
0.95
1
1 10 100 1000
Bagging
Boosting
Base learner – DecisionStump
0.7
0.75
0.8
0.85
0.9
0.95
1
1 10 100
Base learner – JRip
Deriving Knowledge from Data at Scale
Conjecture: Bagging vs Boosting
Bagging leverages unstable base learners that
are weak because of overfitting (JRip, MLR)
Boosting leverages stable base learners that
are weak because of underfitting
(DecisionStump, SLR)
Deriving Knowledge from Data at Scale
Random Subspace Method
Tin Kam Ho
Tin Kam Ho (1998). The Random Subspace Method for Constructing Decision Forests. IEEE Transactions on
Pattern Analysis and Machine Intelligence. 20(8):832-844.
Deriving Knowledge from Data at Scale
• All descriptors have the same probability to be
selected
• Each descriptor can be selected only once
• Only a certain part of descriptors are selected
in each run
.
.
.
D1 D2 D3 D4 Dm
D3 D2 Dm D4
C1
Cn
C1
Cn
Training set with initial pool of descriptors
Training set with randomly selected descriptors
Deriving Knowledge from Data at Scale
Random Subspace Method
145
Training set
Learning
algorithm
Model
M1
Learning
algorithm
Model
M2
Learning
algorithm
Model
Me
ENSEMBLE
Consensus
Model
S1
S2
Se
Voting (classification)
Averaging (regression)
Data sets with
randomly selected
descriptors
D1 D2 D3 D4 Dm
D4 D2 D3
D1 D2 D3
D4 D2 D1
Deriving Knowledge from Data at Scale
Load train-logs-t1ABl2u4.arff
In classification tab, load test-logs-t1ABl2u4.arff
Deriving Knowledge from Data at Scale
Choose the meta method
Random SubSpace.
Deriving Knowledge from Data at Scale
Base classifier: Multi-Linear Regression
without descriptor selection
Build an ensemble of 1 model
… then build an ensemble of 10 models.
Deriving Knowledge from Data at Scale
1 model
10 models
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Random Forest
random tree
Leo Breiman
(1928-2005)
Leo Breiman (2001). Random Forests. Machine Learning. 45(1):5-32.
Random Forest = Bagging + Random Subspace
Deriving Knowledge from Data at Scale
David H. Wolpert
Wolpert, D., Stacked Generalization., Neural Networks, 5(2), pp. 241-259., 1992
Breiman, L., Stacked Regression, Machine Learning, 24, 1996
Deriving Knowledge from Data at Scale
Training set
Data
set
S
Data
set
S
Data
set
S
Learning
algorithm
L1
Model
M1
Model
M2
Model
Me
ENSEMBLE
Consensus
Model
The same data set
Data
set
S
C1
Cn
D1 Dm
Learning
algorithm
L2
Learning
algorithm
Le
Machine Learning
Meta-Method
(e.g. MLR)
Different algorithms
Deriving Knowledge from Data at Scale
Choose meta method Stacking
Click here
Deriving Knowledge from Data at Scale
•Delete the classifier ZeroR
•Add PLS classifier (default
parameters)
•Add Regression Tree M5P (default
parameters)
•Add Multi-Linear Regression without
descriptor selection
Deriving Knowledge from Data at Scale
Click here
Select Multi-Linear
Regression as meta-
method
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Exercise 5
Rebuild the stacked model using:
• kNN (default parameters)
• Multi-Linear Regression without descriptor selection
• PLS classifier (default parameters)
• Regression Tree M5P
Deriving Knowledge from Data at Scale
Exercise 5
Deriving Knowledge from Data at Scale
Exercise 5 - Stacking
Regression models
for LogS
Learning
algorithm
R (correlation
coefficient)
RMSE
MLR 0.8910 1.0068
PLS 0.9171 0.8518
M5P (regression
trees)
0.9176 0.8461
1-NN (one
nearest
neighbour)
0.8455 1.1889
Stacking of
MLR, PLS, M5P
0.9366 0.7460
Stacking of
MLR, PLS,
M5P, 1-NN
0.9392 0.7301
Deriving Knowledge from Data at Scale
That’s all for tonight….

Barga Data Science lecture 8

  • 1.
  • 2.
    Deriving Knowledge fromData at Scale Lecture 8 Agenda Opening Discussion 45 • Course Project Check In • Thought Exercise Data Transformation 60 • Attribute Selection • SVM Considerations Ensembling, review and deeper dive; 60
  • 3.
    Deriving Knowledge fromData at Scale Being proficient at data science requires intuition and judgement.
  • 4.
    Deriving Knowledge fromData at Scale Intuition and judgement come with experience
  • 5.
    Deriving Knowledge fromData at Scale There is no compression algorithm for experience…
  • 6.
    Deriving Knowledge fromData at Scale But you can hack this…
  • 7.
    Deriving Knowledge fromData at Scale Three Steps (every 3 – 4 months) 1. Become proficient in using one tool; 2. Select one algorithm for deep dive; 3. Focus on one data type; Hands on practice…
  • 8.
  • 9.
    Deriving Knowledge fromData at Scale What tools to use?
  • 10.
  • 11.
  • 12.
  • 13.
    Deriving Knowledge fromData at Scale • Weka – explorer… • KNIME – experimentation… Get proficient in at least two (2) tools…
  • 14.
  • 15.
  • 16.
    Deriving Knowledge fromData at Scale http://www.slideshare.net/DataRobot/final-10-r-xc-36610234
  • 17.
  • 18.
  • 19.
  • 20.
  • 21.
    Deriving Knowledge fromData at Scale This is how you learn…
  • 22.
  • 23.
    Deriving Knowledge fromData at Scale Performing Experiments
  • 24.
  • 25.
  • 26.
  • 27.
  • 28.
  • 29.
  • 30.
  • 31.
  • 32.
  • 33.
  • 34.
  • 35.
  • 36.
  • 37.
  • 38.
    Deriving Knowledge fromData at Scale Copy on Catalyst…
  • 39.
    Deriving Knowledge fromData at Scale Course Project
  • 40.
  • 41.
    Deriving Knowledge fromData at Scale what is the data telling you
  • 42.
  • 43.
  • 44.
  • 45.
  • 46.
  • 47.
  • 48.
  • 49.
  • 50.
  • 51.
    Deriving Knowledge fromData at Scale Attribute Selection (feature selection)
  • 52.
    Deriving Knowledge fromData at Scale Problem: Where to focus attention?
  • 53.
    Deriving Knowledge fromData at Scale What is Evaluated? Attributes Subsets of Attributes Evaluation Method Independent Filters Filters Learning Algorithm Wrappers
  • 54.
    Deriving Knowledge fromData at Scale What is Evaluated? Attributes Subsets of Attributes Evaluation Method Independent Filters Filters Learning Algorithm Wrappers
  • 55.
  • 56.
    Deriving Knowledge fromData at Scale Tab for selecting attributes in a data set…
  • 57.
    Deriving Knowledge fromData at Scale Interface for classes that evaluate attributes… Interface for ranking or searching for a subset of attributes…
  • 58.
    Deriving Knowledge fromData at Scale Select CorrelationAttributeEval for Pearson Correlation… False, doesn’t return R score True, returns R scores;
  • 59.
    Deriving Knowledge fromData at Scale Ranks attributes by their individual evaluations, used in conjunction with GainRatio, Entropy, Pearson, etc… Number of attributes to return, -1 returns all ranked attributes; Attributes to ignore (skip) in the evaluation forma: [1, 3-5, 10]; Cutoff at which attributes can be discarded, -1 no cutoff;
  • 60.
    Deriving Knowledge fromData at Scale What is Evaluated? Attributes Subsets of Attributes Evaluation Method Independent Filters Filters Learning Algorithm Wrappers
  • 61.
  • 62.
  • 63.
  • 64.
  • 65.
    Deriving Knowledge fromData at Scale Interface for classes that evaluate attributes… Interface for ranking or searching for a subset of attributes…
  • 66.
    Deriving Knowledge fromData at Scale True: Adds features that are correlated with class and NOT intercorrelated with other features already in selection. False: Eliminates redundant features. Precompute the correlation matrix in advance, useful for fast backtracking, or compute lazily. When given a large number of attributes, compute lazily… CfsSubsetEval
  • 67.
    Deriving Knowledge fromData at Scale What is Evaluated? Attributes Subsets of Attributes Evaluation Method Independent Filters Filters Learning Algorithm Wrappers
  • 68.
    Deriving Knowledge fromData at Scale Select a subset of attributes Induce learning algorithm on this subset Evaluate the resulting model (e.g., accuracy) Stop? YesNo
  • 69.
  • 70.
    Deriving Knowledge fromData at Scale Tab for selecting attributes in a data set…
  • 71.
    Deriving Knowledge fromData at Scale Interface for classes that evaluate attributes… Interface for ranking or searching for a subset of attributes…
  • 72.
    Deriving Knowledge fromData at Scale Interface for classes that evaluate attributes… Interface for ranking or searching for a subset of attributes… WrapperSubsetEval
  • 73.
    Deriving Knowledge fromData at Scale Select and configure ML algorithm… Accuracy (default discrete classes), RMSE (default numeric), AUC, AUPRC, F-measure (discrete class) Number of folds to use to estimate subset accuracy
  • 74.
    Deriving Knowledge fromData at Scale BestFirst: Default search method, it searches the space of descriptor subsets by greedy hill-climbing augmented with a backtracking facility. The BestFirst method may start with the empty set of descriptors and searches forward (default behavior), or starts with the full set of attributes and searches backward, or starts at any point and searches in both directions (considering all single descriptor additions and deletions at a given point). Other options include: • GreedyStepwise; • EvolutionarySearch; • ExhaustiveSearch; • LinearForwardSearch; • GeneticSearch (could take hours) Search Method
  • 75.
  • 76.
  • 77.
  • 78.
  • 79.
  • 80.
  • 81.
  • 82.
  • 83.
    Deriving Knowledge fromData at Scale Feature ranking Forward feature selection Backward feature elimination
  • 84.
  • 85.
  • 86.
    Deriving Knowledge fromData at Scale 10 Minute Break…
  • 87.
    Deriving Knowledge fromData at Scale Practical SVM
  • 88.
    Deriving Knowledge fromData at Scale Goal: to find discriminator That maximize the margins
  • 89.
  • 90.
  • 91.
  • 92.
  • 93.
  • 94.
    Deriving Knowledge fromData at Scale SMO and it's complexity parameter ("-C") • load your dataset in the Explorer • choose weka.classifiers.meta.CVParameterSelection as classifier • select weka.classifiers.functions.SMO as base classifier within CVParameterSelection and modify its setup if necessary, e.g., RBF kernel • open the ArrayEditor for CVParameters and enter the following string (and click on Add): C 2 8 4 This will test the complexity parameters 2, 4, 6 and 8 (= 4 steps) • close dialogs and start the classifier • you will get output similar to this one, with the best parameters found in bold:
  • 95.
    Deriving Knowledge fromData at Scale LibSVM • load your dataset in the Explorer • choose weka.classifiers.meta.CVParameterSelection as classifier • select weka.classifiers.functions.LibSVM as base classifier within CVParameterSelection and modify its setup if necessary, e.g., RBF kernel • open the ArrayEditor for CVParameters and enter the following string (and click on Add): G 0.01 0.1 10
  • 96.
    Deriving Knowledge fromData at Scale GridSearch weka.classifiers.meta.GridSearch is a meta-classifier for exploring 2 parameters, hence the grid in the name. Instead of just using a classifier, one can specify a base classifier and a filter, which both of them can be optimized (one parameter each). For each of the two axes, X and Y, one can specify the following parameters: • min, the minimum value to start from. • max, the maximum value. • step, the step size used to get from min to max. GridSearch can also optimized based on the following measures: • Correlation coefficient (= CC) • Root mean squared error (= RMSE) • Root relative squared error (= RRSE) • Mean absolute error (= MAE) • Root absolute error (= RAE) • Combined: (1-abs(CC)) + RRSE + RAE • Accuracy (= ACC)
  • 97.
    Deriving Knowledge fromData at Scale Missing Values (revisited)
  • 98.
    Deriving Knowledge fromData at Scale ?Instances Attributes
  • 99.
    Deriving Knowledge fromData at Scale Missing values – UCI machine learning repository, 31 of 68 data sets reported to have missing values. “Missing” can mean many things… MAR: "Missing at Random": – usually best case – usually not true Non-randomly missing Presumed normal, so not measured Causally missing – attribute value is missing because of other attribute values (or because of the outcome value!)
  • 100.
  • 101.
    Deriving Knowledge fromData at Scale 30% missing values 88% accuracy 93% accuracy 95% accuracy
  • 102.
    Deriving Knowledge fromData at Scale Input: Output:
  • 103.
    Deriving Knowledge fromData at Scale X Y Data point with missing Y Imputation x Filling in there makes sense!
  • 104.
    Deriving Knowledge fromData at Scale Imputation with k-Nearest Neighbor
  • 105.
    Deriving Knowledge fromData at Scale K-means Clustering Imputation
  • 106.
    Deriving Knowledge fromData at Scale Imputation via Regression/Classification
  • 107.
    Deriving Knowledge fromData at Scale fills in the missing values for an instance with the expected values
  • 108.
    Deriving Knowledge fromData at Scale This can make a difference!
  • 109.
    Deriving Knowledge fromData at Scale Ensemble Learning Review & Out of Class Exercises
  • 110.
    Deriving Knowledge fromData at Scale Original Training data ....D1 D2 Dt-1 Dt D Step 1: Create Multiple Data Sets C1 C2 Ct -1 Ct Step 2: Build Multiple Classifiers C* Step 3: Combine Classifiers
  • 111.
    Deriving Knowledge fromData at Scale Why does it work?        25 13 25 06.0)1( 25 i ii i 
  • 112.
    Deriving Knowledge fromData at Scale Ensemble vs. Base Classifier Error As long as base classifier is better than random (error < 0.5), ensemble will be superior to base classifier
  • 113.
    Deriving Knowledge fromData at Scale • Bagging • Boosting • DECORATE meta-learners base learner
  • 114.
    Deriving Knowledge fromData at Scale Training set Matrix 1 Matrix 2 Matrix 3 Learning algorithm Model M1 Learning algorithm Model M2 Learning algorithm Model Me ENSEMBLE Consensus Model Perturbed sets C1 Cn D1 Dm Compounds/ Descriptor Matrix
  • 115.
    Deriving Knowledge fromData at Scale Mixture of Experts   L j jjdwy 1
  • 116.
    Deriving Knowledge fromData at Scale Stacking
  • 117.
    Deriving Knowledge fromData at Scale Cascading
  • 118.
    Deriving Knowledge fromData at Scale Bagging Leo Breiman (1928-2005) Leo Breiman (1996). Bagging predictors. Machine Learning. 24(2):123-140. Bagging = Bootstrap Aggregation
  • 119.
    Deriving Knowledge fromData at Scale Training set S . . . C1 C2 C3 C4 Cn Bootstrap . . . C3 C2 C2 C4 C4 Sample Si from training set S • All compounds have the same probability to be selected • Each compound can be selected several times or even not selected at all (i.e. compounds are sampled randomly with replacement) Efron, B., & Tibshirani, R. J. (1993). "An introduction to the bootstrap". New York: Chapman & Hall Si D1 Dm D1 Dm
  • 120.
    Deriving Knowledge fromData at Scale Bagging Original Data 1 2 3 4 5 6 7 8 9 10 Bagging (Round 1) 7 8 10 8 2 5 10 10 5 9 Bagging (Round 2) 1 4 9 1 2 3 2 7 3 2 Bagging (Round 3) 1 8 5 10 5 5 9 6 3 7 Training Data Data ID
  • 121.
    Deriving Knowledge fromData at Scale The 0.632 bootstrap This method is also called the 0.632 bootstrap • A particular training instance has a probability of 1-1/n of not being picked • Thus its probability of ending up in the test data (not selected) is: This means the training data will contain approximately 63.2% of the instances 368.0 1 1 1         e n n
  • 122.
    Deriving Knowledge fromData at Scale Bagging Training set . . . C1 C2 C3 C4 Cn Learning algorithm Model M1 Learning algorithm Model M2 Learning algorithm Model Me ENSEMBLE Consensus Model S1 S2 Se C4 C2 C8 C2 C1 C9 C7 C2 C2 C1 C4 C3 C4 C8 Voting (classification) Averaging (regression) Data with perturbed sets of compounds C1
  • 123.
    Deriving Knowledge fromData at Scale Classification - Files train-ache.sdf/test-ache.sdf train-ache-t3ABl2u3.arff/test-ache-t3ABl2u3.arff ache-t3ABl2u3.hdr
  • 124.
    Deriving Knowledge fromData at Scale Exercise 1 Development of one individual rules-based model (JRip method in WEKA)
  • 125.
    Deriving Knowledge fromData at Scale Exercise 1 Load train-ache-t3ABl2u3.arff
  • 126.
    Deriving Knowledge fromData at Scale Load test-ache-t3ABl2u3.arff
  • 127.
    Deriving Knowledge fromData at Scale Setup one JRip model
  • 128.
    Deriving Knowledge fromData at Scale 187. (C*C),(C*C*C),(C*C-C),(C*N),(C*N*C),(C-C),(C-C-C),xC* 81. (C-N),(C-N-C),(C-N-C),(C-N-C),xC 12. (C*C),(C*C),(C*C*C),(C*C*C),(C*C*N),xC
  • 129.
    Deriving Knowledge fromData at Scale What happens if we randomize the data and rebuild a JRip model ?
  • 130.
    Deriving Knowledge fromData at Scale Changing data ordering induces rules changes
  • 131.
    Deriving Knowledge fromData at Scale Exercise 3a: Bagging Reinitialize the dataset In the classifier tab, choose the meta classifier Bagging
  • 132.
    Deriving Knowledge fromData at Scale Set the base classifier as JRip Build an ensemble of 1 model
  • 133.
    Deriving Knowledge fromData at Scale JRipBag1.out JRipBag3.out JRipBag8.out
  • 134.
    Deriving Knowledge fromData at Scale ROC AUC of the consensus model as a function of the number of bagging iterations Classification AChE 0.74 0.76 0.78 0.8 0.82 0.84 0.86 0.88 0 2 4 6 8 10 Number of bagging iterations ROC AUC
  • 135.
    Deriving Knowledge fromData at Scale Boosting works by training a set of classifiers sequentially by combining them for prediction, where each latter classifier focuses on the mistakes of the earlier classifiers. Yoav Freund Robert Shapire Jerome Friedman Yoav Freund, Robert E. Schapire: Experiments with a new boosting algorithm. In: Thirteenth International Conference on Machine Learning, San Francisco, 148-156, 1996. J.H. Friedman (1999). Stochastic Gradient Boosting. Computational Statistics and Data Analysis. 38:367-378. AdaBoost - classification Regression boosting
  • 136.
    Deriving Knowledge fromData at Scale Training set . . . C1 C2 C3 C4 Cn Learning algorithm Model M1 Learning algorithm Model M2 Learning algorithm Model Mb ENSEMBLE Consensus Model S1 S2 Se C1 C2 C3 C4 Cn . . . w w w w w e e e e e e e e e e C1 C2 C3 C4 Cn . . . w w w w w Weighted averaging & thresholding w C4 Cn . . . w w w w C1 C2 C3
  • 137.
    Deriving Knowledge fromData at Scale Load train-ache-t3ABl2u3.arff In classification tab, load test-ache-t3ABl2u3.arff
  • 138.
    Deriving Knowledge fromData at Scale In classifier tab, choose meta classifier AdaBoostM1 Setup an ensemble of one JRip model
  • 139.
    Deriving Knowledge fromData at Scale JRipBoost1.out JRipBoost3.out JRipBoost8.out
  • 140.
    Deriving Knowledge fromData at Scale ROC AUC as a function of the number of boosting iterations Classification AChE Log(Number of boosting iterations) ROCAUC 0.76 0.77 0.78 0.79 0.8 0.81 0.82 0.83 0 2 4 6 8 10
  • 141.
    Deriving Knowledge fromData at Scale 0.7 0.75 0.8 0.85 0.9 0.95 1 1 10 100 1000 Bagging Boosting Base learner – DecisionStump 0.7 0.75 0.8 0.85 0.9 0.95 1 1 10 100 Base learner – JRip
  • 142.
    Deriving Knowledge fromData at Scale Conjecture: Bagging vs Boosting Bagging leverages unstable base learners that are weak because of overfitting (JRip, MLR) Boosting leverages stable base learners that are weak because of underfitting (DecisionStump, SLR)
  • 143.
    Deriving Knowledge fromData at Scale Random Subspace Method Tin Kam Ho Tin Kam Ho (1998). The Random Subspace Method for Constructing Decision Forests. IEEE Transactions on Pattern Analysis and Machine Intelligence. 20(8):832-844.
  • 144.
    Deriving Knowledge fromData at Scale • All descriptors have the same probability to be selected • Each descriptor can be selected only once • Only a certain part of descriptors are selected in each run . . . D1 D2 D3 D4 Dm D3 D2 Dm D4 C1 Cn C1 Cn Training set with initial pool of descriptors Training set with randomly selected descriptors
  • 145.
    Deriving Knowledge fromData at Scale Random Subspace Method 145 Training set Learning algorithm Model M1 Learning algorithm Model M2 Learning algorithm Model Me ENSEMBLE Consensus Model S1 S2 Se Voting (classification) Averaging (regression) Data sets with randomly selected descriptors D1 D2 D3 D4 Dm D4 D2 D3 D1 D2 D3 D4 D2 D1
  • 146.
    Deriving Knowledge fromData at Scale Load train-logs-t1ABl2u4.arff In classification tab, load test-logs-t1ABl2u4.arff
  • 147.
    Deriving Knowledge fromData at Scale Choose the meta method Random SubSpace.
  • 148.
    Deriving Knowledge fromData at Scale Base classifier: Multi-Linear Regression without descriptor selection Build an ensemble of 1 model … then build an ensemble of 10 models.
  • 149.
    Deriving Knowledge fromData at Scale 1 model 10 models
  • 150.
  • 151.
    Deriving Knowledge fromData at Scale Random Forest random tree Leo Breiman (1928-2005) Leo Breiman (2001). Random Forests. Machine Learning. 45(1):5-32. Random Forest = Bagging + Random Subspace
  • 152.
    Deriving Knowledge fromData at Scale David H. Wolpert Wolpert, D., Stacked Generalization., Neural Networks, 5(2), pp. 241-259., 1992 Breiman, L., Stacked Regression, Machine Learning, 24, 1996
  • 153.
    Deriving Knowledge fromData at Scale Training set Data set S Data set S Data set S Learning algorithm L1 Model M1 Model M2 Model Me ENSEMBLE Consensus Model The same data set Data set S C1 Cn D1 Dm Learning algorithm L2 Learning algorithm Le Machine Learning Meta-Method (e.g. MLR) Different algorithms
  • 154.
    Deriving Knowledge fromData at Scale Choose meta method Stacking Click here
  • 155.
    Deriving Knowledge fromData at Scale •Delete the classifier ZeroR •Add PLS classifier (default parameters) •Add Regression Tree M5P (default parameters) •Add Multi-Linear Regression without descriptor selection
  • 156.
    Deriving Knowledge fromData at Scale Click here Select Multi-Linear Regression as meta- method
  • 157.
  • 158.
    Deriving Knowledge fromData at Scale Exercise 5 Rebuild the stacked model using: • kNN (default parameters) • Multi-Linear Regression without descriptor selection • PLS classifier (default parameters) • Regression Tree M5P
  • 159.
    Deriving Knowledge fromData at Scale Exercise 5
  • 160.
    Deriving Knowledge fromData at Scale Exercise 5 - Stacking Regression models for LogS Learning algorithm R (correlation coefficient) RMSE MLR 0.8910 1.0068 PLS 0.9171 0.8518 M5P (regression trees) 0.9176 0.8461 1-NN (one nearest neighbour) 0.8455 1.1889 Stacking of MLR, PLS, M5P 0.9366 0.7460 Stacking of MLR, PLS, M5P, 1-NN 0.9392 0.7301
  • 161.
    Deriving Knowledge fromData at Scale That’s all for tonight….