Barga Data Science lecture 8

Deriving Knowledge from Data at Scale

Lecture 8 Agenda
Opening Discussion 45
• Course Project Check In
• Thought Exercise
Data Transformation 60
• Attribute Selection
• SVM Considerations
Ensembling, review and deeper dive; 60

Being proficient at data science
requires intuition and
judgement.

Intuition and judgement
come with experience

There is no compression
algorithm for experience…

But you can hack this…

Three Steps (every 3 – 4 months)
1. Become proficient in using one tool;
2. Select one algorithm for deep dive;
3. Focus on one data type;
Hands on practice…

What tools to use?

• Weka – explorer…
• KNIME – experimentation…
Get proficient in at least two (2) tools…

http://www.slideshare.net/DataRobot/final-10-r-xc-36610234

This is how you learn…

Performing Experiments

Copy on Catalyst…

Course Project

what is the data telling you

Attribute Selection
(feature selection)

Problem: Where to focus attention?

What is Evaluated?
Attributes
Subsets of
Attributes
Evaluation
Method
Independent
Filters Filters
Learning
Algorithm Wrappers

Tab for selecting attributes in a data set…

Interface for classes that evaluate attributes…
Interface for ranking or searching for a subset of attributes…

Select CorrelationAttributeEval for Pearson Correlation…
False, doesn’t return R score
True, returns R scores;

Ranks attributes by their individual evaluations, used in
conjunction with GainRatio, Entropy, Pearson, etc…
Number of attributes to return,
-1 returns all ranked attributes;
Attributes to ignore (skip) in the
evaluation forma: [1, 3-5, 10];
Cutoff at which attributes can
be discarded, -1 no cutoff;

True: Adds features that are correlated
with class and NOT intercorrelated with
other features already in selection.
False: Eliminates redundant features.
Precompute the correlation matrix in
advance, useful for fast backtracking, or
compute lazily. When given a large
number of attributes, compute lazily…
CfsSubsetEval

Select a subset of
attributes
Induce learning
algorithm on this subset
Evaluate the resulting
model (e.g., accuracy)
Stop? YesNo

Interface for classes that evaluate attributes…
Interface for ranking or searching for a subset of attributes…
WrapperSubsetEval

Select and configure ML algorithm…
Accuracy (default discrete classes), RMSE (default
numeric), AUC, AUPRC, F-measure (discrete class)
Number of folds to use to estimate
subset accuracy

BestFirst: Default search method, it searches
the space of descriptor subsets by greedy
hill-climbing augmented with a backtracking
facility. The BestFirst method may start with
the empty set of descriptors and searches
forward (default behavior), or starts with the
full set of attributes and searches backward,
or starts at any point and searches in both
directions (considering all single descriptor
additions and deletions at a given point).
Other options include:
• GreedyStepwise;
• EvolutionarySearch;
• ExhaustiveSearch;
• LinearForwardSearch;
• GeneticSearch (could take hours)
Search Method

Feature ranking
Forward feature selection
Backward feature elimination

10 Minute Break…

Practical SVM

Goal: to find
discriminator
That maximize the
margins

SMO and it's complexity parameter ("-C")
• load your dataset in the Explorer
• choose weka.classifiers.meta.CVParameterSelection as classifier
• select weka.classifiers.functions.SMO as base classifier within CVParameterSelection and modify its
setup if necessary, e.g., RBF kernel
• open the ArrayEditor for CVParameters and enter the following string (and click on Add):
C 2 8 4
This will test the complexity parameters 2, 4, 6 and 8 (= 4 steps)
• close dialogs and start the classifier
• you will get output similar to this one, with the best parameters found in bold:

LibSVM
• load your dataset in the Explorer
• choose weka.classifiers.meta.CVParameterSelection as classifier
• select weka.classifiers.functions.LibSVM as base classifier within CVParameterSelection and modify
its setup if necessary, e.g., RBF kernel
• open the ArrayEditor for CVParameters and enter the following string (and click on Add):
G 0.01 0.1 10

GridSearch
weka.classifiers.meta.GridSearch is a meta-classifier for exploring 2 parameters, hence the grid in the name.
Instead of just using a classifier, one can specify a base classifier and a filter, which both of them can be
optimized (one parameter each).
For each of the two axes, X and Y, one can specify the following parameters:
• min, the minimum value to start from.
• max, the maximum value.
• step, the step size used to get from min to max.
GridSearch can also optimized based on the following measures:
• Correlation coefficient (= CC)
• Root mean squared error (= RMSE)
• Root relative squared error (= RRSE)
• Mean absolute error (= MAE)
• Root absolute error (= RAE)
• Combined: (1-abs(CC)) + RRSE + RAE
• Accuracy (= ACC)

Missing Values
(revisited)

?Instances
Attributes

Missing values – UCI machine learning repository, 31 of 68 data sets
reported to have missing values. “Missing” can mean many things…
MAR: "Missing at Random":
– usually best case
– usually not true
Non-randomly missing
Presumed normal, so not measured
Causally missing
– attribute value is missing because of other attribute values (or because of
the outcome value!)

30% missing values
88% accuracy
93% accuracy
95% accuracy

Input:
Output:

X
Y
Data point with missing Y
Imputation
x
Filling in there makes
sense!

Imputation with k-Nearest Neighbor

K-means Clustering Imputation

Imputation via Regression/Classification

fills in the missing values for an instance with the expected
values

This can make a difference!

Ensemble Learning
Review & Out of Class Exercises

Original
Training data
....D1
D2 Dt-1 Dt
D
Step 1:
Create Multiple
Data Sets
C1 C2 Ct -1 Ct
Step 2:
Build Multiple
Classifiers
C*
Step 3:
Combine
Classifiers

Why does it work?







25
13
25
06.0)1(
25
i
ii
i


Ensemble vs. Base Classifier Error
As long as base classifier is better than random (error < 0.5),
ensemble will be superior to base classifier

• Bagging
• Boosting
• DECORATE
meta-learners
base learner

Training set
Matrix 1
Matrix 2
Matrix 3
Learning
algorithm
Model
M1
Learning
algorithm
Model
M2
Learning
algorithm
Model
Me
ENSEMBLE
Consensus
Model
Perturbed sets
C1
Cn
D1 Dm
Compounds/
Descriptor
Matrix

Mixture of Experts


L
j
jjdwy
1

Stacking

Cascading

Bagging
Leo Breiman
(1928-2005)
Leo Breiman (1996). Bagging predictors. Machine Learning. 24(2):123-140.
Bagging = Bootstrap Aggregation

Training set S
.
.
.
C1
C2
C3
C4
Cn
Bootstrap
.
.
.
C3
C2
C2
C4
C4
Sample Si from training set S
• All compounds have the same probability to
be selected
• Each compound can be selected several
times or even not selected at all (i.e.
compounds are sampled randomly with
replacement)
Efron, B., & Tibshirani, R. J. (1993). "An introduction to the bootstrap". New York: Chapman & Hall
Si
D1 Dm D1 Dm

Bagging
Original Data 1 2 3 4 5 6 7 8 9 10
Bagging (Round 1) 7 8 10 8 2 5 10 10 5 9
Bagging (Round 2) 1 4 9 1 2 3 2 7 3 2
Bagging (Round 3) 1 8 5 10 5 5 9 6 3 7
Training Data
Data ID

The 0.632 bootstrap
This method is also called the 0.632 bootstrap
• A particular training instance has a probability of 1-1/n of not being
picked
• Thus its probability of ending up in the test data (not selected) is:
This means the training data will contain approximately 63.2% of the
instances
368.0
1
1 1






 
e
n
n

Bagging
Training set
.
.
.
C1
C2
C3
C4
Cn
Learning
algorithm
Model
M1
Learning
algorithm
Model
M2
Learning
algorithm
Model
Me
ENSEMBLE
Consensus
Model
S1
S2
Se
C4
C2
C8
C2
C1
C9
C7
C2
C2
C1
C4
C3
C4
C8
Voting (classification)
Averaging (regression)
Data with
perturbed sets
of compounds
C1

Classification - Files
train-ache.sdf/test-ache.sdf
train-ache-t3ABl2u3.arff/test-ache-t3ABl2u3.arff
ache-t3ABl2u3.hdr

Exercise 1
Development of one individual rules-based model
(JRip method in WEKA)

Exercise 1
Load train-ache-t3ABl2u3.arff

Load test-ache-t3ABl2u3.arff

Setup one JRip
model

187. (C*C),(C*C*C),(C*C-C),(C*N),(C*N*C),(C-C),(C-C-C),xC*
81. (C-N),(C-N-C),(C-N-C),(C-N-C),xC
12. (C*C),(C*C),(C*C*C),(C*C*C),(C*C*N),xC

What happens if we randomize the data
and rebuild a JRip model ?

Changing data ordering induces rules changes

Exercise 3a: Bagging
Reinitialize the dataset
In the classifier tab, choose the meta
classifier Bagging

Set the base classifier as JRip
Build an ensemble of 1 model

JRipBag1.out
JRipBag3.out
JRipBag8.out

ROC AUC of the consensus
model as a function of the
number of bagging iterations
Classification
AChE
0.74
0.76
0.78
0.8
0.82
0.84
0.86
0.88
0 2 4 6 8 10
Number of bagging iterations
ROC
AUC

Boosting works by training a set of classifiers sequentially by combining them
for prediction, where each latter classifier focuses on the mistakes of the
earlier classifiers.
Yoav Freund Robert Shapire Jerome Friedman
Yoav Freund, Robert E. Schapire: Experiments with a new boosting algorithm. In: Thirteenth International Conference on
Machine Learning, San Francisco, 148-156, 1996.
J.H. Friedman (1999). Stochastic Gradient Boosting. Computational Statistics and Data Analysis. 38:367-378.
AdaBoost -
classification
Regression
boosting

Training set
.
.
.
C1
C2
C3
C4
Cn
Learning
algorithm
Model
M1
Learning
algorithm
Model
M2
Learning
algorithm
Model
Mb
ENSEMBLE
Consensus
Model
S1
S2
Se
C1
C2
C3
C4
Cn
.
.
.
w
w
w
w
w
e
e
e
e
e
e
e
e
e
e
C1
C2
C3
C4
Cn
.
.
.
w
w
w
w
w
Weighted averaging
& thresholding
w
C4
Cn
.
.
.
w
w
w
w
C1
C2
C3

Load train-ache-t3ABl2u3.arff
In classification tab, load test-ache-t3ABl2u3.arff

In classifier tab, choose meta classifier AdaBoostM1
Setup an ensemble of one JRip model

JRipBoost1.out
JRipBoost3.out JRipBoost8.out

ROC AUC as a function of the
number of boosting iterations
Classification
AChE
Log(Number of boosting iterations)
ROCAUC
0.76
0.77
0.78
0.79
0.8
0.81
0.82
0.83
0 2 4 6 8 10

0.7
0.75
0.8
0.85
0.9
0.95
1
1 10 100 1000
Bagging
Boosting
Base learner – DecisionStump
0.7
0.75
0.8
0.85
0.9
0.95
1
1 10 100
Base learner – JRip

Conjecture: Bagging vs Boosting
Bagging leverages unstable base learners that
are weak because of overfitting (JRip, MLR)
Boosting leverages stable base learners that
are weak because of underfitting
(DecisionStump, SLR)

Random Subspace Method
Tin Kam Ho
Tin Kam Ho (1998). The Random Subspace Method for Constructing Decision Forests. IEEE Transactions on
Pattern Analysis and Machine Intelligence. 20(8):832-844.

• All descriptors have the same probability to be
selected
• Each descriptor can be selected only once
• Only a certain part of descriptors are selected
in each run
.
.
.
D1 D2 D3 D4 Dm
D3 D2 Dm D4
C1
Cn
C1
Cn
Training set with initial pool of descriptors
Training set with randomly selected descriptors

Random Subspace Method
145
Training set
Learning
algorithm
Model
M1
Learning
algorithm
Model
M2
Learning
algorithm
Model
Me
ENSEMBLE
Consensus
Model
S1
S2
Se
Voting (classification)
Averaging (regression)
Data sets with
randomly selected
descriptors
D1 D2 D3 D4 Dm
D4 D2 D3
D1 D2 D3
D4 D2 D1

Load train-logs-t1ABl2u4.arff
In classification tab, load test-logs-t1ABl2u4.arff

Choose the meta method
Random SubSpace.

Base classifier: Multi-Linear Regression
without descriptor selection
Build an ensemble of 1 model
… then build an ensemble of 10 models.

1 model
10 models

Random Forest
random tree
Leo Breiman
(1928-2005)
Leo Breiman (2001). Random Forests. Machine Learning. 45(1):5-32.
Random Forest = Bagging + Random Subspace

David H. Wolpert
Wolpert, D., Stacked Generalization., Neural Networks, 5(2), pp. 241-259., 1992
Breiman, L., Stacked Regression, Machine Learning, 24, 1996

Training set
Data
set
S
Data
set
S
Data
set
S
Learning
algorithm
L1
Model
M1
Model
M2
Model
Me
ENSEMBLE
Consensus
Model
The same data set
Data
set
S
C1
Cn
D1 Dm
Learning
algorithm
L2
Learning
algorithm
Le
Machine Learning
Meta-Method
(e.g. MLR)
Different algorithms

Choose meta method Stacking
Click here

•Delete the classifier ZeroR
•Add PLS classifier (default
parameters)
•Add Regression Tree M5P (default
parameters)
•Add Multi-Linear Regression without
descriptor selection

Click here
Select Multi-Linear
Regression as meta-
method

Exercise 5
Rebuild the stacked model using:
• kNN (default parameters)
• Multi-Linear Regression without descriptor selection
• PLS classifier (default parameters)
• Regression Tree M5P

Exercise 5

Exercise 5 - Stacking
Regression models
for LogS
Learning
algorithm
R (correlation
coefficient)
RMSE
MLR 0.8910 1.0068
PLS 0.9171 0.8518
M5P (regression
trees)
0.9176 0.8461
1-NN (one
nearest
neighbour)
0.8455 1.1889
Stacking of
MLR, PLS, M5P
0.9366 0.7460
Stacking of
MLR, PLS,
M5P, 1-NN
0.9392 0.7301

That’s all for tonight….

Barga Data Science lecture 8

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (14)

Similar to Barga Data Science lecture 8

Similar to Barga Data Science lecture 8 (20)

More from Roger Barga

More from Roger Barga (7)

Recently uploaded

Recently uploaded (20)

Barga Data Science lecture 8