4. 4
Nxxx ,...,, 21
Nkkk ,...,, 21
features and
states
kxfkxp ,
Kk ,...,2,1 domain for k
The model is parametrized by
K ,...,, 21
N
i
iiK kxp
K 1,...,,
*
21 ,maxarg,...,,
21
input
output
Approaches for ML
Maximum Likelihood
5. 5
N
i
ii kxp
K 1,...,,
,maxarg
21
N
i
iii kxpkp
K 1,...,, 21
maxarg
KK XxXxXx
Kxpxpxp
2121
21maxarg
,...,,
N
i
ii kxp
K 1,...,, 21
maxarg
kXx
k xf
,maxarg
*
Approaches for ML
Maximum Likelihood
7. 7
46 46
45
37 37
36
37
222
1 454645expmaxarg
222
1 454645minarg
1
3
454645
1 X
x
Approaches for ML
Maximum Likelihood: example
8. 8
,1minmaxarg
1
*
1
kxp
Xx
Approaches for ML
Bias in training data
Nxxx ,...,, 21
Nkkk ,...,, 21
kxfkxp ,
Kk ,...,2,1
The model is parametrized by
K ,...,, 21
input
output
features and
states
domain for k
9. 9
,1minmaxarg
1
*
1
kxp
Xx
1,2
1,1
2
1
Nkxp
Nkxp
1
2
Approaches for ML
Bias in training data: example
10. 10
,1minmaxarg
1
*
1
kxp
Xx
1,2
1,1
2
1
Nkxp
Nkxp
1
2
center of the circle that
holds all the points from X1
Approaches for ML
Bias in training data: example
11. 11
Nxxx ,...,, 21
Nkkk ,...,, 21
Kk ,...,2,1
kkW , penalty function
decision strategy is parametrized by Kxq ,
Approaches for ML
ERM: Empirical risk minimization
features and
states
domain for k
12. 12
Xx Kk
kxqWkxpqRisk ,,,
N
i
ii
N
i
iiii kxqWkxqWkxp
11
,,,,,
qRiskminarg*
Kxq ,
Approaches for ML
ERM: Empirical risk minimization
decision strategy is parametrized by
16. 16
k Xxxqxq
xqkWkxpxqRiskxq ,,minarg)(minarg)(
)()(
*
input
output
Bayesian approach
The problem
Xx
k
feature
state (hidden)
kxp , joint probability
RW : penalty function
Xq: decision strategy
17. 17
Bayes' theorem
law of total probability
k
kxpkp
kxpkp
xkp )(
k
kxpkpxp
joint probability xkpxpkxpkpkxp ,
p(k) prior probability
p(k|x) posterior probability
p(x|k) conditional probability of x, assuming k
p(x) probability of x
Bayesian approach
Some basics
20. 20
)(min kRisk
k
if
then
kk
kkWxkpk ,minarg*
else
0
Bayesian approach
Flexibility to not make a decision
reject
)(min kRisk
k
if
then
kk
kkWxkpk ,minarg*
else reject
21. 21
kkkkw ),(
2
),( kkkkw
kk
kk
kkw
1
0
),(
kk
kk
kkw
1
0
),(
median
average
the most probable sample
center of the most probable section Δ
Bayesian approach
Decision strategies for different penalty functions
23. 23
Nxxxx ,...,, 21 feature – a set of independent binary values
kxpkxpkxpkxp N 21
2,1k Few states are possible
2
1
2
1
kxp
kxp
Decision strategy
Classifying the independent binary features
The problem
24. 24
22
11
log
2
1
log
1
1
kxpkxp
kxpkxp
kxp
kxp
n
n
2
1
log
2
1
log
1
1
kxp
kxp
kxp
kxp
n
n
20
10
log
1021
2011
log
1
1
11
11
1
kxp
kxp
kxpkxp
kxpkxp
x
20
10
log
1021
2011
log
kxp
kxp
kxpkxp
kxpkxp
x
n
n
nn
nn
N
Classifying the independent binary features
The problem
25. 25
log
2
1
20
10
log
1021
2011
log
1
N
i i
i
ii
ii
i
kxp
kxp
kxpkxp
kxpkxp
x
2
1
1
N
i
ii ax
Classifying the independent binary features
Decision rule
43. 43
Nxxxx ,...,, 21 feature
n
i
n
j
k
jj
k
ii
k
ji xxakxp
1 1
][][][
,5.0exp
kxOM i
k
i ..][
1][][
kk
BA
][][][
.. k
jj
k
ii
k
ij xxOMb
Gaussian classifier
Definitions
44. 44
n
i
n
j
jjiiji xxaxp
1 1
,5.0exp
2 3 4 5
2
3
4
5
Gaussian classifier
Example
63. 63
min,
n
i
ii
n
i
ii
xxXx
xxXx
1
1
,
,
Separate by the
widest band
Linear discrimination
SVM: support vector machine
64. 64
xxxXx
xxxXx
n
i
ii
n
i
ii
1
1
,
,
min,
,
XX
xconst
penalty
ξ
ξ
Linear discrimination
SVM: support vector machine
Separate by the
widest band
66. 66
n
i
n
i
iiii xayyxmrginerrors
1 1
min0,]0),([#
Logistic regression
Target function: empirical risk = number of errors
margin<0
margin<0
67. 67
Logistic regression
Target function: upper bound
0
1
margin
[margin<0]
log(1+exp(-margin))
min),exp(1log0,
11
n
i
ii
n
i
ii xayxay
68. 68
Logistic regression
Relationship with P(y|x)
max
),exp(1
1
log
1
n
i ii xay
min),exp(1log0,
11
n
i
ii
n
i
ii xayxay
max,log
1
n
i
ii xaysigmoid
maxlog
1
n
i
ii xyp
iiii xaysigmoidxyp ,
model
71. 71
Bias vs Variance
Bias is an error from wrong assumptions in the learning algorithm
High bias can cause underfit: the algorithm can miss relations
between features and target
Variance is an error from sensitivity to small fluctuations in the
training set. High variance can cause overfit: the algorithm can
model the random noise in the training data, rather than the
intended outputs
72. 72
Bias vs Variance
How to avoid overfit (high variance)
o Keep the model simpler
o Do cross-validation (e.g. k-folds)
o Use regularization techniques (e.g. LASSO, Ridge)
o Use top n features from variable importance chart
86. 86
Decision tree
F1 F2 F3 Target
A A A 0
A A B 0
B B A 1
A B B 1
H(0,0,1,1) = ½ log(½) + ½ log(½) = 1
87. 87
Decision tree
F1 F2 F3 Target Attempt split on F1
A A A 0 0
A A B 0 0
B B A 1 1
A B B 1 0
Information Gain = H(0011) - 3/4 H(001) - 1/4 H(1) = 0.91
88. 88
Decision tree
F1 F2 F3 Target Attempt split on F2
A A A 0 0
A A B 0 0
B B A 1 1
A B B 1 1
Information Gain = H(0011) - 1/2 H(00) - 1/2 H(11) = 1
89. 89
Decision tree
F1 F2 F3 Target Attempt split on F3
A A A 0 0
A A B 0 0
B B A 1 1
A B B 1 1
Information Gain = H(0011) - 1/2 H(01) - 1/2 H(01) = 0
93. 93
Decision tree: pruning
55.5% 62.5%
50% x 58% + 50% (60% x 55.5% + 40% x 62.5%) = 50% x 58% + 50% x 58.3 = 58.16%
58.16% > 58% this is small improvement
58%58%
97. 97
Random forest
Original dataset
F1 F2 F3 Target
A A A 0
A A B 0
B B A 1
A B B 1
Bootstrapped dataset
F1 F2 F3 Target
A B B 1
A A B 0
A A A 0
A B B 1
98. 98
Random forest
Bootstrapped dataset
F1 F2 F3 Target
A B B 1
A A B 0
A A A 0
A B B 1
Determine the best root node out of randomly selected sub-set of candidates
candidate candidate
100. 100
Random forest
Bootstrapped dataset
F1 F2 F3 Target
A B B 1
A A B 0
A A A 0
A B B 1
candidate
Build tree as usual, but considering a random subset of features at each step
candidate
102. 102
Random forest
Validation: run aggregated decisions against out-of-bag dataset
Original dataset
F1 F2 F3 Target
A A A 0
A A B 0
B B A 1
A B B 1
Bootstrapped dataset
F1 F2 F3 Target
A B B 1
A A B 0
A A A 0
A B B 1
out-of-bag
103. 103
Random forest
The process
Build random forest
Estimate
Train / Validation error
Adjust RF parameters
# of variables per step,
regularization params, etc
105. 105
Ni xxxx ,...,,, 21
...3,2,1t
it xW
Ni kkkk ,...,,, 21
xht
t
t
tt xhxk
Training features
Iteration number
Weight of the element on iteration t
Weak classifier on iteration t
Weight if a weak classifier on iteration t
1:0? xxh
= {-1 ,1} Training targets
Adaboost
Definitions
106. 106
ii
i
t
h
t xhkiWxh minarg
)(exp1 itittt xhkiWiW
Update weights
Weak classifier to minimize
the weighted error
kxhif
kxhif
r
r
r
r
iW
t
t
t
t
t
t
t
)(
)(
1
1
1
1
i
tii
i
tt iWkxhiWr
t
t
t
r
r
1
1
log
2
1
Calculate the weight od classifier
Adaboost
Algorithm
162. 162
Benchmarking the Classifiers
o Regression
o Naive Bayes
o Gaussian
o SVM
o Nearest Neighbors
o Decision Tree
o Random Forest
o AdaBoost
o xgboost
o Neural Net
195. 195
Xx
k
feature
hidden state
kp priory probability of the state
RW : penalty function
kxp conditional probability to expose x by state k
Neyman-Pearson approach
Limitations of the Bayesian approach
196. 196
Neyman-Pearson approach
Limitations of the Bayesian approach
Xx
k
feature
hidden state
kp priory probability of the state
RW :
kxp conditional probability to expose x by state k
penalty function
197. 197
Neyman-Pearson approach
Limitations of the Bayesian approach: example
Xx
k
heartbeat
state: normal, sick
kxp distribution of the heartbeat for
normal and sick
kp ???
W(normal, sick)
W(sick, normal)
???
210. 210
Missing values
o Delete missing rows
o Replace with mean/median
o Encode as another level of a categorical variable
o Create a new engineered feature
o Hot deck