Machine Learning

Basics of
Machine Learning
Dmitry Ryabokon, github.com/dryabokon

ξ
ξ

Content
1.1 Approaches for ML
1.2 Bayesian approach
1.3 Classifying the independent binary features
2.1 Naive Bayesian classifier
2.2 Gaussian classifier
2.3 Linear discrimination
2.4 Logistic regression
2.5 Ensemble learning
2.6 Decision Tree
2.7 Random Forest
2.7 Adaboost
3.1 ROC-Curve
3.2 Precision, recall, Accuracy, F1 score
3.3 Benchmarking the Classifiers
4.1 Unsupervised learning
4.2 Non-Bayesian approach
4.3 Time Series
5.1 Bias vs Variance
5.2 Regularization
5.3 Multi-collinearity
5.4 Variable importance chart
5.5 Variance inflation factor VIF
5.6 Reducing data dimensions
5.7 Imbalanced data set
5.8 Stratified sampling
5.9 Correlation: continuous & categorical
5.10 Missing values: the approach
5.11 Back propagation

4
Nxxx ,...,, 21
Nkkk ,...,, 21
features and
states
   kxfkxp ,
 Kk ,...,2,1 domain for k
The model is parametrized by
K ,...,, 21
   

N
i
iiK kxp
K 1,...,,
*
21 ,maxarg,...,,
21 

input
output
Approaches for ML
Maximum Likelihood

5
 
N
i
ii kxp
K 1,...,,
,maxarg
21 
    






N
i
iii kxpkp
K 1,...,, 21
maxarg

      







   KK XxXxXx
Kxpxpxp
2121
21maxarg
,...,, 
  






N
i
ii kxp
K 1,...,, 21
maxarg

 

kXx
k xf 

,maxarg
*
Approaches for ML
Maximum Likelihood

6
46 46
45
37 37
36
37
      2
11 exp,1   xxfkxp       2
22 exp,2   xxfkxp
   

7
1
21 ,maxarg,
i
ii kxp
      222
1 454645expmaxarg 


Approaches for ML
Maximum Likelihood: example

7
46 46
45
37 37
36
37
      222
1 454645expmaxarg 


      222
1 454645minarg 


1
3
454645
1 X
x


Approaches for ML
Maximum Likelihood: example

8
  

,1minmaxarg
1
*
1 

kxp
Xx
Approaches for ML
Bias in training data
Nxxx ,...,, 21
Nkkk ,...,, 21
   kxfkxp ,
 Kk ,...,2,1
The model is parametrized by
K ,...,, 21
input
output
features and
states
domain for k

9
  

,1minmaxarg
1
*
1 

kxp
Xx
   
   1,2
1,1
2
1


Nkxp
Nkxp


1
2
Approaches for ML
Bias in training data: example

10
  

,1minmaxarg
1
*
1 

kxp
Xx
   
   1,2
1,1
2
1


Nkxp
Nkxp


1
2
center of the circle that
holds all the points from X1
Approaches for ML
Bias in training data: example

11
Nxxx ,...,, 21
Nkkk ,...,, 21
 Kk ,...,2,1
 kkW , penalty function
decision strategy is parametrized by  Kxq , 
Approaches for ML
ERM: Empirical risk minimization
features and
states
domain for k

12
        

Xx Kk
kxqWkxpqRisk ,,, 
        

N
i
ii
N
i
iiii kxqWkxqWkxp
11
,,,,, 
 

qRiskminarg*

  Kxq ,
Approaches for ML
decision strategy is parametrized by 

13


 






kk
kk
kkW
1
0
,
   ,qRisk Number of errors
Approaches for ML
𝑞( ҧ𝑥, ത𝛼, 𝜃) = ቊ
−1 𝑤ℎ𝑒𝑛 ҧ𝑥, ത𝛼 < 𝜃
+1 𝑤ℎ𝑒𝑛 ҧ𝑥, ത𝛼 > 𝜃

15
Definitions
Xx
k
feature
state (hidden)
 kxp , joint probability
RW : penalty function
       

k Xx
xqkWkxpqRisk ,,
Xq: decision strategy
Bayesian approach

16
       

k Xxxqxq
xqkWkxpxqRiskxq ,,minarg)(minarg)(
)()(
*
input
output
Bayesian approach
The problem
Xx
k
feature
state (hidden)
 kxp , joint probability
Xq: decision strategy

17
Bayes' theorem
law of total probability
   
   



k
kxpkp
kxpkp
xkp )(
      
k
kxpkpxp
joint probability         xkpxpkxpkpkxp ,
p(k) prior probability
p(k|x) posterior probability
p(x|k) conditional probability of x, assuming k
p(x) probability of x
Bayesian approach
Some basics

18
 12,11,10,9,8X time
  %1081  xkp
  %9082  xkp
5)1,1(  kkW 100)1,2(  kkW
5)2,1(  kkW 0)2,2(  kkW
К = {1-revision, 2-free_ride}
Bayesian approach
Example

19
Bayesian approach
Flexibility to not make a decision
𝑊(𝑘, 𝑘′) = ቐ
0 𝑤ℎ𝑒𝑛 𝑘 = 𝑘′
1 𝑤ℎ𝑒𝑛 𝑘 ≠ 𝑘′
𝜀 𝑤ℎ𝑒𝑛 𝑘 = 𝑟𝑒𝑗𝑒𝑐𝑡
𝑅𝑖𝑠𝑘 𝑟𝑒𝑗𝑒𝑐𝑡 = ෍
𝑘∈𝐾
𝑝(𝑘ȁ 𝑥) ∙ 𝑊 𝑘, 𝑟𝑒𝑗𝑒𝑐𝑡 =
= ෍
𝑘∈𝐾
𝑝(𝑘ȁ 𝑥) ∙ 𝜀 = 𝜀

20
)(min kRisk
k
if
then    

kk
kkWxkpk ,minarg*
else
0 
Bayesian approach
Flexibility to not make a decision
reject
)(min kRisk
k
if
then    

kk
kkWxkpk ,minarg*
else reject

21
kkkkw ),(
 2
),( kkkkw 






kk
kk
kkw
1
0
),(






kk
kk
kkw
1
0
),(
median
average
the most probable sample
center of the most probable section Δ
Bayesian approach
Decision strategies for different penalty functions

Classifying the
independent
binary features

23
 Nxxxx ,...,, 21 feature – a set of independent binary values
       kxpkxpkxpkxp N 21
 2,1k Few states are possible
 
 

2
1
2
1




kxp
kxp
Decision strategy
Classifying the independent binary features
The problem

24
 
 
   
   






22
11
log
2
1
log
1
1
kxpkxp
kxpkxp
kxp
kxp
n
n

  
 
 
 






2
1
log
2
1
log
1
1
kxp
kxp
kxp
kxp
n
n

   
   
 
 







20
10
log
1021
2011
log
1
1
11
11
1
kxp
kxp
kxpkxp
kxpkxp
x
   
   
 
 20
10
log
1021
2011
log






kxp
kxp
kxpkxp
kxpkxp
x
n
n
nn
nn
N

The problem

25
   
   
 
 
log
2
1
20
10
log
1021
2011
log
1 







N
i i
i
ii
ii
i
kxp
kxp
kxpkxp
kxpkxp
x




2
1
1
N
i
ii ax
Decision rule

26
1x
2x
1
1
0




2
1
1
N
i
ii ax
Example

27
1x
2x
1
1
0
 1xp
 1xp
 2xp  2xp
1/4 3/4
3/4 1/4
1/4 2/4
3/4 2/4
Example

28
1x
2x
 1xp
 1xp
 2xp  2xp
1/4 3/4
3/4 1/4
1/4 2/4
3/4 2/4
(1/4)x(1/4) < (2/4)x(3/4)
Example

29
1x
2x
 1xp
 1xp
 2xp  2xp
1/4 3/4
3/4 1/4
1/4 2/4
3/4 2/4
(1/4)x(3/4) > (2/4)x(1/4)
Example

30
1x
2x
 1xp
 1xp
 2xp  2xp
1/4 3/4
3/4 1/4
1/4 2/4
3/4 2/4
(3/4)x(1/4) < (2/4)x(3/4)
Example

31
1x
2x
 1xp
 1xp
 2xp  2xp
1/4 3/4
3/4 1/4
1/4 2/4
3/4 2/4
(3/4)x(3/4) > (2/4)x(1/4)
Example

32
1x
2x
Example

33
1x
2x
Example

34
1x
2x
1
1
0
 1xp
 1xp
 2xp  2xp
5/9 4/8
4/8 4/8
5/9 4/8
4/9 4/8
Another example

35
1x
2x
 1xp
 1xp
 2xp  2xp
(4/9)x(4/9) < (4/8)x(4/8)
Another example

36
1x
2x
Another example

37
1x
2x
Another example

39
 Nxxxx ,...,, 21 feature
       kxpkxpkxpkxp N 21
 2,1k few states are possible
 
 

2
1
2
1




kxp
kxp
decision strategy
Naive Bayesian classifier
The problem

40
1x
2x
2
2
0 1
1
 1xp
 1xp
 2xp  2xp
4/6 1/6
2/9 3/9
1/6 3/9
4/6 3/9
1/6
4/9
1/6 3/9
Example

41
1x
2x
2
2
0 1
1
 1xp
 1xp
 2xp  2xp
4/6 1/6
2/9 3/9
1/6 3/9
4/6 3/9
1/6
4/9
1/6 3/9
Example

43
 Nxxxx ,...,, 21 feature
    







  
n
i
n
j
k
jj
k
ii
k
ji xxakxp
1 1
][][][
,5.0exp 
 kxOM i
k
i ..][

 1][][ 
 kk
BA
   ][][][
.. k
jj
k
ii
k
ij xxOMb  
Gaussian classifier
Definitions

44
    







  
n
i
n
j
jjiiji xxaxp
1 1
,5.0exp 
2 3 4 5
2
3
4
5
Gaussian classifier
Example

45
  311  xMO
  5,322  xMO
    1111111   xxMOb
8/)41000111( 
    4/3222222   xxMOb
    8/122112112   xxMObb
AB 















18/1
8/14/3
47
64
4/38/1
8/11
1
  











 2
221
2
1 )5,3(
1
1
)5,3)(3(
4
1
)3(
4
3
47
64
exp xxxxxp
2 3 4 5
2
3
4
5
Gaussian classifier

46
 
 
  
  

 
 





n
i
n
j
jjiiji
n
i
n
j
jjiiji
xxa
xxa
kxp
kxp
1 1
]2[]2[]2[
,
1 1
]1[]1[]1[
,
2
1
log



2
1


 i i
ii
j
jiij xxx
Gaussian classifier
Decision strategy

Linear
discrimination
ChervonenkisVapnik

48
 sxxxX  ,,, 21 
 rxxxX ,,, 21 
n
R

 
 















n
i
ii
n
i
ii
xxXx
xxXx
1
1
,
,
input
output
Linear discrimination
Perceptron
Rosenblatt

49
n
R

 
 















n
i
ii
n
i
ii
xxXx
xxXx
1
1
,
,
output


Perceptron

50















01
01
1
1
1
1
n
n
i
ii
n
n
i
ii
xXx
xXx

















n
i
ii
n
i
ii
xXx
xXx
1
1
=
Perceptron

51

  
  

;0,
;0,
1
1
xaxXxif
xaxXxif
ttt
ttt






Perceptron: tuning
𝑡 = 0
𝛼 𝑡 = 0
while (sets are not separated by hyperplane)

52
1
Perceptron: tuning

53
1
Perceptron: tuning

54
1
Perceptron: tuning

55
2
Perceptron: tuning

56
2
Perceptron: tuning

57
2
Perceptron: tuning

58
3
Perceptron: tuning

59
3
Perceptron: tuning

60
Xx
Xx
Dx


max 1*

 
 





0,
0,
*
*


xXx
xXx

2
D
t 
Perceptron: convergence

61
CByAx  
a line in R2
x
y
Transformation of the feature space

62
FEyDxCyBxyAx  22

x
y
hyperplane in R
5
elliptic curve in R
2
Transformation of the feature space

63
  min, 
 
 















n
i
ii
n
i
ii
xxXx
xxXx
1
1
,
,
Separate by the
widest band
SVM: support vector machine

64
   
   













xxxXx
xxxXx
n
i
ii
n
i
ii


1
1
,
,
    min,
,
 XX
xconst 
penalty

ξ
ξ

SVM: support vector machine
Separate by the
widest band

66
    

n
i
n
i
iiii xayyxmrginerrors
1 1
min0,]0),([#
Logistic regression
Target function: empirical risk = number of errors

margin<0

margin<0

67
Logistic regression
Target function: upper bound
0
1
margin
[margin<0]
log(1+exp(-margin))
      min),exp(1log0,
11
  
n
i
ii
n
i
ii xayxay

68
Logistic regression
Relationship with P(y|x)
max
),exp(1
1
log
1










n
i ii xay
    min),exp(1log0,
11
  
n
i
ii
n
i
ii xayxay
   max,log
1

n
i
ii xaysigmoid
   maxlog
1

n
i
ii xyp
   iiii xaysigmoidxyp ,
model

69
Logistic regression
Logistic function: the sigmoid
 xsigmoid

71
Bias vs Variance
Bias is an error from wrong assumptions in the learning algorithm
High bias can cause underfit: the algorithm can miss relations
between features and target
Variance is an error from sensitivity to small fluctuations in the
training set. High variance can cause overfit: the algorithm can
model the random noise in the training data, rather than the
intended outputs

72
Bias vs Variance
How to avoid overfit (high variance)
o Keep the model simpler
o Do cross-validation (e.g. k-folds)
o Use regularization techniques (e.g. LASSO, Ridge)
o Use top n features from variable importance chart

78
Ensemble Learning
Bootstrapping: random sampling with replacement
bootstrap set 1
bootstrap set 2
bootstrap set 3
Pick up, clone, put back

79
Ensemble Learning
Bagging: Bootstrap Aggregation
Classifier 1
Classifier 2
Classifier 3
Voting / Aggregation
decision

80
Ensemble Learning
Boosting: weighted average of multiple weak classifiers

81
Ensemble Learning

82
Ensemble Learning

83
Ensemble Learning

84
Ensemble Learning
Bagging Boosting
- Aims to decrease variance - Aims to decrease bias
- Aims to solve over-fitting problem

86
Decision tree
F1 F2 F3 Target
A A A 0
A A B 0
B B A 1
A B B 1
H(0,0,1,1) = ½ log(½) + ½ log(½) = 1

87
Decision tree
F1 F2 F3 Target Attempt split on F1
A A A 0 0
A A B 0 0
B B A 1 1
A B B 1 0
Information Gain = H(0011) - 3/4 H(001) - 1/4 H(1) = 0.91

88
Decision tree
A A A 0 0
A A B 0 0
B B A 1 1
A B B 1 1
Information Gain = H(0011) - 1/2 H(00) - 1/2 H(11) = 1

89
Decision tree
A A A 0 0
A A B 0 0
B B A 1 1
A B B 1 1
Information Gain = H(0011) - 1/2 H(01) - 1/2 H(01) = 0

90
Decision tree
F1 F2 F3 Target
A A A 0
A A B 0
B B A 1
A B B 1

92
Decision tree: pruning
50%58% = (7x7 + 3x3)/100 58% = (3x3 + 7x7)/100
50% x 58% + 50% x 58% = 58%
58% > 50% OK

93
55.5% 62.5%
50% x 58% + 50% (60% x 55.5% + 40% x 62.5%) = 50% x 58% + 50% x 58.3 = 58.16%
58.16% > 58% this is small improvement
58%58%

94
55.5% 62.5%
58%58%

95
58%
50% x 58% + 50% (80% x 62.5% + 20% x 50%) = 50% x 58% + 50% x 60% = 59%
59% > 58% OK
62.5% 50%
58%

Random Forest
https://www.youtube.com/watch?v=J4Wdy0Wc_xQ

97
Random forest
Original dataset
F1 F2 F3 Target
A A A 0
A A B 0
B B A 1
A B B 1
Bootstrapped dataset
F1 F2 F3 Target
A B B 1
A A B 0
A A A 0
A B B 1

98
Random forest
F1 F2 F3 Target
A B B 1
A A B 0
A A A 0
A B B 1
Determine the best root node out of randomly selected sub-set of candidates
candidate candidate

99
Random forest
F1 F2 F3 Target
A B B 1
A A B 0
A A A 0
A B B 1
root

100
Random forest
F1 F2 F3 Target
A B B 1
A A B 0
A A A 0
A B B 1
candidate
Build tree as usual, but considering a random subset of features at each step
candidate

101
Random forest
Bagging: do aggregate decisions against the bootstrapped data

102
Random forest
Validation: run aggregated decisions against out-of-bag dataset
Original dataset
F1 F2 F3 Target
A A A 0
A A B 0
B B A 1
A B B 1
F1 F2 F3 Target
A B B 1
A A B 0
A A A 0
A B B 1
out-of-bag

103
Random forest
The process
Build random forest
Estimate
Train / Validation error
Adjust RF parameters
# of variables per step,
regularization params, etc

Adaboost
Yoav Freund
Robert Schapire

105
Ni xxxx ,...,,, 21
...3,2,1t
 it xW
Ni kkkk ,...,,, 21
 xht
t
    
t
tt xhxk 
Training features
Iteration number
Weight of the element on iteration t
Weak classifier on iteration t
Weight if a weak classifier on iteration t
   1:0? xxh
= {-1 ,1} Training targets
Adaboost
Definitions

106
      ii
i
t
h
t xhkiWxh  minarg
     )(exp1 itittt xhkiWiW  
Update weights
Weak classifier to minimize
the weighted error
 















kxhif
kxhif
r
r
r
r
iW
t
t
t
t
t
t
t
)(
)(
1
1
1
1
      
i
tii
i
tt iWkxhiWr








t
t
t
r
r
1
1
log
2
1

Calculate the weight od classifier
Adaboost
Algorithm

107
1x
2x
   :?5.0211  xxxht
Adaboost
Example: step 1

108
1x
2x 3
3
3
3
4
4
   :?5.0211  xxxht
Adaboost
Example: step 1

109
   :?5.0211  xxxht
1x
2x
34.033.0 11   ttr 
Adaboost
Example: step 1

110
   :?5.0211  xxxht
34.033.0 11   ttr 
     )(exp 1112 iit xhkiWiW   1x
2x
Adaboost
Example: step 1

111
   :?5.0211  xxxht
34.033.0 11   ttr 
     )(exp 1112 iit xhkiWiW  
1.5
0.75
1.50.75
0.75
1.5
0.75
0.75 0.75
Adaboost
Example: step 1

112
   :?5.0212  xxxht
3
1.5
0.75
1.50.75
0.75
1.5
0.75
0.75 0.75
3.75
3.75
3.75
4.5
3.75
1x20 1
Adaboost
Example: step 2

113
   :?5.0212  xxxht
34.033.0 22   ttr 
Adaboost
Example: step 2

114
   :?5.0212  xxxht
34.033.0 22   ttr 
     )(exp 1112 iit xhkiWiW  
1
0.5 11
1
2
0.5
0.5 0.5
Adaboost
Example: step 2

115
   :?5.0213  xxxht
1
0.5 11
1
2
0.5
0.5 0.5
3
4 3.5
2.5
3.5
3.5
Adaboost
Example: step 3

116
   :?5.0213  xxxht
39.038.0 33   ttr 
Adaboost
Example: step 3

117
    
t
tt xhxk 
(0,0) (1,0) (1,0) (1,0) (0,1) (0,1) (0,1) (0,1) (0,1) (1,1)
1 -1 -1 -1 -1 -1 -1 -1 -1 -1
3th
1h
2th
x
35.01 
2h35.02 
3h39.03 
    
t
tt xhxk 
1 1 1 1 -1 -1 -1 -1 -1 1
1 -1 -1 -1 1 1 1 1 1 1
1 -1 -1 -1 -1 -1 -1 -1 -1 1
Adaboost
Example: result

118
x20 1 43 65
Adaboost
Another example

119
x
4 3 3 4 33
0 1 2 3 4 5 6
Adaboost

120
x
0 1 2 3 4 5 6
   1:1?5.21  xxht
      8
211
1
1 1
  ii
i
iWt kxhiWr
i
     )(exp 1112 iit xhkiWiW    















kxhif
kxhif
r
r
r
r
iW
t
t
t
t
)(
)(
1
1
1
1
1
1
1









t
t
t
r
r
1
1
log
2
1

Adaboost

121
x
0 1 2 3 4 5 6
      8
211
1
1 1
  ii
i
iWt kxhiWr
i
     )(exp 1112 iit xhkiWiW    










kxhif
kxhif
iW
)(
)(
3
5
5
3
1
1
1









t
t
t
r
r
1
1
log
2
1

Adaboost

122
x
0 1 2 3 4 5 6
3.2 4 3.4 3.7 2.93.5
1.3
0.8
1.30.8
0.80.81.31.3
Adaboost

123
x
0 1 2 3 4 5 6
   12  xht
      9.222
1
2 2
  ii
i
iWt kxhiWr
i
  iWt 3  















kxhif
kxhif
r
r
r
r
iW
t
t
t
t
)(
)(
1
1
1
1
2
2
2









t
t
t
r
r
1
1
log
2
1

Adaboost

124
x
0 1 2 3 4 5 6
      9.222
1
2 2
  ii
i
iWt kxhiWr
i
  iWt 3









t
t
t
r
r
1
1
log
2
1

 















kxhif
kxhif
r
r
r
r
iW
t
t
t
t
)(
)(
1
1
1
1
2
2
2
Adaboost

125
x
0 1 2 3 4 5 6
1.7
0.6
11
0.60.611
2.8 3.7 2.6 3.2 3.73.7
Adaboost

126
x
0 1 2 3 4 5 6
   1:1?5.33  xxht
1 1 1 -1 -1 -125.01 
27.02 
33.03 
-1 -1 -1 -1 -1 -1
1 1 1 1 -1 -1
1 1 1 -1 -1 -1
Adaboost

128
ROC curve
FP
TP
FN
TN
score (confidence level)
%

129
Actual
Positive Negative
ROC curve
Error types

130
Actual
true pos TP: 2/3
FN: 1/3false neg
FP: 2/7 false pos
ROC curve
Error types

131
TP: 0/3
FN: 3/3
FP: 0/7
FP
TP
ROC curve

132
TP: 1/3
FN: 2/3
FP: 0/7
FP
TP
ROC curve

133
TP: 2/3
FN: 1/3
FP: 1/7
FP
TP
ROC curve

134
TP: 2/3
FN: 1/3
FP: 2/7
FP
TP
ROC curve

135
TP: 2/3
FN: 1/3
FP: 4/7
FP
TP
ROC curve

136
TP: 3/3
FN: 0/3
FP: 7/7
FP
TP
ROC curve

137
FP
TP
ROC curve
AUC: area under ROC curve
target

138
1x
2x
4/6 1/6
2/9 3/9
1/6 3/9
4/6 3/9
1/6
4/9
1/6 3/9
ROC curve
Example: Naive Bayesian classifier

139
1x
2x
4/6 1/6
2/9 3/9
1/6 3/9
4/6 3/9
1/6
4/9
1/6 3/9
x
k
 xp
xp
2
)1(
(0,0) (1,0) (2,0) (2,0) (0,1) (0,1) (1,1) (1,1) (1,1) (1,1) (2,1), (0,2) (0,2) (1,2) (2,2)
0.19 1.50 0.25 0.25 0.75 0.75 6.00 6.00 6.00 6.00 1.00 0.19 0.19 1.50 0.25
ROC curve

1406.00 6.00 6.00 6.00 1.50 1.50 1.00 0.75 0.75 0.25 0.25 0.25 0.19 0.19 0.19
FP
TP
k
 xp
xp
2
)1(
ROC curve

141
63 81 23 54 10 17 29 5 34 21
86 98 18 42 21 23 37 18 74 27
Strategy 1:
Strategy 2:
ROC curve
The task: compare accuracy of two strategies

142
ROC curve
def display_roc_curve_from_file(plt,fig,path_scores,caption=''):
data = load_mat(path_scores, dtype = numpy.chararray, delim='t')
labels = (data [:, 0]).astype('float32')
scores = data[:, 1:].astype('float32')
fpr, tpr, thresholds = metrics.roc_curve(labels, scores)
roc_auc = auc(fpr, tpr)
plot_tp_fp(plt,fig,tpr,fpr,roc_auc,caption)
return
def plot_tp_fp(plt,fig,tpr,fpr,roc_auc,caption=''):
lw = 2
plt.plot(fpr, tpr, color='darkgreen', lw=lw, label='AUC = %0.2f' % roc_auc)
plt.plot([0, 1.05], [0, 1.05], color='lightgray', lw=lw, linestyle='--’)
plt.grid(which='major', color='lightgray', linestyle='--')
fig.canvas.set_window_title(caption + ('AUC = %0.4f' % roc_auc))
return

144
Precision - Recall
Precision: TP / (TP+FP)
Recall: TP / (TP+FN)
Accuracy: (TP+TN) /Total
F1-score: 2 x Precision x Recall / (Precision+ Recall)

145
TP: 0
FN: 3
FP: 0
Recall
Precision
Precision - Recall
Precision: 0/0
Recall: 0/3

146
TP: 1
FN: 2
FP: 0
Precision - Recall
Precision: 1
Recall: 1/3
Recall
Precision

147
TP: 2
FN: 1
FP: 1
Precision - Recall
Precision: 2/3
Recall: 2/3
Recall
Precision

148
TP: 2
FN: 1
FP: 2
Precision - Recall
Precision: 1/2
Recall: 2/3
Recall
Precision

149
TP: 2
FN: 1
FP: 4
Precision - Recall
Precision: 2/6
Recall: 2/3
Recall
Precision

150
TP: 2
FN: 1
FP: 5
Precision - Recall
Recall
Precision
Precision: 2/7
Recall: 2/3

151
TP: 3
FN: 0
FP: 7
Precision - Recall
Recall
Precision
Precision: 3/7
Recall: 1

152
Recall
Precision
Precision - Recall
TP: 3
FN: 0
FP: 7
Recall: 1

153
TP: 0
FN: 3
FP: 0
Precision - Recall
Recall
Precision
Precision: 0
Recall: 0

154
TP: 1
FN: 2
FP: 0
Precision - Recall
Recall
Precision
Precision: 1
Recall: 1/3

155
TP: 2
FN: 1
FP: 0
Precision - Recall
Recall
Precision
Precision: 1
Recall: 2/3

156
TP: 2
FN: 1
FP: 1
Precision - Recall
Recall
Precision
Precision: 2/3
Recall: 2/3

157
TP: 2
FN: 1
FP: 2
Precision - Recall
Recall
Precision
Precision: 2/4
Recall: 2/3

158
TP: 3
FN: 0
FP: 2
Precision - Recall
Recall
Precision
Precision: 2/5
Recall: 1

159
TP: 3
FN: 0
FP: 4
Precision - Recall
Recall
Precision
Precision: 3/7
Recall: 1

160
TP: 3
FN: 0
FP: 4
Precision - Recall
Recall
Precision
Precision: 3/7
Recall: 1

162
Benchmarking the Classifiers
o Regression
o Naive Bayes
o Gaussian
o SVM
o Nearest Neighbors
o Decision Tree
o Random Forest
o AdaBoost
o xgboost
o Neural Net

163
Gaussian distribution

164

165

166

167

168

169

170

171

172

173
Non-gaussian distribution

174

175

176
def E2E_on_features_files_2_classes(Classifier,folder_out, filename_data_pos,filename_data_neg,filename_scrs_pos=None,filename_scrs_neg=None,fig=None):
filename_data_grid = folder_out+'data_grid.txt'
filename_scores_grid = folder_out+'scores_grid.txt'
Pos = (IO.load_mat(filename_data_pos, numpy.chararray, 't')).shape[0]
Neg = (IO.load_mat(filename_data_neg, numpy.chararray, 't')).shape[0]
numpy.random.seed(125)
idx_pos_train = numpy.random.choice(Pos, int(Pos/2),replace=False)
idx_neg_train = numpy.random.choice(Neg, int(Neg/2),replace=False)
idx_pos_test = [x for x in range(0, Pos) if x not in idx_pos_train]
idx_neg_test = [x for x in range(0, Neg) if x not in idx_neg_train]
generate_data_grid(filename_data_pos, filename_data_neg, filename_data_grid)
model = learn_on_pos_neg_files(Classifier,filename_data_pos,filename_data_neg, 't', idx_pos_train,idx_neg_train)
score_feature_file(Classifier, filename_data_pos, filename_scrs =filename_scrs_pos,delimeter='t', append=0, rand_sel=idx_pos_test)
score_feature_file(Classifier, filename_data_neg, filename_scrs =filename_scrs_neg,delimeter='t', append=0, rand_sel=idx_neg_test)
score_feature_file(Classifier, filename_data_grid, filename_scrs =filename_scores_grid)
model = learn_on_pos_neg_files(Classifier,filename_data_pos,filename_data_neg, 't', idx_pos_test,idx_neg_test)
score_feature_file(Classifier,filename_data_pos, filename_scrs = filename_scrs_pos,delimeter='t',append= 1,rand_sel=idx_pos_train)
score_feature_file(Classifier,filename_data_neg, filename_scrs = filename_scrs_neg,delimeter='t',append= 1,rand_sel=idx_neg_train)
tpr, fpr, auc = IO.get_roc_data_from_scores_file_v2(filename_scrs_pos,filename_scrs_neg)
if(fig!=None):
th = get_th_v2(filename_scrs_pos, filename_scrs_neg)
IO.display_roc_curve_from_descriptions(plt.subplot(1,3,3),fig,filename_scrs_pos,filename_scrs_neg,delim='t')
IO.display_distributions(plt.subplot(1,3,2),fig, filename_scrs_pos,filename_scrs_neg,delim='t')
IO.plot_2D_scores(plt.subplot(1,3,1),fig,filename_data_pos,filename_data_neg,filename_data_grid, filename_scores_grid, th, noice_needed=0)
plt.tight_layout()
plt.show()
return tpr, fpr, auc

178
Learning
   kxfkxp ,






n
n
kk
xx
,,
,,
1
1


 xkp
k
 kp





 nxx ,,1 
 xkp
 kp k
Learning
   kxfkxp ,
Supervisor
Unsupervised learning
EM algorithm

179
x20 1 43 65
EM algorithm

180
x20 1 43 65
EM algorithm

181
x20 1 43 65
  5.43122 2222
4
1
   5.43003 2222
4
1

2)5300(4
1
 3)6330(4
1

EM algorithm

182
x20 1 43 65
  5.43122 2222
4
1
   5.43003 2222
4
1

2)5300(4
1
 3)6330(4
1

EM algorithm

183
x20 1 43 65
0)000(3
1
 4)65333(2
1

  0000 222
3
1
   6.121111 22222
5
1

EM algorithm

184
K-means
https://www.naftaliharris.com/blog/visualizing-k-means-clustering/
http://stanford.edu/class/ee103/visualizations/kmeans/kmeans.html

185
K-means

186
K-means

187
K-means

188
K-means

189
K-means

190
K-means

191
K-means

192
K-means

193
K-means

195
Xx
k
feature
hidden state
 kp priory probability of the state
 kxp conditional probability to expose x by state k
Neyman-Pearson approach
Limitations of the Bayesian approach

196
Limitations of the Bayesian approach
Xx
k
feature
hidden state
 kp priory probability of the state
RW :
 kxp conditional probability to expose x by state k
penalty function

197
Limitations of the Bayesian approach: example
Xx
k
heartbeat
state: normal, sick
 kxp distribution of the heartbeat for
normal and sick
 kp ???
W(normal, sick)
W(sick, normal)
???

198
input
output
The problem
𝑥 ∈ 𝑋 𝑘 ∈ 𝐾 = 𝑛𝑜𝑟𝑛𝑎𝑙, 𝑟𝑖𝑠𝑘𝑦
𝑝(𝑥ȁ 𝑘 = 𝑛𝑜𝑟𝑚)
𝑝(𝑥ȁ 𝑘 = 𝑟𝑖𝑠𝑘)
𝑋 𝑛𝑜𝑟𝑚 ∪ 𝑋𝑟𝑖𝑠𝑘= 𝑋
𝑋 𝑛𝑜𝑟𝑚 ∩ 𝑋𝑟𝑖𝑠𝑘= 0
𝑒𝑟𝑟𝑜𝑟 𝑡𝑦𝑝𝑒 1 = 𝑃 𝑓𝑎𝑙𝑠𝑒 𝑎𝑙𝑎𝑟𝑚 = ෍
𝑥∈𝑋 𝑟𝑖𝑠𝑘
𝑝(𝑥ȁ 𝑛𝑜𝑟𝑚) → 𝑚𝑖𝑛
𝑒𝑟𝑟𝑜𝑟 𝑡𝑦𝑝𝑒 2 = 𝑃 𝑚𝑖𝑠𝑠 𝑡ℎ𝑒 𝑟𝑖𝑠𝑘 = ෍
𝑥∈𝑋 𝑛𝑜𝑟𝑚
𝑝(𝑥ȁ 𝑟𝑖𝑠𝑘) ≤ 𝜀
𝑋 𝑛𝑜𝑟𝑚, 𝑋𝑟𝑖𝑠𝑘 =?

199
The problem
𝑥 ∈ 𝑋 𝑘 ∈ 𝐾 = 𝑛𝑜𝑟𝑛𝑎𝑙, 𝑟𝑖𝑠𝑘𝑦
𝛼 𝑛𝑜𝑟𝑚 𝑥 = ቊ
1 𝑤ℎ𝑒𝑛 𝑥 ∈ 𝑋 𝑛𝑜𝑟𝑚
0 𝑤ℎ𝑒𝑛 𝑥 ∉ 𝑋 𝑛𝑜𝑟𝑚
𝛼 𝑟𝑖𝑠𝑘 𝑥 = ቊ
1 𝑤ℎ𝑒𝑛 𝑥 ∈ 𝑋𝑟𝑖𝑠𝑘
0 𝑤ℎ𝑒𝑛 𝑥 ∉ 𝑋𝑟𝑖𝑠𝑘

200
The problem
෍
෍
෍
𝑥∈𝑋
𝛼 𝑟𝑖𝑠𝑘(𝑥) ∙ 𝑝(𝑥ȁ 𝑛𝑜𝑟𝑚) → 𝑚𝑖𝑛
− ෍
𝑥∈𝑋
𝛼 𝑛𝑜𝑟𝑚 𝑥 ∙ 𝑝 𝑥ȁ 𝑛𝑜𝑟𝑚 ≥ −𝜀
𝛼 𝑛𝑜𝑟𝑚 𝑥 + 𝛼 𝑟𝑖𝑠𝑘 𝑥 = 1 ∀ 𝑥 ∈ 𝑋
𝛼 𝑛𝑜𝑟𝑚 𝑥 ≥ 0 ∀ 𝑥 ∈ 𝑋
𝛼 𝑟𝑖𝑠𝑘 𝑥 ≥ 0 ∀ 𝑥 ∈ 𝑋

201
maxxpxp nn11  
1nn1111 bxaxa  
knkn11k bxaxa  
1knn,1k11,1k bxaxa   
mnmn11m bxaxa  
0x1 
0xs 
.перем.свx 1s 
.перем.свxn 
minybyb mm11  






























0y1 
0yk 
.перем.свy 1k 
.перем.свym 
1m1m111 pyaya  
smms1s1 pyaya  
1sm1s,m11s,1 pyaya   
nmmn1n1 pyaya  

202
maxxpxp nn11   minybyb mm11  
1nn1111 bxaxa  
knkn11k bxaxa  
0y1 
0yk 
1knn,1k11,1k bxaxa   
mnmn11m bxaxa  
.перем.свy 1k 
.перем.свym 
0x1 
0xs 
1m1m111 pyaya  
smms1s1 pyaya  
.перем.свx 1s 
.перем.свxn 
1sm1s,m11s,1 pyaya   
nmmn1n1 pyaya  
  0111111  ybxaxa nn






























  0111111  xpyaya mm

203
Solution
෍
𝑥∈𝑋
𝛼 𝑟𝑖𝑠𝑘(𝑥) ∙ 𝑝(𝑥ȁ 𝑛𝑜𝑟𝑚) → 𝑚𝑖𝑛
− ෍
𝑥∈𝑋
𝛼 𝑛𝑜𝑟𝑚 𝑥 ∙ 𝑝 𝑥ȁ 𝑛𝑜𝑟𝑚 ≥ −𝜀
𝛼 𝑛𝑜𝑟𝑚 𝑥 + 𝛼 𝑟𝑖𝑠𝑘 𝑥 = 1 ∀ 𝑥 ∈ 𝑋
𝛼 𝑛𝑜𝑟𝑚 𝑥 ≥ 0 ∀ 𝑥 ∈ 𝑋
𝛼 𝑟𝑖𝑠𝑘 𝑥 ≥ 0 ∀ 𝑥 ∈ 𝑋
−𝜀 ∙ 𝜏 + 1 ∙ 𝑡 𝑥1 + ⋯ + 1 ∙ 𝑡 𝑥𝑖 → 𝑚𝑎𝑥
𝜏 ≥ 0
𝑡 𝑥 𝑓𝑟𝑒𝑒 𝑣𝑎𝑟
−𝑝 𝑥ȁ 𝑟𝑖𝑠𝑘 ∙ 𝜏 + 1 ∙ 𝑡(𝑥) ≤ 0
1 ∙ 𝑡 𝑥 ≤ 𝑝(𝑥ȁ 𝑛𝑜𝑟𝑚)

204
Solution
𝑥 ∈ 𝑋 𝑛𝑜𝑟𝑚 ⇒ 𝛼 𝑛𝑜𝑟𝑚 𝑥 = 1, 𝛼 𝑟𝑖𝑠𝑘 𝑥 = 0
𝑥 ∈ 𝑋𝑟𝑖𝑠𝑘 ⇒ 𝛼 𝑛𝑜𝑟𝑚 𝑥 = 0, 𝛼 𝑟𝑖𝑠𝑘 𝑥 = 1
⇒
⇒
ቊ
−𝑝 𝑥ȁ 𝑟𝑖𝑠𝑘 ∙ 𝜏 + 1 ∙ 𝑡 𝑥 = 0
1 ∙ 𝑡 𝑥 < 𝑝 𝑥ȁ 𝑛𝑜𝑟𝑚
𝑝(𝑥ȁ 𝑟𝑖𝑠𝑘) ∙ 𝜏 ≤ 𝑝(𝑥ȁ 𝑛𝑜𝑟𝑚)
⇒
⇒
ቊ
−𝑝 𝑥ȁ 𝑟𝑖𝑠𝑘 ∙ 𝜏 + 1 ∙ 𝑡 𝑥 ≤ 0
1 ∙ 𝑡 𝑥 = 𝑝 𝑥ȁ 𝑛𝑜𝑟𝑚
𝑝(𝑥ȁ 𝑟𝑖𝑠𝑘) ∙ 𝜏 ≥ 𝑝(𝑥ȁ 𝑛𝑜𝑟𝑚)⇒
⇒

205
Solution
𝑝(𝑥ȁ 𝑛𝑜𝑟𝑚)
𝑝(𝑥ȁ 𝑟𝑖𝑠𝑘)
𝜏
norm
>
risk
<

206
Some Variations: two risky states
෍
𝑝(𝑥ȁ 𝑟𝑖𝑠𝑘1) ≤ 𝜀
෍
𝑝(𝑥ȁ 𝑟𝑖𝑠𝑘2) ≤ 𝜀
෍

207
Some Variations: two normal states
෍
෍
𝑝(𝑥ȁ 𝑛𝑜𝑟𝑚1) + ෍
𝑝(𝑥ȁ 𝑛𝑜𝑟𝑚2) → 𝑚𝑖𝑛

208
Minimax problem
𝑚𝑎𝑥 ෍
𝑥∉𝑋1
𝑝(𝑥ȁ1) , ෍
𝑥∉𝑋2
𝑝(𝑥ȁ2) . . , ෍
𝑥∉𝑋 𝐾
𝑝(𝑥ȁ 𝐾) → 𝑚𝑖𝑛

210
Missing values
o Delete missing rows
o Replace with mean/median
o Encode as another level of a categorical variable
o Create a new engineered feature
o Hot deck

Machine Learning

Recommended

Recommended

More Related Content

Similar to Machine Learning

Similar to Machine Learning (20)

Recently uploaded

Recently uploaded (20)

Machine Learning