This deck will cover various algorithms at high level. Those algorithms include "Supervised Learning and Classification", "Training Discriminative Classifiers", "Representer Theorem", "Support Vector Machines", "Logistic Regression", "Generative Classifiers: Naive Bayes", "Deep Learning" and "Tree Ensembles"
5. Training Discriminative Classifiers
𝑓 = argmin 𝑓 ∑ ℓ(𝑓(𝒙𝑖), 𝑦𝑖) + ℊ(∥ 𝑓 ∥)/
012
• The second term is ``regularization”
• A common form for 𝑓(𝒙) is 𝐰′𝒙 (linear classifier)
• ℓ(w’x,y) is ``convexified” loss
• Besides discriminative classifiers, generative classifiers also exist
• e.g., naïve Bayes
Classifier Loss function (y ∈ {±1})
support vector machine max(0, 1 - y w’x)
logistic regression log[1 + exp(-y w’x)]
adaboost exp(-y w’x)
square loss (1 – y w’x)2
Algorithms for Direct 0–1 Loss Optim
−1.5 −1 −0.5 0 0.5 1 1.5 2
0
1
2
3
4
5
6
7
8
margin (m)
LossValue
0−1 Loss
Hinge Loss
Log Loss
Squared Loss
T
ar
is
W
in
(w
fw
as
th
T
{x
t
cl
6. Representer Theorem*
• If ℊ is real-valued, monotonically increasing and lies in [0,∞)
• And if ℓ lies in ℝ⋃{∞}, then
f(x) = ∑ 𝛼𝑖
𝒙′𝒙𝑖 /
012
• In particular,
• Neither convexity nor differentiability is necessary
• But helps with the optimization
• Especially when using gradient based methods
*``A Generalized Representer Theorem” by Scholkopf, Herbrich and Smola in COLT 2001
☨``When is there a Representer Theorem?” by Argyriou, Micchelli and Pontil in JMLR 2009
7. Binary Class Support Vector Machines
minw ∑ max 0,1 − 𝑦𝑖𝒘S 𝒙𝑖 +
T
U
𝒘S 𝒘 /
012
• Expressed in standard form:
minw ∑ ξi +
T
U
𝒘S 𝒘 0
s.t. 𝑦𝑖 𝒘S 𝒙𝑖 ≥ 1 − ξi
∀𝑖
ξi ≥ 0 ∀𝑖
• Lagrangian (𝛼𝑖, 𝛽𝑖 ≥ 0):
ℒ = ∑ ξi +
T
U
𝒘S 𝒘 + ∑ 𝛼𝑖(1 − 𝑦𝑖 𝒘S 𝒙𝑖 − ξi00 ) − ∑ 𝛽𝑖 𝜉𝑖0
]ℒ
𝝏w
: w =
2
_
∑ 𝛼𝑖 𝑦𝑖 𝒙𝑖0
]ℒ
𝝏`i
: 1 = 𝛼𝑖+𝛽𝑖 ∀
𝑖
8. Binary SVM: Dual Formulation
max 𝛼 ∑ 𝛼𝑖 0 −
2
U
∑ 𝛼𝑖 𝛼𝑗 𝑦𝑖 𝑦𝑗 𝒙𝑖′𝒙𝑗0b
s.t. 0 ≤ 𝛼𝑖 ≤ 1 ∀𝑖
• Convex Quadratic Program
• Optimization algorithms such as Platt’s SMO* exist
• Also possible to optimize the primal directly (l2-svm.dml, next slide)
• Kernel trick:
• Redefine inner product K(xi, xj)
• Projects data into a space 𝜙(𝒙) where classes may be separable
• Well known kernels: radial basis functions, polynomial kernel
*``Sequential Minimal Optimization: A Fast Algorithm for Training Support Vector Machines” by Platt, Tech Report 1998.
9. Binary SVM in DML
minw 𝜆𝒘S 𝒘 + ∑ max2 0,1 − 𝑦𝑖𝒘S 𝒙𝑖 /
012
• Solve for 𝒘 directly using:
• Non linear conjugate gradient descent
• Newton’s method to determine step size
• Most complex operation in the script
• Matrix-vector product
• Incremental maintenence using vector-vector
operations
Matrix-vector product
Matrix-vector product
Fletcher-Reeves formula
1D Newton method to
determine step size
10. Multi-Class SVM in DML
• At least 3 different ways to define multi-class
SVMs
• One-against-the-rest* (OvA)
• Pairwise (or one-against-one)
• Crammer-Singer SVM☨
• OvA multi-class SVM
• Each binary class SVM learnt in parallel
• Inner body uses l2-svm’s approach
*``In Defense of One-vs-All Classification” by Rifkin and Klautau in
JMLR 2004
☨``On the Algorithmic Implementation of Multiclass Kernel-
Based Vector Machines” by Crammer and Singer in JMLR 2002
…
Parallel for loop
11. Logistic Regression
maxw -∑ log 1 + 𝑒hi0 𝒘j 𝒙0 −
T
U
𝒘S 𝒘 /
012
• To derive the dual form use the following bound*:
log
2
2klmn𝒘j 𝒙
≤ min 𝛼
𝛼𝑦𝒘′𝒙 − 𝐻(𝛼)
where 0 ≤ 𝛼 ≤ 1 and 𝐻(𝛼) = −𝛼log(𝛼) − (1 − 𝛼)log(1 − 𝛼)
• Substituting:
maxw min 𝛼 ∑ 𝛼𝑖 𝑦𝑖 𝒘′𝒙𝑖0 − 𝐻(𝛼𝑖) −
T
U
𝒘S 𝒘 s.t. 0 ≤ 𝛼𝑖 ≤ 1 ∀𝑖
]ℒ
𝝏w
: w =
2
_
∑ 𝛼𝑖 𝑦𝑖 𝒙𝑖0
• Dual form:
min 𝛼
2
UT
∑ 𝛼𝑖 𝛼𝑗 𝑦𝑖 𝑦𝑗 𝒙𝑖′𝒙𝑗0b − ∑ 𝐻(𝛼𝑖)0 s.t. 0 ≤ 𝛼𝑖 ≤ 1 ∀𝑖
• Apply kernel trick to obtain kernelized logistic regression
*``Probabilistic Kernel Regression Models” by Jaakkola and Haussler in AISTATS 1999
12. Multiclass Logistic Regression
• Also called softmax regression or multinomial logistic regression
• W is now a matrix of weights, jth column contains the jth class’s weights
Pr(y|x) =
lp
j
qn
∑ lp
j
qn
n
minW
_
U
∥ 𝑊 ∥ 2 + ∑ [log(Zi)0 − 𝑥𝑖′𝑊𝑦𝑖] where 𝑍𝑖 = 1S 𝑒wjx0
• The DML script is called MultiLogReg.dml
• Uses trust-region Newton method to learn the weights*
• Care needs to be taken because softmax is an over-parameterized function
*See regression class’s slides on ibm.biz/AlmadenML
15. Deep Learning: Autoencoders
• Designed to discover the hidden subspace in
which the data ``lives”
• Layer-wise pretraininghelps*
• Many of these can be stacked together
• Final layer is usually softmax (for classification)
• Weights may be tied or not, output layer may
have a non-linear activation function or not,
many options☨
*``A fast learning algorithm for deep belief nets” by Hinton, Osindero
and Teh in Neural Computation 2006
☨ ``On Optimization Methods for Deep Learning” by V. Le et al in ICML
2011
…
…
…
Input layer
Output layer
Hidden layer
17. Decision Tree (Classification)
• Simple and easy to understand model for classification
• More interpretable results than other classifiers
• Recursively partitions training data until examples in each partition belong to one class or
partition becomes small enough
• Splitting test s for choosing feature j (xj: jth feature value of x)
• Numerical: xj < 𝜎
• Categorical: xj∈ S where S ⊆ Domain of feature j
• Measuring node impurity 𝒥:
• Entropy: ∑ −𝑓𝑖log(𝑓𝑖)0
• Gini: ∑ 𝑓𝑖(1 − 𝑓𝑖)0
• To find the best split across features use information gain:
argmax 𝒥 𝑋 −
/…l†‡
/
𝒥 𝑋𝑙𝑒𝑓𝑡 −
/Š0‹Œ‡
/
𝒥 𝑋 𝑟𝑖𝑔ℎ𝑡
18. Decision Tree in DML
• Tree construction*:
• Breadth-first expansion for nodes in top level
• Depth-first expansion for nodes in lower levels
• Input data needs to be transformed (dummy coded)
• Can control complexity of the tree (pruning, early stopping)
*``PLANET: Massively Parallel Learning of Tree Ensembles with MapReduce” by Panda, Herbach, Basu, Bayardoin VLDB 2009