In this talk we discuss the connections between (Supervised) Learning and Mathematical Optimization. Topics include iterative algorithm, search directions and stepsizes. The talk has been held at the Computer Science, Machine Learning and Statistics Meetup Hamburg.
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
SKuehn_MachineLearningAndOptimization_2015
1. The Foundations of (Machine) Learning
Stefan Kühn
codecentric AG
CSMLS Meetup Hamburg - February 26th, 2015
Stefan Kühn (codecentric AG) Unconstrained Optimization 25.02.2015 1 / 18
4. Setting
Supervised Learning Approach
Use labeled training data in order to fit a given model to the data, i.e. to
learn from the given data.
Typical Problems:
Classification - discrete output
Logistic Regression
Neural Networks
Support Vector Machines
Regression - continuous output
Linear Regression
Support Vector Regression
Generalized Linear/Additive Models
Stefan Kühn (codecentric AG) Unconstrained Optimization 25.02.2015 4 / 18
5. Training and Learning
Ingredients:
Training Data Set
Model, e.g. Logistic Regression
Error Measure, e.g. Mean Squared Error
Learning Procedure:
Derive objective function from Model and Error Measure
Initialize Model parameters
Find a good fit!
Iterate with other initial parameters
What is Learning in this context?
Learning is nothing but the application of an algorithm for unconstrained
optimization to the given objective function.
Stefan Kühn (codecentric AG) Unconstrained Optimization 25.02.2015 5 / 18
7. Unconstrained Optimization
Higher-order methods
Newton’s method (fast local convergence)
Gradient-based methods
Gradient Descent / Steepest Descent (globally convergent)
Conjugate Gradient (globally convergent)
Gauß-Newton, Levenberg-Marquardt, Quasi-Newton
Krylow Subspace methods
Derivative-free methods, direct search
Secant method (locally convergent)
Regula Falsi and successors (global convergence, typically slow)
Nelder-Mead / Downhill-Simplex
unconventional method, creates a moving simplex
driven by reflection/contraction/expansion of the corner points
globally convergent for differentiable functions f ∈ C1
Stefan Kühn (codecentric AG) Unconstrained Optimization 25.02.2015 7 / 18
8. General Iterative Algorithmic Scheme
Goal: Minimize a given function f :
min f (x), x ∈ Rn
Iterative Algorithms
Starting from a given point an iterative algorithm tries to minimize the
objective function step by step.
Preparation: k = 0
Initialization: Choose initial points and parameters
Iterate until convergence: k = 1, 2, 3, . . .
Termination criterion: Check optimality of the current iterate
Descent Direction: Find reasonable search direction
Stepsize: Determine length of the step in the given direction
Stefan Kühn (codecentric AG) Unconstrained Optimization 25.02.2015 8 / 18
9. Termination criteria
Critical points x∗
:
f (x∗
) = 0
Gradient: Should converge to zero
f (x∗
) < tol
Iterates: Distance between xk and xk+1 should converge to zero
xk
− xk+1
< tol
Function Values: Difference between f (xk) and f (xk+1) should
converge to zero
f (xk
) − f (xk+1
) < tol
Number of iterations: Terminate after maxiter iterations
Stefan Kühn (codecentric AG) Unconstrained Optimization 25.02.2015 9 / 18
11. Descent direction
Geometric interpretation
d is a descent direction if and only if the angle α between the gradient
f (x) and d is in a certain range:
π
2
= 90◦
< α < 270◦
=
3π
2
Algebraic equivalent
The sign of the scalar product between two vectors a and b is determined
by the cosine of the angle α between a and b:
a, b = aT
b = a b cos α(a, b)
d is a descent direction if and only if:
dT
f (x) < 0
Stefan Kühn (codecentric AG) Unconstrained Optimization 25.02.2015 11 / 18
13. Stepsize
Armijo’s rule
Takes two parameters 0 < σ < 1, and 0 < ρ < 0.5
For = 0, 1, 2, ... test Armijo condition:
f (p + σ d) < f (p) + ρσ dT
f (p)
Accepted stepsize
First that passes this test determines the accepted stepsize
t = σ
Standard Armijo implies, that for the accepted stepsize t always holds
t <= 1, only semi-efficient.
Technical detail: Widening
Test whether some t > 0 satisfy Armijo condition, i.e. check
= −1, −2, . . . as well, ensures efficiency.
Stefan Kühn (codecentric AG) Unconstrained Optimization 25.02.2015 13 / 18
15. Gradient Descent
Descent direction
Direction of Steepest Descent, the negative gradient:
d = − f (x)
Motivation:
corresponds to α = 180◦ = π
obvious choice, always a descent direction, no test needed
guarantees the quickest win locally
works with inexact line search, e.g. Armijo’ s rule
works for functions f ∈ C1
always solves auxiliary optimization problem
min sT
f (x), s ∈ Rn
, s = 1
Stefan Kühn (codecentric AG) Unconstrained Optimization 25.02.2015 15 / 18
16. Conjugate Gradient
Motivation: Quadratic Model Problem, minimize
f (x) = Ax − b 2
Optimality condition:
f (x∗
) = 2AT
(Ax∗
− b) = 0
Obvious approach: Solve system of linear equations
AT
Ax = AT
b
Descent direction
Consecutive directions di , . . . , di+k satisfy certain orthogonality or
conjugacy conditions, M = AT A symmetric positive definite:
dT
i Mdj = 0, i = j
Stefan Kühn (codecentric AG) Unconstrained Optimization 25.02.2015 16 / 18
17. Nonlinear Conjugate Gradient
Initial Steps:
start at point x0 with d0 = − f (x0)
perform exact line search, find
t0 = arg min f (x0 + td0), t > 0
set x1 = x0 + t0d0.
Iteration:
set ∆k = − f (xk)
compute βk via one of the available formulas (next slide)
update conjugate search direction dk = ∆k + βkdk−1
perform exact line search, find
tk = arg min f (xk + tdk), t > 0
set xk+1 = xk + tkdk
Stefan Kühn (codecentric AG) Unconstrained Optimization 25.02.2015 17 / 18
18. Nonlinear Conjugate Gradient
Formulas for βk:
Fletcher-Reeves
βFR
k =
∆T
k ∆k
∆T
k−1∆k−1
Polak-Ribière βPR
k =
∆T
k (∆k − ∆k−1)
∆T
k−1∆k−1
Hestenes-Stiefel βHS
k = −
∆T
k (∆k − ∆k−1)
sT
k−1 (∆k − ∆k−1)
Dai-Yuan βDY
k = −
∆T
k ∆k
sT
k−1 (∆k − ∆k−1)
Reasonable choice with automatic direction reset:
β = max 0, βPR
Stefan Kühn (codecentric AG) Unconstrained Optimization 25.02.2015 18 / 18