QMC: Operator Splitting Workshop, Incremental Learning-to-Learn with Statistical Guarantees - Massimiliano Pontil, Mar 21, 2018

Incremental Learning-to-Learn with
Statistical Guarantees
Massimiliano Pontil
Istituto Italiano di Tecnologia
and
University College London
Joint work with Giulia Denevi, Carlo Ciliberto, Dimitris Stamos
Workshop on Operator Splitting Methods in Data Analysis
SAMSI, Raleigh, NC, USA
March 21–23, 2018

Plan
Learning-to-learn
Online approach
Linear feature learning
Analysis
Link to multitask learning
Open problems
2 / 16

Supervised Learning
Supervised learning problem (task): a probability distribution µ on
Z = X × Y, with X ⊆ Rd
and Y ⊆ R
A learning algorithm is a mapping A :
n∈N
Zn
→ Rd
, z → A(z)
Risk: Rµ(w) = E
(x,y)∼µ
w, x − y
2
Example (Ridge Regression):
A(z) = arg min
w∈Rd
1
n
n
i=1
yi − w, xi
2
emp. risk Rz(w)
+λ w 2
How to choose A?
3 / 16

Supervised Learning
Supervised learning problem (task): a probability distribution µ on
Z = X × Y, with X ⊆ Rd
and Y ⊆ R
A learning algorithm is a mapping A :
n∈N
Zn
→ Rd
, z → A(z)
Risk: Rµ(w) = E
(x,y)∼µ
w, x − y
2
Example (Ridge Regression):
A(z) = arg min
w∈Rd
1
n
n
i=1
yi − w, Φxi
2
emp. risk Rz(w)
+λ w 2
How to choose A (feature map Φ)
4 / 16

Learning-to-Learn (LTL) Problem
[Baxter, 2000; Maurer 2009]
We wish to ﬁnd a learning algorithm which works well on an
environment of tasks, captured by a meta-distribution ρ (a
“distribution over distributions”)
The performance of algorithm A is measured by the transfer risk:
Eρ(A) = E
µ∼ρ
E
z∼µn
Rµ(A(z))
– draw a task µ ∼ ρ
– draw a sample z ∼ µn
= µn⊗
– run the algorithm to obtain A(z)
– compute the risk of A(z) on task µ
ρ is unknown, we only observe a sequence of datasets z1, z2, . . .
5 / 16

Online LTL
We wish to design a meta-algorithm which improves the underlying
algorithm over time as new datasets are observed
Need for memory efficiency: we cannot keep in memory the datasets!
We propose to minimize – via a suitable stochastic strategy – the
future empirical risk Ê as a proxy for the transfer risk Eρ:
Ê(A) = Eµ∼ρEz∼µn Rz(A(z))
This is justified by statistical learning bounds [e.g. Maurer, 2009]
Ez∼µn |Rµ A(z) −Rz A(z) | ≤ G(A, n)
with G(A, ·) a measure of complexity of A and lim
n→∞
G(·, n) = 0
6 / 16

Linear Feature Learning
Learning algorithm: Ridge Regression with a feature map
AΦ(z) = argmin
w∈Rd
1
n
n
i=1
yi − w, Φxi
2
+ λ w 2
Setting D = 1
λ Φ Φ ∈ Rd×d
, the above problem can be equivalently
formulated as
AD(z) = argmin
w∈ran(D)
1
n
n
i=1
yi − w, xi
2
+ w D†
w
We wish to ﬁnd a matrix with small transfer risk in the set
Dλ = D ∈ Sd
+, tr(D) ≤ 1/λ
Encourages low rank solutions [Argyriou et al., 2008]
7 / 16

Online LTL for Linear Feature Learning
Let Lz(D) = Rz(AD(z))
Recall we propose to minimize the future empirical risk ˆE:
ˆE(D) = Eµ∼ρEz∼µn Lz(D)
A direct computation gives Lz(D) = n (XDX + nI)−1
y 2
Lz is convex on the set of PSD matrices. In addition if X ⊆ B1 and
Y ⊆ [0, 1] then Lz is 2-Lipschitz w.r.t. the Frobenius norm
8 / 16

Online LTL for Linear Feature Learning (cont.)
We address min
D∈Dλ
ˆE(D) by projected stochastic gradient descent:
Input: T number of tasks, λ > 0 parameter, (γt = 1
λ
√
2t
)T
t=1 steps
Initialization: Choose D(1)
∈ Dλ
For t = 1 to T:
Sample µt ∼ ρ, zt ∼ µn
t
Update D(t+1)
= projDλ
D(t)
− γt Lzt (D(t)
)
Return DT = 1
T
T
t=1 D(t)
The projection can be computed in a ﬁnite number of steps in O(d3
)
time
9 / 16

Statistical Analsysis
Theorem 1. Let δ ∈ (0, 1]. If X ⊆ B1, Y ⊆ [0, 1] and DT is the output
of the online LTL algorithm with step sizes γt = (λ
√
2t)−1
, then with
probability at least 1 − δ w.r.t. the sampling of the datasets z1, . . . , zT
E(DT ) − E(D∗) ≤
4
√
2π Cρ
1/2
∞
√
n
1 +
√
λ
λ
+
4
√
2
λ
√
T
+
8 log 2/δ
T
where Cρ is the total covariance of the input, Cρ = Eµ∼ρE(x,y)∼µ xx
The bound is equivalent (up to constants) to a previous bound by
[Maurer 2009] for the batch case, which optimizes
T
t=1 Lzt
(D)
The bound improves over independent task learning when Cρ is
small and T is large
10 / 16

Statistical Analsysis (cont.)
The proof uses the decomposition
E(DT )−E(D∗) = E(DT ) − Ê(DT )
A
+ Ê(DT )− Ê(D∗)
B
+ Ê(D∗)−E(D∗)
C
where D∗ ∈ arg min
D∈Dλ
E(D) and ˆD∗ ∈ arg min
D∈Dλ
Ê(D)
We control terms A and C with a uniform bound from [Maurer 2009]
We bound term B via a regret analysis, followed by an online to
batch conversion step [Cesa-Bianchi et al., 2004; Hazan, 2016]
11 / 16

Link to Multitask Learning (MTL)
[Argyriou et al., 2008]
Our approach is related to MTL problem with trace norm regularization
min
W ∈Rd×T
1
T
T
t=1
Rzt
(wt) +
λ
T
σ(W)
2
1
(∗)
Using σ(W)
2
1
= 1
λ inf
D∈Int(Dλ)
T
t=1
wt D−1
wt we rewrite (∗) as
min
D∈Dλ
1
T
T
t=1
min
w∈Ran(D)
Rzt
(w) + w D†
w
Encourages low rank solutions!
12 / 16

Experiment
0 10 20 30 40 50
# training tasks
0.54
0.55
0.56
0.57
meantestMSE
batch LTL
online LTL
MTL
ITL
13 / 16

Experiment (cont.)
25 50 75 100 125 150
# training points
20
40
60
80
100
120
140
#trainingtasks
0%
2%
4%
6%
14 / 16

Ongoing Work / Open Questions
Explore other stochastic approaches / more eﬃcient meta-algorithms
(projecting on Dλ requires an eigen-decoposition)
Extend to other loss functions:
min
w∈Rd
1
n
n
i=1
( w, xi , yi) + w, D−1
w
Let Lz(D) be the empirical error evaluated at the minimizer. Under
which conditions is Lz(D) convex in D?
Extend to “richer” learning algorithms: Banach space setting (e.g.
kernel methods) or non-convex learning algorithms (e.g. neural nets)
15 / 16

References
A. Argyriou, T. Evgeniou, and M. Pontil. Convex multi-task feature learning. Machine Learning,
73(3):243–272, 2008.
J. Baxter. A model of inductive bias learning. Journal of Artiﬁcial Intelligence Research,
12(149–198):3, 2000.
N. Cesa-Bianchi, A. Conconi, and C. Gentile. On the generalization ability of on-line learning
algorithms. IEEE Transactions on Information Theory, 50(9):2050–2057, 2004.
E. Hazan. Introduction to online convex optimization. Foundations and Trends in Optimization,
2016.
A. Maurer. Transfer bounds for linear feature learning. Machine Learning, 75(3):327–350, 2009.
A. Maurer, M. Pontil, and B. Romera-Paredes. The beneﬁt of multitask representation learning.
The Journal of Machine Learning Research, 17(1):2853–2884, 2016.
16 / 16

QMC: Operator Splitting Workshop, Incremental Learning-to-Learn with Statistical Guarantees - Massimiliano Pontil, Mar 21, 2018

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to QMC: Operator Splitting Workshop, Incremental Learning-to-Learn with Statistical Guarantees - Massimiliano Pontil, Mar 21, 2018

Similar to QMC: Operator Splitting Workshop, Incremental Learning-to-Learn with Statistical Guarantees - Massimiliano Pontil, Mar 21, 2018 (20)

More from The Statistical and Applied Mathematical Sciences Institute

More from The Statistical and Applied Mathematical Sciences Institute (20)

Recently uploaded

Recently uploaded (20)

QMC: Operator Splitting Workshop, Incremental Learning-to-Learn with Statistical Guarantees - Massimiliano Pontil, Mar 21, 2018