QMC: Operator Splitting Workshop, Incremental Learning-to-Learn with Statistical Guarantees - Massimiliano Pontil, Mar 21, 2018
1. Incremental Learning-to-Learn with
Statistical Guarantees
Massimiliano Pontil
Istituto Italiano di Tecnologia
and
University College London
Joint work with Giulia Denevi, Carlo Ciliberto, Dimitris Stamos
Workshop on Operator Splitting Methods in Data Analysis
SAMSI, Raleigh, NC, USA
March 21–23, 2018
3. Supervised Learning
Supervised learning problem (task): a probability distribution µ on
Z = X × Y, with X ⊆ Rd
and Y ⊆ R
A learning algorithm is a mapping A :
n∈N
Zn
→ Rd
, z → A(z)
Risk: Rµ(w) = E
(x,y)∼µ
w, x − y
2
Example (Ridge Regression):
A(z) = arg min
w∈Rd
1
n
n
i=1
yi − w, xi
2
emp. risk Rz(w)
+λ w 2
How to choose A?
3 / 16
4. Supervised Learning
Supervised learning problem (task): a probability distribution µ on
Z = X × Y, with X ⊆ Rd
and Y ⊆ R
A learning algorithm is a mapping A :
n∈N
Zn
→ Rd
, z → A(z)
Risk: Rµ(w) = E
(x,y)∼µ
w, x − y
2
Example (Ridge Regression):
A(z) = arg min
w∈Rd
1
n
n
i=1
yi − w, Φxi
2
emp. risk Rz(w)
+λ w 2
How to choose A (feature map Φ)
4 / 16
5. Learning-to-Learn (LTL) Problem
[Baxter, 2000; Maurer 2009]
We wish to find a learning algorithm which works well on an
environment of tasks, captured by a meta-distribution ρ (a
“distribution over distributions”)
The performance of algorithm A is measured by the transfer risk:
Eρ(A) = E
µ∼ρ
E
z∼µn
Rµ(A(z))
– draw a task µ ∼ ρ
– draw a sample z ∼ µn
= µn⊗
– run the algorithm to obtain A(z)
– compute the risk of A(z) on task µ
ρ is unknown, we only observe a sequence of datasets z1, z2, . . .
5 / 16
6. Online LTL
We wish to design a meta-algorithm which improves the underlying
algorithm over time as new datasets are observed
Need for memory efficiency: we cannot keep in memory the datasets!
We propose to minimize – via a suitable stochastic strategy – the
future empirical risk ˆE as a proxy for the transfer risk Eρ:
ˆE(A) = Eµ∼ρEz∼µn Rz(A(z))
This is justified by statistical learning bounds [e.g. Maurer, 2009]
Ez∼µn |Rµ A(z) −Rz A(z) | ≤ G(A, n)
with G(A, ·) a measure of complexity of A and lim
n→∞
G(·, n) = 0
6 / 16
7. Linear Feature Learning
Learning algorithm: Ridge Regression with a feature map
AΦ(z) = argmin
w∈Rd
1
n
n
i=1
yi − w, Φxi
2
+ λ w 2
Setting D = 1
λ Φ Φ ∈ Rd×d
, the above problem can be equivalently
formulated as
AD(z) = argmin
w∈ran(D)
1
n
n
i=1
yi − w, xi
2
+ w D†
w
We wish to find a matrix with small transfer risk in the set
Dλ = D ∈ Sd
+, tr(D) ≤ 1/λ
Encourages low rank solutions [Argyriou et al., 2008]
7 / 16
8. Online LTL for Linear Feature Learning
Let Lz(D) = Rz(AD(z))
Recall we propose to minimize the future empirical risk ˆE:
ˆE(D) = Eµ∼ρEz∼µn Lz(D)
A direct computation gives Lz(D) = n (XDX + nI)−1
y 2
Lz is convex on the set of PSD matrices. In addition if X ⊆ B1 and
Y ⊆ [0, 1] then Lz is 2-Lipschitz w.r.t. the Frobenius norm
8 / 16
9. Online LTL for Linear Feature Learning (cont.)
We address min
D∈Dλ
ˆE(D) by projected stochastic gradient descent:
Input: T number of tasks, λ > 0 parameter, (γt = 1
λ
√
2t
)T
t=1 steps
Initialization: Choose D(1)
∈ Dλ
For t = 1 to T:
Sample µt ∼ ρ, zt ∼ µn
t
Update D(t+1)
= projDλ
D(t)
− γt Lzt (D(t)
)
Return DT = 1
T
T
t=1 D(t)
The projection can be computed in a finite number of steps in O(d3
)
time
9 / 16
10. Statistical Analsysis
Theorem 1. Let δ ∈ (0, 1]. If X ⊆ B1, Y ⊆ [0, 1] and DT is the output
of the online LTL algorithm with step sizes γt = (λ
√
2t)−1
, then with
probability at least 1 − δ w.r.t. the sampling of the datasets z1, . . . , zT
E(DT ) − E(D∗) ≤
4
√
2π Cρ
1/2
∞
√
n
1 +
√
λ
λ
+
4
√
2
λ
√
T
+
8 log 2/δ
T
where Cρ is the total covariance of the input, Cρ = Eµ∼ρE(x,y)∼µ xx
The bound is equivalent (up to constants) to a previous bound by
[Maurer 2009] for the batch case, which optimizes
T
t=1 Lzt
(D)
The bound improves over independent task learning when Cρ is
small and T is large
10 / 16
11. Statistical Analsysis (cont.)
The proof uses the decomposition
E(DT )−E(D∗) = E(DT ) − ˆE(DT )
A
+ ˆE(DT )− ˆE(D∗)
B
+ ˆE(D∗)−E(D∗)
C
where D∗ ∈ arg min
D∈Dλ
E(D) and ˆD∗ ∈ arg min
D∈Dλ
ˆE(D)
We control terms A and C with a uniform bound from [Maurer 2009]
We bound term B via a regret analysis, followed by an online to
batch conversion step [Cesa-Bianchi et al., 2004; Hazan, 2016]
11 / 16
12. Link to Multitask Learning (MTL)
[Argyriou et al., 2008]
Our approach is related to MTL problem with trace norm regularization
min
W ∈Rd×T
1
T
T
t=1
Rzt
(wt) +
λ
T
σ(W)
2
1
(∗)
Using σ(W)
2
1
= 1
λ inf
D∈Int(Dλ)
T
t=1
wt D−1
wt we rewrite (∗) as
min
D∈Dλ
1
T
T
t=1
min
w∈Ran(D)
Rzt
(w) + w D†
w
Encourages low rank solutions!
12 / 16
15. Ongoing Work / Open Questions
Explore other stochastic approaches / more efficient meta-algorithms
(projecting on Dλ requires an eigen-decoposition)
Extend to other loss functions:
min
w∈Rd
1
n
n
i=1
( w, xi , yi) + w, D−1
w
Let Lz(D) be the empirical error evaluated at the minimizer. Under
which conditions is Lz(D) convex in D?
Extend to “richer” learning algorithms: Banach space setting (e.g.
kernel methods) or non-convex learning algorithms (e.g. neural nets)
15 / 16
16. References
A. Argyriou, T. Evgeniou, and M. Pontil. Convex multi-task feature learning. Machine Learning,
73(3):243–272, 2008.
J. Baxter. A model of inductive bias learning. Journal of Artificial Intelligence Research,
12(149–198):3, 2000.
N. Cesa-Bianchi, A. Conconi, and C. Gentile. On the generalization ability of on-line learning
algorithms. IEEE Transactions on Information Theory, 50(9):2050–2057, 2004.
E. Hazan. Introduction to online convex optimization. Foundations and Trends in Optimization,
2016.
A. Maurer. Transfer bounds for linear feature learning. Machine Learning, 75(3):327–350, 2009.
A. Maurer, M. Pontil, and B. Romera-Paredes. The benefit of multitask representation learning.
The Journal of Machine Learning Research, 17(1):2853–2884, 2016.
16 / 16