SlideShare a Scribd company logo
1 of 16
Download to read offline
Incremental Learning-to-Learn with
Statistical Guarantees
Massimiliano Pontil
Istituto Italiano di Tecnologia
and
University College London
Joint work with Giulia Denevi, Carlo Ciliberto, Dimitris Stamos
Workshop on Operator Splitting Methods in Data Analysis
SAMSI, Raleigh, NC, USA
March 21–23, 2018
Plan
Learning-to-learn
Online approach
Linear feature learning
Analysis
Link to multitask learning
Open problems
2 / 16
Supervised Learning
Supervised learning problem (task): a probability distribution µ on
Z = X × Y, with X ⊆ Rd
and Y ⊆ R
A learning algorithm is a mapping A :
n∈N
Zn
→ Rd
, z → A(z)
Risk: Rµ(w) = E
(x,y)∼µ
w, x − y
2
Example (Ridge Regression):
A(z) = arg min
w∈Rd
1
n
n
i=1
yi − w, xi
2
emp. risk Rz(w)
+λ w 2
How to choose A?
3 / 16
Supervised Learning
Supervised learning problem (task): a probability distribution µ on
Z = X × Y, with X ⊆ Rd
and Y ⊆ R
A learning algorithm is a mapping A :
n∈N
Zn
→ Rd
, z → A(z)
Risk: Rµ(w) = E
(x,y)∼µ
w, x − y
2
Example (Ridge Regression):
A(z) = arg min
w∈Rd
1
n
n
i=1
yi − w, Φxi
2
emp. risk Rz(w)
+λ w 2
How to choose A (feature map Φ)
4 / 16
Learning-to-Learn (LTL) Problem
[Baxter, 2000; Maurer 2009]
We wish to find a learning algorithm which works well on an
environment of tasks, captured by a meta-distribution ρ (a
“distribution over distributions”)
The performance of algorithm A is measured by the transfer risk:
Eρ(A) = E
µ∼ρ
E
z∼µn
Rµ(A(z))
– draw a task µ ∼ ρ
– draw a sample z ∼ µn
= µn⊗
– run the algorithm to obtain A(z)
– compute the risk of A(z) on task µ
ρ is unknown, we only observe a sequence of datasets z1, z2, . . .
5 / 16
Online LTL
We wish to design a meta-algorithm which improves the underlying
algorithm over time as new datasets are observed
Need for memory efficiency: we cannot keep in memory the datasets!
We propose to minimize – via a suitable stochastic strategy – the
future empirical risk ˆE as a proxy for the transfer risk Eρ:
ˆE(A) = Eµ∼ρEz∼µn Rz(A(z))
This is justified by statistical learning bounds [e.g. Maurer, 2009]
Ez∼µn |Rµ A(z) −Rz A(z) | ≤ G(A, n)
with G(A, ·) a measure of complexity of A and lim
n→∞
G(·, n) = 0
6 / 16
Linear Feature Learning
Learning algorithm: Ridge Regression with a feature map
AΦ(z) = argmin
w∈Rd
1
n
n
i=1
yi − w, Φxi
2
+ λ w 2
Setting D = 1
λ Φ Φ ∈ Rd×d
, the above problem can be equivalently
formulated as
AD(z) = argmin
w∈ran(D)
1
n
n
i=1
yi − w, xi
2
+ w D†
w
We wish to find a matrix with small transfer risk in the set
Dλ = D ∈ Sd
+, tr(D) ≤ 1/λ
Encourages low rank solutions [Argyriou et al., 2008]
7 / 16
Online LTL for Linear Feature Learning
Let Lz(D) = Rz(AD(z))
Recall we propose to minimize the future empirical risk ˆE:
ˆE(D) = Eµ∼ρEz∼µn Lz(D)
A direct computation gives Lz(D) = n (XDX + nI)−1
y 2
Lz is convex on the set of PSD matrices. In addition if X ⊆ B1 and
Y ⊆ [0, 1] then Lz is 2-Lipschitz w.r.t. the Frobenius norm
8 / 16
Online LTL for Linear Feature Learning (cont.)
We address min
D∈Dλ
ˆE(D) by projected stochastic gradient descent:
Input: T number of tasks, λ > 0 parameter, (γt = 1
λ
√
2t
)T
t=1 steps
Initialization: Choose D(1)
∈ Dλ
For t = 1 to T:
Sample µt ∼ ρ, zt ∼ µn
t
Update D(t+1)
= projDλ
D(t)
− γt Lzt (D(t)
)
Return DT = 1
T
T
t=1 D(t)
The projection can be computed in a finite number of steps in O(d3
)
time
9 / 16
Statistical Analsysis
Theorem 1. Let δ ∈ (0, 1]. If X ⊆ B1, Y ⊆ [0, 1] and DT is the output
of the online LTL algorithm with step sizes γt = (λ
√
2t)−1
, then with
probability at least 1 − δ w.r.t. the sampling of the datasets z1, . . . , zT
E(DT ) − E(D∗) ≤
4
√
2π Cρ
1/2
∞
√
n
1 +
√
λ
λ
+
4
√
2
λ
√
T
+
8 log 2/δ
T
where Cρ is the total covariance of the input, Cρ = Eµ∼ρE(x,y)∼µ xx
The bound is equivalent (up to constants) to a previous bound by
[Maurer 2009] for the batch case, which optimizes
T
t=1 Lzt
(D)
The bound improves over independent task learning when Cρ is
small and T is large
10 / 16
Statistical Analsysis (cont.)
The proof uses the decomposition
E(DT )−E(D∗) = E(DT ) − ˆE(DT )
A
+ ˆE(DT )− ˆE(D∗)
B
+ ˆE(D∗)−E(D∗)
C
where D∗ ∈ arg min
D∈Dλ
E(D) and ˆD∗ ∈ arg min
D∈Dλ
ˆE(D)
We control terms A and C with a uniform bound from [Maurer 2009]
We bound term B via a regret analysis, followed by an online to
batch conversion step [Cesa-Bianchi et al., 2004; Hazan, 2016]
11 / 16
Link to Multitask Learning (MTL)
[Argyriou et al., 2008]
Our approach is related to MTL problem with trace norm regularization
min
W ∈Rd×T
1
T
T
t=1
Rzt
(wt) +
λ
T
σ(W)
2
1
(∗)
Using σ(W)
2
1
= 1
λ inf
D∈Int(Dλ)
T
t=1
wt D−1
wt we rewrite (∗) as
min
D∈Dλ
1
T
T
t=1
min
w∈Ran(D)
Rzt
(w) + w D†
w
Encourages low rank solutions!
12 / 16
Experiment
0 10 20 30 40 50
# training tasks
0.54
0.55
0.56
0.57
meantestMSE
batch LTL
online LTL
MTL
ITL
13 / 16
Experiment (cont.)
25 50 75 100 125 150
# training points
20
40
60
80
100
120
140
#trainingtasks
0%
2%
4%
6%
14 / 16
Ongoing Work / Open Questions
Explore other stochastic approaches / more efficient meta-algorithms
(projecting on Dλ requires an eigen-decoposition)
Extend to other loss functions:
min
w∈Rd
1
n
n
i=1
( w, xi , yi) + w, D−1
w
Let Lz(D) be the empirical error evaluated at the minimizer. Under
which conditions is Lz(D) convex in D?
Extend to “richer” learning algorithms: Banach space setting (e.g.
kernel methods) or non-convex learning algorithms (e.g. neural nets)
15 / 16
References
A. Argyriou, T. Evgeniou, and M. Pontil. Convex multi-task feature learning. Machine Learning,
73(3):243–272, 2008.
J. Baxter. A model of inductive bias learning. Journal of Artificial Intelligence Research,
12(149–198):3, 2000.
N. Cesa-Bianchi, A. Conconi, and C. Gentile. On the generalization ability of on-line learning
algorithms. IEEE Transactions on Information Theory, 50(9):2050–2057, 2004.
E. Hazan. Introduction to online convex optimization. Foundations and Trends in Optimization,
2016.
A. Maurer. Transfer bounds for linear feature learning. Machine Learning, 75(3):327–350, 2009.
A. Maurer, M. Pontil, and B. Romera-Paredes. The benefit of multitask representation learning.
The Journal of Machine Learning Research, 17(1):2853–2884, 2016.
16 / 16

More Related Content

What's hot

A note on word embedding
A note on word embeddingA note on word embedding
A note on word embeddingKhang Pham
 
Core–periphery detection in networks with nonlinear Perron eigenvectors
Core–periphery detection in networks with nonlinear Perron eigenvectorsCore–periphery detection in networks with nonlinear Perron eigenvectors
Core–periphery detection in networks with nonlinear Perron eigenvectorsFrancesco Tudisco
 
Optimal L-shaped matrix reordering, aka graph's core-periphery
Optimal L-shaped matrix reordering, aka graph's core-peripheryOptimal L-shaped matrix reordering, aka graph's core-periphery
Optimal L-shaped matrix reordering, aka graph's core-peripheryFrancesco Tudisco
 
About functional SIR
About functional SIRAbout functional SIR
About functional SIRtuxette
 
Maximum likelihood estimation of regularisation parameters in inverse problem...
Maximum likelihood estimation of regularisation parameters in inverse problem...Maximum likelihood estimation of regularisation parameters in inverse problem...
Maximum likelihood estimation of regularisation parameters in inverse problem...Valentin De Bortoli
 
Problem Understanding through Landscape Theory
Problem Understanding through Landscape TheoryProblem Understanding through Landscape Theory
Problem Understanding through Landscape Theoryjfrchicanog
 
Small updates of matrix functions used for network centrality
Small updates of matrix functions used for network centralitySmall updates of matrix functions used for network centrality
Small updates of matrix functions used for network centralityFrancesco Tudisco
 
Quantitative Propagation of Chaos for SGD in Wide Neural Networks
Quantitative Propagation of Chaos for SGD in Wide Neural NetworksQuantitative Propagation of Chaos for SGD in Wide Neural Networks
Quantitative Propagation of Chaos for SGD in Wide Neural NetworksValentin De Bortoli
 
Uncertainty Awareness in Integrating Machine Learning and Game Theory
Uncertainty Awareness in Integrating Machine Learning and Game TheoryUncertainty Awareness in Integrating Machine Learning and Game Theory
Uncertainty Awareness in Integrating Machine Learning and Game TheoryRikiya Takahashi
 
Macrocanonical models for texture synthesis
Macrocanonical models for texture synthesisMacrocanonical models for texture synthesis
Macrocanonical models for texture synthesisValentin De Bortoli
 
Dictionary Learning for Massive Matrix Factorization
Dictionary Learning for Massive Matrix FactorizationDictionary Learning for Massive Matrix Factorization
Dictionary Learning for Massive Matrix Factorizationrecsysfr
 
A comparison of three learning methods to predict N20 fluxes and N leaching
A comparison of three learning methods to predict N20 fluxes and N leachingA comparison of three learning methods to predict N20 fluxes and N leaching
A comparison of three learning methods to predict N20 fluxes and N leachingtuxette
 
Data-Driven Recommender Systems
Data-Driven Recommender SystemsData-Driven Recommender Systems
Data-Driven Recommender Systemsrecsysfr
 
Refresher probabilities-statistics
Refresher probabilities-statisticsRefresher probabilities-statistics
Refresher probabilities-statisticsSteve Nouri
 

What's hot (20)

A note on word embedding
A note on word embeddingA note on word embedding
A note on word embedding
 
Core–periphery detection in networks with nonlinear Perron eigenvectors
Core–periphery detection in networks with nonlinear Perron eigenvectorsCore–periphery detection in networks with nonlinear Perron eigenvectors
Core–periphery detection in networks with nonlinear Perron eigenvectors
 
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
 
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
 
Optimal L-shaped matrix reordering, aka graph's core-periphery
Optimal L-shaped matrix reordering, aka graph's core-peripheryOptimal L-shaped matrix reordering, aka graph's core-periphery
Optimal L-shaped matrix reordering, aka graph's core-periphery
 
About functional SIR
About functional SIRAbout functional SIR
About functional SIR
 
Deep Learning Opening Workshop - Deep ReLU Networks Viewed as a Statistical M...
Deep Learning Opening Workshop - Deep ReLU Networks Viewed as a Statistical M...Deep Learning Opening Workshop - Deep ReLU Networks Viewed as a Statistical M...
Deep Learning Opening Workshop - Deep ReLU Networks Viewed as a Statistical M...
 
Maximum likelihood estimation of regularisation parameters in inverse problem...
Maximum likelihood estimation of regularisation parameters in inverse problem...Maximum likelihood estimation of regularisation parameters in inverse problem...
Maximum likelihood estimation of regularisation parameters in inverse problem...
 
Problem Understanding through Landscape Theory
Problem Understanding through Landscape TheoryProblem Understanding through Landscape Theory
Problem Understanding through Landscape Theory
 
Deep Learning Opening Workshop - ProxSARAH Algorithms for Stochastic Composit...
Deep Learning Opening Workshop - ProxSARAH Algorithms for Stochastic Composit...Deep Learning Opening Workshop - ProxSARAH Algorithms for Stochastic Composit...
Deep Learning Opening Workshop - ProxSARAH Algorithms for Stochastic Composit...
 
Small updates of matrix functions used for network centrality
Small updates of matrix functions used for network centralitySmall updates of matrix functions used for network centrality
Small updates of matrix functions used for network centrality
 
Quantitative Propagation of Chaos for SGD in Wide Neural Networks
Quantitative Propagation of Chaos for SGD in Wide Neural NetworksQuantitative Propagation of Chaos for SGD in Wide Neural Networks
Quantitative Propagation of Chaos for SGD in Wide Neural Networks
 
Uncertainty Awareness in Integrating Machine Learning and Game Theory
Uncertainty Awareness in Integrating Machine Learning and Game TheoryUncertainty Awareness in Integrating Machine Learning and Game Theory
Uncertainty Awareness in Integrating Machine Learning and Game Theory
 
Macrocanonical models for texture synthesis
Macrocanonical models for texture synthesisMacrocanonical models for texture synthesis
Macrocanonical models for texture synthesis
 
Dictionary Learning for Massive Matrix Factorization
Dictionary Learning for Massive Matrix FactorizationDictionary Learning for Massive Matrix Factorization
Dictionary Learning for Massive Matrix Factorization
 
Lecture2 xing
Lecture2 xingLecture2 xing
Lecture2 xing
 
A comparison of three learning methods to predict N20 fluxes and N leaching
A comparison of three learning methods to predict N20 fluxes and N leachingA comparison of three learning methods to predict N20 fluxes and N leaching
A comparison of three learning methods to predict N20 fluxes and N leaching
 
Data-Driven Recommender Systems
Data-Driven Recommender SystemsData-Driven Recommender Systems
Data-Driven Recommender Systems
 
Refresher probabilities-statistics
Refresher probabilities-statisticsRefresher probabilities-statistics
Refresher probabilities-statistics
 
Gtti 10032021
Gtti 10032021Gtti 10032021
Gtti 10032021
 

Similar to QMC: Operator Splitting Workshop, Incremental Learning-to-Learn with Statistical Guarantees - Massimiliano Pontil, Mar 21, 2018

Conference poster 6
Conference poster 6Conference poster 6
Conference poster 6NTNU
 
MAPE regression, seminar @ QUT (Brisbane)
MAPE regression, seminar @ QUT (Brisbane)MAPE regression, seminar @ QUT (Brisbane)
MAPE regression, seminar @ QUT (Brisbane)Arnaud de Myttenaere
 
Simplified Runtime Analysis of Estimation of Distribution Algorithms
Simplified Runtime Analysis of Estimation of Distribution AlgorithmsSimplified Runtime Analysis of Estimation of Distribution Algorithms
Simplified Runtime Analysis of Estimation of Distribution AlgorithmsPK Lehre
 
Simplified Runtime Analysis of Estimation of Distribution Algorithms
Simplified Runtime Analysis of Estimation of Distribution AlgorithmsSimplified Runtime Analysis of Estimation of Distribution Algorithms
Simplified Runtime Analysis of Estimation of Distribution AlgorithmsPer Kristian Lehre
 
Accelerating Pseudo-Marginal MCMC using Gaussian Processes
Accelerating Pseudo-Marginal MCMC using Gaussian ProcessesAccelerating Pseudo-Marginal MCMC using Gaussian Processes
Accelerating Pseudo-Marginal MCMC using Gaussian ProcessesMatt Moores
 
Unbiased Bayes for Big Data
Unbiased Bayes for Big DataUnbiased Bayes for Big Data
Unbiased Bayes for Big DataChristian Robert
 
2014 spring crunch seminar (SDE/levy/fractional/spectral method)
2014 spring crunch seminar (SDE/levy/fractional/spectral method)2014 spring crunch seminar (SDE/levy/fractional/spectral method)
2014 spring crunch seminar (SDE/levy/fractional/spectral method)Zheng Mengdi
 
Tucker tensor analysis of Matern functions in spatial statistics
Tucker tensor analysis of Matern functions in spatial statistics Tucker tensor analysis of Matern functions in spatial statistics
Tucker tensor analysis of Matern functions in spatial statistics Alexander Litvinenko
 
Bayesian Deep Learning
Bayesian Deep LearningBayesian Deep Learning
Bayesian Deep LearningRayKim51
 
[AAAI2021] Combinatorial Pure Exploration with Full-bandit or Partial Linear ...
[AAAI2021] Combinatorial Pure Exploration with Full-bandit or Partial Linear ...[AAAI2021] Combinatorial Pure Exploration with Full-bandit or Partial Linear ...
[AAAI2021] Combinatorial Pure Exploration with Full-bandit or Partial Linear ...Yuko Kuroki (黒木祐子)
 
A Szemeredi-type theorem for subsets of the unit cube
A Szemeredi-type theorem for subsets of the unit cubeA Szemeredi-type theorem for subsets of the unit cube
A Szemeredi-type theorem for subsets of the unit cubeVjekoslavKovac1
 
SLC 2015 talk improved version
SLC 2015 talk improved versionSLC 2015 talk improved version
SLC 2015 talk improved versionZheng Mengdi
 
Higher-order Factorization Machines(第5回ステアラボ人工知能セミナー)
Higher-order Factorization Machines(第5回ステアラボ人工知能セミナー)Higher-order Factorization Machines(第5回ステアラボ人工知能セミナー)
Higher-order Factorization Machines(第5回ステアラボ人工知能セミナー)STAIR Lab, Chiba Institute of Technology
 

Similar to QMC: Operator Splitting Workshop, Incremental Learning-to-Learn with Statistical Guarantees - Massimiliano Pontil, Mar 21, 2018 (20)

Conference poster 6
Conference poster 6Conference poster 6
Conference poster 6
 
MAPE regression, seminar @ QUT (Brisbane)
MAPE regression, seminar @ QUT (Brisbane)MAPE regression, seminar @ QUT (Brisbane)
MAPE regression, seminar @ QUT (Brisbane)
 
MUMS: Transition & SPUQ Workshop - Practical Bayesian Optimization for Urban ...
MUMS: Transition & SPUQ Workshop - Practical Bayesian Optimization for Urban ...MUMS: Transition & SPUQ Workshop - Practical Bayesian Optimization for Urban ...
MUMS: Transition & SPUQ Workshop - Practical Bayesian Optimization for Urban ...
 
Presentation.pdf
Presentation.pdfPresentation.pdf
Presentation.pdf
 
QMC: Operator Splitting Workshop, Perturbed (accelerated) Proximal-Gradient A...
QMC: Operator Splitting Workshop, Perturbed (accelerated) Proximal-Gradient A...QMC: Operator Splitting Workshop, Perturbed (accelerated) Proximal-Gradient A...
QMC: Operator Splitting Workshop, Perturbed (accelerated) Proximal-Gradient A...
 
Simplified Runtime Analysis of Estimation of Distribution Algorithms
Simplified Runtime Analysis of Estimation of Distribution AlgorithmsSimplified Runtime Analysis of Estimation of Distribution Algorithms
Simplified Runtime Analysis of Estimation of Distribution Algorithms
 
Simplified Runtime Analysis of Estimation of Distribution Algorithms
Simplified Runtime Analysis of Estimation of Distribution AlgorithmsSimplified Runtime Analysis of Estimation of Distribution Algorithms
Simplified Runtime Analysis of Estimation of Distribution Algorithms
 
Accelerating Pseudo-Marginal MCMC using Gaussian Processes
Accelerating Pseudo-Marginal MCMC using Gaussian ProcessesAccelerating Pseudo-Marginal MCMC using Gaussian Processes
Accelerating Pseudo-Marginal MCMC using Gaussian Processes
 
Smart Multitask Bregman Clustering
Smart Multitask Bregman ClusteringSmart Multitask Bregman Clustering
Smart Multitask Bregman Clustering
 
Unbiased Bayes for Big Data
Unbiased Bayes for Big DataUnbiased Bayes for Big Data
Unbiased Bayes for Big Data
 
2014 spring crunch seminar (SDE/levy/fractional/spectral method)
2014 spring crunch seminar (SDE/levy/fractional/spectral method)2014 spring crunch seminar (SDE/levy/fractional/spectral method)
2014 spring crunch seminar (SDE/levy/fractional/spectral method)
 
KAUST_talk_short.pdf
KAUST_talk_short.pdfKAUST_talk_short.pdf
KAUST_talk_short.pdf
 
MUMS Opening Workshop - Quantifying Nonparametric Modeling Uncertainty with B...
MUMS Opening Workshop - Quantifying Nonparametric Modeling Uncertainty with B...MUMS Opening Workshop - Quantifying Nonparametric Modeling Uncertainty with B...
MUMS Opening Workshop - Quantifying Nonparametric Modeling Uncertainty with B...
 
Tucker tensor analysis of Matern functions in spatial statistics
Tucker tensor analysis of Matern functions in spatial statistics Tucker tensor analysis of Matern functions in spatial statistics
Tucker tensor analysis of Matern functions in spatial statistics
 
Bayesian Deep Learning
Bayesian Deep LearningBayesian Deep Learning
Bayesian Deep Learning
 
Program on Mathematical and Statistical Methods for Climate and the Earth Sys...
Program on Mathematical and Statistical Methods for Climate and the Earth Sys...Program on Mathematical and Statistical Methods for Climate and the Earth Sys...
Program on Mathematical and Statistical Methods for Climate and the Earth Sys...
 
[AAAI2021] Combinatorial Pure Exploration with Full-bandit or Partial Linear ...
[AAAI2021] Combinatorial Pure Exploration with Full-bandit or Partial Linear ...[AAAI2021] Combinatorial Pure Exploration with Full-bandit or Partial Linear ...
[AAAI2021] Combinatorial Pure Exploration with Full-bandit or Partial Linear ...
 
A Szemeredi-type theorem for subsets of the unit cube
A Szemeredi-type theorem for subsets of the unit cubeA Szemeredi-type theorem for subsets of the unit cube
A Szemeredi-type theorem for subsets of the unit cube
 
SLC 2015 talk improved version
SLC 2015 talk improved versionSLC 2015 talk improved version
SLC 2015 talk improved version
 
Higher-order Factorization Machines(第5回ステアラボ人工知能セミナー)
Higher-order Factorization Machines(第5回ステアラボ人工知能セミナー)Higher-order Factorization Machines(第5回ステアラボ人工知能セミナー)
Higher-order Factorization Machines(第5回ステアラボ人工知能セミナー)
 

More from The Statistical and Applied Mathematical Sciences Institute

More from The Statistical and Applied Mathematical Sciences Institute (20)

Causal Inference Opening Workshop - Latent Variable Models, Causal Inference,...
Causal Inference Opening Workshop - Latent Variable Models, Causal Inference,...Causal Inference Opening Workshop - Latent Variable Models, Causal Inference,...
Causal Inference Opening Workshop - Latent Variable Models, Causal Inference,...
 
2019 Fall Series: Special Guest Lecture - 0-1 Phase Transitions in High Dimen...
2019 Fall Series: Special Guest Lecture - 0-1 Phase Transitions in High Dimen...2019 Fall Series: Special Guest Lecture - 0-1 Phase Transitions in High Dimen...
2019 Fall Series: Special Guest Lecture - 0-1 Phase Transitions in High Dimen...
 
Causal Inference Opening Workshop - Causal Discovery in Neuroimaging Data - F...
Causal Inference Opening Workshop - Causal Discovery in Neuroimaging Data - F...Causal Inference Opening Workshop - Causal Discovery in Neuroimaging Data - F...
Causal Inference Opening Workshop - Causal Discovery in Neuroimaging Data - F...
 
Causal Inference Opening Workshop - Smooth Extensions to BART for Heterogeneo...
Causal Inference Opening Workshop - Smooth Extensions to BART for Heterogeneo...Causal Inference Opening Workshop - Smooth Extensions to BART for Heterogeneo...
Causal Inference Opening Workshop - Smooth Extensions to BART for Heterogeneo...
 
Causal Inference Opening Workshop - A Bracketing Relationship between Differe...
Causal Inference Opening Workshop - A Bracketing Relationship between Differe...Causal Inference Opening Workshop - A Bracketing Relationship between Differe...
Causal Inference Opening Workshop - A Bracketing Relationship between Differe...
 
Causal Inference Opening Workshop - Testing Weak Nulls in Matched Observation...
Causal Inference Opening Workshop - Testing Weak Nulls in Matched Observation...Causal Inference Opening Workshop - Testing Weak Nulls in Matched Observation...
Causal Inference Opening Workshop - Testing Weak Nulls in Matched Observation...
 
Causal Inference Opening Workshop - Difference-in-differences: more than meet...
Causal Inference Opening Workshop - Difference-in-differences: more than meet...Causal Inference Opening Workshop - Difference-in-differences: more than meet...
Causal Inference Opening Workshop - Difference-in-differences: more than meet...
 
Causal Inference Opening Workshop - New Statistical Learning Methods for Esti...
Causal Inference Opening Workshop - New Statistical Learning Methods for Esti...Causal Inference Opening Workshop - New Statistical Learning Methods for Esti...
Causal Inference Opening Workshop - New Statistical Learning Methods for Esti...
 
Causal Inference Opening Workshop - Bipartite Causal Inference with Interfere...
Causal Inference Opening Workshop - Bipartite Causal Inference with Interfere...Causal Inference Opening Workshop - Bipartite Causal Inference with Interfere...
Causal Inference Opening Workshop - Bipartite Causal Inference with Interfere...
 
Causal Inference Opening Workshop - Bridging the Gap Between Causal Literatur...
Causal Inference Opening Workshop - Bridging the Gap Between Causal Literatur...Causal Inference Opening Workshop - Bridging the Gap Between Causal Literatur...
Causal Inference Opening Workshop - Bridging the Gap Between Causal Literatur...
 
Causal Inference Opening Workshop - Some Applications of Reinforcement Learni...
Causal Inference Opening Workshop - Some Applications of Reinforcement Learni...Causal Inference Opening Workshop - Some Applications of Reinforcement Learni...
Causal Inference Opening Workshop - Some Applications of Reinforcement Learni...
 
Causal Inference Opening Workshop - Bracketing Bounds for Differences-in-Diff...
Causal Inference Opening Workshop - Bracketing Bounds for Differences-in-Diff...Causal Inference Opening Workshop - Bracketing Bounds for Differences-in-Diff...
Causal Inference Opening Workshop - Bracketing Bounds for Differences-in-Diff...
 
Causal Inference Opening Workshop - Assisting the Impact of State Polcies: Br...
Causal Inference Opening Workshop - Assisting the Impact of State Polcies: Br...Causal Inference Opening Workshop - Assisting the Impact of State Polcies: Br...
Causal Inference Opening Workshop - Assisting the Impact of State Polcies: Br...
 
Causal Inference Opening Workshop - Experimenting in Equilibrium - Stefan Wag...
Causal Inference Opening Workshop - Experimenting in Equilibrium - Stefan Wag...Causal Inference Opening Workshop - Experimenting in Equilibrium - Stefan Wag...
Causal Inference Opening Workshop - Experimenting in Equilibrium - Stefan Wag...
 
Causal Inference Opening Workshop - Targeted Learning for Causal Inference Ba...
Causal Inference Opening Workshop - Targeted Learning for Causal Inference Ba...Causal Inference Opening Workshop - Targeted Learning for Causal Inference Ba...
Causal Inference Opening Workshop - Targeted Learning for Causal Inference Ba...
 
Causal Inference Opening Workshop - Bayesian Nonparametric Models for Treatme...
Causal Inference Opening Workshop - Bayesian Nonparametric Models for Treatme...Causal Inference Opening Workshop - Bayesian Nonparametric Models for Treatme...
Causal Inference Opening Workshop - Bayesian Nonparametric Models for Treatme...
 
2019 Fall Series: Special Guest Lecture - Adversarial Risk Analysis of the Ge...
2019 Fall Series: Special Guest Lecture - Adversarial Risk Analysis of the Ge...2019 Fall Series: Special Guest Lecture - Adversarial Risk Analysis of the Ge...
2019 Fall Series: Special Guest Lecture - Adversarial Risk Analysis of the Ge...
 
2019 Fall Series: Professional Development, Writing Academic Papers…What Work...
2019 Fall Series: Professional Development, Writing Academic Papers…What Work...2019 Fall Series: Professional Development, Writing Academic Papers…What Work...
2019 Fall Series: Professional Development, Writing Academic Papers…What Work...
 
2019 GDRR: Blockchain Data Analytics - Machine Learning in/for Blockchain: Fu...
2019 GDRR: Blockchain Data Analytics - Machine Learning in/for Blockchain: Fu...2019 GDRR: Blockchain Data Analytics - Machine Learning in/for Blockchain: Fu...
2019 GDRR: Blockchain Data Analytics - Machine Learning in/for Blockchain: Fu...
 
2019 GDRR: Blockchain Data Analytics - QuTrack: Model Life Cycle Management f...
2019 GDRR: Blockchain Data Analytics - QuTrack: Model Life Cycle Management f...2019 GDRR: Blockchain Data Analytics - QuTrack: Model Life Cycle Management f...
2019 GDRR: Blockchain Data Analytics - QuTrack: Model Life Cycle Management f...
 

Recently uploaded

Pharmacognosy Flower 3. Compositae 2023.pdf
Pharmacognosy Flower 3. Compositae 2023.pdfPharmacognosy Flower 3. Compositae 2023.pdf
Pharmacognosy Flower 3. Compositae 2023.pdfMahmoud M. Sallam
 
CELL CYCLE Division Science 8 quarter IV.pptx
CELL CYCLE Division Science 8 quarter IV.pptxCELL CYCLE Division Science 8 quarter IV.pptx
CELL CYCLE Division Science 8 quarter IV.pptxJiesonDelaCerna
 
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...JhezDiaz1
 
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTiammrhaywood
 
Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Celine George
 
EPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptxEPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptxRaymartEstabillo3
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxSayali Powar
 
Final demo Grade 9 for demo Plan dessert.pptx
Final demo Grade 9 for demo Plan dessert.pptxFinal demo Grade 9 for demo Plan dessert.pptx
Final demo Grade 9 for demo Plan dessert.pptxAvyJaneVismanos
 
Roles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in PharmacovigilanceRoles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in PharmacovigilanceSamikshaHamane
 
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Historical philosophical, theoretical, and legal foundations of special and i...
Historical philosophical, theoretical, and legal foundations of special and i...Historical philosophical, theoretical, and legal foundations of special and i...
Historical philosophical, theoretical, and legal foundations of special and i...jaredbarbolino94
 
Hierarchy of management that covers different levels of management
Hierarchy of management that covers different levels of managementHierarchy of management that covers different levels of management
Hierarchy of management that covers different levels of managementmkooblal
 
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxiammrhaywood
 
How to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxHow to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxmanuelaromero2013
 
MARGINALIZATION (Different learners in Marginalized Group
MARGINALIZATION (Different learners in Marginalized GroupMARGINALIZATION (Different learners in Marginalized Group
MARGINALIZATION (Different learners in Marginalized GroupJonathanParaisoCruz
 
Proudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxProudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxthorishapillay1
 
Meghan Sutherland In Media Res Media Component
Meghan Sutherland In Media Res Media ComponentMeghan Sutherland In Media Res Media Component
Meghan Sutherland In Media Res Media ComponentInMediaRes1
 
MICROBIOLOGY biochemical test detailed.pptx
MICROBIOLOGY biochemical test detailed.pptxMICROBIOLOGY biochemical test detailed.pptx
MICROBIOLOGY biochemical test detailed.pptxabhijeetpadhi001
 

Recently uploaded (20)

Pharmacognosy Flower 3. Compositae 2023.pdf
Pharmacognosy Flower 3. Compositae 2023.pdfPharmacognosy Flower 3. Compositae 2023.pdf
Pharmacognosy Flower 3. Compositae 2023.pdf
 
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
 
CELL CYCLE Division Science 8 quarter IV.pptx
CELL CYCLE Division Science 8 quarter IV.pptxCELL CYCLE Division Science 8 quarter IV.pptx
CELL CYCLE Division Science 8 quarter IV.pptx
 
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
 
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
 
Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17
 
EPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptxEPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptx
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
 
Final demo Grade 9 for demo Plan dessert.pptx
Final demo Grade 9 for demo Plan dessert.pptxFinal demo Grade 9 for demo Plan dessert.pptx
Final demo Grade 9 for demo Plan dessert.pptx
 
Roles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in PharmacovigilanceRoles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in Pharmacovigilance
 
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
 
Historical philosophical, theoretical, and legal foundations of special and i...
Historical philosophical, theoretical, and legal foundations of special and i...Historical philosophical, theoretical, and legal foundations of special and i...
Historical philosophical, theoretical, and legal foundations of special and i...
 
Hierarchy of management that covers different levels of management
Hierarchy of management that covers different levels of managementHierarchy of management that covers different levels of management
Hierarchy of management that covers different levels of management
 
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
 
How to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxHow to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptx
 
MARGINALIZATION (Different learners in Marginalized Group
MARGINALIZATION (Different learners in Marginalized GroupMARGINALIZATION (Different learners in Marginalized Group
MARGINALIZATION (Different learners in Marginalized Group
 
Proudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxProudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptx
 
Meghan Sutherland In Media Res Media Component
Meghan Sutherland In Media Res Media ComponentMeghan Sutherland In Media Res Media Component
Meghan Sutherland In Media Res Media Component
 
OS-operating systems- ch04 (Threads) ...
OS-operating systems- ch04 (Threads) ...OS-operating systems- ch04 (Threads) ...
OS-operating systems- ch04 (Threads) ...
 
MICROBIOLOGY biochemical test detailed.pptx
MICROBIOLOGY biochemical test detailed.pptxMICROBIOLOGY biochemical test detailed.pptx
MICROBIOLOGY biochemical test detailed.pptx
 

QMC: Operator Splitting Workshop, Incremental Learning-to-Learn with Statistical Guarantees - Massimiliano Pontil, Mar 21, 2018

  • 1. Incremental Learning-to-Learn with Statistical Guarantees Massimiliano Pontil Istituto Italiano di Tecnologia and University College London Joint work with Giulia Denevi, Carlo Ciliberto, Dimitris Stamos Workshop on Operator Splitting Methods in Data Analysis SAMSI, Raleigh, NC, USA March 21–23, 2018
  • 2. Plan Learning-to-learn Online approach Linear feature learning Analysis Link to multitask learning Open problems 2 / 16
  • 3. Supervised Learning Supervised learning problem (task): a probability distribution µ on Z = X × Y, with X ⊆ Rd and Y ⊆ R A learning algorithm is a mapping A : n∈N Zn → Rd , z → A(z) Risk: Rµ(w) = E (x,y)∼µ w, x − y 2 Example (Ridge Regression): A(z) = arg min w∈Rd 1 n n i=1 yi − w, xi 2 emp. risk Rz(w) +λ w 2 How to choose A? 3 / 16
  • 4. Supervised Learning Supervised learning problem (task): a probability distribution µ on Z = X × Y, with X ⊆ Rd and Y ⊆ R A learning algorithm is a mapping A : n∈N Zn → Rd , z → A(z) Risk: Rµ(w) = E (x,y)∼µ w, x − y 2 Example (Ridge Regression): A(z) = arg min w∈Rd 1 n n i=1 yi − w, Φxi 2 emp. risk Rz(w) +λ w 2 How to choose A (feature map Φ) 4 / 16
  • 5. Learning-to-Learn (LTL) Problem [Baxter, 2000; Maurer 2009] We wish to find a learning algorithm which works well on an environment of tasks, captured by a meta-distribution ρ (a “distribution over distributions”) The performance of algorithm A is measured by the transfer risk: Eρ(A) = E µ∼ρ E z∼µn Rµ(A(z)) – draw a task µ ∼ ρ – draw a sample z ∼ µn = µn⊗ – run the algorithm to obtain A(z) – compute the risk of A(z) on task µ ρ is unknown, we only observe a sequence of datasets z1, z2, . . . 5 / 16
  • 6. Online LTL We wish to design a meta-algorithm which improves the underlying algorithm over time as new datasets are observed Need for memory efficiency: we cannot keep in memory the datasets! We propose to minimize – via a suitable stochastic strategy – the future empirical risk ˆE as a proxy for the transfer risk Eρ: ˆE(A) = Eµ∼ρEz∼µn Rz(A(z)) This is justified by statistical learning bounds [e.g. Maurer, 2009] Ez∼µn |Rµ A(z) −Rz A(z) | ≤ G(A, n) with G(A, ·) a measure of complexity of A and lim n→∞ G(·, n) = 0 6 / 16
  • 7. Linear Feature Learning Learning algorithm: Ridge Regression with a feature map AΦ(z) = argmin w∈Rd 1 n n i=1 yi − w, Φxi 2 + λ w 2 Setting D = 1 λ Φ Φ ∈ Rd×d , the above problem can be equivalently formulated as AD(z) = argmin w∈ran(D) 1 n n i=1 yi − w, xi 2 + w D† w We wish to find a matrix with small transfer risk in the set Dλ = D ∈ Sd +, tr(D) ≤ 1/λ Encourages low rank solutions [Argyriou et al., 2008] 7 / 16
  • 8. Online LTL for Linear Feature Learning Let Lz(D) = Rz(AD(z)) Recall we propose to minimize the future empirical risk ˆE: ˆE(D) = Eµ∼ρEz∼µn Lz(D) A direct computation gives Lz(D) = n (XDX + nI)−1 y 2 Lz is convex on the set of PSD matrices. In addition if X ⊆ B1 and Y ⊆ [0, 1] then Lz is 2-Lipschitz w.r.t. the Frobenius norm 8 / 16
  • 9. Online LTL for Linear Feature Learning (cont.) We address min D∈Dλ ˆE(D) by projected stochastic gradient descent: Input: T number of tasks, λ > 0 parameter, (γt = 1 λ √ 2t )T t=1 steps Initialization: Choose D(1) ∈ Dλ For t = 1 to T: Sample µt ∼ ρ, zt ∼ µn t Update D(t+1) = projDλ D(t) − γt Lzt (D(t) ) Return DT = 1 T T t=1 D(t) The projection can be computed in a finite number of steps in O(d3 ) time 9 / 16
  • 10. Statistical Analsysis Theorem 1. Let δ ∈ (0, 1]. If X ⊆ B1, Y ⊆ [0, 1] and DT is the output of the online LTL algorithm with step sizes γt = (λ √ 2t)−1 , then with probability at least 1 − δ w.r.t. the sampling of the datasets z1, . . . , zT E(DT ) − E(D∗) ≤ 4 √ 2π Cρ 1/2 ∞ √ n 1 + √ λ λ + 4 √ 2 λ √ T + 8 log 2/δ T where Cρ is the total covariance of the input, Cρ = Eµ∼ρE(x,y)∼µ xx The bound is equivalent (up to constants) to a previous bound by [Maurer 2009] for the batch case, which optimizes T t=1 Lzt (D) The bound improves over independent task learning when Cρ is small and T is large 10 / 16
  • 11. Statistical Analsysis (cont.) The proof uses the decomposition E(DT )−E(D∗) = E(DT ) − ˆE(DT ) A + ˆE(DT )− ˆE(D∗) B + ˆE(D∗)−E(D∗) C where D∗ ∈ arg min D∈Dλ E(D) and ˆD∗ ∈ arg min D∈Dλ ˆE(D) We control terms A and C with a uniform bound from [Maurer 2009] We bound term B via a regret analysis, followed by an online to batch conversion step [Cesa-Bianchi et al., 2004; Hazan, 2016] 11 / 16
  • 12. Link to Multitask Learning (MTL) [Argyriou et al., 2008] Our approach is related to MTL problem with trace norm regularization min W ∈Rd×T 1 T T t=1 Rzt (wt) + λ T σ(W) 2 1 (∗) Using σ(W) 2 1 = 1 λ inf D∈Int(Dλ) T t=1 wt D−1 wt we rewrite (∗) as min D∈Dλ 1 T T t=1 min w∈Ran(D) Rzt (w) + w D† w Encourages low rank solutions! 12 / 16
  • 13. Experiment 0 10 20 30 40 50 # training tasks 0.54 0.55 0.56 0.57 meantestMSE batch LTL online LTL MTL ITL 13 / 16
  • 14. Experiment (cont.) 25 50 75 100 125 150 # training points 20 40 60 80 100 120 140 #trainingtasks 0% 2% 4% 6% 14 / 16
  • 15. Ongoing Work / Open Questions Explore other stochastic approaches / more efficient meta-algorithms (projecting on Dλ requires an eigen-decoposition) Extend to other loss functions: min w∈Rd 1 n n i=1 ( w, xi , yi) + w, D−1 w Let Lz(D) be the empirical error evaluated at the minimizer. Under which conditions is Lz(D) convex in D? Extend to “richer” learning algorithms: Banach space setting (e.g. kernel methods) or non-convex learning algorithms (e.g. neural nets) 15 / 16
  • 16. References A. Argyriou, T. Evgeniou, and M. Pontil. Convex multi-task feature learning. Machine Learning, 73(3):243–272, 2008. J. Baxter. A model of inductive bias learning. Journal of Artificial Intelligence Research, 12(149–198):3, 2000. N. Cesa-Bianchi, A. Conconi, and C. Gentile. On the generalization ability of on-line learning algorithms. IEEE Transactions on Information Theory, 50(9):2050–2057, 2004. E. Hazan. Introduction to online convex optimization. Foundations and Trends in Optimization, 2016. A. Maurer. Transfer bounds for linear feature learning. Machine Learning, 75(3):327–350, 2009. A. Maurer, M. Pontil, and B. Romera-Paredes. The benefit of multitask representation learning. The Journal of Machine Learning Research, 17(1):2853–2884, 2016. 16 / 16