Random Matrix Theory and Machine Learning - Part 3Fabian Pedregosa
ICML 2021 tutorial on random matrix theory and machine learning.
Part 3 covers: 1. Motivation: Average-case versus worst-case in high dimensions 2. Algorithm halting times (runtimes) 3. Outlook
Random Matrix Theory and Machine Learning - Part 1Fabian Pedregosa
ICML 2021 tutorial on random matrix theory and machine learning. Part 1 covers: 1. A brief history of Random Matrix Theory, 2. Classical Random Matrix Ensembles (basic building blocks)
Random Matrix Theory and Machine Learning - Part 3Fabian Pedregosa
ICML 2021 tutorial on random matrix theory and machine learning.
Part 3 covers: 1. Motivation: Average-case versus worst-case in high dimensions 2. Algorithm halting times (runtimes) 3. Outlook
Random Matrix Theory and Machine Learning - Part 1Fabian Pedregosa
ICML 2021 tutorial on random matrix theory and machine learning. Part 1 covers: 1. A brief history of Random Matrix Theory, 2. Classical Random Matrix Ensembles (basic building blocks)
Tutorial on Belief Propagation in Bayesian NetworksAnmol Dwivedi
The goal of this mini-project is to implement belief propagation algorithms for posterior probability inference and most probable explanation (MPE) inference for the Bayesian Network with binary values in which the Conditional Probability Table for each random-variable/node is given.
Individualized treatment rules (ITR) assign treatments according to different patients' characteristics. Despite recent advances on the estimation of ITRs, much less attention has been given to uncertainty assessments for the estimated rules. We propose a hypothesis testing procedure for the estimated ITRs from a general framework that directly optimizes overall treatment bene t equipped with sparse penalties. Specifically, we construct a local test for testing low dimensional components of high-dimensional linear decision rules. The procedure can apply to observational studies by taking into account the additional variability from the estimation of propensity score. Theoretically, our test extends the decorrelated score test proposed in Nang and Liu (2017, Ann. Stat.) and is valid no matter whether model selection consistency for the true parameters holds or not. The proposed methodology is illustrated with numerical studies and a real data example on electronic health records of patients with Type-II Diabetes.
Linear Discriminant Analysis (LDA) Under f-Divergence MeasuresAnmol Dwivedi
For more details, please have a look at:
1. https://www.mdpi.com/1099-4300/24/2/188
2. https://ieeexplore.ieee.org/document/9518004
Abstract:
In statistical inference, the information-theoretic performance limits can often be expressed in terms of a notion of divergence between the underlying statistical models (e.g., in binary hypothesis testing, the total error probability is equal to the total variation between the models). As the data dimension grows, computing the statistics involved in decision-making and the attendant performance limits (divergence measures) face complexity and stability challenges. Dimensionality reduction addresses these challenges at the expense of compromising the performance (divergence reduces due to the data processing inequality for divergence). This paper considers linear dimensionality reduction such that the divergence between the models is \emph{maximally} preserved. Specifically, the paper focuses on the Gaussian models and characterizes an optimal projection of the data onto a lower-dimensional subspace with respect to four $f$-divergence measures (Kullback-Leibler, $\chi^2$, Hellinger, and total variation). There are two key observations. First, projections are not necessarily along the dominant modes of the covariance matrix of the data, and even in some situations, they can be along the least dominant modes. Secondly, under specific regimes, the optimal design of subspace projection is identical under all the $f$-divergence measures considered, rendering a degree of universality to the design independent of the inference problem of interest.
The granting process of all credit institutions rejects applicants who seem risky regarding the repayment of their debt. A credit score is calculated and associated with a cut-off value beneath which an applicant is rejected. Developing a new scorecard, i.e. a correspondence table between a client's characteristics and his/her score, implies having a learning dataset in which the response variable good/bad borrower is known, so that rejects are de facto excluded from the learning process.
It might have deep consequences on the scorecard relevance as the learning population has been financed and considered good by the previous model. Previous works from the literature in this matter consisted mostly in empirical methods to exploit rejected applicants' data in the scorecard development process and experiments. We propose a rational criterion to evaluate the quality of a scoring model. We review each existing method in light of our criterion and dig out their implicit mathematical hypotheses.
We show that up to now, no Reject Inference method can guarantee to provide an improved credit scorecard. To support these theoretical findings, we added experiments on simulated and real data from the french branch of Crédit Agricole Consumer Finance.
These slides were presented as part of a talk I gave at the 49èmes Journées de Statistique, 29th of May, in Avignon, France.
Presentation on stochastic control problem with financial applications (Merto...Asma Ben Slimene
This is an introductory to optimal stochastic control theory with two applications in finance: Merton portfolio problem and Investement/consumption problem with numerical results using finite differences approach
Abstract : For many years, Machine Learning has focused on a key issue: the design of input features to solve prediction tasks. In this presentation, we show that many learning tasks from structured output prediction to zero-shot learning can benefit from an appropriate design of output features, broadening the scope of regression. As an illustration, I will briefly review different examples and recent results obtained in my team.
Tutorial on Belief Propagation in Bayesian NetworksAnmol Dwivedi
The goal of this mini-project is to implement belief propagation algorithms for posterior probability inference and most probable explanation (MPE) inference for the Bayesian Network with binary values in which the Conditional Probability Table for each random-variable/node is given.
Individualized treatment rules (ITR) assign treatments according to different patients' characteristics. Despite recent advances on the estimation of ITRs, much less attention has been given to uncertainty assessments for the estimated rules. We propose a hypothesis testing procedure for the estimated ITRs from a general framework that directly optimizes overall treatment bene t equipped with sparse penalties. Specifically, we construct a local test for testing low dimensional components of high-dimensional linear decision rules. The procedure can apply to observational studies by taking into account the additional variability from the estimation of propensity score. Theoretically, our test extends the decorrelated score test proposed in Nang and Liu (2017, Ann. Stat.) and is valid no matter whether model selection consistency for the true parameters holds or not. The proposed methodology is illustrated with numerical studies and a real data example on electronic health records of patients with Type-II Diabetes.
Linear Discriminant Analysis (LDA) Under f-Divergence MeasuresAnmol Dwivedi
For more details, please have a look at:
1. https://www.mdpi.com/1099-4300/24/2/188
2. https://ieeexplore.ieee.org/document/9518004
Abstract:
In statistical inference, the information-theoretic performance limits can often be expressed in terms of a notion of divergence between the underlying statistical models (e.g., in binary hypothesis testing, the total error probability is equal to the total variation between the models). As the data dimension grows, computing the statistics involved in decision-making and the attendant performance limits (divergence measures) face complexity and stability challenges. Dimensionality reduction addresses these challenges at the expense of compromising the performance (divergence reduces due to the data processing inequality for divergence). This paper considers linear dimensionality reduction such that the divergence between the models is \emph{maximally} preserved. Specifically, the paper focuses on the Gaussian models and characterizes an optimal projection of the data onto a lower-dimensional subspace with respect to four $f$-divergence measures (Kullback-Leibler, $\chi^2$, Hellinger, and total variation). There are two key observations. First, projections are not necessarily along the dominant modes of the covariance matrix of the data, and even in some situations, they can be along the least dominant modes. Secondly, under specific regimes, the optimal design of subspace projection is identical under all the $f$-divergence measures considered, rendering a degree of universality to the design independent of the inference problem of interest.
The granting process of all credit institutions rejects applicants who seem risky regarding the repayment of their debt. A credit score is calculated and associated with a cut-off value beneath which an applicant is rejected. Developing a new scorecard, i.e. a correspondence table between a client's characteristics and his/her score, implies having a learning dataset in which the response variable good/bad borrower is known, so that rejects are de facto excluded from the learning process.
It might have deep consequences on the scorecard relevance as the learning population has been financed and considered good by the previous model. Previous works from the literature in this matter consisted mostly in empirical methods to exploit rejected applicants' data in the scorecard development process and experiments. We propose a rational criterion to evaluate the quality of a scoring model. We review each existing method in light of our criterion and dig out their implicit mathematical hypotheses.
We show that up to now, no Reject Inference method can guarantee to provide an improved credit scorecard. To support these theoretical findings, we added experiments on simulated and real data from the french branch of Crédit Agricole Consumer Finance.
These slides were presented as part of a talk I gave at the 49èmes Journées de Statistique, 29th of May, in Avignon, France.
Presentation on stochastic control problem with financial applications (Merto...Asma Ben Slimene
This is an introductory to optimal stochastic control theory with two applications in finance: Merton portfolio problem and Investement/consumption problem with numerical results using finite differences approach
Abstract : For many years, Machine Learning has focused on a key issue: the design of input features to solve prediction tasks. In this presentation, we show that many learning tasks from structured output prediction to zero-shot learning can benefit from an appropriate design of output features, broadening the scope of regression. As an illustration, I will briefly review different examples and recent results obtained in my team.
Internet s'ha convertit en el mitjà de comunicació més extens de tota la història. Constitueix una font de recursos d'informació i coneixements compartits a escala mundial. Internet permet establir la cooperació i col·laboració entre gran nombre de comunitats i grups d’interès generals o específics distribuïts per tot el planeta, ja sigui via twitter, facebook, instagram o webs (change.org)
Data fusion is the process of combining data from different sources to enhance the utility of the combined product. In remote sensing, input data sources are typically massive, noisy, and have different spatial supports and sampling characteristics. We take an inferential approach to this data fusion problem: we seek to infer a true but not directly observed spatial (or spatio-temporal) field from heterogeneous inputs. We use a statistical model to make these inferences, but like all models it is at least somewhat uncertain. In this talk, we will discuss our experiences with the impacts of these uncertainties and some potential ways addressing them.
Bayesian inference for mixed-effects models driven by SDEs and other stochast...Umberto Picchini
An important, and well studied, class of stochastic models is given by stochastic differential equations (SDEs). In this talk, we consider Bayesian inference based on measurements from several individuals, to provide inference at the "population level" using mixed-effects modelling. We consider the case where dynamics are expressed via SDEs or other stochastic (Markovian) models. Stochastic differential equation mixed-effects models (SDEMEMs) are flexible hierarchical models that account for (i) the intrinsic random variability in the latent states dynamics, as well as (ii) the variability between individuals, and also (iii) account for measurement error. This flexibility gives rise to methodological and computational difficulties.
Fully Bayesian inference for nonlinear SDEMEMs is complicated by the typical intractability of the observed data likelihood which motivates the use of sampling-based approaches such as Markov chain Monte Carlo. A Gibbs sampler is proposed to target the marginal posterior of all parameters of interest. The algorithm is made computationally efficient through careful use of blocking strategies, particle filters (sequential Monte Carlo) and correlated pseudo-marginal approaches. The resulting methodology is is flexible, general and is able to deal with a large class of nonlinear SDEMEMs [1]. In a more recent work [2], we also explored ways to make inference even more scalable to an increasing number of individuals, while also dealing with state-space models driven by other stochastic dynamic models than SDEs, eg Markov jump processes and nonlinear solvers typically used in systems biology.
[1] S. Wiqvist, A. Golightly, AT McLean, U. Picchini (2020). Efficient inference for stochastic differential mixed-effects models using correlated particle pseudo-marginal algorithms, CSDA, https://doi.org/10.1016/j.csda.2020.107151
[2] S. Persson, N. Welkenhuysen, S. Shashkova, S. Wiqvist, P. Reith, G. W. Schmidt, U. Picchini, M. Cvijovic (2021). PEPSDI: Scalable and flexible inference framework for stochastic dynamic single-cell models, bioRxiv doi:10.1101/2021.07.01.450748.
Possible applications of low-rank tensors in statistics and UQ (my talk in Bo...Alexander Litvinenko
Just some ideas how low-rank matrices/tensors can be useful in spatial and environmental statistics, where one usually has to deal with very large data
Slides: On the Chi Square and Higher-Order Chi Distances for Approximating f-...Frank Nielsen
Slides for the paper:
On the Chi Square and Higher-Order Chi Distances for Approximating f-Divergences
published in IEEE SPL:
http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=6654274
Integration with kernel methods, Transported meshfree methodsMercier Jean-Marc
I made public for discussion a first version (subject to changes) of a talk that will be given at the Particles 2019 conference.
The bold points of this presentation are the following :
1) We present new to my knowledge, sharp, estimations for Monte-Carlo type methods.
2) These estimations can be used in a wide variety of context to perform a sharp error analysis.
3) We present a class of numerical methods that we refer to as Transported Meshfree Methods. This class of methods can be used for a wide variety of problems based on Partial Differential Equations, among which Artificial Intelligence problems belongs.
4) We can guarantee, thanks to the error analysis, a worst-error while computing with transported meshfree methods. We can also check that this error matches optimal convergence rate.
Rao-Blackwellisation schemes for accelerating Metropolis-Hastings algorithmsChristian Robert
Aggregate of three different papers on Rao-Blackwellisation, from Casella & Robert (1996), to Douc & Robert (2010), to Banterle et al. (2015), presented during an OxWaSP workshop on MCMC methods, Warwick, Nov 20, 2015
Seminar Talk: Multilevel Hybrid Split Step Implicit Tau-Leap for Stochastic R...Chiheb Ben Hammouda
In biochemically reactive systems with small copy numbers of one or more reactant molecules, the dynamics are dominated by stochastic effects. To approximate those systems, discrete state-space and stochastic simulation approaches have been shown to be more relevant than continuous state-space and deterministic ones. These stochastic models constitute the theory of Stochastic Reaction Networks (SRNs). In systems characterized by having simultaneously fast and slow timescales, existing discrete space-state stochastic path simulation methods, such as the stochastic simulation algorithm (SSA) and the explicit tau-leap (explicit-TL) method, can be very slow. In this talk, we propose a novel implicit scheme, split-step implicit tau-leap (SSI-TL), to improve numerical stability and provide efficient simulation algorithms for those systems. Furthermore, to estimate statistical quantities related to SRNs, we propose a novel hybrid Multilevel Monte Carlo (MLMC) estimator in the spirit of the work by Anderson and Higham (SIAM Multiscal Model. Simul. 10(1), 2012). This estimator uses the SSI-TL scheme at levels where the explicit-TL method is not applicable due to numerical stability issues, and then, starting from a certain interface level, it switches to the explicit scheme. We present numerical examples that illustrate the achieved gains of our proposed approach in this context.
IVR - Chapter 2 - Basics of filtering I: Spatial filters (25Mb) Charles Deledalle
Moving averages. Finite differences and edge detectors. Gradient, Sobel and Laplacian. Linear translations invariant filters, cross-correlation and convolution. Adaptive and non-linear filters. Median filters. Morphological filters. Local versus global filters. Sigma filter. Bilateral filter. Patches and non-local means. Applications to image denoising.
On learning statistical mixtures maximizing the complete likelihood
main
1. Distributed algorithms for convex optimization
David Mateos-N´u˜nez Jorge Cort´es
University of California, San Diego
{dmateosn,cortes}@ucsd.edu
Thesis Defense
Committee:
Tara Javidi, Philip E. Gill, Maur´ıcio de Oliveira, and Miroslav Krsti´c
UC San Diego, August 20, 2015
1 / 68
3. Distributed paradigm
Tara Philip Mauricio Miroslav
Toy Story
Jurasic Park
· · · · · · · · · · · · · · ·
W =
|
w1
|
|
w2
|
|
w3
|
|
w4
|
min
wi ∈Rd
4
i=1
f i
(wi ) + γ [w1 · · · w4] ∗
Ratings {wi } are
private
3 / 68
4. Distributed algorithms for four classes of problems
Noisy channels
min
x∈Rd
N
i=1
f i
(x)
Online optimization
Rj
T :=
T
t=1
N
i=1
f i
t (xj
t ) − min
x∈Rd
T
t=1
N
i=1
f i
t (x)
Multi-task learning
min
wi ∈Rd
N
i=1
f i
(wi
) + γ W ∗
Constraints
min
wi ∈Wi
D∈D
N
i=1
f i
(wi
, D) s.t.
N
i=1
gi
(wi
, D) ≤ 0
4 / 68
5. Part I: Noise-to-state stable coordination
Distributed optimization Noisy communication
5 / 68
6. Review of
(consensus based) distributed convex optimization
Local information
Laplacian matrix
Disagreement feedback
Pioneering works
Lagrangian approach
6 / 68
7. Using local information
x∗
∈ arg min
x∈Rd
N
i=1
f i
(x)
Agent i has access to f i
Agent i can share its estimate xi
t with “neighboring” agents
f 1
f 2
f 3
f 4
f 5
A =
a13 a14
a21 a25
a31 a32
a41
a52
Parallel computations: Tsitsiklis 84, Bertsekas and Tsitsiklis 95
Consensus: Jadbabaie et al. 03, Olfati-Saber, Murray 04, Boyd et al. 05
7 / 68
8. The Laplacian matrix
L = diag(A1) − A =
a13 + a14 −a13 −a14
−a21 a21 + a25 −a25
−a31 −a32 a31 + a32
−a41 a41
−a52 a52
Nullspace is agreement ⇔ spanning tree
The mirror graph has Laplacian L
Consensus via feedback on disagreement
proportional
proportional-integral feedback
8 / 68
9. Disagreement feedback
Honeybee navigation en route to the goal: visual flight control and odometry
1996 M. Srinivasan, S. Zhang, M. Lehrer, and T. Collett, J Exp. Biol
Relative positions and angles
g1(w1) + · · · + gN(wN) = 0
Agreement
on the multiplier
9 / 68
10. Pioneering works
local minimization versus local consensus
A. Nedi´c and A. Ozdaglar, TAC, 09
xi
k+1 = − ηk
decreasing
gi
k +
N
j=1
aij xj
k,
where gi
k ∈ ∂f i (xi
k) and A = (aij ) is doubly stochastic
J. C. Duchi, A. Agarwal, and M. J. Wainwright, TAC, 12
10 / 68
11. The Lagrangian approach
min
x1,...,xN ∈Rd
x1=...=xN
N
i=1
f i
(xi
) = min
x∈(Rd )N
Lx=0
˜f (x)
Convex-concave augmented Lagrangian
F(x, z) := ˜f (x)
Network objective
+ z (L x)
Multiplier × constraint
+
˜γ
2
x L x
Augmented Lagrangian
Saddle-point dynamics for symmetric Laplacian L
˙x = −
∂F(x, z)
∂x
= − ˜f (x) − ˜γ Lx − Lz
˙z =
∂F(x, z)
∂z
= Lx
J. Wang and N. Elia, Allerton, 10 And when L = L ?
11 / 68
12. Lagrangian approach for directed graphs
Lagrangian F(x, z) := ˜f (x) + z L x +
˜γ
2
x (L + L ) x
still convex-concave because
Weight-balanced ⇔ 1 L = 0 ⇔ L + L 0
˙x = −
∂F(x, z)
∂x
= − ˜f (x) − ˜γ 1
2 L + L x − L z (Non distributed!)
˙z =
∂F(x, z)
∂z
= Lx
B. Gharesifard and J. Cort´es, CDC, 12
12 / 68
13. Lagrangian approach for directed graphs
Lagrangian F(x, z) := ˜f (x) + z L x +
˜γ
2
x (L + L ) x
still convex-concave because
Weight-balanced ⇔ 1 L = 0 ⇔ L + L 0
˙x = −
∂F(x, z)
∂x
= − ˜f (x) − ˜γ 1
2 L + L x − L z (Non distributed!)
changed to − ˜f (x) − ˜γ Lx − Lz
˙z =
∂F(x, z)
∂z
= Lx
B. Gharesifard and J. Cort´es, CDC, 12
12 / 68
14. Now... what happens when there is noise?
Previous work and limitations
our coordination algorithm under noise
theorem of exponential noise-to-state stability
simulations
outline of the proof
13 / 68
15. Previous work with noisy communication channels
S. S. Ram and A. Nedi´c and V. V. Veeravalli, 10
K. Srivastava and A. Nedi´c, 11
xi
k+1 = xi
k − ηi
k gi
k + noise
+ σi
k
N
j=1
aij,k xj
k + noise − xi
k
where gi
k ∈ ∂f i (xi
k) and Ak = (aij,k) is an adjacency matrix
Limitations:
vanishing {ηi
k}k≥1 and {σi
k}k≥1
Sublinear convergence even in strongly convex case
noise of fixed variance
require projection step or bounded subgradients
14 / 68
16. Our algorithm: proprortional-integral feedback + noise
˙xi
(t) = − fi (xi
(t)) + noise
+ ˜γ
N
j=1,j=i
aij xj
(t) + noise − xi
(t) +
N
j=1,j=i
aij zj
(t) + noise − zi
(t)
˙zi
(t) = −
N
j=1,j=i
aij xj
(t) + noise − xi
(t)
Stochastic differential equation in compact form
dx = − ˜f (x) dt − ˜γ Lx dt −Lz dt
structured integral feedback
+ G1
(x, z, t)Σ1
(t)dB(t)
Noise
dz = Lx dt + G2
(x, z, t)Σ2
(t)dB(t)
Noise
15 / 68
17. Theorem (DMN-JC 2nd moment noise-to-state exponential stability)
If
the digraph is strongly connected and weight-balanced,
each fi is convex, twice-differentiable with bounded Hessian, and
at least one fi0 is strongly convex,
then
E x(t) − x∗
1⊗xmin
2
2
Distance to optimizer
+ (z(t) − z∗
) 2
LK
Distance modulus agreement
≤ Cµ x0 − x∗ 2
2 + z0 − z∗ 2
LK
2
Initial conditions
e−Dµ (t−t0)
+ Cθ sup
t0≤τ≤t
Σ1
(τ)
Σ2
(τ) F
2
Ultimate bound
Σ(t) F := trace (Σ(t)Σ(t) )
gives the size of the noise
16 / 68
18. Simulations
0 2 4 6 8 10 12 14 16 18
−2.74
0
1.1
Evolution of the agents’s estimates
time, t
{[xi
1(t), xi
2(t)]}
0 5 10 15 20
0
2
4
6
8
10
12
14
time, t
E x(t) − 1 ⊗ xmin
2
2
Evolution of the second moment
˜γ = 2
˜γ = 4
˜γ = 3
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7
0
0.5
1
1.5
2
2.5
3
3.5
4
Empirical noise gain
size of the noise, s
Ultimate bound for
E x(t) − 1 ⊗ xmin
2
2
(t = 17)
˜γ = 2
˜γ = 3
˜γ = 4
N = 4 agents communicating over a directed ring with local objectives
f1(x1, x2) = 1
2((x1 − 4)2
+ (x2 − 3)2
)
f2(x1, x2) = x1 + 3x2 − 2
f3(x1, x2) = log(ex1+3
+ ex2+1
)
f4(x1, x2) = (x1 + 2x2 + 5)2
+ (x1 − x2 − 4)2
17 / 68
19. Outline of the proof
KKT conditions VERSUS Equilibria of underlying modified
saddle-point ODE
Existence & uniqueness of solutions of the SDEs
Lyapunov analysis of SDEs
Pth moment NSS Lyapunov functions
“pth moment noise-to-state stability of stochastic differential
equations with persistent noise”, DMN-JC
Strategy for directed graphs
“Distributed continuous-time convex optimization on
weight-balanced digraphs”, B. Gharesifard and J. Cort´es
δ-Cocoercivity of F := (INd + P)( ˜f(x) + Lx)
(x − x ) (F(x) − F(x )) ≥ δ F(x) − F(x ) 2
2
18 / 68
20. End of Part I
Contributions
2nd moment Noise-to-state exponentially stable
First proof of exponential convergence
only one objective strongly convex
Novel general Lyapunov stability analysis for SDEs
Open problems
Coordination algorithms
with
communication delays
19 / 68
21. Part II: Distributed online optimization
Distributed optimization Online optimization
Case study: medical diagnosis
20 / 68
22. Review of
online convex optimization
Motivation: medical diagnosis
Sequential decision making
Regret
Classical results
21 / 68
23. Medical diagnosis
Medical findings, symptoms:
age factor
amnesia before impact
deterioration in GCS score
open skull fracture
loss of consciousness
vomiting
Does the patient need a computerized Tomography?
“The Canadian CT Head Rule for patients with minor head injury”
22 / 68
24. Different kind of optimization: sequential decision making
In the diagnosis example:
Each round t ∈ {1, . . . , T}
question (features, medical findings): wt
decision (about using CT): h(xt, wt)
outcome (by CT findings/follow up of patient): yt
loss: l(yt h(xt, wt))
Choose xt & Incur loss ft(xt) := l(yt h(xt, wt))
Goal: sublinear regret
R(u, T) :=
T
t=1
ft(xt) −
T
t=1
ft(u) ∈ o(T)
using “historical observations”
23 / 68
25. Significance of sublinear regret
If the regret is sublinear,
T
t=1
ft(xt) ≤
T
t=1
ft(u) + o(T),
then
lim
T→∞
1
T
T
t=1
ft(xt)
Loss of decisions “on the fly”
≤ lim
T→∞
1
T
T
t=1
ft(u)
Loss of any decision in hindsight
In temporal average, online decisions {xt}T
t=1 perform as well as best
decision in hindsight
“No regrets, my friend”
24 / 68
26. Some classical results
Projected gradient descent:
xt+1 = ΠS(xt − ηt ft(xt)), (1)
where ΠS is a projection onto a compact set S ⊆ Rd , & f 2 ≤ H
Follow-the-Regularized-Leader:
xt+1 = arg min
y∈S
t
s=1
fs(y) + ψ(y)
Martin Zinkevich, 03
Alg. (1) achieves O(
√
T) regret under convexity with ηt = 1√
t
Elad Hazan, Amit Agarwal, and Satyen Kale, 07
(1) achieves O(log T) regret under p-strong convexity with ηt = 1
pt
Others: Online Newton Step, Follow the Regularized Leader, etc.
25 / 68
27. Combination of the distributed and online frameworks
Motivation
Previous work and limitations
our coordination algorithm in the online case
theorems of O(
√
T) & O(log T) agent regrets
simulations
outline of the proof
26 / 68
28. Resuming the diagnosis example:
Health center i ∈ {1, . . . , N} takes care of a set of patients at time t
Goal: sublinear agent regret
Rj
(u, T) :=
T
t=1
N
i=1
f i
t (xj
t ) −
T
t=1
N
i=1
f i
t (u) ≤ o(T)
using “local information” & “historical observations”
27 / 68
29. Challenge: Coordinate hospitals
f 1
t f 2
t
f 3
t f 4
t
f 5
t
t ∈ {1, . . . , T}
Need to design distributed online algorithms
Private patient data; share only prediction models
28 / 68
30. Previous work on consensus-based online algorithms
F. Yan, S. Sundaram, S. V. N. Vishwanathan and Y. Qi, TAC
Projected Subgradient Descent
log(T) regret (local strong convexity & bounded subgradients)√
T regret (convexity & bounded subgradients)
Both analysis require a projection onto a compact set
S. Hosseini, A. Chapman and M. Mesbahi, CDC, 13
Dual Averaging
√
T regret (convexity & bounded subgradients)
General regularized projection onto a convex closed set.
K. I. Tsianos and M. G. Rabbat, arXiv, 12
Projected Subgradient Descent
Empirical risk as opposed to regret analysis
Limitations: Communication digraph in all cases is fixed and strongly
connected
29 / 68
31. Our contributions (informally)
time-varying communication digraphs under B-joint connectivity &
weight-balanced
unconstrained optimization (no projection step onto a bounded set)
log T regret (local strong convexity & bounded subgradients)
√
T regret (convexity & β-centrality with β ∈ (0, 1] & bounded
subgradients)
x∗
x
−ξx
f i
t is β-central in Z ⊆ Rd X∗,
where X∗ = x∗ : 0 ∈ ∂f i
t (x∗) ,
if ∀x ∈ Z, ∃x∗ ∈ X∗ such that
−ξx (x∗ − x)
ξx 2 x∗ − x 2
≥ β,
for each ξx ∈ ∂f i
t (x).
30 / 68
32. Our algorithm under jointly connected topologies
xi
t+1 = xi
t − ηt gxi
t
Subgradient descent on previous local objectives, gxi
t
∈ ∂f i
t
31 / 68
33. Our algorithm under jointly connected topologies
xi
t+1 = xi
t − ηt gxi
t
+ σ ˜γ
N
j=1,j=i
aij,t xj
t − xi
t
Proportional feedback on disagreement
31 / 68
34. Our algorithm under jointly connected topologies
xi
t+1 = xi
t − ηt gxi
t
+ σ ˜γ
N
j=1,j=i
aij,t xj
t − xi
t +
N
j=1,j=i
aij,t zj
t − zi
t
zi
t+1 = zi
t − σ
N
j=1,j=i
aij,t xj
t − xi
t
Integral feedback on disagreement
31 / 68
35. Our algorithm under jointly connected topologies
xi
t+1 = xi
t − ηt gxi
t
+ σ ˜γ
N
j=1,j=i
aij,t xj
t − xi
t +
N
j=1,j=i
aij,t zj
t − zi
t
zi
t+1 = zi
t − σ
N
j=1,j=i
aij,t xj
t − xi
t
Union of graphs over intervals of length B is strongly connected.
f 1
t f 2
t
f 3
t f 4
t
f 5
t
time t time t + 1
a14,t+1a41,t+1
time t + 2
31 / 68
36. Our algorithm under jointly connected topologies
xi
t+1 = xi
t − ηt gxi
t
+ σ ˜γ
N
j=1,j=i
aij,t xj
t − xi
t +
N
j=1,j=i
aij,t zj
t − zi
t
zi
t+1 = zi
t − σ
N
j=1,j=i
aij,t xj
t − xi
t
Compact representation
xt+1
zt+1
=
xt
zt
− σ
˜γLt Lt
−Lt 0
xt
zt
− ηt
˜gxt
0
31 / 68
37. Our algorithm under jointly connected topologies
xi
t+1 = xi
t − ηt gxi
t
+ σ ˜γ
N
j=1,j=i
aij,t xj
t − xi
t +
N
j=1,j=i
aij,t zj
t − zi
t
zi
t+1 = zi
t − σ
N
j=1,j=i
aij,t xj
t − xi
t
Compact representation & generalization
xt+1
zt+1
=
xt
zt
− σ
˜γLt Lt
−Lt 0
xt
zt
− ηt
˜gxt
0
vt+1 = (I − σE ⊗ Lt)vt − ηtgt,
31 / 68
38. Theorem (DMN-JC Sublinear agent regret)
Assume that
{f 1
t , ..., f N
t }T
t=1 are convex functions in Rd
with H-bounded subgradient sets,
nonempty and uniformly bounded sets of minimizers, and
p-strongly convex in a suff. large neighborhood of their minimizers
The sequence of weight-balanced communication digraphs is
δ-nondegenerate, and
B-jointly-connected
E ∈ RK×K is diagonalizable with positive real eigenvalues
For ˜δ ∈ (0, 1) and ˜δ := min ˜δ , (1 − ˜δ ) λmin(E) δ
λmax(E) dmax
,
choosing σ ∈
˜δ
λmin(E)δ , 1−˜δ
λmax(E)dmax
and ηt = 1
p t ,
Rj
(u, T) ≤ C( u 2
2 + 1 + log T),
for any j ∈ {1, . . . , N} and u ∈ Rd
32 / 68
39. Theorem (DMN-JC Sublinear agent regret)
Assume that
{f 1
t , ..., f N
t }T
t=1 are convex functions in Rd
with H-bounded subgradient sets,
nonempty and uniformly bounded sets of minimizers, and
p-strongly convex in a suff. large neighborhood of their minimizers
The sequence of weight-balanced communication digraphs is
δ-nondegenerate, and
B-jointly-connected
E ∈ RK×K is diagonalizable with positive real eigenvalues
Substituting strong convexity by β-centrality, for β ∈ (0, 1],
Rj
(u, T) ≤ C u 2
2
√
T,
for any j ∈ {1, . . . , N} and u ∈ Rd
32 / 68
41. Simulations
Figure: Agents’ estimates
0 50 100 150 200 250 300
−0.5
0
0.5
1
{ xi
t,7}N
i=1
Centralized
0 50 100 150 200 250 300
−0.5
0
0.5
1
time, t
{ xi
t,8}N
i=1
Figure: Average regret
0 50 100 150 200 250 300
10
0
time horizon, T
maxj 1/T Rj
x∗
T ,
N
i f i
t
T
t=1
Proportional-Integral
disagreement feedback
Proportional
disagreement feedback
Centralized
f i
t (x) =
s∈Pi
t
l(ysh(x, ws)) + 1
10 x 2
2
where
l(m) = log 1 + e−2m
34 / 68
42. Outline of the proof
Analysis of network regret
RN (u, T) :=
T
t=1
N
i=1
f i
t (xi
t ) −
T
t=1
N
i=1
f i
t (u)
Input-to-state stability with respect to agreement
ˆLKvt 2 ≤ CI v1 2 1 −
˜δ
4N2
t−1
B
+ CU max
1≤s≤t−1
ds 2
Agent regret bound in terms of {ηt}t≥1 and initial conditions
Boundedness of online estimates under β-centrality (β ∈ (0, 1] )
Doubling trick to obtain O(
√
T) agent regret
Local strong convexity ⇒ β-centrality
ηt = 1
p t to obtain O(log T) agent regret
35 / 68
43. End of Part II
Contributions
Extension of classical
√
T and log(T)-regret bounds to multi-agent
scenario
B-joint connectivity of communication topologies
Unnecessary projections onto compact sets
Proportional-integral feedback on disagreement
Relevant for regression & classification
Future work
Refine guarantees under
model for evolution
of objective functions
36 / 68
44. Part III: Distributed optimization with nuclear norm
Distributed optimization Nuclear norm regularization
37 / 68
45. Review of
nuclear norm regularization
Definition
Application to matrix completion
38 / 68
46. Coupling through nuclear norm
Given a matrix W =
| |
w1 · · · wN
| |
∈ Rd×N
W 2
∗ = sum of singular values of W
= trace
√
WW = trace
N
i=1
wi wi
The nuclear norm regularization
min
wi ∈Wi
N
i=1
f i
(wi ) + γ W 2
∗
favors vectors {wi }N
i=1 belonging to a low-dimensional subspace
39 / 68
48. Low-rank matrix completion
Tara Philip Mauricio Miroslav
Toy Story
Jurasic Park
· · · · · · · · · · · · · · ·
W = [ W:,1 W:,2 W:,3 W:,4 ]
Incomplete
Rating matrix Z
assumed low-rank
To estimate the missing entries
min
W ∈Rd×N
(i,j)∈Γ
Wij − Zij
2
2 + γ W ∗
where Γ indexes the revealed entries of the rating matrix Z
γ depends on application, dimensions... Regularization, not penalty
Netflix: users ∼ 107 movies ∼ 105
Why making it distributed? Because users may not want to share
their ratings 41 / 68
49. Distributed formulation of nuclear norm
Useful characterizations and challenges
Our dynamics for distributed optimization with nuclear norm
theorem of convergence
simulations
42 / 68
50. 1st characterization of the nuclear norm
For any W ∈ Rd×N
2 trace(
√
WW )
W ∗
= min
D∈Sd
0
C(D)⊇C(W )
Column spaces of D and W
trace D†
Pseudoinverse
WW + trace(D)
= min
D∈Sd
0
wi ∈C(D)
N
i=1
wi D†
wi + trace(D)
Used in A.Argyriou, T. Evgeniou, and M. Pontil, NIPS, 06
Minimizer D∗ =
√
WW
For W 2
∗ use hard constraint trace(D) ≤ 1. Then D∗ =
√
WW
trace
√
WW
Challenges related to the pseudoinverse:
Optimizer D∗ =
√
WW might be rank deficient
Pseudoinverse D† might be discontinuous when rank of D changes
43 / 68
51. Convexity property of w D†
w
Point-wise supremum of a linear function in w and D
sup
x∈Rd
−(x Dx + 2w x) =
w D†w if D ∈ Sd
0 , w ∈ C(D),
∞ otherwise
Hence w D†w is jointly convex in {w ∈ Rd , D ∈ Sd
0 : w ∈ C(D)}
Here the pseudoinverse is “hidden” in the optimizer
Good for iterative algorithms
Curiosity: explicit Fenchel conjugate of q(x) := x Dx, for a fixed D
(Up to a factor, q (−2w))
Proof in appendix of “Convex Optimization”, by Boyd
44 / 68
52. 2nd characterization of the nuclear norm
The latter expression allows to write
2 W ∗ = min
D∈Sd
0
wi ∈C(D)
sup
xi ∈Rd
N
i=1
−(xi Dxi + 2wi xi )
N
i=1 wi D†wi
+ trace(D)
Sum of convex functions in the vectors {wi }N
i=1
(as well as in the auxiliary vectors {xi }N
i=1)
Common auxiliary matrix D ∈ Rd×d , ideally d N
Can be “distributed” as a Min-max problem with local copies of D
subject to agreement
45 / 68
53. Approximate regularization to guarantee D∗
cI
Challenges related to the constraint w ∈ C(D):
Set {w ∈ Rd , D ∈ Sd
0 : w ∈ C(D)} is convex but not closed
also, Sd
0 is open
DD†w projects wi onto C(D) but is state dependent
[W |
√
Id ] ∗ = trace( WW + Id )
= min
D∈Sd
0
C(W )⊆C(D)
trace D†
(WW + Id ) + trace(D)
Minimizer D∗ =
√
WW + Id
√
Id
Replace the constraint w ∈ C(D) by the dummy constraint D
√
Id
46 / 68
54. Min-max problem with explicit agreement
min
wi ∈Wi
N
i=1
f i
(wi ) + [ W |
√
Id ] ∗
= min
wi ∈Wi , Di ∈Rd×d
Di =Dj ∀i,j
sup
xi ∈Rd
Yi ∈Rd×d
N
i=1
Fi (wi , Di
convex
, xi , Yi
concave
)
with
Fi (w, D, x, Y ) := fi (w)+ γ trace D(−xx − N YY
approx. reg.
)
− 2γw x − 2γ
N
trace(Y )
approx. reg.
+
1
N
trace(D)
47 / 68
55. Distributed saddle-point dynamics for nuclear optimization
wi (k + 1) = PW wi (k) − ηk gi (k) − 2γxi (k)
Di (k + 1) = PD Di (k) − ηkγ − xi xi − N Yi Yi + 1
N Id
+ σ
N
j=1
aij,t(Dj (k) − Di (k))
xi (k + 1) = xi (k) + ηkγ − 2Di (k)xi (k) − 2wi (k)
Yi (k + 1) = Yi (k) + ηkγ −
2
N
Di (k)Yi (k) −
2
N
Id
User i does not need to share wi with its neighbors!!
and Di → N
i=1 wi wi + Id
Complexity: 1/
√
k saddle-point error × projection onto {D cId }
PD(D) := (D − cId )2 + cId (e.g., using Newton)
48 / 68
56. Theorem (DMN-JC Distributed saddle-point nuclear optimization)
Assume that fi is convex and the set Wi is convex and compact.
Also, let the sequence of weight-balanced communication digraphs be
nondegenerate, and
B-jointly-connected
Then,
for a suitable choice of the consensus stepsize σ
and the learning rates {ηk}
the saddle-point evaluation error of the running-time averages decreases
as 1/
√
k
Proof follows from developments in the next section.
49 / 68
57. Privacy preserving matrix completion
10
0
10
1
10
2
10
3
10
4
0
0.5
1
1.5
2
Distributed saddle-point dynamics
Centralized subgradient descent
10
0
10
1
10
2
10
3
10
4
10
3
10
4
0 5000 10000 15000
0
1
2
3
4
5
iterations, k
W (k)−Z F
Z F
N
i=1 j∈Υi
(Wij (k) − Zij )2 + γ W (k)|
√
Id ∗
( N
i=1 Di (k) − 1
N
N
i=1 Di (k) 2
F )1/2
Matrix
fitting error
Nuclear
regularization
Disagreement
local matrices
50 / 68
58. End of Part III
Contributions
First multi-agent treatment of nuclear norm regularization
Privacy preserving matrix completion
Future work
Inexact projections,
Compressed matrix
communication
51 / 68
59. Part IV (and last): Saddle-point problems
Distributed optimization Constraints
52 / 68
60. Organization
Distributed constrained optimization
Previous work and limitations
The magic of agreement on the multiplier
Saddle-point problems with explicit agreement
Nuclear norm min-max characterization
Lagrangians
Our distributed saddle-point dynamics with Laplacian averaging
theorem for our dynamics: 1√
t
saddle-point evaluation error
outline of the proof
53 / 68
61. Distributed constrained optimization
min
wi ∈Wi , ∀i
D∈D
N
i=1
f i
(wi
, D)
s.t. g1
(w1
, D) + · · · + gN
(wN
, D) ≤ 0
{wi }N
i=1 are local decision vectors, D is a global decision vector
Applications:
Traffic and routing
Resource allocation
Optimal control
Network formation
54 / 68
62. Previous work by type of constraint
N
i=1
f i
(x)
s.t. g(x) ≤ 0
All agents know the constraint
2011 D. Yuan, S. Xu, and H.
Zhao
2012 M. Zhu and S. Mart´ınez
D. Yuan, D. W. C. Ho, and S. Xu
(to appear)
min
wi ∈Wi
N
i=1
f i
(wi
)
s.t.
N
i=1
gi
(wi
) ≤ 0
Agent i knows only gi
2010 D. Mosk-Aoyama,
T. Roughgarden and D. Shah
(only linear constraints)
2014 T.-H. Chang, A. Nedi´c
and A. Scaglione
Primal-dual perturbation
methods
55 / 68
63. Distributing the constraint via agreement on multipliers
min
wi ∈Wi
D∈D
N
i=1
f i
(wi
, D)
s.t. g1
(w1
, D) + · · · + gN
(wN
, D) ≤ 0
Dual decomposition, if strong duality holds
min
wi ∈Wi
D∈D
max
z∈Rm
≥0
N
i=1
f i
(wi
, D) + z
N
i=1
gi
(wi
, D)
= min
wi ∈Wi
D∈D
max
zi ∈Rm
≥0
zi =zj ∀i,j
N
i=1
f i
(wi
, D) + zi
gi
(wi
, D)
= min
wi ∈Wi
Di ∈D
Di =Dj ∀i,j
max
zi ∈Rm
≥0
zi =zj ∀i,j
N
i=1
f i
(wi
, D
i
) + zi
gi
(wi
, D
i
)
56 / 68
64. Saddle-point problems with explicit agreement
A more general framework
min
w∈W
(D1,...,DN )∈DN
Di =Dj , ∀i,j
max
µ∈M
(z1,...,zN )∈ZN
zi =zj , ∀i,j
φ w, (D
1
, . . . , D
N
)
convex
, µ, (z1
, . . . , zN
)
concave
Problem unstudied in the literature
Fits min-max formulation of nuclear norm regularization
The concave part is quadratic
Fits convex-concave functions arising from Lagrangians
The concave part is linear
57 / 68
65. Our dynamics
Subgradient saddle-point dynamics with Laplacian averaging
(provided a saddle-point exists under agreement)
ˆwt+1 = wt − ηtgwt
ˆDt+1 = Dt −σLtDt
Lap. aver.
−ηtgDt
ˆµt+1 = µt + ηtgµt
ˆzt+1 = zt −σLtzt
Lap. aver.
+ηt ˆgzt
(wt+1, Dt+1, µt+1, zt+1) = PS ˆwt+1, ˆDt+1, ˆµt+1, ˆzt+1
Orthogonal projection
gwt , gDt , gµt , ˆgzt are subgradients and supergradients of φ(w, D, µ, z)
and S = W × DN × M × ZN
58 / 68
66. Theorem (DMN-JC Distributed saddle-point approximation)
Assume that
φ(w, D, µ, z) is convex in (w, D) ∈ W × DN and concave in
(µ, z) ∈ M × ZN
The dynamics is bounded
The sequence of weight-balanced communication digraphs is
nondegenerate, and
B-jointly-connected
For ˜δ ∈ (0, 1) and ˜δ := min ˜δ , (1 − ˜δ ) δ
dmax
,
choosing σ ∈
˜δ
δ , 1−˜δ
dmax
and {ηt} based on the Doubling Trick scheme,
then for any saddle point (w∗, D∗, µ∗, z∗) of φ within agreement,
−
α
√
t − 1
≤ φ(wav
t , D
av
t , zav
t , µav
t ) − φ(w∗
, D
∗
, z∗
, µ∗
) ≤
α
√
t − 1
wav
t :=
t
s=1
ws, etc
59 / 68
67. Outline of the proof
Inequality techniques from A. Nedi and A. Ozdaglar, 2009
Saddle-point evaluation error
tφ(wav
t+1, D
av
t+1, µav
t+1, zav
t+1) − tφ(w∗
, D
∗
, µ∗
, z∗
) (2)
at running-time averages, wav
t+1 := 1
t
t
s=1 ws, etc.
Bound for (2) in terms of
initial conditions
bound on subgradients and states of the dynamics
disagreement
sum of learning rates
Input-to-state stability with respect to agreement
LKDt 2 ≤ CI D1 2 1 −
˜δ
4N2
t−1
B
+ CU max
1≤s≤t−1
ds 2
Doubling Trick scheme: for m = 0, 1, 2, . . . , log2 t , take
ηs = 1√
2m
in each period of 2m rounds s = 2m, . . . , 2m+1 − 1
60 / 68
68. End of Part IV
Contributions
Distributed subgradient methods for saddle-point problems
Particularize to
Saddle-points of Lagrangians coming from separable constraints
Min-max characterization of nuclear norm
Future work
Bounds on Lagrange vectors and matrices in a distributed way
Boundedness properties of the dynamics without truncated
projections
Semidefinite constraints with chordal sparsity, where agents update
the entries corresponding to maximal cliques subject to agreement
on the intersections
61 / 68
71. Noise-to-state exponentially stable distributed convex
optimization on weight-balanced digraphs,
SIAM Journal on Control and Optimization, under revision
pth moment noise-to-state stability of stochastic differential
equations with persistent noise,
SIAM Journal on Control and Optimization 52 (4) (2014), 2399-2421
Distributed online convex optimization over jointly connected
digraphs,
IEEE Transactions on Network Science and Engineering 1 (1) (2014),
23-37
And conference versions presented in ACC 13, CDC 13, MTNS 14
64 / 68
72. Distributed optimization for multi-task learning via nuclear-norm
approximation,
IFAC Workshop on Distributed Estimation and Control in Networked
Systems
Distributed subgradient methods for saddle-point problems,
CDC 15
(In preparation for Transactions on Automatic Control)
65 / 68
73. Next steps
Algorithmic principles in biology
homeostasis in neocortex
noisy computational units
distribution of firing rates
recovery after disruptions
differentiation in populations
of cells
distribution of phenotypes
recovery after injury or
disease
cancer, stem cells
66 / 68
74. Beyond biology
Control and security of cyber-physical systems:
cooperative optimization +
distributed trust
heterogeneous
agents’ dynamics & incentives
malicious minority
67 / 68