SlideShare a Scribd company logo
1 of 18
Download to read offline
Minimax statistical learning with Wasserstein distances
by Jaeho Lee and Maxim Raginsky
January 26, 2019
Presenter: Kenta Oono @ NeurIPS 2018 Reading Club
Kenta Oono (@delta2323 )
Profile
• 2011.3: MSc. (Mathematics)
• 2011.4-2014.10: Preferred Infrastructure (PFI)
• 2014.10-current: Preferred Networks (PFN)
• 2018.4-current: Ph.D student @U.Tokyo
Interests
• Mathematics
• Bioinformatics
• Theory of Deep Learning
2/18
Summary
What this paper does.
• Develop a distributionally-robust risk minimization problem.
• Derive the excess-risk rate O(n−1
2 ), same as the non-robust case.
• Application to domain adaptation.
Why I choose this paper?
• Spotlight talk
• Wanted to learn statistics learning theory
• Especially minimax optimality of DL. But this paper turned out to not be about it.
• Wanted to learn Wasserstein distance
3/18
Problem Setting (Expected Risk)
Given
• Z: sample space
• P: (unknown) distribution over Z
• Dataset: D = (z1, . . . , zN) ∼ P i.i.d.
For a hypothesis f : Z → R, we evaluate its expected risk by
• Expected Risk: R(P, f ) = EZ∼P[f (Z)]
• Hypothesis space: F ⊂ {Z → R}
4/18
Problem Setting (Estimator)
Goal:
• Devise an algorithm A : D → ˆf = ˆf (D)
• We treat D as a random variable. So, is ˆf .
• If A is a random algorithm (e.g. SGD), randomness of ˆf (D) comes from A, too.
• Evaluate excess risk: R(P, ˆf ) − inff ∈F R(P, f )
Typical form of theorems:
• EA,D[R(P, ˆf ) − inff ∈F R(P, f )] = O(g(n))
• R(P, ˆf ) − inff ∈F R(P, f ) = O(g(n, δ)) with probability 1 − δ with respect to the
choice of D (and A)
5/18
Problem Setting (ERM Estimator)
Since we cannot compute the expected risk R, we compute empirical risk instead:
ˆRD(f ) =
1
n
n
i=1
f (zi )
= R(Pn, f ) (Pn: empirical distribution).
ERM (Empirical Risk Minimization) estimator for hypothesis space F is
ˆf = ˆf (D) ∈ min
f ∈F
R(Pn, f )
6/18
Relation
7/18
Assumptions
+
OR
Ref. Lee and Raginsky (2018)
8/18
Example
Supervised learning
• Z = (X, Y ), X = RD: input space, Y = R: label space
• : Y × Y → R: loss function
• H ⊂ {X → Y }: set of models
• F = {fh(x, y) = (h(x), y)|h ∈ H}
Regression
• X = RD, Y = R, (y, y) = (y − y)2
• H = (Function realized by a neural networks with a fixed architecture)
9/18
Classical Result
Typically, we have
R(P, ˆf ) − inf
f ∈F
R(P, f ) = OP
complexity of F
√
n
Model complexity measure complexity of F (intuitively, how ”large” F is)
10/18
Covering number
Definition (Covering Number)
For F ⊂ F0 := {f : [−1, 1]D → R}, and ε > 0, the (external) covering number of F is
N(F, ε) := inf N ∈ N
∃f1, . . . , fN ∈ F0 s.t. ∀f ∈ F, ∃n ∈ [N] s.t.
f − fn ∞ ≤ ε
.
• Intuition: the minimum # of balls
(with radius ε) to cover the space F.
• Entropy integral:
C(F) :=
∞
0 log N(F, u) du.
11/18
Distributionally Robust Framework
Minimize the worst-case risk close to true distribution P.
minimize R(P, f )
↓
minimize Rρ,p(P, f ) := supQ∈Aρ,p(P) R(Q, f )
We consider p-Wasserstein distance:
Aρ,p(P) = {Q|Wp(P, Q) ≤ ρ}
Applications
• Adversarial attack: ρ = noise level
• Domain adaptation: ρ = discrepancy level of train/test dists.
12/18
Estimator
Correspondingly, we change the estimator
ˆf ∈ inf
f ∈F
Rρ,p(Pn, f )
Want to evaluate
Rρ,p(P, ˆf ) − inf
f ∈F
Rρ,pR(P, f )
13/18
Main Theorems
Same excess-risk rate as the non-robust setting.
Ref. Lee and Raginsky (2018)
14/18
Strategy
From authors slide
Ref: https://nips.cc/media/Slides/nips/2018/517cd(05-09-45)
-05-10-20-12649-Minimax_Statist.pdf
15/18
Key Lemmas
Ref. Lee
and Raginsky (2018)
16/18
Why these lemmas are important?
(Complexity of ΨΛ,F ) ≈ (Complexity of F) × (Complexity of Λ)
17/18
Impression
• Duality form of risk (Rρ(P, f ) = infλ≥0 E[ψλ,f (Z)]) may be useful of its own.
• Mysterious assumption 4 (incredibly local property of F).
• Special structure of p=1-Wasserstein distance?
18/18

More Related Content

Similar to Minimax statistical learning with Wasserstein distances (NeurIPS2018 Reading Club)

A Statistical Perspective on Retrieval-Based Models.pdf
A Statistical Perspective on Retrieval-Based Models.pdfA Statistical Perspective on Retrieval-Based Models.pdf
A Statistical Perspective on Retrieval-Based Models.pdf
Po-Chuan Chen
 
Regret Minimization in Multi-objective Submodular Function Maximization
Regret Minimization in Multi-objective Submodular Function MaximizationRegret Minimization in Multi-objective Submodular Function Maximization
Regret Minimization in Multi-objective Submodular Function Maximization
Tasuku Soma
 

Similar to Minimax statistical learning with Wasserstein distances (NeurIPS2018 Reading Club) (20)

New Insights and Perspectives on the Natural Gradient Method
New Insights and Perspectives on the Natural Gradient MethodNew Insights and Perspectives on the Natural Gradient Method
New Insights and Perspectives on the Natural Gradient Method
 
Strategic Argumentation is NP-complete
Strategic Argumentation is NP-completeStrategic Argumentation is NP-complete
Strategic Argumentation is NP-complete
 
Approximate Bayesian model choice via random forests
Approximate Bayesian model choice via random forestsApproximate Bayesian model choice via random forests
Approximate Bayesian model choice via random forests
 
PMED Opening Workshop - Inference on Individualized Treatment Rules from Obse...
PMED Opening Workshop - Inference on Individualized Treatment Rules from Obse...PMED Opening Workshop - Inference on Individualized Treatment Rules from Obse...
PMED Opening Workshop - Inference on Individualized Treatment Rules from Obse...
 
Mapping analysis
Mapping analysisMapping analysis
Mapping analysis
 
k-MLE: A fast algorithm for learning statistical mixture models
k-MLE: A fast algorithm for learning statistical mixture modelsk-MLE: A fast algorithm for learning statistical mixture models
k-MLE: A fast algorithm for learning statistical mixture models
 
A Statistical Perspective on Retrieval-Based Models.pdf
A Statistical Perspective on Retrieval-Based Models.pdfA Statistical Perspective on Retrieval-Based Models.pdf
A Statistical Perspective on Retrieval-Based Models.pdf
 
Lecture notes
Lecture notes Lecture notes
Lecture notes
 
Regret Minimization in Multi-objective Submodular Function Maximization
Regret Minimization in Multi-objective Submodular Function MaximizationRegret Minimization in Multi-objective Submodular Function Maximization
Regret Minimization in Multi-objective Submodular Function Maximization
 
On the smallest enclosing information disk
 On the smallest enclosing information disk On the smallest enclosing information disk
On the smallest enclosing information disk
 
Slides lln-risques
Slides lln-risquesSlides lln-risques
Slides lln-risques
 
Uncoupled Regression from Pairwise Comparison Data
Uncoupled Regression from Pairwise Comparison DataUncoupled Regression from Pairwise Comparison Data
Uncoupled Regression from Pairwise Comparison Data
 
NBBC15, Reyjavik, June 08, 2015
NBBC15, Reyjavik, June 08, 2015NBBC15, Reyjavik, June 08, 2015
NBBC15, Reyjavik, June 08, 2015
 
The dual geometry of Shannon information
The dual geometry of Shannon informationThe dual geometry of Shannon information
The dual geometry of Shannon information
 
Rademacher Averages: Theory and Practice
Rademacher Averages: Theory and PracticeRademacher Averages: Theory and Practice
Rademacher Averages: Theory and Practice
 
A Unified Perspective for Darmon Points
A Unified Perspective for Darmon PointsA Unified Perspective for Darmon Points
A Unified Perspective for Darmon Points
 
Double Robustness: Theory and Applications with Missing Data
Double Robustness: Theory and Applications with Missing DataDouble Robustness: Theory and Applications with Missing Data
Double Robustness: Theory and Applications with Missing Data
 
Slides: Total Jensen divergences: Definition, Properties and k-Means++ Cluste...
Slides: Total Jensen divergences: Definition, Properties and k-Means++ Cluste...Slides: Total Jensen divergences: Definition, Properties and k-Means++ Cluste...
Slides: Total Jensen divergences: Definition, Properties and k-Means++ Cluste...
 
Workshop in honour of Don Poskitt and Gael Martin
Workshop in honour of Don Poskitt and Gael MartinWorkshop in honour of Don Poskitt and Gael Martin
Workshop in honour of Don Poskitt and Gael Martin
 
Nodal Domain Theorem for the p-Laplacian on Graphs and the Related Multiway C...
Nodal Domain Theorem for the p-Laplacian on Graphs and the Related Multiway C...Nodal Domain Theorem for the p-Laplacian on Graphs and the Related Multiway C...
Nodal Domain Theorem for the p-Laplacian on Graphs and the Related Multiway C...
 

More from Kenta Oono

提供AMIについて
提供AMIについて提供AMIについて
提供AMIについて
Kenta Oono
 

More from Kenta Oono (20)

Deep learning for molecules, introduction to chainer chemistry
Deep learning for molecules, introduction to chainer chemistryDeep learning for molecules, introduction to chainer chemistry
Deep learning for molecules, introduction to chainer chemistry
 
Overview of Machine Learning for Molecules and Materials Workshop @ NIPS2017
Overview of Machine Learning for Molecules and Materials Workshop @ NIPS2017Overview of Machine Learning for Molecules and Materials Workshop @ NIPS2017
Overview of Machine Learning for Molecules and Materials Workshop @ NIPS2017
 
Comparison of deep learning frameworks from a viewpoint of double backpropaga...
Comparison of deep learning frameworks from a viewpoint of double backpropaga...Comparison of deep learning frameworks from a viewpoint of double backpropaga...
Comparison of deep learning frameworks from a viewpoint of double backpropaga...
 
深層学習フレームワーク概要とChainerの事例紹介
深層学習フレームワーク概要とChainerの事例紹介深層学習フレームワーク概要とChainerの事例紹介
深層学習フレームワーク概要とChainerの事例紹介
 
20170422 数学カフェ Part2
20170422 数学カフェ Part220170422 数学カフェ Part2
20170422 数学カフェ Part2
 
20170422 数学カフェ Part1
20170422 数学カフェ Part120170422 数学カフェ Part1
20170422 数学カフェ Part1
 
情報幾何学の基礎、第7章発表ノート
情報幾何学の基礎、第7章発表ノート情報幾何学の基礎、第7章発表ノート
情報幾何学の基礎、第7章発表ノート
 
GTC Japan 2016 Chainer feature introduction
GTC Japan 2016 Chainer feature introductionGTC Japan 2016 Chainer feature introduction
GTC Japan 2016 Chainer feature introduction
 
On the benchmark of Chainer
On the benchmark of ChainerOn the benchmark of Chainer
On the benchmark of Chainer
 
Tokyo Webmining Talk1
Tokyo Webmining Talk1Tokyo Webmining Talk1
Tokyo Webmining Talk1
 
VAE-type Deep Generative Models
VAE-type Deep Generative ModelsVAE-type Deep Generative Models
VAE-type Deep Generative Models
 
Common Design of Deep Learning Frameworks
Common Design of Deep Learning FrameworksCommon Design of Deep Learning Frameworks
Common Design of Deep Learning Frameworks
 
Introduction to Chainer and CuPy
Introduction to Chainer and CuPyIntroduction to Chainer and CuPy
Introduction to Chainer and CuPy
 
Stochastic Gradient MCMC
Stochastic Gradient MCMCStochastic Gradient MCMC
Stochastic Gradient MCMC
 
Chainer Contribution Guide
Chainer Contribution GuideChainer Contribution Guide
Chainer Contribution Guide
 
2015年9月18日 (GTC Japan 2015) 深層学習フレームワークChainerの導入と化合物活性予測への応用
2015年9月18日 (GTC Japan 2015) 深層学習フレームワークChainerの導入と化合物活性予測への応用 2015年9月18日 (GTC Japan 2015) 深層学習フレームワークChainerの導入と化合物活性予測への応用
2015年9月18日 (GTC Japan 2015) 深層学習フレームワークChainerの導入と化合物活性予測への応用
 
Introduction to Chainer (LL Ring Recursive)
Introduction to Chainer (LL Ring Recursive)Introduction to Chainer (LL Ring Recursive)
Introduction to Chainer (LL Ring Recursive)
 
日本神経回路学会セミナー「DeepLearningを使ってみよう!」資料
日本神経回路学会セミナー「DeepLearningを使ってみよう!」資料日本神経回路学会セミナー「DeepLearningを使ってみよう!」資料
日本神経回路学会セミナー「DeepLearningを使ってみよう!」資料
 
提供AMIについて
提供AMIについて提供AMIについて
提供AMIについて
 
Chainerインストール
ChainerインストールChainerインストール
Chainerインストール
 

Recently uploaded

Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo DiehlFuture Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Peter Udo Diehl
 

Recently uploaded (20)

ECS 2024 Teams Premium - Pretty Secure
ECS 2024   Teams Premium - Pretty SecureECS 2024   Teams Premium - Pretty Secure
ECS 2024 Teams Premium - Pretty Secure
 
UiPath Test Automation using UiPath Test Suite series, part 2
UiPath Test Automation using UiPath Test Suite series, part 2UiPath Test Automation using UiPath Test Suite series, part 2
UiPath Test Automation using UiPath Test Suite series, part 2
 
Buy Epson EcoTank L3210 Colour Printer Online.pdf
Buy Epson EcoTank L3210 Colour Printer Online.pdfBuy Epson EcoTank L3210 Colour Printer Online.pdf
Buy Epson EcoTank L3210 Colour Printer Online.pdf
 
Buy Epson EcoTank L3210 Colour Printer Online.pptx
Buy Epson EcoTank L3210 Colour Printer Online.pptxBuy Epson EcoTank L3210 Colour Printer Online.pptx
Buy Epson EcoTank L3210 Colour Printer Online.pptx
 
IoT Analytics Company Presentation May 2024
IoT Analytics Company Presentation May 2024IoT Analytics Company Presentation May 2024
IoT Analytics Company Presentation May 2024
 
Optimizing NoSQL Performance Through Observability
Optimizing NoSQL Performance Through ObservabilityOptimizing NoSQL Performance Through Observability
Optimizing NoSQL Performance Through Observability
 
Demystifying gRPC in .Net by John Staveley
Demystifying gRPC in .Net by John StaveleyDemystifying gRPC in .Net by John Staveley
Demystifying gRPC in .Net by John Staveley
 
A Business-Centric Approach to Design System Strategy
A Business-Centric Approach to Design System StrategyA Business-Centric Approach to Design System Strategy
A Business-Centric Approach to Design System Strategy
 
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdf
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdfHow Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdf
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdf
 
FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...
FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...
FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...
 
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
 
Strategic AI Integration in Engineering Teams
Strategic AI Integration in Engineering TeamsStrategic AI Integration in Engineering Teams
Strategic AI Integration in Engineering Teams
 
Salesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
Salesforce Adoption – Metrics, Methods, and Motivation, Antone KomSalesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
Salesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
 
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo DiehlFuture Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
 
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
 
Where to Learn More About FDO _ Richard at FIDO Alliance.pdf
Where to Learn More About FDO _ Richard at FIDO Alliance.pdfWhere to Learn More About FDO _ Richard at FIDO Alliance.pdf
Where to Learn More About FDO _ Richard at FIDO Alliance.pdf
 
Top 10 Symfony Development Companies 2024
Top 10 Symfony Development Companies 2024Top 10 Symfony Development Companies 2024
Top 10 Symfony Development Companies 2024
 
The Metaverse: Are We There Yet?
The  Metaverse:    Are   We  There  Yet?The  Metaverse:    Are   We  There  Yet?
The Metaverse: Are We There Yet?
 
AI revolution and Salesforce, Jiří Karpíšek
AI revolution and Salesforce, Jiří KarpíšekAI revolution and Salesforce, Jiří Karpíšek
AI revolution and Salesforce, Jiří Karpíšek
 
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptxUnpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
 

Minimax statistical learning with Wasserstein distances (NeurIPS2018 Reading Club)

  • 1. Minimax statistical learning with Wasserstein distances by Jaeho Lee and Maxim Raginsky January 26, 2019 Presenter: Kenta Oono @ NeurIPS 2018 Reading Club
  • 2. Kenta Oono (@delta2323 ) Profile • 2011.3: MSc. (Mathematics) • 2011.4-2014.10: Preferred Infrastructure (PFI) • 2014.10-current: Preferred Networks (PFN) • 2018.4-current: Ph.D student @U.Tokyo Interests • Mathematics • Bioinformatics • Theory of Deep Learning 2/18
  • 3. Summary What this paper does. • Develop a distributionally-robust risk minimization problem. • Derive the excess-risk rate O(n−1 2 ), same as the non-robust case. • Application to domain adaptation. Why I choose this paper? • Spotlight talk • Wanted to learn statistics learning theory • Especially minimax optimality of DL. But this paper turned out to not be about it. • Wanted to learn Wasserstein distance 3/18
  • 4. Problem Setting (Expected Risk) Given • Z: sample space • P: (unknown) distribution over Z • Dataset: D = (z1, . . . , zN) ∼ P i.i.d. For a hypothesis f : Z → R, we evaluate its expected risk by • Expected Risk: R(P, f ) = EZ∼P[f (Z)] • Hypothesis space: F ⊂ {Z → R} 4/18
  • 5. Problem Setting (Estimator) Goal: • Devise an algorithm A : D → ˆf = ˆf (D) • We treat D as a random variable. So, is ˆf . • If A is a random algorithm (e.g. SGD), randomness of ˆf (D) comes from A, too. • Evaluate excess risk: R(P, ˆf ) − inff ∈F R(P, f ) Typical form of theorems: • EA,D[R(P, ˆf ) − inff ∈F R(P, f )] = O(g(n)) • R(P, ˆf ) − inff ∈F R(P, f ) = O(g(n, δ)) with probability 1 − δ with respect to the choice of D (and A) 5/18
  • 6. Problem Setting (ERM Estimator) Since we cannot compute the expected risk R, we compute empirical risk instead: ˆRD(f ) = 1 n n i=1 f (zi ) = R(Pn, f ) (Pn: empirical distribution). ERM (Empirical Risk Minimization) estimator for hypothesis space F is ˆf = ˆf (D) ∈ min f ∈F R(Pn, f ) 6/18
  • 8. Assumptions + OR Ref. Lee and Raginsky (2018) 8/18
  • 9. Example Supervised learning • Z = (X, Y ), X = RD: input space, Y = R: label space • : Y × Y → R: loss function • H ⊂ {X → Y }: set of models • F = {fh(x, y) = (h(x), y)|h ∈ H} Regression • X = RD, Y = R, (y, y) = (y − y)2 • H = (Function realized by a neural networks with a fixed architecture) 9/18
  • 10. Classical Result Typically, we have R(P, ˆf ) − inf f ∈F R(P, f ) = OP complexity of F √ n Model complexity measure complexity of F (intuitively, how ”large” F is) 10/18
  • 11. Covering number Definition (Covering Number) For F ⊂ F0 := {f : [−1, 1]D → R}, and ε > 0, the (external) covering number of F is N(F, ε) := inf N ∈ N ∃f1, . . . , fN ∈ F0 s.t. ∀f ∈ F, ∃n ∈ [N] s.t. f − fn ∞ ≤ ε . • Intuition: the minimum # of balls (with radius ε) to cover the space F. • Entropy integral: C(F) := ∞ 0 log N(F, u) du. 11/18
  • 12. Distributionally Robust Framework Minimize the worst-case risk close to true distribution P. minimize R(P, f ) ↓ minimize Rρ,p(P, f ) := supQ∈Aρ,p(P) R(Q, f ) We consider p-Wasserstein distance: Aρ,p(P) = {Q|Wp(P, Q) ≤ ρ} Applications • Adversarial attack: ρ = noise level • Domain adaptation: ρ = discrepancy level of train/test dists. 12/18
  • 13. Estimator Correspondingly, we change the estimator ˆf ∈ inf f ∈F Rρ,p(Pn, f ) Want to evaluate Rρ,p(P, ˆf ) − inf f ∈F Rρ,pR(P, f ) 13/18
  • 14. Main Theorems Same excess-risk rate as the non-robust setting. Ref. Lee and Raginsky (2018) 14/18
  • 15. Strategy From authors slide Ref: https://nips.cc/media/Slides/nips/2018/517cd(05-09-45) -05-10-20-12649-Minimax_Statist.pdf 15/18
  • 16. Key Lemmas Ref. Lee and Raginsky (2018) 16/18
  • 17. Why these lemmas are important? (Complexity of ΨΛ,F ) ≈ (Complexity of F) × (Complexity of Λ) 17/18
  • 18. Impression • Duality form of risk (Rρ(P, f ) = infλ≥0 E[ψλ,f (Z)]) may be useful of its own. • Mysterious assumption 4 (incredibly local property of F). • Special structure of p=1-Wasserstein distance? 18/18