Learning bounds for risk-sensitive learning

A
Jaeho Lee Sejun Park Jinwoo Shin

Korea Advanced Institute of Science and Technology (KAIST)
†
Learning bounds for Risk-sensitive learning
… or, “Robust and Fair ML with Vapnik & Chervonenkis”
Contact: jaeho-lee@kaist.ac.kr
Code: https://github.com/jaeho-lee/oce
Motivation: Robust and fair learning
Truth. Empirical risk minimization (ERM) is a theoretical foundation for ML.
̂f 𝖾𝗋𝗆 ≜ 𝖺𝗋𝗀𝗆𝗂𝗇f∈ℱ
n
∑
i=1
1
n
⋅ f(Zi)
Motivation: Robust and fair learning
Truth. Study on the “empirical risk minimization” gives a concrete foundation for ML.
̂f 𝖾𝗋𝗆 ≜ 𝖺𝗋𝗀𝗆𝗂𝗇f∈ℱ
n
∑
i=1
1
n
⋅ f(Zi)
Also Truth. .Modern-day ML is more than just ERM.

-We weigh samples differently, based on their loss values!
̂f ≜ 𝖺𝗋𝗀𝗆𝗂𝗇f∈ℱ
n
∑
i=1
wi ⋅ f(Zi)
Depends on , relative tof(Zi) f(Z1), f(Z2), ⋯, f(Zn)
Motivation: Robust and fair learning
Truth. Study on the “empirical risk minimization” gives a concrete foundation for ML.
̂f 𝖾𝗋𝗆 ≜ 𝖺𝗋𝗀𝗆𝗂𝗇f∈ℱ
n
∑
i=1
1
n
⋅ f(Zi)
Examples. .Robust learning with outliers / noisy labels (high-loss samples are ignored)

Curriculum learning (low-loss samples are prioritized)

Fair ML, with individual fairness criteria (low-loss samples are ignored)
Also Truth. .Modern-day ML is more than just ERM.

-We weigh samples differently, based on their loss values!
̂f ≜ 𝖺𝗋𝗀𝗆𝗂𝗇f∈ℱ
n
∑
i=1
wi ⋅ f(Zi)
[1] e.g., Han et al., “Co-teaching: Robust training of deep neural networks with extremely noisy labels,” NeurIPS 2018.

[2] e.g., Pawan Kumar et al., “Self-paced learning for latent variable models,” NeurIPS 2010.

[3] e.g., Williamson et al., “Fairness risk measures,” ICML 2019.
[1]
[2]
[3]
Motivation: Robust and fair learning
Truth. Study on the “empirical risk minimization” gives a concrete foundation for ML.
̂f 𝖾𝗋𝗆 ≜ 𝖺𝗋𝗀𝗆𝗂𝗇f∈ℱ
n
∑
i=1
1
n
⋅ f(Zi)
Also Truth. .Modern-day ML is more than just ERM.

-We weigh samples differently, based on their loss values!
̂f ≜ 𝖺𝗋𝗀𝗆𝗂𝗇f∈ℱ
n
∑
i=1
wi ⋅ f(Zi)
Examples. .Robust learning with outliers / noisy labels (high-loss samples are ignored)

Curriculum learning (low-loss samples are prioritized)

Fair ML, with individual fairness criteria (low-loss samples are ignored)
Question. Can we give convergence guarantees for algorithms with loss-dependent weights?
Challenge. What theoretical framework should we use?
Framework: Optimized Certainty Equivalents (OCE)
History. Invented by Ben-Tal and Teboulle (1986) to characterize risk-aversion.

- extends the utility-theoretic perspective of von Neumann and Morgenstern.
Utility curve

(diminishing marginal utility)
Income
(Objective)
Utility

(subjective)
Δ1
Δ2
Δ3
Framework: Optimized Certainty Equivalents (OCE)
History. Invented by Ben-Tal and Teboulle (1986) to characterize risk-aversion.

- extends the utility-theoretic perspective of von Neumann and Morgenstern.
Definition. Capture the risk-averse behavior using a convex disutility function .ϕ
i.e., negative utility
𝗈𝖼𝖾(f, P) ≜ inf
λ∈ℝ
{λ + EP[ϕ(f(Z) − λ)]}
EP[ϕ(f(Z) − λ)]
λ Certain present loss
Uncertain future disutility
Framework: Optimized Certainty Equivalents (OCE)
History. Invented by Ben-Tal and Teboulle (1986) to characterize risk-aversion.

- extends the utility-theoretic perspective of von Neumann and Morgenstern.
Definition. Capture the risk-averse behavior using a convex disutility function .ϕ
i.e., negative utility
ML view. .We are penalizing the average loss + deviation!
𝗈𝖼𝖾(f, P) = EP[f(Z)] + inf
λ∈ℝ
{EP[φ(f(Z) − λ)]}
… for some convex .φ(t) = ϕ(t) − t
λ* f(Z𝗁𝗂𝗀𝗁−𝗅𝗈𝗌𝗌)f(Z𝗅𝗈𝗐−𝗅𝗈𝗌𝗌)
“deviation penalty” from the

optimized anchor λ*
Framework: Optimized Certainty Equivalents (OCE)
History. Invented by Ben-Tal and Teboulle (1986) to characterize risk-aversion.

- extends the utility-theoretic perspective of von Neumann and Morgenstern.
Definition. Capture the risk-averse behavior using a convex disutility function .ϕ
i.e., negative utility
ML view. .We are penalizing the average loss + deviation!
𝗈𝖼𝖾(f, P) = EP[f(Z)] + inf
λ∈ℝ
{EP[φ(f(Z) − λ)]}
Examples. This framework covers a wide range of “risk-averse” measures of loss.
- Average + variance penalty

- Conditional value-at-risk .(i.e., ignore low-loss samples)

- Entropic risk measure -(i.e., exponentially tilted loss).
Note: OCE is complementary to rank-based approaches

(come to our poster session for details!)
[1] e.g., Maurer and Pontil, “Empirical Bernstein bounds and sample variance penalization,” COLT 2009.

[2] e.g., Curi et al., “Adaptive sampling for stochastic risk-averse learning,” NeurIPS 2020.

[3] e.g., Li et al., “Tilted empirical risk minimization,” arXiv 2020.
[1]
[2]
[3]
Framework: Optimized Certainty Equivalents (OCE)
History. Invented by Ben-Tal and Teboulle (1986) to characterize risk-aversion.

- extends the utility-theoretic perspective of von Neumann and Morgenstern.
Definition. Capture the risk-averse behavior using a convex disutility function .ϕ
i.e., negative utility
ML view. .We are penalizing the average loss + deviation!
𝗈𝖼𝖾(f, P) = EP[f(Z)] + inf
λ∈ℝ
{EP[φ(f(Z) − λ)]}
Examples. This framework covers a wide range of “risk-averse” measures of loss.
- Average + variance penalty

- Conditional value-at-risk .(i.e., ignore low-loss samples)

- Entropic risk measure -(i.e., exponentially tilted loss).
Inverted OCE. A new notion to address “risk-seeking” algorithms (e.g., ignore high-loss samples)
𝗈𝖼𝖾(f, P) ≜ EP[f(Z)] − inf
λ∈ℝ
{EP[φ(λ − f(Z))]}
Results: Two learning bounds.
What we do. We analyze the empirical OCE minimization procedure:
Just as Vapnik&Chervonenkis studies “empirical risk minimization.”

we also give inverted OCE version.
̂f 𝖾𝗈𝗆 ≜ 𝖺𝗋𝗀𝗆𝗂𝗇f∈ℱ
𝗈𝖼𝖾(f, Pn)
Results: Two learning bounds.
In a nutshell. We give learning bounds of two different type.
What we do. We analyze the empirical OCE minimization procedure:
Just as Vapnik&Chervonenkis studies “empirical risk minimization.”

we also give inverted OCE version.
𝗈𝖼𝖾( ̂f 𝖾𝗈𝗆, P) − inf
f∈ℱ
𝗈𝖼𝖾(f, P) ≈ 𝒪
(
𝖫𝗂𝗉(ϕ) ⋅ 𝖼𝗈𝗆𝗉(ℱ)
n )
EP[ ̂f 𝖾𝗈𝗆(Z)] − inf
f∈ℱ
EP[f(Z)] ≈ 𝒪
(
𝖼𝗈𝗆𝗉(ℱ)
n )
Theorem 6. Excess expected loss bound
Theorem 3. Excess OCE bound
(come to our poster session for details!)
̂f 𝖾𝗈𝗆 ≜ 𝖺𝗋𝗀𝗆𝗂𝗇f∈ℱ
𝗈𝖼𝖾(f, Pn)
Results: Two learning bounds.
In a nutshell. We give learning bounds of two different type.
What we do. We analyze the empirical OCE minimization procedure:
Just as Vapnik&Chervonenkis studies “empirical risk minimization.”

we also give inverted OCE version.
Theorem 6. Excess expected loss bound
Theorem 3. Excess OCE bound
Also… We also discover the relationship to sample variance penalization (SVP) procedure,

and find that SVP is a nice baseline strategy for batch-based OCE minimization.
(come to our poster session for details!)
̂f 𝖾𝗈𝗆 ≜ 𝖺𝗋𝗀𝗆𝗂𝗇f∈ℱ
𝗈𝖼𝖾(f, Pn)
𝗈𝖼𝖾( ̂f 𝖾𝗈𝗆, P) − inf
f∈ℱ
𝗈𝖼𝖾(f, P) ≈ 𝒪
(
𝖫𝗂𝗉(ϕ) ⋅ 𝖼𝗈𝗆𝗉(ℱ)
n )
EP[ ̂f 𝖾𝗈𝗆(Z)] − inf
f∈ℱ
EP[f(Z)] ≈ 𝒪
(
𝖼𝗈𝗆𝗉(ℱ)
n )
Results: Two learning bounds.
In a nutshell. We give learning bounds of two different type.
What we do. We analyze the empirical OCE minimization procedure:
Just as Vapnik&Chervonenkis studies “empirical risk minimization.”

we also give inverted OCE version.
Theorem 6. Excess expected loss bound
Theorem 3. Excess OCE bound
Also… We also discover the relationship to sample variance penalization (SVP) procedure,

and find that SVP is a nice baseline strategy for batch-based OCE minimization.
(come to our poster session for details!)
̂f 𝖾𝗈𝗆 ≜ 𝖺𝗋𝗀𝗆𝗂𝗇f∈ℱ
𝗈𝖼𝖾(f, Pn)
𝗈𝖼𝖾( ̂f 𝖾𝗈𝗆, P) − inf
f∈ℱ
𝗈𝖼𝖾(f, P) ≈ 𝒪
(
𝖫𝗂𝗉(ϕ) ⋅ 𝖼𝗈𝗆𝗉(ℱ)
n )
EP[ ̂f 𝖾𝗈𝗆(Z)] − inf
f∈ℱ
EP[f(Z)] ≈ 𝒪
(
𝖼𝗈𝗆𝗉(ℱ)
n )
TL;DR. . - We give OCE-based theoretical framework to address robust/fair ML.

-- We give excess risk bounds for empirical OCE minimizers.
- Further implications of our theoretical results…

- Proof ideas…

- Experiment details…

- Comparisons with alternative frameworks…
Come to our zoom session for interesting details, including…
1 of 14

More Related Content

Similar to Learning bounds for risk-sensitive learning(20)

M08 BiasVarianceTradeoffM08 BiasVarianceTradeoff
M08 BiasVarianceTradeoff
Raman Kannan114 views
Lesson 26Lesson 26
Lesson 26
Avijit Kumar187 views
AI Lesson 26AI Lesson 26
AI Lesson 26
Assistant Professor394 views
STAT: Random experiments(2)STAT: Random experiments(2)
STAT: Random experiments(2)
Tuenti SiIx3.5K views
Exon Junction ComplexExon Junction Complex
Exon Junction Complex
Kara Richards3 views
EWMA VaR ModelsEWMA VaR Models
EWMA VaR Models
DanielMiraglia4.3K views
Basic Inference AnalysisBasic Inference Analysis
Basic Inference Analysis
Ameen AboDabash138 views
Program on Mathematical and Statistical Methods for Climate and the Earth Sys...Program on Mathematical and Statistical Methods for Climate and the Earth Sys...
Program on Mathematical and Statistical Methods for Climate and the Earth Sys...
The Statistical and Applied Mathematical Sciences Institute636 views
Chapter06Chapter06
Chapter06
rwmiller927 views
chap4_Parametric_Methods.pptchap4_Parametric_Methods.ppt
chap4_Parametric_Methods.ppt
ShayanChowdary14 views
Artificial intelligenceArtificial intelligence
Artificial intelligence
keerthikaA843 views
Artificial intelligence.pptxArtificial intelligence.pptx
Artificial intelligence.pptx
keerthikaA823 views
Artificial intelligenceArtificial intelligence
Artificial intelligence
keerthikaA842 views
LECTURE8.PPTLECTURE8.PPT
LECTURE8.PPT
butest452 views

Learning bounds for risk-sensitive learning

  • 1. Jaeho Lee Sejun Park Jinwoo Shin Korea Advanced Institute of Science and Technology (KAIST) † Learning bounds for Risk-sensitive learning … or, “Robust and Fair ML with Vapnik & Chervonenkis” Contact: jaeho-lee@kaist.ac.kr Code: https://github.com/jaeho-lee/oce
  • 2. Motivation: Robust and fair learning Truth. Empirical risk minimization (ERM) is a theoretical foundation for ML. ̂f 𝖾𝗋𝗆 ≜ 𝖺𝗋𝗀𝗆𝗂𝗇f∈ℱ n ∑ i=1 1 n ⋅ f(Zi)
  • 3. Motivation: Robust and fair learning Truth. Study on the “empirical risk minimization” gives a concrete foundation for ML. ̂f 𝖾𝗋𝗆 ≜ 𝖺𝗋𝗀𝗆𝗂𝗇f∈ℱ n ∑ i=1 1 n ⋅ f(Zi) Also Truth. .Modern-day ML is more than just ERM.
 -We weigh samples differently, based on their loss values! ̂f ≜ 𝖺𝗋𝗀𝗆𝗂𝗇f∈ℱ n ∑ i=1 wi ⋅ f(Zi) Depends on , relative tof(Zi) f(Z1), f(Z2), ⋯, f(Zn)
  • 4. Motivation: Robust and fair learning Truth. Study on the “empirical risk minimization” gives a concrete foundation for ML. ̂f 𝖾𝗋𝗆 ≜ 𝖺𝗋𝗀𝗆𝗂𝗇f∈ℱ n ∑ i=1 1 n ⋅ f(Zi) Examples. .Robust learning with outliers / noisy labels (high-loss samples are ignored)
 Curriculum learning (low-loss samples are prioritized)
 Fair ML, with individual fairness criteria (low-loss samples are ignored) Also Truth. .Modern-day ML is more than just ERM.
 -We weigh samples differently, based on their loss values! ̂f ≜ 𝖺𝗋𝗀𝗆𝗂𝗇f∈ℱ n ∑ i=1 wi ⋅ f(Zi) [1] e.g., Han et al., “Co-teaching: Robust training of deep neural networks with extremely noisy labels,” NeurIPS 2018. [2] e.g., Pawan Kumar et al., “Self-paced learning for latent variable models,” NeurIPS 2010.
 [3] e.g., Williamson et al., “Fairness risk measures,” ICML 2019. [1] [2] [3]
  • 5. Motivation: Robust and fair learning Truth. Study on the “empirical risk minimization” gives a concrete foundation for ML. ̂f 𝖾𝗋𝗆 ≜ 𝖺𝗋𝗀𝗆𝗂𝗇f∈ℱ n ∑ i=1 1 n ⋅ f(Zi) Also Truth. .Modern-day ML is more than just ERM.
 -We weigh samples differently, based on their loss values! ̂f ≜ 𝖺𝗋𝗀𝗆𝗂𝗇f∈ℱ n ∑ i=1 wi ⋅ f(Zi) Examples. .Robust learning with outliers / noisy labels (high-loss samples are ignored)
 Curriculum learning (low-loss samples are prioritized)
 Fair ML, with individual fairness criteria (low-loss samples are ignored) Question. Can we give convergence guarantees for algorithms with loss-dependent weights? Challenge. What theoretical framework should we use?
  • 6. Framework: Optimized Certainty Equivalents (OCE) History. Invented by Ben-Tal and Teboulle (1986) to characterize risk-aversion. - extends the utility-theoretic perspective of von Neumann and Morgenstern. Utility curve
 (diminishing marginal utility) Income (Objective) Utility
 (subjective) Δ1 Δ2 Δ3
  • 7. Framework: Optimized Certainty Equivalents (OCE) History. Invented by Ben-Tal and Teboulle (1986) to characterize risk-aversion. - extends the utility-theoretic perspective of von Neumann and Morgenstern. Definition. Capture the risk-averse behavior using a convex disutility function .ϕ i.e., negative utility 𝗈𝖼𝖾(f, P) ≜ inf λ∈ℝ {λ + EP[ϕ(f(Z) − λ)]} EP[ϕ(f(Z) − λ)] λ Certain present loss Uncertain future disutility
  • 8. Framework: Optimized Certainty Equivalents (OCE) History. Invented by Ben-Tal and Teboulle (1986) to characterize risk-aversion. - extends the utility-theoretic perspective of von Neumann and Morgenstern. Definition. Capture the risk-averse behavior using a convex disutility function .ϕ i.e., negative utility ML view. .We are penalizing the average loss + deviation! 𝗈𝖼𝖾(f, P) = EP[f(Z)] + inf λ∈ℝ {EP[φ(f(Z) − λ)]} … for some convex .φ(t) = ϕ(t) − t λ* f(Z𝗁𝗂𝗀𝗁−𝗅𝗈𝗌𝗌)f(Z𝗅𝗈𝗐−𝗅𝗈𝗌𝗌) “deviation penalty” from the
 optimized anchor λ*
  • 9. Framework: Optimized Certainty Equivalents (OCE) History. Invented by Ben-Tal and Teboulle (1986) to characterize risk-aversion. - extends the utility-theoretic perspective of von Neumann and Morgenstern. Definition. Capture the risk-averse behavior using a convex disutility function .ϕ i.e., negative utility ML view. .We are penalizing the average loss + deviation! 𝗈𝖼𝖾(f, P) = EP[f(Z)] + inf λ∈ℝ {EP[φ(f(Z) − λ)]} Examples. This framework covers a wide range of “risk-averse” measures of loss. - Average + variance penalty - Conditional value-at-risk .(i.e., ignore low-loss samples) - Entropic risk measure -(i.e., exponentially tilted loss). Note: OCE is complementary to rank-based approaches
 (come to our poster session for details!) [1] e.g., Maurer and Pontil, “Empirical Bernstein bounds and sample variance penalization,” COLT 2009. [2] e.g., Curi et al., “Adaptive sampling for stochastic risk-averse learning,” NeurIPS 2020.
 [3] e.g., Li et al., “Tilted empirical risk minimization,” arXiv 2020. [1] [2] [3]
  • 10. Framework: Optimized Certainty Equivalents (OCE) History. Invented by Ben-Tal and Teboulle (1986) to characterize risk-aversion. - extends the utility-theoretic perspective of von Neumann and Morgenstern. Definition. Capture the risk-averse behavior using a convex disutility function .ϕ i.e., negative utility ML view. .We are penalizing the average loss + deviation! 𝗈𝖼𝖾(f, P) = EP[f(Z)] + inf λ∈ℝ {EP[φ(f(Z) − λ)]} Examples. This framework covers a wide range of “risk-averse” measures of loss. - Average + variance penalty - Conditional value-at-risk .(i.e., ignore low-loss samples) - Entropic risk measure -(i.e., exponentially tilted loss). Inverted OCE. A new notion to address “risk-seeking” algorithms (e.g., ignore high-loss samples) 𝗈𝖼𝖾(f, P) ≜ EP[f(Z)] − inf λ∈ℝ {EP[φ(λ − f(Z))]}
  • 11. Results: Two learning bounds. What we do. We analyze the empirical OCE minimization procedure: Just as Vapnik&Chervonenkis studies “empirical risk minimization.”
 we also give inverted OCE version. ̂f 𝖾𝗈𝗆 ≜ 𝖺𝗋𝗀𝗆𝗂𝗇f∈ℱ 𝗈𝖼𝖾(f, Pn)
  • 12. Results: Two learning bounds. In a nutshell. We give learning bounds of two different type. What we do. We analyze the empirical OCE minimization procedure: Just as Vapnik&Chervonenkis studies “empirical risk minimization.”
 we also give inverted OCE version. 𝗈𝖼𝖾( ̂f 𝖾𝗈𝗆, P) − inf f∈ℱ 𝗈𝖼𝖾(f, P) ≈ 𝒪 ( 𝖫𝗂𝗉(ϕ) ⋅ 𝖼𝗈𝗆𝗉(ℱ) n ) EP[ ̂f 𝖾𝗈𝗆(Z)] − inf f∈ℱ EP[f(Z)] ≈ 𝒪 ( 𝖼𝗈𝗆𝗉(ℱ) n ) Theorem 6. Excess expected loss bound Theorem 3. Excess OCE bound (come to our poster session for details!) ̂f 𝖾𝗈𝗆 ≜ 𝖺𝗋𝗀𝗆𝗂𝗇f∈ℱ 𝗈𝖼𝖾(f, Pn)
  • 13. Results: Two learning bounds. In a nutshell. We give learning bounds of two different type. What we do. We analyze the empirical OCE minimization procedure: Just as Vapnik&Chervonenkis studies “empirical risk minimization.”
 we also give inverted OCE version. Theorem 6. Excess expected loss bound Theorem 3. Excess OCE bound Also… We also discover the relationship to sample variance penalization (SVP) procedure,
 and find that SVP is a nice baseline strategy for batch-based OCE minimization. (come to our poster session for details!) ̂f 𝖾𝗈𝗆 ≜ 𝖺𝗋𝗀𝗆𝗂𝗇f∈ℱ 𝗈𝖼𝖾(f, Pn) 𝗈𝖼𝖾( ̂f 𝖾𝗈𝗆, P) − inf f∈ℱ 𝗈𝖼𝖾(f, P) ≈ 𝒪 ( 𝖫𝗂𝗉(ϕ) ⋅ 𝖼𝗈𝗆𝗉(ℱ) n ) EP[ ̂f 𝖾𝗈𝗆(Z)] − inf f∈ℱ EP[f(Z)] ≈ 𝒪 ( 𝖼𝗈𝗆𝗉(ℱ) n )
  • 14. Results: Two learning bounds. In a nutshell. We give learning bounds of two different type. What we do. We analyze the empirical OCE minimization procedure: Just as Vapnik&Chervonenkis studies “empirical risk minimization.”
 we also give inverted OCE version. Theorem 6. Excess expected loss bound Theorem 3. Excess OCE bound Also… We also discover the relationship to sample variance penalization (SVP) procedure,
 and find that SVP is a nice baseline strategy for batch-based OCE minimization. (come to our poster session for details!) ̂f 𝖾𝗈𝗆 ≜ 𝖺𝗋𝗀𝗆𝗂𝗇f∈ℱ 𝗈𝖼𝖾(f, Pn) 𝗈𝖼𝖾( ̂f 𝖾𝗈𝗆, P) − inf f∈ℱ 𝗈𝖼𝖾(f, P) ≈ 𝒪 ( 𝖫𝗂𝗉(ϕ) ⋅ 𝖼𝗈𝗆𝗉(ℱ) n ) EP[ ̂f 𝖾𝗈𝗆(Z)] − inf f∈ℱ EP[f(Z)] ≈ 𝒪 ( 𝖼𝗈𝗆𝗉(ℱ) n ) TL;DR. . - We give OCE-based theoretical framework to address robust/fair ML.
 -- We give excess risk bounds for empirical OCE minimizers. - Further implications of our theoretical results…
 - Proof ideas…
 - Experiment details…
 - Comparisons with alternative frameworks… Come to our zoom session for interesting details, including…