Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
A Statistical Perspective on Retrieval-Based Models.pdf
1. A Statistical Perspective on Retrieval-Based Models
A Statistical Perspective
on Retrieval-Based Models
ICML, 2023
Soumya Basu, Ankit Singh Rawat, Manzil Zaheer
Speaker: Po-Chuan Chen
Oct 12, 2023
1 / 41
2. A Statistical Perspective on Retrieval-Based Models
Table of contents
1 Abstract
2 Introduction
3 Problem setup
4 Local empirical risk minimization
5 Classification in extended feature space
6 Experiments
7 Conclusion and future direction 2 / 41
3. A Statistical Perspective on Retrieval-Based Models
Abstract
Table of contents
1 Abstract
2 Introduction
3 Problem setup
4 Local empirical risk minimization
5 Classification in extended feature space
6 Experiments
7 Conclusion and future direction 3 / 41
4. A Statistical Perspective on Retrieval-Based Models
Abstract
Abstract
This paper uses a formal treatment of retrieval-based models to
characterize their performance via a novel statistical perspective.
They study two different perspective method
Analyzing local learning framework
Learning global model using kernel methods
4 / 41
5. A Statistical Perspective on Retrieval-Based Models
Introduction
Table of contents
1 Abstract
2 Introduction
3 Problem setup
4 Local empirical risk minimization
5 Classification in extended feature space
6 Experiments
7 Conclusion and future direction 5 / 41
6. A Statistical Perspective on Retrieval-Based Models
Introduction
Introduction
To increase the expressiveness of an ML model, a popular way is to
homogeneously scale the size of a parametric model.
Such large models, however, have their own limitations
High computation cost
Catastrophic forgetting
Lack of provenance
Poor explainability
6 / 41
7. A Statistical Perspective on Retrieval-Based Models
Introduction
Introduction
Figure 1: An illustration of a retrieval-based classification model.
7 / 41
8. A Statistical Perspective on Retrieval-Based Models
Introduction
Contribution
1 Setting up a formal framework for classification via
retrieval-based models under local structure
2 Finite sample analysis of explicit local learning framework
3 Extending the analysis to a globally learnt model
4 Providing the first rigorous treatment of an end-to-end
retrieval-based model to study its generalization by using
kernel-based learning
8 / 41
9. A Statistical Perspective on Retrieval-Based Models
Problem setup
Table of contents I
1 Abstract
2 Introduction
3 Problem setup
Multiclass classification
Classification with local structure
Retrieval-based classification model
4 Local empirical risk minimization
5 Classification in extended feature space
9 / 41
10. A Statistical Perspective on Retrieval-Based Models
Problem setup
Table of contents II
6 Experiments
7 Conclusion and future direction
10 / 41
11. A Statistical Perspective on Retrieval-Based Models
Problem setup
Multiclass classification
Multiclass classification
In here, it’ll access to n training examples S = {(xi, yi)}i∈[n] ⊂ X × Y
, sampled i.i.d. from the data distribution D := DX,Y.
For the scorer f, the classifier takes the form:
hf (x) = arg max
y∈Y
fy(x)
Given a set of scorer F global ⊆ {f : X → ℝ|Y| }, learning a model can
find a scorer in F global that minimizes the miss-classification error or
expected 0/1 loss:
f∗
0/1 = arg min
f ∈Fglobal
ℙD(hf (X) ≠ Y)
11 / 41
12. A Statistical Perspective on Retrieval-Based Models
Problem setup
Multiclass classification
Multiclass classification
In this part, it uses surrogate loss [1] 𝓁 for the miss-classification error
and aims to minimize the associated population risk:
R𝓁 (f) = 𝔼(X,Y)∼D[𝓁(f (X), Y)]
With minimizing the (global) empirical risk over the function class
F global, we can learn a good scorer:
f̂ = arg min
f ∈Fglobal
1
n
∑︁
i∈[n]
𝓁(f (xi), yi)
And, R̂ := 1
n
Í
i∈[n] 𝓁(f (xi), yi).
12 / 41
13. A Statistical Perspective on Retrieval-Based Models
Problem setup
Classification with local structure
Classification with local structure
They define the data in each local neighborhood as
Bx,r := {x′ ∈ X : 𝕕(x, x′) ≤ r}, where x ∈ X and r > 0.
Dx,r set as the data distribution restricted to Bx,r
Dx,r
=
D(A)
D(Bx,r × Y)
A ⊆ Bx,r
× Y
13 / 41
14. A Statistical Perspective on Retrieval-Based Models
Problem setup
Classification with local structure
Classification with local structure
Such that we have local structure condition that approximates the
Bayes optimal for the local classification problem.
That is for a given 𝜀X > 0 and ∀x ∈ X, we have
min
f ∈Fx
Rx
𝓁 (f) ≤ min
f ∈Fglobal
Rx
𝓁 (f) + 𝜀X
And the local population risk can be defined as
Rx
𝓁 (f) = 𝔼(X′,Y′ )∼Dx,r [𝓁(f (X′
), Y′
)]
14 / 41
15. A Statistical Perspective on Retrieval-Based Models
Problem setup
Retrieval-based classification model
Retrieval-based classification model
In this paper, they focus on retrieval-based methods.
In local empirical risk minimization, it will give a instance x, the
local empirical risk minimization (ERM) approach first retrieves a
neighboring set Rx = {(x′
j, y′
j)} ⊆ S.
And it identifies a scorer f̂x from a function class
F loc ⊂ {f : X → ℝ|Y| }:
f̂x
= arg min
f ∈Floc
1
|Rx|
∑︁
(x′,y′ )∈Rx
𝓁(f (x′
), y′
)
if |Rx| = 0, f̂x ∈ F loc is chosen arbitrarily.
15 / 41
16. A Statistical Perspective on Retrieval-Based Models
Problem setup
Retrieval-based classification model
Retrieval-based classification model
Another approach is called classification with extended feature
space, that the scorer directly maps the augmented input
x × Rx ∈ X × (X × Y)∗ to per-class scores.
A scorer can be learned over extended feature space X × (X × Y)∗ as
follows:
f̂ex
= arg min
f ∈Fex
R̂ex
𝓁 (f)
where R̂ex
𝓁
(f) := 1
n
Í
i∈[n] 𝓁(f (xi, Rxi ), yi) and a function class of
interest over the extended space is denoted as
F ex ⊂ {f : X × (X × Y)∗ → ℝ|Y| }.
16 / 41
17. A Statistical Perspective on Retrieval-Based Models
Local empirical risk minimization
Table of contents I
1 Abstract
2 Introduction
3 Problem setup
4 Local empirical risk minimization
Assumptions
Excess risk bound for local ERM
Illustrative examples
Endowing local ERM with global representations
17 / 41
18. A Statistical Perspective on Retrieval-Based Models
Local empirical risk minimization
Table of contents II
5 Classification in extended feature space
6 Experiments
7 Conclusion and future direction
18 / 41
19. A Statistical Perspective on Retrieval-Based Models
Local empirical risk minimization
Local empirical risk minimization
The goal is to characterize the excess risk of local ERM, such that it
aims to bound
𝔼(X,Y)∼D[𝓁(f̂X
(X), Y) − 𝓁(f∗
(X), Y)]
Here f̂X (X) in the above equation is a function of RX.
19 / 41
20. A Statistical Perspective on Retrieval-Based Models
Local empirical risk minimization
Assumptions
Assumptions
First, they define the margin of scorer f at a given label y ∈ Y as
𝛾f (x, y) = fy(x) − max
y′≠y
fy′ (x)
To ensure the margin of the scorer f has smooth deviation as x varies, a
scorer f is L-coordinate Lipschitz iff for all y ∈ Y and x, x′ ∈ X, it has
|fy(x) − fy(x′
)| ≤ L∥x − x′
∥2
Also they define the weak margin condition for a scorer f: Given a
distribution D, a scorer f satisfies (𝛼, c)-weak margin condition iff, for
all t ≥ 0,
ℙ(X,Y)∼D(|𝛾f (X, Y)| ≤ t) ≤ ct𝛼
20 / 41
21. A Statistical Perspective on Retrieval-Based Models
Local empirical risk minimization
Assumptions
Assumption 3.1 (True scorer function)
The scorer ftrue make sure for all (x, y) ∈ X × Y, ftrue generates the
true label, i.e., 𝛾ftrue (x, y) > 0 that ftrue is Ltrue-coordinate Lipschitz,
and satisfies the (𝛼true, ctrue)-weak margin condition.
21 / 41
22. A Statistical Perspective on Retrieval-Based Models
Local empirical risk minimization
Assumptions
Assumption 3.2 (Margin-based Lipschitz loss)
For any given example (x, y) and any scorer f we have
𝓁(f (x), y) = 𝓁(𝛾f (x, y)) and 𝓁 is a decreasing function of the margin.
That, the loss function 𝓁 is L𝓁-Lipschitz function, i.e.
|𝓁(𝛾) − 𝓁(𝛾′)| ≤ L𝓁 |𝛾 − 𝛾′|, ∀𝛾 ≥ 𝛾′.
22 / 41
23. A Statistical Perspective on Retrieval-Based Models
Local empirical risk minimization
Assumptions
Assumption 3.3 (Data regularity condition)
Weak density condition.
There exists constants cwdc > 0, and 𝛿wdc > 0, such that for all x ∈ X
and 𝜌D(x)rd ≤ 𝛿d
wdc
.
ℙX′∼D[𝕕(X′
, x) ≤ r] ≥ cd
wdc𝜌D(x)rd
Density level-set.
There exists a function f𝜌(𝛿) with f𝜌(𝛿) → 0 as 𝛿 → 0, such that for
any 𝛿 > 0,
ℙX∼D[𝜌D(X) ≤ f𝜌(𝛿)] ≤ 𝛿
23 / 41
24. A Statistical Perspective on Retrieval-Based Models
Local empirical risk minimization
Assumptions
Assumption 3.4 (Weak + density condition)
There exists constants cwdc+ ≥ 0, and 𝛼wdc+ > 0, such that for all
x ∈ X and r ∈ [0, rmax],
ℙX′∼D[𝕕(X′, x) ≤ r]
𝜌D(c)vold (r)
− 1 ≤ cwdc+r𝛼wdc+
Under this assumption the local ERM error bounds can be tightened
further.
24 / 41
25. A Statistical Perspective on Retrieval-Based Models
Local empirical risk minimization
Excess risk bound for local ERM
Excess risk bound for local ERM
It proceed to their main results on the excess risk bound of local ERM.
At x ∈ X, fx,∗ denotes the minimizer of the population version of the
local loss, and f∗ for the global loss.
fx,∗
= arg min
f ∈Floc
Rx
𝓁 (f); f∗
= arg min
f ∈Fglobal
R𝓁 (f)
The next slide will show how the expected excess risk of the local
ERM solution f̂X is bounded, it is called Risk decomposition.
25 / 41
26. A Statistical Perspective on Retrieval-Based Models
Local empirical risk minimization
Excess risk bound for local ERM
𝔼(X,Y)∼D
h
ℓ
f̂X
(X), Y
− ℓ (f∗
(X), Y)
i
≤ 𝔼(X,Y)∼D
h
RX
ℓ
fX,∗
− RX
ℓ (f∗
)
i
| {z }
Local vs Global Population Optimal Risk
+
∑︁
F∈{Fglobal ,Floc
}
𝔼(X,Y)∼D
sup
f ∈F
RX
ℓ (f) − ℓ(f (X), Y)
#
| {z }
Global and Local: Sample vs Retrieved Set Risk
+ 𝔼(X,Y)∼D
sup
f ∈Floc
RX
ℓ (f) − R̂X
ℓ (f)
#
| {z }
Generalization of Local ERM
+ 𝔼(X,Y)∼D
h
RX
ℓ
fX,∗
− R̂X
ℓ
fX,∗
i
| {z }
Central Absolute Moment of fX,∗
.
26 / 41
27. A Statistical Perspective on Retrieval-Based Models
Local empirical risk minimization
Excess risk bound for local ERM
Excess risk bound for local ERM
In here, we can obtain a tighter bound by utilizing the local structure
of the distribution DX,r
. For any L 0, we can define
Mr (L; 𝓁, ftrue,F) := 2L𝓁 (Lr + (2∥F ∥∞ − Lr)ctrue(2Ltruer)𝛼true
)
For any x ∈ X, the weak density condition provides high probability
lower bound on the size of the retrieved set Rx.
27 / 41
28. A Statistical Perspective on Retrieval-Based Models
Local empirical risk minimization
Excess risk bound for local ERM
Proposition 3.6.
Under the Assumption 3.3, for any x ∈ X, r 0, and 𝛿 0,
ℙD[|Rx
| N(r, 𝛿)] ≤ 𝛿
for N(r, 𝛿) = n(cd
wdc
min{f𝜌(𝛿/2)rd, 𝛿d
wdc
} −
√︃
log(2/𝛿)
2n )
The next slide will show how the expected excess risk of the local
ERM solution f̂X is bounded, it is called Excess risk bound.
28 / 41
29. A Statistical Perspective on Retrieval-Based Models
Local empirical risk minimization
Excess risk bound for local ERM
Theorem 3.7 (Excess risk bound)
𝔼(X,Y)∼D
h
ℓ
f̂X
(X), Y
− ℓ (f∗
(X), Y)
i
≤ (𝜀x + 𝜀loc)
| {z }
Local vs Global Optimal loss (I)
+ Mr
Lloc ; ℓ, ftrue , F loc
+ Mr
Lglobal ; ℓ, ftrue , F global
| {z }
Global and Local: Sample vs Retrieved Set Risk (II)
+
𝔼(X,Y)∼D
ℜRX (G(X, Y)) | RX ≥ N(r, 𝛿)
+5Mr Lloc ; ℓ, ftrue , F loc
√︃
2 ln(4/𝛿)
N(r,𝛿)
+8𝛿Lℓ F loc
∞
,
| {z }
Generalization of Local ERM and Central Absolute Moment of fX,∗
(III)
29 / 41
30. A Statistical Perspective on Retrieval-Based Models
Local empirical risk minimization
Excess risk bound for local ERM
The result shows a trade-off in approximation vs. generalization error
as retrieval radius r varies.
Approximation error.
It comprises two components, defined by (I) and (II) in Thm. 3.7.
Generalization error.
It (III) depends on the size of the retrieved set RX and the Rademacher
complexity of G(X, Y) which is included by F loc.
Under the local ERM setting the total approximation error increases
with increasing radius r for a fixed F loc. But for the generalization
error, it decreases.
30 / 41
31. A Statistical Perspective on Retrieval-Based Models
Local empirical risk minimization
Illustrative examples
Illustrative examples
Local linear models.
In this setting where F loc is the class of linear classifiers in
d-dimension.
Excess Risk ≤ O
r2
|{z}
(I)
+ O
rmin{𝛼true ,1}
| {z }
(II)
+
O
d
n(2d−1)/2drd/2
+
rmin{𝛼true ,1}
n(2d−1)/4drd/2
+
1
n1/2d
| {z }
(III)
.
31 / 41
32. A Statistical Perspective on Retrieval-Based Models
Local empirical risk minimization
Illustrative examples
Illustrative examples
Feed-forward classifiers.
As another example they study the setting where F loc is a the class of
fully connected deep neural networks (FC-DNN).
Excess Risk ≤ O
r(qmax+1)
| {z }
(I)
+ O
rmin{𝛼true ,1}
| {z }
(II)
+
O
q3/4
max ln (dqmax/r)3/4
ln(n)3/2
n(2d−1)/2drd/2
+
rmin{𝛼true ,1}
n(2d−1)/4drd/2
+
1
n1/2d
!
| {z }
(III)
.
32 / 41
33. A Statistical Perspective on Retrieval-Based Models
Local empirical risk minimization
Endowing local ERM with global representations
Endowing local ERM with global representations
The local ERM method takes a myopic view and does not aim to learn
a global hypothesis that explains the entire data distribution.
This approach may result in poor performance in regions of input
domains that are not well represented in the training set.
The two-stage approach enables the local learning to benefit from
good quality global representations, especially in sparse data regions.
33 / 41
34. A Statistical Perspective on Retrieval-Based Models
Local empirical risk minimization
Endowing local ERM with global representations
Endowing local ERM with global representations
Here they discuss two-stage approach to address the potential
shortcoming of local empirical risk minimization (ERM) in
retrieval-based models.
In the first stage, a global representation is learned using the entire
dataset.
In the second stage, the learned global representation is utilized at test
time while solving the local ERM as previously defined.
34 / 41
35. A Statistical Perspective on Retrieval-Based Models
Classification in extended feature space
Table of contents
1 Abstract
2 Introduction
3 Problem setup
4 Local empirical risk minimization
5 Classification in extended feature space
6 Experiments
7 Conclusion and future direction 35 / 41
36. A Statistical Perspective on Retrieval-Based Models
Classification in extended feature space
Classification in extended feature space
The scorer function can implicitly solve the local empirical risk
minimization (ERM) using retrieved neighboring labeled instances to
make the classification prediction.
The objective is to learn a function f : X × (X × Y)∗ → ℝ|Y|
Here, they also discuss a kernel-based approach to classification in the
extended feature space, where the scorer function is represented as a
linear combination of kernel functions evaluated on the extended
feature space.
36 / 41
37. A Statistical Perspective on Retrieval-Based Models
Experiments
Table of contents
1 Abstract
2 Introduction
3 Problem setup
4 Local empirical risk minimization
5 Classification in extended feature space
6 Experiments
7 Conclusion and future direction 37 / 41
38. A Statistical Perspective on Retrieval-Based Models
Experiments
Experiments
This paper performs experiments on both synthetic and real datasets
to demonstrate the benefits of retrieval-based models in classification
tasks.
Synthetic: binary classification
CIFAR-10: binary classification
ImageNet: 1000-way classification
The experiments show that retrieval-based models can achieve good
performance with much simpler function classes compared to
traditional parametric and nonparametric models.
38 / 41
39. A Statistical Perspective on Retrieval-Based Models
Experiments
Experiments
Figure 2: Performance of local ERM with size of retrieved set across models
of different complexity.
39 / 41
40. A Statistical Perspective on Retrieval-Based Models
Conclusion and future direction
Conclusion and future direction
The main contributions of the paper, which include
A formal framework for retrieval-based models
Analysis of local and global learning frameworks
Empirical results that support the theoretical findings
For the future work, we could explore the use of retrieval-based
models in other machine learning tasks beyond classification.
40 / 41
41. A Statistical Perspective on Retrieval-Based Models
Conclusion and future direction
References I
[1] Peter L. Bartlett, Michael I. Jordan, and Jon D. Mcauliffe.
“Convexity, Classification, and Risk Bounds”. In: Journal of the
American Statistical Association 101.473 (2006), pp. 138–156.
issn: 01621459. url:
http://www.jstor.org/stable/30047445.
41 / 41