A Statistical Perspective on Retrieval-Based Models.pdf

A Statistical Perspective on Retrieval-Based Models
A Statistical Perspective
on Retrieval-Based Models
ICML, 2023
Soumya Basu, Ankit Singh Rawat, Manzil Zaheer
Speaker: Po-Chuan Chen
Oct 12, 2023
1 / 41

Table of contents
1 Abstract
2 Introduction
3 Problem setup
4 Local empirical risk minimization
5 Classification in extended feature space
6 Experiments
7 Conclusion and future direction 2 / 41

Abstract
Table of contents
1 Abstract
2 Introduction
3 Problem setup
6 Experiments

Abstract
Abstract
This paper uses a formal treatment of retrieval-based models to
characterize their performance via a novel statistical perspective.
They study two different perspective method
Analyzing local learning framework
Learning global model using kernel methods
4 / 41

Introduction
Table of contents
1 Abstract
2 Introduction
3 Problem setup
6 Experiments

Introduction
Introduction
To increase the expressiveness of an ML model, a popular way is to
homogeneously scale the size of a parametric model.
Such large models, however, have their own limitations
High computation cost
Catastrophic forgetting
Lack of provenance
Poor explainability
6 / 41

Introduction
Introduction
Figure 1: An illustration of a retrieval-based classification model.
7 / 41

Introduction
Contribution
1 Setting up a formal framework for classification via
retrieval-based models under local structure
2 Finite sample analysis of explicit local learning framework
3 Extending the analysis to a globally learnt model
4 Providing the first rigorous treatment of an end-to-end
retrieval-based model to study its generalization by using
kernel-based learning
8 / 41

Problem setup
Table of contents I
1 Abstract
2 Introduction
3 Problem setup
Multiclass classification
Classification with local structure
Retrieval-based classification model
9 / 41

Problem setup
Table of contents II
6 Experiments
7 Conclusion and future direction
10 / 41

Problem setup
In here, it’ll access to n training examples S = {(xi, yi)}i∈[n] ⊂ X × Y
, sampled i.i.d. from the data distribution D := DX,Y.
For the scorer f, the classifier takes the form:
hf (x) = arg max
y∈Y
fy(x)
Given a set of scorer F global ⊆ {f : X → ℝ|Y| }, learning a model can
find a scorer in F global that minimizes the miss-classification error or
expected 0/1 loss:
f∗
0/1 = arg min
f ∈Fglobal
ℙD(hf (X) ≠ Y)
11 / 41

Problem setup
In this part, it uses surrogate loss [1] 𝓁 for the miss-classification error
and aims to minimize the associated population risk:
R𝓁 (f) = 𝔼(X,Y)∼D[𝓁(f (X), Y)]
With minimizing the (global) empirical risk over the function class
F global, we can learn a good scorer:
f̂ = arg min
f ∈Fglobal
1
n
∑︁
i∈[n]
𝓁(f (xi), yi)
And, R̂ := 1
n
Í
i∈[n] 𝓁(f (xi), yi).
12 / 41

Problem setup
They define the data in each local neighborhood as
Bx,r := {x′ ∈ X : 𝕕(x, x′) ≤ r}, where x ∈ X and r > 0.
Dx,r set as the data distribution restricted to Bx,r
Dx,r
=
D(A)
D(Bx,r × Y)
A ⊆ Bx,r
× Y
13 / 41

Problem setup
Such that we have local structure condition that approximates the
Bayes optimal for the local classification problem.
That is for a given 𝜀X > 0 and ∀x ∈ X, we have
min
f ∈Fx
Rx
𝓁 (f) ≤ min
f ∈Fglobal
Rx
𝓁 (f) + 𝜀X
And the local population risk can be defined as
Rx
𝓁 (f) = 𝔼(X′,Y′ )∼Dx,r [𝓁(f (X′
), Y′
)]
14 / 41

Problem setup
In this paper, they focus on retrieval-based methods.
In local empirical risk minimization, it will give a instance x, the
local empirical risk minimization (ERM) approach first retrieves a
neighboring set Rx = {(x′
j, y′
j)} ⊆ S.
And it identifies a scorer f̂x from a function class
F loc ⊂ {f : X → ℝ|Y| }:
f̂x
= arg min
f ∈Floc
1
|Rx|
∑︁
(x′,y′ )∈Rx
𝓁(f (x′
), y′
)
if |Rx| = 0, f̂x ∈ F loc is chosen arbitrarily.
15 / 41

Problem setup
Another approach is called classification with extended feature
space, that the scorer directly maps the augmented input
x × Rx ∈ X × (X × Y)∗ to per-class scores.
A scorer can be learned over extended feature space X × (X × Y)∗ as
follows:
f̂ex
= arg min
f ∈Fex
R̂ex
𝓁 (f)
where R̂ex
𝓁
(f) := 1
n
Í
i∈[n] 𝓁(f (xi, Rxi ), yi) and a function class of
interest over the extended space is denoted as
F ex ⊂ {f : X × (X × Y)∗ → ℝ|Y| }.
16 / 41

Local empirical risk minimization
Table of contents I
1 Abstract
2 Introduction
3 Problem setup
Assumptions
Excess risk bound for local ERM
Illustrative examples
Endowing local ERM with global representations
17 / 41

Table of contents II
6 Experiments
7 Conclusion and future direction
18 / 41

The goal is to characterize the excess risk of local ERM, such that it
aims to bound
𝔼(X,Y)∼D[𝓁(f̂X
(X), Y) − 𝓁(f∗
(X), Y)]
Here f̂X (X) in the above equation is a function of RX.
19 / 41

Assumptions
Assumptions
First, they define the margin of scorer f at a given label y ∈ Y as
𝛾f (x, y) = fy(x) − max
y′≠y
fy′ (x)
To ensure the margin of the scorer f has smooth deviation as x varies, a
scorer f is L-coordinate Lipschitz iff for all y ∈ Y and x, x′ ∈ X, it has
|fy(x) − fy(x′
)| ≤ L∥x − x′
∥2
Also they define the weak margin condition for a scorer f: Given a
distribution D, a scorer f satisfies (𝛼, c)-weak margin condition iff, for
all t ≥ 0,
ℙ(X,Y)∼D(|𝛾f (X, Y)| ≤ t) ≤ ct𝛼
20 / 41

Assumptions
Assumption 3.1 (True scorer function)
The scorer ftrue make sure for all (x, y) ∈ X × Y, ftrue generates the
true label, i.e., 𝛾ftrue (x, y) > 0 that ftrue is Ltrue-coordinate Lipschitz,
and satisfies the (𝛼true, ctrue)-weak margin condition.
21 / 41

Assumptions
Assumption 3.2 (Margin-based Lipschitz loss)
For any given example (x, y) and any scorer f we have
𝓁(f (x), y) = 𝓁(𝛾f (x, y)) and 𝓁 is a decreasing function of the margin.
That, the loss function 𝓁 is L𝓁-Lipschitz function, i.e.
|𝓁(𝛾) − 𝓁(𝛾′)| ≤ L𝓁 |𝛾 − 𝛾′|, ∀𝛾 ≥ 𝛾′.
22 / 41

Assumptions
Assumption 3.3 (Data regularity condition)
Weak density condition.
There exists constants cwdc > 0, and 𝛿wdc > 0, such that for all x ∈ X
and 𝜌D(x)rd ≤ 𝛿d
wdc
.
ℙX′∼D[𝕕(X′
, x) ≤ r] ≥ cd
wdc𝜌D(x)rd
Density level-set.
There exists a function f𝜌(𝛿) with f𝜌(𝛿) → 0 as 𝛿 → 0, such that for
any 𝛿 > 0,
ℙX∼D[𝜌D(X) ≤ f𝜌(𝛿)] ≤ 𝛿
23 / 41

Assumptions
Assumption 3.4 (Weak + density condition)
There exists constants cwdc+ ≥ 0, and 𝛼wdc+ > 0, such that for all
x ∈ X and r ∈ [0, rmax],
ℙX′∼D[𝕕(X′, x) ≤ r]
𝜌D(c)vold (r)
− 1 ≤ cwdc+r𝛼wdc+
Under this assumption the local ERM error bounds can be tightened
further.
24 / 41

It proceed to their main results on the excess risk bound of local ERM.
At x ∈ X, fx,∗ denotes the minimizer of the population version of the
local loss, and f∗ for the global loss.
fx,∗
= arg min
f ∈Floc
Rx
𝓁 (f); f∗
= arg min
f ∈Fglobal
R𝓁 (f)
The next slide will show how the expected excess risk of the local
ERM solution f̂X is bounded, it is called Risk decomposition.
25 / 41

𝔼(X,Y)∼D
h
ℓ

f̂X
(X), Y

− ℓ (f∗
(X), Y)
i
≤ 𝔼(X,Y)∼D
h
RX
ℓ

fX,∗

− RX
ℓ (f∗
)
i
| {z }
Local vs Global Population Optimal Risk
+
∑︁
F∈{Fglobal ,Floc
}
𝔼(X,Y)∼D

sup
f ∈F
RX
ℓ (f) − ℓ(f (X), Y)
#
| {z }
Global and Local: Sample vs Retrieved Set Risk
+ 𝔼(X,Y)∼D

sup
f ∈Floc
RX
ℓ (f) − R̂X
ℓ (f)
#
| {z }
Generalization of Local ERM
+ 𝔼(X,Y)∼D
h
RX
ℓ

fX,∗

− R̂X
ℓ

fX,∗
i
| {z }
Central Absolute Moment of fX,∗
.
26 / 41

In here, we can obtain a tighter bound by utilizing the local structure
of the distribution DX,r
. For any L 0, we can define
Mr (L; 𝓁, ftrue,F) := 2L𝓁 (Lr + (2∥F ∥∞ − Lr)ctrue(2Ltruer)𝛼true
)
For any x ∈ X, the weak density condition provides high probability
lower bound on the size of the retrieved set Rx.
27 / 41

Proposition 3.6.
Under the Assumption 3.3, for any x ∈ X, r 0, and 𝛿 0,
ℙD[|Rx
| N(r, 𝛿)] ≤ 𝛿
for N(r, 𝛿) = n(cd
wdc
min{f𝜌(𝛿/2)rd, 𝛿d
wdc
} −
√︃
log(2/𝛿)
2n )
The next slide will show how the expected excess risk of the local
ERM solution f̂X is bounded, it is called Excess risk bound.
28 / 41

Theorem 3.7 (Excess risk bound)
𝔼(X,Y)∼D
h
ℓ

f̂X
(X), Y

− ℓ (f∗
(X), Y)
i
≤ (𝜀x + 𝜀loc)
| {z }
Local vs Global Optimal loss (I)
+ Mr

Lloc ; ℓ, ftrue , F loc

+ Mr

Lglobal ; ℓ, ftrue , F global

| {z }
Global and Local: Sample vs Retrieved Set Risk (II)
+
𝔼(X,Y)∼D

ℜRX (G(X, Y)) | RX ≥ N(r, 𝛿)

+5Mr Lloc ; ℓ, ftrue , F loc
√︃
2 ln(4/𝛿)
N(r,𝛿)
+8𝛿Lℓ F loc
∞
,
| {z }
Generalization of Local ERM and Central Absolute Moment of fX,∗
(III)
29 / 41

The result shows a trade-off in approximation vs. generalization error
as retrieval radius r varies.
Approximation error.
It comprises two components, defined by (I) and (II) in Thm. 3.7.
Generalization error.
It (III) depends on the size of the retrieved set RX and the Rademacher
complexity of G(X, Y) which is included by F loc.
Under the local ERM setting the total approximation error increases
with increasing radius r for a fixed F loc. But for the generalization
error, it decreases.
30 / 41

Local linear models.
In this setting where F loc is the class of linear classifiers in
d-dimension.
Excess Risk ≤ O

r2

|{z}
(I)
+ O

rmin{𝛼true ,1}

| {z }
(II)
+
O

d
n(2d−1)/2drd/2
+
rmin{𝛼true ,1}
n(2d−1)/4drd/2
+
1
n1/2d

| {z }
(III)
.
31 / 41

Feed-forward classifiers.
As another example they study the setting where F loc is a the class of
fully connected deep neural networks (FC-DNN).
Excess Risk ≤ O

r(qmax+1)

| {z }
(I)
+ O

rmin{𝛼true ,1}

| {z }
(II)
+
O
q3/4
max ln (dqmax/r)3/4
ln(n)3/2
n(2d−1)/2drd/2
+
rmin{𝛼true ,1}
n(2d−1)/4drd/2
+
1
n1/2d
!
| {z }
(III)
.
32 / 41

The local ERM method takes a myopic view and does not aim to learn
a global hypothesis that explains the entire data distribution.
This approach may result in poor performance in regions of input
domains that are not well represented in the training set.
The two-stage approach enables the local learning to benefit from
good quality global representations, especially in sparse data regions.
33 / 41

Here they discuss two-stage approach to address the potential
shortcoming of local empirical risk minimization (ERM) in
retrieval-based models.
In the first stage, a global representation is learned using the entire
dataset.
In the second stage, the learned global representation is utilized at test
time while solving the local ERM as previously defined.
34 / 41

Classification in extended feature space
Table of contents
1 Abstract
2 Introduction
3 Problem setup
6 Experiments

The scorer function can implicitly solve the local empirical risk
minimization (ERM) using retrieved neighboring labeled instances to
make the classification prediction.
The objective is to learn a function f : X × (X × Y)∗ → ℝ|Y|
Here, they also discuss a kernel-based approach to classification in the
extended feature space, where the scorer function is represented as a
linear combination of kernel functions evaluated on the extended
feature space.
36 / 41

Experiments
Table of contents
1 Abstract
2 Introduction
3 Problem setup
6 Experiments

Experiments
Experiments
This paper performs experiments on both synthetic and real datasets
to demonstrate the benefits of retrieval-based models in classification
tasks.
Synthetic: binary classification
CIFAR-10: binary classification
ImageNet: 1000-way classification
The experiments show that retrieval-based models can achieve good
performance with much simpler function classes compared to
traditional parametric and nonparametric models.
38 / 41

Experiments
Experiments
Figure 2: Performance of local ERM with size of retrieved set across models
of different complexity.
39 / 41

Conclusion and future direction
The main contributions of the paper, which include
A formal framework for retrieval-based models
Analysis of local and global learning frameworks
Empirical results that support the theoretical findings
For the future work, we could explore the use of retrieval-based
models in other machine learning tasks beyond classification.
40 / 41

References I
[1] Peter L. Bartlett, Michael I. Jordan, and Jon D. Mcauliffe.
“Convexity, Classification, and Risk Bounds”. In: Journal of the
American Statistical Association 101.473 (2006), pp. 138–156.
issn: 01621459. url:
http://www.jstor.org/stable/30047445.
41 / 41

A Statistical Perspective on Retrieval-Based Models.pdf

Recommended

Recommended

More Related Content

Similar to A Statistical Perspective on Retrieval-Based Models.pdf

Similar to A Statistical Perspective on Retrieval-Based Models.pdf (20)

More from Po-Chuan Chen

More from Po-Chuan Chen (20)

Recently uploaded

Recently uploaded (20)

A Statistical Perspective on Retrieval-Based Models.pdf