Neural Processes

Neural Processes
Sangwoo Mo
KAIST ALIN Lab.
August 28, 2018
1

Table of Contents
Overview
Conditional Neural Processes (ICML 2018)
Neural Processes (ICML Workshop 2018)
2

Table of Contents
Overview
3

Overview
Motivation: Can we learn a distribution of functions f ∼ P(f )
instead of a single function f with NN?
4

Overview
Traditional NN: Dataset D is ﬁxed, and infer with f
4

Overview
Neural Process: Given observation O, sample f ∼ P(f | O)
4

Overview
In modern language, NP does meta learning
4

Overview
We know GP is a good solution for this problem
4

Overview
However, GP suﬀers from high complexity O(n3)
4

Overview
However, GP suﬀers from high complexity O(n3)
NP combines both advantages of NN and GP
4

Table of Contents
Overview
5

Conditional Neural Process
Let O = {(xi , yi )} be observation and T = {xi } be target
We aim to learn a predictive distribution Qθ(f (T ) | O, T )
6

Idea: Learn encoder hθ(xi , yi ) and decoder gθ(xi , r)
6

To be clear, the model follows 4 steps
6

1 ri = hθ(xi , yi ) ∀(xi , yi ) ∈ O
6

1 ri = hθ(xi , yi ) ∀(xi , yi ) ∈ O
2 r = r1 ⊕ r2 · · · ⊕ rn
6

1 ri = hθ(xi , yi ) ∀(xi , yi ) ∈ O
2 r = r1 ⊕ r2 · · · ⊕ rn
3 φi = gθ(xi , r) ∀(xi ) ∈ T
6

1 ri = hθ(xi , yi ) ∀(xi , yi ) ∈ O
2 r = r1 ⊕ r2 · · · ⊕ rn
3 φi = gθ(xi , r) ∀(xi ) ∈ T
4 Qθ(f (T ) | O, T ) = xi ∈T Q(f (xi ) | φi )
6

1 ri = hθ(xi , yi ) ∀(xi , yi ) ∈ O
2 r = r1 ⊕ r2 · · · ⊕ rn
3 φi = gθ(xi , r) ∀(xi ) ∈ T
Step 2 and 4 enforce permutation invariance for O and T
6

1 ri = hθ(xi , yi ) ∀(xi , yi ) ∈ O
2 r = r1 ⊕ r2 · · · ⊕ rn
3 φi = gθ(xi , r) ∀(xi ) ∈ T
Step 2 and 4 enforce permutation invariance for O and T
The complexity is O(n) (much faster than GP!)
6

Overall Architecture
To summarize, the overall architecture is
7

Training CNP
For training, we randomly sample1 ON from O
1
The paper used ”ﬁrst” N samples, but I guess it would be ﬁne too
8

Training CNP
and minimize NLL for both observed and unobserved data
1
8

Training CNP
Formally, let O = {(xi , yi )}n
i=1, then the loss is
1
8

Training CNP
Formally, let O = {(xi , yi )}n
i=1, then the loss is
L(θ) = −Ef ∼P [EN [log Qθ(y1:n | ON, x1:n)]]
1
8

Experiments 1. Function Regression
CNP learns uncertainty well (red: GP, blue: CNP)
9

Experiments 2. Image Completion
Complete image with few observed pixels
Uncertainty helps active learning (select high variance)
10

Experiments 3. One-shot Classiﬁcation
Competitive results with better complexity
11

Table of Contents
Overview
12

Neural Process
CNP only gets the mean and variance
However, we may want to sample smooth functions
13

Neural Process
CNP only gets the mean and variance
However, we may want to sample smooth functions
To this end, we introduce a global variable z
13

NP Inference
Unlike CNP, NP inference is not trivial (due to latent z)
14

NP Inference
The generative model of NP is
p(z, y1:n | x1:n) = p(z)
n
i=1
N(yi | g(xi , z), σ2
)
14

NP Inference
p(z, y1:n | x1:n) = p(z)
n
i=1
)
To estimate ELBO, take variational posterior q(z | x1:n, y1:n)
14

NP Inference
Let x1:m, y1:m be O and xm+1:n, ym+1:n be T
15

NP Inference
Now we maximize test time likelihood
log p(ym+1:n | x1:n, y1:m)
≥ Eq(z|x1:n,y1:n)
n
i=m+1
p(z | x1:m, y1:m)
q(z | x1:n, y1:n)
15

Overall Architecture
Unlike CNP, NP samples global latent z ∼ N(µ(r), σ(r)I)
Also, CNP get representation from target xT for inference
The overall architecture is
16

Experiments 1. Function Regression
NP can sample functions (1-D case)
17

Experiments 2. Image Completion
NP can sample functions (2-D case)
18

Experiments 3. Contextual Bandit
Applied Thompson sampling with NP, and achieved SOTA
19

Summary
Introduced NP, combination of NN and GP
Introduced two variants: CNP and NP
CNP predicts uncertainty locally, and get mean & variance
NP predicts uncertainty globally, and get samples
Both has own advantages, and may can use on situation
20

Neural Processes

More Related Content

What's hot

Similar to Neural Processes

More from Sangwoo Mo

Recently uploaded

Neural Processes