本スライドは、弊社の梅本により弊社内の技術勉強会で使用されたものです。
近年注目を集めるアーキテクチャーである「Transformer」の解説スライドとなっております。
"Arithmer Seminar" is weekly held, where professionals from within and outside our company give lectures on their respective expertise.
The slides are made by the lecturer from outside our company, and shared here with his/her permission.
Arithmer株式会社は東京大学大学院数理科学研究科発の数学の会社です。私達は現代数学を応用して、様々な分野のソリューションに、新しい高度AIシステムを導入しています。AIをいかに上手に使って仕事を効率化するか、そして人々の役に立つ結果を生み出すのか、それを考えるのが私たちの仕事です。
Arithmer began at the University of Tokyo Graduate School of Mathematical Sciences. Today, our research of modern mathematics and AI systems has the capability of providing solutions when dealing with tough complex issues. At Arithmer we believe it is our job to realize the functions of AI through improving work efficiency and producing more useful results for society.
Maximizing the spectral gap of networks produced by node removalNaoki Masuda
Presentation slides for the following two papers (currently available in the pdf format only).
(1) T. Watanabe, N. Masuda.
Enhancing the spectral gap of networks by node removal.
Physical Review E, 82, 046102 (2010).
(2) N. Masuda, T. Fujie, K. Murota.
Semidefinite programming for maximizing the spectral gap.
In: Complex Networks IV, Studies in Computational Intelligence, 476, 155-163 (2013).
本スライドは、弊社の梅本により弊社内の技術勉強会で使用されたものです。
近年注目を集めるアーキテクチャーである「Transformer」の解説スライドとなっております。
"Arithmer Seminar" is weekly held, where professionals from within and outside our company give lectures on their respective expertise.
The slides are made by the lecturer from outside our company, and shared here with his/her permission.
Arithmer株式会社は東京大学大学院数理科学研究科発の数学の会社です。私達は現代数学を応用して、様々な分野のソリューションに、新しい高度AIシステムを導入しています。AIをいかに上手に使って仕事を効率化するか、そして人々の役に立つ結果を生み出すのか、それを考えるのが私たちの仕事です。
Arithmer began at the University of Tokyo Graduate School of Mathematical Sciences. Today, our research of modern mathematics and AI systems has the capability of providing solutions when dealing with tough complex issues. At Arithmer we believe it is our job to realize the functions of AI through improving work efficiency and producing more useful results for society.
Maximizing the spectral gap of networks produced by node removalNaoki Masuda
Presentation slides for the following two papers (currently available in the pdf format only).
(1) T. Watanabe, N. Masuda.
Enhancing the spectral gap of networks by node removal.
Physical Review E, 82, 046102 (2010).
(2) N. Masuda, T. Fujie, K. Murota.
Semidefinite programming for maximizing the spectral gap.
In: Complex Networks IV, Studies in Computational Intelligence, 476, 155-163 (2013).
Abstract : For many years, Machine Learning has focused on a key issue: the design of input features to solve prediction tasks. In this presentation, we show that many learning tasks from structured output prediction to zero-shot learning can benefit from an appropriate design of output features, broadening the scope of regression. As an illustration, I will briefly review different examples and recent results obtained in my team.
We consider the problem of model estimation in episodic Block MDPs. In these MDPs, the decision maker has access to rich observations or contexts generated from a small number of latent states. We are interested in estimating the latent state decoding function (the mapping from the observations to latent states) based on data generated under a fixed behavior policy. We derive an information-theoretical lower bound on the error rate for estimating this function and present an algorithm approaching this fundamental limit. In turn, our algorithm also provides estimates of all the components of the MDP.
We apply our results to the problem of learning near-optimal policies in the reward-free setting. Based on our efficient model estimation algorithm, we show that we can infer a policy converging (as the number of collected samples grows large) to the optimal policy at the best possible asymptotic rate. Our analysis provides necessary and sufficient conditions under which exploiting the block structure yields improvements in the sample complexity for identifying near-optimal policies. When these conditions are met, the sample complexity in the minimax reward-free setting is improved by a multiplicative factor $n$, where $n$ is the number of contexts.
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptxR&R Consult
CFD analysis is incredibly effective at solving mysteries and improving the performance of complex systems!
Here's a great example: At a large natural gas-fired power plant, where they use waste heat to generate steam and energy, they were puzzled that their boiler wasn't producing as much steam as expected.
R&R and Tetra Engineering Group Inc. were asked to solve the issue with reduced steam production.
An inspection had shown that a significant amount of hot flue gas was bypassing the boiler tubes, where the heat was supposed to be transferred.
R&R Consult conducted a CFD analysis, which revealed that 6.3% of the flue gas was bypassing the boiler tubes without transferring heat. The analysis also showed that the flue gas was instead being directed along the sides of the boiler and between the modules that were supposed to capture the heat. This was the cause of the reduced performance.
Based on our results, Tetra Engineering installed covering plates to reduce the bypass flow. This improved the boiler's performance and increased electricity production.
It is always satisfying when we can help solve complex challenges like this. Do your systems also need a check-up or optimization? Give us a call!
Work done in cooperation with James Malloy and David Moelling from Tetra Engineering.
More examples of our work https://www.r-r-consult.dk/en/cases-en/
Automobile Management System Project Report.pdfKamal Acharya
The proposed project is developed to manage the automobile in the automobile dealer company. The main module in this project is login, automobile management, customer management, sales, complaints and reports. The first module is the login. The automobile showroom owner should login to the project for usage. The username and password are verified and if it is correct, next form opens. If the username and password are not correct, it shows the error message.
When a customer search for a automobile, if the automobile is available, they will be taken to a page that shows the details of the automobile including automobile name, automobile ID, quantity, price etc. “Automobile Management System” is useful for maintaining automobiles, customers effectively and hence helps for establishing good relation between customer and automobile organization. It contains various customized modules for effectively maintaining automobiles and stock information accurately and safely.
When the automobile is sold to the customer, stock will be reduced automatically. When a new purchase is made, stock will be increased automatically. While selecting automobiles for sale, the proposed software will automatically check for total number of available stock of that particular item, if the total stock of that particular item is less than 5, software will notify the user to purchase the particular item.
Also when the user tries to sale items which are not in stock, the system will prompt the user that the stock is not enough. Customers of this system can search for a automobile; can purchase a automobile easily by selecting fast. On the other hand the stock of automobiles can be maintained perfectly by the automobile shop manager overcoming the drawbacks of existing system.
Overview of the fundamental roles in Hydropower generation and the components involved in wider Electrical Engineering.
This paper presents the design and construction of hydroelectric dams from the hydrologist’s survey of the valley before construction, all aspects and involved disciplines, fluid dynamics, structural engineering, generation and mains frequency regulation to the very transmission of power through the network in the United Kingdom.
Author: Robbie Edward Sayers
Collaborators and co editors: Charlie Sims and Connor Healey.
(C) 2024 Robbie E. Sayers
Cosmetic shop management system project report.pdfKamal Acharya
Buying new cosmetic products is difficult. It can even be scary for those who have sensitive skin and are prone to skin trouble. The information needed to alleviate this problem is on the back of each product, but it's thought to interpret those ingredient lists unless you have a background in chemistry.
Instead of buying and hoping for the best, we can use data science to help us predict which products may be good fits for us. It includes various function programs to do the above mentioned tasks.
Data file handling has been effectively used in the program.
The automated cosmetic shop management system should deal with the automation of general workflow and administration process of the shop. The main processes of the system focus on customer's request where the system is able to search the most appropriate products and deliver it to the customers. It should help the employees to quickly identify the list of cosmetic product that have reached the minimum quantity and also keep a track of expired date for each cosmetic product. It should help the employees to find the rack number in which the product is placed.It is also Faster and more efficient way.
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdffxintegritypublishin
Advancements in technology unveil a myriad of electrical and electronic breakthroughs geared towards efficiently harnessing limited resources to meet human energy demands. The optimization of hybrid solar PV panels and pumped hydro energy supply systems plays a pivotal role in utilizing natural resources effectively. This initiative not only benefits humanity but also fosters environmental sustainability. The study investigated the design optimization of these hybrid systems, focusing on understanding solar radiation patterns, identifying geographical influences on solar radiation, formulating a mathematical model for system optimization, and determining the optimal configuration of PV panels and pumped hydro storage. Through a comparative analysis approach and eight weeks of data collection, the study addressed key research questions related to solar radiation patterns and optimal system design. The findings highlighted regions with heightened solar radiation levels, showcasing substantial potential for power generation and emphasizing the system's efficiency. Optimizing system design significantly boosted power generation, promoted renewable energy utilization, and enhanced energy storage capacity. The study underscored the benefits of optimizing hybrid solar PV panels and pumped hydro energy supply systems for sustainable energy usage. Optimizing the design of solar PV panels and pumped hydro energy supply systems as examined across diverse climatic conditions in a developing country, not only enhances power generation but also improves the integration of renewable energy sources and boosts energy storage capacities, particularly beneficial for less economically prosperous regions. Additionally, the study provides valuable insights for advancing energy research in economically viable areas. Recommendations included conducting site-specific assessments, utilizing advanced modeling tools, implementing regular maintenance protocols, and enhancing communication among system components.
Water scarcity is the lack of fresh water resources to meet the standard water demand. There are two type of water scarcity. One is physical. The other is economic water scarcity.
Quality defects in TMT Bars, Possible causes and Potential Solutions.PrashantGoswami42
Maintaining high-quality standards in the production of TMT bars is crucial for ensuring structural integrity in construction. Addressing common defects through careful monitoring, standardized processes, and advanced technology can significantly improve the quality of TMT bars. Continuous training and adherence to quality control measures will also play a pivotal role in minimizing these defects.
Saudi Arabia stands as a titan in the global energy landscape, renowned for its abundant oil and gas resources. It's the largest exporter of petroleum and holds some of the world's most significant reserves. Let's delve into the top 10 oil and gas projects shaping Saudi Arabia's energy future in 2024.
2. Table of contents
1. Conditional Neural Processes [Garnelo+ (ICML2018)]
2. Neural Processes [Garnelo+ (ICML2018WS)]
3. Attentive Neural Processes [Kim+ (ICLR2019)]
4. Meta-Learning surrogate models for sequential decision
making [Galashov+ (ICLR2019WS)]
5. On the Connection between Neural Processes and Gaussian
Processes with Deep Kernels [Rudner+ (NeurIPS2018WS)]
6. Conclusion
K. Matsui (RIKEN AIP) Neural Processes Family 1 / 60
3. Table of Contents
1. Conditional Neural Processes [Garnelo+ (ICML2018)]
2. Neural Processes [Garnelo+ (ICML2018WS)]
3. Attentive Neural Processes [Kim+ (ICLR2019)]
4. Meta-Learning surrogate models for sequential decision
making [Galashov+ (ICLR2019WS)]
5. On the Connection between Neural Processes and Gaussian
Processes with Deep Kernels [Rudner+ (NeurIPS2018WS)]
6. Conclusion
K. Matsui (RIKEN AIP) Neural Processes Family Conditional Neural Processes 2 / 60
4. Motivation
Neural Net vs Gaussian Processes
£ Neural Net (NN)
• Function approximation ability
• New functions are learned from scratch each time
• Uncertainty of functions can not be considered
£ Gaussian Processes (GP)
• Can use prior knowledge to quickly estimate the shape of
new function
• Can model uncertainty of functions
• Computationally expensive
• Hard to design prior distribution
Aim
Combine the benefits of NN and GP
K. Matsui (RIKEN AIP) Neural Processes Family Conditional Neural Processes Introduction 3 / 60
5. Conditional Neural Processes (CNPs)
• A conditional distribution over functions trained to model
the empirical conditional distributions of functions
• permutation invariant in training/test data
• scalable: running time complexity of O(n + m)
K. Matsui (RIKEN AIP) Neural Processes Family Conditional Neural Processes Introduction 4 / 60
6. Stochastic Processes i
• observations : O = {(xi, yi)}n−1
i=0 ⊂ X × Y
• targets : T = {xi}n+m−1
i=n
• generative model (stochastic processes) :
• yi = f(xi), f : X → Y (noiseless case)
• f ∼ P (prior process)
• P Y P(f(T) | O, T) (predictive distribution)
Task
Predict the output values f(x) for ∀x ∈ T given O
Example 1 (Gaussian Processes)
P = GP(µ(x), k(x, x′))
Y predictive distribution : f(x) ∼ N(µn(x), σ2
n(x))
µn(x) = µ(x) + k(x)⊤
(K + σ2
I)−1
(y − m)
σ2
n(x) = k(x, x) − k(x)⊤
(K + σ2
I)−1
k(x)
K. Matsui (RIKEN AIP) Neural Processes Family Conditional Neural Processes Model 5 / 60
7. Stochastic Processes ii
1D Gaussian process regression
Difficulties of ordinary SP approaches
1. It is difficult to design appropriate priors
2. GPs (typical ex) do not scale w.r.t. the number of data
→ O((n + m)3) computational costs are required
K. Matsui (RIKEN AIP) Neural Processes Family Conditional Neural Processes Model 6 / 60
8. Conditional Neural Processes i
conditional stochastic process Qθ(f(·) | O, T)
Predictive Ability of NNs + Uncertainty Modeling of SPs
Assumption 1
1. (permutation invariant)
Qθ(f(T) | O, T) = Qθ
(
f
(
T′
)
| O, T′
)
= Qθ
(
f(T) | O′
, T
)
• O′
, T′
: permutations of O, T resp.
2. (factorizability)
Qθ(f(T) | O, T) =
∏
x∈T
Qθ(f(x) | O, x)
K. Matsui (RIKEN AIP) Neural Processes Family Conditional Neural Processes Model 7 / 60
9. Conditional Neural Processes ii Architecture
PredictObserve Aggregate
r3r2r1
…x3x2x1 x5 x6x4
ahhh y5 y6y4
ry3y2y1 g gg
…
ri = hθ(xi, yi) ∀(xi, yi) ∈ O
r = r1 ⊕ r2 ⊕ . . . rn−1 ⊕ rn
φi = gθ(xi, r) ∀(xi) ∈ T
hθ, gθ
⊕
r1 ⊕ r2 ⊕ . . . rn−1 ⊕ rn =
1
n
n
i=1
ri
Qθ (f (xi) | O, xi) = Q (f (xi) | φi)
φi = (µi, σ2
i ) N(µi, σ2
i )
K. Matsui (RIKEN AIP) Neural Processes Family Conditional Neural Processes Model 8 / 60
10. Conditional Neural Processes ii Architecture
• 構造としては VAE に非常に近い
→ h と a は VAE の encoder に対応し, 入力データから潜在
表現 r を獲得
• VAE との違いその 1 : 入力 x に加えて出力 y も与えて潜在
表現を学習
• VAE との違いその 2 : 潜在表現 r は確率変数ではなく, デー
タ毎の表現 r1, ..., rn の和で決まる
• 違いその 2 でデータ毎に独立に計算した潜在表現を使って
いることが後で説明する “画像全体で一貫した completion
にならない” 原因になっている
K. Matsui (RIKEN AIP) Neural Processes Family Conditional Neural Processes Model 9 / 60
11. Conditional Neural Processes iii Training
Optimization Problem
minimization of the negative conditional log probability
θ∗
= arg min
θ
L(θ)
L(θ) = −Ef∼P
[
EN
[
log Qθ
(
{yi}n−1
i=0 | ON , {xi}n−1
i=0
)]]
• f ∼ P : prior process
• N ∼ Unif(0, n − 1)
• ON = {(xi, yi)}N
i=0 ⊂ O
practical implementation : gradient descent
1. sampling f and N
2. MC estimates of the gradient of L(θ)
3. gradient descent by estimated gradient
K. Matsui (RIKEN AIP) Neural Processes Family Conditional Neural Processes Model 10 / 60
12. Function Regression i Setting
Dataset
1. random sample from GP w/ fixed kernel¶ms
2. random sample from GP w/ switching two kernels
network architectures
£ hθ : 3-layer MLP with 128-dim output ri, i = 1, ..., 128
£ r = 1
128
∑128
i=1 ri : aggregation
£ gθ : 5-layer MLP, gθ(xi, r) = µi, σ2
i (mean & var of Gaussian)
£ Adam (optimizer)
K. Matsui (RIKEN AIP) Neural Processes Family Conditional Neural Processes Experimental Results 11 / 60
13. Function Regression ii Results
K. Matsui (RIKEN AIP) Neural Processes Family Conditional Neural Processes Experimental Results 12 / 60
14. Image Completion i Setting
Dataset
1. MNIST (f : [0, 1]2 → [0, 1])
Complete the entire image from a small number of
observation
2. CelebA (f : [0, 1]2 → [0, 1]3)
Complete the entire image from a small number of
observation
network architectures
• the same model architecture as for 1D function regression
except for
• input layer : 2D pixel coordinates normalized to [0, 1]2
• output layer : color intensity of the corresponding pixel
K. Matsui (RIKEN AIP) Neural Processes Family Conditional Neural Processes Experimental Results 13 / 60
15. Image Completion ii Results
• 1 (non-informative) observation point
→ prediction corresponds to the average over all digits
K. Matsui (RIKEN AIP) Neural Processes Family Conditional Neural Processes Experimental Results 14 / 60
17. Image Completion iii Latent Variable Model
Original CNPs
• The model returns factored outputs (sample-wise
independent modeling)
→ best prediction with limited data points is to average
over all possible predictions
• It can not sample different coherent images of all the
possible digits conditioned on the observations
← GPs can do this due to a kernel function
• Adding latent variables, CNPs can maintain this property
CNPs の latent variable model は後述する Neural Processes と
同じもの
K. Matsui (RIKEN AIP) Neural Processes Family Conditional Neural Processes Experimental Results 16 / 60
18. Image Completion iv Latent Variable Model
z ∼ N(µ, σ2
)
r = (µ, σ2
) = hθ(X, Y )
φi = (µi, σ2
i ) = gθ(xi, z)
•
K. Matsui (RIKEN AIP) Neural Processes Family Conditional Neural Processes Experimental Results 17 / 60
19. Classification i Settings
Dataset
• Omniglot
• 1,623 classes of characters from 50 different alphabets
• suitable for few-shot learning
• N-way classification task
• N classes are randomly chosen at each training step
network architectures
• encoder h : include convolution layers
• aggregation r : class-wise aggregation & concatenate
K. Matsui (RIKEN AIP) Neural Processes Family Conditional Neural Processes Experimental Results 18 / 60
20. Classification ii Results
PredictObserve Aggregate
r5r4r3
ahhh
r
Class
E
Class
D
Class
C g gg
r2r1
hh
Class
B
Class
A
A
B
C
E
D
0 1 0 1 0 1
5-way Acc 20-way Acc Runtime
1-shot 5-shot 1-shot 5-shot
MANN 82.8% 94.9% - - O(nm)
MN 98.1% 98.9% 93.8% 98.5% O(nm)
CNP 95.3% 98.5% 89.9% 96.8% O(n + m)
K. Matsui (RIKEN AIP) Neural Processes Family Conditional Neural Processes Experimental Results 19 / 60
21. Table of Contents
1. Conditional Neural Processes [Garnelo+ (ICML2018)]
2. Neural Processes [Garnelo+ (ICML2018WS)]
3. Attentive Neural Processes [Kim+ (ICLR2019)]
4. Meta-Learning surrogate models for sequential decision
making [Galashov+ (ICLR2019WS)]
5. On the Connection between Neural Processes and Gaussian
Processes with Deep Kernels [Rudner+ (NeurIPS2018WS)]
6. Conclusion
K. Matsui (RIKEN AIP) Neural Processes Family Neural Processes 20 / 60
22. Generative Model i
Assumption 2
1. (Exchangeability)
入力 x や出力 y の順番を入れ替えても分布は変わらない
ρx1:n (y1:n) = ρπ(x1:n)(π(y1:n))
2. (Consistency)
ある列 Dm = {(xi, yi)}m
i=1 に対する分布とそれを含む列で Dm 以外を
周辺化して得られる分布は同じ
ρx1:m (y1:m) =
∫
ρx1:n (y1:n)dym+1:n
3. (Decomposability)
観測モデルに独立分解仮定
p(y1:n | f, x1:n) =
n∏
i=1
N(yi | f(xi), σ2
)
K. Matsui (RIKEN AIP) Neural Processes Family Neural Processes Model 21 / 60
23. Generative Model ii
f をある確率過程からのサンプルとしたときの観測値の事
後分布
ρx1:n (y1:n) =
∫
p(y1:n | f, x1:n)p(f)df
=
∫ n∏
i=1
N(yi | f(xi), σ2
)p(f)df
f を隠れ変数付き NN g(x, z) でモデル化するとき, 生成モ
デルは
p(z, y1:n | x1:n) =
n∏
i=1
N(yi | g(xi, z), σ2
)p(z)
K. Matsui (RIKEN AIP) Neural Processes Family Neural Processes Model 22 / 60
24. Evidence Lower-Bound (ELBO)
隠れ変数 z の変分事後分布を q(z | x1:n, y1:n) とおくと ELBO は
log p (y1:n|x1:n)
≥ Eq(z|x1:n,y1:n)
[ n∑
i=1
log p (yi | z, xi) + log
p(z)
q (z | x1:n, y1:n)
]
特に予測時には観測データとテストデータを分割して
log p (ym+1:n | x1:m, xm+1:n, y1:m)
≥ Eq(z|xm+1:n,ym+1:n)
[ n∑
i=m+1
log p (yi | z, xi) + log
p (z | x1:m, y1:m)
q (z | xm+1:n, ym+1:n)
]
≈ Eq(z|xm+1:n,ym+1:n)
[ n∑
i=m+1
log p (yi | z, xi) + log
q (z | x1:m, y1:m)
q (z | xm+1:n, ym+1:n)
]
p を q で近似している理由は, 観測データによる条件付き分布としての計算
に O(m3
) のコストがかかるのを回避するため
K. Matsui (RIKEN AIP) Neural Processes Family Neural Processes Model 23 / 60
25. Architectures
x1 x2 x3
y1 y2 y3
hθ hθ hθ
r1 r2 r3
a
r
x4 x5 x6
gθ gθ gθ
z
ˆy4 ˆy5 ˆy6
z
z ∼ N(µ(r), σ2
(r)I)
gθ(xi) = P(y | z, xi) :
K. Matsui (RIKEN AIP) Neural Processes Family Neural Processes Model 24 / 60
26. Comparing Architectures : VAE, CNPs & NPs
X
qφ(z | X)
pθ(X | z)
ˆX
z ∼ N(0, I)
X Y
r =
n
i=1 ri
ˆY
gθ(Y | ˆX, r)
hθ(xi, yi)
X Y
r =
n
i=1 ri
ˆY
hθ(xi, yi)
z ∼ N(µ(r), σ2
(r)I)
gθ(Y | ˆX, z)
K. Matsui (RIKEN AIP) Neural Processes Family Neural Processes Model 25 / 60
27. Black-Box Optimization with Thompson Sampling
Neural process Gaussian process Random Search
0.26 0.14 1.00
K. Matsui (RIKEN AIP) Neural Processes Family Neural Processes Experiments 26 / 60
28. Table of Contents
1. Conditional Neural Processes [Garnelo+ (ICML2018)]
2. Neural Processes [Garnelo+ (ICML2018WS)]
3. Attentive Neural Processes [Kim+ (ICLR2019)]
4. Meta-Learning surrogate models for sequential decision
making [Galashov+ (ICLR2019WS)]
5. On the Connection between Neural Processes and Gaussian
Processes with Deep Kernels [Rudner+ (NeurIPS2018WS)]
6. Conclusion
K. Matsui (RIKEN AIP) Neural Processes Family Attentive Neural Processes 27 / 60
29. Recall : Neural Processes
x1 y1
x2 y2
x3 y3
MLPθ
MLPθ
MLPθ
MLPΨ
MLPΨ
MLPΨ
r1
r2
r3
s1
s2
s3
rCm
m sC
x
rC
~
MLP y
ENCODER DECODER
Deterministic
Path
Latent
Path
NEURAL PROCESS
m Mean
z
z
*
*
• 潜在表現 r と潜在変数 z を両方モデルに組み込む ver.
• ELBO を目的関数として学習
log p (yT | xT , xC , yC )
≥Eq(z|sT ) [log p (yT | xT , rC , z)] − DKL (q (z | sT ) ∥q (z | sC ))
K. Matsui (RIKEN AIP) Neural Processes Family Attentive Neural Processes Introduction 28 / 60
30. Motivation i
オリジナルの NP は context set に対して underfit しやすい
•
•
•
K. Matsui (RIKEN AIP) Neural Processes Family Attentive Neural Processes Introduction 29 / 60
38. Experiment 1 : 1D Function regression on synthetic GP data
context point
target negative log likelihood
training iteration
wall clock time
Published as a conference paper at ICLR 2019
Figure 3: Qualitative and quantitative results of different attention mechanisms for 1D GP func
regression with random kernel hyperparameters. Left: moving average of context reconstruc
error (top) and target negative log likelihood (NLL) given contexts (bottom) plotted against train
iterations (left) and wall clock time (right). d denotes the bottleneck size i.e. hidden layer size o
MLPs and the dimensionality of r and z. Right: predictive mean and variance of different atten
mechanisms given the same context. Best viewed in colour.
1D Function regression on synthetic GP data We first explore the (A)NPs trained on data th
generated from a Gaussian Process with a squared-exponential kernel and small likelihood noi
We emphasise that (A)NPs need not be trained on GP data or data generated from a known stocha
process, and this is just an illustrative example. We explore two settings: one where the hype
rameters of the kernel are fixed throughout training, and another where they vary randomly at e
d =
hidden layer size of MLPs
dimensionality of r
dimensionality of z
• 2 GP
• context point n target point m iteration
•
• ANPs self-attention , cross-attention
1
|C|
i C
Eq(z|sC ) [log p (yi|xi, r (xC, yC, xi) , z)]
1
|T|
i T
Eq(z|sC ) [log p (yi|xi, r (xC, yC, xi) , z)]
K. Matsui (RIKEN AIP) Neural Processes Family Attentive Neural Processes Experiments 37 / 60
39. Experiment 1 : 1D Function regression on synthetic GP data
NP Attentive NP
Figure 1: Comparison of predictions given by a fully tr
tion regression (left) / 2D image regression (right). T
to predict the target outputs (y-values of all x 2 [ 2
are noticeably more accurate than for NP at the conte
provide relevant information for a given target predic
•
inaccurate predictive means
•
overestimated variances at the input locations
NP
ANP
•
•
→
Multihead Attention
K. Matsui (RIKEN AIP) Neural Processes Family Attentive Neural Processes Experiments 38 / 60
40. Experiment 1 : 1D Function regression on synthetic GP data
•
•
•
K. Matsui (RIKEN AIP) Neural Processes Family Attentive Neural Processes Experiments 39 / 60
41. Experiment 2 : 2D Function regression on image data
•
•
•
•
•
•
• p(yT | xT , rC, z)
•
•
K. Matsui (RIKEN AIP) Neural Processes Family Attentive Neural Processes Experiments 40 / 60
42. Experiment 2 : 2D Function regression on image data
K. Matsui (RIKEN AIP) Neural Processes Family Attentive Neural Processes Experiments 41 / 60
43. Table of Contents
1. Conditional Neural Processes [Garnelo+ (ICML2018)]
2. Neural Processes [Garnelo+ (ICML2018WS)]
3. Attentive Neural Processes [Kim+ (ICLR2019)]
4. Meta-Learning surrogate models for sequential decision
making [Galashov+ (ICLR2019WS)]
5. On the Connection between Neural Processes and Gaussian
Processes with Deep Kernels [Rudner+ (NeurIPS2018WS)]
6. Conclusion
K. Matsui (RIKEN AIP) Neural Processes Family BO via Neural Processes 42 / 60
45. BO from Meta-Learning Viewpoints
Key Observation
We can sample functions similar to the target function from
prior distribution (e.g. GP)
Algorithm 1 Bayesian Optimisation
Input:
f∗
- Target function of interest (= T ∗
).
D0 = {(x0, y0)} - Observed evaluations of f∗
.
N - Maximum number of function iterations.
Mθ - Model pre-trained on evaluations of similar
functions f1, . . . fn ∼ p(T ).
for n=1, ..., N do
// Model-adaptation
Optimise θ to improve M’s prediction on Dn−1.
Thompson sampling: Draw ˆgn ∼ M, find
xn = arg minx∈X E ˆg(y|x)
Evaluate target function and save result.
Dn ← Dn−1 ∪ {(xn, f∗
(xn))}
end for
K. Matsui (RIKEN AIP) Neural Processes Family BO via Neural Processes 44 / 60
46. NPs as Meta-Learning Model
Use neural processes as a model M because
1. statistical efficiency
Accurate predictions of function values based on small
numbers of evaluations
2. calibrated uncertainties
balance exploration and exploitation
3. O(n + m) computational complexity
4. non-parametric modeling
→ Not necessary to set hyper parameters such as learning
rate and update frequency in MAML
K. Matsui (RIKEN AIP) Neural Processes Family BO via Neural Processes 45 / 60
47. Experiments : Bayesian Optimization via NPs
Adversarial task search for RL agents [Ruderman+ (2018)]
• Search Problem of adversarially designed 3D Maze
• trivially solvable by human players
• But RL agents will catastrophically fail
• Notation
• fA : given agent mapping from task params to its
performance r
• parameters of the task
• M : maze layout
• ps, pg : start and goal positions
Problem setup
1. Position search (p∗
s, p∗
g) = arg min
ps,pg
fA(M, ps, pg)
2. Full maze search (M∗, p∗
s, p∗
g) = arg min
M,ps,pg
fA(M, ps, pg)
K. Matsui (RIKEN AIP) Neural Processes Family BO via Neural Processes 46 / 60
48. Experiments : Bayesian Optimization via NPs
(a) Position search results (b) Full maze search results
Figure 2: Bayesian Optimisation results. Left: Position search Right: Full maze search. We report
the minimum up to iteration t (scaled in [0,1]) as a function of the number of iterations. Bold
lines show the mean performance over 4 unseen agents on a set of held-out mazes. We also show
20% of the standard deviation. Baselines: GP: Gaussian Process (with a linear and Matern 3/2
product kernel [Bonilla et al., 2008]), BBB: Bayes by Backprop [Blundell et al., 2015], AlphaDiv:
AlphaDivergence [Hernández-Lobato et al., 2016], DKL: Deep Kernel Learning [Wilson et al., 2016].
K. Matsui (RIKEN AIP) Neural Processes Family BO via Neural Processes 47 / 60
49. Table of Contents
1. Conditional Neural Processes [Garnelo+ (ICML2018)]
2. Neural Processes [Garnelo+ (ICML2018WS)]
3. Attentive Neural Processes [Kim+ (ICLR2019)]
4. Meta-Learning surrogate models for sequential decision
making [Galashov+ (ICLR2019WS)]
5. On the Connection between Neural Processes and Gaussian
Processes with Deep Kernels [Rudner+ (NeurIPS2018WS)]
6. Conclusion
K. Matsui (RIKEN AIP) Neural Processes Family Connection between NP and GP 48 / 60
50. Contributions
Neural processes (NPs) と Gaussian processes (GPs) の理論的
な関係を示した. 特に, ある条件の下では NPs はカーネル関数
にdeep kernelsを用いた GPs と数学的に等価であることを示
した.
• GPs の理論が NPs に適用可能になりうる
• deep kernel GP を一度学習しておくことで異なる予測タス
クにも適用可能な共分散関数を獲得するという学習方法
方針
£ deep kernel GP と NP とで同じ ELBO が出てくる
£ 生成モデルとしては NP の decoder 部分を deep kernel の
NN と潜在変数の内積の形で書くと同じものになる
K. Matsui (RIKEN AIP) Neural Processes Family Connection between NP and GP 49 / 60
51. Gaussian Processes with Deep Kernels i
Notation
• x1:n, y1:n : 観測点
• f : Rp → R : 真関数
• GP model : p(f | x1:n) = N(m, K)
p(y1:n | f) = N(f, τ−1
I)
ここで, f = (f(x1), ..., f(xn)), m = (m(x1), ..., m(xn)),
Kij = k(xi, xj)
K. Matsui (RIKEN AIP) Neural Processes Family Connection between NP and GP GP with Deep Kernels 50 / 60
52. Gaussian Processes with Deep Kernels ii
Definition 1 (deep kernel [Tsuda+ (2002)])
k (xi, xj) :=
1
d
d∑
j,j′=1
σ
(
w⊤
j xi + bj
)
Σjj′ σ
(
w⊤
j′ xj + bj
)
• σ
(
w⊤
j xi + bj
)
は 1 層の NN, w, b はモデルパラメータで
σ(·) は活性化関数
• Σ = (Σjj′ )d
j,j′=1 は半正定値行列
行列表記
ϕi := ϕ(xi, W , b) =
√
1
d
σ(W ⊤
xi + b) ∈ Rd
,
Φ = [ϕ1, ..., ϕn] とおくと, k(X, X) = ΦΣΦ⊤.
以下, GP の平均関数は次のような形で書かれるとする
m(X) = Φµ, µ ∈ Rd
K. Matsui (RIKEN AIP) Neural Processes Family Connection between NP and GP GP with Deep Kernels 51 / 60
53. Gaussian Processes with Deep Kernels iii
Latent function を積分消去して得られるエビデンス (周辺尤度)
p (y|X) =
∫
p (y, f | X) df =
∫
p (y | f) p (f | X) df
= N
(
Φµ, ΦΣΦ⊤
+ τ−1
In
)
NPs の生成モデルと関連づけるために隠れ変数を導入
z ∼ N(µ, Σ)
このとき, 上記のエビデンスは z の周辺化からも導出される
p (y|X) =
∫
p (y | X, z) p (z) dz =
∫
N
(
Φz, τ−1
In
)
N (µ, Σ) dz
= N
(
Φµ, ΦΣΦ⊤
+ τ−1
In
)
特に, z ∼ N(0, Id) のときは p (y|X) = N(0, ΦΦ⊤ + τ−1In)
K. Matsui (RIKEN AIP) Neural Processes Family Connection between NP and GP GP with Deep Kernels 52 / 60
54. Gaussian Processes with Deep Kernels iv
エビデンス p (y|X) を計算するには共分散行列 ΦΣΦ⊤ + τ−1In
の逆行列計算が必要 (O(n3) の計算コスト)
→ エビデンス下界 (ELBO) の計算で置き換えてコスト削減
log p(Y | X) ≥ Eq(z|X)[log p(Y | z, X)] − KL(q(z | X)∥p(z))
K. Matsui (RIKEN AIP) Neural Processes Family Connection between NP and GP GP with Deep Kernels 53 / 60
55. 予測時の ELBO の一致 i
Deep kernel GPs のエビデンス下界を, 観測データとテストデー
タを明示的に分離して書く (C = 1 : m, T = m + 1 : n はそれぞ
れ観測データ, テストデータを表す)
log p (YT | XT , XC, YC)
≥Eq(z|XT ,YT ) [log p (YT | z, XT )] − KL (q (z | XT , YT ) ∥p (z | XC, YC))
ここで, p (z | XC, YC) は観測データ XC, YC に基づいて設定さ
れる “data-driven” な prior
p (z | XC, YC) = N(µ(XC, YC), Σ(XC, YC))
NPs でやったのと同様にこれを変分事後分布で近似する
p (z | XC, YC) ≈ q (z | XC, YC)
K. Matsui (RIKEN AIP) Neural Processes Family Connection between NP and GP NP as GP 54 / 60
56. 予測時の ELBO の一致 ii
以上の下で, Deep kernel GP の ELBO は
Eq(z|XT ,YT ) [log p (YT | z, XT )] − KL (q (z | XT , YT ) ∥q (z | XC, YC))
一方, NPs の ELBO は
log p (ym+1:n | x1:m, xm+1:n, y1:m)
≥ Eq(z|xm+1:n,ym+1:n)
[ n∑
i=m+1
log p (yi | z, xi) + log
p (z | x1:m, y1:m)
q (z | xm+1:n, ym+1:n)
]
≈ Eq(z|xm+1:n,ym+1:n)
[ n∑
i=m+1
log p (yi | z, xi) + log
q (z | x1:m, y1:m)
q (z | xm+1:n, ym+1:n)
]
となり, 両者の生成モデルが同じならばELBO も一致すること
がわかる
K. Matsui (RIKEN AIP) Neural Processes Family Connection between NP and GP NP as GP 55 / 60
57. 生成モデル
NPs の生成モデル:
p(Y | z, X)p(z) = N
(
Y ; gθ(z, X), τ−1
I
)
N(z; µ, Σ)
Deep kernel GPs with latent variable の生成モデル:
p(Y | z, X)p(z) = N
(
Y ; Φz, τ−1
I
)
N (z; µ, Σ)
上記を比較すると, gθ(z, X) = Φz ととれば両者が一致するこ
とがわかる. より一般には, パラメータ Θ = {W ℓ, bℓ}L
ℓ=1 を持つ
L 層の Deep NN ΦΘ(·) によって
gθ(z, X) = ΦΘ(X)z
なる形の affine-decoder を用いることで両者は一致する
K. Matsui (RIKEN AIP) Neural Processes Family Connection between NP and GP NP as GP 56 / 60
58. Table of Contents
1. Conditional Neural Processes [Garnelo+ (ICML2018)]
2. Neural Processes [Garnelo+ (ICML2018WS)]
3. Attentive Neural Processes [Kim+ (ICLR2019)]
4. Meta-Learning surrogate models for sequential decision
making [Galashov+ (ICLR2019WS)]
5. On the Connection between Neural Processes and Gaussian
Processes with Deep Kernels [Rudner+ (NeurIPS2018WS)]
6. Conclusion
K. Matsui (RIKEN AIP) Neural Processes Family Conclusion 57 / 60
59. Summary
• NPs family は出力 y を予測するための条件付き分布を直接
モデリングする方法
• GPs 回帰では予測時に O((m + n)3) かかっていた計算コス
トが O(m + n) で済む
• BO への応用も既に考えられている (問題によっては
GP-based の BO よりも高性能)
• 潜在表現 · 変数の導出に attention を用いた ANPs はより
GP に近い回帰の結果を返す
• NPs は GPs 回帰において deep kernel を用いるのと等価な
操作とみなせる
K. Matsui (RIKEN AIP) Neural Processes Family Conclusion 58 / 60
61. References
[1] Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep
networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages
1126–1135. JMLR. org, 2017.
[2] Alexandre Galashov, Jonathan Schwarz, Hyunjik Kim, Marta Garnelo, David Saxton, Pushmeet Kohli, SM Eslami,
and Yee Whye Teh. Meta-learning surrogate models for sequential decision making. arXiv preprint
arXiv:1903.11907, 2019.
[3] Marta Garnelo, Dan Rosenbaum, Christopher Maddison, Tiago Ramalho, David Saxton, Murray Shanahan,
Yee Whye Teh, Danilo Rezende, and SM Ali Eslami. Conditional neural processes. In International Conference on
Machine Learning, pages 1690–1699, 2018.
[4] Marta Garnelo, Jonathan Schwarz, Dan Rosenbaum, Fabio Viola, Danilo J Rezende, SM Eslami, and Yee Whye
Teh. Neural processes. arXiv preprint arXiv:1807.01622, 2018.
[5] Hyunjik Kim, Andriy Mnih, Jonathan Schwarz, Marta Garnelo, Ali Eslami, Dan Rosenbaum, Oriol Vinyals, and
Yee Whye Teh. Attentive neural processes. arXiv preprint arXiv:1901.05761, 2019.
[6] Tim GJ Rudner, Vincent Fortuin, Yee Whye Teh, and Yarin Gal. On the connection between neural processes and
gaussian processes with deep kernels. In Workshop on Bayesian Deep Learning, NeurIPS, 2018.
K. Matsui (RIKEN AIP) Neural Processes Family Conclusion 60 / 60