The document discusses deep kernel learning, which combines deep learning and Gaussian processes (GPs). It briefly reviews the predictive equations and marginal likelihood for GPs, noting their computational requirements. GPs assume datasets with input vectors and target values, modeling the values as joint Gaussian distributions based on a mean function and covariance kernel. Predictive distributions for test points are also Gaussian. The goal of deep kernel learning is to leverage recent work on efficiently representing kernel functions to produce scalable deep kernels, allowing outperformance of standalone deep learning and GPs on various datasets.
論文輪読資料「A review of unsupervised feature learning and deep learning for time-s...Kaoru Nasuno
2015年4月16日のdeeplearning.jpの勉強会における論文輪読資料。
「A review of unsupervised feature learning and deep learning for time-series modeling」
「時系列モデリングのための教師なし表現学習とディープラー ニングに関する調査」のサマリーです。
Fixed Point Results In Fuzzy Menger Space With Common Property (E.A.)IJERA Editor
This paper presents some common fixed point theorems for weakly compatible mappings via an implicit relation in Fuzzy Menger spaces satisfying the common property (E.A)
Research Inventy : International Journal of Engineering and Scienceresearchinventy
Research Inventy : International Journal of Engineering and Science is published by the group of young academic and industrial researchers with 12 Issues per year. It is an online as well as print version open access journal that provides rapid publication (monthly) of articles in all areas of the subject such as: civil, mechanical, chemical, electronic and computer engineering as well as production and information technology. The Journal welcomes the submission of manuscripts that meet the general criteria of significance and scientific excellence. Papers will be published by rapid process within 20 days after acceptance and peer review process takes only 7 days. All articles published in Research Inventy will be peer-reviewed.
Common Fixed Point Theorems For Occasionally Weakely Compatible Mappingsiosrjce
Som [11 ] establishes a common fixed point theorem for R-weakly Commuting mappings in a Fuzzy
metric space.The object of this Paper is to prove some fixed point theorems for occasionally Weakly compatible
mappings by improving the condition of Som[11 ].
A Note on “ Geraghty contraction type mappings”IOSRJM
In this paper, a fixed point result for Geraghty contraction type mappings has been proved. Karapiner [2] assumes to be continuous. In this paper, the continuity condition of has been replaced by a weaker condition and fixed point result has been proved. Thus the result proved generalizes many known results in the literature [2-7].
COMMON FIXED POINT THEOREMS IN COMPATIBLE MAPPINGS OF TYPE (P*) OF GENERALIZE...mathsjournal
In this paper, we give some new definition of Compatible mappings of type (P), type (P-1) and type (P-2) in intuitionistic generalized fuzzy metric spaces and prove Common fixed point theorems for six mappings under the conditions of compatible mappings of type (P-1) and type (P-2) in complete intuitionistic fuzzy metric spaces. Our results intuitionistically fuzzify the result of Muthuraj and Pandiselvi [15]
COMMON FIXED POINT THEOREMS IN COMPATIBLE MAPPINGS OF TYPE (P*) OF GENERALIZE...mathsjournal
In this paper, we give some new definition of Compatible mappings of type (P), type (P-1) and type (P-2) in intuitionistic generalized fuzzy metric spaces and prove Common fixed point theorems for six mappings under the conditions of compatible mappings of type (P-1) and type (P-2) in complete intuitionistic fuzzy metric spaces.
COMMON FIXED POINT THEOREMS IN COMPATIBLE MAPPINGS OF TYPE (P*) OF GENERALIZE...mathsjournal
In this paper, we give some new definition of Compatible mappings of type (P), type (P-1) and type (P-2) in intuitionistic generalized fuzzy metric spaces and prove Common fixed point theorems for six mappings under the
conditions of compatible mappings of type (P-1) and type (P-2) in complete intuitionistic fuzzy metric spaces. Our results intuitionistically fuzzify the result of Muthuraj and Pandiselvi [15]
Mathematics subject classifications: 45H10, 54H25
Generating a custom Ruby SDK for your web service or Rails API using Smithyg2nightmarescribd
Have you ever wanted a Ruby client API to communicate with your web service? Smithy is a protocol-agnostic language for defining services and SDKs. Smithy Ruby is an implementation of Smithy that generates a Ruby SDK using a Smithy model. In this talk, we will explore Smithy and Smithy Ruby to learn how to generate custom feature-rich SDKs that can communicate with any web service, such as a Rails JSON API.
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
Neuro-symbolic is not enough, we need neuro-*semantic*Frank van Harmelen
Neuro-symbolic (NeSy) AI is on the rise. However, simply machine learning on just any symbolic structure is not sufficient to really harvest the gains of NeSy. These will only be gained when the symbolic structures have an actual semantics. I give an operational definition of semantics as “predictable inference”.
All of this illustrated with link prediction over knowledge graphs, but the argument is general.
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Ramesh Iyer
In today's fast-changing business world, Companies that adapt and embrace new ideas often need help to keep up with the competition. However, fostering a culture of innovation takes much work. It takes vision, leadership and willingness to take risks in the right proportion. Sachin Dev Duggal, co-founder of Builder.ai, has perfected the art of this balance, creating a company culture where creativity and growth are nurtured at each stage.
Elevating Tactical DDD Patterns Through Object CalisthenicsDorra BARTAGUIZ
After immersing yourself in the blue book and its red counterpart, attending DDD-focused conferences, and applying tactical patterns, you're left with a crucial question: How do I ensure my design is effective? Tactical patterns within Domain-Driven Design (DDD) serve as guiding principles for creating clear and manageable domain models. However, achieving success with these patterns requires additional guidance. Interestingly, we've observed that a set of constraints initially designed for training purposes remarkably aligns with effective pattern implementation, offering a more ‘mechanical’ approach. Let's explore together how Object Calisthenics can elevate the design of your tactical DDD patterns, offering concrete help for those venturing into DDD for the first time!
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
Connector Corner: Automate dynamic content and events by pushing a buttonDianaGray10
Here is something new! In our next Connector Corner webinar, we will demonstrate how you can use a single workflow to:
Create a campaign using Mailchimp with merge tags/fields
Send an interactive Slack channel message (using buttons)
Have the message received by managers and peers along with a test email for review
But there’s more:
In a second workflow supporting the same use case, you’ll see:
Your campaign sent to target colleagues for approval
If the “Approve” button is clicked, a Jira/Zendesk ticket is created for the marketing design team
But—if the “Reject” button is pushed, colleagues will be alerted via Slack message
Join us to learn more about this new, human-in-the-loop capability, brought to you by Integration Service connectors.
And...
Speakers:
Akshay Agnihotri, Product Manager
Charlie Greenberg, Host
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
JMeter webinar - integration with InfluxDB and GrafanaRTTS
Watch this recorded webinar about real-time monitoring of application performance. See how to integrate Apache JMeter, the open-source leader in performance testing, with InfluxDB, the open-source time-series database, and Grafana, the open-source analytics and visualization application.
In this webinar, we will review the benefits of leveraging InfluxDB and Grafana when executing load tests and demonstrate how these tools are used to visualize performance metrics.
Length: 30 minutes
Session Overview
-------------------------------------------
During this webinar, we will cover the following topics while demonstrating the integrations of JMeter, InfluxDB and Grafana:
- What out-of-the-box solutions are available for real-time monitoring JMeter tests?
- What are the benefits of integrating InfluxDB and Grafana into the load testing stack?
- Which features are provided by Grafana?
- Demonstration of InfluxDB and Grafana using a practice web application
To view the webinar recording, go to:
https://www.rttsweb.com/jmeter-integration-webinar
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
2. ¤ Wilson et al. (arXiv 2015/11/6)
¤ Carnegie Mellon University
¤ Salakhutdinov
¤ Deep learning + Gaussian process
¤
¤ &
¤
3. ¤
¤ D
¤
an Processes
ew the predictive equations and marginal likelihood for Gaussian processes
e associated computational requirements, following the notational conven-
n et al. (2015). See, for example, Rasmussen and Williams (2006) for a
discussion of GPs.
ataset D of n input (predictor) vectors X = {x1, . . . , xn}, each of dimension
x an n ⇥ 1 vector of targets y = (y(x1), . . . , y(xn))>. If f(x) ⇠ GP(µ, k ),
tion of function values f has a joint Gaussian distribution,
f = f(X) = [f(x1), . . . , f(xn)]>
⇠ N(µ, KX,X) , (1)
ector, µi = µ(xi), and covariance matrix, (KX,X)ij = k (xi, xj), determined
function and covariance kernel of the Gaussian process. The kernel, k , is
y . Assuming additive Gaussian noise, y(x)|f(x) ⇠ N(y(x); f(x), 2), the
ibution of the GP evaluated at the n⇤ test points indexed by X⇤, is given by
f⇤|X⇤,X, y, , 2
⇠ N(E[f⇤], cov(f⇤)) , (2)
E[f⇤] = µX⇤
+ KX⇤,X[KX,X + 2
I] 1
y ,
cov(f⇤) = KX⇤,X⇤ KX⇤,X[KX,X + 2
I] 1
KX,X⇤ .
mple, is an n⇤ ⇥ n matrix of covariances between the GP evaluated at X⇤
3 Gaussian Processes
We briefly review the predictive equations and marginal likelihood for Gaussian processes
(GPs), and the associated computational requirements, following the notational conven-
tions in Wilson et al. (2015). See, for example, Rasmussen and Williams (2006) for a
comprehensive discussion of GPs.
We assume a dataset D of n input (predictor) vectors X = {x1, . . . , xn}, each of dimension
D, which index an n ⇥ 1 vector of targets y = (y(x1), . . . , y(xn))>. If f(x) ⇠ GP(µ, k ),
then any collection of function values f has a joint Gaussian distribution,
f = f(X) = [f(x1), . . . , f(xn)]>
⇠ N(µ, KX,X) , (1)
with a mean vector, µi = µ(xi), and covariance matrix, (KX,X)ij = k (xi, xj), determined
from the mean function and covariance kernel of the Gaussian process. The kernel, k , is
parametrized by . Assuming additive Gaussian noise, y(x)|f(x) ⇠ N(y(x); f(x), 2), the
predictive distribution of the GP evaluated at the n⇤ test points indexed by X⇤, is given by
f⇤|X⇤,X, y, , 2
⇠ N(E[f⇤], cov(f⇤)) , (2)
E[f⇤] = µX⇤
+ KX⇤,X[KX,X + 2
I] 1
y ,
cov(f⇤) = KX⇤,X⇤ KX⇤,X[KX,X + 2
I] 1
KX,X⇤ .
KX⇤,X, for example, is an n⇤ ⇥ n matrix of covariances between the GP evaluated at X⇤
We briefly review the predictive equations and marginal likelihood for Gaussian proce
(GPs), and the associated computational requirements, following the notational con
tions in Wilson et al. (2015). See, for example, Rasmussen and Williams (2006) f
comprehensive discussion of GPs.
We assume a dataset D of n input (predictor) vectors X = {x1, . . . , xn}, each of dimen
D, which index an n ⇥ 1 vector of targets y = (y(x1), . . . , y(xn))>. If f(x) ⇠ GP(µ,
then any collection of function values f has a joint Gaussian distribution,
f = f(X) = [f(x1), . . . , f(xn)]>
⇠ N(µ, KX,X) ,
with a mean vector, µi = µ(xi), and covariance matrix, (KX,X)ij = k (xi, xj), determ
from the mean function and covariance kernel of the Gaussian process. The kernel, k
parametrized by . Assuming additive Gaussian noise, y(x)|f(x) ⇠ N(y(x); f(x), 2),
predictive distribution of the GP evaluated at the n⇤ test points indexed by X⇤, is give
f⇤|X⇤,X, y, , 2
⇠ N(E[f⇤], cov(f⇤)) ,
E[f⇤] = µX⇤
+ KX⇤,X[KX,X + 2
I] 1
y ,
cov(f⇤) = KX⇤,X⇤ KX⇤,X[KX,X + 2
I] 1
KX,X⇤ .
and Nickisch, 2015) and extensions in Wilson et al. (2015) for e ciently representing kernel
functions, to produce scalable deep kernels.
3 Gaussian Processes
We briefly review the predictive equations and marginal likelihood for Gaussian processes
(GPs), and the associated computational requirements, following the notational conven-
tions in Wilson et al. (2015). See, for example, Rasmussen and Williams (2006) for a
comprehensive discussion of GPs.
We assume a dataset D of n input (predictor) vectors X = {x1, . . . , xn}, each of dimension
D, which index an n ⇥ 1 vector of targets y = (y(x1), . . . , y(xn))>. If f(x) ⇠ GP(µ, k ),
then any collection of function values f has a joint Gaussian distribution,
f = f(X) = [f(x1), . . . , f(xn)]>
⇠ N(µ, KX,X) , (1)
with a mean vector, µi = µ(xi), and covariance matrix, (KX,X)ij = k (xi, xj), determined
from the mean function and covariance kernel of the Gaussian process. The kernel, k , is
parametrized by . Assuming additive Gaussian noise, y(x)|f(x) ⇠ N(y(x); f(x), 2), the
predictive distribution of the GP evaluated at the n⇤ test points indexed by X⇤, is given by
f⇤|X⇤,X, y, , 2
⇠ N(E[f⇤], cov(f⇤)) , (2)
E[f⇤] = µX⇤
+ KX⇤,X[KX,X + 2
I] 1
y ,
cov(f⇤) = KX⇤,X⇤ KX⇤,X[KX,X + 2
I] 1
KX,X⇤ .
KX⇤,X, for example, is an n⇤ ⇥ n matrix of covariances between the GP evaluated at X⇤
and X. µX⇤
is the n⇤ ⇥ 1 mean vector, and KX,X is the n ⇥ n covariance matrix evaluated
show that the proposed model outperforms state of the art stand-alone deep learning archi-
tectures and Gaussian processes with advanced kernel learning procedures on a wide range
of datasets, demonstrating its practical significance. We achieve scalability while retaining
non-parametric model structure by leveraging the very recent KISS-GP approach (Wilson
and Nickisch, 2015) and extensions in Wilson et al. (2015) for e ciently representing kernel
functions, to produce scalable deep kernels.
3 Gaussian Processes
We briefly review the predictive equations and marginal likelihood for Gaussian processes
(GPs), and the associated computational requirements, following the notational conven-
tions in Wilson et al. (2015). See, for example, Rasmussen and Williams (2006) for a
comprehensive discussion of GPs.
We assume a dataset D of n input (predictor) vectors X = {x1, . . . , xn}, each of dimension
D, which index an n ⇥ 1 vector of targets y = (y(x1), . . . , y(xn))>. If f(x) ⇠ GP(µ, k ),
then any collection of function values f has a joint Gaussian distribution,
f = f(X) = [f(x1), . . . , f(xn)]>
⇠ N(µ, KX,X) , (1)
with a mean vector, µi = µ(xi), and covariance matrix, (KX,X)ij = k (xi, xj), determined
from the mean function and covariance kernel of the Gaussian process. The kernel, k , is
parametrized by . Assuming additive Gaussian noise, y(x)|f(x) ⇠ N(y(x); f(x), 2), the
predictive distribution of the GP evaluated at the n⇤ test points indexed by X⇤, is given by
f⇤|X⇤,X, y, , 2
⇠ N(E[f⇤], cov(f⇤)) , (2)
E[f⇤] = µX⇤
+ KX⇤,X[KX,X + 2
I] 1
y ,
cov(f⇤) = KX⇤,X⇤ KX⇤,X[KX,X + 2
I] 1
KX,X⇤ .
KX⇤,X, for example, is an n⇤ ⇥ n matrix of covariances between the GP evaluated at X⇤
and X. µX⇤
is the n⇤ ⇥ 1 mean vector, and KX,X is the n ⇥ n covariance matrix evaluated
at training inputs X. All covariance (kernel) matrices implicitly depend on the kernel
hyperparameters .
15.2. GPs for regression 517
−5 0 5
−2
−1.5
−1
−0.5
0
0.5
1
1.5
2
(a)
−5 0 5
−2
−1.5
−1
−0.5
0
0.5
1
1.5
2
2.5
(b)
Figure 15.2 Left: some functions sampled from a GP prior with SE kernel. Right: some samples from a GP
posterior, after conditioning on 5 noise-free observations. The shaded area represents E [f(x)]±2std(f(x).
ns and marginal likelihood for Gaussian processes
al requirements, following the notational conven-
example, Rasmussen and Williams (2006) for a
ctor) vectors X = {x1, . . . , xn}, each of dimension
ets y = (y(x1), . . . , y(xn))>. If f(x) ⇠ GP(µ, k ),
has a joint Gaussian distribution,
. . . , f(xn)]>
⇠ N(µ, KX,X) , (1)
ariance matrix, (KX,X)ij = k (xi, xj), determined
kernel of the Gaussian process. The kernel, k , is
Gaussian noise, y(x)|f(x) ⇠ N(y(x); f(x), 2), the
ed at the n⇤ test points indexed by X⇤, is given by
N(E[f⇤], cov(f⇤)) , (2)
X⇤,X[KX,X + 2
I] 1
y ,
KX⇤,X[KX,X + 2
I] 1
KX,X⇤ .
x of covariances between the GP evaluated at X⇤
nd KX,X is the n ⇥ n covariance matrix evaluated
6. ¤ !'()
!
¤
¤
¤
. . . . . .
GP の予測
• 新しい入力 ynew とこれまでの y の結合分布がまた Gaussian に
なるので,
p(ynew
|xnew
, X, y, θ) (17)
=
p((y, ynew)|(X, xnew), θ)
p(y|X, θ)
(18)
∝ exp
⎛
⎝−
1
2
([y, ynew
]
K k
kT k
−1
y
ynew
− yT
K−1
y)
⎞
⎠
(19)
∼ N(kT
K−1
y, k − kT
K−1
k). (20)
ここで
− K = [k(x, x′)].
− k = (k(xnew, x1), · · · , k(xnew, xN )).
18 / 59
. . . . . .
GP の予測
• 新しい入力 ynew とこれまでの y の結合分布がまた Gaussian に
なるので,
p(ynew
|xnew
, X, y, θ) (17)
=
p((y, ynew)|(X, xnew), θ)
p(y|X, θ)
(18)
∝ exp
⎛
⎝−
1
2
([y, ynew
]
K k
kT k
−1
y
ynew
− yT
K−1
y)
⎞
⎠
(19)
∼ N(kT
K−1
y, k − kT
K−1
k). (20)
ここで
− K = [k(x, x′)].
− k = (k(xnew, x1), · · · , k(xnew, xN )).
18 / 59
. . . . . .
GP の予測
• 新しい入力 ynew とこれまでの y の結合分布がまた Gaussian に
なるので,
p(ynew
|xnew
, X, y, θ) (17)
=
p((y, ynew)|(X, xnew), θ)
p(y|X, θ)
(18)
∝ exp
⎛
⎝−
1
2
([y, ynew
]
K k
kT k
−1
y
ynew
− yT
K−1
y)
⎞
⎠
(19)
∼ N(kT
K−1
y, k − kT
K−1
k). (20)
ここで
− K = [k(x, x′)].
− k = (k(xnew, x1), · · · , k(xnew, xN )).
18 / 59
. . . . . .
GP の予測
• 新しい入力 ynew とこれまでの y の結合分布がまた Gaussian に
なるので,
p(ynew
|xnew
, X, y, θ) (17)
=
p((y, ynew)|(X, xnew), θ)
p(y|X, θ)
(18)
∝ exp
⎛
⎝−
1
2
([y, ynew
]
K k
kT k
−1
y
ynew
− yT
K−1
y)
⎞
⎠
(19)
∼ N(kT
K−1
y, k − kT
K−1
k). (20)
ここで
− K = [k(x, x′)].
− k = (k(xnew, x1), · · · , k(xnew, xN )).
18 / 59
7. ¤
¤ =
developed that have better scaling with training set size than the exact approach
(Gibbs, 1997; Tresp, 2001; Smola and Bartlett, 2001; Williams and Seeger, 2001;
Csat´o and Opper, 2002; Seeger et al., 2003). Practical issues in the application of
Gaussian processes are discussed in Bishop and Nabney (2008).
We have introduced Gaussian process regression for the case of a single tar-
get variable. The extension of this formalism to multiple target variables, known
as co-kriging (Cressie, 1993), is straightforward. Various other extensions of Gaus-
Illustration of Gaussian process re-
gression applied to the sinusoidal
data set in Figure A.6 in which the
three right-most data points have
been omitted. The green curve
shows the sinusoidal function from
which the data points, shown in
blue, are obtained by sampling and
addition of Gaussian noise. The
red line shows the mean of the
Gaussian process predictive distri-
bution, and the shaded region cor-
responds to plus and minus two
standard deviations. Notice how
the uncertainty increases in the re-
gion to the right of the data points.
0 0.2 0.4 0.6 0.8 1
−1
−0.5
0
0.5
1
9. ¤
¤ “How can Gaussian processes possibly replace neural networks? Have
we thrown the baby out with the bathwater?” [MacKay1998]
¤
¤ Bayesian Learning for Neural Networks [Neal 1996]
¤ Neal D Hinton MacKey
¤
¤
¤
¤ [Wilson, 2014][Wilson and Adams, 2013][Lloyd et al., 2014][Yang et al.,
2015]
¤
¤
¤
10. +
¤ Gaussian process regression network [Wilson et al., 2012]
¤
¤ Damianou and Lawrence (2013)
¤
¤ Salakhutdinov and Hinton (2008)
¤ DBN
¤ Calandra et al. (2014)
¤ NN sharp discontinuities
11. ¤ k x
¤ g DNN &
¤ RBF spectral mixture (SM)
[Wilson and Adams, 2013]
¤
¤ DNN .
In this section we show how we can contruct kernels which encapsulate the expressive pow
of deep architectures, and how to learn the properties of these kernels as part of a scalab
probabilistic Gaussian process framework.
Specifically, starting from a base kernel k(xi, xj|✓) with hyperparameters ✓, we transfor
the inputs (predictors) x as
k(xi, xj|✓) ! k(g(xi, w), g(xj, w)|✓, w) , (5
where g(x, w) is a non-linear mapping given by a deep architecture, such as a deep con
volutional network, parametrized by weights w. The popular RBF kernel (Eq. (3)) is
sensible choice of base kernel k(xi, xj|✓). For added flexibility, we also propose to u
spectral mixture base kernels (Wilson and Adams, 2013):
kSM(x, x0
|✓) =
QX
q=1
aq
|⌃q|
1
2
(2⇡)
D
2
exp
✓
1
2
||⌃
1
2
q (x x0
)||2
◆
coshx x0
, 2⇡µqi . (6
The parameters of the spectral mixture kernel ✓ = {aq, ⌃q, µq} are mixture weights, band
widths (inverse length-scales), and frequencies. The spectral mixture (SM) kernel, whic
forms an expressive basis for all stationary covariance functions, can discover quasi-period
stationary structure with an interpretable and succinct representation, while the deep learn
4 Deep Kernel Learning
In this section we show how we can contruct kernels which encapsulate the expressive power
of deep architectures, and how to learn the properties of these kernels as part of a scalable
probabilistic Gaussian process framework.
Specifically, starting from a base kernel k(xi, xj|✓) with hyperparameters ✓, we transform
the inputs (predictors) x as
k(xi, xj|✓) ! k(g(xi, w), g(xj, w)|✓, w) , (5)
where g(x, w) is a non-linear mapping given by a deep architecture, such as a deep con-
volutional network, parametrized by weights w. The popular RBF kernel (Eq. (3)) is a
sensible choice of base kernel k(xi, xj|✓). For added flexibility, we also propose to use
spectral mixture base kernels (Wilson and Adams, 2013):
kSM(x, x0
|✓) =
QX
q=1
aq
|⌃q|
1
2
(2⇡)
D
2
exp
✓
1
2
||⌃
1
2
q (x x0
)||2
◆
coshx x0
, 2⇡µqi . (6)
The parameters of the spectral mixture kernel ✓ = {aq, ⌃q, µq} are mixture weights, band-
widths (inverse length-scales), and frequencies. The spectral mixture (SM) kernel, which
forms an expressive basis for all stationary covariance functions, can discover quasi-periodic
stationary structure with an interpretable and succinct representation, while the deep learn-
ing transformation g(x, w) captures non-stationary and hierarchical structure.
5
12. ¤
¤ DNN
¤ RBF SM
x1
xD
Input layer h
(1)
1
h
(1)
A
...
. . .
h
(2)
1
h
(2)
B
h
(L)
1
h
(L)
C
W(1)
W(2)
W(L)
h1(✓)
h1(✓)
Hidden layers
1 layer
y1
yP
Output layer
. . .
...
...
...
......
...
...
Figure 1: Deep Kernel Learning: A Gaussian process with a deep kernel maps D dimensional
inputs x through L parametric hidden layers followed by a hidden layer with an infinite number of
basis functions, with base kernel hyperparameters ✓. Overall, a Gaussian process with a deep kernel
13. ¤ L
¤ / DNN &
¤
is determined by the interpretable length-scale hyperparameter `. Shorter length-scales
correspond to functions which vary more rapidly with the inputs x.
The structure of our data is discovered through learning interpretable kernel hyperparam-
eters. The marginal likelihood of the targets y, the probability of the data conditioned
only on kernel hyperparameters , provides a principled probabilistic framework for kernel
learning:
log p(y| , X) / [y>
(K + 2
I) 1
y + log |K + 2
I|] , (4)
where we have used K as shorthand for KX,X given . Note that the expression for the log
marginal likelihood in Eq. (4) pleasingly separates into automatically calibrated model fit
and complexity terms (Rasmussen and Ghahramani, 2001). Kernel learning can be achieved
by optimizing Eq. (4) with respect to .
The computational bottleneck for inference is solving the linear system (KX,X + 2I) 1y,
and for kernel learning is computing the log determinant log |KX,X + 2I| in the marginal
likelihood. The standard approach is to compute the Cholesky decomposition of the n ⇥
n matrix KX,X, which requires O(n3) operations and O(n2) storage. After inference is
complete, the predictive mean costs O(n), and the predictive variance costs O(n2), per test
point x⇤.
4 Deep Kernel Learning
In this section we show how we can contruct kernels which encapsulate the expressive power
function representation, our network e↵ectively has a hidden layer with an infinite number
of hidden units. The overall model is shown in Figure 1.
We emphasize, however, that we jointly learn all deep kernel hyperparameters, = {w, ✓},
which include w, the weights of the network, and ✓ the parameters of the base kernel, by
maximizing the log marginal likelihood L of the Gaussian process (see Eq. (4)). Indeed
compartmentalizing our model into a base kernel and deep architecture is for pedagogical
clarity. When applying a Gaussian process one can use our deep kernel, which operates as a
single unit, as a drop-in replacement for e.g., standard ARD or Mat´ern kernels (Rasmussen
and Williams, 2006), since learning and inference follow the same procedures.
For kernel learning, we use the chain rule to compute derivatives of the log marginal likeli-
hood with respect to the deep kernel hyperparameters:
@L
@✓
=
@L
@K
@K
@✓
,
@L
@w
=
@L
@K
@K
@g(x, w)
@g(x, w)
@w
.
The implicit derivative of the log marginal likelihood with respect to our n ⇥ n data covari-
6
ance matrix K is given by
@L
@K
=
1
2
(K 1
yy>
K 1
K 1
) , (7)
where we have absorbed the noise covariance 2I into our covariance matrix, and treat it as
part of the base kernel hyperparameters ✓.
@K
@✓ are the derivatives of the deep kernel with
respect to the base kernel hyperparameters (such as length-scale), conditioned on the fixed
transformation of the inputs g(x, w). Similarly,
@K
@g(x,w) are the implicit derivatives of the
deep kernel with respect to g, holding ✓ fixed. The derivatives with respect to the weight
14. KISS-GP
¤ K KISS-GP [Wilson and Nickisch, 2015]
[Wilson et al., 2015]
¤ M:
¤ K:
¤ linear conjugate gradients (LCG)
¤ ×
¤ &
¤
¤ [Quin ̃onero-Candela and Rasmussen, 2005]
part of the base kernel hyperparameters ✓. @✓ are the derivatives of the deep kernel wit
respect to the base kernel hyperparameters (such as length-scale), conditioned on the fixe
transformation of the inputs g(x, w). Similarly,
@K
@g(x,w) are the implicit derivatives of th
deep kernel with respect to g, holding ✓ fixed. The derivatives with respect to the weigh
variables @g(x,w)
@w are computed using standard backpropagation.
For scalability, we replace all instances of K with the KISS-GP covariance matrix (Wilso
and Nickisch, 2015; Wilson et al., 2015)
K ⇡ MKdeep
U,U M>
:= KKISS , (
where M is a sparse matrix of interpolation weights, containing only 4 non-zero entries p
row for local cubic interpolation, and KU,U is a covariance matrix created from our dee
kernel, evaluated over m latent inducing points U = [ui]i=1...m. We place the inducing poin
over a regular multidimensional lattice, and exploit the resulting decomposition of KU,U in
a Kronecker product of Toeplitz matrices for extremely fast matrix vector multiplication
(MVMs), without requiring any grid structure in the data inputs X or the transforme
inputs g(x, w). Because KISS-GP operates by creating an approximate kernel which admi
fast computations, and is independent from a specific inference and learning procedure, w
can view the KISS approximation applied to our deep kernels as a stand-alone kerne
k(x, z) = m>
x Kdeep
U,U mz, which can be combined with Gaussian processes or other kern
machines for scalable learning.
For inference we solve K 1
KISSy using linear conjugate gradients (LCG), an iterative procedure
for solving linear systems which only involves matrix vector multiplications (MVMs). The
number of iterations required for convergence to within machine precision is j ⌧ n, and in
practice j depends on the conditioning of the KISS-GP covariance matrix rather than the
number of training points n. For estimating the log determinant in the marginal likelihood
we follow the approach described in Wilson and Nickisch (2015) with extensions in Wilson
et al. (2015).
KISS-GP training scales as O(n+h(m)) (where h(m) is typically close to linear in m), versus
conventional scalable GP approaches which require O(m2n + m3) (Qui˜nonero-Candela and
Rasmussen, 2005) computations and need m ⌧ n for tractability, which results in severe
deteriorations in predictive performance. The ability to have large m ⇡ n allows KISS-GP
ce we solve K 1
KISSy using linear conjugate gradients (LCG), an iterative procedure
linear systems which only involves matrix vector multiplications (MVMs). The
iterations required for convergence to within machine precision is j ⌧ n, and in
depends on the conditioning of the KISS-GP covariance matrix rather than the
training points n. For estimating the log determinant in the marginal likelihood
the approach described in Wilson and Nickisch (2015) with extensions in Wilson
5).
raining scales as O(n+h(m)) (where h(m) is typically close to linear in m), versus
al scalable GP approaches which require O(m2n + m3) (Qui˜nonero-Candela and
n, 2005) computations and need m ⌧ n for tractability, which results in severe
ons in predictive performance. The ability to have large m ⇡ n allows KISS-GP
ar-exact accuracy in its approximation (Wilson and Nickisch, 2015), retaining a
metric representation, while providing linear scaling in n and O(1) time per test
iction (Wilson et al., 2015). We empirically demonstrate this scalability and
ng linear conjugate gradients (LCG), an iterative procedure
only involves matrix vector multiplications (MVMs). The
convergence to within machine precision is j ⌧ n, and in
ioning of the KISS-GP covariance matrix rather than the
estimating the log determinant in the marginal likelihood
in Wilson and Nickisch (2015) with extensions in Wilson
h(m)) (where h(m) is typically close to linear in m), versus
ches which require O(m2n + m3) (Qui˜nonero-Candela and
and need m ⌧ n for tractability, which results in severe
rmance. The ability to have large m ⇡ n allows KISS-GP
s approximation (Wilson and Nickisch, 2015), retaining a
16. UCI repository
¤
¤ n<6000 [d-1000-500-50-2]
¤ n>6000 [d-1000-1000-500-50-2]
¤ GP Fastfood
finite basis function expansions
Table 1: Comparative RMSE performance and runtime on UCI regression datasets, with n training points and d the input
dimensions. The results are averaged over 5 equal partitions (90% train, 10% test) of the data ± one standard deviation. The
best denotes the best-performing kernel according to Yang et al. (2015) (note that often the best performing kernel is GP-SM).
Following Yang et al. (2015), as exact Gaussian processes are intractable on the large data used here, the Fastfood finite basis
function expansions are used for approximation in GP (RBF, SM, Best). We verified on datasets with n < 10, 000 that exact
GPs with RBF kernels provide comparable performance to the Fastfood expansions. For datasets with n < 6, 000 we used a
fully-connected DNN with a [d-1000-500-50-2] architecture, and for n > 6000 we used a [d-1000-1000-500-50-2] architecture. We
consider scalable deep kernel learning (DKL) with RBF and SM base kernels. For the SM base kernel, we set Q = 4 for datasets
with n < 10, 000 training instances, and use Q = 6 for larger datasets.
Datasets n d
RMSE Runtime(s)
GP
DNN
DKL
DNN
DKL
RBF SM best RBF SM RBF SM
Gas 2,565 128 0.21±0.07 0.14±0.08 0.12±0.07 0.11±0.05 0.11±0.05 0.09±0.06 7.43 7.80 10.52
Skillcraft 3,338 19 1.26±3.14 0.25±0.02 0.25±0.02 0.25±0.00 0.25±0.00 0.25±0.00 15.79 15.91 17.08
SML 4,137 26 6.94±0.51 0.27±0.03 0.26±0.04 0.25±0.02 0.24±0.01 0.23±0.01 1.09 1.48 1.92
Parkinsons 5,875 20 3.94±1.31 0.00±0.00 0.00±0.00 0.31±0.04 0.29±0.04 0.29±0.04 3.21 3.44 6.49
Pumadyn 8,192 32 1.00±0.00 0.21±0.00 0.20±0.00 0.25±0.02 0.24±0.02 0.23±0.02 7.50 7.88 9.77
PoleTele 15,000 26 12.6±0.3 5.40±0.3 4.30±0.2 3.42±0.05 3.28±0.04 3.11±0.07 8.02 8.27 26.95
Elevators 16,599 18 0.12±0.00 0.090±0.001 0.089±0.002 0.099±0.001 0.084±0.002 0.084±0.002 8.91 9.16 11.77
Kin40k 40,000 8 0.34±0.01 0.19±0.02 0.06±0.00 0.11±0.01 0.05±0.00 0.03±0.01 19.82 20.73 24.99
Protein 45,730 9 1.64±1.66 0.50±0.02 0.47±0.01 0.49±0.01 0.46±0.01 0.43±0.01 142.8 154.8 144.2
KEGG 48,827 22 0.33±0.17 0.12±0.01 0.12±0.01 0.12±0.01 0.11±0.00 0.10±0.01 31.31 34.23 61.01
CTslice 53,500 385 7.13±0.11 2.21±0.06 0.59±0.07 0.41±0.06 0.36±0.01 0.34±0.02 36.38 44.28 80.44
KEGGU 63,608 27 0.29±0.12 0.12±0.00 0.12±0.00 0.12±0.00 0.11±0.00 0.11±0.00 39.54 42.97 41.05
3Droad 434,874 3 12.86±0.09 10.34±0.19 9.90±0.10 7.36±0.07 6.91±0.04 6.91±0.04 238.7 256.1 292.2
Song 515,345 90 0.55±0.00 0.46±0.00 0.45±0.00 0.45±0.02 0.44±0.00 0.43±0.01 517.7 538.5 589.8
Buzz 583,250 77 0.88±0.01 0.51±0.01 0.51±0.01 0.49±0.00 0.48±0.00 0.46±0.01 486.4 523.3 769.7
Electric 2,049,280 11 0.230±0.000 0.053±0.000 0.053±0.000 0.058±0.002 0.050±0.002 0.048±0.002 3458 3542 4881
17. ¤
¤ The Olivetti face data
28 28 [Salakhutdinov and Hinton (2008)]
¤ 30 10
¤
¤ 2 4
¤ 2
36.15-43.10 -3.4917.35 -19.81
Training data
Test data
Label
Figure 2: Left: Randomly sampled examples of the training and test data. Right: Th
dimensional outputs of the convolutional network on a set of test cases. Each point is shown
a line segment that has the same orientation as the input face.
5.1 UCI regression tasks
We consider a large set of UCI regression problems of varying sizes and properties. Ta
reports test root mean squared error (RMSE) for 1) many scalable Gaussian process k
learning methods based on Fastfood (Yang et al., 2015); 2) stand-alone deep neural netw
(DNNs); and 3) our proposed combined deep kernel learning (DKL) model using both
and SM base kernels.
For smaller datasets, where the number of training examples n < 6, 000, we used a f
connected neural network with a d-1000-500-50-2 architecture; for larger datasets we
2
A Appendix
A.1 Convolutional network architecture
Table 3 lists the architecture of the convolutional networks used in the tasks of face ori-
entation extraction (section 5.2) and digit magnitude extraction (section 5.3). The CNN
architecture is original from the LeNet LeCun et al. (1998) (for digit classification) and
adapted to the above tasks with one or two more fully-connected layers for feature trans-
formation.
Layer conv1 pool1 conv2 pool2 full3 full4 full5 full6
kernel size 5⇥5 2⇥2 5⇥5 2⇥2 - - - -
stride 1 2 1 2 - - - -
channel 20 20 50 50 1000 500 50 2
Table 3: The architecture of the convolutional network used in face orientation extraction.
The CNN used in the MNIST digit magnitude regression has a similar architecture except
that the full3 layer is omitted. Both pool1 and pool2 are max pooling layers. ReLU layer is
placed after full3 and full4.
18. ¤
¤ DBN-GP 12000 1000
DKL
¤ (RMSE)
Table 2: RMSE performance on the Olivetti and MNIST. For comparison, in the face orientation
extraction, we trained DKL on the same amount (12,000) of training instances as with DBN-GP, but
used all labels; whereas DBN-GP (as with GP) scaled to only 1,000 labeled images and modeled the
remaining data through unsupervised pretraining of DBN. We used RBF base kernel within GPs.
Datasets GP DBN-GP CNN DKL
Olivetti 16.33 6.42 6.34 6.07
MNIST 1.25 1.03 0.59 0.53
combining KISS-GP with DNNs for deep kernels introduces only negligible runtime costs:
KISS-GP imposes an additional runtime of about 10% (one order of magnitude less than)
the runtime a DNN typically requires. Overall, these results show the general applicability
and practical significance of our scalable DKL approach.
5.2 Face orientation extraction
We now consider the task of predicting the orientation of a face extracted from a gray-
1 2 3 4 5 6
x 10
4
5.2
5.4
5.6
5.8
6
6.2
6.4
#Training Instances
RMSE
CNN
DKL−RBF
DKL−SM
1 2 3 4 5 6
x 10
4
50
100
150
200
250
300
350
#Training Instances
Runtime(s)
CNN
DKL−RBF
DKL−SM
8
10
x 10
4
e(s)
CNN
DKL−RBF
DKL−SM
1 2 3 4 5 6
x 10
4
5.2
5.4
5.6
5.8
6
6.2
6.4
#Training Instances
RMSE
CNN
DKL−RBF
DKL−SM
1 2 3 4 5 6
x 1
50
100
150
200
250
300
350
#Training Instances
Runtime(s)
CNN
DKL−RBF
DKL−SM
1.2 2 3 4 5 6
x 10
4
0
2
4
6
8
10
x 10
4
#Training Instances
TotalTrainingTime(s)
CNN
DKL−RBF
DKL−SM
Figure 3: Left: RMSE vs. n, the number of training examples. Middle: Runtime vs n. Ri
Total training time vs n. The dashed line in black indicates a slope of 1. Convolutional netw
are used within DKL. We set Q = 4 for the SM kernel.
19. ¤ spectral
¤ spectral mixture RBF
¤ SM 2
¤ RBF
¤
0 10 20 30
−800
−600
−400
−200
0
Frequency
LogSpectralDensity
Figure 4: The log spectral densities of the DKL-SM and DKL-SE base kernels are in black and red,
respectively.
We further see the benefit of an SM base kernel in Figure 5, where we show the learned
covariance matrices constructed from the whole deep kernels (composition of base kernel
and deep architecture) for RBF and SM base kernels. The covariance matrix is evaluated
on a set of test inputs, where we randomly sample 400 instances from the test set and sort
them in terms of the orientation angles of the input faces. We see that the deep kernels with
both RBF and SM base kernels discover that faces with similar rotation angles are highly
20. ¤
¤ DKL-SM
¤ DKL-RBF
¤ RBF
¤ DKL
¤ DKL-RBF
¤ RBF
→DKL
100 200 300 400
100
200
300
400 −0.1
0
0.1
0.2
100 200 300 400
100
200
300
400 0
1
2
100 200 300 400
100
200
300
400 0
100
200
300
Figure 5: Left: The induced covariance matrix using DKL-SM kernel on a set of test cases, where
the test samples are ordered according to the orientations of the input faces. Middle: The respective
covariance matrix using DKL-RBF kernel. Right: The respective covariance matrix using regular
RBF kernel. The models are trained with n = 12, 000. We set Q = 4 for the SM base kernel.
5.3 Digit magnitude extraction
21. MNIST
¤ MNIST
¤
¤
¤ (RMSE)
¤
Table 2: RMSE performance on the Olivetti and MNIST. For comparison, in the face orientation
extraction, we trained DKL on the same amount (12,000) of training instances as with DBN-GP, but
used all labels; whereas DBN-GP (as with GP) scaled to only 1,000 labeled images and modeled the
remaining data through unsupervised pretraining of DBN. We used RBF base kernel within GPs.
Datasets GP DBN-GP CNN DKL
Olivetti 16.33 6.42 6.34 6.07
MNIST 1.25 1.03 0.59 0.53
combining KISS-GP with DNNs for deep kernels introduces only negligible runtime costs:
KISS-GP imposes an additional runtime of about 10% (one order of magnitude less than)
the runtime a DNN typically requires. Overall, these results show the general applicability
and practical significance of our scalable DKL approach.
5.2 Face orientation extraction
Table 2: RMSE performance on the Olivetti and MNIST. For comparison, in the face orientation
extraction, we trained DKL on the same amount (12,000) of training instances as with DBN-GP, but
used all labels; whereas DBN-GP (as with GP) scaled to only 1,000 labeled images and modeled the
remaining data through unsupervised pretraining of DBN. We used RBF base kernel within GPs.
Datasets GP DBN-GP CNN DKL
Olivetti 16.33 6.42 6.34 6.07
MNIST 1.25 1.03 0.59 0.53
combining KISS-GP with DNNs for deep kernels introduces only negligible runtime costs:
KISS-GP imposes an additional runtime of about 10% (one order of magnitude less than)
the runtime a DNN typically requires. Overall, these results show the general applicability
and practical significance of our scalable DKL approach.
5.2 Face orientation extraction
We now consider the task of predicting the orientation of a face extracted from a gray-
A Appendix
A.1 Convolutional network architecture
Table 3 lists the architecture of the convolutional networks used in the tasks of face ori-
entation extraction (section 5.2) and digit magnitude extraction (section 5.3). The CNN
architecture is original from the LeNet LeCun et al. (1998) (for digit classification) and
adapted to the above tasks with one or two more fully-connected layers for feature trans-
formation.
Layer conv1 pool1 conv2 pool2 full3 full4 full5 full6
kernel size 5⇥5 2⇥2 5⇥5 2⇥2 - - - -
stride 1 2 1 2 - - - -
channel 20 20 50 50 1000 500 50 2
Table 3: The architecture of the convolutional network used in face orientation extraction.
The CNN used in the MNIST digit magnitude regression has a similar architecture except
that the full3 layer is omitted. Both pool1 and pool2 are max pooling layers. ReLU layer is
placed after full3 and full4.
22. ¤ DKL Deep Learning
¤
¤
¤
¤
¤
−1 −0.5 0 0.5 1
4
6
8
10
12
14
16
18
Input X
OutputY
GP(RBF)
GP(SM)
DKL−SM
Training data
Figure 6: Recovering a step function. We show the predictive mean and 95% of the predictive
probability mass for regular GPs with RBF and SM kernels, and DKL with SM base kernel. We set