SlideShare a Scribd company logo
Expert Systems With Applications 213 (2023) 119057
Available online 19 October 2022
0957-4174/© 2022 Elsevier Ltd. All rights reserved.
Multi-view and Multi-level network for fault diagnosis accommodating
feature transferability
Na Lu *
, Zhiyan Cui , Huiyang Hu , Tao Yin
Systems Engineering Institute, School of Automation Science and Engineering, Xi’an Jiaotong University, Xi’an, Shaanxi 710049, China
A R T I C L E I N F O
Keywords:
Transfer learning
Feature transferability
Fault diagnosis
Few shot learning
A B S T R A C T
Various deep transfer learning solutions have been developed for machine fault diagnosis. The existing solutions
mainly focus on domain adaptation by minimizing the data distribution discrepancy with certain metric, which
emphasize the common features embedded in the data cross domains and neglect the unique features toward
health condition classification in one specific domain. In these solutions, all the data for training have been
forced to align in a common feature space and all the features for domain adaptation have been treated equally.
However, there might exist domain specific features which are not appropriate for transfer but may contain
essential information for classification in specific domain. In addition, due to the difficulty of collecting machine
fault data, the number of machine fault samples is usually quite small or even zero. The traditional deep network
structures and the training strategy are not the optimal choice in this occasion. To address these problems, a
novel multi-view and multi-level network (MMNet) for fault diagnosis is developed. In MMNet, two network
channels have been respectively constructed for cross domain common feature and domain specific feature
learning to provide multi-view features. This architecture could implicitly differentiate the common features
cross domains and the specific features only in one domain. In the channel of domain specific feature, a domain
classifier and fault classifier are combined to learn the domain specific features. Multiple kernel maximum mean
discrepancy (MK-MMD) is imposed on multiple layers of the common feature channel to implement domain
adaptation and extract cross domain common features. The domain classification and fault classification together
form a multi-level classification scheme. A classic few shot learning architecture with two modules respectively
for feature extraction and relation computation is adopted as the backbone network. The relation score based
classification mechanism enables zero shot fault classification in the target domain. Episode based few shot
training strategy is employed to enhance the performance of MMNet with few labeled training data. Extensive
experiments have demonstrated the state-of-the-art performance of MMNet on the involved transfer tasks.
1. Introduction
Machine fault in industry could bring catastrophic damage and
enormous economic loss (Lei, Jia, Lin, Xing, & Ding, 2016). Therefore,
fault diagnosis has long been a popular and important research field
which involves multidisciplinary researches like mechanical engineer­
ing, signal processing, and machine learning and so on. Machines usu­
ally work in health state during most time of their life circle. Different
possible faults only occur in rare occasions. Due to the long time span of
normal condition and sporadic occurrence of fault, it is commonly
acknowledged that the fault data collected from one machine is quite
limited especially in practical application. While in laboratory envi­
ronment, it is much easier to collect manual fabricated fault data.
Therefore, how to learn efficient representation of fault data and transfer
the knowledge learnt from data abundant scenarios to data lack sce­
narios are crucial for fault diagnosis.
To this end, deep learning and transfer learning have been widely
explored in recent decades in fault diagnosis. Various deep network
models have been employed to automatically extract discriminant fea­
tures from machine fault data (Lu & Yin, 2021). Network structures like
Peer review under responsibility of Submissions with the production note ‘Please add the Reproducibility Badge for this item’ the Badge and the following footnote
to be added:The code (and data) in this article has been certified as Reproducible by the CodeOcean: https://codeocean.com. More information on the Reproduc­
ibility Badge Initiative is available at https://www.elsevier.com/physicalsciencesandengineering/computerscience/journals..
* Corresponding author.
E-mail address: lvna2009@xjtu.edu.cn (N. Lu).
Contents lists available at ScienceDirect
Expert Systems With Applications
journal homepage: www.elsevier.com/locate/eswa
https://doi.org/10.1016/j.eswa.2022.119057
Received 16 June 2021; Received in revised form 23 April 2022; Accepted 13 October 2022
Expert Systems With Applications 213 (2023) 119057
2
AutoEncoder (Yu, Wang, Li, & Zhao, 2019), sparse AutoEncoder (Wen,
Gao, & Li, 2017), Convolutional Neural Network (CNN) (Jia, Lei, Lu, &
Xing, 2018; Yang, Lei, Jia, Li, & Du, 2020) have been widely employed
for fault representation learning. In addition, Generative Adversarial
Network (GAN) (Chen et al., 2020; Li et al., 2020; Zhang et al., 2020)
based methods have also been employed for fault diagnosis, which aim
to generate more fault samples to balance the fault dataset and improve
the classification performance. Except the GAN based solutions, most of
the fault classification network architectures and their training methods
were borrowed directly from the classic deep learning solutions of
computer vision, which can well fit the big data applications. However,
when the fault data are not abundant and especially when no labeled
data are available, more appropriate network architecture and training
method need be developed.
Another important issue in fault diagnosis is how to transfer the
knowledge from the domain with relatively abundant labeled data
(source domain) to the domain with few or no labeled data (target
domain). Here the different domains could be understood as different
machines or one machine under different working conditions. To
address this issue, many solutions combining deep neural network and
transfer learning have been developed (Li et al., 2020; Li, Zhang, Ding, &
Sun, 2019; Shao, McAleer, Yan, & Baldi, 2019; Xu, Liu, Jiang, Shen, &
Huang, 2020; Yang et al., 2020; Yang, Lei, Jia, & Xing, 2019) which we
refer to as deep transfer learning methods for simplicity. These methods
mainly aimed at minimizing the distribution discrepancy between
different domains and improving the fault classification accuracy. To
fulfill domain adaptation, multiple metrics of data distribution have
been applied, including Maximum Mean Discrepancy (MMD) (Yang
et al., 2019), Multi-kernel Maximum Mean Discrepancy (MK-MMD)
(Che, Wang, Ni, & Fu, 2020), Polynomial-kernel Maximum Mean
Discrepancy (PK-MMD) (Yang et al., 2020) and so on. These metrics
evaluate the data distribution difference which is used as the domain
adaptation loss to train the fault diagnosis model. The training objective
functions of deep transfer learning models usually contain two parts,
classification loss and domain adaptation loss. By minimizing the overall
loss of these terms, the deep transfer learning models could be trained.
Long et al. (Long, Cao, Wang, & Jordan, 2015) developed a widely used
deep transfer learning method with domain adaptation. MK-MMD loss
was used on the last three fully connected layers but the output layer to
enable domain adaptation. Lu et al. (Lu et al., 2017) adopted MMD as
the distribution discrepancy measure and developed a deep neural
network (DNN) model for fault diagnosis. The MMD loss was imposed on
the feature layer of a DNN. A gearbox dataset collected under different
working conditions was employed to evaluate the method. A deep
convolutional transfer learning network (DCTLN) was constructed by
Guo et al. (Guo, Lei, Xing, Yan, & Li, 2019) to implement fault diagnosis
knowledge transfer. One convolutional network module was used for
fault condition recognition and another convolutional network module
was used for domain distribution adaptation. Three datasets collected
from bearings were used for experiments to test the transferability of
DCTLN. Wen et al. (Wen, Gao, & Li, 2019) developed a sparse autoen­
coder for feature representation learning which used frequency spec­
trum of vibration sequences recorded from bearings as input. Domain
adaptation was implemented via MMD. Li et al. (Li, Zhang, & Ding,
2018) also proposed a domain adaptive deep convolutional neural
network for bearing fault diagnosis and fault knowledge transfer. The
fault dataset was collected under working environments with different
noise. Frequency spectrum was employed as the input to the CNN model.
The cross-domain feature discrepancy was also minimized based on
MMD. FTNN (feature-based transfer neural network) was developed by
Yang et al. (Yang et al., 2019) to diagnose the machine faults of real-case
machines by the knowledge learnt from the data recorded from labo­
ratory machines. MMD was also adopted for domain adaptation which
was imposed on multiple network layers. Four bearing fault datasets
were used to construct the transfer experiments and test the perfor­
mance of FTNN. Lu et al. (Lu & Yin, 2021) developed a combined
solution of convolutional autoencoder and convolutional network for
bearing fault diagnosis, where the convolutional autoencoder was
adopted to mine the common features cross domains. MMD was
employed for domain adaptation in the convolutional autoencoder.
From the above literature review, it could be seen that the domain
transfer is usually implemented by imposing certain domain distribution
metric on one or several network layers within the deep transfer model.
In these solutions, all the input data for training were treated equally for
domain adaptation. No matter what transfer learning solutions were
adopted, an intermediate data distribution space would be learnt where
the source domain and the target domain data were aligned with each
other. Therefore, an implicit assumption is actually made in the existing
deep transfer solutions that all the features learnt from the source
domain could be appropriately transferred into the intermediate feature
space, meanwhile maintaining discriminant power in both domains.
However, there is no guarantee that the features of the data belonging to
the same category from different domains could be transferred to the
same cluster in the intermediate feature space. Some original features
might carry discriminant information for the source domain, which
might get lost after transferring for both domains. The nonlinear feature
mapping obtained by the deep transfer model is not a deterministic
projection function for both domains which means the samples from the
same class but different domains might be mapped to regions belonging
to different classes. The samples that are mistakenly projected will
deteriorate the performance of the transfer model and lead to false
classification. Therefore, to achieve high classification accuracy it is not
sufficient to transfer all the source samples to the common feature space
and only use the transferred common features cross domains for fault
diagnosis.
In order to keep the domain specific features and mine common
features cross both domains simultaneously, a novel deep transfer so­
lution termed as multi-view multi-level network (MMNet) is developed
in this paper. MMNet constructs a dual channel structure to learn the
representations of common features cross domains and discriminant
features in specific domain which form multi-view features for classifi­
cation. Domain level classification and fault level classification are
combined to extract the domain specific features. The cross domain
common features are learnt by MK-MMD based domain adaptation and
fault level classification. In addition, to deal with the data deficiency
problem, an efficient few shot learning mechanism is adopted which
employs two modules i.e. feature extraction module and feature com­
parison module to perform fault diagnosis. Two weight shared branches
are employed to extract multi-view features of both domains simulta­
neously, which form the feature extraction module. In the feature
comparison module, relation score between template sample and query
sample is used to implement fault classification. In MMNet, no labeled
sample from the target domain is required. The test samples from the
target domain are compared with the template samples from the source
domain for fault diagnosis, which enables zero shot diagnosis in the
target domain. Episode based training strategy is adopted to train
MMNet.
There are three major contributions in this paper.
First, the property of the features before and after domain transfer
has been analyzed, based on which a multi-view feature extraction
mechanism incorporating domain specific features and cross domain
common features is proposed.
Second, a multi-view multi-level network MMNet is constructed
which combines fault level classification and domain level classification
to learn domain specific features, and meanwhile combines MK-MMD
based domain adaptation and fault level classification to learn com­
mon features cross domains.
Third, a FeatureNet module is used to extract sample features and a
RelationNet module is adopted to implement fault classification in
MMNet, which enables zero shot fault diagnosis in the target domain.
The paper is organized as follows. Section 1 is introduction. Problem
formulation, transfer feature analysis and some preliminary knowledge
N. Lu et al.
Expert Systems With Applications 213 (2023) 119057
3
are discussed in Section 2. Section 3 describes the proposed solution
MMNet in details. Section 4 reports experiment and comparison results
to demonstrate the effectiveness of MMNet. Conclusions are made in
Section 5.
2. Motivation and preliminaries
2.1. Problem formulation and motivation
In machine fault diagnosis task, data are collected from one machine
under different working conditions or different machines. The data from
different working conditions or different machines follow different
probability distributions, which are viewed as different domains.
Transfer learning aims at borrowing the knowledge learnt from one
domain to another domain. The former one is called source domain and
the latter one target domain, which could be denoted as D s
and D t
respectively. The sample space of the source domain and the target
domain can be denoted as Xs
and Xt
which satisfy Xs
⊂D s
and Xt
⊂D t
.
The samples drawn from the source space can be represented as
{
xs
i
}
, i =
1, 2, ⋯, ns and the samples from the target space can be represented as
{
xt
i
}
, i = 1, 2, ⋯, nt, where ns and nt are respectively the number of
samples from the corresponding domain. The fault categories in the
source and the target domain are assumed to be the same. The fault class
space is denoted as Y = {1, 2, • • •, C }, where C is the number of fault
categories involved. Therefore, there exists Ys
= Yt
= Y. Accordingly,
one labeled sample from the source and the target domain could be
respectively represented as
{
xs
i , ys
i
}
, i = 1, 2, ⋯, ns and
{
xt
i , yt
i
}
, i = 1, 2,
⋯,nt. In our study, the training set from the source domain are labeled
and no label information from the target domain training set is used.
Transfer learning methods try to learn an intermediate feature space
where the data from different space could be aligned. When deep
transfer learning methods are employed, an intermediate feature space
can be constructed by the learnt features which can be denoted as Xm
. At
different layers of the deep model, multiple intermediate feature space
will be learnt. For simplicity, we use Xm
as a general representation for
all the intermediate feature space. The nonlinear mapping from the
input sample to the intermediate feature space is represented as φ : Xs
,
Xt
→Xm
. With an ideal nonlinear mapping, the input samples from the
source and the target domain belonging to one category should be
mapped to the same region within one class boundary in the feature
space. However, the nonlinear model learned by neural network
training is not a deterministic optimal solution. Some samples of the
same class from the source and the target domains will be mapped to
different class regions. Fig. 1 gives an illustration of the mistakenly
mapped samples. Fig. 1(a) depicts the samples within the source domain
and Fig. 1(b) shows the projected results in the intermediate feature
space from both the source and the target domain. The solid triangles
and circles in Fig. 1(a) and (b) are samples from two fault classes of the
source domain. The dotted triangles and circles in Fig. 1(b) represent the
samples from the target domain belonging to the corresponding two
classes as the source domain samples. Within the source domain, these
samples could be well classified by the classification boundary as shown
in Fig. 1(a). When the samples have been mapped to the intermediate
feature space, to correctly classify the target domain samples the ex­
pected target class boundary should be set as in Fig. 1(b). It could be seen
that some mapped source domain samples are not in agreement with the
correct class boundary. When all the mapped samples from the source
domain are treated as prior knowledge for the target domain, an actual
class boundary would be obtained as shown in Fig. 1(b). Obviously some
source domain samples have not been appropriately mapped and could
bring misleading information.
If deep transfer learning model is employed, to alleviate the influence
from the above discussed phenomenon, the weights in corresponding to
such misleading samples should be suppressed. Their contribution to the
target domain fault classification should be minimized. However, in the
source domain fault classification, these samples might play important
role and thus their corresponding weight could not be diminished during
the model training progress. The existing deep transfer learning solu­
tions treat all the samples indifferently with the domain adaptation
procedure, which makes the above discussed problem an issue to be
addressed and forms one of the motivations of this study.
In addition, the widely used benchmarks for deep model training are
usually of very large scale. The popular image dataset ImageNet (Deng
et al., 2009) contains more than 10 million samples from more than 20
thousand categories. Sports-1 M (Karpathy et al., 2014) is a famous
video dataset for action recognition which includes more than 1 million
videos. LaSOT (Fan et al., 2019) is a representative visual tracking
dataset which includes more than 3 million image frames. In contrast,
the fault diagnosis benchmarks like CWRU bearing dataset provided by
Case Western Reserve University (Center), IMS bearing dataset (Guo
et al., 2019) and RL bearing dataset (Lei, 2017) usually only contain
several hundred or several thousand samples. Therefore, fault diagnosis
Fig. 1. Illustration of mistakenly mapped samples from the source domain to the intermediate feature space. (a) Source domain samples and their class boundary (b)
Mapped source domain and target domain samples in the intermediate feature space and class boundaries.
N. Lu et al.
Expert Systems With Applications 213 (2023) 119057
4
is a relatively small data problem. Appropriate deep models which could
well deal few shot learning scenarios should be explored. Furthermore,
when zero labeled sample is provided in the target domain, how to
implement efficient fault knowledge transfer and fault classification
remains a challenge. This is another motivation of this work.
2.2. Multiple kernel maximum mean discrepancy
Multiple Kernel Maximum Mean Discrepancy (MK-MMD) is an
improved version of Maximum Mean Discrepancy (MMD). MMD is a
metric evaluating the data distribution distinction between the source
and the target domain. It is indicated in (Gretton, Borgwardt, Rasch,
Scholkopf, & Smola, 2012) that the probability distribution difference
between two domains could be estimated by their mean embedding in
the Reproducing Kernel Hilbert Space (RKHS) via the characteristic
kernel function. Gaussian kernel is characteristic on Rd
which is used to
define MMD. Given i.i.d samples from the source and the target domain
as Xs
:=
{
xs
1, xs
2, ⋯, xs
ns
}
and Xt
: =
{
xt
1, xt
2, ⋯, xt
nt
}
, which are respec­
tively drawn from probability distribution Ps and Pt, and suppose H k is
the RKHS endowed with characteristic Gaussian kernel k( • ), the MMD
can be formulated as.
dH k
(F , Ps, Pt) :=
sup
f ∈ F
(
1
ns
∑
ns
i=1
f
(
xs
i
)
−
1
nt
∑
nt
i=1
f
(
xt
i
)
)
, (1)
where F is a class of functions which performs nonlinear mapping as f :
Xs
→R or f : Xt
→R, sup ( • ) is the supremum of the input. The two terms
in the bracket of Eq. (1) are respectively the empirical mean expecta­
tions of the source and the target domain calculated on the samples. It
has been demonstrated in (Gretton et al., 2012) that the nonlinear
function f( • ) could be estimated by the endowed Gaussian kernel
function. Therefore, MMD could be estimated by the data samples as.
where k(•, •) is the characteristic Gaussian kernel. Given two feature
vectors xi and xj, the Gaussian kernel function is defined as.
k
(
xi, xj
)
= e
− ‖xi− xj‖2
γ (3)
where γ is the kernel width.
MMD uses single Gaussian kernel to evaluate the distribution
distinction between the source and the target domain, which suffers
from suboptimal kernel selection and limited adaptation effectiveness.
MK-MMD (Long et al., 2015) constructs a multiple-kernel variant of
MMD, which employs the combination of multiple Gaussian kernels to
measure the distribution discrepancy. The characteristic kernel used in
MK-MMD is defined as.
k =
∑mu
u=1
βuku,
s.t.
∑mu
u=1
βu = 1,
βu ≥ 0, ∀u,
(4)
where mu is the number of used kernels and βu is the weight of kernel u.
In this research, Gaussian kernels are used as the base kernels. One
Gaussian kernel can be rewritten as ku
(
xi, xj
)
= e
− ‖xi− xj‖2
γ . Through
changing the kernel bandwidth γ between 2− ⌊ku/2⌋
γ and 2⌊ku/2⌋
γ with a
scaling parameter of 2, where ⌊. • / • ⌋ is the integer division, the mu
Gaussian kernels could be obtained.
2.3. Few shot learning
Few shot learning has developed into an important direction in
machine learning research which aims at exploring effective solutions
for application scenarios with small dataset for training. There are
mainly-two popular categories of few shot learning methods, metric
based methods and optimization based methods. Matching network
(Vinyals, Blundell, Lillicrap, Kavukcuoglu, & Wierstra, 2016), prototype
network (Snell, Swersky, & Zemel, 2017) and relation network (Sung
et al., 2018) are representative metric based few shot learning methods.
Methods like model-agnostic meta-learning (MAML) (Finn, Abbeel, &
Levine, 2017) and task-agnostic meta-learning (TAML) (Jamal & Qi,
2019) are optimization based methods. A common property of these few
shot learning methods is that small mini-batches over multiple tasks are
sampled to train the model iteratively. This cross tasks training pro­
cedure enables fast fine tuning of the model and its generalization per­
formance, which thus assures the model effectiveness in small data
application scenarios. Among these few shot learning methods, relation
network (Sung et al., 2018) employs a network module to learn the
metric for sample difference evaluation, which is called relation module.
Before the relation module, a feature module is used to extract the
features of the input samples. Considering the excellent performance of
relation network, its two module architecture has been borrowed to
build MMNet in this study.
3. Multi-view and Multi-level network
As discussed in section 2.1, taking all the samples into domain
adaptation equally might lead to important information loss. The
domain specific information carried by the samples not appropriate for
transfer will get suppressed to fulfill domain adaptation between the
source and the target domain. In order to retain as much effective in­
formation as possible, both the common features cross domains and
domain specific features should be simultaneously extracted. In addi­
tion, few shot learning related mechanism should be incorporated to
deal with the data paucity issue in fault diagnosis. Therefore, a novel
solution MMNet is developed which could learn multi-view features
with multi-level classification.
3.1. Architecture of MMNet
Within a domain adaption deep network, all the involved network
weights are adjusted toward improving the classification performance of
the network. Therefore, the contribution of the samples which are
inappropriate for domain adaptation will be diminished. Only the fea­
tures of the samples that could benefit the domain alignment between
the source and the target domain will be effectively extracted. To extract
both cross domain common features and domain specific features, two
isolated network channels for feature extraction are designed in MMNet.
Fig. 2 gives the detailed architecture of MMNet. The structure of MMNet
is shown in Fig. 2(a) and (b) gives the notations of different channels in
d2
H k
(Xs
, Xt
) = ‖
1
ns
∑ns
i=1
f
(
xs
i
)
−
1
nt
∑nt
i=1
f
(
xt
i
)
‖
2
H k
=
1
ns2
∑ns
i=1
∑ns
j=1
k
(
xs
i , xs
j
)
+
1
nt2
∑nt
i=1
∑nt
j=1
k
(
xt
i, xt
j
)
−
2
ns
nt
∑ns
i=1
∑nt
j=1
k
(
xs
i , xt
j
)
,
(2)
N. Lu et al.
Expert Systems With Applications 213 (2023) 119057
5
the network.
The overall architecture of MMNet borrows the module arrangement
from relation network (Sung et al., 2018). As shown in Fig. 2(a), MMNet
has two modules which are denoted as FeatureNet and RelationNet
respectively. FeatureNet extracts the features of the input samples and
RelationNet computes the relation between the samples. Each module
contains two branches indicated as source branch and target branch,
which process the input samples from the source and the target domain
respectively. In the FeatureNet module, the upper two feature extraction
channels form the source branch which extracts the feature of the source
domain samples. The lower two feature extraction channels form the
target branch which extracts the feature of the target domain samples.
The source and target branches share the same weights. The cross
domain common feature channel aims at extracting the common fea­
tures cross domains via domain adaptation, while the domain specific
feature channel extracts the domain specific discriminant features
facilitating both fault classification and domain classification. The cor­
responding channel notations are given in Fig. 2(b). The two branches in
the RelationNet module are also weight shared.
To obtain common features cross the source and the target domains,
MK-MMD based domain adaptation is employed. It has been indicated in
(Long et al., 2015) that with the increase of the network depth the
features learned over the layers transit from general to specific. The
specific features of one domain are difficult to get transferred to another
domain in comparison to the general features. Therefore, MK-MMD loss
is imposed on three layers of MMNet as shown in Fig. 2(a). In the Fea­
tureNet module, MK-MMD loss is imposed on the highest convolutional
layer. In the RelationNet module, MK-MMD loss is imposed on the two
highest fully connected layers but the output layer. To obtain domain
specific features, domain level classification and fault level classification
have both been incorporated. Domain level classification is performed
based on the features extracted by the domain specific feature channels
in the FeatureNet module. The domain specific feature channel aims at
boosting both domain classification and fault classification, which could
thus learn the features benefiting classification in specific domain.
The details of the network channels are given in Fig. 3. The two
feature learning channels in both the source and the target branch of the
FeatureNet have the same structure settings. In each channel, there are
Fig. 2. Architecture of MMNet. (a) MMNet structure (b) Details of network branches in MMNet.
N. Lu et al.
Expert Systems With Applications 213 (2023) 119057
6
three convolutional layers each followed by an average pooling layer. In
all the three convolutional layers, 20 feature maps are adopted and the
kernel size of each feature map is 3 × 1. The pooling size of the average
pooling layer is 2. In the source branch, based on the features learned by
the domain specific feature channel, a flatten layer with dimension of
5120 and a fully connected layer are used for domain classification. Here
the domain classification is a binary classification problem. The samples
from the source domain are labeled with 1 and the samples from the
target domain are labeled with 0. The upper channel in the RelationNet
module calculates the similarity between the concatenated features and
implements fault classification as shown in Fig. 2(a). The lower channel
in the RelationNet module shares the same structure with the upper
channel which only participates in the domain adaptation calculation. In
both RelationNet channels, two convolutional layers, one flatten layer
and two fully connected layers have been employed. The convolutional
kernel width is 3 × 1 and the average pooling size is 4. The dimension of
the flatten layer and the two fully connected layers is 1280, 512 and 256
respectively. The computation and optimization details are given in the
following section.
3.2. Optimization of MMNet
The training of MMNet has adopted the episode based training
strategy in few shot learning methods. The training set is constructed by
the samples from both the source and the target domain. The part from
the source domain is labeled data which aims for fault classification
training. The part from the target domain is unlabeled data which aims
for domain adaptation. Both parts are used for domain classification
training. In episode based training, an experiment mechanism called
k-way m-shot setting is used. Here k-way means the number of classes
involved in each episode and m-shot indicates the number of labeled
samples as template for comparison from each category. Specifically, in
each episode a mini-batch is randomly selected from the source domain
dataset as the template set. The size of the template set is k × m in a
k-way m-shot experiment setting. A fraction of the remaining dataset is
used as the query set. In each episode, the features of the m template
samples from each category are extracted by the FeatureNet module
which could be denoted as
{
xs,t
i
}
, i = 1, 2, ⋯, m, where s means the
samples come from the source domain dataset, and t indicates that the
samples work as template. The query set sample is also fed to the Fea­
tureNet module to extract its feature representation. The query set could
be represented as
{
xs,q
i
}
, i = 1, 2, ⋯, n, where n is the number of query
samples used for training from each class. These two parts of data are the
input to the domain specific feature channel in the source branch of
FeatureNet as shown in Fig. 2(a). For the lower target branch, the same
set of template samples is used. The query set comes from the target
domain, which could be denoted as
{
xt,q
i
}
,i = 1,2,⋯,n. The number of
query samples from the source and the target domain are the same.
For each branch in the FeatureNet module, all the template samples
from the source domain and two query samples respectively from the
source and the target domain are fed to the FeatureNet module sepa­
rately during each episode to obtain their corresponding feature vectors.
When the number of the template samples m is larger than 1, the sum of
their obtained feature vectors is used as the template feature vector. The
query feature vector is obtained from the query sample. Suppose the
corresponding feature vectors of
{
xs,t
i
}
, i = 1, 2, ⋯, m, xs,q
i , and xt,q
i
extracted by the FeatureNet module in one episode are
{
fs,t
i
}
,i = 1,2,⋯,
m, fs,q
i , and ft,q
i respectively, the final template feature vector could be
obtained by summing up the feature vectors of all the template samples
as
fs,t
=
∑
m
i=1
fs,t
i . (5)
For each category of machine fault, a template vector will be
computed during each episode. After the FeatureNet module, the tem­
plate feature vector and the query feature vector are concatenated with
each other, which form the input to the following RelationNet module as
shown in Fig. 2(a). During the training stage, one source domain and one
target domain query sample will be fed to the MMNet each time along
Fig. 3. Network structure details of the network channels in MMNet.
N. Lu et al.
Expert Systems With Applications 213 (2023) 119057
7
with the template samples. With the RelationNet module, the similarity
between the query sample and the template of each category is calcu­
lated and a relation score for the source domain query sample will be
obtained as rc
(
fs,q
i , fs,t
)
, where c is the class index. Based on which,
Softmax function is employed to implement the machine health condi­
tion classification as
p(ys,q
i = c) =
exp(rc(fs,q
i , fs,t
) )
exp
( ∑C
c=1rc(fs,q
i , fs,t
)
), (6)
where p
(
ys,q
i = c
)
is the probability of the ith
query sample from the
source domain belonging to class c. The query samples from the target
domain are specifically used for domain adaptation and no labels are
provided for them, so the classification of the target domain query
sample is not conducted as shown in Fig. 2(a).
To optimize MMNet, three parts of loss are combined to train MMNet
including the machine fault classification loss, domain classification loss
and domain adaptation loss. The fault classification loss is calculated
based on the relation score, so it is termed as relation loss for simplicity
as shown in Fig. 2(a). The domain classification loss (domain loss for
short) further includes two parts, i.e. the domain classification loss for
the query sample from the source domain and the target domain
respectively. The relation loss is denoted as L r and defined by cross
entropy loss as
L r =
∑
nbs
i=1
J(xs,q
i , ys,q
i |θ ) = −
∑
nbs
i=1
ytrue
i logys,q
i , (7)
where nbs
is the number of source domain query samples in one training
episode, θ represents the parameters of the network, ys,q
i is the estimated
fault label and ytrue
i is the true fault label.
The two parts of domain loss are respectively denoted as L ds and
L dt for the source and the target domain query samples, which have also
used cross entropy loss and are formulated as
L ds =
∑
nbs
i=1
J(xs,q
i , ds,q
i |θ ) = −
∑
nbs
i=1
dtrue
i logds,q
i (8)
and
L dt =
∑
nbt
i=1
J
(
xt,q
i , dt,q
i |θ
)
= −
∑
nbt
i=1
dtrue
i logdt,q
i , (9)
where nbs
and nbt
are the number of query samples from the source and
the target domain respectively, ds,q
i and dt,q
i are the estimated domain
labels of the query samples, dtrue
i is the true domain label. If the query
sample comes from the source domain dtrue
i = 1, otherwise dtrue
i = 0.
The domain adaptation loss is evaluated based on MK-MMD as dis­
cussed in Section 2.2, which is denoted as MK-MMD loss in Fig. 2(a) and
calculated as
L MK− MMD = d2
H k
(Xs
, Xt
), (10)
where Xs
=
{
xs,q
i
}
, i = 1, 2, ⋯, nbs
and Xt
=
{
xt,q
i
}
, i = 1, 2, ⋯, nbt
. An
unbiased estimate of MK-MMD is adopted to calculate d2
H k
(Xs
, Xt
) as in
(Long et al., 2015), which is formulated as
d2
H k
(Xs
, Xt
) =
2
nbs
∑
nbs
i=1
gk(zi), (11)
where zi is a quad-tuple and is defined as zi≜
(
xs,q
2i− 1, xs,q
2i , xt,q
2i− 1, xt,q
2i
)
. gk(zi)
is calculated as
gk(zi)≜k(xs,q
2i− 1, xs,q
2i ) + k
(
xt,q
2i− 1, xt,q
2i
)
− k
(
xs,q
2i− 1, xt,q
2i
)
− k
(
xs,q
2i , xt,q
2i− 1
)
, (12)
where the kernel function k is defined in Eq. (4) which is a weighted
combination of multiple Gaussian kernels. The weight of kernel u
denoted as βu was obtained by the same method as in (Long et al., 2015)
by reducing the kernel optimization to a quadratic program (QP). The
MK-MMD loss is calculated on three layers, i.e. the highest convolutional
layer in the FeatureNet module and two fully connected layers in the
RelationNet module.
Combining the relation loss, the MK-MMD loss and the domain loss,
the overall loss function can be formulated as
L = L r + L MK− MMD + L ds + L dt. (13)
In addition, to treat the loss terms in Eq. (13) with different impor­
tance, trade-off parameters could be incorporated. As discussed in Sec­
tion 3.1, there are three parts of MK-MMD loss respectively imposed on
three layers, which could be denoted as L MMD1, L MMD2 and L MMD3.
Therefore, four trade-off parameters have been incorporated and the
weighted loss is written as.
L = L r + λ1L MK− MMD1 + λ2L MK− MMD2 + λ3L MK− MMD3 + λ4(L ds + L dt),
(14)
where λ1, λ2, λ3 and λ4 are the tradeoff parameters. By minimizing the
above loss as min
θ
L , the MMNet could be trained. Adam has been
adopted as the optimization method to train the network and optimize
the network parameters θ. The weights of the Gaussian kernels βu, u = 1,
⋯, mu in MK-MMD are then optimized in an alternating way by QP. The
details of the training process of MMNet are given in Table 1.
4. Experiment results and discussions
4.1. Datasets and experiment setting
Four datasets were employed to test the effectiveness of MMNet, the
specification of the dataset were given in Table 2. Among these four
datasets, the first two datasets were recorded in laboratory with artificial
faults, the third one was collected in laboratory with run to failure faults,
and the last one was collected from bearings used in practical applica­
tion. All the collected data are vibration signals collected by acceler­
ometers from operating bearings. Four classes of health conditions have
been incorporated in these datasets, including normal condition (NC),
inner race fault (IF), outer race fault (OF) and ball fault (BF). The test
benches that collected the four datasets are illustrated in Fig. 4, where
the illustration of the four types of health conditions is also given. The
difference of the bearings lies in the specification model, rotation speed,
working load and sampling rate. Vibration signals from the same type of
rotatory part of the same fault are expected to show similar character­
istic, which makes it possible to transfer knowledge between different
datasets.
Dataset A and B are from CWRU bearing dataset provided by Case
Western Reserve University (Center). The vibration data were collected
from a motor bearing experiment platform (Fig. 4(a)) with a sampling
frequency of 12 kHz. Artificial single point faults were made on bearings
and corresponding vibration signals were collected in laboratory envi­
ronment. The diameter of the point fault was set as 0.0014 in.. Dataset A
and B were respectively collected under 0 HP and 3 HP motor loads. For
each health condition, 101 samples are used in our study each with 1024
data points. Therefore, there are 404 samples in total in both dataset A
and B.
Dataset C is from IMS bearing dataset, which is provided by the NSF
I/UCR Center for Intelligent Maintenance Systems (IMS) (Qiu, Lee, Lin,
& Yu, 2006). Four bearings were installed on a shaft rotating at a con­
stant speed of 2000 RPM. Accelerometers were installed on the bearing
housing to collect vibration signals. 6000 lbs of radial load was imposed
on the shaft. The sampling frequency was 20 kHz. There are also 404
samples used in dataset C in this study. The length of each sample is
1024 data points.
Dataset D comes from RL bearing dataset provided by Xi’an Jiaotong
N. Lu et al.
Expert Systems With Applications 213 (2023) 119057
8
University (Lei, 2017). Different from the previous three datasets where
the bearing faults were artificially produced in laboratory, RL bearing
dataset collected data from practically used railway locomotive (RL)
rolling element bearing. An accelerometer was mounted on the outer
race of the bearing to collect the vibration signal. A working load of
9800 N was adopted and the sampling rate was 12.8 kHz. There are also
four health conditions included in this dataset which is the same with
the previous three datasets. The number of samples and the sample
length are also same to the other three datasets.
4.2. MMNet performance and comparisons
MMNet was implemented in Python with PyTorch. All the experi­
ments were performed on a PC equipped with a 3.2 GHz Intel I7 CPU and
a TITAN Xp GPU.
4.2.1. Experiment settings in MMNet
Based on the four datasets detailed in Section 4.1, three transfer tasks
have been used to validate the efficiency of MMNet, including transfer
task A → D, B → D and C → D. The bearing faults of datasets A, B and C
were generated in laboratory and those of dataset D were made during
practical application. Therefore, datasets A, B and C are used as the
source datasets and D is adopted as the target dataset to implement
knowledge transfer from laboratory data to practical data.
Episode based training in few shot learning is employed to efficiently
learn knowledge with small amount of samples. Specifically, three few
shot learning scenarios have been adopted, including 4-way 1-shot, 4-
way 5-shot and 4-way 10-shot. In each episode, one template set from
the source domain and two query sets respectively from the source and
the target domain are used for training. The query set from the source
domain has labels and is used for domain classification and fault clas­
sification. The query set from the target domain is not labeled which is
used for domain classification and domain adaptation. In the source
branch of MMNet, according to the obtained relation scores, the cate­
gory of the query sample could be determined by the largest one.
In one episode of a k-way m-shot training, k classes each with m
samples randomly selected are used as the template set, and a fraction of
the remainder data are taken as the query set. In each episode of the 4-
way 1-shot experiments, one example from each class of the source
dataset is randomly selected to form the template set and 29 random
examples are respectively selected from the source and the target dataset
as the query set. For the upper source branch of MMNet, both the tem­
plate set and query set are selected from the source dataset. For the
bottom target branch, same template set as the source branch is adopted.
The query set is selected from the target dataset and no label information
is required. In the 4-way 1-shot experiments, the total number of ex­
amples used for training is 1 × 4 + 29 × 4 + 29 × 4 = 236. In the 4-way
5-shot experiments, 5 random examples from the source dataset form
the template set and 25 examples respectively from the source and the
target dataset construct the query set. The total number of examples in
each episode is 5 × 4 + 25 × 4 + 25 × 4 = 220. Similarly, in the 4-way
10-shot setting, the total number of examples in each episode is 10 × 4 +
20 × 4 + 20 × 4 = 200. All the labeled data from the source domain and
200 unlabeled examples from the target domain have been used to
generate the training set in each episode. The rest 204 samples (51 × 4 =
204) from the target domain are used for testing.
Table 1
Training process of MMNet.
Table 2
Dataset Specifications.
Datasets Bearing
specs
Health
condition
Number of
samples
Operation
configuration
A SKF6205 NC 4 × 101 0HP
1797 r/min
IF
OF
BF
B SKF6205 NC 4 × 101 3HP
1730 r/min
IF
OF
BF
C ZA-2115 NC 4 × 101 6000lbs
2000 r/min
IF
OF
BF
D 552732QT NC 4 × 101 9800 N
500 r/min
IF
OF
BF
N. Lu et al.
Expert Systems With Applications 213 (2023) 119057
9
4.2.2. Parameter settings in MMNet
Adam is adopted to optimize MMNet. The number of training epi­
sodes is set as 10,000 and the learning rate is 5 × 10− 4
. The tradeoff
parameter λ1, λ2 and λ3 of the three MK-MMD loss and the tradeoff
parameter λ4 of the domain loss are given in Table 3. It has been
discovered in previous research that from the shallower layers to the
deeper layers of convolutional neural network, the learned features turn
from general to specific. The general features cross different domains are
easier to get transferred than the specific ones. Therefore, the trans­
ferability of the features will decrease with the increase of the network
depth. Larger MK-MMD tradeoff parameters should be selected for the
lower layers and smaller ones are supposed to be used for the higher
layers to allow for task-specific tuning.
To verify the above statement, grid search experiments have been
conducted to search for the optimal tradeoff parameters in an exhaustive
manner. The details of the parameter selection procedures are given in
Table 4. The MK-MMD tradeoff parameters are selected within the range
of [0.1, 5] with an increment of 0.05. In each experiment scenario, 10
examples from the test set (query set) are randomly separated as vali­
dation set for parameter selection. Considering the high computational
cost, no cross validation procedure is used. Experiment results have
shown that MMNet failed to obtain satisfactory performance when the
three parameters take identical value. Fault classification accuracy
around 83 % was obtained in these experiments. In some experiments,
the network even failed to converge. Similar experiment results were
observed when the value of the parameters is in increasing order from λ1
to λ3. When the parameters take random order (neither monotone
increasing nor monotone decreasing), some good results have been ob­
tained. Better classification performance has been achieved when the
tradeoff parameters are in decreasing order. The optimal value of the
three MK-MMD loss tradeoff parameters are selected based on the grid
search results as shown in Table 3. The parameter selection results also
indicate that the model is quite robust to the parameter variation with a
mean accuracy of 89.86% and standard deviation of 6.02%.
During the search of the three MK-MMD loss parameters, the domain
loss parameter is fixed as 0.1 to reduce computational cost which has
shown relative excellent performance throughout experiments. After the
three MK-MMD loss tradeoff parameters have been selected, they are
fixed to further finely select the domain loss tradeoff parameter λ4. Ex­
periments with λ4 from {0.001, 0.01, 0.1, 1, 10, 100} have been per­
formed. Based on the experiment results, 0.1 is selected.
In each domain adaptation operation with MK-MMD, 5 Gaussian
kernels have been adopted. The Gaussian kernel bandwidth γ is set as the
median of the pairwise distance of the training samples from both the
source and the target domain. The kernel bandwidth of the mu Gaussian
kernels is obtained by changing their bandwidth between 2− ⌊ku/2⌋
γ and
2⌊ku/2⌋
γ with a scaling parameter of 2, where ⌊. • / • ⌋ is the integer
division.
4.2.3. Performance of MMNet and comparison with other methods
To verify the performance of MMNet, the three transfer tasks A → D,
B → D and C → D discussed in section 4.2.1 have been carried out. For
each transfer task, three few shot learning experiment settings are
tested. The results are reported in Table 5. It could be seen that excellent
fault classification performance has been obtained on the three transfer
tasks. With the increase of the number of examples used as the template
set, the performance of MMNet has been improved. The average fault
classification accuracy is above 99 % which is a superior transfer per­
formance for bearing fault diagnosis.
To further validate the effectiveness of MMNet, extensive compari­
son experiments have been conducted. Multiple state-of-the-art transfer
learning methods have been included for comparison, including Trans­
fer Component Analysis (TCA) (Pan, Tsang, Kwok, & Yang, 2011), Deep
Domain Confusion (DDC) (Tzeng, Hoffman, Zhang, Saenko, & Darrell,
2014), modified Deep Adaptation Networks (DAN) (Long et al., 2015),
Feature-based transfer neural network (FTNN) (Yang et al., 2019), G-
ResNet (Yang et al., 2020), P-ResNet (Yang et al., 2020) and TrResNet
(Yang et al., 2020). In addition, Convolutional Neural Network (CNN)
has been incorporated as a baseline method for comparison. To make
fair comparisons, we use public available source code provided by the
Fig. 4. Test bench of CWRU [23], IMS [31] and RL [24] bearing dataset and the corresponding health condition illustration [6].
Table 3
Tradeoff parameters of the mk-mmd Loss and domain loss.
Experiment setting λ1 λ2 λ3 λ4
4-way 1-shot 2.25 1.25 0.5 0.1
4-way 5-shot 1.0 0.5 0.2 0.1
4-way 10-shot 2.0 1.0 0.75 0.1
N. Lu et al.
Expert Systems With Applications 213 (2023) 119057
10
authors of the above methods for experiments. When the code of the
method is not publicly available, the results are borrowed from the
original papers directly given the same transfer task. When both the
source code and the corresponding experiment results are not available
in the original publication, “/” mark is used in Table 6 which reported
the comparison results.
In the baseline CNN method, no transfer learning related tricks have
been applied. The labeled data from the source dataset form the training
set and the unlabeled data from the target dataset construct the testing
Table 4
Tradeoff parameter selection procedures.
Table 5
Classification accuracy (%) of MMNET on different transfer tasks.
Experiment setting A → D B → D C → D Avg
4-way 1-shot 99.62 98.75 99.25 99.21
4-way 5-shot 99.64 99.90 99.70 99.75
4-way 10-shot 99.95 99.98 99.72 99.88
N. Lu et al.
Expert Systems With Applications 213 (2023) 119057
11
set. To achieve an optimal performance of CNN for comparison, various
architectures of CNNs have been evaluated. Specifically, CNNs with
different depth have been tested, including CNN of five convolutional
layers, three convolutional layers and two convolutional layers. In each
CNN, one flatten layer and one fully connected layer are added following
the convolutional layers. Cross-entropy is used as the loss function.
Softmax is applied at the output layer for classification. Meanwhile, our
experiments have shown that average pooling could obtain better per­
formance than max pooling. Therefore, average pooling has been
adopted in these baseline CNNs. In the other compared CNN based so­
lutions, average pooling is also adopted instead of max pooling to ensure
fair comparison. It has been shown that CNN with two convolutional
layers and two fully connected layers has obtained the best fault diag­
nosis performance, the results of which are given in Table 6.
TCA is a classic transfer learning method, which projects the source
data and the target data into a new subspace where their data distri­
butions are closer than in the original data distribution space. In the
implementation of TCA, the regularization tradeoff parameter is
selected from {0.01, 0.1, 1, 10, 100} and the subspace dimension is
selected from {2, 4, 8, 16, 32, 64, 128, 256} via experiments. Based on
the representations of all the samples in the transformed subspace, a
support vector machine (SVM) classifier is trained for fault
classification.
The baseline CNN architecture selected via experiments has been
adopted in DDC. Meanwhile, MK-MMD based domain adaptation is used
in the layer before the softmax classification layer. For the compared
DAN method, the same CNN structure is used and domain adaption with
MK-MMD is applied in the flatten layer and the last fully connected layer
before the output layer. The specifications of the adopted CNN structure
in the baseline CNN, DDC and DAN are given in Table 7, where “/”
means not applicable. In both DDC and DAN, all the labeled data of the
source dataset and part of the unlabeled data of the target dataset are
used for model training. Similar dataset partition setting is adopted as
MMNet. The experiment results of FTNN are borrowed from its original
publication (Yang et al., 2019).
€In G-ResNet, P-ResNet and TrResNet, eight ResNet blocks are used
to construct the network backbone structure. In G-ResNet, Gaussian
kernel based MMD is adopted for domain adaptation. In P-ResNet and
TrResNet, polynomial kernel based MMD is used. In addition, pseudo
label learning is applied in TrResNet. The reported results of the above
three methods are borrowed from (Yang et al., 2020). The detailed
model configurations can be found in (Yang et al., 2020). In the exper­
iments of these three methods, both dataset A and B from our experi­
ment setting are used as the source domain, and dataset D is treated as
the target domain. Therefore, the results of transfer tasks A → D and B →
D are the same as reported in Table 6.
The raw vibration data are used as the input to CNN, DDC, DAN,
FTNN, G-ResNet, P-ResNet, TrResNet and MMNet. To obtain better fault
diagnosis performance of TCA, frequency spectrum instead of vibration
data is adopted as the input for TCA.
In Table 6, the best results have been highlighted in bold. It could be
seen from these results that neural network based solutions have ob­
tained significantly better performance than the traditional transfer
learning method TCA. The performance of the baseline CNN with no
transfer learning component involved is relatively poor. Its best per­
formance on the three transfer tasks is 57.67 %. TrResNet has ranked the
second best which is published in 2020 lately. Among all the compared
methods, our MMNet method has obtained the best performance on fault
classification accuracy. The classification accuracy on all the three
transfer tasks is above 99 % which is a quite excellent performance. The
smallest accuracy increase against the second best result reaches 10.94
%.
T-SNE (t-distributed stochastic neighbor embedding) method is
employed to visualize the transfer features learned by the compared
methods. The visualization results are given in Fig. 5. The intermediate
feature representation results of methods G-ResNet, P-ResNet and
TrResNet are not available. Therefore, their corresponding visualization
results have not been provided in Fig. 5. The visualization is conducted
on the transfer task A → D. In Fig. 5, the notation “S-” means the cor­
responding samples come from the source domain and “T-” means the
samples come from the target domain. The visual illustration of Fig. 5
includes frequency spectrum analysis, TCA, CNN, DDC, DAN and
MMNet. The results of Fig. 5 show that the feature distribution differ­
ence of frequency spectrum, TCA, CNN and DDC between the source and
target domain is quite obvious. Among these methods, the features ob­
tained by TCA are aggregated within one class from the same domain
but still scattered for the same class from different domains as compared
with CNN, DDC and DAN, which could well explain the relative better
performance of the latter three methods. The domain discrepancy of the
features learned by DAN and MMNet has been obviously reduced
comparing with the former four methods. For both DAN and MMNet, the
samples coming from the same class are well aggregated even though
they are from different domain. Comparing MMNet with DAN, the dis­
tance among different classes obtained by MMNet is obviously larger
than that of DAN. Meanwhile, the samples from the same class are more
aggregated in MMNet which are relatively more scattered in DAN. The
well-formed sample distribution structure obtained by MMNet explains
the excellent classification performance of the method.
To take a further look into the classification performance compari­
son, the confusion matrices of the compared methods are visualized and
reported in Fig. 6. The confusion matrices of TCA, CNN, DDC, FTNN,
DAN and MMNet are illustrated. From the listed results, it could be seen
that a large quantity of samples have been mistakenly classified with
both TCA and CNN. The results of DDC, FTNN and DAN are better than
those of TCA and CNN. The performance of MMNet is obviously superior
to all the other compared methods, which has validated the efficiency of
MMNet.
Table 6
Accuracy comparison results (%) of different transfer learning methods for fault
diagnosis.
Method Input A → D B → D C → D
CNN Raw vibration 57.67 53.17 53.96
TCA Frequency spectrum 51.48 41.58 25.00
DDC Raw vibration 80.84 77.80 81.22
DAN Raw vibration 83.52 78.90 86.27
FTNN Raw vibration 83.69 84.95 /
G-ResNet Raw vibration 84.32 84.32 /
P-ResNet Raw vibration 87.76 87.76 /
TrResNet Raw vibration 88.27 88.27 /
MMNet Raw vibration 99.21 99.75 99.88
Table 7
Specifications of the CNN structure in baseline CNN, DDC and DAN.
Layer Operation Convolutional kernel
width
Number of
channels
Output size
Input / / / 1024 × 1
× 1
C1 Convolution 3 × 1 20 1024 × 1
× 20
P1 AvgPooling 2 × 1 / 512 × 1 ×
20
C2 Convolution 3 × 1 20 512 × 1 ×
20
P2 AvgPooling 2 × 1 / 256 × 1 ×
20
FC1 Flatten / / 5120 × 1
FC2 Fully -
connected
5120 × 256 / 256 × 1
Output Fully -
connected
256 × 4 / 4 × 1
N. Lu et al.
Expert Systems With Applications 213 (2023) 119057
12
4.3. Ablation study
There are several components contributing to the performance of
MMNet among which the three major components include the double
channel feature extraction mechanism, multiple-layer domain adapta­
tion and average pooling. In order to verify the effectiveness of each
component, ablation study has been conducted.
To test the necessity of the double channel feature extraction
Fig. 5. Visualization of the learned features with t-SNE. (a) Frequency spectrum feature (b) TCA (c) CNN (d) DDC (e) DAN (f) MMNet.
Fig. 6. Confusion matrix of the transfer results of dataset A → D. (a) TCA (b) CNN (c) DDC (d) FTNN (e) DAN (f) MMNet.
N. Lu et al.
Expert Systems With Applications 213 (2023) 119057
13
mechanism, comparison experiments with only one common feature
extraction channel network have been performed. The rest components
like multi-layer adaptation and average pooling are kept the same. The
comparison results are reported in Fig. 7, which were averaged over
three experiment settings (4-way 1-shot, 4-way 5-shot and 4-way 10-
shot) on each transfer task. When only one cross domain common
feature channel was adopted, the results are indicated as “one channel”
in Fig. 7. When the cross domain common feature channel and the
domain specific feature channel were both applied, the corresponding
results are denoted as “double channel”. The highest accuracy obtained
by the one channel network setting is 98.15 % on transfer task A → D,
while the corresponding result of double channel setting is 99.21 %. For
all the three transfer tasks, the double channel setting of MMNet has
obtained better performance than the one channel setting, which has
verified the effectiveness of the double channel feature extraction
mechanism in MMNet.
Fig. 7. Comparison results with and without the domain discriminant feature extraction channel on three transfer tasks.
Fig. 8. Comparison on classification results of different number of Gaussian kernels used in MK-MMD domain adaptation on three transfer tasks. (a) Results on
transfer task A → D (b) Results on transfer task B → D (c) Results on transfer task C → D.
N. Lu et al.
Expert Systems With Applications 213 (2023) 119057
14
One key factor that influences the performance of the multi-layer
domain adaptation in MMNet is the number of Gaussian kernels used
in MK-MMD. When the number of Gaussian kernels reduces to 1, MK-
MMD degenerates to MMD. To compare the performance of different
number of kernels, comparison experiments respectively with 1, 3, 5, 7
and 9 kernels have been performed on each transfer tasks for 10 runs.
The comparison results are illustrated in Fig. 8. When the number of
kernels increases from 1 to 3 and from 3 to 5, significant improvement of
fault classification accuracy can be observed in Fig. 8. When the number
of kernels changes from 5 to 7 and 9, the performance variation is
relatively small. Meanwhile, the computational complexity of MMNet
increases with the number of kernels. Therefore, the number of kernels
in our experiments has been selected as 5.
In addition, to test the efficiency of the average pooling, comparison
experiments against max pooling were conducted. The three convolu­
tional layers in the FeatureNet module of MMNet have used average
pooling instead of max pooling to suppress noise within the vibration
time sequence. We respectively replaced the average pooling in the first
convolutional layer C1, first two convolutional layers C1 and C2, and the
three convolutional layers C1, C2 and C3 in the FeatureNet module to
test the effectivity of average pooling. Experiment results have shown
that the advantage of average pooling are reflected in two aspects,
boosting the converging progress of the training stage and improving the
classification accuracy. In comparison to max pooling, the fault classi­
fication accuracy on the transfer tasks has been improved more than 5 %
utmost in our experiments. Meanwhile, it took about 2,000 episodes to
train MMNet with average pooling following all the convolutional
layers. When max pooling was used instead, more than 30,000 episodes
were cost to train MMNet. Average pooling has greatly improved the
training speed of MMNet.
4.4. Computational complexity comparison
Besides the above model performance comparison, the computa­
tional complexity of the models has also been compared. Considering
the training and operation time will be different on different hardware
platforms, the model structure complexity and number of trainable pa­
rameters are summarized and compared in Table 8. The models with the
same backbone network structure have been listed in the same row. In
MMNet, the weights are shared in different channels and thus the
complexity of only one channel need be considered. From Table 8, it
could be seen that the total number of trainable parameters of MMNet is
the smallest among the compared models, which is only about 1/2 or 1/
4 of the other compared models. More convolutional layers (vs CNN/
DDC/DAN, FTNN), smaller convolutional kernels (vs G-ResNet, P-
ResNet and TrResNet) and narrower fully connected layers have led to
the more concise structure of MMNet. Therefore, MMNet has less
computational complexity than the other compared models.
5. Conclusions
The existing deep transfer networks try to transfer all the extracted
features of fault data cross different domains. Considering there might
be features which could only benefit classification of the data in specific
domain and could not provide common information cross domains, a
neural network solution MMNet separately considering the features
appropriate to transfer and inappropriate to transfer is developed. In
MMNet, a domain level classification and a fault level classification are
combined to extract domain specific discriminant features. Multi-layer
MK-MMD based domain adaption and fault level classification are
combined to extract cross domain common features. A classic few shot
learning network structure RelationNet is employed as the backbone
network. A Siamese double branch structure is incorporated to process
the samples from the source and the target domain simultaneously. The
relation score based classification mechanism could perform fault
diagnosis without labeled data from the target domain. Four datasets
have been used to test the effectiveness of MMNet. The results have
verified the efficiency of MMNet. The transfer fault classification accu­
racy has been significantly improved as compared with other state of the
art transfer solutions in fault diagnosis. Fault classification accuracy
over 99 % has been obtained in all the three transfer tasks for
experiments.
The outcome of this research has verified the different competence of
the learned features for different domains. A multi-level classification
mechanism has enabled implicit discrimination of these features. How
to further and even explicitly evaluate the efficiency of different features
for specific domain remains a challenging problem. One promising di­
rection is to incorporate metric like Kullback-Leibler divergence to
measure the similarity among features. It is also possible to learn a
metric for feature evaluation and embed the metric learning module into
fault diagnosis scheme. Another promising direction is to include
channel attention, self attention and cross attention mechanism into
fault diagnosis network, based on which the salient features for different
domains could be separately treated. In addition, the major idea of
MMNet can also be directly used in other classification applications like
brain signal recognition of different subjects, activity recognition of
different people, image classification under different imaging conditions
and so on.
CRediT authorship contribution statement
Na Lu: Conceptualization, Funding acquisition, Methodology, Vali­
dation, Writing – review & editing. Zhiyan Cui: Investigation, Software.
Huiyang Hu: Data curation, Visualization. Tao Yin: Validation, Writing
– review & editing.
Declaration of Competing Interest
The authors declare that they have no known competing financial
interests or personal relationships that could have appeared to influence
the work reported in this paper.
Acknowledgement
This work is supported by National Key R&D Program of China
2018YFB1306100, National Natural Science Foundation of China grant
61876147.
References
Center, C. W. R. U. B. D. Retrieved from http://csegroups.case.edu/bearingdatacenter/h
ome.
Che, C., Wang, H., Ni, X., & Fu, Q. (2020). Domain adaptive deep belief network for
rolling bearing fault diagnosis. Computers & Industrial Engineering, 143, Article
106427. https://doi.org/10.1016/j.cie.2020.106427
Table 8
Model computational complexity COMPARISONS.
Model Number of
convolution layers
(size)
Number of full
connected layers
(size)
Number of
Parameters
CNN/DDC/DAN 2 × (3 × 1 × 20) 2 (5120 × 256, 256
× 4)
1,311,864
FTNN 2 (5 × 1 × 20, 5 ×
20 × 20)
2 (5941 × 256,
256 × 4)
1,524,084
G-ResNet/P-
ResNet/
TrResNet
16 × (3 × 20 × 20) 2 (6000 × 512,
512 × 4)
3,093,248
MMNet 5 × (3 × 1 × 20)) 3 (5120 × 2, 1280 ×
512, 512 × 256)
796,972
N. Lu et al.
Expert Systems With Applications 213 (2023) 119057
15
Chen, Z., He, G., Li, J., Liao, Y., Gryllias, K., & Li, W. (2020). Domain adversarial transfer
network for cross-domain fault diagnosis of rotary machinery. IEEE Transactions on
Instrumentation and Measurement, 69(11), 8702–8712. https://doi.org/10.1109/
TIM.2020.2995441
Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., & Li, F.-F. (2009, 20-25 June 2009).
ImageNet: A large-scale hierarchical image database. Paper presented at the IEEE
Conference on Computer Vision and Pattern Recognition.
Fan, H., Lin, L., Yang, F., Chu, P., Deng, G., Yu, S., . . . Ling, H. (2019, 15-20 June 2019).
LaSOT: A High-Quality Benchmark for Large-Scale Single Object Tracking. Paper
presented at the IEEE Conference on Computer Vision and Pattern Recognition.
Finn, C., Abbeel, P., & Levine, S. (2017). Model-agnostic meta-learning for fast adaptation of
deep networks. Paper presented at the International Conference on Machine Learning,
Sydney, NSW, Australia.
Gretton, A., Borgwardt, K. M., Rasch, M. J., Scholkopf, B., & Smola, A. (2012). A kernel
two-sample test. Journal of Machine Learning Research, 13, 723–773.
Guo, L., Lei, Y., Xing, S., Yan, T., & Li, N. (2019). Deep convolutional transfer learning
network: A new method for intelligent fault diagnosis of machines with unlabeled
data. IEEE Transactions on Industrial Electronics, 66(9), 7316–7325.
Jamal, M. A., & Qi, G.-J. (2019, 15-20 June 2019). Task Agnostic Meta-Learning for Few-
Shot Learning. Paper presented at the IEEE Conference on Computer Vision and
Pattern Recognition.
Jia, F., Lei, Y., Lu, N., & Xing, S. (2018). Deep normalized convolutional neural network
for imbalanced fault classification of machinery and its understanding via
visualization. Mechanical Systems and Signal Processing, 110, 349–367. https://doi.
org/10.1016/j.ymssp.2018.03.025
Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., & Li, F.-F. (2014, 23-28
June 2014). Large-Scale Video Classification with Convolutional Neural Networks. Paper
presented at the IEEE Conference on Computer Vision and Pattern Recognition.
Lei, Y. (2017). Intelligent fault diagnosis and remaining useful life prediction of rotating
machinery. Butterworth-Heinemann.
Lei, Y., Jia, F., Lin, J., Xing, S., & Ding, S. X. (2016). An intelligent fault diagnosis method
using unsupervised feature learning towards mechanical big data. IEEE Transactions
on Industrial Electronics, 63(5), 3137–3147. https://doi.org/10.1109/
TIE.2016.2519325
Li, J., Huang, R., He, G., Wang, S., Li, G., & Li, W. (2020). A deep adversarial transfer
learning network for machinery emerging fault detection. IEEE Sensors Journal, 20
(15), 8413–8422. https://doi.org/10.1109/JSEN.2020.2975286
Li, X., Zhang, W., & Ding, Q. (2018). A robust intelligent fault diagnosis method for
rolling element bearings based on deep distance metric learning. Neurocomputing,
310, 77–95.
Li, X., Zhang, W., Ding, Q., & Sun, J.-Q. (2019). Multi-Layer domain adaptation method
for rolling bearing fault diagnosis. Signal Processing, 157, 180–197. https://doi.org/
10.1016/j.sigpro.2018.12.005
Long, M., Cao, Y., Wang, J., & Jordan, M. (2015). Learning transferable features with
deep adaptation networks. Paper presented at the International Conference on Machine
Learning.
Lu, N., & Yin, T. (2021). Transferable common feature space mining for fault diagnosis
with imbalanced data. Mechanical Systems and Signal Processing, 156, Article 107645.
https://doi.org/10.1016/j.ymssp.2021.107645
Lu, W., Liang, B., Cheng, Y., Meng, D., Yang, J., & Zhang, T. (2017). Deep model based
domain adaptation for fault diagnosis. IEEE Transactions on Industrial Electronics, 64
(3), 2296–2305. https://doi.org/10.1109/TIE.2016.2627020
Pan, S. J., Tsang, I. W., Kwok, J. T., & Yang, Q. (2011). Domain adaptation via transfer
component analysis. IEEE Transactions on Neural Networks, 22(2), 199–210.
Qiu, H., Lee, J., Lin, J., & Yu, G. (2006). Wavelet filter-based weak signature detection
method and its application on rolling element bearing prognostics. Journal of Sound
and Vibration, 289(4), 1066–1090. https://doi.org/10.1016/j.jsv.2005.03.007
Shao, S., McAleer, S., Yan, R., & Baldi, P. (2019). Highly accurate machine fault diagnosis
using deep transfer learning. IEEE Transactions on Industrial Informatics, 15(4),
2446–2455. https://doi.org/10.1109/TII.2018.2864759
Snell, J., Swersky, K., & Zemel, R. (2017, 03/15). Prototypical Networks for Few-shot
Learning. Paper presented at the Advances in Neural Information Processing Systems.
Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P. H. S., & Hospedales, T. M. (2018, 18-23
June 2018). Learning to Compare: Relation Network for Few-Shot Learning. Paper
presented at the IEEE Conference on Computer Vision and Pattern Recognition.
Tzeng, E., Hoffman, J., Zhang, N., Saenko, K., & Darrell, T. (2014). Deep domain
confusion: Maximizing for domain invariance. arXiv preprint arXiv:1412.3474.
Vinyals, O., Blundell, C., Lillicrap, T., Kavukcuoglu, K., & Wierstra, D. (2016). Matching
networks for one shot learning. Paper presented at the International Conference on
Neural Information Processing Systems, Barcelona, Spain.
Wen, L., Gao, L., & Li, X. (2017). A new deep transfer learning based on sparse auto-
encoder for fault diagnosis. IEEE Transactions on Systems, Man, and Cybernetics:
Systems, 49(1), 136–144.
Wen, L., Gao, L., & Li, X. (2019). A new deep transfer learning based on sparse auto-
encoder for fault diagnosis. IEEE Transactions on Systems, Man, and Cybernetics:
Systems, 49(1), 136–144. https://doi.org/10.1109/TSMC.2017.2754287
Xu, G., Liu, M., Jiang, Z., Shen, W., & Huang, C. (2020). Online fault diagnosis method
based on transfer convolutional neural networks. IEEE Transactions on
Instrumentation and Measurement, 69(2), 509–520. https://doi.org/10.1109/
TIM.2019.2902003
Yang, B., Lei, Y., Jia, F., Li, N., & Du, Z. (2020). A polynomial kernel induced distance
metric to improve deep transfer learning for fault diagnosis of machines. IEEE
Transactions on Industrial Electronics, 67(11), 9747–9757. https://doi.org/10.1109/
TIE.2019.2953010
Yang, B., Lei, Y., Jia, F., & Xing, S. (2019). An intelligent fault diagnosis approach based
on transfer learning from laboratory bearings to locomotive bearings. Mechanical
Systems and Signal Processing, 122, 692–706. https://doi.org/10.1016/j.
ymssp.2018.12.051
Yu, H., Wang, K., Li, Y., & Zhao, W. (2019). Representation learning with class level
autoencoder for intelligent fault diagnosis. IEEE Signal Processing Letters, 26(10),
1476–1480. https://doi.org/10.1109/LSP.2019.2936310
Zhang, W., Li, X., Jia, X.-D., Ma, H., Luo, Z., & Li, X. (2020). Machinery fault diagnosis
with imbalanced data using deep generative adversarial networks. Measurement,
152, Article 107377. https://doi.org/10.1016/j.measurement.2019.107377
N. Lu et al.

More Related Content

Similar to 1-s2.0-S0957417422020759-main.pdf

Residual balanced attention network for real-time traffic scene semantic segm...
Residual balanced attention network for real-time traffic scene semantic segm...Residual balanced attention network for real-time traffic scene semantic segm...
Residual balanced attention network for real-time traffic scene semantic segm...
IJECEIAES
 
INVESTIGATING THE EFFECT OF BD-CRAFT TO TEXT DETECTION ALGORITHMS
INVESTIGATING THE EFFECT OF BD-CRAFT TO TEXT DETECTION ALGORITHMSINVESTIGATING THE EFFECT OF BD-CRAFT TO TEXT DETECTION ALGORITHMS
INVESTIGATING THE EFFECT OF BD-CRAFT TO TEXT DETECTION ALGORITHMS
ijaia
 
Investigating the Effect of BD-CRAFT to Text Detection Algorithms
Investigating the Effect of BD-CRAFT to Text Detection AlgorithmsInvestigating the Effect of BD-CRAFT to Text Detection Algorithms
Investigating the Effect of BD-CRAFT to Text Detection Algorithms
gerogepatton
 
FACE EXPRESSION RECOGNITION USING CONVOLUTION NEURAL NETWORK (CNN) MODELS
FACE EXPRESSION RECOGNITION USING CONVOLUTION NEURAL NETWORK (CNN) MODELS FACE EXPRESSION RECOGNITION USING CONVOLUTION NEURAL NETWORK (CNN) MODELS
FACE EXPRESSION RECOGNITION USING CONVOLUTION NEURAL NETWORK (CNN) MODELS
ijgca
 
A deep learning based stereo matching model for autonomous vehicle
A deep learning based stereo matching model for autonomous vehicleA deep learning based stereo matching model for autonomous vehicle
A deep learning based stereo matching model for autonomous vehicle
IAESIJAI
 
IRJET- Vanet Connection Performance Analysis using GPSR Protocol
IRJET- Vanet Connection Performance Analysis using GPSR ProtocolIRJET- Vanet Connection Performance Analysis using GPSR Protocol
IRJET- Vanet Connection Performance Analysis using GPSR Protocol
IRJET Journal
 
Intelligent black hole detection in mobile AdHoc networks
Intelligent black hole detection in mobile AdHoc networksIntelligent black hole detection in mobile AdHoc networks
Intelligent black hole detection in mobile AdHoc networks
IJECEIAES
 
Data Analysis In The Cloud
Data Analysis In The CloudData Analysis In The Cloud
Data Analysis In The Cloud
Monica Carter
 
An Integrated Inductive-Deductive Framework for Data Mapping in Wireless Sens...
An Integrated Inductive-Deductive Framework for Data Mapping in Wireless Sens...An Integrated Inductive-Deductive Framework for Data Mapping in Wireless Sens...
An Integrated Inductive-Deductive Framework for Data Mapping in Wireless Sens...
M H
 
IRJET- An Efficient VLSI Architecture for 3D-DWT using Lifting Scheme
IRJET- An Efficient VLSI Architecture for 3D-DWT using Lifting SchemeIRJET- An Efficient VLSI Architecture for 3D-DWT using Lifting Scheme
IRJET- An Efficient VLSI Architecture for 3D-DWT using Lifting Scheme
IRJET Journal
 
CAR DAMAGE DETECTION USING DEEP LEARNING
CAR DAMAGE DETECTION USING DEEP LEARNINGCAR DAMAGE DETECTION USING DEEP LEARNING
CAR DAMAGE DETECTION USING DEEP LEARNING
IRJET Journal
 
Semantic Concept Detection in Video Using Hybrid Model of CNN and SVM Classif...
Semantic Concept Detection in Video Using Hybrid Model of CNN and SVM Classif...Semantic Concept Detection in Video Using Hybrid Model of CNN and SVM Classif...
Semantic Concept Detection in Video Using Hybrid Model of CNN and SVM Classif...
CSCJournals
 
Comparison of convolutional neural network models for user’s facial recognition
Comparison of convolutional neural network models for user’s facial recognitionComparison of convolutional neural network models for user’s facial recognition
Comparison of convolutional neural network models for user’s facial recognition
IJECEIAES
 
Efficient Feature Selection for Fault Diagnosis of Aerospace System Using Syn...
Efficient Feature Selection for Fault Diagnosis of Aerospace System Using Syn...Efficient Feature Selection for Fault Diagnosis of Aerospace System Using Syn...
Efficient Feature Selection for Fault Diagnosis of Aerospace System Using Syn...
IRJET Journal
 
MACHINE LEARNING FOR QOE PREDICTION AND ANOMALY DETECTION IN SELF-ORGANIZING ...
MACHINE LEARNING FOR QOE PREDICTION AND ANOMALY DETECTION IN SELF-ORGANIZING ...MACHINE LEARNING FOR QOE PREDICTION AND ANOMALY DETECTION IN SELF-ORGANIZING ...
MACHINE LEARNING FOR QOE PREDICTION AND ANOMALY DETECTION IN SELF-ORGANIZING ...
ijwmn
 
Gender classification using custom convolutional neural networks architecture
Gender classification using custom convolutional neural networks architecture Gender classification using custom convolutional neural networks architecture
Gender classification using custom convolutional neural networks architecture
IJECEIAES
 
CLOUD BIOINFORMATICS Part1
 CLOUD BIOINFORMATICS Part1 CLOUD BIOINFORMATICS Part1
CLOUD BIOINFORMATICS Part1
ARPUTHA SELVARAJ A
 
Efficient Attack Detection in IoT Devices using Feature Engineering-Less Mach...
Efficient Attack Detection in IoT Devices using Feature Engineering-Less Mach...Efficient Attack Detection in IoT Devices using Feature Engineering-Less Mach...
Efficient Attack Detection in IoT Devices using Feature Engineering-Less Mach...
AIRCC Publishing Corporation
 
EFFICIENT ATTACK DETECTION IN IOT DEVICES USING FEATURE ENGINEERING-LESS MACH...
EFFICIENT ATTACK DETECTION IN IOT DEVICES USING FEATURE ENGINEERING-LESS MACH...EFFICIENT ATTACK DETECTION IN IOT DEVICES USING FEATURE ENGINEERING-LESS MACH...
EFFICIENT ATTACK DETECTION IN IOT DEVICES USING FEATURE ENGINEERING-LESS MACH...
ijcsit
 
IRJET- A Survey on Medical Image Interpretation for Predicting Pneumonia
IRJET- A Survey on Medical Image Interpretation for Predicting PneumoniaIRJET- A Survey on Medical Image Interpretation for Predicting Pneumonia
IRJET- A Survey on Medical Image Interpretation for Predicting Pneumonia
IRJET Journal
 

Similar to 1-s2.0-S0957417422020759-main.pdf (20)

Residual balanced attention network for real-time traffic scene semantic segm...
Residual balanced attention network for real-time traffic scene semantic segm...Residual balanced attention network for real-time traffic scene semantic segm...
Residual balanced attention network for real-time traffic scene semantic segm...
 
INVESTIGATING THE EFFECT OF BD-CRAFT TO TEXT DETECTION ALGORITHMS
INVESTIGATING THE EFFECT OF BD-CRAFT TO TEXT DETECTION ALGORITHMSINVESTIGATING THE EFFECT OF BD-CRAFT TO TEXT DETECTION ALGORITHMS
INVESTIGATING THE EFFECT OF BD-CRAFT TO TEXT DETECTION ALGORITHMS
 
Investigating the Effect of BD-CRAFT to Text Detection Algorithms
Investigating the Effect of BD-CRAFT to Text Detection AlgorithmsInvestigating the Effect of BD-CRAFT to Text Detection Algorithms
Investigating the Effect of BD-CRAFT to Text Detection Algorithms
 
FACE EXPRESSION RECOGNITION USING CONVOLUTION NEURAL NETWORK (CNN) MODELS
FACE EXPRESSION RECOGNITION USING CONVOLUTION NEURAL NETWORK (CNN) MODELS FACE EXPRESSION RECOGNITION USING CONVOLUTION NEURAL NETWORK (CNN) MODELS
FACE EXPRESSION RECOGNITION USING CONVOLUTION NEURAL NETWORK (CNN) MODELS
 
A deep learning based stereo matching model for autonomous vehicle
A deep learning based stereo matching model for autonomous vehicleA deep learning based stereo matching model for autonomous vehicle
A deep learning based stereo matching model for autonomous vehicle
 
IRJET- Vanet Connection Performance Analysis using GPSR Protocol
IRJET- Vanet Connection Performance Analysis using GPSR ProtocolIRJET- Vanet Connection Performance Analysis using GPSR Protocol
IRJET- Vanet Connection Performance Analysis using GPSR Protocol
 
Intelligent black hole detection in mobile AdHoc networks
Intelligent black hole detection in mobile AdHoc networksIntelligent black hole detection in mobile AdHoc networks
Intelligent black hole detection in mobile AdHoc networks
 
Data Analysis In The Cloud
Data Analysis In The CloudData Analysis In The Cloud
Data Analysis In The Cloud
 
An Integrated Inductive-Deductive Framework for Data Mapping in Wireless Sens...
An Integrated Inductive-Deductive Framework for Data Mapping in Wireless Sens...An Integrated Inductive-Deductive Framework for Data Mapping in Wireless Sens...
An Integrated Inductive-Deductive Framework for Data Mapping in Wireless Sens...
 
IRJET- An Efficient VLSI Architecture for 3D-DWT using Lifting Scheme
IRJET- An Efficient VLSI Architecture for 3D-DWT using Lifting SchemeIRJET- An Efficient VLSI Architecture for 3D-DWT using Lifting Scheme
IRJET- An Efficient VLSI Architecture for 3D-DWT using Lifting Scheme
 
CAR DAMAGE DETECTION USING DEEP LEARNING
CAR DAMAGE DETECTION USING DEEP LEARNINGCAR DAMAGE DETECTION USING DEEP LEARNING
CAR DAMAGE DETECTION USING DEEP LEARNING
 
Semantic Concept Detection in Video Using Hybrid Model of CNN and SVM Classif...
Semantic Concept Detection in Video Using Hybrid Model of CNN and SVM Classif...Semantic Concept Detection in Video Using Hybrid Model of CNN and SVM Classif...
Semantic Concept Detection in Video Using Hybrid Model of CNN and SVM Classif...
 
Comparison of convolutional neural network models for user’s facial recognition
Comparison of convolutional neural network models for user’s facial recognitionComparison of convolutional neural network models for user’s facial recognition
Comparison of convolutional neural network models for user’s facial recognition
 
Efficient Feature Selection for Fault Diagnosis of Aerospace System Using Syn...
Efficient Feature Selection for Fault Diagnosis of Aerospace System Using Syn...Efficient Feature Selection for Fault Diagnosis of Aerospace System Using Syn...
Efficient Feature Selection for Fault Diagnosis of Aerospace System Using Syn...
 
MACHINE LEARNING FOR QOE PREDICTION AND ANOMALY DETECTION IN SELF-ORGANIZING ...
MACHINE LEARNING FOR QOE PREDICTION AND ANOMALY DETECTION IN SELF-ORGANIZING ...MACHINE LEARNING FOR QOE PREDICTION AND ANOMALY DETECTION IN SELF-ORGANIZING ...
MACHINE LEARNING FOR QOE PREDICTION AND ANOMALY DETECTION IN SELF-ORGANIZING ...
 
Gender classification using custom convolutional neural networks architecture
Gender classification using custom convolutional neural networks architecture Gender classification using custom convolutional neural networks architecture
Gender classification using custom convolutional neural networks architecture
 
CLOUD BIOINFORMATICS Part1
 CLOUD BIOINFORMATICS Part1 CLOUD BIOINFORMATICS Part1
CLOUD BIOINFORMATICS Part1
 
Efficient Attack Detection in IoT Devices using Feature Engineering-Less Mach...
Efficient Attack Detection in IoT Devices using Feature Engineering-Less Mach...Efficient Attack Detection in IoT Devices using Feature Engineering-Less Mach...
Efficient Attack Detection in IoT Devices using Feature Engineering-Less Mach...
 
EFFICIENT ATTACK DETECTION IN IOT DEVICES USING FEATURE ENGINEERING-LESS MACH...
EFFICIENT ATTACK DETECTION IN IOT DEVICES USING FEATURE ENGINEERING-LESS MACH...EFFICIENT ATTACK DETECTION IN IOT DEVICES USING FEATURE ENGINEERING-LESS MACH...
EFFICIENT ATTACK DETECTION IN IOT DEVICES USING FEATURE ENGINEERING-LESS MACH...
 
IRJET- A Survey on Medical Image Interpretation for Predicting Pneumonia
IRJET- A Survey on Medical Image Interpretation for Predicting PneumoniaIRJET- A Survey on Medical Image Interpretation for Predicting Pneumonia
IRJET- A Survey on Medical Image Interpretation for Predicting Pneumonia
 

Recently uploaded

Introduction to Computer Networks & OSI MODEL.ppt
Introduction to Computer Networks & OSI MODEL.pptIntroduction to Computer Networks & OSI MODEL.ppt
Introduction to Computer Networks & OSI MODEL.ppt
Dwarkadas J Sanghvi College of Engineering
 
Call Girls Chennai +91-8824825030 Vip Call Girls Chennai
Call Girls Chennai +91-8824825030 Vip Call Girls ChennaiCall Girls Chennai +91-8824825030 Vip Call Girls Chennai
Call Girls Chennai +91-8824825030 Vip Call Girls Chennai
paraasingh12 #V08
 
一比一原版(osu毕业证书)美国俄勒冈州立大学毕业证如何办理
一比一原版(osu毕业证书)美国俄勒冈州立大学毕业证如何办理一比一原版(osu毕业证书)美国俄勒冈州立大学毕业证如何办理
一比一原版(osu毕业证书)美国俄勒冈州立大学毕业证如何办理
upoux
 
Impartiality as per ISO /IEC 17025:2017 Standard
Impartiality as per ISO /IEC 17025:2017 StandardImpartiality as per ISO /IEC 17025:2017 Standard
Impartiality as per ISO /IEC 17025:2017 Standard
MuhammadJazib15
 
原版制作(Humboldt毕业证书)柏林大学毕业证学位证一模一样
原版制作(Humboldt毕业证书)柏林大学毕业证学位证一模一样原版制作(Humboldt毕业证书)柏林大学毕业证学位证一模一样
原版制作(Humboldt毕业证书)柏林大学毕业证学位证一模一样
ydzowc
 
Sri Guru Hargobind Ji - Bandi Chor Guru.pdf
Sri Guru Hargobind Ji - Bandi Chor Guru.pdfSri Guru Hargobind Ji - Bandi Chor Guru.pdf
Sri Guru Hargobind Ji - Bandi Chor Guru.pdf
Balvir Singh
 
3rd International Conference on Artificial Intelligence Advances (AIAD 2024)
3rd International Conference on Artificial Intelligence Advances (AIAD 2024)3rd International Conference on Artificial Intelligence Advances (AIAD 2024)
3rd International Conference on Artificial Intelligence Advances (AIAD 2024)
GiselleginaGloria
 
A high-Speed Communication System is based on the Design of a Bi-NoC Router, ...
A high-Speed Communication System is based on the Design of a Bi-NoC Router, ...A high-Speed Communication System is based on the Design of a Bi-NoC Router, ...
A high-Speed Communication System is based on the Design of a Bi-NoC Router, ...
DharmaBanothu
 
SELENIUM CONF -PALLAVI SHARMA - 2024.pdf
SELENIUM CONF -PALLAVI SHARMA - 2024.pdfSELENIUM CONF -PALLAVI SHARMA - 2024.pdf
SELENIUM CONF -PALLAVI SHARMA - 2024.pdf
Pallavi Sharma
 
Tools & Techniques for Commissioning and Maintaining PV Systems W-Animations ...
Tools & Techniques for Commissioning and Maintaining PV Systems W-Animations ...Tools & Techniques for Commissioning and Maintaining PV Systems W-Animations ...
Tools & Techniques for Commissioning and Maintaining PV Systems W-Animations ...
Transcat
 
AN INTRODUCTION OF AI & SEARCHING TECHIQUES
AN INTRODUCTION OF AI & SEARCHING TECHIQUESAN INTRODUCTION OF AI & SEARCHING TECHIQUES
AN INTRODUCTION OF AI & SEARCHING TECHIQUES
drshikhapandey2022
 
INTRODUCTION TO ARTIFICIAL INTELLIGENCE BASIC
INTRODUCTION TO ARTIFICIAL INTELLIGENCE BASICINTRODUCTION TO ARTIFICIAL INTELLIGENCE BASIC
INTRODUCTION TO ARTIFICIAL INTELLIGENCE BASIC
GOKULKANNANMMECLECTC
 
This study Examines the Effectiveness of Talent Procurement through the Imple...
This study Examines the Effectiveness of Talent Procurement through the Imple...This study Examines the Effectiveness of Talent Procurement through the Imple...
This study Examines the Effectiveness of Talent Procurement through the Imple...
DharmaBanothu
 
一比一原版(uofo毕业证书)美国俄勒冈大学毕业证如何办理
一比一原版(uofo毕业证书)美国俄勒冈大学毕业证如何办理一比一原版(uofo毕业证书)美国俄勒冈大学毕业证如何办理
一比一原版(uofo毕业证书)美国俄勒冈大学毕业证如何办理
upoux
 
UNIT-III- DATA CONVERTERS ANALOG TO DIGITAL CONVERTER
UNIT-III- DATA CONVERTERS ANALOG TO DIGITAL CONVERTERUNIT-III- DATA CONVERTERS ANALOG TO DIGITAL CONVERTER
UNIT-III- DATA CONVERTERS ANALOG TO DIGITAL CONVERTER
vmspraneeth
 
FULL STACK PROGRAMMING - Both Front End and Back End
FULL STACK PROGRAMMING - Both Front End and Back EndFULL STACK PROGRAMMING - Both Front End and Back End
FULL STACK PROGRAMMING - Both Front End and Back End
PreethaV16
 
Beckhoff Programmable Logic Control Overview Presentation
Beckhoff Programmable Logic Control Overview PresentationBeckhoff Programmable Logic Control Overview Presentation
Beckhoff Programmable Logic Control Overview Presentation
VanTuDuong1
 
paper relate Chozhavendhan et al. 2020.pdf
paper relate Chozhavendhan et al. 2020.pdfpaper relate Chozhavendhan et al. 2020.pdf
paper relate Chozhavendhan et al. 2020.pdf
ShurooqTaib
 
Call For Paper -3rd International Conference on Artificial Intelligence Advan...
Call For Paper -3rd International Conference on Artificial Intelligence Advan...Call For Paper -3rd International Conference on Artificial Intelligence Advan...
Call For Paper -3rd International Conference on Artificial Intelligence Advan...
ijseajournal
 
Particle Swarm Optimization–Long Short-Term Memory based Channel Estimation w...
Particle Swarm Optimization–Long Short-Term Memory based Channel Estimation w...Particle Swarm Optimization–Long Short-Term Memory based Channel Estimation w...
Particle Swarm Optimization–Long Short-Term Memory based Channel Estimation w...
IJCNCJournal
 

Recently uploaded (20)

Introduction to Computer Networks & OSI MODEL.ppt
Introduction to Computer Networks & OSI MODEL.pptIntroduction to Computer Networks & OSI MODEL.ppt
Introduction to Computer Networks & OSI MODEL.ppt
 
Call Girls Chennai +91-8824825030 Vip Call Girls Chennai
Call Girls Chennai +91-8824825030 Vip Call Girls ChennaiCall Girls Chennai +91-8824825030 Vip Call Girls Chennai
Call Girls Chennai +91-8824825030 Vip Call Girls Chennai
 
一比一原版(osu毕业证书)美国俄勒冈州立大学毕业证如何办理
一比一原版(osu毕业证书)美国俄勒冈州立大学毕业证如何办理一比一原版(osu毕业证书)美国俄勒冈州立大学毕业证如何办理
一比一原版(osu毕业证书)美国俄勒冈州立大学毕业证如何办理
 
Impartiality as per ISO /IEC 17025:2017 Standard
Impartiality as per ISO /IEC 17025:2017 StandardImpartiality as per ISO /IEC 17025:2017 Standard
Impartiality as per ISO /IEC 17025:2017 Standard
 
原版制作(Humboldt毕业证书)柏林大学毕业证学位证一模一样
原版制作(Humboldt毕业证书)柏林大学毕业证学位证一模一样原版制作(Humboldt毕业证书)柏林大学毕业证学位证一模一样
原版制作(Humboldt毕业证书)柏林大学毕业证学位证一模一样
 
Sri Guru Hargobind Ji - Bandi Chor Guru.pdf
Sri Guru Hargobind Ji - Bandi Chor Guru.pdfSri Guru Hargobind Ji - Bandi Chor Guru.pdf
Sri Guru Hargobind Ji - Bandi Chor Guru.pdf
 
3rd International Conference on Artificial Intelligence Advances (AIAD 2024)
3rd International Conference on Artificial Intelligence Advances (AIAD 2024)3rd International Conference on Artificial Intelligence Advances (AIAD 2024)
3rd International Conference on Artificial Intelligence Advances (AIAD 2024)
 
A high-Speed Communication System is based on the Design of a Bi-NoC Router, ...
A high-Speed Communication System is based on the Design of a Bi-NoC Router, ...A high-Speed Communication System is based on the Design of a Bi-NoC Router, ...
A high-Speed Communication System is based on the Design of a Bi-NoC Router, ...
 
SELENIUM CONF -PALLAVI SHARMA - 2024.pdf
SELENIUM CONF -PALLAVI SHARMA - 2024.pdfSELENIUM CONF -PALLAVI SHARMA - 2024.pdf
SELENIUM CONF -PALLAVI SHARMA - 2024.pdf
 
Tools & Techniques for Commissioning and Maintaining PV Systems W-Animations ...
Tools & Techniques for Commissioning and Maintaining PV Systems W-Animations ...Tools & Techniques for Commissioning and Maintaining PV Systems W-Animations ...
Tools & Techniques for Commissioning and Maintaining PV Systems W-Animations ...
 
AN INTRODUCTION OF AI & SEARCHING TECHIQUES
AN INTRODUCTION OF AI & SEARCHING TECHIQUESAN INTRODUCTION OF AI & SEARCHING TECHIQUES
AN INTRODUCTION OF AI & SEARCHING TECHIQUES
 
INTRODUCTION TO ARTIFICIAL INTELLIGENCE BASIC
INTRODUCTION TO ARTIFICIAL INTELLIGENCE BASICINTRODUCTION TO ARTIFICIAL INTELLIGENCE BASIC
INTRODUCTION TO ARTIFICIAL INTELLIGENCE BASIC
 
This study Examines the Effectiveness of Talent Procurement through the Imple...
This study Examines the Effectiveness of Talent Procurement through the Imple...This study Examines the Effectiveness of Talent Procurement through the Imple...
This study Examines the Effectiveness of Talent Procurement through the Imple...
 
一比一原版(uofo毕业证书)美国俄勒冈大学毕业证如何办理
一比一原版(uofo毕业证书)美国俄勒冈大学毕业证如何办理一比一原版(uofo毕业证书)美国俄勒冈大学毕业证如何办理
一比一原版(uofo毕业证书)美国俄勒冈大学毕业证如何办理
 
UNIT-III- DATA CONVERTERS ANALOG TO DIGITAL CONVERTER
UNIT-III- DATA CONVERTERS ANALOG TO DIGITAL CONVERTERUNIT-III- DATA CONVERTERS ANALOG TO DIGITAL CONVERTER
UNIT-III- DATA CONVERTERS ANALOG TO DIGITAL CONVERTER
 
FULL STACK PROGRAMMING - Both Front End and Back End
FULL STACK PROGRAMMING - Both Front End and Back EndFULL STACK PROGRAMMING - Both Front End and Back End
FULL STACK PROGRAMMING - Both Front End and Back End
 
Beckhoff Programmable Logic Control Overview Presentation
Beckhoff Programmable Logic Control Overview PresentationBeckhoff Programmable Logic Control Overview Presentation
Beckhoff Programmable Logic Control Overview Presentation
 
paper relate Chozhavendhan et al. 2020.pdf
paper relate Chozhavendhan et al. 2020.pdfpaper relate Chozhavendhan et al. 2020.pdf
paper relate Chozhavendhan et al. 2020.pdf
 
Call For Paper -3rd International Conference on Artificial Intelligence Advan...
Call For Paper -3rd International Conference on Artificial Intelligence Advan...Call For Paper -3rd International Conference on Artificial Intelligence Advan...
Call For Paper -3rd International Conference on Artificial Intelligence Advan...
 
Particle Swarm Optimization–Long Short-Term Memory based Channel Estimation w...
Particle Swarm Optimization–Long Short-Term Memory based Channel Estimation w...Particle Swarm Optimization–Long Short-Term Memory based Channel Estimation w...
Particle Swarm Optimization–Long Short-Term Memory based Channel Estimation w...
 

1-s2.0-S0957417422020759-main.pdf

  • 1. Expert Systems With Applications 213 (2023) 119057 Available online 19 October 2022 0957-4174/© 2022 Elsevier Ltd. All rights reserved. Multi-view and Multi-level network for fault diagnosis accommodating feature transferability Na Lu * , Zhiyan Cui , Huiyang Hu , Tao Yin Systems Engineering Institute, School of Automation Science and Engineering, Xi’an Jiaotong University, Xi’an, Shaanxi 710049, China A R T I C L E I N F O Keywords: Transfer learning Feature transferability Fault diagnosis Few shot learning A B S T R A C T Various deep transfer learning solutions have been developed for machine fault diagnosis. The existing solutions mainly focus on domain adaptation by minimizing the data distribution discrepancy with certain metric, which emphasize the common features embedded in the data cross domains and neglect the unique features toward health condition classification in one specific domain. In these solutions, all the data for training have been forced to align in a common feature space and all the features for domain adaptation have been treated equally. However, there might exist domain specific features which are not appropriate for transfer but may contain essential information for classification in specific domain. In addition, due to the difficulty of collecting machine fault data, the number of machine fault samples is usually quite small or even zero. The traditional deep network structures and the training strategy are not the optimal choice in this occasion. To address these problems, a novel multi-view and multi-level network (MMNet) for fault diagnosis is developed. In MMNet, two network channels have been respectively constructed for cross domain common feature and domain specific feature learning to provide multi-view features. This architecture could implicitly differentiate the common features cross domains and the specific features only in one domain. In the channel of domain specific feature, a domain classifier and fault classifier are combined to learn the domain specific features. Multiple kernel maximum mean discrepancy (MK-MMD) is imposed on multiple layers of the common feature channel to implement domain adaptation and extract cross domain common features. The domain classification and fault classification together form a multi-level classification scheme. A classic few shot learning architecture with two modules respectively for feature extraction and relation computation is adopted as the backbone network. The relation score based classification mechanism enables zero shot fault classification in the target domain. Episode based few shot training strategy is employed to enhance the performance of MMNet with few labeled training data. Extensive experiments have demonstrated the state-of-the-art performance of MMNet on the involved transfer tasks. 1. Introduction Machine fault in industry could bring catastrophic damage and enormous economic loss (Lei, Jia, Lin, Xing, & Ding, 2016). Therefore, fault diagnosis has long been a popular and important research field which involves multidisciplinary researches like mechanical engineer­ ing, signal processing, and machine learning and so on. Machines usu­ ally work in health state during most time of their life circle. Different possible faults only occur in rare occasions. Due to the long time span of normal condition and sporadic occurrence of fault, it is commonly acknowledged that the fault data collected from one machine is quite limited especially in practical application. While in laboratory envi­ ronment, it is much easier to collect manual fabricated fault data. Therefore, how to learn efficient representation of fault data and transfer the knowledge learnt from data abundant scenarios to data lack sce­ narios are crucial for fault diagnosis. To this end, deep learning and transfer learning have been widely explored in recent decades in fault diagnosis. Various deep network models have been employed to automatically extract discriminant fea­ tures from machine fault data (Lu & Yin, 2021). Network structures like Peer review under responsibility of Submissions with the production note ‘Please add the Reproducibility Badge for this item’ the Badge and the following footnote to be added:The code (and data) in this article has been certified as Reproducible by the CodeOcean: https://codeocean.com. More information on the Reproduc­ ibility Badge Initiative is available at https://www.elsevier.com/physicalsciencesandengineering/computerscience/journals.. * Corresponding author. E-mail address: lvna2009@xjtu.edu.cn (N. Lu). Contents lists available at ScienceDirect Expert Systems With Applications journal homepage: www.elsevier.com/locate/eswa https://doi.org/10.1016/j.eswa.2022.119057 Received 16 June 2021; Received in revised form 23 April 2022; Accepted 13 October 2022
  • 2. Expert Systems With Applications 213 (2023) 119057 2 AutoEncoder (Yu, Wang, Li, & Zhao, 2019), sparse AutoEncoder (Wen, Gao, & Li, 2017), Convolutional Neural Network (CNN) (Jia, Lei, Lu, & Xing, 2018; Yang, Lei, Jia, Li, & Du, 2020) have been widely employed for fault representation learning. In addition, Generative Adversarial Network (GAN) (Chen et al., 2020; Li et al., 2020; Zhang et al., 2020) based methods have also been employed for fault diagnosis, which aim to generate more fault samples to balance the fault dataset and improve the classification performance. Except the GAN based solutions, most of the fault classification network architectures and their training methods were borrowed directly from the classic deep learning solutions of computer vision, which can well fit the big data applications. However, when the fault data are not abundant and especially when no labeled data are available, more appropriate network architecture and training method need be developed. Another important issue in fault diagnosis is how to transfer the knowledge from the domain with relatively abundant labeled data (source domain) to the domain with few or no labeled data (target domain). Here the different domains could be understood as different machines or one machine under different working conditions. To address this issue, many solutions combining deep neural network and transfer learning have been developed (Li et al., 2020; Li, Zhang, Ding, & Sun, 2019; Shao, McAleer, Yan, & Baldi, 2019; Xu, Liu, Jiang, Shen, & Huang, 2020; Yang et al., 2020; Yang, Lei, Jia, & Xing, 2019) which we refer to as deep transfer learning methods for simplicity. These methods mainly aimed at minimizing the distribution discrepancy between different domains and improving the fault classification accuracy. To fulfill domain adaptation, multiple metrics of data distribution have been applied, including Maximum Mean Discrepancy (MMD) (Yang et al., 2019), Multi-kernel Maximum Mean Discrepancy (MK-MMD) (Che, Wang, Ni, & Fu, 2020), Polynomial-kernel Maximum Mean Discrepancy (PK-MMD) (Yang et al., 2020) and so on. These metrics evaluate the data distribution difference which is used as the domain adaptation loss to train the fault diagnosis model. The training objective functions of deep transfer learning models usually contain two parts, classification loss and domain adaptation loss. By minimizing the overall loss of these terms, the deep transfer learning models could be trained. Long et al. (Long, Cao, Wang, & Jordan, 2015) developed a widely used deep transfer learning method with domain adaptation. MK-MMD loss was used on the last three fully connected layers but the output layer to enable domain adaptation. Lu et al. (Lu et al., 2017) adopted MMD as the distribution discrepancy measure and developed a deep neural network (DNN) model for fault diagnosis. The MMD loss was imposed on the feature layer of a DNN. A gearbox dataset collected under different working conditions was employed to evaluate the method. A deep convolutional transfer learning network (DCTLN) was constructed by Guo et al. (Guo, Lei, Xing, Yan, & Li, 2019) to implement fault diagnosis knowledge transfer. One convolutional network module was used for fault condition recognition and another convolutional network module was used for domain distribution adaptation. Three datasets collected from bearings were used for experiments to test the transferability of DCTLN. Wen et al. (Wen, Gao, & Li, 2019) developed a sparse autoen­ coder for feature representation learning which used frequency spec­ trum of vibration sequences recorded from bearings as input. Domain adaptation was implemented via MMD. Li et al. (Li, Zhang, & Ding, 2018) also proposed a domain adaptive deep convolutional neural network for bearing fault diagnosis and fault knowledge transfer. The fault dataset was collected under working environments with different noise. Frequency spectrum was employed as the input to the CNN model. The cross-domain feature discrepancy was also minimized based on MMD. FTNN (feature-based transfer neural network) was developed by Yang et al. (Yang et al., 2019) to diagnose the machine faults of real-case machines by the knowledge learnt from the data recorded from labo­ ratory machines. MMD was also adopted for domain adaptation which was imposed on multiple network layers. Four bearing fault datasets were used to construct the transfer experiments and test the perfor­ mance of FTNN. Lu et al. (Lu & Yin, 2021) developed a combined solution of convolutional autoencoder and convolutional network for bearing fault diagnosis, where the convolutional autoencoder was adopted to mine the common features cross domains. MMD was employed for domain adaptation in the convolutional autoencoder. From the above literature review, it could be seen that the domain transfer is usually implemented by imposing certain domain distribution metric on one or several network layers within the deep transfer model. In these solutions, all the input data for training were treated equally for domain adaptation. No matter what transfer learning solutions were adopted, an intermediate data distribution space would be learnt where the source domain and the target domain data were aligned with each other. Therefore, an implicit assumption is actually made in the existing deep transfer solutions that all the features learnt from the source domain could be appropriately transferred into the intermediate feature space, meanwhile maintaining discriminant power in both domains. However, there is no guarantee that the features of the data belonging to the same category from different domains could be transferred to the same cluster in the intermediate feature space. Some original features might carry discriminant information for the source domain, which might get lost after transferring for both domains. The nonlinear feature mapping obtained by the deep transfer model is not a deterministic projection function for both domains which means the samples from the same class but different domains might be mapped to regions belonging to different classes. The samples that are mistakenly projected will deteriorate the performance of the transfer model and lead to false classification. Therefore, to achieve high classification accuracy it is not sufficient to transfer all the source samples to the common feature space and only use the transferred common features cross domains for fault diagnosis. In order to keep the domain specific features and mine common features cross both domains simultaneously, a novel deep transfer so­ lution termed as multi-view multi-level network (MMNet) is developed in this paper. MMNet constructs a dual channel structure to learn the representations of common features cross domains and discriminant features in specific domain which form multi-view features for classifi­ cation. Domain level classification and fault level classification are combined to extract the domain specific features. The cross domain common features are learnt by MK-MMD based domain adaptation and fault level classification. In addition, to deal with the data deficiency problem, an efficient few shot learning mechanism is adopted which employs two modules i.e. feature extraction module and feature com­ parison module to perform fault diagnosis. Two weight shared branches are employed to extract multi-view features of both domains simulta­ neously, which form the feature extraction module. In the feature comparison module, relation score between template sample and query sample is used to implement fault classification. In MMNet, no labeled sample from the target domain is required. The test samples from the target domain are compared with the template samples from the source domain for fault diagnosis, which enables zero shot diagnosis in the target domain. Episode based training strategy is adopted to train MMNet. There are three major contributions in this paper. First, the property of the features before and after domain transfer has been analyzed, based on which a multi-view feature extraction mechanism incorporating domain specific features and cross domain common features is proposed. Second, a multi-view multi-level network MMNet is constructed which combines fault level classification and domain level classification to learn domain specific features, and meanwhile combines MK-MMD based domain adaptation and fault level classification to learn com­ mon features cross domains. Third, a FeatureNet module is used to extract sample features and a RelationNet module is adopted to implement fault classification in MMNet, which enables zero shot fault diagnosis in the target domain. The paper is organized as follows. Section 1 is introduction. Problem formulation, transfer feature analysis and some preliminary knowledge N. Lu et al.
  • 3. Expert Systems With Applications 213 (2023) 119057 3 are discussed in Section 2. Section 3 describes the proposed solution MMNet in details. Section 4 reports experiment and comparison results to demonstrate the effectiveness of MMNet. Conclusions are made in Section 5. 2. Motivation and preliminaries 2.1. Problem formulation and motivation In machine fault diagnosis task, data are collected from one machine under different working conditions or different machines. The data from different working conditions or different machines follow different probability distributions, which are viewed as different domains. Transfer learning aims at borrowing the knowledge learnt from one domain to another domain. The former one is called source domain and the latter one target domain, which could be denoted as D s and D t respectively. The sample space of the source domain and the target domain can be denoted as Xs and Xt which satisfy Xs ⊂D s and Xt ⊂D t . The samples drawn from the source space can be represented as { xs i } , i = 1, 2, ⋯, ns and the samples from the target space can be represented as { xt i } , i = 1, 2, ⋯, nt, where ns and nt are respectively the number of samples from the corresponding domain. The fault categories in the source and the target domain are assumed to be the same. The fault class space is denoted as Y = {1, 2, • • •, C }, where C is the number of fault categories involved. Therefore, there exists Ys = Yt = Y. Accordingly, one labeled sample from the source and the target domain could be respectively represented as { xs i , ys i } , i = 1, 2, ⋯, ns and { xt i , yt i } , i = 1, 2, ⋯,nt. In our study, the training set from the source domain are labeled and no label information from the target domain training set is used. Transfer learning methods try to learn an intermediate feature space where the data from different space could be aligned. When deep transfer learning methods are employed, an intermediate feature space can be constructed by the learnt features which can be denoted as Xm . At different layers of the deep model, multiple intermediate feature space will be learnt. For simplicity, we use Xm as a general representation for all the intermediate feature space. The nonlinear mapping from the input sample to the intermediate feature space is represented as φ : Xs , Xt →Xm . With an ideal nonlinear mapping, the input samples from the source and the target domain belonging to one category should be mapped to the same region within one class boundary in the feature space. However, the nonlinear model learned by neural network training is not a deterministic optimal solution. Some samples of the same class from the source and the target domains will be mapped to different class regions. Fig. 1 gives an illustration of the mistakenly mapped samples. Fig. 1(a) depicts the samples within the source domain and Fig. 1(b) shows the projected results in the intermediate feature space from both the source and the target domain. The solid triangles and circles in Fig. 1(a) and (b) are samples from two fault classes of the source domain. The dotted triangles and circles in Fig. 1(b) represent the samples from the target domain belonging to the corresponding two classes as the source domain samples. Within the source domain, these samples could be well classified by the classification boundary as shown in Fig. 1(a). When the samples have been mapped to the intermediate feature space, to correctly classify the target domain samples the ex­ pected target class boundary should be set as in Fig. 1(b). It could be seen that some mapped source domain samples are not in agreement with the correct class boundary. When all the mapped samples from the source domain are treated as prior knowledge for the target domain, an actual class boundary would be obtained as shown in Fig. 1(b). Obviously some source domain samples have not been appropriately mapped and could bring misleading information. If deep transfer learning model is employed, to alleviate the influence from the above discussed phenomenon, the weights in corresponding to such misleading samples should be suppressed. Their contribution to the target domain fault classification should be minimized. However, in the source domain fault classification, these samples might play important role and thus their corresponding weight could not be diminished during the model training progress. The existing deep transfer learning solu­ tions treat all the samples indifferently with the domain adaptation procedure, which makes the above discussed problem an issue to be addressed and forms one of the motivations of this study. In addition, the widely used benchmarks for deep model training are usually of very large scale. The popular image dataset ImageNet (Deng et al., 2009) contains more than 10 million samples from more than 20 thousand categories. Sports-1 M (Karpathy et al., 2014) is a famous video dataset for action recognition which includes more than 1 million videos. LaSOT (Fan et al., 2019) is a representative visual tracking dataset which includes more than 3 million image frames. In contrast, the fault diagnosis benchmarks like CWRU bearing dataset provided by Case Western Reserve University (Center), IMS bearing dataset (Guo et al., 2019) and RL bearing dataset (Lei, 2017) usually only contain several hundred or several thousand samples. Therefore, fault diagnosis Fig. 1. Illustration of mistakenly mapped samples from the source domain to the intermediate feature space. (a) Source domain samples and their class boundary (b) Mapped source domain and target domain samples in the intermediate feature space and class boundaries. N. Lu et al.
  • 4. Expert Systems With Applications 213 (2023) 119057 4 is a relatively small data problem. Appropriate deep models which could well deal few shot learning scenarios should be explored. Furthermore, when zero labeled sample is provided in the target domain, how to implement efficient fault knowledge transfer and fault classification remains a challenge. This is another motivation of this work. 2.2. Multiple kernel maximum mean discrepancy Multiple Kernel Maximum Mean Discrepancy (MK-MMD) is an improved version of Maximum Mean Discrepancy (MMD). MMD is a metric evaluating the data distribution distinction between the source and the target domain. It is indicated in (Gretton, Borgwardt, Rasch, Scholkopf, & Smola, 2012) that the probability distribution difference between two domains could be estimated by their mean embedding in the Reproducing Kernel Hilbert Space (RKHS) via the characteristic kernel function. Gaussian kernel is characteristic on Rd which is used to define MMD. Given i.i.d samples from the source and the target domain as Xs := { xs 1, xs 2, ⋯, xs ns } and Xt : = { xt 1, xt 2, ⋯, xt nt } , which are respec­ tively drawn from probability distribution Ps and Pt, and suppose H k is the RKHS endowed with characteristic Gaussian kernel k( • ), the MMD can be formulated as. dH k (F , Ps, Pt) := sup f ∈ F ( 1 ns ∑ ns i=1 f ( xs i ) − 1 nt ∑ nt i=1 f ( xt i ) ) , (1) where F is a class of functions which performs nonlinear mapping as f : Xs →R or f : Xt →R, sup ( • ) is the supremum of the input. The two terms in the bracket of Eq. (1) are respectively the empirical mean expecta­ tions of the source and the target domain calculated on the samples. It has been demonstrated in (Gretton et al., 2012) that the nonlinear function f( • ) could be estimated by the endowed Gaussian kernel function. Therefore, MMD could be estimated by the data samples as. where k(•, •) is the characteristic Gaussian kernel. Given two feature vectors xi and xj, the Gaussian kernel function is defined as. k ( xi, xj ) = e − ‖xi− xj‖2 γ (3) where γ is the kernel width. MMD uses single Gaussian kernel to evaluate the distribution distinction between the source and the target domain, which suffers from suboptimal kernel selection and limited adaptation effectiveness. MK-MMD (Long et al., 2015) constructs a multiple-kernel variant of MMD, which employs the combination of multiple Gaussian kernels to measure the distribution discrepancy. The characteristic kernel used in MK-MMD is defined as. k = ∑mu u=1 βuku, s.t. ∑mu u=1 βu = 1, βu ≥ 0, ∀u, (4) where mu is the number of used kernels and βu is the weight of kernel u. In this research, Gaussian kernels are used as the base kernels. One Gaussian kernel can be rewritten as ku ( xi, xj ) = e − ‖xi− xj‖2 γ . Through changing the kernel bandwidth γ between 2− ⌊ku/2⌋ γ and 2⌊ku/2⌋ γ with a scaling parameter of 2, where ⌊. • / • ⌋ is the integer division, the mu Gaussian kernels could be obtained. 2.3. Few shot learning Few shot learning has developed into an important direction in machine learning research which aims at exploring effective solutions for application scenarios with small dataset for training. There are mainly-two popular categories of few shot learning methods, metric based methods and optimization based methods. Matching network (Vinyals, Blundell, Lillicrap, Kavukcuoglu, & Wierstra, 2016), prototype network (Snell, Swersky, & Zemel, 2017) and relation network (Sung et al., 2018) are representative metric based few shot learning methods. Methods like model-agnostic meta-learning (MAML) (Finn, Abbeel, & Levine, 2017) and task-agnostic meta-learning (TAML) (Jamal & Qi, 2019) are optimization based methods. A common property of these few shot learning methods is that small mini-batches over multiple tasks are sampled to train the model iteratively. This cross tasks training pro­ cedure enables fast fine tuning of the model and its generalization per­ formance, which thus assures the model effectiveness in small data application scenarios. Among these few shot learning methods, relation network (Sung et al., 2018) employs a network module to learn the metric for sample difference evaluation, which is called relation module. Before the relation module, a feature module is used to extract the features of the input samples. Considering the excellent performance of relation network, its two module architecture has been borrowed to build MMNet in this study. 3. Multi-view and Multi-level network As discussed in section 2.1, taking all the samples into domain adaptation equally might lead to important information loss. The domain specific information carried by the samples not appropriate for transfer will get suppressed to fulfill domain adaptation between the source and the target domain. In order to retain as much effective in­ formation as possible, both the common features cross domains and domain specific features should be simultaneously extracted. In addi­ tion, few shot learning related mechanism should be incorporated to deal with the data paucity issue in fault diagnosis. Therefore, a novel solution MMNet is developed which could learn multi-view features with multi-level classification. 3.1. Architecture of MMNet Within a domain adaption deep network, all the involved network weights are adjusted toward improving the classification performance of the network. Therefore, the contribution of the samples which are inappropriate for domain adaptation will be diminished. Only the fea­ tures of the samples that could benefit the domain alignment between the source and the target domain will be effectively extracted. To extract both cross domain common features and domain specific features, two isolated network channels for feature extraction are designed in MMNet. Fig. 2 gives the detailed architecture of MMNet. The structure of MMNet is shown in Fig. 2(a) and (b) gives the notations of different channels in d2 H k (Xs , Xt ) = ‖ 1 ns ∑ns i=1 f ( xs i ) − 1 nt ∑nt i=1 f ( xt i ) ‖ 2 H k = 1 ns2 ∑ns i=1 ∑ns j=1 k ( xs i , xs j ) + 1 nt2 ∑nt i=1 ∑nt j=1 k ( xt i, xt j ) − 2 ns nt ∑ns i=1 ∑nt j=1 k ( xs i , xt j ) , (2) N. Lu et al.
  • 5. Expert Systems With Applications 213 (2023) 119057 5 the network. The overall architecture of MMNet borrows the module arrangement from relation network (Sung et al., 2018). As shown in Fig. 2(a), MMNet has two modules which are denoted as FeatureNet and RelationNet respectively. FeatureNet extracts the features of the input samples and RelationNet computes the relation between the samples. Each module contains two branches indicated as source branch and target branch, which process the input samples from the source and the target domain respectively. In the FeatureNet module, the upper two feature extraction channels form the source branch which extracts the feature of the source domain samples. The lower two feature extraction channels form the target branch which extracts the feature of the target domain samples. The source and target branches share the same weights. The cross domain common feature channel aims at extracting the common fea­ tures cross domains via domain adaptation, while the domain specific feature channel extracts the domain specific discriminant features facilitating both fault classification and domain classification. The cor­ responding channel notations are given in Fig. 2(b). The two branches in the RelationNet module are also weight shared. To obtain common features cross the source and the target domains, MK-MMD based domain adaptation is employed. It has been indicated in (Long et al., 2015) that with the increase of the network depth the features learned over the layers transit from general to specific. The specific features of one domain are difficult to get transferred to another domain in comparison to the general features. Therefore, MK-MMD loss is imposed on three layers of MMNet as shown in Fig. 2(a). In the Fea­ tureNet module, MK-MMD loss is imposed on the highest convolutional layer. In the RelationNet module, MK-MMD loss is imposed on the two highest fully connected layers but the output layer. To obtain domain specific features, domain level classification and fault level classification have both been incorporated. Domain level classification is performed based on the features extracted by the domain specific feature channels in the FeatureNet module. The domain specific feature channel aims at boosting both domain classification and fault classification, which could thus learn the features benefiting classification in specific domain. The details of the network channels are given in Fig. 3. The two feature learning channels in both the source and the target branch of the FeatureNet have the same structure settings. In each channel, there are Fig. 2. Architecture of MMNet. (a) MMNet structure (b) Details of network branches in MMNet. N. Lu et al.
  • 6. Expert Systems With Applications 213 (2023) 119057 6 three convolutional layers each followed by an average pooling layer. In all the three convolutional layers, 20 feature maps are adopted and the kernel size of each feature map is 3 × 1. The pooling size of the average pooling layer is 2. In the source branch, based on the features learned by the domain specific feature channel, a flatten layer with dimension of 5120 and a fully connected layer are used for domain classification. Here the domain classification is a binary classification problem. The samples from the source domain are labeled with 1 and the samples from the target domain are labeled with 0. The upper channel in the RelationNet module calculates the similarity between the concatenated features and implements fault classification as shown in Fig. 2(a). The lower channel in the RelationNet module shares the same structure with the upper channel which only participates in the domain adaptation calculation. In both RelationNet channels, two convolutional layers, one flatten layer and two fully connected layers have been employed. The convolutional kernel width is 3 × 1 and the average pooling size is 4. The dimension of the flatten layer and the two fully connected layers is 1280, 512 and 256 respectively. The computation and optimization details are given in the following section. 3.2. Optimization of MMNet The training of MMNet has adopted the episode based training strategy in few shot learning methods. The training set is constructed by the samples from both the source and the target domain. The part from the source domain is labeled data which aims for fault classification training. The part from the target domain is unlabeled data which aims for domain adaptation. Both parts are used for domain classification training. In episode based training, an experiment mechanism called k-way m-shot setting is used. Here k-way means the number of classes involved in each episode and m-shot indicates the number of labeled samples as template for comparison from each category. Specifically, in each episode a mini-batch is randomly selected from the source domain dataset as the template set. The size of the template set is k × m in a k-way m-shot experiment setting. A fraction of the remaining dataset is used as the query set. In each episode, the features of the m template samples from each category are extracted by the FeatureNet module which could be denoted as { xs,t i } , i = 1, 2, ⋯, m, where s means the samples come from the source domain dataset, and t indicates that the samples work as template. The query set sample is also fed to the Fea­ tureNet module to extract its feature representation. The query set could be represented as { xs,q i } , i = 1, 2, ⋯, n, where n is the number of query samples used for training from each class. These two parts of data are the input to the domain specific feature channel in the source branch of FeatureNet as shown in Fig. 2(a). For the lower target branch, the same set of template samples is used. The query set comes from the target domain, which could be denoted as { xt,q i } ,i = 1,2,⋯,n. The number of query samples from the source and the target domain are the same. For each branch in the FeatureNet module, all the template samples from the source domain and two query samples respectively from the source and the target domain are fed to the FeatureNet module sepa­ rately during each episode to obtain their corresponding feature vectors. When the number of the template samples m is larger than 1, the sum of their obtained feature vectors is used as the template feature vector. The query feature vector is obtained from the query sample. Suppose the corresponding feature vectors of { xs,t i } , i = 1, 2, ⋯, m, xs,q i , and xt,q i extracted by the FeatureNet module in one episode are { fs,t i } ,i = 1,2,⋯, m, fs,q i , and ft,q i respectively, the final template feature vector could be obtained by summing up the feature vectors of all the template samples as fs,t = ∑ m i=1 fs,t i . (5) For each category of machine fault, a template vector will be computed during each episode. After the FeatureNet module, the tem­ plate feature vector and the query feature vector are concatenated with each other, which form the input to the following RelationNet module as shown in Fig. 2(a). During the training stage, one source domain and one target domain query sample will be fed to the MMNet each time along Fig. 3. Network structure details of the network channels in MMNet. N. Lu et al.
  • 7. Expert Systems With Applications 213 (2023) 119057 7 with the template samples. With the RelationNet module, the similarity between the query sample and the template of each category is calcu­ lated and a relation score for the source domain query sample will be obtained as rc ( fs,q i , fs,t ) , where c is the class index. Based on which, Softmax function is employed to implement the machine health condi­ tion classification as p(ys,q i = c) = exp(rc(fs,q i , fs,t ) ) exp ( ∑C c=1rc(fs,q i , fs,t ) ), (6) where p ( ys,q i = c ) is the probability of the ith query sample from the source domain belonging to class c. The query samples from the target domain are specifically used for domain adaptation and no labels are provided for them, so the classification of the target domain query sample is not conducted as shown in Fig. 2(a). To optimize MMNet, three parts of loss are combined to train MMNet including the machine fault classification loss, domain classification loss and domain adaptation loss. The fault classification loss is calculated based on the relation score, so it is termed as relation loss for simplicity as shown in Fig. 2(a). The domain classification loss (domain loss for short) further includes two parts, i.e. the domain classification loss for the query sample from the source domain and the target domain respectively. The relation loss is denoted as L r and defined by cross entropy loss as L r = ∑ nbs i=1 J(xs,q i , ys,q i |θ ) = − ∑ nbs i=1 ytrue i logys,q i , (7) where nbs is the number of source domain query samples in one training episode, θ represents the parameters of the network, ys,q i is the estimated fault label and ytrue i is the true fault label. The two parts of domain loss are respectively denoted as L ds and L dt for the source and the target domain query samples, which have also used cross entropy loss and are formulated as L ds = ∑ nbs i=1 J(xs,q i , ds,q i |θ ) = − ∑ nbs i=1 dtrue i logds,q i (8) and L dt = ∑ nbt i=1 J ( xt,q i , dt,q i |θ ) = − ∑ nbt i=1 dtrue i logdt,q i , (9) where nbs and nbt are the number of query samples from the source and the target domain respectively, ds,q i and dt,q i are the estimated domain labels of the query samples, dtrue i is the true domain label. If the query sample comes from the source domain dtrue i = 1, otherwise dtrue i = 0. The domain adaptation loss is evaluated based on MK-MMD as dis­ cussed in Section 2.2, which is denoted as MK-MMD loss in Fig. 2(a) and calculated as L MK− MMD = d2 H k (Xs , Xt ), (10) where Xs = { xs,q i } , i = 1, 2, ⋯, nbs and Xt = { xt,q i } , i = 1, 2, ⋯, nbt . An unbiased estimate of MK-MMD is adopted to calculate d2 H k (Xs , Xt ) as in (Long et al., 2015), which is formulated as d2 H k (Xs , Xt ) = 2 nbs ∑ nbs i=1 gk(zi), (11) where zi is a quad-tuple and is defined as zi≜ ( xs,q 2i− 1, xs,q 2i , xt,q 2i− 1, xt,q 2i ) . gk(zi) is calculated as gk(zi)≜k(xs,q 2i− 1, xs,q 2i ) + k ( xt,q 2i− 1, xt,q 2i ) − k ( xs,q 2i− 1, xt,q 2i ) − k ( xs,q 2i , xt,q 2i− 1 ) , (12) where the kernel function k is defined in Eq. (4) which is a weighted combination of multiple Gaussian kernels. The weight of kernel u denoted as βu was obtained by the same method as in (Long et al., 2015) by reducing the kernel optimization to a quadratic program (QP). The MK-MMD loss is calculated on three layers, i.e. the highest convolutional layer in the FeatureNet module and two fully connected layers in the RelationNet module. Combining the relation loss, the MK-MMD loss and the domain loss, the overall loss function can be formulated as L = L r + L MK− MMD + L ds + L dt. (13) In addition, to treat the loss terms in Eq. (13) with different impor­ tance, trade-off parameters could be incorporated. As discussed in Sec­ tion 3.1, there are three parts of MK-MMD loss respectively imposed on three layers, which could be denoted as L MMD1, L MMD2 and L MMD3. Therefore, four trade-off parameters have been incorporated and the weighted loss is written as. L = L r + λ1L MK− MMD1 + λ2L MK− MMD2 + λ3L MK− MMD3 + λ4(L ds + L dt), (14) where λ1, λ2, λ3 and λ4 are the tradeoff parameters. By minimizing the above loss as min θ L , the MMNet could be trained. Adam has been adopted as the optimization method to train the network and optimize the network parameters θ. The weights of the Gaussian kernels βu, u = 1, ⋯, mu in MK-MMD are then optimized in an alternating way by QP. The details of the training process of MMNet are given in Table 1. 4. Experiment results and discussions 4.1. Datasets and experiment setting Four datasets were employed to test the effectiveness of MMNet, the specification of the dataset were given in Table 2. Among these four datasets, the first two datasets were recorded in laboratory with artificial faults, the third one was collected in laboratory with run to failure faults, and the last one was collected from bearings used in practical applica­ tion. All the collected data are vibration signals collected by acceler­ ometers from operating bearings. Four classes of health conditions have been incorporated in these datasets, including normal condition (NC), inner race fault (IF), outer race fault (OF) and ball fault (BF). The test benches that collected the four datasets are illustrated in Fig. 4, where the illustration of the four types of health conditions is also given. The difference of the bearings lies in the specification model, rotation speed, working load and sampling rate. Vibration signals from the same type of rotatory part of the same fault are expected to show similar character­ istic, which makes it possible to transfer knowledge between different datasets. Dataset A and B are from CWRU bearing dataset provided by Case Western Reserve University (Center). The vibration data were collected from a motor bearing experiment platform (Fig. 4(a)) with a sampling frequency of 12 kHz. Artificial single point faults were made on bearings and corresponding vibration signals were collected in laboratory envi­ ronment. The diameter of the point fault was set as 0.0014 in.. Dataset A and B were respectively collected under 0 HP and 3 HP motor loads. For each health condition, 101 samples are used in our study each with 1024 data points. Therefore, there are 404 samples in total in both dataset A and B. Dataset C is from IMS bearing dataset, which is provided by the NSF I/UCR Center for Intelligent Maintenance Systems (IMS) (Qiu, Lee, Lin, & Yu, 2006). Four bearings were installed on a shaft rotating at a con­ stant speed of 2000 RPM. Accelerometers were installed on the bearing housing to collect vibration signals. 6000 lbs of radial load was imposed on the shaft. The sampling frequency was 20 kHz. There are also 404 samples used in dataset C in this study. The length of each sample is 1024 data points. Dataset D comes from RL bearing dataset provided by Xi’an Jiaotong N. Lu et al.
  • 8. Expert Systems With Applications 213 (2023) 119057 8 University (Lei, 2017). Different from the previous three datasets where the bearing faults were artificially produced in laboratory, RL bearing dataset collected data from practically used railway locomotive (RL) rolling element bearing. An accelerometer was mounted on the outer race of the bearing to collect the vibration signal. A working load of 9800 N was adopted and the sampling rate was 12.8 kHz. There are also four health conditions included in this dataset which is the same with the previous three datasets. The number of samples and the sample length are also same to the other three datasets. 4.2. MMNet performance and comparisons MMNet was implemented in Python with PyTorch. All the experi­ ments were performed on a PC equipped with a 3.2 GHz Intel I7 CPU and a TITAN Xp GPU. 4.2.1. Experiment settings in MMNet Based on the four datasets detailed in Section 4.1, three transfer tasks have been used to validate the efficiency of MMNet, including transfer task A → D, B → D and C → D. The bearing faults of datasets A, B and C were generated in laboratory and those of dataset D were made during practical application. Therefore, datasets A, B and C are used as the source datasets and D is adopted as the target dataset to implement knowledge transfer from laboratory data to practical data. Episode based training in few shot learning is employed to efficiently learn knowledge with small amount of samples. Specifically, three few shot learning scenarios have been adopted, including 4-way 1-shot, 4- way 5-shot and 4-way 10-shot. In each episode, one template set from the source domain and two query sets respectively from the source and the target domain are used for training. The query set from the source domain has labels and is used for domain classification and fault clas­ sification. The query set from the target domain is not labeled which is used for domain classification and domain adaptation. In the source branch of MMNet, according to the obtained relation scores, the cate­ gory of the query sample could be determined by the largest one. In one episode of a k-way m-shot training, k classes each with m samples randomly selected are used as the template set, and a fraction of the remainder data are taken as the query set. In each episode of the 4- way 1-shot experiments, one example from each class of the source dataset is randomly selected to form the template set and 29 random examples are respectively selected from the source and the target dataset as the query set. For the upper source branch of MMNet, both the tem­ plate set and query set are selected from the source dataset. For the bottom target branch, same template set as the source branch is adopted. The query set is selected from the target dataset and no label information is required. In the 4-way 1-shot experiments, the total number of ex­ amples used for training is 1 × 4 + 29 × 4 + 29 × 4 = 236. In the 4-way 5-shot experiments, 5 random examples from the source dataset form the template set and 25 examples respectively from the source and the target dataset construct the query set. The total number of examples in each episode is 5 × 4 + 25 × 4 + 25 × 4 = 220. Similarly, in the 4-way 10-shot setting, the total number of examples in each episode is 10 × 4 + 20 × 4 + 20 × 4 = 200. All the labeled data from the source domain and 200 unlabeled examples from the target domain have been used to generate the training set in each episode. The rest 204 samples (51 × 4 = 204) from the target domain are used for testing. Table 1 Training process of MMNet. Table 2 Dataset Specifications. Datasets Bearing specs Health condition Number of samples Operation configuration A SKF6205 NC 4 × 101 0HP 1797 r/min IF OF BF B SKF6205 NC 4 × 101 3HP 1730 r/min IF OF BF C ZA-2115 NC 4 × 101 6000lbs 2000 r/min IF OF BF D 552732QT NC 4 × 101 9800 N 500 r/min IF OF BF N. Lu et al.
  • 9. Expert Systems With Applications 213 (2023) 119057 9 4.2.2. Parameter settings in MMNet Adam is adopted to optimize MMNet. The number of training epi­ sodes is set as 10,000 and the learning rate is 5 × 10− 4 . The tradeoff parameter λ1, λ2 and λ3 of the three MK-MMD loss and the tradeoff parameter λ4 of the domain loss are given in Table 3. It has been discovered in previous research that from the shallower layers to the deeper layers of convolutional neural network, the learned features turn from general to specific. The general features cross different domains are easier to get transferred than the specific ones. Therefore, the trans­ ferability of the features will decrease with the increase of the network depth. Larger MK-MMD tradeoff parameters should be selected for the lower layers and smaller ones are supposed to be used for the higher layers to allow for task-specific tuning. To verify the above statement, grid search experiments have been conducted to search for the optimal tradeoff parameters in an exhaustive manner. The details of the parameter selection procedures are given in Table 4. The MK-MMD tradeoff parameters are selected within the range of [0.1, 5] with an increment of 0.05. In each experiment scenario, 10 examples from the test set (query set) are randomly separated as vali­ dation set for parameter selection. Considering the high computational cost, no cross validation procedure is used. Experiment results have shown that MMNet failed to obtain satisfactory performance when the three parameters take identical value. Fault classification accuracy around 83 % was obtained in these experiments. In some experiments, the network even failed to converge. Similar experiment results were observed when the value of the parameters is in increasing order from λ1 to λ3. When the parameters take random order (neither monotone increasing nor monotone decreasing), some good results have been ob­ tained. Better classification performance has been achieved when the tradeoff parameters are in decreasing order. The optimal value of the three MK-MMD loss tradeoff parameters are selected based on the grid search results as shown in Table 3. The parameter selection results also indicate that the model is quite robust to the parameter variation with a mean accuracy of 89.86% and standard deviation of 6.02%. During the search of the three MK-MMD loss parameters, the domain loss parameter is fixed as 0.1 to reduce computational cost which has shown relative excellent performance throughout experiments. After the three MK-MMD loss tradeoff parameters have been selected, they are fixed to further finely select the domain loss tradeoff parameter λ4. Ex­ periments with λ4 from {0.001, 0.01, 0.1, 1, 10, 100} have been per­ formed. Based on the experiment results, 0.1 is selected. In each domain adaptation operation with MK-MMD, 5 Gaussian kernels have been adopted. The Gaussian kernel bandwidth γ is set as the median of the pairwise distance of the training samples from both the source and the target domain. The kernel bandwidth of the mu Gaussian kernels is obtained by changing their bandwidth between 2− ⌊ku/2⌋ γ and 2⌊ku/2⌋ γ with a scaling parameter of 2, where ⌊. • / • ⌋ is the integer division. 4.2.3. Performance of MMNet and comparison with other methods To verify the performance of MMNet, the three transfer tasks A → D, B → D and C → D discussed in section 4.2.1 have been carried out. For each transfer task, three few shot learning experiment settings are tested. The results are reported in Table 5. It could be seen that excellent fault classification performance has been obtained on the three transfer tasks. With the increase of the number of examples used as the template set, the performance of MMNet has been improved. The average fault classification accuracy is above 99 % which is a superior transfer per­ formance for bearing fault diagnosis. To further validate the effectiveness of MMNet, extensive compari­ son experiments have been conducted. Multiple state-of-the-art transfer learning methods have been included for comparison, including Trans­ fer Component Analysis (TCA) (Pan, Tsang, Kwok, & Yang, 2011), Deep Domain Confusion (DDC) (Tzeng, Hoffman, Zhang, Saenko, & Darrell, 2014), modified Deep Adaptation Networks (DAN) (Long et al., 2015), Feature-based transfer neural network (FTNN) (Yang et al., 2019), G- ResNet (Yang et al., 2020), P-ResNet (Yang et al., 2020) and TrResNet (Yang et al., 2020). In addition, Convolutional Neural Network (CNN) has been incorporated as a baseline method for comparison. To make fair comparisons, we use public available source code provided by the Fig. 4. Test bench of CWRU [23], IMS [31] and RL [24] bearing dataset and the corresponding health condition illustration [6]. Table 3 Tradeoff parameters of the mk-mmd Loss and domain loss. Experiment setting λ1 λ2 λ3 λ4 4-way 1-shot 2.25 1.25 0.5 0.1 4-way 5-shot 1.0 0.5 0.2 0.1 4-way 10-shot 2.0 1.0 0.75 0.1 N. Lu et al.
  • 10. Expert Systems With Applications 213 (2023) 119057 10 authors of the above methods for experiments. When the code of the method is not publicly available, the results are borrowed from the original papers directly given the same transfer task. When both the source code and the corresponding experiment results are not available in the original publication, “/” mark is used in Table 6 which reported the comparison results. In the baseline CNN method, no transfer learning related tricks have been applied. The labeled data from the source dataset form the training set and the unlabeled data from the target dataset construct the testing Table 4 Tradeoff parameter selection procedures. Table 5 Classification accuracy (%) of MMNET on different transfer tasks. Experiment setting A → D B → D C → D Avg 4-way 1-shot 99.62 98.75 99.25 99.21 4-way 5-shot 99.64 99.90 99.70 99.75 4-way 10-shot 99.95 99.98 99.72 99.88 N. Lu et al.
  • 11. Expert Systems With Applications 213 (2023) 119057 11 set. To achieve an optimal performance of CNN for comparison, various architectures of CNNs have been evaluated. Specifically, CNNs with different depth have been tested, including CNN of five convolutional layers, three convolutional layers and two convolutional layers. In each CNN, one flatten layer and one fully connected layer are added following the convolutional layers. Cross-entropy is used as the loss function. Softmax is applied at the output layer for classification. Meanwhile, our experiments have shown that average pooling could obtain better per­ formance than max pooling. Therefore, average pooling has been adopted in these baseline CNNs. In the other compared CNN based so­ lutions, average pooling is also adopted instead of max pooling to ensure fair comparison. It has been shown that CNN with two convolutional layers and two fully connected layers has obtained the best fault diag­ nosis performance, the results of which are given in Table 6. TCA is a classic transfer learning method, which projects the source data and the target data into a new subspace where their data distri­ butions are closer than in the original data distribution space. In the implementation of TCA, the regularization tradeoff parameter is selected from {0.01, 0.1, 1, 10, 100} and the subspace dimension is selected from {2, 4, 8, 16, 32, 64, 128, 256} via experiments. Based on the representations of all the samples in the transformed subspace, a support vector machine (SVM) classifier is trained for fault classification. The baseline CNN architecture selected via experiments has been adopted in DDC. Meanwhile, MK-MMD based domain adaptation is used in the layer before the softmax classification layer. For the compared DAN method, the same CNN structure is used and domain adaption with MK-MMD is applied in the flatten layer and the last fully connected layer before the output layer. The specifications of the adopted CNN structure in the baseline CNN, DDC and DAN are given in Table 7, where “/” means not applicable. In both DDC and DAN, all the labeled data of the source dataset and part of the unlabeled data of the target dataset are used for model training. Similar dataset partition setting is adopted as MMNet. The experiment results of FTNN are borrowed from its original publication (Yang et al., 2019). €In G-ResNet, P-ResNet and TrResNet, eight ResNet blocks are used to construct the network backbone structure. In G-ResNet, Gaussian kernel based MMD is adopted for domain adaptation. In P-ResNet and TrResNet, polynomial kernel based MMD is used. In addition, pseudo label learning is applied in TrResNet. The reported results of the above three methods are borrowed from (Yang et al., 2020). The detailed model configurations can be found in (Yang et al., 2020). In the exper­ iments of these three methods, both dataset A and B from our experi­ ment setting are used as the source domain, and dataset D is treated as the target domain. Therefore, the results of transfer tasks A → D and B → D are the same as reported in Table 6. The raw vibration data are used as the input to CNN, DDC, DAN, FTNN, G-ResNet, P-ResNet, TrResNet and MMNet. To obtain better fault diagnosis performance of TCA, frequency spectrum instead of vibration data is adopted as the input for TCA. In Table 6, the best results have been highlighted in bold. It could be seen from these results that neural network based solutions have ob­ tained significantly better performance than the traditional transfer learning method TCA. The performance of the baseline CNN with no transfer learning component involved is relatively poor. Its best per­ formance on the three transfer tasks is 57.67 %. TrResNet has ranked the second best which is published in 2020 lately. Among all the compared methods, our MMNet method has obtained the best performance on fault classification accuracy. The classification accuracy on all the three transfer tasks is above 99 % which is a quite excellent performance. The smallest accuracy increase against the second best result reaches 10.94 %. T-SNE (t-distributed stochastic neighbor embedding) method is employed to visualize the transfer features learned by the compared methods. The visualization results are given in Fig. 5. The intermediate feature representation results of methods G-ResNet, P-ResNet and TrResNet are not available. Therefore, their corresponding visualization results have not been provided in Fig. 5. The visualization is conducted on the transfer task A → D. In Fig. 5, the notation “S-” means the cor­ responding samples come from the source domain and “T-” means the samples come from the target domain. The visual illustration of Fig. 5 includes frequency spectrum analysis, TCA, CNN, DDC, DAN and MMNet. The results of Fig. 5 show that the feature distribution differ­ ence of frequency spectrum, TCA, CNN and DDC between the source and target domain is quite obvious. Among these methods, the features ob­ tained by TCA are aggregated within one class from the same domain but still scattered for the same class from different domains as compared with CNN, DDC and DAN, which could well explain the relative better performance of the latter three methods. The domain discrepancy of the features learned by DAN and MMNet has been obviously reduced comparing with the former four methods. For both DAN and MMNet, the samples coming from the same class are well aggregated even though they are from different domain. Comparing MMNet with DAN, the dis­ tance among different classes obtained by MMNet is obviously larger than that of DAN. Meanwhile, the samples from the same class are more aggregated in MMNet which are relatively more scattered in DAN. The well-formed sample distribution structure obtained by MMNet explains the excellent classification performance of the method. To take a further look into the classification performance compari­ son, the confusion matrices of the compared methods are visualized and reported in Fig. 6. The confusion matrices of TCA, CNN, DDC, FTNN, DAN and MMNet are illustrated. From the listed results, it could be seen that a large quantity of samples have been mistakenly classified with both TCA and CNN. The results of DDC, FTNN and DAN are better than those of TCA and CNN. The performance of MMNet is obviously superior to all the other compared methods, which has validated the efficiency of MMNet. Table 6 Accuracy comparison results (%) of different transfer learning methods for fault diagnosis. Method Input A → D B → D C → D CNN Raw vibration 57.67 53.17 53.96 TCA Frequency spectrum 51.48 41.58 25.00 DDC Raw vibration 80.84 77.80 81.22 DAN Raw vibration 83.52 78.90 86.27 FTNN Raw vibration 83.69 84.95 / G-ResNet Raw vibration 84.32 84.32 / P-ResNet Raw vibration 87.76 87.76 / TrResNet Raw vibration 88.27 88.27 / MMNet Raw vibration 99.21 99.75 99.88 Table 7 Specifications of the CNN structure in baseline CNN, DDC and DAN. Layer Operation Convolutional kernel width Number of channels Output size Input / / / 1024 × 1 × 1 C1 Convolution 3 × 1 20 1024 × 1 × 20 P1 AvgPooling 2 × 1 / 512 × 1 × 20 C2 Convolution 3 × 1 20 512 × 1 × 20 P2 AvgPooling 2 × 1 / 256 × 1 × 20 FC1 Flatten / / 5120 × 1 FC2 Fully - connected 5120 × 256 / 256 × 1 Output Fully - connected 256 × 4 / 4 × 1 N. Lu et al.
  • 12. Expert Systems With Applications 213 (2023) 119057 12 4.3. Ablation study There are several components contributing to the performance of MMNet among which the three major components include the double channel feature extraction mechanism, multiple-layer domain adapta­ tion and average pooling. In order to verify the effectiveness of each component, ablation study has been conducted. To test the necessity of the double channel feature extraction Fig. 5. Visualization of the learned features with t-SNE. (a) Frequency spectrum feature (b) TCA (c) CNN (d) DDC (e) DAN (f) MMNet. Fig. 6. Confusion matrix of the transfer results of dataset A → D. (a) TCA (b) CNN (c) DDC (d) FTNN (e) DAN (f) MMNet. N. Lu et al.
  • 13. Expert Systems With Applications 213 (2023) 119057 13 mechanism, comparison experiments with only one common feature extraction channel network have been performed. The rest components like multi-layer adaptation and average pooling are kept the same. The comparison results are reported in Fig. 7, which were averaged over three experiment settings (4-way 1-shot, 4-way 5-shot and 4-way 10- shot) on each transfer task. When only one cross domain common feature channel was adopted, the results are indicated as “one channel” in Fig. 7. When the cross domain common feature channel and the domain specific feature channel were both applied, the corresponding results are denoted as “double channel”. The highest accuracy obtained by the one channel network setting is 98.15 % on transfer task A → D, while the corresponding result of double channel setting is 99.21 %. For all the three transfer tasks, the double channel setting of MMNet has obtained better performance than the one channel setting, which has verified the effectiveness of the double channel feature extraction mechanism in MMNet. Fig. 7. Comparison results with and without the domain discriminant feature extraction channel on three transfer tasks. Fig. 8. Comparison on classification results of different number of Gaussian kernels used in MK-MMD domain adaptation on three transfer tasks. (a) Results on transfer task A → D (b) Results on transfer task B → D (c) Results on transfer task C → D. N. Lu et al.
  • 14. Expert Systems With Applications 213 (2023) 119057 14 One key factor that influences the performance of the multi-layer domain adaptation in MMNet is the number of Gaussian kernels used in MK-MMD. When the number of Gaussian kernels reduces to 1, MK- MMD degenerates to MMD. To compare the performance of different number of kernels, comparison experiments respectively with 1, 3, 5, 7 and 9 kernels have been performed on each transfer tasks for 10 runs. The comparison results are illustrated in Fig. 8. When the number of kernels increases from 1 to 3 and from 3 to 5, significant improvement of fault classification accuracy can be observed in Fig. 8. When the number of kernels changes from 5 to 7 and 9, the performance variation is relatively small. Meanwhile, the computational complexity of MMNet increases with the number of kernels. Therefore, the number of kernels in our experiments has been selected as 5. In addition, to test the efficiency of the average pooling, comparison experiments against max pooling were conducted. The three convolu­ tional layers in the FeatureNet module of MMNet have used average pooling instead of max pooling to suppress noise within the vibration time sequence. We respectively replaced the average pooling in the first convolutional layer C1, first two convolutional layers C1 and C2, and the three convolutional layers C1, C2 and C3 in the FeatureNet module to test the effectivity of average pooling. Experiment results have shown that the advantage of average pooling are reflected in two aspects, boosting the converging progress of the training stage and improving the classification accuracy. In comparison to max pooling, the fault classi­ fication accuracy on the transfer tasks has been improved more than 5 % utmost in our experiments. Meanwhile, it took about 2,000 episodes to train MMNet with average pooling following all the convolutional layers. When max pooling was used instead, more than 30,000 episodes were cost to train MMNet. Average pooling has greatly improved the training speed of MMNet. 4.4. Computational complexity comparison Besides the above model performance comparison, the computa­ tional complexity of the models has also been compared. Considering the training and operation time will be different on different hardware platforms, the model structure complexity and number of trainable pa­ rameters are summarized and compared in Table 8. The models with the same backbone network structure have been listed in the same row. In MMNet, the weights are shared in different channels and thus the complexity of only one channel need be considered. From Table 8, it could be seen that the total number of trainable parameters of MMNet is the smallest among the compared models, which is only about 1/2 or 1/ 4 of the other compared models. More convolutional layers (vs CNN/ DDC/DAN, FTNN), smaller convolutional kernels (vs G-ResNet, P- ResNet and TrResNet) and narrower fully connected layers have led to the more concise structure of MMNet. Therefore, MMNet has less computational complexity than the other compared models. 5. Conclusions The existing deep transfer networks try to transfer all the extracted features of fault data cross different domains. Considering there might be features which could only benefit classification of the data in specific domain and could not provide common information cross domains, a neural network solution MMNet separately considering the features appropriate to transfer and inappropriate to transfer is developed. In MMNet, a domain level classification and a fault level classification are combined to extract domain specific discriminant features. Multi-layer MK-MMD based domain adaption and fault level classification are combined to extract cross domain common features. A classic few shot learning network structure RelationNet is employed as the backbone network. A Siamese double branch structure is incorporated to process the samples from the source and the target domain simultaneously. The relation score based classification mechanism could perform fault diagnosis without labeled data from the target domain. Four datasets have been used to test the effectiveness of MMNet. The results have verified the efficiency of MMNet. The transfer fault classification accu­ racy has been significantly improved as compared with other state of the art transfer solutions in fault diagnosis. Fault classification accuracy over 99 % has been obtained in all the three transfer tasks for experiments. The outcome of this research has verified the different competence of the learned features for different domains. A multi-level classification mechanism has enabled implicit discrimination of these features. How to further and even explicitly evaluate the efficiency of different features for specific domain remains a challenging problem. One promising di­ rection is to incorporate metric like Kullback-Leibler divergence to measure the similarity among features. It is also possible to learn a metric for feature evaluation and embed the metric learning module into fault diagnosis scheme. Another promising direction is to include channel attention, self attention and cross attention mechanism into fault diagnosis network, based on which the salient features for different domains could be separately treated. In addition, the major idea of MMNet can also be directly used in other classification applications like brain signal recognition of different subjects, activity recognition of different people, image classification under different imaging conditions and so on. CRediT authorship contribution statement Na Lu: Conceptualization, Funding acquisition, Methodology, Vali­ dation, Writing – review & editing. Zhiyan Cui: Investigation, Software. Huiyang Hu: Data curation, Visualization. Tao Yin: Validation, Writing – review & editing. Declaration of Competing Interest The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper. Acknowledgement This work is supported by National Key R&D Program of China 2018YFB1306100, National Natural Science Foundation of China grant 61876147. References Center, C. W. R. U. B. D. Retrieved from http://csegroups.case.edu/bearingdatacenter/h ome. Che, C., Wang, H., Ni, X., & Fu, Q. (2020). Domain adaptive deep belief network for rolling bearing fault diagnosis. Computers & Industrial Engineering, 143, Article 106427. https://doi.org/10.1016/j.cie.2020.106427 Table 8 Model computational complexity COMPARISONS. Model Number of convolution layers (size) Number of full connected layers (size) Number of Parameters CNN/DDC/DAN 2 × (3 × 1 × 20) 2 (5120 × 256, 256 × 4) 1,311,864 FTNN 2 (5 × 1 × 20, 5 × 20 × 20) 2 (5941 × 256, 256 × 4) 1,524,084 G-ResNet/P- ResNet/ TrResNet 16 × (3 × 20 × 20) 2 (6000 × 512, 512 × 4) 3,093,248 MMNet 5 × (3 × 1 × 20)) 3 (5120 × 2, 1280 × 512, 512 × 256) 796,972 N. Lu et al.
  • 15. Expert Systems With Applications 213 (2023) 119057 15 Chen, Z., He, G., Li, J., Liao, Y., Gryllias, K., & Li, W. (2020). Domain adversarial transfer network for cross-domain fault diagnosis of rotary machinery. IEEE Transactions on Instrumentation and Measurement, 69(11), 8702–8712. https://doi.org/10.1109/ TIM.2020.2995441 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., & Li, F.-F. (2009, 20-25 June 2009). ImageNet: A large-scale hierarchical image database. Paper presented at the IEEE Conference on Computer Vision and Pattern Recognition. Fan, H., Lin, L., Yang, F., Chu, P., Deng, G., Yu, S., . . . Ling, H. (2019, 15-20 June 2019). LaSOT: A High-Quality Benchmark for Large-Scale Single Object Tracking. Paper presented at the IEEE Conference on Computer Vision and Pattern Recognition. Finn, C., Abbeel, P., & Levine, S. (2017). Model-agnostic meta-learning for fast adaptation of deep networks. Paper presented at the International Conference on Machine Learning, Sydney, NSW, Australia. Gretton, A., Borgwardt, K. M., Rasch, M. J., Scholkopf, B., & Smola, A. (2012). A kernel two-sample test. Journal of Machine Learning Research, 13, 723–773. Guo, L., Lei, Y., Xing, S., Yan, T., & Li, N. (2019). Deep convolutional transfer learning network: A new method for intelligent fault diagnosis of machines with unlabeled data. IEEE Transactions on Industrial Electronics, 66(9), 7316–7325. Jamal, M. A., & Qi, G.-J. (2019, 15-20 June 2019). Task Agnostic Meta-Learning for Few- Shot Learning. Paper presented at the IEEE Conference on Computer Vision and Pattern Recognition. Jia, F., Lei, Y., Lu, N., & Xing, S. (2018). Deep normalized convolutional neural network for imbalanced fault classification of machinery and its understanding via visualization. Mechanical Systems and Signal Processing, 110, 349–367. https://doi. org/10.1016/j.ymssp.2018.03.025 Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., & Li, F.-F. (2014, 23-28 June 2014). Large-Scale Video Classification with Convolutional Neural Networks. Paper presented at the IEEE Conference on Computer Vision and Pattern Recognition. Lei, Y. (2017). Intelligent fault diagnosis and remaining useful life prediction of rotating machinery. Butterworth-Heinemann. Lei, Y., Jia, F., Lin, J., Xing, S., & Ding, S. X. (2016). An intelligent fault diagnosis method using unsupervised feature learning towards mechanical big data. IEEE Transactions on Industrial Electronics, 63(5), 3137–3147. https://doi.org/10.1109/ TIE.2016.2519325 Li, J., Huang, R., He, G., Wang, S., Li, G., & Li, W. (2020). A deep adversarial transfer learning network for machinery emerging fault detection. IEEE Sensors Journal, 20 (15), 8413–8422. https://doi.org/10.1109/JSEN.2020.2975286 Li, X., Zhang, W., & Ding, Q. (2018). A robust intelligent fault diagnosis method for rolling element bearings based on deep distance metric learning. Neurocomputing, 310, 77–95. Li, X., Zhang, W., Ding, Q., & Sun, J.-Q. (2019). Multi-Layer domain adaptation method for rolling bearing fault diagnosis. Signal Processing, 157, 180–197. https://doi.org/ 10.1016/j.sigpro.2018.12.005 Long, M., Cao, Y., Wang, J., & Jordan, M. (2015). Learning transferable features with deep adaptation networks. Paper presented at the International Conference on Machine Learning. Lu, N., & Yin, T. (2021). Transferable common feature space mining for fault diagnosis with imbalanced data. Mechanical Systems and Signal Processing, 156, Article 107645. https://doi.org/10.1016/j.ymssp.2021.107645 Lu, W., Liang, B., Cheng, Y., Meng, D., Yang, J., & Zhang, T. (2017). Deep model based domain adaptation for fault diagnosis. IEEE Transactions on Industrial Electronics, 64 (3), 2296–2305. https://doi.org/10.1109/TIE.2016.2627020 Pan, S. J., Tsang, I. W., Kwok, J. T., & Yang, Q. (2011). Domain adaptation via transfer component analysis. IEEE Transactions on Neural Networks, 22(2), 199–210. Qiu, H., Lee, J., Lin, J., & Yu, G. (2006). Wavelet filter-based weak signature detection method and its application on rolling element bearing prognostics. Journal of Sound and Vibration, 289(4), 1066–1090. https://doi.org/10.1016/j.jsv.2005.03.007 Shao, S., McAleer, S., Yan, R., & Baldi, P. (2019). Highly accurate machine fault diagnosis using deep transfer learning. IEEE Transactions on Industrial Informatics, 15(4), 2446–2455. https://doi.org/10.1109/TII.2018.2864759 Snell, J., Swersky, K., & Zemel, R. (2017, 03/15). Prototypical Networks for Few-shot Learning. Paper presented at the Advances in Neural Information Processing Systems. Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P. H. S., & Hospedales, T. M. (2018, 18-23 June 2018). Learning to Compare: Relation Network for Few-Shot Learning. Paper presented at the IEEE Conference on Computer Vision and Pattern Recognition. Tzeng, E., Hoffman, J., Zhang, N., Saenko, K., & Darrell, T. (2014). Deep domain confusion: Maximizing for domain invariance. arXiv preprint arXiv:1412.3474. Vinyals, O., Blundell, C., Lillicrap, T., Kavukcuoglu, K., & Wierstra, D. (2016). Matching networks for one shot learning. Paper presented at the International Conference on Neural Information Processing Systems, Barcelona, Spain. Wen, L., Gao, L., & Li, X. (2017). A new deep transfer learning based on sparse auto- encoder for fault diagnosis. IEEE Transactions on Systems, Man, and Cybernetics: Systems, 49(1), 136–144. Wen, L., Gao, L., & Li, X. (2019). A new deep transfer learning based on sparse auto- encoder for fault diagnosis. IEEE Transactions on Systems, Man, and Cybernetics: Systems, 49(1), 136–144. https://doi.org/10.1109/TSMC.2017.2754287 Xu, G., Liu, M., Jiang, Z., Shen, W., & Huang, C. (2020). Online fault diagnosis method based on transfer convolutional neural networks. IEEE Transactions on Instrumentation and Measurement, 69(2), 509–520. https://doi.org/10.1109/ TIM.2019.2902003 Yang, B., Lei, Y., Jia, F., Li, N., & Du, Z. (2020). A polynomial kernel induced distance metric to improve deep transfer learning for fault diagnosis of machines. IEEE Transactions on Industrial Electronics, 67(11), 9747–9757. https://doi.org/10.1109/ TIE.2019.2953010 Yang, B., Lei, Y., Jia, F., & Xing, S. (2019). An intelligent fault diagnosis approach based on transfer learning from laboratory bearings to locomotive bearings. Mechanical Systems and Signal Processing, 122, 692–706. https://doi.org/10.1016/j. ymssp.2018.12.051 Yu, H., Wang, K., Li, Y., & Zhao, W. (2019). Representation learning with class level autoencoder for intelligent fault diagnosis. IEEE Signal Processing Letters, 26(10), 1476–1480. https://doi.org/10.1109/LSP.2019.2936310 Zhang, W., Li, X., Jia, X.-D., Ma, H., Luo, Z., & Li, X. (2020). Machinery fault diagnosis with imbalanced data using deep generative adversarial networks. Measurement, 152, Article 107377. https://doi.org/10.1016/j.measurement.2019.107377 N. Lu et al.