1-s2.0-S0957417422020759-main.pdf

Expert Systems With Applications 213 (2023) 119057
Available online 19 October 2022
0957-4174/© 2022 Elsevier Ltd. All rights reserved.
Multi-view and Multi-level network for fault diagnosis accommodating
feature transferability
Na Lu *
, Zhiyan Cui , Huiyang Hu , Tao Yin
Systems Engineering Institute, School of Automation Science and Engineering, Xi’an Jiaotong University, Xi’an, Shaanxi 710049, China
A R T I C L E I N F O
Keywords:
Transfer learning
Feature transferability
Fault diagnosis
Few shot learning
A B S T R A C T
Various deep transfer learning solutions have been developed for machine fault diagnosis. The existing solutions
mainly focus on domain adaptation by minimizing the data distribution discrepancy with certain metric, which
emphasize the common features embedded in the data cross domains and neglect the unique features toward
health condition classification in one specific domain. In these solutions, all the data for training have been
forced to align in a common feature space and all the features for domain adaptation have been treated equally.
However, there might exist domain specific features which are not appropriate for transfer but may contain
essential information for classification in specific domain. In addition, due to the difficulty of collecting machine
fault data, the number of machine fault samples is usually quite small or even zero. The traditional deep network
structures and the training strategy are not the optimal choice in this occasion. To address these problems, a
novel multi-view and multi-level network (MMNet) for fault diagnosis is developed. In MMNet, two network
channels have been respectively constructed for cross domain common feature and domain specific feature
learning to provide multi-view features. This architecture could implicitly differentiate the common features
cross domains and the specific features only in one domain. In the channel of domain specific feature, a domain
classifier and fault classifier are combined to learn the domain specific features. Multiple kernel maximum mean
discrepancy (MK-MMD) is imposed on multiple layers of the common feature channel to implement domain
adaptation and extract cross domain common features. The domain classification and fault classification together
form a multi-level classification scheme. A classic few shot learning architecture with two modules respectively
for feature extraction and relation computation is adopted as the backbone network. The relation score based
classification mechanism enables zero shot fault classification in the target domain. Episode based few shot
training strategy is employed to enhance the performance of MMNet with few labeled training data. Extensive
experiments have demonstrated the state-of-the-art performance of MMNet on the involved transfer tasks.
1. Introduction
Machine fault in industry could bring catastrophic damage and
enormous economic loss (Lei, Jia, Lin, Xing, & Ding, 2016). Therefore,
fault diagnosis has long been a popular and important research field
which involves multidisciplinary researches like mechanical engineer
ing, signal processing, and machine learning and so on. Machines usu
ally work in health state during most time of their life circle. Different
possible faults only occur in rare occasions. Due to the long time span of
normal condition and sporadic occurrence of fault, it is commonly
acknowledged that the fault data collected from one machine is quite
limited especially in practical application. While in laboratory envi
ronment, it is much easier to collect manual fabricated fault data.
Therefore, how to learn efficient representation of fault data and transfer
the knowledge learnt from data abundant scenarios to data lack sce
narios are crucial for fault diagnosis.
To this end, deep learning and transfer learning have been widely
explored in recent decades in fault diagnosis. Various deep network
models have been employed to automatically extract discriminant fea
tures from machine fault data (Lu & Yin, 2021). Network structures like
Peer review under responsibility of Submissions with the production note ‘Please add the Reproducibility Badge for this item’ the Badge and the following footnote
to be added:The code (and data) in this article has been certified as Reproducible by the CodeOcean: https://codeocean.com. More information on the Reproduc
ibility Badge Initiative is available at https://www.elsevier.com/physicalsciencesandengineering/computerscience/journals..
* Corresponding author.
E-mail address: lvna2009@xjtu.edu.cn (N. Lu).
Contents lists available at ScienceDirect
Expert Systems With Applications
journal homepage: www.elsevier.com/locate/eswa
https://doi.org/10.1016/j.eswa.2022.119057
Received 16 June 2021; Received in revised form 23 April 2022; Accepted 13 October 2022

2
AutoEncoder (Yu, Wang, Li, & Zhao, 2019), sparse AutoEncoder (Wen,
Gao, & Li, 2017), Convolutional Neural Network (CNN) (Jia, Lei, Lu, &
Xing, 2018; Yang, Lei, Jia, Li, & Du, 2020) have been widely employed
for fault representation learning. In addition, Generative Adversarial
Network (GAN) (Chen et al., 2020; Li et al., 2020; Zhang et al., 2020)
based methods have also been employed for fault diagnosis, which aim
to generate more fault samples to balance the fault dataset and improve
the classification performance. Except the GAN based solutions, most of
the fault classification network architectures and their training methods
were borrowed directly from the classic deep learning solutions of
computer vision, which can well fit the big data applications. However,
when the fault data are not abundant and especially when no labeled
data are available, more appropriate network architecture and training
method need be developed.
Another important issue in fault diagnosis is how to transfer the
knowledge from the domain with relatively abundant labeled data
(source domain) to the domain with few or no labeled data (target
domain). Here the different domains could be understood as different
machines or one machine under different working conditions. To
address this issue, many solutions combining deep neural network and
transfer learning have been developed (Li et al., 2020; Li, Zhang, Ding, &
Sun, 2019; Shao, McAleer, Yan, & Baldi, 2019; Xu, Liu, Jiang, Shen, &
Huang, 2020; Yang et al., 2020; Yang, Lei, Jia, & Xing, 2019) which we
refer to as deep transfer learning methods for simplicity. These methods
mainly aimed at minimizing the distribution discrepancy between
different domains and improving the fault classification accuracy. To
fulfill domain adaptation, multiple metrics of data distribution have
been applied, including Maximum Mean Discrepancy (MMD) (Yang
et al., 2019), Multi-kernel Maximum Mean Discrepancy (MK-MMD)
(Che, Wang, Ni, & Fu, 2020), Polynomial-kernel Maximum Mean
Discrepancy (PK-MMD) (Yang et al., 2020) and so on. These metrics
evaluate the data distribution difference which is used as the domain
adaptation loss to train the fault diagnosis model. The training objective
functions of deep transfer learning models usually contain two parts,
classification loss and domain adaptation loss. By minimizing the overall
loss of these terms, the deep transfer learning models could be trained.
Long et al. (Long, Cao, Wang, & Jordan, 2015) developed a widely used
deep transfer learning method with domain adaptation. MK-MMD loss
was used on the last three fully connected layers but the output layer to
enable domain adaptation. Lu et al. (Lu et al., 2017) adopted MMD as
the distribution discrepancy measure and developed a deep neural
network (DNN) model for fault diagnosis. The MMD loss was imposed on
the feature layer of a DNN. A gearbox dataset collected under different
working conditions was employed to evaluate the method. A deep
convolutional transfer learning network (DCTLN) was constructed by
Guo et al. (Guo, Lei, Xing, Yan, & Li, 2019) to implement fault diagnosis
knowledge transfer. One convolutional network module was used for
fault condition recognition and another convolutional network module
was used for domain distribution adaptation. Three datasets collected
from bearings were used for experiments to test the transferability of
DCTLN. Wen et al. (Wen, Gao, & Li, 2019) developed a sparse autoen
coder for feature representation learning which used frequency spec
trum of vibration sequences recorded from bearings as input. Domain
adaptation was implemented via MMD. Li et al. (Li, Zhang, & Ding,
2018) also proposed a domain adaptive deep convolutional neural
network for bearing fault diagnosis and fault knowledge transfer. The
fault dataset was collected under working environments with different
noise. Frequency spectrum was employed as the input to the CNN model.
The cross-domain feature discrepancy was also minimized based on
MMD. FTNN (feature-based transfer neural network) was developed by
Yang et al. (Yang et al., 2019) to diagnose the machine faults of real-case
machines by the knowledge learnt from the data recorded from labo
ratory machines. MMD was also adopted for domain adaptation which
was imposed on multiple network layers. Four bearing fault datasets
were used to construct the transfer experiments and test the perfor
mance of FTNN. Lu et al. (Lu & Yin, 2021) developed a combined
solution of convolutional autoencoder and convolutional network for
bearing fault diagnosis, where the convolutional autoencoder was
adopted to mine the common features cross domains. MMD was
employed for domain adaptation in the convolutional autoencoder.
From the above literature review, it could be seen that the domain
transfer is usually implemented by imposing certain domain distribution
metric on one or several network layers within the deep transfer model.
In these solutions, all the input data for training were treated equally for
domain adaptation. No matter what transfer learning solutions were
adopted, an intermediate data distribution space would be learnt where
the source domain and the target domain data were aligned with each
other. Therefore, an implicit assumption is actually made in the existing
deep transfer solutions that all the features learnt from the source
domain could be appropriately transferred into the intermediate feature
space, meanwhile maintaining discriminant power in both domains.
However, there is no guarantee that the features of the data belonging to
the same category from different domains could be transferred to the
same cluster in the intermediate feature space. Some original features
might carry discriminant information for the source domain, which
might get lost after transferring for both domains. The nonlinear feature
mapping obtained by the deep transfer model is not a deterministic
projection function for both domains which means the samples from the
same class but different domains might be mapped to regions belonging
to different classes. The samples that are mistakenly projected will
deteriorate the performance of the transfer model and lead to false
classification. Therefore, to achieve high classification accuracy it is not
sufficient to transfer all the source samples to the common feature space
and only use the transferred common features cross domains for fault
diagnosis.
In order to keep the domain specific features and mine common
features cross both domains simultaneously, a novel deep transfer so
lution termed as multi-view multi-level network (MMNet) is developed
in this paper. MMNet constructs a dual channel structure to learn the
representations of common features cross domains and discriminant
features in specific domain which form multi-view features for classifi
cation. Domain level classification and fault level classification are
combined to extract the domain specific features. The cross domain
common features are learnt by MK-MMD based domain adaptation and
fault level classification. In addition, to deal with the data deficiency
problem, an efficient few shot learning mechanism is adopted which
employs two modules i.e. feature extraction module and feature com
parison module to perform fault diagnosis. Two weight shared branches
are employed to extract multi-view features of both domains simulta
neously, which form the feature extraction module. In the feature
comparison module, relation score between template sample and query
sample is used to implement fault classification. In MMNet, no labeled
sample from the target domain is required. The test samples from the
target domain are compared with the template samples from the source
domain for fault diagnosis, which enables zero shot diagnosis in the
target domain. Episode based training strategy is adopted to train
MMNet.
There are three major contributions in this paper.
First, the property of the features before and after domain transfer
has been analyzed, based on which a multi-view feature extraction
mechanism incorporating domain specific features and cross domain
common features is proposed.
Second, a multi-view multi-level network MMNet is constructed
which combines fault level classification and domain level classification
to learn domain specific features, and meanwhile combines MK-MMD
based domain adaptation and fault level classification to learn com
mon features cross domains.
Third, a FeatureNet module is used to extract sample features and a
RelationNet module is adopted to implement fault classification in
MMNet, which enables zero shot fault diagnosis in the target domain.
The paper is organized as follows. Section 1 is introduction. Problem
formulation, transfer feature analysis and some preliminary knowledge
N. Lu et al.

3
are discussed in Section 2. Section 3 describes the proposed solution
MMNet in details. Section 4 reports experiment and comparison results
to demonstrate the effectiveness of MMNet. Conclusions are made in
Section 5.
2. Motivation and preliminaries
2.1. Problem formulation and motivation
In machine fault diagnosis task, data are collected from one machine
under different working conditions or different machines. The data from
different working conditions or different machines follow different
probability distributions, which are viewed as different domains.
Transfer learning aims at borrowing the knowledge learnt from one
domain to another domain. The former one is called source domain and
the latter one target domain, which could be denoted as D s
and D t
respectively. The sample space of the source domain and the target
domain can be denoted as Xs
and Xt
which satisfy Xs
⊂D s
and Xt
⊂D t
.
The samples drawn from the source space can be represented as
{
xs
i
}
, i =
1, 2, ⋯, ns and the samples from the target space can be represented as
{
xt
i
}
, i = 1, 2, ⋯, nt, where ns and nt are respectively the number of
samples from the corresponding domain. The fault categories in the
source and the target domain are assumed to be the same. The fault class
space is denoted as Y = {1, 2, • • •, C }, where C is the number of fault
categories involved. Therefore, there exists Ys
= Yt
= Y. Accordingly,
one labeled sample from the source and the target domain could be
respectively represented as
{
xs
i , ys
i
}
, i = 1, 2, ⋯, ns and
{
xt
i , yt
i
}
, i = 1, 2,
⋯,nt. In our study, the training set from the source domain are labeled
and no label information from the target domain training set is used.
Transfer learning methods try to learn an intermediate feature space
where the data from different space could be aligned. When deep
transfer learning methods are employed, an intermediate feature space
can be constructed by the learnt features which can be denoted as Xm
. At
different layers of the deep model, multiple intermediate feature space
will be learnt. For simplicity, we use Xm
as a general representation for
all the intermediate feature space. The nonlinear mapping from the
input sample to the intermediate feature space is represented as φ : Xs
,
Xt
→Xm
. With an ideal nonlinear mapping, the input samples from the
source and the target domain belonging to one category should be
mapped to the same region within one class boundary in the feature
space. However, the nonlinear model learned by neural network
training is not a deterministic optimal solution. Some samples of the
same class from the source and the target domains will be mapped to
different class regions. Fig. 1 gives an illustration of the mistakenly
mapped samples. Fig. 1(a) depicts the samples within the source domain
and Fig. 1(b) shows the projected results in the intermediate feature
space from both the source and the target domain. The solid triangles
and circles in Fig. 1(a) and (b) are samples from two fault classes of the
source domain. The dotted triangles and circles in Fig. 1(b) represent the
samples from the target domain belonging to the corresponding two
classes as the source domain samples. Within the source domain, these
samples could be well classified by the classification boundary as shown
in Fig. 1(a). When the samples have been mapped to the intermediate
feature space, to correctly classify the target domain samples the ex
pected target class boundary should be set as in Fig. 1(b). It could be seen
that some mapped source domain samples are not in agreement with the
correct class boundary. When all the mapped samples from the source
domain are treated as prior knowledge for the target domain, an actual
class boundary would be obtained as shown in Fig. 1(b). Obviously some
source domain samples have not been appropriately mapped and could
bring misleading information.
If deep transfer learning model is employed, to alleviate the influence
from the above discussed phenomenon, the weights in corresponding to
such misleading samples should be suppressed. Their contribution to the
target domain fault classification should be minimized. However, in the
source domain fault classification, these samples might play important
role and thus their corresponding weight could not be diminished during
the model training progress. The existing deep transfer learning solu
tions treat all the samples indifferently with the domain adaptation
procedure, which makes the above discussed problem an issue to be
addressed and forms one of the motivations of this study.
In addition, the widely used benchmarks for deep model training are
usually of very large scale. The popular image dataset ImageNet (Deng
et al., 2009) contains more than 10 million samples from more than 20
thousand categories. Sports-1 M (Karpathy et al., 2014) is a famous
video dataset for action recognition which includes more than 1 million
videos. LaSOT (Fan et al., 2019) is a representative visual tracking
dataset which includes more than 3 million image frames. In contrast,
the fault diagnosis benchmarks like CWRU bearing dataset provided by
Case Western Reserve University (Center), IMS bearing dataset (Guo
et al., 2019) and RL bearing dataset (Lei, 2017) usually only contain
several hundred or several thousand samples. Therefore, fault diagnosis
Fig. 1. Illustration of mistakenly mapped samples from the source domain to the intermediate feature space. (a) Source domain samples and their class boundary (b)
Mapped source domain and target domain samples in the intermediate feature space and class boundaries.
N. Lu et al.

4
is a relatively small data problem. Appropriate deep models which could
well deal few shot learning scenarios should be explored. Furthermore,
when zero labeled sample is provided in the target domain, how to
implement efficient fault knowledge transfer and fault classification
remains a challenge. This is another motivation of this work.
2.2. Multiple kernel maximum mean discrepancy
Multiple Kernel Maximum Mean Discrepancy (MK-MMD) is an
improved version of Maximum Mean Discrepancy (MMD). MMD is a
metric evaluating the data distribution distinction between the source
and the target domain. It is indicated in (Gretton, Borgwardt, Rasch,
Scholkopf, & Smola, 2012) that the probability distribution difference
between two domains could be estimated by their mean embedding in
the Reproducing Kernel Hilbert Space (RKHS) via the characteristic
kernel function. Gaussian kernel is characteristic on Rd
which is used to
define MMD. Given i.i.d samples from the source and the target domain
as Xs
:=
{
xs
1, xs
2, ⋯, xs
ns
}
and Xt
: =
{
xt
1, xt
2, ⋯, xt
nt
}
, which are respec
tively drawn from probability distribution Ps and Pt, and suppose H k is
the RKHS endowed with characteristic Gaussian kernel k( • ), the MMD
can be formulated as.
dH k
(F , Ps, Pt) :=
sup
f ∈ F
(
1
ns
∑
ns
i=1
f
(
xs
i
)
−
1
nt
∑
nt
i=1
f
(
xt
i
)
)
, (1)
where F is a class of functions which performs nonlinear mapping as f :
Xs
→R or f : Xt
→R, sup ( • ) is the supremum of the input. The two terms
in the bracket of Eq. (1) are respectively the empirical mean expecta
tions of the source and the target domain calculated on the samples. It
has been demonstrated in (Gretton et al., 2012) that the nonlinear
function f( • ) could be estimated by the endowed Gaussian kernel
function. Therefore, MMD could be estimated by the data samples as.
where k(•, •) is the characteristic Gaussian kernel. Given two feature
vectors xi and xj, the Gaussian kernel function is defined as.
k
(
xi, xj
)
= e
− ‖xi− xj‖2
γ (3)
where γ is the kernel width.
MMD uses single Gaussian kernel to evaluate the distribution
distinction between the source and the target domain, which suffers
from suboptimal kernel selection and limited adaptation effectiveness.
MK-MMD (Long et al., 2015) constructs a multiple-kernel variant of
MMD, which employs the combination of multiple Gaussian kernels to
measure the distribution discrepancy. The characteristic kernel used in
MK-MMD is defined as.
k =
∑mu
u=1
βuku,
s.t.
∑mu
u=1
βu = 1,
βu ≥ 0, ∀u,
(4)
where mu is the number of used kernels and βu is the weight of kernel u.
In this research, Gaussian kernels are used as the base kernels. One
Gaussian kernel can be rewritten as ku
(
xi, xj
)
= e
− ‖xi− xj‖2
γ . Through
changing the kernel bandwidth γ between 2− ⌊ku/2⌋
γ and 2⌊ku/2⌋
γ with a
scaling parameter of 2, where ⌊. • / • ⌋ is the integer division, the mu
Gaussian kernels could be obtained.
2.3. Few shot learning
Few shot learning has developed into an important direction in
machine learning research which aims at exploring effective solutions
for application scenarios with small dataset for training. There are
mainly-two popular categories of few shot learning methods, metric
based methods and optimization based methods. Matching network
(Vinyals, Blundell, Lillicrap, Kavukcuoglu, & Wierstra, 2016), prototype
network (Snell, Swersky, & Zemel, 2017) and relation network (Sung
et al., 2018) are representative metric based few shot learning methods.
Methods like model-agnostic meta-learning (MAML) (Finn, Abbeel, &
Levine, 2017) and task-agnostic meta-learning (TAML) (Jamal & Qi,
2019) are optimization based methods. A common property of these few
shot learning methods is that small mini-batches over multiple tasks are
sampled to train the model iteratively. This cross tasks training pro
cedure enables fast fine tuning of the model and its generalization per
formance, which thus assures the model effectiveness in small data
application scenarios. Among these few shot learning methods, relation
network (Sung et al., 2018) employs a network module to learn the
metric for sample difference evaluation, which is called relation module.
Before the relation module, a feature module is used to extract the
features of the input samples. Considering the excellent performance of
relation network, its two module architecture has been borrowed to
build MMNet in this study.
3. Multi-view and Multi-level network
As discussed in section 2.1, taking all the samples into domain
adaptation equally might lead to important information loss. The
domain specific information carried by the samples not appropriate for
transfer will get suppressed to fulfill domain adaptation between the
source and the target domain. In order to retain as much effective in
formation as possible, both the common features cross domains and
domain specific features should be simultaneously extracted. In addi
tion, few shot learning related mechanism should be incorporated to
deal with the data paucity issue in fault diagnosis. Therefore, a novel
solution MMNet is developed which could learn multi-view features
with multi-level classification.
3.1. Architecture of MMNet
Within a domain adaption deep network, all the involved network
weights are adjusted toward improving the classification performance of
the network. Therefore, the contribution of the samples which are
inappropriate for domain adaptation will be diminished. Only the fea
tures of the samples that could benefit the domain alignment between
the source and the target domain will be effectively extracted. To extract
both cross domain common features and domain specific features, two
isolated network channels for feature extraction are designed in MMNet.
Fig. 2 gives the detailed architecture of MMNet. The structure of MMNet
is shown in Fig. 2(a) and (b) gives the notations of different channels in
d2
H k
(Xs
, Xt
) = ‖
1
ns
∑ns
i=1
f
(
xs
i
)
−
1
nt
∑nt
i=1
f
(
xt
i
)
‖
2
H k
=
1
ns2
∑ns
i=1
∑ns
j=1
k
(
xs
i , xs
j
)
+
1
nt2
∑nt
i=1
∑nt
j=1
k
(
xt
i, xt
j
)
−
2
ns
nt
∑ns
i=1
∑nt
j=1
k
(
xs
i , xt
j
)
,
(2)
N. Lu et al.

5
the network.
The overall architecture of MMNet borrows the module arrangement
from relation network (Sung et al., 2018). As shown in Fig. 2(a), MMNet
has two modules which are denoted as FeatureNet and RelationNet
respectively. FeatureNet extracts the features of the input samples and
RelationNet computes the relation between the samples. Each module
contains two branches indicated as source branch and target branch,
which process the input samples from the source and the target domain
respectively. In the FeatureNet module, the upper two feature extraction
channels form the source branch which extracts the feature of the source
domain samples. The lower two feature extraction channels form the
target branch which extracts the feature of the target domain samples.
The source and target branches share the same weights. The cross
domain common feature channel aims at extracting the common fea
tures cross domains via domain adaptation, while the domain specific
feature channel extracts the domain specific discriminant features
facilitating both fault classification and domain classification. The cor
responding channel notations are given in Fig. 2(b). The two branches in
the RelationNet module are also weight shared.
To obtain common features cross the source and the target domains,
MK-MMD based domain adaptation is employed. It has been indicated in
(Long et al., 2015) that with the increase of the network depth the
features learned over the layers transit from general to specific. The
specific features of one domain are difficult to get transferred to another
domain in comparison to the general features. Therefore, MK-MMD loss
is imposed on three layers of MMNet as shown in Fig. 2(a). In the Fea
tureNet module, MK-MMD loss is imposed on the highest convolutional
layer. In the RelationNet module, MK-MMD loss is imposed on the two
highest fully connected layers but the output layer. To obtain domain
specific features, domain level classification and fault level classification
have both been incorporated. Domain level classification is performed
based on the features extracted by the domain specific feature channels
in the FeatureNet module. The domain specific feature channel aims at
boosting both domain classification and fault classification, which could
thus learn the features benefiting classification in specific domain.
The details of the network channels are given in Fig. 3. The two
feature learning channels in both the source and the target branch of the
FeatureNet have the same structure settings. In each channel, there are
Fig. 2. Architecture of MMNet. (a) MMNet structure (b) Details of network branches in MMNet.
N. Lu et al.

6
three convolutional layers each followed by an average pooling layer. In
all the three convolutional layers, 20 feature maps are adopted and the
kernel size of each feature map is 3 × 1. The pooling size of the average
pooling layer is 2. In the source branch, based on the features learned by
the domain specific feature channel, a flatten layer with dimension of
5120 and a fully connected layer are used for domain classification. Here
the domain classification is a binary classification problem. The samples
from the source domain are labeled with 1 and the samples from the
target domain are labeled with 0. The upper channel in the RelationNet
module calculates the similarity between the concatenated features and
implements fault classification as shown in Fig. 2(a). The lower channel
in the RelationNet module shares the same structure with the upper
channel which only participates in the domain adaptation calculation. In
both RelationNet channels, two convolutional layers, one flatten layer
and two fully connected layers have been employed. The convolutional
kernel width is 3 × 1 and the average pooling size is 4. The dimension of
the flatten layer and the two fully connected layers is 1280, 512 and 256
respectively. The computation and optimization details are given in the
following section.
3.2. Optimization of MMNet
The training of MMNet has adopted the episode based training
strategy in few shot learning methods. The training set is constructed by
the samples from both the source and the target domain. The part from
the source domain is labeled data which aims for fault classification
training. The part from the target domain is unlabeled data which aims
for domain adaptation. Both parts are used for domain classification
training. In episode based training, an experiment mechanism called
k-way m-shot setting is used. Here k-way means the number of classes
involved in each episode and m-shot indicates the number of labeled
samples as template for comparison from each category. Specifically, in
each episode a mini-batch is randomly selected from the source domain
dataset as the template set. The size of the template set is k × m in a
k-way m-shot experiment setting. A fraction of the remaining dataset is
used as the query set. In each episode, the features of the m template
samples from each category are extracted by the FeatureNet module
which could be denoted as
{
xs,t
i
}
, i = 1, 2, ⋯, m, where s means the
samples come from the source domain dataset, and t indicates that the
samples work as template. The query set sample is also fed to the Fea
tureNet module to extract its feature representation. The query set could
be represented as
{
xs,q
i
}
, i = 1, 2, ⋯, n, where n is the number of query
samples used for training from each class. These two parts of data are the
input to the domain specific feature channel in the source branch of
FeatureNet as shown in Fig. 2(a). For the lower target branch, the same
set of template samples is used. The query set comes from the target
domain, which could be denoted as
{
xt,q
i
}
,i = 1,2,⋯,n. The number of
query samples from the source and the target domain are the same.
For each branch in the FeatureNet module, all the template samples
from the source domain and two query samples respectively from the
source and the target domain are fed to the FeatureNet module sepa
rately during each episode to obtain their corresponding feature vectors.
When the number of the template samples m is larger than 1, the sum of
their obtained feature vectors is used as the template feature vector. The
query feature vector is obtained from the query sample. Suppose the
corresponding feature vectors of
{
xs,t
i
}
, i = 1, 2, ⋯, m, xs,q
i , and xt,q
i
extracted by the FeatureNet module in one episode are
{
fs,t
i
}
,i = 1,2,⋯,
m, fs,q
i , and ft,q
i respectively, the final template feature vector could be
obtained by summing up the feature vectors of all the template samples
as
fs,t
=
∑
m
i=1
fs,t
i . (5)
For each category of machine fault, a template vector will be
computed during each episode. After the FeatureNet module, the tem
plate feature vector and the query feature vector are concatenated with
each other, which form the input to the following RelationNet module as
shown in Fig. 2(a). During the training stage, one source domain and one
target domain query sample will be fed to the MMNet each time along
Fig. 3. Network structure details of the network channels in MMNet.
N. Lu et al.

7
with the template samples. With the RelationNet module, the similarity
between the query sample and the template of each category is calcu
lated and a relation score for the source domain query sample will be
obtained as rc
(
fs,q
i , fs,t
)
, where c is the class index. Based on which,
Softmax function is employed to implement the machine health condi
tion classification as
p(ys,q
i = c) =
exp(rc(fs,q
i , fs,t
) )
exp
( ∑C
c=1rc(fs,q
i , fs,t
)
), (6)
where p
(
ys,q
i = c
)
is the probability of the ith
query sample from the
source domain belonging to class c. The query samples from the target
domain are specifically used for domain adaptation and no labels are
provided for them, so the classification of the target domain query
sample is not conducted as shown in Fig. 2(a).
To optimize MMNet, three parts of loss are combined to train MMNet
including the machine fault classification loss, domain classification loss
and domain adaptation loss. The fault classification loss is calculated
based on the relation score, so it is termed as relation loss for simplicity
as shown in Fig. 2(a). The domain classification loss (domain loss for
short) further includes two parts, i.e. the domain classification loss for
the query sample from the source domain and the target domain
respectively. The relation loss is denoted as L r and defined by cross
entropy loss as
L r =
∑
nbs
i=1
J(xs,q
i , ys,q
i |θ ) = −
∑
nbs
i=1
ytrue
i logys,q
i , (7)
where nbs
is the number of source domain query samples in one training
episode, θ represents the parameters of the network, ys,q
i is the estimated
fault label and ytrue
i is the true fault label.
The two parts of domain loss are respectively denoted as L ds and
L dt for the source and the target domain query samples, which have also
used cross entropy loss and are formulated as
L ds =
∑
nbs
i=1
J(xs,q
i , ds,q
i |θ ) = −
∑
nbs
i=1
dtrue
i logds,q
i (8)
and
L dt =
∑
nbt
i=1
J
(
xt,q
i , dt,q
i |θ
)
= −
∑
nbt
i=1
dtrue
i logdt,q
i , (9)
where nbs
and nbt
are the number of query samples from the source and
the target domain respectively, ds,q
i and dt,q
i are the estimated domain
labels of the query samples, dtrue
i is the true domain label. If the query
sample comes from the source domain dtrue
i = 1, otherwise dtrue
i = 0.
The domain adaptation loss is evaluated based on MK-MMD as dis
cussed in Section 2.2, which is denoted as MK-MMD loss in Fig. 2(a) and
calculated as
L MK− MMD = d2
H k
(Xs
, Xt
), (10)
where Xs
=
{
xs,q
i
}
, i = 1, 2, ⋯, nbs
and Xt
=
{
xt,q
i
}
, i = 1, 2, ⋯, nbt
. An
unbiased estimate of MK-MMD is adopted to calculate d2
H k
(Xs
, Xt
) as in
(Long et al., 2015), which is formulated as
d2
H k
(Xs
, Xt
) =
2
nbs
∑
nbs
i=1
gk(zi), (11)
where zi is a quad-tuple and is defined as zi≜
(
xs,q
2i− 1, xs,q
2i , xt,q
2i− 1, xt,q
2i
)
. gk(zi)
is calculated as
gk(zi)≜k(xs,q
2i− 1, xs,q
2i ) + k
(
xt,q
2i− 1, xt,q
2i
)
− k
(
xs,q
2i− 1, xt,q
2i
)
− k
(
xs,q
2i , xt,q
2i− 1
)
, (12)
where the kernel function k is defined in Eq. (4) which is a weighted
combination of multiple Gaussian kernels. The weight of kernel u
denoted as βu was obtained by the same method as in (Long et al., 2015)
by reducing the kernel optimization to a quadratic program (QP). The
MK-MMD loss is calculated on three layers, i.e. the highest convolutional
layer in the FeatureNet module and two fully connected layers in the
RelationNet module.
Combining the relation loss, the MK-MMD loss and the domain loss,
the overall loss function can be formulated as
L = L r + L MK− MMD + L ds + L dt. (13)
In addition, to treat the loss terms in Eq. (13) with different impor
tance, trade-off parameters could be incorporated. As discussed in Sec
tion 3.1, there are three parts of MK-MMD loss respectively imposed on
three layers, which could be denoted as L MMD1, L MMD2 and L MMD3.
Therefore, four trade-off parameters have been incorporated and the
weighted loss is written as.
L = L r + λ1L MK− MMD1 + λ2L MK− MMD2 + λ3L MK− MMD3 + λ4(L ds + L dt),
(14)
where λ1, λ2, λ3 and λ4 are the tradeoff parameters. By minimizing the
above loss as min
θ
L , the MMNet could be trained. Adam has been
adopted as the optimization method to train the network and optimize
the network parameters θ. The weights of the Gaussian kernels βu, u = 1,
⋯, mu in MK-MMD are then optimized in an alternating way by QP. The
details of the training process of MMNet are given in Table 1.
4. Experiment results and discussions
4.1. Datasets and experiment setting
Four datasets were employed to test the effectiveness of MMNet, the
specification of the dataset were given in Table 2. Among these four
datasets, the first two datasets were recorded in laboratory with artificial
faults, the third one was collected in laboratory with run to failure faults,
and the last one was collected from bearings used in practical applica
tion. All the collected data are vibration signals collected by acceler
ometers from operating bearings. Four classes of health conditions have
been incorporated in these datasets, including normal condition (NC),
inner race fault (IF), outer race fault (OF) and ball fault (BF). The test
benches that collected the four datasets are illustrated in Fig. 4, where
the illustration of the four types of health conditions is also given. The
difference of the bearings lies in the specification model, rotation speed,
working load and sampling rate. Vibration signals from the same type of
rotatory part of the same fault are expected to show similar character
istic, which makes it possible to transfer knowledge between different
datasets.
Dataset A and B are from CWRU bearing dataset provided by Case
Western Reserve University (Center). The vibration data were collected
from a motor bearing experiment platform (Fig. 4(a)) with a sampling
frequency of 12 kHz. Artificial single point faults were made on bearings
and corresponding vibration signals were collected in laboratory envi
ronment. The diameter of the point fault was set as 0.0014 in.. Dataset A
and B were respectively collected under 0 HP and 3 HP motor loads. For
each health condition, 101 samples are used in our study each with 1024
data points. Therefore, there are 404 samples in total in both dataset A
and B.
Dataset C is from IMS bearing dataset, which is provided by the NSF
I/UCR Center for Intelligent Maintenance Systems (IMS) (Qiu, Lee, Lin,
& Yu, 2006). Four bearings were installed on a shaft rotating at a con
stant speed of 2000 RPM. Accelerometers were installed on the bearing
housing to collect vibration signals. 6000 lbs of radial load was imposed
on the shaft. The sampling frequency was 20 kHz. There are also 404
samples used in dataset C in this study. The length of each sample is
1024 data points.
Dataset D comes from RL bearing dataset provided by Xi’an Jiaotong
N. Lu et al.

8
University (Lei, 2017). Different from the previous three datasets where
the bearing faults were artificially produced in laboratory, RL bearing
dataset collected data from practically used railway locomotive (RL)
rolling element bearing. An accelerometer was mounted on the outer
race of the bearing to collect the vibration signal. A working load of
9800 N was adopted and the sampling rate was 12.8 kHz. There are also
four health conditions included in this dataset which is the same with
the previous three datasets. The number of samples and the sample
length are also same to the other three datasets.
4.2. MMNet performance and comparisons
MMNet was implemented in Python with PyTorch. All the experi
ments were performed on a PC equipped with a 3.2 GHz Intel I7 CPU and
a TITAN Xp GPU.
4.2.1. Experiment settings in MMNet
Based on the four datasets detailed in Section 4.1, three transfer tasks
have been used to validate the efficiency of MMNet, including transfer
task A → D, B → D and C → D. The bearing faults of datasets A, B and C
were generated in laboratory and those of dataset D were made during
practical application. Therefore, datasets A, B and C are used as the
source datasets and D is adopted as the target dataset to implement
knowledge transfer from laboratory data to practical data.
Episode based training in few shot learning is employed to efficiently
learn knowledge with small amount of samples. Specifically, three few
shot learning scenarios have been adopted, including 4-way 1-shot, 4-
way 5-shot and 4-way 10-shot. In each episode, one template set from
the source domain and two query sets respectively from the source and
the target domain are used for training. The query set from the source
domain has labels and is used for domain classification and fault clas
sification. The query set from the target domain is not labeled which is
used for domain classification and domain adaptation. In the source
branch of MMNet, according to the obtained relation scores, the cate
gory of the query sample could be determined by the largest one.
In one episode of a k-way m-shot training, k classes each with m
samples randomly selected are used as the template set, and a fraction of
the remainder data are taken as the query set. In each episode of the 4-
way 1-shot experiments, one example from each class of the source
dataset is randomly selected to form the template set and 29 random
examples are respectively selected from the source and the target dataset
as the query set. For the upper source branch of MMNet, both the tem
plate set and query set are selected from the source dataset. For the
bottom target branch, same template set as the source branch is adopted.
The query set is selected from the target dataset and no label information
is required. In the 4-way 1-shot experiments, the total number of ex
amples used for training is 1 × 4 + 29 × 4 + 29 × 4 = 236. In the 4-way
5-shot experiments, 5 random examples from the source dataset form
the template set and 25 examples respectively from the source and the
target dataset construct the query set. The total number of examples in
each episode is 5 × 4 + 25 × 4 + 25 × 4 = 220. Similarly, in the 4-way
10-shot setting, the total number of examples in each episode is 10 × 4 +
20 × 4 + 20 × 4 = 200. All the labeled data from the source domain and
200 unlabeled examples from the target domain have been used to
generate the training set in each episode. The rest 204 samples (51 × 4 =
204) from the target domain are used for testing.
Table 1
Training process of MMNet.
Table 2
Dataset Specifications.
Datasets Bearing
specs
Health
condition
Number of
samples
Operation
configuration
A SKF6205 NC 4 × 101 0HP
1797 r/min
IF
OF
BF
B SKF6205 NC 4 × 101 3HP
1730 r/min
IF
OF
BF
C ZA-2115 NC 4 × 101 6000lbs
2000 r/min
IF
OF
BF
D 552732QT NC 4 × 101 9800 N
500 r/min
IF
OF
BF
N. Lu et al.

9
4.2.2. Parameter settings in MMNet
Adam is adopted to optimize MMNet. The number of training epi
sodes is set as 10,000 and the learning rate is 5 × 10− 4
. The tradeoff
parameter λ1, λ2 and λ3 of the three MK-MMD loss and the tradeoff
parameter λ4 of the domain loss are given in Table 3. It has been
discovered in previous research that from the shallower layers to the
deeper layers of convolutional neural network, the learned features turn
from general to specific. The general features cross different domains are
easier to get transferred than the specific ones. Therefore, the trans
ferability of the features will decrease with the increase of the network
depth. Larger MK-MMD tradeoff parameters should be selected for the
lower layers and smaller ones are supposed to be used for the higher
layers to allow for task-specific tuning.
To verify the above statement, grid search experiments have been
conducted to search for the optimal tradeoff parameters in an exhaustive
manner. The details of the parameter selection procedures are given in
Table 4. The MK-MMD tradeoff parameters are selected within the range
of [0.1, 5] with an increment of 0.05. In each experiment scenario, 10
examples from the test set (query set) are randomly separated as vali
dation set for parameter selection. Considering the high computational
cost, no cross validation procedure is used. Experiment results have
shown that MMNet failed to obtain satisfactory performance when the
three parameters take identical value. Fault classification accuracy
around 83 % was obtained in these experiments. In some experiments,
the network even failed to converge. Similar experiment results were
observed when the value of the parameters is in increasing order from λ1
to λ3. When the parameters take random order (neither monotone
increasing nor monotone decreasing), some good results have been ob
tained. Better classification performance has been achieved when the
tradeoff parameters are in decreasing order. The optimal value of the
three MK-MMD loss tradeoff parameters are selected based on the grid
search results as shown in Table 3. The parameter selection results also
indicate that the model is quite robust to the parameter variation with a
mean accuracy of 89.86% and standard deviation of 6.02%.
During the search of the three MK-MMD loss parameters, the domain
loss parameter is fixed as 0.1 to reduce computational cost which has
shown relative excellent performance throughout experiments. After the
three MK-MMD loss tradeoff parameters have been selected, they are
fixed to further finely select the domain loss tradeoff parameter λ4. Ex
periments with λ4 from {0.001, 0.01, 0.1, 1, 10, 100} have been per
formed. Based on the experiment results, 0.1 is selected.
In each domain adaptation operation with MK-MMD, 5 Gaussian
kernels have been adopted. The Gaussian kernel bandwidth γ is set as the
median of the pairwise distance of the training samples from both the
source and the target domain. The kernel bandwidth of the mu Gaussian
kernels is obtained by changing their bandwidth between 2− ⌊ku/2⌋
γ and
2⌊ku/2⌋
γ with a scaling parameter of 2, where ⌊. • / • ⌋ is the integer
division.
4.2.3. Performance of MMNet and comparison with other methods
To verify the performance of MMNet, the three transfer tasks A → D,
B → D and C → D discussed in section 4.2.1 have been carried out. For
each transfer task, three few shot learning experiment settings are
tested. The results are reported in Table 5. It could be seen that excellent
fault classification performance has been obtained on the three transfer
tasks. With the increase of the number of examples used as the template
set, the performance of MMNet has been improved. The average fault
classification accuracy is above 99 % which is a superior transfer per
formance for bearing fault diagnosis.
To further validate the effectiveness of MMNet, extensive compari
son experiments have been conducted. Multiple state-of-the-art transfer
learning methods have been included for comparison, including Trans
fer Component Analysis (TCA) (Pan, Tsang, Kwok, & Yang, 2011), Deep
Domain Confusion (DDC) (Tzeng, Hoffman, Zhang, Saenko, & Darrell,
2014), modified Deep Adaptation Networks (DAN) (Long et al., 2015),
Feature-based transfer neural network (FTNN) (Yang et al., 2019), G-
ResNet (Yang et al., 2020), P-ResNet (Yang et al., 2020) and TrResNet
(Yang et al., 2020). In addition, Convolutional Neural Network (CNN)
has been incorporated as a baseline method for comparison. To make
fair comparisons, we use public available source code provided by the
Fig. 4. Test bench of CWRU [23], IMS [31] and RL [24] bearing dataset and the corresponding health condition illustration [6].
Table 3
Tradeoff parameters of the mk-mmd Loss and domain loss.
Experiment setting λ1 λ2 λ3 λ4
4-way 1-shot 2.25 1.25 0.5 0.1
4-way 5-shot 1.0 0.5 0.2 0.1
4-way 10-shot 2.0 1.0 0.75 0.1
N. Lu et al.

10
authors of the above methods for experiments. When the code of the
method is not publicly available, the results are borrowed from the
original papers directly given the same transfer task. When both the
source code and the corresponding experiment results are not available
in the original publication, “/” mark is used in Table 6 which reported
the comparison results.
In the baseline CNN method, no transfer learning related tricks have
been applied. The labeled data from the source dataset form the training
set and the unlabeled data from the target dataset construct the testing
Table 4
Tradeoff parameter selection procedures.
Table 5
Classification accuracy (%) of MMNET on different transfer tasks.
Experiment setting A → D B → D C → D Avg
4-way 1-shot 99.62 98.75 99.25 99.21
4-way 5-shot 99.64 99.90 99.70 99.75
4-way 10-shot 99.95 99.98 99.72 99.88
N. Lu et al.

11
set. To achieve an optimal performance of CNN for comparison, various
architectures of CNNs have been evaluated. Specifically, CNNs with
different depth have been tested, including CNN of five convolutional
layers, three convolutional layers and two convolutional layers. In each
CNN, one flatten layer and one fully connected layer are added following
the convolutional layers. Cross-entropy is used as the loss function.
Softmax is applied at the output layer for classification. Meanwhile, our
experiments have shown that average pooling could obtain better per
formance than max pooling. Therefore, average pooling has been
adopted in these baseline CNNs. In the other compared CNN based so
lutions, average pooling is also adopted instead of max pooling to ensure
fair comparison. It has been shown that CNN with two convolutional
layers and two fully connected layers has obtained the best fault diag
nosis performance, the results of which are given in Table 6.
TCA is a classic transfer learning method, which projects the source
data and the target data into a new subspace where their data distri
butions are closer than in the original data distribution space. In the
implementation of TCA, the regularization tradeoff parameter is
selected from {0.01, 0.1, 1, 10, 100} and the subspace dimension is
selected from {2, 4, 8, 16, 32, 64, 128, 256} via experiments. Based on
the representations of all the samples in the transformed subspace, a
support vector machine (SVM) classifier is trained for fault
classification.
The baseline CNN architecture selected via experiments has been
adopted in DDC. Meanwhile, MK-MMD based domain adaptation is used
in the layer before the softmax classification layer. For the compared
DAN method, the same CNN structure is used and domain adaption with
MK-MMD is applied in the flatten layer and the last fully connected layer
before the output layer. The specifications of the adopted CNN structure
in the baseline CNN, DDC and DAN are given in Table 7, where “/”
means not applicable. In both DDC and DAN, all the labeled data of the
source dataset and part of the unlabeled data of the target dataset are
used for model training. Similar dataset partition setting is adopted as
MMNet. The experiment results of FTNN are borrowed from its original
publication (Yang et al., 2019).
€In G-ResNet, P-ResNet and TrResNet, eight ResNet blocks are used
to construct the network backbone structure. In G-ResNet, Gaussian
kernel based MMD is adopted for domain adaptation. In P-ResNet and
TrResNet, polynomial kernel based MMD is used. In addition, pseudo
label learning is applied in TrResNet. The reported results of the above
three methods are borrowed from (Yang et al., 2020). The detailed
model configurations can be found in (Yang et al., 2020). In the exper
iments of these three methods, both dataset A and B from our experi
ment setting are used as the source domain, and dataset D is treated as
the target domain. Therefore, the results of transfer tasks A → D and B →
D are the same as reported in Table 6.
The raw vibration data are used as the input to CNN, DDC, DAN,
FTNN, G-ResNet, P-ResNet, TrResNet and MMNet. To obtain better fault
diagnosis performance of TCA, frequency spectrum instead of vibration
data is adopted as the input for TCA.
In Table 6, the best results have been highlighted in bold. It could be
seen from these results that neural network based solutions have ob
tained significantly better performance than the traditional transfer
learning method TCA. The performance of the baseline CNN with no
transfer learning component involved is relatively poor. Its best per
formance on the three transfer tasks is 57.67 %. TrResNet has ranked the
second best which is published in 2020 lately. Among all the compared
methods, our MMNet method has obtained the best performance on fault
classification accuracy. The classification accuracy on all the three
transfer tasks is above 99 % which is a quite excellent performance. The
smallest accuracy increase against the second best result reaches 10.94
%.
T-SNE (t-distributed stochastic neighbor embedding) method is
employed to visualize the transfer features learned by the compared
methods. The visualization results are given in Fig. 5. The intermediate
feature representation results of methods G-ResNet, P-ResNet and
TrResNet are not available. Therefore, their corresponding visualization
results have not been provided in Fig. 5. The visualization is conducted
on the transfer task A → D. In Fig. 5, the notation “S-” means the cor
responding samples come from the source domain and “T-” means the
samples come from the target domain. The visual illustration of Fig. 5
includes frequency spectrum analysis, TCA, CNN, DDC, DAN and
MMNet. The results of Fig. 5 show that the feature distribution differ
ence of frequency spectrum, TCA, CNN and DDC between the source and
target domain is quite obvious. Among these methods, the features ob
tained by TCA are aggregated within one class from the same domain
but still scattered for the same class from different domains as compared
with CNN, DDC and DAN, which could well explain the relative better
performance of the latter three methods. The domain discrepancy of the
features learned by DAN and MMNet has been obviously reduced
comparing with the former four methods. For both DAN and MMNet, the
samples coming from the same class are well aggregated even though
they are from different domain. Comparing MMNet with DAN, the dis
tance among different classes obtained by MMNet is obviously larger
than that of DAN. Meanwhile, the samples from the same class are more
aggregated in MMNet which are relatively more scattered in DAN. The
well-formed sample distribution structure obtained by MMNet explains
the excellent classification performance of the method.
To take a further look into the classification performance compari
son, the confusion matrices of the compared methods are visualized and
reported in Fig. 6. The confusion matrices of TCA, CNN, DDC, FTNN,
DAN and MMNet are illustrated. From the listed results, it could be seen
that a large quantity of samples have been mistakenly classified with
both TCA and CNN. The results of DDC, FTNN and DAN are better than
those of TCA and CNN. The performance of MMNet is obviously superior
to all the other compared methods, which has validated the efficiency of
MMNet.
Table 6
Accuracy comparison results (%) of different transfer learning methods for fault
diagnosis.
Method Input A → D B → D C → D
CNN Raw vibration 57.67 53.17 53.96
TCA Frequency spectrum 51.48 41.58 25.00
DDC Raw vibration 80.84 77.80 81.22
DAN Raw vibration 83.52 78.90 86.27
FTNN Raw vibration 83.69 84.95 /
G-ResNet Raw vibration 84.32 84.32 /
P-ResNet Raw vibration 87.76 87.76 /
TrResNet Raw vibration 88.27 88.27 /
MMNet Raw vibration 99.21 99.75 99.88
Table 7
Specifications of the CNN structure in baseline CNN, DDC and DAN.
Layer Operation Convolutional kernel
width
Number of
channels
Output size
Input / / / 1024 × 1
× 1
C1 Convolution 3 × 1 20 1024 × 1
× 20
P1 AvgPooling 2 × 1 / 512 × 1 ×
20
C2 Convolution 3 × 1 20 512 × 1 ×
20
P2 AvgPooling 2 × 1 / 256 × 1 ×
20
FC1 Flatten / / 5120 × 1
FC2 Fully -
connected
5120 × 256 / 256 × 1
Output Fully -
connected
256 × 4 / 4 × 1
N. Lu et al.

12
4.3. Ablation study
There are several components contributing to the performance of
MMNet among which the three major components include the double
channel feature extraction mechanism, multiple-layer domain adapta
tion and average pooling. In order to verify the effectiveness of each
component, ablation study has been conducted.
To test the necessity of the double channel feature extraction
Fig. 5. Visualization of the learned features with t-SNE. (a) Frequency spectrum feature (b) TCA (c) CNN (d) DDC (e) DAN (f) MMNet.
Fig. 6. Confusion matrix of the transfer results of dataset A → D. (a) TCA (b) CNN (c) DDC (d) FTNN (e) DAN (f) MMNet.
N. Lu et al.

13
mechanism, comparison experiments with only one common feature
extraction channel network have been performed. The rest components
like multi-layer adaptation and average pooling are kept the same. The
comparison results are reported in Fig. 7, which were averaged over
three experiment settings (4-way 1-shot, 4-way 5-shot and 4-way 10-
shot) on each transfer task. When only one cross domain common
feature channel was adopted, the results are indicated as “one channel”
in Fig. 7. When the cross domain common feature channel and the
domain specific feature channel were both applied, the corresponding
results are denoted as “double channel”. The highest accuracy obtained
by the one channel network setting is 98.15 % on transfer task A → D,
while the corresponding result of double channel setting is 99.21 %. For
all the three transfer tasks, the double channel setting of MMNet has
obtained better performance than the one channel setting, which has
verified the effectiveness of the double channel feature extraction
mechanism in MMNet.
Fig. 7. Comparison results with and without the domain discriminant feature extraction channel on three transfer tasks.
Fig. 8. Comparison on classification results of different number of Gaussian kernels used in MK-MMD domain adaptation on three transfer tasks. (a) Results on
transfer task A → D (b) Results on transfer task B → D (c) Results on transfer task C → D.
N. Lu et al.

14
One key factor that influences the performance of the multi-layer
domain adaptation in MMNet is the number of Gaussian kernels used
in MK-MMD. When the number of Gaussian kernels reduces to 1, MK-
MMD degenerates to MMD. To compare the performance of different
number of kernels, comparison experiments respectively with 1, 3, 5, 7
and 9 kernels have been performed on each transfer tasks for 10 runs.
The comparison results are illustrated in Fig. 8. When the number of
kernels increases from 1 to 3 and from 3 to 5, significant improvement of
fault classification accuracy can be observed in Fig. 8. When the number
of kernels changes from 5 to 7 and 9, the performance variation is
relatively small. Meanwhile, the computational complexity of MMNet
increases with the number of kernels. Therefore, the number of kernels
in our experiments has been selected as 5.
In addition, to test the efficiency of the average pooling, comparison
experiments against max pooling were conducted. The three convolu
tional layers in the FeatureNet module of MMNet have used average
pooling instead of max pooling to suppress noise within the vibration
time sequence. We respectively replaced the average pooling in the first
convolutional layer C1, first two convolutional layers C1 and C2, and the
three convolutional layers C1, C2 and C3 in the FeatureNet module to
test the effectivity of average pooling. Experiment results have shown
that the advantage of average pooling are reflected in two aspects,
boosting the converging progress of the training stage and improving the
classification accuracy. In comparison to max pooling, the fault classi
fication accuracy on the transfer tasks has been improved more than 5 %
utmost in our experiments. Meanwhile, it took about 2,000 episodes to
train MMNet with average pooling following all the convolutional
layers. When max pooling was used instead, more than 30,000 episodes
were cost to train MMNet. Average pooling has greatly improved the
training speed of MMNet.
4.4. Computational complexity comparison
Besides the above model performance comparison, the computa
tional complexity of the models has also been compared. Considering
the training and operation time will be different on different hardware
platforms, the model structure complexity and number of trainable pa
rameters are summarized and compared in Table 8. The models with the
same backbone network structure have been listed in the same row. In
MMNet, the weights are shared in different channels and thus the
complexity of only one channel need be considered. From Table 8, it
could be seen that the total number of trainable parameters of MMNet is
the smallest among the compared models, which is only about 1/2 or 1/
4 of the other compared models. More convolutional layers (vs CNN/
DDC/DAN, FTNN), smaller convolutional kernels (vs G-ResNet, P-
ResNet and TrResNet) and narrower fully connected layers have led to
the more concise structure of MMNet. Therefore, MMNet has less
computational complexity than the other compared models.
5. Conclusions
The existing deep transfer networks try to transfer all the extracted
features of fault data cross different domains. Considering there might
be features which could only benefit classification of the data in specific
domain and could not provide common information cross domains, a
neural network solution MMNet separately considering the features
appropriate to transfer and inappropriate to transfer is developed. In
MMNet, a domain level classification and a fault level classification are
combined to extract domain specific discriminant features. Multi-layer
MK-MMD based domain adaption and fault level classification are
combined to extract cross domain common features. A classic few shot
learning network structure RelationNet is employed as the backbone
network. A Siamese double branch structure is incorporated to process
the samples from the source and the target domain simultaneously. The
relation score based classification mechanism could perform fault
diagnosis without labeled data from the target domain. Four datasets
have been used to test the effectiveness of MMNet. The results have
verified the efficiency of MMNet. The transfer fault classification accu
racy has been significantly improved as compared with other state of the
art transfer solutions in fault diagnosis. Fault classification accuracy
over 99 % has been obtained in all the three transfer tasks for
experiments.
The outcome of this research has verified the different competence of
the learned features for different domains. A multi-level classification
mechanism has enabled implicit discrimination of these features. How
to further and even explicitly evaluate the efficiency of different features
for specific domain remains a challenging problem. One promising di
rection is to incorporate metric like Kullback-Leibler divergence to
measure the similarity among features. It is also possible to learn a
metric for feature evaluation and embed the metric learning module into
fault diagnosis scheme. Another promising direction is to include
channel attention, self attention and cross attention mechanism into
fault diagnosis network, based on which the salient features for different
domains could be separately treated. In addition, the major idea of
MMNet can also be directly used in other classification applications like
brain signal recognition of different subjects, activity recognition of
different people, image classification under different imaging conditions
and so on.
CRediT authorship contribution statement
Na Lu: Conceptualization, Funding acquisition, Methodology, Vali
dation, Writing – review & editing. Zhiyan Cui: Investigation, Software.
Huiyang Hu: Data curation, Visualization. Tao Yin: Validation, Writing
– review & editing.
Declaration of Competing Interest
The authors declare that they have no known competing financial
interests or personal relationships that could have appeared to influence
the work reported in this paper.
Acknowledgement
This work is supported by National Key R&D Program of China
2018YFB1306100, National Natural Science Foundation of China grant
61876147.
References
Center, C. W. R. U. B. D. Retrieved from http://csegroups.case.edu/bearingdatacenter/h
ome.
Che, C., Wang, H., Ni, X., & Fu, Q. (2020). Domain adaptive deep belief network for
rolling bearing fault diagnosis. Computers & Industrial Engineering, 143, Article
106427. https://doi.org/10.1016/j.cie.2020.106427
Table 8
Model computational complexity COMPARISONS.
Model Number of
convolution layers
(size)
Number of full
connected layers
(size)
Number of
Parameters
CNN/DDC/DAN 2 × (3 × 1 × 20) 2 (5120 × 256, 256
× 4)
1,311,864
FTNN 2 (5 × 1 × 20, 5 ×
20 × 20)
2 (5941 × 256,
256 × 4)
1,524,084
G-ResNet/P-
ResNet/
TrResNet
16 × (3 × 20 × 20) 2 (6000 × 512,
512 × 4)
3,093,248
MMNet 5 × (3 × 1 × 20)) 3 (5120 × 2, 1280 ×
512, 512 × 256)
796,972
N. Lu et al.

15
Chen, Z., He, G., Li, J., Liao, Y., Gryllias, K., & Li, W. (2020). Domain adversarial transfer
network for cross-domain fault diagnosis of rotary machinery. IEEE Transactions on
Instrumentation and Measurement, 69(11), 8702–8712. https://doi.org/10.1109/
TIM.2020.2995441
Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., & Li, F.-F. (2009, 20-25 June 2009).
ImageNet: A large-scale hierarchical image database. Paper presented at the IEEE
Conference on Computer Vision and Pattern Recognition.
Fan, H., Lin, L., Yang, F., Chu, P., Deng, G., Yu, S., . . . Ling, H. (2019, 15-20 June 2019).
LaSOT: A High-Quality Benchmark for Large-Scale Single Object Tracking. Paper
presented at the IEEE Conference on Computer Vision and Pattern Recognition.
Finn, C., Abbeel, P., & Levine, S. (2017). Model-agnostic meta-learning for fast adaptation of
deep networks. Paper presented at the International Conference on Machine Learning,
Sydney, NSW, Australia.
Gretton, A., Borgwardt, K. M., Rasch, M. J., Scholkopf, B., & Smola, A. (2012). A kernel
two-sample test. Journal of Machine Learning Research, 13, 723–773.
Guo, L., Lei, Y., Xing, S., Yan, T., & Li, N. (2019). Deep convolutional transfer learning
network: A new method for intelligent fault diagnosis of machines with unlabeled
data. IEEE Transactions on Industrial Electronics, 66(9), 7316–7325.
Jamal, M. A., & Qi, G.-J. (2019, 15-20 June 2019). Task Agnostic Meta-Learning for Few-
Shot Learning. Paper presented at the IEEE Conference on Computer Vision and
Pattern Recognition.
Jia, F., Lei, Y., Lu, N., & Xing, S. (2018). Deep normalized convolutional neural network
for imbalanced fault classification of machinery and its understanding via
visualization. Mechanical Systems and Signal Processing, 110, 349–367. https://doi.
org/10.1016/j.ymssp.2018.03.025
Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., & Li, F.-F. (2014, 23-28
June 2014). Large-Scale Video Classification with Convolutional Neural Networks. Paper
Lei, Y. (2017). Intelligent fault diagnosis and remaining useful life prediction of rotating
machinery. Butterworth-Heinemann.
Lei, Y., Jia, F., Lin, J., Xing, S., & Ding, S. X. (2016). An intelligent fault diagnosis method
using unsupervised feature learning towards mechanical big data. IEEE Transactions
on Industrial Electronics, 63(5), 3137–3147. https://doi.org/10.1109/
TIE.2016.2519325
Li, J., Huang, R., He, G., Wang, S., Li, G., & Li, W. (2020). A deep adversarial transfer
learning network for machinery emerging fault detection. IEEE Sensors Journal, 20
(15), 8413–8422. https://doi.org/10.1109/JSEN.2020.2975286
Li, X., Zhang, W., & Ding, Q. (2018). A robust intelligent fault diagnosis method for
rolling element bearings based on deep distance metric learning. Neurocomputing,
310, 77–95.
Li, X., Zhang, W., Ding, Q., & Sun, J.-Q. (2019). Multi-Layer domain adaptation method
for rolling bearing fault diagnosis. Signal Processing, 157, 180–197. https://doi.org/
10.1016/j.sigpro.2018.12.005
Long, M., Cao, Y., Wang, J., & Jordan, M. (2015). Learning transferable features with
deep adaptation networks. Paper presented at the International Conference on Machine
Learning.
Lu, N., & Yin, T. (2021). Transferable common feature space mining for fault diagnosis
with imbalanced data. Mechanical Systems and Signal Processing, 156, Article 107645.
https://doi.org/10.1016/j.ymssp.2021.107645
Lu, W., Liang, B., Cheng, Y., Meng, D., Yang, J., & Zhang, T. (2017). Deep model based
domain adaptation for fault diagnosis. IEEE Transactions on Industrial Electronics, 64
(3), 2296–2305. https://doi.org/10.1109/TIE.2016.2627020
Pan, S. J., Tsang, I. W., Kwok, J. T., & Yang, Q. (2011). Domain adaptation via transfer
component analysis. IEEE Transactions on Neural Networks, 22(2), 199–210.
Qiu, H., Lee, J., Lin, J., & Yu, G. (2006). Wavelet filter-based weak signature detection
method and its application on rolling element bearing prognostics. Journal of Sound
and Vibration, 289(4), 1066–1090. https://doi.org/10.1016/j.jsv.2005.03.007
Shao, S., McAleer, S., Yan, R., & Baldi, P. (2019). Highly accurate machine fault diagnosis
using deep transfer learning. IEEE Transactions on Industrial Informatics, 15(4),
2446–2455. https://doi.org/10.1109/TII.2018.2864759
Snell, J., Swersky, K., & Zemel, R. (2017, 03/15). Prototypical Networks for Few-shot
Learning. Paper presented at the Advances in Neural Information Processing Systems.
Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P. H. S., & Hospedales, T. M. (2018, 18-23
June 2018). Learning to Compare: Relation Network for Few-Shot Learning. Paper
Tzeng, E., Hoffman, J., Zhang, N., Saenko, K., & Darrell, T. (2014). Deep domain
confusion: Maximizing for domain invariance. arXiv preprint arXiv:1412.3474.
Vinyals, O., Blundell, C., Lillicrap, T., Kavukcuoglu, K., & Wierstra, D. (2016). Matching
networks for one shot learning. Paper presented at the International Conference on
Neural Information Processing Systems, Barcelona, Spain.
Wen, L., Gao, L., & Li, X. (2017). A new deep transfer learning based on sparse auto-
encoder for fault diagnosis. IEEE Transactions on Systems, Man, and Cybernetics:
Systems, 49(1), 136–144.
Wen, L., Gao, L., & Li, X. (2019). A new deep transfer learning based on sparse auto-
encoder for fault diagnosis. IEEE Transactions on Systems, Man, and Cybernetics:
Systems, 49(1), 136–144. https://doi.org/10.1109/TSMC.2017.2754287
Xu, G., Liu, M., Jiang, Z., Shen, W., & Huang, C. (2020). Online fault diagnosis method
based on transfer convolutional neural networks. IEEE Transactions on
Instrumentation and Measurement, 69(2), 509–520. https://doi.org/10.1109/
TIM.2019.2902003
Yang, B., Lei, Y., Jia, F., Li, N., & Du, Z. (2020). A polynomial kernel induced distance
metric to improve deep transfer learning for fault diagnosis of machines. IEEE
Transactions on Industrial Electronics, 67(11), 9747–9757. https://doi.org/10.1109/
TIE.2019.2953010
Yang, B., Lei, Y., Jia, F., & Xing, S. (2019). An intelligent fault diagnosis approach based
on transfer learning from laboratory bearings to locomotive bearings. Mechanical
Systems and Signal Processing, 122, 692–706. https://doi.org/10.1016/j.
ymssp.2018.12.051
Yu, H., Wang, K., Li, Y., & Zhao, W. (2019). Representation learning with class level
autoencoder for intelligent fault diagnosis. IEEE Signal Processing Letters, 26(10),
1476–1480. https://doi.org/10.1109/LSP.2019.2936310
Zhang, W., Li, X., Jia, X.-D., Ma, H., Luo, Z., & Li, X. (2020). Machinery fault diagnosis
with imbalanced data using deep generative adversarial networks. Measurement,
152, Article 107377. https://doi.org/10.1016/j.measurement.2019.107377
N. Lu et al.

1-s2.0-S0957417422020759-main.pdf

Recommended

Recommended

More Related Content

Similar to 1-s2.0-S0957417422020759-main.pdf

Similar to 1-s2.0-S0957417422020759-main.pdf (20)

Recently uploaded

Recently uploaded (20)

1-s2.0-S0957417422020759-main.pdf