SlideShare a Scribd company logo
1 of 14
Download to read offline
The Real and Complex Backpropagation Algorithm for Second
Order Feedforward Neural Networks
W. B. Yates
July 7 2013
Contents
1 Introduction 2
2 Neural Networks 2
2.1 Multilayer Perceptrons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2.2 Universal Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
3 The Backpropagation Learning Algorithm 4
3.1 The Real Backpropagation Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3.2 The Complex Backpropagation Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.2.1 First Order Weights . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.2.2 Second Order Weights . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.2.3 Error Terms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
A Appendix 12
A.1 The Chain Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
A.1.1 The Chain Rule for a Single Variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
A.1.2 The Chain Rule for Multiple Variables I . . . . . . . . . . . . . . . . . . . . . . . . . . 13
A.1.3 The Chain Rule for Multiple Variables II . . . . . . . . . . . . . . . . . . . . . . . . . 13
A.2 Complex Differentiation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1
1 Introduction
In this document we describe a general class of second order feedforward neural network and the associated
real and complex valued versions of the Backpropagation learning algorithm that can be employed to train
such networks. We have (slightly) extended the derivation of the Backpropagation algorithm presented in
[RHW-87] and [KA-01] to include second order connections.
2 Neural Networks
Artificial neural networks, as their names suggests, are models of computation based on principles abstracted
from biological nervous systems (see [MP-43] for example). A neural network consist of a finite number of
simple processing units which communicate via channels that are interconnected according to some pattern
or architecture. Each unit consists of a number of weighted input channels and a single output channel.
A unit combines its channel weights with the input signals on those channels and then computes a scalar
output signal or activation. This activation is then propagated to other units in the network according to
the pattern of interconnection. Thus a pattern of activation spreads across the network over time. Once the
pattern of activation has converged to a stable state we may interrogate the network’s output units.
The computational power of such a network is dictated by the choice of unit functions, the weights and
the architecture. Typically the unit functions and the architecture are fixed and the function computed by
the network as a whole is dictated by the choice of weights.
2.1 Multilayer Perceptrons
We shall concern ourselves with an important class of feedforward neural network; the Multilayer Perceptron
or MLP (see [RHW-87], [MP-88], and [KA-01]). In this section we present a specification of the class of
L-layer feedforward neural networks1
that we shall employ throughout this document.
Let A be the set of activation values and let W denote the set of weight values. Typically A will be
restricted to a closed interval such as [0, 1], [−1, 1] or [−π, π], in the real R or complex numbers C, and we
note that all these sets are compact in R and C. The set W will be equal to the reals R or complex numbers
C as required.
Each hidden and output unit of the network i = 1, . . . , n and l = 1, . . . , L computes a function
fl
i : An
× Wn
× Wn(n−1)/2
× W → A
defined by
fl
i (a, w, w , θ) = actl
i(net(a, w, w , θ))
where actl
i is the unit’s activation function and net is the unit’s second order net-input function. We define
the net input function
net(a, w, w , θ) =
n
i
aiwi +
n
j=1
n
k>j
ajakwj,k + θ
where a are the unit’s inputs, w are the first order connections, w are the second order connections, and
θ is the unit’s bias. When n = 1 we shall assume that the network has no second order connections. The
products ajak are the multiplicative conjuncts of the unit. In this document we shall make use of the real
valued activation functions shown in table (1) and the complex activation functions shown in table (2).
1We shall adopt the convention that layer 0 denotes the network’s inputs, that layer 1 is the first hidden layer, and that
layer L is the network’s output layer. For the special case where L = 1 we note that there are no hidden layers.
2
Sigmoid 1
1+e−x
Sigmoid Compliment 1 − 1
1+e−x
Hyperbolic Tangent tanh(x)
Sine sin(x)
Cosine cos(x)
Guassian e−x2
Gaussian Compliment 1 − e−x2
Table 1: Real Activation Functions.
Tangent tan(z)
Sine sin(z)
Inverse Tangent arctan(z)
Inverse Sine arcsin(z)
Inverse Cosine arccos(z)
Hyperbolic Tangent tanh(z)
Hyperbolic Sine sinh(z)
Inverse Hyperbolic Tangent arctanh(z)
Inverser Hyperbolic Sine arcsinh(z)
Table 2: Complex Activation Functions.
For notational convenience we shall represent the bias θ by an extra weight, wi,n+1 from a unit whose
output is always 1. As a result our equations become
fl
i : An+1
× Wn+1
× Wn(n−1)/2
→ A
fl
i (a, w, w ) = actl
i(net(a, w, w ))
net(a, w, w ) =
n+1
i
aiwi +
n
j=1
n
k>j
ajakwj,k
for i = 1, . . . , n and l = 1, . . . , L. The units of the first hidden layer i = 1, . . . , n and l = 1 are defined by
fl
i (a, w, w ) = actl
i(net( a1, . . . , an, 1,
wl
i,1, . . . , wl
i,n, wl
i,n+1,
w l
i,1,2, w l
i,1,3, . . . , w l
i,1,n,
w l
i,2,3, w l
i,2,4, . . . , w l
i,2,n,
...
w l
i,n−1,n )).
For each unit in the subsequent hidden and output layers i = 1, . . . , n and l = 2, . . . L we have
fl
i (a, w, w ) = actl
i(net( fl−1
1 (a, w, w ), . . . , fl−1
n (a, w, w ), 1,
wl
i,1, . . . , wl
i,n, wl
i,n+1,
w l
i,1,2, w l
i,1,3, . . . , w l
i,1,n,
w l
i,2,3, w l
i,2,4, . . . , w l
i,2,n,
...
w l
i,n−1,n )).
3
The network output is defined by
f(a, w, w ) = (fL
1 (a, w, w ), . . . , fL
n (a, w, w )).
From these unit specifications, an L layer feedforward network is defined by the function
f : An
× WLn(n+1)
× WLn2
(n−1)/2
→ An
.
Given some countable set of non-constant, real or complex valued activation functions, denoted Ψ (see table
(1) or table (2) for example), we define the class of all such networks as
MLP(Ψ) = {f | ∀n > 0, ∀L > 0, ∀w ∈ WLn(n+1)
, ∀w ∈ WLn2
(n−1)/2
}.
2.2 Universal Approximation
A class of neural networks is said to be a universal approximator if for any given real (or complex) valued
Borel measurable target function g on A, there exists a network in our class, say f ∈ MLP(Ψ) that can
approximate g to any desired degree of accuracy. The class of real (or complex) valued first order multilayer
networks with a suitable activation function (see table (1) and table (2)) is a universal approximator (see
[HSW-89] and [KA-03]). Thus the inclusion of second order connections does not affect the computational
power of our multilayer networks in a theoretical sense2
. In practice however, such networks are better able
to approximate certain target functions.
The proof of universal approximation employs the Stone-Weierstrass theorem for real and complex alge-
bras of functions and is existential rather than constructive. In other words the proof asserts that such a
network exists in theory. It provides no information regarding the structure of the appropriate approximat-
ing network which, in this case, corresponds to the number of hidden units required and the values of the
weights. It is important to emphasise that the class of networks MLP(Ψ) is capable of universal approxi-
mation only in theory. In practice, the finite precision of any particular implementation greatly reduces the
class of functions that can be represented by any particular network (see [WG-95]).
3 The Backpropagation Learning Algorithm
Most neural networks are not programmed explicitly, rather they learn from, or equivalently are trained on, a
set of patterns representing examples of the task to be performed. Essentially a learning algorithm iteratively
modifies the network’s weights until the network is a “correct implementation” of the “task” encoded in the
patterns. We shall employ the online Backpropagation algorithm (see [RHW-87] and [KA-01]) to train
our feedforward networks. The Backpropagation algorithm, given some initial, randomly generated set of
weights, processes a sequence of patterns and produces a sequence of weight updates that are intended to
minimise an error function defined over the network’s outputs and the training patterns. Training continues
until the network satisfies its correctness criterion or some fixed number of training cycles are performed.
The Backpropagation algorithm is a neural network implementation of the steepest gradient descent
optimisation method due originally to Fermat and may be applied to real or complex valued networks. In
each case we shall need to extend the algorithm slightly to accommodate the second order weights.
We note that the algorithm cannot be guaranteed to converge to a set of weights that result in a correct
neural network implementation of a general task. In practice the learning algorithm is very sensitive to the
precise choice of initial weights and learning algorithm parameters and, in common with all gradient descent
methods, can become trapped in local minima when the error function to be minimised is non-convex. In
addition, given the existential nature of the proof of universal approximation (see Section 2.2) it is, in general,
difficult to estimate an appropriate number of hidden layers and units.
2For the special case where L = 1 we note that second order single layer networks are more powerful than first order single
layer networks as they are capable of solving non-linearly separable problems such as the XOR problem (see [RHW-87], page
319-321).
4
3.1 The Real Backpropagation Algorithm
Backpropagation is intended to minimise the real valued error function
E =
1
2 p k
(tpk − opk)2
defined over the network’s output units opk and the target outputs tpk by calculating weight updates
∆pwji ∝ −
∂Ep
∂wji
and ∆pwjlm ∝ −
∂Ep
∂wjlm
for each weight in the network and each training pattern p. In the notation of [RHW-87] the net-input and
activation function are defined to be
netpj =
i
wjiopi +
l m>l
wjlmoplopm
opj = fj(netpj).
We will derive formulae for
∂Ep
∂wji
and
∂Ep
∂wjlm
by repeated application of the chain rule. Specifically, let
∂Ep
∂wji
=
∂Ep
∂netpj
∂netpj
∂wji
and
∂Ep
∂wjlm
=
∂Ep
∂netpj
∂netpj
∂wjlm
.
The second term on the right hand side of each equation yields
∂netpj
∂wji
=
∂
∂wji
(
k
wjkopk +
l m>l
wjlmoplopm) = opi
and
∂netpj
∂wjlm
=
∂
∂wjlm
(
k
wjkopk +
l m>l
wjlmoplopm) = oplopm.
Now define
δpj = −
∂Ep
∂netpj
.
Again, applying the chain rule we have
∂Ep
∂netpj
=
∂Ep
∂opj
∂opj
∂netpj
and we note that
∂opj
∂netpj
= fj(netpj)
where f denotes the derivative of f. For the output units we have
∂Ep
∂opj
= −(tpj − opj).
5
For the hidden units we again apply the chain rule
∂Ep
∂opj
=
k
∂Ep
∂netpk
∂netpk
opj
=
k
∂Ep
∂netpk
∂
opj
(
i
wkiopi +
l m>l
wklmoplopm)
=
k
∂Ep
∂netpk
(wkj +
l=j
wkljopl)
= −(
k
δpkwkj +
k
δpk
l=j
wkljopl).
We note that the notationally convenient term l=j wkljopl is only correct if one makes the assumption that
wklj and wkjl identifies the same unique second order weight. In the absence of this assumption we note
that as k > j for each second order weight wijk the correct expansion of the term l=j wkljopl is
wk,1,j op,1 + wk,2,j op,2 + · · · + wk,j−1,j op,j−1 + wk,j,j+1 op,j+1 + · · · + wk,j,n op,n.
Thus the algorithm is specified by four equations. The weight updates themselves are defined by
∆pwji = ηδpjopi (1)
and
∆pwjlm = ηδpjoplopm. (2)
The error term for the output units are
δpj = (tpj − opi)fj(netpj) (3)
and for the hidden units we have
δpj = fj(netpj)(
k
δpkwkj +
k
δpk
l=j
wkljopl). (4)
3.2 The Complex Backpropagation Algorithm
The Backpropagation algorithm can also be extended to the complex numbers C. As in the real case we
shall extend the definition of the algorithm in order to accommodate the second order connections. In this
section, as we are unencumbered by the restrictions of space, we shall, for the sake of clarity, include several
of the intermediate steps omitted in [KA-01].
In the notation of [KA-01], consider the real valued error function
E =
1
2 n
|di − oi|2
defined over the network’s output units oi and some given set of target output values di. Here | · | denotes
the usual norm on the complex numbers
|z| = |x + iy| = x2 + y2.
Let the net-input zn to a network unit be defined by
zn =
k
WnkXnk +
k l>k
WnklXnkXnl
=
k
(WnkR + iWnkI)(XnkR + iXnkI)
+
k l>k
(WnklR + iWnklI)(XnkR + iXnkI)(XnlR + iXnlI)
= xn + iyn
6
where the subscripts R and I denote the real and imaginary components of a particular complex value. The
unit’s output activations on are defined by
on = fn(zn)
= un + ivn.
We shall determine a formula for the first order weight updates by separating out the real and imaginary
components of
∂E
∂Wnk
and repeated application of the chain rule (for two variables) and the Cauchy-Riemann equations. We shall
apply a similar argument in order to derive a formula for the second order weight updates and subsequently
the error terms. We note that the complex conjugate of a complex number z = x + iy is denoted ¯z and is
equal to x − iy.
3.2.1 First Order Weights
Specifically for the first order weights let
∂E
∂Wnk
=
∂E
∂WnkR
+ i
∂E
∂WnkI
.
For the real component we have
∂E
∂WnkR
=
∂E
∂un
∂un
∂WnkR
+
∂E
∂vn
∂vn
∂WnkR
(5)
and for the imaginary component
∂E
∂WnkI
=
∂E
∂un
∂un
∂WnkI
+
∂E
∂vn
∂vn
∂WnkI
. (6)
By application of the chain rule the real component yields
∂un
∂WnkR
=
∂un
∂xn
∂xn
∂WnkR
+
∂un
∂yn
∂yn
∂WnkR
(7)
∂vn
∂WnkR
=
∂vn
∂xn
∂xn
∂WnkR
+
∂vn
∂yn
∂yn
∂WnkR
(8)
and for the imaginary component
∂un
∂WnkI
=
∂un
∂xn
∂xn
∂WnkI
+
∂un
∂yn
∂yn
∂WnkI
(9)
∂vn
∂WnkI
=
∂vn
∂xn
∂xn
∂WnkI
+
∂vn
∂yn
∂yn
∂WnkI
. (10)
Substituting equations (7), and (8) into (5), and equations (9) and (10) into (6) yields
∂E
∂WnkR
=
∂E
∂un
∂un
∂xn
∂xn
∂WnkR
+
∂un
∂yn
∂yn
∂WnkR
+
∂E
∂vn
∂vn
∂xn
∂xn
∂WnkR
+
∂vn
∂yn
∂yn
∂WnkR
(11)
∂E
∂WnkI
=
∂E
∂un
∂un
∂xn
∂xn
∂WnkI
+
∂un
∂yn
∂yn
∂WnkI
+
∂E
∂vn
∂vn
∂xn
∂xn
∂WnkI
+
∂vn
∂yn
∂yn
∂WnkI
. (12)
7
Now define the error term δn by
δn = −
∂E
∂un
− i
∂E
∂vn
where
δnR = −
∂E
∂un
and δnI = −
∂E
∂vn
and identify the following partial derivatives from the net-input function
∂xn
∂WnkR
= XnkR,
∂yn
∂WnkR
= XnkI,
∂xn
∂WnkI
= −XnkI,
∂yn
∂WnkI
= XnkR.
It is important to note that the δ terms defined here are not analogous to the δ terms used in the real
valued Backpropagation algorithm. Specifically, in the real valued case, δ represents the rate of change of
error for a unit with respect to its net-input. In symbols
δpj =
∂Ep
∂netpj
=
∂Ep
∂opj
∂opj
∂netpj
where
∂opj
∂netpj
= fj(netpj).
In the complex case, δ represents the rate of change of error for a unit with respect to its output, which in
the notation of [RHW-87] corresponds to
∂Ep
∂opj
.
Substituting the δn and the net-input partial derivatives into equations (11) and (12) we have for the real
components of the first order weights
∂E
∂WnkR
= −δnR
∂un
∂xn
XnkR +
∂un
∂yn
XnkI − δnI
∂vn
∂xn
XnkR +
∂vn
∂yn
XnkI
and for the imaginary components we have
∂E
∂WnkI
= −δnR
∂un
∂xn
(−XnkI) +
∂un
∂yn
XnkR − δnI
∂vn
∂xn
(−XnkI) +
∂vn
∂yn
XnkR .
Combining the real and imaginary components
∂E
∂Wnk
= − δnR
∂un
∂xn
XnkR +
∂un
∂yn
XnkI + δnI
∂vn
∂xn
XnkR +
∂vn
∂yn
XnkI
+δnR i
∂un
∂xn
(−XnkI) + i
∂un
∂yn
XnkR + δnI i
∂vn
∂xn
(−XnkI) + i
∂vn
∂yn
XnkR
∂E
∂Wnk
= − δnR
∂un
∂xn
+ i
∂un
∂yn
XnkR + −
∂un
∂yn
+ i
∂un
∂xn
(−XnkI) (13)
+δnI
∂vn
∂xn
+ i
∂vn
∂yn
XnkR + −
∂vn
∂yn
+ i
∂vn
∂xn
(−XnkI) .
By the application of the Cauchy-Riemann equations we note that
−
∂un
∂yn
+ i
∂un
∂xn
=
∂vn
∂xn
+ i
∂un
∂xn
= i
∂ ¯fn
∂xn
= i
∂un
∂xn
− i
∂vn
∂xn
= i
∂un
∂xn
+ i
∂un
∂yn
8
and that
−
∂vn
∂yn
+ i
∂vn
∂xn
= −i i −
∂vn
∂yn
+ i
∂vn
∂xn
= −i i −
∂vn
∂yn
− i
∂un
∂yn
= −i − i
∂vn
∂yn
+ i
∂un
∂yn
= −i(−i
∂ ¯fn
∂yn
) = −i
∂un
∂yn
− i
∂vn
∂yn
= −i −
∂vn
∂xn
− i
∂vn
∂yn
= i
∂vn
∂xn
+ i
∂vn
∂yn
.
Substituting these terms into equation (13) we have
∂E
∂Wnk
= − δnR
∂un
∂xn
+ i
∂un
∂yn
XnkR + i
∂un
∂xn
+ i
∂un
∂yn
(−XnkI)
+δnI
∂vn
∂xn
+ i
∂vn
∂yn
XnkR + i
∂vn
∂xn
+ i
∂vn
∂yn
(−XnkI)
∂E
∂Wnk
= − δnR
∂un
∂xn
+ i
∂un
∂yn
(XnkR − iXnkI) + δnI
∂vn
∂xn
+ i
∂vn
∂yn
(XnkR − iXnkI)
∂E
∂Wnk
= − ¯Xnk δnR
∂un
∂xn
+ i
∂un
∂yn
+ δnI
∂vn
∂xn
+ i
∂vn
∂yn
.
By the application of the Cauchy-Riemann equations again we have
∂vn
∂xn
+ i
∂vn
∂yn
=
∂vn
∂xn
− i
∂un
∂xn
= −i
∂ ¯fn
∂xn
= −i
∂un
∂xn
− i
∂vn
∂xn
= −i
∂un
∂xn
+ i
∂un
∂yn
.
Substituting we have
∂E
∂Wnk
= − ¯Xnk
∂un
∂xn
+ i
∂un
∂yn
(δnR − iδnI)
∂E
∂Wnk
= − ¯Xnk
¯δn
∂ ¯fn
∂xn
∂E
∂Wnk
= − ¯Xnk
¯δn
¯fn (zn).
3.2.2 Second Order Weights
The preceding argument may also be applied mutatis mutandis to the second order weights
∂E
∂WnklR
=
∂E
∂un
∂un
∂xn
∂xn
∂WnklR
+
∂un
∂yn
∂yn
∂WnklR
(14)
+
∂E
∂vn
∂vn
∂xn
∂xn
∂WnklR
+
∂vn
∂yn
∂yn
∂WnklR
∂E
∂WnklI
=
∂E
∂un
∂un
∂xn
∂xn
∂WnklI
+
∂un
∂yn
∂yn
∂WnklI
(15)
+
∂E
∂vn
∂vn
∂xn
∂xn
∂WnklI
+
∂vn
∂yn
∂yn
∂WnklI
.
9
We identify the following partial derivatives from the net-input function
∂xn
∂WnklR
= XnkRXnlR − XnkIXnlI = a
∂yn
∂WnklR
= XnkRXnlI + XnkIXnlR = b
∂xn
∂WnklI
= −XnkRXnlI − XnkIXnlR = c
∂yn
∂WnklI
= XnkRXnlR − XnkIXnlI = d
and we note that a = d and b = −c. Substituting the δn and the net-input partial derivatives into equations
(14) and (15) we have for the second order weights
∂E
∂WnklR
= −δnR
∂un
∂xn
a +
∂un
∂yn
b − δnI
∂vn
∂xn
a +
∂vn
∂yn
b
∂E
∂WnklI
= −δnR
∂un
∂xn
(−b) +
∂un
∂yn
a − δnI
∂vn
∂xn
(−b) +
∂vn
∂yn
a .
Combining the real and imaginary components
∂E
∂Wnkl
= − δnR
∂un
∂xn
a +
∂un
∂yn
b + δnI
∂vn
∂xn
a +
∂vn
∂yn
b
+ δnR i
∂un
∂xn
(−b) + i
∂un
∂yn
a + δnI i
∂vn
∂xn
(−b) + i
∂vn
∂yn
a
∂E
∂Wnkl
= − δnR
∂un
∂xn
+ i
∂un
∂yn
a + −
∂un
∂yn
+ i
∂un
∂xn
(−b)
+ δnI
∂vn
∂xn
+ i
∂vn
∂yn
a + −
∂vn
∂yn
+ i
∂vn
∂xn
(−b)
∂E
∂Wnkl
= − δnR
∂un
∂xn
+ i
∂un
∂yn
(a − ib) + δnI
∂vn
∂xn
+ i
∂vn
∂yn
(a − ib) .
As
(a − ib) = (XnkRXnlR − XnkIXnlI − i(XnkRXnlI + XnkIXnlR))
= (XnkR − iXnkI)(XnlR − iXnlI)
= ¯Xnk
¯Xnl
we have
∂E
∂Wnkl
= − ¯Xnk
¯Xnl
¯δn
¯fn (zn).
10
3.2.3 Error Terms
For each output unit n we have
δn = dn − on.
For each hidden unit m we apply the chain rule to δm. Thus for the real component we have
δmR = −
∂E
∂um
= −
k
∂E
∂uk
∂uk
∂um
−
k
∂E
∂vk
∂vk
∂um
δmR = −
k
∂E
∂uk
∂uk
∂xk
∂xk
∂um
+
∂uk
∂yk
∂yk
∂um
−
k
∂E
∂vk
∂vk
∂xk
∂xk
∂um
+
∂vk
∂yk
∂yk
∂um
and for the imaginary component we have
δmI = −
∂E
∂vm
= −
k
∂E
∂uk
∂uk
∂vm
−
k
∂E
∂vk
∂vk
∂vm
δmI = −
k
∂E
∂uk
∂uk
∂xk
∂xk
∂vm
+
∂uk
∂yk
∂yk
∂vm
−
k
∂E
∂vk
∂vk
∂xk
∂xk
∂vm
+
∂vk
∂yk
∂yk
∂vm
where here k ranges over the units that receive input from unit m. From the net-input function for each
such unit
zk = xk + iyk =
j
(uj + ivj)(WkjR + iWkjI) +
j l>j
(uj + ivj)(ul + ivl)(WkjlR + iWkjlI)
we identify the following partial derivatives
∂xk
∂um
= WkmR +
j=m
(WkjmRXkjR − WkjmIXkjI) = a
∂yk
∂um
= WkmI +
j=m
(WkjmRXkjI + WkjmIXkjR) = b
∂xk
∂vm
= −WkmI +
j=m
(−WkjmRXkjI − WkjmIXkjR) = c
∂yk
∂vm
= WkmR +
j=m
(WkjmRXkjR − WkjmIXkjI) = d.
For the real component
δmR =
k
δkR
∂uk
∂xk
a +
∂uk
∂yk
b +
k
δkI
∂vk
∂xk
a +
∂vk
∂yk
b .
and similarly for the imaginary component
δmI =
k
δkR
∂uk
∂xk
(−b) +
∂uk
∂yk
a +
k
δkI
∂vk
∂xk
(−b) +
∂vk
∂yk
a .
11
Combining these equations we have
δm = δmR + iδmI =
k
δkR
∂uk
∂xk
a +
∂uk
∂yk
b +
k
δkI
∂vk
∂xk
a +
∂vk
∂yk
b
+
k
δkR i
∂uk
∂xk
(−b) + i
∂uk
∂yk
a +
k
δkI i
∂vk
∂xk
(−b) + i
∂vk
∂yk
a .
Using the results of Section 3.2.2 we have
δm =
k
δkR
∂uk
∂xk
+ i
∂uk
∂yk
(a − ib) +
k
δkI
∂vk
∂xk
+ i
∂vk
∂yk
(a − ib)
δm =
k
¯fk(zk)¯δk
¯Wkm +
k
¯fk(zk)¯δk
j=m
¯Wkjm
¯Xkj .
Again, note that the term j=m
¯Wkjm
¯Xkj is only correct if one makes the assumption that Wkjm and Wkmj
identifies the same unique second order weight.
Thus the algorithm is specified by four equations. The weight updates themselves are defined by
∆Wnk = η ¯Xnk
¯fn (zn) ¯δn (16)
and
∆Wnkl = η ¯Xnk
¯Xnl
¯fn (zn) ¯δn (17)
where η is a real, positive learning rate. The error term for the output units are
δn = (dn − on) (18)
and for the hidden units we have
δm =
k
¯fk(zk)¯δk
¯Wkm +
k
¯fk(zk)¯δk
j=m
¯Wkjm
¯Xkj . (19)
As ¯a¯b = (ab) for complex numbers and a = ¯a for real numbers, by making the substitution ¯fi (zi)¯δi = ¯δpi
we note that if we restrict our attention to real valued activations, weights and error signals, the real and
complex weight update equations are identical (see [KA-01], Section 2, page 1282).
A Appendix
A.1 The Chain Rule
A.1.1 The Chain Rule for a Single Variable
y = (f ◦ g) = f (g(x))g (x) = f ◦ g.g
u = g(x)
∂y
∂x
=
∂y
∂u
∂u
∂x
.
12
A.1.2 The Chain Rule for Multiple Variables I
z = f(x, y)
x = g(t)
y = h(t)
∂z
∂t
=
∂z
∂x
∂x
∂t
+
∂z
∂y
∂y
∂t
.
A.1.3 The Chain Rule for Multiple Variables II
z = f(u, v)
u = g(x, y)
v = h(x, y)
∂z
∂x
=
∂z
∂u
∂u
∂x
+
∂z
∂v
∂v
∂x
∂z
∂y
=
∂z
∂u
∂u
∂y
+
∂z
∂v
∂v
∂y
A.2 Complex Differentiation
Consider a complex valued function
f : C → C
defined by
f(x + iy) = u(x, y) + iv(x, y)
where u and v are real valued functions. If u and v have first partial derivatives with respect to x and y,
and satisfy the Cauchy-Riemann equations
∂u
∂x
=
∂v
∂y
(20)
and
∂u
∂y
= −
∂v
∂x
(21)
then f is complex differentiable (see [A-79]). In symbols
∂f
∂x
=
∂u
∂x
+ i
∂v
∂x
∂f
∂y
= −i
∂u
∂y
+
∂v
∂y
.
13
References
[A-79] L. Ahlfors. Complex Analysis. International Series in Pure and Applied Mathematics. McGraw-
Hill, 1979.
[HSW-89] K. Hornik, M. Stinchcombe, and H. White. Multilayer feedforward networks are universal ap-
proximators. Neural Networks, 2(5):359–366, 1989.
[KA-01] T. Kim and T. Adali. Complex backpropagation neural networks using elementary transcendental
activation functions. ICASSP, 2:1281–1284, 2001.
[KA-03] T. Kim and T. Adali. Approximation by fully complex multilayer perceptrons. Neural Computu-
tation, 15(7):1641–1666, 2003.
[MP-43] W. S. McCulloch and W. Pitts. A logical calculus of the ideas immanent in nervous activity.
Bulletin of Mathematical Biophysic, 5:115–133, 1943.
[MP-88] M. Minsky and S. Papert. Perceptrons: An Introduction to Computational Geometry. MIT Press,
Cambridge, MA, expanded edition, 1988.
[RHW-87] D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning internal representations by error
propagation. In D. E. Rumelhart, J. L. McClelland, et al., editors, Parallel Distributed Processing:
Volume 1: Foundations, pages 318–362. MIT Press, Cambridge, 1987.
[WG-95] J. Wray and G. G. R. Green. Neural networks, approximation theory, and finite precision com-
putation. Neural Networks, 8(1):31–37, 1995.
14

More Related Content

What's hot

Artificial Neural Networks Lect8: Neural networks for constrained optimization
Artificial Neural Networks Lect8: Neural networks for constrained optimizationArtificial Neural Networks Lect8: Neural networks for constrained optimization
Artificial Neural Networks Lect8: Neural networks for constrained optimizationMohammed Bennamoun
 
The Perceptron and its Learning Rule
The Perceptron and its Learning RuleThe Perceptron and its Learning Rule
The Perceptron and its Learning RuleNoor Ul Hudda Memon
 
14 Machine Learning Single Layer Perceptron
14 Machine Learning Single Layer Perceptron14 Machine Learning Single Layer Perceptron
14 Machine Learning Single Layer PerceptronAndres Mendez-Vazquez
 
Adaline madaline
Adaline madalineAdaline madaline
Adaline madalineNagarajan
 
lecture07.ppt
lecture07.pptlecture07.ppt
lecture07.pptbutest
 
Neural Network Dynamical Systems
Neural Network Dynamical Systems Neural Network Dynamical Systems
Neural Network Dynamical Systems M Reza Rahmati
 
Artificial Neural Network Lect4 : Single Layer Perceptron Classifiers
Artificial Neural Network Lect4 : Single Layer Perceptron ClassifiersArtificial Neural Network Lect4 : Single Layer Perceptron Classifiers
Artificial Neural Network Lect4 : Single Layer Perceptron ClassifiersMohammed Bennamoun
 
The Back Propagation Learning Algorithm
The Back Propagation Learning AlgorithmThe Back Propagation Learning Algorithm
The Back Propagation Learning AlgorithmESCOM
 
A Mixed Binary-Real NSGA II Algorithm Ensuring Both Accuracy and Interpretabi...
A Mixed Binary-Real NSGA II Algorithm Ensuring Both Accuracy and Interpretabi...A Mixed Binary-Real NSGA II Algorithm Ensuring Both Accuracy and Interpretabi...
A Mixed Binary-Real NSGA II Algorithm Ensuring Both Accuracy and Interpretabi...IJECEIAES
 
3.a heuristic based_multi-22-33
3.a heuristic based_multi-22-333.a heuristic based_multi-22-33
3.a heuristic based_multi-22-33Alexander Decker
 
NEURAL NETWORK Widrow-Hoff Learning Adaline Hagan LMS
NEURAL NETWORK Widrow-Hoff Learning Adaline Hagan LMSNEURAL NETWORK Widrow-Hoff Learning Adaline Hagan LMS
NEURAL NETWORK Widrow-Hoff Learning Adaline Hagan LMSESCOM
 
Basic Learning Algorithms of ANN
Basic Learning Algorithms of ANNBasic Learning Algorithms of ANN
Basic Learning Algorithms of ANNwaseem khan
 
Backpropagation
BackpropagationBackpropagation
Backpropagationariffast
 
Adaptive Resonance Theory
Adaptive Resonance TheoryAdaptive Resonance Theory
Adaptive Resonance TheoryNaveen Kumar
 
Modeling the Behavior of Selfish Forwarding Nodes to Stimulate Cooperation in...
Modeling the Behavior of Selfish Forwarding Nodes to Stimulate Cooperation in...Modeling the Behavior of Selfish Forwarding Nodes to Stimulate Cooperation in...
Modeling the Behavior of Selfish Forwarding Nodes to Stimulate Cooperation in...IJNSA Journal
 

What's hot (20)

Artificial Neural Networks Lect8: Neural networks for constrained optimization
Artificial Neural Networks Lect8: Neural networks for constrained optimizationArtificial Neural Networks Lect8: Neural networks for constrained optimization
Artificial Neural Networks Lect8: Neural networks for constrained optimization
 
The Perceptron and its Learning Rule
The Perceptron and its Learning RuleThe Perceptron and its Learning Rule
The Perceptron and its Learning Rule
 
14 Machine Learning Single Layer Perceptron
14 Machine Learning Single Layer Perceptron14 Machine Learning Single Layer Perceptron
14 Machine Learning Single Layer Perceptron
 
Adaline madaline
Adaline madalineAdaline madaline
Adaline madaline
 
Perceptron
PerceptronPerceptron
Perceptron
 
lecture07.ppt
lecture07.pptlecture07.ppt
lecture07.ppt
 
Neural Network Dynamical Systems
Neural Network Dynamical Systems Neural Network Dynamical Systems
Neural Network Dynamical Systems
 
Artificial Neural Network Lect4 : Single Layer Perceptron Classifiers
Artificial Neural Network Lect4 : Single Layer Perceptron ClassifiersArtificial Neural Network Lect4 : Single Layer Perceptron Classifiers
Artificial Neural Network Lect4 : Single Layer Perceptron Classifiers
 
Unit iii update
Unit iii updateUnit iii update
Unit iii update
 
The Back Propagation Learning Algorithm
The Back Propagation Learning AlgorithmThe Back Propagation Learning Algorithm
The Back Propagation Learning Algorithm
 
A Mixed Binary-Real NSGA II Algorithm Ensuring Both Accuracy and Interpretabi...
A Mixed Binary-Real NSGA II Algorithm Ensuring Both Accuracy and Interpretabi...A Mixed Binary-Real NSGA II Algorithm Ensuring Both Accuracy and Interpretabi...
A Mixed Binary-Real NSGA II Algorithm Ensuring Both Accuracy and Interpretabi...
 
Unit ii supervised ii
Unit ii supervised iiUnit ii supervised ii
Unit ii supervised ii
 
3.a heuristic based_multi-22-33
3.a heuristic based_multi-22-333.a heuristic based_multi-22-33
3.a heuristic based_multi-22-33
 
Adaptive Resonance Theory (ART)
Adaptive Resonance Theory (ART)Adaptive Resonance Theory (ART)
Adaptive Resonance Theory (ART)
 
NEURAL NETWORK Widrow-Hoff Learning Adaline Hagan LMS
NEURAL NETWORK Widrow-Hoff Learning Adaline Hagan LMSNEURAL NETWORK Widrow-Hoff Learning Adaline Hagan LMS
NEURAL NETWORK Widrow-Hoff Learning Adaline Hagan LMS
 
Art network
Art networkArt network
Art network
 
Basic Learning Algorithms of ANN
Basic Learning Algorithms of ANNBasic Learning Algorithms of ANN
Basic Learning Algorithms of ANN
 
Backpropagation
BackpropagationBackpropagation
Backpropagation
 
Adaptive Resonance Theory
Adaptive Resonance TheoryAdaptive Resonance Theory
Adaptive Resonance Theory
 
Modeling the Behavior of Selfish Forwarding Nodes to Stimulate Cooperation in...
Modeling the Behavior of Selfish Forwarding Nodes to Stimulate Cooperation in...Modeling the Behavior of Selfish Forwarding Nodes to Stimulate Cooperation in...
Modeling the Behavior of Selfish Forwarding Nodes to Stimulate Cooperation in...
 

Viewers also liked

Jl.komplek pangkalan truck blok ab no.14
Jl.komplek pangkalan truck blok ab no.14Jl.komplek pangkalan truck blok ab no.14
Jl.komplek pangkalan truck blok ab no.14Nur Faiq
 
Presentación resumen ejecutivo.may ina
Presentación resumen ejecutivo.may inaPresentación resumen ejecutivo.may ina
Presentación resumen ejecutivo.may inaMayurisParris19
 
24º Encontro Regional | Criterios de Seleção de Gestores
24º Encontro Regional | Criterios de Seleção de Gestores24º Encontro Regional | Criterios de Seleção de Gestores
24º Encontro Regional | Criterios de Seleção de GestoresAPEPREM
 
Liderazgo y gestión del tiempo
Liderazgo y gestión del tiempoLiderazgo y gestión del tiempo
Liderazgo y gestión del tiempoCenproexFormacion
 

Viewers also liked (7)

reva 12 a1s
reva 12 a1sreva 12 a1s
reva 12 a1s
 
larrys resume
larrys resumelarrys resume
larrys resume
 
Hoja de evaluación grupal
Hoja de evaluación grupalHoja de evaluación grupal
Hoja de evaluación grupal
 
Jl.komplek pangkalan truck blok ab no.14
Jl.komplek pangkalan truck blok ab no.14Jl.komplek pangkalan truck blok ab no.14
Jl.komplek pangkalan truck blok ab no.14
 
Presentación resumen ejecutivo.may ina
Presentación resumen ejecutivo.may inaPresentación resumen ejecutivo.may ina
Presentación resumen ejecutivo.may ina
 
24º Encontro Regional | Criterios de Seleção de Gestores
24º Encontro Regional | Criterios de Seleção de Gestores24º Encontro Regional | Criterios de Seleção de Gestores
24º Encontro Regional | Criterios de Seleção de Gestores
 
Liderazgo y gestión del tiempo
Liderazgo y gestión del tiempoLiderazgo y gestión del tiempo
Liderazgo y gestión del tiempo
 

Similar to honn

Extracted pages from Neural Fuzzy Systems.docx
Extracted pages from Neural Fuzzy Systems.docxExtracted pages from Neural Fuzzy Systems.docx
Extracted pages from Neural Fuzzy Systems.docxdannyabe
 
Modeling of neural image compression using gradient decent technology
Modeling of neural image compression using gradient decent technologyModeling of neural image compression using gradient decent technology
Modeling of neural image compression using gradient decent technologytheijes
 
Artificial neural networks
Artificial neural networks Artificial neural networks
Artificial neural networks ShwethaShreeS
 
Economic Load Dispatch (ELD), Economic Emission Dispatch (EED), Combined Econ...
Economic Load Dispatch (ELD), Economic Emission Dispatch (EED), Combined Econ...Economic Load Dispatch (ELD), Economic Emission Dispatch (EED), Combined Econ...
Economic Load Dispatch (ELD), Economic Emission Dispatch (EED), Combined Econ...cscpconf
 
ARTIFICIAL NEURAL NETWORK APPROACH TO MODELING OF POLYPROPYLENE REACTOR
ARTIFICIAL NEURAL NETWORK APPROACH TO MODELING OF POLYPROPYLENE REACTORARTIFICIAL NEURAL NETWORK APPROACH TO MODELING OF POLYPROPYLENE REACTOR
ARTIFICIAL NEURAL NETWORK APPROACH TO MODELING OF POLYPROPYLENE REACTORijac123
 
ANNs have been widely used in various domains for: Pattern recognition Funct...
ANNs have been widely used in various domains for: Pattern recognition  Funct...ANNs have been widely used in various domains for: Pattern recognition  Funct...
ANNs have been widely used in various domains for: Pattern recognition Funct...vijaym148
 
A New Classifier Based onRecurrent Neural Network Using Multiple Binary-Outpu...
A New Classifier Based onRecurrent Neural Network Using Multiple Binary-Outpu...A New Classifier Based onRecurrent Neural Network Using Multiple Binary-Outpu...
A New Classifier Based onRecurrent Neural Network Using Multiple Binary-Outpu...iosrjce
 
Digital Implementation of Artificial Neural Network for Function Approximatio...
Digital Implementation of Artificial Neural Network for Function Approximatio...Digital Implementation of Artificial Neural Network for Function Approximatio...
Digital Implementation of Artificial Neural Network for Function Approximatio...IOSR Journals
 
Digital Implementation of Artificial Neural Network for Function Approximatio...
Digital Implementation of Artificial Neural Network for Function Approximatio...Digital Implementation of Artificial Neural Network for Function Approximatio...
Digital Implementation of Artificial Neural Network for Function Approximatio...IOSR Journals
 
ACUMENS ON NEURAL NET AKG 20 7 23.pptx
ACUMENS ON NEURAL NET AKG 20 7 23.pptxACUMENS ON NEURAL NET AKG 20 7 23.pptx
ACUMENS ON NEURAL NET AKG 20 7 23.pptxgnans Kgnanshek
 
Artificial Neural Networks ppt.pptx for final sem cse
Artificial Neural Networks  ppt.pptx for final sem cseArtificial Neural Networks  ppt.pptx for final sem cse
Artificial Neural Networks ppt.pptx for final sem cseNaveenBhajantri1
 
Introduction Of Artificial neural network
Introduction Of Artificial neural networkIntroduction Of Artificial neural network
Introduction Of Artificial neural networkNagarajan
 
A survey research summary on neural networks
A survey research summary on neural networksA survey research summary on neural networks
A survey research summary on neural networkseSAT Publishing House
 
Feed forward neural network for sine
Feed forward neural network for sineFeed forward neural network for sine
Feed forward neural network for sineijcsa
 
Approximate bounded-knowledge-extractionusing-type-i-fuzzy-logic
Approximate bounded-knowledge-extractionusing-type-i-fuzzy-logicApproximate bounded-knowledge-extractionusing-type-i-fuzzy-logic
Approximate bounded-knowledge-extractionusing-type-i-fuzzy-logicCemal Ardil
 

Similar to honn (20)

Extracted pages from Neural Fuzzy Systems.docx
Extracted pages from Neural Fuzzy Systems.docxExtracted pages from Neural Fuzzy Systems.docx
Extracted pages from Neural Fuzzy Systems.docx
 
Modeling of neural image compression using gradient decent technology
Modeling of neural image compression using gradient decent technologyModeling of neural image compression using gradient decent technology
Modeling of neural image compression using gradient decent technology
 
Artificial neural networks
Artificial neural networks Artificial neural networks
Artificial neural networks
 
Economic Load Dispatch (ELD), Economic Emission Dispatch (EED), Combined Econ...
Economic Load Dispatch (ELD), Economic Emission Dispatch (EED), Combined Econ...Economic Load Dispatch (ELD), Economic Emission Dispatch (EED), Combined Econ...
Economic Load Dispatch (ELD), Economic Emission Dispatch (EED), Combined Econ...
 
N ns 1
N ns 1N ns 1
N ns 1
 
ARTIFICIAL NEURAL NETWORK APPROACH TO MODELING OF POLYPROPYLENE REACTOR
ARTIFICIAL NEURAL NETWORK APPROACH TO MODELING OF POLYPROPYLENE REACTORARTIFICIAL NEURAL NETWORK APPROACH TO MODELING OF POLYPROPYLENE REACTOR
ARTIFICIAL NEURAL NETWORK APPROACH TO MODELING OF POLYPROPYLENE REACTOR
 
ai7.ppt
ai7.pptai7.ppt
ai7.ppt
 
ai7.ppt
ai7.pptai7.ppt
ai7.ppt
 
ANNs have been widely used in various domains for: Pattern recognition Funct...
ANNs have been widely used in various domains for: Pattern recognition  Funct...ANNs have been widely used in various domains for: Pattern recognition  Funct...
ANNs have been widely used in various domains for: Pattern recognition Funct...
 
Sparse autoencoder
Sparse autoencoderSparse autoencoder
Sparse autoencoder
 
H017376369
H017376369H017376369
H017376369
 
A New Classifier Based onRecurrent Neural Network Using Multiple Binary-Outpu...
A New Classifier Based onRecurrent Neural Network Using Multiple Binary-Outpu...A New Classifier Based onRecurrent Neural Network Using Multiple Binary-Outpu...
A New Classifier Based onRecurrent Neural Network Using Multiple Binary-Outpu...
 
Digital Implementation of Artificial Neural Network for Function Approximatio...
Digital Implementation of Artificial Neural Network for Function Approximatio...Digital Implementation of Artificial Neural Network for Function Approximatio...
Digital Implementation of Artificial Neural Network for Function Approximatio...
 
Digital Implementation of Artificial Neural Network for Function Approximatio...
Digital Implementation of Artificial Neural Network for Function Approximatio...Digital Implementation of Artificial Neural Network for Function Approximatio...
Digital Implementation of Artificial Neural Network for Function Approximatio...
 
ACUMENS ON NEURAL NET AKG 20 7 23.pptx
ACUMENS ON NEURAL NET AKG 20 7 23.pptxACUMENS ON NEURAL NET AKG 20 7 23.pptx
ACUMENS ON NEURAL NET AKG 20 7 23.pptx
 
Artificial Neural Networks ppt.pptx for final sem cse
Artificial Neural Networks  ppt.pptx for final sem cseArtificial Neural Networks  ppt.pptx for final sem cse
Artificial Neural Networks ppt.pptx for final sem cse
 
Introduction Of Artificial neural network
Introduction Of Artificial neural networkIntroduction Of Artificial neural network
Introduction Of Artificial neural network
 
A survey research summary on neural networks
A survey research summary on neural networksA survey research summary on neural networks
A survey research summary on neural networks
 
Feed forward neural network for sine
Feed forward neural network for sineFeed forward neural network for sine
Feed forward neural network for sine
 
Approximate bounded-knowledge-extractionusing-type-i-fuzzy-logic
Approximate bounded-knowledge-extractionusing-type-i-fuzzy-logicApproximate bounded-knowledge-extractionusing-type-i-fuzzy-logic
Approximate bounded-knowledge-extractionusing-type-i-fuzzy-logic
 

honn

  • 1. The Real and Complex Backpropagation Algorithm for Second Order Feedforward Neural Networks W. B. Yates July 7 2013 Contents 1 Introduction 2 2 Neural Networks 2 2.1 Multilayer Perceptrons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 2.2 Universal Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 3 The Backpropagation Learning Algorithm 4 3.1 The Real Backpropagation Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 3.2 The Complex Backpropagation Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 3.2.1 First Order Weights . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 3.2.2 Second Order Weights . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 3.2.3 Error Terms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 A Appendix 12 A.1 The Chain Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 A.1.1 The Chain Rule for a Single Variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 A.1.2 The Chain Rule for Multiple Variables I . . . . . . . . . . . . . . . . . . . . . . . . . . 13 A.1.3 The Chain Rule for Multiple Variables II . . . . . . . . . . . . . . . . . . . . . . . . . 13 A.2 Complex Differentiation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 1
  • 2. 1 Introduction In this document we describe a general class of second order feedforward neural network and the associated real and complex valued versions of the Backpropagation learning algorithm that can be employed to train such networks. We have (slightly) extended the derivation of the Backpropagation algorithm presented in [RHW-87] and [KA-01] to include second order connections. 2 Neural Networks Artificial neural networks, as their names suggests, are models of computation based on principles abstracted from biological nervous systems (see [MP-43] for example). A neural network consist of a finite number of simple processing units which communicate via channels that are interconnected according to some pattern or architecture. Each unit consists of a number of weighted input channels and a single output channel. A unit combines its channel weights with the input signals on those channels and then computes a scalar output signal or activation. This activation is then propagated to other units in the network according to the pattern of interconnection. Thus a pattern of activation spreads across the network over time. Once the pattern of activation has converged to a stable state we may interrogate the network’s output units. The computational power of such a network is dictated by the choice of unit functions, the weights and the architecture. Typically the unit functions and the architecture are fixed and the function computed by the network as a whole is dictated by the choice of weights. 2.1 Multilayer Perceptrons We shall concern ourselves with an important class of feedforward neural network; the Multilayer Perceptron or MLP (see [RHW-87], [MP-88], and [KA-01]). In this section we present a specification of the class of L-layer feedforward neural networks1 that we shall employ throughout this document. Let A be the set of activation values and let W denote the set of weight values. Typically A will be restricted to a closed interval such as [0, 1], [−1, 1] or [−π, π], in the real R or complex numbers C, and we note that all these sets are compact in R and C. The set W will be equal to the reals R or complex numbers C as required. Each hidden and output unit of the network i = 1, . . . , n and l = 1, . . . , L computes a function fl i : An × Wn × Wn(n−1)/2 × W → A defined by fl i (a, w, w , θ) = actl i(net(a, w, w , θ)) where actl i is the unit’s activation function and net is the unit’s second order net-input function. We define the net input function net(a, w, w , θ) = n i aiwi + n j=1 n k>j ajakwj,k + θ where a are the unit’s inputs, w are the first order connections, w are the second order connections, and θ is the unit’s bias. When n = 1 we shall assume that the network has no second order connections. The products ajak are the multiplicative conjuncts of the unit. In this document we shall make use of the real valued activation functions shown in table (1) and the complex activation functions shown in table (2). 1We shall adopt the convention that layer 0 denotes the network’s inputs, that layer 1 is the first hidden layer, and that layer L is the network’s output layer. For the special case where L = 1 we note that there are no hidden layers. 2
  • 3. Sigmoid 1 1+e−x Sigmoid Compliment 1 − 1 1+e−x Hyperbolic Tangent tanh(x) Sine sin(x) Cosine cos(x) Guassian e−x2 Gaussian Compliment 1 − e−x2 Table 1: Real Activation Functions. Tangent tan(z) Sine sin(z) Inverse Tangent arctan(z) Inverse Sine arcsin(z) Inverse Cosine arccos(z) Hyperbolic Tangent tanh(z) Hyperbolic Sine sinh(z) Inverse Hyperbolic Tangent arctanh(z) Inverser Hyperbolic Sine arcsinh(z) Table 2: Complex Activation Functions. For notational convenience we shall represent the bias θ by an extra weight, wi,n+1 from a unit whose output is always 1. As a result our equations become fl i : An+1 × Wn+1 × Wn(n−1)/2 → A fl i (a, w, w ) = actl i(net(a, w, w )) net(a, w, w ) = n+1 i aiwi + n j=1 n k>j ajakwj,k for i = 1, . . . , n and l = 1, . . . , L. The units of the first hidden layer i = 1, . . . , n and l = 1 are defined by fl i (a, w, w ) = actl i(net( a1, . . . , an, 1, wl i,1, . . . , wl i,n, wl i,n+1, w l i,1,2, w l i,1,3, . . . , w l i,1,n, w l i,2,3, w l i,2,4, . . . , w l i,2,n, ... w l i,n−1,n )). For each unit in the subsequent hidden and output layers i = 1, . . . , n and l = 2, . . . L we have fl i (a, w, w ) = actl i(net( fl−1 1 (a, w, w ), . . . , fl−1 n (a, w, w ), 1, wl i,1, . . . , wl i,n, wl i,n+1, w l i,1,2, w l i,1,3, . . . , w l i,1,n, w l i,2,3, w l i,2,4, . . . , w l i,2,n, ... w l i,n−1,n )). 3
  • 4. The network output is defined by f(a, w, w ) = (fL 1 (a, w, w ), . . . , fL n (a, w, w )). From these unit specifications, an L layer feedforward network is defined by the function f : An × WLn(n+1) × WLn2 (n−1)/2 → An . Given some countable set of non-constant, real or complex valued activation functions, denoted Ψ (see table (1) or table (2) for example), we define the class of all such networks as MLP(Ψ) = {f | ∀n > 0, ∀L > 0, ∀w ∈ WLn(n+1) , ∀w ∈ WLn2 (n−1)/2 }. 2.2 Universal Approximation A class of neural networks is said to be a universal approximator if for any given real (or complex) valued Borel measurable target function g on A, there exists a network in our class, say f ∈ MLP(Ψ) that can approximate g to any desired degree of accuracy. The class of real (or complex) valued first order multilayer networks with a suitable activation function (see table (1) and table (2)) is a universal approximator (see [HSW-89] and [KA-03]). Thus the inclusion of second order connections does not affect the computational power of our multilayer networks in a theoretical sense2 . In practice however, such networks are better able to approximate certain target functions. The proof of universal approximation employs the Stone-Weierstrass theorem for real and complex alge- bras of functions and is existential rather than constructive. In other words the proof asserts that such a network exists in theory. It provides no information regarding the structure of the appropriate approximat- ing network which, in this case, corresponds to the number of hidden units required and the values of the weights. It is important to emphasise that the class of networks MLP(Ψ) is capable of universal approxi- mation only in theory. In practice, the finite precision of any particular implementation greatly reduces the class of functions that can be represented by any particular network (see [WG-95]). 3 The Backpropagation Learning Algorithm Most neural networks are not programmed explicitly, rather they learn from, or equivalently are trained on, a set of patterns representing examples of the task to be performed. Essentially a learning algorithm iteratively modifies the network’s weights until the network is a “correct implementation” of the “task” encoded in the patterns. We shall employ the online Backpropagation algorithm (see [RHW-87] and [KA-01]) to train our feedforward networks. The Backpropagation algorithm, given some initial, randomly generated set of weights, processes a sequence of patterns and produces a sequence of weight updates that are intended to minimise an error function defined over the network’s outputs and the training patterns. Training continues until the network satisfies its correctness criterion or some fixed number of training cycles are performed. The Backpropagation algorithm is a neural network implementation of the steepest gradient descent optimisation method due originally to Fermat and may be applied to real or complex valued networks. In each case we shall need to extend the algorithm slightly to accommodate the second order weights. We note that the algorithm cannot be guaranteed to converge to a set of weights that result in a correct neural network implementation of a general task. In practice the learning algorithm is very sensitive to the precise choice of initial weights and learning algorithm parameters and, in common with all gradient descent methods, can become trapped in local minima when the error function to be minimised is non-convex. In addition, given the existential nature of the proof of universal approximation (see Section 2.2) it is, in general, difficult to estimate an appropriate number of hidden layers and units. 2For the special case where L = 1 we note that second order single layer networks are more powerful than first order single layer networks as they are capable of solving non-linearly separable problems such as the XOR problem (see [RHW-87], page 319-321). 4
  • 5. 3.1 The Real Backpropagation Algorithm Backpropagation is intended to minimise the real valued error function E = 1 2 p k (tpk − opk)2 defined over the network’s output units opk and the target outputs tpk by calculating weight updates ∆pwji ∝ − ∂Ep ∂wji and ∆pwjlm ∝ − ∂Ep ∂wjlm for each weight in the network and each training pattern p. In the notation of [RHW-87] the net-input and activation function are defined to be netpj = i wjiopi + l m>l wjlmoplopm opj = fj(netpj). We will derive formulae for ∂Ep ∂wji and ∂Ep ∂wjlm by repeated application of the chain rule. Specifically, let ∂Ep ∂wji = ∂Ep ∂netpj ∂netpj ∂wji and ∂Ep ∂wjlm = ∂Ep ∂netpj ∂netpj ∂wjlm . The second term on the right hand side of each equation yields ∂netpj ∂wji = ∂ ∂wji ( k wjkopk + l m>l wjlmoplopm) = opi and ∂netpj ∂wjlm = ∂ ∂wjlm ( k wjkopk + l m>l wjlmoplopm) = oplopm. Now define δpj = − ∂Ep ∂netpj . Again, applying the chain rule we have ∂Ep ∂netpj = ∂Ep ∂opj ∂opj ∂netpj and we note that ∂opj ∂netpj = fj(netpj) where f denotes the derivative of f. For the output units we have ∂Ep ∂opj = −(tpj − opj). 5
  • 6. For the hidden units we again apply the chain rule ∂Ep ∂opj = k ∂Ep ∂netpk ∂netpk opj = k ∂Ep ∂netpk ∂ opj ( i wkiopi + l m>l wklmoplopm) = k ∂Ep ∂netpk (wkj + l=j wkljopl) = −( k δpkwkj + k δpk l=j wkljopl). We note that the notationally convenient term l=j wkljopl is only correct if one makes the assumption that wklj and wkjl identifies the same unique second order weight. In the absence of this assumption we note that as k > j for each second order weight wijk the correct expansion of the term l=j wkljopl is wk,1,j op,1 + wk,2,j op,2 + · · · + wk,j−1,j op,j−1 + wk,j,j+1 op,j+1 + · · · + wk,j,n op,n. Thus the algorithm is specified by four equations. The weight updates themselves are defined by ∆pwji = ηδpjopi (1) and ∆pwjlm = ηδpjoplopm. (2) The error term for the output units are δpj = (tpj − opi)fj(netpj) (3) and for the hidden units we have δpj = fj(netpj)( k δpkwkj + k δpk l=j wkljopl). (4) 3.2 The Complex Backpropagation Algorithm The Backpropagation algorithm can also be extended to the complex numbers C. As in the real case we shall extend the definition of the algorithm in order to accommodate the second order connections. In this section, as we are unencumbered by the restrictions of space, we shall, for the sake of clarity, include several of the intermediate steps omitted in [KA-01]. In the notation of [KA-01], consider the real valued error function E = 1 2 n |di − oi|2 defined over the network’s output units oi and some given set of target output values di. Here | · | denotes the usual norm on the complex numbers |z| = |x + iy| = x2 + y2. Let the net-input zn to a network unit be defined by zn = k WnkXnk + k l>k WnklXnkXnl = k (WnkR + iWnkI)(XnkR + iXnkI) + k l>k (WnklR + iWnklI)(XnkR + iXnkI)(XnlR + iXnlI) = xn + iyn 6
  • 7. where the subscripts R and I denote the real and imaginary components of a particular complex value. The unit’s output activations on are defined by on = fn(zn) = un + ivn. We shall determine a formula for the first order weight updates by separating out the real and imaginary components of ∂E ∂Wnk and repeated application of the chain rule (for two variables) and the Cauchy-Riemann equations. We shall apply a similar argument in order to derive a formula for the second order weight updates and subsequently the error terms. We note that the complex conjugate of a complex number z = x + iy is denoted ¯z and is equal to x − iy. 3.2.1 First Order Weights Specifically for the first order weights let ∂E ∂Wnk = ∂E ∂WnkR + i ∂E ∂WnkI . For the real component we have ∂E ∂WnkR = ∂E ∂un ∂un ∂WnkR + ∂E ∂vn ∂vn ∂WnkR (5) and for the imaginary component ∂E ∂WnkI = ∂E ∂un ∂un ∂WnkI + ∂E ∂vn ∂vn ∂WnkI . (6) By application of the chain rule the real component yields ∂un ∂WnkR = ∂un ∂xn ∂xn ∂WnkR + ∂un ∂yn ∂yn ∂WnkR (7) ∂vn ∂WnkR = ∂vn ∂xn ∂xn ∂WnkR + ∂vn ∂yn ∂yn ∂WnkR (8) and for the imaginary component ∂un ∂WnkI = ∂un ∂xn ∂xn ∂WnkI + ∂un ∂yn ∂yn ∂WnkI (9) ∂vn ∂WnkI = ∂vn ∂xn ∂xn ∂WnkI + ∂vn ∂yn ∂yn ∂WnkI . (10) Substituting equations (7), and (8) into (5), and equations (9) and (10) into (6) yields ∂E ∂WnkR = ∂E ∂un ∂un ∂xn ∂xn ∂WnkR + ∂un ∂yn ∂yn ∂WnkR + ∂E ∂vn ∂vn ∂xn ∂xn ∂WnkR + ∂vn ∂yn ∂yn ∂WnkR (11) ∂E ∂WnkI = ∂E ∂un ∂un ∂xn ∂xn ∂WnkI + ∂un ∂yn ∂yn ∂WnkI + ∂E ∂vn ∂vn ∂xn ∂xn ∂WnkI + ∂vn ∂yn ∂yn ∂WnkI . (12) 7
  • 8. Now define the error term δn by δn = − ∂E ∂un − i ∂E ∂vn where δnR = − ∂E ∂un and δnI = − ∂E ∂vn and identify the following partial derivatives from the net-input function ∂xn ∂WnkR = XnkR, ∂yn ∂WnkR = XnkI, ∂xn ∂WnkI = −XnkI, ∂yn ∂WnkI = XnkR. It is important to note that the δ terms defined here are not analogous to the δ terms used in the real valued Backpropagation algorithm. Specifically, in the real valued case, δ represents the rate of change of error for a unit with respect to its net-input. In symbols δpj = ∂Ep ∂netpj = ∂Ep ∂opj ∂opj ∂netpj where ∂opj ∂netpj = fj(netpj). In the complex case, δ represents the rate of change of error for a unit with respect to its output, which in the notation of [RHW-87] corresponds to ∂Ep ∂opj . Substituting the δn and the net-input partial derivatives into equations (11) and (12) we have for the real components of the first order weights ∂E ∂WnkR = −δnR ∂un ∂xn XnkR + ∂un ∂yn XnkI − δnI ∂vn ∂xn XnkR + ∂vn ∂yn XnkI and for the imaginary components we have ∂E ∂WnkI = −δnR ∂un ∂xn (−XnkI) + ∂un ∂yn XnkR − δnI ∂vn ∂xn (−XnkI) + ∂vn ∂yn XnkR . Combining the real and imaginary components ∂E ∂Wnk = − δnR ∂un ∂xn XnkR + ∂un ∂yn XnkI + δnI ∂vn ∂xn XnkR + ∂vn ∂yn XnkI +δnR i ∂un ∂xn (−XnkI) + i ∂un ∂yn XnkR + δnI i ∂vn ∂xn (−XnkI) + i ∂vn ∂yn XnkR ∂E ∂Wnk = − δnR ∂un ∂xn + i ∂un ∂yn XnkR + − ∂un ∂yn + i ∂un ∂xn (−XnkI) (13) +δnI ∂vn ∂xn + i ∂vn ∂yn XnkR + − ∂vn ∂yn + i ∂vn ∂xn (−XnkI) . By the application of the Cauchy-Riemann equations we note that − ∂un ∂yn + i ∂un ∂xn = ∂vn ∂xn + i ∂un ∂xn = i ∂ ¯fn ∂xn = i ∂un ∂xn − i ∂vn ∂xn = i ∂un ∂xn + i ∂un ∂yn 8
  • 9. and that − ∂vn ∂yn + i ∂vn ∂xn = −i i − ∂vn ∂yn + i ∂vn ∂xn = −i i − ∂vn ∂yn − i ∂un ∂yn = −i − i ∂vn ∂yn + i ∂un ∂yn = −i(−i ∂ ¯fn ∂yn ) = −i ∂un ∂yn − i ∂vn ∂yn = −i − ∂vn ∂xn − i ∂vn ∂yn = i ∂vn ∂xn + i ∂vn ∂yn . Substituting these terms into equation (13) we have ∂E ∂Wnk = − δnR ∂un ∂xn + i ∂un ∂yn XnkR + i ∂un ∂xn + i ∂un ∂yn (−XnkI) +δnI ∂vn ∂xn + i ∂vn ∂yn XnkR + i ∂vn ∂xn + i ∂vn ∂yn (−XnkI) ∂E ∂Wnk = − δnR ∂un ∂xn + i ∂un ∂yn (XnkR − iXnkI) + δnI ∂vn ∂xn + i ∂vn ∂yn (XnkR − iXnkI) ∂E ∂Wnk = − ¯Xnk δnR ∂un ∂xn + i ∂un ∂yn + δnI ∂vn ∂xn + i ∂vn ∂yn . By the application of the Cauchy-Riemann equations again we have ∂vn ∂xn + i ∂vn ∂yn = ∂vn ∂xn − i ∂un ∂xn = −i ∂ ¯fn ∂xn = −i ∂un ∂xn − i ∂vn ∂xn = −i ∂un ∂xn + i ∂un ∂yn . Substituting we have ∂E ∂Wnk = − ¯Xnk ∂un ∂xn + i ∂un ∂yn (δnR − iδnI) ∂E ∂Wnk = − ¯Xnk ¯δn ∂ ¯fn ∂xn ∂E ∂Wnk = − ¯Xnk ¯δn ¯fn (zn). 3.2.2 Second Order Weights The preceding argument may also be applied mutatis mutandis to the second order weights ∂E ∂WnklR = ∂E ∂un ∂un ∂xn ∂xn ∂WnklR + ∂un ∂yn ∂yn ∂WnklR (14) + ∂E ∂vn ∂vn ∂xn ∂xn ∂WnklR + ∂vn ∂yn ∂yn ∂WnklR ∂E ∂WnklI = ∂E ∂un ∂un ∂xn ∂xn ∂WnklI + ∂un ∂yn ∂yn ∂WnklI (15) + ∂E ∂vn ∂vn ∂xn ∂xn ∂WnklI + ∂vn ∂yn ∂yn ∂WnklI . 9
  • 10. We identify the following partial derivatives from the net-input function ∂xn ∂WnklR = XnkRXnlR − XnkIXnlI = a ∂yn ∂WnklR = XnkRXnlI + XnkIXnlR = b ∂xn ∂WnklI = −XnkRXnlI − XnkIXnlR = c ∂yn ∂WnklI = XnkRXnlR − XnkIXnlI = d and we note that a = d and b = −c. Substituting the δn and the net-input partial derivatives into equations (14) and (15) we have for the second order weights ∂E ∂WnklR = −δnR ∂un ∂xn a + ∂un ∂yn b − δnI ∂vn ∂xn a + ∂vn ∂yn b ∂E ∂WnklI = −δnR ∂un ∂xn (−b) + ∂un ∂yn a − δnI ∂vn ∂xn (−b) + ∂vn ∂yn a . Combining the real and imaginary components ∂E ∂Wnkl = − δnR ∂un ∂xn a + ∂un ∂yn b + δnI ∂vn ∂xn a + ∂vn ∂yn b + δnR i ∂un ∂xn (−b) + i ∂un ∂yn a + δnI i ∂vn ∂xn (−b) + i ∂vn ∂yn a ∂E ∂Wnkl = − δnR ∂un ∂xn + i ∂un ∂yn a + − ∂un ∂yn + i ∂un ∂xn (−b) + δnI ∂vn ∂xn + i ∂vn ∂yn a + − ∂vn ∂yn + i ∂vn ∂xn (−b) ∂E ∂Wnkl = − δnR ∂un ∂xn + i ∂un ∂yn (a − ib) + δnI ∂vn ∂xn + i ∂vn ∂yn (a − ib) . As (a − ib) = (XnkRXnlR − XnkIXnlI − i(XnkRXnlI + XnkIXnlR)) = (XnkR − iXnkI)(XnlR − iXnlI) = ¯Xnk ¯Xnl we have ∂E ∂Wnkl = − ¯Xnk ¯Xnl ¯δn ¯fn (zn). 10
  • 11. 3.2.3 Error Terms For each output unit n we have δn = dn − on. For each hidden unit m we apply the chain rule to δm. Thus for the real component we have δmR = − ∂E ∂um = − k ∂E ∂uk ∂uk ∂um − k ∂E ∂vk ∂vk ∂um δmR = − k ∂E ∂uk ∂uk ∂xk ∂xk ∂um + ∂uk ∂yk ∂yk ∂um − k ∂E ∂vk ∂vk ∂xk ∂xk ∂um + ∂vk ∂yk ∂yk ∂um and for the imaginary component we have δmI = − ∂E ∂vm = − k ∂E ∂uk ∂uk ∂vm − k ∂E ∂vk ∂vk ∂vm δmI = − k ∂E ∂uk ∂uk ∂xk ∂xk ∂vm + ∂uk ∂yk ∂yk ∂vm − k ∂E ∂vk ∂vk ∂xk ∂xk ∂vm + ∂vk ∂yk ∂yk ∂vm where here k ranges over the units that receive input from unit m. From the net-input function for each such unit zk = xk + iyk = j (uj + ivj)(WkjR + iWkjI) + j l>j (uj + ivj)(ul + ivl)(WkjlR + iWkjlI) we identify the following partial derivatives ∂xk ∂um = WkmR + j=m (WkjmRXkjR − WkjmIXkjI) = a ∂yk ∂um = WkmI + j=m (WkjmRXkjI + WkjmIXkjR) = b ∂xk ∂vm = −WkmI + j=m (−WkjmRXkjI − WkjmIXkjR) = c ∂yk ∂vm = WkmR + j=m (WkjmRXkjR − WkjmIXkjI) = d. For the real component δmR = k δkR ∂uk ∂xk a + ∂uk ∂yk b + k δkI ∂vk ∂xk a + ∂vk ∂yk b . and similarly for the imaginary component δmI = k δkR ∂uk ∂xk (−b) + ∂uk ∂yk a + k δkI ∂vk ∂xk (−b) + ∂vk ∂yk a . 11
  • 12. Combining these equations we have δm = δmR + iδmI = k δkR ∂uk ∂xk a + ∂uk ∂yk b + k δkI ∂vk ∂xk a + ∂vk ∂yk b + k δkR i ∂uk ∂xk (−b) + i ∂uk ∂yk a + k δkI i ∂vk ∂xk (−b) + i ∂vk ∂yk a . Using the results of Section 3.2.2 we have δm = k δkR ∂uk ∂xk + i ∂uk ∂yk (a − ib) + k δkI ∂vk ∂xk + i ∂vk ∂yk (a − ib) δm = k ¯fk(zk)¯δk ¯Wkm + k ¯fk(zk)¯δk j=m ¯Wkjm ¯Xkj . Again, note that the term j=m ¯Wkjm ¯Xkj is only correct if one makes the assumption that Wkjm and Wkmj identifies the same unique second order weight. Thus the algorithm is specified by four equations. The weight updates themselves are defined by ∆Wnk = η ¯Xnk ¯fn (zn) ¯δn (16) and ∆Wnkl = η ¯Xnk ¯Xnl ¯fn (zn) ¯δn (17) where η is a real, positive learning rate. The error term for the output units are δn = (dn − on) (18) and for the hidden units we have δm = k ¯fk(zk)¯δk ¯Wkm + k ¯fk(zk)¯δk j=m ¯Wkjm ¯Xkj . (19) As ¯a¯b = (ab) for complex numbers and a = ¯a for real numbers, by making the substitution ¯fi (zi)¯δi = ¯δpi we note that if we restrict our attention to real valued activations, weights and error signals, the real and complex weight update equations are identical (see [KA-01], Section 2, page 1282). A Appendix A.1 The Chain Rule A.1.1 The Chain Rule for a Single Variable y = (f ◦ g) = f (g(x))g (x) = f ◦ g.g u = g(x) ∂y ∂x = ∂y ∂u ∂u ∂x . 12
  • 13. A.1.2 The Chain Rule for Multiple Variables I z = f(x, y) x = g(t) y = h(t) ∂z ∂t = ∂z ∂x ∂x ∂t + ∂z ∂y ∂y ∂t . A.1.3 The Chain Rule for Multiple Variables II z = f(u, v) u = g(x, y) v = h(x, y) ∂z ∂x = ∂z ∂u ∂u ∂x + ∂z ∂v ∂v ∂x ∂z ∂y = ∂z ∂u ∂u ∂y + ∂z ∂v ∂v ∂y A.2 Complex Differentiation Consider a complex valued function f : C → C defined by f(x + iy) = u(x, y) + iv(x, y) where u and v are real valued functions. If u and v have first partial derivatives with respect to x and y, and satisfy the Cauchy-Riemann equations ∂u ∂x = ∂v ∂y (20) and ∂u ∂y = − ∂v ∂x (21) then f is complex differentiable (see [A-79]). In symbols ∂f ∂x = ∂u ∂x + i ∂v ∂x ∂f ∂y = −i ∂u ∂y + ∂v ∂y . 13
  • 14. References [A-79] L. Ahlfors. Complex Analysis. International Series in Pure and Applied Mathematics. McGraw- Hill, 1979. [HSW-89] K. Hornik, M. Stinchcombe, and H. White. Multilayer feedforward networks are universal ap- proximators. Neural Networks, 2(5):359–366, 1989. [KA-01] T. Kim and T. Adali. Complex backpropagation neural networks using elementary transcendental activation functions. ICASSP, 2:1281–1284, 2001. [KA-03] T. Kim and T. Adali. Approximation by fully complex multilayer perceptrons. Neural Computu- tation, 15(7):1641–1666, 2003. [MP-43] W. S. McCulloch and W. Pitts. A logical calculus of the ideas immanent in nervous activity. Bulletin of Mathematical Biophysic, 5:115–133, 1943. [MP-88] M. Minsky and S. Papert. Perceptrons: An Introduction to Computational Geometry. MIT Press, Cambridge, MA, expanded edition, 1988. [RHW-87] D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning internal representations by error propagation. In D. E. Rumelhart, J. L. McClelland, et al., editors, Parallel Distributed Processing: Volume 1: Foundations, pages 318–362. MIT Press, Cambridge, 1987. [WG-95] J. Wray and G. G. R. Green. Neural networks, approximation theory, and finite precision com- putation. Neural Networks, 8(1):31–37, 1995. 14