2. 1 Introduction
In this document we describe a general class of second order feedforward neural network and the associated
real and complex valued versions of the Backpropagation learning algorithm that can be employed to train
such networks. We have (slightly) extended the derivation of the Backpropagation algorithm presented in
[RHW-87] and [KA-01] to include second order connections.
2 Neural Networks
Artificial neural networks, as their names suggests, are models of computation based on principles abstracted
from biological nervous systems (see [MP-43] for example). A neural network consist of a finite number of
simple processing units which communicate via channels that are interconnected according to some pattern
or architecture. Each unit consists of a number of weighted input channels and a single output channel.
A unit combines its channel weights with the input signals on those channels and then computes a scalar
output signal or activation. This activation is then propagated to other units in the network according to
the pattern of interconnection. Thus a pattern of activation spreads across the network over time. Once the
pattern of activation has converged to a stable state we may interrogate the network’s output units.
The computational power of such a network is dictated by the choice of unit functions, the weights and
the architecture. Typically the unit functions and the architecture are fixed and the function computed by
the network as a whole is dictated by the choice of weights.
2.1 Multilayer Perceptrons
We shall concern ourselves with an important class of feedforward neural network; the Multilayer Perceptron
or MLP (see [RHW-87], [MP-88], and [KA-01]). In this section we present a specification of the class of
L-layer feedforward neural networks1
that we shall employ throughout this document.
Let A be the set of activation values and let W denote the set of weight values. Typically A will be
restricted to a closed interval such as [0, 1], [−1, 1] or [−π, π], in the real R or complex numbers C, and we
note that all these sets are compact in R and C. The set W will be equal to the reals R or complex numbers
C as required.
Each hidden and output unit of the network i = 1, . . . , n and l = 1, . . . , L computes a function
fl
i : An
× Wn
× Wn(n−1)/2
× W → A
defined by
fl
i (a, w, w , θ) = actl
i(net(a, w, w , θ))
where actl
i is the unit’s activation function and net is the unit’s second order net-input function. We define
the net input function
net(a, w, w , θ) =
n
i
aiwi +
n
j=1
n
k>j
ajakwj,k + θ
where a are the unit’s inputs, w are the first order connections, w are the second order connections, and
θ is the unit’s bias. When n = 1 we shall assume that the network has no second order connections. The
products ajak are the multiplicative conjuncts of the unit. In this document we shall make use of the real
valued activation functions shown in table (1) and the complex activation functions shown in table (2).
1We shall adopt the convention that layer 0 denotes the network’s inputs, that layer 1 is the first hidden layer, and that
layer L is the network’s output layer. For the special case where L = 1 we note that there are no hidden layers.
2
3. Sigmoid 1
1+e−x
Sigmoid Compliment 1 − 1
1+e−x
Hyperbolic Tangent tanh(x)
Sine sin(x)
Cosine cos(x)
Guassian e−x2
Gaussian Compliment 1 − e−x2
Table 1: Real Activation Functions.
Tangent tan(z)
Sine sin(z)
Inverse Tangent arctan(z)
Inverse Sine arcsin(z)
Inverse Cosine arccos(z)
Hyperbolic Tangent tanh(z)
Hyperbolic Sine sinh(z)
Inverse Hyperbolic Tangent arctanh(z)
Inverser Hyperbolic Sine arcsinh(z)
Table 2: Complex Activation Functions.
For notational convenience we shall represent the bias θ by an extra weight, wi,n+1 from a unit whose
output is always 1. As a result our equations become
fl
i : An+1
× Wn+1
× Wn(n−1)/2
→ A
fl
i (a, w, w ) = actl
i(net(a, w, w ))
net(a, w, w ) =
n+1
i
aiwi +
n
j=1
n
k>j
ajakwj,k
for i = 1, . . . , n and l = 1, . . . , L. The units of the first hidden layer i = 1, . . . , n and l = 1 are defined by
fl
i (a, w, w ) = actl
i(net( a1, . . . , an, 1,
wl
i,1, . . . , wl
i,n, wl
i,n+1,
w l
i,1,2, w l
i,1,3, . . . , w l
i,1,n,
w l
i,2,3, w l
i,2,4, . . . , w l
i,2,n,
...
w l
i,n−1,n )).
For each unit in the subsequent hidden and output layers i = 1, . . . , n and l = 2, . . . L we have
fl
i (a, w, w ) = actl
i(net( fl−1
1 (a, w, w ), . . . , fl−1
n (a, w, w ), 1,
wl
i,1, . . . , wl
i,n, wl
i,n+1,
w l
i,1,2, w l
i,1,3, . . . , w l
i,1,n,
w l
i,2,3, w l
i,2,4, . . . , w l
i,2,n,
...
w l
i,n−1,n )).
3
4. The network output is defined by
f(a, w, w ) = (fL
1 (a, w, w ), . . . , fL
n (a, w, w )).
From these unit specifications, an L layer feedforward network is defined by the function
f : An
× WLn(n+1)
× WLn2
(n−1)/2
→ An
.
Given some countable set of non-constant, real or complex valued activation functions, denoted Ψ (see table
(1) or table (2) for example), we define the class of all such networks as
MLP(Ψ) = {f | ∀n > 0, ∀L > 0, ∀w ∈ WLn(n+1)
, ∀w ∈ WLn2
(n−1)/2
}.
2.2 Universal Approximation
A class of neural networks is said to be a universal approximator if for any given real (or complex) valued
Borel measurable target function g on A, there exists a network in our class, say f ∈ MLP(Ψ) that can
approximate g to any desired degree of accuracy. The class of real (or complex) valued first order multilayer
networks with a suitable activation function (see table (1) and table (2)) is a universal approximator (see
[HSW-89] and [KA-03]). Thus the inclusion of second order connections does not affect the computational
power of our multilayer networks in a theoretical sense2
. In practice however, such networks are better able
to approximate certain target functions.
The proof of universal approximation employs the Stone-Weierstrass theorem for real and complex alge-
bras of functions and is existential rather than constructive. In other words the proof asserts that such a
network exists in theory. It provides no information regarding the structure of the appropriate approximat-
ing network which, in this case, corresponds to the number of hidden units required and the values of the
weights. It is important to emphasise that the class of networks MLP(Ψ) is capable of universal approxi-
mation only in theory. In practice, the finite precision of any particular implementation greatly reduces the
class of functions that can be represented by any particular network (see [WG-95]).
3 The Backpropagation Learning Algorithm
Most neural networks are not programmed explicitly, rather they learn from, or equivalently are trained on, a
set of patterns representing examples of the task to be performed. Essentially a learning algorithm iteratively
modifies the network’s weights until the network is a “correct implementation” of the “task” encoded in the
patterns. We shall employ the online Backpropagation algorithm (see [RHW-87] and [KA-01]) to train
our feedforward networks. The Backpropagation algorithm, given some initial, randomly generated set of
weights, processes a sequence of patterns and produces a sequence of weight updates that are intended to
minimise an error function defined over the network’s outputs and the training patterns. Training continues
until the network satisfies its correctness criterion or some fixed number of training cycles are performed.
The Backpropagation algorithm is a neural network implementation of the steepest gradient descent
optimisation method due originally to Fermat and may be applied to real or complex valued networks. In
each case we shall need to extend the algorithm slightly to accommodate the second order weights.
We note that the algorithm cannot be guaranteed to converge to a set of weights that result in a correct
neural network implementation of a general task. In practice the learning algorithm is very sensitive to the
precise choice of initial weights and learning algorithm parameters and, in common with all gradient descent
methods, can become trapped in local minima when the error function to be minimised is non-convex. In
addition, given the existential nature of the proof of universal approximation (see Section 2.2) it is, in general,
difficult to estimate an appropriate number of hidden layers and units.
2For the special case where L = 1 we note that second order single layer networks are more powerful than first order single
layer networks as they are capable of solving non-linearly separable problems such as the XOR problem (see [RHW-87], page
319-321).
4
5. 3.1 The Real Backpropagation Algorithm
Backpropagation is intended to minimise the real valued error function
E =
1
2 p k
(tpk − opk)2
defined over the network’s output units opk and the target outputs tpk by calculating weight updates
∆pwji ∝ −
∂Ep
∂wji
and ∆pwjlm ∝ −
∂Ep
∂wjlm
for each weight in the network and each training pattern p. In the notation of [RHW-87] the net-input and
activation function are defined to be
netpj =
i
wjiopi +
l m>l
wjlmoplopm
opj = fj(netpj).
We will derive formulae for
∂Ep
∂wji
and
∂Ep
∂wjlm
by repeated application of the chain rule. Specifically, let
∂Ep
∂wji
=
∂Ep
∂netpj
∂netpj
∂wji
and
∂Ep
∂wjlm
=
∂Ep
∂netpj
∂netpj
∂wjlm
.
The second term on the right hand side of each equation yields
∂netpj
∂wji
=
∂
∂wji
(
k
wjkopk +
l m>l
wjlmoplopm) = opi
and
∂netpj
∂wjlm
=
∂
∂wjlm
(
k
wjkopk +
l m>l
wjlmoplopm) = oplopm.
Now define
δpj = −
∂Ep
∂netpj
.
Again, applying the chain rule we have
∂Ep
∂netpj
=
∂Ep
∂opj
∂opj
∂netpj
and we note that
∂opj
∂netpj
= fj(netpj)
where f denotes the derivative of f. For the output units we have
∂Ep
∂opj
= −(tpj − opj).
5
6. For the hidden units we again apply the chain rule
∂Ep
∂opj
=
k
∂Ep
∂netpk
∂netpk
opj
=
k
∂Ep
∂netpk
∂
opj
(
i
wkiopi +
l m>l
wklmoplopm)
=
k
∂Ep
∂netpk
(wkj +
l=j
wkljopl)
= −(
k
δpkwkj +
k
δpk
l=j
wkljopl).
We note that the notationally convenient term l=j wkljopl is only correct if one makes the assumption that
wklj and wkjl identifies the same unique second order weight. In the absence of this assumption we note
that as k > j for each second order weight wijk the correct expansion of the term l=j wkljopl is
wk,1,j op,1 + wk,2,j op,2 + · · · + wk,j−1,j op,j−1 + wk,j,j+1 op,j+1 + · · · + wk,j,n op,n.
Thus the algorithm is specified by four equations. The weight updates themselves are defined by
∆pwji = ηδpjopi (1)
and
∆pwjlm = ηδpjoplopm. (2)
The error term for the output units are
δpj = (tpj − opi)fj(netpj) (3)
and for the hidden units we have
δpj = fj(netpj)(
k
δpkwkj +
k
δpk
l=j
wkljopl). (4)
3.2 The Complex Backpropagation Algorithm
The Backpropagation algorithm can also be extended to the complex numbers C. As in the real case we
shall extend the definition of the algorithm in order to accommodate the second order connections. In this
section, as we are unencumbered by the restrictions of space, we shall, for the sake of clarity, include several
of the intermediate steps omitted in [KA-01].
In the notation of [KA-01], consider the real valued error function
E =
1
2 n
|di − oi|2
defined over the network’s output units oi and some given set of target output values di. Here | · | denotes
the usual norm on the complex numbers
|z| = |x + iy| = x2 + y2.
Let the net-input zn to a network unit be defined by
zn =
k
WnkXnk +
k l>k
WnklXnkXnl
=
k
(WnkR + iWnkI)(XnkR + iXnkI)
+
k l>k
(WnklR + iWnklI)(XnkR + iXnkI)(XnlR + iXnlI)
= xn + iyn
6
7. where the subscripts R and I denote the real and imaginary components of a particular complex value. The
unit’s output activations on are defined by
on = fn(zn)
= un + ivn.
We shall determine a formula for the first order weight updates by separating out the real and imaginary
components of
∂E
∂Wnk
and repeated application of the chain rule (for two variables) and the Cauchy-Riemann equations. We shall
apply a similar argument in order to derive a formula for the second order weight updates and subsequently
the error terms. We note that the complex conjugate of a complex number z = x + iy is denoted ¯z and is
equal to x − iy.
3.2.1 First Order Weights
Specifically for the first order weights let
∂E
∂Wnk
=
∂E
∂WnkR
+ i
∂E
∂WnkI
.
For the real component we have
∂E
∂WnkR
=
∂E
∂un
∂un
∂WnkR
+
∂E
∂vn
∂vn
∂WnkR
(5)
and for the imaginary component
∂E
∂WnkI
=
∂E
∂un
∂un
∂WnkI
+
∂E
∂vn
∂vn
∂WnkI
. (6)
By application of the chain rule the real component yields
∂un
∂WnkR
=
∂un
∂xn
∂xn
∂WnkR
+
∂un
∂yn
∂yn
∂WnkR
(7)
∂vn
∂WnkR
=
∂vn
∂xn
∂xn
∂WnkR
+
∂vn
∂yn
∂yn
∂WnkR
(8)
and for the imaginary component
∂un
∂WnkI
=
∂un
∂xn
∂xn
∂WnkI
+
∂un
∂yn
∂yn
∂WnkI
(9)
∂vn
∂WnkI
=
∂vn
∂xn
∂xn
∂WnkI
+
∂vn
∂yn
∂yn
∂WnkI
. (10)
Substituting equations (7), and (8) into (5), and equations (9) and (10) into (6) yields
∂E
∂WnkR
=
∂E
∂un
∂un
∂xn
∂xn
∂WnkR
+
∂un
∂yn
∂yn
∂WnkR
+
∂E
∂vn
∂vn
∂xn
∂xn
∂WnkR
+
∂vn
∂yn
∂yn
∂WnkR
(11)
∂E
∂WnkI
=
∂E
∂un
∂un
∂xn
∂xn
∂WnkI
+
∂un
∂yn
∂yn
∂WnkI
+
∂E
∂vn
∂vn
∂xn
∂xn
∂WnkI
+
∂vn
∂yn
∂yn
∂WnkI
. (12)
7
8. Now define the error term δn by
δn = −
∂E
∂un
− i
∂E
∂vn
where
δnR = −
∂E
∂un
and δnI = −
∂E
∂vn
and identify the following partial derivatives from the net-input function
∂xn
∂WnkR
= XnkR,
∂yn
∂WnkR
= XnkI,
∂xn
∂WnkI
= −XnkI,
∂yn
∂WnkI
= XnkR.
It is important to note that the δ terms defined here are not analogous to the δ terms used in the real
valued Backpropagation algorithm. Specifically, in the real valued case, δ represents the rate of change of
error for a unit with respect to its net-input. In symbols
δpj =
∂Ep
∂netpj
=
∂Ep
∂opj
∂opj
∂netpj
where
∂opj
∂netpj
= fj(netpj).
In the complex case, δ represents the rate of change of error for a unit with respect to its output, which in
the notation of [RHW-87] corresponds to
∂Ep
∂opj
.
Substituting the δn and the net-input partial derivatives into equations (11) and (12) we have for the real
components of the first order weights
∂E
∂WnkR
= −δnR
∂un
∂xn
XnkR +
∂un
∂yn
XnkI − δnI
∂vn
∂xn
XnkR +
∂vn
∂yn
XnkI
and for the imaginary components we have
∂E
∂WnkI
= −δnR
∂un
∂xn
(−XnkI) +
∂un
∂yn
XnkR − δnI
∂vn
∂xn
(−XnkI) +
∂vn
∂yn
XnkR .
Combining the real and imaginary components
∂E
∂Wnk
= − δnR
∂un
∂xn
XnkR +
∂un
∂yn
XnkI + δnI
∂vn
∂xn
XnkR +
∂vn
∂yn
XnkI
+δnR i
∂un
∂xn
(−XnkI) + i
∂un
∂yn
XnkR + δnI i
∂vn
∂xn
(−XnkI) + i
∂vn
∂yn
XnkR
∂E
∂Wnk
= − δnR
∂un
∂xn
+ i
∂un
∂yn
XnkR + −
∂un
∂yn
+ i
∂un
∂xn
(−XnkI) (13)
+δnI
∂vn
∂xn
+ i
∂vn
∂yn
XnkR + −
∂vn
∂yn
+ i
∂vn
∂xn
(−XnkI) .
By the application of the Cauchy-Riemann equations we note that
−
∂un
∂yn
+ i
∂un
∂xn
=
∂vn
∂xn
+ i
∂un
∂xn
= i
∂ ¯fn
∂xn
= i
∂un
∂xn
− i
∂vn
∂xn
= i
∂un
∂xn
+ i
∂un
∂yn
8
9. and that
−
∂vn
∂yn
+ i
∂vn
∂xn
= −i i −
∂vn
∂yn
+ i
∂vn
∂xn
= −i i −
∂vn
∂yn
− i
∂un
∂yn
= −i − i
∂vn
∂yn
+ i
∂un
∂yn
= −i(−i
∂ ¯fn
∂yn
) = −i
∂un
∂yn
− i
∂vn
∂yn
= −i −
∂vn
∂xn
− i
∂vn
∂yn
= i
∂vn
∂xn
+ i
∂vn
∂yn
.
Substituting these terms into equation (13) we have
∂E
∂Wnk
= − δnR
∂un
∂xn
+ i
∂un
∂yn
XnkR + i
∂un
∂xn
+ i
∂un
∂yn
(−XnkI)
+δnI
∂vn
∂xn
+ i
∂vn
∂yn
XnkR + i
∂vn
∂xn
+ i
∂vn
∂yn
(−XnkI)
∂E
∂Wnk
= − δnR
∂un
∂xn
+ i
∂un
∂yn
(XnkR − iXnkI) + δnI
∂vn
∂xn
+ i
∂vn
∂yn
(XnkR − iXnkI)
∂E
∂Wnk
= − ¯Xnk δnR
∂un
∂xn
+ i
∂un
∂yn
+ δnI
∂vn
∂xn
+ i
∂vn
∂yn
.
By the application of the Cauchy-Riemann equations again we have
∂vn
∂xn
+ i
∂vn
∂yn
=
∂vn
∂xn
− i
∂un
∂xn
= −i
∂ ¯fn
∂xn
= −i
∂un
∂xn
− i
∂vn
∂xn
= −i
∂un
∂xn
+ i
∂un
∂yn
.
Substituting we have
∂E
∂Wnk
= − ¯Xnk
∂un
∂xn
+ i
∂un
∂yn
(δnR − iδnI)
∂E
∂Wnk
= − ¯Xnk
¯δn
∂ ¯fn
∂xn
∂E
∂Wnk
= − ¯Xnk
¯δn
¯fn (zn).
3.2.2 Second Order Weights
The preceding argument may also be applied mutatis mutandis to the second order weights
∂E
∂WnklR
=
∂E
∂un
∂un
∂xn
∂xn
∂WnklR
+
∂un
∂yn
∂yn
∂WnklR
(14)
+
∂E
∂vn
∂vn
∂xn
∂xn
∂WnklR
+
∂vn
∂yn
∂yn
∂WnklR
∂E
∂WnklI
=
∂E
∂un
∂un
∂xn
∂xn
∂WnklI
+
∂un
∂yn
∂yn
∂WnklI
(15)
+
∂E
∂vn
∂vn
∂xn
∂xn
∂WnklI
+
∂vn
∂yn
∂yn
∂WnklI
.
9
10. We identify the following partial derivatives from the net-input function
∂xn
∂WnklR
= XnkRXnlR − XnkIXnlI = a
∂yn
∂WnklR
= XnkRXnlI + XnkIXnlR = b
∂xn
∂WnklI
= −XnkRXnlI − XnkIXnlR = c
∂yn
∂WnklI
= XnkRXnlR − XnkIXnlI = d
and we note that a = d and b = −c. Substituting the δn and the net-input partial derivatives into equations
(14) and (15) we have for the second order weights
∂E
∂WnklR
= −δnR
∂un
∂xn
a +
∂un
∂yn
b − δnI
∂vn
∂xn
a +
∂vn
∂yn
b
∂E
∂WnklI
= −δnR
∂un
∂xn
(−b) +
∂un
∂yn
a − δnI
∂vn
∂xn
(−b) +
∂vn
∂yn
a .
Combining the real and imaginary components
∂E
∂Wnkl
= − δnR
∂un
∂xn
a +
∂un
∂yn
b + δnI
∂vn
∂xn
a +
∂vn
∂yn
b
+ δnR i
∂un
∂xn
(−b) + i
∂un
∂yn
a + δnI i
∂vn
∂xn
(−b) + i
∂vn
∂yn
a
∂E
∂Wnkl
= − δnR
∂un
∂xn
+ i
∂un
∂yn
a + −
∂un
∂yn
+ i
∂un
∂xn
(−b)
+ δnI
∂vn
∂xn
+ i
∂vn
∂yn
a + −
∂vn
∂yn
+ i
∂vn
∂xn
(−b)
∂E
∂Wnkl
= − δnR
∂un
∂xn
+ i
∂un
∂yn
(a − ib) + δnI
∂vn
∂xn
+ i
∂vn
∂yn
(a − ib) .
As
(a − ib) = (XnkRXnlR − XnkIXnlI − i(XnkRXnlI + XnkIXnlR))
= (XnkR − iXnkI)(XnlR − iXnlI)
= ¯Xnk
¯Xnl
we have
∂E
∂Wnkl
= − ¯Xnk
¯Xnl
¯δn
¯fn (zn).
10
11. 3.2.3 Error Terms
For each output unit n we have
δn = dn − on.
For each hidden unit m we apply the chain rule to δm. Thus for the real component we have
δmR = −
∂E
∂um
= −
k
∂E
∂uk
∂uk
∂um
−
k
∂E
∂vk
∂vk
∂um
δmR = −
k
∂E
∂uk
∂uk
∂xk
∂xk
∂um
+
∂uk
∂yk
∂yk
∂um
−
k
∂E
∂vk
∂vk
∂xk
∂xk
∂um
+
∂vk
∂yk
∂yk
∂um
and for the imaginary component we have
δmI = −
∂E
∂vm
= −
k
∂E
∂uk
∂uk
∂vm
−
k
∂E
∂vk
∂vk
∂vm
δmI = −
k
∂E
∂uk
∂uk
∂xk
∂xk
∂vm
+
∂uk
∂yk
∂yk
∂vm
−
k
∂E
∂vk
∂vk
∂xk
∂xk
∂vm
+
∂vk
∂yk
∂yk
∂vm
where here k ranges over the units that receive input from unit m. From the net-input function for each
such unit
zk = xk + iyk =
j
(uj + ivj)(WkjR + iWkjI) +
j l>j
(uj + ivj)(ul + ivl)(WkjlR + iWkjlI)
we identify the following partial derivatives
∂xk
∂um
= WkmR +
j=m
(WkjmRXkjR − WkjmIXkjI) = a
∂yk
∂um
= WkmI +
j=m
(WkjmRXkjI + WkjmIXkjR) = b
∂xk
∂vm
= −WkmI +
j=m
(−WkjmRXkjI − WkjmIXkjR) = c
∂yk
∂vm
= WkmR +
j=m
(WkjmRXkjR − WkjmIXkjI) = d.
For the real component
δmR =
k
δkR
∂uk
∂xk
a +
∂uk
∂yk
b +
k
δkI
∂vk
∂xk
a +
∂vk
∂yk
b .
and similarly for the imaginary component
δmI =
k
δkR
∂uk
∂xk
(−b) +
∂uk
∂yk
a +
k
δkI
∂vk
∂xk
(−b) +
∂vk
∂yk
a .
11
12. Combining these equations we have
δm = δmR + iδmI =
k
δkR
∂uk
∂xk
a +
∂uk
∂yk
b +
k
δkI
∂vk
∂xk
a +
∂vk
∂yk
b
+
k
δkR i
∂uk
∂xk
(−b) + i
∂uk
∂yk
a +
k
δkI i
∂vk
∂xk
(−b) + i
∂vk
∂yk
a .
Using the results of Section 3.2.2 we have
δm =
k
δkR
∂uk
∂xk
+ i
∂uk
∂yk
(a − ib) +
k
δkI
∂vk
∂xk
+ i
∂vk
∂yk
(a − ib)
δm =
k
¯fk(zk)¯δk
¯Wkm +
k
¯fk(zk)¯δk
j=m
¯Wkjm
¯Xkj .
Again, note that the term j=m
¯Wkjm
¯Xkj is only correct if one makes the assumption that Wkjm and Wkmj
identifies the same unique second order weight.
Thus the algorithm is specified by four equations. The weight updates themselves are defined by
∆Wnk = η ¯Xnk
¯fn (zn) ¯δn (16)
and
∆Wnkl = η ¯Xnk
¯Xnl
¯fn (zn) ¯δn (17)
where η is a real, positive learning rate. The error term for the output units are
δn = (dn − on) (18)
and for the hidden units we have
δm =
k
¯fk(zk)¯δk
¯Wkm +
k
¯fk(zk)¯δk
j=m
¯Wkjm
¯Xkj . (19)
As ¯a¯b = (ab) for complex numbers and a = ¯a for real numbers, by making the substitution ¯fi (zi)¯δi = ¯δpi
we note that if we restrict our attention to real valued activations, weights and error signals, the real and
complex weight update equations are identical (see [KA-01], Section 2, page 1282).
A Appendix
A.1 The Chain Rule
A.1.1 The Chain Rule for a Single Variable
y = (f ◦ g) = f (g(x))g (x) = f ◦ g.g
u = g(x)
∂y
∂x
=
∂y
∂u
∂u
∂x
.
12
13. A.1.2 The Chain Rule for Multiple Variables I
z = f(x, y)
x = g(t)
y = h(t)
∂z
∂t
=
∂z
∂x
∂x
∂t
+
∂z
∂y
∂y
∂t
.
A.1.3 The Chain Rule for Multiple Variables II
z = f(u, v)
u = g(x, y)
v = h(x, y)
∂z
∂x
=
∂z
∂u
∂u
∂x
+
∂z
∂v
∂v
∂x
∂z
∂y
=
∂z
∂u
∂u
∂y
+
∂z
∂v
∂v
∂y
A.2 Complex Differentiation
Consider a complex valued function
f : C → C
defined by
f(x + iy) = u(x, y) + iv(x, y)
where u and v are real valued functions. If u and v have first partial derivatives with respect to x and y,
and satisfy the Cauchy-Riemann equations
∂u
∂x
=
∂v
∂y
(20)
and
∂u
∂y
= −
∂v
∂x
(21)
then f is complex differentiable (see [A-79]). In symbols
∂f
∂x
=
∂u
∂x
+ i
∂v
∂x
∂f
∂y
= −i
∂u
∂y
+
∂v
∂y
.
13
14. References
[A-79] L. Ahlfors. Complex Analysis. International Series in Pure and Applied Mathematics. McGraw-
Hill, 1979.
[HSW-89] K. Hornik, M. Stinchcombe, and H. White. Multilayer feedforward networks are universal ap-
proximators. Neural Networks, 2(5):359–366, 1989.
[KA-01] T. Kim and T. Adali. Complex backpropagation neural networks using elementary transcendental
activation functions. ICASSP, 2:1281–1284, 2001.
[KA-03] T. Kim and T. Adali. Approximation by fully complex multilayer perceptrons. Neural Computu-
tation, 15(7):1641–1666, 2003.
[MP-43] W. S. McCulloch and W. Pitts. A logical calculus of the ideas immanent in nervous activity.
Bulletin of Mathematical Biophysic, 5:115–133, 1943.
[MP-88] M. Minsky and S. Papert. Perceptrons: An Introduction to Computational Geometry. MIT Press,
Cambridge, MA, expanded edition, 1988.
[RHW-87] D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning internal representations by error
propagation. In D. E. Rumelhart, J. L. McClelland, et al., editors, Parallel Distributed Processing:
Volume 1: Foundations, pages 318–362. MIT Press, Cambridge, 1987.
[WG-95] J. Wray and G. G. R. Green. Neural networks, approximation theory, and finite precision com-
putation. Neural Networks, 8(1):31–37, 1995.
14