ANN Architecture Guide

Artificial Neural Networks
Steven Walczak
University of Colorado, Denver
Narciso Cerpa
University of Talca, Chile
I. Introduction to Artificial Neural Networks
II. Need for Guidelines
III. Input Variable Selection
IV. Learning Method Selection
V. Architecture Design
VI. Training Samples Selection
VII. Conclusions
GLOSSARY
Architecture The several different topologies into which
artificial neural networks can be organized. Processing
elements or neurons can be interconnected in different
ways.
Artificial neural network Model that emulates a biolog-
ical neural network using a reduced set of concepts
from a biological neural system.
Learning method Algorithm for training the artificial
neural network.
Processing element An artificial neuron that receives in-
put(s), processes the input(s), and delivers a single
output.
Summation function Computes the internal stimulation,
or activation level, of the artificial neuron.
Training sample Training cases that are used to adjust
the weight.
Transformation function A linear or nonlinear rela-
tionship between the internal activation level and the
output.
Weight The relative importance of each input to a
processing element.
ARTIFICIAL NEURAL NETWORKS (ANNS) have
been used to support applications across a variety of busi-
ness and scientific disciplines during the past years. These
computational models of neuronal activity in the brain are
defined and illustrated through some brief examples. Neu-
ral network designers typically perform extensive knowl-
edge engineering and incorporate a significant amount of
domain knowledge into ANNs. Once the input variables
present in the neural network’s input vector have been
selected, training data for these variables with known out-
put values must be acquired. Recent research has shown
that smaller training set sizes produce better performing
neural networks, especially fortime-series applications.
631

632 Artificial Neural Networks
Summarizing, this article presents an introduction to arti-
ficial neural networks and also a general heuristic method-
ology for designing high-quality ANN solutions to various
domain problems.
I. INTRODUCTION TO ARTIFICIAL
NEURAL NETWORKS
Artificial neural networks (sometimes just called neural
networks or connectionist models) provide a means for
dealing with complex pattern-oriented problems of both
categorization and time-series (trend analysis) types. The
nonparametric nature of neural networks enables models
to be developed without having any prior knowledge of the
distribution of the data population or possible interaction
effects between variables as required by commonly used
parametric statistical methods.
As an example, multiple regression requires that the er-
ror term of the regression equation be distributed normally
(with a µ = 0) and also be nonheteroscedastic. Another
statistical technique that is frequently used for perform-
ing categorization is discriminant analysis, but discrimi-
nant analysis requires that the predictor variables be mul-
tivariate normally distributed. Because such assumptions
are removed from ANN models, the ease of develop-
ing a domain problem solution is increased with artifi-
cial neural networks. Another factor contributing to the
FIGURE 1 Sample artificial neural network architecture (not all weights are shown).
success of ANN applications is their ability to create non-
linear models as well as traditional linear models and,
hence, artificial neural network solutions are applicable
across a wider range of problem types (both linear and
nonlinear).
In the following sections, a brief history of artificial
neural networks is presented. Next, a detailed examination
of the components of an artificial neural network model is
given with respect to the design of artificial neural network
models of business and scientific domain problems.
A. Biological Basis of Artificial
Neural Networks
Artificial neural networks are a technology based on stud-
ies of the brain and nervous system as depicted in Fig. 1.
These networks emulate a biological neural network but
they use a reduced set of concepts from biological neural
systems. Specifically, ANN models simulate the electri-
cal activity of the brain and nervous system. Processing
elements (also known as either a neurode or perceptron)
are connected to other processing elements. Typically the
neurodes are arranged in a layer or vector, with the out-
put of one layer serving as the input to the next layer
and possibly other layers. A neurode may be connected
to all or a subset of the neurodes in the subsequent layer,
with these connections simulating the synaptic connec-
tionsofthebrain.Weighteddatasignalsenteringaneurode

Artificial Neural Networks 633
simulate the electrical excitation of a nerve cell and conse-
quently the transference of information within the network
or brain. The input values to a processing element, in, are
multipliedbyaconnectionweight,wn,m,thatsimulatesthe
strengthening of neural pathways in the brain. It is through
the adjustment of the connection strengths or weights that
learning is emulated in ANNs.
All of the weight-adjusted input values to a process-
ing element are then aggregated using a vector to scalar
function such as summation (i.e., y = wi j xi ), averaging,
input maximum, or mode value to produce a single input
value to the neurode. Once the input value is calculated,
the processing element then uses a transfer function to pro-
duce its output (and consequently the input signals for the
next processing layer). The transfer function transforms
the neurode’s input value. Typically this transformation
involves the use of a sigmoid, hyperbolic-tangent, or other
nonlinear function. The process is repeated between lay-
ers of processing elements until a final output value, on,
or vector of values is produced by the neural network.
Theoretically, to simulate the asynchronous activity of
the human nervous system, the processing elements of the
artificial neural network should also be activated with the
weighted input signal in an asynchronous manner. Most
software and hardware implementations of artificial neu-
ral networks, however, implement a more discretized ap-
proach that guarantees that each processing element is
activated once for each presentation of a vector of input
values.
B. History and Resurgence of Artificial
Neural Networks
The idea of combining multiple processing elements into
a network is attributed to McCulloch and Pitts in the early
1940s and Hebb in 1949 is credited with being the first to
define a learning rule to explain the behavior of networks
ofneurons.Inthelate1950s,Rosenblattdevelopedthefirst
perceptron learning algorithm. Soon after Rosenblatt’s
discovery, Widrow and Hoff developed a similar learn-
ing rule for electronic circuits. Artificial neural network
research continued strongly throughout the 1960s.
In 1969, Minsky and Papert published their book,
Perceptrons, in which they showed the computational lim-
its of single-layer neural networks, which were the type
of artificial neural networks being used at that time. The
theoretical limitations of perceptron-like networks led to a
decrease in funding and subsequently research on artificial
neural networks.
Finally in 1986, McClelland and Rumelhart and the
PDP research group published the Parallel Distributed
Processing texts. These new texts published the back-
propagation learning algorithm, which enabled multiple
layers of perceptrons to be trained [and thus introduced
the hidden layer(s) to artificial neural networks], and was
the birth of MLPs (multiple layered perceptrons). Follow-
ing the discovery of MLPs and the backpropagation algo-
rithm, a revitalization of research and development efforts
in artificial neural networks took place.
In the past years, ANNs have been used to support
applications across a diversity of business and scientific
disciplines (e.g., financial, manufacturing, marketing,
telecomunications, and biomedical). This proliferation
of neural network applications has been facilitated by
the emergence of neural networks shells (e.g., Brain-
maker, Neuralyst, Neuroshell, and Professional II Plus)
and tool add-ins (for SAS, MATLAB, and Excel) that
provide developers with the means for specifying the
ANN architecture and training the neural network. These
shells and add-in tools enable ANN developers to build
ANN solutions without requiring an in-depth knowl-
edge of ANN theory or terminology. Please see either
of these World Wide Web sites (active on December 31,
2000): http://www.faqs.org/faqs/ai-faq/neural-nets/part6/
or http://www.emsl.pnl.gov:2080/proj/neuron/neural/sys-
tems/software.html for additional links to neural network
shell software available commercially.
Neural networks may use different learning algorithms
and we can classify them into two major categories based
on the input format: binary-valued input (i.e., 0s and 1s) or
continuous-valued input. These two categories can be sub-
divided into supervised learning and unsupervised learn-
ing. As mentioned above, supervised learning algorithms
use the difference between the desired and actual output
to adjust and finally determine the appropriate weights
for the ANN. In a variation of this approach some super-
vised learning algorithms are informed whether the output
for the input is correct and the network adjust its weights
with the aims of achieving correct results. Hopfield net-
work (binary) and backpropagation (continuous) are ex-
amples of supervised learning algorithms. Unsupervised
learning algorithms only receive input stimuli and the net-
work organizes itself with the aim of having hidden pro-
cessing elements that respond differently to each set of
input stimuli. The network does not require information
on the correctness of the output. ART I (binary) and Koho-
nen (continuous) are examples of unsupervised learning
algorithms.
Neural network applications are frequently viewed as
black boxes that mystically determine complex patterns
in data. However, ANN designers must perform exten-
sive knowledge engineering and incorporate a significant
amount of domain knowledge into artificial neural net-
works. Successful artificial neural network development
requires a deep understanding of the steps involved in de-
signing ANNs.

ANN design requires the developer to make many deci-
sions such as input values, training and test data set sizes,
learning algorithm, network architecture or topology, and
transformation function. Several of these decisions are de-
pendent on each other. For example, the ANN architecture
and the learning algorithm will determine the type of input
value (i.e., binary or continuous). Therefore, it is essen-
tial to follow a methodology or a well-defined sequence
of steps when designing ANNs. These steps are listed
below:
r Determine data to use.
r Determine input variables.
r Separate data into training and test sets.
r Define the network architecture.
r Select a learning algorithm.
r Transform variables to network inputs.
r Train (repeat until ANN error is below acceptable
value).
r Test (on hold-out sample to validate generalization of
the ANN).
In the following sections we discuss the need for guide-
lines, and discuss heuristics for input variable selection,
learning method selection, architecture design, and train-
ing sample selection. Finally we conclude and summarize
a set of guidelines for ANN design.
II. NEED FOR GUIDELINES
Artificial neural networks have been applied to a wide
variety of business, engineering, medical, and scientific
problems. Several research results have shown that ANNs
outperform traditional statistical techniques (e.g., regres-
sion or logit) as well as other standard machine learning
techniques (e.g., the ID3 algorithm) for a large class of
problem types.
Many of these ANN applications such as financial time
series, e.g., foreign exchange rate forecasts, are difficult to
model. Artificial neural networks provide a valuable tool
for building nonlinear models of data, especially when
the underlying laws governing the system are unknown.
Artificial neural network forecasting models have outper-
formed both statistical and other machine learning models
of financial time series, achieving forecast accuracies of
more than 60% and thus are being widely used to model
the behavior of financial time series. Other categorization-
based applications of ANNs are achieving success rates
of well over 90%.
Development of effective neural network models is dif-
ficult. Most artificial neural network designers develop
multiple neural network solutions with regard to the net-
work’s architecture—quantity of nodes and arrangement
in hidden layers. Two critical design issues are still a chal-
lenge for artificial neural networks developers: selection
of appropriate input variables and capturing a sufficient
quantity of training examples to permit the neural network
to adequately model the application.
Many different types of ANN applications have being
developed in the past several years and are continuing to
be developed. Industrial applications exist in the financial,
manufacturing, marketing, telecommunications, biomed-
ical, and many other domains. While business managers
are seeking to develop new applications using ANNs, a
basic misunderstanding of the source of intelligence in an
ANN exists. As mentioned above, the development of new
ANN applications has been facilitated by the emergence
of a variety of neural network shells that allow anyone
to produce neural network systems by simply specify-
ing the ANN architecture and providing a set of train-
ing data to be used by the shell to train the ANN. These
shell-based neural networks may fail or produce subopti-
mal results unless a deeper understanding of how to use
and incorporate domain knowledge in the ANN is ob-
tained by the designers of ANNs in business and industrial
domains.
The traditional view of an ANN is of a program that
emulates biological neural networks and “learns” to rec-
ognize patterns or categorize input data by being trained
on a set of sample data from the domain. These programs
learn through training and subsequently have the ability to
generalize broad categories from specific examples. This
is the unique perceived source of intelligence in an ANN.
However, experienced ANN application designers typi-
cally perform extensive knowledge engineering and in-
corporate a significant amount of domain knowledge into
the design of ANNs even before the learning through train-
ing process has begun. The selection of the input variables
to be used by the ANN is quite a complex task, due to the
misconception that the more input a network is fed the
more successful the results produced. This is only true
if the information fed is critical to making the decisions;
however, noisy input variables commonly result in very
poor generalization performance.
Design of optimal neural networks is problematic in
that there exist a large number of alternative ANN physi-
cal architectures and learning methods, all of which may
be applied to a given domain problem. Selecting the ap-
propriate size of the training data set presents another chal-
lenge, since it implies direct and indirect costs, and it can
also affect the generalization performance.
A general heuristic or rule of thumb for the design of
neural networks in time-series domains is that the more
knowledge that is available to the neural network for form-
ing its model, the better the ultimate performance of the

neural network. A minimum of 2 years of training data
is considered to be a nominal starting point for financial
time series. Times-series models are considered to im-
prove as more data are incorporated into the modeling
process. Research has indicated that currency exchange
rates have a long-term memory, implying that larger peri-
ods of time (data) will produce more comprehensive mod-
els and produce better generalization. However, this has
been challenged in recent research and will be discussed
in Section VI.
Neural network researchers have built forecasting and
trading systems with training data from 1 to 16 years,
including various training set sizes in between the two
extremes. However, researchers typically use all of the
data in building the neural network forecasting model,
with no attempt at comparing data quantity effects on the
quality of the produced forecasting models.
In this article, a set of guidelines for incorporating
knowledge into an ANN and using domain knowledge
to design optimal ANNs is described. The guidelines for
designing ANNs are made up of the following steps:
knowledge-based selection of input values, selection of a
learning method, architecture design, and training sample
selection. The majority of the ANN design steps described
will focus mainly on feed-forward supervised learning
(and more specifically backpropagation) ANN applica-
tions. Following these guidelines will enable developers
and researchers to take advantage of the power of ANNs
and will afford economic benefit by producing an ANN
that outperforms similar ANNs with improperly specified
design parameters.
Artificial neural network designers must determine the
optimal set of design criteria specified as follows:
r Appropriate input (independent) variables.
r Best learning method: Learning methods can be
classified into either supervised or unsupervised
learning methods. Within these learning methods
there are many alternatives, each of which is
appropriate for different distributions or types of data.
r Appropriate architecture: The number of hidden
layers depending on the selected learning method; the
quantity of processing elements (nodes) per hidden
layer.
r Appropriate amount of training data: Time series and
classification problems.
The designer’s choices for these design criteria will affect
the performance of the resulting ANN on out-of-sample
data. Inappropriate selection of the values for these design
factorsmayproduceANNapplicationsthatperformworse
than random selection of an output (dependent) value.
III. INPUT VARIABLE SELECTION
The generalization performance of supervised learning
artificial neural networks (e.g., backpropagation) usually
improves when the network size is minimized with respect
to the weighted connections between processing nodes
(elements of the input, hidden, and output layers). ANNs
that are too large tend to overfit or memorize the input
data. Conversely, ANNs with too few weighted connec-
tions do not contain enough processing elements to cor-
rectly model the input data set, underfitting the data. Both
of these situations result in poor out-of-sample general-
ization.
Therefore, when developing supervised learning neural
networks (e.g., backpropagation, radial basis function, or
fuzzy ARTMAP), the developer must determine what in-
put variables should be selected to accurately model the
domain.
ANN designers must spend a significant amount of time
performing the task of knowledge acquisition to avoid the
fact that “garbage in, garbage out” also applies to ANN
applications. ANNs as well as other artificial intelligence
(AI) techniques are highly dependent on the specification
of input variables. However, ANN designers tend to mis-
specify input variables.
Input variable misspecification occurs because ANN
designers follow the expert system approach of incor-
porating as much domain knowledge as possible into an
intelligent system. ANN performance improves as addi-
tional domain knowledge is provided through the input
variables. This belief is correct, because if a sufficient
amount of information representing critical decision cri-
teria is not given to an ANN, it cannot develop a correct
model of the domain. Most ANN designers believe that
since ANNs learn, they will be able to determine those
input variables that are important and develop a corre-
sponding model through the modification of the weights
associated with the connections between the input layer
and the hidden layers.
Noise input variables produce poor generalization per-
formance in ANNs. The presence of too many input vari-
ables causes poor generalization when the ANN not only
models the true predictors, but also includes the noise vari-
ables in the model. Interaction between input variables
produces critical differences in output values, further ob-
scuring the ideal problem model when unnecessary vari-
ables are included in the set of input values.
As indicated above and shown in the following sec-
tions both under- and overspecification of input variables
produce suboptimal performance. The following section
describes the guidelines for selecting input (independent)
variables for an ANN solution to a domain problem.

A. Determination of Input Variables
Two approaches exist regarding the selection of input pa-
rameter variables for supervised learning neural networks.
In the first approach, it is thought that since a neural net-
work that utilizes supervised training will adjust its con-
nection weights to better approximate the desired output
values, then all possible domain-relevant variables should
be given to the neural network as input values. The idea is
that the connection weights that indicate the contribution
of nonsignificant variables will approach zero and thus
effectively eliminate any effect on the output value from
these variables
lim
t→∞
εt ⇒ 0,
where ε is the error term of the neural network and t is the
number of training iterations.
The second approach emphasizes the fact that the
weighted connections never achieve a value of true zero
and thus there will always be some contribution to the out-
putvalueoftheneuralnetworkbyalloftheinputvariables.
Hence, ANN designers must research domain variables to
determine their potential contribution to the desired out-
put values. Selection of input variables for neural networks
is a complex, but necessary task. Selection of irrelevant
variables may cause output value fluctuations of up to 7%.
Designers should determine applicability through knowl-
edge acquisition of experts in the domain, similar to expert
systems development. Highly correlated variables should
be removed from the input vector because they can mul-
tiply the effect of those variables and consequently cause
noise in the output values. This process should produce
an expert-specified set of significant variables that are not
intercorrelated, and which will yield the optimal perfor-
mance for supervised learning neural networks.
The first step in determining the optimal set of input
variables is to perform standard knowledge acquisition.
Typically, this involves consultation with multiple domain
experts. Various researchers have indicated the require-
ment for extensive knowledge acquisition utilizing do-
main experts to specify ANN input variables. The primary
purpose of the knowledge acquisition phase is to guarantee
that the input variable set is not underspecified, providing
all relevant domain criteria to the ANN.
Once a base set of input variables is defined through
knowledge acquisition, the set can be pruned to elimi-
nate variables that contribute noise to the ANN and con-
sequently reduce the ANN generalization performance.
ANN input variables need to be predictive, but should
not be correlated. Correlated variables degrade ANN per-
formance by interacting with each other as well as other
elements to produce a biased effect. The designer should
calculate the correlation of pairs of variables—Pearson
correlation matrix—to identify “noise” variables. If two
variables have a high correlation, then one of these two
variables may be removed from the set of variables with-
out adversely affecting the ANN performance. Alterna-
tively, a chi-square test may be used for categorical vari-
ables. The cutoff value for variable elimination is an
arbitrary value and must be determined separately for ev-
ery ANN application, but any correlation absolute value
of 0.20 or higher indicates a probable noise source to the
ANN.
Additional statistical techniques may be applied, de-
pending on the distribution properties of the data set. Step-
wise multiple or logistic regression and factor analysis
provide viable tools for evaluating the predictive value
of input variables and may serve as a secondary filter to
the Pearson correlation matrix. Multiple regression and
factor analysis perform best with normally distributed lin-
ear data, while logistic regression assumes a curvilinear
relationship.
Several researchers have shown that smaller input vari-
able sets can produce better generalization performance
by an ANN. As mentioned above, high correlation values
of variables that share a common element need to be dis-
regarded. Smaller input variable sets frequently improve
the ANN generalization performance and reduce the net
cost of data acquisition for development and usage of the
ANN. However, care must be taken when removing vari-
ables from the ANN’s input set to ensure that a complete
set of noncorrelated predictor variables is available for
the ANN, otherwise the reduced variable sets may worsen
generalization performance.
IV. LEARNING METHOD SELECTION
After determining a heuristically optimal set of input vari-
ables using the methods from the previous section, an
ANN learning method must be selected. The learning
method is what enables the ANNs to correctly model cate-
gorization and time-series problems. Artificial neural net-
work learning methods can be divided into two distinct
categories: unsupervised learning and supervised learn-
ing. Both unsupervised and supervised learning methods
require a collection of training examples that enable the
ANN to model the data set and produce accurate output
values.
Unsupervised learning systems, such as adaptive res-
onance theory (ART), self-organizing map (SOM, also
called Kohonen networks), or Hopfield networks, do not
require that the output value for a training sample be pro-
vided at the time of training.
Supervised learning systems, such as backpropagation
(MLP), radial basis function (RBF), counterpropagation,

FIGURE 2 Kohonen layer (12-node) learning of a square
or fuzzy ARTMAP networks, require that a known output
value for all training samples be provided to the ANN.
Unsupervised learning methods determine output val-
ues directly from the input variable data set. Most un-
supervised learning methods have less computational
complexity and less generalization accuracy than super-
vised methods, because the answers must be contained
within or directly learned from the input values. Hence,
unsupervised learning techniques are typically used for
classification problems, where the desired classes are self-
descriptive. For example, the ART algorithm is a good
technique to use for performing object recognition in pic-
torial or graphical data. An example of a problem that has
been solved with ART-based ANNs is the recognition of
hand-written numerals. The hand-written numerals 0–9
are each unique, although in some cases similar for exam-
ple 1 and 7 or 3 and 8, and define the pattern to be learned:
the shapes of the numerals 0–9. The advantage of using
unsupervised learning methods is that these ANNs can
be designed to learn much more rapidly than supervised
learning systems.
A. Unsupervised Learning
The unsupervised learning algorithms—ART, SOM
(Kohonen), and Hopfield—form categories based on the
input data. Typically, this requires a presentation of each of
the training examples to the unsupervised learning ANN.
Distinct categories of the input vector are formed and re-
formed as new input examples are presented to the ANN.
The ART learning algorithm establishes a category for
the initial training example. As additional examples are
presented to the ART-based ANN, new categories are
formed based on how closely the new example matches
one of the existing categories with respect to both negative
inhibition and positive excitation of the neurodes in the
network. As a worst case, an ART-trained ANN may pro-
duce M distinct categories for M input examples. When
building ART-based networks, the architecture of the net-
work is given explicitly by the quantity of input values
and the desired number of categories (output values). The
hidden or what is usually called the F1 layer is the same
size as the input layer and serves as the feature detector
for the categories. The output or F2 layer is defined by the
quantity of categories to be defined.
SOM-trained networks are composed of a Kohonen
layer of neurodes that are two dimensional as opposed to
the vector alignments of most other ANNs. The collection
of neurodes (also called the grid) maps input values onto
the grid of neurodes to preserve order, which means that
two input values that are close together will be mapped to
the same neurode. The Kohonen grid is connected to both
an input and output layer. As training progresses, the neu-
rodes in the grid attempt to approximate the feature space
of the input by adjusting the collection of values mapped
onto each neurode. A graphical example of the learn-
ing process in the Kohonen layer of the SOM is
shown in Fig. 2 , which is a grid of 12 neurodes (3 × 4) that
is trying to learn the category of a hollow square object.
Figures 2a–d represent the two-dimensional coordinates
of each of the 12 Kohonen-layer processing elements.
The Hopfield training algorithm is similar in nature to
the ART training algorithm. Both require a hidden layer
(in this case called the Hopfield layer as opposed to an
F1 layer for ART-based ANNs) that is the same size as
the input layer. The Hopfield algorithm is based on spin
glass physics and views the state of the network as an en-
ergy surface. Both SOM and Hopfield trained ANNs have
been used to solve traveling salesman problems in addition
to the more traditional image processing of unsupervised
learning ANNs. Hopfield ANNs are also used for opti-
mization problems. A difficulty with Hopfield ANNs is the
capacity of the network, which is estimated at n/(4 ln n),
where n is the number of neurodes in the Hopfield
layer.
B. Supervised Learning
The backpropagation learning algorithm is one of the most
popular design choices for implementing ANNs, since
this algorithm is available and supported by most com-
mercial neural network shells and is based on a very ro-
bust paradigm. Backpropagation-trained ANNs have been
shown to be universal approximators, and they are able
to learn arbitrary category mappings. Various researchers
have supported this finding and shown the superiority of
backpropagation-trained ANNs to different ANN learning
paradigms including radial basis function (RBF), coun-
terpropagation, and fuzzy adaptive resonance theory. An
ANN’s performance has been found to be more dependent

on data representation than on the selection of a learning
rule. Learning rules other than backpropagation perform
well if the data from the domain have specific proper-
ties. The mathematical specifications of the various ANN
learning methods described in this section are available in
the reference articles and books given at the end of this
article.
Backpropagation is the superior learning method when
a sufficient number of noise/error-free training examples
exist, regardless of the complexity of the specific domain
problem. Backpropagation ANNs can handle noise in the
training data and they may actually generalize better if
some noise is present in the training data. However, too
many erroneous training values may prevent the ANN
from learning the desired model.
For ANN applications that provide only a few train-
ing examples or very noisy training data, other super-
vised learning methods should be selected. RBF networks
perform well in domains with limited training sets and
counterpropagation networks perform well when a suffi-
cient number of training examples is available, but may
contain very noisy data. For resource allocation problems
(configuration)backpropagationproducedthebestresults,
although the first appearance of the problem indicated
that counterpropagation might outperform backpropaga-
tion due to anticipated noise in the training data set. Hence,
although properties of the data population may strongly
indicate the preference of a particular training method,
because of the strength of the backpropagation network,
this type of learning method should always be tried in ad-
dition to any other methods prescribed by domain data
tendencies.
Domains that have a large collection of relatively error-
free historical examples with known outcomes suit back-
propagation ANN implementations. Both the ART and
RBF ANNs have worse performance than the back-
propagation ANN performance for this specific domain
problem.
Many other ANN learning methods exist and each is
subject to constraints on the type of data that is best pro-
cessed by that specific learning method. For example, gen-
eral regression neural networks are capable of solving any
problem that can also be solved by a statistical regression
model, but does not require that a specific model type (e.g.,
multiple linear or logistic) be specified in advance. How-
ever, regression ANNs suffer from the same constraints as
regression models, such as the linear or curvilinear rela-
tionship of the data with heteroscedastic error. Likewise,
learning vector quantization (LVQ) networks try to divide
input values into disjoint categories similar to discrimi-
nant analysis and consequently have the same data dis-
tribution requirements as discriminant analysis. Research
using resource allocation problems has indicated that LVQ
neural networks produced the second best allocation re-
sults, which indicated into the previously unknown per-
ception that the categories used for allocating resources
were unique.
To summarize, backpropagation MLP networks are
usually implemented due to their robust and genera-
lized problem-solving capabilities. General regression
networks are implemented to simulate the statistical
regression models. Radial basis function networks are im-
plemented to resolve domain problems having a partial
sample or a training data set that is too small. Both coun-
terpropagation and fuzzy ARTMAP networks are imple-
mented to resolve the difficulty of extremely noisy training
data. The combination of unsupervised (clustering and
ART) learning techniques with supervised learning may
improve the performance of neural networks in the noisy
domains. Finally, learning vector quantization networks
are implemented to exploit the potential for unique deci-
sion criteria of disjoint sets.
The selection of a learning method is an open problem
and ANN designers must use the constraints of the train-
ing data set for determining the optimal learning method.
If reasonably large quantities of relatively noise-free train-
ing examples are available, then backpropagation provides
an effective learning method, which is relatively easy to
implement.
V. ARCHITECTURE DESIGN
The architecture of an ANN consists of the number of
layers of processing elements or nodes, including input,
output, and any hidden layers, and the quantity of nodes
contained in each layer. Selection of input variables (i.e.,
input vector) was discussed in Section III, and the output
vector is normally predefined by the problem to be solved
with the ANN. Design of hidden layers is dependent on the
selected learning algorithm (discussed in Section IV). For
example,unsupervisedlearningmethodssuchasARTnor-
mally require a first hidden layer quantity of nodes equal
to the size of the input layer. Supervised learning sys-
tems are generally more flexible in the design of hidden
layers. The remaining discussion focuses on backpropa-
gation ANN systems or other similar supervised learning
ANNs. The designer should determine the following as-
pects regarding the hidden layers of the ANN architecture:
(1) number of hidden layers and (2) number of nodes in
the hidden layer(s).
A. Number of Hidden Layers
It is possible to design an ANN with no hidden layers,
but these types of ANNs can only classify input data that

is linearly separable, which severely limits their applica-
tion. Artificial neural networks that contain hidden layers
have the ability to deal robustly with nonlinear and com-
plex problems and therefore can operate on more inter-
esting problems. The quantity of hidden layers is asso-
ciated with the complexity of the domain problem to be
solved. ANNs with a single hidden layer create a hyper-
plane. ANNs with two hidden layers combine hyperplanes
to form convex decision areas and ANNs with three hidden
layers combine convex decision areas to form convex de-
cision areas that contain concave regions. The convexity or
concavity of a decision region corresponds roughly to the
number of unique inferences or abstractions that are per-
formed on the input variables to produce the desired output
result.
Increasing the number of hidden unit layers enables
a trade-off between smoothness and closeness-of-fit. A
greater quantity of hidden layers enables an ANN to im-
prove its closeness-of-fit, while a smaller quantity im-
proves the smoothness or extrapolation capabilities of the
ANN.
Several researchers have indicated that a single hid-
den layer architecture, with an arbitrarily large quantity of
hidden nodes in the single layer, is capable of modeling
any categorization mapping. On the other hand two hid-
den layer networks outperform their single hidden layer
counterparts for specific problems. A heuristic for deter-
mining the quantity of hidden layers required by an ANN
is as follows: “As the dimensionality of the problem space
increases—higher order problems—the number of hidden
layers should increase correspondingly.”
The number of hidden layers is heuristically set by de-
termining the number of intermediate steps, dependent on
previous categorizations, required to translate the input
variables into an output value. Therefore, domain prob-
lems that have a standard nonlinear equation solution are
solvable by a single hidden layer ANN.
B. Number of Nodes per Hidden Layer
When choosing the number of nodes to be contained in
a hidden layer, there is a trade-off between training time
and the accuracy of training. A greater number of hidden
unit nodes results in a longer (slower) training period,
while fewer hidden units provide shorter (faster) training,
but at the cost of having fewer feature detectors. Too many
hidden nodes in an ANN enable it to memorize the training
data set, which produces poor generalization performance.
Some of the heuristics used for selecting the quantity of
hidden nodes for an ANN are using:
r 75 percent of the quantity of input nodes,
r 50 percent of the quantity of input and output nodes, or
r 2n + 1 hidden layer nodes where n is the number of
nodes in the input layer.
These algorithmic heuristics do not utilize domain knowl-
edge for estimating the quantity of hidden nodes and may
be counterproductive.
As with the knowledge acquisition and elimination of
correlated input variables heuristic for defining the opti-
mal input node set, the number of decision factors (DFs)
heuristically determines the optimal number of hidden
units for an ANN. Knowledge acquisition or existing
knowledge bases may be used to determine the DFs for a
particular domain and consequently the hidden layer ar-
chitecture and optimal quantity of hidden nodes. Decision
factors are the separable elements that help to form the
unique categories of the input vector space. The DFs are
comparable to the collection of heuristic production rules
used in an expert system.
An example of the DF design principle is provided by
the NETTalk neural network research project. NETTalk
has 203 input nodes representing seven textual characters,
and 33 output units representing the phonetic notation of
the spoken text words. Hidden units are varied from 0 to
120. NETTalk improved output accuracy as the number of
hidden units was increased from 0 to 120, but only a min-
imal improvement in the output accuracy was observed
between 60 and 120 hidden units. This indicates that the
idealquantityofDFsfortheNETTalkproblemwasaround
60; adding hidden units beyond 60 increased the training
time, but did not provide any appreciable difference in the
ANN’s performance.
Several researchers have found that ANNs perform
poorly until a sufficient number of hidden units is avail-
able to represent the correlations between the input vec-
tor and the desired output values. Increasing the num-
ber of hidden units beyond the sufficient number served
to increase training time without a corresponding in-
crease in output accuracy. Knowledge acquisition is nec-
essary to determine the optimal input variable set to be
used in an ANN system. During the knowledge acqui-
sition phase, additional knowledge engineering can be
performed to determine the DFs and subsequently the
minimum number of hidden units required by the ANN
architecture. The ANN designer must acquire the heuris-
tic rules or clustering methods used by domain experts,
similar to the knowledge that must be acquired dur-
ing the knowledge acquisition process for expert sys-
tems. The number of heuristic rules or clusters used
by domain experts is equivalent to the DFs used in the
domain.
Researchers have explored and shown techniques for
automatically producing an ANN architecture with the
exactnumberofhiddenunitsrequiredtomodeltheDFsfor

the problem space. The approach used by these automatic
methods consists of three steps:
1. Initially create a neural network architecture with a
very small or very large number of hidden units.
2. Train the network for some predetermined number of
epochs.
3. Evaluate the error of the output nodes.
If the error exceeds a set threshold value, then a hidden
unit is added or deleted, respectively, and the process is re-
peated until the error term is less than the threshold value.
Another method to automatically determine the optimum
architecture is to use genetic algorithms to generate mul-
tiple ANN architectures and select the architectures with
the best performance. Determining the optimum number
of hidden units for an ANN application is a very com-
plex problem, and an accurate method for automatically
determining the DF quantity of hidden units without per-
formingthecorrespondingknowledgeacquisitionremains
a current research topic.
In this section, the heuristic architecture design princi-
ple of acquiring decision factors to determine the quantity
of hidden nodes and the configuration of hidden layers
has been presented. A number of hidden nodes equal to
the number of the DFs is required by an ANN to perform
robustly in a domain and produce accurate results. This
concept is similar to the principle of a minimum size in-
put vector determined through knowledge acquisition pre-
sented in Section III. The knowledge acquisition process
for ANN designers must acquire the heuristic decision
rules or clustering methods of domain experts. The DFs
for a domain are equivalent to the heuristic decision rules
used by domain experts. Further analysis of the DFs to de-
termine the dimensionality of the problem space enables
the knowledge engineer to configure the hidden nodes into
the optimal number of hidden layers for efficient modeling
of the problem space.
VI. TRAINING SAMPLES SELECTION
Acquisition of training data has direct costs associated
with the data themselves, and indirect costs due to the
fact that larger training sets require a larger quantity of
training epochs to optimize the neural network’s learning.
The common belief is that the generalization performance
of a neural network will increase when larger quantities
of training samples are used to train the neural network,
especially for time-series applications of neural networks.
Based on this belief, the neural network designer must
acquire as much data as possible to ensure the optimal
learning of a neural network.
A “rule of thumb” lower bound on the number of train-
ing examples required to train a backpropagation ANN is
four times the number of weighted connections contained
in the network. Therefore, if a training database contains
only100trainingexamples,themaximumsizeoftheANN
is 25 connections or approximately 10 nodes depending
on the ANN architecture. While the general heuristic of
four times the number of connections is applicable to most
classification problems, time-series problems, including
the prediction of financial time series (e.g., stock values),
are more dependent on business cycles. Recent research
has conclusively shown that a maximum of 1 or 2 years
of data is all that is required to produce optimal fore-
casting results for ANNs performing financial time-series
prediction.
Another issue to be considered during training sample
selection is how well the samples in the training set model
the real world. If training samples are skewed such that
they only cover a small portion of the possible real-world
instances that a neural network will be asked to classify
or predict, then the neural network can only learn how to
classify or predict results for this subset of the domain.
Therefore, developers should take care to ensure that their
training set samples have a similar distribution to the do-
main in which the neural network must operate.
Artificial neural network training sets should be rep-
resentative of the population-at-large. This indicates that
categorization-based ANNs require at least one example
of each category to be classified and that the distribu-
tion of training data should approximate the distribution
of the population at large. A small amount of additional
examples from each category will help to improve the
generalization performance of the ANN. Thus a catego-
rization ANN trying to classify items into one of seven
categories with distributions of 5, 10, 10, 15, 15, 20, and
25% would need a minimum of 20 training examples, but
would benefit by having 40–100 training examples. Time-
series domain problems are dependent on the distribution
of the time series, with the neural network normally re-
quiring one complete cycle of data. Again, recent research
in financial time series has demonstrated that 1- and 2-year
cycle times are prevalent and thus the minimum required
training data for a financial time-series ANN would be
from 1 to 2 years of training examples.
Based on these more recent findings we suggest that
neural network developers should use an iterative ap-
proach to training. Starting with a small quantity of train-
ing data, train the neural network and then increase the
quantity of samples in the training data set and repeat
training until a decrease in performance occurs.
Development of optimal neural networks is a difficult
and complex task. Limiting both the set of input variables
to those that are thought to be predictive and the training

set size increases the probability of developing robust and
highly accurate neural network models.
Most neural network models of financial time series are
homogeneous. Homogeneous models utilize data from the
specific time series being forecast or directly obtainable
from that time series (e.g., a k-day trend or moving av-
erage). Heterogeneous models utilize information from
outside the time series in addition to the time series itself.
Homogeneous models rely on the predictive capabilities
of the time series itself, corresponding to a technical anal-
ysis as opposed to a fundamental analysis.
Most neural network forecasting in the capital markets
produces an output value that is the future price or ex-
change rate. Measuring the mean standard error of these
neural networks may produce misleading evaluations of
the neural networks’ capabilities, since even very small
errors that are incorrect in the direction of change will
result in a capital loss. Instead of measuring the mean
standard error of a forecast, some researchers argue that
a better method for measuring the performance of neural
networks is to analyze the direction of change. The direc-
tion of change is calculated by subtracting today’s price
from the forecast price and determining the sign (positive
or negative) of the result. The percentage of correct direc-
tion of change forecasts is equivalent to the percentage of
profitable trades enabled by the ANN system.
The effect on the quality of the neural network model
forecasting outputs achieved from the quantities of train-
ing data has been called the “time-series (TS) recency
effect.” The TS recency effect states that for time-series
data, model construction data that are closer in time to
the values to be forecast produce better forecasting mod-
els. This effect is similar to the concept of a random walk
model that assumes future values are only affected by the
previous time period’s value, but able to use a wider range
of proximal data for formulating the forecasts.
Requirements for training or modeling knowledge were
investigated when building nonlinear financial time-series
forecasting models with neural networks. Homogeneous
neural network forecasting models were developed for
trading the U.S. dollar against various other foreign cur-
rencies (i.e., dollar/pound, dollar/mark, dollar/yen). Var-
ious training sets were used, ranging from 22 years to
1 year of historic training data. The differences between
the neural network models for a specific currency ex-
isted only in the quantity of training data used to develop
each time-series forecasting model. The researchers crit-
ically examined the qualitative effect of training set size
on neural network foreign exchange rate forecasting mod-
els. Training data sets of up to 22 years of data are used
to predict 1-day future spot rates for several nominal ex-
change rates. Multiple neural network forecasting models
for each exchange rate forecasting model were trained on
incrementally larger quantities of training data. The re-
sulting outputs were used to empirically evaluate whether
neural network exchange rate forecasting models achieve
optimal performance in the presence of a critical amount
of data used to train the network. Once this critical quan-
tity of data is obtained, addition of more training data does
not improve and may, in fact, hinder the forecasting per-
formance of the neural network forecasting model. For
most exchange rate predictions, a maximum of 2 years of
training data produces the best neural network forecast-
ing model performance. Hence, this finding leads to the
induction of the empirical hypothesis for a time-series re-
cency effect. The TS recency effect can be summarized in
the following statement: “The use of data that are closer
in time to the data that are to be forecast by the model
produces a higher quality model.”
The TS recency effect provides several direct benefits
for both neural network researchers and developers:
r A new paradigm for choosing training samples for
producing a time-series model
r Higher quality models by having better forecasting
performance through the use of smaller quantities of
data
r Lower development costs for neural network
time-series models because fewer training data are
required
r Less development time because smaller training set
sizes typically require fewer training iterations to
accurately model the training data.
The time-series recency effect refutes existing heuris-
tics and is a call to revise previous claims of longevity ef-
fects in financial time series. The empirical method used
to evaluate and determine the critical quantity of train-
ing data for exchange rate forecasting is generalized for
application to other financial time series, indicating the
generality of the TS recency effect to other financial time
series.
The TS recency effect offers an explanation as to why
previous research efforts using neural network models
have not surpassed the 60% prediction accuracy demon-
strated as a realistic threshold by researchers. The diffi-
culty in most prior neural network research is that too
much data is typically used. In attempting to build the
best possible forecasting model, as was perceived at that
time, too much training data is used (typically 4–6 years of
data), thus violating the TS recency effect by introducing
data into the model that is not representative of the current
time-series behavior. Training, test, and general use data
represent an important and recurring cost for information
systems in general and neural networks in particular. Thus,
ifthe2-yeartrainingsetproducesthebestperformanceand

represents the minimal quantity of data required to achieve
thislevelofperformance,thenthisminimalamountofdata
is all that should be used to minimize the costs of neural
network development and maintenance. For example, the
Chicago Mercantile Exchange (CME) sells historical data
on commodities (including currency exchange rates) at the
cost of $100 per year per commodity. At this rate, using
1–2 years of data instead of the full 22 years of data pro-
vides an immediate data cost savings of $2000 to $2100
for producing the neural network models.
The only variation in the ANN models above was the
quantity of data used to build the ANN models. It may be
argued that certain years of training data contain noise and
would thus adversely affect the forecasting performance
of the neural network model. In such case, the addition of
more training data (older) that is error free should com-
pensate for the noise effects in middle data, creating a
U-shaped performance curve. The most recent data pro-
vide high performance and the largest quantity of data
available also provides high performance due to drown-
ing out the noise in middle-time frame samples.
The TS recency effect has been demonstrated for the
threemostwidelytradedcurrenciesagainsttheU.S.dollar.
These results contradict current approaches which state
that as the quantity of training data used in constructing
neural network models increases, the forecasting perfor-
mance of the neural networks correspondingly improves.
The results were tested for robustness by extending the
research method to other foreign currencies. Three ad-
ditional currencies were selected: the French franc, the
Swiss franc, and the Italian lira. These three currencies
were chosen to approximate the set of nominal currencies
used in previous study.
Results for the six different ANN models for each of
the three new currencies show that the full 22-year train-
ing data set continues to be outperformed by either the
1- or 2-year training sets. This is excluding the French
franc, which has equivalent performance for the most
recent and the largest training data sets. The result that
the 22-year data set cannot outperform the smaller 1- or
2-year training data sets provides further empirical evi-
dence that a critical amount of training data, less than the
full 22 years for the foreign exchange time series, pro-
duces optimal performance for neural network financial
time-series models.
The French franc, similar to the Japanese yen, ANN
models have identical performance between the largest
(22-year) data set and the smallest (1-year) data set.
Because no increase in performance is provided through
the use of additional data, economics dictates that the
smaller 1-year set be used as the training paradigm for
the French franc, producing a possible $2100 savings in
data costs.
Additionally, the TS recency effect is supported by all
three currencies; however, the Swiss franc achieves its
maximum performance with 4 years of training data. The
quality of the ANN outputs for the Swiss franc model con-
tinually increases as new training data years are added,
through the fourth year, then precipitously drops in per-
formance as additional data are added to the training set.
Again, the Swiss franc results still support the research
goal of determining a critical training set size and the
discovered TS recency effect. However, the Swiss franc
results indicate that validation tests should be performed
individually for all financial time series to determine the
minimum quantity of data required for producing the best
forecasting performance.
While a significant amount of evidence has been ac-
quired to support the TS recency effect for ANN models
of foreign exchange rates, can the TS recency effect be
generalized to apply to other financial time series? The
knowledge that only a few years of data are necessary
to construct neural network models with maximum fore-
casting performance would serve to save neural network
developers significant development time, effort, and costs.
On the other hand, the dollar/Swiss franc ANNs described
above indicate that a cutoff of 2 years of training data may
not always be appropriate.
A method for determining the optimal training set size
for financial time series ANN models has been proposed.
This method consists of the following steps:
1. Create 1-year training set using most recent data;
determine appropriate test set.
2. Train with 1-year set and test (baseline); record
performance.
3. Add 1 year of training data; the closest to current
training set.
4. Train with newest training set, and test on original
test set; record performance.
5. If the performance of the newest training set is better
than previous performance,
Then
Go to step 5
Otherwise
Use the previous training data set, which
produced the best performance.
This is an iterative approach that starts with a single year of
training data and continues to add additional years of train-
ing data until the trained neural network’s performance
begins to decrease. In other words, the process continues
to search for better training set sizes as long as the perfor-
mance increases or remains the same. The optimal training
set size is then set to be the smallest quantity of training
data to achieve the best forecasting performance.

Because the described method is a result of the em-
pirical evidence acquired using foreign exchange rates,
it stands to reason that testing the method on additional
neural network foreign exchange rate forecasting models
would continue to validate the method. Therefore, three
new financial time series were used to demonstrate the
robustness of the specified method. The DJIA stock index
closing values, the closing price for the individual DIS
(Walt Disney Co.) stock, and the CAC-40 French stock
index closing values served as the three new financial time
series. Data samples from January 1977 to August 1999,
to simulate the 22 years of data used in the foreign ex-
change neural network training, were used for the DJIA
and DIS time series and data values from August 1988 to
May 1999 were used for the CAC-40 index.
Following the method discussed above, three backprop-
agation ANNs, one for each of the two time series, were
trained on the 1998 data set and tested a single time on the
1999 data values (164 cases for the DJIA and DIS; 123
cases for the CAC-40). Then a single year was added to
the training set, a new ANN model was trained and tested
a single time, with the process repeated until a decrease in
forecasting performance occurred. An additional 3 years
of training data, in 1-year increments, were added to the
training sets and evaluated to strengthen the conclusion
that the optimal training set size has been acquired. A fi-
nal test of the usefulness of the generalized method for
determining minimum optimal training set sizes was per-
formed by training similar neural network models on the
full 22-year training set for the DJIA index and DIS stock
ANNs and on the 10-year training set for all networks,
which was the maximum data quantity available for the
CAC-40. Then each of the ANNs trained on the “largest”
training sets was tested on the 1999 test data set to evaluate
the forecasting performance.
For both the DJIA and the DIS stock, the 1-year train-
ing data set was immediately identified as the best size
for a training data set as soon as the ANN trained on the
2-year data set was tested. The CAC-40 ANN forecast-
ing model, however, achieved its best performance with a
2-year training data set size. While the forecasting ac-
curacy for these three new financial time series did not
achieve the 60% forecasting accuracy as do many of the
foreign exchange forecasting ANNs, it did support the
generalized method for determining minimum necessary
training data sets and consequently lends support to the
time-series recency effect. Once the correct or best per-
forming minimum training set was identified by the gen-
eralized method, no other ANN model trained on a larger
size training set was able to outperform the “minimum”
training set.
The results for the DIS stock value are slightly better.
Conclusions were that the ANN model, which used ap-
proximately 4 years of training data, emulated a simple
efficient market. A random walk model of the DIS stock
produced a 50% prediction accuracy and so the DIS artifi-
cial neural network forecasting model did outperform the
random walk model, but not by a statistically significant
amount. An improvement to the ANN model to predict
stock price changes may be achieved by following the
generalized method for determining the best size training
set and reducing the overall quantity of training data, thus
limiting the effect of nonrelevant data.
Again as an alternative evaluation mechanism, a simu-
lation is run with the CAC-40 stock index data. A starting
value of $10,000 with sufficient funds and/or credit is as-
sumed to enable a position on 100 index options contracts.
Options are purchased or sold consistent with the ANN
forecasts for the direction of change in the CAC-40 index.
All options contracts are sold at the end of the year-long
simulation. The two-year training data set model produces
a net gain of $16,790, while using the full 10-year training
data set produces a net loss of $15,010. The simulation re-
sults yield a net average difference between the TS recency
effect model (2 years) and the heuristic greatest quantity
model (10 years) of $31,800, or three times the size of the
initial investment.
VII. CONCLUSIONS
General guidelines for the development of artificial neural
networks are few, so this article presents several heuristics
for developing ANNs that produce optimal generalization
performance. Extensive knowledge acquisition is the key
to the design of ANNs.
First, the correct input vector for the ANN must be
determined by capturing all relevant decision criteria used
by domain experts for solving the domain problem to be
modeled by the ANN and eliminating correlated variables.
Second, the selection of a learning method is an open
problem and an appropriate learning method can be
selected by examining the set of constraints imposed by
the collection of available training examples for training
the ANN.
Third, the architecture of the hidden layers is deter-
mined by further analyzing a domain expert’s clustering
of the input variables or heuristic rules for producing an
output value from the input variables. The collection of
clustering/decision heuristics used by the domain expert
has been called the set of decision factors (DFs). The quan-
tity of DFs is equivalent to the minimum number of hidden
units required by an ANN to correctly represent the prob-
lem space of the domain.
Use of the knowledge-based design heuristics enables
an ANN designer to build a minimum size ANN that is

capable of robustly dealing with specific domain prob-
lems. The future may hold automatic methods for deter-
mining the optimum configuration of the hidden layers
for ANNs. Minimum size ANN configurations guarantee
optimal results with the minimum amount of training time.
Finally, a new time-series model effect, termed the
time-series recency effect, has been described and demon-
strated to work consistently across six different currency
exchange time series ANN models. The TS recency effect
claims that model building data that is nearer in time to the
out-of-sample values to be forecast produce more accu-
rate forecasting models. The empirical results discussed
in this article show that frequently, a smaller quantity of
training data will produce a better performing backprop-
agation neural network model of a financial time series.
Research indicates that for financial time series 2 years
of training data are frequently all that is required to pro-
duce optimal forecasting accuracy. Results from the Swiss
franc models alert the neural network researcher that the
TS recency effect may extend beyond 2 years. A gener-
alized method is presented for determining the minimum
training set size that produces the best forecasting perfor-
mance. Neural network researchers and developers using
the generalized method for determining the minimum nec-
essary training set size will be able to implement artificial
neural networks with the highest forecasting performance
at the least cost.
Future research can continue to provide evidence for
the TS recency effect by examining the effect of training
set size for additional financial time series (e.g., any other
stock or commodity and any other index value). The TS
recency effect may not be limited only to financial time
series; evidence from nonfinancial time-series domain
neural network implementations already indicates that
smaller quantities of more recent modeling data are ca-
pable of producing high-performance forecasting models.
Additionally, the TS recency effect has been demon-
strated with neural network models trained using back-
propagation. The common belief is that the TS recency
effect holds for all supervised learning neural network
training algorithms (e.g., radial basis function, fuzzy
ARTMAP, probabilistic) and is therefore a general prin-
ciple for time-series modeling and not restricted to back-
propagation neural network models.
In conclusion, it has been noted that ANN systems in-
cur costs from training data. This cost is not only financial,
but also has an impact on the development time and effort.
Empirical evidence demonstrates that frequently only 1 or
2 years of training data will produce the “best” perform-
ing backpropagation trained neural network forecasting
models. The proposed method for identifying the mini-
mum necessary training set size for optimal performance
enables neural network researchers and implementers to
develop the highest quality financial time-series forecast-
ing models in the shortest amount of time and at the lowest
cost.
Therefore, the set of general guidelines for designing
ANNs can be summarized as follows:
1. Perform extensive knowledge acquisition. This
knowledge acquisition should be targeted at
identifying the necessary domain information
required for solving the problem and identifying the
decision factors that are used by domain experts for
solving the type of problem to be modeled by the
ANN.
2. Remove noise variables. Identify highly correlated
variables via a Pearson correlation matrix or
chi-square test, and keep only one correlated variable.
Identify and remove noncontributing variables,
depending on data distribution and type, via
discriminant/factor analysis or step-wise regression.
3. Select an ANN learning method, based on the
demographic features of the data and decision
problem. If supervised learning methods are
applicable, then implement backpropagation in
addition to any other method indicated by the data
demographics (i.e., radial-basis function for small
training sets or counterpropagation for very noisy
training data).
4. Determine the amount of training data. Follow the
methodology described in Section VI for time series.
Four times the number of weighted connections for
classification problems.
5. Determine the number of hidden layers. Analyze the
complexity, and number of unique steps, of the
traditional expert decision-making solution. If in
doubt, then use a single hidden layer, but realize that
additional nodes may be required to adequately
model the domain problem.
6. Set the quantity of hidden nodes in the last hidden
layer equal to the decision factors used by domain
experts to solve the problem. Use the knowledge
acquired during step 1 of this set of guidelines.
SEE ALSO THE FOLLOWING ARTICLES
ARTIFICIAL INTELLIGENCE • COMPUTER NETWORKS •
EVOLUTIONARY ALGORITHMS AND METAHEURISTICS
BIBLIOGRAPHY
Bansal, A., Kauffman, R. J., and Weitz, R. R. (1993). “Comparing the
modeling performance of regression and neural networks as data qual-
ity varies: A business value approach,” J. Management Infor. Syst. 10
(1), 11–32.

Barnard, E., and Wessels, L. (1992). “Extrapolation and interpolation in
neural network classifiers,” IEEE Control Syst. 12 (5), 50–53.
Carpenter, G. A., and Grossberg, S. (1998). “The ART of adaptive pattern
recognition by a self-organizing neural network,” Computer, 21 (3),
77–88.
Carpenter, G. A., Grossberg, S., Markuzon, N., and Reynolds, J. H.
(1992). “Fuzzy ARTMAP: A neural network architecture for incre-
mental learning of analog multidimensional maps,” IEEE Trans. Neu-
ral Networks 3 (5), 698–712.
Dayhoff, J. (1990). “Neural Network Architectures: An Introduction,”
Van Nostrand Reinhold, New York.
Fu, L. (1996). “Neural Networks in Computer Intelligence,” McGraw-
Hill, New York.
Gately, E. (1996). “Neural Networks for Financial Forecasting,” Wiley,
New York.
Hammerstrom, D. (1993). “Neural networks at work,” IEEE Spectrum
30 (6), 26–32.
Haykin, S. (1994). “Neural Networks: A Comprehensive Foundation,”
Macmillan, New York.
Hecht-Nielsen, R. (1988). “Applications of counterpropagation net-
works,” Neural Networks 1, 131–139.
Hertz, J., Krogh, A., and Palmer, R. (1991). “Introduction to the Theory
of Neural Computation,” Addison-Wesley, Reading, MA.
Hopfield, J. J., and Tank, D. W. (1986). “Computing with neural circuits:
A model,” Science 233 (4764), 625–633.
Hornik, K., Stinchcombe, M., and White, H. (1989). “Multilayer feedfor-
ward networks are universal approximators,” Neural Networks 2 (5)
359–366.
Kohonen, T. (1988). “Self-Organization and Associative Memory,”
Springer-Verlag, Berlin.
Li, E. Y. (1994). “Artificial neural networks and their business applica-
tions,” Infor. Management, 27 (5), 303–313.
Medsker, L., and Liebowitz, J. (1994). “Design and Development of Ex-
pert Systems and Neural Networks,” Macmillan, New York.
Mehra, P., and Wah, B. W. (19xx). “Artificial Neural Networks: Concepts
and Theory,” IEEE, New York.
Moody, J., and Darken, C. J. (1989). “Fast learning in networks of locally-
tuned processing elements,” Neural Comput. 1 (2), 281–294.
Smith, M. (1993). “Neural Networks for Statistical Modeling,” Van
Nostrand Reinhold, New York.
Specht, D. F. (1991). “A general regression neural network,” IEEE Trans.
Neural Networks 2 (6), 568–576.
White, H. (1990). “Connectionist nonparametric regression: Multilayer
feedforward networks can learn arbitrary mappings,” Neural Networks
3 (5), 535–549.
Widrow, B., Rumelhart, D. E., and Lehr, M. A. (1994). “Neural
networks: Applications in industry, business and science,” Commun.
ACM 37 (3), 93–105.

ANN Architecture Guide

Recommended

Recommended

More Related Content

Similar to ANN Architecture Guide

Similar to ANN Architecture Guide (20)

More from Bria Davis

More from Bria Davis (20)

Recently uploaded

Recently uploaded (20)

ANN Architecture Guide