SlideShare a Scribd company logo
1 of 63
Download to read offline
UNIT-4: Artificial Neural Network and Deep Learning
Dr. Radhey Shyam
Professor
Department of Computer Science and Engineering
BIET Lucknow
Following slides have been prepared by Dr. Radhey Shyam, with grateful acknowledgement of others who
made their course contents freely available. Feel free to reuse these slides for your own academic purposes.
1
1252 MIPRO 2017/CTS
Brief Review of Self-Organizing Maps
Dubravko Miljković
Hrvatska elektroprivreda, Zagreb, Croatia
dubravko.miljkovic@hep.hr
Abstract - As a particular type of artificial neural networks,
self-organizing maps (SOMs) are trained using an
unsupervised, competitive learning to produce a low-
dimensional, discretized representation of the input space of
the training samples, called a feature map. Such a map
retains principle features of the input data. Self-organizing
maps are known for its clustering, visualization and
classification capabilities. In this brief review paper basic
tenets, including motivation, architecture, math description
and applications are reviewed.
I. INTRODUCTION
Among numerous neural network architectures,
particularly interesting architecture was introduced by
Finish Professor Teuvo Kohonen in the 1980s, [1,2]. Self-
organizing map (SOM), sometimes also called a Kohonen
map use unsupervised, competitive learning to produce
low dimensional, discretized representation of presented
high dimensional data, while simultaneously preserving
similarity relations between the presented data items. Such
low dimensional representation is called a feature map,
hence map in the name. This brief review paper attempts
to introduce a reader to SOMs, covering in short basic
tenets, underlying biological motivation, its architecture,
math description and various applications, [3-10].
II. NEURAL NETWORKS
Human and animal brains are highly complex,
nonlinear and parallel systems, consisting of billions of
neurons integrated into numerous neural networks, [3]. A
neural networks within a brain are massively parallel
distributed processing system suitable for storing
knowledge in forms of past experiences and making it
available for future use. They are particularly suitable for
the class of problems where it is difficult to propose an
analytical solution convenient for algorithmic
implementation.
A. Biological Motivation
After millions of years of evolution, brain in animals
and humans has evolved into the massive parallel stack of
computing power capable of dealing with the tremendous
varieties of situations it can encounter. The biological
neural networks are natural intelligent information
processors. Artificial neural networks (ANN) constitute
computing paradigm motivated by the neural structure of
biological systems, [6]. ANNs employ a computational
approach based on a large collection of artificial neurons
that are much simplified representation of biological
neurons. Synapses that ensure communication among
biological neurons are replaced with neuron input weights.
Adjustment of connection weights is performed by some
of numerous learning algorithms. ANNs have very simple
principles, but their behavior can be very complex. They
have a capability to learn, generalize, associate data and
are fault tolerant. The history of the ANNs begins in the
1940s, but the first significant step came in 1957 with the
introduction of Rosenblatt’s perceptron. The evolution of
the most popular ANN paradigms is shown in Fig. 1, [10].
B. Basic Architectures
An artificial neural network is an interconnected
assembly of simple processing elements, called artificial
neurons (also called units or nodes), whose functionality
mimics that of a biological neuron, [4]. Individual neurons
can be combined into layers, and there are single and
multi-layer networks, with or without feedback. The most
common types of ANNs are shown in Fig. 2, [11]. Among
training algorithms the most popular is backpropagation
and its variants. ANNs can be used for solving a wide
variety of problems. Before the use they have to be
trained. During the training, network adjusts its weights. In
supervised training, input/output pairs are presented to the
network by an external teacher and network tries to learn
desired input output mapping. Some neural architectures
(like SOM) can learn without supervision (unsupervised)
from the training data without specified input/output pairs.
Figure 1. Evolution of artificial neural network paradigms, based on [10]
Figure 2. Most common artificial neural networks, according to [11]
MIPRO 2017/CTS 1253
III. SELF-ORGANIZING MAPS
Self-organized map (SOM), as a particular neural
network paradigm has found its inspiration in self-
organizing and biological systems.
A. Self-Organized Systems
Self-organizing systems are types of systems that can
change their internal structure and function in response to
external circumstances and stimuli, [12-15]. Elements of
such a system can influence or organize other elements
within the same system, resulting in a greater stability of
structure or function of the whole against external
fluctuations, [12]. The main aspects of self-organizing
systems are increase of complexity, emergence of new
phenomena (the whole is more than the sum of its parts)
and internal regulation by positive and negative feedback
loops. In 1952 Turing published a paper regarding the
mathematical theory of pattern formation in biology, and
found that global order in a system can arise from local
interactions, [13]. This often produces a system with new,
emergent properties that differ qualitatively from those of
components without interactions, [16]. Self-organizing
systems exist in nature, including non-living as well as
living world, they exist in man-made systems, but also in
the world of abstract ideas, [12].
B. Self-Organizing Map
Neural networks of neurons with lateral
communication of neurons topologically organized as
self-organizing maps are common in neurobiology.
Various neural functions are mapped onto identifiable
regions of the brain, Fig. 3, [17]. In such topographic
maps neighborhood relation is preserved. Brain mostly
does not have desired input-output pairs available and has
to learn in unsupervised mode.
Figure 3. Maps in brain, [17]
A SOM is a single layer neural network with units set
along an n-dimensional grid. Most applications use two-
dimensional and rectangular grid, although many
applications also use hexagonal grids, and some one,
three, or more dimensional spaces. SOMs produce low-
dimensional projection images of high-dimensional data
distributions, in which the similarity relations between
the data items are preserved, [18],
C. Principles of Self-Organization in SOMs
Following three processes are common to self-
organization in SOMs, [7,19,20]:
1. Competitive Process
For each input pattern vector presented to the map, all
neurons calculate values of a discriminant function. The
neuron that is most similar to the input pattern vector is
the winner (best matching unit, BMU).
2. Cooperative Process
The winner neuron (BMU) finds the spatial location
of a topological neighborhood of excited neurons.
Neurons from this neighborhood may then cooperate.
3. Synaptic Adaptation
Provides that excited neurons can modify their
values of the discriminant function related to the presented
input pattern vector by the process of weight adjustments.
D. Common Topologies
SOM topologies can be in one, two (most common)
or even three dimensions, [2-10]. Two most used two
dimensional grids in SOMs are rectangular and hexagonal
grid. Three dimensional topologies can be in form of a
cylinder or toroid shapes. 1-D (linear) and 2-D grids are
illustrated in Fig. 4, with corresponding SOMs in Fig. 5
and Fig. 6, according to [19].
Figure 4. Most common grids and neuron neighborhoods
Figure 5. 1-D SOM network, according to [19].
Figure 6. 2-D SOM network, according to [19].
IV. LEARNING ALGORITHM
In 1982 Professor Kohonen presented his SOM
algorithm, [1]. Further advancement in a field came with
the Second edition of his book “Self-Organization and
Associative Memory” in 1988, [2].
A. Measures of Distance and Similarity
To determined similarity between the input vector and
neurons measures of distance are used. Some popular
distances among input pattern and SOM units are, [21]:
 Euclidian
 Correlation
 Direction cosine
 Block distance
In a real application most often squared Euclidean
distance is used, (1):
 
 

i
ij
i w
x
dj
2 (1)
1254 MIPRO 2017/CTS
B. Neighborhood Functions
Neurons within a grid interact among themselves using
a neighbor function. Neighborhood functions most often
assume the form of the Mexican hat, (2), Fig. 7, that has
biological motivation behind (rejects some neurons in the
vicinity to the winning neuron) although other functions
(Gaussian, cone and cylinder) are also possible, [22].
Ordering algorithm is robust to the choice of function
type if the neighbor radius and learning rate decrease to
zero. The popular choice is the exponential decay.
 
   
2
2
1
2
2
,
,









 




r
m
j
n
i
mn
ij
g e
r
w
w
h (2)
Figure 7. Mexican hat function
C. Initialization of Self-Organizing Maps
Before training SOM, units (i.e. its weights) should be
initialized. Common approaches are, [2,23]:
1. Use of random values, completely independent of the
training data set
2. Use of random samples from the input training data
3. Initialization that tries to reflect the distribution of
the data (Principal Components)
D. Training
Self-organizing maps use the most popular algorithm
of the unsupervised learning category, [2]. The criterion
D, that is minimized, is the sum of distances between all
input vectors xn and their respective winning neuron
weights wi calculated at the end of each epoch,(3),[21]:
 

 


k
i c
n
i
n
i
D
1
2
w
x
(3)
Training of self-organizing maps, [2,18], can be
accomplished in two ways: as sequential or batch training.
1. Sequential training
 single vector at a time is presented to the map
 adjustment of neuron weights is made after a
presentation of each vector
 suitable for on-line learning
2. Batch training
 whole dataset is before any adjustment to the
neuron weights is made
 suitable for off-line learning
Here are steps for the sequential training, [3,7,19,22]:
1. Initialization
 Initialize the neuron weights (iteration step n=0)
2. Sampling
 Randomlysample avector x n from the dataset
3. Similarity Matching
 Find the best matching unit (BMU), c, with
weights wbmu=wc, (4):
   
 
n
n
c i
i
w
x 
 min
arg
(4)
4. Updating
 Update each unit i with the following rule:
           
     
n
n
n
r
n
n
h
n
n
n i
i
bmu
i
i w
x
w
w
w
w 


 ,
,
1  (5)
5. Continuation
 Increment n. Repeat steps 2-4 until a stopping
criterion is met (e.g. the fixed number of
iterations or the map has reached a stable state).
For the convergence and stability to be guaranteed, the
learning rate α n and neighborhood radius r n are
decreasing with each iteration towards the zero, [22].
SOM Sample Hits, Fig. 8, show the number of input
vectors that each unit in SOM classifies, [24].
Figure 8. SOM Sample Hits, [24]
During the training process two phases may be
distinguished, [7,18]:
1. Self-organizing (ordering) phase:
Topological ordering in the map takes place (roughly
first 1000 iterations). The learning rate α n and
neighborhood radiusr n are decreasing.
2. Convergence (fine tuning) phase:
This is fine tuning that provides an accurate statistical
representation of the input space. It typically lasts at
least (500 xnumber of neuron) iterations. The smaller
learning rate α n and neighborhood radius r n may
be kept fixed (e.g. last values from the previous phase).
After the training of the SOM is completed, neurons
may be labeled if labeled pattern vectors are available.
E. Classification
Find the best matching unit (BMU), c, (5):
 
i
i
c w
x 
 min
arg
(5)
Test pattern x belongs to the class represented by the best
matching unit c.
V. PROPERTIES OF SOM
After the convergence of SOM algorithm, resulting
feature map displays important statistical characteristics
of the input space. They are also able to discover relevant
patterns or features present in the input data.
A. Important Properties of SOMs
SOMs have four important properties, [3,7]:
1. Approximation of the Input Space
The resulting mapping provides a good approximation
to the input space. SOM also performs dimensionality
reduction bymapping multidimensionaldataon SOM grid.
2. Topological Ordering
Spatial locations of the neurons in the SOM lattice are
topologically related to the features of the input space.
3. Density Matching
The density of the output neurons in the map
approximates the statistical distribution of the input
space. Regions of the input space that contain more
training vectors are represented with more output neurons.
MIPRO 2017/CTS 1255
4. Feature Selection
Map extracts the principal features of the input space. It
is capable of selecting the best features for approximation
of the underlying statistical distribution of the input space.
B. Representing the Input Space with SOMs of Various
Topologies
1. 1-D
2D input data points are uniformly distributed in a
triangle, 1D SOM ordering process shown in Fig. 9, [2].
Figure 9. 2D to 1D mapping by a SOM (ordering process), [2]
2. 2-D
2D input data points are uniformly distributed in a
square, 2D SOM ordering process shown in Fig. 10, [3].
Figure 10. 2D to 2D mapping by a SOM (ordering process), [3]
3. Torus SOMs
In conventional SOM, the size of neighborhood set is
not always constant because the map has its edges. This
problem can be mitigated by use of torus SOM that has
no edges, [25]. However torus SOM, Fig. 11, is not easy
to visualize as there are now missing edges.
Figure 11. Torus SOM
4. Hierarchical SOMs
After previous topologies, hierarchical SOMs should
also be mentioned. Hierarchical neural networks are
composed of multiple loosely connected neural networks
that form an acyclic graph. The outputs of the lower level
SOMs can be used as the input for the higher level SOM,
Fig. 12, [10]. Such input can be formed of several vectors
from Best Matching Units (BMUs) of many SOMs.
Figure 12. Hierarchical SOM, [10]
VI. APPLICATIONS
Despite its simplicity, SOMs can be used for a various
classes of applications, [2,26,27]. This in a broad sense
includes visualizations, generation of feature maps,
pattern recognition and classification. Kohonen in [2]
came with the following categories of applications:
machine vision and image analysis, optical character
recognition and script reading, speech analysis and
recognition, acoustic and musical studies, signal
processing and radar measurements, telecommunications,
industrial and other real world measurements, process
control, robotics, chemistry, physics, design of electronic
circuits, medical applications without image processing,
data processing linguistic and AI problems, mathematical
problems and neurophysiological research. With such an
exhaustive list provided here, as space permits, it is
possible only to mention some of them that are interesting
and popular.
A. Speech Recognition
The neural phonetic typewriter for Finnish and
Japanese speech was developed by Kohonen in 1988,
[28]. The signal from the microphone proceeds to acoustic
preprocessing, shown in more detail in Fig. 13, forming
15-component pattern vector (values in 15 frequency
beans taken every 10 ms), containing a short time spectral
description of speech. These vectors are presented to a
SOM with the hexagonal lattice of the size 8 x 12.
Figure 13. Acoustic preprocessing
After training resulting phonotopic map is shown in
Fig. 14, [7]. During speech recognition new pattern
vectors are assigned category belonging to a closest
prototype in the map.
Figure 14. Phonotopic map, [7]
B. Text Clustering
Text clustering is the technology of processing a large
number of texts that gives their partition. Preparation of
text for SOM analysis is shown in Fig. 15, [29], and
Figure 15. Preparation of text for SOM analysis, according to [29]
1256 MIPRO 2017/CTS
Figure 16. Framework for text clustering, [29]
complete framework in Fig. 16, [29]. Massive document
collections can be organized using a SOM. It can be
optimized to map large document collections while
preserving much of the classification accuracy Clustering
of scientific articles is illustrated in Fig. 17, [30].
Figure 17. Clustering of scientific articles, [30]
C. Application in Chemistry
SOMs have found applications in chemistry.
Illustration of the output layer of the SOM model using a
hexagonal grid for the combinatorial design of
cannabinoid compounds is shown in Fig. 18, [11].
Figure 18. Application of SOM in chemistry, [11]
D. Medical Imaging and Analysis
Recognition of diseases from medical images (ECG,
CAT scans, ultrasonic scans, etc.) can be performed by
SOMs, [21].This includes image segmentation, Fig. 19,
[31], to discover region of interest and help diagnostics.
Figure 19. Segmentation of hip image using SOM, [31]
E. Maritime Applications
SOMs have been widely for maritime applications,
[22]. One example is analysis of passive sonar recordings.
Also SOMs have been used for planning ship trajectories.
F. Robotics
Some applications of SOMs are control of robot arm,
learning the motion map and solving traveling salesman
problem (multi-goal path planning problem), Fig.20, [32].
Figure 20. Traveling Salesman Problem, [32]
G. Classification of Satellite Images
SOMs can be used for interpreting satellite imagery
like land cover classification. Dust sources can also be
spotted in images using theSOMas shown in Fig. 21, [33].
Figure 21. Detecting dust sources using SOMs, [33]
H. Psycholinguistic Studies
One example is the categorization of words by their
local context in three word sentences of the type subject-
predicate-object or subject-predicate-predicative that
were constructed artificially. The words become clustered
by SOM according to their linguistic roles in an orderly
fashion, Fig. 22, [18].
Figure 22. SOM in psycholinguistic studies, [18]
I. Exploring Music Collections
Similarity of music recordings may be determined by
analyzing the lyrics, instrumentation, melody, rhythm,
artists, or emotions they invoke, Fig. 23, [34].
Figure 23. Exploring music collections, [34]
MIPRO 2017/CTS 1257
J. Business Applications
Customer segmentation of the international tourist
market is illustrated in Fig. 24, [35]. Another example is
classifying world poverty (welfare map), [36]. Ordering
of items with the respect to 39 features describing various
quality-of-life factors, such as state of health, nutrition, and
educational services is shown in Fig. 25. Countries with
similar quality of life factors clustered together on a map.
Figure 24. Customer segmentation of the international tourist market,[35]
Figure 25. Poverty map based on 39 indicators from World Bank
statistics (1992), [36]
VII. CONCLUSION
Self-organizing maps (SOMs) are neural network
architecture inspired by the biological structure of human
and animal brains. They become one of the most popular
neural network architecture. SOMs learn without external
teacher, i.e. employ unsupervised learning. Topologically
SOMs most often use a two-dimensional grid, although
one-dimensional, higher-dimensional and irregular grids
are also possible. SOM maps higher dimensional input
onto the lower dimensional grid while preserving
topological ordering present in the input space. During
competitive learning SOM uses lateral interactions among
the neurons to form a semantic map where similar
patterns are mapped closer together than dissimilar ones.
They can be used for broad type of applications like
visualizations, generation of feature maps, pattern
recognition and classification. Humans can’t visualize
high-dimensional data, hence SOMs by mapping such
data to a two-dimensional grid are widely used for data
visualization. SOMs are also suitable for generation of
feature maps. Because they can detect clusters of similar
patterns without supervision SOMs are a powerful tool
for identification and classification of spatio-temporal
patterns. SOMs can be used as an analytical tool, but also
in a myriad of real world applications including science,
medicine, satelliteimaging and industry.
REFERENCES
[1] T. Kohonen, “Self-organized formation of topologically correct
feature maps”, Biol. Cybern. 43, pp. 59-69, 1982
[2] T. Kohonen, Self-Organizing Maps, 2nd ed., Springer 1997
[3] S. Haykin, Neural Networks: A Comprehensive Foundation, 2nd
ed., Prentice Hall PTR Upper Saddle River, NJ, USA, 1998
[4] K. Gurney, An introduction to neural network, UCL Press
Limited, London, UK, 1997
[5] D. Kriese, A Brief Introduction to Neural Networks,
http://www.dkriesel.com
[6] R. Rojas: Neural Networks, A Systematic Introduction, Springer-
Verlag, Berlin, 1996
[7] J. A. Bullinaria, Introduction to Neural Networks - Course
Material and Useful Links, http://www.cs.bham.ac.uk/~jxb/NN/
[8] C. M. Bishop, Neural Networks for Pattern Recognition,
Clarendon Press, Oxford, 1997
[9] R. Eckmiller, C. Malsburg, Neural Computers, NATO ASI Series,
Computer and Systems Sciences, 1988
[10] P. Hodju and J. Halme, Neural Networks Information Homepage,
http://phodju.mbnet.fi/nenet/SiteMap/SiteMap.html
[11] Káthia Maria Honório and A. B. F. da Silva, “Applications of
artificial neural networks in chemical problems”, in Artificial
Neural Networks - Architectures and Applications, InTech, 2013
[12] W. Banzhafl, “Self-organizing systems”, in Encyclopedia of
Complexity and Systems Science, 2009, Springer, Heidelberg,
[13] A. M. Turing, “The chemical basis of morphogenesis”,
Philosophical Transactions of the Royal Society of London. Series
B, Biological Sciences, Vol. 237, No.641. pp.37-72, Aug. 14, 1952
[14] W. R. Ashby, “Principles of the self-organizing system”, E:CO
Special Double Issue Vol. 6, No. 1-2, pp. 102-126, 2004
[15] C. Fuchs, “Self-organizing system”, in Encyclopedia of
Governance, Vol. 2, SAGE Publications, 2006, pp. 863-864
[16] J. Howard, “Self-organisation in biology”, in Research
Perspectives 2010+ of the Max Planck Society, 2010, pp. 28-29
[17] The Wizard of Ads Brain Map - Wernicke and Broca,
https://www.wizardofads.com.au/brain-map-brocas-area/
[18] T. Kohonen, MATLAB Implementations and Applications of the
Self-Organizing Map, Unigrafia, Helsinki, Finland, 2014
[19] Bill Wilson, Self-organisation Notes, 2010,
www.cse.unsw.edu.au/~billw/cs9444/selforganising-10-4up.pdf
[20] J. Boedecker, Self-Organizing Map (SOM), .ppt, Machine
Learning, Summer 2015, Machine Learning Lab, Univ. of Freiburg
[21] L. Grajciarova, J. Mares, P. Dvorak and A. Prochazka, Biomedical
image analysis using self-organizing maps, Matlab Conference 2012
[22] V. J. A. S. Lobo, “Application of Self-Organizing Maps to the
Maritime Environment”, Proc. IF&GIS 2009, 20 May 2009, St.
Petersburg, Russia, pp. 19-36
[23] A. A. Akinduko and E. M. Mirkes, “Initialization of self-organizing
maps: principal components versus random initialization. A case
study”,InformationSciences,Vol.364,Is.C,pp. 213-221, Oct.2016
[24] MathWorks, Self-Organizing Maps,
https://www.mathworks.com/help/nnet/ug/cluster-with-self-
organizing-map-neural-network.html
[25] M. Ito, T. Miyoshi, and H. Masuyama, “The characteristics of the
torus self organizing map”, Proc. 6th Int. Conf. on Soft Computing
(IIZUKA’2000), Iizuka, Fukuoka, Japan, Oct. 1-4,2000,pp. 239-44
[26] M. Johnsson ed., Applications of Self-Organizing Maps, InTech,
November 21, 2012
[27] J. I. Mwasiagi (ed.), Self Organizing Maps - Applications and
Novel Algorithm Design, InTech, 2011
[28] T. Kohonen, “The ‘neural’ phonetic typewriter”, IEEE Computer
21(3), pp. 11–22, 1988
[29] Yuan-Chao Liu, Ming Liu and Xiao-Long Wang, “Application of
self-organizingmapsintextclustering:areview”,in“SelfOrganizing
Maps - Applications and Novel Algorithm Design”, InTech, 2012
[30] K. W. Boyacka et. all., Supplementary information on data and
methods for “Clustering more than two million biomedical
publications: comparing the accuracies of nine text-based
similarity approaches”, PLoS ONE 6(3): e18029, 2011
[31] A. Aslantas, D. Emre and M. Çakiroğlu, “Comparison of
segmentation algorithms for detection of hotspots in bone
scintigraphy images and effects on CAD systems”, Biomedical
Research, 28 (2), pp. 676-683, 2017
[32] J. Faigl, “Multi-goal path planning for cooperative sensing”, PhD
Thesis, Czech Technical University of Prague, February 2010
[33] D. Lairy, Machine Learning for Scientific Applications, slides,
https://www.slideshare.net/davidlary/machine-learning-for-scientific-applications
[34] E. Pampalk, S. Dixon and G. Widmer, “Exploring music
collections by browsing different views”, Computer Musical
Journal, Vol. 28, No. 2, pp. 49-62, Summer 2004
[35] J. Z. Bloom, “Market segmentation-aneural network application”,
Annals of Tourism Research, Vol. 32, No. 1, pp. 93–111, 2005
[36] World Poverty Map, SOM research page, Univ. of Helsinki,
http://www.cis.hut.fi/research/som-research/worldmap.html
Convolutional Neural Network (CNN)
A Convolutional Neural Network (CNN) is a class of ANNs.
CNN was developed primarily triggered by the challenges of image
recognition.
CNN architectures are strongly influenced by our current neuro science models
of the organization of human and animal visual perception.
The central convolution mechanisms of CNN are inspired by receptive fields
and their direct connections to specific neuron structures.
The implementation of these mechanisms are based on the concept of
convolution function in mathematics.
CNNs use relatively little pre-processing compared to other image
classification algorithms. This means that the network learns the filters that in
traditional algorithms were hand-engineered. This independence from prior
knowledge and human effort in feature design is a major advantage.
Image Recognition
The classical problem in computer vision is that of determining whether or not the
image data contains some specific object, feature, or activity. Different varieties of the
recognition problem are:
Object recognition or object classification – one or several pre-specified or learned
objects or object classes can be recognized, usually together with their 2D positions in
the image or 3D poses in the scene.
Identification – an individual instance of an object is recognized. Examples include
identification of a specific person's face or fingerprint, identification of handwritten
digits or letters or identification of a specific object.
Detection – the image data are scanned for a specific condition. Examples include
detection of possible abnormal cells or tissues in medical images or detection of a
vehicle in an automatic road toll system. Detection based on relatively simple and fast
computations is sometimes used for finding smaller regions of interesting image data
which can be further analyzed by more computationally demanding techniques to
produce a correct interpretation.
ImageNet
The ImageNet project is a large visual database designed for use in visual
object recognition software research.
More than 14 million images have been hand-annotated by the project to
indicate what objects are pictured.
ImageNet contains more than 20,000 categories with a typical category
consisting of several hundred images.
Since 2010, the ImageNet project runs an annual software contest, the
ImageNet Large Scale Visual Recognition Challenge (ILSVRC), where
software programs compete to correctly classify and detect objects and
scenes. The challenge uses a specially selected list of one thousand non-
overlapping classes.
Image input in Compact Symbolic
standard pixel form characterization
of image as output
Alternative architecture
CNN architecture
ANN architecture
Manual Mapping
Automated Mapping
Image Recognition Systems
Input to Image Recognition systems - finite arrays of pixels
RGB Images
An RGB image, sometimes referred to as a true-color image, is a m-
by-n-by-3 data array RGB ( .. , .. , ..) that defines red, green, and blue
color components for each individual pixel. The color of each pixel is
determined by the combination of the red, green, and blue intensities
stored in each color plane at the pixel's location.
An RGB color component is a value between 0 and 1. A pixel whose
color components are (0,0,0) displays as black, and a pixel whose
color components are (1,1,1) displays as white.
The three color components for each pixel are stored along the third
dimension of the data array.
For example, the red, green, and blue color components of the pixel
(3,3,5) are stored in RGB(2,3,1), RGB(2,3,2), and RGB(3,3,3),
respectively. Suppose (2,3,1) contains the value 0.5176, (2,3,2)
contains 0.1608, and (2,3,3) contains 0.0627. The color for the pixel at
(2,3) is 0.5176 0.1608 0.0627
Output from an Image Recognition system
One or several object categories (classes) present in the image
Specific objects (instances) present in the image
Subset of features of object and/or categories observable in the image
Topological and Geometrical aspects of the image
Dynamic properties of elements in the image (requires sequences of images)
All the above elements can be represented in symbolic and numeric form
A Feature vector is still a default option.
The Human Visual system
Eye
Superior
colliculus
Dorsal
LGN V1 V2
V3
V4
V3A STS
TEO
V5
TE
Posterior
parietal Cx
Striate
Cortex
Extrastriate
Cortex
Inferior Temporal
Cortex
STS Superior temporal sulcus
TEO Inferior temporal cortex
TE Inferior temporal cortex
The Organization of the Visual Cortex
Dorsal stream
Ventral stream
V1
The connections between Receptive fields
and Neurons in the Visual Cortex
Nobel prize awarded work by Hubel and Wiesel in the 1950s and 1960s showed that cat and monkey
visual cortexes contain neurons that individually respond to small regions of the visual field.
Provided the eyes are not moving, the regions of visual space within which visual stimuli affect the firing of
single neurons we call receptive fields..
Neighboring neurons have similar and overlapping receptive fields.
Receptive fields sizes and locations varies systematically to form a complete map of visual space. The responses
of specific to a subset of stimuli within its receptive field is called neuronal tuning.
A1968 article by Hubel and Wieser identified two basic visual cell types in the brain:
• simple cells, whose output is maximized by straight edges having particular orientations within their
receptive field. Neurons of this kind are located in the earlier visual areas (like V19).
• complex cells, which have larger receptive fields, whose output is insensitive to the exact position of the
edges in the field. In the higher visual areas, neurons have complex tuning. For example, in the inferior
temporal cortex, a neuron may fire only when a certain face appears in its receptive field.
Hubel and Wiesel also proposed a cascading model of these two types of cells for use in pattern recognition task.
Convolution as defined in Mathematics
Convolution is a mathematical operation on two functions
(f and g) to produce a third function that expresses how
the shape of one is modified by the other.
• Express each function in terms of a dummy variable a.
• Reflect one of the functions: g(a) → g(-a)
• Add a time-offset, x, which allows g to slide along
the a-axis from −∞ to +∞.
• Wherever the two functions intersect, find the integral
of their product.
• In other words, compute a sliding, weighted-sum of
function f(a) where the weighting function is g(-a)
• The resulting waveform is the convolution of
functions f and g.
The term convolution refers to both the result function and
to the process of computing it. Convolution is similar to
cross-correlation and related to autocorrelation.
1
1
1/2
-1 2
-2
1
1
1/2
-1 2
-2
1
1
1/2
-1 2
-2
1
1
1/2
-1 2
-2
¤
1
1/2
-1 2
-2 1
1
1/2
-1 2
-2 1
1
1/2
-1 2
-2
b
Example
Compute the convolution of f and g =f*g
f= g=
Reflect the weight
function g
Slide g
f*g = 0 f*g = 0
Result
x x
X-1
0 1
¤
A typical Convolutional Neural Network Architecture
Convolutional Neural Network related Terminology
Convolution
Filter (or synonymously Kernel)
Stride
Padding
Feature map
Parameter sharing
Local connectivity
Pooling Subsampling Downsampling
Subsampling ratio
Maxpooling Average pooling
ReLU
Softmax
The Feature Learning Phase
The feature learning phase in a CNN consists of an arbitrary number of
pairs of Convolution and Pooling layers.
The number and roles of these pairs of layers are engineering decisions for
particular problem settings but in general later (deeper) levels handles more
abstract or high level features or patterns in analogy with our assumed
model of the functioning of the human visual cortex.
Example
An input image of RGB type.
Convolution for one Filter in a Convolution Layer
In our example we take a 5*5*3 Filter and Slide it over the
Input array with a Stride of 1.
Let us disregard the color dimensions for a moment.
In each step of the slide, take the dot product between
each filter element and each element of each subarea of
the Input array. For every dot product taken, the result is a
scalar.
There are 28*28 unique positions where the filter can be
put on the image and the therefore the total result is a
Feature Map = a 28x28x1 array.
If the Stride is larger than 1 the Feature Map becomes more
reduced.
Example of a filter and a single convolution operation
The input is a 7x7 array with 49 elements The filter is shown in the middle. The
filter size is 3x3 (black and white), the stride is 1. This is an example of a filter
that detects diagonal patterns (1 on the diagonals). The output is a 5x5 array with
25 elements.
We slide the array systematically across the input array (analogy with convulotion).
There are 25 distinct distinct sliding positions. For each position we calculated the
dot product elementwise and puts the result in the output matrix.
Padding
Depending of the size of input array, size of filters and
stride, the sliding process can miss to apply the filter to
some input array element.
A possibilty is to ´pad´ the original input array with a frame
and use he extended array as a basis for the convolution.
Wether this is benefical for the process or not depends on
the specific situation.
If padding is never used, the system of arrays are shrinking
rapidly but if padding is used sytematically the size of
arrays are kept up.
Repeated convolution for all
filters in a convolution layer
Each convolution layer comprises a set of independent
Filters.
Each filter is independently convolved with the input
image.
In the example there are 6 filters in this first convolution
layer.
Which generates 6 feature maps of shape 28*28*1.
Pooling (Subsampling) Layer
A Pooling layer is frequently used in a convolution neural
network with the purpose to progressively reduce the spatial
size of the representation to reduce the amount of features and
the computational complexity of the network.
Pooling layer operates on each Feature map independently.
The main reason for the pooling layer is to prevent the model
from overfitting.
The choice of filtersize, stride (and maybe padding) are also
relevant for the pooling phases.
The most common approach used in pooling is Max pooling.
As an example a MAXPOOL of 2 x 2 would cause a Filter of 2
by 2 to traverse over the entire matrix with a Stride of 2 and
pick the largest element from the window to be included in the
next representation map. Average pooling takes the average.
Two aspects of the neuron structures
in the convolution and pooling layers
Weight-sharing
Based on the motivation that a certain feature/filter should treat all subareas of the visual
space similarity the same weights should be employed within a convolution computation
phase. This
brings down the complexity of the networks
Local connectivity
In contrast to general ANN, the neuron connections in the input, convolution and pooling
layers are restricted, primarily motivated by the fact that specific neurons are allocated to only
small sub-areas of the total visual field. This brings down the complexity of the networks.
Flattening
When leaving the convolution and pooling
layers and before entering the fully connected
layers the output of the previous layers
is flattened.
By this is meant that the dimensions of the input
array from earlier phases are flattened out into
one large dimension.
For example a 3 D array with a shape off
(10x10x10) when flattened would become a 1 D
array with 1000 elements.
The Fully Connected Layers
If a Softmax activity function is used, each number in this N dimensional vector
represents the probability of a certain class.
For example, if the resulting vector for a digit classification program is [0, 0.1,
0.1, 0 .75, 0, 0, 0, 0, 0, 0.05], then this represents a 10% probability that the
image is a 1, a 10% probability that the image is a 2, a 75% probability that the
image is a 3, and a 5% probability that the image is a 9.
The fully connected layers takes as input, a
flattened array representing the activation maps
of high level features from earlier layers and
outputs an N dimensional vector.
N is the number of classes that the program has to
choose from. For example, if the task is digit
classification, N would be 10 since there are 10
digits.
The fully connected layers determine which
features best correlate to a particular class.
Activation functions used
ReLU (Rectified Linear Unit) and Leaking ReLU activation functions
The advantages are simplicity and efficiency.
Typically used in the convolution layers of CNN.
Sigmoid and hyberbolic tangent (Tanh) functions
Sigmoid and Tanh are typically used for fully connected networks aimed for binary
classification problems. Can be used for the output layers of a CNN
Softmax
Softmax is equivalent to Sigmoid for binary classification but is primarily aimed for
the multi class case where the non-normalized output of a network is mapped onto a
probability distribution over predicted output classes.
Typically used in the output layers of CNN.
Gaussian activation function
Can be used for the output layers of a CNN.
For regression problems, the final layer has typically an identity activation.
LeNet-5 – A Classic CNN Architecture
In 1990 Yann LeCun, Leon Bottou, Yosuha Bengio and Patrick Haffner proposed a neural network architecture for
handwritten and machine-printed character recognition which they called LeNet-5. The architecture is
straightforward and simple to understand and well suited as an Introduction to CNNs.
The LeNet-5 architecture consists of:
• two sets of convolutional and average pooling (subsampling) layers followed by
• one flattening convolutional layer, then
• two fully-connected layers and finally
• one Softmax classifier.
First Layer
The input for LeNet-5 is a 32×32 grayscale
image which passes through the first
convolutional layer with 6 feature maps or
filters having size 5×5 and a stride of one. The
image dimensions changes from 32x32x1 to
28x28x6.
Second Layer
Then the LeNet-5 applies average pooling layer
or sub-sampling layer with a filter size 2×2 and
a stride of two. The resulting image dimensions
will be reduced to 14x14x6.
Third Layer
Next, there is a second convolutional layer with 16 feature maps having size
5×5 and a stride of 1. In this layer, only 10 out of 16 feature maps are
connected to 6 feature maps of the previous layer as shown below. The main
reason is to break the symmetry in the network and keeps the number of
connections within reasonable bounds. That’s why the number of training
parameters in this layers are 1516 instead of 2400 and similarly, the number
of connections are 151600 instead of 240000
Fourth Layer
The fourth layer (S4) is again an average pooling layer with filter size 2×2
and a stride of 2. This layer is the same as the second layer (S2) except it has
16 feature maps so the output will be reduced to 5x5x16.
Fifth Layer
The fifth layer (C5) is a fully connected convolutional layer with
120 feature maps each of size 1×1. Each of the 120 units in C5 is
connected to all the 400 nodes (5x5x16) in the fourth layer S4.
Sixth Layer
The sixth layer is a fully connected layer (F6) with 84 units.
Output Layer
Finally, there is a fully connected Softmax output layer ŷ with 10
possible values corresponding to the digits from 0 to 9.
Le-Net-5 layers
Summary of Le-Net-5 architecture
Timeline for CNN
1980 The Neocognitron, introduced by Kunihiko Fukushima.
1987 Time delay neural networks (TDNN), introduced by Alex Waibel.
1989 A system to recognize hand-written ZIP Code numbers using convolutions based on
laboriously hand designed filters, introduced by Yann LeCun.
1989 LeNet-5, a pioneering 7-level convolutional network by Yann LeCun.
2006 The first GPU-implementation of a CNN was described in by K. Chellapilla.
2012 Alexnet, A GPU-based CNN by Alex Krizhevsky
2013 won the ImageNet Large Scale Visual Recognition Challenge.
1
Facebook AI Research, 770 Broadway, New York, New York 10003 USA. 2
New York University, 715 Broadway, New York, New York 10003, USA. 3
Department of Computer Science and Operations
Research Université de Montréal, Pavillon André-Aisenstadt, PO Box 6128 Centre-Ville STN Montréal, Quebec H3C 3J7, Canada. 4
Google, 1600 Amphitheatre Parkway, Mountain View, California
94043, USA.5
Department of Computer Science, University of Toronto, 6 King’s College Road, Toronto, Ontario M5S 3G4, Canada.
M
achine-learning technology powers many aspects of modern
society: from web searches to content filtering on social net-
works to recommendations on e-commerce websites, and
it is increasingly present in consumer products such as cameras and
smartphones. Machine-learning systems are used to identify objects
in images, transcribe speech into text, match news items, posts or
products with users’ interests, and select relevant results of search.
Increasingly, these applications make use of a class of techniques called
deep learning.
Conventional machine-learning techniques were limited in their
ability to process natural data in their raw form. For decades, con-
structing a pattern-recognition or machine-learning system required
careful engineering and considerable domain expertise to design a fea-
ture extractor that transformed the raw data (such as the pixel values
of an image) into a suitable internal representation or feature vector
from which the learning subsystem, often a classifier, could detect or
classify patterns in the input.
Representation learning is a set of methods that allows a machine to
be fed with raw data and to automatically discover the representations
needed for detection or classification. Deep-learning methods are
representation-learning methods with multiple levels of representa-
tion, obtained by composing simple but non-linear modules that each
transform the representation at one level (starting with the raw input)
into a representation at a higher, slightly more abstract level. With the
composition of enough such transformations, very complex functions
can be learned. For classification tasks, higher layers of representation
amplify aspects of the input that are important for discrimination and
suppress irrelevant variations. An image, for example, comes in the
form of an array of pixel values, and the learned features in the first
layer of representation typically represent the presence or absence of
edges at particular orientations and locations in the image. The second
layer typically detects motifs by spotting particular arrangements of
edges, regardless of small variations in the edge positions. The third
layer may assemble motifs into larger combinations that correspond
to parts of familiar objects, and subsequent layers would detect objects
as combinations of these parts. The key aspect of deep learning is that
these layers of features are not designed by human engineers: they
are learned from data using a general-purpose learning procedure.
Deep learning is making major advances in solving problems that
have resisted the best attempts of the artificial intelligence commu-
nity for many years. It has turned out to be very good at discovering
intricate structures in high-dimensional data and is therefore applica-
ble to many domains of science, business and government. In addition
to beating records in image recognition1–4
and speech recognition5–7
, it
has beaten other machine-learning techniques at predicting the activ-
ity of potential drug molecules8
, analysing particle accelerator data9,10
,
reconstructing brain circuits11
, and predicting the effects of mutations
in non-coding DNA on gene expression and disease12,13
. Perhaps more
surprisingly, deep learning has produced extremely promising results
for various tasks in natural language understanding14
, particularly
topic classification, sentiment analysis, question answering15
and lan-
guage translation16,17
.
We think that deep learning will have many more successes in the
near future because it requires very little engineering by hand, so it
can easily take advantage of increases in the amount of available com-
putation and data. New learning algorithms and architectures that are
currently being developed for deep neural networks will only acceler-
ate this progress.
Supervised learning
The most common form of machine learning, deep or not, is super-
vised learning. Imagine that we want to build a system that can classify
images as containing, say, a house, a car, a person or a pet. We first
collect a large data set of images of houses, cars, people and pets, each
labelled with its category. During training, the machine is shown an
image and produces an output in the form of a vector of scores, one
for each category. We want the desired category to have the highest
score of all categories, but this is unlikely to happen before training.
We compute an objective function that measures the error (or dis-
tance) between the output scores and the desired pattern of scores. The
machine then modifies its internal adjustable parameters to reduce
this error. These adjustable parameters, often called weights, are real
numbers that can be seen as ‘knobs’ that define the input–output func-
tion of the machine. In a typical deep-learning system, there may be
hundreds of millions of these adjustable weights, and hundreds of
millions of labelled examples with which to train the machine.
To properly adjust the weight vector, the learning algorithm com-
putes a gradient vector that, for each weight, indicates by what amount
the error would increase or decrease if the weight were increased by a
tiny amount. The weight vector is then adjusted in the opposite direc-
tion to the gradient vector.
The objective function, averaged over all the training examples, can
Deep learning allows computational models that are composed of multiple processing layers to learn representations of
data with multiple levels of abstraction. These methods have dramatically improved the state-of-the-art in speech rec-
ognition, visual object recognition, object detection and many other domains such as drug discovery and genomics. Deep
learning discovers intricate structure in large data sets by using the backpropagation algorithm to indicate how a machine
should change its internal parameters that are used to compute the representation in each layer from the representation in
the previous layer. Deep convolutional nets have brought about breakthroughs in processing images, video, speech and
audio, whereas recurrent nets have shone light on sequential data such as text and speech.
Deep learning
Yann LeCun1,2
, Yoshua Bengio3
& Geoffrey Hinton4,5
4 3 6 | N A T U R E | V O L 5 2 1 | 2 8 M A Y 2 0 1 5
REVIEW doi:10.1038/nature14539
© 2015 Macmillan Publishers Limited. All rights reserved
be seen as a kind of hilly landscape in the high-dimensional space of
weight values. The negative gradient vector indicates the direction
of steepest descent in this landscape, taking it closer to a minimum,
where the output error is low on average.
In practice, most practitioners use a procedure called stochastic
gradient descent (SGD). This consists of showing the input vector
for a few examples, computing the outputs and the errors, computing
the average gradient for those examples, and adjusting the weights
accordingly. The process is repeated for many small sets of examples
from the training set until the average of the objective function stops
decreasing. It is called stochastic because each small set of examples
gives a noisy estimate of the average gradient over all examples. This
simple procedure usually finds a good set of weights surprisingly
quickly when compared with far more elaborate optimization tech-
niques18
. After training, the performance of the system is measured
on a different set of examples called a test set. This serves to test the
generalization ability of the machine — its ability to produce sensible
answers on new inputs that it has never seen during training.
Many of the current practical applications of machine learning use
linear classifiers on top of hand-engineered features. A two-class linear
classifier computes a weighted sum of the feature vector components.
If the weighted sum is above a threshold, the input is classified as
belonging to a particular category.
Since the 1960s we have known that linear classifiers can only carve
their input space into very simple regions, namely half-spaces sepa-
rated by a hyperplane19
. But problems such as image and speech recog-
nition require the input–output function to be insensitive to irrelevant
variations of the input, such as variations in position, orientation or
illumination of an object, or variations in the pitch or accent of speech,
while being very sensitive to particular minute variations (for example,
the difference between a white wolf and a breed of wolf-like white
dog called a Samoyed). At the pixel level, images of two Samoyeds in
different poses and in different environments may be very different
from each other, whereas two images of a Samoyed and a wolf in the
same position and on similar backgrounds may be very similar to each
other. A linear classifier, or any other ‘shallow’ classifier operating on
Figure 1 | Multilayer neural networks and backpropagation. a, A multi-
layer neural network (shown by the connected dots) can distort the input
space to make the classes of data (examples of which are on the red and
blue lines) linearly separable. Note how a regular grid (shown on the left)
in input space is also transformed (shown in the middle panel) by hidden
units. This is an illustrative example with only two input units, two hidden
units and one output unit, but the networks used for object recognition
or natural language processing contain tens or hundreds of thousands of
units. Reproduced with permission from C. Olah (http://colah.github.io/).
b, The chain rule of derivatives tells us how two small effects (that of a small
change of x on y, and that of y on z) are composed. A small change Δx in
x gets transformed first into a small change Δy in y by getting multiplied
by ∂y/∂x (that is, the definition of partial derivative). Similarly, the change
Δy creates a change Δz in z. Substituting one equation into the other
gives the chain rule of derivatives — how Δx gets turned into Δz through
multiplication by the product of ∂y/∂x and ∂z/∂x. It also works when x,
y and z are vectors (and the derivatives are Jacobian matrices). c, The
equations used for computing the forward pass in a neural net with two
hidden layers and one output layer, each constituting a module through
which one can backpropagate gradients. At each layer, we first compute
the total input z to each unit, which is a weighted sum of the outputs of
the units in the layer below. Then a non-linear function f(.) is applied to
z to get the output of the unit. For simplicity, we have omitted bias terms.
The non-linear functions used in neural networks include the rectified
linear unit (ReLU) f(z)=max(0,z), commonly used in recent years, as
well as the more conventional sigmoids, such as the hyberbolic tangent,
f(z)=(exp(z)−exp(−z))/(exp(z)+exp(−z)) and logistic function logistic,
f(z)=1/(1+exp(−z)). d, The equations used for computing the backward
pass. At each hidden layer we compute the error derivative with respect to
the output of each unit, which is a weighted sum of the error derivatives
with respect to the total inputs to the units in the layer above. We then
convert the error derivative with respect to the output into the error
derivative with respect to the input by multiplying it by the gradient of f(z).
At the output layer, the error derivative with respect to the output of a unit
is computed by differentiating the cost function. This gives yl −tl if the cost
function for unit l is 0.5(yl −tl)2
, where tl is the target value. Once the ∂E/∂zk
is known, the error-derivative for the weight wjk on the connection from
unit j in the layer below is just yj ∂E/∂zk.
Input
(2)
Output
(1 sigmoid)
Hidden
(2 sigmoid)
a b
d
c
y
y
x
y x


=
y
z


x
y


z y
z
z
y


=
Δ Δ
Δ Δ
Δ Δ
z y
z
x
y x




=
x
z
y
z
x
x
y






=
Compare outputs with correct
answer to get error derivatives
j
k
E
yl
=yl tl
E
zl
=
E
yl
yl
zl
l
E
yj
= wjk
E
zk
E
zj
=
E
yj
yj
zj
E
yk
= wkl
E
zl
E
zk
=
E
yk
yk
zk
wkl
wjk
wij
i
j
k
yl = f (zl )
zl = wkl yk
l
yj = f (zj )
zj = wij xi
yk = f (zk )
zk = wjk yj
Output units
Input units
Hidden units H2
Hidden units H1
wkl
wjk
wij
k  H2
k  H2
I  out
j  H1
i  Input
i
2 8 M A Y 2 0 1 5 | V O L 5 2 1 | N A T U R E | 4 3 7
REVIEW INSIGHT
© 2015 Macmillan Publishers Limited. All rights reserved
raw pixels could not possibly distinguish the latter two, while putting
the former two in the same category. This is why shallow classifiers
require a good feature extractor that solves the selectivity–invariance
dilemma — one that produces representations that are selective to
the aspects of the image that are important for discrimination, but
that are invariant to irrelevant aspects such as the pose of the animal.
To make classifiers more powerful, one can use generic non-linear
features, as with kernel methods20
, but generic features such as those
arising with the Gaussian kernel do not allow the learner to general-
ize well far from the training examples21
. The conventional option is
to hand design good feature extractors, which requires a consider-
able amount of engineering skill and domain expertise. But this can
all be avoided if good features can be learned automatically using a
general-purpose learning procedure. This is the key advantage of
deep learning.
A deep-learning architecture is a multilayer stack of simple mod-
ules, all (or most) of which are subject to learning, and many of which
compute non-linear input–output mappings. Each module in the
stack transforms its input to increase both the selectivity and the
invariance of the representation. With multiple non-linear layers, say
a depth of 5 to 20, a system can implement extremely intricate func-
tions of its inputs that are simultaneously sensitive to minute details
— distinguishing Samoyeds from white wolves — and insensitive to
large irrelevant variations such as the background, pose, lighting and
surrounding objects.
Backpropagation to train multilayer architectures
From the earliest days of pattern recognition22,23
, the aim of research-
ers has been to replace hand-engineered features with trainable
multilayer networks, but despite its simplicity, the solution was not
widely understood until the mid 1980s. As it turns out, multilayer
architectures can be trained by simple stochastic gradient descent.
As long as the modules are relatively smooth functions of their inputs
and of their internal weights, one can compute gradients using the
backpropagation procedure. The idea that this could be done, and
that it worked, was discovered independently by several different
groups during the 1970s and 1980s24–27
.
The backpropagation procedure to compute the gradient of an
objective function with respect to the weights of a multilayer stack
of modules is nothing more than a practical application of the chain
rule for derivatives. The key insight is that the derivative (or gradi-
ent) of the objective with respect to the input of a module can be
computed by working backwards from the gradient with respect to
the output of that module (or the input of the subsequent module)
(Fig. 1). The backpropagation equation can be applied repeatedly to
propagate gradients through all modules, starting from the output
at the top (where the network produces its prediction) all the way to
the bottom (where the external input is fed). Once these gradients
have been computed, it is straightforward to compute the gradients
with respect to the weights of each module.
Many applications of deep learning use feedforward neural net-
work architectures (Fig. 1), which learn to map a fixed-size input
(for example, an image) to a fixed-size output (for example, a prob-
ability for each of several categories). To go from one layer to the
next, a set of units compute a weighted sum of their inputs from the
previous layer and pass the result through a non-linear function. At
present, the most popular non-linear function is the rectified linear
unit (ReLU), which is simply the half-wave rectifier f(z)=max(z, 0).
In past decades, neural nets used smoother non-linearities, such as
tanh(z) or 1/(1+exp(−z)), but the ReLU typically learns much faster
in networks with many layers, allowing training of a deep supervised
network without unsupervised pre-training28
. Units that are not in
the input or output layer are conventionally called hidden units. The
hidden layers can be seen as distorting the input in a non-linear way
so that categories become linearly separable by the last layer (Fig. 1).
In the late 1990s, neural nets and backpropagation were largely
forsaken by the machine-learning community and ignored by the
computer-vision and speech-recognition communities. It was widely
thought that learning useful, multistage, feature extractors with lit-
tle prior knowledge was infeasible. In particular, it was commonly
thought that simple gradient descent would get trapped in poor local
minima — weight configurations for which no small change would
reduce the average error.
In practice, poor local minima are rarely a problem with large net-
works. Regardless of the initial conditions, the system nearly always
reaches solutions of very similar quality. Recent theoretical and
empirical results strongly suggest that local minima are not a serious
issue in general. Instead, the landscape is packed with a combinato-
rially large number of saddle points where the gradient is zero, and
the surface curves up in most dimensions and curves down in the
Figure 2 | Inside a convolutional network. The outputs (not the filters)
of each layer (horizontally) of a typical convolutional network architecture
applied to the image of a Samoyed dog (bottom left; and RGB (red, green,
blue) inputs, bottom right). Each rectangular image is a feature map
corresponding to the output for one of the learned features, detected at each
of the image positions. Information flows bottom up, with lower-level features
acting as oriented edge detectors, and a score is computed for each image class
in output. ReLU, rectified linear unit.
Red Green Blue
Samoyed (16); Papillon (5.7); Pomeranian (2.7); Arctic fox (1.0); Eskimo dog (0.6); white wolf (0.4); Siberian husky (0.4)
Convolutions and ReLU
Max pooling
Max pooling
Convolutions and ReLU
Convolutions and ReLU
4 3 8 | N A T U R E | V O L 5 2 1 | 2 8 M A Y 2 0 1 5
REVIEW
INSIGHT
© 2015 Macmillan Publishers Limited. All rights reserved
remainder29,30
. The analysis seems to show that saddle points with
only a few downward curving directions are present in very large
numbers, but almost all of them have very similar values of the objec-
tive function. Hence, it does not much matter which of these saddle
points the algorithm gets stuck at.
Interest in deep feedforward networks was revived around 2006
(refs 31–34) by a group of researchers brought together by the Cana-
dian Institute for Advanced Research (CIFAR). The researchers intro-
duced unsupervised learning procedures that could create layers of
feature detectors without requiring labelled data. The objective in
learning each layer of feature detectors was to be able to reconstruct
or model the activities of feature detectors (or raw inputs) in the layer
below. By ‘pre-training’ several layers of progressively more complex
feature detectors using this reconstruction objective, the weights of a
deep network could be initialized to sensible values. A final layer of
output units could then be added to the top of the network and the
whole deep system could be fine-tuned using standard backpropaga-
tion33–35
. This worked remarkably well for recognizing handwritten
digits or for detecting pedestrians, especially when the amount of
labelled data was very limited36
.
The first major application of this pre-training approach was in
speech recognition, and it was made possible by the advent of fast
graphics processing units (GPUs) that were convenient to program37
and allowed researchers to train networks 10 or 20 times faster. In
2009, the approach was used to map short temporal windows of coef-
ficients extracted from a sound wave to a set of probabilities for the
various fragments of speech that might be represented by the frame
in the centre of the window. It achieved record-breaking results on a
standard speech recognition benchmark that used a small vocabu-
lary38
and was quickly developed to give record-breaking results on
a large vocabulary task39
. By 2012, versions of the deep net from 2009
were being developed by many of the major speech groups6
and were
already being deployed in Android phones. For smaller data sets,
unsupervised pre-training helps to prevent overfitting40
, leading to
significantly better generalization when the number of labelled exam-
ples is small, or in a transfer setting where we have lots of examples
for some ‘source’ tasks but very few for some ‘target’ tasks. Once deep
learning had been rehabilitated, it turned out that the pre-training
stage was only needed for small data sets.
There was, however, one particular type of deep, feedforward net-
work that was much easier to train and generalized much better than
networks with full connectivity between adjacent layers. This was
the convolutional neural network (ConvNet)41,42
. It achieved many
practical successes during the period when neural networks were out
of favour and it has recently been widely adopted by the computer-
vision community.
Convolutional neural networks
ConvNets are designed to process data that come in the form of
multiple arrays, for example a colour image composed of three 2D
arrays containing pixel intensities in the three colour channels. Many
data modalities are in the form of multiple arrays: 1D for signals and
sequences, including language; 2D for images or audio spectrograms;
and 3D for video or volumetric images. There are four key ideas
behind ConvNets that take advantage of the properties of natural
signals: local connections, shared weights, pooling and the use of
many layers.
The architecture of a typical ConvNet (Fig. 2) is structured as a
series of stages. The first few stages are composed of two types of
layers: convolutional layers and pooling layers. Units in a convolu-
tional layer are organized in feature maps, within which each unit
is connected to local patches in the feature maps of the previous
layer through a set of weights called a filter bank. The result of this
local weighted sum is then passed through a non-linearity such as a
ReLU. All units in a feature map share the same filter bank. Differ-
ent feature maps in a layer use different filter banks. The reason for
this architecture is twofold. First, in array data such as images, local
groups of values are often highly correlated, forming distinctive local
motifs that are easily detected. Second, the local statistics of images
and other signals are invariant to location. In other words, if a motif
can appear in one part of the image, it could appear anywhere, hence
the idea of units at different locations sharing the same weights and
detecting the same pattern in different parts of the array. Mathemati-
cally, the filtering operation performed by a feature map is a discrete
convolution, hence the name.
Although the role of the convolutional layer is to detect local con-
junctions of features from the previous layer, the role of the pooling
layer is to merge semantically similar features into one. Because the
relative positions of the features forming a motif can vary somewhat,
reliably detecting the motif can be done by coarse-graining the posi-
tion of each feature. A typical pooling unit computes the maximum
of a local patch of units in one feature map (or in a few feature maps).
Neighbouring pooling units take input from patches that are shifted
by more than one row or column, thereby reducing the dimension of
the representation and creating an invariance to small shifts and dis-
tortions. Two or three stages of convolution, non-linearity and pool-
ing are stacked, followed by more convolutional and fully-connected
layers. Backpropagating gradients through a ConvNet is as simple as
through a regular deep network, allowing all the weights in all the
filter banks to be trained.
Deep neural networks exploit the property that many natural sig-
nals are compositional hierarchies, in which higher-level features
are obtained by composing lower-level ones. In images, local combi-
nations of edges form motifs, motifs assemble into parts, and parts
form objects. Similar hierarchies exist in speech and text from sounds
to phones, phonemes, syllables, words and sentences. The pooling
allows representations to vary very little when elements in the previ-
ous layer vary in position and appearance.
The convolutional and pooling layers in ConvNets are directly
inspired by the classic notions of simple cells and complex cells in
visual neuroscience43
, and the overall architecture is reminiscent of
the LGN–V1–V2–V4–IT hierarchy in the visual cortex ventral path-
way44
. When ConvNet models and monkeys are shown the same pic-
ture, the activations of high-level units in the ConvNet explains half
of the variance of random sets of 160 neurons in the monkey’s infer-
otemporal cortex45
. ConvNets have their roots in the neocognitron46
,
the architecture of which was somewhat similar, but did not have an
end-to-end supervised-learning algorithm such as backpropagation.
A primitive 1D ConvNet called a time-delay neural net was used for
the recognition of phonemes and simple words47,48
.
There have been numerous applications of convolutional net-
works going back to the early 1990s, starting with time-delay neu-
ral networks for speech recognition47
and document reading42
. The
document reading system used a ConvNet trained jointly with a
probabilistic model that implemented language constraints. By the
late 1990s this system was reading over 10% of all the cheques in the
United States. A number of ConvNet-based optical character recog-
nition and handwriting recognition systems were later deployed by
Microsoft49
. ConvNets were also experimented with in the early 1990s
for object detection in natural images, including faces and hands50,51
,
and for face recognition52
.
Image understanding with deep convolutional networks
Since the early 2000s, ConvNets have been applied with great success to
the detection, segmentation and recognition of objects and regions in
images. These were all tasks in which labelled data was relatively abun-
dant, such as traffic sign recognition53
, the segmentation of biological
images54
particularly for connectomics55
, and the detection of faces,
text, pedestrians and human bodies in natural images36,50,51,56–58
. A major
recent practical success of ConvNets is face recognition59
.
Importantly, images can be labelled at the pixel level, which will have
applications in technology, including autonomous mobile robots and
2 8 M A Y 2 0 1 5 | V O L 5 2 1 | N A T U R E | 4 3 9
REVIEW INSIGHT
© 2015 Macmillan Publishers Limited. All rights reserved
self-driving cars60,61
. Companies such as Mobileye and NVIDIA are
using such ConvNet-based methods in their upcoming vision sys-
tems for cars. Other applications gaining importance involve natural
language understanding14
and speech recognition7
.
Despite these successes, ConvNets were largely forsaken by the
mainstream computer-vision and machine-learning communities
until the ImageNet competition in 2012. When deep convolutional
networks were applied to a data set of about a million images from
the web that contained 1,000 different classes, they achieved spec-
tacular results, almost halving the error rates of the best compet-
ing approaches1
. This success came from the efficient use of GPUs,
ReLUs, a new regularization technique called dropout62
, and tech-
niques to generate more training examples by deforming the existing
ones. This success has brought about a revolution in computer vision;
ConvNets are now the dominant approach for almost all recognition
and detection tasks4,58,59,63–65
and approach human performance on
some tasks. A recent stunning demonstration combines ConvNets
and recurrent net modules for the generation of image captions
(Fig. 3).
Recent ConvNet architectures have 10 to 20 layers of ReLUs, hun-
dreds of millions of weights, and billions of connections between
units. Whereas training such large networks could have taken weeks
only two years ago, progress in hardware, software and algorithm
parallelization have reduced training times to a few hours.
The performance of ConvNet-based vision systems has caused
most major technology companies, including Google, Facebook,
Microsoft, IBM, Yahoo!, Twitter and Adobe, as well as a quickly
growing number of start-ups to initiate research and development
projects and to deploy ConvNet-based image understanding products
and services.
ConvNets are easily amenable to efficient hardware implemen-
tations in chips or field-programmable gate arrays66,67
. A number
of companies such as NVIDIA, Mobileye, Intel, Qualcomm and
Samsung are developing ConvNet chips to enable real-time vision
applications in smartphones, cameras, robots and self-driving cars.
Distributed representations and language processing
Deep-learning theory shows that deep nets have two different expo-
nential advantages over classic learning algorithms that do not use
distributed representations21
. Both of these advantages arise from the
power of composition and depend on the underlying data-generating
distribution having an appropriate componential structure40
. First,
learning distributed representations enable generalization to new
combinations of the values of learned features beyond those seen
during training (for example, 2n
combinations are possible with n
binary features)68,69
. Second, composing layers of representation in
a deep net brings the potential for another exponential advantage70
(exponential in the depth).
The hidden layers of a multilayer neural network learn to repre-
sent the network’s inputs in a way that makes it easy to predict the
target outputs. This is nicely demonstrated by training a multilayer
neural network to predict the next word in a sequence from a local
Figure 3 | From image to text. Captions generated by a recurrent neural
network (RNN) taking, as extra input, the representation extracted by a deep
convolution neural network (CNN) from a test image, with the RNN trained to
‘translate’ high-level representations of images into captions (top). Reproduced
with permission from ref. 102. When the RNN is given the ability to focus its
attention on a different location in the input image (middle and bottom; the
lighter patches were given more attention) as it generates each word (bold), we
found86
that it exploits this to achieve better ‘translation’ of images into captions.
Vision
Deep CNN
Language
Generating RNN
A group of people
shopping at an outdoor
market.
There are many
vegetables at the
fruit stand.
A woman is throwing a frisbee in a park.
A little girl sitting on a bed with a teddy bear. A group of people sitting on a boat in the water. A giraffe standing in a forest with
trees in the background.
A dog is standing on a hardwood floor. A stop sign is on a road with a
mountain in the background
4 4 0 | N A T U R E | V O L 5 2 1 | 2 8 M A Y 2 0 1 5
REVIEW
INSIGHT
© 2015 Macmillan Publishers Limited. All rights reserved
context of earlier words71
. Each word in the context is presented to
the network as a one-of-N vector, that is, one component has a value
of 1 and the rest are 0. In the first layer, each word creates a different
pattern of activations, or word vectors (Fig. 4). In a language model,
the other layers of the network learn to convert the input word vec-
tors into an output word vector for the predicted next word, which
can be used to predict the probability for any word in the vocabulary
to appear as the next word. The network learns word vectors that
contain many active components each of which can be interpreted
as a separate feature of the word, as was first demonstrated27
in the
context of learning distributed representations for symbols. These
semantic features were not explicitly present in the input. They were
discovered by the learning procedure as a good way of factorizing
the structured relationships between the input and output symbols
into multiple ‘micro-rules’. Learning word vectors turned out to also
work very well when the word sequences come from a large corpus
of real text and the individual micro-rules are unreliable71
. When
trained to predict the next word in a news story, for example, the
learned word vectors for Tuesday and Wednesday are very similar, as
are the word vectors for Sweden and Norway. Such representations
are called distributed representations because their elements (the
features) are not mutually exclusive and their many configurations
correspond to the variations seen in the observed data. These word
vectors are composed of learned features that were not determined
ahead of time by experts, but automatically discovered by the neural
network. Vector representations of words learned from text are now
very widely used in natural language applications14,17,72–76
.
The issue of representation lies at the heart of the debate between
the logic-inspired and the neural-network-inspired paradigms for
cognition. In the logic-inspired paradigm, an instance of a symbol is
something for which the only property is that it is either identical or
non-identical to other symbol instances. It has no internal structure
that is relevant to its use; and to reason with symbols, they must be
bound to the variables in judiciously chosen rules of inference. By
contrast, neural networks just use big activity vectors, big weight
matrices and scalar non-linearities to perform the type of fast ‘intui-
tive’ inference that underpins effortless commonsense reasoning.
Before the introduction of neural language models71
, the standard
approach to statistical modelling of language did not exploit distrib-
uted representations: it was based on counting frequencies of occur-
rences of short symbol sequences of length up to N (called N-grams).
The number of possible N-grams is on the order of VN
, where V is
the vocabulary size, so taking into account a context of more than a
handful of words would require very large training corpora. N-grams
treat each word as an atomic unit, so they cannot generalize across
semantically related sequences of words, whereas neural language
models can because they associate each word with a vector of real
valued features, and semantically related words end up close to each
other in that vector space (Fig. 4).
Recurrent neural networks
When backpropagation was first introduced, its most exciting use was
for training recurrent neural networks (RNNs). For tasks that involve
sequential inputs, such as speech and language, it is often better to
use RNNs (Fig. 5). RNNs process an input sequence one element at a
time, maintaining in their hidden units a ‘state vector’ that implicitly
contains information about the history of all the past elements of
the sequence. When we consider the outputs of the hidden units at
different discrete time steps as if they were the outputs of different
neurons in a deep multilayer network (Fig. 5, right), it becomes clear
how we can apply backpropagation to train RNNs.
RNNs are very powerful dynamic systems, but training them has
proved to be problematic because the backpropagated gradients
either grow or shrink at each time step, so over many time steps they
typically explode or vanish77,78
.
Thanks to advances in their architecture79,80
and ways of training
them81,82
, RNNs have been found to be very good at predicting the
next character in the text83
or the next word in a sequence75
, but they
can also be used for more complex tasks. For example, after reading
an English sentence one word at a time, an English ‘encoder’ network
can be trained so that the final state vector of its hidden units is a good
representation of the thought expressed by the sentence. This thought
vector can then be used as the initial hidden state of (or as extra input
to) a jointly trained French ‘decoder’ network, which outputs a prob-
ability distribution for the first word of the French translation. If a
particular first word is chosen from this distribution and provided
as input to the decoder network it will then output a probability dis-
tribution for the second word of the translation and so on until a
full stop is chosen17,72,76
. Overall, this process generates sequences of
French words according to a probability distribution that depends on
the English sentence. This rather naive way of performing machine
translation has quickly become competitive with the state-of-the-art,
and this raises serious doubts about whether understanding a sen-
tence requires anything like the internal symbolic expressions that are
manipulated by using inference rules. It is more compatible with the
view that everyday reasoning involves many simultaneous analogies
Figure 4 | Visualizing the learned word vectors. On the left is an illustration
of word representations learned for modelling language, non-linearly projected
to 2D for visualization using the t-SNE algorithm103
. On the right is a 2D
representation of phrases learned by an English-to-French encoder–decoder
recurrent neural network75
. One can observe that semantically similar words
or sequences of words are mapped to nearby representations. The distributed
representations of words are obtained by using backpropagation to jointly learn
a representation for each word and a function that predicts a target quantity
such as the next word in a sequence (for language modelling) or a whole
sequence of translated words (for machine translation)18,75
.
−37 −36 −35 −34 −33 −32 −31 −30 −29
9
10
10.5
11
11.5
12
12.5
13
13.5
14
community
organizations
institutions
society
industry
company
organization
school
companies
Community
office
Agency
communities
Association
body
schools
agencies
−5.5 −5 −4.5 −4 −3.5 −3 −2.5 −2
−4.2
−4
−3.8
−3.6
−3.4
−3.2
−3
−2.8
−2.6
−2.4
−2.2
over the past few months
that a few days
In the last few days
the past few days
In a few months
in the coming months
a few months ago
" the two groups
of the two groups
over the last few months
dispute between the two
the last two decades
the next six months
two months before being
for nearly two months
over the last two decades
within a few months
2 8 M A Y 2 0 1 5 | V O L 5 2 1 | N A T U R E | 4 4 1
REVIEW INSIGHT
© 2015 Macmillan Publishers Limited. All rights reserved
that each contribute plausibility to a conclusion84,85
.
Instead of translating the meaning of a French sentence into an
English sentence, one can learn to ‘translate’ the meaning of an image
into an English sentence (Fig. 3). The encoder here is a deep Con-
vNet that converts the pixels into an activity vector in its last hidden
layer. The decoder is an RNN similar to the ones used for machine
translation and neural language modelling. There has been a surge of
interest in such systems recently (see examples mentioned in ref. 86).
RNNs, once unfolded in time (Fig. 5), can be seen as very deep
feedforward networks in which all the layers share the same weights.
Although their main purpose is to learn long-term dependencies,
theoretical and empirical evidence shows that it is difficult to learn
to store information for very long78
.
To correct for that, one idea is to augment the network with an
explicit memory. The first proposal of this kind is the long short-term
memory (LSTM) networks that use special hidden units, the natural
behaviour of which is to remember inputs for a long time79
. A special
unit called the memory cell acts like an accumulator or a gated leaky
neuron: it has a connection to itself at the next time step that has a
weight of one, so it copies its own real-valued state and accumulates
the external signal, but this self-connection is multiplicatively gated
by another unit that learns to decide when to clear the content of the
memory.
LSTM networks have subsequently proved to be more effective
than conventional RNNs, especially when they have several layers for
each time step87
, enabling an entire speech recognition system that
goes all the way from acoustics to the sequence of characters in the
transcription. LSTM networks or related forms of gated units are also
currently used for the encoder and decoder networks that perform
so well at machine translation17,72,76
.
Over the past year, several authors have made different proposals to
augment RNNs with a memory module. Proposals include the Neural
Turing Machine in which the network is augmented by a ‘tape-like’
memory that the RNN can choose to read from or write to88
, and
memory networks, in which a regular network is augmented by a
kind of associative memory89
. Memory networks have yielded excel-
lent performance on standard question-answering benchmarks. The
memory is used to remember the story about which the network is
later asked to answer questions.
Beyond simple memorization, neural Turing machines and mem-
ory networks are being used for tasks that would normally require
reasoning and symbol manipulation. Neural Turing machines can
be taught ‘algorithms’. Among other things, they can learn to output
a sorted list of symbols when their input consists of an unsorted
sequence in which each symbol is accompanied by a real value that
indicates its priority in the list88
. Memory networks can be trained
to keep track of the state of the world in a setting similar to a text
adventure game and after reading a story, they can answer questions
that require complex inference90
. In one test example, the network is
shown a 15-sentence version of the The Lord of the Rings and correctly
answers questions such as “where is Frodo now?”89
.
The future of deep learning
Unsupervised learning91–98
had a catalytic effect in reviving interest in
deep learning, but has since been overshadowed by the successes of
purely supervised learning. Although we have not focused on it in this
Review,weexpectunsupervisedlearningtobecomefarmoreimportant
in the longer term. Human and animal learning is largely unsupervised:
we discover the structure of the world by observing it, not by being told
the name of every object.
Human vision is an active process that sequentially samples the optic
array in an intelligent, task-specific way using a small, high-resolution
fovea with a large, low-resolution surround. We expect much of the
future progress in vision to come from systems that are trained end-to-
end and combine ConvNets with RNNs that use reinforcement learning
to decide where to look. Systems combining deep learning and rein-
forcement learning are in their infancy, but they already outperform
passive vision systems99
at classification tasks and produce impressive
results in learning to play many different video games100
.
Natural language understanding is another area in which deep learn-
ing is poised to make a large impact over the next few years. We expect
systems that use RNNs to understand sentences or whole documents
will become much better when they learn strategies for selectively
attending to one part at a time76,86
.
Ultimately, major progress in artificial intelligence will come about
through systems that combine representation learning with complex
reasoning. Although deep learning and simple reasoning have been
used for speech and handwriting recognition for a long time, new
paradigms are needed to replace rule-based manipulation of symbolic
expressions by operations on large vectors101
. ■
Received 25 February; accepted 1 May 2015.
1. Krizhevsky, A., Sutskever, I. & Hinton, G. ImageNet classification with deep
convolutional neural networks. In Proc. Advances in Neural Information
Processing Systems 25 1090–1098 (2012).
This report was a breakthrough that used convolutional nets to almost halve
the error rate for object recognition, and precipitated the rapid adoption of
deep learning by the computer vision community.
2. Farabet, C., Couprie, C., Najman, L. & LeCun, Y. Learning hierarchical features for
scene labeling. IEEE Trans. Pattern Anal. Mach. Intell. 35, 1915–1929 (2013).
3. Tompson, J., Jain, A., LeCun, Y. & Bregler, C. Joint training of a convolutional
network and a graphical model for human pose estimation. In Proc. Advances in
Neural Information Processing Systems 27 1799–1807 (2014).
4. Szegedy, C. et al. Going deeper with convolutions. Preprint at http://arxiv.org/
abs/1409.4842 (2014).
5. Mikolov, T., Deoras, A., Povey, D., Burget, L. & Cernocky, J. Strategies for training
large scale neural network language models. In Proc. Automatic Speech
Recognition and Understanding 196–201 (2011).
6. Hinton, G. et al. Deep neural networks for acoustic modeling in speech
recognition. IEEE Signal Processing Magazine 29, 82–97 (2012).
This joint paper from the major speech recognition laboratories, summarizing
the breakthrough achieved with deep learning on the task of phonetic
classification for automatic speech recognition, was the first major industrial
application of deep learning.
7. Sainath, T., Mohamed, A.-R., Kingsbury, B. & Ramabhadran, B. Deep
convolutional neural networks for LVCSR. In Proc. Acoustics, Speech and Signal
Processing 8614–8618 (2013).
8. Ma, J., Sheridan, R. P., Liaw, A., Dahl, G. E. & Svetnik, V. Deep neural nets as a
method for quantitative structure-activity relationships. J. Chem. Inf. Model. 55,
263–274 (2015).
9. Ciodaro, T., Deva, D., de Seixas, J. & Damazio, D. Online particle detection with
neural networks based on topological calorimetry information. J. Phys. Conf.
Series 368, 012030 (2012).
10. Kaggle. Higgs boson machine learning challenge. Kaggle https://www.kaggle.
com/c/higgs-boson (2014).
11. Helmstaedter, M. et al. Connectomic reconstruction of the inner plexiform layer
in the mouse retina. Nature 500, 168–174 (2013).
xt
xt−1
xt+1
x
Unfold
V
W
W
W W W
V V V
U U U U
s
o
st−1
ot−1
ot
st
st+1
ot+1
Figure 5 | A recurrent neural network and the unfolding in time of the
computation involved in its forward computation. The artificial neurons
(for example, hidden units grouped under node s with values st at time t) get
inputs from other neurons at previous time steps (this is represented with the
black square, representing a delay of one time step, on the left). In this way, a
recurrent neural network can map an input sequence with elements xt into an
output sequence with elements ot, with each ot depending on all the previous
xtʹ (for tʹ≤t). The same parameters (matrices U,V,W ) are used at each time
step. Many other architectures are possible, including a variant in which the
network can generate a sequence of outputs (for example, words), each of
which is used as inputs for the next time step. The backpropagation algorithm
(Fig.1) can be directly applied to the computational graph of the unfolded
network on the right, to compute the derivative of a total error (for example,
the log-probability of generating the right sequence of outputs) with respect to
all the states st and all the parameters.
4 4 2 | N A T U R E | V O L 5 2 1 | 2 8 M A Y 2 0 1 5
REVIEW
INSIGHT
© 2015 Macmillan Publishers Limited. All rights reserved
KCS-055 MLT U4.pdf
KCS-055 MLT U4.pdf

More Related Content

What's hot

Thesis on Hybrid renewable energy system for condo developments
Thesis on Hybrid renewable energy system for condo developmentsThesis on Hybrid renewable energy system for condo developments
Thesis on Hybrid renewable energy system for condo developmentsFasil Ayele
 
Natural language processing with python and amharic syntax parse tree by dani...
Natural language processing with python and amharic syntax parse tree by dani...Natural language processing with python and amharic syntax parse tree by dani...
Natural language processing with python and amharic syntax parse tree by dani...Daniel Adenew
 
Elements of dynamic programming
Elements of dynamic programmingElements of dynamic programming
Elements of dynamic programmingTafhim Islam
 
Evolutionary computing - soft computing
Evolutionary computing - soft computingEvolutionary computing - soft computing
Evolutionary computing - soft computingSakshiMahto1
 
PhD Defense of Teodoro Montanaro
PhD Defense of Teodoro MontanaroPhD Defense of Teodoro Montanaro
PhD Defense of Teodoro MontanaroTeodoro Montanaro
 
Event management by using cloud computing
Event management by using cloud computingEvent management by using cloud computing
Event management by using cloud computingLogesh Waran
 
Fuzzy logic Notes AI CSE 8th Sem
Fuzzy logic Notes AI CSE 8th SemFuzzy logic Notes AI CSE 8th Sem
Fuzzy logic Notes AI CSE 8th SemDigiGurukul
 
Database , 12 Reliability
Database , 12 ReliabilityDatabase , 12 Reliability
Database , 12 ReliabilityAli Usman
 
Artificial nueral network slideshare
Artificial nueral network slideshareArtificial nueral network slideshare
Artificial nueral network slideshareRed Innovators
 
Virtualization (Distributed computing)
Virtualization (Distributed computing)Virtualization (Distributed computing)
Virtualization (Distributed computing)Sri Prasanna
 
Rule Based Architecture System
Rule Based Architecture SystemRule Based Architecture System
Rule Based Architecture SystemFirdaus Adib
 
Concept learning
Concept learningConcept learning
Concept learningAmir Shokri
 
lazy learners and other classication methods
lazy learners and other classication methodslazy learners and other classication methods
lazy learners and other classication methodsrajshreemuthiah
 

What's hot (20)

Thesis on Hybrid renewable energy system for condo developments
Thesis on Hybrid renewable energy system for condo developmentsThesis on Hybrid renewable energy system for condo developments
Thesis on Hybrid renewable energy system for condo developments
 
Fuzzy expert system
Fuzzy expert systemFuzzy expert system
Fuzzy expert system
 
Natural language processing with python and amharic syntax parse tree by dani...
Natural language processing with python and amharic syntax parse tree by dani...Natural language processing with python and amharic syntax parse tree by dani...
Natural language processing with python and amharic syntax parse tree by dani...
 
Elements of dynamic programming
Elements of dynamic programmingElements of dynamic programming
Elements of dynamic programming
 
Evolutionary computing - soft computing
Evolutionary computing - soft computingEvolutionary computing - soft computing
Evolutionary computing - soft computing
 
PhD Defense of Teodoro Montanaro
PhD Defense of Teodoro MontanaroPhD Defense of Teodoro Montanaro
PhD Defense of Teodoro Montanaro
 
Fuzzy logic
Fuzzy logicFuzzy logic
Fuzzy logic
 
Event management by using cloud computing
Event management by using cloud computingEvent management by using cloud computing
Event management by using cloud computing
 
Fuzzy logic Notes AI CSE 8th Sem
Fuzzy logic Notes AI CSE 8th SemFuzzy logic Notes AI CSE 8th Sem
Fuzzy logic Notes AI CSE 8th Sem
 
Beowulf cluster
Beowulf clusterBeowulf cluster
Beowulf cluster
 
KCS-055 U5.pdf
KCS-055 U5.pdfKCS-055 U5.pdf
KCS-055 U5.pdf
 
Fuzzy Logic
Fuzzy LogicFuzzy Logic
Fuzzy Logic
 
Artificial Neural Network Topology
Artificial Neural Network TopologyArtificial Neural Network Topology
Artificial Neural Network Topology
 
Database , 12 Reliability
Database , 12 ReliabilityDatabase , 12 Reliability
Database , 12 Reliability
 
Artificial nueral network slideshare
Artificial nueral network slideshareArtificial nueral network slideshare
Artificial nueral network slideshare
 
Virtualization (Distributed computing)
Virtualization (Distributed computing)Virtualization (Distributed computing)
Virtualization (Distributed computing)
 
Rule Based Architecture System
Rule Based Architecture SystemRule Based Architecture System
Rule Based Architecture System
 
Concept learning
Concept learningConcept learning
Concept learning
 
lazy learners and other classication methods
lazy learners and other classication methodslazy learners and other classication methods
lazy learners and other classication methods
 
Classical Sets & fuzzy sets
Classical Sets & fuzzy setsClassical Sets & fuzzy sets
Classical Sets & fuzzy sets
 

Similar to KCS-055 MLT U4.pdf

Fuzzy Logic Final Report
Fuzzy Logic Final ReportFuzzy Logic Final Report
Fuzzy Logic Final ReportShikhar Agarwal
 
NETWORK LEARNING AND TRAINING OF A CASCADED LINK-BASED FEED FORWARD NEURAL NE...
NETWORK LEARNING AND TRAINING OF A CASCADED LINK-BASED FEED FORWARD NEURAL NE...NETWORK LEARNING AND TRAINING OF A CASCADED LINK-BASED FEED FORWARD NEURAL NE...
NETWORK LEARNING AND TRAINING OF A CASCADED LINK-BASED FEED FORWARD NEURAL NE...ijaia
 
NEURAL NETWORK FOR THE RELIABILITY ANALYSIS OF A SERIES - PARALLEL SYSTEM SUB...
NEURAL NETWORK FOR THE RELIABILITY ANALYSIS OF A SERIES - PARALLEL SYSTEM SUB...NEURAL NETWORK FOR THE RELIABILITY ANALYSIS OF A SERIES - PARALLEL SYSTEM SUB...
NEURAL NETWORK FOR THE RELIABILITY ANALYSIS OF A SERIES - PARALLEL SYSTEM SUB...IAEME Publication
 
CONVOLUTIONAL NEURAL NETWORK BASED FEATURE EXTRACTION FOR IRIS RECOGNITION
CONVOLUTIONAL NEURAL NETWORK BASED FEATURE EXTRACTION FOR IRIS RECOGNITION CONVOLUTIONAL NEURAL NETWORK BASED FEATURE EXTRACTION FOR IRIS RECOGNITION
CONVOLUTIONAL NEURAL NETWORK BASED FEATURE EXTRACTION FOR IRIS RECOGNITION AIRCC Publishing Corporation
 
CONVOLUTIONAL NEURAL NETWORK BASED FEATURE EXTRACTION FOR IRIS RECOGNITION
CONVOLUTIONAL NEURAL NETWORK BASED FEATURE EXTRACTION FOR IRIS RECOGNITION CONVOLUTIONAL NEURAL NETWORK BASED FEATURE EXTRACTION FOR IRIS RECOGNITION
CONVOLUTIONAL NEURAL NETWORK BASED FEATURE EXTRACTION FOR IRIS RECOGNITION ijcsit
 
Optimized Neural Network for Classification of Multispectral Images
Optimized Neural Network for Classification of Multispectral ImagesOptimized Neural Network for Classification of Multispectral Images
Optimized Neural Network for Classification of Multispectral ImagesIDES Editor
 
Employing Neocognitron Neural Network Base Ensemble Classifiers To Enhance Ef...
Employing Neocognitron Neural Network Base Ensemble Classifiers To Enhance Ef...Employing Neocognitron Neural Network Base Ensemble Classifiers To Enhance Ef...
Employing Neocognitron Neural Network Base Ensemble Classifiers To Enhance Ef...cscpconf
 
Modeling of neural image compression using gradient decent technology
Modeling of neural image compression using gradient decent technologyModeling of neural image compression using gradient decent technology
Modeling of neural image compression using gradient decent technologytheijes
 
Analytical Review on the Correlation between Ai and Neuroscience
Analytical Review on the Correlation between Ai and NeuroscienceAnalytical Review on the Correlation between Ai and Neuroscience
Analytical Review on the Correlation between Ai and NeuroscienceIOSR Journals
 
Artificial Neural Networks.pdf
Artificial Neural Networks.pdfArtificial Neural Networks.pdf
Artificial Neural Networks.pdfBria Davis
 
self operating maps
self operating mapsself operating maps
self operating mapsAltafSMT
 
NEURAL NETWORKS
NEURAL NETWORKSNEURAL NETWORKS
NEURAL NETWORKSESCOM
 
A Time Series ANN Approach for Weather Forecasting
A Time Series ANN Approach for Weather ForecastingA Time Series ANN Approach for Weather Forecasting
A Time Series ANN Approach for Weather Forecastingijctcm
 
APPLYING NEURAL NETWORKS FOR SUPERVISED LEARNING OF MEDICAL DATA
APPLYING NEURAL NETWORKS FOR SUPERVISED LEARNING OF MEDICAL DATAAPPLYING NEURAL NETWORKS FOR SUPERVISED LEARNING OF MEDICAL DATA
APPLYING NEURAL NETWORKS FOR SUPERVISED LEARNING OF MEDICAL DATAIJDKP
 
Microscopy images segmentation algorithm based on shearlet neural network
Microscopy images segmentation algorithm based on shearlet neural networkMicroscopy images segmentation algorithm based on shearlet neural network
Microscopy images segmentation algorithm based on shearlet neural networkjournalBEEI
 
Image Recognition With the Help of Auto-Associative Neural Network
Image Recognition With the Help of Auto-Associative Neural NetworkImage Recognition With the Help of Auto-Associative Neural Network
Image Recognition With the Help of Auto-Associative Neural NetworkCSCJournals
 

Similar to KCS-055 MLT U4.pdf (20)

Artificial Neural Networks
Artificial Neural NetworksArtificial Neural Networks
Artificial Neural Networks
 
P2-Artificial.pdf
P2-Artificial.pdfP2-Artificial.pdf
P2-Artificial.pdf
 
Fuzzy Logic Final Report
Fuzzy Logic Final ReportFuzzy Logic Final Report
Fuzzy Logic Final Report
 
NETWORK LEARNING AND TRAINING OF A CASCADED LINK-BASED FEED FORWARD NEURAL NE...
NETWORK LEARNING AND TRAINING OF A CASCADED LINK-BASED FEED FORWARD NEURAL NE...NETWORK LEARNING AND TRAINING OF A CASCADED LINK-BASED FEED FORWARD NEURAL NE...
NETWORK LEARNING AND TRAINING OF A CASCADED LINK-BASED FEED FORWARD NEURAL NE...
 
NEURAL NETWORK FOR THE RELIABILITY ANALYSIS OF A SERIES - PARALLEL SYSTEM SUB...
NEURAL NETWORK FOR THE RELIABILITY ANALYSIS OF A SERIES - PARALLEL SYSTEM SUB...NEURAL NETWORK FOR THE RELIABILITY ANALYSIS OF A SERIES - PARALLEL SYSTEM SUB...
NEURAL NETWORK FOR THE RELIABILITY ANALYSIS OF A SERIES - PARALLEL SYSTEM SUB...
 
CONVOLUTIONAL NEURAL NETWORK BASED FEATURE EXTRACTION FOR IRIS RECOGNITION
CONVOLUTIONAL NEURAL NETWORK BASED FEATURE EXTRACTION FOR IRIS RECOGNITION CONVOLUTIONAL NEURAL NETWORK BASED FEATURE EXTRACTION FOR IRIS RECOGNITION
CONVOLUTIONAL NEURAL NETWORK BASED FEATURE EXTRACTION FOR IRIS RECOGNITION
 
CONVOLUTIONAL NEURAL NETWORK BASED FEATURE EXTRACTION FOR IRIS RECOGNITION
CONVOLUTIONAL NEURAL NETWORK BASED FEATURE EXTRACTION FOR IRIS RECOGNITION CONVOLUTIONAL NEURAL NETWORK BASED FEATURE EXTRACTION FOR IRIS RECOGNITION
CONVOLUTIONAL NEURAL NETWORK BASED FEATURE EXTRACTION FOR IRIS RECOGNITION
 
Optimized Neural Network for Classification of Multispectral Images
Optimized Neural Network for Classification of Multispectral ImagesOptimized Neural Network for Classification of Multispectral Images
Optimized Neural Network for Classification of Multispectral Images
 
Employing Neocognitron Neural Network Base Ensemble Classifiers To Enhance Ef...
Employing Neocognitron Neural Network Base Ensemble Classifiers To Enhance Ef...Employing Neocognitron Neural Network Base Ensemble Classifiers To Enhance Ef...
Employing Neocognitron Neural Network Base Ensemble Classifiers To Enhance Ef...
 
Modeling of neural image compression using gradient decent technology
Modeling of neural image compression using gradient decent technologyModeling of neural image compression using gradient decent technology
Modeling of neural image compression using gradient decent technology
 
Neural network
Neural networkNeural network
Neural network
 
Analytical Review on the Correlation between Ai and Neuroscience
Analytical Review on the Correlation between Ai and NeuroscienceAnalytical Review on the Correlation between Ai and Neuroscience
Analytical Review on the Correlation between Ai and Neuroscience
 
Artificial Neural Networks.pdf
Artificial Neural Networks.pdfArtificial Neural Networks.pdf
Artificial Neural Networks.pdf
 
self operating maps
self operating mapsself operating maps
self operating maps
 
NEURAL NETWORKS
NEURAL NETWORKSNEURAL NETWORKS
NEURAL NETWORKS
 
A Time Series ANN Approach for Weather Forecasting
A Time Series ANN Approach for Weather ForecastingA Time Series ANN Approach for Weather Forecasting
A Time Series ANN Approach for Weather Forecasting
 
APPLYING NEURAL NETWORKS FOR SUPERVISED LEARNING OF MEDICAL DATA
APPLYING NEURAL NETWORKS FOR SUPERVISED LEARNING OF MEDICAL DATAAPPLYING NEURAL NETWORKS FOR SUPERVISED LEARNING OF MEDICAL DATA
APPLYING NEURAL NETWORKS FOR SUPERVISED LEARNING OF MEDICAL DATA
 
B42010712
B42010712B42010712
B42010712
 
Microscopy images segmentation algorithm based on shearlet neural network
Microscopy images segmentation algorithm based on shearlet neural networkMicroscopy images segmentation algorithm based on shearlet neural network
Microscopy images segmentation algorithm based on shearlet neural network
 
Image Recognition With the Help of Auto-Associative Neural Network
Image Recognition With the Help of Auto-Associative Neural NetworkImage Recognition With the Help of Auto-Associative Neural Network
Image Recognition With the Help of Auto-Associative Neural Network
 

More from Dr. Radhey Shyam

KIT-601-L-UNIT-1 (Revised) Introduction to Data Analytcs.pdf
KIT-601-L-UNIT-1 (Revised) Introduction to Data Analytcs.pdfKIT-601-L-UNIT-1 (Revised) Introduction to Data Analytcs.pdf
KIT-601-L-UNIT-1 (Revised) Introduction to Data Analytcs.pdfDr. Radhey Shyam
 
SE-UNIT-3-II-Software metrics, numerical and their solutions.pdf
SE-UNIT-3-II-Software metrics, numerical and their solutions.pdfSE-UNIT-3-II-Software metrics, numerical and their solutions.pdf
SE-UNIT-3-II-Software metrics, numerical and their solutions.pdfDr. Radhey Shyam
 
Introduction to Data Analytics and data analytics life cycle
Introduction to Data Analytics and data analytics life cycleIntroduction to Data Analytics and data analytics life cycle
Introduction to Data Analytics and data analytics life cycleDr. Radhey Shyam
 
KIT-601 Lecture Notes-UNIT-2.pdf
KIT-601 Lecture Notes-UNIT-2.pdfKIT-601 Lecture Notes-UNIT-2.pdf
KIT-601 Lecture Notes-UNIT-2.pdfDr. Radhey Shyam
 
KIT-601 Lecture Notes-UNIT-1.pdf
KIT-601 Lecture Notes-UNIT-1.pdfKIT-601 Lecture Notes-UNIT-1.pdf
KIT-601 Lecture Notes-UNIT-1.pdfDr. Radhey Shyam
 
Deep-Learning-2017-Lecture5CNN.pptx
Deep-Learning-2017-Lecture5CNN.pptxDeep-Learning-2017-Lecture5CNN.pptx
Deep-Learning-2017-Lecture5CNN.pptxDr. Radhey Shyam
 
SE UNIT-3 (Software metrics).pdf
SE UNIT-3 (Software metrics).pdfSE UNIT-3 (Software metrics).pdf
SE UNIT-3 (Software metrics).pdfDr. Radhey Shyam
 
Ip unit 4 modified on 22.06.21
Ip unit 4 modified on 22.06.21Ip unit 4 modified on 22.06.21
Ip unit 4 modified on 22.06.21Dr. Radhey Shyam
 
Ip unit 3 modified of 26.06.2021
Ip unit 3 modified of 26.06.2021Ip unit 3 modified of 26.06.2021
Ip unit 3 modified of 26.06.2021Dr. Radhey Shyam
 
Ip unit 2 modified on 8.6.2021
Ip unit 2 modified on 8.6.2021Ip unit 2 modified on 8.6.2021
Ip unit 2 modified on 8.6.2021Dr. Radhey Shyam
 

More from Dr. Radhey Shyam (20)

KIT-601-L-UNIT-1 (Revised) Introduction to Data Analytcs.pdf
KIT-601-L-UNIT-1 (Revised) Introduction to Data Analytcs.pdfKIT-601-L-UNIT-1 (Revised) Introduction to Data Analytcs.pdf
KIT-601-L-UNIT-1 (Revised) Introduction to Data Analytcs.pdf
 
SE-UNIT-3-II-Software metrics, numerical and their solutions.pdf
SE-UNIT-3-II-Software metrics, numerical and their solutions.pdfSE-UNIT-3-II-Software metrics, numerical and their solutions.pdf
SE-UNIT-3-II-Software metrics, numerical and their solutions.pdf
 
Introduction to Data Analytics and data analytics life cycle
Introduction to Data Analytics and data analytics life cycleIntroduction to Data Analytics and data analytics life cycle
Introduction to Data Analytics and data analytics life cycle
 
KCS-501-3.pdf
KCS-501-3.pdfKCS-501-3.pdf
KCS-501-3.pdf
 
KIT-601 Lecture Notes-UNIT-2.pdf
KIT-601 Lecture Notes-UNIT-2.pdfKIT-601 Lecture Notes-UNIT-2.pdf
KIT-601 Lecture Notes-UNIT-2.pdf
 
KIT-601 Lecture Notes-UNIT-1.pdf
KIT-601 Lecture Notes-UNIT-1.pdfKIT-601 Lecture Notes-UNIT-1.pdf
KIT-601 Lecture Notes-UNIT-1.pdf
 
Deep-Learning-2017-Lecture5CNN.pptx
Deep-Learning-2017-Lecture5CNN.pptxDeep-Learning-2017-Lecture5CNN.pptx
Deep-Learning-2017-Lecture5CNN.pptx
 
SE UNIT-3 (Software metrics).pdf
SE UNIT-3 (Software metrics).pdfSE UNIT-3 (Software metrics).pdf
SE UNIT-3 (Software metrics).pdf
 
SE UNIT-2.pdf
SE UNIT-2.pdfSE UNIT-2.pdf
SE UNIT-2.pdf
 
SE UNIT-1 Revised.pdf
SE UNIT-1 Revised.pdfSE UNIT-1 Revised.pdf
SE UNIT-1 Revised.pdf
 
SE UNIT-3.pdf
SE UNIT-3.pdfSE UNIT-3.pdf
SE UNIT-3.pdf
 
Ip unit 5
Ip unit 5Ip unit 5
Ip unit 5
 
Ip unit 4 modified on 22.06.21
Ip unit 4 modified on 22.06.21Ip unit 4 modified on 22.06.21
Ip unit 4 modified on 22.06.21
 
Ip unit 3 modified of 26.06.2021
Ip unit 3 modified of 26.06.2021Ip unit 3 modified of 26.06.2021
Ip unit 3 modified of 26.06.2021
 
Ip unit 2 modified on 8.6.2021
Ip unit 2 modified on 8.6.2021Ip unit 2 modified on 8.6.2021
Ip unit 2 modified on 8.6.2021
 
Ip unit 1
Ip unit 1Ip unit 1
Ip unit 1
 
Cc unit 5
Cc unit 5Cc unit 5
Cc unit 5
 
Cc unit 4 updated version
Cc unit 4 updated versionCc unit 4 updated version
Cc unit 4 updated version
 
Cc unit 3 updated version
Cc unit 3 updated versionCc unit 3 updated version
Cc unit 3 updated version
 
Cc unit 2 updated
Cc unit 2 updatedCc unit 2 updated
Cc unit 2 updated
 

Recently uploaded

IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...RajaP95
 
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSAPPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSKurinjimalarL3
 
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxupamatechverse
 
SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )Tsuyoshi Horigome
 
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130Suhani Kapoor
 
Introduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxIntroduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxupamatechverse
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...Call Girls in Nagpur High Profile
 
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).pptssuser5c9d4b1
 
Microscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxMicroscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxpurnimasatapathy1234
 
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escortsranjana rawat
 
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Dr.Costas Sachpazis
 
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service NashikCall Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service NashikCall Girls in Nagpur High Profile
 
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝soniya singh
 
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCollege Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCall Girls in Nagpur High Profile
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingrakeshbaidya232001
 
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptx
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptxthe ladakh protest in leh ladakh 2024 sonam wangchuk.pptx
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptxhumanexperienceaaa
 
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...Soham Mondal
 
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSHARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSRajkumarAkumalla
 

Recently uploaded (20)

IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...
 
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSAPPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
 
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptx
 
SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )
 
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
 
Introduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxIntroduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptx
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
 
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
 
Microscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxMicroscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptx
 
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
 
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
 
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service NashikCall Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
 
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
 
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCollege Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writing
 
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptx
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptxthe ladakh protest in leh ladakh 2024 sonam wangchuk.pptx
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptx
 
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
 
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
 
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSHARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
 
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCRCall Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
 

KCS-055 MLT U4.pdf

  • 1. UNIT-4: Artificial Neural Network and Deep Learning Dr. Radhey Shyam Professor Department of Computer Science and Engineering BIET Lucknow Following slides have been prepared by Dr. Radhey Shyam, with grateful acknowledgement of others who made their course contents freely available. Feel free to reuse these slides for your own academic purposes. 1
  • 2.
  • 3.
  • 4.
  • 5.
  • 6.
  • 7.
  • 8.
  • 9.
  • 10.
  • 11.
  • 12.
  • 13.
  • 14.
  • 15.
  • 16.
  • 17.
  • 18. 1252 MIPRO 2017/CTS Brief Review of Self-Organizing Maps Dubravko Miljković Hrvatska elektroprivreda, Zagreb, Croatia dubravko.miljkovic@hep.hr Abstract - As a particular type of artificial neural networks, self-organizing maps (SOMs) are trained using an unsupervised, competitive learning to produce a low- dimensional, discretized representation of the input space of the training samples, called a feature map. Such a map retains principle features of the input data. Self-organizing maps are known for its clustering, visualization and classification capabilities. In this brief review paper basic tenets, including motivation, architecture, math description and applications are reviewed. I. INTRODUCTION Among numerous neural network architectures, particularly interesting architecture was introduced by Finish Professor Teuvo Kohonen in the 1980s, [1,2]. Self- organizing map (SOM), sometimes also called a Kohonen map use unsupervised, competitive learning to produce low dimensional, discretized representation of presented high dimensional data, while simultaneously preserving similarity relations between the presented data items. Such low dimensional representation is called a feature map, hence map in the name. This brief review paper attempts to introduce a reader to SOMs, covering in short basic tenets, underlying biological motivation, its architecture, math description and various applications, [3-10]. II. NEURAL NETWORKS Human and animal brains are highly complex, nonlinear and parallel systems, consisting of billions of neurons integrated into numerous neural networks, [3]. A neural networks within a brain are massively parallel distributed processing system suitable for storing knowledge in forms of past experiences and making it available for future use. They are particularly suitable for the class of problems where it is difficult to propose an analytical solution convenient for algorithmic implementation. A. Biological Motivation After millions of years of evolution, brain in animals and humans has evolved into the massive parallel stack of computing power capable of dealing with the tremendous varieties of situations it can encounter. The biological neural networks are natural intelligent information processors. Artificial neural networks (ANN) constitute computing paradigm motivated by the neural structure of biological systems, [6]. ANNs employ a computational approach based on a large collection of artificial neurons that are much simplified representation of biological neurons. Synapses that ensure communication among biological neurons are replaced with neuron input weights. Adjustment of connection weights is performed by some of numerous learning algorithms. ANNs have very simple principles, but their behavior can be very complex. They have a capability to learn, generalize, associate data and are fault tolerant. The history of the ANNs begins in the 1940s, but the first significant step came in 1957 with the introduction of Rosenblatt’s perceptron. The evolution of the most popular ANN paradigms is shown in Fig. 1, [10]. B. Basic Architectures An artificial neural network is an interconnected assembly of simple processing elements, called artificial neurons (also called units or nodes), whose functionality mimics that of a biological neuron, [4]. Individual neurons can be combined into layers, and there are single and multi-layer networks, with or without feedback. The most common types of ANNs are shown in Fig. 2, [11]. Among training algorithms the most popular is backpropagation and its variants. ANNs can be used for solving a wide variety of problems. Before the use they have to be trained. During the training, network adjusts its weights. In supervised training, input/output pairs are presented to the network by an external teacher and network tries to learn desired input output mapping. Some neural architectures (like SOM) can learn without supervision (unsupervised) from the training data without specified input/output pairs. Figure 1. Evolution of artificial neural network paradigms, based on [10] Figure 2. Most common artificial neural networks, according to [11]
  • 19. MIPRO 2017/CTS 1253 III. SELF-ORGANIZING MAPS Self-organized map (SOM), as a particular neural network paradigm has found its inspiration in self- organizing and biological systems. A. Self-Organized Systems Self-organizing systems are types of systems that can change their internal structure and function in response to external circumstances and stimuli, [12-15]. Elements of such a system can influence or organize other elements within the same system, resulting in a greater stability of structure or function of the whole against external fluctuations, [12]. The main aspects of self-organizing systems are increase of complexity, emergence of new phenomena (the whole is more than the sum of its parts) and internal regulation by positive and negative feedback loops. In 1952 Turing published a paper regarding the mathematical theory of pattern formation in biology, and found that global order in a system can arise from local interactions, [13]. This often produces a system with new, emergent properties that differ qualitatively from those of components without interactions, [16]. Self-organizing systems exist in nature, including non-living as well as living world, they exist in man-made systems, but also in the world of abstract ideas, [12]. B. Self-Organizing Map Neural networks of neurons with lateral communication of neurons topologically organized as self-organizing maps are common in neurobiology. Various neural functions are mapped onto identifiable regions of the brain, Fig. 3, [17]. In such topographic maps neighborhood relation is preserved. Brain mostly does not have desired input-output pairs available and has to learn in unsupervised mode. Figure 3. Maps in brain, [17] A SOM is a single layer neural network with units set along an n-dimensional grid. Most applications use two- dimensional and rectangular grid, although many applications also use hexagonal grids, and some one, three, or more dimensional spaces. SOMs produce low- dimensional projection images of high-dimensional data distributions, in which the similarity relations between the data items are preserved, [18], C. Principles of Self-Organization in SOMs Following three processes are common to self- organization in SOMs, [7,19,20]: 1. Competitive Process For each input pattern vector presented to the map, all neurons calculate values of a discriminant function. The neuron that is most similar to the input pattern vector is the winner (best matching unit, BMU). 2. Cooperative Process The winner neuron (BMU) finds the spatial location of a topological neighborhood of excited neurons. Neurons from this neighborhood may then cooperate. 3. Synaptic Adaptation Provides that excited neurons can modify their values of the discriminant function related to the presented input pattern vector by the process of weight adjustments. D. Common Topologies SOM topologies can be in one, two (most common) or even three dimensions, [2-10]. Two most used two dimensional grids in SOMs are rectangular and hexagonal grid. Three dimensional topologies can be in form of a cylinder or toroid shapes. 1-D (linear) and 2-D grids are illustrated in Fig. 4, with corresponding SOMs in Fig. 5 and Fig. 6, according to [19]. Figure 4. Most common grids and neuron neighborhoods Figure 5. 1-D SOM network, according to [19]. Figure 6. 2-D SOM network, according to [19]. IV. LEARNING ALGORITHM In 1982 Professor Kohonen presented his SOM algorithm, [1]. Further advancement in a field came with the Second edition of his book “Self-Organization and Associative Memory” in 1988, [2]. A. Measures of Distance and Similarity To determined similarity between the input vector and neurons measures of distance are used. Some popular distances among input pattern and SOM units are, [21]:  Euclidian  Correlation  Direction cosine  Block distance In a real application most often squared Euclidean distance is used, (1):      i ij i w x dj 2 (1)
  • 20. 1254 MIPRO 2017/CTS B. Neighborhood Functions Neurons within a grid interact among themselves using a neighbor function. Neighborhood functions most often assume the form of the Mexican hat, (2), Fig. 7, that has biological motivation behind (rejects some neurons in the vicinity to the winning neuron) although other functions (Gaussian, cone and cylinder) are also possible, [22]. Ordering algorithm is robust to the choice of function type if the neighbor radius and learning rate decrease to zero. The popular choice is the exponential decay.       2 2 1 2 2 , ,                r m j n i mn ij g e r w w h (2) Figure 7. Mexican hat function C. Initialization of Self-Organizing Maps Before training SOM, units (i.e. its weights) should be initialized. Common approaches are, [2,23]: 1. Use of random values, completely independent of the training data set 2. Use of random samples from the input training data 3. Initialization that tries to reflect the distribution of the data (Principal Components) D. Training Self-organizing maps use the most popular algorithm of the unsupervised learning category, [2]. The criterion D, that is minimized, is the sum of distances between all input vectors xn and their respective winning neuron weights wi calculated at the end of each epoch,(3),[21]:        k i c n i n i D 1 2 w x (3) Training of self-organizing maps, [2,18], can be accomplished in two ways: as sequential or batch training. 1. Sequential training  single vector at a time is presented to the map  adjustment of neuron weights is made after a presentation of each vector  suitable for on-line learning 2. Batch training  whole dataset is before any adjustment to the neuron weights is made  suitable for off-line learning Here are steps for the sequential training, [3,7,19,22]: 1. Initialization  Initialize the neuron weights (iteration step n=0) 2. Sampling  Randomlysample avector x n from the dataset 3. Similarity Matching  Find the best matching unit (BMU), c, with weights wbmu=wc, (4):       n n c i i w x   min arg (4) 4. Updating  Update each unit i with the following rule:                   n n n r n n h n n n i i bmu i i w x w w w w     , , 1  (5) 5. Continuation  Increment n. Repeat steps 2-4 until a stopping criterion is met (e.g. the fixed number of iterations or the map has reached a stable state). For the convergence and stability to be guaranteed, the learning rate α n and neighborhood radius r n are decreasing with each iteration towards the zero, [22]. SOM Sample Hits, Fig. 8, show the number of input vectors that each unit in SOM classifies, [24]. Figure 8. SOM Sample Hits, [24] During the training process two phases may be distinguished, [7,18]: 1. Self-organizing (ordering) phase: Topological ordering in the map takes place (roughly first 1000 iterations). The learning rate α n and neighborhood radiusr n are decreasing. 2. Convergence (fine tuning) phase: This is fine tuning that provides an accurate statistical representation of the input space. It typically lasts at least (500 xnumber of neuron) iterations. The smaller learning rate α n and neighborhood radius r n may be kept fixed (e.g. last values from the previous phase). After the training of the SOM is completed, neurons may be labeled if labeled pattern vectors are available. E. Classification Find the best matching unit (BMU), c, (5):   i i c w x   min arg (5) Test pattern x belongs to the class represented by the best matching unit c. V. PROPERTIES OF SOM After the convergence of SOM algorithm, resulting feature map displays important statistical characteristics of the input space. They are also able to discover relevant patterns or features present in the input data. A. Important Properties of SOMs SOMs have four important properties, [3,7]: 1. Approximation of the Input Space The resulting mapping provides a good approximation to the input space. SOM also performs dimensionality reduction bymapping multidimensionaldataon SOM grid. 2. Topological Ordering Spatial locations of the neurons in the SOM lattice are topologically related to the features of the input space. 3. Density Matching The density of the output neurons in the map approximates the statistical distribution of the input space. Regions of the input space that contain more training vectors are represented with more output neurons.
  • 21. MIPRO 2017/CTS 1255 4. Feature Selection Map extracts the principal features of the input space. It is capable of selecting the best features for approximation of the underlying statistical distribution of the input space. B. Representing the Input Space with SOMs of Various Topologies 1. 1-D 2D input data points are uniformly distributed in a triangle, 1D SOM ordering process shown in Fig. 9, [2]. Figure 9. 2D to 1D mapping by a SOM (ordering process), [2] 2. 2-D 2D input data points are uniformly distributed in a square, 2D SOM ordering process shown in Fig. 10, [3]. Figure 10. 2D to 2D mapping by a SOM (ordering process), [3] 3. Torus SOMs In conventional SOM, the size of neighborhood set is not always constant because the map has its edges. This problem can be mitigated by use of torus SOM that has no edges, [25]. However torus SOM, Fig. 11, is not easy to visualize as there are now missing edges. Figure 11. Torus SOM 4. Hierarchical SOMs After previous topologies, hierarchical SOMs should also be mentioned. Hierarchical neural networks are composed of multiple loosely connected neural networks that form an acyclic graph. The outputs of the lower level SOMs can be used as the input for the higher level SOM, Fig. 12, [10]. Such input can be formed of several vectors from Best Matching Units (BMUs) of many SOMs. Figure 12. Hierarchical SOM, [10] VI. APPLICATIONS Despite its simplicity, SOMs can be used for a various classes of applications, [2,26,27]. This in a broad sense includes visualizations, generation of feature maps, pattern recognition and classification. Kohonen in [2] came with the following categories of applications: machine vision and image analysis, optical character recognition and script reading, speech analysis and recognition, acoustic and musical studies, signal processing and radar measurements, telecommunications, industrial and other real world measurements, process control, robotics, chemistry, physics, design of electronic circuits, medical applications without image processing, data processing linguistic and AI problems, mathematical problems and neurophysiological research. With such an exhaustive list provided here, as space permits, it is possible only to mention some of them that are interesting and popular. A. Speech Recognition The neural phonetic typewriter for Finnish and Japanese speech was developed by Kohonen in 1988, [28]. The signal from the microphone proceeds to acoustic preprocessing, shown in more detail in Fig. 13, forming 15-component pattern vector (values in 15 frequency beans taken every 10 ms), containing a short time spectral description of speech. These vectors are presented to a SOM with the hexagonal lattice of the size 8 x 12. Figure 13. Acoustic preprocessing After training resulting phonotopic map is shown in Fig. 14, [7]. During speech recognition new pattern vectors are assigned category belonging to a closest prototype in the map. Figure 14. Phonotopic map, [7] B. Text Clustering Text clustering is the technology of processing a large number of texts that gives their partition. Preparation of text for SOM analysis is shown in Fig. 15, [29], and Figure 15. Preparation of text for SOM analysis, according to [29]
  • 22. 1256 MIPRO 2017/CTS Figure 16. Framework for text clustering, [29] complete framework in Fig. 16, [29]. Massive document collections can be organized using a SOM. It can be optimized to map large document collections while preserving much of the classification accuracy Clustering of scientific articles is illustrated in Fig. 17, [30]. Figure 17. Clustering of scientific articles, [30] C. Application in Chemistry SOMs have found applications in chemistry. Illustration of the output layer of the SOM model using a hexagonal grid for the combinatorial design of cannabinoid compounds is shown in Fig. 18, [11]. Figure 18. Application of SOM in chemistry, [11] D. Medical Imaging and Analysis Recognition of diseases from medical images (ECG, CAT scans, ultrasonic scans, etc.) can be performed by SOMs, [21].This includes image segmentation, Fig. 19, [31], to discover region of interest and help diagnostics. Figure 19. Segmentation of hip image using SOM, [31] E. Maritime Applications SOMs have been widely for maritime applications, [22]. One example is analysis of passive sonar recordings. Also SOMs have been used for planning ship trajectories. F. Robotics Some applications of SOMs are control of robot arm, learning the motion map and solving traveling salesman problem (multi-goal path planning problem), Fig.20, [32]. Figure 20. Traveling Salesman Problem, [32] G. Classification of Satellite Images SOMs can be used for interpreting satellite imagery like land cover classification. Dust sources can also be spotted in images using theSOMas shown in Fig. 21, [33]. Figure 21. Detecting dust sources using SOMs, [33] H. Psycholinguistic Studies One example is the categorization of words by their local context in three word sentences of the type subject- predicate-object or subject-predicate-predicative that were constructed artificially. The words become clustered by SOM according to their linguistic roles in an orderly fashion, Fig. 22, [18]. Figure 22. SOM in psycholinguistic studies, [18] I. Exploring Music Collections Similarity of music recordings may be determined by analyzing the lyrics, instrumentation, melody, rhythm, artists, or emotions they invoke, Fig. 23, [34]. Figure 23. Exploring music collections, [34]
  • 23. MIPRO 2017/CTS 1257 J. Business Applications Customer segmentation of the international tourist market is illustrated in Fig. 24, [35]. Another example is classifying world poverty (welfare map), [36]. Ordering of items with the respect to 39 features describing various quality-of-life factors, such as state of health, nutrition, and educational services is shown in Fig. 25. Countries with similar quality of life factors clustered together on a map. Figure 24. Customer segmentation of the international tourist market,[35] Figure 25. Poverty map based on 39 indicators from World Bank statistics (1992), [36] VII. CONCLUSION Self-organizing maps (SOMs) are neural network architecture inspired by the biological structure of human and animal brains. They become one of the most popular neural network architecture. SOMs learn without external teacher, i.e. employ unsupervised learning. Topologically SOMs most often use a two-dimensional grid, although one-dimensional, higher-dimensional and irregular grids are also possible. SOM maps higher dimensional input onto the lower dimensional grid while preserving topological ordering present in the input space. During competitive learning SOM uses lateral interactions among the neurons to form a semantic map where similar patterns are mapped closer together than dissimilar ones. They can be used for broad type of applications like visualizations, generation of feature maps, pattern recognition and classification. Humans can’t visualize high-dimensional data, hence SOMs by mapping such data to a two-dimensional grid are widely used for data visualization. SOMs are also suitable for generation of feature maps. Because they can detect clusters of similar patterns without supervision SOMs are a powerful tool for identification and classification of spatio-temporal patterns. SOMs can be used as an analytical tool, but also in a myriad of real world applications including science, medicine, satelliteimaging and industry. REFERENCES [1] T. Kohonen, “Self-organized formation of topologically correct feature maps”, Biol. Cybern. 43, pp. 59-69, 1982 [2] T. Kohonen, Self-Organizing Maps, 2nd ed., Springer 1997 [3] S. Haykin, Neural Networks: A Comprehensive Foundation, 2nd ed., Prentice Hall PTR Upper Saddle River, NJ, USA, 1998 [4] K. Gurney, An introduction to neural network, UCL Press Limited, London, UK, 1997 [5] D. Kriese, A Brief Introduction to Neural Networks, http://www.dkriesel.com [6] R. Rojas: Neural Networks, A Systematic Introduction, Springer- Verlag, Berlin, 1996 [7] J. A. Bullinaria, Introduction to Neural Networks - Course Material and Useful Links, http://www.cs.bham.ac.uk/~jxb/NN/ [8] C. M. Bishop, Neural Networks for Pattern Recognition, Clarendon Press, Oxford, 1997 [9] R. Eckmiller, C. Malsburg, Neural Computers, NATO ASI Series, Computer and Systems Sciences, 1988 [10] P. Hodju and J. Halme, Neural Networks Information Homepage, http://phodju.mbnet.fi/nenet/SiteMap/SiteMap.html [11] Káthia Maria Honório and A. B. F. da Silva, “Applications of artificial neural networks in chemical problems”, in Artificial Neural Networks - Architectures and Applications, InTech, 2013 [12] W. Banzhafl, “Self-organizing systems”, in Encyclopedia of Complexity and Systems Science, 2009, Springer, Heidelberg, [13] A. M. Turing, “The chemical basis of morphogenesis”, Philosophical Transactions of the Royal Society of London. Series B, Biological Sciences, Vol. 237, No.641. pp.37-72, Aug. 14, 1952 [14] W. R. Ashby, “Principles of the self-organizing system”, E:CO Special Double Issue Vol. 6, No. 1-2, pp. 102-126, 2004 [15] C. Fuchs, “Self-organizing system”, in Encyclopedia of Governance, Vol. 2, SAGE Publications, 2006, pp. 863-864 [16] J. Howard, “Self-organisation in biology”, in Research Perspectives 2010+ of the Max Planck Society, 2010, pp. 28-29 [17] The Wizard of Ads Brain Map - Wernicke and Broca, https://www.wizardofads.com.au/brain-map-brocas-area/ [18] T. Kohonen, MATLAB Implementations and Applications of the Self-Organizing Map, Unigrafia, Helsinki, Finland, 2014 [19] Bill Wilson, Self-organisation Notes, 2010, www.cse.unsw.edu.au/~billw/cs9444/selforganising-10-4up.pdf [20] J. Boedecker, Self-Organizing Map (SOM), .ppt, Machine Learning, Summer 2015, Machine Learning Lab, Univ. of Freiburg [21] L. Grajciarova, J. Mares, P. Dvorak and A. Prochazka, Biomedical image analysis using self-organizing maps, Matlab Conference 2012 [22] V. J. A. S. Lobo, “Application of Self-Organizing Maps to the Maritime Environment”, Proc. IF&GIS 2009, 20 May 2009, St. Petersburg, Russia, pp. 19-36 [23] A. A. Akinduko and E. M. Mirkes, “Initialization of self-organizing maps: principal components versus random initialization. A case study”,InformationSciences,Vol.364,Is.C,pp. 213-221, Oct.2016 [24] MathWorks, Self-Organizing Maps, https://www.mathworks.com/help/nnet/ug/cluster-with-self- organizing-map-neural-network.html [25] M. Ito, T. Miyoshi, and H. Masuyama, “The characteristics of the torus self organizing map”, Proc. 6th Int. Conf. on Soft Computing (IIZUKA’2000), Iizuka, Fukuoka, Japan, Oct. 1-4,2000,pp. 239-44 [26] M. Johnsson ed., Applications of Self-Organizing Maps, InTech, November 21, 2012 [27] J. I. Mwasiagi (ed.), Self Organizing Maps - Applications and Novel Algorithm Design, InTech, 2011 [28] T. Kohonen, “The ‘neural’ phonetic typewriter”, IEEE Computer 21(3), pp. 11–22, 1988 [29] Yuan-Chao Liu, Ming Liu and Xiao-Long Wang, “Application of self-organizingmapsintextclustering:areview”,in“SelfOrganizing Maps - Applications and Novel Algorithm Design”, InTech, 2012 [30] K. W. Boyacka et. all., Supplementary information on data and methods for “Clustering more than two million biomedical publications: comparing the accuracies of nine text-based similarity approaches”, PLoS ONE 6(3): e18029, 2011 [31] A. Aslantas, D. Emre and M. Çakiroğlu, “Comparison of segmentation algorithms for detection of hotspots in bone scintigraphy images and effects on CAD systems”, Biomedical Research, 28 (2), pp. 676-683, 2017 [32] J. Faigl, “Multi-goal path planning for cooperative sensing”, PhD Thesis, Czech Technical University of Prague, February 2010 [33] D. Lairy, Machine Learning for Scientific Applications, slides, https://www.slideshare.net/davidlary/machine-learning-for-scientific-applications [34] E. Pampalk, S. Dixon and G. Widmer, “Exploring music collections by browsing different views”, Computer Musical Journal, Vol. 28, No. 2, pp. 49-62, Summer 2004 [35] J. Z. Bloom, “Market segmentation-aneural network application”, Annals of Tourism Research, Vol. 32, No. 1, pp. 93–111, 2005 [36] World Poverty Map, SOM research page, Univ. of Helsinki, http://www.cis.hut.fi/research/som-research/worldmap.html
  • 24. Convolutional Neural Network (CNN) A Convolutional Neural Network (CNN) is a class of ANNs. CNN was developed primarily triggered by the challenges of image recognition. CNN architectures are strongly influenced by our current neuro science models of the organization of human and animal visual perception. The central convolution mechanisms of CNN are inspired by receptive fields and their direct connections to specific neuron structures. The implementation of these mechanisms are based on the concept of convolution function in mathematics. CNNs use relatively little pre-processing compared to other image classification algorithms. This means that the network learns the filters that in traditional algorithms were hand-engineered. This independence from prior knowledge and human effort in feature design is a major advantage.
  • 25. Image Recognition The classical problem in computer vision is that of determining whether or not the image data contains some specific object, feature, or activity. Different varieties of the recognition problem are: Object recognition or object classification – one or several pre-specified or learned objects or object classes can be recognized, usually together with their 2D positions in the image or 3D poses in the scene. Identification – an individual instance of an object is recognized. Examples include identification of a specific person's face or fingerprint, identification of handwritten digits or letters or identification of a specific object. Detection – the image data are scanned for a specific condition. Examples include detection of possible abnormal cells or tissues in medical images or detection of a vehicle in an automatic road toll system. Detection based on relatively simple and fast computations is sometimes used for finding smaller regions of interesting image data which can be further analyzed by more computationally demanding techniques to produce a correct interpretation.
  • 26. ImageNet The ImageNet project is a large visual database designed for use in visual object recognition software research. More than 14 million images have been hand-annotated by the project to indicate what objects are pictured. ImageNet contains more than 20,000 categories with a typical category consisting of several hundred images. Since 2010, the ImageNet project runs an annual software contest, the ImageNet Large Scale Visual Recognition Challenge (ILSVRC), where software programs compete to correctly classify and detect objects and scenes. The challenge uses a specially selected list of one thousand non- overlapping classes.
  • 27. Image input in Compact Symbolic standard pixel form characterization of image as output Alternative architecture CNN architecture ANN architecture Manual Mapping Automated Mapping Image Recognition Systems
  • 28. Input to Image Recognition systems - finite arrays of pixels
  • 29. RGB Images An RGB image, sometimes referred to as a true-color image, is a m- by-n-by-3 data array RGB ( .. , .. , ..) that defines red, green, and blue color components for each individual pixel. The color of each pixel is determined by the combination of the red, green, and blue intensities stored in each color plane at the pixel's location. An RGB color component is a value between 0 and 1. A pixel whose color components are (0,0,0) displays as black, and a pixel whose color components are (1,1,1) displays as white. The three color components for each pixel are stored along the third dimension of the data array. For example, the red, green, and blue color components of the pixel (3,3,5) are stored in RGB(2,3,1), RGB(2,3,2), and RGB(3,3,3), respectively. Suppose (2,3,1) contains the value 0.5176, (2,3,2) contains 0.1608, and (2,3,3) contains 0.0627. The color for the pixel at (2,3) is 0.5176 0.1608 0.0627
  • 30. Output from an Image Recognition system One or several object categories (classes) present in the image Specific objects (instances) present in the image Subset of features of object and/or categories observable in the image Topological and Geometrical aspects of the image Dynamic properties of elements in the image (requires sequences of images) All the above elements can be represented in symbolic and numeric form A Feature vector is still a default option.
  • 32. Eye Superior colliculus Dorsal LGN V1 V2 V3 V4 V3A STS TEO V5 TE Posterior parietal Cx Striate Cortex Extrastriate Cortex Inferior Temporal Cortex STS Superior temporal sulcus TEO Inferior temporal cortex TE Inferior temporal cortex The Organization of the Visual Cortex Dorsal stream Ventral stream V1
  • 33. The connections between Receptive fields and Neurons in the Visual Cortex Nobel prize awarded work by Hubel and Wiesel in the 1950s and 1960s showed that cat and monkey visual cortexes contain neurons that individually respond to small regions of the visual field. Provided the eyes are not moving, the regions of visual space within which visual stimuli affect the firing of single neurons we call receptive fields.. Neighboring neurons have similar and overlapping receptive fields. Receptive fields sizes and locations varies systematically to form a complete map of visual space. The responses of specific to a subset of stimuli within its receptive field is called neuronal tuning. A1968 article by Hubel and Wieser identified two basic visual cell types in the brain: • simple cells, whose output is maximized by straight edges having particular orientations within their receptive field. Neurons of this kind are located in the earlier visual areas (like V19). • complex cells, which have larger receptive fields, whose output is insensitive to the exact position of the edges in the field. In the higher visual areas, neurons have complex tuning. For example, in the inferior temporal cortex, a neuron may fire only when a certain face appears in its receptive field. Hubel and Wiesel also proposed a cascading model of these two types of cells for use in pattern recognition task.
  • 34. Convolution as defined in Mathematics Convolution is a mathematical operation on two functions (f and g) to produce a third function that expresses how the shape of one is modified by the other. • Express each function in terms of a dummy variable a. • Reflect one of the functions: g(a) → g(-a) • Add a time-offset, x, which allows g to slide along the a-axis from −∞ to +∞. • Wherever the two functions intersect, find the integral of their product. • In other words, compute a sliding, weighted-sum of function f(a) where the weighting function is g(-a) • The resulting waveform is the convolution of functions f and g. The term convolution refers to both the result function and to the process of computing it. Convolution is similar to cross-correlation and related to autocorrelation.
  • 35. 1 1 1/2 -1 2 -2 1 1 1/2 -1 2 -2 1 1 1/2 -1 2 -2 1 1 1/2 -1 2 -2 ¤ 1 1/2 -1 2 -2 1 1 1/2 -1 2 -2 1 1 1/2 -1 2 -2 b Example Compute the convolution of f and g =f*g f= g= Reflect the weight function g Slide g f*g = 0 f*g = 0 Result x x X-1 0 1 ¤
  • 36. A typical Convolutional Neural Network Architecture
  • 37. Convolutional Neural Network related Terminology Convolution Filter (or synonymously Kernel) Stride Padding Feature map Parameter sharing Local connectivity Pooling Subsampling Downsampling Subsampling ratio Maxpooling Average pooling ReLU Softmax
  • 38. The Feature Learning Phase The feature learning phase in a CNN consists of an arbitrary number of pairs of Convolution and Pooling layers. The number and roles of these pairs of layers are engineering decisions for particular problem settings but in general later (deeper) levels handles more abstract or high level features or patterns in analogy with our assumed model of the functioning of the human visual cortex.
  • 39. Example An input image of RGB type.
  • 40. Convolution for one Filter in a Convolution Layer In our example we take a 5*5*3 Filter and Slide it over the Input array with a Stride of 1. Let us disregard the color dimensions for a moment. In each step of the slide, take the dot product between each filter element and each element of each subarea of the Input array. For every dot product taken, the result is a scalar. There are 28*28 unique positions where the filter can be put on the image and the therefore the total result is a Feature Map = a 28x28x1 array. If the Stride is larger than 1 the Feature Map becomes more reduced.
  • 41. Example of a filter and a single convolution operation The input is a 7x7 array with 49 elements The filter is shown in the middle. The filter size is 3x3 (black and white), the stride is 1. This is an example of a filter that detects diagonal patterns (1 on the diagonals). The output is a 5x5 array with 25 elements. We slide the array systematically across the input array (analogy with convulotion). There are 25 distinct distinct sliding positions. For each position we calculated the dot product elementwise and puts the result in the output matrix.
  • 42. Padding Depending of the size of input array, size of filters and stride, the sliding process can miss to apply the filter to some input array element. A possibilty is to ´pad´ the original input array with a frame and use he extended array as a basis for the convolution. Wether this is benefical for the process or not depends on the specific situation. If padding is never used, the system of arrays are shrinking rapidly but if padding is used sytematically the size of arrays are kept up.
  • 43. Repeated convolution for all filters in a convolution layer Each convolution layer comprises a set of independent Filters. Each filter is independently convolved with the input image. In the example there are 6 filters in this first convolution layer. Which generates 6 feature maps of shape 28*28*1.
  • 44. Pooling (Subsampling) Layer A Pooling layer is frequently used in a convolution neural network with the purpose to progressively reduce the spatial size of the representation to reduce the amount of features and the computational complexity of the network. Pooling layer operates on each Feature map independently. The main reason for the pooling layer is to prevent the model from overfitting. The choice of filtersize, stride (and maybe padding) are also relevant for the pooling phases. The most common approach used in pooling is Max pooling. As an example a MAXPOOL of 2 x 2 would cause a Filter of 2 by 2 to traverse over the entire matrix with a Stride of 2 and pick the largest element from the window to be included in the next representation map. Average pooling takes the average.
  • 45. Two aspects of the neuron structures in the convolution and pooling layers Weight-sharing Based on the motivation that a certain feature/filter should treat all subareas of the visual space similarity the same weights should be employed within a convolution computation phase. This brings down the complexity of the networks Local connectivity In contrast to general ANN, the neuron connections in the input, convolution and pooling layers are restricted, primarily motivated by the fact that specific neurons are allocated to only small sub-areas of the total visual field. This brings down the complexity of the networks.
  • 46. Flattening When leaving the convolution and pooling layers and before entering the fully connected layers the output of the previous layers is flattened. By this is meant that the dimensions of the input array from earlier phases are flattened out into one large dimension. For example a 3 D array with a shape off (10x10x10) when flattened would become a 1 D array with 1000 elements.
  • 47. The Fully Connected Layers If a Softmax activity function is used, each number in this N dimensional vector represents the probability of a certain class. For example, if the resulting vector for a digit classification program is [0, 0.1, 0.1, 0 .75, 0, 0, 0, 0, 0, 0.05], then this represents a 10% probability that the image is a 1, a 10% probability that the image is a 2, a 75% probability that the image is a 3, and a 5% probability that the image is a 9. The fully connected layers takes as input, a flattened array representing the activation maps of high level features from earlier layers and outputs an N dimensional vector. N is the number of classes that the program has to choose from. For example, if the task is digit classification, N would be 10 since there are 10 digits. The fully connected layers determine which features best correlate to a particular class.
  • 48. Activation functions used ReLU (Rectified Linear Unit) and Leaking ReLU activation functions The advantages are simplicity and efficiency. Typically used in the convolution layers of CNN. Sigmoid and hyberbolic tangent (Tanh) functions Sigmoid and Tanh are typically used for fully connected networks aimed for binary classification problems. Can be used for the output layers of a CNN Softmax Softmax is equivalent to Sigmoid for binary classification but is primarily aimed for the multi class case where the non-normalized output of a network is mapped onto a probability distribution over predicted output classes. Typically used in the output layers of CNN. Gaussian activation function Can be used for the output layers of a CNN. For regression problems, the final layer has typically an identity activation.
  • 49. LeNet-5 – A Classic CNN Architecture In 1990 Yann LeCun, Leon Bottou, Yosuha Bengio and Patrick Haffner proposed a neural network architecture for handwritten and machine-printed character recognition which they called LeNet-5. The architecture is straightforward and simple to understand and well suited as an Introduction to CNNs. The LeNet-5 architecture consists of: • two sets of convolutional and average pooling (subsampling) layers followed by • one flattening convolutional layer, then • two fully-connected layers and finally • one Softmax classifier.
  • 50. First Layer The input for LeNet-5 is a 32×32 grayscale image which passes through the first convolutional layer with 6 feature maps or filters having size 5×5 and a stride of one. The image dimensions changes from 32x32x1 to 28x28x6. Second Layer Then the LeNet-5 applies average pooling layer or sub-sampling layer with a filter size 2×2 and a stride of two. The resulting image dimensions will be reduced to 14x14x6.
  • 51. Third Layer Next, there is a second convolutional layer with 16 feature maps having size 5×5 and a stride of 1. In this layer, only 10 out of 16 feature maps are connected to 6 feature maps of the previous layer as shown below. The main reason is to break the symmetry in the network and keeps the number of connections within reasonable bounds. That’s why the number of training parameters in this layers are 1516 instead of 2400 and similarly, the number of connections are 151600 instead of 240000 Fourth Layer The fourth layer (S4) is again an average pooling layer with filter size 2×2 and a stride of 2. This layer is the same as the second layer (S2) except it has 16 feature maps so the output will be reduced to 5x5x16.
  • 52. Fifth Layer The fifth layer (C5) is a fully connected convolutional layer with 120 feature maps each of size 1×1. Each of the 120 units in C5 is connected to all the 400 nodes (5x5x16) in the fourth layer S4. Sixth Layer The sixth layer is a fully connected layer (F6) with 84 units. Output Layer Finally, there is a fully connected Softmax output layer ŷ with 10 possible values corresponding to the digits from 0 to 9. Le-Net-5 layers
  • 53. Summary of Le-Net-5 architecture
  • 54. Timeline for CNN 1980 The Neocognitron, introduced by Kunihiko Fukushima. 1987 Time delay neural networks (TDNN), introduced by Alex Waibel. 1989 A system to recognize hand-written ZIP Code numbers using convolutions based on laboriously hand designed filters, introduced by Yann LeCun. 1989 LeNet-5, a pioneering 7-level convolutional network by Yann LeCun. 2006 The first GPU-implementation of a CNN was described in by K. Chellapilla. 2012 Alexnet, A GPU-based CNN by Alex Krizhevsky 2013 won the ImageNet Large Scale Visual Recognition Challenge.
  • 55. 1 Facebook AI Research, 770 Broadway, New York, New York 10003 USA. 2 New York University, 715 Broadway, New York, New York 10003, USA. 3 Department of Computer Science and Operations Research Université de Montréal, Pavillon André-Aisenstadt, PO Box 6128 Centre-Ville STN Montréal, Quebec H3C 3J7, Canada. 4 Google, 1600 Amphitheatre Parkway, Mountain View, California 94043, USA.5 Department of Computer Science, University of Toronto, 6 King’s College Road, Toronto, Ontario M5S 3G4, Canada. M achine-learning technology powers many aspects of modern society: from web searches to content filtering on social net- works to recommendations on e-commerce websites, and it is increasingly present in consumer products such as cameras and smartphones. Machine-learning systems are used to identify objects in images, transcribe speech into text, match news items, posts or products with users’ interests, and select relevant results of search. Increasingly, these applications make use of a class of techniques called deep learning. Conventional machine-learning techniques were limited in their ability to process natural data in their raw form. For decades, con- structing a pattern-recognition or machine-learning system required careful engineering and considerable domain expertise to design a fea- ture extractor that transformed the raw data (such as the pixel values of an image) into a suitable internal representation or feature vector from which the learning subsystem, often a classifier, could detect or classify patterns in the input. Representation learning is a set of methods that allows a machine to be fed with raw data and to automatically discover the representations needed for detection or classification. Deep-learning methods are representation-learning methods with multiple levels of representa- tion, obtained by composing simple but non-linear modules that each transform the representation at one level (starting with the raw input) into a representation at a higher, slightly more abstract level. With the composition of enough such transformations, very complex functions can be learned. For classification tasks, higher layers of representation amplify aspects of the input that are important for discrimination and suppress irrelevant variations. An image, for example, comes in the form of an array of pixel values, and the learned features in the first layer of representation typically represent the presence or absence of edges at particular orientations and locations in the image. The second layer typically detects motifs by spotting particular arrangements of edges, regardless of small variations in the edge positions. The third layer may assemble motifs into larger combinations that correspond to parts of familiar objects, and subsequent layers would detect objects as combinations of these parts. The key aspect of deep learning is that these layers of features are not designed by human engineers: they are learned from data using a general-purpose learning procedure. Deep learning is making major advances in solving problems that have resisted the best attempts of the artificial intelligence commu- nity for many years. It has turned out to be very good at discovering intricate structures in high-dimensional data and is therefore applica- ble to many domains of science, business and government. In addition to beating records in image recognition1–4 and speech recognition5–7 , it has beaten other machine-learning techniques at predicting the activ- ity of potential drug molecules8 , analysing particle accelerator data9,10 , reconstructing brain circuits11 , and predicting the effects of mutations in non-coding DNA on gene expression and disease12,13 . Perhaps more surprisingly, deep learning has produced extremely promising results for various tasks in natural language understanding14 , particularly topic classification, sentiment analysis, question answering15 and lan- guage translation16,17 . We think that deep learning will have many more successes in the near future because it requires very little engineering by hand, so it can easily take advantage of increases in the amount of available com- putation and data. New learning algorithms and architectures that are currently being developed for deep neural networks will only acceler- ate this progress. Supervised learning The most common form of machine learning, deep or not, is super- vised learning. Imagine that we want to build a system that can classify images as containing, say, a house, a car, a person or a pet. We first collect a large data set of images of houses, cars, people and pets, each labelled with its category. During training, the machine is shown an image and produces an output in the form of a vector of scores, one for each category. We want the desired category to have the highest score of all categories, but this is unlikely to happen before training. We compute an objective function that measures the error (or dis- tance) between the output scores and the desired pattern of scores. The machine then modifies its internal adjustable parameters to reduce this error. These adjustable parameters, often called weights, are real numbers that can be seen as ‘knobs’ that define the input–output func- tion of the machine. In a typical deep-learning system, there may be hundreds of millions of these adjustable weights, and hundreds of millions of labelled examples with which to train the machine. To properly adjust the weight vector, the learning algorithm com- putes a gradient vector that, for each weight, indicates by what amount the error would increase or decrease if the weight were increased by a tiny amount. The weight vector is then adjusted in the opposite direc- tion to the gradient vector. The objective function, averaged over all the training examples, can Deep learning allows computational models that are composed of multiple processing layers to learn representations of data with multiple levels of abstraction. These methods have dramatically improved the state-of-the-art in speech rec- ognition, visual object recognition, object detection and many other domains such as drug discovery and genomics. Deep learning discovers intricate structure in large data sets by using the backpropagation algorithm to indicate how a machine should change its internal parameters that are used to compute the representation in each layer from the representation in the previous layer. Deep convolutional nets have brought about breakthroughs in processing images, video, speech and audio, whereas recurrent nets have shone light on sequential data such as text and speech. Deep learning Yann LeCun1,2 , Yoshua Bengio3 & Geoffrey Hinton4,5 4 3 6 | N A T U R E | V O L 5 2 1 | 2 8 M A Y 2 0 1 5 REVIEW doi:10.1038/nature14539 © 2015 Macmillan Publishers Limited. All rights reserved
  • 56. be seen as a kind of hilly landscape in the high-dimensional space of weight values. The negative gradient vector indicates the direction of steepest descent in this landscape, taking it closer to a minimum, where the output error is low on average. In practice, most practitioners use a procedure called stochastic gradient descent (SGD). This consists of showing the input vector for a few examples, computing the outputs and the errors, computing the average gradient for those examples, and adjusting the weights accordingly. The process is repeated for many small sets of examples from the training set until the average of the objective function stops decreasing. It is called stochastic because each small set of examples gives a noisy estimate of the average gradient over all examples. This simple procedure usually finds a good set of weights surprisingly quickly when compared with far more elaborate optimization tech- niques18 . After training, the performance of the system is measured on a different set of examples called a test set. This serves to test the generalization ability of the machine — its ability to produce sensible answers on new inputs that it has never seen during training. Many of the current practical applications of machine learning use linear classifiers on top of hand-engineered features. A two-class linear classifier computes a weighted sum of the feature vector components. If the weighted sum is above a threshold, the input is classified as belonging to a particular category. Since the 1960s we have known that linear classifiers can only carve their input space into very simple regions, namely half-spaces sepa- rated by a hyperplane19 . But problems such as image and speech recog- nition require the input–output function to be insensitive to irrelevant variations of the input, such as variations in position, orientation or illumination of an object, or variations in the pitch or accent of speech, while being very sensitive to particular minute variations (for example, the difference between a white wolf and a breed of wolf-like white dog called a Samoyed). At the pixel level, images of two Samoyeds in different poses and in different environments may be very different from each other, whereas two images of a Samoyed and a wolf in the same position and on similar backgrounds may be very similar to each other. A linear classifier, or any other ‘shallow’ classifier operating on Figure 1 | Multilayer neural networks and backpropagation. a, A multi- layer neural network (shown by the connected dots) can distort the input space to make the classes of data (examples of which are on the red and blue lines) linearly separable. Note how a regular grid (shown on the left) in input space is also transformed (shown in the middle panel) by hidden units. This is an illustrative example with only two input units, two hidden units and one output unit, but the networks used for object recognition or natural language processing contain tens or hundreds of thousands of units. Reproduced with permission from C. Olah (http://colah.github.io/). b, The chain rule of derivatives tells us how two small effects (that of a small change of x on y, and that of y on z) are composed. A small change Δx in x gets transformed first into a small change Δy in y by getting multiplied by ∂y/∂x (that is, the definition of partial derivative). Similarly, the change Δy creates a change Δz in z. Substituting one equation into the other gives the chain rule of derivatives — how Δx gets turned into Δz through multiplication by the product of ∂y/∂x and ∂z/∂x. It also works when x, y and z are vectors (and the derivatives are Jacobian matrices). c, The equations used for computing the forward pass in a neural net with two hidden layers and one output layer, each constituting a module through which one can backpropagate gradients. At each layer, we first compute the total input z to each unit, which is a weighted sum of the outputs of the units in the layer below. Then a non-linear function f(.) is applied to z to get the output of the unit. For simplicity, we have omitted bias terms. The non-linear functions used in neural networks include the rectified linear unit (ReLU) f(z)=max(0,z), commonly used in recent years, as well as the more conventional sigmoids, such as the hyberbolic tangent, f(z)=(exp(z)−exp(−z))/(exp(z)+exp(−z)) and logistic function logistic, f(z)=1/(1+exp(−z)). d, The equations used for computing the backward pass. At each hidden layer we compute the error derivative with respect to the output of each unit, which is a weighted sum of the error derivatives with respect to the total inputs to the units in the layer above. We then convert the error derivative with respect to the output into the error derivative with respect to the input by multiplying it by the gradient of f(z). At the output layer, the error derivative with respect to the output of a unit is computed by differentiating the cost function. This gives yl −tl if the cost function for unit l is 0.5(yl −tl)2 , where tl is the target value. Once the ∂E/∂zk is known, the error-derivative for the weight wjk on the connection from unit j in the layer below is just yj ∂E/∂zk. Input (2) Output (1 sigmoid) Hidden (2 sigmoid) a b d c y y x y x   = y z   x y   z y z z y   = Δ Δ Δ Δ Δ Δ z y z x y x     = x z y z x x y       = Compare outputs with correct answer to get error derivatives j k E yl =yl tl E zl = E yl yl zl l E yj = wjk E zk E zj = E yj yj zj E yk = wkl E zl E zk = E yk yk zk wkl wjk wij i j k yl = f (zl ) zl = wkl yk l yj = f (zj ) zj = wij xi yk = f (zk ) zk = wjk yj Output units Input units Hidden units H2 Hidden units H1 wkl wjk wij k  H2 k  H2 I  out j  H1 i  Input i 2 8 M A Y 2 0 1 5 | V O L 5 2 1 | N A T U R E | 4 3 7 REVIEW INSIGHT © 2015 Macmillan Publishers Limited. All rights reserved
  • 57. raw pixels could not possibly distinguish the latter two, while putting the former two in the same category. This is why shallow classifiers require a good feature extractor that solves the selectivity–invariance dilemma — one that produces representations that are selective to the aspects of the image that are important for discrimination, but that are invariant to irrelevant aspects such as the pose of the animal. To make classifiers more powerful, one can use generic non-linear features, as with kernel methods20 , but generic features such as those arising with the Gaussian kernel do not allow the learner to general- ize well far from the training examples21 . The conventional option is to hand design good feature extractors, which requires a consider- able amount of engineering skill and domain expertise. But this can all be avoided if good features can be learned automatically using a general-purpose learning procedure. This is the key advantage of deep learning. A deep-learning architecture is a multilayer stack of simple mod- ules, all (or most) of which are subject to learning, and many of which compute non-linear input–output mappings. Each module in the stack transforms its input to increase both the selectivity and the invariance of the representation. With multiple non-linear layers, say a depth of 5 to 20, a system can implement extremely intricate func- tions of its inputs that are simultaneously sensitive to minute details — distinguishing Samoyeds from white wolves — and insensitive to large irrelevant variations such as the background, pose, lighting and surrounding objects. Backpropagation to train multilayer architectures From the earliest days of pattern recognition22,23 , the aim of research- ers has been to replace hand-engineered features with trainable multilayer networks, but despite its simplicity, the solution was not widely understood until the mid 1980s. As it turns out, multilayer architectures can be trained by simple stochastic gradient descent. As long as the modules are relatively smooth functions of their inputs and of their internal weights, one can compute gradients using the backpropagation procedure. The idea that this could be done, and that it worked, was discovered independently by several different groups during the 1970s and 1980s24–27 . The backpropagation procedure to compute the gradient of an objective function with respect to the weights of a multilayer stack of modules is nothing more than a practical application of the chain rule for derivatives. The key insight is that the derivative (or gradi- ent) of the objective with respect to the input of a module can be computed by working backwards from the gradient with respect to the output of that module (or the input of the subsequent module) (Fig. 1). The backpropagation equation can be applied repeatedly to propagate gradients through all modules, starting from the output at the top (where the network produces its prediction) all the way to the bottom (where the external input is fed). Once these gradients have been computed, it is straightforward to compute the gradients with respect to the weights of each module. Many applications of deep learning use feedforward neural net- work architectures (Fig. 1), which learn to map a fixed-size input (for example, an image) to a fixed-size output (for example, a prob- ability for each of several categories). To go from one layer to the next, a set of units compute a weighted sum of their inputs from the previous layer and pass the result through a non-linear function. At present, the most popular non-linear function is the rectified linear unit (ReLU), which is simply the half-wave rectifier f(z)=max(z, 0). In past decades, neural nets used smoother non-linearities, such as tanh(z) or 1/(1+exp(−z)), but the ReLU typically learns much faster in networks with many layers, allowing training of a deep supervised network without unsupervised pre-training28 . Units that are not in the input or output layer are conventionally called hidden units. The hidden layers can be seen as distorting the input in a non-linear way so that categories become linearly separable by the last layer (Fig. 1). In the late 1990s, neural nets and backpropagation were largely forsaken by the machine-learning community and ignored by the computer-vision and speech-recognition communities. It was widely thought that learning useful, multistage, feature extractors with lit- tle prior knowledge was infeasible. In particular, it was commonly thought that simple gradient descent would get trapped in poor local minima — weight configurations for which no small change would reduce the average error. In practice, poor local minima are rarely a problem with large net- works. Regardless of the initial conditions, the system nearly always reaches solutions of very similar quality. Recent theoretical and empirical results strongly suggest that local minima are not a serious issue in general. Instead, the landscape is packed with a combinato- rially large number of saddle points where the gradient is zero, and the surface curves up in most dimensions and curves down in the Figure 2 | Inside a convolutional network. The outputs (not the filters) of each layer (horizontally) of a typical convolutional network architecture applied to the image of a Samoyed dog (bottom left; and RGB (red, green, blue) inputs, bottom right). Each rectangular image is a feature map corresponding to the output for one of the learned features, detected at each of the image positions. Information flows bottom up, with lower-level features acting as oriented edge detectors, and a score is computed for each image class in output. ReLU, rectified linear unit. Red Green Blue Samoyed (16); Papillon (5.7); Pomeranian (2.7); Arctic fox (1.0); Eskimo dog (0.6); white wolf (0.4); Siberian husky (0.4) Convolutions and ReLU Max pooling Max pooling Convolutions and ReLU Convolutions and ReLU 4 3 8 | N A T U R E | V O L 5 2 1 | 2 8 M A Y 2 0 1 5 REVIEW INSIGHT © 2015 Macmillan Publishers Limited. All rights reserved
  • 58. remainder29,30 . The analysis seems to show that saddle points with only a few downward curving directions are present in very large numbers, but almost all of them have very similar values of the objec- tive function. Hence, it does not much matter which of these saddle points the algorithm gets stuck at. Interest in deep feedforward networks was revived around 2006 (refs 31–34) by a group of researchers brought together by the Cana- dian Institute for Advanced Research (CIFAR). The researchers intro- duced unsupervised learning procedures that could create layers of feature detectors without requiring labelled data. The objective in learning each layer of feature detectors was to be able to reconstruct or model the activities of feature detectors (or raw inputs) in the layer below. By ‘pre-training’ several layers of progressively more complex feature detectors using this reconstruction objective, the weights of a deep network could be initialized to sensible values. A final layer of output units could then be added to the top of the network and the whole deep system could be fine-tuned using standard backpropaga- tion33–35 . This worked remarkably well for recognizing handwritten digits or for detecting pedestrians, especially when the amount of labelled data was very limited36 . The first major application of this pre-training approach was in speech recognition, and it was made possible by the advent of fast graphics processing units (GPUs) that were convenient to program37 and allowed researchers to train networks 10 or 20 times faster. In 2009, the approach was used to map short temporal windows of coef- ficients extracted from a sound wave to a set of probabilities for the various fragments of speech that might be represented by the frame in the centre of the window. It achieved record-breaking results on a standard speech recognition benchmark that used a small vocabu- lary38 and was quickly developed to give record-breaking results on a large vocabulary task39 . By 2012, versions of the deep net from 2009 were being developed by many of the major speech groups6 and were already being deployed in Android phones. For smaller data sets, unsupervised pre-training helps to prevent overfitting40 , leading to significantly better generalization when the number of labelled exam- ples is small, or in a transfer setting where we have lots of examples for some ‘source’ tasks but very few for some ‘target’ tasks. Once deep learning had been rehabilitated, it turned out that the pre-training stage was only needed for small data sets. There was, however, one particular type of deep, feedforward net- work that was much easier to train and generalized much better than networks with full connectivity between adjacent layers. This was the convolutional neural network (ConvNet)41,42 . It achieved many practical successes during the period when neural networks were out of favour and it has recently been widely adopted by the computer- vision community. Convolutional neural networks ConvNets are designed to process data that come in the form of multiple arrays, for example a colour image composed of three 2D arrays containing pixel intensities in the three colour channels. Many data modalities are in the form of multiple arrays: 1D for signals and sequences, including language; 2D for images or audio spectrograms; and 3D for video or volumetric images. There are four key ideas behind ConvNets that take advantage of the properties of natural signals: local connections, shared weights, pooling and the use of many layers. The architecture of a typical ConvNet (Fig. 2) is structured as a series of stages. The first few stages are composed of two types of layers: convolutional layers and pooling layers. Units in a convolu- tional layer are organized in feature maps, within which each unit is connected to local patches in the feature maps of the previous layer through a set of weights called a filter bank. The result of this local weighted sum is then passed through a non-linearity such as a ReLU. All units in a feature map share the same filter bank. Differ- ent feature maps in a layer use different filter banks. The reason for this architecture is twofold. First, in array data such as images, local groups of values are often highly correlated, forming distinctive local motifs that are easily detected. Second, the local statistics of images and other signals are invariant to location. In other words, if a motif can appear in one part of the image, it could appear anywhere, hence the idea of units at different locations sharing the same weights and detecting the same pattern in different parts of the array. Mathemati- cally, the filtering operation performed by a feature map is a discrete convolution, hence the name. Although the role of the convolutional layer is to detect local con- junctions of features from the previous layer, the role of the pooling layer is to merge semantically similar features into one. Because the relative positions of the features forming a motif can vary somewhat, reliably detecting the motif can be done by coarse-graining the posi- tion of each feature. A typical pooling unit computes the maximum of a local patch of units in one feature map (or in a few feature maps). Neighbouring pooling units take input from patches that are shifted by more than one row or column, thereby reducing the dimension of the representation and creating an invariance to small shifts and dis- tortions. Two or three stages of convolution, non-linearity and pool- ing are stacked, followed by more convolutional and fully-connected layers. Backpropagating gradients through a ConvNet is as simple as through a regular deep network, allowing all the weights in all the filter banks to be trained. Deep neural networks exploit the property that many natural sig- nals are compositional hierarchies, in which higher-level features are obtained by composing lower-level ones. In images, local combi- nations of edges form motifs, motifs assemble into parts, and parts form objects. Similar hierarchies exist in speech and text from sounds to phones, phonemes, syllables, words and sentences. The pooling allows representations to vary very little when elements in the previ- ous layer vary in position and appearance. The convolutional and pooling layers in ConvNets are directly inspired by the classic notions of simple cells and complex cells in visual neuroscience43 , and the overall architecture is reminiscent of the LGN–V1–V2–V4–IT hierarchy in the visual cortex ventral path- way44 . When ConvNet models and monkeys are shown the same pic- ture, the activations of high-level units in the ConvNet explains half of the variance of random sets of 160 neurons in the monkey’s infer- otemporal cortex45 . ConvNets have their roots in the neocognitron46 , the architecture of which was somewhat similar, but did not have an end-to-end supervised-learning algorithm such as backpropagation. A primitive 1D ConvNet called a time-delay neural net was used for the recognition of phonemes and simple words47,48 . There have been numerous applications of convolutional net- works going back to the early 1990s, starting with time-delay neu- ral networks for speech recognition47 and document reading42 . The document reading system used a ConvNet trained jointly with a probabilistic model that implemented language constraints. By the late 1990s this system was reading over 10% of all the cheques in the United States. A number of ConvNet-based optical character recog- nition and handwriting recognition systems were later deployed by Microsoft49 . ConvNets were also experimented with in the early 1990s for object detection in natural images, including faces and hands50,51 , and for face recognition52 . Image understanding with deep convolutional networks Since the early 2000s, ConvNets have been applied with great success to the detection, segmentation and recognition of objects and regions in images. These were all tasks in which labelled data was relatively abun- dant, such as traffic sign recognition53 , the segmentation of biological images54 particularly for connectomics55 , and the detection of faces, text, pedestrians and human bodies in natural images36,50,51,56–58 . A major recent practical success of ConvNets is face recognition59 . Importantly, images can be labelled at the pixel level, which will have applications in technology, including autonomous mobile robots and 2 8 M A Y 2 0 1 5 | V O L 5 2 1 | N A T U R E | 4 3 9 REVIEW INSIGHT © 2015 Macmillan Publishers Limited. All rights reserved
  • 59. self-driving cars60,61 . Companies such as Mobileye and NVIDIA are using such ConvNet-based methods in their upcoming vision sys- tems for cars. Other applications gaining importance involve natural language understanding14 and speech recognition7 . Despite these successes, ConvNets were largely forsaken by the mainstream computer-vision and machine-learning communities until the ImageNet competition in 2012. When deep convolutional networks were applied to a data set of about a million images from the web that contained 1,000 different classes, they achieved spec- tacular results, almost halving the error rates of the best compet- ing approaches1 . This success came from the efficient use of GPUs, ReLUs, a new regularization technique called dropout62 , and tech- niques to generate more training examples by deforming the existing ones. This success has brought about a revolution in computer vision; ConvNets are now the dominant approach for almost all recognition and detection tasks4,58,59,63–65 and approach human performance on some tasks. A recent stunning demonstration combines ConvNets and recurrent net modules for the generation of image captions (Fig. 3). Recent ConvNet architectures have 10 to 20 layers of ReLUs, hun- dreds of millions of weights, and billions of connections between units. Whereas training such large networks could have taken weeks only two years ago, progress in hardware, software and algorithm parallelization have reduced training times to a few hours. The performance of ConvNet-based vision systems has caused most major technology companies, including Google, Facebook, Microsoft, IBM, Yahoo!, Twitter and Adobe, as well as a quickly growing number of start-ups to initiate research and development projects and to deploy ConvNet-based image understanding products and services. ConvNets are easily amenable to efficient hardware implemen- tations in chips or field-programmable gate arrays66,67 . A number of companies such as NVIDIA, Mobileye, Intel, Qualcomm and Samsung are developing ConvNet chips to enable real-time vision applications in smartphones, cameras, robots and self-driving cars. Distributed representations and language processing Deep-learning theory shows that deep nets have two different expo- nential advantages over classic learning algorithms that do not use distributed representations21 . Both of these advantages arise from the power of composition and depend on the underlying data-generating distribution having an appropriate componential structure40 . First, learning distributed representations enable generalization to new combinations of the values of learned features beyond those seen during training (for example, 2n combinations are possible with n binary features)68,69 . Second, composing layers of representation in a deep net brings the potential for another exponential advantage70 (exponential in the depth). The hidden layers of a multilayer neural network learn to repre- sent the network’s inputs in a way that makes it easy to predict the target outputs. This is nicely demonstrated by training a multilayer neural network to predict the next word in a sequence from a local Figure 3 | From image to text. Captions generated by a recurrent neural network (RNN) taking, as extra input, the representation extracted by a deep convolution neural network (CNN) from a test image, with the RNN trained to ‘translate’ high-level representations of images into captions (top). Reproduced with permission from ref. 102. When the RNN is given the ability to focus its attention on a different location in the input image (middle and bottom; the lighter patches were given more attention) as it generates each word (bold), we found86 that it exploits this to achieve better ‘translation’ of images into captions. Vision Deep CNN Language Generating RNN A group of people shopping at an outdoor market. There are many vegetables at the fruit stand. A woman is throwing a frisbee in a park. A little girl sitting on a bed with a teddy bear. A group of people sitting on a boat in the water. A giraffe standing in a forest with trees in the background. A dog is standing on a hardwood floor. A stop sign is on a road with a mountain in the background 4 4 0 | N A T U R E | V O L 5 2 1 | 2 8 M A Y 2 0 1 5 REVIEW INSIGHT © 2015 Macmillan Publishers Limited. All rights reserved
  • 60. context of earlier words71 . Each word in the context is presented to the network as a one-of-N vector, that is, one component has a value of 1 and the rest are 0. In the first layer, each word creates a different pattern of activations, or word vectors (Fig. 4). In a language model, the other layers of the network learn to convert the input word vec- tors into an output word vector for the predicted next word, which can be used to predict the probability for any word in the vocabulary to appear as the next word. The network learns word vectors that contain many active components each of which can be interpreted as a separate feature of the word, as was first demonstrated27 in the context of learning distributed representations for symbols. These semantic features were not explicitly present in the input. They were discovered by the learning procedure as a good way of factorizing the structured relationships between the input and output symbols into multiple ‘micro-rules’. Learning word vectors turned out to also work very well when the word sequences come from a large corpus of real text and the individual micro-rules are unreliable71 . When trained to predict the next word in a news story, for example, the learned word vectors for Tuesday and Wednesday are very similar, as are the word vectors for Sweden and Norway. Such representations are called distributed representations because their elements (the features) are not mutually exclusive and their many configurations correspond to the variations seen in the observed data. These word vectors are composed of learned features that were not determined ahead of time by experts, but automatically discovered by the neural network. Vector representations of words learned from text are now very widely used in natural language applications14,17,72–76 . The issue of representation lies at the heart of the debate between the logic-inspired and the neural-network-inspired paradigms for cognition. In the logic-inspired paradigm, an instance of a symbol is something for which the only property is that it is either identical or non-identical to other symbol instances. It has no internal structure that is relevant to its use; and to reason with symbols, they must be bound to the variables in judiciously chosen rules of inference. By contrast, neural networks just use big activity vectors, big weight matrices and scalar non-linearities to perform the type of fast ‘intui- tive’ inference that underpins effortless commonsense reasoning. Before the introduction of neural language models71 , the standard approach to statistical modelling of language did not exploit distrib- uted representations: it was based on counting frequencies of occur- rences of short symbol sequences of length up to N (called N-grams). The number of possible N-grams is on the order of VN , where V is the vocabulary size, so taking into account a context of more than a handful of words would require very large training corpora. N-grams treat each word as an atomic unit, so they cannot generalize across semantically related sequences of words, whereas neural language models can because they associate each word with a vector of real valued features, and semantically related words end up close to each other in that vector space (Fig. 4). Recurrent neural networks When backpropagation was first introduced, its most exciting use was for training recurrent neural networks (RNNs). For tasks that involve sequential inputs, such as speech and language, it is often better to use RNNs (Fig. 5). RNNs process an input sequence one element at a time, maintaining in their hidden units a ‘state vector’ that implicitly contains information about the history of all the past elements of the sequence. When we consider the outputs of the hidden units at different discrete time steps as if they were the outputs of different neurons in a deep multilayer network (Fig. 5, right), it becomes clear how we can apply backpropagation to train RNNs. RNNs are very powerful dynamic systems, but training them has proved to be problematic because the backpropagated gradients either grow or shrink at each time step, so over many time steps they typically explode or vanish77,78 . Thanks to advances in their architecture79,80 and ways of training them81,82 , RNNs have been found to be very good at predicting the next character in the text83 or the next word in a sequence75 , but they can also be used for more complex tasks. For example, after reading an English sentence one word at a time, an English ‘encoder’ network can be trained so that the final state vector of its hidden units is a good representation of the thought expressed by the sentence. This thought vector can then be used as the initial hidden state of (or as extra input to) a jointly trained French ‘decoder’ network, which outputs a prob- ability distribution for the first word of the French translation. If a particular first word is chosen from this distribution and provided as input to the decoder network it will then output a probability dis- tribution for the second word of the translation and so on until a full stop is chosen17,72,76 . Overall, this process generates sequences of French words according to a probability distribution that depends on the English sentence. This rather naive way of performing machine translation has quickly become competitive with the state-of-the-art, and this raises serious doubts about whether understanding a sen- tence requires anything like the internal symbolic expressions that are manipulated by using inference rules. It is more compatible with the view that everyday reasoning involves many simultaneous analogies Figure 4 | Visualizing the learned word vectors. On the left is an illustration of word representations learned for modelling language, non-linearly projected to 2D for visualization using the t-SNE algorithm103 . On the right is a 2D representation of phrases learned by an English-to-French encoder–decoder recurrent neural network75 . One can observe that semantically similar words or sequences of words are mapped to nearby representations. The distributed representations of words are obtained by using backpropagation to jointly learn a representation for each word and a function that predicts a target quantity such as the next word in a sequence (for language modelling) or a whole sequence of translated words (for machine translation)18,75 . −37 −36 −35 −34 −33 −32 −31 −30 −29 9 10 10.5 11 11.5 12 12.5 13 13.5 14 community organizations institutions society industry company organization school companies Community office Agency communities Association body schools agencies −5.5 −5 −4.5 −4 −3.5 −3 −2.5 −2 −4.2 −4 −3.8 −3.6 −3.4 −3.2 −3 −2.8 −2.6 −2.4 −2.2 over the past few months that a few days In the last few days the past few days In a few months in the coming months a few months ago " the two groups of the two groups over the last few months dispute between the two the last two decades the next six months two months before being for nearly two months over the last two decades within a few months 2 8 M A Y 2 0 1 5 | V O L 5 2 1 | N A T U R E | 4 4 1 REVIEW INSIGHT © 2015 Macmillan Publishers Limited. All rights reserved
  • 61. that each contribute plausibility to a conclusion84,85 . Instead of translating the meaning of a French sentence into an English sentence, one can learn to ‘translate’ the meaning of an image into an English sentence (Fig. 3). The encoder here is a deep Con- vNet that converts the pixels into an activity vector in its last hidden layer. The decoder is an RNN similar to the ones used for machine translation and neural language modelling. There has been a surge of interest in such systems recently (see examples mentioned in ref. 86). RNNs, once unfolded in time (Fig. 5), can be seen as very deep feedforward networks in which all the layers share the same weights. Although their main purpose is to learn long-term dependencies, theoretical and empirical evidence shows that it is difficult to learn to store information for very long78 . To correct for that, one idea is to augment the network with an explicit memory. The first proposal of this kind is the long short-term memory (LSTM) networks that use special hidden units, the natural behaviour of which is to remember inputs for a long time79 . A special unit called the memory cell acts like an accumulator or a gated leaky neuron: it has a connection to itself at the next time step that has a weight of one, so it copies its own real-valued state and accumulates the external signal, but this self-connection is multiplicatively gated by another unit that learns to decide when to clear the content of the memory. LSTM networks have subsequently proved to be more effective than conventional RNNs, especially when they have several layers for each time step87 , enabling an entire speech recognition system that goes all the way from acoustics to the sequence of characters in the transcription. LSTM networks or related forms of gated units are also currently used for the encoder and decoder networks that perform so well at machine translation17,72,76 . Over the past year, several authors have made different proposals to augment RNNs with a memory module. Proposals include the Neural Turing Machine in which the network is augmented by a ‘tape-like’ memory that the RNN can choose to read from or write to88 , and memory networks, in which a regular network is augmented by a kind of associative memory89 . Memory networks have yielded excel- lent performance on standard question-answering benchmarks. The memory is used to remember the story about which the network is later asked to answer questions. Beyond simple memorization, neural Turing machines and mem- ory networks are being used for tasks that would normally require reasoning and symbol manipulation. Neural Turing machines can be taught ‘algorithms’. Among other things, they can learn to output a sorted list of symbols when their input consists of an unsorted sequence in which each symbol is accompanied by a real value that indicates its priority in the list88 . Memory networks can be trained to keep track of the state of the world in a setting similar to a text adventure game and after reading a story, they can answer questions that require complex inference90 . In one test example, the network is shown a 15-sentence version of the The Lord of the Rings and correctly answers questions such as “where is Frodo now?”89 . The future of deep learning Unsupervised learning91–98 had a catalytic effect in reviving interest in deep learning, but has since been overshadowed by the successes of purely supervised learning. Although we have not focused on it in this Review,weexpectunsupervisedlearningtobecomefarmoreimportant in the longer term. Human and animal learning is largely unsupervised: we discover the structure of the world by observing it, not by being told the name of every object. Human vision is an active process that sequentially samples the optic array in an intelligent, task-specific way using a small, high-resolution fovea with a large, low-resolution surround. We expect much of the future progress in vision to come from systems that are trained end-to- end and combine ConvNets with RNNs that use reinforcement learning to decide where to look. Systems combining deep learning and rein- forcement learning are in their infancy, but they already outperform passive vision systems99 at classification tasks and produce impressive results in learning to play many different video games100 . Natural language understanding is another area in which deep learn- ing is poised to make a large impact over the next few years. We expect systems that use RNNs to understand sentences or whole documents will become much better when they learn strategies for selectively attending to one part at a time76,86 . Ultimately, major progress in artificial intelligence will come about through systems that combine representation learning with complex reasoning. Although deep learning and simple reasoning have been used for speech and handwriting recognition for a long time, new paradigms are needed to replace rule-based manipulation of symbolic expressions by operations on large vectors101 . ■ Received 25 February; accepted 1 May 2015. 1. Krizhevsky, A., Sutskever, I. & Hinton, G. ImageNet classification with deep convolutional neural networks. In Proc. Advances in Neural Information Processing Systems 25 1090–1098 (2012). This report was a breakthrough that used convolutional nets to almost halve the error rate for object recognition, and precipitated the rapid adoption of deep learning by the computer vision community. 2. Farabet, C., Couprie, C., Najman, L. & LeCun, Y. Learning hierarchical features for scene labeling. IEEE Trans. Pattern Anal. Mach. Intell. 35, 1915–1929 (2013). 3. Tompson, J., Jain, A., LeCun, Y. & Bregler, C. Joint training of a convolutional network and a graphical model for human pose estimation. In Proc. Advances in Neural Information Processing Systems 27 1799–1807 (2014). 4. Szegedy, C. et al. Going deeper with convolutions. Preprint at http://arxiv.org/ abs/1409.4842 (2014). 5. Mikolov, T., Deoras, A., Povey, D., Burget, L. & Cernocky, J. Strategies for training large scale neural network language models. In Proc. Automatic Speech Recognition and Understanding 196–201 (2011). 6. Hinton, G. et al. Deep neural networks for acoustic modeling in speech recognition. IEEE Signal Processing Magazine 29, 82–97 (2012). This joint paper from the major speech recognition laboratories, summarizing the breakthrough achieved with deep learning on the task of phonetic classification for automatic speech recognition, was the first major industrial application of deep learning. 7. Sainath, T., Mohamed, A.-R., Kingsbury, B. & Ramabhadran, B. Deep convolutional neural networks for LVCSR. In Proc. Acoustics, Speech and Signal Processing 8614–8618 (2013). 8. Ma, J., Sheridan, R. P., Liaw, A., Dahl, G. E. & Svetnik, V. Deep neural nets as a method for quantitative structure-activity relationships. J. Chem. Inf. Model. 55, 263–274 (2015). 9. Ciodaro, T., Deva, D., de Seixas, J. & Damazio, D. Online particle detection with neural networks based on topological calorimetry information. J. Phys. Conf. Series 368, 012030 (2012). 10. Kaggle. Higgs boson machine learning challenge. Kaggle https://www.kaggle. com/c/higgs-boson (2014). 11. Helmstaedter, M. et al. Connectomic reconstruction of the inner plexiform layer in the mouse retina. Nature 500, 168–174 (2013). xt xt−1 xt+1 x Unfold V W W W W W V V V U U U U s o st−1 ot−1 ot st st+1 ot+1 Figure 5 | A recurrent neural network and the unfolding in time of the computation involved in its forward computation. The artificial neurons (for example, hidden units grouped under node s with values st at time t) get inputs from other neurons at previous time steps (this is represented with the black square, representing a delay of one time step, on the left). In this way, a recurrent neural network can map an input sequence with elements xt into an output sequence with elements ot, with each ot depending on all the previous xtʹ (for tʹ≤t). The same parameters (matrices U,V,W ) are used at each time step. Many other architectures are possible, including a variant in which the network can generate a sequence of outputs (for example, words), each of which is used as inputs for the next time step. The backpropagation algorithm (Fig.1) can be directly applied to the computational graph of the unfolded network on the right, to compute the derivative of a total error (for example, the log-probability of generating the right sequence of outputs) with respect to all the states st and all the parameters. 4 4 2 | N A T U R E | V O L 5 2 1 | 2 8 M A Y 2 0 1 5 REVIEW INSIGHT © 2015 Macmillan Publishers Limited. All rights reserved