Geometric Processing of Data in Neural Networks

Corso di Laurea Triennale in Fisica
GEOMETRIC PROCESSING OF DATA
IN NEURAL NETWORKS
Relatore:
Prof. Marco Gherardi
Correlatore:
Dott. Pietro Rotondo
Candidato:
Lorenzo Cassani
Matricola: 867016
Anno Accademico 2021-2022

Acknowledgments
This bachelor’s thesis would not have been possible without the support of many
people. Professor Marco Gherardi has been an ideal teacher and thesis supervisor,
offering advice and encouragement with a perfect blend of insight and humor. I’m
proud of, and grateful for, my time working with Marco. I would also like to thank
my assistant supervisor Pietro Rotondo, as well as graduate student Simone Ciceri,
for their constant help throughout this project.
I wish to extend my special thanks to my university colleagues (and dear friends),
I ragazzi di Via Celoria, who were always there for me over the course of my degree.
Their company in the study halls, laboratories and pubs will always be remembered.
A huge thanks also goes to my friends from Romagna, my homeland. I already
spent some of my best years in their company, and they will always have a special
place in my heart.
I am grateful for my parents whose constant love and support keep me motivated
and confident. My accomplishments and success are because they believed in me.
Deepest thanks to my little sister, who taught me that you should never be afraid of
being yourself.
Finally, I would like to thank my beloved girlfriend Silvia, who’s had my back for
almost eight years. I could never have done this without her.

Abstract
Feed-forward neural networks can be considered as geometric transformations that
act on input data points. It is known that, during training, those transformations
generally bring points belonging to the same class closer together and drive points
belonging to different classes farther away. The purpose of this work is to carry out
a numerical analysis of how this description varies during training. The training task
consisted of a binary (e.g. even digits vs. odd digits) of the elements of MNIST and
other similar structured datasets, in order to have a clear vision of the link between
structure in data and possible noteworthy behaviours in the evolution of the inner
geometries of neural networks. Particular attention has been reserved to data points
which neural networks struggle more to correctly classify, and their connection to the
neural networks generalization capability.

Contents
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Purpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Deep Learning as Geometric Processing of Data 4
2.1 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.1.1 The Supervised Learning Approach . . . . . . . . . . . . . . . 6
2.1.2 The Gradient Descent Optimization Algorithm . . . . . . . . . 7
2.2 Artificial Neural Networks and Deep Learning . . . . . . . . . . . . . 8
2.2.1 Artificial Neurons . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2.2 The Feed-Forward Architecture . . . . . . . . . . . . . . . . . 10
2.2.3 Training Neural Networks . . . . . . . . . . . . . . . . . . . . 11
2.2.4 The Backpropagation Algorithm . . . . . . . . . . . . . . . . . 15
2.3 Structure in Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.3.1 The Object Manifold Model . . . . . . . . . . . . . . . . . . . 17
2.3.2 Geometrical Observables . . . . . . . . . . . . . . . . . . . . . 19
3 Tools and Techniques Employed 20
3.1 Overview of the Datasets . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.1.1 MNIST . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.1.2 EMNIST . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.1.3 KMNIST . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.1.4 Fashion-MNIST . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.2 Programming Language and Libraries . . . . . . . . . . . . . . . . . . 24
3.2.1 Why we opted for Python . . . . . . . . . . . . . . . . . . . . 24
3.2.2 The PyTorch Framework for Machine Learning . . . . . . . . 26
3.3 Data Pre-Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.3.1 Rescaling (min-max Normalization) . . . . . . . . . . . . . . . 28
3.3.2 Standardization (Z-score Normalization) . . . . . . . . . . . . 28
3.4 ML Task and Network Architecture . . . . . . . . . . . . . . . . . . . 29
3.4.1 The Binary Classification Task . . . . . . . . . . . . . . . . . 29
3.4.2 Network Architecture . . . . . . . . . . . . . . . . . . . . . . . 29
3.4.3 Training and Test Errors Computation . . . . . . . . . . . . . 30
3.4.4 Choice of Activation Function . . . . . . . . . . . . . . . . . . 30
3.4.5 Choice of Loss Function . . . . . . . . . . . . . . . . . . . . . 32

Contents iii
4 Numerical Analysis Results 33
4.1 Dynamics of Geometric Observables . . . . . . . . . . . . . . . . . . . 33
4.1.1 Non-Monotonic Behaviours in MNIST . . . . . . . . . . . . . 34
4.1.2 Comparing MNIST with Similar Datasets . . . . . . . . . . . 36
4.1.3 Epochs of Inversion . . . . . . . . . . . . . . . . . . . . . . . . 38
4.2 The Finding of Stragglers Data Points . . . . . . . . . . . . . . . . . 39
4.2.1 The Critical Role of Stragglers . . . . . . . . . . . . . . . . . . 39
4.2.2 Steps Towards a Formal Definition . . . . . . . . . . . . . . . 42
4.2.3 Training and Filtration . . . . . . . . . . . . . . . . . . . . . . 43
4.3 Effects of the Shuffling of Labels . . . . . . . . . . . . . . . . . . . . . 44
5 Conclusions and Future Work 46

Chapter 1
Introduction
1.1 Motivation
The availability of big datasets is a hallmark of modern sciences, including physics,
where data analysis has become an important component of diverse areas, such as
experimental particle physics, observational astronomy and cosmology, condensed
matter physics, biophysics, and quantum computing. Moreover, machine learning
and data science are playing increasingly important roles in many aspects of modern
technology, ranging from biotechnology to the engineering of self-driving cars and
smart devices. Therefore, having a grasp of the concepts and tools used in machine
learning is an important skill that is increasingly relevant in the physical sciences.
This revolution has been spurred by an exponential growth in computing power
and memory commonly known as Moore’s law. This increase in our computational
ability has been accompanied by new techniques for analyzing and learning from
large datasets. These techniques draw heavily from ideas in statistics, computational
neuroscience, computer science, and physics. Similar to physics, modern machine
learning places a premium on empirical results and intuition over the more formal
treatments common in statistics, computer science, and mathematics. This is not
to say that proofs are not important or undesirable. Rather, many of the advances
of the last two decades – especially in fields like deep learning – do not have formal
justifications. Physicists are uniquely situated to benefit from and contribute to
machine learning. Many of the core concepts and techniques used in machine learning
- such as Monte-Carlo methods, simulated annealing and variational methods - have
their origins in physics. Moreover, “energy-based models” inspired by statistical
physics are the backbone of many deep learning methods. For these reasons, there is
much in modern ML that will be familiar to physicists. Physicists and astronomers
have also been at the forefront of using “big data”. For example, experiments such
as CMS and ATLAS at the LHC generate petabytes of data per year (Fig. 1.1). In
astronomy, projects such as the Sloan Digital Sky Survey (SDSS) routinely analyze
and release hundreds of terabytes of data measuring the properties of nearly a billion
stars and galaxies. Researchers in these fields are increasingly incorporating recent
advances in ML and data science, and this trend is likely to accelerate in the future.

1.1. Motivation 2
Fig. 1.1 Data (in TB) recorded on tape at CERN month-by-month. This plot
shows the amount of data recorded on tape generated by the LHC experiments, other
experiments, various back-ups and users. In 2018, over 115 PB of data in total
(including about 88 PB of LHC data) were recorded on tape, with a record peak of
15.8 PB in November (Image: Esma Mobs/CERN).
And it is in this context that we introduce artificial neural networks, a family
of machine learning algorithms that use a “network” consisting of multiple layers of
inter-connected nodes. They are inspired by the animal nervous system, where the
nodes are viewed as neurons and edges are viewed as synapses. Each edge has an
associated parameter, named weight, that the algorithm tunes by itself during the
training process, and the network possesses computational rules, in the form of both
linear and non-linear functions, for passing data from its input to its output layer.
With appropriately defined functions, a neural network can perform various learning
tasks by minimizing a loss function over its weights. Multilayer networks can, for
example, be used to perform feature learning, since they learn a representation of
their input at the intermediate layer(s), commonly known as hidden layer(s), which
is subsequently used for classification at the output layer.
Deep learning is part of a broader family of machine learning methods based on
artificial neural networks with feature learning. The adjective deep refers to the use
of multiple layers in the network. Early work showed that a linear, single-neuron
network, the perceptron, cannot be a universal classifier, but that a network with a
non-polynomial activation function with one hidden layer of unbounded width can.
Deep learning can then be considered a modern variation of machine learning, focusing
on artificial neural networks with an unbounded number of layers of bounded size,
which permits practical application and optimized implementation, while retaining
theoretical universality under mild conditions.

1.2. Purpose 3
One of the main criticisms of deep learning concerns the previously mentioned
lack of a theoretical framework supporting some of its facets. Learning in the most
common deep architectures is implemented using the well-understood gradient descent
algorithm or one of its many variations. However, the theory surrounding other
aspects, such as how real world structured data are processed by deep neural networks,
is less clear. Deep learning methods are often looked at as a black box, with most
confirmations done empirically, rather than theoretically.
One of the common assumptions of theoretical investigations on neural networks
is that of considering input data points as unstructured data, i.e. uncorrelated data
points, distributed randomly in their state space. Nevertheless, the necessity of a
theoretical framework taking the structure in data into account has already emerged in
different contexts, for instance in investigating how stimuli are represented in the brain
when an object is shown in varying conditions [8], following the discovery of spatial
maps in rodents brains [20], or in researches explicitly concerning machine learning
[9][2][6]. So, what can be considered as “structure” in data? The previously mentioned
studies have adopted different definitions and models for describing structured data,
depending on their specific purpose and field of origin.
Driven by the existing gap between established deep learning techniques and an
inadequate theoretical understanding, an increasing interest in this sector has been
observed over the last few years, motivating the community of physicists to tackle
this problem. Physical approaches and methods originating from statistical physics
have already contributed concretely to this task [10][11], shedding light on some of the
then still obscure aspects of machine learning and offering new interesting insights.
1.2 Purpose
The purpose of this work is to carry out a numerical analysis of how multi-layer,
feed-forward neural networks process structured data, since it is known that this kind
of networks can be interpreted as geometric transformations acting on input data
points. We trained these networks on the task of binary classification of data from
well-known structured datasets, such as MNIST and other similar ones, in “even” and
“odd” dichotomies, focusing on how the inner geometric representations of clusters of
elements belonging to the same class, called object manifolds, vary during the training
process. This latter was carried out, for the sake of generality, with the basic gradient
descent as the optimization algorithm, in order to highlight noteworthy behaviours in
the evolution of the inner geometries. Particular attention has been reserved to data
which neural networks struggle more to correctly classify, and to their connection to
the generalization capability of the networks.

Chapter 2
Deep Learning as Geometric
Processing of Data
2.1 Machine Learning
Machine Learning (ML) is a subfield of Artificial Intelligence (AI) with the goal
of developing algorithms capable of learning from data automatically. Therefore,
techniques in ML tend to be more focused on prediction rather than estimation, in
contrast to classical statistics, which instead is primarily concerned with how to use
data to estimate the value of an unknown quantity. In addition, methods from ML
tend to be applied to more complex high-dimensional problems than those typically
encountered classical statistics. ML also has intimate ties to optimization: many
learning problems are formulated as minimization of some loss function on a training
set of examples. Loss functions express the discrepancy between the predictions of the
model being trained and the actual problem instances (for example, in classification,
one wants to assign a label to instances, and models are trained to correctly predict
the pre-assigned labels of a set of examples). The difference between optimization
and ML arises from the goal of generalization: while optimization algorithms can
minimize the loss on a training set, machine learning is concerned with minimizing
the loss on unseen samples. Characterizing the generalization capability of various
learning algorithms is an active topic of current research.
The term machine learning was coined in 1959 by Arthur Samuel, an American
IBMer and pioneer in the field of computer gaming and artificial intelligence.[22]
Tom M. Mitchell provided a widely quoted, more formal definition of the algorithms
studied in the machine learning field: “A computer program is said to learn from
experience E with respect to some class of tasks T and performance measure P if its
performance at tasks in T, as measured by P, improves with experience E.”[19] This
definition of the tasks in which machine learning is concerned offers a fundamentally
operational definition rather than defining the field in cognitive terms. This follows
Alan Turing’s proposal in his paper “Computing Machinery and Intelligence” [23], in
which the question “Can machines think?” is replaced with “Can machines do what
we (as thinking entities) can do?”.

2.1. Machine Learning 5
ML approaches are traditionally divided into three broad categories, depending on
the nature of the “feedback” available to the learning algorithm: supervised learning,
unsupervised learning, and reinforcement learning.
• Supervised learning consists in learning from labelled data. The learning
algorithm is presented with sample inputs and their desired outputs, the labels,
given by a “supervisor”, and the goal is to learn a general rule that maps
inputs to outputs. Common supervised learning tasks include classification and
regression. Classification algorithms are used when the labels are restricted to
a limited set of values, and regression algorithms are used when the desired
outputs may have any numerical value within a range.
• Unsupervised learning is concerned with finding patterns and structure in
unlabelled data. Unsupervised learning algorithms take a dataset that contains
unlabelled inputs, and find structure in the data, like grouping or clustering of
data points. The algorithms, therefore, learn independently from data that has
not been previously labelled. Instead of responding to feedback, unsupervised
learning algorithms identify commonalities in the data and react based on the
presence or absence of such commonalities in each new piece of input data.
Examples of unsupervised learning include clustering, dimensionality reduction,
and generative modeling.
• Reinforcement learning algorithms learn by interacting with an environment
and taking actions to maximize some notion of reward. The environment is
typically represented as a Markov decision process (MDP)1
. In Reinforcement
learning, algorithms do not assume knowledge of an exact mathematical model
of the MDP, and are used when exact models are infeasible. Reinforcement
learning algorithms are used in autonomous vehicles or in learning to play a
game against a human opponent.
In this investigation, we made use of techniques belonging exclusively to supervised
learning, since our ML task consisted in a binary classification of MNIST and other
similar datasets in “even” and “odd” dichotomies.
We should point out that ML presents some universal limitations. First, fitting
existing data well is fundamentally different from making predictions about new data.
Next, increasing a model’s complexity (i.e number of fitting parameters) will usually
yield better results on the training data. However when the training data size is
small and the data are noisy, this results in overfitting and can substantially degrade
the predictive performance of the model. Furthermore, as the number of parameters
in the model increases, we are forced to work in high-dimensional spaces, where the
so-called curse of dimensionality ensures that many phenomena that are absent or
rare in low-dimensional spaces become generic. Finally, it is difficult to generalize
beyond the situations encountered in the training data set.
1
A Markov decision process is a discrete-time stochastic control process providing a mathematical
framework for modeling decision making in situations where outcomes are partly random and partly
under the control of a decision maker.

2.1.1 The Supervised Learning Approach
The first ingredient of a supervised learning task is the dataset D = {(xi, yi)}n
i=i,
where xi are the vectors associated to the data points and yi their respective labels.
The second is the model f(xi; w), i.e. is a function f : xi → ŷi of the parameters w.
f is a function used to predict an output vector ŷi from a vector of input variables.
The final ingredient is the loss function L(yi, f(xi; w)) that allows us to judge how
well the model performs on the observations yi. The model is fit by finding the value
of w that minimizes the loss function. For example, one commonly used loss function
is the mean squared error. Minimizing the squared error loss function is known as the
method of least squares, and is typically appropriate for experiments with Gaussian
measurement errors.
ML researchers and data scientists follow a standard recipe to obtain models that
are useful for prediction problems. The first step in the analysis is to randomly divide
the dataset D into two mutually exclusive groups Dtrain and Dtest called the training
set and the test set. The fact that this must be the first step should be heavily
emphasized: performing some analysis (such as using the data to select important
variables) before partitioning the data is a common pitfall that can lead to incorrect
conclusions. Typically, the majority of the data are partitioned into the training set
(e.g. 90%) with the remainder going into the test set. The model is fit by minimizing
the loss function using only the data in the training set
ŵ = arg minw{L(yi train, f(xi train; w))}. (2.1)
Finally, the performance of the model is evaluated by computing the loss function
using the test set L(yi test, f(xi test; ŵ)). The performance on unseen data, i.e. the
test set, is know as the generalization capability of the model. Splitting the data
into mutually exclusive training and test sets provides an unbiased estimate for the
predictive performance of the model: this is known as cross-validation in the ML and
statistics literature.
Fig. 2.1 A simple flowchart representing the supervised learning approach.

2.1.2 The Gradient Descent Optimization Algorithm
We shall now discuss the method we used for performing the minimization of the
loss function L: the Gradient Descent (GD) algorithm. The basic idea behind this
method is straightforward: iteratively adjust the parameters w in the direction where
the gradient of the loss function w.r.t. the parameters is large and negative. In this
way, the training procedure ensures the parameters flow towards a local minimum of
the loss function. We first initialize the parameters to some value w0 and iteratively
update the parameters according to the equation
vt = ηt∇wL(wt), wt+1 = wt − vt, (2.2)
where ∇wL(w) is the gradient of L(w) w.r.t. w and we have introduced a learning
rate, ηt, that controls how big a step we should take in the direction of the gradient
at time step t. It is clear that for sufficiently small choice of the learning rate ηt
this methods will converge to a local minimum (in all directions) of the loss function.
However, choosing a small ηt comes at a huge computational cost. The smaller ηt,
the more steps we have to take to reach the local minimum. In contrast, if ηt is too
large, we can overshoot the minimum and the algorithm becomes unstable (it either
oscillates or even moves away from the minimum). This is shown in Fig. 2.2.
Fig. 2.2 Gradient descent exhibits three qualitatively different regimes as
a function of the learning rate. Result of gradient descent on surface z = x2
+y2
−1
for learning rate of η = 0.1, 0.5, 1.01. Notice that the trajectory converges to the global
minima in multiple steps for small learning rates (η = 0.1). Increasing the learning
rate further (η = 0.5) causes the trajectory to oscillate around the global minima
before converging. For even larger learning rates (η = 1.01) the trajectory diverges
from the minima.

2.2. Artificial Neural Networks and Deep Learning 8
2.2 Artificial Neural Networks and Deep Learning
Artificial Neural Neworks (ANNs), usually simply called Neural Networks (NNs)
or neural nets, are non-linear models for supervised learning inspired by the biological
neural networks that constitute animal brains. An ANN is based on a collection of
connected units or nodes called artificial neurons, which loosely model the neurons in a
biological brain. Each connection, like the synapses in a biological brain, can transmit
a signal to other neurons. An artificial neuron receives a signal then processes it and
can signal neurons connected to it. The “signal” at a connection is a real number, and
the output of each neuron is computed by some non-linear function of the sum of its
inputs. The connections are called edges. Neurons and edges typically have a weight
that adjusts as learning proceeds. The weight increases or decreases the strength of
the signal at a connection. Typically, neurons are aggregated into layers. Different
layers may perform different transformations on their inputs. Signals travel from the
first layer to the last layer, possibly after traversing one or more intermediate layers.
Over the last decade, neural networks have emerged as one of the most powerful
and widely-used supervised learning techniques. Deep Neural Networks (DNNs), that
is NNs containing multiple intermediate layers, have a long history, but re-emerged to
prominence after a rebranding as Deep Learning in the mid 2000s. DNNs truly caught
the attention of the wider ML community and industry in 2012 when Alex Krizhevsky,
Ilya Sutskever, and Geoff Hinton used a GPU-based DNN model (AlexNet) to lower
the error rate on the ImageNet Large Scale Visual Recognition Challenge (ILSVRC)
by an incredible twelve percent from 28% to 16% [15]. Since then, DNNs have become
the workhorse technique for many image and speech recognition based ML tasks. The
large-scale industrial deployment of DNNs has given rise to many high-level libraries
and packages (TensorFlow, Keras, PyTorch, etc.) that make it easy to quickly code
and deploy DNNs.
Fig. 2.3 Neuron and myelinated axon, with signal flow from inputs at dendrites to
outputs at axon terminals.

2.2.1 Artificial Neurons
The basic unit of a neural network is a stylized “neuron” i that takes a vector
of d input features x = (x1, x2, ..., xd) and produces a scalar output ai(x). A neural
net consists of many such neurons stacked into layers, with the output of one layer
serving as the input for the next. The first layer in the neural net is called the input
layer, the intermediate layers are commonly known as “hidden layers”, and the final
layer is called the output layer. The exact function ai varies depending on the type of
non-linearity used in the NN. However, in essentially all cases ai can be decomposed
into a linear operation that weights the relative importance of the various inputs, and
a non-linear transformation fi(z) called activation function which is usually the same
for all neurons. The linear transformation takes the form of a dot product with a set
of neuron-specific weights w(i)
= (w
(i)
1 , w
(i)
2 , ..., w
(i)
d ) followed by re-centering with a
neuron-specific bias b(i)
:
z(i)
= w(i)
· x + b(i)
= xT
· w(i)
, (2.3)
where x = (1, x) and w(i)
= (b(i)
, w(i)
). In terms of z(i)
and the activation function
fi(z), we can write the full input-output function as
ai(x) = fi(z(i)
), (2.4)
see Figure 2.4.
Fig. 2.4 Sketch of an artificial neuron, consisting of a linear transformation that
weights the importance of various inputs, followed by a non-linear activation function.
Different choices of activation functions lead to different properties for neurons.
The underlying reason for this is that we train NNs using the gradient descent, see
Subsec. 2.1.2, that require us to take derivatives of the neural input-output function
with respect to the weights w(i)
and the bias b(i)
.

2.2.2 The Feed-Forward Architecture
The basic idea of all neural networks is to layer neurons in a hierarchical fashion,
the general structure of which is known as the network architecture (See Fig. 2.5).
A feed-forward neural network (FNN) is an ANN wherein connections between the
nodes do not form a cycle. In the simplest feed-forward networks, each neuron in
the input layer of the neurons takes the inputs x and produces an output ai(x)
that depends on its current weights, see Eq. (2.4). The outputs of the input layer
are then treated as the inputs to the next hidden layer. This is usually repeated
several times until one reaches the top or output layer. Thus, the whole NN can be
thought of as a complicated non-linear transformation of the inputs x into an output
ŷ that depends on the weights and biases of all the neurons in the input, hidden, and
output layers. The use of hidden layers greatly expands the representational power
of a neural network when compared with a simple mono-layer network. Perhaps,
the most formal expression of the increased representational power of neural nets
(also called the expressivity) is the universal approximation theorem which states
that a neural network with a single hidden layer can approximate any continuous,
multi-input/multi-output function with arbitrary accuracy.
Modern neural networks generally contain multiple hidden layers. There are many
ideas of why such deep architectures are favorable for learning. Increasing the number
of layers increases the number of parameters and hence the representational power
of neural networks. Adding hidden layers is also thought to allow neural nets to
learn more complex features from the data. Choosing the exact network architecture
remains an art that requires extensive numerical experimentation and intuition, and
is often times problem-specific. Both the number of hidden layers and the number of
neurons in each layer can affect the performance of an NN.
Fig. 2.5 A simple feed-forward neural network, with two hidden layers between the
input layer and the output layer.

2.2.3 Training Neural Networks
The basic procedure for training neural networks is the same as we described in
Subsec. 2.1.1 for training simpler supervised learning algorithms: construct a loss
function and then use gradient descent to minimize the loss function and find the
optimal weights and biases. Neural networks differ from these simpler supervised
procedures in that generally they contain multiple hidden layers that make taking
the gradient computationally more difficult. First of all, we shall introduce some
preliminary ML concepts that will be helpful in the description of the actual neural
net training procedure: hyperparameters and classifier functions.
In machine learning, a hyperparameter is a parameter whose value is used to
control the learning process. By contrast, the values of other parameters - such
as the weights in neural networks - are derived via training. Hyperparameters can
be classified as model hyperparameters, that cannot be inferred while fitting the
machine to the training set because they refer to the model selection task, or algorithm
hyperparameters, that in principle have no influence on the performance of the model
but affect the speed and quality of the learning process. An example of a model
hyperparameter is the topology and size of a neural network. Examples of algorithm
hyperparameters are
• the learning rate ηt, which we introduced in Subsec. 2.1.2 while describing the
gradient descent algorithm;
• the batch size, that is the number of training samples to process before we
update the model parameters (i.e. the size of the full training dataset Dtrain for
full-batch GD);
• the number of epochs, which is the number of times that the learning algorithm
will work through the entire training dataset Dtrain.
There are no exact rules for how to configure hyperparameters. One must try different
values and see what works best for each specific problem.
Now that we explained what hyperparameters are, we will illustrate two of the
most well-known and widely-used classifier functions: the logistic function and the
softmax function. The standard logistic function is a common sigmoid function2
defined by the equation
σ(z) :=
1
1 + e−z
∈ (0, 1). (2.5)
It is often used as the last activation function of a neural network to clamp signals to
within the (0, 1) interval, thus normalizing the outputs to probabilities. In practice,
due to the nature of the exponential function e−z
, it is often sufficient to compute
the standard logistic function for z over a small range of real numbers, as it quickly
converges very close to its saturation values of 0 and 1. The logistic function has the
symmetry property that 1 − σ(z) = σ(−z).
2
A sigmoid function is a function having a characteristic “S”-shaped curve or sigmoid curve.

Fig. 2.6 A plot of standard the logistic function.
The softmax function, also known as softargmax, is a generalization of the logistic
function to multiple dimensions. The softmax function takes as input a vector z of
K real numbers, and normalizes it into a probability distribution consisting of K
probabilities proportional to the exponentials of the input numbers. That is, prior
to applying softmax, some vector components could be negative, or greater than
one, but after applying softmax, each component will be in the interval (0, 1), and
the components will add up to 1, so that they can be interpreted as probabilities.
Furthermore, the larger input components will correspond to larger probabilities. The
standard (unit) softmax function σ : RK
→ (0, 1)K
is defined when K is greater than
one by the formula
σi(z) :=
ezi
PK
j=1 ezj
for i = 1, ..., K and z = (z1, ..., zK) ∈ RK
. (2.6)
In simple words, it applies the standard exponential function to each element zi of
the input vector z and normalizes these values by dividing by the sum of all these
exponentials; this normalization ensures that the sum of the components of the output
vector σ(z) is 1.
We can now move to the actual training procedure. The first thing one must do
to train a neural network is define the general network architecture. Given a dataset
D = {(xi, yi)}n
i=i, xi ∈ Rd
, the number of neurons in the input layer of our neural
network must be equal to the dimension d of the data points. For categorical data,
the labels yi can take on C values so that yi ∈ {0, 1, ..., C − 1}: the size of the output
layer must be C accordingly. On the other hand, the number of hidden layer and
their size are not constrained, and thus becomes model hyperparameters.

The neural net makes a prediction ŷi(w) ∈ RC
for each data point xi, where w
are the parameters of the neural network. Each component ŷi(c+1)(w) of prediction
ŷi(w) is the activation value relative to class c. The predicted class pi for data point
xi is then the one corresponding to the greatest among all the activation values
pi = arg maxc{ŷi(c+1)(w)}. (2.7)
The data point is considered correctly classified if the predicted class pi coincides
with its label yi. We will now define two quantities for measuring the performance of
a neural network on the training set Dtrain and the test set Dtest: the training error
Etrain and the test error Etest, respectively. For n data points drawn from the training
set Dtrain, we define the training error as
Etrain :=
1
n
n
X
i=1
1 − δpiyi train
∈ [0, 1], (2.8)
where δpiyitrain
is the Kronecker delta3
of pi and yi train. If we were to draw n data
points from the test set Dtest, we would get the test error instead
Etest :=
1
n
n
X
i=1
1 − δpiyi test
∈ [0, 1]. (2.9)
The training and test errors are determined after each training epoch, when the model
parameters are fixed and no learning is taking place, and they simply represent the
percentage of misclassifications on the training set and test set respectively. The test
error is especially important, as it provides us with a measure of the generalization
capability of the model. One of the most important observations we can make is that
the test error is almost always greater than the training error, i.e. Etest ≥ Etrain.
Like all supervised learning procedures, we must then specify a loss function L.
For categorical data, the most commonly used loss function is the cross-entropy, since
the output layer is often taken to be a softmax classifier. For each datapoint i, we
define a vector yic called a ‘one-hot’ vector, such that
yic :=
(
1, if yi = c
0, otherwise
. (2.10)
We can also define the probability that the neural net assigns the data point to
category c as the component ŷi(c+1)(w) of prediction ŷi(w)
ŷi(c+1)(w) := P(yi = c|xi; w). (2.11)
3
The Kronecker delta is a function of two variables, usually just non-negative integers. The
function is 1 if the variables are equal, and 0 otherwise
δij =
(
0, if i ̸= j
1, if i = j
.

The categorical cross-entropy between the true labels yi ∈ {0, 1, ..., C − 1} and the
components ŷi(c+1)(w) of prediction ŷi(w) is defined as
LCE(w) := −
n
X
i=1
C−1
X
c=0
yic log (ŷi(c+1)(w)) + (1 − yic) log (1 − ŷi(c+1)(w)). (2.12)
Fig. 2.7 A plot of the cross-entropy loss as a function of the predicted probability
for the true label yi. The loss rapidly increases as the predicted probability for the
actual label moves away from 1.
Having defined an architecture and a loss function, we must now train the model.
Similar to other supervised learning methods, we make use of the gradient descent
method to optimize the loss function 2.1.2. Recall that the basic idea of gradient
descent is to update the parameters w to move in the direction of the gradient of the
loss function ∇wL(w).
Calculating the gradients for a neural network requires a specialized algorithm,
called backpropagation (often abbreviated backprop) which forms the heart of any
neural network training procedure. A brute force calculation is out of the question
since it requires us to calculate as many gradients as parameters at each step of the
gradient descent. The backpropagation algorithm (Rumelhart and Zipser, 1985 [21])
is a clever procedure that exploits the layered structure of neural networks to more
efficiently compute gradients.

2.2.4 The Backpropagation Algorithm
At its core, backpropagation is simply the chain rule4
for partial differentiation,
and can be summarized using four equations. In order to see this, we must first
establish some useful notation. We will assume that there are L layers in our network
with l = 1, ..., L indexing the layer. Denote by wl
jk the weight for the connection from
the k-th neuron in layer l − 1 to the j-th neuron in layer l. We denote the bias of this
neuron by bl
j. By construction, in a feed-forward neural network the activation al
j of
the j-th neuron in the l-th layer can be related to the activities of the neurons in the
layer l − 1 by the equation
al
j = f
X
k
wl
jkal−1
k + bl
j
!
= f(zl
j), (2.13)
where we have defined the linear weighted sum
zl
j =
X
k
wl
jkal−1
k + bl
j. (2.14)
By definition, the loss function L depends directly on the activities of the output layer
aL
j . It of course also indirectly depends on all the activities of neurons in lower layers
in the neural network through iteration of Eq. (2.13). Let us define the error ∆L
j of
the j-th neuron in the L-th layer as the change in loss function w.r.t. the weighted
input zL
j
∆L
j =
∂L
∂zL
j
(2.15)
This definition is the first of the four backpropagation equations. We can analogously
define the error of neuron j in layer l, ∆l
j, as the change in the loss function w.r.t.
the weighted input zl
j:
∆l
j =
∂L
∂zl
j
=
∂L
∂al
j
f′
(zl
j), (I)
where f′
(x) denotes the derivative of the non-linearity f(·) w.r.t. its input evaluated
at x. Notice that the error function ∆l
j can also be seen as the partial derivative of
the loss function w.r.t. the bias bl
j, since
∆l
j =
∂L
∂zl
j
=
∂L
∂bl
j
∂bl
j
∂zl
j
=
∂L
∂bl
j
, (II)
where in the last line we have used the fact that ∂bl
j/∂zl
j = 1, cf. Eq. (2.14). This is
the second of the four backpropagation equations.
4
The chain rule is a formula that expresses the derivative of the composition of two differentiable
functions f and g in terms of the derivatives of f and g. More precisely, if h = f◦g is the function such
that h(x) = f(g(x)) for every x, then the chain rule is, in Lagrange’s notation, h′
(x) = f′
(g(x))g′
(x)
or, equivalently, h′
= (f ◦ g)′
= (f′
◦ g) · g′
.

We now derive the final two backpropagation equations using the chain rule. Since
the error depends on neurons in layer l only through the activation of neurons in the
subsequent layer l + 1, we can use the chain rule to write
∆l
j =
∂L
∂zl
j
=
X
k
∂L
∂zl+1
k
∂zl+1
k
∂zl
j
=
X
k
∆l+1
k
∂zl+1
k
∂zl
j
=
X
k
∆l+1
k wl+1
kj
!
f′
(zl
j). (III)
This is the third backpropagation equation. The final equation can be derived by
differentiating the loss function w.r.t. the weight wl
jk as
∂L
∂wl
kj
=
∂L
∂zl
j
∂zl
j
∂wl
kj
= ∆l
jal−1
k . (IV)
Together, Eqs. (I), (II), (III), and (IV) define the four backpropagation equations
relating the gradients of the activations of various neurons al
j, the weighted inputs zl
j
and the errors ∆l
j. These equations can be combined into a simple, computationally
efficient algorithm to calculate the gradient w.r.t. all parameters.
The Backpropagation Algorithm
1. Activation at input layer: calculate the activations a1
j of all the neurons in
the input layer.
2. Feed-forward: starting with the first layer, exploit the feed-forward architec-
ture through Eq. (2.13) to compute zl
and al
for each subsequent layer.
3. Error at top layer: calculate the error of the top layer using Eq. (I). This
requires to know the expression for the derivative of both the loss function
L(w) = L(aL
) and the activation function f(z).
4. “Backpropagate” the error: use Eq. (III) to propagate the error backwards
and calculate ∆l
j for all layers.
5. Calculate gradient: use Eqs. (II) and (IV) to calculate ∂L
∂bl
j
and ∂L
∂wl
jk
.
We can now see where the name backpropagation comes from. The algorithm
consists of a forward pass from the bottom layer to the top layer where one calculates
the weighted inputs and activations of all the neurons. One then backpropagates
the error starting with the top layer down to the input layer and uses these errors to
calculate the desired gradients. This description makes clear the incredible utility and
computational efficiency of the backpropagation algorithm. We can calculate all the
derivatives using a single “forward” and “backward” pass of the neural network. This
computational efficiency is crucial since we must calculate the gradient with respect
to all parameters of the neural net at each step of gradient descent. These basic ideas
also underly almost all modern automatic differentiation packages - such as the one
we used, the torch.autograd package from the PyTorch framework for ML in Python.

2.3. Structure in Data 17
2.3 Structure in Data
2.3.1 The Object Manifold Model
As we mentioned in Chap. 1, many theoretical investigations on neural networks
make the simplifying assumption of considering input data points as unstructured
data, i.e. uncorrelated data points, distributed randomly in a state space5
. However,
in the majority of real-life datasets, data points with similar features are found to be
closer together when compared to data points exhibiting considerable differences in
their characteristics. For instance, in a set of pictures portraying both cats and dogs,
each picture can be represented as a distinct point in a state space; a neural network
will generally be able to distinguish cats from dogs and classify them correctly, but
at the same time, it will have the tendency to similarly identify dogs of the same
breed, thus grouping their representations in the neural state spaces of the inner
network layers closer together, see Fig. 2.8. This interpretation of structure in data is
commonly known as the object manifold model, and has already been the geometrical
framework to many recent machine learning publications, such as the ones by Haim
Sompolinsky et al. [9][2][6], to which our investigation owes its general approach,
based on the numerical analysis of few, but crucial geometric observables during the
training process of a neural network.
Fig. 2.8 Unstructured data (left) vs. data with an object manifold structure (right)
[16]
The visual hierarchy of the brain has a remarkable ability to identify objects
despite differences in appearance due to changes in variables such as orientation,
position, pose, lighting and background [8]. Recent research in machine learning has
shown that deep neural networks can perform invariant object categorization with
almost human-level accuracy [1], and that their network representations are similar
to the brain’s. [14][25][13]. DNNs are therefore very important as models of visual
hierarchy, though understanding their operational capabilities and design principles
remain a significant challenge.
5
A state space is the set of all possible configurations of a system.

To conceptualize object manifolds, consider a set of N neurons responding to
a specific visual signal associated with an object. The neural population response
to that stimulus is a vector in RN
. Changes in the physical parameters of the input
stimulus that do not change the object identity modulate the neural state vector. The
set of all state vectors corresponding to responses to all possible stimuli associated
with the same object can be viewed as a manifold in the neural state space. In this
perspective, object recognition is equivalent to the task of discriminating manifolds of
different objects from each other. As signals propagate from one processing stage to
the next in the visual hierarchy, the geometry of the manifolds is reformatted so that
they become “untangled,” namely they are more easily separated by a biologically
plausible decoder, modeled as a hyperplane6
[7], as illustrated in Fig. 2.9.
Fig. 2.9 Illustration of three layers in a visual hierarchy where the neural population
response of the first layer is mapped into intermediate layer by F1 and into the last
layer by F2 (top). The transformation of per-stimuli responses is associated with
changes in the geometry of the object manifold, the collection of responses to stimuli
of the same object (colored blue for a ‘dog’ manifold and pink for a ‘cat’ manifold).
Changes in geometry may result in transforming object manifolds which are not
linearly separable (in the first and intermediate layers) into separable ones in the last
layer (separating hyperplane, colored orange) [6].
The object manifold structure can have a positive impact on the learning ability
of a neural network, because classifying data points uniformly distributed in the state
space is surely more complicated than classifying data points that are clustered. On
the other hand, this makes it harder to develop an analytic description of the subject.
6
A hyperplane of an n-dimensional space V is a subspace of dimension n − 1, or equivalently, of
codimension 1 in V .

2.3.2 Geometrical Observables
We tried to capture the structure within neural representations by introducing two
independent geometric observables, namely the radius of gyration RMc , measuring
the average spread of data points belonging to the same manifold around its centre,
and the centre-to-centre distance dctc, quantifying the distance between the centres of
two different manifolds. Measuring these two quantities at each training epoch one
can essentially keep track of the intra-manifold and inter-manifold dynamics during
the entire training process, using a discrete time7
mathematical dynamics approach.
We first define the (normalized) centre of a manifold M at a given time t as
xM(t) :=
1
nM
X
x∈M(t)
x
∥x∥
, (2.16)
where nM = |M| is the number of elements that belongs to manifold M. The
normalization of each data point x ensures that the centre is located within the unit
ball centered at the origin.
Definition 2.3.1 (radius of gyration) The radius of gyration of a manifold M at
a given time t is defined as
RM(t) :=
v
u
u
t
1
nM
X
x∈M(t)

x
∥x∥
− xM(t)

2
∈ [0, 1]. (2.17)
Definition 2.3.2 (centre-to-centre distance) The distance between the centres of
two manifolds M and N at a given time t is defined as
dctc(t) := ∥xM(t) − xN (t)∥ ∈ [0, 2]. (2.18)
The radius of gyration and the centre-to-centre distance are quantities that share the
same associated dimension, the one of the data points x. Since the problem of linear
separation is invariant under rescaling by a positive factor, we decided to introduce
one last geometric observable: the dimensionless radius of gyration R̂M, defined as
R̂M(t) :=
RM(t)
dctc(t)
∈ [0, ∞). (2.19)
While R̂M has the obvious advantage of being a dimensionless quantity, we should
note that this rescaling of RM removes its upper bound, as
lim
dctc→0
RM
dctc
= ∞.
7
The discrete time framework views values of variables as occurring at distinct, separate “points
in time”, or equivalently as being unchanged throughout each non-zero region of time (“time period”)
- that is, time is viewed as a discrete variable.

Chapter 3
Tools and Techniques Employed
3.1 Overview of the Datasets
3.1.1 MNIST
MNIST (Modified National Institute of Standards and Technology) is a large
dataset of handwritten digits that is widely used for training and testing in the field
of machine learning. It was created by Yann LeCun et al. in 1998 by “re-mixing”
the samples from NIST’s original datasets [17]. The creators felt that since NIST’s
training dataset was taken from American Census Bureau employees, while the testing
dataset was taken from American high school students, it was not well-suited for
machine learning experiments. Furthermore, the black and white images from NIST
were normalized to fit into a 28 × 28 pixel bounding box and anti-aliased, which
introduced grayscale1
levels.
Fig. 3.1 Sample images from MNIST test dataset.
1
The grayscale intensity is stored as an 8-bit integer giving 256 possible different shades of gray
from black to white.

3.1. Overview of the Datasets 21
MNIST contains a total of 70000 images, divided into 60,000 training images and
10,000 testing images, all labeled with their respective digit, which represents the
ground truth2
. Each image is a 28 × 28 matrix in which every entry aij (i.e. a pixel)
is an integer in the range [0, 255] corresponding to its grayscale intensity, 0 being
white and 255 being black, see Fig. 3.2. To feed an image to a neural network, we
must vectorize3
its 28 × 28 matrix into a 784-dimensional vector, which is directly
compatible with an input layer composed of 784 neurons.
Fig. 3.2 Sample image of an “eight” digit from MNIST in its 28 × 28 matrix form.
MNIST is the de facto “Hello, world!” dataset of computer vision. Since its release,
this classic dataset has served as the basis for benchmarking classification algorithms.
As new machine learning techniques emerge, MNIST remains a reliable resource for
researchers and learners alike. The main reason behind its success is the simplicity
of neural networks architectures needed to perform object classification with great
accuracy. On this basis, we decided to focus our exploration mainly on the MNIST
dataset.
2
Ground truth is information that is known to be real or true, provided by direct observation
and measurement (i.e. empirical evidence) as opposed to information provided by inference.
3
The vectorization of a matrix is a linear transformation which converts the matrix into a column
vector. Specifically, the vectorization of a m × n matrix A, denoted vec(A), is the mn × 1 column
vector obtained by stacking the columns of the matrix A on top of one another.

3.1.2 EMNIST
EMNIST (Extended-MNIST) is a dataset developed and released by NIST in 2017
to be the successor to MNIST [5]. While MNIST includes images of handwritten digits
only, EMNIST includes all the images from NIST Special Database 19, which is a
large database of handwritten uppercase and lower case letters as well as digits. The
images in EMNIST were converted into the same 28 × 28 pixel format, by the same
process, as were the MNIST images. Accordingly, tools which work with the older,
smaller, MNIST dataset will work unmodified with EMNIST. There are six different
splits provided in this dataset. A short summary of the dataset is provided below:
• EMNIST ByClass: 814,255 images, 62 unbalanced classes;
• EMNIST ByMerge: 814,255 images, 47 unbalanced classes;
• EMNIST Balanced: 131,600 images, 47 balanced classes;
• EMNIST Letters: 145,600 images, 26 balanced classes;
• EMNIST Digits: 280,000 images, 10 balanced classes;
• EMNIST MNIST: 70,000 images, 10 balanced classes.
The full complement of the NIST Special Database 19 is available in the ByClass
and ByMerge splits. The EMNIST Balanced dataset contains a set of characters
with an equal number of samples per class. The EMNIST Letters dataset merges a
balanced set of the uppercase and lowercase letters into a single 26-class task. The
EMNIST Digits and EMNIST MNIST dataset provide balanced handwritten digit
datasets directly compatible with the original MNIST dataset.
In this work, we decided to examine the EMNIST Letters split only: it is the most
different from MNIST, because it exclusively contains images that are not shared with
it; furthermore, its 26 classes are balanced, which provides a fair ground for our object
classification task.
Fig. 3.3 Sample images from the “Letters” split of the EMNIST dataset.

3.1.3 KMNIST
KMNIST (Kuzushiji-MNIST) is a drop-in replacement for the MNIST dataset
(28 × 28 grayscale, 70000 images, 10 balanced classes), developed for deep learning
on classical Japanese literature and provided in the original MNIST format [4]. It
contains images with the first entries from the 10 main Japanese hiragana4
character
groups, handwritten in cursive.
Fig. 3.4 Sample images from KMNIST, with the first column showing each char-
acter’s modern hiragana counterpart.
3.1.4 Fashion-MNIST
Fashion-MNIST is another drop-in replacement for MNIST (28 × 28 grayscale,
70000 images, 10 balanced classes), composed of Zalando’s article of clothing images
[24].
Fig. 3.5 Sample images from Fashion-MNIST.
4
Hiragana is a Japanese syllabary, part of the Japanese writing system, along with katakana as
well as kanji.

3.2. Programming Language and Libraries 24
3.2 Programming Language and Libraries
3.2.1 Why we opted for Python
Python was the natural choice for the programming language of our investigation.
Benefits that make it the best fit for ML and AI-based projects include:
• Simplicity and consistency: Python offers concise and readable code. While
complex algorithms and versatile workflows stand behind ML and AI, Python
simplicity allows developers to write reliable systems. Developers get to put
all their effort into solving an ML problem instead of focusing on the technical
nuances of the language.
Additionally, Python is appealing to many developers as it’s easy to learn. Being
a high-level5
, interpreted6
programming language, its code is understandable by
humans, which makes it easier to build models for ML.
Since Python is a general-purpose language, it can do a set of complex ML tasks
and enable you to build prototypes quickly that allow you to test your product
for ML purposes.
• Easy access to many libraries for ML and AI: implementing machine
learning algorithms can be tricky and requires a lot of time. It’s vital to have
a well-structured and well-tested environment to enable developers to come up
with the best coding solutions.
To reduce development time, programmers turn to a number of Python libraries,
pre-written code that simplify the implementation of different functionalities.
Python, with its rich technology stack, has an extensive set of libraries for AI
and ML. With these solutions, you can develop your product faster: your team
won’t have to reinvent the wheel and can use an existing library to implement
necessary features.
5
In computer science, a high-level programming language is a programming language with strong
abstraction from the details of the computer. In contrast to low-level programming languages, it may
use natural language elements, be easier to use, or may automate (or even hide entirely) significant
areas of computing systems (e.g. memory management), making the process of developing a program
simpler and more understandable than when using a lower-level language.
6
An interpreted language is a programming language whose implementations execute instructions
directly and freely, without previously compiling a program into machine-language instructions.

Some of the Python libraries we made use of are:
– PyTorch and scikit-learn for machine learning;
– NumPy for high-performance scientific computing and data analysis;
– Matplotlib for data visualization.
• Platform independence: a platform independent language lets developers
implement programs on one machine and use them on another machine without
any (or with only minimal) changes. One key to Python’s popularity is that
it’s a platform independent language. Python is supported by many Operating
Systems (OSs) including Linux, Windows, and macOS. Python code can be
used to create standalone executable programs for most common OSs, which
means that Python software can be easily distributed and used on those OSs
without a Python interpreter.
What’s more, developers usually use services such as Google Colab or Amazon
SageMaker for their computing needs. However, you can often find companies
and data scientists who use their own machines with powerful GPUs to train
their ML models. And the fact that Python is platform independent makes this
training a lot cheaper and easier.
Our investigation, for instance, was conducted on a personal workstation using:
– Microsoft Windows 10 as OS;
– Anaconda Navigator as Graphical User Interface (GUI);
– JupyterLab as (web-based) Interactive Development Environment (IDE).
• A wide community: in the Developer Survey 2020 by Stack Overflow, Python
was among the top 5 most popular programming languages, which ultimately
means that you can easily find a development company with the necessary skill
set to build your AI-based project.
In the Python Developers Survey 2020, data science and ML account for over
27% of the use cases.
All the aforementioned Python features add to the overall popularity of this pro-
gramming language.

3.2.2 The PyTorch Framework for Machine Learning
The numerical implementation of NNs is greatly facilitated by open source Python
packages, such as TensorFlow, Keras, PyTorch and others. The complexity and
learning curves for these packages differ, depending on the user’s level of familiarity
with Python.
We opted for the PyTorch open source framework, which allows for control over the
inter and intra-layer operations, without the need to introduce computational graphs,
and offers a library for the automatic differentiation of tensors, already mentioned in
Subsec. 2.2.4: the torch.autograd package. As we discussed above, manipulating NNs
boils down to fast array multiplication and contraction operations and, therefore,
the PyTorch framework and its libraries do the job of providing enough access and
controllability to manipulate the linear algebra operations underlying NNs. We will
now show how automatic differentiation with torch.autograd works.
Mathematically, if you have a vector-valued function f = f(x), then the gradient
of f with respect to x is a Jacobian matrix7
J:
J = ∂f
∂x1
. . . ∂f
∂xn

=



∇T
f1
.
.
.
∇T
fm


 =



∂f1
∂x1
. . . ∂f1
∂xn
.
.
.
...
.
.
.
∂fm
∂x1
. . . ∂fm
∂xn


 . (3.1)
Essentially, torch.autograd is an engine for computing vector-Jacobian product. That
is, given any vector v, compute the product JT
· v. If v happens to be the gradient
of a scalar function g = g(f)
v = ∇g =

∂g
∂f1
. . . ∂g
∂fm
T
, (3.2)
then by the chain rule, the vector-Jacobian product would be the gradient of g with
respect to x
JT
· v =



∂f1
∂x1
. . . ∂fm
∂x1
.
.
.
...
.
.
.
∂f1
∂xn
. . . ∂fm
∂xn






∂g
∂f1
.
.
.
∂g
∂fm


 =



∂g
∂x1
.
.
.
∂g
∂xn


 . (3.3)
This characteristic of vector-Jacobian product is what we use in the following example,
in which external_grad represents v.
7
The Jacobian matrix of a vector-valued function of several variables is the matrix of all its
first-order partial derivatives.

We first create two tensors a and b with requires_grad=True. This signals to auto-
grad that every operation on them should be tracked.
import torch
a = torch.tensor ([2., 3.], requires_grad=True)
b = torch.tensor ([6., 4.], requires_grad=True)
We create another tensor Q from a and b: Q = 3a3
− b2
.
Q = 3*a**3 - b**2
Let’s assume a and b to be parameters of an NN, and Q to be the error. In NN
training, we want gradients of the error w.r.t. parameters, i.e.
∂Q
∂a
= 9a2
,
∂Q
∂b
= −2b.
When we call .backward() on Q, autograd calculates these gradients and stores them in
the respective tensors .grad attribute. We need to explicitly pass a gradient argument
in Q.backward() because it is a vector: gradient is a tensor of the same shape as Q,
and it represents the gradient of Q w.r.t. itself, i.e.
∂Q
∂Q
= 1.
external_grad = torch.tensor ([1., 1.])
Q.backward(gradient=external_grad)
Equivalently, we can also aggregate Q into a scalar and call backward implicitly, like
Q.sum().backward(). Gradients are now deposited in a.grad and b.grad:
# check if collected gradients are correct
print (9*a**2 == a.grad)
print (-2*b == b.grad)
Output:
tensor ([True , True ])
tensor ([True , True ])
During the training process of a nerual network, all its parameters are defined
with requires_grad=True, so that every operation involving them (e.g. output and
loss computation) is kept track of. Once the loss function L has been evaluated, it
is sufficient to call loss.backward() to compute its derivatives w.r.t. the parameters
through the backpropagation algorithm described in Subsec. 2.2.4 and save them in
the w.grad attribute.

3.3. Data Pre-Processing 28
3.3 Data Pre-Processing
It has been found empirically that if the original values of the data differ by
orders of magnitude, training can be slowed down or impeded. This can be traced
back to the vanishing and exploding gradient problem in backpropagation. To avoid
such unwanted effects, we resorted to two tricks, used in succession: rescaling and
standardization of the dataset. In addition to having a positive effect on the learning
capacity of neural networks, standardization has also been been shown to highlight
the geometrical properties already present in structured datasets [3].
3.3.1 Rescaling (min-max Normalization)
As stated in Sec. 3.1, each component (feature) xij of every input vector (data
point) xi is an integer in the range [0, 255]. To ensure that the weights of the NN
are of a similar order of magnitude, we performed a rescaling of the dataset from the
range [0, 255] to [0, 1]. The general formula for a rescaling, also known as min-max
normalization, to an arbitrary range [a, b] is given as
xij 7→ x′
ij = a +
(xij − minj{xij}(b − a)
maxj{xij} − minj{xij}
∈ [a, b], (3.4)
where a and b are the min-max values. The formula for our specific min-max from
[0, 255] to [0, 1] is then simply
xij 7→ x′
ij =
xij
255
∈ [0, 1]. (3.5)
3.3.2 Standardization (Z-score Normalization)
Standardization (also known as Z-score normalization) of datasets is a common
requirement for many machine learning estimators; they might behave badly if the
individual features do not more or less look like standard normally distributed data:
Gaussian with zero mean and unit variance. In practice we often ignore the shape of
the distribution and just transform the data to center them by removing the mean
value of each feature then scale them by dividing non-constant features by their
standard deviation:
xij 7→ xij′ =
xij − xij
σ
(3.6)
This standardization procedure is entirely handled by sklearn.preprocessing.scale
from the scikit-learn library mentioned in Subsec. 3.2.1.

3.4. ML Task and Network Architecture 29
3.4 ML Task and Network Architecture
3.4.1 The Binary Classification Task
For the purpose of this work, aimed at exploring geometrical structure in data, we
reduced the usual multi-class classification task of MNIST-like datasets to a simpler
binary one, so as to focus on separating two object manifolds only, for the sake of
generality. We opted for the even-odd dichotomy, relabelling even-labelled elements
with “0” and odd-labelled ones with “1”. Given a dataset D = {(xi, yi)}n
i=i, we
relabel the original labels yi ∈ {0, 1, ..., C − 1} (enumerating the C dataset classes)
into the binary labels y′
i ∈ {0, 1} through
y′
i =
(
0 (yi even)
1 (yi odd)
. (3.7)
3.4.2 Network Architecture
The feed-forward neural network we designed for this specific classification task is
characterized by the following network architecture (Fig. 3.6):
• one input layer comprised of 784 neurons;
• one hidden layer with a variable number of neurons;
• one output layer composed of 2 neurons.
The number of input neurons is given by the dimension of input vectors (which
is 784 for the images of MNIST and MNIST-like datasets, as discussed in Sec. 3.1),
and the width of the output layer is fixed by the specific task we have to carry out
(in other words, by the number of classes in which data has to be classified). On the
other hand, the number of neurons in the hidden layer is not constrained, and thus
becomes a model hyperparameter which influences the NN performance.
Fig. 3.6 The neural network we designed for the binary classification task.

The results from a previous work show that increasing the width of the hidden
layer makes the geometrical properties of structured data become more evident at the
cost of a greater computational complexity [3]. Since the focus of this investigation is
mainly a numerical analysis of the dynamics of object manifolds in the hidden layer of
the neural network as the latter undergoes the training process, we ultimately decided
to fix the width of the hidden layer to N = 10, since we found that it stroke a good
balance between the geometrical expressivity of the network and the computational
cost. An hidden layer consisting of 10 neurons basically means that the layer response
to each data point will be a vector in R10
: as a result, we will have to compute the
geometric observables introduced in Subsec. 2.3.2 for 10-dimensional object manifolds.
In a 10-dimensional state space, the effects of the curse of dimensionality mentioned
in Sec. 2.1 should still be mild enough for us to successfully capture non-virtual
noteworthy behaviours in the dynamics of the manifolds. Fixing the width of the
hidden layer also means having one less neural network hyperparameter to tune.
3.4.3 Training and Test Errors Computation
As we already discussed in Subsec. 2.2.3, once an input data point (xi, yi) is
processed into the corresponding prediction ŷi by the NN, the predicted class pi is
then the one corresponding to the greatest among all the activation values ŷi(c+1)(w)
pi =
(
0, if ŷi1 ŷi2
1, otherwise
. (3.8)
The data point is then considered correctly classified if the predicted label pi coincides
with the real label yi. We computed the training error Etrain and the test error Etest
after each training epoch through Eqs. (2.8) and (2.9), respectively.
3.4.4 Choice of Activation Function
We chose to utilize the hyperbolic tangent as the activation function for every
neuron of both the hidden layer and the output layers, mainly because of its property
of point symmetry w.r.t. the origin: it has been shown that, while non-symmetric
activation functions, such as ReLU (Rectified Linear Unit) and Swish, tend to process
distinct manifolds in different ways, symmetric activation functions like the logistic
function and the hyperbolic tangent do the opposite. Intuitively, this feature can
be explained by the fact that symmetric functions process vectors only w.r.t. the
absolute value of their components, and not to the their sign. Since, during the
training, manifolds are driven away from each other, if non-linear processing does not
depend on the direction of travel (as in symmetric activation functions), we expect
to find this symmetry reflected on the geometries of the manifolds. Non-symmetric
activation functions, in contrast, process points moving towards negative coordinates
differently from points moving to positive ones, with consequences on their pattern.

The hyperbolic tangent is a real function with domain R defined as
tanh(z) :=
ez
− e−z
ez + e−z
∈ (−1, 1). (3.9)
Fig. 3.7 A plot of the hyperbolic tangent.
The hyperbolic tangent exhibits the following main properties:
1. it is bounded (from) above by 1 and bounded (from) below by -1, and as such
is a bounded function;
2. it is constrained by a pair of horizontal asymptotes as x → ±∞;
3. it is a differentiable function, and has a first derivative which is bell-shaped;
4. it is a monotonically increasing function;
5. it has exactly one non-stationary inflexion point for x = 0;
6. it is convex for x 0 and it is concave for x 0;
7. it is an odd function, and as such tanh(−z) = −tanh(z);
Properties 1 to 6 define sigmoid functions, while property 7 is specific to tanh; the
logistic function σ(z), defined by Eq. (2.5), also exhibits point symmetry but, unlike
the hyperbolic tangent, it is not an odd function. We can also obtain the hyperbolic
tangent from the logistic function: tanh(z) = 2σ(2z) − 1.

Fig. 3.8 A plot of the hyperbolic tangent and the logistic function.
It has been shown in multiple works [18][12] that the hyperbolic tangent typically
performs better than the logistic function because the former is more likely to produce
outputs (which are inputs to the next layer) that are on average closer to zero, in
contrast to the logistic function whose outputs are always positive and so must have
a mean that is positive.
3.4.5 Choice of Loss Function
As we mentioned in Subsec. 2.2.3, the most common loss function for categorical
data is the cross-entropy, so it was a natural choice for our binary classification task.
However, the output layer of the NNs we employed is not a softmax classifier: on
the contrary, the activation function of the output layer is the hyperbolic tangent,
like we just discussed in Subsec. 3.4.4, therefore the components ŷi(c+1) of prediction
ŷi(w) will be in the (−1, 1) interval instead of the (0, 1) one, and they will not add
up to 1. For this reason, we cannot interpret ŷi(c+1) as probabilities. Luckily, the
torch.nn.CrossEntropyLoss function from PyTorch automatically applies a softmax to
the outputs before calculating the cross-entropy, so that we don’t necessarily have to
use a softmax activation function in the output layer. The categorical cross-entropy
between the binary labels yi ∈ {0, 1} and the softmaxed outputs is then given, using
the “one-hot” vectors notation Eq. (2.10), by
LCE(w) = −
n
X
i=1
1
X
c=0
yic log (ŷi(c+1)(w)) + (1 − yic) log (1 − ŷi(c+1)(w)). (3.10)

Chapter 4
Numerical Analysis Results
4.1 Dynamics of Geometric Observables
We shall commence by showing the results obtained from the averaging of 30
runs where we trained the neural network described in Subsec. 3.4.2 on the binary
classification of the MNIST dataset into even and odd digits, as detailed in Subsec.
3.4.1. We used a training set Dtrain of 10000 data points randomly drawn from the
original MNIST training set, the full MNIST test set (also of 10000 points) as the
test set Dtest, and a learning rate η = 0.3 (see Subsec. 2.1.2). We should point out
that the initialization of the network parameters is random as well.
The first quantities we will inspect are the training error Etrain and the test error
Etest (see Subsec. 2.2.3), as is common practice in every machine learning experiment.
Fig. 4.1 Etrain(t) and Etest(t) for 30 runs on the MNIST dataset.
Hyperparameters: |Dtrain| = 10000, η = 0.3.

4.1. Dynamics of Geometric Observables 34
Fig. 4.1 shows a steady decrease over the training epoch for both Etrain and Etest,
with the latter being slightly (but consistently) larger than the former, as one would
expect in the vast majority of cases. A decreasing training error signifies that our
neural network is successfully learning to classify the training set, while a decreasing
test error implies an increasing generalization capability of the model.
Now that we made sure that our neural network is correctly learning to perform its
given task, we can move on to the actual investigation of the dynamics of the object
manifolds, encapsulated in the geometric observables that we introduced in Subsec.
2.3.2. Since the only hidden layer in our network is comprised of 10 neurons, we will
be computing the geometric observables for 10-dimensional manifolds. Working with
a binary dataset signifies having to deal with two distinct object manifolds only:
• Meven, consisting of the inner representations of the elements with label 0;
• Modd, consisting of the inner representations of the elements with label 1.
4.1.1 Non-Monotonic Behaviours in MNIST
First of all, we will focus on the radius of gyration RM, measuring the average
spread of data points belonging to the same manifold M around its centre.
Fig. 4.2 RM(t) for the same 30 MNIST runs of Fig. 4.1.
The fact that both RMeven and RModd
exhibit a steep drop-off right at the beginning
of the training process may not come as a total surprise: for a deep neural network,
a monotonically decreasing training error corresponds to a monotonically increasing
capacity of untangling the two manifolds in the hidden layers, which would intuitively
be achieved by a steady contraction of the manifolds.

Contrary to what one might expect, the initial contraction of the manifolds does
not carry on indefinitely, and their radius of gyration displays an incredibly interesting
non-monotonic behaviour: after the noticeable decrease at the beginning of the
training process, the radii of gyration both reach a minimum in a few tens of training
epochs, and then they start slowly increasing without ever reaching a maximum.
This non-monotonicity corresponds, in a geometric perspective, to an initial rapid
contraction of the manifolds followed by a slower expansion, and has been observed
for the first time in [3].
Let’s move on to the centre-to-centre distance dctc between Meven and Modd,
measuring their average separation.
Fig. 4.3 dctc(t) for the same 30 MNIST runs of Fig. 4.1.
Here again, we see that dctc, too, exhibits the same non-monotonic behaviour of the
radius of gyration RM, only this time it is reversed: beginning with a sheer growth,
the centre-to-centre distance reaches a maximum in a few tens of training epochs, then
it starts dropping in a milder way. From a geometric point of view, the manifolds
show an initial quick distancing and a subsequent, more gradual mutual approach.
We must stress that, while this non-monotonic phenomenon and the previously
mentioned contraction-expansion one describe two distinct manifolds dynamics, they
seem to occur at approximately the same training epoch. From now on, we will
refer to the training epoch at which the monotonicity of a geometric observable O(t)
changes as its epoch of inversion, represented as t∗
. We shall come back to this
subject in 4.1.3.

To conclude our numerical analysis on the dynamics of the geometric observables
introduced in 2.3.2, we will now discuss the dimensionless radius of gyration R̂M.
Since RM and dctc are characterized by opposite dynamics over the training epoch,
with their respective epochs of inversion approximately coinciding, R̂M := RM/dctc
exhibits the same qualitative behaviour as RM, only visibly more pronounced (see Fig.
4.4). R̂M can therefore be interpreted as a dimensionless quantity single-handedly
encapsulating all the intra-manifold and inter-manifold dynamics.
Fig. 4.4 RMeven (t) and R̂Meven (t) for the same 30 MNIST runs of Fig. 4.1.
4.1.2 Comparing MNIST with Similar Datasets
MNIST is usually the first dataset researchers use as a benchmark to validate
their algorithms, as we have already discussed in Subsec. 3.1.1. “If it doesn’t work on
MNIST, it won’t work at all”, as they say. Well, if it does work on MNIST, it may still
fail on others. For this reason, we decided to extend our investigation of the dynamics
of the usual geometric observables to other, more complex datasets, namely EMNIST
Letters, KMNIST, and Fashion-MNIST (see Sec. 3.1), to see if the non-monotonic
behaviours we found in MNIST were also present in other structured datasets, or if
they were just a special feature of the former. We used the same hyperparameters
values for all the datasets we examined, only adjusting the the learning rate η to
compensate for variations in the dataset complexity. We are only going to provide
the graphs of the dimensionless radius of gyration R̂M since, as we explained in 4.1.1,
this geometric observable contains all the information regarding the intra-manifold
and inter-manifold dynamics.

(a) EMNIST Letters
(b) KMNIST
(c) Fashion-MNIST
Fig. 4.5 R̂M(t) for 30 runs EMNIST Letters, KMNIST and Fashion-MNIST.

We can clearly see that the same exact non-monotonic dynamics we first observed
in MNIST is found in EMNIST Letters (Fig. 4.5 (a)), KMNIST (Fig. 4.5 (b)), and
Fashion-MNIST (Fig. 4.5 (c)) as well. For this reason, from now on we will continue
our investigation solely on MNIST, and the results will be treated as general properties
of all the datasets we examined.
4.1.3 Epochs of Inversion
We conclude this section with a brief discussion on training epochs. As we stated
in Subsec. 4.1.3, each non-monotonic geometric observable O(t) is characterized by
an epoch of inversion t∗
, i.e. the training epoch at which it exhibits a change in
monotonicity, defined as
t∗
:=
(
arg mint{O(t)}, if O = RM ∨ O = R̂M
arg maxt{O(t)}, if O = dctc
. (4.1)
Since we already noted that the epochs of inversion of RM and dctc are localized
inside a small range of training epochs, we decided to identify this inversion band for
a sample of 30 distinct runs on MNIST.
Fig. 4.6 Inversion band of RM(t) and dctc(t) for 30 runs (MNIST).
We can see in Fig. 4.6 that the epochs of inversion of RM(t) and dctc(t) for 30 runs
are localized in a rather narrow inversion band of ∼ 20 training epochs. Therefore,
we can conclude that the two distinct manifold dynamics, i.e. contraction-expansion
and distancing-reapproching, indeed occur almost simultaneously during the training
process of a feed-forward neural network.

4.2. The Finding of Stragglers Data Points 39
4.2 The Finding of Stragglers Data Points
In Sec. 4.1, we delved into the array of non-monotonic behaviours displayed by the
geometric observables during the training process of a feed-forward neural network.
Since we couldn’t help but notice that the epochs of inversion of different geometric
observables are all localized in narrow bands, we wondered if perhaps there were
something hidden among the data points, interacting with the neural network in a
way that causes the changes in monotonicity that we observe. Indeed, we found
that all the non-monotonic trends are attributable to those data points that still get
misclassified during the training epochs in the inversion band, and we decided to dub
them stragglers1
accordingly. Not only stragglers are the data points responsible
for the non-monotonicity of the geometric observables but, as we’ll see, they are also
intrinsically linked to the generalization capability of neural networks.
4.2.1 The Critical Role of Stragglers
To empirically illustrate the role of stragglers data points in both the appearance
of non-monotonic behaviours in the geometric observables and generalization, we
adopted the following procedure:
1. We trained our neural network on a training set Dtrain of 10000 data points,
saving the positions of the N stragglers identified at the epoch of inversion t∗
of the dimensionless radius of gyration R̂M;
2. We removed the N stragglers data points from the training set Dtrain and trained
the neural network on the remaining training set D′
train;
3. We removed N random data points from the original training set Dtrain and
trained the neural network on the remaining training set D′′
train;
4. We cross-compared the results from the previous three steps.
This time around, we will begin by showing the results for the geometric observables
and then we’ll move on to the training and test errors. As we can plainly see in Fig. 4.7,
removing ∼ 10% of the elements in the training set Dtrain in the form of 1125 random
data points has a negligible effect on the dynamics of the manifolds, whereas removing
the exact same number of data points from Dtrain in the form of stragglers completely
lifts the non-monotonicy of the geometric observables: both RM and R̂M turn into
monotonically decreasing functions, while dctc becomes monotonically increasing. RM
and R̂M also reach considerably smaller values while dctc grows much larger, meaning
that the neural network is succeeding in untangling the two object manifolds with a
substantially smaller effort.
1
The Oxford Languages definition of straggler is “a person in a group who becomes separated
from the others, typically because of moving more slowly”.

(a) RM(t)
(b) dctc(t)
(c) R̂M(t)
Fig. 4.7 RM(t), dctc(t) and R̂M(t) for the three different sets of 30 runs (MNIST).
Hyperparameters: |Dtrain| = 10000, |D′
train| = |D′′
train| = 8875, η = 0.3.

(a) Etrain
(b) Etest
Fig. 4.8 Etrain(t) and Etest(t) for the same three sets of 30 runs of Fig. 4.7.
Hyperparameters: |Dtrain| = 10000, |D′
train| = |D′′
train| = 8875, η = 0.3.
Let’s move on the graphs of the training and test errors (Fig. 4.8). As it happens
for the geometric observables, the effects of the removal of 1125 random data points
from the training set Dtrain are virtually insignificant. On the other hand, removing
the 1125 stragglers makes the training error drop to a perfect 0% in a few dozen
training epochs, while the test error always stays above the other two, flattening out
at approximately 10%. This result is very interesting, as it signifies that stragglers
data points are the hardest to learn, but they are also the ones that truly contribute
to the generalization capability of the neural network.

4.2.2 Steps Towards a Formal Definition
At the beginning of this section, we outlined stragglers as the data points that get
misclassified during the training epochs in the inversion band, i.e. for t∗
min ≤ t ≤ t∗
max,
where t∗
min and t∗
max are the minimum and the maximum of the inversion band,
respectively. Since each training epoch t is characterized by its own set of misclassified
data points S(t), we can define the set of stragglers as
S :=
[
t∗
min≤t≤t∗
max
S(t). (4.2)
In practice, however, there is no need to use the full set of stragglers S, and we can
obtain almost identical results using just a subset S(t∗
), where t∗
is the epoch of
inversion of any non-monotonic geometric observable - for instance, in Subsec. 4.2.1
we used the epoch of inversion of R̂M to define the subset of stragglers.
To give consistency to definition 4.2, we trained our neural network on multiple
training sets D′
train(t∗
) = Dtrain S(t∗
), where t∗
indicates the epoch whose set of
misclassified data points S(t∗
) were removed from the original training set Dtrain, and
we then compared their test error Etest at training epoch t=200. As we can see in
Fig. 4.9, we indeed found that, following an initial exponential decay, Etest(t = 200)
becomes linear at the midpoint of the inversion band, remaining almost constant until
it starts oscillating at around t∗
= 80.
Fig. 4.9 A scatter plot of Etest(t = 200) as a function of the epoch t∗
whose |S(t∗
)|
misclassified data points were removed from the training set Dtrain (MNIST).
Hyperparameters: |D′
train(t∗
)| = 10000 − |S(t∗
)|, η = 0.3.

4.2.3 Training and Filtration
As we already mentioned, each training epoch t is characterized by its own set
of misclassified data points S(t), but we don’t know much about S(t) itself, except
that a monotonically decreasing training error Etrain(t) corresponds to a monotonically
decreasing size of S(t). We are particularly interested in understanding if the training
process of a neural network can be assimilated to a descending filtration.
In mathematics, a descending filtration F is defined as an indexed family {Si}i∈I
of subobjects of a given algebraic structure S, with the index i running over some
totally ordered index set I, subject to the condition that if i ≤ j in I, then Si ⊇ Sj.
A necessary condition for the training process to be a descending filtration is that the
data points must never enter S(t), i.e. once a data point xi gets correctly classified
by the neural network, it cannot get misclassified during subsequent training epochs.
The results shown in Fig. 4.10 prove that this is definitely not the case, as many
training epochs are characterized by a non-zero number of data points entering S(t),
especially the ones that precede the inversion band. Therefore, the training process
of a feed-forward neural network cannot be considered as a descending filtration.
Fig. 4.10 A scatter plot of the number of points entering/exiting S(t) during each
training epoch t (MNIST).

4.3. Effects of the Shuffling of Labels 44
4.3 Effects of the Shuffling of Labels
We shall conclude by investigating the effects that a shuffle of the binary labels
yi ∈ {0, 1} (that is a random relabelling of “even” and “odd” labelled elements)
induces in the geometric structure of their dataset D = {(xi, yi)}n
i=i.
Let’s begin by showing a graph of the training error Etrain and the test error Etest
as usual. To ensure a fair comparison between the two, we shuffled the labels of both
the training set Dtrain and the test set Dtest.
Fig. 4.11 Etrain(t) and Etest(t) for 30 shuffled-labels runs (MNIST).
As one would expect, the training error continue to decrease over the training epoch
even if we shuffle the labels before training, albeit at a slower pace compared to
training with unshuffled labels (see Fig. 4.1), meaning the neural network is still
learning to classify the training data points. On the contrary, the generalization
capability of the model is not improving at all, with the test error staying constant at a
value of 50%. What’s interesting however is that, as we’re about to see, the shuffling of
the labels completely lifts non-monotonic behaviours in all the geometric observables,
thus totally disrupting the object manifold structure. The radius of gyration RM
becomes a monotonically non-increasing function, almost constant over the course
of training (Fig. 4.12 (a)), the centre-to-centre distance dctc becomes monotonically
increasing (Fig. 4.12 (b)) and, as a result, the dimensionless radius of gyration R̂M
is now monotonically decreasing (Fig. 4.12 (c)). As a side effect, the shuffling of the
labels makes the gradient descent algorithm (see Subsec. 2.1.2) become extremely
unstable, with consequent severe oscillations in all the observables (especially in the
radius of gyration RM (Fig. 4.12 (a)).

4.3. Effects of the Shuffling of Labels 45
(a) RM(t)
(b) dctc(t)
(c) R̂M(t)
Fig. 4.12 RM(t), dctc(t) and R̂M(t) for the same 30 shuffled-labels runs of Fig. 4.11.

Chapter 5
Conclusions and Future Work
Because of their ability to reproduce and model non-linear processes, artificial
neural networks have found applications in many modern disciplines, including, but
not limited to, computational neuroscience, quantum chemistry, cybersecurity, data
mining and finance. ANNs have also been used as a tool to solve PDEs in physics and
simulate the properties of many-body open quantum systems. A common simplifying
assumption of theoretical investigations on neural networks is that of considering
input data points as unstructured data. In the majority of real-life datasets, however,
data points with similar features are found to be closer together when compared to
points exhibiting considerable differences in their characteristics, clustered in the
state spaces into geometric structures known as object manifolds, which in the case
of labelled data are simply the collections of data points sharing the same label. It
is known that, during their training process, neural networks generally bring points
belonging to the same manifold closer together and drive points belonging to different
manifolds farther away, effectively untangling them.
The cornerstone of this work was a numerical analysis of how feed-forward neural
networks process the geometrical properties of structured data during their training
process. The training task consisted in a binary classification of the elements of
MNIST and other similar structured datasets, with the simple gradient descent as
the optimization algorithm according to Occam’s razor, also known as the principle
of parsimony, in order to have a clear vision of the connection between the structure in
data and the interesting dynamics of the latent geometries of neural networks. Over
the course of our numerical investigation, we managed to uncover non-monotonic
behaviours both in the radius of gyration, measuring the average spread of data
points belonging to the same manifold around its centre, and in the centre-to-centre
distance, quantifying the separation between the centres of two different manifolds.
We also found these behaviours to be a common feature of all the structured datasets
we examined. The greatest achievement of this work, however, is the finding that these
non-monotonic dynamics are entirely due to the existence of stragglers data points,
which we also proved to be crucial in the growth of the generalization capability of
neural networks. Finally, we showed that the training process of a neural network
cannot be considered a descending filtration, and that a random shuffle of the data
points labels completely lifts the non-monotonicity of the geometric observables.

47
Of course, there is still work to be done on the subject: while we have proposed a
straightforward operational definition for stragglers data points, we know very little
about their intimate features. For example, we don’t know if it is possible to identify
stragglers without actually training the neural network, and besides, a data point
that is a straggler for a specific neural network may not be so for a different one.
Discovering more about the true nature of stragglers could lead to many interesting
developments, such as the transition to smaller, yet more impactful training sets, and
a more robust theoretical framework for deep learning itself. In addition, we still
cannot claim that the non-monotonic dynamics we observed in all the datasets we
examined are, in fact, general properties of structured datasets, although our intuition
would suggest that they are.
We would like to end this dissertation with a simple, yet meaningful quote by the
American computer programmer and science fiction writer Daniel Keys Moran:
“You can have data without information,
but you cannot have information without data.”

References
[1] Charles F. Cadieu, Ha Hong, Daniel L. K. Yamins, Nicolas Pinto, Diego Ardila,
Ethan A. Solomon, Najib J. Majaj, and James J. DiCarlo. Deep neural networks
rival the representation of primate IT cortex for core visual object recognition.
PLoS Computational Biology, 10(12):e1003963, dec 2014.
[2] SueYeon Chung, Daniel D. Lee, and Haim Sompolinsky. Classification and ge-
ometry of general perceptual manifolds. Physical Review X, 8(3), jul 2018.
[3] Simone Ciceri. Geometrical processing of data in multilayer neural networks.
Bachelor’s Thesis, UNIMI, 2020.
[4] Tarin Clanuwat, Mikel Bober-Irizar, Asanobu Kitamoto, Alex Lamb, Kazuaki
Yamamoto, and David Ha. Deep learning for classical japanese literature. CoRR,
abs/1812.01718, 2018.
[5] Gregory Cohen, Saeed Afshar, Jonathan Tapson, and André van Schaik. EM-
NIST: an extension of MNIST to handwritten letters. CoRR, abs/1702.05373,
2017.
[6] Uri Cohen, Sue Yeon Chung, Daniel Lee, and Haim Sompolinsky. Separability
and geometry of object manifolds in deep neural networks, 05 2019.
[7] James Dicarlo and David Cox. Untangling invariant object recognition. Trends
in cognitive sciences, 11:333–41, 09 2007.
[8] James Dicarlo, Davide Zoccolan, and Nicole Rust. How does the brain solve
visual object recognition? Neuron, 73:415–34, 02 2012.
[9] Surya Ganguli and Haim Sompolinsky. Statistical mechanics of compressed sens-
ing. Phys. Rev. Lett., 104:188701, May 2010.
[10] E Gardner. Maximum storage capacity in neural networks. Europhysics Letters
(EPL), 4(4):481–485, aug 1987.
[11] E. Gardner and Bernard Derrida. Optimal storage properties of neural network
models. Journal of Physics A, 21:271–284, 1988.
[12] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT
Press, 2016.

References 49
[13] Nikolaus Kriegeskorte. Deep neural networks: A new framework for modeling
biological vision and brain information processing. Annual Review of Vision
Science, 1:417–446, 11 2015.
[14] Nikolaus Kriegeskorte, Marieke Mur, Douglas Ruff, Roozbeh Kiani, Jerzy Bo-
durka, Hossein Esteky, Keiji Tanaka, and Peter Bandettini. Matching categorical
object representations in inferior temporal cortex of man and monkey. Neuron,
60:1126–41, 01 2009.
[15] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification
with deep convolutional neural networks. In F. Pereira, C.J. Burges, L. Bottou,
and K.Q. Weinberger, editors, Advances in Neural Information Processing Sys-
tems, volume 25. Curran Associates, Inc., 2012.
[16] Andrea Lazzari. Analisi del perceptron e della sua espressività nella classifi-
cazione di dati strutturati. Bachelor’s Thesis, UNIMI, 2020.
[17] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied
to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
[18] Yann A. LeCun, Léon Bottou, Genevieve B. Orr, and Klaus-Robert Müller.
Efficient BackProp, pages 9–48. Springer Berlin Heidelberg, Berlin, Heidelberg,
2012.
[19] Tom M. Mitchell. Machine Learning. McGraw-Hill, New York, 1997.
[20] J. O’Keefe and J. Dostrovsky. The hippocampus as a spatial map. preliminary
evidence from unit activity in the freely-moving rat. Brain Research, 34(1):171–
175, 1971.
[21] David E. Rumelhart and David Zipser. Feature discovery by competitive learn-
ing. Cognitive Science, 9(1):75–112, 1985.
[22] A. L. Samuel. Some studies in machine learning using the game of checkers. IBM
Journal of Research and Development, 3(3):210–229, 1959.
[23] A. M. Turing. Computing Machinery and Intelligence. Mind, LIX(236):433–460,
10 1950.
[24] Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-mnist: a novel image
dataset for benchmarking machine learning algorithms. CoRR, abs/1708.07747,
2017.
[25] Daniel Yamins, Ha Hong, Charles Cadieu, Ethan Solomon, Darren Seibert, and
James Dicarlo. Performance-optimized hierarchical models predict neural re-
sponses in higher visual cortex. Proceedings of the National Academy of Sciences
of the United States of America, 111, 05 2014.

Geometric Processing of Data in Neural Networks

Recommended

Recommended

More Related Content

Similar to Geometric Processing of Data in Neural Networks

Similar to Geometric Processing of Data in Neural Networks (20)

Recently uploaded

Recently uploaded (20)

Geometric Processing of Data in Neural Networks