Non-parametric regressions & Neural Networks

Non-parametric estimate of
non-trivial pdf’s & Gender
classiﬁcation based on
Artiﬁcial Neural Networks
Giuseppe Broccolo
Journal Club, Florence, 10/02/2011
(Scuola Normale Superiore & INFN Sez. di Pisa)
Sunday, July 30, 2017

Introduction
G. Broccolo, Journal Club, Florence, 01/02/2011 1
1) Find an analytical expression for a distribution function or a pdf
describing a set of data x={x}
>> if the pdf is a-priori well-known, then a parametric estimation f(x,H) is
made following maximum likelihood, minimization of |2, etc.
>> if the pdf is not a-priori known, a “forced” parametrization of the pdf
may introduce a bias.
2) Gender classiﬁcation: N-hypothesis {~ = ~1...~N} test based on a set
of data x={x}mX
>> if the conditioned probability P(~i) = ydx p(~i,x) (i.e. the pdf) is
well-known, then the bayesian decision rule can be applied.
>> if there is no knowledge about the hypothesis ~i and the dataset x,
bayesian approach cannot be used.
The goals

Parametric estimate
1) Find the analytical expression for a distribution function describing a set of
data Y={y} related to another set X={x}:
f: X 8 Y / xi 8 yi = f(xi,H)
Minimization of the “Mahalanobis distance” (|2) (metric: covariance matrix R)
(=“|2 distance” if Rij = vij
2dij, =“euclidean distance” if Rij = 1/2 dij):
2) Find the analytical expression for a pdf p(x,H) starting from a dataset X={x}:
PX(H) is a “Likelihood function”, it has to be maximized. For monotonous p(.):
(for an indipendent dataset)

Gender Classification
Try to characterize the distribution of a set of data x={x} conditioned by the
gender class ~i: p(x|~1)...p(x|~2).
d(x) = ~i / ydx p(~i|x) < ydx p(~j|x), 6i]j
xdRd{H} xdRd{H}
Bayes theorem gives a relation between the dataset x and the a-posteriori
probability p(~i|x):
p(~i|x) =
p(x|~i) P(~i)
p(x)
with p(x) = p(x|~1) P(~1) + p(x|~2) P(~2)
From the observation of x, “knowing” p(x|~i), it is possible to apply the
“bayesian decision rule”:
note that for the bayesian approach:
>>
>>
p(~i|x) + p(x|~i) P(~i)
p(~i|x) + p(x|~i)
P(~1|xd[~2])
“~1 error confidence level”
P(~2|xd[~1])
“~2 error confidence level”

Part I
1) Non-parametric estimate of pdf’s or distribution functions.
>> “Discrete Algorithms” for pdf estimate.
>> Anyway, problems remain & Discrete Algorithms implemented in
recursive way (Artificial Neural Networks).
2) Gender Classification.
>> “Discrete” classifiers, based on Bayes theorem.
>> Problems in high dimension for discrete classifiers.
>> Artificial Neural Networks are good statistical estimators for pattern
recognition and noisy data treatment & good candidates to be
classifiers.
Alternative approaches

Discrete algorithms: Parzen Windows
>> Relation between P (probability: P / {x}dRQRd) and p(x) (pdf):
>> The idea: divide R in contiguous region R1...Rn of measure V1...Vn, centred in x0dRi.
>> To guarantee the convergence:
Counting the number (kn) of patterns xi in each region, and
considering small regions (Vi$0):
>> The solution:
>> Introduce the Parzen Window (hypercube, Vn=hn
d):
the conditions yduz(u)=1 & z(u)>0
guarantees R Parzen Windows
describe pdf’s effectively!!

kn-Nearest Neighbors
>> Possible solution: choice Vn (so hn) as a function of dataset x !!
>> Problems with the Parzen Windows algorithm:
- Practice problem: for Vn$0, kn~0 being n ﬁxed for a real sample.
- Theoretical problems: Parzen Windows does not point out a criteria for the kn
choice:
i) too small h* brings to kn~0 (practice problem).
ii) too large h* brings to ﬂat ratios so p(x0)~E[p(x)] in Rn.
This means that the estimate of p(x0) starts considering Rn which increases until
points are included (Nearest Neighbors).
To be noted: hn includes the estimate p(x0), so a recursive algorithm is needed!

k-Nearest Neighbor
>> Try to find a good non-parametric gender classifier starting from a tagged (i.e. each
element xi belongs to a well-known gender class ~j) n-dimensional set of data
x={x|~}. Then, try to estimate the gender-class of a data point x0 using the non-
parametric classifier:
i) Consider a volume V centred in x0, then consider k nearest neighbors.
ii) In the k patterns, ki are the number of elements belonging to ~i class.
iii) Using Parzen Windows or kn-Nearest Neighbors, we can estimate the probability
to belong to the class ~i for the data point xo:
i.e. the searched probability is equal to the relative frequency, so we can reduce to a
frequentistic approach. This suggests the following (non-bayesian) decision rule:
k-Nearest Neighbor decision rule: consider the nearest (Mahalanobis distance) pattern xi
to x0 and classify it in the class ~i only if it correspond to the highest frequency.
Theoretical Problem: i) Requiring the lowest Mahalanobis distance is only a conjecture.
ii) Being based on PWs, k>>1 implies high V’s ( high distances.

Part II
1) “Discrete” non-parametric algorithm such as PWs, kn-NNs & k-NN
suffer some structural problems
>> Search of stopping criteria for recursive algorithms.
>> Not so well-deﬁned in many cases (e.g. whit high V’s in PWs or k-NN).
2) ANN could avoid some of these further problems...
Neural Networks (ANN)

From Biological Neurons...
>> Human brain is not so good for difﬁcult computations, but it is very
good for recognitions...
- ~1011-1012 basic structures (neurons) which
are activated if inputs are received.
- Neurons are linked with each other: the
output of one becomes the input(s,
combined to give the so called “synaptic
potential”) of the other. For different inputs
there are different activated neurons; in this
way human brain is improved for pattern
recognitions.
- Human Brain “is trained” by the experience
to understand the right correspondence
between recognized objects and speciﬁc
activated neurons.
>> ANN are inspired by this work principle: many basic units which work
in a (apparently) simple way.

...to ANN !!
>> ANN is a graph, i.e. an ensemble A={V,E} where vdV are neurons, and edE are synaptic
connections (or axons). V has 3 sub-ensemble I 3V (input neurons), H3V, (hidden neurons),
and O3V (output neurons) / HkI = HkO = Q. Each synaptic connection edE is tagged
with a “weight” ~idR and each neuron vdV corresponds to a “propagation function”
f(.): R8R.
>> So the ANN topology is univocally described by the ensemble A={I,H,O,E,~,f(.)}
>> Typical propagation function are deﬁned as f(.): R8[0,1] or [-1,1] (tanh(.), H(.)...). The
most used is the “sigmoid” for its simple derivation rule (see later):

The simplest ANN: the Perceptron
>> One single neuron {f(.)}, n synaptic input connections and one synaptic output connection.
...
x1
x2
xn
a
y=f(a)
~1
~2
~n
>> Very similar to the biological neuron (n dendrites, one axon): the n inputs form the
“synaptic potential” (F. Rosenblatt, 1962)
>> Linear superposition of the n inputs brings to a “trainable” Neural Network: different
weights correspond to different outputs, expected for a given set x={x}.
- i is the “threshold weight”. It rules the activation of the perceptron.
a = (Ri ~ixi) - i
(F. Rosenblatt, 1962)

Feed-Forward MultiLayer Perceptrons
>> Topology based on layers, simultaneous (i.e. no delay function applied, a recursive
computation is needed) forward propagation (feed-forward), completely connected
perceptrons in every layer.
>> Same activation function of the synaptic potential for each perceptron.
>> Recursive Computation (F. Rosenblatt, 1969): l = 2...L, i = 1...nP
(l) (# of perceptrons in the
l-th layer), pi
(l) output of the i-th perceptrons in the l-th layer
x1
x2
. . .
.
.
. .
.
.
y1
y2
L: tot. # of layers (F. Rosenblatt, 1969)
inputs of the MLP: the n-dim. vector p(1)
outputs of the MLP: the m-dim. vector y=p(L)

Feed-Forward MultiLayer Perceptrons
>> An example of MLP (4 layers, 2+5+3+1 perceptrons) code implemented in C++...
// input data
Float_t Axion[5];
for (int k = 0; k < 2; k++)
Axion[k] = input[k];
// neural net
int prevnperc = 1;
int nperc = 0;
for (int l = 0; l < 4; l++)
{
if (l == 0)
nperc = 2;
else if (l == 1)
nperc = 5;
else if (l == 2)
nperc = 3;
else if (l == 3)
nperc = 1;
for (int i = 0; i < nperc; i++)
{
// ******** PERCEPTRON IMPLEMENTATION... ******
Float_t SynapticPotential = 0;
for (int j = 0; j < prevnperc; j++)
{
SynapticPotential += w[l][i][j] * Axion[j];
}
SynapticPotential -= Theta[l][i];
Axion[i] = transfer_function_sigmoid(SynapticPotential);
}
}

ANN Training
>> Biological neurons activation was trained by the experience: human brain
knows that specific neurons have to be activated only in correlation of
specific inputs.
>> Training is needed in order to fix synaptic connections weights ~ij
(l).
In this way a specific response (y) of the ANN in correlation of specific
inputs (x), avoiding the simple memorization of examples (generalization
capability of a ANN):
- Supervised Training: a training set x={x,y} with correlated xi*yi
elements used for the iterative correction of
weights (descendent gradient algorithm).
- Unsupervised Training: the training set is x={x}, as usual when ANN
describes pdf, basing on ANN.
>> Training is made iteratively correcting ~ij
(l) each time the i-th set input-
output is exploited. A complete iteration on training set x is called “epoch”.

Supervised: Descendent Gradient
>> Descendent Gradient Algorithm (Widrow, Hoff, ’70): iterations on the training set
have to minimize the Mahalanobis distance between the expected output yi and the
i-th ANN output pi
(L)(x).
Minimization starts iteratively correcting ~(L), i(L) in each layer ( hd]0,1[ ):
Gradients computation: Widrow-Hoff Algorithm:
i) Choice random values
for ~ij
(l), ii
(l), h$0
ii) Run the ANN and
store pi
(L)(x)
iii) Compute Di
(l) starting
from l=L, then back
propagate to l=1
iv) Update ~ij
(l) and ii
(l)# of estimated epochs: ~103-104 for f~10-5
Stopping Criteria: |2(~ij
(l),ii
(l)) < f

Supervised: Descendent Gradient (D)
>> An example of supervised MLP (2+5+3+1 layers) training code implemented in C++...
Float_t eta = 0.5; Float_t output[layernum][percnum];
// initialize random weights and thresholds using two void functions...
for (int l = 0; l < 4; l++)
{
randomizer_threshold(Theta[l][i]);
for (int j = 0; j < prevnperc; j++)
randomizer_weight(w[l][i][j]);
}
// MLP processing (saving single perceptron output) starting from inputs x[]...
MultyLayerPerceptron(*x, **output);
// weights and thresholds corrections, using trained output y[]...
Float_t Delta[4][nperc] = 0;
for (int l = 3; l >= 0; l--)
{
for (int j = 0; j < nextnperc; j++)
{
SynapticPotential += w[l][i][j] * output[l][j];
if (l < 3)
Delta[l][nperc] += Delta[l + 1][nperc] * w[l + 1][j][i];
}
SynapticPotential -= Theta[l][i];
if (l == 3)
Delta[l][nperc] = derived_transfer_function(SynapticPotential) * (output[l][i] - y[i]);
else
Delta[l][nperc] *= derived_transfer_function(SynapticPotential);
}

UnSupervised: “competitive” ANN
>> Try to obtain a supervised dataset x={xkdRn,ykdRm: k=1...n} starting from the only
inputs {xkdRn}. The idea: try to obtain y from ANN. Let ~ij
(l) be weights which have
to be trained to values Xij
(l), and y the outputs computed by ANN:
MLP approachs continuous function (see later!) ( Taylor expansion ( locally monotonous
and assuming
Using Bayes thm to pass p(Xij
(l)|yk)*p(yk|Xij
(l))
Likelihood maximization corresponds to Mahalanobis distance minimization:
The algorithm: initialize ~ij
(l) and iteratively update to Xij
(l) !!!

Foundamental Theorems on MLP
1) Principle of MLP Universality (Cybenko, Lippmann, 1989):
6 continuos z: Rn8Rm 7 a MLP h: Rn8Rm with hydden layers of sigmoids and one
linear (pi
(L) = (Rj ~ij
(L-1)pj
(L-1)) - ii
(L)) output layer / | h(x)-z(x) | < f 6fdR+, 6xdRn.
Note that theorem says “at least a MLP...” but does not specify the topology, which has
to be chosen by the user.
2) Corollary of Principle of MLP Universality (E. Trentin, 2001):
7 a MLP h: Rn8Rm with hydden layers of sigmoids and one output layer with a
sigmoid of kind f(a) = m/(1+e-a), with m trainable parameter, which allows h to satisfy
the Principle of MLP Universality and hd[0,+3[ .
3) Lippmann-Richard Theorem (1996):
It’s possible to use a MLP such as a Bayesian estimator for gender classiﬁcation,
provided that:
i) pi
(L)(x) / Rj pj
(L)(x) used as estimator of p(~i|x).
ii) MLP is supervised trained with many training set {x,1} or {x,0}
if xd~i or xz~i.

Non-parametric pdf estimate with MLP
>> Starting from the (unsupervised) n-dimensional training set x={x}, try to ﬁnd a non-
parametric estimation of the 1-dimensional pdf (or distribution function).
Principle of Universality: a MLP h(x) can estimate a pdf (f(x)d[0,1]) if sigmoids are used
(the last layer must contain only one perceptron). The condition
| h(x)-f(x) | < f 6fdR+, 6xdR.
ensures the convergence to 0 of the Mahalanobis (or |2 or euclidean) distance in the
discendent gradient minimization. So a trainable MLP can be used to reproduce the pdf...
The Training: i) initialize the MLP h(x) with random ~ij
(l), ii
(l)
ii) create n dataset: xi = x{xi}
iii) use PWs to estimate n outputs: h=h*/:n, V=hd, yi=(n-1)-1Ridxi V-1z{(xi-x)/h}
iv) a new training set T={x,y}={(xi,yi) / i = 1...n} is obtained
v) Train h(x) with T using Widrow-Hoff algorithm
If it is the case, h(x) can be normalized to 1 multiplying for a constant N / N ydx h(x) = 1
:) lower dependency from n, h* than PWs, kn-NN :( large range of x for a good training

pdf estimate with MLP: an example
>> Try to estimate the distribution function f(x)=e-x describing set of data
x={xdR / x>0} (comparison between PWs & MLP):
h*=0.1
n=100
h*=0.2
n=100
h*=0.3
n=10
h*=0.3
n=100
h*=0.3
n=1000

Come back to Introduction...
1) Find an analytical expression for a distribution function or a pdf
describing a set of data x={x}
>> MLP with an unsupervised training allow to estimate pdf in an unbiased
and non-parametric way, provided that a large range in x is used.
2) Gender classiﬁcation: N-hypothesis {~ = ~1...~N} test based on a set
of data x={x}mX
>> MLP can be used as bayesian gender classiﬁers (i.e. MLP provides outputs
pi
(L)(x)/p(~i|x) ) thanks to the Lippmann-Richard theorem, once x is tagged
with gender classes (x={xi|~i}).
The goals

Part III
1) Study of the evolution of the deceleration parameter with a non-
parametric estimate of the geometry of the universe equation of state.
>> MLP is used for a non-parametric fit of the evolution of Hubble constant H(z).
>> As a collateral result, a good estimate of Hubble Constant H0.
Two ANN-based Analysis Proposal
2) Non-bayesian approach for identification of mini-BH’s evaporation at the
LHC and estimate of fundamental parameters M* and nexdim of gravity
at the TeV scale.
>> Two MLP’s used as Gender Classifiers first to identify mini-BH events,
then to estimate M* and nexdim.

Analysis 1)
Study of the evolution of the deceleration parameter with a non-parametric
estimate of the geometry of the universe equation of state.
>> The Hubble Constant H(t)=R(t)/R(t) and the deceleration parameter q(t)=-R(t)/[R(t)H(t)2] are
related by the Einstein equations in the following form:
where H0 is the current value of the Hubble constant (~65-75 Km s-1 Mpc-1), and z is the
doppler redshift. q(z) parametrization is not univocally known, the most used parametrization
is the KCDM model: q(z)=[Xm(1+z)3-2(1-)]/2[Xm(1+z)3+1-Xm], where Xm is the matter density.
>> Parametric estimation is done ﬁtting observed supernovae redshift (extinction-corrected
distance modulus), i.e. minimizing the |2 distribution
>> State of affairs: Parametric estimation was found to be strongly dependent from H0. As
alternative, we propose a non-parametric estimation of g(z) with a MLP, where H0 is trained
unsupervised (note that Mahalanobis distance coincides with |2) following Trentin corollary.
G.B., D. Tommasini (University & INFN of Florence)
. ..

Analysis 2)
Non-bayesian approach for identification of mini-BH’s evaporation at the LHC
and estimate of fundamental parameters M* and nexdim of gravity at the TeV
scale.
>> LED scenario, proposed by Arkani-Dimopoulus-Dvali (ADD model): nexdim extradimension with a
compactification radius of R are introduced. Gravitons are obtained by Einstein-Hilbert action
expansion in terms of the 4+nexdim gny=(g4
ny,A(y),hn
ny) metric tensor and seen as KK-modes:
This brings to a re-definition of the Planck Mass:
MPL=G4
-1/2~1016 TeV ~ TeV
>> Following Hoope conjecture, BH’s are formed in pp collisions when a fraction MQ=xaxbs1/2 satisfy
SM particle emitted in evaporation!!
>> State of affairs: Study of the observables (Mll, visible pT, ET, sphericity) through MC simulation
of SUSY, SM and Mini-BH events, used for identification and M*, nexdim estimate.
G.B., F. Coradeschi (University & INFN of Florence)
Multipole excess & Planck Phase & Hawking Phase & Hadronization

Analysis 2)
Non-bayesian approach for identification of mini-BH’s evaporation at the LHC
and estimate of fundamental parameters M* and nexdim of gravity at the TeV
scale.
>> LED scenario, proposed by Arkani Dimopoulus Dvali (ADD model): nexdim extradimension with a
compactification radius of R are introduced. Gravitons are obtained by Einstein-Hilbert action
expansion in terms of the 4+nexdim gny=(g4
ny,A(y),hn
ny) metric tensor and seen as KK-modes:
This brings to a re-definition of the Planck Mass:
MPL=G4
-1/2~1016 TeV ~ TeV
>> Following Hoope conjecture, BH’s are formed in pp collisions when a fraction MQ=xaxbs1/2 satisfy
SM particle emitted in evaporation!!
>> State of affairs: Study of the observables (leptonic pT, ET, sphericity) through MC simulation
of SUSY, SM and Mini-BH events, used for identification and M*, nexdim estimate.
G.B., F. Coradeschi (University & INFN of Florence)
Multipole excess & Planck Phase & Hawking Phase & Hadronization
Ml+l- : BH’s (Planck Phase)
BH’s (no Planck Phase)
SUSY
2nd Fox-Wolfram Moment
nP=2 nP=0

4) MLP could be used in particle physics where there is high multiplicity of
phenomenology changing reference models to obtain un-biased classifier.
Conclusions
1) pdf’s or distribution functions can be estimated analytically without
any a-priori knowledge of f(.) structure using MLP’s.
2) Gender classification: a MLP can be used like a non-bayesian classifier.
Lippmann-Richard theorem gives a direct quantitative estimator for the
pattern recognitions.
3) MLP could be used to estimate important Cosmology constants affected
by big errors with the standard parametric estimates.

Non-parametric regressions & Neural Networks

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Non-parametric regressions & Neural Networks

Similar to Non-parametric regressions & Neural Networks (20)

More from Giuseppe Broccolo

More from Giuseppe Broccolo (10)

Recently uploaded

Recently uploaded (20)

Non-parametric regressions & Neural Networks