Method of Potential Function as Feature Choice Criterion in Alpha Procedure

Hamburg, 2015 Roman Rader
Method of Potential Function as Feature Choice Criterion in Alpha ProcedureMethod of Potential Function as Feature Choice Criterion in Alpha Procedure
Method of Potential Function as Feature
Choice Criterion in Alpha Procedure
Roman Rader
National Technical University of Ukraine “Kyiv Polytechnical Institute”, Ukraine
Scientific Advisor:
Prof. Dr.-Ing. Tatjana Lange

Contents
● Overview of Alpha Procedure
● Separation power
● Alternative to separation power using Potential Function
● Comparison with original method with visualization

Intro
First, let's take a high-level overview of Alpha Procedure method.
Alpha Procedure is the pattern recognition algorithm.
Most importrant advantages of it that it's
– Non-parametric
– Can significantly reduce feature space
– Since it operates 2d and 3d spaces it's useful to visualize learning and
recognition process.

Alpha Procedure: Input data
AP is supervised learning method, so it requires input feature vectors pk=(x1,x2,...,xn)
which are already classified by “trainer” by two classes, let's call them A and B.
Since it's non-parametric, no any other data should be provided.
# p1
p2
p3
class
1 X1,1
X1,2
X1,3
A
2 X2,1
X2,2
X2,3
B
3 X3,1
X3,2
X3,3
A
4 X4,1
X4,2
X4,3
B
{(x1
1,
x2
1,
...,xn
1,
C
1
),(x1
2,
x2
2,
...,xn
2,
C
2
),...,(x1
k
,x2
k
,...,xn
k
,C
k
)}
x1=(x11, x12, ...,x1n)
x2=(x21, x22, ...,x2n)
x3=(x31, x32, ...,x3n)
x4=(x41, x42, ..., x4n)

Alpha Procedure: Algorithm
Method based on step-by-step selection of the most “powerful” feature.
If we present single feature as an axis and put all training samples on it, we can
choose one that separates points best way – method of determining will be described
later. It will be most “powerful” feature.

1. From given features we select one that separates data the best (it will be our basis
feature (and also current repere axis)) - f0.
Intersection area
This feature definitely separates data better,
so we will use it as repere axis, f0
=p1
p1
p2

2. Now let's build set of 2D spaces
using f0 as first axis and each of
remaining features as second axis.
f0
fk

2. Let's create new axis that goes
through zero point and turn it around
the origin by the angle .α
On each step of rotation let's project
points on the plane on this axis.
f0
fk

2. Let's create new axis that goes
through zero point and turn it around
the origin by the angle .α
On each step of rotation let's project
points on the plane on this axis.
f0
fk
α

2. The goal of this step is to find best
pair of second axis feature and angle of
new axis.
In the end we'll have n-1 pairs of
feature and angle, so using “power”
metric of data axis we can choose the
best pair, and after that on this stage we
have two features projected on new
axis. This new axis will be our first
repère vector - f1.
f0
f1
α1
f1

3. On next step we use f1 as basis axis
f1

And repeat same procedure as on 2nd
step: walk through all remaining
features and build 2D space with them,
rotating new axis around the origin.
f1
fk

And repeat same procedure as on 2nd
step: walk through all remaining
features and build 2D space with them,
rotating new axis around the origin.
f1
fk
α2
f2

4, 5, … . On next steps we repeat
previous until data is separated or no
more features remain.
f1
fk
α2
f2

Alpha Procedure
As described above, Alpha Procedure based on geometric transformations of space
which is quite easy to figure out, because doesn't matter how many features are
given, method operates only on 2D spaces.

Method of Potential Function
Let's disgress from Alpha Procedure and look at Potential Function.
Assume, as previously, that our problem is to classify objects
by classes A and B. Input data contains object's features and “teacher” classification:
x=(x1, x2,...,xn)
{(x1
1,
x2
1,
...,xn
1,
C
1
),(x1
2,
x2
2,
...,xn
2,
C
2
),...,(x1
k
,x2
k
,...,xn
k
,C
k
)}

In geometric interpretation, objects are points on space X with coordinats
Assume space
Features are
Then, solution of this problem will be scalar field
which is positive if point should be classified as class A and negative that should be
classified as class B
(x1, x2,...,xn)
X=ℝ
n
xi∈ℝ ,i∈1..n
Φ=Φ(x),(Φ∈ℝ)
{' A' ,Φ(x)≥0
' B',Φ(x)<0

Let's introduce function - so called “kernel”
For fixed point x* value of this function is point in space X. In Physics such kind of
functions called potential function – this is the origin of method name. This function
defined on space X but depends on the origin of the signal.
On the figure below example of potential function with signal source in the point 0.
K (x ,x *), x, x *∈X

Now, let's introduce functions
In geometric interpretation result of these functions will be superposition of
potentials of all points of specific class in the given point x.
Considering features of potential function K, in the resulting plot of functions KA
and KB, points located densely will magnify the potential of nearly located points,
and will form common region on the plot.
K A=∑x j
∈A
K (x ,x j
)
KB=∑xj
∈B
K (x ,xj
)

We have now functions that define the “power” of specific class in the point, so we
can introduce the recognition function
where x is feacture vector.
That function, which is actually scalar field, will be solution of our problem.
Having the method of calculating Φ(x) we can predict the class of random object x.
Φ(x)=K A(x)– KB(x)Φ(x)=K A(x)– KB(x)

Φ(x)=K A(x)– KB(x)

Separation Power
Now let's return to Alpha Procedure. As mentioned, it's based on choosing the best
axis in terms of quality of data separation. Let's elaborate how original Alpha
Procedure method do this.

Separation Power
To calculate which feature separates data better, Alpha Procedure offers
straighforward way of calculation the separation power: we have to find the
intersection area, which is area where objects could not be unambiguously classified
by putting “separation point” between A-class points cloud and B-class cloud.
It can be defined as
where l is overall amount of objects and is count of objects in intersection area.
Intersection area
F(pq)=
ωq
l
ωq

Separation Power
For the given example on the figure, = 3, l=10
So, F = 3/10 = 0.3
Intersection area
F(pq)=
ωq
l
ωq

Separation Power
Regardless of the way of calculation, the idea of function F as metric of separation
quality gives ability to find other, probably better ways keeping same “interface”.
So, let's fix that 0≤F(pq)≤1

Potential Function as Separation Power
Let's consider a way to use Potential Function as separation power.
For each axis on each step let's find out the separation function In our case
it will be 1 argument function, because our data defined on single axis.
To calculate the we have to define the K(x,x*) function. It defines how point
influences on the potential depending on the distance to it's location.
Let's introduce the distance, which
will be
Also, let's assume the kernel of
potential function as
Φ=Φ(x)
Φ
ρ=ρ(x, x *)=|x−x*|
K (ρ)=
1
1+aρ2

Potential Function Shape
For kernel function K parameter a should be determined. It defines the shape of
potential function and hence, the greater it's value the less influence objects make to
neighbours, and the plot of potential function will fit original data better.
Alpha Procedure is non-parametric method, so we have to find a way to keep it
untouched, automatic kernel shape determining will be very helpful.
K (ρ)=
1
1+aρ2

Let's see how parameter influences kernel shape
K(ρ)=
1
1+aρ2
α=0.5 α=5α=1

To determine the parameter we have to find a way to estimate how accurate kernel
separates the data and also prevent overfitting.
In this study we used cross-validation, so that optimal kernel will make less error
recognitions on test dataset.
Let's introduce recognition error function:
where is heaviside step function
ξA(Φ)=∑x j
∈ A
θ(−Φ(x j
))
ξB(Φ)=∑xj
∈B
θ(Φ(x j
))
θ(x) θ(x)=
{0, x<0
1, x≥0

Then, general error of recognition of function with kernel K will be
Now, let's assume kernel parameters (a) as parameters for function K, . ThenФ
kernel function will be ,
recognition function will be
and error of recognition function will be
Now let's fix function doing currying, because method of recognition does notФ
matter for recognition error function
Φ(x)
ξ(Φ)=ξA(Φ)+ξB(Φ)
K=K(ρ ,a)
Φ=Φ(x ,a)
ξX (Φ,a)=∑x
j
∈ X
θ(−Φ(x
j
,a))

Now let's fix function doing currying, because method of recognition does notФ
matter for recognition error function, we will get
Then, the problem of finding best parameter a reduced to finding the minima of
function
ξX (Φ,a)=∑x
j
∈ X
θ(−Φ(x
j
,a))
ξ(a)=ξA(a)+ξB(a)
a=argmin
a∈[amin , amax ]
ξ(a)
ξ(a)

Potential Function: Conclusion
● The shape of kernel function should be determined
● Having parameter a we have everything to build the recognition function withФ
potential function for specific axis of data.
● Then, we can calculate separation power for the data using recognition function ,Ф
which is based on potential function now, so we'll use recognition error function to
determine better one.
● All other steps in Alpha Procedure algorithm remain the same, so we kept the
“interface” untouched and changed the internal way of determining better feature.
a=argmin
a∈[amin ,amax]
ξ(a)

Study
We described how we can incorporate Potential Function into the Alpha Procedure
feature choice algorithm. Let's see now how it influences the method in general.

Study
● Potential function is able to split our data in more complicated way then just on two
partitions.
● Two partitions division covers only the case with one separation point in the middle
of two clouds.

Study: Separation of complex data
Let's consider this case:
Blue circles here can represent any class that is described by the two-sided
inequality

On the figure it's generated data sample that shows how PF can split data into three
partitions
Intersection area

● Here, we have three partitions, left
and right where potential function is
positive and central, where potential
function is negative.
● This approach gives Alpha Procedure
flexibility to solve complex problems
where data is located on the axes in
complex way and hence, return more
accurate predictions.

Study: Robustness
● Outliers can become big trouble for
the quality of separation.
● Let's consider “Banknote
authentication Data Set”
https://archive.ics.uci.edu/ml/datase
ts/banknote+authentication
On this data outlier extended the
intersection area, and wrong
classified objects count here is 210.

Study: Robustness
● Now, let's use potential function
● In this case count of wrong-classified
objects is 151.
It means that separation quality
increased on 28%
210−151
210
⋅100%=28%

Study: Robustness
● This happened because outlier makes
influence on the value of potential
function only in it's location and
neighbours. So, we can't break the
cloud of objects just putting one
outlier in the middle.
210−151
210
⋅100%=28%

Study: Robustness
● Let's consider same data without
outlier, and separate it with original
separation power.
● As we see, just one outlier added
17.5% of wrong-classified objects.
210−173
210
⋅100%=17.5%

Study: Robustness
● On the other hand, without outlier
potential function returned 149
errors, which is almost same amount
as with outlier.
● That means that Potential Function
based separation power much more
robust than original method.
151−149
151
⋅100%=1.3%

Study: Overfitting
● Let's consider “Wine Data Set”
http://archive.ics.uci.edu/ml/machine
-learning-databases/wine/
Data on x0 axis of this dataset is not
dense and algorithm chose pretty thin
shape of the kernel function. Hence,
plot of potential function is obviously
overfitted.
● Wider kernel would be better for
recognition quality overall, but
recognition on this specific axis
would be worse.

Study: Overfitting
It shows that there're ways to
enhance method by improving kernel
parameters calculation.

Conslusions: Original separation power
Pros Cons
Can uniquely determine the
quality of separation of the data
on axis by two clouds
Doesn't work on non-linearly
separated datasets
Easy implementation! Not robust – single outliers can
break calculations
Reliable (doesn't require
parametrization for proper work)

Conslusions: Potential Method
Pros Cons
Can separate complex data
disposition, like segmented
clouds
Prone to overfit
Robust: outliers hardly can
influence error function value
Implementation of this algorithm
has O(x2
), while original method
has O(x)
On most datasets where original
method worked well, this method
also works and choses same axes
as basis. Order of axes is also
same or similar.
Potential function tries to
separate data the best way it can,
but doesn't shows real results if
kernel wideness (parameters) are
wrong.

References
● Aizerman M.A., Braverman, E.M., Rozonoer, L.I.: Методы потенциальных
(The method of potential functions inфункций в теории обучения машин
the theory of machine learning).
Nauka, Moscow, 1970.
● Vasil’ev V.I., Lange T., und Baranoff A.E.: Interpretation of fuzzy terms (in Russian).
VIII Meshdunarodnaya Konferenziya 1999. KDS 99. Kiazjaveli (Krim).
● V.I. Vasil’ev.: The reduction principle in problems of revealing regularities (in
Russian).
Cybernetics and System Analysis 5, Part I: 69—81, 2003. Part II: 7—16, 2004

Method of Potential Function as Feature Choice Criterion in Alpha Procedure

Recommended

Recommended

More Related Content

What's hot

What's hot (13)

Similar to Method of Potential Function as Feature Choice Criterion in Alpha Procedure

Similar to Method of Potential Function as Feature Choice Criterion in Alpha Procedure (20)

Recently uploaded

Recently uploaded (20)

Method of Potential Function as Feature Choice Criterion in Alpha Procedure