Consider a non-linear dimensionality reduction method which takes into account
the discriminating power of the solution found for given values of the categorical
variable associated with each observation. Stochastic optimization method known as
the "Particle swarm optimization" is proposed to found characteristics that ensure the
best separation of observations in terms of a given quality functional. The basis for
evaluating the quality of the solution lies in the purity of the clusters obtained with the
k-means method, or with using self-organizing Kohonen feature maps.
2. Particle Swarm Optimization for Multidimensional Clustering of Natural Language Data
http://www.iaeme.com/IJCIET/index.asp 140 editor@iaeme.com
will be the coordinates of centroids found in the given characteristic space, which ensure the
best division of the set of observations.
If a categorical variable is specified, where the value is known for all observations of the
training sample, a quality functional associated with the purity of the classes found in the
context of this categorical variable can be specified for any clustering result. Then the search
for a component subset of discriminant classes that are optimal for discrimination can be
considered a combinatorial task to search for one of 2N (where N is the dimension of the
original feature space) parameter combinations that provides the best values of the given
quality functional.
The task to reduce the original dimensionality of the feature space with quasi-optimal
discrimination, then formulated according to a given category, was considered as a
combinatorial optimization task. A stochastic method was proposed to solve it using a
multicomponent quality functional.
2. FORMULATION OF THE TASKS
A sample of observations is given, where { }, is a set of observations,
characterized by class and multidimensional parametric vector , { }, one of
possible values of categorical variable, { } parametric vector of real
components (coordinates of the observation in the k- dimensional space). It is necessary to
determine the set of components from - { }
{ } , , from unique, such that ̀ { } provides quasi-
optimal values of the quality functional ( ̀ ) .
3. THE CONCEPT OF THE QUALITY FUNCTIONAL OF THE
FEATURE SPACE FOUND
The quality functional ( ̀ ) is defined, large values, which will correspond to the results of
clustering with greater cluster homogeneity. The k-means clustering method is used with
centroid initialization using random values or any other suitable method. Homogeneity will
be defined as a generalized measure of the deviation in the number of observations for a
given class { } of "winners" in each cluster from the total number of
observations assigned to the given class (hereinafter, the partial homogeneity ) and
the generalized deviation of the number of unique observation classes in each cluster from 1
(hereinafter formation homogeneity ). If the number of observations of each class
in V is comparable, it is advisable to supplement the quality functional with a characteristic
reflecting the generalized deviation of the volume of resultant clusters from the expected
volume (for example, from the average arithmetic volume l/n), the term weighting W will be
hereinafter used to refer to this parameter.
4. GRAPHICAL REPRESENTATION OF STRUCTURAL
CHARACTERISTICS OF THE CLUSTERING RESULTS
To illustrate the results of applying the proposed algorithm, a graphic representation of the
cluster structure purity proposed by the researcher Narayana Swamy [1] will be used [5].
Since this form of representation (Fig. 1) is not standardized, a short explanation will be
given on its interpretation.
3. G.A. Yuryev, E.K. Verkhovskaya and N.E. Yuryeva
http://www.iaeme.com/IJCIET/index.asp 141 editor@iaeme.com
Figure 1 Example of a cluster purity diagram
It is assumed that class tags of clustered observations are known in advance and their
values are listed in the legend on the right-hand side of Fig. 1. Each of the classes is
associated with a color code, also reflected in the legend. A set of concentric circles is placed
in the main part of the diagram, whereby their numbers correspond to the number of clusters
obtained during the clustering procedure (9 for the example in Fig. 1). The ratio of the areas
of the inner circles relative to each other corresponds to the ratio of the volumes of the
corresponding clusters. The size of the sectors in the inner circles corresponds to the volume
of class observations with the corresponding color coding in the corresponding cluster, for
example, in the cluster represented by the left upper circle, most of the observations are
assigned to the class c4 with a red color code.
5. PRACTICAL ASSESSMENTS OF THE QUALITY FUNCTIONAL
Formal estimates of each ( ̀ ) , previously listed, can be given. Before starting
the calculations, the clustering result ̀ should be obtained using the k-means clustering
method, the set of j resulting clusters will be further denoted as { }, where
{ } { } the set of z class labels corresponding to observations assigned to
the given cluster. Let ( { }) function return the absolute number of class
labels that have the maximum share in the i-th cluster. Then partial homogeneity can be
estimated as
∑
(
⁄ (1)
where - is the number of observations relating to cluster as a result of clustering.
The result of maximizing partial homogeneity on the sample data for a fixed number of
clusters is reflected in Fig. 2
4. Particle Swarm Optimization for Multidimensional Clustering of Natural Language Data
http://www.iaeme.com/IJCIET/index.asp 142 editor@iaeme.com
Figure 2 Result of maximizing partial homogeneity
Let ( { }) be the function that returns the number of unique class labels
in the i-th cluster. Then formation homogeneity
∑
( ⁄ (2)
The result of maximizing formation homogeneity on the same data array is reflected in
Fig. 3
Figure 3 Result of maximizing formation homogeneity
Despite the similarity in the results of maximizing both criteria, it is easy to see that in the
first case (Fig. 2), preference is given to increasing the share of the winning class in each of
the clusters, while in the second case (Figure 3), preference is given to the smaller number of
classes represented within the same cluster.
The weighting estimate based on the assumption of equal sizes of resultant clusters is
given as
∑ √( ⁄ (3)
where - is the number of observations relating to cluster as a result of clustering, l –
total sample size.
The result of maximizing weighting with the same initial data is shown in Fig. 4
5. G.A. Yuryev, E.K. Verkhovskaya and N.E. Yuryeva
http://www.iaeme.com/IJCIET/index.asp 143 editor@iaeme.com
Figure 4 Result of maximizing weighting
The cumulative quality functional can be written as follows
( ̀ ) (4)
where , and are the gain factors corresponding to the
components of the given quality functional, allowing to designate the desired characteristics
of the clustering results obtained in the optimization process. The result of maximizing the
total quality functional for equal pairs , and , is reflected in Fig.
5.
Figure 5 Result of maximizing the three-component quality function
The value of ( ̀ ) – tends towards one, strict equality to 1 is achieved with the
described evaluation procedures in the case when all clusters are of equal size and when
within each of the clusters there are observations of only one class.
6. STOCHASTIC SOLUTION OF THE COMBINATORIAL
OPTIMIZATION TASK
The solution of optimization tasks that do not have an explicit analytical interpretation often
involve search methods for quasi-optimal parameters based on numerical estimates of the
gradient of the quality functional or on stochastic methods of directed search [2, 3]. This
section describes the stochastic optimization method based on the "particle swarm" method
applied to the formulated dimensionality reduction task.
The original idea of the "particle swarm" method was proposed and developed in studies
[4, 5, 6] the original algorithm was applied to model the social behavior of birds, bees and
other animals characterized by spatial movement within large groups (swarms). Later, it was
noted that it is possible to use this model effectively to study feature spaces, in particular, to
search for quasi-optimal solutions of multidimensional optimization tasks.
Swarm optimization algorithms can be referred to as evolutionary, whereby the general
scheme for enumeration of solutions is described by the following a sequence of steps:
6. Particle Swarm Optimization for Multidimensional Clustering of Natural Language Data
http://www.iaeme.com/IJCIET/index.asp 144 editor@iaeme.com
1. A population of "individuals" is generated, each of which contains some random
solution of the target task. The search for a solution is represented by an iterative
process (steps 2 and 3). In each iteration (epoch), the decisions (positions) of all
individuals are slightly modified according to the rules ensuring the convergence
of the iterative process to quasi-optimal solutions.
2. The direction of the change in the position of each individual is calculated,
depending on its current position, the best solution of the given individual
throughout its "existence" (local extremum) and the best known solution (obtained
by any individual) for the entire population (global extremum).
3. New positions of the individuals (their coordinates) are computed in accordance
with the directions obtained in step 2.
4. The stopping criteria are checked, if the solution parameters do not match, go to
step 2, otherwise the search is interrupted.
The balance between local and global trends in the behavior of individuals is determined
by the coefficients interpreted as the acceleration of the movement of individuals towards the
local and global extremes. The original formulation of the task meant a real solution space,
making it impossible to use the method in linear programming tasks, in particular,
combinatorial optimization tasks. A number of authors proposed the adaptation of the method
to linear tasks [7], while the "accelerations" received a probabilistic interpretation. Let us
consider in more detail the modified version of the algorithm described in [7], which can be
used to find quasi-optimal ones in terms of ( ̀ ) for values ̀ { }.
Search for optimal combinations of parameters from { } can be formulated
as the task of "assembling a multidimensional backpack" with a single sampling. Its solution
{ } can be represented as vector P={ }, where { }, its
components with numbers from m equal to 1, and all others 0, that is 1 in the i-th position P
indicates that enters the component subset ̀ , selected for clustering.
The population of individuals consists of d solutions, each of the individuals stores the
current solution , the best solution and two vectors , , that determine
the inversion frequency of each of the k bits in each of the possible directions.
- probability of replacing 1 with 0 in position for the i-th individual,
- probability of replacing 0 with 1 in position for the i-th individual.
The values , will change at each iteration, as will the positions of the individuals.
For the positions of each individual at each iteration the probability of inversion
of each bit denoted as , it will be determined based on its current position
according to the following rule
{ (5)
The acceleration derivatives ̈ , ̈ will be affected by their current values, the best
local solution , the best global solution and the coefficient of inertia .
̈ ( ) ( ) (6)
̈ ( ) ( ) (7)
7. G.A. Yuryev, E.K. Verkhovskaya and N.E. Yuryeva
http://www.iaeme.com/IJCIET/index.asp 145 editor@iaeme.com
To calculate ( and ( ) two scaling factors are used
and , the values of which lie within {0…1}, they form a balance between
local and global trends in the derivative position obtained. In addition, at each iteration,
another pair of scaling values is randomly generated and
within limits {0…1}. The final values are formed according to the following rule:
{
( )
( ) ( )
(8)
{
( ) ( )
( )
(9)
{
( )
( )
(10)
{
( ) 〖( 〗 )
( )
(11)
In fact, the probability of inversion of each vector component decreases, provided that its
value coincides with the corresponding value of the known optimal solution, and increases
otherwise.
To calculate the new positions of each individual i, the components of the corresponding
solution e are inverted with probabilities , for this value is compared
with random values RNDie generated for each comparison. For vector the
following normalizing condition is applied:
(12)
Then the rule to determine derived positions will look as follows:
̈ {
̅̅̅̅
(13)
The above-mentioned algorithm corresponds to that described previously described.
Obviously, such a procedure will significantly reduce the probability of new solutions
appearing in the population after a certain number of steps, i.e. the algorithm will be "stuck"
at the point of the local extremum found. To overcome the problem of local extremums, it is
proposed to introduce a procedure to randomize the positions, based on the criterion of
"swarm density". When density refers to the generalized assessment of position deviation of
each individual from , ( – the function that restores the Hamming
distance between the binary vectors a and b, the swarm density will be estimated as
∑ (
( )
)
(14)
where k – the length of the solution binary vector, d – number of individuals in the
population.
The range of values of corresponds to the interval {0 ... 1}, with a value
of one indicating that the current positions of all individuals coincide with the best known
8. Particle Swarm Optimization for Multidimensional Clustering of Natural Language Data
http://www.iaeme.com/IJCIET/index.asp 146 editor@iaeme.com
solution found by the algorithm. The value is estimated at the end of each
iteration, if the predetermined density threshold is exceeded, it is proposed to randomize the
positions of a certain percentage of individuals and the corresponding acceleration vectors
, , while preserving information about their best local solutions. This approach allows
to automatically derive a numerical procedure from local extremums without losing the
general search direction defined in the optimization process. Lower threshold values for
will lead to a less intensive search for a solution in the area of the current
extremum, and vice versa.
7. EVALUATION OF EFFICIENCY UPON CLUSTERING OF
NATURAL LANGUAGE DATA
The proposed concept was tested on natural language data on air incidents taken from the
open base of the "Aviation safety network" [8]. The initial data contained descriptions of the
incidents in English and the categorical variables associated with these incidents (damage
level, vessel type, flight phase, etc.). Incident descriptions were parameterized using
word2vec technology [9]. For each word included in the description, the corresponding
vector was saved from the freely distributed dictionary, learned from news aggregator Google
News text data. Each observation was then associated with a 300-dimensional numerical
vector obtained by calculating the arithmetic mean values of the component sums for the
word vectors making up this description. These vectors were considered as integral estimates
of the description semantics that do not have an explicit interpretation in the context of the
specified categorical variables. To test the proposed algorithm, a computational experiment
was conducted, the purpose of which was to reduce the dimension of multidimensional vector
descriptions, maximizing their discriminating force with respect to the level of damage
resulting from the incident.
The following levels of damage values were considered:
Serious – the aircraft was significantly damaged as a result of the incident;
Minor – the aircraft suffered minor damage as a result of the incident;
None – the aircraft was not damaged as a result of the incident;
Missing – as a result of the incident, the aircraft was completely lost, its fate is unknown.
The training sample consisted of 600 observations – whereby 300 were used as a training
sample and 300 as a control sample. Division was performed for 4, 8, and 12 clusters, using
k-mean clustering and self-organizing Kohonen feature maps as clustering algorithms when
calculating the quality functional.
The results of this study are graphically presented in table 1.
9. G.A. Yuryev, E.K. Verkhovskaya and N.E. Yuryeva
http://www.iaeme.com/IJCIET/index.asp 147 editor@iaeme.com
Table 1 Visualization of the purity of cluster structures obtained by reducing the dimensionality of
text descriptions, using 3 damage levels (3 types) and 4 damage levels (4 types) using k-means and
self-organizing Kohonen feature maps.
k-mean (3 types) SOM (3 types) k-mean (4 types) SOM (4 types)
In the clustering results, it is noticeable that for the categorical variable values (damage
level) obtained using 4 levels, cases with minor damages (minor, blue sector) and with
serious damages (serious, red sector) are regularly combined into one cluster. In fact, these
descriptions are quite complete, using the non-linear reduction method in dimensionality to
visualize the location of multidimensional (300 component) observation points in the 3-
dimensional space - t-SNE shows that the points with the corresponding levels of damage are
weakly distinguishable (Fig. 6).
Figure 6 displays the distribution of Minor and Serious observations in the 3-dimensional space
When aggregating Minor and Serious categories into one Damaged class (Table 1,
columns 1 and 2), the separation result improves significantly. In terms of class definition
10. Particle Swarm Optimization for Multidimensional Clustering of Natural Language Data
http://www.iaeme.com/IJCIET/index.asp 148 editor@iaeme.com
error in the control sample, the error amounts to ~ 12% of errors when partitioning into 4
clusters and ~ 8% of errors when partitioning into 6 and 8 clusters. The results weakly
depend on the chosen clustering method, this conclusion may be specific for this particular
task.
The dimensionality of the resulting feature space in each of the cases presented in Table 1
was formed from one third to one tenth of the original number of components.
8. CONCLUSIONS
A new method was developed and tested for dimensionality reduction of data, which provides
solutions that are quasi-optimal from the point of view of discrimination of the given classes.
The method can be used in combination with text parameterization technology to process
natural language records in arbitrary application areas.
1. The result of the proposed algorithm is not only a combination of the initial
features, but also the coordinates of the cluster centers that combine observations
in the space found (or the trained Kohonen network if it is chosen as a clustering
method).
2. The proposed method does not require preliminary assumptions regarding the type
of initial distribution of the observed features.
3. The described reduction procedure was prototyped, thus confirming its practical
applicability.
4. The efficiency of the proposed technology was evaluated using examples of tasks
with real source space. The method can be extended to any initial spaces in which
points can be clustered using the k-means clustering method.
5. A multicomponent quality functional was formulated that allows to control the
process of dimensionality reduction in the feature space and to form various
characteristics of the resultant space.
6. For combinatorial optimization using the "particle swarm" method, a criterion to
"jam" the algorithm in the local extremum region was proposed. A procedure was
described to derive the algorithm from this domain and performed on the basis of
the criterion test results.
ACKNOWLEDGEMENTS
"This work was financially supported by the Ministry of Education and Science of the
Russian Federation within the framework of the Subsidy Agreement dating September 26,
2017 No. 14.576.21.0092 (Unique identifier of the agreement RFMEFI57617X0092) for the
implementation of applied scientific research on the topic: "Development of a neural network
forecasting system for aviation incidents and safety risk management based on historical data
including many parameters and text descriptions of events".
REFERENCES
[1] Narayana, S. Cluster Purity Visualizer.
https://bl.ocks.org/nswamy14/e28ec2c438e9e8bd302f
[2] Kuravsky, L. S., Marmalyuk, P. A., Yuryev, G. A., and Dumin, P. N. A Numerical
Technique for the Identification of Discrete-State Continuous-Time Markov Models.
Applied Mathematical Sciences, 9(8), 2015, pp. 379-391.
https://dx.doi.org/10.12988/ams.2015.410882
11. G.A. Yuryev, E.K. Verkhovskaya and N.E. Yuryeva
http://www.iaeme.com/IJCIET/index.asp 149 editor@iaeme.com
[3] Kuravsky, L. S., Marmalyuk, P. A., Yuryev, G. A., Belyaeva, O. B., and Prokopieva, O.
Yu. Mathematical Foundations of Flight Crew Diagnostics Based on Videooculography
Data. Applied Mathematical Sciences. 10(30), 2016, pp. 1449–1466.
https://dx.doi.org/10.12988/ams.2016.6122.
[4] Kennedy, J., and Eberhart, R. Swarm Intelligence. Morgan Kaufmann Publishers, Inc.,
San Francisco, CA, 2001.
[5] Kennedy, J., and Eberhart, R. ”Particle Swarm Optimization”, IEEE International
Conference on Neural Networks (Perth, Australia), IEEE Service Center, Piscataway, NJ,
IV, 1995.
[6] Eberhart, R., and Kennedy, J. A New Optimizer Using Particles Swarm Theory, Proc.
Sixth International Symposium on MicroMachine and Human Science (Nagoya, Japan),
IEEE Service Center, Piscataway, NJ, 1995.
[7] Khanesar, M. A., Tavakoli, H., Teshnehlab, M., and Shoorehdeli, M. A. Novel Binary
Particle Swarm Optimization, Particle Swarm Optimization, Aleksandar Lazinica (Ed.),
InTech, 2009.
https://www.intechopen.com/books/particle_swarm_optimization/novel_binary_particle_
swarm_optimization
[8] Aviation safety network https://aviation-safety.net/database/
[9] Mikolov, T., Yih, W., and Zweig, G. Linguistic Regularities in Continuous Space Word
Representations. Proceedings of NAACL HLT, 2013.