Iaetsd hierarchical fuzzy rule based classification

Hierarchical fuzzy rule based classification
systems with genetic rule
selection To Filter Unwanted Messages
shaik masthan baba1
, shaik aslam2
, K.Naresh babu3
1
Computer science and engineering.
2
Computer science and engineering
3
Computer science and engineering
1
masthan201@gmail.com,2
aslambasha592@gmail.com ,3
naresh.kosuri@gmail.com
Abstract: Social networking sites that
facilitate communication of information
between users allow users to post messages
as an important function. Unnecessary posts
could spam a user’s wall, which is the page
where posts are displayed, thus disabling
the user from viewing relevant messages.
The aim of this paper is to improve the
performance of fuzzy rule based
classification systems on imbalanced
domains, increasing the granularity of the
fuzzy partitions on the boundary areas
between the classes, in order to obtain a
better separability. We propose the use of a
hierarchical fuzzy rule based classification
system using the neural network learning
model to filter out unwanted messages
from Online Social Networking (OSN)
user walls, which is based on the refinement
of a simple linguistic fuzzy model by means
of the extension of the structure of the
knowledge base in a hierarchical way and
the use of a genetic rule selection process in
order to get a compact and accurate model.
Keywords:On-line Social Networks,
Classification,Fuzzy rule based classification
systems,Imbalanced data-sets , Genetic rule
selection
1.INTRODUCTION
Online Social Networks (OSNs) are today
one of the most popular interactive medium
to communicate, share and disseminate a
considerable amount of human life
information. Daily and continuous
communications imply the exchange of
several types of content, including free text,
image, audio and video data. The huge and
dynamic character of these data creates
the premise for the employment of web
content mining strategies aimed to
automatically discover useful information
dormant within the data. They are
instrumental to provide an active support in
complex and sophisticated tasks involved
in OSN management, such as for instance
access control or information filtering.
Information filtering has been greatly
explored for what concerns textual
documents and more recently, web content
[1-3].Information filtering can therefore be
used to give users the ability to
automatically control the messages written
on their own walls, by filtering out
unwanted messages. Indeed, today OSNs
provide very little support to prevent
unwanted messages on user walls [4-6]. No
content-based preferences are supported
and therefore it is not possible to prevent
undesired messages, Providing this service
is not only a matter of using previously
INTERNATIONAL CONFERENCE ON DEVELOPMENTS IN ENGINEERING RESEARCH, ICDER - 2014
INTERNATIONAL ASSOCIATION OF ENGINEERING & TECHNOLOGY FOR SKILL DEVELOPMENT www.iaetsd.in
92

defined web content mining techniques for a
different application, rather it requires to
design ad hoc classification strategies. The
aim of the present work is therefore to
propose and experimentally evaluate an
automated system, called Filtered Wall
(FW), able to filter unwanted messages
from OSN user walls [7-9]. We exploit
Machine Learning (ML) text categorization
techniques to automatically assign with
each short text message a set of categories
based on its content. The major efforts in
building a robust short text classifier (STC)
are concentrated in the extraction and
selection of a set of characterizing and
discriminate features. The solutions
investigated in which we inherit the learning
model and the elicitation procedure for
generating pre-classified data..In particular,
we base the overall short text classification
strategy on Radial Basis Function Networks
(RBFN) for their proven capabilities in
acting as soft classifiers, in managing noisy
data and intrinsically vague classes We
insert the neural model within a hierarchical
two level classification strategy [8-10]. In
the first level,the RBFNcategorizes short
messages as Neutral and Nonneutral;in the
second stage, No neutral messages are
classified producing gradual estimates of
appropriateness to each of the considered
category. Besides classification facilities,
the system provides a powerful rule
layer exploiting a flexible language to
specify Filtering Rules (FRs), by which
users can state what contents should not be
displayed on their walls. FRs can support a
variety of different filtering criteria that can
be combined and customized according to
the user needs. In addition, the system
provides the support for user-defined
Blacklists (BLs), that is, lists of users that
are temporarily prevented to post any kind
of messages on a user wall [11-13]. The
experiments we have carried out show
the effectiveness of the developed
filtering techniques the proposal of a
system to automatically filter unwanted
messages from OSN user walls on the basis
of both message content and the message
creator relationships and characteristics.
2. LITERATURE REVIEW
The main contribution of this paper is the
design of a system providing customizable
content-based message filtering for OSNs,
based on ML techniques. As we have
pointed out in the introduction, to the best of
our knowledge we are the first proposing
such kind of application for OSNs.
However, our work has relationships both
with the state of the art in content-based
filtering, as well as with the field of policy-
based personalization for OSNs and, more in
general, web contents. Therefore, in what
follows, we survey the literature in both
these fields.
2.1 Content-based filtering:
Information filtering systems are designed
to classify a stream of dynamically
generated information dispatched
asynchronously by an information producer
and present to the user those information
that are likely to satisfy his/her requirements
[3]. In content-based filtering each user is
assumed to operate independently. As a
result, a content-based filtering system
selects information items based on the
correlation between the content of the items
and the user preferences as opposed to a
collaborative filtering system that chooses
items based on the correlation between
people with similar preferences. Documents
processed in content-based filtering are
mostly textual in nature and this makes
content-based filtering close to text
classification. The activity of filtering can be
modeled, in fact, as a case of single label,
binary classification, partitioning incoming
93

documents into relevant and non relevant
categories [4]. More complex filtering
systems include multi-label text
categorization automatically labeling
messages into partial thematic categories.
Content-based filtering is mainly based on
the use of the ML paradigm according to
which a classifier is automatically induced
by learning from a set of pre-classified
examples. A remarkable variety of related
work has recently appeared, which differ for
the adopted feature extraction methods,
model learning, and collection of samples
[5], [6], [7], [8],[9]. The feature extraction
procedure maps text into a compact
representation of its content and is uniformly
applied to training and generalization
phases. The application of content-based
filtering on messages posted on OSN user
walls poses additional challenges given the
short length of these messages other than the
wide range of topics that can be discussed.
Short text classification has received up to
now few attention in the scientific
community. Recent work highlights
difficulties in defining robust features,
essentially due to the fact that the
description of the short text is concise, with
many misspellings, non standard terms and
noise. Focusing on the OSN domain, interest
in access control and privacy protection is
quite recent. As far as privacy is concerned,
current work is mainly focusing on privacy-
preserving data mining techniques, that is,
protecting information related to the
network, i.e., relationships/nodes, while
performing social network analysis [5].
Works more related to our proposals are
those in the field of access control. In this
field, many different access control models
and related mechanisms have been proposed
so far (e.g., [6,2,10]), which mainly differ on
the expressivity of the access control policy
language and on the way access control is
enforced (e.g., centralized vs. decentralized).
Most of these models express access control
requirements in terms of relationships that
the requestor should have with the resource
owner. We use a similar idea to identify the
users to which a filtering rule applies.
However, the overall goal of our proposal is
completely different, since we mainly deal
with filtering of unwanted contents rather
than with access control. As such, one of the
key ingredients of our system is the
availability of a description for the message
contents to be exploited by the filtering
mechanism as well as by the language to
express filtering rules. In contrast, no one of
the access control models previously cited
exploit the content of the resources to
enforce access control. We believe that this
is a fundamental difference. Moreover, the
notion of blacklists and their management
are not considered by any of these access
control models
2.2 Policy-based personalization of OSN
contents
Recently, there have been some proposals
exploiting classification mechanisms for
personalizing access in OSNs. For instance,
in [11] a classification method has been
proposed to categorize short text messages
in order to avoid overwhelming users of
microblogging services by raw data. The
system described in [11] focuses on Twitter2
and associates a set of categories with each
tweet describing its content. The user can
then view only certain types of tweets based
on his/her interests. In contrast, Golbeck and
Kuter [12] propose an application, called
FilmTrust, that exploits OSN trust
relationships and provenance information to
personalize access to the website. However,
such systems do not provide a filtering
policy layer by which the user can exploit
the result of the classification process to
decide how and to which extent filtering out
unwanted information. In contrast, our
filtering policy language allows the setting
94

of FRs according to a variety of criteria, that
do not consider only the results of the
classification process but also the
relationships of the wall owner with other
OSN users as well as information on the
user profile. Moreover, our system is
complemented by a flexible mechanism for
BL management that provides a further
opportunity of customization to the filtering
procedure. The only social networking
service we are aware of providing filtering
abilities to its users is MyWOT, a social
networking service which gives its
subscribers the ability to: 1) rate resources
with respect to four criteria: trustworthiness,
vendor reliability, privacy, and child safety;
2) specify preferences determining whether
the browser should block access to a given
resource, or should simply return a warning
message on the basis of the specified rating.
Despite the existence of some similarities,
the approach adopted by MyWOT is quite
different from ours. In particular, it supports
filtering criteria which are far less flexible
than the ones of Filtered Wall since they are
only based on the four above-mentioned
criteria. Moreover, no automatic
classification mechanism is provided to the
end user. Our work is also inspired by the
many access control models and related
policy languages and enforcement
mechanisms that have been proposed so far
for OSNs, since filtering shares several
similarities with access control. Actually,
content filtering can be considered as an
extension of access control, since it can be
used both to protect objects from
unauthorized subjects, and subjects from
inappropriate objects. In the field of OSNs,
the majority of access control models
proposed so far enforce topology-based
access control, according to which access
control requirements are expressed in terms
of relationships that the requester should
have with the resource owner. We use a
similar idea to identify the users to which a
FR applies. However, our filtering policy
language extends the languages proposed for
access control policy specification in OSNs
to cope with the extended requirements of
the filtering domain. Indeed, since we are
dealing with filtering of unwanted contents
rather than with access control, one of the
key ingredients of our system is the
availability of a description for the message
contents to be exploited by the filtering
mechanism. In contrast, no one of the access
control models previously cited exploit the
content of the resources to enforce access
control. Moreover, the notion of BLs and
their management are not considered by any
of the above-mentioned access control
models.
3. ANALYSIS OF PROBLEM:
The use of effective and appropriate
methods in facilitating projects enhances its
effectiveness and efficiency. The method
will be applied in system analysis and
design method where an existing system is
studied to proffer better options to solving
existing problems. Indeed, today OSNs
provide very little support to prevent
unwanted messages on user walls. For
example, Facebook allows users to state
who is allowed to insert messages in their
walls (i.e., friends, friends of friends, or
defined groups of friends). However, no
content-based preferences are supported and
therefore it is not possible to prevent
undesired messages, such as political or
vulgar ones, no matter of the user who posts
them. However, no content-based
preferences are supported and therefore it is
not possible to prevent undesired messages,
no matter of the user who posts them.
Providing this service is not only a matter of
using previously defined web content
mining techniques for a different
application, rather it requires to design ad
hoc classification strategies. This is because
wall messages are constituted by short text
95

for which traditional classification methods
have serious limitations since short texts do
not provide sufficient word occurrences.
4.IMBALANCED DATA-SETS IN
CLASSIFICATION
In this section, we will first introduce the
problem of imbalanced data-sets. Then, we
will describe the preprocessing technique
that we have applied in order to deal with
the imbalanced data-sets: the SMOTE
algorithm [7]. Finally, we will present the
evaluation metrics for this kind of
classification problem.
4.1. The problem of imbalanced data-sets
Learning from imbalanced data is an
important topic that has recently appeared in
the machine learning community.When
treating with imbalanced data-sets, one or
more classes might be represented by a large
number of examples whereas the others are
represented by only a few.We focus on the
binary-class imbalanced data-sets, where
there is only one positive and one negative
class. We consider the positive class as the
one with the lowest number of examples and
the negative class the one with the highest
number of examples. Furthermore, in this
work we use the IR, defined as the ratio of
the number of instances of the majority class
and the minority class, to organize the
different data-sets according to their IR.The
problem of imbalanced data-sets is
extremely significant because it is implicit in
most real world applications, such as fraud
detection [16], text classification, risk
management or medical applications.In
classification, this problem (also named the
‘‘class imbalance problem”) will cause a
bias on the training of classifiers
and will result in the lower sensitivity of
detecting the minority class examples. For
this reason, a large number of approaches
have been previously proposed to deal with
the class imbalance problem. These
approaches can be categorized into two
groups: the internal approaches that create
new algorithms or modify existing ones to
take the class imbalance problem into
consideration [3] and external approaches
that preprocess the data in order to diminish
the effect cause by their class imbalance
[4,15].The internal approaches have the
disadvantage of being algorithm specific,
whereas external approaches are
independentof the classifier used and are, for
this reason, more versatile. Furthermore, in
our previous work on this topic [18] we
analyzed the cooperation of some
preprocessing methods with FRBCSs,
showing a good behaviour for the over-
sampling methods,specially in the case of
the SMOTE methodology.According to this,
we will employ in this paper the SMOTE
algorithm in order to deal with the problem
of imbalanced data-sets. This method is
detailed in the next subsection.
4.2. Preprocessing imbalanced data-sets.
The SMOTE algorithm As mentioned
before, applying a preprocessing step in
order to balance the class distribution is a
positive solution to the imbalance data-set
problem [4]. Specifically, in this work we
have chosen an over-sampling method
which is a reference in this area: the
SMOTE algorithm [7].In this approach the
minority class is over-sampled by taking
each minority class sample and introducing
synthetic examples along the line segments
joining any/all of the k minority class
nearest neighbours. Depending upon the
amount of oversampling required,
neighbours from the k-nearest neighbours
are randomly chosen. This process is
illustrated in Fig. 1,where xi is the selected
point, xi1 to xi4 are some selected nearest
neighbours and r1 to r4 the synthetic data
points created by the randomized
interpolation. The implementation employed
96

in this work uses only one nearest neighbour
using the euclidean distance, and balance
both classes to the 50%
distribution.Synthetic samples are generated
in the following way: take the difference
between the feature vector (sample) under
consideration and its nearest neighbour.
Multiply this difference by a random
number between 0 and 1, and add it to the
feature vector under consideration. This
causes the selection of a random point along
the line segment between two specific
features.This approach effectively forces the
decision region of the minority class to
become more general. An example is
detailed in Fig. 2.In short, its main idea is to
form new minority class examples by
interpolating between several minority class
examples that lie together. Thus, the
overfitting problem is avoided and causes
the decision boundaries for the minority
class to spread further into the majority class
space.
4.3. Evaluation in imbalanced domains
The measures of the quality of classification
are built from a confusion matrix (shown in
Table 1) which records correctly and
incorrectly recognized examples for each
class.The most used empirical measure,
accuracy (1), does not distinguish between
the number of correct labels of different
classes, which in the framework of
imbalanced problems may lead to erroneous
conclusions. For example a classifier that
obtains an accuracy of 90% in a data-set
with an IR value of 9, might not be accurate
if it does not cover correctly any minority
class instance.
Because of this, instead of using accuracy,
more correct metrics are considered. Two
common measures, sensitivity and
specificity (2,3), approximate the probability
of the positive (negative) label being true. In
other words, they assess the effectiveness
of the algorithm on a single class.
The metric used in this work is the
geometric mean of the true rates [3], which
can be defined as
Fig.1.An illustration on how to create the synthetic data points in
the SMOTE algorithm.
Fig.2.Example of SMOTE application
This metric attempts to maximize the
accuracy of each one of the two classes with
a good balance. It is a performance metric
that links both objectives.
97

Table.1. Confusion matrix for a two-class problem.
Class Positive
prediction
Negative
Prediction
Positive class True
positive(TP)
False
Negative(FN)
Negative class False
positive(FP)
True
Negative(TN)
5. Hierarchical rule base genetic rule
selection process
In the previous section we have mentioned
that an excessive number of rules may not
produce a good performance and it makes
difficult to understand the model behaviour.
We may find different types of rules in a
large fuzzy rule set: irrelevant rules, which
do not contain significant information;
redundant rules, whose actions are covered
by other rules; erroneous rules, which are
wrong defined and distort the performance
of the FRBCS; and conflicting rules, which
perturb the performance of the FRBCS when
they coexist with others.In this work, we
consider the CHC genetic model [14] in
order to make the rule selection process,
since it has achieved good results for binary
selection problems [6]. In the following, the
main characteristics of this genetic approach
are presented.
1. Coding scheme and initial gene pool: It is
based on a binary coded GA where each
gene indicates whether a rule is selected or
not (alleles ‘1’ or ‘0’, respectively).
Considering that N rules are contained in the
preliminary/candidate rule set, the
chromosome C = (c1, . . . ,cN) represents a
subset of rules composing the final HRB,
such that:
with Ri being the corresponding ith rule in
the candidate rule set and HRB being the
final hierarchical rule base.The initial
pool is obtained with an individual having
all genes with value ‘1’ and the remaining
individuals generated at random in {0, 1}, so
that the initial HRB is taking into account in
the genetic selection process.
2. Chromosome evaluation: The fitness
function must be in accordance with the
framework of imbalanced data-sets. Thus,
we will use, as presented in Section 2.3, the
geometric mean of the true rates, defined in
(4) as:
3. Crossover operator: The half uniform
crossover scheme (HUX) is employed. In
this approach, the two parents are combined
to produce two new offspring. The
individual bits in the string are compared
between the two parents and exactly half of
the non-matching bits are swapped. Thus the
Hamming distance (the number of differing
bits) is first calculated.This number is
divided by two. The resulting number is how
many of the bits that do not match between
the two parents will be swapped.
4. Restarting approach: To get away from
local optima, this algorithm uses a restart
approach. In this case, the best chromosome
is maintained and the remaining are
generated at random in {1,0}. The restart
procedure is applied when a threshold value
is reached, which means that all the
individuals coexisting in the population are
very similar.
5. Evolutionary model: The CHC genetic
model makes use of a ‘‘Population-based
Selection” approach. N parents and their
corresponding offspring are combined to
select the best N individuals to take part of
the next population. The CHC approach
makes use of an incest prevention
mechanism and a restarting process to
provoke diversity in the population,instead
of the well-known mutation operator.
98

This incest prevention mechanism will be
considered in order to apply the HUX
operator, i.e., two parents are crossed if their
hamming distance divided by 2 is higher
than a predetermined threshold, L. The
threshold value is initialized as:
L = (#Genes/4.0). Following the original
CHC scheme, L is decremented by one
when the population does not change in
one generation. The algorithm restarts when
L is below zero. We will stop the genetic
process if more than 3 restarts are performed
without including any new chromosome in
the population.
6. CONCLUSION
In this paper, we have proposed an HFRBCS
approach for classification with imbalanced
data-sets. Our aim was to employ a
hierarchical model to obtain a good balance
among different granularity levels. A fine
granularity is applied in the boundary areas,
and a thick granularity may be applied in the
rest of the classification space providing a
good generalization. Thus,this approach
enhances the classification performance in
the overlapping areas between the minority
and majority classes.Furthermore, we have
made use of the SMOTE algorithm in order
to balance the training data before the rule
learning generation phase. This
preprocessing step enables the obtention of
better fuzzy rules than using the original
data-sets and therefore, we improve the
global performance of the fuzzy model to
filter out unwanted messages from Online
Social Networking (OSN).
7.REFERENCES
[1] R. Alcalá, J. Alcalá-Fdez, F. Herrera, J.
Otero, Genetic learning of accurate and
compact fuzzy rule based systems based on
the 2-tuples linguistic representation,
International Journal of Approximate
Reasoning 44 (2007) 4564.
[2] A. Asuncion, D. Newman, 2007. UCI
machine learning repository. University of
California, Irvine, School of Information
and Computer Sciences. URL:
<http://www.ics.uci.edu/~mlearn/MLReposi
tory.html>.
[3] R. Barandela, J.S. Sánchez, V. García, E.
Rangel, Strategies for learning in class
imbalance problems, Pattern Recognition 36
(3) (2003) 849–851.
[4] G.E.A.P.A. Batista, R.C. Prati, M.C.
Monard, A study of the behaviour of several
methods for balancing machine learning
training data, SIGKDD Explorations 6 (1)
(2004) 20–29.
[5] P. Campadelli, E. Casiraghi, G.
Valentini, Support vector machines for
candidate nodules classification, Letters on
Neurocomputing 68 (2005) 281–288.
[6] J.R. Cano, F. Herrera, M. Lozano, Using
evolutionary algorithms as instance selection
for data reduction in kdd: an experimental
study, IEEE Transactions on Evolutionary
Computation 7 (6) (2003) 561–575.
[7] N.V. Chawla, K.W. Bowyer, L.O. Hall,
W.P. Kegelmeyer, Smote: synthetic
minority over-sampling technique, Journal
of Artificial Intelligent Research 16 (2002)
321–357.
[8] N.V. Chawla, N. Japkowicz, A. Kolcz,
Editorial: special issue on learning from
imbalanced data-sets, SIGKDD Explorations
6 (1) (2004) 1–6.
[9] Z. Chi, H. Yan, T. Pham, Fuzzy
algorithms with applications to image
processing and pattern recognition, World
Scientific, 1996.
[10] J.-N. Choi, S.-K. Oh, W. Pedrycz,
Structural and parametric design of fuzzy
inference systems using hierarchical fair
competition-based parallel genetic
algorithms and information granulation,
Reasoning 49 (3) (2008) 631–648.
99

[11] O. Cordón, M.J. del Jesus, F. Herrera,
A proposal on reasoning methods in fuzzy
rule-based classification systems,
Reasoning 20 (1) (1999) 21–45.
[12] O. Cordón, F. Herrera, I. Zwir,
Linguistic modeling by hierarchical systems
of linguistic rules, IEEE Transactions on
Fuzzy Systems 10 (1) (2002) 2–20.
[13] J. Demšar, Statistical comparisons of
classifiers over multiple data-sets, Journal of
Machine Learning Research 7 (2006) 1–30.
[14] L.J. Eshelman, 1991. Foundations of
Genetic Algorithms. Morgan Kaufman, Ch.
The CHC Adaptive Search Algorithm: How
to have Safe Search when Engaging in
Nontraditional Genetic Recombination, pp.
265–283.
[15] A. Estabrooks, T. Jo, N. Japkowicz, A
multiple resampling method for learning
from imbalanced data-sets, Computational
Intelligence 20 (1) (2004) 18–36.
[16] T. Fawcett, F.J. Provost, Adaptive fraud
detection, Data Mining and Knowledge
Discovery 1 (3) (1997) 291–316.
[17] A. Fernández, S. García, M.J. del Jesus,
F. Herrera, An analysis of the rule weights
and fuzzy reasoning methods for linguistic
rule based classification systems applied to
problems with highly imbalanced data-sets,
in: International Workshop on Fuzzy Logic
and Applications (WILF07), Lecture Notes
on Computer Science, vol. 4578, Springer-
Verlag, 2007, pp. 170–179.
[18] A. Fernández, S. García, M.J. del Jesus,
F. Herrera, A study of the behaviour of
linguistic fuzzy rule based classification
systems in the framework of imbalanced
data-sets, Fuzzy Sets and Systems 159 (18)
(2008) 2378–2398.
[19] M. Friedman, The use of ranks to avoid
the assumption of normality implicit in the
analysis of variance, Journal of the
American Statistical Association 32 (1937)
675–701.
[20] S. García, D. Molina, M. Lozano, F.
Herrera, A study on the use of non-
parametric tests for analyzing the
evolutionary algorithms’ behaviour: a case
study on the CEC’2005 special session on
real parameter optimization. Journal of
Heuristics, in press, doi: 10.1007/s10732-
008-9080-4.
100

Iaetsd hierarchical fuzzy rule based classification

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Viewers also liked

Viewers also liked (20)

Similar to Iaetsd hierarchical fuzzy rule based classification

Similar to Iaetsd hierarchical fuzzy rule based classification (20)

More from Iaetsd Iaetsd

More from Iaetsd Iaetsd (20)

Recently uploaded

Recently uploaded (20)

Iaetsd hierarchical fuzzy rule based classification