SlideShare a Scribd company logo
International Journal of Applied Engineering Research
ISSN 0973-4562 Volume 9, Number 24 (2014) pp. 30795-30812
© Research India Publications
http://www.ripublication.com
Anomaly Detection Via Eliminating Data
Redundancy and Rectifying Duplication in Uncertain
Data Streams
M. Nalini 1*
and S. Anbu 2
1*
Research Scholar, Department of Computer Science and Engineering,
St.Peter's University, Avadi, Chennai, India, email: nalinicseme@gmail.com
2
Professor, Department of Computer Science and Engineering, St.Peter's College of
Engineering and Technology, Avadi, Chennai ,India, email: anbuss16@gmail.com
Abstract
One of the important problems is anomaly detection which necessitates in
emerging fields, such as data warehouse and data mining etc. Various generic
anomaly detection techniques were already developed for common
applications. Most of the real time systems using consistent data offering
high-quality services are affected by record duplication, quasi replicas or
partial erred data. Even government and private organizations are developing
procedures for eliminating replicas from their data repositories. In data
maintenance the quality of the data, database, and database related
applications needs error-free, replica removed, de-duplicated to improve the
accuracy of the query passed. In this paper, we propose Particle Swarm
Optimization approach to record de-duplication which combines various
pieces of evidence extracted from the data content to identify a de-duplication
method. which can able to find out whether two entries in a data repository are
replicas or not. From the experiment, our approach outperforms available
state-of-the-art method found in the literature. Our PSO approach is capable
than the existing approaches. The experiment result shows that the proposed
approach is more efficient than the existing approaches where the proposed
approach is implemented in DOTNET framework-2010.
Keywords: Data Duplication, Error Correction, DBMS, Data Mining, TPR,
FNR.
INTRODUCTION
Anomaly detection refers to the problem of finding patterns in any kind of data which
30796 M.Nalini and S.Anbu
cannot satisfy the customer’s expected behavior. These non-conforming patterns are
often referred to as anomalies, outliers, discordant observations, exceptions,
aberrations, surprises, peculiarities or contaminants in different application domains.
Of these anomalies and outliers are two terms used most commonly in the context of
anomaly detection and sometimes interchangeably. Anomaly detection finds
extensive use in a wide variety of applications. Such as fraud detection for credit
cards, insurance or health care, intrusion detection for cyber-security, fault detection
in safety critical systems, and military surveillance for enemy activities.
The importance of anomaly detection is due to the fact that anomalies in data
that translates to significant (and often critical) actionable information in a wide
variety of application domains. For example, an anomalous traffic pattern in a
computer network could mean that a hacked computer is sending out sensitive data to
an unauthorized destination. An anomalous MRI image may indicate presence of
malignant tumors. Anomalies in credit card transaction data could indicate credit card
or identity theft or anomalous readings from a space craft sensor could signify a fault
in some component of the space craft. Detecting outliers or anomalies in data has
been studied in the statistics community as early as the 19th
century. Over time, a
variety of anomaly detection techniques have been developed in several research
communities. Many of these techniques have been specifically developed for certain
application domains, while others are more generic.
Also in this paper the data taken into account is uncertain data, the size of the
data is also too large.PSO approach is applied here to find and count the de-
duplication where it combines various pieces of evidence extracted from the data
content to produce de-duplication method. This approach is able to identify whether
any entries in a repository are same or not. In the same manner, by combining more
pieces in the data, taken as evidence, and compare with the whole data as training
data. This function is applied repeatedly on the whole data or in the repositories.
Newly inserted data can also be compared in the same manner to avoid replica by
comparing with evidence. A method applied to record de-duplication should
accomplish individual but contradictory objectives: this process should effectively
increase the identification of the records replicated. The approach GP [15] is chosen
as the basic approach which is suitable for finding accurate answers to a given
problem without searching all the data on the whole. Due to the record de-duplication
problem, in the existing approach [14, 16] the genetic programming is applied to
provide good solutions to it.
In this paper, the existing system results in [16] are taken for comparison
without PSO based approach, where our approach is able to automatically find more
effective de-duplication methods. Moreover, PSO based approach can interoperable
with existing best de-duplication methods to change on the replication identification
limits used to classify a pair of records as a match or not. In our experiment, real time
dataset having all scientific domain based article citations and hotel index records.
Also, the real time data set having synthetic generated datasets to control in best
experimental environment. In all the scenarios, our approach can be applied to all the
possible scenarios.
Anomaly Detection Via Eliminating Data Redundancy 30797
On the whole, our contribution of this paper is PSO based approach to find
and count the De-Duplication is as follows:
 A less computational time based solution can be obtained in terms of
duplication detection
 Reduce the individual comparison using PSO approach to find out the
similarity values.
 Choosing the replicas by computing TPR and FPR among the data
 Rectify the errors in the data entries.
RELATED WORKS
In [3] the author proposed an approach to data reduction. This data reduction
functions are very essential to machine learning and data mining. An agent based
population algorithm is used for solving data reduction. Only data reduction is not
only the solution for improving the quality of databases. Various sizes of database are
used to provide high classification among the data to find out anomalies. Two
algorithms such as evolutionary and non-evolutionary are applied and the results are
compared to finding the best suitable algorithm for anomaly detection in [4]. N-ary
relations are computed to define the patterns in the dataset [5] where it provides
relations in one-dimensional data. DBLEARN and DBDISCOVER [6] are two
systems developed to analyze RDBMS. The main objective of the data mining
technique is to detect and classify data in a huge set of database [7] without
negotiating the speed of the process. PCA is used for data reduction and SVM is used
for data classification in [7]. In [8] the data redundancy method is explored using
mathematical representation. Software developed with safe, correct and reliable
operations for avionics and automobile based database systems [9]. A statistical QA-
[Question Answer] model is applied to develop a prototype to avoid web based data
redundancy [10]. GDW-[Geographic Data Warehouses] [11], SOLAP (Spatial On-
Line Analytical Processing) is applied to Gist database and other spatial database
analysis, indexing, and generating various set of reports without any error. In [12], an
effective method was proposed for P-2-P sharing data. During the data sharing the
data duplication is removed using the effective method. Web entity data extraction
associated with the attributes of the data [13] can be obtained using a novel approach
which uses duplicated attribute value pairs.
G. de Carvalho et al. [1] used the Genetic Algorithm to mark the duplication
and convert de-duplication in the data also mainly concentrated on identifying the
entries are repository or replica. This approach outperformed and provides 6.2% of
accuracy more than the earlier approaches for two different data sets found in [2]. Our
proposed approach can be extended for various benchmark data with real time data
such as time series data, clinical data, 20-20 new group etc.
PARTICLE SWARM OPTIMIZATION GENERAL CONCEPTS
The natural selection process influences all virtually living things and it can be
inspired by evolutionary programming approaches based ideas. Particle swarm
30798 M.Nalini and S.Anbu
optimization programming is one of the best known evolutionary programming
techniques. It is considered as a heuristic approach and initially applied for optimizing
the data properties and availabilities. PSO is also considered as a multi-objective
problem which can restrict the environment. PSO and the other various evolutionary
approaches are mostly known and applied variety of applications due to their good
performance in terms of searching over a large set of data. PSO creates more
populations for individuals, instead of processing on a single point in the search space
of the problem. This behavior is the essential aspect of the PSO approach and it
creates additional new solutions with new combined features. It also moves forward
comparing with the existing solutions in the search space.
PARTICLE OPERATIONS
PSO generates random particles representing individuals. In this paper, the current
modeling the trees representing arithmetic functions which are illustrated in Figure-1.
When using, this tree representation of the PSO based approach, the set of all inputs,
variables, constants and methods should be defined [8]. Some of the nodes
terminating the trees are called as leaves. The collection of operators, statements and
methods are used in the PSO evolutionary process to manipulate the terminal values.
All these methods are placed in the internal nodes of the tree is shown in Figure-1.In
general PSO is analyzing social behavior of birds. In order to search for food, every
bird in a flock of birds is referred by velocity based on the personal experience and
information collected by interacting with each other inside the flock. This is the basic
idea about the PSO. Each particle denoting each bird, flies denotes searching in the
subspace for the optimization problem searching for the optimum solution. In PSO,
the solutions within the iteration are called as swarm and equal in population.
Tree(a, b, c) = a + ( b + b)
FIGURE-1: Tree Used For Mapping A Function
+
x /
x z
Anomaly Detection Via Eliminating Data Redundancy 30799
PROPOSED APPROACH
The proposed approach utilizes the functionality of the PSO optimization method for
finding the difference among entities in each record in a database. This difference
indicates the similarity index among two data entities can decide the duplication. In
this case, if the distance between two data entities [ and ] is less than a threshold
value [ ], then and are decided as duplicate. The algorithm of PSO applied in
this paper is given here:
1. Generate random population P is representing each individual of data entries.
2. Assume a random feasible solution from particles.
3. For I = 1to P
4. Evaluate all particles based on the objective function
5. The objective function = ( [ ], [ ]) ≤ ≥∝
6. Gbest = best solution based particle
7. Compute the velocity of the Gbest particles
8. Update the current position of the best solution
9. Next i
A Database is a rectangular table consists of number of Records as:
= { , ,… , } ---(1)
And each record has number of Entities as:
= =
: : :
---(2)
is the entity at row and column in the data. Here represent the rows
and j represents the column. In this paper the threshold value is user defined very
small value among 0 and 1.
30800 M.Nalini and S.Anbu
Fig.1: Proposed Approach
The overall functionality of the proposed approach is depicted in Fig.1. The
database may be in any form like ORACLE, SQL, and MY-SQL, MS-ACCESS or
EXCEL.
PREPROCESSING
Let us consider an example of an employee data for an MNC company, where the
company branches are located overall world. The entire data are read from the
database and investigate that if there any ‘~’, empty space, “#”, “*” and irrelevant
characters placed as an entity in the database. [Example, if an entity is numerical data,
then it should contain only the digits from 0 to 9. If it is a name, then it should
represent all alphabets combined only with “.”, “_”,”-“]. In case of irrelevant
characters presented in any dataset, then those data entity are treated as error data and
it will be corrected, removed or changed by any other relevant characters.
If the data-type of the field is a string, then the preprocessing function assigns
“NULL” in the corresponding entity, else if the data type of the field is a numeric,
then preprocessing function assigns 0’s [according to the length of the numeric data
type] in the corresponding entity. Similarly, preprocessing function replaces the entity
as today’s-date if the data-type is ‘date’, ‘*’ for data-type is ‘character’ and so on.
Once the data is preprocessed the results of the SQL-Query are good else error
generated.
For example, in the following table-1, the first row says the Field name and all
the rows contain set of records. In the given Table-1, the first record fourth field is
having an irrelevant character as “~”. In the same way the second record 3rd
field
consists “##” instead of numbers. It gives an error, when a query
Pre Processing the Data
Normalizing the Data
Anomaly Detection
Load Data Pre-process the
data
Divide data as
windows
Finding Similarity Ratio
Normalize the Data
Check data
redundancy & Error
Yes
No
D
B
Mark redundant data
Persistent the data
Anomaly Detection
Anomaly Detection Via Eliminating Data Redundancy 30801
;
Select City from EMP;
is passed in the table EMP [Table-1]. To avoid error during query process the City,
Age fields are corrected by verifying the original data sources. If it is not possible
then for alphanumeric fields “NULL” and numeric field “0” are applied for replacing
and correcting the error. If it is not possible to correct the record, those data are
marked [‘*’] and moved to a separate pool area.
Table-1: Sample Error Records Pre-Processed and Marked [‘*’] [EMP].
No Name Age City State Comment
0001* Kabir 45 ~ty Employee
0002* Ramu ## Chennai TN Employee
The entire data can be divided as sub windows for easy and fast process. Let
the dataset is DB and it can be divided as sub windows shown in Fig.2 as DB1 and
DB2. Each DB1 and DB2 has a number of windows as 1, 2 … .
DATA NORMALIZING
In general an uncertain data stream is considered for anomaly detection. The main
problem defined in this paper is anomaly detection for any kind of Data streams.
Since, the size of the data stream is huge, in our approach the complete data are
divided into subsets of data streams. A data stream DS is divided into two uncertain
data streams DS1 and DS2, are taken for our problem, where both data stream
consists of a sequence of continuously occurring uncertain objects in various time
intervals, are denoted as
DS1 = { [1], [2],… … [ ], … } --- (3)
DS2 = { [1], [2],… … [ ], … } --- (4)
Where [ ] or [ ] is a k-dimensional uncertain objects at the time interval
and is the current time interval. According to grouping the nearest neighbor, the
objects should retrieve a close pair of objects within a period. Thus a compartment
window concept is adapted for the uncertain stream group operator. From figure-2, a
USG operator always considers the most recent CW uncertain data in the stream, that
is
CW(DS1) = { [ − + 1], [ − + 2],… … … , [ ]} --- (5)
CW(DS2) = { [ − + 1], [ − + 2], … … … , [ ]} --- (6)
30802 M.Nalini and S.Anbu
At the current time intervals .It can say in other words, when a new certain
object x[t+1] (y[t+1]) comes at the next time interval (t+1), the new object x[t+1]
(y[t+1]) is appended to DS1(DS2). In that particular time the old object x[t-cw+1]
(y[t-cw+1]) expires and is ejected from the memory. Thus, USG at a time interval
(t+1) is conducted on a new compartment window {x[t-cw+2], ……x[t+1]} (y[t-
w+2],….,y[t+1]}) of size cw.
Fig.2: Data Set Divided as Sub-Windows
For Grouping the uncertain Data Streams the two data streams DS1 and DS2
and distance threshold value and a probabilistic threshold α ∈ [0, 1].A group on
uncertain data streams continuously monitors pairs of uncertain objects x[i] and y[i]
within compartment windows CW(DS1) and CW(DS2) respectively of size cw at the
current time stamp t. Here, the data streams DS1 and DS2 are compared to finding the
similarity distance can be obtained using PSO. Such that
PSO (Pr { ( [ ], [ ]) ≤ ≥∝) --- Equ (7)
Holds, where t-cw+1 ≤ I, j ≤ t, and dist(., .) is a Euclidean distance function
between two objects. To perform a USG Equation (7), users need to register two
parameters, distance threshold and probabilistic threshold α in PSO. Since, each
uncertain object at a timestamp consists of R samples, the grouping probability
P|r{dist( x[i], y[i]) ≤ } in Inequality (7) can be rewritten via samples as
Pr{ ( [ ], [ ]) ≤ } =
{ 1[ ]. . 2[ ]. , ( 1[ ], 2[ ]) ≤ ;
0 ℎ
--- Equ (8).
DS1
DS2
y[1]…….,y[t-cw+1]…………..| …………………………….y[t] y[t+1]
Compartment window at
time interval t - CW (DS1)Uncertain data
stream
Expired uncertain object at
time interval (t+1)
New uncertain
object
Compartment window at
time interval t - CW (DS2)
USG
answers
Anomaly Detection Via Eliminating Data Redundancy 30803
Note that, one straightforward method to directly perform USG over
compartment windows is to follow the USG definition. That is for every object pair
<X[i], Y[i] > fromcompartment windows CW(DS1) and CW(DS2) respectively. We
compute the grouping probability that X[i] is within distance from Y[i] (via
samples) based on (8).If the resulting probability is greater than or equal to
probabilistic threshold α, then this pair <X[i], Y[i] > is reported as the USG answer,
otherwise it is a false alarm and can be carefully discarded. The number of false alarm
is counted using PSO by repeating n number of times and generating numbers of
particles in the search space for each individual data. For any comparison, verification
and other relevant tasks the window based data makes easy and fulfill the task very
quickly for any DBMS. For example, if the database is having 1000 records can be
divided into 4 sub datasets having 250 records each.
Data in the database can be normalized using any normalization form for fast
and accurate query process. In this paper user defined normalization is also applied to
improve the efficiency such as arranging the data in a proper manner like ascending
order or descending order according to the SQL query keywords.
PSO BASED SIMILARITY COMPUTATION
This paper focuses on applying a PSO based comparison for finding similar or
dissimilar. PSO has a measurement among two data in a database well defined by
appropriate features. Since it accounts for unequal variances as well as the correlation
between features it will adequately evaluate the distance by assigning different
weights or important factors to the features of data entities. In this paper the
inconsistency of data can be removed in real-time digital libraries.
Assume two sets of groups and having data about girls and boys in a
school. Let number girls are categorized as same sub-group in since their
attribute or characteristics are same. It is computed by PSO as
= ( − ) ≤ 1 --- (9)
The correlation among dataset is computed using Similarity-Distance. Data
entities are the main objects of data mining. The data entities are arranged in an order
according to the attributes. The data set with number of attributes is considered
as K-dimensional vector is represented as:
= ( , ,… , ). --- (10)
N number data entities form a set
= ( , , … , ) ⊂ ℝ --- (11)
is known as data set. can be represented by an matrix
= --- (12)
30804 M.Nalini and S.Anbu
where is the jth component of the data set .There are various methods used for
data mining. Numerous such methods, for example, NN-classification techniques,
cluster investigation, and multi-dimensional scaling methods are based on the
processes of similarity between data. As a replacement for measuring similarity,
dissimilarity among the entities too will give the same results. For measuring
dissimilarity one of the parameters that can be used is distance. This category of
measures is also known as separability, divergence or discrimination measures.
A distance metric is a real-values function , such that for any data points
, , and :
( , ) ≥ 0, ( , ) = 0, = --- (13)
( , ) = ( , ) --- (14)
( , ) ≤ ( , ) + ( , ) --- (15)
The first line (13), positive definiteness assures the distance is a non-negative
value. The distance can be zero for the points to be the same. The second property
indicates the symmetry nature of distance. There are various distance formulas are
available like Euclidean, manhattans, Lp-Norm and Similarity distance. In this paper
the Similarity-Distance is taken as the main method to find the similarity distance
among two data sets. The distance among a set of observed groups in m-dimensional
space determined by m variables is known as Similarity-Distance method. The less
distance value says the data in the groups are very close and the other is not close. The
mathematical formula for Similarity-Distance for two set of data samples as X and Y
is written as:
( , ) = ( − ) ∑ ( − ) --- (16)
∑ is the inverse co-variance matrix.
The similarity value among the sub-windows of the dataset DB1 and the
dataset DB2 is computed and the result is stored in a variable named score.
[ ] = ∑ ( 1) − ( 2) ---- (17)
[ ] ≤ ℎ 1
[ ] = 0 ℎ 0
[ ] > ℎ − 1
---- (18)
The first line in (18) says that the data available in both windows of 1
and 2 are more or less similar.The next line says that exactly same and the third
line says that the data are different. Whenever the distance among dataset satisfies
[ ] = 0 and [ ] ≤ both data are marked in the DB. The value of [ ] gives two
solution such as:
Anomaly Detection Via Eliminating Data Redundancy 30805
TPR—if the similarity value lies above this boundary [-1 to 1], the records are
considered as replicas;
TNR—if the similarity value lies below this boundary then the records are considered
as not being replicas.
In this situation the similarity values lies among the two boundaries then the
records are classified as “possiblematches”. In this case, a human judgment is also
necessary to find the matching score. Usually most of the existing approaches to
replica identification depend on several choices to set their parameters and they may
not be always optimal. Setting these parameters requires the accomplishment of the
following tasks:
Selecting the best proof to use- as evidence, it takes more time to find out the
duplication due to apply more processes to compute the similarity among the data.
Decide how to merge the best evidence, some evidence may be more effective for
duplication identification than others. Finding the best boundary values to be used,
Bad boundaries may increase the number of identification errors (e.g., false positives
and false negatives),nullifying the whole process. Window1 from DB1 is compared
with Window1, window2, window3 and so on from DB2 can be written as:
[ ] = ( 1) − ∑ ( 2) ---- (19)
If the [ ] =0, then both ( 1) and ( 2) are same and mark it as
duplicate. Else ( 1) is compared with ( 2).
The objective of this paper is to improve the quality of the data in a DBMS is
error free and can provide fast outputs for any SQL query. It is also concentrates on
de-duplication if possible in the data model. The removal of duplicate is not efficient
in Government based organization and it is difficult to remove. Avoiding duplicate
data provides high retrieval of quality data from huge data set like banking.
DATA:
For experimenting the proposed approach two real time data sets commonly
employed for evaluating the record de-duplication purposes. They are based on the
current data gathered from the web index. Additionally, some more data sets also
created using a synthetic data set generator. One of the dataset is
the ℎ data set is a assembly of 1,295 different credentials to 122
computer science papers occupied from the Cora research paper through search
engine. These credentials were separated into numerous characteristics (author names,
year, title, venue, and pages and other info) by an information mining system. The
another real-time data set is the Restaurants data set comprises 864 records of
restaurant names and supplementary data together with 112 replicas that were attained
by incorporating records from Fodor and Zagat’s guidebooks. It is used the following
attributes from this data set: (restaurant) name, address, city, and specialty. The
synthetic data sets were created using the Synthetic Data Set Generator (SDG) [32]
available in the Febrl [26] package.
30806 M. Nalini and S.Anbu
Since the real time dataset are not sufficient and not easily accessible for the
experiment, such as time series data set, 20-20 news data set and customer data from
OLX.in. It contains the fields as name, age, city, address, phone numbers etc, (like
social security number). Using SDG it also can create manually some errors in data
and duplication in data. Some of the modifications also can be applied on the records
attribute level. The data taken for experiments are
DATA-1: This data set contains four files of 1000 records (600 originals and 400
duplicates) with a maximum of five duplicates based on one original record (using a
Poisson distribution of duplicate records) and with a maximum of two modifications
in a single attribute and in full record.
DATA-2: This data set contains four files of 1000 records (750 originals and 250
duplicates), with a maximum of five duplicates based on one original record (using a
Poisson distribution of duplicate records) and with a maximum of two modifications
in a single attribute and four in the full record.
DATA-3: This data set contains four files of 1000 records (800 originals and 200
duplicates) with a maximum of seven duplicates, based on one original record (using
a Poisson distribution of duplicate records) and with a maximum of four
modifications in a single attribute and five in the full record. The duplication can be
applied on each attribute of the data in the form of [i.e the evidence]
〈 , − 〉
The experiment on the time series data is done in MATLAB software and the
time complexity is compared with the existing system. The elapsed time taken for
implementing the proposed approach is 5.482168 seconds. The results obtained for all
the functionalities defined in Fig.1 are depicted in Fig.3 to Fig.6.
Anomaly Detection Via Eliminating Data Redundancy 30807
Fig.3: Original Data Not Preprocessed
Fig.3 shows the data originality as such taken from web. It has errors,
redundancy and noise. The three lines show that the data DB is divided into DB1,
DB2 and DB3. It is clear in the above figure that DB1, DB2 and DB3 are consisted
and overlapped in many places which indicates the data redundancy. Also it is drawn
in zigzag form says it is not preprocessed. In the time series data there are 14
numerical data are preprocessed [replaced as 0’s], it is verified from the database.
Fig.4: Preprocessed Data
30808 M.Nalini and S.Anbu
The data is preprocessed and normalized is shown in Fig.4. User defined
normalization of the data is arranging the data in an order for easy process. Even
DB1, DB2 and DB3 have overlapped data itself, which indicates more data are
similar. DB1 and DB2 have more similar overlapped data is clearly shown in Fig.4.
Finding the similarity index and those same data can be removed for easy process and
make it as duplication.
Fig.5: Single Window Data in DB1
After normalization the data is divided into windows and shown in Fig.5
where the window size is 50 defined by the developer. Each window has 50 data for
fast comparison. In order to confirm this behavior observed with real data, we
conducted additional experiments using our synthetic data sets. The user-selected
evidence setup used in this experiment was built using the following list of evidence:
<firstname, PSO>, <lastname, PSO>, <street number, string distance>,
<address1, PSO>, <address2, PSO>, <suburb, PSO>, <postcode, string distance>,
<state, PSO>, <date of birth, string distance>, <age, string distance>,
<phone number, string distance>, <social security number, string distance>.
This list of evidence, using the PSO similarity function for free text attributes
and a string distance function for numeric attributes was chosen. Since, it required
less time to be processed in our initial tuning tests.
Anomaly Detection Via Eliminating Data Redundancy 30809
Table-2: Original Data
Data set Original Data Good Data Similar Data Error Data
Time Series 1000 600 400 24%
Restaurant 1000 750 250 15%
Student Database 1000 800 200 12.4%
Cora 1000 700 300 19.2%
Table-3: Data Duplication Detection and De-Duplication
Data set Original DataMarked DuplicationDe-Duplicated Not De-Duplicated
Time Series 1000 400 395 5
Restaurant 1000 250 206 44
Student DB 1000 200 146 54
Cora 1000 300 244 46
Fig.6: Performance Evaluation of Proposed Approach
The performance of proposed approach is evaluated by comparing the
detection of duplication, error, marking duplication, number of de-duplication
achieved and error correction for various datasets. Fig.6 shows the performance
evaluation of the proposed approach using the Similarity-Distance . According the
distance score the duplicate error records are detected and marked. Similarity-
Distance rectifies the error of 24%, 15%, 12.4% and 19.2% for Time series data,
Restaurant data, student data and Cora respectively.
The number of duplicate records detected by PSO is 400, 250, 200 and 300 for
Time series data, Restaurant data, student data and Cora and de-duplicate the data are
30810 M.Nalini and S.Anbu
395, 206, 146 and 244 respectively. Due more complex or error in the data the de-
duplication is not obtained 100%.
Some of the performance metrics can be calculated to find out the accuracy of
our proposed approach as:
=
Number of Duplication Find correctly
Total number of data
TNR =
=
Number of Duplication wrongly obtained
Total Number of data to be Identi ied
FNR =
Sensitivity = = 99%
Specificity = = 88.5%
Accuracy = = 96.3%
Where P = TP+FN and N =FP+TN. The proposed approach proved the
efficiency is better in terms of Duplication detection, Error Detection, and De-
duplication in terms of accuracy is 96.3%. Hence Similarity-Distance based
duplication detection is more efficient.
CONCLUSION
In this paper the PSO based distance method is taken as the main method for finding
the similarity [redundancy] in any database. Where the similarity score is computed
for various databases and the performance is compared. The accuracy obtained using
this proposed approach is 96.3% for four different databases. The time series data is in
the form of Excel, Cora data is in the form of table, student data is in the form of MS-
Access and the restaurant data is in the form of SQL table. It is concluded from the
experiment results obtained using our proposed approach it is easy to do anomaly
detection and removal in terms of data redundancy and error. In future the reliability
and scalability is investigated in terms of data size and data variations.
Anomaly Detection Via Eliminating Data Redundancy 30811
REFERENCES
[1]. Moise´s G. de Carvalho, Alberto H.F. Laender, Marcos Andre Gonc¸alves,
andAltigran S. da Silva, “A Genetic Programming Approach to Record
Deduplication”, IEEE Transactions On Knowledge And Data Engineering,
Vol. 24, NO. 3, March 2012.
[2]. M. Wheatley, “Operation Clean Data,” CIO Asia Magazine, http:// www.cio-
asia.com, Aug. 2004.
[3]. Ireneusz Czarnowski, Piotr J drzejowicz, “Data Reduction Algorithm for
Machine Learning and Data Mining”,Volume 5027, 2008, pp 276-285.
[4]. Jose Ramon Cano, Francisco Herrera, Manuel Lozano, “Strategies for Scaling
Up Evolutionary Instance Reduction Algorithms for Data Mining”, Book-Soft
Computing, Volume 163, 2005, pp 21-39.
[5]. Gabriel Poesia, Loïc Cerf, “A Lossless Data Reduction for Mining Constrained
Patterns in n-ary Relations”, Machine Learning and Knowledge Discovery in
Databases, Volume 8725, 2014, pp 581-596.
[6]. Nick J. Cercone, Howard J. Hamilton, Xiaohua Hu, Ning Shan, “Data Mining
Using Attribute-Oriented Generalization and Information Reduction”, Rough
Sets and Data Mining1997, pp 199-22
[7] Vikrant Sabnis, Neelu Khare, “An Adaptive Iterative PCA-SVM Based
Technique for Dimensionality Reduction to Support Fast Mining of Leukemia
Data”, SocProS 2012.
[8] Paul Ammann, Dahlard L. Lukes, John C. Knight, “Applying data redundancy
to differential equation solvers”, Journal of Annals of Software Engineering,
1997, Volume 4, Issue 1, pp 65-77.
[9] P. E. Ammann, “Data Redundancy for the Detection and Tolerance of
Software Faults”, Computing Science and Statistics”,1992, pp 43-52.
[10] Rita Aceves-Pérez, Luis Villaseñor-Pineda, Manuel Montes-y-Gomez, “
Towards a Multilingual QA System Based on the Web Data Redundancy”,
Computer Science Volume 3528, 2005, pp 32-37.
[11]. Thiago Luís Lopes Siqueira, Cristina Dutra de Aguiar Ciferri, Valéria Cesário
Times, Anjolina Grisi de Oliveira, Ricardo Rodrigues Ciferri, “The impact of
spatial data redundancy on SOLAP query performance”,Journal of the
Brazilian Computer Society, June 2009, Volume 15, Issue 2, pp 19-34.
[12]. Ahmad Ali Iqbal, Maximilian Ott, Aruna Seneviratne, “Removing the
Redundancy from Distributed Semantic Web Data”, Database and Expert
Systems Applications Lecture Notes in Computer Science, Volume 6261,
2010, pp 512-519.
[13] Yanxu Zhu, Gang Yin, Xiang Li, Huaimin Wang, Dianxi Shi, Lin Yuan,
“Exploiting Attribute Redundancy for Web Entity Data Extraction”, Digital
Libraries: For Cultural Heritage, Knowledge Dissemination, and Future
Creation Lecture Notes in Computer Science Volume 7008, 2011, pp 98-107.
[14] M.G. de Carvalho, M.A. Gonc¸alves, A.H.F. Laender, and A.S. da Silva,
“Learning to Deduplicate,” Proc. Sixth ACM/IEEE CS Joint Conf. Digital
Libraries, pp. 41-50, 2006.
30812 M.Nalini and S.Anbu
[15] J.R. Koza, Gentic Programming: On the Programming of Computers byMeans
of Natural Selection. MIT Press, 1992.
[16] M.G. de Carvalho, A.H.F. Laender, M.A. Gonc¸alves, and A.S. da Silva,
“Replica Identification Using Genetic Programming,” Proc. 23rd Ann. ACM
Symp. Applied Computing (SAC), pp. 1801-1806, 2008.

More Related Content

What's hot

Selecting the correct Data Mining Method: Classification & InDaMiTe-R
Selecting the correct Data Mining Method: Classification & InDaMiTe-RSelecting the correct Data Mining Method: Classification & InDaMiTe-R
Selecting the correct Data Mining Method: Classification & InDaMiTe-R
IOSR Journals
 
A statistical data fusion technique in virtual data integration environment
A statistical data fusion technique in virtual data integration environmentA statistical data fusion technique in virtual data integration environment
A statistical data fusion technique in virtual data integration environment
IJDKP
 
IRJET- Missing Data Imputation by Evidence Chain
IRJET- Missing Data Imputation by Evidence ChainIRJET- Missing Data Imputation by Evidence Chain
IRJET- Missing Data Imputation by Evidence Chain
IRJET Journal
 
IRJET- A Review of Data Cleaning and its Current Approaches
IRJET- A Review of Data Cleaning and its Current ApproachesIRJET- A Review of Data Cleaning and its Current Approaches
IRJET- A Review of Data Cleaning and its Current Approaches
IRJET Journal
 
Recommendation system using bloom filter in mapreduce
Recommendation system using bloom filter in mapreduceRecommendation system using bloom filter in mapreduce
Recommendation system using bloom filter in mapreduce
IJDKP
 
IRJET- Probability based Missing Value Imputation Method and its Analysis
IRJET- Probability based Missing Value Imputation Method and its AnalysisIRJET- Probability based Missing Value Imputation Method and its Analysis
IRJET- Probability based Missing Value Imputation Method and its Analysis
IRJET Journal
 
Ontology Based PMSE with Manifold Preference
Ontology Based PMSE with Manifold PreferenceOntology Based PMSE with Manifold Preference
Ontology Based PMSE with Manifold Preference
IJCERT
 
An efficient algorithm for sequence generation in data mining
An efficient algorithm for sequence generation in data miningAn efficient algorithm for sequence generation in data mining
An efficient algorithm for sequence generation in data mining
ijcisjournal
 
Feature Selection : A Novel Approach for the Prediction of Learning Disabilit...
Feature Selection : A Novel Approach for the Prediction of Learning Disabilit...Feature Selection : A Novel Approach for the Prediction of Learning Disabilit...
Feature Selection : A Novel Approach for the Prediction of Learning Disabilit...
csandit
 
Indexing based Genetic Programming Approach to Record Deduplication
Indexing based Genetic Programming Approach to Record DeduplicationIndexing based Genetic Programming Approach to Record Deduplication
Indexing based Genetic Programming Approach to Record Deduplication
idescitation
 
Ieee transactions on 2018 knowledge and data engineering topics with abstract .
Ieee transactions on 2018 knowledge and data engineering topics with abstract .Ieee transactions on 2018 knowledge and data engineering topics with abstract .
Ieee transactions on 2018 knowledge and data engineering topics with abstract .
tsysglobalsolutions
 
Predictive job scheduling in a connection limited system using parallel genet...
Predictive job scheduling in a connection limited system using parallel genet...Predictive job scheduling in a connection limited system using parallel genet...
Predictive job scheduling in a connection limited system using parallel genet...Mumbai Academisc
 
INTEGRATED ASSOCIATIVE CLASSIFICATION AND NEURAL NETWORK MODEL ENHANCED BY US...
INTEGRATED ASSOCIATIVE CLASSIFICATION AND NEURAL NETWORK MODEL ENHANCED BY US...INTEGRATED ASSOCIATIVE CLASSIFICATION AND NEURAL NETWORK MODEL ENHANCED BY US...
INTEGRATED ASSOCIATIVE CLASSIFICATION AND NEURAL NETWORK MODEL ENHANCED BY US...
IJDKP
 
A STUDY ON SIMILARITY MEASURE FUNCTIONS ON ENGINEERING MATERIALS SELECTION
A STUDY ON SIMILARITY MEASURE FUNCTIONS ON ENGINEERING MATERIALS SELECTION A STUDY ON SIMILARITY MEASURE FUNCTIONS ON ENGINEERING MATERIALS SELECTION
A STUDY ON SIMILARITY MEASURE FUNCTIONS ON ENGINEERING MATERIALS SELECTION
cscpconf
 
Ijmet 10 02_050
Ijmet 10 02_050Ijmet 10 02_050
Ijmet 10 02_050
IAEME Publication
 
Survey on Feature Selection and Dimensionality Reduction Techniques
Survey on Feature Selection and Dimensionality Reduction TechniquesSurvey on Feature Selection and Dimensionality Reduction Techniques
Survey on Feature Selection and Dimensionality Reduction Techniques
IRJET Journal
 
New proximity estimate for incremental update of non uniformly distributed cl...
New proximity estimate for incremental update of non uniformly distributed cl...New proximity estimate for incremental update of non uniformly distributed cl...
New proximity estimate for incremental update of non uniformly distributed cl...
IJDKP
 
Data mining techniques
Data mining techniquesData mining techniques
Data mining techniques
eSAT Journals
 
Data Analysis and Prediction System for Meteorological Data
Data Analysis and Prediction System for Meteorological DataData Analysis and Prediction System for Meteorological Data
Data Analysis and Prediction System for Meteorological Data
IRJET Journal
 

What's hot (19)

Selecting the correct Data Mining Method: Classification & InDaMiTe-R
Selecting the correct Data Mining Method: Classification & InDaMiTe-RSelecting the correct Data Mining Method: Classification & InDaMiTe-R
Selecting the correct Data Mining Method: Classification & InDaMiTe-R
 
A statistical data fusion technique in virtual data integration environment
A statistical data fusion technique in virtual data integration environmentA statistical data fusion technique in virtual data integration environment
A statistical data fusion technique in virtual data integration environment
 
IRJET- Missing Data Imputation by Evidence Chain
IRJET- Missing Data Imputation by Evidence ChainIRJET- Missing Data Imputation by Evidence Chain
IRJET- Missing Data Imputation by Evidence Chain
 
IRJET- A Review of Data Cleaning and its Current Approaches
IRJET- A Review of Data Cleaning and its Current ApproachesIRJET- A Review of Data Cleaning and its Current Approaches
IRJET- A Review of Data Cleaning and its Current Approaches
 
Recommendation system using bloom filter in mapreduce
Recommendation system using bloom filter in mapreduceRecommendation system using bloom filter in mapreduce
Recommendation system using bloom filter in mapreduce
 
IRJET- Probability based Missing Value Imputation Method and its Analysis
IRJET- Probability based Missing Value Imputation Method and its AnalysisIRJET- Probability based Missing Value Imputation Method and its Analysis
IRJET- Probability based Missing Value Imputation Method and its Analysis
 
Ontology Based PMSE with Manifold Preference
Ontology Based PMSE with Manifold PreferenceOntology Based PMSE with Manifold Preference
Ontology Based PMSE with Manifold Preference
 
An efficient algorithm for sequence generation in data mining
An efficient algorithm for sequence generation in data miningAn efficient algorithm for sequence generation in data mining
An efficient algorithm for sequence generation in data mining
 
Feature Selection : A Novel Approach for the Prediction of Learning Disabilit...
Feature Selection : A Novel Approach for the Prediction of Learning Disabilit...Feature Selection : A Novel Approach for the Prediction of Learning Disabilit...
Feature Selection : A Novel Approach for the Prediction of Learning Disabilit...
 
Indexing based Genetic Programming Approach to Record Deduplication
Indexing based Genetic Programming Approach to Record DeduplicationIndexing based Genetic Programming Approach to Record Deduplication
Indexing based Genetic Programming Approach to Record Deduplication
 
Ieee transactions on 2018 knowledge and data engineering topics with abstract .
Ieee transactions on 2018 knowledge and data engineering topics with abstract .Ieee transactions on 2018 knowledge and data engineering topics with abstract .
Ieee transactions on 2018 knowledge and data engineering topics with abstract .
 
Predictive job scheduling in a connection limited system using parallel genet...
Predictive job scheduling in a connection limited system using parallel genet...Predictive job scheduling in a connection limited system using parallel genet...
Predictive job scheduling in a connection limited system using parallel genet...
 
INTEGRATED ASSOCIATIVE CLASSIFICATION AND NEURAL NETWORK MODEL ENHANCED BY US...
INTEGRATED ASSOCIATIVE CLASSIFICATION AND NEURAL NETWORK MODEL ENHANCED BY US...INTEGRATED ASSOCIATIVE CLASSIFICATION AND NEURAL NETWORK MODEL ENHANCED BY US...
INTEGRATED ASSOCIATIVE CLASSIFICATION AND NEURAL NETWORK MODEL ENHANCED BY US...
 
A STUDY ON SIMILARITY MEASURE FUNCTIONS ON ENGINEERING MATERIALS SELECTION
A STUDY ON SIMILARITY MEASURE FUNCTIONS ON ENGINEERING MATERIALS SELECTION A STUDY ON SIMILARITY MEASURE FUNCTIONS ON ENGINEERING MATERIALS SELECTION
A STUDY ON SIMILARITY MEASURE FUNCTIONS ON ENGINEERING MATERIALS SELECTION
 
Ijmet 10 02_050
Ijmet 10 02_050Ijmet 10 02_050
Ijmet 10 02_050
 
Survey on Feature Selection and Dimensionality Reduction Techniques
Survey on Feature Selection and Dimensionality Reduction TechniquesSurvey on Feature Selection and Dimensionality Reduction Techniques
Survey on Feature Selection and Dimensionality Reduction Techniques
 
New proximity estimate for incremental update of non uniformly distributed cl...
New proximity estimate for incremental update of non uniformly distributed cl...New proximity estimate for incremental update of non uniformly distributed cl...
New proximity estimate for incremental update of non uniformly distributed cl...
 
Data mining techniques
Data mining techniquesData mining techniques
Data mining techniques
 
Data Analysis and Prediction System for Meteorological Data
Data Analysis and Prediction System for Meteorological DataData Analysis and Prediction System for Meteorological Data
Data Analysis and Prediction System for Meteorological Data
 

Similar to Anomaly detection via eliminating data redundancy and rectifying data error in uncertain data streams

A study and survey on various progressive duplicate detection mechanisms
A study and survey on various progressive duplicate detection mechanismsA study and survey on various progressive duplicate detection mechanisms
A study and survey on various progressive duplicate detection mechanisms
eSAT Journals
 
An efficient feature selection algorithm for health care data analysis
An efficient feature selection algorithm for health care data analysisAn efficient feature selection algorithm for health care data analysis
An efficient feature selection algorithm for health care data analysis
journalBEEI
 
TUPLE VALUE BASED MULTIPLICATIVE DATA PERTURBATION APPROACH TO PRESERVE PRIVA...
TUPLE VALUE BASED MULTIPLICATIVE DATA PERTURBATION APPROACH TO PRESERVE PRIVA...TUPLE VALUE BASED MULTIPLICATIVE DATA PERTURBATION APPROACH TO PRESERVE PRIVA...
TUPLE VALUE BASED MULTIPLICATIVE DATA PERTURBATION APPROACH TO PRESERVE PRIVA...
IJDKP
 
IRJET- Medical Data Mining
IRJET- Medical Data MiningIRJET- Medical Data Mining
IRJET- Medical Data Mining
IRJET Journal
 
Dissertation
DissertationDissertation
Dissertation
Mefratechnologies
 
Front End Data Cleaning And Transformation In Standard Printed Form Using Neu...
Front End Data Cleaning And Transformation In Standard Printed Form Using Neu...Front End Data Cleaning And Transformation In Standard Printed Form Using Neu...
Front End Data Cleaning And Transformation In Standard Printed Form Using Neu...
ijcsa
 
MULTI MODEL DATA MINING APPROACH FOR HEART FAILURE PREDICTION
MULTI MODEL DATA MINING APPROACH FOR HEART FAILURE PREDICTIONMULTI MODEL DATA MINING APPROACH FOR HEART FAILURE PREDICTION
MULTI MODEL DATA MINING APPROACH FOR HEART FAILURE PREDICTION
IJDKP
 
Final Report
Final ReportFinal Report
Final Reportimu409
 
IRJET- Survey of Estimation of Crop Yield using Agriculture Data
IRJET- Survey of Estimation of Crop Yield using Agriculture DataIRJET- Survey of Estimation of Crop Yield using Agriculture Data
IRJET- Survey of Estimation of Crop Yield using Agriculture Data
IRJET Journal
 
Machine Learning Approaches and its Challenges
Machine Learning Approaches and its ChallengesMachine Learning Approaches and its Challenges
Machine Learning Approaches and its Challenges
ijcnes
 
Fault detection of imbalanced data using incremental clustering
Fault detection of imbalanced data using incremental clusteringFault detection of imbalanced data using incremental clustering
Fault detection of imbalanced data using incremental clustering
IRJET Journal
 
Identification of important features and data mining classification technique...
Identification of important features and data mining classification technique...Identification of important features and data mining classification technique...
Identification of important features and data mining classification technique...
IJECEIAES
 
MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...
MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...
MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...
acijjournal
 
Introductionedited
IntroductioneditedIntroductionedited
Introductionedited
Mefratechnologies
 
Running Head Data Mining in The Cloud .docx
Running Head Data Mining in The Cloud                            .docxRunning Head Data Mining in The Cloud                            .docx
Running Head Data Mining in The Cloud .docx
healdkathaleen
 
Review of Algorithms for Crime Analysis & Prediction
Review of Algorithms for Crime Analysis & PredictionReview of Algorithms for Crime Analysis & Prediction
Review of Algorithms for Crime Analysis & Prediction
IRJET Journal
 
Detection of Outliers in Large Dataset using Distributed Approach
Detection of Outliers in Large Dataset using Distributed ApproachDetection of Outliers in Large Dataset using Distributed Approach
Detection of Outliers in Large Dataset using Distributed Approach
Editor IJMTER
 
TECHNICAL REVIEW: PERFORMANCE OF EXISTING IMPUTATION METHODS FOR MISSING DATA...
TECHNICAL REVIEW: PERFORMANCE OF EXISTING IMPUTATION METHODS FOR MISSING DATA...TECHNICAL REVIEW: PERFORMANCE OF EXISTING IMPUTATION METHODS FOR MISSING DATA...
TECHNICAL REVIEW: PERFORMANCE OF EXISTING IMPUTATION METHODS FOR MISSING DATA...
IJDKP
 
TECHNICAL REVIEW: PERFORMANCE OF EXISTING IMPUTATION METHODS FOR MISSING DATA...
TECHNICAL REVIEW: PERFORMANCE OF EXISTING IMPUTATION METHODS FOR MISSING DATA...TECHNICAL REVIEW: PERFORMANCE OF EXISTING IMPUTATION METHODS FOR MISSING DATA...
TECHNICAL REVIEW: PERFORMANCE OF EXISTING IMPUTATION METHODS FOR MISSING DATA...
IJDKP
 
Survey on Software Data Reduction Techniques Accomplishing Bug Triage
Survey on Software Data Reduction Techniques Accomplishing Bug TriageSurvey on Software Data Reduction Techniques Accomplishing Bug Triage
Survey on Software Data Reduction Techniques Accomplishing Bug Triage
IRJET Journal
 

Similar to Anomaly detection via eliminating data redundancy and rectifying data error in uncertain data streams (20)

A study and survey on various progressive duplicate detection mechanisms
A study and survey on various progressive duplicate detection mechanismsA study and survey on various progressive duplicate detection mechanisms
A study and survey on various progressive duplicate detection mechanisms
 
An efficient feature selection algorithm for health care data analysis
An efficient feature selection algorithm for health care data analysisAn efficient feature selection algorithm for health care data analysis
An efficient feature selection algorithm for health care data analysis
 
TUPLE VALUE BASED MULTIPLICATIVE DATA PERTURBATION APPROACH TO PRESERVE PRIVA...
TUPLE VALUE BASED MULTIPLICATIVE DATA PERTURBATION APPROACH TO PRESERVE PRIVA...TUPLE VALUE BASED MULTIPLICATIVE DATA PERTURBATION APPROACH TO PRESERVE PRIVA...
TUPLE VALUE BASED MULTIPLICATIVE DATA PERTURBATION APPROACH TO PRESERVE PRIVA...
 
IRJET- Medical Data Mining
IRJET- Medical Data MiningIRJET- Medical Data Mining
IRJET- Medical Data Mining
 
Dissertation
DissertationDissertation
Dissertation
 
Front End Data Cleaning And Transformation In Standard Printed Form Using Neu...
Front End Data Cleaning And Transformation In Standard Printed Form Using Neu...Front End Data Cleaning And Transformation In Standard Printed Form Using Neu...
Front End Data Cleaning And Transformation In Standard Printed Form Using Neu...
 
MULTI MODEL DATA MINING APPROACH FOR HEART FAILURE PREDICTION
MULTI MODEL DATA MINING APPROACH FOR HEART FAILURE PREDICTIONMULTI MODEL DATA MINING APPROACH FOR HEART FAILURE PREDICTION
MULTI MODEL DATA MINING APPROACH FOR HEART FAILURE PREDICTION
 
Final Report
Final ReportFinal Report
Final Report
 
IRJET- Survey of Estimation of Crop Yield using Agriculture Data
IRJET- Survey of Estimation of Crop Yield using Agriculture DataIRJET- Survey of Estimation of Crop Yield using Agriculture Data
IRJET- Survey of Estimation of Crop Yield using Agriculture Data
 
Machine Learning Approaches and its Challenges
Machine Learning Approaches and its ChallengesMachine Learning Approaches and its Challenges
Machine Learning Approaches and its Challenges
 
Fault detection of imbalanced data using incremental clustering
Fault detection of imbalanced data using incremental clusteringFault detection of imbalanced data using incremental clustering
Fault detection of imbalanced data using incremental clustering
 
Identification of important features and data mining classification technique...
Identification of important features and data mining classification technique...Identification of important features and data mining classification technique...
Identification of important features and data mining classification technique...
 
MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...
MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...
MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...
 
Introductionedited
IntroductioneditedIntroductionedited
Introductionedited
 
Running Head Data Mining in The Cloud .docx
Running Head Data Mining in The Cloud                            .docxRunning Head Data Mining in The Cloud                            .docx
Running Head Data Mining in The Cloud .docx
 
Review of Algorithms for Crime Analysis & Prediction
Review of Algorithms for Crime Analysis & PredictionReview of Algorithms for Crime Analysis & Prediction
Review of Algorithms for Crime Analysis & Prediction
 
Detection of Outliers in Large Dataset using Distributed Approach
Detection of Outliers in Large Dataset using Distributed ApproachDetection of Outliers in Large Dataset using Distributed Approach
Detection of Outliers in Large Dataset using Distributed Approach
 
TECHNICAL REVIEW: PERFORMANCE OF EXISTING IMPUTATION METHODS FOR MISSING DATA...
TECHNICAL REVIEW: PERFORMANCE OF EXISTING IMPUTATION METHODS FOR MISSING DATA...TECHNICAL REVIEW: PERFORMANCE OF EXISTING IMPUTATION METHODS FOR MISSING DATA...
TECHNICAL REVIEW: PERFORMANCE OF EXISTING IMPUTATION METHODS FOR MISSING DATA...
 
TECHNICAL REVIEW: PERFORMANCE OF EXISTING IMPUTATION METHODS FOR MISSING DATA...
TECHNICAL REVIEW: PERFORMANCE OF EXISTING IMPUTATION METHODS FOR MISSING DATA...TECHNICAL REVIEW: PERFORMANCE OF EXISTING IMPUTATION METHODS FOR MISSING DATA...
TECHNICAL REVIEW: PERFORMANCE OF EXISTING IMPUTATION METHODS FOR MISSING DATA...
 
Survey on Software Data Reduction Techniques Accomplishing Bug Triage
Survey on Software Data Reduction Techniques Accomplishing Bug TriageSurvey on Software Data Reduction Techniques Accomplishing Bug Triage
Survey on Software Data Reduction Techniques Accomplishing Bug Triage
 

Recently uploaded

Recycled Concrete Aggregate in Construction Part III
Recycled Concrete Aggregate in Construction Part IIIRecycled Concrete Aggregate in Construction Part III
Recycled Concrete Aggregate in Construction Part III
Aditya Rajan Patra
 
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
thanhdowork
 
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdfAKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
SamSarthak3
 
Heap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTS
Heap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTSHeap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTS
Heap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTS
Soumen Santra
 
DfMAy 2024 - key insights and contributions
DfMAy 2024 - key insights and contributionsDfMAy 2024 - key insights and contributions
DfMAy 2024 - key insights and contributions
gestioneergodomus
 
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&BDesign and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Sreedhar Chowdam
 
block diagram and signal flow graph representation
block diagram and signal flow graph representationblock diagram and signal flow graph representation
block diagram and signal flow graph representation
Divya Somashekar
 
CW RADAR, FMCW RADAR, FMCW ALTIMETER, AND THEIR PARAMETERS
CW RADAR, FMCW RADAR, FMCW ALTIMETER, AND THEIR PARAMETERSCW RADAR, FMCW RADAR, FMCW ALTIMETER, AND THEIR PARAMETERS
CW RADAR, FMCW RADAR, FMCW ALTIMETER, AND THEIR PARAMETERS
veerababupersonal22
 
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
Amil Baba Dawood bangali
 
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Dr.Costas Sachpazis
 
CME397 Surface Engineering- Professional Elective
CME397 Surface Engineering- Professional ElectiveCME397 Surface Engineering- Professional Elective
CME397 Surface Engineering- Professional Elective
karthi keyan
 
NUMERICAL SIMULATIONS OF HEAT AND MASS TRANSFER IN CONDENSING HEAT EXCHANGERS...
NUMERICAL SIMULATIONS OF HEAT AND MASS TRANSFER IN CONDENSING HEAT EXCHANGERS...NUMERICAL SIMULATIONS OF HEAT AND MASS TRANSFER IN CONDENSING HEAT EXCHANGERS...
NUMERICAL SIMULATIONS OF HEAT AND MASS TRANSFER IN CONDENSING HEAT EXCHANGERS...
ssuser7dcef0
 
Planning Of Procurement o different goods and services
Planning Of Procurement o different goods and servicesPlanning Of Procurement o different goods and services
Planning Of Procurement o different goods and services
JoytuBarua2
 
English lab ppt no titlespecENG PPTt.pdf
English lab ppt no titlespecENG PPTt.pdfEnglish lab ppt no titlespecENG PPTt.pdf
English lab ppt no titlespecENG PPTt.pdf
BrazilAccount1
 
Standard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - NeometrixStandard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - Neometrix
Neometrix_Engineering_Pvt_Ltd
 
Hierarchical Digital Twin of a Naval Power System
Hierarchical Digital Twin of a Naval Power SystemHierarchical Digital Twin of a Naval Power System
Hierarchical Digital Twin of a Naval Power System
Kerry Sado
 
Student information management system project report ii.pdf
Student information management system project report ii.pdfStudent information management system project report ii.pdf
Student information management system project report ii.pdf
Kamal Acharya
 
AP LAB PPT.pdf ap lab ppt no title specific
AP LAB PPT.pdf ap lab ppt no title specificAP LAB PPT.pdf ap lab ppt no title specific
AP LAB PPT.pdf ap lab ppt no title specific
BrazilAccount1
 
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
bakpo1
 
6th International Conference on Machine Learning & Applications (CMLA 2024)
6th International Conference on Machine Learning & Applications (CMLA 2024)6th International Conference on Machine Learning & Applications (CMLA 2024)
6th International Conference on Machine Learning & Applications (CMLA 2024)
ClaraZara1
 

Recently uploaded (20)

Recycled Concrete Aggregate in Construction Part III
Recycled Concrete Aggregate in Construction Part IIIRecycled Concrete Aggregate in Construction Part III
Recycled Concrete Aggregate in Construction Part III
 
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
 
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdfAKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
 
Heap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTS
Heap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTSHeap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTS
Heap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTS
 
DfMAy 2024 - key insights and contributions
DfMAy 2024 - key insights and contributionsDfMAy 2024 - key insights and contributions
DfMAy 2024 - key insights and contributions
 
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&BDesign and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
 
block diagram and signal flow graph representation
block diagram and signal flow graph representationblock diagram and signal flow graph representation
block diagram and signal flow graph representation
 
CW RADAR, FMCW RADAR, FMCW ALTIMETER, AND THEIR PARAMETERS
CW RADAR, FMCW RADAR, FMCW ALTIMETER, AND THEIR PARAMETERSCW RADAR, FMCW RADAR, FMCW ALTIMETER, AND THEIR PARAMETERS
CW RADAR, FMCW RADAR, FMCW ALTIMETER, AND THEIR PARAMETERS
 
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
 
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
 
CME397 Surface Engineering- Professional Elective
CME397 Surface Engineering- Professional ElectiveCME397 Surface Engineering- Professional Elective
CME397 Surface Engineering- Professional Elective
 
NUMERICAL SIMULATIONS OF HEAT AND MASS TRANSFER IN CONDENSING HEAT EXCHANGERS...
NUMERICAL SIMULATIONS OF HEAT AND MASS TRANSFER IN CONDENSING HEAT EXCHANGERS...NUMERICAL SIMULATIONS OF HEAT AND MASS TRANSFER IN CONDENSING HEAT EXCHANGERS...
NUMERICAL SIMULATIONS OF HEAT AND MASS TRANSFER IN CONDENSING HEAT EXCHANGERS...
 
Planning Of Procurement o different goods and services
Planning Of Procurement o different goods and servicesPlanning Of Procurement o different goods and services
Planning Of Procurement o different goods and services
 
English lab ppt no titlespecENG PPTt.pdf
English lab ppt no titlespecENG PPTt.pdfEnglish lab ppt no titlespecENG PPTt.pdf
English lab ppt no titlespecENG PPTt.pdf
 
Standard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - NeometrixStandard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - Neometrix
 
Hierarchical Digital Twin of a Naval Power System
Hierarchical Digital Twin of a Naval Power SystemHierarchical Digital Twin of a Naval Power System
Hierarchical Digital Twin of a Naval Power System
 
Student information management system project report ii.pdf
Student information management system project report ii.pdfStudent information management system project report ii.pdf
Student information management system project report ii.pdf
 
AP LAB PPT.pdf ap lab ppt no title specific
AP LAB PPT.pdf ap lab ppt no title specificAP LAB PPT.pdf ap lab ppt no title specific
AP LAB PPT.pdf ap lab ppt no title specific
 
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
 
6th International Conference on Machine Learning & Applications (CMLA 2024)
6th International Conference on Machine Learning & Applications (CMLA 2024)6th International Conference on Machine Learning & Applications (CMLA 2024)
6th International Conference on Machine Learning & Applications (CMLA 2024)
 

Anomaly detection via eliminating data redundancy and rectifying data error in uncertain data streams

  • 1. International Journal of Applied Engineering Research ISSN 0973-4562 Volume 9, Number 24 (2014) pp. 30795-30812 © Research India Publications http://www.ripublication.com Anomaly Detection Via Eliminating Data Redundancy and Rectifying Duplication in Uncertain Data Streams M. Nalini 1* and S. Anbu 2 1* Research Scholar, Department of Computer Science and Engineering, St.Peter's University, Avadi, Chennai, India, email: nalinicseme@gmail.com 2 Professor, Department of Computer Science and Engineering, St.Peter's College of Engineering and Technology, Avadi, Chennai ,India, email: anbuss16@gmail.com Abstract One of the important problems is anomaly detection which necessitates in emerging fields, such as data warehouse and data mining etc. Various generic anomaly detection techniques were already developed for common applications. Most of the real time systems using consistent data offering high-quality services are affected by record duplication, quasi replicas or partial erred data. Even government and private organizations are developing procedures for eliminating replicas from their data repositories. In data maintenance the quality of the data, database, and database related applications needs error-free, replica removed, de-duplicated to improve the accuracy of the query passed. In this paper, we propose Particle Swarm Optimization approach to record de-duplication which combines various pieces of evidence extracted from the data content to identify a de-duplication method. which can able to find out whether two entries in a data repository are replicas or not. From the experiment, our approach outperforms available state-of-the-art method found in the literature. Our PSO approach is capable than the existing approaches. The experiment result shows that the proposed approach is more efficient than the existing approaches where the proposed approach is implemented in DOTNET framework-2010. Keywords: Data Duplication, Error Correction, DBMS, Data Mining, TPR, FNR. INTRODUCTION Anomaly detection refers to the problem of finding patterns in any kind of data which
  • 2. 30796 M.Nalini and S.Anbu cannot satisfy the customer’s expected behavior. These non-conforming patterns are often referred to as anomalies, outliers, discordant observations, exceptions, aberrations, surprises, peculiarities or contaminants in different application domains. Of these anomalies and outliers are two terms used most commonly in the context of anomaly detection and sometimes interchangeably. Anomaly detection finds extensive use in a wide variety of applications. Such as fraud detection for credit cards, insurance or health care, intrusion detection for cyber-security, fault detection in safety critical systems, and military surveillance for enemy activities. The importance of anomaly detection is due to the fact that anomalies in data that translates to significant (and often critical) actionable information in a wide variety of application domains. For example, an anomalous traffic pattern in a computer network could mean that a hacked computer is sending out sensitive data to an unauthorized destination. An anomalous MRI image may indicate presence of malignant tumors. Anomalies in credit card transaction data could indicate credit card or identity theft or anomalous readings from a space craft sensor could signify a fault in some component of the space craft. Detecting outliers or anomalies in data has been studied in the statistics community as early as the 19th century. Over time, a variety of anomaly detection techniques have been developed in several research communities. Many of these techniques have been specifically developed for certain application domains, while others are more generic. Also in this paper the data taken into account is uncertain data, the size of the data is also too large.PSO approach is applied here to find and count the de- duplication where it combines various pieces of evidence extracted from the data content to produce de-duplication method. This approach is able to identify whether any entries in a repository are same or not. In the same manner, by combining more pieces in the data, taken as evidence, and compare with the whole data as training data. This function is applied repeatedly on the whole data or in the repositories. Newly inserted data can also be compared in the same manner to avoid replica by comparing with evidence. A method applied to record de-duplication should accomplish individual but contradictory objectives: this process should effectively increase the identification of the records replicated. The approach GP [15] is chosen as the basic approach which is suitable for finding accurate answers to a given problem without searching all the data on the whole. Due to the record de-duplication problem, in the existing approach [14, 16] the genetic programming is applied to provide good solutions to it. In this paper, the existing system results in [16] are taken for comparison without PSO based approach, where our approach is able to automatically find more effective de-duplication methods. Moreover, PSO based approach can interoperable with existing best de-duplication methods to change on the replication identification limits used to classify a pair of records as a match or not. In our experiment, real time dataset having all scientific domain based article citations and hotel index records. Also, the real time data set having synthetic generated datasets to control in best experimental environment. In all the scenarios, our approach can be applied to all the possible scenarios.
  • 3. Anomaly Detection Via Eliminating Data Redundancy 30797 On the whole, our contribution of this paper is PSO based approach to find and count the De-Duplication is as follows:  A less computational time based solution can be obtained in terms of duplication detection  Reduce the individual comparison using PSO approach to find out the similarity values.  Choosing the replicas by computing TPR and FPR among the data  Rectify the errors in the data entries. RELATED WORKS In [3] the author proposed an approach to data reduction. This data reduction functions are very essential to machine learning and data mining. An agent based population algorithm is used for solving data reduction. Only data reduction is not only the solution for improving the quality of databases. Various sizes of database are used to provide high classification among the data to find out anomalies. Two algorithms such as evolutionary and non-evolutionary are applied and the results are compared to finding the best suitable algorithm for anomaly detection in [4]. N-ary relations are computed to define the patterns in the dataset [5] where it provides relations in one-dimensional data. DBLEARN and DBDISCOVER [6] are two systems developed to analyze RDBMS. The main objective of the data mining technique is to detect and classify data in a huge set of database [7] without negotiating the speed of the process. PCA is used for data reduction and SVM is used for data classification in [7]. In [8] the data redundancy method is explored using mathematical representation. Software developed with safe, correct and reliable operations for avionics and automobile based database systems [9]. A statistical QA- [Question Answer] model is applied to develop a prototype to avoid web based data redundancy [10]. GDW-[Geographic Data Warehouses] [11], SOLAP (Spatial On- Line Analytical Processing) is applied to Gist database and other spatial database analysis, indexing, and generating various set of reports without any error. In [12], an effective method was proposed for P-2-P sharing data. During the data sharing the data duplication is removed using the effective method. Web entity data extraction associated with the attributes of the data [13] can be obtained using a novel approach which uses duplicated attribute value pairs. G. de Carvalho et al. [1] used the Genetic Algorithm to mark the duplication and convert de-duplication in the data also mainly concentrated on identifying the entries are repository or replica. This approach outperformed and provides 6.2% of accuracy more than the earlier approaches for two different data sets found in [2]. Our proposed approach can be extended for various benchmark data with real time data such as time series data, clinical data, 20-20 new group etc. PARTICLE SWARM OPTIMIZATION GENERAL CONCEPTS The natural selection process influences all virtually living things and it can be inspired by evolutionary programming approaches based ideas. Particle swarm
  • 4. 30798 M.Nalini and S.Anbu optimization programming is one of the best known evolutionary programming techniques. It is considered as a heuristic approach and initially applied for optimizing the data properties and availabilities. PSO is also considered as a multi-objective problem which can restrict the environment. PSO and the other various evolutionary approaches are mostly known and applied variety of applications due to their good performance in terms of searching over a large set of data. PSO creates more populations for individuals, instead of processing on a single point in the search space of the problem. This behavior is the essential aspect of the PSO approach and it creates additional new solutions with new combined features. It also moves forward comparing with the existing solutions in the search space. PARTICLE OPERATIONS PSO generates random particles representing individuals. In this paper, the current modeling the trees representing arithmetic functions which are illustrated in Figure-1. When using, this tree representation of the PSO based approach, the set of all inputs, variables, constants and methods should be defined [8]. Some of the nodes terminating the trees are called as leaves. The collection of operators, statements and methods are used in the PSO evolutionary process to manipulate the terminal values. All these methods are placed in the internal nodes of the tree is shown in Figure-1.In general PSO is analyzing social behavior of birds. In order to search for food, every bird in a flock of birds is referred by velocity based on the personal experience and information collected by interacting with each other inside the flock. This is the basic idea about the PSO. Each particle denoting each bird, flies denotes searching in the subspace for the optimization problem searching for the optimum solution. In PSO, the solutions within the iteration are called as swarm and equal in population. Tree(a, b, c) = a + ( b + b) FIGURE-1: Tree Used For Mapping A Function + x / x z
  • 5. Anomaly Detection Via Eliminating Data Redundancy 30799 PROPOSED APPROACH The proposed approach utilizes the functionality of the PSO optimization method for finding the difference among entities in each record in a database. This difference indicates the similarity index among two data entities can decide the duplication. In this case, if the distance between two data entities [ and ] is less than a threshold value [ ], then and are decided as duplicate. The algorithm of PSO applied in this paper is given here: 1. Generate random population P is representing each individual of data entries. 2. Assume a random feasible solution from particles. 3. For I = 1to P 4. Evaluate all particles based on the objective function 5. The objective function = ( [ ], [ ]) ≤ ≥∝ 6. Gbest = best solution based particle 7. Compute the velocity of the Gbest particles 8. Update the current position of the best solution 9. Next i A Database is a rectangular table consists of number of Records as: = { , ,… , } ---(1) And each record has number of Entities as: = = : : : ---(2) is the entity at row and column in the data. Here represent the rows and j represents the column. In this paper the threshold value is user defined very small value among 0 and 1.
  • 6. 30800 M.Nalini and S.Anbu Fig.1: Proposed Approach The overall functionality of the proposed approach is depicted in Fig.1. The database may be in any form like ORACLE, SQL, and MY-SQL, MS-ACCESS or EXCEL. PREPROCESSING Let us consider an example of an employee data for an MNC company, where the company branches are located overall world. The entire data are read from the database and investigate that if there any ‘~’, empty space, “#”, “*” and irrelevant characters placed as an entity in the database. [Example, if an entity is numerical data, then it should contain only the digits from 0 to 9. If it is a name, then it should represent all alphabets combined only with “.”, “_”,”-“]. In case of irrelevant characters presented in any dataset, then those data entity are treated as error data and it will be corrected, removed or changed by any other relevant characters. If the data-type of the field is a string, then the preprocessing function assigns “NULL” in the corresponding entity, else if the data type of the field is a numeric, then preprocessing function assigns 0’s [according to the length of the numeric data type] in the corresponding entity. Similarly, preprocessing function replaces the entity as today’s-date if the data-type is ‘date’, ‘*’ for data-type is ‘character’ and so on. Once the data is preprocessed the results of the SQL-Query are good else error generated. For example, in the following table-1, the first row says the Field name and all the rows contain set of records. In the given Table-1, the first record fourth field is having an irrelevant character as “~”. In the same way the second record 3rd field consists “##” instead of numbers. It gives an error, when a query Pre Processing the Data Normalizing the Data Anomaly Detection Load Data Pre-process the data Divide data as windows Finding Similarity Ratio Normalize the Data Check data redundancy & Error Yes No D B Mark redundant data Persistent the data Anomaly Detection
  • 7. Anomaly Detection Via Eliminating Data Redundancy 30801 ; Select City from EMP; is passed in the table EMP [Table-1]. To avoid error during query process the City, Age fields are corrected by verifying the original data sources. If it is not possible then for alphanumeric fields “NULL” and numeric field “0” are applied for replacing and correcting the error. If it is not possible to correct the record, those data are marked [‘*’] and moved to a separate pool area. Table-1: Sample Error Records Pre-Processed and Marked [‘*’] [EMP]. No Name Age City State Comment 0001* Kabir 45 ~ty Employee 0002* Ramu ## Chennai TN Employee The entire data can be divided as sub windows for easy and fast process. Let the dataset is DB and it can be divided as sub windows shown in Fig.2 as DB1 and DB2. Each DB1 and DB2 has a number of windows as 1, 2 … . DATA NORMALIZING In general an uncertain data stream is considered for anomaly detection. The main problem defined in this paper is anomaly detection for any kind of Data streams. Since, the size of the data stream is huge, in our approach the complete data are divided into subsets of data streams. A data stream DS is divided into two uncertain data streams DS1 and DS2, are taken for our problem, where both data stream consists of a sequence of continuously occurring uncertain objects in various time intervals, are denoted as DS1 = { [1], [2],… … [ ], … } --- (3) DS2 = { [1], [2],… … [ ], … } --- (4) Where [ ] or [ ] is a k-dimensional uncertain objects at the time interval and is the current time interval. According to grouping the nearest neighbor, the objects should retrieve a close pair of objects within a period. Thus a compartment window concept is adapted for the uncertain stream group operator. From figure-2, a USG operator always considers the most recent CW uncertain data in the stream, that is CW(DS1) = { [ − + 1], [ − + 2],… … … , [ ]} --- (5) CW(DS2) = { [ − + 1], [ − + 2], … … … , [ ]} --- (6)
  • 8. 30802 M.Nalini and S.Anbu At the current time intervals .It can say in other words, when a new certain object x[t+1] (y[t+1]) comes at the next time interval (t+1), the new object x[t+1] (y[t+1]) is appended to DS1(DS2). In that particular time the old object x[t-cw+1] (y[t-cw+1]) expires and is ejected from the memory. Thus, USG at a time interval (t+1) is conducted on a new compartment window {x[t-cw+2], ……x[t+1]} (y[t- w+2],….,y[t+1]}) of size cw. Fig.2: Data Set Divided as Sub-Windows For Grouping the uncertain Data Streams the two data streams DS1 and DS2 and distance threshold value and a probabilistic threshold α ∈ [0, 1].A group on uncertain data streams continuously monitors pairs of uncertain objects x[i] and y[i] within compartment windows CW(DS1) and CW(DS2) respectively of size cw at the current time stamp t. Here, the data streams DS1 and DS2 are compared to finding the similarity distance can be obtained using PSO. Such that PSO (Pr { ( [ ], [ ]) ≤ ≥∝) --- Equ (7) Holds, where t-cw+1 ≤ I, j ≤ t, and dist(., .) is a Euclidean distance function between two objects. To perform a USG Equation (7), users need to register two parameters, distance threshold and probabilistic threshold α in PSO. Since, each uncertain object at a timestamp consists of R samples, the grouping probability P|r{dist( x[i], y[i]) ≤ } in Inequality (7) can be rewritten via samples as Pr{ ( [ ], [ ]) ≤ } = { 1[ ]. . 2[ ]. , ( 1[ ], 2[ ]) ≤ ; 0 ℎ --- Equ (8). DS1 DS2 y[1]…….,y[t-cw+1]…………..| …………………………….y[t] y[t+1] Compartment window at time interval t - CW (DS1)Uncertain data stream Expired uncertain object at time interval (t+1) New uncertain object Compartment window at time interval t - CW (DS2) USG answers
  • 9. Anomaly Detection Via Eliminating Data Redundancy 30803 Note that, one straightforward method to directly perform USG over compartment windows is to follow the USG definition. That is for every object pair <X[i], Y[i] > fromcompartment windows CW(DS1) and CW(DS2) respectively. We compute the grouping probability that X[i] is within distance from Y[i] (via samples) based on (8).If the resulting probability is greater than or equal to probabilistic threshold α, then this pair <X[i], Y[i] > is reported as the USG answer, otherwise it is a false alarm and can be carefully discarded. The number of false alarm is counted using PSO by repeating n number of times and generating numbers of particles in the search space for each individual data. For any comparison, verification and other relevant tasks the window based data makes easy and fulfill the task very quickly for any DBMS. For example, if the database is having 1000 records can be divided into 4 sub datasets having 250 records each. Data in the database can be normalized using any normalization form for fast and accurate query process. In this paper user defined normalization is also applied to improve the efficiency such as arranging the data in a proper manner like ascending order or descending order according to the SQL query keywords. PSO BASED SIMILARITY COMPUTATION This paper focuses on applying a PSO based comparison for finding similar or dissimilar. PSO has a measurement among two data in a database well defined by appropriate features. Since it accounts for unequal variances as well as the correlation between features it will adequately evaluate the distance by assigning different weights or important factors to the features of data entities. In this paper the inconsistency of data can be removed in real-time digital libraries. Assume two sets of groups and having data about girls and boys in a school. Let number girls are categorized as same sub-group in since their attribute or characteristics are same. It is computed by PSO as = ( − ) ≤ 1 --- (9) The correlation among dataset is computed using Similarity-Distance. Data entities are the main objects of data mining. The data entities are arranged in an order according to the attributes. The data set with number of attributes is considered as K-dimensional vector is represented as: = ( , ,… , ). --- (10) N number data entities form a set = ( , , … , ) ⊂ ℝ --- (11) is known as data set. can be represented by an matrix = --- (12)
  • 10. 30804 M.Nalini and S.Anbu where is the jth component of the data set .There are various methods used for data mining. Numerous such methods, for example, NN-classification techniques, cluster investigation, and multi-dimensional scaling methods are based on the processes of similarity between data. As a replacement for measuring similarity, dissimilarity among the entities too will give the same results. For measuring dissimilarity one of the parameters that can be used is distance. This category of measures is also known as separability, divergence or discrimination measures. A distance metric is a real-values function , such that for any data points , , and : ( , ) ≥ 0, ( , ) = 0, = --- (13) ( , ) = ( , ) --- (14) ( , ) ≤ ( , ) + ( , ) --- (15) The first line (13), positive definiteness assures the distance is a non-negative value. The distance can be zero for the points to be the same. The second property indicates the symmetry nature of distance. There are various distance formulas are available like Euclidean, manhattans, Lp-Norm and Similarity distance. In this paper the Similarity-Distance is taken as the main method to find the similarity distance among two data sets. The distance among a set of observed groups in m-dimensional space determined by m variables is known as Similarity-Distance method. The less distance value says the data in the groups are very close and the other is not close. The mathematical formula for Similarity-Distance for two set of data samples as X and Y is written as: ( , ) = ( − ) ∑ ( − ) --- (16) ∑ is the inverse co-variance matrix. The similarity value among the sub-windows of the dataset DB1 and the dataset DB2 is computed and the result is stored in a variable named score. [ ] = ∑ ( 1) − ( 2) ---- (17) [ ] ≤ ℎ 1 [ ] = 0 ℎ 0 [ ] > ℎ − 1 ---- (18) The first line in (18) says that the data available in both windows of 1 and 2 are more or less similar.The next line says that exactly same and the third line says that the data are different. Whenever the distance among dataset satisfies [ ] = 0 and [ ] ≤ both data are marked in the DB. The value of [ ] gives two solution such as:
  • 11. Anomaly Detection Via Eliminating Data Redundancy 30805 TPR—if the similarity value lies above this boundary [-1 to 1], the records are considered as replicas; TNR—if the similarity value lies below this boundary then the records are considered as not being replicas. In this situation the similarity values lies among the two boundaries then the records are classified as “possiblematches”. In this case, a human judgment is also necessary to find the matching score. Usually most of the existing approaches to replica identification depend on several choices to set their parameters and they may not be always optimal. Setting these parameters requires the accomplishment of the following tasks: Selecting the best proof to use- as evidence, it takes more time to find out the duplication due to apply more processes to compute the similarity among the data. Decide how to merge the best evidence, some evidence may be more effective for duplication identification than others. Finding the best boundary values to be used, Bad boundaries may increase the number of identification errors (e.g., false positives and false negatives),nullifying the whole process. Window1 from DB1 is compared with Window1, window2, window3 and so on from DB2 can be written as: [ ] = ( 1) − ∑ ( 2) ---- (19) If the [ ] =0, then both ( 1) and ( 2) are same and mark it as duplicate. Else ( 1) is compared with ( 2). The objective of this paper is to improve the quality of the data in a DBMS is error free and can provide fast outputs for any SQL query. It is also concentrates on de-duplication if possible in the data model. The removal of duplicate is not efficient in Government based organization and it is difficult to remove. Avoiding duplicate data provides high retrieval of quality data from huge data set like banking. DATA: For experimenting the proposed approach two real time data sets commonly employed for evaluating the record de-duplication purposes. They are based on the current data gathered from the web index. Additionally, some more data sets also created using a synthetic data set generator. One of the dataset is the ℎ data set is a assembly of 1,295 different credentials to 122 computer science papers occupied from the Cora research paper through search engine. These credentials were separated into numerous characteristics (author names, year, title, venue, and pages and other info) by an information mining system. The another real-time data set is the Restaurants data set comprises 864 records of restaurant names and supplementary data together with 112 replicas that were attained by incorporating records from Fodor and Zagat’s guidebooks. It is used the following attributes from this data set: (restaurant) name, address, city, and specialty. The synthetic data sets were created using the Synthetic Data Set Generator (SDG) [32] available in the Febrl [26] package.
  • 12. 30806 M. Nalini and S.Anbu Since the real time dataset are not sufficient and not easily accessible for the experiment, such as time series data set, 20-20 news data set and customer data from OLX.in. It contains the fields as name, age, city, address, phone numbers etc, (like social security number). Using SDG it also can create manually some errors in data and duplication in data. Some of the modifications also can be applied on the records attribute level. The data taken for experiments are DATA-1: This data set contains four files of 1000 records (600 originals and 400 duplicates) with a maximum of five duplicates based on one original record (using a Poisson distribution of duplicate records) and with a maximum of two modifications in a single attribute and in full record. DATA-2: This data set contains four files of 1000 records (750 originals and 250 duplicates), with a maximum of five duplicates based on one original record (using a Poisson distribution of duplicate records) and with a maximum of two modifications in a single attribute and four in the full record. DATA-3: This data set contains four files of 1000 records (800 originals and 200 duplicates) with a maximum of seven duplicates, based on one original record (using a Poisson distribution of duplicate records) and with a maximum of four modifications in a single attribute and five in the full record. The duplication can be applied on each attribute of the data in the form of [i.e the evidence] 〈 , − 〉 The experiment on the time series data is done in MATLAB software and the time complexity is compared with the existing system. The elapsed time taken for implementing the proposed approach is 5.482168 seconds. The results obtained for all the functionalities defined in Fig.1 are depicted in Fig.3 to Fig.6.
  • 13. Anomaly Detection Via Eliminating Data Redundancy 30807 Fig.3: Original Data Not Preprocessed Fig.3 shows the data originality as such taken from web. It has errors, redundancy and noise. The three lines show that the data DB is divided into DB1, DB2 and DB3. It is clear in the above figure that DB1, DB2 and DB3 are consisted and overlapped in many places which indicates the data redundancy. Also it is drawn in zigzag form says it is not preprocessed. In the time series data there are 14 numerical data are preprocessed [replaced as 0’s], it is verified from the database. Fig.4: Preprocessed Data
  • 14. 30808 M.Nalini and S.Anbu The data is preprocessed and normalized is shown in Fig.4. User defined normalization of the data is arranging the data in an order for easy process. Even DB1, DB2 and DB3 have overlapped data itself, which indicates more data are similar. DB1 and DB2 have more similar overlapped data is clearly shown in Fig.4. Finding the similarity index and those same data can be removed for easy process and make it as duplication. Fig.5: Single Window Data in DB1 After normalization the data is divided into windows and shown in Fig.5 where the window size is 50 defined by the developer. Each window has 50 data for fast comparison. In order to confirm this behavior observed with real data, we conducted additional experiments using our synthetic data sets. The user-selected evidence setup used in this experiment was built using the following list of evidence: <firstname, PSO>, <lastname, PSO>, <street number, string distance>, <address1, PSO>, <address2, PSO>, <suburb, PSO>, <postcode, string distance>, <state, PSO>, <date of birth, string distance>, <age, string distance>, <phone number, string distance>, <social security number, string distance>. This list of evidence, using the PSO similarity function for free text attributes and a string distance function for numeric attributes was chosen. Since, it required less time to be processed in our initial tuning tests.
  • 15. Anomaly Detection Via Eliminating Data Redundancy 30809 Table-2: Original Data Data set Original Data Good Data Similar Data Error Data Time Series 1000 600 400 24% Restaurant 1000 750 250 15% Student Database 1000 800 200 12.4% Cora 1000 700 300 19.2% Table-3: Data Duplication Detection and De-Duplication Data set Original DataMarked DuplicationDe-Duplicated Not De-Duplicated Time Series 1000 400 395 5 Restaurant 1000 250 206 44 Student DB 1000 200 146 54 Cora 1000 300 244 46 Fig.6: Performance Evaluation of Proposed Approach The performance of proposed approach is evaluated by comparing the detection of duplication, error, marking duplication, number of de-duplication achieved and error correction for various datasets. Fig.6 shows the performance evaluation of the proposed approach using the Similarity-Distance . According the distance score the duplicate error records are detected and marked. Similarity- Distance rectifies the error of 24%, 15%, 12.4% and 19.2% for Time series data, Restaurant data, student data and Cora respectively. The number of duplicate records detected by PSO is 400, 250, 200 and 300 for Time series data, Restaurant data, student data and Cora and de-duplicate the data are
  • 16. 30810 M.Nalini and S.Anbu 395, 206, 146 and 244 respectively. Due more complex or error in the data the de- duplication is not obtained 100%. Some of the performance metrics can be calculated to find out the accuracy of our proposed approach as: = Number of Duplication Find correctly Total number of data TNR = = Number of Duplication wrongly obtained Total Number of data to be Identi ied FNR = Sensitivity = = 99% Specificity = = 88.5% Accuracy = = 96.3% Where P = TP+FN and N =FP+TN. The proposed approach proved the efficiency is better in terms of Duplication detection, Error Detection, and De- duplication in terms of accuracy is 96.3%. Hence Similarity-Distance based duplication detection is more efficient. CONCLUSION In this paper the PSO based distance method is taken as the main method for finding the similarity [redundancy] in any database. Where the similarity score is computed for various databases and the performance is compared. The accuracy obtained using this proposed approach is 96.3% for four different databases. The time series data is in the form of Excel, Cora data is in the form of table, student data is in the form of MS- Access and the restaurant data is in the form of SQL table. It is concluded from the experiment results obtained using our proposed approach it is easy to do anomaly detection and removal in terms of data redundancy and error. In future the reliability and scalability is investigated in terms of data size and data variations.
  • 17. Anomaly Detection Via Eliminating Data Redundancy 30811 REFERENCES [1]. Moise´s G. de Carvalho, Alberto H.F. Laender, Marcos Andre Gonc¸alves, andAltigran S. da Silva, “A Genetic Programming Approach to Record Deduplication”, IEEE Transactions On Knowledge And Data Engineering, Vol. 24, NO. 3, March 2012. [2]. M. Wheatley, “Operation Clean Data,” CIO Asia Magazine, http:// www.cio- asia.com, Aug. 2004. [3]. Ireneusz Czarnowski, Piotr J drzejowicz, “Data Reduction Algorithm for Machine Learning and Data Mining”,Volume 5027, 2008, pp 276-285. [4]. Jose Ramon Cano, Francisco Herrera, Manuel Lozano, “Strategies for Scaling Up Evolutionary Instance Reduction Algorithms for Data Mining”, Book-Soft Computing, Volume 163, 2005, pp 21-39. [5]. Gabriel Poesia, Loïc Cerf, “A Lossless Data Reduction for Mining Constrained Patterns in n-ary Relations”, Machine Learning and Knowledge Discovery in Databases, Volume 8725, 2014, pp 581-596. [6]. Nick J. Cercone, Howard J. Hamilton, Xiaohua Hu, Ning Shan, “Data Mining Using Attribute-Oriented Generalization and Information Reduction”, Rough Sets and Data Mining1997, pp 199-22 [7] Vikrant Sabnis, Neelu Khare, “An Adaptive Iterative PCA-SVM Based Technique for Dimensionality Reduction to Support Fast Mining of Leukemia Data”, SocProS 2012. [8] Paul Ammann, Dahlard L. Lukes, John C. Knight, “Applying data redundancy to differential equation solvers”, Journal of Annals of Software Engineering, 1997, Volume 4, Issue 1, pp 65-77. [9] P. E. Ammann, “Data Redundancy for the Detection and Tolerance of Software Faults”, Computing Science and Statistics”,1992, pp 43-52. [10] Rita Aceves-Pérez, Luis Villaseñor-Pineda, Manuel Montes-y-Gomez, “ Towards a Multilingual QA System Based on the Web Data Redundancy”, Computer Science Volume 3528, 2005, pp 32-37. [11]. Thiago Luís Lopes Siqueira, Cristina Dutra de Aguiar Ciferri, Valéria Cesário Times, Anjolina Grisi de Oliveira, Ricardo Rodrigues Ciferri, “The impact of spatial data redundancy on SOLAP query performance”,Journal of the Brazilian Computer Society, June 2009, Volume 15, Issue 2, pp 19-34. [12]. Ahmad Ali Iqbal, Maximilian Ott, Aruna Seneviratne, “Removing the Redundancy from Distributed Semantic Web Data”, Database and Expert Systems Applications Lecture Notes in Computer Science, Volume 6261, 2010, pp 512-519. [13] Yanxu Zhu, Gang Yin, Xiang Li, Huaimin Wang, Dianxi Shi, Lin Yuan, “Exploiting Attribute Redundancy for Web Entity Data Extraction”, Digital Libraries: For Cultural Heritage, Knowledge Dissemination, and Future Creation Lecture Notes in Computer Science Volume 7008, 2011, pp 98-107. [14] M.G. de Carvalho, M.A. Gonc¸alves, A.H.F. Laender, and A.S. da Silva, “Learning to Deduplicate,” Proc. Sixth ACM/IEEE CS Joint Conf. Digital Libraries, pp. 41-50, 2006.
  • 18. 30812 M.Nalini and S.Anbu [15] J.R. Koza, Gentic Programming: On the Programming of Computers byMeans of Natural Selection. MIT Press, 1992. [16] M.G. de Carvalho, A.H.F. Laender, M.A. Gonc¸alves, and A.S. da Silva, “Replica Identification Using Genetic Programming,” Proc. 23rd Ann. ACM Symp. Applied Computing (SAC), pp. 1801-1806, 2008.