SlideShare a Scribd company logo
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 05 Issue: 06 | June 2018 www.irjet.net p-ISSN: 2395-0072
© 2018, IRJET | Impact Factor value: 7.211 | ISO 9001:2008 Certified Journal | Page 2354
Efficient Data Linkage Technique Using One Class Clustering Tree For
Database Misuse Domain
Mandilkar Kalyani1, Prof.Panhalkar A.R2
1Student MEIT, Dept. of Information Technology, Amrutvahini college of engineering, Sangamner, Maharastra.
2Professor, MEIT, Dept. of Information Technology, Amrutvahini college of engineering, Sangamner, Maharastra.
---------------------------------------------------------------------***---------------------------------------------------------------------
Abstract - The task of data linkage is to join datasets that do
not share a common identifier.The process of Data Linkage is
implement with entities which are of same and differenttypes.
One-to-many data Linkage is an necessary Data Model. Two
types of data linkages techniques are one-to-one and one-to-
many . In this paper, we propose a one-to-many data linkage.
it is used to accomplish link between different types in the
matching sets. This present method is depends on One-Class
Clustering Tree (OCCT). The tree is constructed in such a way
that it is easily understandable and can be transformed into
association rules. The inner nodes of the tree include features
of the first set of entities. The leaves of the tree appear as
features of the second set that are matching. The data is split
using four splitting criteria,and two pruningmethodsareused
.The result shows that OCCT gives better performances in
terms of precision and recall.A threshold is calculated which
plays a main role in the categorization of the users. A pruning
technique is used to reduce the tree by avoiding the
unnecessary branches. Thus, the proposed system is enhanced
in terms of time complexity and accuracy.
Key Words: Pruning, Splitting criteria, Record Linkage
1. INTRODUCTION
There is a need to found the techniquesto linkthedatasets
that does not share the common identity; it comesunderthe
data linkage process. Data linkage is used when information
from two or more records of independent sources are
brought together, when they are perceived to belong to the
same individual, family, event or place . It is the technique
that is used to connect the information across several
disparate data sources. The major task of data linkage is to
identify the different objects which were used to refer the
same entity across the different source of data.
The data linkage can be divided into two types: one to one
and one to many. The goal is to associate an entity from one
data set with a single matching entity in another data set is
in one to one linkage. Most of the previous works focus on
one-to-one data linkage. we propose a new data linkage
method aimed at performing one to many linkages. The
proposed data linkage technique can match entities of
different types, while data linkage is usually performed
among entities of the same type.[1]
In this paper, we propose a new data linkage method one-
to-many linkage that can match entities of different types.
For example, Let TA and TB are the two tables of different
types. The inner nodes of the tree consist of attributes
referring to both of the tables being matched (TA and TB).
The leaves of the tree will determine whether a pair of
records described by the path in the tree ending with the
current leaf is a match or a nonmatch.We proposeanewdata
linkage method aimed at performingone-to-manylinkage.In
addition,while data linkage is usually performed among
entities of the same type, this proposed data linkage
technique can match entitiesof different types. Forexample,
if there is a student database we might want to linkastudent
record with the courses she should take (according to
different features that describe the student and features
describing the courses). The proposedmethodlinksbetween
the entities using a One-Class Clustering Tree (OCCT). A
clustering tree is a tree in which each of the leavescontainsa
cluster instead of a single classification and each cluster is
generalized by a set of rulesthat is stored in the appropriate
leaf [9], [10].
Detecting data misuse poses a great challenge for
organizations. Whether caused by malicious intent or an
inadvertent mistake, data leakage/misuse can diminish a
company's brand, reduce shareholder value,anddamagethe
company's goodwill and reputation. This challenge is
intensified when trying to detect and/or prevent data
leakage/misuse performed by an insider with legitimate
permissions to access the organization's systems and its
critical data.
The goal of using the OCCT in this domain is to link a set of
records, representing the context of the request (i.e., the
actual access to certain data), with a set of records
representing the data that can be legitimately retrieved
within the specific context. Thus, the inner nodes of the
OCCT represent contextual attributes in which the request
occurs (table TA), and the probabilistic models in the leaves
represent the data that can be legitimately retrieved in the
specific context (table TB). Pairs which are detected as
nonmatching are assumed to be illegitimate (i.e., malicious),
and will trigger an alert to the organization’s securityofficer.
By analyzing both the context of the request as well as the
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 05 Issue: 06 | June 2018 www.irjet.net p-ISSN: 2395-0072
© 2018, IRJET | Impact Factor value: 7.211 | ISO 9001:2008 Certified Journal | Page 2355
data that the user is exposed to, the OCCT method can
improve the detection accuracy and better distinguish
between a normal and abnormal request
In this paper, a new record linkage method is introduced
which performs many-to-many record linkage. In many-to-
many record linkage the dataset from different tablesmatch
with the entity of different tables. Decision tree employd for
decide which records are similar to each-other.Decision
trees are usually regarded as representing for classification.
The leaves of the tree contain the classes and the branches
from the root to a leaf contain sufficient conditions for
classification.
The proposed method is implemented using the database
misuse domain used to identify the common and malicious
users. In this domain, the goal is to identify anomalous
access to database records that may indicate a possible data
leakage, loss of data integrity or data mishandling.
2. RELATED WORK
Record linkage is the process of matching entities from two
different data sources that may or may not contain a
common element. Shabtai et al. [1]usedone-to-manylinkage
in different domains like fraud detection, recommender
systems and data leakage prevention by developing a One-
Class Clustering Tree. Four splitting criteria and pruning
methods are implemented to construct the tree and to avoid
the needless branches of the tree correspondingly. The
shortcoming of this approach is that it is difficult to reduce
the linkage computation time and it is a one-class approach.
Christen and Goiser [2] used a C4.5 decision tree to
determine which records must be matched to one another.
In their work, various string comparison methods are used
and compared by constructing different decision trees.
However, their method performs the matching of attributes
that are only predefined. Moreover only one or two
attributes are usually used.
Henry et al. [3] used one-to-many linkage for genealogical
research. Record linkage was performed using five
attributes: name of the person, birth date, place, gender and
the relationships between the persons. Using these five
attributes a decision tree wasinduced. The drawback of this
approach is that it performs matching using the specific
attributes and therefore it is very hard to generalize.
F.De Comite, F.Denis, R. Gilleron and F.Letouzey introduced
POSC4.5 algorithm for record linkage of positive and
unlabeled examples. They considered binary classification
and hence this method is not generalized. They require
not only the data set but also the information of positive
examples out of whole data set. The attraction of their work
is that they presented modified entropy formula which that
considers weight of positive examples in a given data set.
They assumed that negative examples are in unlabeled data
set as per given distribution [4].
[3] PROPOSED SYSTEM
A. Problem Statement
The aim of this paper is to link a record from a table TA
with records from another table TB. The generated model is
in the form of a tree in which the inner nodes represent
attributes from TA and the leafs hold a compact
representation of a subset of records from TB.
B. Feature
The OCCT yields better performance in terms of precision
and recall.It will improve the efficiency of classifiers by
improvement in the pre and post processing techniques.
The simplicity of the model, which can easily betransformed
to rules of the type A ->B.
The prepruning approach which is used to reduce the time
complexity of the algorithm.
C. Scope
The OCCT can used in different domains like:
FraudDetection- To find the fraudulent users.In
Recommender systemstheproposed system can beused
formatchingnew userswiththeirproductexpectations.In
Data leakag prevention.-todetecttheab-normalaccessto
the database records that indicates data leakage or data
misuse.
The process flow of the proposed system is as follows: First
consider the customer dataset and the context dataset for
linkage process. Then preprocess both the databy removing
the null values and error data. After preprocessing, create a
training table based on both the context and customer
dataset. For training table creation, consider the attributes
such as customer id, requesting location, requesting day,
requesting time from the first database andintheothertable
consider the attributes such as user location and the
business type.
Then measure the similarity score based on the attributesof
requesting location, requesting day, requesting time with
user location and the businesstype. Based on that, construct
the tree by applying the splitting criteria. Then apply the
post pruning method to remove the unnecessary branches
that are not used in the classification process.Thenapplythe
SEMANC method which is used to analyzethenormaluseror
the malicious user from the linked data. At last, the
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 05 Issue: 06 | June 2018 www.irjet.net p-ISSN: 2395-0072
© 2018, IRJET | Impact Factor value: 7.211 | ISO 9001:2008 Certified Journal | Page 2356
performance graph is drawn based on both the existing and
the proposed system.
Figure 1:- Architecture diagram
4. SPLITTING CRITERIA
Maximum Likelihood Estimation:
This algorithm is used for choosing the suitable splitting
methods. This splitting methodshelp to find outthe suitable
attributes that are not split until now. We choose those
attributes those are having the uppermost value ascompare
to our threshold value. Those vlaue are havingmorethanthe
thresold value that record is a exact match. Those value are
having less than the threshold value that records are flase
match. Its depends upon the time complexity for calculating
the all the values for their respective attribute. Its help to
calculate the time complexity..Its help to calculate the early
threshold value and the choose the next which feature going
to be split.
The domain that is going to be implemented is database
misuse domain. Then the attributes are split up by the
similarity values between the clusters. Finally tree will be
constructed. The internal nodes of the tree consist of
attributes referring to both of the tables being matched (TA
and TB). The leaves of the tree will determine whetherapair
of records described by the path in the tree ending with the
current leaf is a match or a non-match.
5. PRUNING
Pruning is considered to be an important activity during the
tree construction process. The necessity of pruning is to
build a tree with accuracy and also to avoid over fitting.
Pruning can be classified into two types: pre-pruning and
post-pruning. In pre-pruning, the branches are pruned
during the induction process if there are no possible splits
found. In post-pruning, the tree is built completely followed
by a bottom-up approach to determine which branches are
not beneficial. Pruning is the process to trim unnecessary
branches to improve accuracy of model. Thustreeisinduced
using matching examplesonly. The pruning process isuseto
give compact representation of tree i.e. it contains small
number of attributes. It also avoids over fitting and improve
the time complexity.
In our proposed system we are using pre-pruning
processto reduce the time complexity. Thedecisionwhether
to prune the branch taken once best splitting attribute is
chosen. We propose MLE for our system. In Maximum
Likelihood Estimation (MLE), a MLE score is calculated for
each of splitting attribute. If none of the candidate attribute
achieve MLE score greater than current node then branch is
pruned and current node becomes leaf node.
6. IMPLEMENTATION DETAILS
6.1 ONE-TO-MANY DATA LINKAGE PROCESS
A characteristic record linkage trouble consists of two data
tables that do not split an ordinary identifier. Consider two
tables Table1 and Table2 as an example. The first table is
viewer table and second one as movie table. Here A and B
are taken as attribute for table T1 and T2. The goal is to
match the records of the table T1 and T2. Generally here we
define the same entity for the both table. Here each dataset
of the both table match each other. here we proposed an
many-to-many linkage model also. In many-to-many record
linkage the dataset from different tables match with the
entity of different tables.
Here the problem is define as |TA| x |TB|. The
advance indexing technique can used for an efficient linkage
of records.TAB is denoted for matching records and TAB is
denoted for non-matching records. The objective of the
algorithm is to accurately recognize the true identical pairs
(true positive) and identify non-identical pairs (False
Positives). Each possible recordspair of T1 is matchwiththe
matching records of T2. Every probable records pair are
allot a value that describe the probability of two identical
tables. For Each class one probability value should be
provides. A threshold value has to be declare from the
starting. If the value surpasses the define threshold, the
records measure as true match or link otherwise it declare
as non-match.
For implementation purpose we create customer
database of an organization are shared with a business
partner. Therefore, the simulated data include requests for
customer records of an organization ,submitted by a
business partner of the organization.
6.2 THE PROCESS FLOW OF THE PROPOSED SYSTEM:-
 First the user dataset and the server log dataset
for linkage process is taken into account.
 Then preprocessboth the data by removingthenull
values or missing values and those attributes that
have error data.
 After preprocessing, create a training table based
on both the server log and user dataset. Fortraining
Inputdataset(Cust
omer&contextdata
)
preprocessing Splitting
criteria(ML
E
prepruningOutput(clu
stering
tree)
Database
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 05 Issue: 06 | June 2018 www.irjet.net p-ISSN: 2395-0072
© 2018, IRJET | Impact Factor value: 7.211 | ISO 9001:2008 Certified Journal | Page 2357
table creation, consider the elements such as
requesting location, requestingday,requestingtime
from the first database and in the other table
consider the elements such asuser location and the
business type.
 Then measure the similarity score based on the
elements of requesting location, requesting day,
requesting time with user location and the business
type by using the Maximum likelihood estimation
technique.
 Then apply the pre pruning method to remove the
unnecessary branches that are not used
 A total probability value (sum all possible values of
an attribute) is calculated which is compared with
the probability calculated for each value of an
attribute and that is used in the classification of the
users as the common user and the malicious users.
 At last, the performance of the existing and the
proposed system is evaluated in terms of execution
time and the accuracy of the result.
6.3 PREPARING THE INPUT
The training and also the testing input data for the OCCT
algorithm should be a single dataset TAB which contains
only matching instancesfrom TA ⊆ TB. Each record, r, of the
dataset is actually a pair of records (r(a), r(b)), one from the
table TA and one from the table TB, so that: r = (r(a); r(b)) Є
TAB ⊆ TA × TB. In the current version of the
implementation, due to circumstances, we require all the
attributes of TA to appear in the input dataset, followed by
all the attributes of TB. By this way, the attributes of TB
(City) can be pointed by a single index which refers to the
first attribute of TB in the input dataset.
Given a record a -> TA, a leaf of the tree is extracted
by following the values of the record a to the correct path of
the tree. Each leaf of the tree representsa set ofrecordsfrom
the second table (TB) which are the most likely to be linked
with record a. In order to represent this set in acompactway
and avoid overfitting, a set of probabilistic modelsisinduced
for each leaf. Each model is used for deriving the probability
of a value of some attribute from table TB, given all other
attributes from that table.
In our implementation there is no need to save modelsforall
possible attributes of TB. Thus, a feature selection processis
executed on the leaf dataset in order to choose the attributes
that will be represented by the leafs. In our implementation
it needs only examples of matching pairs in order to learn
and build the model. So this leads to much more accuracy as
only matching pairs are train so it accurately choose
matching pairs. This feature is important since in many
domains it is difficult to obtain nonmatching examples.
6.4 RESULT AND DISSCUSSION
I. First consider the customer dataset and the context
dataset for linkage process. Then preprocess both
the data by removing the null valuesand error data.
After preprocessing, create a trainingtablebasedon
both the context and customer dataset. For training
table creation, consider the attributes such as
customer id, requesting location, requesting day,
requesting time from the first database and in the
other table consider the attributes such as user
location and the business type.
Fig 2:- Customer dataset for database misuse domain
II. After applying splitting criteria,it gives result of
four splitting methods.for implementation we
consider MLE as a splitting method.
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 05 Issue: 06 | June 2018 www.irjet.net p-ISSN: 2395-0072
© 2018, IRJET | Impact Factor value: 7.211 | ISO 9001:2008 Certified Journal | Page 2358
Fig 3:-Using different splitting criteria result shows the tree size and leaves for dataset database misuse
III. Now result of pruning method.we use prepruning
approach.
Fig 4:- After applying pruning method it shows tree size and
leaves
IV. Precision and Recall value comparison for
different methods
Fig 5: Precision and Recall for Database Misuse domain
7. CONCLUSIONS
A new method is proposed to link the data that do not have
the common elements. A tree is constructed to accomplish
the one-to-many record linkage and the database users are
classified as normal and abnormal user. Improved efficiency
is attained by means of the time complexity and accuracy of
the record linkage process.
REFERENCES
[1] ma’ayan dror, asaf shabtai, lior rokach, and yuval
elovici, “occt: a one-class clustering tree for
implementing one-to-many data linkage”, ieee
transactions on knowledge and data engineering,vol.
26, no. 3, march 2014.
[2] I.p. fellegi and a.b. sunter, “a theory for record
linkage,” j. am. statistical soc., vol. 64, no. 328, pp.
1183-1210, dec. 1969.
[3] m. yakout, a.k. elmagarmid, h. elmeleegy, m.
quzzani, and a. qi, “behavior based record linkage”,
proc. vldb endowment, vol. 3, nos. 1/2, pp. 439-448,
2010.
[4] f. de comite´, f. denis, r. gilleron, and f. letouzey,
“positive and unlabeled examples help learning”,
proc. 10th int’l conf. algorithmic learning theory,pp.
219-230, 1999.
[5] a.j.storkey, c.k.i.williams, e.taylorandr.g.mann“an
expectation maximisation algorithm for one-to-
many record linkage,” university of edinburgh
informatics research report, 2005
[6] S.Ivie, G.Henry, H.Gatrell and C.Giraud-Carrier, “A
Metric Based Machine Learning Approachto Genea-
Logical Record Linkage,” in Proc. of the 7th Annual
Workshop on Technology for Family History and
Genealogical Research, 2007.
[7] P.Christen and K.Goiser, “Towards AutomatedData
Linkage and Deduplication,” Australian National
University, Technical Report, 2005.
[8] A. Gershman et al., “A Decision Tree Based
Recommender System,” Proc. 10th Int’l Conf.
Innovative Internet Community Services, pp. 170-
179, 2010.
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 05 Issue: 06 | June 2018 www.irjet.net p-ISSN: 2395-0072
© 2018, IRJET | Impact Factor value: 7.211 | ISO 9001:2008 Certified Journal | Page 2359
[9] L. Gu and R. Baxter, “Decision Models for
RecordLinkage,” DataMining, vol. 3755, pp. 146-
160, 2006.
[10] O. Benjelloun, H. Garcia, D. Menestrina, Q. Su,
S.Whang, and J.Widom, “Swoosh: A Generic
Approach toEntity Resolution,” TheVLDB J., vol.
18, no. 1, pp. 255-276, 2009.
[11] G.Henry, S.Ivie, H.Gatrell and C.Giraud-Carrier, “A
Metric Based Machine Learning Approach to Genea-
Logical Record Linkage,” in Proc. of the 7th Annual
Workshop on Technology for Family History and
Genealogical Research, 2007.

More Related Content

What's hot

AN IMPROVED TECHNIQUE FOR DOCUMENT CLUSTERING
AN IMPROVED TECHNIQUE FOR DOCUMENT CLUSTERINGAN IMPROVED TECHNIQUE FOR DOCUMENT CLUSTERING
AN IMPROVED TECHNIQUE FOR DOCUMENT CLUSTERING
International Journal of Technical Research & Application
 
Indexing based Genetic Programming Approach to Record Deduplication
Indexing based Genetic Programming Approach to Record DeduplicationIndexing based Genetic Programming Approach to Record Deduplication
Indexing based Genetic Programming Approach to Record Deduplication
idescitation
 
A statistical data fusion technique in virtual data integration environment
A statistical data fusion technique in virtual data integration environmentA statistical data fusion technique in virtual data integration environment
A statistical data fusion technique in virtual data integration environment
IJDKP
 
G045033841
G045033841G045033841
G045033841
IJERA Editor
 
Cross Domain Data Fusion
Cross Domain Data FusionCross Domain Data Fusion
Cross Domain Data Fusion
IRJET Journal
 
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...
IRJET Journal
 
INTEGRATED ASSOCIATIVE CLASSIFICATION AND NEURAL NETWORK MODEL ENHANCED BY US...
INTEGRATED ASSOCIATIVE CLASSIFICATION AND NEURAL NETWORK MODEL ENHANCED BY US...INTEGRATED ASSOCIATIVE CLASSIFICATION AND NEURAL NETWORK MODEL ENHANCED BY US...
INTEGRATED ASSOCIATIVE CLASSIFICATION AND NEURAL NETWORK MODEL ENHANCED BY US...
IJDKP
 
Granularity analysis of classification and estimation for complex datasets wi...
Granularity analysis of classification and estimation for complex datasets wi...Granularity analysis of classification and estimation for complex datasets wi...
Granularity analysis of classification and estimation for complex datasets wi...
IJECEIAES
 
Recommendation system using bloom filter in mapreduce
Recommendation system using bloom filter in mapreduceRecommendation system using bloom filter in mapreduce
Recommendation system using bloom filter in mapreduce
IJDKP
 
Data Mining Classification Comparison (Naïve Bayes and C4.5 Algorithms)
Data Mining Classification Comparison (Naïve Bayes and C4.5 Algorithms)Data Mining Classification Comparison (Naïve Bayes and C4.5 Algorithms)
Data Mining Classification Comparison (Naïve Bayes and C4.5 Algorithms)
Universitas Pembangunan Panca Budi
 
LINK MINING PROCESS
LINK MINING PROCESSLINK MINING PROCESS
LINK MINING PROCESS
IJDKP
 
PATTERN GENERATION FOR COMPLEX DATA USING HYBRID MINING
PATTERN GENERATION FOR COMPLEX DATA USING HYBRID MININGPATTERN GENERATION FOR COMPLEX DATA USING HYBRID MINING
PATTERN GENERATION FOR COMPLEX DATA USING HYBRID MINING
IJDKP
 
A Robust Keywords Based Document Retrieval by Utilizing Advanced Encryption S...
A Robust Keywords Based Document Retrieval by Utilizing Advanced Encryption S...A Robust Keywords Based Document Retrieval by Utilizing Advanced Encryption S...
A Robust Keywords Based Document Retrieval by Utilizing Advanced Encryption S...
IRJET Journal
 
Applying K-Means Clustering Algorithm to Discover Knowledge from Insurance Da...
Applying K-Means Clustering Algorithm to Discover Knowledge from Insurance Da...Applying K-Means Clustering Algorithm to Discover Knowledge from Insurance Da...
Applying K-Means Clustering Algorithm to Discover Knowledge from Insurance Da...
theijes
 
Ijmet 10 02_050
Ijmet 10 02_050Ijmet 10 02_050
Ijmet 10 02_050
IAEME Publication
 
Frequent Item set Mining of Big Data for Social Media
Frequent Item set Mining of Big Data for Social MediaFrequent Item set Mining of Big Data for Social Media
Frequent Item set Mining of Big Data for Social Media
IJERA Editor
 
Interpreting the Semantics of Anomalies Based on Mutual Information in Link M...
Interpreting the Semantics of Anomalies Based on Mutual Information in Link M...Interpreting the Semantics of Anomalies Based on Mutual Information in Link M...
Interpreting the Semantics of Anomalies Based on Mutual Information in Link M...
ijdms
 
11.software modules clustering an effective approach for reusability
11.software modules clustering an effective approach for  reusability11.software modules clustering an effective approach for  reusability
11.software modules clustering an effective approach for reusability
Alexander Decker
 

What's hot (18)

AN IMPROVED TECHNIQUE FOR DOCUMENT CLUSTERING
AN IMPROVED TECHNIQUE FOR DOCUMENT CLUSTERINGAN IMPROVED TECHNIQUE FOR DOCUMENT CLUSTERING
AN IMPROVED TECHNIQUE FOR DOCUMENT CLUSTERING
 
Indexing based Genetic Programming Approach to Record Deduplication
Indexing based Genetic Programming Approach to Record DeduplicationIndexing based Genetic Programming Approach to Record Deduplication
Indexing based Genetic Programming Approach to Record Deduplication
 
A statistical data fusion technique in virtual data integration environment
A statistical data fusion technique in virtual data integration environmentA statistical data fusion technique in virtual data integration environment
A statistical data fusion technique in virtual data integration environment
 
G045033841
G045033841G045033841
G045033841
 
Cross Domain Data Fusion
Cross Domain Data FusionCross Domain Data Fusion
Cross Domain Data Fusion
 
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...
 
INTEGRATED ASSOCIATIVE CLASSIFICATION AND NEURAL NETWORK MODEL ENHANCED BY US...
INTEGRATED ASSOCIATIVE CLASSIFICATION AND NEURAL NETWORK MODEL ENHANCED BY US...INTEGRATED ASSOCIATIVE CLASSIFICATION AND NEURAL NETWORK MODEL ENHANCED BY US...
INTEGRATED ASSOCIATIVE CLASSIFICATION AND NEURAL NETWORK MODEL ENHANCED BY US...
 
Granularity analysis of classification and estimation for complex datasets wi...
Granularity analysis of classification and estimation for complex datasets wi...Granularity analysis of classification and estimation for complex datasets wi...
Granularity analysis of classification and estimation for complex datasets wi...
 
Recommendation system using bloom filter in mapreduce
Recommendation system using bloom filter in mapreduceRecommendation system using bloom filter in mapreduce
Recommendation system using bloom filter in mapreduce
 
Data Mining Classification Comparison (Naïve Bayes and C4.5 Algorithms)
Data Mining Classification Comparison (Naïve Bayes and C4.5 Algorithms)Data Mining Classification Comparison (Naïve Bayes and C4.5 Algorithms)
Data Mining Classification Comparison (Naïve Bayes and C4.5 Algorithms)
 
LINK MINING PROCESS
LINK MINING PROCESSLINK MINING PROCESS
LINK MINING PROCESS
 
PATTERN GENERATION FOR COMPLEX DATA USING HYBRID MINING
PATTERN GENERATION FOR COMPLEX DATA USING HYBRID MININGPATTERN GENERATION FOR COMPLEX DATA USING HYBRID MINING
PATTERN GENERATION FOR COMPLEX DATA USING HYBRID MINING
 
A Robust Keywords Based Document Retrieval by Utilizing Advanced Encryption S...
A Robust Keywords Based Document Retrieval by Utilizing Advanced Encryption S...A Robust Keywords Based Document Retrieval by Utilizing Advanced Encryption S...
A Robust Keywords Based Document Retrieval by Utilizing Advanced Encryption S...
 
Applying K-Means Clustering Algorithm to Discover Knowledge from Insurance Da...
Applying K-Means Clustering Algorithm to Discover Knowledge from Insurance Da...Applying K-Means Clustering Algorithm to Discover Knowledge from Insurance Da...
Applying K-Means Clustering Algorithm to Discover Knowledge from Insurance Da...
 
Ijmet 10 02_050
Ijmet 10 02_050Ijmet 10 02_050
Ijmet 10 02_050
 
Frequent Item set Mining of Big Data for Social Media
Frequent Item set Mining of Big Data for Social MediaFrequent Item set Mining of Big Data for Social Media
Frequent Item set Mining of Big Data for Social Media
 
Interpreting the Semantics of Anomalies Based on Mutual Information in Link M...
Interpreting the Semantics of Anomalies Based on Mutual Information in Link M...Interpreting the Semantics of Anomalies Based on Mutual Information in Link M...
Interpreting the Semantics of Anomalies Based on Mutual Information in Link M...
 
11.software modules clustering an effective approach for reusability
11.software modules clustering an effective approach for  reusability11.software modules clustering an effective approach for  reusability
11.software modules clustering an effective approach for reusability
 

Similar to IRJET-Efficient Data Linkage Technique using one Class Clustering Tree for Database Misuse Domain

Iaetsd a survey on one class clustering
Iaetsd a survey on one class clusteringIaetsd a survey on one class clustering
Iaetsd a survey on one class clustering
Iaetsd Iaetsd
 
Implementation of Matching Tree Technique for Online Record Linkage
Implementation of Matching Tree Technique for Online Record LinkageImplementation of Matching Tree Technique for Online Record Linkage
Implementation of Matching Tree Technique for Online Record Linkage
IOSR Journals
 
A PROCESS OF LINK MINING
A PROCESS OF LINK MININGA PROCESS OF LINK MINING
A PROCESS OF LINK MINING
csandit
 
Privacy preservation techniques in data mining
Privacy preservation techniques in data miningPrivacy preservation techniques in data mining
Privacy preservation techniques in data mining
eSAT Journals
 
Privacy preservation techniques in data mining
Privacy preservation techniques in data miningPrivacy preservation techniques in data mining
Privacy preservation techniques in data mining
eSAT Publishing House
 
Asif nosql
Asif nosqlAsif nosql
Asif nosql
Asif Ali
 
A study and survey on various progressive duplicate detection mechanisms
A study and survey on various progressive duplicate detection mechanismsA study and survey on various progressive duplicate detection mechanisms
A study and survey on various progressive duplicate detection mechanisms
eSAT Journals
 
LINK MINING PROCESS
LINK MINING PROCESSLINK MINING PROCESS
LINK MINING PROCESS
IJDKP
 
Literature review of attribute level and
Literature review of attribute level andLiterature review of attribute level and
Literature review of attribute level and
IJDKP
 
Survey on Text Mining Based on Social Media Comments as Big Data Analysis Usi...
Survey on Text Mining Based on Social Media Comments as Big Data Analysis Usi...Survey on Text Mining Based on Social Media Comments as Big Data Analysis Usi...
Survey on Text Mining Based on Social Media Comments as Big Data Analysis Usi...
IJMREMJournal
 
IRJET- Diverse Approaches for Document Clustering in Product Development Anal...
IRJET- Diverse Approaches for Document Clustering in Product Development Anal...IRJET- Diverse Approaches for Document Clustering in Product Development Anal...
IRJET- Diverse Approaches for Document Clustering in Product Development Anal...
IRJET Journal
 
MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...
MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...
MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...
acijjournal
 
TUPLE VALUE BASED MULTIPLICATIVE DATA PERTURBATION APPROACH TO PRESERVE PRIVA...
TUPLE VALUE BASED MULTIPLICATIVE DATA PERTURBATION APPROACH TO PRESERVE PRIVA...TUPLE VALUE BASED MULTIPLICATIVE DATA PERTURBATION APPROACH TO PRESERVE PRIVA...
TUPLE VALUE BASED MULTIPLICATIVE DATA PERTURBATION APPROACH TO PRESERVE PRIVA...
IJDKP
 
Multikeyword Hunt on Progressive Graphs
Multikeyword Hunt on Progressive GraphsMultikeyword Hunt on Progressive Graphs
Multikeyword Hunt on Progressive Graphs
IRJET Journal
 
IRJET- Text Document Clustering using K-Means Algorithm
IRJET-  	  Text Document Clustering using K-Means Algorithm IRJET-  	  Text Document Clustering using K-Means Algorithm
IRJET- Text Document Clustering using K-Means Algorithm
IRJET Journal
 
How Partitioning Clustering Technique For Implementing...
How Partitioning Clustering Technique For Implementing...How Partitioning Clustering Technique For Implementing...
How Partitioning Clustering Technique For Implementing...
Nicolle Dammann
 
CLASSIFICATION ALGORITHM USING RANDOM CONCEPT ON A VERY LARGE DATA SET: A SURVEY
CLASSIFICATION ALGORITHM USING RANDOM CONCEPT ON A VERY LARGE DATA SET: A SURVEYCLASSIFICATION ALGORITHM USING RANDOM CONCEPT ON A VERY LARGE DATA SET: A SURVEY
CLASSIFICATION ALGORITHM USING RANDOM CONCEPT ON A VERY LARGE DATA SET: A SURVEY
Editor IJMTER
 
Spe165 t
Spe165 tSpe165 t
Spe165 t
Rajesh War
 
Elimination of data redundancy before persisting into dbms using svm classifi...
Elimination of data redundancy before persisting into dbms using svm classifi...Elimination of data redundancy before persisting into dbms using svm classifi...
Elimination of data redundancy before persisting into dbms using svm classifi...
nalini manogaran
 
Bi4101343346
Bi4101343346Bi4101343346
Bi4101343346
IJERA Editor
 

Similar to IRJET-Efficient Data Linkage Technique using one Class Clustering Tree for Database Misuse Domain (20)

Iaetsd a survey on one class clustering
Iaetsd a survey on one class clusteringIaetsd a survey on one class clustering
Iaetsd a survey on one class clustering
 
Implementation of Matching Tree Technique for Online Record Linkage
Implementation of Matching Tree Technique for Online Record LinkageImplementation of Matching Tree Technique for Online Record Linkage
Implementation of Matching Tree Technique for Online Record Linkage
 
A PROCESS OF LINK MINING
A PROCESS OF LINK MININGA PROCESS OF LINK MINING
A PROCESS OF LINK MINING
 
Privacy preservation techniques in data mining
Privacy preservation techniques in data miningPrivacy preservation techniques in data mining
Privacy preservation techniques in data mining
 
Privacy preservation techniques in data mining
Privacy preservation techniques in data miningPrivacy preservation techniques in data mining
Privacy preservation techniques in data mining
 
Asif nosql
Asif nosqlAsif nosql
Asif nosql
 
A study and survey on various progressive duplicate detection mechanisms
A study and survey on various progressive duplicate detection mechanismsA study and survey on various progressive duplicate detection mechanisms
A study and survey on various progressive duplicate detection mechanisms
 
LINK MINING PROCESS
LINK MINING PROCESSLINK MINING PROCESS
LINK MINING PROCESS
 
Literature review of attribute level and
Literature review of attribute level andLiterature review of attribute level and
Literature review of attribute level and
 
Survey on Text Mining Based on Social Media Comments as Big Data Analysis Usi...
Survey on Text Mining Based on Social Media Comments as Big Data Analysis Usi...Survey on Text Mining Based on Social Media Comments as Big Data Analysis Usi...
Survey on Text Mining Based on Social Media Comments as Big Data Analysis Usi...
 
IRJET- Diverse Approaches for Document Clustering in Product Development Anal...
IRJET- Diverse Approaches for Document Clustering in Product Development Anal...IRJET- Diverse Approaches for Document Clustering in Product Development Anal...
IRJET- Diverse Approaches for Document Clustering in Product Development Anal...
 
MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...
MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...
MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...
 
TUPLE VALUE BASED MULTIPLICATIVE DATA PERTURBATION APPROACH TO PRESERVE PRIVA...
TUPLE VALUE BASED MULTIPLICATIVE DATA PERTURBATION APPROACH TO PRESERVE PRIVA...TUPLE VALUE BASED MULTIPLICATIVE DATA PERTURBATION APPROACH TO PRESERVE PRIVA...
TUPLE VALUE BASED MULTIPLICATIVE DATA PERTURBATION APPROACH TO PRESERVE PRIVA...
 
Multikeyword Hunt on Progressive Graphs
Multikeyword Hunt on Progressive GraphsMultikeyword Hunt on Progressive Graphs
Multikeyword Hunt on Progressive Graphs
 
IRJET- Text Document Clustering using K-Means Algorithm
IRJET-  	  Text Document Clustering using K-Means Algorithm IRJET-  	  Text Document Clustering using K-Means Algorithm
IRJET- Text Document Clustering using K-Means Algorithm
 
How Partitioning Clustering Technique For Implementing...
How Partitioning Clustering Technique For Implementing...How Partitioning Clustering Technique For Implementing...
How Partitioning Clustering Technique For Implementing...
 
CLASSIFICATION ALGORITHM USING RANDOM CONCEPT ON A VERY LARGE DATA SET: A SURVEY
CLASSIFICATION ALGORITHM USING RANDOM CONCEPT ON A VERY LARGE DATA SET: A SURVEYCLASSIFICATION ALGORITHM USING RANDOM CONCEPT ON A VERY LARGE DATA SET: A SURVEY
CLASSIFICATION ALGORITHM USING RANDOM CONCEPT ON A VERY LARGE DATA SET: A SURVEY
 
Spe165 t
Spe165 tSpe165 t
Spe165 t
 
Elimination of data redundancy before persisting into dbms using svm classifi...
Elimination of data redundancy before persisting into dbms using svm classifi...Elimination of data redundancy before persisting into dbms using svm classifi...
Elimination of data redundancy before persisting into dbms using svm classifi...
 
Bi4101343346
Bi4101343346Bi4101343346
Bi4101343346
 

More from IRJET Journal

TUNNELING IN HIMALAYAS WITH NATM METHOD: A SPECIAL REFERENCES TO SUNGAL TUNNE...
TUNNELING IN HIMALAYAS WITH NATM METHOD: A SPECIAL REFERENCES TO SUNGAL TUNNE...TUNNELING IN HIMALAYAS WITH NATM METHOD: A SPECIAL REFERENCES TO SUNGAL TUNNE...
TUNNELING IN HIMALAYAS WITH NATM METHOD: A SPECIAL REFERENCES TO SUNGAL TUNNE...
IRJET Journal
 
STUDY THE EFFECT OF RESPONSE REDUCTION FACTOR ON RC FRAMED STRUCTURE
STUDY THE EFFECT OF RESPONSE REDUCTION FACTOR ON RC FRAMED STRUCTURESTUDY THE EFFECT OF RESPONSE REDUCTION FACTOR ON RC FRAMED STRUCTURE
STUDY THE EFFECT OF RESPONSE REDUCTION FACTOR ON RC FRAMED STRUCTURE
IRJET Journal
 
A COMPARATIVE ANALYSIS OF RCC ELEMENT OF SLAB WITH STARK STEEL (HYSD STEEL) A...
A COMPARATIVE ANALYSIS OF RCC ELEMENT OF SLAB WITH STARK STEEL (HYSD STEEL) A...A COMPARATIVE ANALYSIS OF RCC ELEMENT OF SLAB WITH STARK STEEL (HYSD STEEL) A...
A COMPARATIVE ANALYSIS OF RCC ELEMENT OF SLAB WITH STARK STEEL (HYSD STEEL) A...
IRJET Journal
 
Effect of Camber and Angles of Attack on Airfoil Characteristics
Effect of Camber and Angles of Attack on Airfoil CharacteristicsEffect of Camber and Angles of Attack on Airfoil Characteristics
Effect of Camber and Angles of Attack on Airfoil Characteristics
IRJET Journal
 
A Review on the Progress and Challenges of Aluminum-Based Metal Matrix Compos...
A Review on the Progress and Challenges of Aluminum-Based Metal Matrix Compos...A Review on the Progress and Challenges of Aluminum-Based Metal Matrix Compos...
A Review on the Progress and Challenges of Aluminum-Based Metal Matrix Compos...
IRJET Journal
 
Dynamic Urban Transit Optimization: A Graph Neural Network Approach for Real-...
Dynamic Urban Transit Optimization: A Graph Neural Network Approach for Real-...Dynamic Urban Transit Optimization: A Graph Neural Network Approach for Real-...
Dynamic Urban Transit Optimization: A Graph Neural Network Approach for Real-...
IRJET Journal
 
Structural Analysis and Design of Multi-Storey Symmetric and Asymmetric Shape...
Structural Analysis and Design of Multi-Storey Symmetric and Asymmetric Shape...Structural Analysis and Design of Multi-Storey Symmetric and Asymmetric Shape...
Structural Analysis and Design of Multi-Storey Symmetric and Asymmetric Shape...
IRJET Journal
 
A Review of “Seismic Response of RC Structures Having Plan and Vertical Irreg...
A Review of “Seismic Response of RC Structures Having Plan and Vertical Irreg...A Review of “Seismic Response of RC Structures Having Plan and Vertical Irreg...
A Review of “Seismic Response of RC Structures Having Plan and Vertical Irreg...
IRJET Journal
 
A REVIEW ON MACHINE LEARNING IN ADAS
A REVIEW ON MACHINE LEARNING IN ADASA REVIEW ON MACHINE LEARNING IN ADAS
A REVIEW ON MACHINE LEARNING IN ADAS
IRJET Journal
 
Long Term Trend Analysis of Precipitation and Temperature for Asosa district,...
Long Term Trend Analysis of Precipitation and Temperature for Asosa district,...Long Term Trend Analysis of Precipitation and Temperature for Asosa district,...
Long Term Trend Analysis of Precipitation and Temperature for Asosa district,...
IRJET Journal
 
P.E.B. Framed Structure Design and Analysis Using STAAD Pro
P.E.B. Framed Structure Design and Analysis Using STAAD ProP.E.B. Framed Structure Design and Analysis Using STAAD Pro
P.E.B. Framed Structure Design and Analysis Using STAAD Pro
IRJET Journal
 
A Review on Innovative Fiber Integration for Enhanced Reinforcement of Concre...
A Review on Innovative Fiber Integration for Enhanced Reinforcement of Concre...A Review on Innovative Fiber Integration for Enhanced Reinforcement of Concre...
A Review on Innovative Fiber Integration for Enhanced Reinforcement of Concre...
IRJET Journal
 
Survey Paper on Cloud-Based Secured Healthcare System
Survey Paper on Cloud-Based Secured Healthcare SystemSurvey Paper on Cloud-Based Secured Healthcare System
Survey Paper on Cloud-Based Secured Healthcare System
IRJET Journal
 
Review on studies and research on widening of existing concrete bridges
Review on studies and research on widening of existing concrete bridgesReview on studies and research on widening of existing concrete bridges
Review on studies and research on widening of existing concrete bridges
IRJET Journal
 
React based fullstack edtech web application
React based fullstack edtech web applicationReact based fullstack edtech web application
React based fullstack edtech web application
IRJET Journal
 
A Comprehensive Review of Integrating IoT and Blockchain Technologies in the ...
A Comprehensive Review of Integrating IoT and Blockchain Technologies in the ...A Comprehensive Review of Integrating IoT and Blockchain Technologies in the ...
A Comprehensive Review of Integrating IoT and Blockchain Technologies in the ...
IRJET Journal
 
A REVIEW ON THE PERFORMANCE OF COCONUT FIBRE REINFORCED CONCRETE.
A REVIEW ON THE PERFORMANCE OF COCONUT FIBRE REINFORCED CONCRETE.A REVIEW ON THE PERFORMANCE OF COCONUT FIBRE REINFORCED CONCRETE.
A REVIEW ON THE PERFORMANCE OF COCONUT FIBRE REINFORCED CONCRETE.
IRJET Journal
 
Optimizing Business Management Process Workflows: The Dynamic Influence of Mi...
Optimizing Business Management Process Workflows: The Dynamic Influence of Mi...Optimizing Business Management Process Workflows: The Dynamic Influence of Mi...
Optimizing Business Management Process Workflows: The Dynamic Influence of Mi...
IRJET Journal
 
Multistoried and Multi Bay Steel Building Frame by using Seismic Design
Multistoried and Multi Bay Steel Building Frame by using Seismic DesignMultistoried and Multi Bay Steel Building Frame by using Seismic Design
Multistoried and Multi Bay Steel Building Frame by using Seismic Design
IRJET Journal
 
Cost Optimization of Construction Using Plastic Waste as a Sustainable Constr...
Cost Optimization of Construction Using Plastic Waste as a Sustainable Constr...Cost Optimization of Construction Using Plastic Waste as a Sustainable Constr...
Cost Optimization of Construction Using Plastic Waste as a Sustainable Constr...
IRJET Journal
 

More from IRJET Journal (20)

TUNNELING IN HIMALAYAS WITH NATM METHOD: A SPECIAL REFERENCES TO SUNGAL TUNNE...
TUNNELING IN HIMALAYAS WITH NATM METHOD: A SPECIAL REFERENCES TO SUNGAL TUNNE...TUNNELING IN HIMALAYAS WITH NATM METHOD: A SPECIAL REFERENCES TO SUNGAL TUNNE...
TUNNELING IN HIMALAYAS WITH NATM METHOD: A SPECIAL REFERENCES TO SUNGAL TUNNE...
 
STUDY THE EFFECT OF RESPONSE REDUCTION FACTOR ON RC FRAMED STRUCTURE
STUDY THE EFFECT OF RESPONSE REDUCTION FACTOR ON RC FRAMED STRUCTURESTUDY THE EFFECT OF RESPONSE REDUCTION FACTOR ON RC FRAMED STRUCTURE
STUDY THE EFFECT OF RESPONSE REDUCTION FACTOR ON RC FRAMED STRUCTURE
 
A COMPARATIVE ANALYSIS OF RCC ELEMENT OF SLAB WITH STARK STEEL (HYSD STEEL) A...
A COMPARATIVE ANALYSIS OF RCC ELEMENT OF SLAB WITH STARK STEEL (HYSD STEEL) A...A COMPARATIVE ANALYSIS OF RCC ELEMENT OF SLAB WITH STARK STEEL (HYSD STEEL) A...
A COMPARATIVE ANALYSIS OF RCC ELEMENT OF SLAB WITH STARK STEEL (HYSD STEEL) A...
 
Effect of Camber and Angles of Attack on Airfoil Characteristics
Effect of Camber and Angles of Attack on Airfoil CharacteristicsEffect of Camber and Angles of Attack on Airfoil Characteristics
Effect of Camber and Angles of Attack on Airfoil Characteristics
 
A Review on the Progress and Challenges of Aluminum-Based Metal Matrix Compos...
A Review on the Progress and Challenges of Aluminum-Based Metal Matrix Compos...A Review on the Progress and Challenges of Aluminum-Based Metal Matrix Compos...
A Review on the Progress and Challenges of Aluminum-Based Metal Matrix Compos...
 
Dynamic Urban Transit Optimization: A Graph Neural Network Approach for Real-...
Dynamic Urban Transit Optimization: A Graph Neural Network Approach for Real-...Dynamic Urban Transit Optimization: A Graph Neural Network Approach for Real-...
Dynamic Urban Transit Optimization: A Graph Neural Network Approach for Real-...
 
Structural Analysis and Design of Multi-Storey Symmetric and Asymmetric Shape...
Structural Analysis and Design of Multi-Storey Symmetric and Asymmetric Shape...Structural Analysis and Design of Multi-Storey Symmetric and Asymmetric Shape...
Structural Analysis and Design of Multi-Storey Symmetric and Asymmetric Shape...
 
A Review of “Seismic Response of RC Structures Having Plan and Vertical Irreg...
A Review of “Seismic Response of RC Structures Having Plan and Vertical Irreg...A Review of “Seismic Response of RC Structures Having Plan and Vertical Irreg...
A Review of “Seismic Response of RC Structures Having Plan and Vertical Irreg...
 
A REVIEW ON MACHINE LEARNING IN ADAS
A REVIEW ON MACHINE LEARNING IN ADASA REVIEW ON MACHINE LEARNING IN ADAS
A REVIEW ON MACHINE LEARNING IN ADAS
 
Long Term Trend Analysis of Precipitation and Temperature for Asosa district,...
Long Term Trend Analysis of Precipitation and Temperature for Asosa district,...Long Term Trend Analysis of Precipitation and Temperature for Asosa district,...
Long Term Trend Analysis of Precipitation and Temperature for Asosa district,...
 
P.E.B. Framed Structure Design and Analysis Using STAAD Pro
P.E.B. Framed Structure Design and Analysis Using STAAD ProP.E.B. Framed Structure Design and Analysis Using STAAD Pro
P.E.B. Framed Structure Design and Analysis Using STAAD Pro
 
A Review on Innovative Fiber Integration for Enhanced Reinforcement of Concre...
A Review on Innovative Fiber Integration for Enhanced Reinforcement of Concre...A Review on Innovative Fiber Integration for Enhanced Reinforcement of Concre...
A Review on Innovative Fiber Integration for Enhanced Reinforcement of Concre...
 
Survey Paper on Cloud-Based Secured Healthcare System
Survey Paper on Cloud-Based Secured Healthcare SystemSurvey Paper on Cloud-Based Secured Healthcare System
Survey Paper on Cloud-Based Secured Healthcare System
 
Review on studies and research on widening of existing concrete bridges
Review on studies and research on widening of existing concrete bridgesReview on studies and research on widening of existing concrete bridges
Review on studies and research on widening of existing concrete bridges
 
React based fullstack edtech web application
React based fullstack edtech web applicationReact based fullstack edtech web application
React based fullstack edtech web application
 
A Comprehensive Review of Integrating IoT and Blockchain Technologies in the ...
A Comprehensive Review of Integrating IoT and Blockchain Technologies in the ...A Comprehensive Review of Integrating IoT and Blockchain Technologies in the ...
A Comprehensive Review of Integrating IoT and Blockchain Technologies in the ...
 
A REVIEW ON THE PERFORMANCE OF COCONUT FIBRE REINFORCED CONCRETE.
A REVIEW ON THE PERFORMANCE OF COCONUT FIBRE REINFORCED CONCRETE.A REVIEW ON THE PERFORMANCE OF COCONUT FIBRE REINFORCED CONCRETE.
A REVIEW ON THE PERFORMANCE OF COCONUT FIBRE REINFORCED CONCRETE.
 
Optimizing Business Management Process Workflows: The Dynamic Influence of Mi...
Optimizing Business Management Process Workflows: The Dynamic Influence of Mi...Optimizing Business Management Process Workflows: The Dynamic Influence of Mi...
Optimizing Business Management Process Workflows: The Dynamic Influence of Mi...
 
Multistoried and Multi Bay Steel Building Frame by using Seismic Design
Multistoried and Multi Bay Steel Building Frame by using Seismic DesignMultistoried and Multi Bay Steel Building Frame by using Seismic Design
Multistoried and Multi Bay Steel Building Frame by using Seismic Design
 
Cost Optimization of Construction Using Plastic Waste as a Sustainable Constr...
Cost Optimization of Construction Using Plastic Waste as a Sustainable Constr...Cost Optimization of Construction Using Plastic Waste as a Sustainable Constr...
Cost Optimization of Construction Using Plastic Waste as a Sustainable Constr...
 

Recently uploaded

Flow Through Pipe: the analysis of fluid flow within pipes
Flow Through Pipe:  the analysis of fluid flow within pipesFlow Through Pipe:  the analysis of fluid flow within pipes
Flow Through Pipe: the analysis of fluid flow within pipes
Indrajeet sahu
 
AI + Data Community Tour - Build the Next Generation of Apps with the Einstei...
AI + Data Community Tour - Build the Next Generation of Apps with the Einstei...AI + Data Community Tour - Build the Next Generation of Apps with the Einstei...
AI + Data Community Tour - Build the Next Generation of Apps with the Einstei...
Paris Salesforce Developer Group
 
Accident detection system project report.pdf
Accident detection system project report.pdfAccident detection system project report.pdf
Accident detection system project report.pdf
Kamal Acharya
 
Impartiality as per ISO /IEC 17025:2017 Standard
Impartiality as per ISO /IEC 17025:2017 StandardImpartiality as per ISO /IEC 17025:2017 Standard
Impartiality as per ISO /IEC 17025:2017 Standard
MuhammadJazib15
 
paper relate Chozhavendhan et al. 2020.pdf
paper relate Chozhavendhan et al. 2020.pdfpaper relate Chozhavendhan et al. 2020.pdf
paper relate Chozhavendhan et al. 2020.pdf
ShurooqTaib
 
一比一原版(USF毕业证)旧金山大学毕业证如何办理
一比一原版(USF毕业证)旧金山大学毕业证如何办理一比一原版(USF毕业证)旧金山大学毕业证如何办理
一比一原版(USF毕业证)旧金山大学毕业证如何办理
uqyfuc
 
Particle Swarm Optimization–Long Short-Term Memory based Channel Estimation w...
Particle Swarm Optimization–Long Short-Term Memory based Channel Estimation w...Particle Swarm Optimization–Long Short-Term Memory based Channel Estimation w...
Particle Swarm Optimization–Long Short-Term Memory based Channel Estimation w...
IJCNCJournal
 
Sri Guru Hargobind Ji - Bandi Chor Guru.pdf
Sri Guru Hargobind Ji - Bandi Chor Guru.pdfSri Guru Hargobind Ji - Bandi Chor Guru.pdf
Sri Guru Hargobind Ji - Bandi Chor Guru.pdf
Balvir Singh
 
ITSM Integration with MuleSoft.pptx
ITSM  Integration with MuleSoft.pptxITSM  Integration with MuleSoft.pptx
ITSM Integration with MuleSoft.pptx
VANDANAMOHANGOUDA
 
Blood finder application project report (1).pdf
Blood finder application project report (1).pdfBlood finder application project report (1).pdf
Blood finder application project report (1).pdf
Kamal Acharya
 
一比一原版(爱大毕业证书)爱荷华大学毕业证如何办理
一比一原版(爱大毕业证书)爱荷华大学毕业证如何办理一比一原版(爱大毕业证书)爱荷华大学毕业证如何办理
一比一原版(爱大毕业证书)爱荷华大学毕业证如何办理
nedcocy
 
Introduction to Computer Networks & OSI MODEL.ppt
Introduction to Computer Networks & OSI MODEL.pptIntroduction to Computer Networks & OSI MODEL.ppt
Introduction to Computer Networks & OSI MODEL.ppt
Dwarkadas J Sanghvi College of Engineering
 
一比一原版(uofo毕业证书)美国俄勒冈大学毕业证如何办理
一比一原版(uofo毕业证书)美国俄勒冈大学毕业证如何办理一比一原版(uofo毕业证书)美国俄勒冈大学毕业证如何办理
一比一原版(uofo毕业证书)美国俄勒冈大学毕业证如何办理
upoux
 
Sachpazis_Consolidation Settlement Calculation Program-The Python Code and th...
Sachpazis_Consolidation Settlement Calculation Program-The Python Code and th...Sachpazis_Consolidation Settlement Calculation Program-The Python Code and th...
Sachpazis_Consolidation Settlement Calculation Program-The Python Code and th...
Dr.Costas Sachpazis
 
A high-Speed Communication System is based on the Design of a Bi-NoC Router, ...
A high-Speed Communication System is based on the Design of a Bi-NoC Router, ...A high-Speed Communication System is based on the Design of a Bi-NoC Router, ...
A high-Speed Communication System is based on the Design of a Bi-NoC Router, ...
DharmaBanothu
 
FULL STACK PROGRAMMING - Both Front End and Back End
FULL STACK PROGRAMMING - Both Front End and Back EndFULL STACK PROGRAMMING - Both Front End and Back End
FULL STACK PROGRAMMING - Both Front End and Back End
PreethaV16
 
AN INTRODUCTION OF AI & SEARCHING TECHIQUES
AN INTRODUCTION OF AI & SEARCHING TECHIQUESAN INTRODUCTION OF AI & SEARCHING TECHIQUES
AN INTRODUCTION OF AI & SEARCHING TECHIQUES
drshikhapandey2022
 
SELENIUM CONF -PALLAVI SHARMA - 2024.pdf
SELENIUM CONF -PALLAVI SHARMA - 2024.pdfSELENIUM CONF -PALLAVI SHARMA - 2024.pdf
SELENIUM CONF -PALLAVI SHARMA - 2024.pdf
Pallavi Sharma
 
This study Examines the Effectiveness of Talent Procurement through the Imple...
This study Examines the Effectiveness of Talent Procurement through the Imple...This study Examines the Effectiveness of Talent Procurement through the Imple...
This study Examines the Effectiveness of Talent Procurement through the Imple...
DharmaBanothu
 
Open Channel Flow: fluid flow with a free surface
Open Channel Flow: fluid flow with a free surfaceOpen Channel Flow: fluid flow with a free surface
Open Channel Flow: fluid flow with a free surface
Indrajeet sahu
 

Recently uploaded (20)

Flow Through Pipe: the analysis of fluid flow within pipes
Flow Through Pipe:  the analysis of fluid flow within pipesFlow Through Pipe:  the analysis of fluid flow within pipes
Flow Through Pipe: the analysis of fluid flow within pipes
 
AI + Data Community Tour - Build the Next Generation of Apps with the Einstei...
AI + Data Community Tour - Build the Next Generation of Apps with the Einstei...AI + Data Community Tour - Build the Next Generation of Apps with the Einstei...
AI + Data Community Tour - Build the Next Generation of Apps with the Einstei...
 
Accident detection system project report.pdf
Accident detection system project report.pdfAccident detection system project report.pdf
Accident detection system project report.pdf
 
Impartiality as per ISO /IEC 17025:2017 Standard
Impartiality as per ISO /IEC 17025:2017 StandardImpartiality as per ISO /IEC 17025:2017 Standard
Impartiality as per ISO /IEC 17025:2017 Standard
 
paper relate Chozhavendhan et al. 2020.pdf
paper relate Chozhavendhan et al. 2020.pdfpaper relate Chozhavendhan et al. 2020.pdf
paper relate Chozhavendhan et al. 2020.pdf
 
一比一原版(USF毕业证)旧金山大学毕业证如何办理
一比一原版(USF毕业证)旧金山大学毕业证如何办理一比一原版(USF毕业证)旧金山大学毕业证如何办理
一比一原版(USF毕业证)旧金山大学毕业证如何办理
 
Particle Swarm Optimization–Long Short-Term Memory based Channel Estimation w...
Particle Swarm Optimization–Long Short-Term Memory based Channel Estimation w...Particle Swarm Optimization–Long Short-Term Memory based Channel Estimation w...
Particle Swarm Optimization–Long Short-Term Memory based Channel Estimation w...
 
Sri Guru Hargobind Ji - Bandi Chor Guru.pdf
Sri Guru Hargobind Ji - Bandi Chor Guru.pdfSri Guru Hargobind Ji - Bandi Chor Guru.pdf
Sri Guru Hargobind Ji - Bandi Chor Guru.pdf
 
ITSM Integration with MuleSoft.pptx
ITSM  Integration with MuleSoft.pptxITSM  Integration with MuleSoft.pptx
ITSM Integration with MuleSoft.pptx
 
Blood finder application project report (1).pdf
Blood finder application project report (1).pdfBlood finder application project report (1).pdf
Blood finder application project report (1).pdf
 
一比一原版(爱大毕业证书)爱荷华大学毕业证如何办理
一比一原版(爱大毕业证书)爱荷华大学毕业证如何办理一比一原版(爱大毕业证书)爱荷华大学毕业证如何办理
一比一原版(爱大毕业证书)爱荷华大学毕业证如何办理
 
Introduction to Computer Networks & OSI MODEL.ppt
Introduction to Computer Networks & OSI MODEL.pptIntroduction to Computer Networks & OSI MODEL.ppt
Introduction to Computer Networks & OSI MODEL.ppt
 
一比一原版(uofo毕业证书)美国俄勒冈大学毕业证如何办理
一比一原版(uofo毕业证书)美国俄勒冈大学毕业证如何办理一比一原版(uofo毕业证书)美国俄勒冈大学毕业证如何办理
一比一原版(uofo毕业证书)美国俄勒冈大学毕业证如何办理
 
Sachpazis_Consolidation Settlement Calculation Program-The Python Code and th...
Sachpazis_Consolidation Settlement Calculation Program-The Python Code and th...Sachpazis_Consolidation Settlement Calculation Program-The Python Code and th...
Sachpazis_Consolidation Settlement Calculation Program-The Python Code and th...
 
A high-Speed Communication System is based on the Design of a Bi-NoC Router, ...
A high-Speed Communication System is based on the Design of a Bi-NoC Router, ...A high-Speed Communication System is based on the Design of a Bi-NoC Router, ...
A high-Speed Communication System is based on the Design of a Bi-NoC Router, ...
 
FULL STACK PROGRAMMING - Both Front End and Back End
FULL STACK PROGRAMMING - Both Front End and Back EndFULL STACK PROGRAMMING - Both Front End and Back End
FULL STACK PROGRAMMING - Both Front End and Back End
 
AN INTRODUCTION OF AI & SEARCHING TECHIQUES
AN INTRODUCTION OF AI & SEARCHING TECHIQUESAN INTRODUCTION OF AI & SEARCHING TECHIQUES
AN INTRODUCTION OF AI & SEARCHING TECHIQUES
 
SELENIUM CONF -PALLAVI SHARMA - 2024.pdf
SELENIUM CONF -PALLAVI SHARMA - 2024.pdfSELENIUM CONF -PALLAVI SHARMA - 2024.pdf
SELENIUM CONF -PALLAVI SHARMA - 2024.pdf
 
This study Examines the Effectiveness of Talent Procurement through the Imple...
This study Examines the Effectiveness of Talent Procurement through the Imple...This study Examines the Effectiveness of Talent Procurement through the Imple...
This study Examines the Effectiveness of Talent Procurement through the Imple...
 
Open Channel Flow: fluid flow with a free surface
Open Channel Flow: fluid flow with a free surfaceOpen Channel Flow: fluid flow with a free surface
Open Channel Flow: fluid flow with a free surface
 

IRJET-Efficient Data Linkage Technique using one Class Clustering Tree for Database Misuse Domain

  • 1. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 05 Issue: 06 | June 2018 www.irjet.net p-ISSN: 2395-0072 © 2018, IRJET | Impact Factor value: 7.211 | ISO 9001:2008 Certified Journal | Page 2354 Efficient Data Linkage Technique Using One Class Clustering Tree For Database Misuse Domain Mandilkar Kalyani1, Prof.Panhalkar A.R2 1Student MEIT, Dept. of Information Technology, Amrutvahini college of engineering, Sangamner, Maharastra. 2Professor, MEIT, Dept. of Information Technology, Amrutvahini college of engineering, Sangamner, Maharastra. ---------------------------------------------------------------------***--------------------------------------------------------------------- Abstract - The task of data linkage is to join datasets that do not share a common identifier.The process of Data Linkage is implement with entities which are of same and differenttypes. One-to-many data Linkage is an necessary Data Model. Two types of data linkages techniques are one-to-one and one-to- many . In this paper, we propose a one-to-many data linkage. it is used to accomplish link between different types in the matching sets. This present method is depends on One-Class Clustering Tree (OCCT). The tree is constructed in such a way that it is easily understandable and can be transformed into association rules. The inner nodes of the tree include features of the first set of entities. The leaves of the tree appear as features of the second set that are matching. The data is split using four splitting criteria,and two pruningmethodsareused .The result shows that OCCT gives better performances in terms of precision and recall.A threshold is calculated which plays a main role in the categorization of the users. A pruning technique is used to reduce the tree by avoiding the unnecessary branches. Thus, the proposed system is enhanced in terms of time complexity and accuracy. Key Words: Pruning, Splitting criteria, Record Linkage 1. INTRODUCTION There is a need to found the techniquesto linkthedatasets that does not share the common identity; it comesunderthe data linkage process. Data linkage is used when information from two or more records of independent sources are brought together, when they are perceived to belong to the same individual, family, event or place . It is the technique that is used to connect the information across several disparate data sources. The major task of data linkage is to identify the different objects which were used to refer the same entity across the different source of data. The data linkage can be divided into two types: one to one and one to many. The goal is to associate an entity from one data set with a single matching entity in another data set is in one to one linkage. Most of the previous works focus on one-to-one data linkage. we propose a new data linkage method aimed at performing one to many linkages. The proposed data linkage technique can match entities of different types, while data linkage is usually performed among entities of the same type.[1] In this paper, we propose a new data linkage method one- to-many linkage that can match entities of different types. For example, Let TA and TB are the two tables of different types. The inner nodes of the tree consist of attributes referring to both of the tables being matched (TA and TB). The leaves of the tree will determine whether a pair of records described by the path in the tree ending with the current leaf is a match or a nonmatch.We proposeanewdata linkage method aimed at performingone-to-manylinkage.In addition,while data linkage is usually performed among entities of the same type, this proposed data linkage technique can match entitiesof different types. Forexample, if there is a student database we might want to linkastudent record with the courses she should take (according to different features that describe the student and features describing the courses). The proposedmethodlinksbetween the entities using a One-Class Clustering Tree (OCCT). A clustering tree is a tree in which each of the leavescontainsa cluster instead of a single classification and each cluster is generalized by a set of rulesthat is stored in the appropriate leaf [9], [10]. Detecting data misuse poses a great challenge for organizations. Whether caused by malicious intent or an inadvertent mistake, data leakage/misuse can diminish a company's brand, reduce shareholder value,anddamagethe company's goodwill and reputation. This challenge is intensified when trying to detect and/or prevent data leakage/misuse performed by an insider with legitimate permissions to access the organization's systems and its critical data. The goal of using the OCCT in this domain is to link a set of records, representing the context of the request (i.e., the actual access to certain data), with a set of records representing the data that can be legitimately retrieved within the specific context. Thus, the inner nodes of the OCCT represent contextual attributes in which the request occurs (table TA), and the probabilistic models in the leaves represent the data that can be legitimately retrieved in the specific context (table TB). Pairs which are detected as nonmatching are assumed to be illegitimate (i.e., malicious), and will trigger an alert to the organization’s securityofficer. By analyzing both the context of the request as well as the
  • 2. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 05 Issue: 06 | June 2018 www.irjet.net p-ISSN: 2395-0072 © 2018, IRJET | Impact Factor value: 7.211 | ISO 9001:2008 Certified Journal | Page 2355 data that the user is exposed to, the OCCT method can improve the detection accuracy and better distinguish between a normal and abnormal request In this paper, a new record linkage method is introduced which performs many-to-many record linkage. In many-to- many record linkage the dataset from different tablesmatch with the entity of different tables. Decision tree employd for decide which records are similar to each-other.Decision trees are usually regarded as representing for classification. The leaves of the tree contain the classes and the branches from the root to a leaf contain sufficient conditions for classification. The proposed method is implemented using the database misuse domain used to identify the common and malicious users. In this domain, the goal is to identify anomalous access to database records that may indicate a possible data leakage, loss of data integrity or data mishandling. 2. RELATED WORK Record linkage is the process of matching entities from two different data sources that may or may not contain a common element. Shabtai et al. [1]usedone-to-manylinkage in different domains like fraud detection, recommender systems and data leakage prevention by developing a One- Class Clustering Tree. Four splitting criteria and pruning methods are implemented to construct the tree and to avoid the needless branches of the tree correspondingly. The shortcoming of this approach is that it is difficult to reduce the linkage computation time and it is a one-class approach. Christen and Goiser [2] used a C4.5 decision tree to determine which records must be matched to one another. In their work, various string comparison methods are used and compared by constructing different decision trees. However, their method performs the matching of attributes that are only predefined. Moreover only one or two attributes are usually used. Henry et al. [3] used one-to-many linkage for genealogical research. Record linkage was performed using five attributes: name of the person, birth date, place, gender and the relationships between the persons. Using these five attributes a decision tree wasinduced. The drawback of this approach is that it performs matching using the specific attributes and therefore it is very hard to generalize. F.De Comite, F.Denis, R. Gilleron and F.Letouzey introduced POSC4.5 algorithm for record linkage of positive and unlabeled examples. They considered binary classification and hence this method is not generalized. They require not only the data set but also the information of positive examples out of whole data set. The attraction of their work is that they presented modified entropy formula which that considers weight of positive examples in a given data set. They assumed that negative examples are in unlabeled data set as per given distribution [4]. [3] PROPOSED SYSTEM A. Problem Statement The aim of this paper is to link a record from a table TA with records from another table TB. The generated model is in the form of a tree in which the inner nodes represent attributes from TA and the leafs hold a compact representation of a subset of records from TB. B. Feature The OCCT yields better performance in terms of precision and recall.It will improve the efficiency of classifiers by improvement in the pre and post processing techniques. The simplicity of the model, which can easily betransformed to rules of the type A ->B. The prepruning approach which is used to reduce the time complexity of the algorithm. C. Scope The OCCT can used in different domains like: FraudDetection- To find the fraudulent users.In Recommender systemstheproposed system can beused formatchingnew userswiththeirproductexpectations.In Data leakag prevention.-todetecttheab-normalaccessto the database records that indicates data leakage or data misuse. The process flow of the proposed system is as follows: First consider the customer dataset and the context dataset for linkage process. Then preprocess both the databy removing the null values and error data. After preprocessing, create a training table based on both the context and customer dataset. For training table creation, consider the attributes such as customer id, requesting location, requesting day, requesting time from the first database andintheothertable consider the attributes such as user location and the business type. Then measure the similarity score based on the attributesof requesting location, requesting day, requesting time with user location and the businesstype. Based on that, construct the tree by applying the splitting criteria. Then apply the post pruning method to remove the unnecessary branches that are not used in the classification process.Thenapplythe SEMANC method which is used to analyzethenormaluseror the malicious user from the linked data. At last, the
  • 3. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 05 Issue: 06 | June 2018 www.irjet.net p-ISSN: 2395-0072 © 2018, IRJET | Impact Factor value: 7.211 | ISO 9001:2008 Certified Journal | Page 2356 performance graph is drawn based on both the existing and the proposed system. Figure 1:- Architecture diagram 4. SPLITTING CRITERIA Maximum Likelihood Estimation: This algorithm is used for choosing the suitable splitting methods. This splitting methodshelp to find outthe suitable attributes that are not split until now. We choose those attributes those are having the uppermost value ascompare to our threshold value. Those vlaue are havingmorethanthe thresold value that record is a exact match. Those value are having less than the threshold value that records are flase match. Its depends upon the time complexity for calculating the all the values for their respective attribute. Its help to calculate the time complexity..Its help to calculate the early threshold value and the choose the next which feature going to be split. The domain that is going to be implemented is database misuse domain. Then the attributes are split up by the similarity values between the clusters. Finally tree will be constructed. The internal nodes of the tree consist of attributes referring to both of the tables being matched (TA and TB). The leaves of the tree will determine whetherapair of records described by the path in the tree ending with the current leaf is a match or a non-match. 5. PRUNING Pruning is considered to be an important activity during the tree construction process. The necessity of pruning is to build a tree with accuracy and also to avoid over fitting. Pruning can be classified into two types: pre-pruning and post-pruning. In pre-pruning, the branches are pruned during the induction process if there are no possible splits found. In post-pruning, the tree is built completely followed by a bottom-up approach to determine which branches are not beneficial. Pruning is the process to trim unnecessary branches to improve accuracy of model. Thustreeisinduced using matching examplesonly. The pruning process isuseto give compact representation of tree i.e. it contains small number of attributes. It also avoids over fitting and improve the time complexity. In our proposed system we are using pre-pruning processto reduce the time complexity. Thedecisionwhether to prune the branch taken once best splitting attribute is chosen. We propose MLE for our system. In Maximum Likelihood Estimation (MLE), a MLE score is calculated for each of splitting attribute. If none of the candidate attribute achieve MLE score greater than current node then branch is pruned and current node becomes leaf node. 6. IMPLEMENTATION DETAILS 6.1 ONE-TO-MANY DATA LINKAGE PROCESS A characteristic record linkage trouble consists of two data tables that do not split an ordinary identifier. Consider two tables Table1 and Table2 as an example. The first table is viewer table and second one as movie table. Here A and B are taken as attribute for table T1 and T2. The goal is to match the records of the table T1 and T2. Generally here we define the same entity for the both table. Here each dataset of the both table match each other. here we proposed an many-to-many linkage model also. In many-to-many record linkage the dataset from different tables match with the entity of different tables. Here the problem is define as |TA| x |TB|. The advance indexing technique can used for an efficient linkage of records.TAB is denoted for matching records and TAB is denoted for non-matching records. The objective of the algorithm is to accurately recognize the true identical pairs (true positive) and identify non-identical pairs (False Positives). Each possible recordspair of T1 is matchwiththe matching records of T2. Every probable records pair are allot a value that describe the probability of two identical tables. For Each class one probability value should be provides. A threshold value has to be declare from the starting. If the value surpasses the define threshold, the records measure as true match or link otherwise it declare as non-match. For implementation purpose we create customer database of an organization are shared with a business partner. Therefore, the simulated data include requests for customer records of an organization ,submitted by a business partner of the organization. 6.2 THE PROCESS FLOW OF THE PROPOSED SYSTEM:-  First the user dataset and the server log dataset for linkage process is taken into account.  Then preprocessboth the data by removingthenull values or missing values and those attributes that have error data.  After preprocessing, create a training table based on both the server log and user dataset. Fortraining Inputdataset(Cust omer&contextdata ) preprocessing Splitting criteria(ML E prepruningOutput(clu stering tree) Database
  • 4. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 05 Issue: 06 | June 2018 www.irjet.net p-ISSN: 2395-0072 © 2018, IRJET | Impact Factor value: 7.211 | ISO 9001:2008 Certified Journal | Page 2357 table creation, consider the elements such as requesting location, requestingday,requestingtime from the first database and in the other table consider the elements such asuser location and the business type.  Then measure the similarity score based on the elements of requesting location, requesting day, requesting time with user location and the business type by using the Maximum likelihood estimation technique.  Then apply the pre pruning method to remove the unnecessary branches that are not used  A total probability value (sum all possible values of an attribute) is calculated which is compared with the probability calculated for each value of an attribute and that is used in the classification of the users as the common user and the malicious users.  At last, the performance of the existing and the proposed system is evaluated in terms of execution time and the accuracy of the result. 6.3 PREPARING THE INPUT The training and also the testing input data for the OCCT algorithm should be a single dataset TAB which contains only matching instancesfrom TA ⊆ TB. Each record, r, of the dataset is actually a pair of records (r(a), r(b)), one from the table TA and one from the table TB, so that: r = (r(a); r(b)) Є TAB ⊆ TA × TB. In the current version of the implementation, due to circumstances, we require all the attributes of TA to appear in the input dataset, followed by all the attributes of TB. By this way, the attributes of TB (City) can be pointed by a single index which refers to the first attribute of TB in the input dataset. Given a record a -> TA, a leaf of the tree is extracted by following the values of the record a to the correct path of the tree. Each leaf of the tree representsa set ofrecordsfrom the second table (TB) which are the most likely to be linked with record a. In order to represent this set in acompactway and avoid overfitting, a set of probabilistic modelsisinduced for each leaf. Each model is used for deriving the probability of a value of some attribute from table TB, given all other attributes from that table. In our implementation there is no need to save modelsforall possible attributes of TB. Thus, a feature selection processis executed on the leaf dataset in order to choose the attributes that will be represented by the leafs. In our implementation it needs only examples of matching pairs in order to learn and build the model. So this leads to much more accuracy as only matching pairs are train so it accurately choose matching pairs. This feature is important since in many domains it is difficult to obtain nonmatching examples. 6.4 RESULT AND DISSCUSSION I. First consider the customer dataset and the context dataset for linkage process. Then preprocess both the data by removing the null valuesand error data. After preprocessing, create a trainingtablebasedon both the context and customer dataset. For training table creation, consider the attributes such as customer id, requesting location, requesting day, requesting time from the first database and in the other table consider the attributes such as user location and the business type. Fig 2:- Customer dataset for database misuse domain II. After applying splitting criteria,it gives result of four splitting methods.for implementation we consider MLE as a splitting method.
  • 5. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 05 Issue: 06 | June 2018 www.irjet.net p-ISSN: 2395-0072 © 2018, IRJET | Impact Factor value: 7.211 | ISO 9001:2008 Certified Journal | Page 2358 Fig 3:-Using different splitting criteria result shows the tree size and leaves for dataset database misuse III. Now result of pruning method.we use prepruning approach. Fig 4:- After applying pruning method it shows tree size and leaves IV. Precision and Recall value comparison for different methods Fig 5: Precision and Recall for Database Misuse domain 7. CONCLUSIONS A new method is proposed to link the data that do not have the common elements. A tree is constructed to accomplish the one-to-many record linkage and the database users are classified as normal and abnormal user. Improved efficiency is attained by means of the time complexity and accuracy of the record linkage process. REFERENCES [1] ma’ayan dror, asaf shabtai, lior rokach, and yuval elovici, “occt: a one-class clustering tree for implementing one-to-many data linkage”, ieee transactions on knowledge and data engineering,vol. 26, no. 3, march 2014. [2] I.p. fellegi and a.b. sunter, “a theory for record linkage,” j. am. statistical soc., vol. 64, no. 328, pp. 1183-1210, dec. 1969. [3] m. yakout, a.k. elmagarmid, h. elmeleegy, m. quzzani, and a. qi, “behavior based record linkage”, proc. vldb endowment, vol. 3, nos. 1/2, pp. 439-448, 2010. [4] f. de comite´, f. denis, r. gilleron, and f. letouzey, “positive and unlabeled examples help learning”, proc. 10th int’l conf. algorithmic learning theory,pp. 219-230, 1999. [5] a.j.storkey, c.k.i.williams, e.taylorandr.g.mann“an expectation maximisation algorithm for one-to- many record linkage,” university of edinburgh informatics research report, 2005 [6] S.Ivie, G.Henry, H.Gatrell and C.Giraud-Carrier, “A Metric Based Machine Learning Approachto Genea- Logical Record Linkage,” in Proc. of the 7th Annual Workshop on Technology for Family History and Genealogical Research, 2007. [7] P.Christen and K.Goiser, “Towards AutomatedData Linkage and Deduplication,” Australian National University, Technical Report, 2005. [8] A. Gershman et al., “A Decision Tree Based Recommender System,” Proc. 10th Int’l Conf. Innovative Internet Community Services, pp. 170- 179, 2010.
  • 6. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 05 Issue: 06 | June 2018 www.irjet.net p-ISSN: 2395-0072 © 2018, IRJET | Impact Factor value: 7.211 | ISO 9001:2008 Certified Journal | Page 2359 [9] L. Gu and R. Baxter, “Decision Models for RecordLinkage,” DataMining, vol. 3755, pp. 146- 160, 2006. [10] O. Benjelloun, H. Garcia, D. Menestrina, Q. Su, S.Whang, and J.Widom, “Swoosh: A Generic Approach toEntity Resolution,” TheVLDB J., vol. 18, no. 1, pp. 255-276, 2009. [11] G.Henry, S.Ivie, H.Gatrell and C.Giraud-Carrier, “A Metric Based Machine Learning Approach to Genea- Logical Record Linkage,” in Proc. of the 7th Annual Workshop on Technology for Family History and Genealogical Research, 2007.