Iaetsd a survey on one class clustering

A SURVEY ON ONE CLASS CLUSTERING
HIERARCHY FOR PERFORMING DATA
LINKAGE
S.Rajalakshmi,
Assistant Professor, Department of CSE,
Velammal Engineering College,Anna University,
Chennai,India.
raji780@yahoo.co.in
A.Jayanthi,
M.E(CSE),Department of CSE,
Velammal Engineering College,Anna University,
Chennai,India.
jayanthiarumugamk@gmail.com
Abstract— Data linkage refers to the process of matching the
data from several databases that refers to the entities of same
type. Data linkage is also possible for the entities that do not
share the common identifier. With the growing size of the today’s
database, the complexity of the matching process becomes a
major challenge for Data linkage. Many Indexing techniques
were developed for data linkage but however those techniques
are not efficient. In this paper, a new data linkage method called
as One Class Clustering Tree(OCCT) is developed to overcome
the existing challenges and also to perform the data linkage
process for the entities that do not share a common identifier.
The developed technique builds the tree in such a way that the
inner nodes of the tree represents the features of the first set of
entities and the leaves of the tree represents the features of the
second sets that are similar. The one class clustering tree uses
certain splitting criteria and pruning methods for the data
linkage.
Keywords--Linkage, classification, clustering, splitting, decision
tree induction, index techniques.
I. INTRODUCTION
Data linkage is the process of identifying different entries that
refers to the same entity across different data sources[1]. The
main aim of the data linkage is to join the datasets that do not
share a common identifier or the foreign key. Data linkage is
usually performed to reduce the large data into the smaller
data. It also helps in removing the duplicate data in the
datasets. This technique is called as deduplication [19]. Data
linkage can be classified into two types namely, one-to-one
data linkage and one-to-many data linkage[15]. In one-to-one
data linkage, the aim is to link an entity from one dataset with
the matching entity from the other dataset. In one-to-many
data linkage the aim is to link an entity from first dat set with
the group of matching entities from the other data set. In this
paper a new data linkage approach is used called as One Class
Clustering Tree(OCCT) which is aimed at performing one-to-
many data linkage. The OCCT is most preferable compared to
all the indexing techniques because it can easily be translated
to linkage rules.
The paper is structured as follows: In Section II, we review on
indexing techniques,Section III deals with the data linkage
using OCCT and finally Section IV concludes the paper.
II. INDEXING TECHNIQUES
In this section the various indexing techniques are discussed
and the variation among them are discussed in more detail.
The indexing process of the data linkage can be divided into
two phases. 1)Build- All the records in the database are being
read and their Blocking Key Values(BKV) are generated.
Most of the indexing techniques uses inverted index approach
[6] where the record identifiers that have the same BKV will
be inserted into the same inverted index list.2)Retrieve- For
every block, the list of the record identifiers is retrieved from
the inverted index and the candidate record pairs are generated
from the list.
A.TRADITIONAL BLOCKING
Traditional blocking is one of the technique used in the data
linkage[1]. In traditional Blocking all the records that have the
same BKV are being inserted into the same block and the
records within that block are compared with each other. This
technique can be implemented using the inverted index[6].The
main disadvantage of traditional blocking is that the errors and
the variations in the record fields used to generate the BKVs
will lead to the record being inserted into the wrong block.
The second disadvantage is that the sizes of the block
generated depend upon the frequency distribution of the BKVs
and thus it is difficult to predict the total number of candidate
record pairs that will be generated.
B.SORTED NEIGHBORHOOD INDEXING
Sorted Neighborhood Indexing helps in sorting the database
according to the BKVs,and to subsequently move the window
of a fixed number of records over the sorted values and the
candidate record pairs are generated only from the records
within a current window. It uses three approaches namely
sorted array based approach [4],inverted index based
Proceedings of International Conference on Advancements in Engineering and Technology
ISBN NO : 978 - 1502893314
www.iaetsd.in
International Association of Engineering and Technology for Skill Development
51

approach[14] and Adaptive Sorted Neighborhood
approach[16].The sorted array based approach is not
applicable when the window size is small. However the
inverted index based approach also has the same drawback of
traditional blocking and it is inefficient approach as it takes
lots of time for splitting the entities. The Adaptive sorted
Neighborhood approach is not suitable when window size is
too large.
C. Q-GRAM BASED INDEXING
Q-Gram Based Indexing technique overcomes the drawback
of the traditional blocking and the sorted neighborhood
indexing. The main aim of this technique is to index the
database such that the records that have the similar,and not
just the same,BKV will be inserted into the same
block[8].However, much larger number of candidate record
pairs will be generated,leading to a more time consuming
process.
D. SUFFIX ARRAY-BASED INDEXING
Suffix Array-Based Indexing technique is one of the most
efficient approach compared to the previous works. The basic
idea of this technique is to insert the BKVs and their suffixes
into a suffix array based inverted index[11]. It uses the
approach called Robust Suffix Array Based Indexing where
the inverted index lists of the suffix values that are similar to
each other in the sorted suffix array are merged[13]. This
technique also takes a lot of time to merge the values.
E. CANOPY CLUSTERING
The canopy clustering[14]is built by converting BKVs into the
lists of tokens with each unique token becoming a key in the
inverted index. It uses the approach called as the Threshold-
based approach and Nearest Neighbor-Based approach.The
drawback of the canopy clustering is similar to that of the
sorted neighborhood technique based on the sorted array.
F. STRING-MAP-BASED INDEXING
String-map-based indexing [9] is based on mapping BKVs to
objects in a multidimensional Euclidean Space,such that the
distance between the pairs of the strings are preserved.Group
of similar strings are then generated by extracting the objects
that are similar to each other. However this technique fails
when the size of the database is too large or too small.
Hence all the above discussed indexing techniques has few
drawbacks in the data linkage process. In order to overcome
those indexing problems associated with the data linkage
process a new approach called as the One Class Clustering
Tree is proposed, which uses four splitting criteria
namely,Coarse-Grained Jaccard coefficient,Fine-Grained
Jaccard Coefficient, Least Probable Intersection(LPI) and
Maximum Likelihood Estimation(MLE) for data split and
pruning techniques.
III.DATA LINKAGE USING OCCT
OCCT is induced using one of the splitting criteria. The
splitting criteria is used to determine which attribute should be
used in each step of building the tree. OCCT uses the
prepruning process to decide which branches should be
trimmed.
Fig 1: Work Flow Diagram
Initially the tree is constructed where the inner nodes of the
tree consists of the attribute and the leaves represents the
clusters of the clusters of the matching entities. Secondly, the
prepruning technique is being used which means that the
algorithm stops expanding a branch whenever the subbranch
does not improve the accuracy of the model. OCCT uses the
probabilistic model to find the similar entities that are to be
matched. This probabilistic approach helps to avoid
overfitting. OCCT is chosen to be the best approach for data
linkage compared to indexing techniques.
IV.CONCLUSION
In this paper OCCT approach is used which performs one-to-
many data linkage.This method is based on the one class
decision tree model which sums up the knowledge of which
records to be linked together. This method uses one-class
approach which gives the results more accurately.OCCT
model has also been proved successful in three different
domains namely data linkage prevention,recommender system
and fraud detection.
CONSTRUCT OCCT USING ALL ENTITIES
PREPRUNING TECHNIQUE
COMPARE ENTITIES
MATCHING ENTITY NON-MATCHING
ENTITY
FINAL RESULT
DATABASE A DATABASE B
ISBN NO : 978 - 1502893314
www.iaetsd.in
52

REFERENCES
1. I.P. Fellegi and A.B. Sunter, “A Theory for Record
Linkage,” J. Am. Statistical Soc., vol. 64, no. 328, pp.
1183-1210, Dec. 1969.
2. D.D. Dorfman and E. Alf, “Maximum-Likelihood
Estimation of Parameters of Signal-Detection Theory
and Determination of Confidence Intervals—Rating-
Method Data,” J. Math. Psychology,vol. 6, no. 3, pp.
487-496, 1969.
3. J.R.Quinlan, “Induction of Decision Trees,” Machine
Learning, vol. 1, no. 1, pp. 81-106, March 1986.
4. M.A. Hernandez and S.J. Stolfo, “The Merge/Purge
Problem for Large Databases,” Proc. ACM SIGMOD
Int’l Conf. Management of Data (SIGMOD ’95),
1995.
5. P.Langley, Elements of Machine Learning, San Franc
Isco, Morgan Kaufmann, 1996.
6. I.H. Witten, A. Moffat, and T.C. Bell, Managing
Gigabytes, second ed. Morgan Kaufmann, 1999.
7. S.Guha, R.Rastogi and K.Shim, “Rock: A Robust
Clustering Algorithm for Categorical Attributes,”
Informat- ion Systems, vol. 25, no. 5, pp. 345-366,
July 2000.
8. L. Gravano, P.G. Ipeirotis, H.V. Jagadish, N. Koudas,
S. Muthukrishnan, and D. Srivastava, “Approximate
String Joins in a Database (Almost) for Free,” Proc.
27th Int’l Conf. Very Large Data Bases (VLDB ’01),
pp. 491-500, 2001.
9. L. Jin, C. Li, and S. Mehrotra, “Efficient Record
Linkage in Large Data Sets,” Proc. Eighth Int’l Conf.
Database Systems for Advanced Applications
(DASFAA ’03), pp. 137-146, 2003.
10. I.S.Dhillon, S. Mallela, and D.S. Modha,
“Information-Theoretic Co-Clustering,” Proc. Ninth
ACM SIGKDD Int’l Conf. Knowledge Discovery
and Data Mining, pp. 89-98, 2003.
11. A. Aizawa and K. Oyama, “A Fast Linkage Detection
Scheme for Multi-Source Information Integration,”
Proc. Int’l Workshop Chal- lenges in Web
Information Retrieval and Integration (WIRI ’05),
2005.
12. A.J.Storkey, C.K.I.Williams, E.Taylorand R.G.Mann,
“An Expectation Maximisation Algorithm for One-
to- Many Record Linkage,” University of Edinburgh
Informatics Research Report, 2005.
13. P. Christen, “A Comparison of Personal Name
Matching: Techniques and Practical Issues,” Proc.
IEEE Sixth Data Mining Workshop (ICDM ’06),
2006.
14. P. Christen, “Towards Parameter-Free Blocking for
Scalable Record Linkage,” Technical Report TR-CS-
07-03, Dept. of Com- puter Science, The Australian
Nat’l Univ., 2007.
15. P. Christen and K. Goiser, “Quality and Complexity
Measures for Data Linkage and Deduplication,”
Quality Measures in Data Mining, vol. 43, pp. 127-
151, 2007.
16. S. Yan, D. Lee, M.Y. Kan, and L.C. Giles, “Adaptive
Sorted Neighborhood Methods for Efficient Record
Linkage,” Proc. Seventh ACM/IEEE-CS Joint Conf.
Digital Libraries (JCDL ’07), 2007.
17. A.Gershman et al., “A Decision Tree Based
Recomme- nder System,” in Proc. the 10th Int. Conf.
on Innovative Internet Community Services, pp. 170-
179, 2010.
18. M.Yakout, A.K.Elmagarmid, H.Elmeleegy,
M.Quzzani and A.Qi, “Behavior Based Record
Linkage,” in Proc. of the VLDB Endowment, vol. 3,
no 1-2, pp. 439-448, 2010.
19. P. Christen, “A Survey of Indexing Techniques for
Scalable Record Linkage and Deduplication,” IEEE
Trans. Knowledge and Data Eng., vol. 24, no. 9, pp.
1537-1555, Sept. 2012, doi:10.1109/TKDE. 2011.
127.
20. M.Dror, A.Shabtai, L.Rokach, Y. Elovici, “OCCT: A
One-Class Clustering Tree for Implementing One-to-
Many Data Linkage,” IEEE Trans. on Knowledge
and Data Engineering, TKDE-2011-09-0577, 2013.
ISBN NO : 978 - 1502893314
www.iaetsd.in
53

Iaetsd a survey on one class clustering

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (19)

Similar to Iaetsd a survey on one class clustering

Similar to Iaetsd a survey on one class clustering (20)

More from Iaetsd Iaetsd

More from Iaetsd Iaetsd (20)

Recently uploaded

Recently uploaded (20)

Iaetsd a survey on one class clustering