Efficient Techniques for Online
        Record Linkage

            Guided By,
            Mrs.k.Sujatha B.E.,M.Tech.,(Ph.D..,)


                       Presented by,
                         D. Angelin chitra,
                        A. Jainambu sariba,
                         S. Sathiya priya,
                     K. Sridevi.


                                                   1
Overview
Introduction
LiteratureReview
Record Linkage
System Design
Modules
Screen Shots
Conclusion



                         2
Introduction
 Databases   frequently contain duplicate fields and
  records that refer to the same real-world entity.
 The data needed to support these decisions are often
  scattered in heterogeneous distributed databases.
 Heterogeneous databases are usually designed and
  managed by different organizations there may be no
  common candidate key for linking the records.
 If the databases use the same set of design standards
  linking can easily be done using the primary key.
 In this project we are developing a software which
  will produce the relevant links from heterogenous
  database with reference to the keyword.



                                                          3
ABSTRACT
   Matching records that refer to the same entity across databases is
    becoming an increasingly important part of many data mining
    projects, as often data from multiple sources needs to be matched in
    order to enrich data or improve its quality.
    Record linkage is the computation of the associations among
    records of multiple databases.
   Matching data from heterogeneous data source has been a real
    problem.
    Statistical record linkage techniques could be used for resolving
    this problem but it causes communication bottleneck in a
    distributed environment.
   A matching tree is used to overcome communication overhead and
    give matching decision as obtained using the conventional linkage
    technique.

                                                                       4
Literature Review

Existing System
 No common candidate key for linking the records in
  Heterogeneous databases.
 It is possible to use common non key attributes to access
  the heterogeneous database, but the result obtained using
  these attributes may not always be accurate.
 When the matching records reside at a remote site,
  existing techniques cannot be directly applied because
  they would involve transferring the entire remote
  relation, thereby incurring a huge communication
  overhead.
 Different ranking algorithm are used for the search
  which may be time consuming.
                                                          5
Disadvantages

 It   has not been work on online.
 Not    cost effective.
 It   cannot be reduce communication overhead.




                                                  6
Proposed System
 An  efficient technique is developed to facilitate record
  linkage decisions in a distributed, online setting.
 A matching tree is developed for attribute acquisition
  based on sequential decision making.
 The proposed techniques reduce the communication
  overhead considerably, and the linkage performance is
  assured to be at the same level as the traditional
  approach.




                                                              7
Advantages

 Reducing     the communication overhead in a distributed
  environment cost effective model.
 It   is cost effective model.




                                                             8
Sequential record linkage




                            9
Principles of matching tree
Input  selection
      Assume that we are at some node of the
 tree and are trying to decide how to branch
 from there. At that point , we would be
 interested in finding the next best attribute to
  be acquired from the set of remaining
 attributes.
Stopping
      The stopping decision is made when no
 realisation of the remaining attributes can
 sufficiently revise the current matching
 probability so that the matching decision
 changes

                                                10
Tree based linkage techniques
◦ Here we develop efficient online record linkage techniques
  based on the matching tree
◦ The first two stages in this process are performed offline,
  using the training data.
◦ Once the matching tree has been built, the online linkage is
  done as the final step.
◦ We can now characterize the different techniques that can
  be employed in the last step.
◦ Given a local enquiry record, the ultimate goal of any
  linkage technique is to identify and fetch all the records
  from the remote site that have a matching probability
◦ The partitioning itself can be done in one of two possible
  ways:
    1) Sequential
    2) Concurrent

                                                                 11
Sequential partitioning
The  set of remote records is partitioned
 recursively, till we obtain the desired
 partition of all the relevant records.
This partitioning can be done in one of
 two ways:
        i). Sequential Attribute Acquisition
        ii). Sequential Identifier Acquisition



                                                 12
Concurrent Partitioning
The   tree is used to formulate a database
 query that selects the relevant remote
 records directly, in one single step.
Once the relevant records are identified,
 all their attribute values are transferred




                                              13
Record Linkage
Record   linkage refers to the task of
 finding records in a data set that refer to the
 same entity across different data sources (e.g.,
 data files, books, websites, databases).
Record linkage is necessary when joining data
 sets based on entities that may or may not
 share a common identifier.
Record linkage has applications in customer
 systems for marketing, relationship
 management, fraud detection, law
 enforcement and government administration


                                                    14
15
Modules

 Multiple Source Data.
 Detecting Overhead By Matching Tree.
 Overheads Eliminated.




                                         16
MODULE DESCRIPTION




                     17
Multiple Source Data
 Databases   is becoming an increasingly important part of
  many data mining projects, as often data from multiple
  sources needs to be matched in order to enrich data or
  improve its quality.
 The data needed to support these decisions are often scattered
  in heterogeneous distributed databases.
 In such cases, it maybe necessary to link records in multiple
  databases so that one can consolidate and use the data
  pertaining to the same real world entity.
 Entity matching is a crucial task for data integration and data
  cleaning. It is the task of identifying entities (objects, data
  instances) referring to the same real-world entity.
 The entity matching problem arises when there is no common
  identifier across the heterogeneous data sources.

                                                                18
Detecting Overhead by Matching
              Tree
To  integrate or link the data stored in
 heterogeneous data sources, a critical
 problem is entity matching, i.e., matching
 records representing corresponding
 entities in the real world, across the
 sources.
In this paper, we describe how this
 method can be applied in entity matching
 rules from heterogeneous databases.

                                              19
Overheads Eliminated
We   develop a matching tree, similar to
 a decision tree, and use it to reduce
 the communication overhead
 significantly.
The online record linkage process
 become more efficient by reducing the
 communication overhead in a
 distributed environment.


                                            20
Screen shots
Login   page




                           21
Registration page




                    22
Search page




              23
Search results




                 24
Conclusion
Record   linkage is an important issue in
 heterogeneous database systems where
 the records representing the same real-
 world entity type are identified using
 different identifiers in different databases.
In the absence of a common identifier, it
 is often difficult to find records in a
 remote database that are similar to a local
 enquiry record.

                                                 25
References
   A.K. Elmagarmid, P.G. Ipeirotis, and V.S. Verykios, “Duplicate
                     Record Detection: A Survey,” IEEE Trans.
    Knowledge and Data Eng.,vol. 19, no. 1, pp. 1-16, Jan. 2007.
   B. Tepping, “A Model for Optimum Linkage of Records,” J. Am.
    Statistical Assoc., vol. 63, pp. 1321-1332, 1968.
   C. Batini, M. Lenzerini, and S.B. Navathe, “A Comparative
    Analysis of Methodologies for Database Schema
    Integration,”ACM Computing Surveys, vol. 18, no. 4, pp. 323-
    364, 1986
   D. Dey, “Record Matching in Data Warehouses: A Decision
    Model for Data Consolidation,” Operations Research, vol. 51,
    no. 2, pp. 240-254,
   D. Dey, S. Sarkar, and P. De, “A Distance-Based Approach to
    Entity Reconciliation in Heterogeneous Databases,” IEEE
    Trans.Knowledge and Data Eng., vol. 14, no. 3, pp. 567-582,
    May/June 2002.


                                                                 26
THANK YOU




            27
QUERIES




          28

online Record Linkage

  • 1.
    Efficient Techniques forOnline Record Linkage Guided By, Mrs.k.Sujatha B.E.,M.Tech.,(Ph.D..,) Presented by, D. Angelin chitra, A. Jainambu sariba, S. Sathiya priya, K. Sridevi. 1
  • 2.
  • 3.
    Introduction  Databases frequently contain duplicate fields and records that refer to the same real-world entity.  The data needed to support these decisions are often scattered in heterogeneous distributed databases.  Heterogeneous databases are usually designed and managed by different organizations there may be no common candidate key for linking the records.  If the databases use the same set of design standards linking can easily be done using the primary key.  In this project we are developing a software which will produce the relevant links from heterogenous database with reference to the keyword. 3
  • 4.
    ABSTRACT  Matching records that refer to the same entity across databases is becoming an increasingly important part of many data mining projects, as often data from multiple sources needs to be matched in order to enrich data or improve its quality.  Record linkage is the computation of the associations among records of multiple databases.  Matching data from heterogeneous data source has been a real problem.  Statistical record linkage techniques could be used for resolving this problem but it causes communication bottleneck in a distributed environment.  A matching tree is used to overcome communication overhead and give matching decision as obtained using the conventional linkage technique. 4
  • 5.
    Literature Review Existing System No common candidate key for linking the records in Heterogeneous databases.  It is possible to use common non key attributes to access the heterogeneous database, but the result obtained using these attributes may not always be accurate.  When the matching records reside at a remote site, existing techniques cannot be directly applied because they would involve transferring the entire remote relation, thereby incurring a huge communication overhead.  Different ranking algorithm are used for the search which may be time consuming. 5
  • 6.
    Disadvantages  It has not been work on online.  Not cost effective.  It cannot be reduce communication overhead. 6
  • 7.
    Proposed System  An efficient technique is developed to facilitate record linkage decisions in a distributed, online setting.  A matching tree is developed for attribute acquisition based on sequential decision making.  The proposed techniques reduce the communication overhead considerably, and the linkage performance is assured to be at the same level as the traditional approach. 7
  • 8.
    Advantages  Reducing the communication overhead in a distributed environment cost effective model.  It is cost effective model. 8
  • 9.
  • 10.
    Principles of matchingtree Input selection Assume that we are at some node of the tree and are trying to decide how to branch from there. At that point , we would be interested in finding the next best attribute to be acquired from the set of remaining attributes. Stopping The stopping decision is made when no realisation of the remaining attributes can sufficiently revise the current matching probability so that the matching decision changes 10
  • 11.
    Tree based linkagetechniques ◦ Here we develop efficient online record linkage techniques based on the matching tree ◦ The first two stages in this process are performed offline, using the training data. ◦ Once the matching tree has been built, the online linkage is done as the final step. ◦ We can now characterize the different techniques that can be employed in the last step. ◦ Given a local enquiry record, the ultimate goal of any linkage technique is to identify and fetch all the records from the remote site that have a matching probability ◦ The partitioning itself can be done in one of two possible ways: 1) Sequential 2) Concurrent 11
  • 12.
    Sequential partitioning The set of remote records is partitioned recursively, till we obtain the desired partition of all the relevant records. This partitioning can be done in one of two ways: i). Sequential Attribute Acquisition ii). Sequential Identifier Acquisition 12
  • 13.
    Concurrent Partitioning The tree is used to formulate a database query that selects the relevant remote records directly, in one single step. Once the relevant records are identified, all their attribute values are transferred 13
  • 14.
    Record Linkage Record linkage refers to the task of finding records in a data set that refer to the same entity across different data sources (e.g., data files, books, websites, databases). Record linkage is necessary when joining data sets based on entities that may or may not share a common identifier. Record linkage has applications in customer systems for marketing, relationship management, fraud detection, law enforcement and government administration 14
  • 15.
  • 16.
    Modules  Multiple SourceData.  Detecting Overhead By Matching Tree.  Overheads Eliminated. 16
  • 17.
  • 18.
    Multiple Source Data Databases is becoming an increasingly important part of many data mining projects, as often data from multiple sources needs to be matched in order to enrich data or improve its quality.  The data needed to support these decisions are often scattered in heterogeneous distributed databases.  In such cases, it maybe necessary to link records in multiple databases so that one can consolidate and use the data pertaining to the same real world entity.  Entity matching is a crucial task for data integration and data cleaning. It is the task of identifying entities (objects, data instances) referring to the same real-world entity.  The entity matching problem arises when there is no common identifier across the heterogeneous data sources. 18
  • 19.
    Detecting Overhead byMatching Tree To integrate or link the data stored in heterogeneous data sources, a critical problem is entity matching, i.e., matching records representing corresponding entities in the real world, across the sources. In this paper, we describe how this method can be applied in entity matching rules from heterogeneous databases. 19
  • 20.
    Overheads Eliminated We develop a matching tree, similar to a decision tree, and use it to reduce the communication overhead significantly. The online record linkage process become more efficient by reducing the communication overhead in a distributed environment. 20
  • 21.
  • 22.
  • 23.
  • 24.
  • 25.
    Conclusion Record linkage is an important issue in heterogeneous database systems where the records representing the same real- world entity type are identified using different identifiers in different databases. In the absence of a common identifier, it is often difficult to find records in a remote database that are similar to a local enquiry record. 25
  • 26.
    References  A.K. Elmagarmid, P.G. Ipeirotis, and V.S. Verykios, “Duplicate Record Detection: A Survey,” IEEE Trans. Knowledge and Data Eng.,vol. 19, no. 1, pp. 1-16, Jan. 2007.  B. Tepping, “A Model for Optimum Linkage of Records,” J. Am. Statistical Assoc., vol. 63, pp. 1321-1332, 1968.  C. Batini, M. Lenzerini, and S.B. Navathe, “A Comparative Analysis of Methodologies for Database Schema Integration,”ACM Computing Surveys, vol. 18, no. 4, pp. 323- 364, 1986  D. Dey, “Record Matching in Data Warehouses: A Decision Model for Data Consolidation,” Operations Research, vol. 51, no. 2, pp. 240-254,  D. Dey, S. Sarkar, and P. De, “A Distance-Based Approach to Entity Reconciliation in Heterogeneous Databases,” IEEE Trans.Knowledge and Data Eng., vol. 14, no. 3, pp. 567-582, May/June 2002. 26
  • 27.
  • 28.