SlideShare a Scribd company logo
RULE-BASED METHOD FOR ENTITY RESOLUTION
Abstract—The objective of entity resolution (ER) is to identify records referring
to the same real-world entity. Traditional ER approaches identify records based on
pairwise similarity comparisons, which assumes that records referring to the same
entity are more similar to each other than otherwise. However, this assumption
does not always hold in practice and similarity comparisons do not work well when
such assumption breaks. We propose a new class of rules which could describe the
complex matching conditions between records and entities. Based on this class of
rules, we present the rule-based entity resolution problem and develop an on-line
approach for ER. In this framework, by applying rules to each record, we identify
which entity the record refers to. Additionally, we propose an effective and
efficient rule discovery algorithm. We experimentally evaluated our rule-based ER
algorithm on real data sets. The experimental results show that both our rule
discovery algorithm and rule-based ER algorithm can achieve high performance.
EXISTING SYSTEM:
The work on entity resolution can be broadly divided into three categories.
Pairwise ER. Most works on ER focus on record matching, which involves
comparing record pairs and identifying whether they match. A major part of work
on record matching focuses on similarity functions. To capture string variations,
proposed a transformation-based framework for record matching. Some machine-
learningbased approaches can identify matching strings which are syntactically far
apart. Similarity based on record relationships are also proposed to solve the
people identification problem. Since in our work, records are not compared with
each other, our work is orthogonal to record matching. However, string similarity
functions can be applied to fuzzy match operator (denoted by _) in ER-rules. For
example, given a string s, we say s _ “wei wang” if the edit distance between s and
“wei wang” is smaller than a given threshold. Decision trees are employed to teach
record matching rules. However, decision trees cannot be used to discover ER-
rules. This is because the domain of the righthand side of record matching rules is
{yes, no} (two records are mapped or not mapped), while the domain of the
righthand side of ER-rules is an entity set. Non-pairwise ER. The research on non-
pairwise ER includes clustering strategies and classifiers. Most strategies solve ER
based on the relationship graph among records, by modeling the records as nodes
and the relationships as edges. Machine learning approaches are also proposed by
using global information to solve ER effectively. However, these methods are not
suitable for massive data because of efficiency issues. We choose a representative
work for comparison.
PROPOSED SYSTEM:
This paper aims at the aforementioned problems, and the main contributions of the
paper are as following.
1) The syntax and semantics of the rules for ER are designed, and the
independence, consistency, completeness and validity of the rules are defined and
analyzed.
2) An efficient rule discovery algorithm based on training data is proposed and
analyzed.
3) An efficient rule-based algorithm for solving entity resolution problem is
proposedand analyzed.
4) A rule maintaining method is proposedwhen entity information is changed.
5) Experiments are performed on real data to verify the effectiveness and
efficiency of the proposed algorithms.
In fact, our method and traditional ER approaches can be considered as the
complementary to each other and be applied together. This is because our rule-
based method can identify records which cannot be resolved by traditional ER
methods and traditional ER methods can identify most of the records effectively
and do not require the availability of correct entity set. In this way, the limitations
of both methods can be overcome.
Module 1
Rules for entity resolution
In this section, a rule system for entity resolution, called ERrule, is defined.
(1)The If clause includes constraints on attributes of records, such as “including
zhang in coauthors”, and
(2)Then clause indicates the real world entity referred by the records that
satisfy the first clause of the rule, such as “refers to entity e1”. Thus, we use
A ) B to express the rules “8o, If o satisfies A Then o refers to B” for ER.
We denote the left-hand side and the right-hand side of a rule r as LHSðrÞ
and RHSðrÞ respectively.
Syntax
An ER-rule is where clause with the form of (Ai opi vi), (vi opi Ai), :(Aiopi vi) or
:(vi opi Ai), where Ai is an attribute, vi is a constant in the domain of Ai and opi
can be any domaindependent operator defined by users, such as exact match
operator ¼, fuzzy match operator _ [16] for string value, for numeric value, or 2
for set value. The clausewith form (Ai opi vi) or (vi opi Ai) is calledpositive
clause, and the clause with form :(Ai opi vi) or :(vi opiAi) is callednegative clause.
Semantics
In the following definitions, we let o be a record, S be a data set, r be an ER-rule
and R be an ER-rule set. For the convenience of discussion, we assume the
mapping from each record in S to its actual entity is given. Since an ER-rule does
not include disjoint clauses, we define the condition of matching the left-hand and
righthand sides of rule as follows.
Definition 1. 0 matches the LHS of r if o satisfies all the clauses in LHSðrÞ. o
matches the RHS of r if o refers to entity RHSðrÞ.
Module 2
Properties of ER-Rule Set
Given an ER-rule set R and a data set S, to ensure R performs well on S, we
require that (1) there is no false matches
between record and entity (validity); (2) there is no conflicting decisions by R
(consistency); (3) each record in S can be mapped to an entity by R (completeness)
and (4) there is no redundant rules in R (independence). Now we present the
formal definitions of these properties. Definition 4 (Validity). R is valid for S if
each ER-rule in R is
valid for S.
Module 3
RULE DISCOVERY
Since it might be too expensive to construct ER-rules manually, we discuss how to
discover useful rules from a training data set for efficient and effective entity
resolution in this section. We assume that the operator on each attribute can be any
domain-dependent operator defined by users. First, we discuss the requirements of
the discovered rule sets and present our framework of rule discovery. Then we
describe the algorithms in the rule discovery framework and study the correctness
and complexity of our algorithm. For exposition, proofs have been deferred to the
Appendix. For the convenience of discussion, some concepts are introduced first.
We classify ER-rules into two categories according to whether negative clauses are
included. Definition 8 (PR). PR is an ER-rule which only includes positive clauses.
Module 4
Requirements for rule Discovery
Even though these properties are satisfied on the training data set, it cannot be
ensured that the generated rule set can
also perform well on the other data sets. To make the rule set suitable for ER for
many data sets other than only suitable for the training data set, we require the
discovered ERrule set, denoted by R, should also satisfy two requirements
described as follows.
_ Length Requirement: Given a threshold l, each rule r in R satisfies that jrj To
determine whether record o matches the LHS of ERrule r, we should check
whether o satisfies each clause in LHSðrÞ. Thus to guarantee the efficiency of
rule-based ER (R-ER) and avoid overfitting, the length of each rule (the number of
clauses) should be no more than a threshold.
_ PR Requirement: each rule r in R is a PR. The reasonwhywe give priority toPRs
is that,positiveliterals lead to bounded spaceswhile negative literals lead to
unbounded spaces. Therefore the discovered PRs are more possible to identify
other data sets effectively than the discoveredNRs.
Module 5
Rule-basedentity resolution
In this section, we discuss the algorithm of entity resolution by leveraging ER-
rules. We first define the rule-based ER
problem. Next we develop an online algorithm for rulebased ER problem. Finally,
we describe how to incorporate this algorithm into a generalized ER framework.
Problem 1 (Rule-based ER). Rule-based ER takes U and RE as input, and outputs
U. U is a data set, RE is an ER-rule set of entity set E ¼ fe1; . . . ; emg, U ¼ fU1; .
. . ; Umg is a partition of records where each group Uj(1 _ j _ m) is a subset of U
which are determined to refer to the entity ej and [1_j_mUj is a subset of U. Our
rule-based ER algorithm R-ER scans records one by one and determines the entity
for each record. The determination process can be divided into the following steps.
First, we find all the rules satisfied by o (FINDRULES). Second, for each entity e
to which o might refer, we compute the confidence that o refers to e according to
the rules of e that are satisfied by o (COMPCONF). Third, we select the entity e
with the largest confidence to which o might refer, and if this confidence is larger
than a confidence threshold, it is determined that o refers to e (SELENTITY).
These procedures are described as follows. It takes a record to and find all the rules
that are satisfied by it. The intuitive idea is to compare with each rule. In practice,
it does not match the LHS of most of the rules. In order to find the rules whose
LHS are matched by oi efficiently, we construct an inverted index and a B-tree for
rules, denoted by LR and TR respectively. LR is for rules including clauses with ¼
and 2 operators and TR is for rules including clauses with range operators.
CONCLUSION
This paper developed a class of ER-rules which are capable to describe the
complex matching conditions between records and entities. Based on these rules,
we developed an ER algorithm R-ER. We experimentally evaluated our algorithms
on real data sets. The experimental results show that our algorithm can achieve a
good performance both on efficiency and accuracy. For future work, we would like
to extend our techniques to more general cases. For instance, how to discover ER-
rules when the operator for each attribute is not given? We would also like to
consider how to incorporate human resources, such as Crowd, into our rule-
discovery framework to improve the quality of rules.
REFERENCES
[1] N. Koudas, S. Sarawagi, and D. Srivastava, “Record linkage: Similarity
measures and algorithms,” in Proc. ACM SIGMOD Int. Conf. Manage. Data,
2006, pp. 802–803.
[2] M. Bilenko and R. J. Mooney, “Adaptive duplicate detection using learnable
string similarity measures,” in Proc. ACM SIGKDD Int. Conf. Knowl. Discovery
Data Mining, 2003, pp. 39–48.
[3] W. W. Cohen, “Integration of heterogeneous databases without common
domains using queries based on textual similarity,” ACM SIGMOD Rec., vol. 27,
no. 2, pp. 201–212, 1998.
[4] L. Gravano, P. G. Ipeirotis, N. Koudas, and D. Srivastava, “Text joins in an
RDBMS for web data integration,” in Proc. 12th Int. Conf. World Wide Web,
2003, pp. 90–101.
[5] M. A. Jaro, “Advances in record-linkage methodology as applied to matching
the 1985 census of Tampa, Florida,” J. Amer. Statist. Assoc., vol. 84, no. 406, pp.
414–420, 1989.
[6] A. Arasu, S. Chaudhuri, and R. Kaushik, “Transformation-based framework for
record matching,” in Proc. 24th Int. Conf. Data Eng., 2008, pp. 40–49.
[7] S. Chaudhuri, B. C. Chen, V. Ganti, and R. Kaushik, “Exampledriven design of
efficient record matching queries,” in Proc. 33rd Int. Conf. Very Large Databases,
2007, pp. 327–338.
[8] A. Arasu, S. Chaudhuri, and R. Kaushik, “Learning string transformations from
examples,” Proc. VLDB Endowment, vol. 2, no. 1, pp. 514–525, 2009.
[9] R. Bekkerman and A. McCallum, “Disambiguating web appearances of people
in a social network,” in Proc. 14th Int. Conf. World Wide Web, 2005, pp. 463–470.
[10] S. Tejada, C. Knoblock, and S. Minton, “Learning object identification rules
for information integration,” Inf. Syst., vol. 26, no. 8, pp. 607–633, 2001.
[11] X. Fan, J. Wang, X. Pu, L. Zhou, and B. Lv, “On graph-based name
disambiguation,” J. Data Inf. Quality, vol. 2, no. 2, p. 10, 2011.
[12] L. Shu, B. Long, and W. Meng, “A latent topic model for complete entity
resolution,” in Proc. 25th Int. Conf. Data Eng., 2009, pp. 880– 891.

More Related Content

Viewers also liked

Cost minimizing dynamic migration of content
Cost minimizing dynamic migration of contentCost minimizing dynamic migration of content
Cost minimizing dynamic migration of content
nexgentech15
 
Space efficient verifiable secret sharing
Space efficient verifiable secret sharingSpace efficient verifiable secret sharing
Space efficient verifiable secret sharing
nexgentech15
 
Detection and rectification
Detection and rectificationDetection and rectification
Detection and rectification
nexgentech15
 
User defined privacy grid system
User defined privacy grid system User defined privacy grid system
User defined privacy grid system
nexgentech15
 
Cahr a contextually adaptive home based
Cahr a contextually adaptive home basedCahr a contextually adaptive home based
Cahr a contextually adaptive home based
nexgentech15
 
A computational dynamic trust model
A computational dynamic trust modelA computational dynamic trust model
A computational dynamic trust model
nexgentech15
 
Detecting malicious facebook applications
Detecting malicious facebook applicationsDetecting malicious facebook applications
Detecting malicious facebook applications
nexgentech15
 
Bang diem 22 24
Bang diem 22 24Bang diem 22 24
Bang diem 22 24
Nam Ninh Hà
 
CV_Paul Berkowitz_July_2015
CV_Paul Berkowitz_July_2015CV_Paul Berkowitz_July_2015
CV_Paul Berkowitz_July_2015Paul Berkowitz
 
Robert Vincent Peace - FIFA World Rankings
Robert Vincent Peace - FIFA World RankingsRobert Vincent Peace - FIFA World Rankings
Robert Vincent Peace - FIFA World Rankings
Robert Vincent Peace
 
Lanfredini
LanfrediniLanfredini
Lanfredini
P1pp0Free
 

Viewers also liked (12)

Cost minimizing dynamic migration of content
Cost minimizing dynamic migration of contentCost minimizing dynamic migration of content
Cost minimizing dynamic migration of content
 
Space efficient verifiable secret sharing
Space efficient verifiable secret sharingSpace efficient verifiable secret sharing
Space efficient verifiable secret sharing
 
Detection and rectification
Detection and rectificationDetection and rectification
Detection and rectification
 
User defined privacy grid system
User defined privacy grid system User defined privacy grid system
User defined privacy grid system
 
Cahr a contextually adaptive home based
Cahr a contextually adaptive home basedCahr a contextually adaptive home based
Cahr a contextually adaptive home based
 
A computational dynamic trust model
A computational dynamic trust modelA computational dynamic trust model
A computational dynamic trust model
 
Detecting malicious facebook applications
Detecting malicious facebook applicationsDetecting malicious facebook applications
Detecting malicious facebook applications
 
Bang diem 22 24
Bang diem 22 24Bang diem 22 24
Bang diem 22 24
 
CV_Tanvir
CV_TanvirCV_Tanvir
CV_Tanvir
 
CV_Paul Berkowitz_July_2015
CV_Paul Berkowitz_July_2015CV_Paul Berkowitz_July_2015
CV_Paul Berkowitz_July_2015
 
Robert Vincent Peace - FIFA World Rankings
Robert Vincent Peace - FIFA World RankingsRobert Vincent Peace - FIFA World Rankings
Robert Vincent Peace - FIFA World Rankings
 
Lanfredini
LanfrediniLanfredini
Lanfredini
 

Similar to Rule based method for entity resolution

Rule based method for entity resolution
Rule based method for entity resolutionRule based method for entity resolution
Rule based method for entity resolution
ieeepondy
 
Learning Collaborative Agents with Rule Guidance for Knowledge Graph Reasoning
Learning Collaborative Agents with Rule Guidance for Knowledge Graph ReasoningLearning Collaborative Agents with Rule Guidance for Knowledge Graph Reasoning
Learning Collaborative Agents with Rule Guidance for Knowledge Graph Reasoning
Deren Lei
 
Rule Based Algorithms.pptx
Rule Based Algorithms.pptxRule Based Algorithms.pptx
Rule Based Algorithms.pptx
RoshanSuvedi1
 
Effectiveness of ERules in Generating Non Redundant Rule Sets in Pharmacy Dat...
Effectiveness of ERules in Generating Non Redundant Rule Sets in Pharmacy Dat...Effectiveness of ERules in Generating Non Redundant Rule Sets in Pharmacy Dat...
Effectiveness of ERules in Generating Non Redundant Rule Sets in Pharmacy Dat...
International Journal of Science and Research (IJSR)
 
Secured Ontology Mapping
Secured Ontology Mapping Secured Ontology Mapping
Secured Ontology Mapping
dannyijwest
 
Proposed Method for String Transformation using Probablistic Approach
Proposed Method for String Transformation using Probablistic ApproachProposed Method for String Transformation using Probablistic Approach
Proposed Method for String Transformation using Probablistic Approach
Editor IJMTER
 
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMS
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMSA COMPARISON OF DOCUMENT SIMILARITY ALGORITHMS
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMS
gerogepatton
 
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMS
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMSA COMPARISON OF DOCUMENT SIMILARITY ALGORITHMS
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMS
gerogepatton
 
DBMS and Rdbms fundamental concepts
DBMS and Rdbms fundamental conceptsDBMS and Rdbms fundamental concepts
DBMS and Rdbms fundamental concepts
Kuntal Bhowmick
 
A Survey of String Matching Algorithms
A Survey of String Matching AlgorithmsA Survey of String Matching Algorithms
A Survey of String Matching Algorithms
IJERA Editor
 
Improving Tools in Artificial Intelligence
Improving Tools in Artificial IntelligenceImproving Tools in Artificial Intelligence
Improving Tools in Artificial Intelligence
Bogdan Patrut
 
Parallel Materialisation of Datalog Programs in Centralised, Main-Memory RDF ...
Parallel Materialisation of Datalog Programs in Centralised, Main-Memory RDF ...Parallel Materialisation of Datalog Programs in Centralised, Main-Memory RDF ...
Parallel Materialisation of Datalog Programs in Centralised, Main-Memory RDF ...
DBOnto
 
Dbms Concepts
Dbms ConceptsDbms Concepts
Dbms Conceptsadukkas
 
Ijarcet vol-2-issue-2-676-678
Ijarcet vol-2-issue-2-676-678Ijarcet vol-2-issue-2-676-678
Ijarcet vol-2-issue-2-676-678Editor IJARCET
 
Dbms basic nots
Dbms basic notsDbms basic nots
Dbms basic nots
nitinnitinnitin
 
Optimization of Mining Association Rule from XML Documents
Optimization of Mining Association Rule from XML DocumentsOptimization of Mining Association Rule from XML Documents
Optimization of Mining Association Rule from XML Documents
IOSR Journals
 
ch7-clean.ppt
ch7-clean.pptch7-clean.ppt
ch7-clean.ppt
RAJULKUMARSUTHAR
 
A FILM SYNOPSIS GENRE CLASSIFIER BASED ON MAJORITY VOTE
A FILM SYNOPSIS GENRE CLASSIFIER BASED ON MAJORITY VOTEA FILM SYNOPSIS GENRE CLASSIFIER BASED ON MAJORITY VOTE
A FILM SYNOPSIS GENRE CLASSIFIER BASED ON MAJORITY VOTE
kevig
 

Similar to Rule based method for entity resolution (20)

Rule based method for entity resolution
Rule based method for entity resolutionRule based method for entity resolution
Rule based method for entity resolution
 
Learning Collaborative Agents with Rule Guidance for Knowledge Graph Reasoning
Learning Collaborative Agents with Rule Guidance for Knowledge Graph ReasoningLearning Collaborative Agents with Rule Guidance for Knowledge Graph Reasoning
Learning Collaborative Agents with Rule Guidance for Knowledge Graph Reasoning
 
Rule Based Algorithms.pptx
Rule Based Algorithms.pptxRule Based Algorithms.pptx
Rule Based Algorithms.pptx
 
Effectiveness of ERules in Generating Non Redundant Rule Sets in Pharmacy Dat...
Effectiveness of ERules in Generating Non Redundant Rule Sets in Pharmacy Dat...Effectiveness of ERules in Generating Non Redundant Rule Sets in Pharmacy Dat...
Effectiveness of ERules in Generating Non Redundant Rule Sets in Pharmacy Dat...
 
Secured Ontology Mapping
Secured Ontology Mapping Secured Ontology Mapping
Secured Ontology Mapping
 
Proposed Method for String Transformation using Probablistic Approach
Proposed Method for String Transformation using Probablistic ApproachProposed Method for String Transformation using Probablistic Approach
Proposed Method for String Transformation using Probablistic Approach
 
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMS
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMSA COMPARISON OF DOCUMENT SIMILARITY ALGORITHMS
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMS
 
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMS
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMSA COMPARISON OF DOCUMENT SIMILARITY ALGORITHMS
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMS
 
DBMS and Rdbms fundamental concepts
DBMS and Rdbms fundamental conceptsDBMS and Rdbms fundamental concepts
DBMS and Rdbms fundamental concepts
 
A Survey of String Matching Algorithms
A Survey of String Matching AlgorithmsA Survey of String Matching Algorithms
A Survey of String Matching Algorithms
 
Improving Tools in Artificial Intelligence
Improving Tools in Artificial IntelligenceImproving Tools in Artificial Intelligence
Improving Tools in Artificial Intelligence
 
Parallel Materialisation of Datalog Programs in Centralised, Main-Memory RDF ...
Parallel Materialisation of Datalog Programs in Centralised, Main-Memory RDF ...Parallel Materialisation of Datalog Programs in Centralised, Main-Memory RDF ...
Parallel Materialisation of Datalog Programs in Centralised, Main-Memory RDF ...
 
Hw fdb(2)
Hw fdb(2)Hw fdb(2)
Hw fdb(2)
 
Hw fdb(2)
Hw fdb(2)Hw fdb(2)
Hw fdb(2)
 
Dbms Concepts
Dbms ConceptsDbms Concepts
Dbms Concepts
 
Ijarcet vol-2-issue-2-676-678
Ijarcet vol-2-issue-2-676-678Ijarcet vol-2-issue-2-676-678
Ijarcet vol-2-issue-2-676-678
 
Dbms basic nots
Dbms basic notsDbms basic nots
Dbms basic nots
 
Optimization of Mining Association Rule from XML Documents
Optimization of Mining Association Rule from XML DocumentsOptimization of Mining Association Rule from XML Documents
Optimization of Mining Association Rule from XML Documents
 
ch7-clean.ppt
ch7-clean.pptch7-clean.ppt
ch7-clean.ppt
 
A FILM SYNOPSIS GENRE CLASSIFIER BASED ON MAJORITY VOTE
A FILM SYNOPSIS GENRE CLASSIFIER BASED ON MAJORITY VOTEA FILM SYNOPSIS GENRE CLASSIFIER BASED ON MAJORITY VOTE
A FILM SYNOPSIS GENRE CLASSIFIER BASED ON MAJORITY VOTE
 

More from nexgentech15

Privacy policy inference of user uploaded
Privacy policy inference of user uploadedPrivacy policy inference of user uploaded
Privacy policy inference of user uploaded
nexgentech15
 
Discovery of ranking fraud for mobile apps
Discovery of ranking fraud for mobile appsDiscovery of ranking fraud for mobile apps
Discovery of ranking fraud for mobile apps
nexgentech15
 
Secure auditing and deduplicating data in cloud
Secure auditing and deduplicating data in cloudSecure auditing and deduplicating data in cloud
Secure auditing and deduplicating data in cloud
nexgentech15
 
New algorithms for secure outsourcing of
New algorithms for secure outsourcing ofNew algorithms for secure outsourcing of
New algorithms for secure outsourcing of
nexgentech15
 
Identity based encryption with outsourced
Identity based encryption with outsourcedIdentity based encryption with outsourced
Identity based encryption with outsourced
nexgentech15
 
Control cloud data access privilege and
Control cloud data access privilege andControl cloud data access privilege and
Control cloud data access privilege and
nexgentech15
 
A trusted iaa s environment
A trusted iaa s environmentA trusted iaa s environment
A trusted iaa s environment
nexgentech15
 
A profit maximization scheme with guaranteed
A profit maximization scheme with guaranteedA profit maximization scheme with guaranteed
A profit maximization scheme with guaranteed
nexgentech15
 
Learning to rank image tags with limited
Learning to rank image tags with limitedLearning to rank image tags with limited
Learning to rank image tags with limited
nexgentech15
 
Collusion tolerable privacy-preserving sum
Collusion tolerable privacy-preserving sumCollusion tolerable privacy-preserving sum
Collusion tolerable privacy-preserving sum
nexgentech15
 
Automatic face naming by learning discriminative
Automatic face naming by learning discriminativeAutomatic face naming by learning discriminative
Automatic face naming by learning discriminative
nexgentech15
 
Query aware determinization of uncertain
Query aware determinization of uncertainQuery aware determinization of uncertain
Query aware determinization of uncertain
nexgentech15
 
Page a partition aware engine
Page a partition aware enginePage a partition aware engine
Page a partition aware engine
nexgentech15
 
Learning fingerprint reconstruction
Learning fingerprint reconstructionLearning fingerprint reconstruction
Learning fingerprint reconstruction
nexgentech15
 
Combining efficiency, fidelity, and flexibility in
Combining efficiency, fidelity, and flexibility inCombining efficiency, fidelity, and flexibility in
Combining efficiency, fidelity, and flexibility in
nexgentech15
 
Cloud based multimedia content protection system3
Cloud based multimedia content protection system3Cloud based multimedia content protection system3
Cloud based multimedia content protection system3
nexgentech15
 
Ieee 2105 2016 nexgen tech matlab
Ieee  2105 2016 nexgen tech matlabIeee  2105 2016 nexgen tech matlab
Ieee 2105 2016 nexgen tech matlab
nexgentech15
 
Ieee 2015 16 ns2 titles
Ieee  2015 16 ns2 titlesIeee  2015 16 ns2 titles
Ieee 2015 16 ns2 titlesnexgentech15
 

More from nexgentech15 (18)

Privacy policy inference of user uploaded
Privacy policy inference of user uploadedPrivacy policy inference of user uploaded
Privacy policy inference of user uploaded
 
Discovery of ranking fraud for mobile apps
Discovery of ranking fraud for mobile appsDiscovery of ranking fraud for mobile apps
Discovery of ranking fraud for mobile apps
 
Secure auditing and deduplicating data in cloud
Secure auditing and deduplicating data in cloudSecure auditing and deduplicating data in cloud
Secure auditing and deduplicating data in cloud
 
New algorithms for secure outsourcing of
New algorithms for secure outsourcing ofNew algorithms for secure outsourcing of
New algorithms for secure outsourcing of
 
Identity based encryption with outsourced
Identity based encryption with outsourcedIdentity based encryption with outsourced
Identity based encryption with outsourced
 
Control cloud data access privilege and
Control cloud data access privilege andControl cloud data access privilege and
Control cloud data access privilege and
 
A trusted iaa s environment
A trusted iaa s environmentA trusted iaa s environment
A trusted iaa s environment
 
A profit maximization scheme with guaranteed
A profit maximization scheme with guaranteedA profit maximization scheme with guaranteed
A profit maximization scheme with guaranteed
 
Learning to rank image tags with limited
Learning to rank image tags with limitedLearning to rank image tags with limited
Learning to rank image tags with limited
 
Collusion tolerable privacy-preserving sum
Collusion tolerable privacy-preserving sumCollusion tolerable privacy-preserving sum
Collusion tolerable privacy-preserving sum
 
Automatic face naming by learning discriminative
Automatic face naming by learning discriminativeAutomatic face naming by learning discriminative
Automatic face naming by learning discriminative
 
Query aware determinization of uncertain
Query aware determinization of uncertainQuery aware determinization of uncertain
Query aware determinization of uncertain
 
Page a partition aware engine
Page a partition aware enginePage a partition aware engine
Page a partition aware engine
 
Learning fingerprint reconstruction
Learning fingerprint reconstructionLearning fingerprint reconstruction
Learning fingerprint reconstruction
 
Combining efficiency, fidelity, and flexibility in
Combining efficiency, fidelity, and flexibility inCombining efficiency, fidelity, and flexibility in
Combining efficiency, fidelity, and flexibility in
 
Cloud based multimedia content protection system3
Cloud based multimedia content protection system3Cloud based multimedia content protection system3
Cloud based multimedia content protection system3
 
Ieee 2105 2016 nexgen tech matlab
Ieee  2105 2016 nexgen tech matlabIeee  2105 2016 nexgen tech matlab
Ieee 2105 2016 nexgen tech matlab
 
Ieee 2015 16 ns2 titles
Ieee  2015 16 ns2 titlesIeee  2015 16 ns2 titles
Ieee 2015 16 ns2 titles
 

Recently uploaded

CLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCE
CLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCECLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCE
CLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCE
BhavyaRajput3
 
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
MysoreMuleSoftMeetup
 
Model Attribute Check Company Auto Property
Model Attribute  Check Company Auto PropertyModel Attribute  Check Company Auto Property
Model Attribute Check Company Auto Property
Celine George
 
ESC Beyond Borders _From EU to You_ InfoPack general.pdf
ESC Beyond Borders _From EU to You_ InfoPack general.pdfESC Beyond Borders _From EU to You_ InfoPack general.pdf
ESC Beyond Borders _From EU to You_ InfoPack general.pdf
Fundacja Rozwoju Społeczeństwa Przedsiębiorczego
 
Sectors of the Indian Economy - Class 10 Study Notes pdf
Sectors of the Indian Economy - Class 10 Study Notes pdfSectors of the Indian Economy - Class 10 Study Notes pdf
Sectors of the Indian Economy - Class 10 Study Notes pdf
Vivekanand Anglo Vedic Academy
 
Additional Benefits for Employee Website.pdf
Additional Benefits for Employee Website.pdfAdditional Benefits for Employee Website.pdf
Additional Benefits for Employee Website.pdf
joachimlavalley1
 
Phrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXX
Phrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXXPhrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXX
Phrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXX
MIRIAMSALINAS13
 
MARUTI SUZUKI- A Successful Joint Venture in India.pptx
MARUTI SUZUKI- A Successful Joint Venture in India.pptxMARUTI SUZUKI- A Successful Joint Venture in India.pptx
MARUTI SUZUKI- A Successful Joint Venture in India.pptx
bennyroshan06
 
Students, digital devices and success - Andreas Schleicher - 27 May 2024..pptx
Students, digital devices and success - Andreas Schleicher - 27 May 2024..pptxStudents, digital devices and success - Andreas Schleicher - 27 May 2024..pptx
Students, digital devices and success - Andreas Schleicher - 27 May 2024..pptx
EduSkills OECD
 
Unit 8 - Information and Communication Technology (Paper I).pdf
Unit 8 - Information and Communication Technology (Paper I).pdfUnit 8 - Information and Communication Technology (Paper I).pdf
Unit 8 - Information and Communication Technology (Paper I).pdf
Thiyagu K
 
The Roman Empire A Historical Colossus.pdf
The Roman Empire A Historical Colossus.pdfThe Roman Empire A Historical Colossus.pdf
The Roman Empire A Historical Colossus.pdf
kaushalkr1407
 
How to Split Bills in the Odoo 17 POS Module
How to Split Bills in the Odoo 17 POS ModuleHow to Split Bills in the Odoo 17 POS Module
How to Split Bills in the Odoo 17 POS Module
Celine George
 
How to Break the cycle of negative Thoughts
How to Break the cycle of negative ThoughtsHow to Break the cycle of negative Thoughts
How to Break the cycle of negative Thoughts
Col Mukteshwar Prasad
 
How libraries can support authors with open access requirements for UKRI fund...
How libraries can support authors with open access requirements for UKRI fund...How libraries can support authors with open access requirements for UKRI fund...
How libraries can support authors with open access requirements for UKRI fund...
Jisc
 
Supporting (UKRI) OA monographs at Salford.pptx
Supporting (UKRI) OA monographs at Salford.pptxSupporting (UKRI) OA monographs at Salford.pptx
Supporting (UKRI) OA monographs at Salford.pptx
Jisc
 
The Challenger.pdf DNHS Official Publication
The Challenger.pdf DNHS Official PublicationThe Challenger.pdf DNHS Official Publication
The Challenger.pdf DNHS Official Publication
Delapenabediema
 
Ethnobotany and Ethnopharmacology ......
Ethnobotany and Ethnopharmacology ......Ethnobotany and Ethnopharmacology ......
Ethnobotany and Ethnopharmacology ......
Ashokrao Mane college of Pharmacy Peth-Vadgaon
 
Instructions for Submissions thorugh G- Classroom.pptx
Instructions for Submissions thorugh G- Classroom.pptxInstructions for Submissions thorugh G- Classroom.pptx
Instructions for Submissions thorugh G- Classroom.pptx
Jheel Barad
 
Template Jadual Bertugas Kelas (Boleh Edit)
Template Jadual Bertugas Kelas (Boleh Edit)Template Jadual Bertugas Kelas (Boleh Edit)
Template Jadual Bertugas Kelas (Boleh Edit)
rosedainty
 
Polish students' mobility in the Czech Republic
Polish students' mobility in the Czech RepublicPolish students' mobility in the Czech Republic
Polish students' mobility in the Czech Republic
Anna Sz.
 

Recently uploaded (20)

CLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCE
CLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCECLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCE
CLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCE
 
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
 
Model Attribute Check Company Auto Property
Model Attribute  Check Company Auto PropertyModel Attribute  Check Company Auto Property
Model Attribute Check Company Auto Property
 
ESC Beyond Borders _From EU to You_ InfoPack general.pdf
ESC Beyond Borders _From EU to You_ InfoPack general.pdfESC Beyond Borders _From EU to You_ InfoPack general.pdf
ESC Beyond Borders _From EU to You_ InfoPack general.pdf
 
Sectors of the Indian Economy - Class 10 Study Notes pdf
Sectors of the Indian Economy - Class 10 Study Notes pdfSectors of the Indian Economy - Class 10 Study Notes pdf
Sectors of the Indian Economy - Class 10 Study Notes pdf
 
Additional Benefits for Employee Website.pdf
Additional Benefits for Employee Website.pdfAdditional Benefits for Employee Website.pdf
Additional Benefits for Employee Website.pdf
 
Phrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXX
Phrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXXPhrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXX
Phrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXX
 
MARUTI SUZUKI- A Successful Joint Venture in India.pptx
MARUTI SUZUKI- A Successful Joint Venture in India.pptxMARUTI SUZUKI- A Successful Joint Venture in India.pptx
MARUTI SUZUKI- A Successful Joint Venture in India.pptx
 
Students, digital devices and success - Andreas Schleicher - 27 May 2024..pptx
Students, digital devices and success - Andreas Schleicher - 27 May 2024..pptxStudents, digital devices and success - Andreas Schleicher - 27 May 2024..pptx
Students, digital devices and success - Andreas Schleicher - 27 May 2024..pptx
 
Unit 8 - Information and Communication Technology (Paper I).pdf
Unit 8 - Information and Communication Technology (Paper I).pdfUnit 8 - Information and Communication Technology (Paper I).pdf
Unit 8 - Information and Communication Technology (Paper I).pdf
 
The Roman Empire A Historical Colossus.pdf
The Roman Empire A Historical Colossus.pdfThe Roman Empire A Historical Colossus.pdf
The Roman Empire A Historical Colossus.pdf
 
How to Split Bills in the Odoo 17 POS Module
How to Split Bills in the Odoo 17 POS ModuleHow to Split Bills in the Odoo 17 POS Module
How to Split Bills in the Odoo 17 POS Module
 
How to Break the cycle of negative Thoughts
How to Break the cycle of negative ThoughtsHow to Break the cycle of negative Thoughts
How to Break the cycle of negative Thoughts
 
How libraries can support authors with open access requirements for UKRI fund...
How libraries can support authors with open access requirements for UKRI fund...How libraries can support authors with open access requirements for UKRI fund...
How libraries can support authors with open access requirements for UKRI fund...
 
Supporting (UKRI) OA monographs at Salford.pptx
Supporting (UKRI) OA monographs at Salford.pptxSupporting (UKRI) OA monographs at Salford.pptx
Supporting (UKRI) OA monographs at Salford.pptx
 
The Challenger.pdf DNHS Official Publication
The Challenger.pdf DNHS Official PublicationThe Challenger.pdf DNHS Official Publication
The Challenger.pdf DNHS Official Publication
 
Ethnobotany and Ethnopharmacology ......
Ethnobotany and Ethnopharmacology ......Ethnobotany and Ethnopharmacology ......
Ethnobotany and Ethnopharmacology ......
 
Instructions for Submissions thorugh G- Classroom.pptx
Instructions for Submissions thorugh G- Classroom.pptxInstructions for Submissions thorugh G- Classroom.pptx
Instructions for Submissions thorugh G- Classroom.pptx
 
Template Jadual Bertugas Kelas (Boleh Edit)
Template Jadual Bertugas Kelas (Boleh Edit)Template Jadual Bertugas Kelas (Boleh Edit)
Template Jadual Bertugas Kelas (Boleh Edit)
 
Polish students' mobility in the Czech Republic
Polish students' mobility in the Czech RepublicPolish students' mobility in the Czech Republic
Polish students' mobility in the Czech Republic
 

Rule based method for entity resolution

  • 1. RULE-BASED METHOD FOR ENTITY RESOLUTION Abstract—The objective of entity resolution (ER) is to identify records referring to the same real-world entity. Traditional ER approaches identify records based on pairwise similarity comparisons, which assumes that records referring to the same entity are more similar to each other than otherwise. However, this assumption does not always hold in practice and similarity comparisons do not work well when such assumption breaks. We propose a new class of rules which could describe the complex matching conditions between records and entities. Based on this class of rules, we present the rule-based entity resolution problem and develop an on-line approach for ER. In this framework, by applying rules to each record, we identify which entity the record refers to. Additionally, we propose an effective and efficient rule discovery algorithm. We experimentally evaluated our rule-based ER algorithm on real data sets. The experimental results show that both our rule discovery algorithm and rule-based ER algorithm can achieve high performance. EXISTING SYSTEM: The work on entity resolution can be broadly divided into three categories. Pairwise ER. Most works on ER focus on record matching, which involves
  • 2. comparing record pairs and identifying whether they match. A major part of work on record matching focuses on similarity functions. To capture string variations, proposed a transformation-based framework for record matching. Some machine- learningbased approaches can identify matching strings which are syntactically far apart. Similarity based on record relationships are also proposed to solve the people identification problem. Since in our work, records are not compared with each other, our work is orthogonal to record matching. However, string similarity functions can be applied to fuzzy match operator (denoted by _) in ER-rules. For example, given a string s, we say s _ “wei wang” if the edit distance between s and “wei wang” is smaller than a given threshold. Decision trees are employed to teach record matching rules. However, decision trees cannot be used to discover ER- rules. This is because the domain of the righthand side of record matching rules is {yes, no} (two records are mapped or not mapped), while the domain of the righthand side of ER-rules is an entity set. Non-pairwise ER. The research on non- pairwise ER includes clustering strategies and classifiers. Most strategies solve ER based on the relationship graph among records, by modeling the records as nodes and the relationships as edges. Machine learning approaches are also proposed by using global information to solve ER effectively. However, these methods are not suitable for massive data because of efficiency issues. We choose a representative work for comparison.
  • 3. PROPOSED SYSTEM: This paper aims at the aforementioned problems, and the main contributions of the paper are as following. 1) The syntax and semantics of the rules for ER are designed, and the independence, consistency, completeness and validity of the rules are defined and analyzed. 2) An efficient rule discovery algorithm based on training data is proposed and analyzed. 3) An efficient rule-based algorithm for solving entity resolution problem is proposedand analyzed. 4) A rule maintaining method is proposedwhen entity information is changed. 5) Experiments are performed on real data to verify the effectiveness and efficiency of the proposed algorithms. In fact, our method and traditional ER approaches can be considered as the complementary to each other and be applied together. This is because our rule- based method can identify records which cannot be resolved by traditional ER methods and traditional ER methods can identify most of the records effectively
  • 4. and do not require the availability of correct entity set. In this way, the limitations of both methods can be overcome. Module 1 Rules for entity resolution In this section, a rule system for entity resolution, called ERrule, is defined. (1)The If clause includes constraints on attributes of records, such as “including zhang in coauthors”, and (2)Then clause indicates the real world entity referred by the records that satisfy the first clause of the rule, such as “refers to entity e1”. Thus, we use A ) B to express the rules “8o, If o satisfies A Then o refers to B” for ER. We denote the left-hand side and the right-hand side of a rule r as LHSðrÞ and RHSðrÞ respectively. Syntax An ER-rule is where clause with the form of (Ai opi vi), (vi opi Ai), :(Aiopi vi) or :(vi opi Ai), where Ai is an attribute, vi is a constant in the domain of Ai and opi can be any domaindependent operator defined by users, such as exact match operator ¼, fuzzy match operator _ [16] for string value, for numeric value, or 2 for set value. The clausewith form (Ai opi vi) or (vi opi Ai) is calledpositive clause, and the clause with form :(Ai opi vi) or :(vi opiAi) is callednegative clause.
  • 5. Semantics In the following definitions, we let o be a record, S be a data set, r be an ER-rule and R be an ER-rule set. For the convenience of discussion, we assume the mapping from each record in S to its actual entity is given. Since an ER-rule does not include disjoint clauses, we define the condition of matching the left-hand and righthand sides of rule as follows. Definition 1. 0 matches the LHS of r if o satisfies all the clauses in LHSðrÞ. o matches the RHS of r if o refers to entity RHSðrÞ. Module 2 Properties of ER-Rule Set Given an ER-rule set R and a data set S, to ensure R performs well on S, we require that (1) there is no false matches between record and entity (validity); (2) there is no conflicting decisions by R (consistency); (3) each record in S can be mapped to an entity by R (completeness) and (4) there is no redundant rules in R (independence). Now we present the formal definitions of these properties. Definition 4 (Validity). R is valid for S if each ER-rule in R is valid for S. Module 3
  • 6. RULE DISCOVERY Since it might be too expensive to construct ER-rules manually, we discuss how to discover useful rules from a training data set for efficient and effective entity resolution in this section. We assume that the operator on each attribute can be any domain-dependent operator defined by users. First, we discuss the requirements of the discovered rule sets and present our framework of rule discovery. Then we describe the algorithms in the rule discovery framework and study the correctness and complexity of our algorithm. For exposition, proofs have been deferred to the Appendix. For the convenience of discussion, some concepts are introduced first. We classify ER-rules into two categories according to whether negative clauses are included. Definition 8 (PR). PR is an ER-rule which only includes positive clauses. Module 4 Requirements for rule Discovery Even though these properties are satisfied on the training data set, it cannot be ensured that the generated rule set can also perform well on the other data sets. To make the rule set suitable for ER for many data sets other than only suitable for the training data set, we require the
  • 7. discovered ERrule set, denoted by R, should also satisfy two requirements described as follows. _ Length Requirement: Given a threshold l, each rule r in R satisfies that jrj To determine whether record o matches the LHS of ERrule r, we should check whether o satisfies each clause in LHSðrÞ. Thus to guarantee the efficiency of rule-based ER (R-ER) and avoid overfitting, the length of each rule (the number of clauses) should be no more than a threshold. _ PR Requirement: each rule r in R is a PR. The reasonwhywe give priority toPRs is that,positiveliterals lead to bounded spaceswhile negative literals lead to unbounded spaces. Therefore the discovered PRs are more possible to identify other data sets effectively than the discoveredNRs. Module 5 Rule-basedentity resolution In this section, we discuss the algorithm of entity resolution by leveraging ER- rules. We first define the rule-based ER problem. Next we develop an online algorithm for rulebased ER problem. Finally, we describe how to incorporate this algorithm into a generalized ER framework. Problem 1 (Rule-based ER). Rule-based ER takes U and RE as input, and outputs
  • 8. U. U is a data set, RE is an ER-rule set of entity set E ¼ fe1; . . . ; emg, U ¼ fU1; . . . ; Umg is a partition of records where each group Uj(1 _ j _ m) is a subset of U which are determined to refer to the entity ej and [1_j_mUj is a subset of U. Our rule-based ER algorithm R-ER scans records one by one and determines the entity for each record. The determination process can be divided into the following steps. First, we find all the rules satisfied by o (FINDRULES). Second, for each entity e to which o might refer, we compute the confidence that o refers to e according to the rules of e that are satisfied by o (COMPCONF). Third, we select the entity e with the largest confidence to which o might refer, and if this confidence is larger than a confidence threshold, it is determined that o refers to e (SELENTITY). These procedures are described as follows. It takes a record to and find all the rules that are satisfied by it. The intuitive idea is to compare with each rule. In practice, it does not match the LHS of most of the rules. In order to find the rules whose LHS are matched by oi efficiently, we construct an inverted index and a B-tree for rules, denoted by LR and TR respectively. LR is for rules including clauses with ¼ and 2 operators and TR is for rules including clauses with range operators. CONCLUSION This paper developed a class of ER-rules which are capable to describe the complex matching conditions between records and entities. Based on these rules,
  • 9. we developed an ER algorithm R-ER. We experimentally evaluated our algorithms on real data sets. The experimental results show that our algorithm can achieve a good performance both on efficiency and accuracy. For future work, we would like to extend our techniques to more general cases. For instance, how to discover ER- rules when the operator for each attribute is not given? We would also like to consider how to incorporate human resources, such as Crowd, into our rule- discovery framework to improve the quality of rules. REFERENCES [1] N. Koudas, S. Sarawagi, and D. Srivastava, “Record linkage: Similarity measures and algorithms,” in Proc. ACM SIGMOD Int. Conf. Manage. Data, 2006, pp. 802–803. [2] M. Bilenko and R. J. Mooney, “Adaptive duplicate detection using learnable string similarity measures,” in Proc. ACM SIGKDD Int. Conf. Knowl. Discovery Data Mining, 2003, pp. 39–48. [3] W. W. Cohen, “Integration of heterogeneous databases without common domains using queries based on textual similarity,” ACM SIGMOD Rec., vol. 27, no. 2, pp. 201–212, 1998.
  • 10. [4] L. Gravano, P. G. Ipeirotis, N. Koudas, and D. Srivastava, “Text joins in an RDBMS for web data integration,” in Proc. 12th Int. Conf. World Wide Web, 2003, pp. 90–101. [5] M. A. Jaro, “Advances in record-linkage methodology as applied to matching the 1985 census of Tampa, Florida,” J. Amer. Statist. Assoc., vol. 84, no. 406, pp. 414–420, 1989. [6] A. Arasu, S. Chaudhuri, and R. Kaushik, “Transformation-based framework for record matching,” in Proc. 24th Int. Conf. Data Eng., 2008, pp. 40–49. [7] S. Chaudhuri, B. C. Chen, V. Ganti, and R. Kaushik, “Exampledriven design of efficient record matching queries,” in Proc. 33rd Int. Conf. Very Large Databases, 2007, pp. 327–338. [8] A. Arasu, S. Chaudhuri, and R. Kaushik, “Learning string transformations from examples,” Proc. VLDB Endowment, vol. 2, no. 1, pp. 514–525, 2009. [9] R. Bekkerman and A. McCallum, “Disambiguating web appearances of people in a social network,” in Proc. 14th Int. Conf. World Wide Web, 2005, pp. 463–470. [10] S. Tejada, C. Knoblock, and S. Minton, “Learning object identification rules for information integration,” Inf. Syst., vol. 26, no. 8, pp. 607–633, 2001. [11] X. Fan, J. Wang, X. Pu, L. Zhou, and B. Lv, “On graph-based name disambiguation,” J. Data Inf. Quality, vol. 2, no. 2, p. 10, 2011.
  • 11. [12] L. Shu, B. Long, and W. Meng, “A latent topic model for complete entity resolution,” in Proc. 25th Int. Conf. Data Eng., 2009, pp. 880– 891.