Entity Resolution is a problem that occurs in many information integration processes and applications. The primary goal of entity resolution is to resolve data references to the corresponding same real world entity. The ambiguity in references comes in various net- works such as social network, biological network, citation graphs and many others. Ambiguity in references not only leads to data redundancy but also inaccuracies in knowledge representation, extraction and query processing. Entity resolution is the solution to this problem. There have been many approaches such as pair-wise similarity over attributes of references, a parallel approach for morphing the graph data on to cluster of nodes (P-Swoosh) [2] and relational clustering that makes use of relational information in addition to the attribute similarity. In this article, we make use of relational culstering to resolve author name ambiguities in a subset of a real-world dataset: US patent network consisting of more than 650,000 author references.
PROPERTIES OF RELATIONSHIPS AMONG OBJECTS IN OBJECT-ORIENTED SOFTWARE DESIGNijpla
One of the modern paradigms to develop a system is object oriented analysis and design. In this paradigm,
there are several objects and each object plays some specific roles. After identifying objects, the various
relationships among objects must be identified. This paper makes a literature review over relationships
among objects. Mainly, the relationships are three basic types, including generalization/specialization,
aggregation and association.This paper presents five taxonomies for properties of the relationships. The first
taxonomy is based on temporal view. The second taxonomy is based on structure and the third one relies on
behavioral. The fourth taxonomy is specified on mathematical view and fifth one related to the interface.
Additionally, the properties of the relationships are evaluated in a case study and several recommendations
are proposed.
Interpreting the Semantics of Anomalies Based on Mutual Information in Link M...ijdms
This paper aims to show how mutual information can help provide a semantic interpretation of anomalies in data, characterize the anomalies, and how mutual information can help measure the information that object item X shares with another object item Y. Whilst most link mining approaches focus on predicting link type, link based object classification or object identification, this research focused on using link mining to detect anomalies and discovering links/objects among anomalies. This paper attempts to demonstrate the contribution of mutual information to interpret anomalies using a case study.
Schema Integration, View Integration and Database Integration, ER Model & Dia...Mobarok Hossen
What is ER Model & Diagrams?
How can you design ER Model & Diagram?
What is Object-Oriented Model?
What is Schema Integration? how can you Schema Integrate?
What is View Integration? how can you View Integrate?
What is Database Integration? how can you Database Integrate?
PROPERTIES OF RELATIONSHIPS AMONG OBJECTS IN OBJECT-ORIENTED SOFTWARE DESIGNijpla
One of the modern paradigms to develop a system is object oriented analysis and design. In this paradigm,
there are several objects and each object plays some specific roles. After identifying objects, the various
relationships among objects must be identified. This paper makes a literature review over relationships
among objects. Mainly, the relationships are three basic types, including generalization/specialization,
aggregation and association.This paper presents five taxonomies for properties of the relationships. The first
taxonomy is based on temporal view. The second taxonomy is based on structure and the third one relies on
behavioral. The fourth taxonomy is specified on mathematical view and fifth one related to the interface.
Additionally, the properties of the relationships are evaluated in a case study and several recommendations
are proposed.
Interpreting the Semantics of Anomalies Based on Mutual Information in Link M...ijdms
This paper aims to show how mutual information can help provide a semantic interpretation of anomalies in data, characterize the anomalies, and how mutual information can help measure the information that object item X shares with another object item Y. Whilst most link mining approaches focus on predicting link type, link based object classification or object identification, this research focused on using link mining to detect anomalies and discovering links/objects among anomalies. This paper attempts to demonstrate the contribution of mutual information to interpret anomalies using a case study.
Schema Integration, View Integration and Database Integration, ER Model & Dia...Mobarok Hossen
What is ER Model & Diagrams?
How can you design ER Model & Diagram?
What is Object-Oriented Model?
What is Schema Integration? how can you Schema Integrate?
What is View Integration? how can you View Integrate?
What is Database Integration? how can you Database Integrate?
An Enhanced Suffix Tree Approach to Measure Semantic Similarity between Multi...iosrjce
IOSR Journal of Computer Engineering (IOSR-JCE) is a double blind peer reviewed International Journal that provides rapid publication (within a month) of articles in all areas of computer engineering and its applications. The journal welcomes publications of high quality papers on theoretical developments and practical applications in computer technology. Original research papers, state-of-the-art reviews, and high quality technical notes are invited for publications.
Information residing in relational databases and delimited file systems are inadequate for reuse and sharing over the web. These file systems do not adhere to commonly set principles for maintaining data harmony. Due to these reasons, the resources have been suffering from lack of uniformity, heterogeneity as well as redundancy throughout the web. Ontologies have been widely used for solving such type of problems, as they help in extracting knowledge out of any information system. In this article, we focus on extracting concepts and their relations from a set of CSV files. These files are served as individual concepts and grouped into a particular domain, called the domain ontology. Furthermore, this domain ontology is used for capturing CSV data and represented in RDF format retaining links among files or concepts. Datatype and object properties are automatically detected from header fields. This reduces the task of user involvement in generating mapping files. The detail analysis has been performed on Baseball tabular data and the result shows a rich set of semantic information.
Cluster Based Web Search Using Support Vector MachineCSCJournals
Now days, searches for the web pages of a person with a given name constitute a notable fraction of queries to Web search engines. This method exploits a variety of semantic information extracted from web pages. The rapid growth of the Internet has made the Web a popular place for collecting information. Today, Internet user access billions of web pages online using search engines. Information in the Web comes from many sources, including websites of companies, organizations, communications and personal homepages, etc. Effective representation of Web search results remains an open problem in the Information Retrieval community. For ambiguous queries, a traditional approach is to organize search results into groups (clusters), one for each meaning of the query. These groups are usually constructed according to the topical similarity of the retrieved documents, but it is possible for documents to be totally dissimilar and still correspond to the same meaning of the query. To overcome this problem, the relevant Web pages are often located close to each other in the Web graph of hyperlinks. It presents a graphical approach for entity resolution & complements the traditional methodology with the analysis of the entity-relationship (ER) graph constructed for the dataset being analyzed. It also demonstrates a technique that measures the degree of interconnectedness between various pairs of nodes in the graph. It can significantly improve the quality of entity resolution. Using Support vector machines (SVMs) which are a set of related Supervised learning methods used for classification of load of user queries to the sever machine to different client machines so that system will be stable. clusters web pages based on their capacities stores whole database on server machine. Keywords: SVM, cluster; ER.
Student POST Database processing models showcase the logical s.docxorlandov3
Student POST:
Database processing models showcase the logical structure of a database. The most commonly used model is the Relational database model that sorts the data in a table that consist of rows and columns. The column holds the attributes of the entity and rows hold the data of a particular instance of the entities. The major advantage of the Relational model is that it is in the table form and hence easier for users to understand, manage and work with the data. And, with the primary key and foreign key concepts, the data can be uniquely identified, stored in different entities and retrieved effectively with the relationships. The other advantage is that with the relational model, SQL language can be used to work with the data which is simple to understand and most widely used. The disadvantage of relational model could be the financial cost that is higher in comparison as the specific software needs to be in place and the regular maintenance needs to be performed that requires highly skilled manpower. And, the complexity of the database can be further increased when the volume of the data keep in increasing. Also, there is the limitation in the length of fields stored as different data types in relational model (Joseph & Paul, 2009).
The other processing model is the Object-oriented model that depicts database as the collection of objects. The advantage of this model is that it is compatible to work with complex data sets with the use of Object IDs and object-oriented programming. It’s disadvantage is that object databases are not commonly used and the complexity can hamper the performance of database. The other type of database model is the Entity-Relationship model which is mostly used for the conceptual design of database. It pictures the entities, several attributes that falls within the domain of that entity and the cardinality of relationship between them. It’s advantage is that the E-R diagram is easily understandable by the users at the first glance and thus can effectively work with the data in no time and can point out the discrepancies in the data. The other advantage is that it can be easily converted to other models if required by the business. The disadvantage of Entity-Relationship is that the industry standard notations for the diagram is not defined and thus can create confusion to the users. This model is only suitable for high-level database design (S.J.D.,2020).
2Nd Student POST :
Database models or commonly referred to as schemas help represent the structure of a database and its format which is run by a DBMS. Database model uses vary depending on user specifications.
Types of database models
1.
Network model
This network model uses a structure similar to that of a hierarchical model. The model permits multiple parents, which is a tree-like structure model. This model emphasizes two basic concepts; records and sets. Records hold file hierarchy and sets define the many-to-many relationship .
A comprehensive survey of link mining and anomalies detectioncsandit
This survey introduces the emergence of link mining and its relevant application to detect
anomalies which can include events that are unusual, out of the ordinary or rare, unexpected
behaviour, or outliers.
International Journal of Engineering Research and Applications (IJERA) is a team of researchers not publication services or private publications running the journals for monetary benefits, we are association of scientists and academia who focus only on supporting authors who want to publish their work. The articles published in our journal can be accessed online, all the articles will be archived for real time access.
Our journal system primarily aims to bring out the research talent and the works done by sciaentists, academia, engineers, practitioners, scholars, post graduate students of engineering and science. This journal aims to cover the scientific research in a broader sense and not publishing a niche area of research facilitating researchers from various verticals to publish their papers. It is also aimed to provide a platform for the researchers to publish in a shorter of time, enabling them to continue further All articles published are freely available to scientific researchers in the Government agencies,educators and the general public. We are taking serious efforts to promote our journal across the globe in various ways, we are sure that our journal will act as a scientific platform for all researchers to publish their works online.
Purpose of the data base system, data abstraction, data model, data independence, data definition
language, data manipulation language, data base manager, data base administrator, data base users,
overall structure.
ER Models, entities, mapping constrains, keys, E-R diagram, reduction E-R diagrams to tables,
generatio, aggregation, design of an E-R data base scheme.
Oracle RDBMS, architecture, kernel, system global area (SGA), data base writer, log writer, process
monitor, archiver, database files, control files, redo log files, oracle utilities.
SQL: commands and data types, data definition language commands, data manipulation commands,
data query language commands, transaction language control commands, data control language
commands.
Joins, equi-joins, non-equi-joins, self joins, other joins, aggregate functions, math functions, string
functions, group by clause, data function and concepts of null values, sub-querries, views.
PL/SQL, basics of pl/sql, data types, control structures, database access with PL/SQL, data base
connections, transaction management, data base locking, cursor management.
Today’s market evolution and high volatility of business requirements put an increasing emphasis on the
ability for systems to accommodate the changes required by new organizational needs while maintaining
security objectives satisfiability. This is all the more true in case of collaboration and interoperability
between different organizations and thus between their information systems. Ontology mapping has been
used for interoperability and several mapping systems have evolved to support the same. Usual solutions
do not take care of security. That is almost all systems do a mapping of ontologies which are unsecured.
We have developed a system for mapping secured ontologies using graph similarity concept.
An Enhanced Suffix Tree Approach to Measure Semantic Similarity between Multi...iosrjce
IOSR Journal of Computer Engineering (IOSR-JCE) is a double blind peer reviewed International Journal that provides rapid publication (within a month) of articles in all areas of computer engineering and its applications. The journal welcomes publications of high quality papers on theoretical developments and practical applications in computer technology. Original research papers, state-of-the-art reviews, and high quality technical notes are invited for publications.
Information residing in relational databases and delimited file systems are inadequate for reuse and sharing over the web. These file systems do not adhere to commonly set principles for maintaining data harmony. Due to these reasons, the resources have been suffering from lack of uniformity, heterogeneity as well as redundancy throughout the web. Ontologies have been widely used for solving such type of problems, as they help in extracting knowledge out of any information system. In this article, we focus on extracting concepts and their relations from a set of CSV files. These files are served as individual concepts and grouped into a particular domain, called the domain ontology. Furthermore, this domain ontology is used for capturing CSV data and represented in RDF format retaining links among files or concepts. Datatype and object properties are automatically detected from header fields. This reduces the task of user involvement in generating mapping files. The detail analysis has been performed on Baseball tabular data and the result shows a rich set of semantic information.
Cluster Based Web Search Using Support Vector MachineCSCJournals
Now days, searches for the web pages of a person with a given name constitute a notable fraction of queries to Web search engines. This method exploits a variety of semantic information extracted from web pages. The rapid growth of the Internet has made the Web a popular place for collecting information. Today, Internet user access billions of web pages online using search engines. Information in the Web comes from many sources, including websites of companies, organizations, communications and personal homepages, etc. Effective representation of Web search results remains an open problem in the Information Retrieval community. For ambiguous queries, a traditional approach is to organize search results into groups (clusters), one for each meaning of the query. These groups are usually constructed according to the topical similarity of the retrieved documents, but it is possible for documents to be totally dissimilar and still correspond to the same meaning of the query. To overcome this problem, the relevant Web pages are often located close to each other in the Web graph of hyperlinks. It presents a graphical approach for entity resolution & complements the traditional methodology with the analysis of the entity-relationship (ER) graph constructed for the dataset being analyzed. It also demonstrates a technique that measures the degree of interconnectedness between various pairs of nodes in the graph. It can significantly improve the quality of entity resolution. Using Support vector machines (SVMs) which are a set of related Supervised learning methods used for classification of load of user queries to the sever machine to different client machines so that system will be stable. clusters web pages based on their capacities stores whole database on server machine. Keywords: SVM, cluster; ER.
Student POST Database processing models showcase the logical s.docxorlandov3
Student POST:
Database processing models showcase the logical structure of a database. The most commonly used model is the Relational database model that sorts the data in a table that consist of rows and columns. The column holds the attributes of the entity and rows hold the data of a particular instance of the entities. The major advantage of the Relational model is that it is in the table form and hence easier for users to understand, manage and work with the data. And, with the primary key and foreign key concepts, the data can be uniquely identified, stored in different entities and retrieved effectively with the relationships. The other advantage is that with the relational model, SQL language can be used to work with the data which is simple to understand and most widely used. The disadvantage of relational model could be the financial cost that is higher in comparison as the specific software needs to be in place and the regular maintenance needs to be performed that requires highly skilled manpower. And, the complexity of the database can be further increased when the volume of the data keep in increasing. Also, there is the limitation in the length of fields stored as different data types in relational model (Joseph & Paul, 2009).
The other processing model is the Object-oriented model that depicts database as the collection of objects. The advantage of this model is that it is compatible to work with complex data sets with the use of Object IDs and object-oriented programming. It’s disadvantage is that object databases are not commonly used and the complexity can hamper the performance of database. The other type of database model is the Entity-Relationship model which is mostly used for the conceptual design of database. It pictures the entities, several attributes that falls within the domain of that entity and the cardinality of relationship between them. It’s advantage is that the E-R diagram is easily understandable by the users at the first glance and thus can effectively work with the data in no time and can point out the discrepancies in the data. The other advantage is that it can be easily converted to other models if required by the business. The disadvantage of Entity-Relationship is that the industry standard notations for the diagram is not defined and thus can create confusion to the users. This model is only suitable for high-level database design (S.J.D.,2020).
2Nd Student POST :
Database models or commonly referred to as schemas help represent the structure of a database and its format which is run by a DBMS. Database model uses vary depending on user specifications.
Types of database models
1.
Network model
This network model uses a structure similar to that of a hierarchical model. The model permits multiple parents, which is a tree-like structure model. This model emphasizes two basic concepts; records and sets. Records hold file hierarchy and sets define the many-to-many relationship .
A comprehensive survey of link mining and anomalies detectioncsandit
This survey introduces the emergence of link mining and its relevant application to detect
anomalies which can include events that are unusual, out of the ordinary or rare, unexpected
behaviour, or outliers.
International Journal of Engineering Research and Applications (IJERA) is a team of researchers not publication services or private publications running the journals for monetary benefits, we are association of scientists and academia who focus only on supporting authors who want to publish their work. The articles published in our journal can be accessed online, all the articles will be archived for real time access.
Our journal system primarily aims to bring out the research talent and the works done by sciaentists, academia, engineers, practitioners, scholars, post graduate students of engineering and science. This journal aims to cover the scientific research in a broader sense and not publishing a niche area of research facilitating researchers from various verticals to publish their papers. It is also aimed to provide a platform for the researchers to publish in a shorter of time, enabling them to continue further All articles published are freely available to scientific researchers in the Government agencies,educators and the general public. We are taking serious efforts to promote our journal across the globe in various ways, we are sure that our journal will act as a scientific platform for all researchers to publish their works online.
Purpose of the data base system, data abstraction, data model, data independence, data definition
language, data manipulation language, data base manager, data base administrator, data base users,
overall structure.
ER Models, entities, mapping constrains, keys, E-R diagram, reduction E-R diagrams to tables,
generatio, aggregation, design of an E-R data base scheme.
Oracle RDBMS, architecture, kernel, system global area (SGA), data base writer, log writer, process
monitor, archiver, database files, control files, redo log files, oracle utilities.
SQL: commands and data types, data definition language commands, data manipulation commands,
data query language commands, transaction language control commands, data control language
commands.
Joins, equi-joins, non-equi-joins, self joins, other joins, aggregate functions, math functions, string
functions, group by clause, data function and concepts of null values, sub-querries, views.
PL/SQL, basics of pl/sql, data types, control structures, database access with PL/SQL, data base
connections, transaction management, data base locking, cursor management.
Today’s market evolution and high volatility of business requirements put an increasing emphasis on the
ability for systems to accommodate the changes required by new organizational needs while maintaining
security objectives satisfiability. This is all the more true in case of collaboration and interoperability
between different organizations and thus between their information systems. Ontology mapping has been
used for interoperability and several mapping systems have evolved to support the same. Usual solutions
do not take care of security. That is almost all systems do a mapping of ontologies which are unsecured.
We have developed a system for mapping secured ontologies using graph similarity concept.
DEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODELgerogepatton
As digital technology becomes more deeply embedded in power systems, protecting the communication
networks of Smart Grids (SG) has emerged as a critical concern. Distributed Network Protocol 3 (DNP3)
represents a multi-tiered application layer protocol extensively utilized in Supervisory Control and Data
Acquisition (SCADA)-based smart grids to facilitate real-time data gathering and control functionalities.
Robust Intrusion Detection Systems (IDS) are necessary for early threat detection and mitigation because
of the interconnection of these networks, which makes them vulnerable to a variety of cyberattacks. To
solve this issue, this paper develops a hybrid Deep Learning (DL) model specifically designed for intrusion
detection in smart grids. The proposed approach is a combination of the Convolutional Neural Network
(CNN) and the Long-Short-Term Memory algorithms (LSTM). We employed a recent intrusion detection
dataset (DNP3), which focuses on unauthorized commands and Denial of Service (DoS) cyberattacks, to
train and test our model. The results of our experiments show that our CNN-LSTM method is much better
at finding smart grid intrusions than other deep learning algorithms used for classification. In addition,
our proposed approach improves accuracy, precision, recall, and F1 score, achieving a high detection
accuracy rate of 99.50%.
A review on techniques and modelling methodologies used for checking electrom...nooriasukmaningtyas
The proper function of the integrated circuit (IC) in an inhibiting electromagnetic environment has always been a serious concern throughout the decades of revolution in the world of electronics, from disjunct devices to today’s integrated circuit technology, where billions of transistors are combined on a single chip. The automotive industry and smart vehicles in particular, are confronting design issues such as being prone to electromagnetic interference (EMI). Electronic control devices calculate incorrect outputs because of EMI and sensors give misleading values which can prove fatal in case of automotives. In this paper, the authors have non exhaustively tried to review research work concerned with the investigation of EMI in ICs and prediction of this EMI using various modelling methodologies and measurement setups.
Hierarchical Digital Twin of a Naval Power SystemKerry Sado
A hierarchical digital twin of a Naval DC power system has been developed and experimentally verified. Similar to other state-of-the-art digital twins, this technology creates a digital replica of the physical system executed in real-time or faster, which can modify hardware controls. However, its advantage stems from distributing computational efforts by utilizing a hierarchical structure composed of lower-level digital twin blocks and a higher-level system digital twin. Each digital twin block is associated with a physical subsystem of the hardware and communicates with a singular system digital twin, which creates a system-level response. By extracting information from each level of the hierarchy, power system controls of the hardware were reconfigured autonomously. This hierarchical digital twin development offers several advantages over other digital twins, particularly in the field of naval power systems. The hierarchical structure allows for greater computational efficiency and scalability while the ability to autonomously reconfigure hardware controls offers increased flexibility and responsiveness. The hierarchical decomposition and models utilized were well aligned with the physical twin, as indicated by the maximum deviations between the developed digital twin hierarchy and the hardware.
Introduction- e - waste – definition - sources of e-waste– hazardous substances in e-waste - effects of e-waste on environment and human health- need for e-waste management– e-waste handling rules - waste minimization techniques for managing e-waste – recycling of e-waste - disposal treatment methods of e- waste – mechanism of extraction of precious metal from leaching solution-global Scenario of E-waste – E-waste in India- case studies.
Adaptive synchronous sliding control for a robot manipulator based on neural ...IJECEIAES
Robot manipulators have become important equipment in production lines, medical fields, and transportation. Improving the quality of trajectory tracking for
robot hands is always an attractive topic in the research community. This is a
challenging problem because robot manipulators are complex nonlinear systems
and are often subject to fluctuations in loads and external disturbances. This
article proposes an adaptive synchronous sliding control scheme to improve trajectory tracking performance for a robot manipulator. The proposed controller
ensures that the positions of the joints track the desired trajectory, synchronize
the errors, and significantly reduces chattering. First, the synchronous tracking
errors and synchronous sliding surfaces are presented. Second, the synchronous
tracking error dynamics are determined. Third, a robust adaptive control law is
designed,the unknown components of the model are estimated online by the neural network, and the parameters of the switching elements are selected by fuzzy
logic. The built algorithm ensures that the tracking and approximation errors
are ultimately uniformly bounded (UUB). Finally, the effectiveness of the constructed algorithm is demonstrated through simulation and experimental results.
Simulation and experimental results show that the proposed controller is effective with small synchronous tracking errors, and the chattering phenomenon is
significantly reduced.
New techniques for characterising damage in rock slopes.pdf
Entity Resolution in Large Graphs
1. MS Project report
Entity Resolution in Graph Data
Under guidance of Prof. Yoav Freund and Dr. Amarnath Gupta
Apurva Kumar
Department of Computer Science and Engineering, UC San Diego
apk005@cs.ucsd.edu
Abstract
Entity Resolution is a problem that occurs in many information
integration processes and applications. The primary goal of entity
resolution is to resolve data references to the corresponding same
real world entity. The ambiguity in references comes in various net-
works such as social network, biological network, citation graphs
and many others. Ambiguity in references not only leads to data
redundancy but also inaccuracies in knowledge representation, ex-
traction and query processing. Entity resolution is the solution to
this problem. There have been many approaches such as pair-wise
similarity over attributes of references, a parallel approach for mor-
phing the graph data on to cluster of nodes (P-Swoosh) [2] and
relational clustering that makes use of relational information in ad-
dition to the attribute similarity. In this project, we make use of
relational culstering to resolve author name ambiguities in a subset
of a real-world dataset: US patent network consisting of more than
650,000 author references.
General Terms Entity, references, buckets.
Keywords Entity Resolution, Name disambiguation, algorithms,
relational clustering.
1. Introduction
Entity Resolution is a generic problem that currently is embed-
ded in various networks such as biological network, spread across
multiple social networks such as Facebook, Google+, Linkedin,
Twitter, Yahoo! Social; even in search results such as Google, Bing
and Yahoo! we can see multiple results for the same search topic.
Other examples can be author search in Google Scholar, DPLP, and
numerous other citation networks. In all these applications, there
are a variety of ways of referring to the same underlying object.
Given a collection of objects, we want to a) determine the col-
lection of ’true’ underlying entities and b) correctly map the object
references in the collection to these entities. This problem comes up
is inherent throughout computer science. Examples include com-
puter vision, where we need to figure out when regions in two
different images refer to the same underlying object (also known
as the correspondence problem); natural language processing when
we would like to determine which noun phrases refer to the same
underlying entity (co-reference resolution); and databases, where,
[Copyright notice will appear here once ’preprint’ option is removed.]
when merging two databases or cleaning a database, we would like
to determine when two records are referring to the same underly-
ing individual (deduplication). In digital libraries, a big challenge
is to automatically group bibliographic references that refer to the
same publication. This problem is known as citation matching. Not
only publications, but also authors, venues, etc. are often referred
to by different representations and need to be disambiguated. For
example, the dataset might describe publications written by two
people, Johny Chao and Jane Chao, but refer to both of them as J.
Chao, leading to ambiguity. This is a more general disambiguation
challenge. It is also known as fuzzy grouping [10] and object con-
solidation.
To formalize entity resolution problem, lets consider a dataset D
describes a set of entities E = e1, e2, ..., em and the relationships
in which they participate. Entities can be of different types, such as
publications, authors, search topics and venues etc. Entities in D
are represented as a set of instantiated attributes R = r1, r2, ..., rn
referred to as entity representations or references. The goal is to
correctly group the representations in R that co-refer, that is, refer
to the same entity. Entity resolution (ER) is the process of identi-
fying and merging records judged to represent the same real-world
entity.
In this project we are trying to solve the author name dis-
ambiguation problem in U.S. patent citation network [25] origi-
nally consisting of 3,774,768 author references. We have extracted
651,877 author references from it that constitutes all patents filed
by US based authors from 1975 to 1999 which only have first name
and last name (301,877), as well as 350,000 references for authors
which have middle name in addition to first and last name. We re-
stricted our input to such numbers because the jobs were memory
bound and can aprroximately handle these many references at a
time on a single machine. To solve this problem, we considered
numerous approaches as expained in Section 2; Section 4 and 5 ex-
plains the entity resolution as a graphical based approach and an
unsupervised clustering algorithm[2] to solve it. Section 6 explains
the dataset, experiments and results. We used graph based unsu-
pervised relational clustering algorithm. This algorithm predicted
good results as can be seen in Section 6.
2. Related Work
Many previous works have tackled the entity resolution problem
as a manual healing or data mining problem in different domains.
Some of the approaches include both historical and fundamental
ones: (a) manual assignment by librarians [Scoville et al. 2003;
MathSciNet1]; (b) community-based efforts [WikiAuthors2]; (c)
unsupervised clustering that groups articles by similarity [Han et
al. 2005; Soler 2007; Yin et al. 2007]; (d) supervised methods that
utilize manually compiled training sets [Han et al. 2004; Reuther
Project report on Entity Resolution in Graph Data. 1 2013/3/19
2. and Walter 2006; On et al. 2005]; (e) pair-wise attribute analysis
by attributes comparison [(Author-ity model; Vetle I. Torvik and
Neil R. Smalheisher 2009][4] (f) methods that go beyond pairwise
analysis of explicit information to analyze graphs and implicit in-
formation [Bhattacharya and Getoor 2006, 2007; Huang et al. 2006;
Culotta and McCallum 2006; Kalashnikov and Mehrotra 2006; Cu-
lotta et al. 2007; Galvez and Moya-Anegon 2007].
Author name disambiguation is also closely related to several
other data mining problems such as record linkage in adminis-
trative databases [Jaro 1995; Winkler 1995; Koudas et al. 2006],
authorship attribution of anonymous or disputed documents using
stylometry [Holmes et al. 2001; Madigan et al. 2005], and entity
resolution, for example, mentions of a personal name across multi-
ple different websites [Mann and Yarowsky 2003].
There are different kinds of information present in a dataset that
an approach can utilize for disambiguation: attributes, context, re-
lationships, etc. The traditional approach to solve the disambigua-
tion challenge is to compare the values in attributes of two entity
descriptions (references) to decide whether they co-refer (i.e., re-
fer to the same real-world entity), for all pairs of entity references.
These approaches are commonly known as feature-based similarity
(FBS) methods [7, 8]. Recently, there has been a number of inter-
esting techniques developed that enhance the traditional techniques
by being able to utilize certain types of context to improve the qual-
ity [9, 10, 11]. All these methods utilize directly linked entities as
additional information when computing record similarity, also re-
ferred as context-based methods. There are also techniques that can
analyze another type of information: inter-entity relationships [12,
13]. In addition to analyzing the information present in the dataset,
some of the disambiguation techniques can also analyze meta in-
formation about the dataset, specied by the analyst [13]. Few dis-
ambiguation approaches also includes the dependence among the
co-reference decisions and no longer make pair-wise matching de-
cisions independently. These are called relational methods [2, 16,
17, 18, 19].
The motivation to solve this problem using grahical models
because the dataset [6] that we are using consists of graph data.
There are different graphical disambiguation techniques, which
visualize different graphs:
2.1 Internet graph
To disambiguate appearances of people on the internet, it has been
proposed to analyze the Web graph [21], wherein the webpages are
represented as nodes and hyperlinks as edges.
2.2 Co-reference dependency graph
Some of the relational approaches [16] visualize the dependence
(edges) among co-reference decisions (nodes) as a graph.
2.3 Entity-relationship graph
In the standard entity-relationship graph, the nodes represent the
entities in the dataset and the edges represent the relationships
among the entities [2,3,12,13,14]. The technique presented in pa-
pers [2,3] further was extended to make it self-adaptive to the un-
derlying data and thus avoiding the need of domain analyst.
The technique used in this poject is unsupervised relational
clustering algorithm [1,2] and makes use of both attribute similarity
and dependence among co-references (neighbourhood simililarity).
It is a graphical approach as it visualizes the dataset as the standard
entity-relationship graph.
3. A motivating example for Entity Resolution
In this section, lets consider an example that illustrates collective
resolution of author references. Figure 1 shows four paper refer-
ences on similar topic say ”Topic modelling” in machine learning,
each with its own author references, with an optional institution
attribute for each author. For instance, figure 1a shows paper P1
has two author references J Ullman and Andrew Ng. Andrew Ng
have an attribute as instituition and the value of the institution is
Stanford. Figure 1b shows paper P2 with author references J.D.
Ullman and Andrew Ng. Figure 1c shows four author references as
K.Chen, Andrew Ng, J.D. Ullman, Di Mario. Each of the three au-
thor except J.D. Ullman have their institution attribute as Stanford.
Figure 1d shows the fourth paper which has three author references
J.Dean, J.Ullman and Rajat Monga. Two of its authors J.Dean and
J. Ullman has its instituition attribute value Google.
In all we have seven unique author references that have five
unique names - Andrew Ng, K. Chen, Di Mario, Rajat Monga and
J.Dean, so all these five references can be reduced to five entities.
Note that among these seven references, three references corre-
spond to Andrew Ng, who after gathering evidence from figure 1a,
1b, 1c can be reduced to a single entity Andrew Ng with the value
of institution attribute as Stanford. Out of eleven, four remaining
references are J.Ullman (1a), J.D. Ullman (1b), J.D. Ullman (1c)
and J. Ullman (1d). Just from Figure 1a and 1b, its not clear if J.
Ullman and J.D. Ullman are same entity or not. But since Andrew
Ng wrote two papers P3 and P4 with same author J.D. Ullman, and
also Andrew Ng and the other two authors K. Chen and Di Mario
have same instituition attribute value as Stanford in (c), J.D. Ull-
man seems to work with authors from Stanford. On the other hand,
J. Ullman has worked on paper P4 with author references J. Dean,
Rajat Monga who have same institution attribute value as Google.
So based on the cumulative evidences from all figures 1a, 1b ,1c
and 1d its clear from that the two ambiguous references J.Ullman
and J.D.Ullman are not same but two distinct entities. Thus entity
resolution in this example made use of both attribute similarity and
relational (neighborhood) similarity to resolve the author name dis-
ambiguities.
4. Entity Resolution using Relationships
4.1 Problem
In this section we formaly define the notation that we will be using
to represent the entity resolution graphs to solve the entity reso-
lution problem. In the entity resolution problem, we are given a
set of references R = {ri}, where each reference r has attributes
r.A1, r.A2, ..., r.Ak. The references correspond to some set of un-
known entities E = {ei}. We introduce the notation r.E to refer
to the entity to which reference r corresponds. The problem is to
recover the hidden set of entities E = ei and the entity labels r.E
for individual references given the observed attributes of the refer-
ences. In addition to the attributes, we assume that the references
are not observed independently, but that they co-occur. We describe
the co-occurrence with a set of hyper-edges H = {hi}.
Each hyper-edge h may have attributes as well, which we denote
h.A1, h.A2, ..., h.Ak, and we use h.R to denote the set of refer-
ences that it connects. A reference r can belong to zero or more
hyper-edges and we use r.H to denote the set of hyper-edges in
which r participates. In this project, we only consider entity resolu-
tion when each reference is associated with zero or one hyper-edge,
but in other domains it is possible for multiple hyper-edges to share
references. For example, if we have paper, author and venue refer-
ences, then a paper reference may be connected to multiple author
references and also to a venue reference. Let us now illustrate how
Project report on Entity Resolution in Graph Data. 2 2013/3/19
3. Figure 1. (a), (b), (c), (d) corresponds to four papers P1,P2,P3 and P4
on ”Topic Modelling”. Each paper has some author references, some are
unique for e.g. K.Chen in (c), but some author references occur in more
than one paper, for eg. Andrew Ng occurs in (a), (b) and (c), but he can
be resolved to a single entity based on attribute similarity (Stanford) from
(a) and (c) and relational similarity from (b) and (c), sharing co-author
J.D.Ullman in both. Some author references are ambiguous, for instance
from (a) and (b) it is not clear if J. Ullman and J.D. Ullman refer to same
author. Only after gathering relational evidence from Andrew Ng and and
unique neighborhood match for J. D. Ulman in (c) and J. Ullman (d) can
these two references be resolved to two distinct entities.
Figure 2. 2(a), (b), (c) and (d) are same as corresponding figures in 1.
Each edge between two author references have been labeled. Fig (c’) and
(d’) represent the reduced entities of the author references in (a) - (d). Each
node in (c’) and (d’) corresponds to the different reference labels that each
author represented in the papers P1 - P4 and also an edge between two
authors also contain edges from other references representing a hyper-edge.
For eg. in fig (c’) Andrew Ng and J.D. Ullman have edges h2 and h5.
our running example is represented in this notation.
Figure 2 shows the references and hyper-edges corresponding
to Figure 1. Each observed author name corresponds to a refer-
ence, so there are eleven references r1 through r11. In this case,
the names are the only attributes of the references, so for example
r1.A is D. Ullman, r2.A is Andrew Ng and r3.A is J.D.Ullman,
r5.A is K. Chen and so on. The set of true entities E ={Andrew
Ng, K. Chen, Di Mario, Rajat Monga, J.Dean, J. Ullman, J.D. Ull-
man} as shown in Figure 2(c
0
, d
0
). References r2 and r4 corre-
spond to Andrew Ng, so that r2.E = r4.E = Andrew Ng. Sim-
ilarly, r1.E = r9.E = J. Ullman and so on. There are also the
hyper-edges H = {H1, H2, H3, H4}, where H1, = {h1}, H2 =
{h2}, H3 = {h3, h4, h5, h6}andH4 = {h7, h8, h9}. The at-
tributes of the hyper-edges in this domain are the paper titles; for
example, H1.A1=”Unsupervised topic modeling”. The references
r2, r3, r4, r7 are associated with hyper-edge h2 and h5, since they
are the observed author references (Andrew Ng and J.D. Ullman)
in P2 and P3. This is represented as H1.R = {r1, r2}. Similarly
references also holds the values of the hyperedges they belong to.
This process is done for all the references in a similar way.
4.2 Collective Relational Entity Resolution
The goal of collective relational entity resolution (CR) is to make
resolutions using a dependency model so that one resolution de-
cision affects other resolutions via hyper-edges. We now moti-
vate entity resolution as a clustering problem and propose a re-
lational clustering algorithm for collective relational entity reso-
lution. Given any similarity measure between pairs of references,
entity resolution can be posed as a clustering problem where the
goal is to cluster the references so that only those that correspond
to the same entity are assigned to the same cluster. We use a greedy
agglomerative clustering algorithm, where at any stage, the cur-
rent set C = {ci} of entity clusters re ects the current belief about
the mapping of the references to entities. We use r.C to denote
the current cluster label for a reference; references that have the
same cluster label correspond to the same entity. So far, we have
discussed similarity measures for references; for the clustering ap-
proach to entity resolution, we need to define similarities between
clusters of references. For collective entity resolution, we define
the similarity of two clusters ci and cj as:
sim(ci, cj) = (1 − α) × simA(ci, cj) + α × simR(ci, cj),
where0 ≤ α ≥ 1
where simA() is the similarity of the attributes and simR()
is the relational similarity between the references in the two en-
tity clusters. From this the equation, we can see that it reduces to
attribute-based similarity for α = 0. Also, the relational aspect of
the similarity measures include the relational similarity measures
the attribute similarity of the related references that are connected
through hyper-edge and also the labels of related clusters that rep-
resent entities. This similarity is dynamic in nature, which is one
of the most important and interesting aspects of the collective ap-
proach. For attribute-based and naive relational resolution, the sim-
ilarity between two references is fixed [2].
In contrast, for collective resolution, the similarity of two clus-
ters depends on the current cluster labels of their neighbors, and
therefore changes as their labels are updated. The references in
each cluster c are connected to other references via hyper-edges.
For collective entity resolution, relational similarity considers the
cluster labels of all these connected references. Recall that each
reference r is associated with one or more hyper-edges in H. There-
fore, the set of hyper-edges c.H that we need to consider for an
entity cluster c is defined as
c.H = ∪{h|h ∈ H ∧ r ∈ h.R}, where r ∈ R ∧ r.C = c
The hyper-edges connect c to other clusters. The relational sim-
ilarity for two clusters needs to compare their connectivity patterns
to other clusters. For any cluster c, the set of other clusters to which
c is connected via its hyper edge set c.H form the neighborhood
Project report on Entity Resolution in Graph Data. 3 2013/3/19
4. Figure 3. Pseudo-code for the relational clustering algorithm used in this
project. Original source: [2]
Nbr(c) of cluster c:
Nbr(c) = ∪{cj|cj = r.C}, where h ∈ c.H, r ∈ h.R
This defines the neighborhood as a set of related clusters, but the
neighborhood can also be defined as a bag or multi-set, in which the
multiplicity of the different neighboring clusters is preserved. We
will use NbrB(ci) to denote the bag of neighboring clusters. In our
example in Figure 2, the neighborhood of the cluster for J. Ullman
consists of the clusters for Andrew Ng, J. Dean and Rajat Monga
Note that we do not constrain the definition of the neighborhood of
a cluster to exclude the cluster itself.
For measuring the attribute similarity, we are using Levenshtein
distance [23], for neighborhood similarity, we are using Jaccard
Coefficient with frequency [24] For the relational similarity be-
tween two clusters, we look for commonness in their neighbor-
hoods.
5. Relational clustering algorithm
Given the similarity measure for a pair of reference clusters, we use
a greedy agglomerative clustering algorithm [2] that finds the clos-
est cluster pair at each iteration and merges them. Figure 3 is the
pseudo-code for the algorithm that is implemented in detail in this
section. In this section, we also discuss several important imple-
mentation and performance issues regarding relational clustering
algorithm.
5.1 Blocking
Since comparing all the references is costly O(n2
), unless the
dataset is small, blocking techniques [Hernandez and Stolfo 1995;
Monge and Elkan 1997; McCallum et al. 2000][2] are used to re-
duce the number of reference pairs that will result in a non-match.
On a single machine with 4-8 GB of RAM, it is impractical to con-
sider all possible pairs as potential candidates for merging. Apart
from the scaling limitation, most pairs checked by an O(n2
) ap-
proach will be rejected since usually only about 1% of all pairs
are true matches [2]. The blocking technique used is to separate
references into possibly overlapping buckets and only pairs of ref-
erences within each bucket are considered as potential matches.
The relational clustering algorithm uses the blocking method as
a black-box and any method that can quickly identify potential
matches minimizing false negatives can be used. We use a variant
of an algorithm proposed by McCallum et al. [2000] that we briefly
describe below.
This algorithm just makes a single pass over the list of refer-
ences and assigns them to buckets using an attribute similarity mea-
sure. To find the best potential bucket for a reference, each bucket
has a representative reference that is the most similar to all ref-
erences currently in the bucket. For assigning any reference, it is
compared to the representative for each bucket. It is assigned to all
buckets for which the similarity is above a threshold. If no simi-
lar bucket is found, a new bucket is created for this reference. A
naive implementation yields a O(n(b + f)) algorithm for n refer-
ences and b buckets and when a reference is assigned to at most
f buckets. This can be improved by maintaining an inverted index
over buckets. This implementation detail was adapted from the im-
plementation details in [2]. For example, when dealing with names,
for each character we maintain the list of buckets storing last names
starting with that character. This helps in finding the right potential
set of bucksets in O(1) time for each reference leading to a O(nf)
algorithm.
5.2 Bootstrapping
This phase of the relative clustering algorithm is an iterative loop
that utilizes clustering decisions made in previous iterations to
make new decisions. It is done by measuring the shared neighbor-
hood for similar clusters, as explained in the subsection 4.2. The
problem here is that if we begin with each reference in a distinct
cluster, then initially there are no shared neighbors for references
that belong to different hyper-edges. So the initial iterations of the
algorithm have no relational evidence to depend on. Therefore, the
relational component of the similarity between clusters would be
zero and merges would occur based on attribute similarity alone.
Many of such initial merges can be inaccurate, particularly for the
references with ambiguous attribute values. To avoid this, we need
to bootstrap the clustering algorithm such that each reference is not
assigned to a distinct cluster. Specifically, if we are confident that
some reference pair is coreferent, then they should be assigned to
the same initial cluster.
However, precision is crucial for the bootstrap process, since our
algorithm cannot undo any of these initial merge operations. In this
subsection, we describe our bootstrapping scheme for relational
clustering that makes use of the hyper-edges for improved boot-
strap performance. The basic idea is very similar to the naive rela-
tional approach [2], with the difference that we use exact matches
instead of similarity for attributes. To determine if any two refer-
ences should be assigned to the same initial cluster, we first check
if their attributes match exactly. For references with ambiguous at-
tributes, we also check if the attributes of their related references
match. In-depth coverage of this approach can be found in follow-
ing two paragraphs.
The bootstrap scheme goes over each reference pair that is po-
tentially coreferent (as determined by blocking) and determines
if it is a bootstrap candidate. First, consider the simple bootstrap
scheme that looks only at the attributes of two references. Refer-
ences with ambiguous attribute values are assigned to distinct clus-
ters. Any reference pair whose attribute values match and are not
ambiguous is considered to be a bootstrap candidate. The problem
with this simple approach that recall can be very poor for datasets
with large ambiguous references if it assigns all references with
ambiguous attributes to distinct clusters. When hyper-edges are
available, they can be used as further evidence for bootstrapping
ambiguous references.
Project report on Entity Resolution in Graph Data. 4 2013/3/19
5. A pair of ambiguous references form a bootstrap candidate if
their hyper-edges match. Two hyper-edges h1 and h2 are said to
have a k-exact-match if there are at least k pairs of references
(ri, rj), ri ∈ H1.R, rj ∈ H2.R with exactly matching attributes,
i.e. ri.A = rj.A. Two references r1 and r2 are bootstrap candi-
dates if any pair of their hyper-edges have a k-exact-match. Re-
ferring to our example from Figure 1 and 2, two references with
name ’J. Ullman’ bootstrap candidate, will not be merged during
bootstrapping on the basis of the name alone. However, if the first
Ullman reference has co-authors ‘A. Ng’ and ‘K. Chen’, and the
second Ullman has a coauthor ‘A. Ng’ in some paper, then they
have a 1-exact-match and, depending on a threshold for k, they
would be merged. The value of k for the hyper-edge test depends
on the ambiguity of the domain. A higher value of k should be used
for domains with high ambiguity. Other attributes of the references,
and also of the hyper-edges, when available, can be used to further
constrain bootstrap candidates.
Two references are considered only if these other attributes do
not conflict. In the bibliographic domain, author references from
two different papers can be merged only if their institutions and
correspondence addresses match. After the bootstrap candidates are
identified, the initial clusters are created using the union-find ap-
proach so that any two references that are bootstrap candidates are
assigned to the same initial cluster. In addition to improving accu-
racy of the relational clustering algorithm, bootstrapping reduces
execution time by significantly lowering the initial number of clus-
ters without having to find the most similar cluster-pairs or perform
expensive similarity computations.
5.3 Iterative cluster merge and evidence updates
Once the similar clusters have been identified and bootstrapping
has been performed, the algorithm iteratively merges the most sim-
ilar cluster pair and updates similarities until the similarity drops
below some specified threshold. This is shown in lines 5-14 of the
pseudo-code in Figure 3. The similarity update steps for related
clusters in lines 12-14 are the key steps for the collective relational
clustering. In order to perform the update steps efficiently, indexes
need to maintained for each cluster.
In this section, we describe the data structure that which was
used for this purpose. In addition to its list of references, we main-
tain three additional lists with each cluster. First, we maintain the
list of similar clusters for each cluster. The second list keeps track
of all neighboring clusters. Finally, we keep track of all the queue
entries that involve this cluster. For a cluster that has a single refer-
ence r, the similar clusters are those that contain references in the
same bucket as r after blocking. Also, the neighbors for this cluster
are the clusters containing references that share a hyper-edge with r.
Then, as two clusters merge to form a new cluster, all of these lists
can be constructed locally for the new cluster from those of its par-
ents. All of the update operations from lines 9-14 can be performed
efficiently using these lists. For example, updates for related clus-
ters are done by first accessing the neighbor list and then traversing
the similar list for each of them.
5.4 Time complexity
Having described each component of the relational clustering al-
gorithm, in this section discuss about the running time of each
component. First, we look at how the number of similarity compu-
tations required in lines 3-4 of Figure 5 is reduced by the block-
ing method. We first consider the worst case scenario where the
bootstrapping approach does not reduce the number of clusters
at all. We need to compare every pair of references within each
bucket. Suppose we have n references that are assigned to b buck-
ets with each reference being assigned to at most f buckets. Then
using an optimistic estimate, we have (nf/b) references in each
bucket, leading to O((nf/b)2
) comparisons per bucket and a total
of O((n2
f2
/b)) comparisons. We have assumed that the number
of buckets is proportional to the number of references, i.e., b is of
the order of O(n). Additionally, assuming that f is a small constant
independent of n, we have O(n) computations. It should be noted
that this is not a worst case analysis for the bucketing. A bad buck-
eting algorithm that assigns O(n) references to any bucket will
lead to O(n2
) comparisons. Now, let us consider the time taken by
each iteration of the algorithm.
To analyze how many update or insert operations are required,
we assume that for each bucket that is affected by a merge op-
eration, so all the O((nf/b)2
) computations need to be redone.
Then we need to find out how many buckets may be affected by
a merge operation. We say that two buckets are connected if any
hyper-edge connects two references in the two buckets. Then if
any bucket is connected to k other buckets, each merge operation
leads to O(k(nf/b)2
) update/insert operations. Note that this is
still only O(k) operations when f is a constant independent of n
and b is O(n). Using a binary-heap implementation for the priority
queue, the extract-max and each insert and update operation take
O(log q) time, where q is the number of entries in the queue. So
the total cost of each iteration of the algorithm is O(k log q).
Next, we count the total number of iterations that our algo-
rithm may require. In the worst case, the algorithm may have to
exhaust the priority queue before the similarity falls below the
threshold. So we need to consider the number of merge operations
that are required to exhaust a queue that has q entries. If the merge
tree is perfectly balanced, then the size of each cluster is doubled
by each merge operation and as few as O(log q) merges are re-
quired. However, in the worst case, the merge tree may be q deep
requiring as many as O(q) merges. With each merge operation re-
quiring O(k log q) time, the total cost of the iterative process is
O(qk log q).
Finally, in order to put a bound on the priority queue’s initial
size q, we again consider the worst case scenario where bootstrap-
ping does not reduce the number of initial clusters. This results
in O(n2
f2
/b) entries in the queue as shown earlier. Since this is
again O(n), hence the total cost of the algorithm can be bounded
by O(nk log n). Let us consider the cost of boostrapping. We can
analyze the bootstrapping by considering it as a sequence of clus-
ter merge operations that do not require any updates or inserts to
the priority queue. Then the worst case analysis of the number of
iterations accounts for the bootstrapping as well. To compare this
with the attribute and naive relational baselines, observe that they
need to take a decision for each pair of references in a bucket. This
leads to a worst case analysis of O(n) using the same assumptions
as before. However, each similarity computation is more expensive
for the naive relational approach than the attribute-based approach
because the former only requires a pair-wise match to be computed
between the two hyper-edges.
6. Experiments and Results
We evaluated our relational entity resolution algorithm on a real-
world dataset - U.S. patent records. This dataset is maintained
by the National Bureau of Economic Research (NBER). The data
set spans 37 years (January 1, 1963 to December 30, 1999), and
includes all the utility patents granted during that period, totaling
3,923,922 patents (author references). The citation graph includes
all citations made by patents granted between 1975 and 1999,
Project report on Entity Resolution in Graph Data. 5 2013/3/19
6. totaling 16,522,438 citations [6]. For the patents dataset there are
1,803,511 nodes for which there is no information with NBER
about their citations (they only have in-links to the records). The
machine used for this experiment is a quad-core Intel(R) Xeon
X3210, 2.13GHz machine with 4GB RAM.
6.1 Dataset
Since the reational clustering algorithm that we implemented is a
sequential and not parallel, we took a fraction (1̃5% approx.) of the
records for entire US patent dataset for the period 1975 to 1999.
The total records D that we ran our algorithm was 651,877. D con-
stitutes of two parts Dataset 1 (D1) and Dataset 2 (D2). D1 contains
all the records for the authors who belong to U.S. and have no mid-
dle name i.e. they have first name, last name and other remaining
attributes (such as address, country, company and author sequence
number in the patent etc). D1 holds 301,877 records. D2 holds the
first 350,000 records for all the authors who belong to U.S. and
they also have middle name in addition to first name, last name and
all the remaining attributes. These numbers were selected by sam-
pling records from D and then running clustering algorithm over it
to see if they can successfully run on our machine. We also tried
to run the entire 4 million records of D on our maching but it was
making system run out of memory. The reason for running out of
memory is because the binary heap used in bootstrapping phase is
consuming O(n) memory; and hence the memory scales linearly
with the number of records, thus putting an upper bound on the
input dataset size.
6.2 Experiments
In this section we discuss about the different experiments we did
to test our algorithm. The goal of first experiment was to disam-
biguate more number of records that the memory of a system can
allow. Since our system only had 4 GB of RAM, so running algo-
rithm on D1 and D2 combined is not feasible as the dataset would
not fit in the memory. To encounter this we made use of a statis-
tical technique. We first ran the relational clustering algorithm on
D1 and build the buckets for D1 using blocking technique (Section
5.2) and then ran the algorithm on D2 using these clusters. Since
all the assignments to the cluster during blocking occurs based on
attribute similarity, the run on D2 on these clusters used for D1
can be used as an added evidence to disabiguate D2. Lets discus
this approach in more detail: we first ran the clustering algorithm
on D1, did the blocking and bootstrapping as explained in section
5.1 and 5.2. It built the buckets where all the author references that
are similar based on attribute similarity measures are clustered to-
gether into separate buckets; thus running relational clustering on
D1 builds evidences clusters for D1, then we ran the clustering al-
gorithm on D2 using the clusters build for D1. In presence of more
evidences i.e. we already have attribute similarity measures for the
records in D1 and in presence of records that reach cluster of D2
based on attribute similarity we already have previous evidence and
we can use that to disambiguate records in D2 using both already
built attribute similarity and relational similarity during iterative
phase (Section 5.3).
Thus, this statistical evidence based reduction allowed us two
advantages: Firstly it allowed us to run more number of records that
the memory can hold in the first place and also it provided us more
evidence for the records in D2 to get better precision. The second
experiment we ran was used to test the accuracy of the algorithm
on a labeled dataset and validate the number of entites obtained by
running the algorithm against the labeled count of entities known
beforehand. We explain the experiment methodology and results in
the following subsections:
Figure 4. A bar chart showing the number of true names against number
of ambiguous names found in dataset 1 and dataset 2 after blocking phase.
Red bar represents the true names and green bar represents the ambiguous
name.
6.3 Results
6.3.1 Ambiguous names
For each run of relational clustering algorithm on D1 and D2, we
computed the total number of authors that constitute the dataset
and also the number of ambigous names found. Dataset D1 had
92,585 total authors and 11,560 ambiguous name. On the other
hand, dataset D2 had 140,589 total authors and 24,072 ambiguous
names. Bar-chart in Figure 4 displays this information. Ambiguous
names is a measure of the quality of a dataset. If the dataset is noisy,
there will be high fraction for ambiguous names w.r.t total author
names and vice versa for a good quality data set.
6.3.2 Precision and Recall
Since the dataset that we are using is unlabeled, we used successful
pair-wise counts among all the pairs matched to get the precision.
Figure 5 and 6 shows how precision varies as the algorithm iterates
over all pairs during the pair matching process in the priority queue
. This is an approximate measure as we do not have any prior infor-
mation about the quality of the dataset. The initial true pair count
τt is made at the end of blocking phase by setting it to the count of
similar matches done at the end of blocking. Then as the iterative
process begins we compare the actual number of pair-wise simi-
larity matches τa that were made. Precision P is then given as P =
τt
τa
. Note, because we are using unlabeled dataset, recall has been
assumed to be one just to measure the accuracy of similarity mea-
sure of relational clustering process. Figure 5 shows the precision
vs recall graph for the complete run of the algorithm on D1. The
precision ranges from 0.918627 to 1.
Figure 6 shows the precision vs recall graph for the run on D2.
In this case the precision ranges from 0.663406 to 0.690025. The
precision results are not so good for run on D2, suggesting that
attribute similarity is not the best similarity measure to integrate
evidences. This integration seems to have created lots of false-
positives match pairs and hence precision suffered as a result.
6.3.3 F1 score and α
Since α is a coefficient that tunes the relational clustering and de-
cide the weight of attribute similarity and neighborhood similarity,
we varied the values of α from 0.0 to 1.0 and measured F1 value
for both datasets D1 and D2. Figure 7 shows how F1 score varies
with α. Note that this measurement also makes the same assump-
tion regarding the recall as in section 6.3.2. As we can see D1 has
higher F1 score for all the values of α as compared to D2. But F1
score for D2 increases as α value increase, suggesting that as α
increases, relational similarity factor tends to increase ( according
to sim(ci, cj) and leading to increase in F1 score for D2 run. On
Project report on Entity Resolution in Graph Data. 6 2013/3/19
7. Figure 5. This figure shows precision vs recall in a run of relational
clustering algorithm on dataset D1.
Figure 6. This figure shows precision vs recall in a run of relational
clustering algorithm on dataset D2. Precision has dropped sharply on D2
because D2 was run on evidence clusters generated during run on D1, so
attribute similarity based matching have resulted in false records in each
cluster that was reduced in D2.
the other hand for D1, F1 drops slighly around 0.4 and then remain
stable around that mark signifying lesser impact of α.
6.3.4 CPU Usage
Figure 8 shows the running time of relational clustering algorithm
against the total number of record references. Its evident that as the
number of references increases the running time too increases in a
slightly linear curve, confirming with the O(n log n) algorithm.
6.3.5 Memory usage
Figure 9 shows the peak memory usage of the algorithm against the
total number of references. Its quite evident that as the number of
references increase, memory scales linearly too, suggesting a linear
time peak memory usage.
6.4 Validation
We tested the accuracy of the relational clustering algorithm on
a selected set of labeled records from D2, say S. The author ref-
erences in S were manually disambiguated using Google patents
[26] and also using the attributes and relational similarity w.r.t. co-
authors present in S. The size of S was 250 references which has
in total 86 author entities. Some authors also had co-authors refer-
ence records in S. So, both attibute similarity and relational similar-
ity were used to validate the clustering algorithm. Denote the total
number of labeled author entities as T, number of entities identified
by the algoritm as A and number of correctly identified entities as
Figure 7. This figure shows F1 score versus our tuning parameter alpha
on D1 and D2. Because of lower precision for D2 (Figure 6), F1 score has
dropped for D2 compared to D1 but shows improvement as we scale alpha
from 0 to 1. D1 on the other hand seems to have dropped on 0.4 then shows
steady behavior as alpha increases.
Figure 8. A CPU run time vs number of records, it can be seen that both
scale slightly linear w.r.t. to each other, validating a O(nk log n) algorithm
discussed in time complexity section 5.4.
Figure 9. This shows the peak memory usage of the algorithm against
the total number of references. Its can be seen that as the number of
references increase, memory scales linearly too, suggesting a O(n) peak
memory usage.
Project report on Entity Resolution in Graph Data. 7 2013/3/19
8. Figure 10. This graph shows the precision, recall and F1 score of the
algorithm against top k labeled reference records, where k=25, 50, 75,...,
250. All these three parameters have high values initially, then there is
a drop around k =75, suggesting lack of evidence for some sequence of
author references leading to mispredictions, but as the dataset size increases
and algorithm gathers more evidence, prediction, recall and F1 score keep
improving.
C respectively. We use precision P = C
A
, recall R = C
T
and F-score
F1 = 2P R
P +R
to measure the performance of the algorithm. These
performance and accuracy parameters where calculated for every
run of the algorithm on top k records of S, where k = 25, 50, 75,
..., 250. The precision, recall and F-1 measurements are shown in
Figure 10. From these results, we can see that as the number of
reference records increase and the more relational evidence merges
with the existing evidence, the iterative algorithm does a good job
over larger sets of top-k data and makes better predictions.
7. Discussion
In this section we discuss the various performance trade-offs that
were made and also discuss the error analysis done during the
course of implementing and running the algorithm. The first op-
timization that was done was adding a strong evidence attribute
field called assignee from the pat63 99.txt file [25] to the patent
references. This attribute was given a higher value as compared
to other fields such as zipcode or state. The results obtained from
this tweak suggest in favor of this step. Second performance ef-
ficiency step, that helped during blocking phase was using a nor-
malized author name (lname fname) or (lname fname mname) as
patent records if are pre-processed by sorting on the basis of nor-
malized name will lead to faster blocking as same author names
can create pairs quickly. Another technique was to measure approx-
imate precision was made in Section 6.3.2, by setting recall to one,
we just wanted to check how precision varied if we only trust that
bootstrapping alone can create good potential pairs. Figure 5 shows
good results alone for this approximate precision measurement ex-
periment. Some other lessons learned was that rather than relying
on a purely determinisic approach of running the entire datasets,
its a good idea to reduce the dataset, and use statistical approch to
build evidence for doing next iteration on new data. This apprach is
useful for sequential algorithms like relational clustering, for which
entire dataset cannot fit in main memory.
8. Future Work
Even though the relational clustering algorithm runs well for
medium sized datasets. To scale this algorithm for large datasets,
parallel algorithm needs to be devised. Approaches such as P-
Swoosh[3] try to morph the graphs onto a bunch of nodes, the
approach used is not generic and might not apply for different
types of graph datasets. One key observation that came out of this
approach is that if the dataset quality is good enough i.e.it does not
consist of large number of ambiguous references, algorithm will
produce lesser number of clusters. If that is the case, Map Reduce,
GraphLab or other such parallel framework can be used to divide
these reference clusters onto a bunch of machines and then run
the clustering algorithm on each such machine, thus solving the
problem in a distributed manner.
9. Conclusion
In this project we implemented a relational clustering algorithm
[2] to diambiguate author references in a real-world patent citation
dataset. To get good results, our similarity function made use of
both attribute similarity as well as neighborhood similarity. Two
different types of experiments were performed to validate the effi-
ciency of the algorithm. The first method was a statistical approach
to add more evidence to a previous efficient run of the algorithm,
but since our blocking technique relied on attribute similarity, sec-
ond run lead to lots of false negatives and hence lead to moderate
precision in the iterative phase. The second experiment made use
of a manually labeled dataset and the results obtained validated the
efficiency of the algorithm.
Acknowledgments
I would like to thank Prof. Freund and Dr. Gupta for their valuable
guidance during the course of the project.
References
[1] Lisa Getoor, Indrajit Bhattacharya, Entity resolution in graphs, Mining
Graph Data, Wiley publication, Chapter 13.
[2] Lisa Getoor, Indrajit Bhattacharya, Collective entity resolution in
relational data, ACM Transactions on Knowledge Discovery from Data
(TKDD), Volume 1 Issue 1, March 2007.
[3] Kawai, H., Garcia-Molina, H., Benjelloun, O., Menestrina, D., Whang,
E., Gong, H.: PSwoosh: Parallel algorithm for generic entity resolution.
Tech. Rep. 2006-19, Department of Computer Science, Stanford
University (2006)
[4] Vetle I. Torvik, Neil R. Smalheiser. Author name disambiguation in
MEDLINE. ACM Transactions on Knowledge Discovery from Data
(TKDD) Volume 3 Issue 3, July 2009.
[5] Vetle I. Torvik , Neil R. Smalheiser, Author name disambiguation in
MEDLINE, ACM Transactions on Knowledge Discovery from Data
(TKDD), v.3 n.3, p.1-29, July 2009
[6] SNAP http://snap.stanford.edu/data/cit-Patents.html
[7] I. Fellegi and A. Sunter. A theory for record linkage. Journal of Amer.
Statistical Association 1969.
[8] M. Hernandez and S. Stolfo. The merge/purge problem for large
databases. In SIGMOD, 1995.
[9] R. Ananthakrishna, S. Chaudhuri, and V. Ganti. Eliminating fuzzy
duplicates in data warehouses. In VLDB, 2002.
[10] I. Bhattacharya and L. Getoor. Iterative record linkage for cleaning
and integration. In DMKD Workshop, 2004.
[11] X. Dong, A. Y. Halevy, and J. Madhavan. Reference reconciliation in
complex information spaces. In SIGMOD, 2005
[12] D. V. Kalashnikov, S. Mehrotra, and Z. Chen. Exploiting relationships
for domain-independent data cleaning. In SIAM Data Mining (SDM),
2005.
[13] B. Malin. Unsupervised name disambiguation via social network
similarity. In Workshop on Link Analysis, Counterterrorism, and
Security, 2005.
Project report on Entity Resolution in Graph Data. 8 2013/3/19
9. [14] Z. Chen, D.V. Kalashnikov and S. Mehrotra, Adaptive Graphical
Approach to Entity Resolution, Proc. ACM IEEE Joint Conf. Digital
Libraries (JCDL), 2007.
[15] I. G. Councill, H. Li, Z. Zhuang, S. Debnath, L. Bolelli, W. C. Lee, A.
Sivasubramaniam, and C. L. Giles. Learning metadata from the evidence
in an on-line citation matching scheme. In JCDL, 2006.
[16] X. Dong, A. Y. Halevy, and J. Madhavan. Reference reconciliation in
complex information spaces. In SIGMOD, 2005
[17] H. Han, L. Giles, H. Zha, C. Li, and K. Tsioutsiouliklis. Two
supervised learning approaches for name disambiguation in author
citations. In JCDL, 2004.
[18] A. McCallum and B. Wellner. Conditional models of identity
uncertainty with application to noun coreference. In NIPS, 2004.
[19] H. Pasula, B. Marthi, B. Milch, S. Russell, and I. Shpitser. Identity
uncertainty and citation matching. In NIPS, 2002.
[20] P. Singla and P. Domingos. Multi-relational record linkage. In MRDM
Workshop, 2004.
[21] R. Bekkerman and A. McCallum. Disambiguating web appearances
of people in a social network. In WWW, 2005.
[22] Z. Chen, D.V. Kalashnikov and S. Mehrotra, Adaptive Graphical
Approach to Entity Resolution, Proc. ACM IEEE Joint Conf. Digital
Libraries (JCDL), 2007.
[23] Levenshtein distance. http://en.wikipedia.org/wiki/Levenshtein distance
[24] Jaccard index.http://en.wikipedia.org/wiki/Jaccard index
[25] The NBER U.S. Patent Citations Data File: Lessons, Insights, and
Methodological Tools. http//data.nber.org/patents/
[26] Google Patents. www.google.com/patents.
A. Appendix A
This section explains how to run the code and change input param-
eters. To build the code, run make. Then let datafile be input.txt,
then simply using $ ./ER -makeclusters w -f input.txt
will do the blocking and then run the iterative algorithm on the
dataset. To tune the alpha parameter (default 0.5) to say 0.8, do
$./ER -makeclusters w -f input.txt -alpha 0.8. Also
to set the threshold (default 0) for the similarity function to say 0.5
, use $./ER -makeclusters w -f input.txt -t 0.5. More
information can be obtained from readme file in the source direc-
tory.
Project report on Entity Resolution in Graph Data. 9 2013/3/19