REPAIRING HIDDEN LINKS
IN
LINKED DATA
ENHANCING THE QUALITY OF RDF KNOWLEDGE GRAPHS
Nandana Mihindukulasooriya, Mariano Rico, Idafen Santana-Pérez,
Raúl García-Castro and Asunción Gómez-Pérez
December 5th, 2017
The Ninth International Conference on Knowledge Capture
K-CAP 2018, Austin, Texas, United States
Things, not strings!
2
Things as strings!
3
dbr:Tim_Cook dbr:Apple_Pie
“Apple”
dbo:employer dbo:ingredient
• Introduce model inconsistencies
• Increases ambiguity
• Reduces connectivity
… after correctly linked
4
dbr:Tim_Cook dbr:Apple_Pie
dbr:Apple_Inc dbr:Apple
dbo:Company
dbr:Consumer
_electronics
dbr:IPhone
yago:Fruit
2.4
dbr:Rosacese
rdf:type rdf:typedbo:industry
dbo:family
dbp:fiber
dbo:product
dbo:employer dbo:ingredient
Is this a common problem?
• Web Data Commons (JSON-LD embedded data)
• http://webdatacommons.org/structureddata/
5
Property # of objects # of string
literals
Literal % IRI %
schema:creator 16,573,426 15,1782,874 95.23 4.77
schema:brand 4,694,411 2,420,908 51.57 48.43
schema:author 26,193,682 1,500,898 5.73 94.37
schema:
hiringOrganization
1,252,870 1,231,840 98.84 1.16
schema:publisher 31,317,151 560,577 1.79 98.21
Research Questions
• How to identify string literals that denote
entities in a KG?
• How to transform those string literals
into IRIs that correspond to entities they
denote?
• How to measure the improvement in
quality because of the transformation?
6
Entity relations
• Entity relations link two entities.
• All objects of entity relations should be entities.
• Not all relations are entity relations.
7
an entity relation
Person Company
Company
“Company Description”
a non-entity
relation
String to Entity IRI Transformation
Such string literals can be transformed to their
corresponding entity IRIs with high precision using both
ontological axioms and data profiling information.
8
“Apple”
dbr:Apple
dbr:Apple_Inc
dbr:Apple
_Bank
dbr:The_Apple_Film
Connectivity
Connectivity of a knowledge graph can be improved by
transforming literal nodes in entity relations into their
corresponding entity IRIs.
9
Related Problems - I
• Named entity disambiguation from text
10
Kill Bill was directed by Quentin Tarantino and stars Uma Thurman.
dbr:Kill_Bill:_Volume_1 dbr:Quentin_Tarantino dbr:Uma_Thurman
Related Problems - II
• Web Table Matching
11
company ind. loc.
IBM technology United
States
United airlines USA
… … …
dbr:United_States
dbr:Technology
dbr:IBM
dbr:United_Airlines
dbr:Airline
dbo:country
Approach
12
• Using ontological axioms
• Using data profiling information
Identification
of
entity relations
• Context generation
• Type identification (entity relation
range)
• Entity IRI identification
String literal to
IRI conversion
Identification of entity relations
13
Identification of
entity relations
RDF Graph Ontology Definitions
Entity
relations
Other
relations
P1 P2
P3 Pn
PX PY
Identification of entity relations
14
Identification of
entity relations
OWL Object Properties
Property range
Entity Relation Classifier
Features
IRI %, Lit%, DistinctIRI%,
DistinctLit%, String%, Num%,
Date%
Training Data
Known entity relations
Manually annotated rdfs props
Ontology-
driven
Data-
driven
see the paper for
the algorithm
Approach
15
• Using ontological axioms
• Using data profiling information
Identification
of
entity relations
•Type identification (entity relation
range)
•Context generation
•Entity disambiguation
String literal to
IRI conversion
Type identification and entity linking
16
employer
dbr:Google
dbr:Microsoft
dbr:Yahoo!
dbr: Amazon_(company)
“Apple”
“I.B.M”
“Johnson & Johnson”
“Mars”
“Oracle”
“Sun”
range information (ontology)
Type
Prediction
(range)
Type
restrictions
Type frequency
analysis
Entity linking
Entity
disambiguation
Context generation
17
Tim Cook employer Apple.
Timothy Donald "Tim" Cook is an American
business executive, industrial engineer and
developer. Cook is the current and seventh
Chief Executive Officer of Apple Inc.,
previously serving as the company's Chief
Operating Officer, under its ....
Apple Pie ingredient Apple.
An apple pie is a fruit pie, in which the
principal filling ingredient is apple. It is, on
occasion, served with whipped cream or ice
cream on top, or alongside cheddar cheese.
The pastry is generally used top-and-bottom
...
Evaluation
18
Evaluation - I
• Can the approach correctly identity entity relations?
• manually-annotated gold standard
• 3 annotators with high inter-annotator agreement
19
Class Entity
Relations
Detected relations Prec. Recall
Correct Incorrect
dbo:Athlete 183 178 53 77.06% 97.27%
dbo:SportsTeam 157 156 48 76.47% 99.36%
dbo:SportsEvent 116 115 41 73.71% 99.14%
Total 456 449 142 75.97% 98.47%
High Recall
Evaluation - II
• Can the approach correctly disambiguate and link the
identified strings?
• Manual verification of a random sample of 300
20
Class Sample
Size
Disambiguated Prec. Recall
Correct Incorrect
dbo:Athlete 100 51 50 98.04% 50%
dbo:SportsTeam 100 44 44 100% 44%
dbo:SportsEvent 100 58 55 94.83% 56%
Total 300 153 149 97.38% 49.67%
High Precision
Graph metrics
21
# of
edges
# of
components
Isolated
components
largest component
size
8 6 3 5
• connected components
Graph metrics
22
# of
edges
# of
components
Isolated
components
largest component
size
8 6 3 5
12 (+4) 2 (-4) 0 (-3) 12 (+7)
• connected components
Evaluation - III
• Does this transformation increase the connectivity of
the graph?
23
Graph edges components isolated Largest component
size % of
nodes
Original 828,310 119,623 112,331 168,128 54.28
Repaired 1,035,912 99,137 93,507 192,805 64.16
+207,602 -20,486 -18,824 +24,677 +9.88
+25.06% -17.12% -16.76% +14.68%
Improved Connectivity
Conclusions and future work
• A large number of entities are represented as strings
in Linked Data .
• The proposed approach can detect such strings with
high recall (98%) and covert them to their
corresponding entity IRIs with high precision (97%).
• We could add 25% more links and improved the
connectivity of a subgraph of DBpedia by 17%
• In future, improve the algorithm by using more
context information from the graph for the task.
24
25

Repairing Hidden Links in Linked Data

  • 1.
    REPAIRING HIDDEN LINKS IN LINKEDDATA ENHANCING THE QUALITY OF RDF KNOWLEDGE GRAPHS Nandana Mihindukulasooriya, Mariano Rico, Idafen Santana-Pérez, Raúl García-Castro and Asunción Gómez-Pérez December 5th, 2017 The Ninth International Conference on Knowledge Capture K-CAP 2018, Austin, Texas, United States
  • 2.
  • 3.
    Things as strings! 3 dbr:Tim_Cookdbr:Apple_Pie “Apple” dbo:employer dbo:ingredient • Introduce model inconsistencies • Increases ambiguity • Reduces connectivity
  • 4.
    … after correctlylinked 4 dbr:Tim_Cook dbr:Apple_Pie dbr:Apple_Inc dbr:Apple dbo:Company dbr:Consumer _electronics dbr:IPhone yago:Fruit 2.4 dbr:Rosacese rdf:type rdf:typedbo:industry dbo:family dbp:fiber dbo:product dbo:employer dbo:ingredient
  • 5.
    Is this acommon problem? • Web Data Commons (JSON-LD embedded data) • http://webdatacommons.org/structureddata/ 5 Property # of objects # of string literals Literal % IRI % schema:creator 16,573,426 15,1782,874 95.23 4.77 schema:brand 4,694,411 2,420,908 51.57 48.43 schema:author 26,193,682 1,500,898 5.73 94.37 schema: hiringOrganization 1,252,870 1,231,840 98.84 1.16 schema:publisher 31,317,151 560,577 1.79 98.21
  • 6.
    Research Questions • Howto identify string literals that denote entities in a KG? • How to transform those string literals into IRIs that correspond to entities they denote? • How to measure the improvement in quality because of the transformation? 6
  • 7.
    Entity relations • Entityrelations link two entities. • All objects of entity relations should be entities. • Not all relations are entity relations. 7 an entity relation Person Company Company “Company Description” a non-entity relation
  • 8.
    String to EntityIRI Transformation Such string literals can be transformed to their corresponding entity IRIs with high precision using both ontological axioms and data profiling information. 8 “Apple” dbr:Apple dbr:Apple_Inc dbr:Apple _Bank dbr:The_Apple_Film
  • 9.
    Connectivity Connectivity of aknowledge graph can be improved by transforming literal nodes in entity relations into their corresponding entity IRIs. 9
  • 10.
    Related Problems -I • Named entity disambiguation from text 10 Kill Bill was directed by Quentin Tarantino and stars Uma Thurman. dbr:Kill_Bill:_Volume_1 dbr:Quentin_Tarantino dbr:Uma_Thurman
  • 11.
    Related Problems -II • Web Table Matching 11 company ind. loc. IBM technology United States United airlines USA … … … dbr:United_States dbr:Technology dbr:IBM dbr:United_Airlines dbr:Airline dbo:country
  • 12.
    Approach 12 • Using ontologicalaxioms • Using data profiling information Identification of entity relations • Context generation • Type identification (entity relation range) • Entity IRI identification String literal to IRI conversion
  • 13.
    Identification of entityrelations 13 Identification of entity relations RDF Graph Ontology Definitions Entity relations Other relations P1 P2 P3 Pn PX PY
  • 14.
    Identification of entityrelations 14 Identification of entity relations OWL Object Properties Property range Entity Relation Classifier Features IRI %, Lit%, DistinctIRI%, DistinctLit%, String%, Num%, Date% Training Data Known entity relations Manually annotated rdfs props Ontology- driven Data- driven see the paper for the algorithm
  • 15.
    Approach 15 • Using ontologicalaxioms • Using data profiling information Identification of entity relations •Type identification (entity relation range) •Context generation •Entity disambiguation String literal to IRI conversion
  • 16.
    Type identification andentity linking 16 employer dbr:Google dbr:Microsoft dbr:Yahoo! dbr: Amazon_(company) “Apple” “I.B.M” “Johnson & Johnson” “Mars” “Oracle” “Sun” range information (ontology) Type Prediction (range) Type restrictions Type frequency analysis Entity linking Entity disambiguation
  • 17.
    Context generation 17 Tim Cookemployer Apple. Timothy Donald "Tim" Cook is an American business executive, industrial engineer and developer. Cook is the current and seventh Chief Executive Officer of Apple Inc., previously serving as the company's Chief Operating Officer, under its .... Apple Pie ingredient Apple. An apple pie is a fruit pie, in which the principal filling ingredient is apple. It is, on occasion, served with whipped cream or ice cream on top, or alongside cheddar cheese. The pastry is generally used top-and-bottom ...
  • 18.
  • 19.
    Evaluation - I •Can the approach correctly identity entity relations? • manually-annotated gold standard • 3 annotators with high inter-annotator agreement 19 Class Entity Relations Detected relations Prec. Recall Correct Incorrect dbo:Athlete 183 178 53 77.06% 97.27% dbo:SportsTeam 157 156 48 76.47% 99.36% dbo:SportsEvent 116 115 41 73.71% 99.14% Total 456 449 142 75.97% 98.47% High Recall
  • 20.
    Evaluation - II •Can the approach correctly disambiguate and link the identified strings? • Manual verification of a random sample of 300 20 Class Sample Size Disambiguated Prec. Recall Correct Incorrect dbo:Athlete 100 51 50 98.04% 50% dbo:SportsTeam 100 44 44 100% 44% dbo:SportsEvent 100 58 55 94.83% 56% Total 300 153 149 97.38% 49.67% High Precision
  • 21.
    Graph metrics 21 # of edges #of components Isolated components largest component size 8 6 3 5 • connected components
  • 22.
    Graph metrics 22 # of edges #of components Isolated components largest component size 8 6 3 5 12 (+4) 2 (-4) 0 (-3) 12 (+7) • connected components
  • 23.
    Evaluation - III •Does this transformation increase the connectivity of the graph? 23 Graph edges components isolated Largest component size % of nodes Original 828,310 119,623 112,331 168,128 54.28 Repaired 1,035,912 99,137 93,507 192,805 64.16 +207,602 -20,486 -18,824 +24,677 +9.88 +25.06% -17.12% -16.76% +14.68% Improved Connectivity
  • 24.
    Conclusions and futurework • A large number of entities are represented as strings in Linked Data . • The proposed approach can detect such strings with high recall (98%) and covert them to their corresponding entity IRIs with high precision (97%). • We could add 25% more links and improved the connectivity of a subgraph of DBpedia by 17% • In future, improve the algorithm by using more context information from the graph for the task. 24
  • 25.