Type Inference on Noisy RDF Data

10/31/13 Paulheim, Christian Bizer
Heiko Paulheim, Christian Bizer
Heiko

1
The Problem
•

One promise of the Semantic Web:
– You can issue structured queries
– e.g., „List all presidents that graduated from Harvard Law School“
– SELECT ?x WHERE {
?x a dbpedia-owl:President .
?x dbpedia-owl:almaMater
dbpedia:Harvard_Law_School }

10/31/13

Heiko Paulheim, Christian Bizer

2
The Problem
•

SELECT ?x WHERE {
?x a dbpedia-owl:President .
?x dbpedia-owl:almaMater
dbpedia:Harvard_Law_School }

•

...if we run this against DBpedia, we get one result
– i.e., Elwell Stephen Otis

•

But...

10/31/13

Heiko Paulheim, Christian Bizer

3
The Problem

10/31/13

Heiko Paulheim, Christian Bizer

4
The Problem
•

So what is going wrong?

•

SELECT ?x WHERE {
?x a dbpedia-owl:President .
?x dbpedia-owl:almaMater
dbpedia:Harvard_Law_School }

•

In DBpedia, Barack Obama is not of type President!

•

How can we add missing types?

10/31/13

Heiko Paulheim, Christian Bizer

5
Is It a Big Problem?
•

DBpedia has at least 2.7 million missing type statements
– w.r.t. the DBpedia ontology
– found using co-occurence analysis of matching classes
in YAGO and DBpedia
– a very optimistic lower bound

•

Highly incomplete classes:
– Species: >870,000 missing statements
– Person: >510,000 missing statements
– Event: >150,000 missing statements

10/31/13

Heiko Paulheim, Christian Bizer

6
A Naive Approach
•

Idea: exploit properties with domain and range

•

Pseudo RDFS Reasoning:
– CONSTRUCT {?x a ?t}
WHERE { {?x ?r ?y . ?r rdfs:domain ?t}
UNION
{?y ?r ?x . ?r rdfs:range ?t} }

10/31/13

Heiko Paulheim, Christian Bizer

7
A Naive Approach
•

Experiment with Barack Obama
– Person, PersonFunction, Actor, Organization

•

Experiment with Germany:
– Place, Award, Populated Place, City, SportsTeam, Mountain, Agent,
Organisation, Country, Stadium, RecordLabel, MilitaryUnit, Company,
EducationalInstitution, PersonFunction, EthnicGroup, Architect, WineRegion,
Language, MilitaryConflict, Settlement, RouteOfTransportation

10/31/13

Heiko Paulheim, Christian Bizer

8
A Naive Approach
•

What is going on here?
– DBpedia data is noisy
– One wrong statement is enough for a wrong conclusion
– e.g.: dbpedia:Kurt_H._Debus dbpedia-owl:award dbpedia:Germany

•

Germany example: 69,000 statements
– 20 wrong types can come from 20 wrong statements
– i.e., an error rate of 0.03% is enough for a totally screwed result
– ...but that would be an excellent data quality for a LOD source!

10/31/13

Heiko Paulheim, Christian Bizer

9
SDType Approach
•

Idea: outgoing/incoming properties are indicators
for a resource's type
– e.g.: starring → Movie
– e.g.: author-1 → Writer

•

Basic compiled statistics
– P(C|p) := probability of class C in presence of property p
– e.g.: P(dbpedia:Film|starring) = 0.79
– e.g.: P(dbpedia:Writer|author-1) = 0.44

10/31/13

Heiko Paulheim, Christian Bizer

10
SDType Approach
•

Based on precompiled statistics
– Find types of instances
– Using voting

•

score(C) = avg(all properties p) P(C|p)

•

Refinement:
– Weight for properties: discriminative power
– weight(p) = sum(all classes c) (p(c)-p(c|p))²
– i.e., how strongly this property's class distribution
deviates from the overall class distribution

10/31/13

Heiko Paulheim, Christian Bizer

11
Evaluation
•

Two fold evaluation
– On DBpedia and OpenCyc as „Silver Standard“
(automatic, 10,000 random instances)
– On untyped DBpedia resources (manual, 100 instances)

•

Using only incoming properties
– Using outgoing properties is trivial!

10/31/13

Heiko Paulheim, Christian Bizer

12
Evaluation Results
•

On DBpedia

1
0.9
0.8

Precision

0.7
0.6
min. 1 link
min. 10 links
min. 25 links

0.5
0.4
0.3
0.2
0.1
0
0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Recall

10/31/13

Heiko Paulheim, Christian Bizer

13
Evaluation Results
•

On OpenCyc

1
0.9
0.8

Precision

0.7
0.6
min. 1 link
min. 10 links
min. 25 links

0.5
0.4
0.3
0.2
0.1
0
0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Recall

10/31/13

Heiko Paulheim, Christian Bizer

14
Evaluation Results
•

Evaluation on untyped resources
– Random sample of 100 untyped resources
– Manual checking of precision

1

12

0.9
10

0.8
0.7
Precision

0.6
0.5

6

0.4
4

0.3
0.2

# found types

8
# found
types
precision

2

0.1
0

0
0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0

Lower bound for threshold

10/31/13

Heiko Paulheim, Christian Bizer

15
Evaluation Results
•

DBpedia:
– works reasonably well (F-measure 0.89)

•

OpenCyc:
– harder because of deeper class hierarchy (F-measure 0.60)

•

General:
– having more links increases precision
(in contrast to RDFS reasoning)
– more general types (e.g., Band) are easier than specific ones
(e.g., PunkRockBand)

10/31/13

Heiko Paulheim, Christian Bizer

16
Deployment
•

Heuristic types have been included in DBpedia 3.9
– for previously untyped instances
– 3.4 million type statements at precision ~0.95

•

Includes also many resources without a Wikipedia page
– i.e., generated from a red link

•

Runtime
– Complexity O(PT)
P: number of property assertions
T: number of type assertions
– ~24h for processing DBpedia

10/31/13

Heiko Paulheim, Christian Bizer

17
Conclusion and Outlook
•

SDType approach works at high quality
– outperforms most state of the art on DBpedia
– deployed for DBpedia 3.9

•

Same approach can be used for
– validating links
– within dataset: deployed for DBpedia 3.9 (removed ~13,000 wrong statements)
– across datasets: to be done

10/31/13

Heiko Paulheim, Christian Bizer

18
Type Inference on Noisy RDF Data

10/31/13 Paulheim, Christian Bizer
Heiko Paulheim, Christian Bizer
Heiko

19

Type Inference on Noisy RDF Data

  • 1.
    Type Inference onNoisy RDF Data 10/31/13 Paulheim, Christian Bizer Heiko Paulheim, Christian Bizer Heiko 1
  • 2.
    The Problem • One promiseof the Semantic Web: – You can issue structured queries – e.g., „List all presidents that graduated from Harvard Law School“ – SELECT ?x WHERE { ?x a dbpedia-owl:President . ?x dbpedia-owl:almaMater dbpedia:Harvard_Law_School } 10/31/13 Heiko Paulheim, Christian Bizer 2
  • 3.
    The Problem • SELECT ?xWHERE { ?x a dbpedia-owl:President . ?x dbpedia-owl:almaMater dbpedia:Harvard_Law_School } • ...if we run this against DBpedia, we get one result – i.e., Elwell Stephen Otis • But... 10/31/13 Heiko Paulheim, Christian Bizer 3
  • 4.
  • 5.
    The Problem • So whatis going wrong? • SELECT ?x WHERE { ?x a dbpedia-owl:President . ?x dbpedia-owl:almaMater dbpedia:Harvard_Law_School } • In DBpedia, Barack Obama is not of type President! • How can we add missing types? 10/31/13 Heiko Paulheim, Christian Bizer 5
  • 6.
    Is It aBig Problem? • DBpedia has at least 2.7 million missing type statements – w.r.t. the DBpedia ontology – found using co-occurence analysis of matching classes in YAGO and DBpedia – a very optimistic lower bound • Highly incomplete classes: – Species: >870,000 missing statements – Person: >510,000 missing statements – Event: >150,000 missing statements 10/31/13 Heiko Paulheim, Christian Bizer 6
  • 7.
    A Naive Approach • Idea:exploit properties with domain and range • Pseudo RDFS Reasoning: – CONSTRUCT {?x a ?t} WHERE { {?x ?r ?y . ?r rdfs:domain ?t} UNION {?y ?r ?x . ?r rdfs:range ?t} } 10/31/13 Heiko Paulheim, Christian Bizer 7
  • 8.
    A Naive Approach • Experimentwith Barack Obama – Person, PersonFunction, Actor, Organization • Experiment with Germany: – Place, Award, Populated Place, City, SportsTeam, Mountain, Agent, Organisation, Country, Stadium, RecordLabel, MilitaryUnit, Company, EducationalInstitution, PersonFunction, EthnicGroup, Architect, WineRegion, Language, MilitaryConflict, Settlement, RouteOfTransportation 10/31/13 Heiko Paulheim, Christian Bizer 8
  • 9.
    A Naive Approach • Whatis going on here? – DBpedia data is noisy – One wrong statement is enough for a wrong conclusion – e.g.: dbpedia:Kurt_H._Debus dbpedia-owl:award dbpedia:Germany • Germany example: 69,000 statements – 20 wrong types can come from 20 wrong statements – i.e., an error rate of 0.03% is enough for a totally screwed result – ...but that would be an excellent data quality for a LOD source! 10/31/13 Heiko Paulheim, Christian Bizer 9
  • 10.
    SDType Approach • Idea: outgoing/incomingproperties are indicators for a resource's type – e.g.: starring → Movie – e.g.: author-1 → Writer • Basic compiled statistics – P(C|p) := probability of class C in presence of property p – e.g.: P(dbpedia:Film|starring) = 0.79 – e.g.: P(dbpedia:Writer|author-1) = 0.44 10/31/13 Heiko Paulheim, Christian Bizer 10
  • 11.
    SDType Approach • Based onprecompiled statistics – Find types of instances – Using voting • score(C) = avg(all properties p) P(C|p) • Refinement: – Weight for properties: discriminative power – weight(p) = sum(all classes c) (p(c)-p(c|p))² – i.e., how strongly this property's class distribution deviates from the overall class distribution 10/31/13 Heiko Paulheim, Christian Bizer 11
  • 12.
    Evaluation • Two fold evaluation –On DBpedia and OpenCyc as „Silver Standard“ (automatic, 10,000 random instances) – On untyped DBpedia resources (manual, 100 instances) • Using only incoming properties – Using outgoing properties is trivial! 10/31/13 Heiko Paulheim, Christian Bizer 12
  • 13.
    Evaluation Results • On DBpedia 1 0.9 0.8 Precision 0.7 0.6 min.1 link min. 10 links min. 25 links 0.5 0.4 0.3 0.2 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Recall 10/31/13 Heiko Paulheim, Christian Bizer 13
  • 14.
    Evaluation Results • On OpenCyc 1 0.9 0.8 Precision 0.7 0.6 min.1 link min. 10 links min. 25 links 0.5 0.4 0.3 0.2 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Recall 10/31/13 Heiko Paulheim, Christian Bizer 14
  • 15.
    Evaluation Results • Evaluation onuntyped resources – Random sample of 100 untyped resources – Manual checking of precision 1 12 0.9 10 0.8 0.7 Precision 0.6 0.5 6 0.4 4 0.3 0.2 # found types 8 # found types precision 2 0.1 0 0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 Lower bound for threshold 10/31/13 Heiko Paulheim, Christian Bizer 15
  • 16.
    Evaluation Results • DBpedia: – worksreasonably well (F-measure 0.89) • OpenCyc: – harder because of deeper class hierarchy (F-measure 0.60) • General: – having more links increases precision (in contrast to RDFS reasoning) – more general types (e.g., Band) are easier than specific ones (e.g., PunkRockBand) 10/31/13 Heiko Paulheim, Christian Bizer 16
  • 17.
    Deployment • Heuristic types havebeen included in DBpedia 3.9 – for previously untyped instances – 3.4 million type statements at precision ~0.95 • Includes also many resources without a Wikipedia page – i.e., generated from a red link • Runtime – Complexity O(PT) P: number of property assertions T: number of type assertions – ~24h for processing DBpedia 10/31/13 Heiko Paulheim, Christian Bizer 17
  • 18.
    Conclusion and Outlook • SDTypeapproach works at high quality – outperforms most state of the art on DBpedia – deployed for DBpedia 3.9 • Same approach can be used for – validating links – within dataset: deployed for DBpedia 3.9 (removed ~13,000 wrong statements) – across datasets: to be done 10/31/13 Heiko Paulheim, Christian Bizer 18
  • 19.
    Type Inference onNoisy RDF Data 10/31/13 Paulheim, Christian Bizer Heiko Paulheim, Christian Bizer Heiko 19