Type Inference on Noisy RDF Data

Type Inference on Noisy RDF Data

10/31/13 Paulheim, Christian Bizer
Heiko Paulheim, Christian Bizer
Heiko

1

The Problem
•

One promise of the Semantic Web:
– You can issue structured queries
– e.g., „List all presidents that graduated from Harvard Law School“
– SELECT ?x WHERE {
?x a dbpedia-owl:President .
?x dbpedia-owl:almaMater
dbpedia:Harvard_Law_School }

10/31/13


2

The Problem
•

SELECT ?x WHERE {

•

...if we run this against DBpedia, we get one result
– i.e., Elwell Stephen Otis

•

But...

10/31/13


3

The Problem

10/31/13


4

The Problem
•

So what is going wrong?

•

SELECT ?x WHERE {

•

In DBpedia, Barack Obama is not of type President!

•

How can we add missing types?

10/31/13


5

Is It a Big Problem?
•

DBpedia has at least 2.7 million missing type statements
– w.r.t. the DBpedia ontology
– found using co-occurence analysis of matching classes
in YAGO and DBpedia
– a very optimistic lower bound

•

Highly incomplete classes:
– Species: >870,000 missing statements
– Person: >510,000 missing statements
– Event: >150,000 missing statements

10/31/13


6

A Naive Approach
•

Idea: exploit properties with domain and range

•

Pseudo RDFS Reasoning:
– CONSTRUCT {?x a ?t}
WHERE { {?x ?r ?y . ?r rdfs:domain ?t}
UNION
{?y ?r ?x . ?r rdfs:range ?t} }

10/31/13


7

A Naive Approach
•

Experiment with Barack Obama
– Person, PersonFunction, Actor, Organization

•

Experiment with Germany:
– Place, Award, Populated Place, City, SportsTeam, Mountain, Agent,
Organisation, Country, Stadium, RecordLabel, MilitaryUnit, Company,
EducationalInstitution, PersonFunction, EthnicGroup, Architect, WineRegion,
Language, MilitaryConflict, Settlement, RouteOfTransportation

10/31/13


8

A Naive Approach
•

What is going on here?
– DBpedia data is noisy
– One wrong statement is enough for a wrong conclusion
– e.g.: dbpedia:Kurt_H._Debus dbpedia-owl:award dbpedia:Germany

•

Germany example: 69,000 statements
– 20 wrong types can come from 20 wrong statements
– i.e., an error rate of 0.03% is enough for a totally screwed result
– ...but that would be an excellent data quality for a LOD source!

10/31/13


9

SDType Approach
•

Idea: outgoing/incoming properties are indicators
for a resource's type
– e.g.: starring → Movie
– e.g.: author-1 → Writer

•

Basic compiled statistics
– P(C|p) := probability of class C in presence of property p
– e.g.: P(dbpedia:Film|starring) = 0.79
– e.g.: P(dbpedia:Writer|author-1) = 0.44

10/31/13


10

SDType Approach
•

Based on precompiled statistics
– Find types of instances
– Using voting

•

score(C) = avg(all properties p) P(C|p)

•

Refinement:
– Weight for properties: discriminative power
– weight(p) = sum(all classes c) (p(c)-p(c|p))²
– i.e., how strongly this property's class distribution
deviates from the overall class distribution

10/31/13


11

Evaluation
•

Two fold evaluation
– On DBpedia and OpenCyc as „Silver Standard“
(automatic, 10,000 random instances)
– On untyped DBpedia resources (manual, 100 instances)

•

Using only incoming properties
– Using outgoing properties is trivial!

10/31/13


12

Evaluation Results
•

On DBpedia

1
0.9
0.8

Precision

0.7
0.6
min. 1 link
min. 10 links
min. 25 links

0.5
0.4
0.3
0.2
0.1
0
0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Recall

10/31/13


13

Evaluation Results
•

On OpenCyc

1
0.9
0.8

Precision

0.7
0.6
min. 1 link
min. 10 links
min. 25 links

0.5
0.4
0.3
0.2
0.1
0
0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Recall

10/31/13


14

Evaluation Results
•

Evaluation on untyped resources
– Random sample of 100 untyped resources
– Manual checking of precision

1

12

0.9
10

0.8
0.7
Precision

0.6
0.5

6

0.4
4

0.3
0.2

# found types

8
# found
types
precision

2

0.1
0

0
0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0

Lower bound for threshold

10/31/13


15

Evaluation Results
•

DBpedia:
– works reasonably well (F-measure 0.89)

•

OpenCyc:
– harder because of deeper class hierarchy (F-measure 0.60)

•

General:
– having more links increases precision
(in contrast to RDFS reasoning)
– more general types (e.g., Band) are easier than specific ones
(e.g., PunkRockBand)

10/31/13


16

Deployment
•

Heuristic types have been included in DBpedia 3.9
– for previously untyped instances
– 3.4 million type statements at precision ~0.95

•

Includes also many resources without a Wikipedia page
– i.e., generated from a red link

•

Runtime
– Complexity O(PT)
P: number of property assertions
T: number of type assertions
– ~24h for processing DBpedia

10/31/13


17

Conclusion and Outlook
•

SDType approach works at high quality
– outperforms most state of the art on DBpedia
– deployed for DBpedia 3.9

•

Same approach can be used for
– validating links
– within dataset: deployed for DBpedia 3.9 (removed ~13,000 wrong statements)
– across datasets: to be done

10/31/13


18

Type Inference on Noisy RDF Data

10/31/13 Paulheim, Christian Bizer
Heiko

19

Type Inference on Noisy RDF Data

More Related Content

What's hot

Similar to Type Inference on Noisy RDF Data

More from Heiko Paulheim

Recently uploaded

Type Inference on Noisy RDF Data