Exploiting RDFS and OWL for Integrating Heterogeneous, Large-Scale, Linked Data Corpora Aidan Hogan PhD Viva
Cold Open   Figure 1: Web of Data explicit  data implicit  data Topic of thesis:   How can consumers tap into the implicit data
PRELUDE The Area… The Problem… The Hypothesis…
The Area… … Linked Data / Linking Open Data
  Bottom-up Approach to Semantic  Web Individual Publishers should: Use URIs to name things  (not just documents) Use HTTP URIs   that can be looked up Return   information in a common structured data model  ( RDF ) Use external URIs in your data  so as to  link to related data … the micro … Linked Data Principles
  … the macro … A Web of Data Images from:  http://richard.cyganiak.de/2007/10/lod/ ;  Cyganiak, Jentzsch September 2010 August 2007 November 2007   February 2008   March 2008   September 2008   March 2009   July 2009
… so what’s  The Problem ? … … heterogeneity
Take  Query Answering … SPARQL endpoints over Web data such as  YARS2 , Virtuoso, FactForge, etc. Search engines such as  SWSE , Sindice, Falcons, Swoogle, Watson, etc.
Take  Query Answering …   Gimme   webpages   relating to Tim Berners-Lee foaf:page   timbl:i   timbl:i   foaf:page   ?pages  .
Hetereogenity in  terminology …   webpage:  properties   foaf:page   foaf:homepage   foaf:isPrimaryTopicOf   foaf:weblog   doap:homepage   foaf:topic   foaf:primaryTopic   mo:musicBrainz   mo:myspace   … = rdfs:subPropertyOf  = owl:inverseOf
Linked Data, RDFS and OWL:    Linked Vocabularies   … … Image from  http://blog.dbtune.org/public/.081005_lod_constellation_m.jpg : ;  Giasson, Bergman
  Hetereogenity in  naming … Tim Berners-Lee:  URIs … timbl:i dblp:100007 identica:45563 adv:timbl fb:en.tim_berners-lee db:Tim-Berners_Lee = owl:sameAs
Returning to our  Query …   Gimme   webpages   relating to Tim Berners-Lee foaf:page   timbl:i  timbl:i   foaf:page   ?pages  . ...   7 x 6 = 42  possible patterns foaf:homepage   foaf:isPrimaryTopicOf   doap:homepage   foaf:topic   foaf:primaryTopic   mo:myspace   dblp:100007 identica:45563 adv:timbl fb:en.tim_berners-lee db:Tim-Berners_Lee
… The Hypothesis ? … … we can use the OWL and RDFS inherent in Linked Data to attenuate the problem of heterogeneity for consumers
Scenario… … take a  static  corpus crawled from Linked Data… … about a  billion triples  or so… … and  tackle the problem (s)  of heterogeneity … ( without domain-specific “cheats” ).
Setup… hardware … 9 machines … ~6 years old… 4Gb RAM, 2.2GHz, Ethernet
Setup… corpus … crawl ( 9 machines: 52.5 hr ) … took random seed URIs from Billion Triple Challenge 2009 dataset … crawled  ~4 million RDF/XML documents … from arbitrary domains (e.g.,  dbpedia.org) Only found 785 domains providing RDF/XML … 1.118 billion quadruples … 947 million unique triples
Setup… ranking ( 9 machines: 30.3 hr ) … applied PageRank over interlinked source docs. … source A links to source B if A uses a URI which “dereferences” (points) to B
Challenges… … what (OWL) reasoning is feasible for Linked Data?
Linked Data Reasoning:  Challenges   Scalable Expressive Robust Domain-Agnostic
CORE 1.  Reasoning… 2.  Annotated Reasoning… 3.   Consolidation…
1. Reasoning
High Level Approach… … apply a subset of OWL 2 RL/RDF rules over the data
  Forward Chaining materialisation: Avoid runtime expense of backward-chaining Users taught impatience by Google Pre-compute answers for quick retrieval Web-scale systems should be scalable! More data = more disk-space/machines Web Reasoning: Forward Chaining! One size does  not fit all! Don't materialise too much!
Scalable   Authoritative  OWL Reasoner   Our Approach
  Our Approach… INPUT: Flat file of triples (quads) OUTPUT: Flat file of (partial) inferred triples (quads)
Scalable   Reasoning:  In-mem T-Box Main optimisation: Store T-Box in memory T-Box: (loosely) data describing classes and properties.  Aka. schemata/vocabularies/ontologies/terminologies. E.g.,  foaf:topic owl:inverseOf foaf:page . sioc:UserAccount rdfs:subClassOf foaf:OnlineAccount . Most commonly accessed data   for reasoning Quite small (~0.1% for our Linked Data corpus) High selectivity (if you prefer) A-Box:   Lots   ?s foaf:page ?o .   vs.  T-Box:   Few   foaf:page ?p ?o .   +   ?s ?p foaf:page .
Scan 1:  Scan input data separate T-Box statements, load T-Box statements into memory Do T-Box level reasoning if required (semi-naïve) Scan 2:  Scan all on-disk data, join with in-memory T-Box.   Scalable   Reasoning:  Two Scans
  ... ex:me foaf:homepage ex:hp . ...   ... ex:hp  rdf:type foaf:Document  . ex:me  foaf:page  ex:hp . ex:hp  foaf:topic  ex:me . ... IN-MEM   T-BOX ON-DISK   A-BOX ON-DISK OUTPUT Execution of three rules: OWL 2 RL rule   prp-inv1 ?p 1  owl:inverseOf ?p 2  .   ?x ?p 1  ?y .  ⇒  ?y ?p 2  ?x . OWL 2 RL rule   prp-rng ?p rdfs:range ?c .   ?x ?p ?y   .  ⇒  ?y a ?c . OWL 2 RL rule   prp-spo1 ?p 1  rdfs:subPropertyOf ?p 2  .   ?x ?p 1  ?y.  ⇒  ?x ?p 2  ?y . Scalable   Reasoning:  No A-Box Joins
However: some rules do require A-Box joins ?p a owl:TransitiveProperty .  ?x ?p ?y . ?y ?p z .  ⇒  ?x ?p ?z . Difficult to engineer a scalable solution (which reaches a fixpoint) for Linked Data(?) Can lead to quadratic inferences A lot of useful reasoning still possible without A-Box joins…   Scalable   Reasoning:  A-Box joins?
Consider source of T-Box (schemata) data Class/property URIs dereference to their  authoritative  document FOAF spec authoritative for  foaf:Person   ✓  MY spec not authoritative for  foaf:Person   ✘ Allow “extension” in authoritative documents my:Person rdfs:subClassOf foaf:Person .  (MY spec)  ✓ BUT:  Reduce obscure memberships foaf:Person rdfs:subClassOf my:Person .  (MY spec)  ✘ ALSO:  Protect specifications foaf:knows a owl:SymmetricProperty .  (MY spec)  ✘   Authoritative   Reasoning
Survey of terminology:  counts Looked at use of RDFS and OWL in our corpus rdfs:subClassOf   ~307k axioms  ~51k docs   ✓   owl:equivalentClass ~23k axioms ~23k docs   ✓ rdfs:domain ~16k axioms 623 docs   ✓ rdfs:range ~14k axioms 717 docs   ✓ owl:unionOf ~13k axioms 109 docs   ✓ rdfs:subPropertyOf ~9k axioms 227 docs   ✓ owl:inverseOf ~1k axioms 98 docs   ✓ owl:disjointWith 917 axioms 60 docs   ✘ owl:someValuesFrom 465 axioms 48 docs   ✓ owl:intersectionOf 325 axioms 12 docs   ✓ /  ✘ …
...summary please? Our “cheap rules” cover 99%  of RDFS/OWL axioms in our corpus 82.3% of such axioms have an authoritative version - 78.3% of all non-authoritative axioms come from one doc - (without which, ~96% of axioms have auth. version) 9.1% of documents have non-authoritative axioms Authoritative reasoning for cheap rules fully support 90.6% of the “vocabulary documents”   Survey of terminology: counts
Survey of terminology:  ranks Looked at use of RDFS and OWL wrt. ranks of documents… rdfs:subClassOf   0.295   ✓   rdfs:range 0.294      ✓ rdfs:domain 0.292      ✓ rdfs:subPropertyOf 0.090    ✓ owl:FunctionalProperty 0.063    ✘ owl:disjointWith 0.049    ✘ owl:inverseOf 0.047    ✓ owl:unionOf 0.035   ✓ owl:SymmetricProperty 0.033   ✓ owl:equivalentClass 0.021   ✓ owl:InverseFunctionalProperty 0.030   ✘ owl:equivalentProperty 0.030   ✓ owl:someValuesFrom 0.030   ✓ /   ✘
...summary please? Adding up the ranks of all vocabularies our rules  fully  support gives  77%  of the total rank of all vocabularies Adding up the ranks of all vocabularies our authoritative rules fully support gives  70%  of the total rank of all vocabularies The highest ranked document our rules do not fully support was 5 th  overall: SKOS The highest ranked document with non-authoritative axioms was 7 th  overall: FOAF   Survey of terminology: ranks
...let’s stick to the simple rules
Scalable   Distributed Reasoning   ... ... ex:me ex:presented ex:ThisTalk ...   SAME   T-BOX SAME   T-BOX SAME   T-BOX SAME   T-BOX SAME   T-BOX DIFF.   A-BOX DIFF.   A-BOX DIFF.   A-BOX DIFF.   A-BOX DIFF.   A-BOX ... ... ex:me ex:presented ex:ThisTalk ...   ... ... ex:me ex:presented ex:ThisTalk ...   ... ... ex:me ex:presented ex:ThisTalk ...   ... ... ex:me ex:presented ex:ThisTalk ...   LOCAL OUTPUT ... ... ex:me ex:presented ex:ThisTalk ...   LOCAL OUTPUT LOCAL OUTPUT LOCAL OUTPUT LOCAL OUTPUT ... ... ex:me ex:presented ex:ThisTal ... ... ex:me ex:presented ex:ThisTalk ... ... ex:me ex:presented ex:ThisTalk ... ... ex:me rdf:type ex:Awesome . ... ... ... ... ... ... ... ex:me ex:presented ex:ThisTalk ...   ... ... ex:me ex:presented ex:ThisTalk ...   ... ... ex:me ex:presented ex:ThisTalk ...   ... ... ex:me ex:presented ex:ThisTalk ...   ... ... ex:me ex:presented ex:ThisTalk ...   EXTRACT   T-BOX EXTRACT T-BOX EXTRACT   T-BOX EXTRACT   T-BOX EXTRACT   T-BOX COLLECT   T-BOX COLLECT   T-BOX COLLECT   T-BOX COLLECT   T-BOX COLLECT   T-BOX ... ...
  Reasoning Performance (1 machine)
Reasoning Performance: Distrib. 9 machines: Total 3.35 hours
  Reasoning: Results 962 million  unique/novel triples 947 million unique triples
2.  Annotated Reasoning
Annotated Reasoning Let’s try track some meta-information during the reasoning process Annotate input triples with information Use annotated reasoning framework for transforming annotations on input triples into annotations on output triples
Each input triple is assigned the sum of the ranks of the documents in which it appears… foaf:Person rdfs:subClassOf foaf:Agent 0.3 . timbl:i rdf:type foaf:Person 0.04 . aidan:me rdf:type foaf:Person 0.0001 .   Annotated Reasoning: ranks
During reasoning, inferences are assigned the least-trustworthy triple involved in their “proof” foaf:Person rdfs:subClassOf foaf:Agent  0.3  . timbl:i rdf:type foaf:Person  0.04  . ⇒ timbl:i rdf:type foaf:Agent  0.04  .   Annotated Reasoning
Can do top- k  materialisation Only give me inferences above a certain rank threshold Only give me top- k  inferences Can fix inconsistencies in the data… … aka. logical contradictions … interpreting the rank values as denoting “trustworthy” data   Why?
foaf:Person owl:disjointWith foaf:Document .   Inconsistencies:    aka. Contradictions
?c 1  owl:disjointWith ?c 2  .   ?x rdf:type ?c 1  .  ?x rdf:type ?c 2  .  ⇒   false foaf:Person owl:disjointWith foaf:Document . ex:sleepygirl rdf:type foaf:Person . ex:sleepygirl rdf:type foaf:Document . ⇒   false   Cannot compute…
Considered two approaches: Find the “consistency threshold” of the input + inferred data: The largest rank such that all data above that rank are consistent Unfortunately, the 22 nd  ranked document had an ill-typed literal, and so was inconsistent… So we would keep the data of ~22 documents And throw away the data of nearly four million   Fixing inconsistencies
Time for Plan B: 2.   Perform a “granular” repair of the data Remove the weakest triple causing each contradiction foaf:Person owl:disjointWith foaf:Document  0.3  . ex:sleepygirl rdf:type foaf:Person  0.007  . ex:sleepygirl rdf:type foaf:Document  0.002 .   Fixing inconsistencies
~294k ill-typed datatypes ~7k members of disjoint classes   Inconsistencies found
Performance 9 machines Annotated Reasoning:  14.6 hrs  (vs. 3.35hrs w/o annotations: need to do a distributed sort to remove non-optimal triples ) Detect/Extract Inconsistencies: 2.9 hrs Diagnosis/Repair 2.8 hrs Total ~20.3 hours
3. Consolidation
Consolidation for Linked Data
Baseline Approach… … use the explicit  owl:sameAs   relations given in the data…
Scan the data and extract all  owl:sameAs  triples  timbl:i  owl:sameas  identica:45563  . dbpedia:Berners-Lee  owl:sameas  identica:45563  . Load into memory Use a map to store equivalences: timbl:i -> identica:45563 -> dbpedia:Berners-Lee ->   Consolidation: Baseline timbl:i identica:45563 dbpedia:Berners-Lee
For each set of equivalent identifiers, choose a canonical term   Consolidation: Baseline timbl:i identica:45563 dbpedia:Berners-Lee
Scan data a second time: Rewrite identifiers to their canonical version Skip predicates and values of  rdf:type   Canonicalisation timbl:i  rdf:type foaf:Person . identica:48404 foaf:knows  identica:45563  . dbpedia:Berners-Lee   dpo:birthDate  “ 1955-06-08”^^xsd:date  . dbpedia:Berners-Lee  rdf:type foaf:Person . identica:48404 foaf:knows  dbpedia:Berners-Lee  . dbpedia:Berners-Lee   dpo:birthDate  “ 1955-06-08”^^xsd:date  . timbl:i identica:45563 dbpedia:Berners-Lee
Baseline Consolidation:  Performance 9 machines Extract  owl:sameAs :  0.2 hr  Gather  owl:sameAs :  0.1 hr Canonicalise data  0.7 hr Total ~1.1 hours
Applied over raw input data ~12 million  owl:sameAs  triples ~2.2 million sets of equivalent identifiers ~5.8 million identifiers involved ~2.65 identifiers per set ~99.99% of terms were URIs ~6.25% of all URIs   Baseline Consolidation: Results
Extended Approach… … use the  owl:sameAs   relations inferable through reasoning…
Infer  owl:sameAs  through reasoning (OWL 2 RL/RDF) explicit  owl:sameAs  (again) owl:InverseFunctionalProperty owl:FunctionalProperty owl:cardinality 1 / owl:maxCardinality 1 foaf:homepage a owl:InverseFunctionalProperty  . timbl:i  foaf:homepage w3c:timblhomepage . adv:timbl  foaf:homepage w3c:timblhomepage . ⇒ timbl:i  owl:sameas  adv:timbl  . … then apply consolidation as before   Extended   Consolidation
OWL 2 RL/RDF consolidation rules require A-Box joins! Might not be able to fit  owl:sameAs  index in memory ( 4 Gb ) ! ⇒   Use on-disk batch-processing Distributed sorts, scans and merge-joins   Derive  owl:sameAs  on-disk
Extended Consolidation:  Performance 9 machines Inferring owl:sameAs ~7.4 hr Canonicalise data  ~4.9 hr Total ~12.3 hours (11X baseline)
~12 million explicit  owl:sameAs  triples (as before) ~8.7 million thru.  owl:InverseFunctionalProperty ~106 thousand thru.   owl:FunctionalProperty none thru.   owl:cardinality / owl:maxCardinality ~2.8 million sets of equivalent identifiers (1.31x baseline) ~14.86 million identifiers involved (2.58x baseline) ~5.8 million URIs (1.014x baseline)   Extended Consolidation:  Results
CONCLUSION
  timbl:i   foaf:page   ?pages  . timbl:i identica:45563 dbpedia:Berners-Lee dbpedia:Berners-Lee   foaf:page   ?pages  .
Heterogeneity poses a significant problem for consuming Linked Data  Lightweight reasoning can go a long way Simple/authoritative rules have reasonable coverage Deceit/Noise   ≠  End Of World Inconsistency  ≠  End Of World Useful for finding noise in fact! Explicit   owl:sameAs  vs. extended consolidation:  Extended consolidation mostly for consolidating blank-nodes from older FOAF exporters   Conclusions

Aidan's PhD Viva

  • 1.
    Exploiting RDFS andOWL for Integrating Heterogeneous, Large-Scale, Linked Data Corpora Aidan Hogan PhD Viva
  • 2.
    Cold Open Figure 1: Web of Data explicit data implicit data Topic of thesis: How can consumers tap into the implicit data
  • 3.
    PRELUDE The Area…The Problem… The Hypothesis…
  • 4.
    The Area… …Linked Data / Linking Open Data
  • 5.
    Bottom-upApproach to Semantic Web Individual Publishers should: Use URIs to name things (not just documents) Use HTTP URIs that can be looked up Return information in a common structured data model ( RDF ) Use external URIs in your data so as to link to related data … the micro … Linked Data Principles
  • 6.
    the macro … A Web of Data Images from: http://richard.cyganiak.de/2007/10/lod/ ; Cyganiak, Jentzsch September 2010 August 2007 November 2007 February 2008 March 2008 September 2008 March 2009 July 2009
  • 7.
    … so what’s The Problem ? … … heterogeneity
  • 8.
    Take QueryAnswering … SPARQL endpoints over Web data such as YARS2 , Virtuoso, FactForge, etc. Search engines such as SWSE , Sindice, Falcons, Swoogle, Watson, etc.
  • 9.
    Take QueryAnswering … Gimme webpages relating to Tim Berners-Lee foaf:page timbl:i timbl:i foaf:page ?pages .
  • 10.
    Hetereogenity in terminology … webpage: properties foaf:page foaf:homepage foaf:isPrimaryTopicOf foaf:weblog doap:homepage foaf:topic foaf:primaryTopic mo:musicBrainz mo:myspace … = rdfs:subPropertyOf = owl:inverseOf
  • 11.
    Linked Data, RDFSand OWL: Linked Vocabularies … … Image from http://blog.dbtune.org/public/.081005_lod_constellation_m.jpg : ; Giasson, Bergman
  • 12.
    Hetereogenityin naming … Tim Berners-Lee: URIs … timbl:i dblp:100007 identica:45563 adv:timbl fb:en.tim_berners-lee db:Tim-Berners_Lee = owl:sameAs
  • 13.
    Returning to our Query … Gimme webpages relating to Tim Berners-Lee foaf:page timbl:i timbl:i foaf:page ?pages . ... 7 x 6 = 42 possible patterns foaf:homepage foaf:isPrimaryTopicOf doap:homepage foaf:topic foaf:primaryTopic mo:myspace dblp:100007 identica:45563 adv:timbl fb:en.tim_berners-lee db:Tim-Berners_Lee
  • 14.
    … The Hypothesis? … … we can use the OWL and RDFS inherent in Linked Data to attenuate the problem of heterogeneity for consumers
  • 15.
    Scenario… … takea static corpus crawled from Linked Data… … about a billion triples or so… … and tackle the problem (s) of heterogeneity … ( without domain-specific “cheats” ).
  • 16.
    Setup… hardware …9 machines … ~6 years old… 4Gb RAM, 2.2GHz, Ethernet
  • 17.
    Setup… corpus …crawl ( 9 machines: 52.5 hr ) … took random seed URIs from Billion Triple Challenge 2009 dataset … crawled ~4 million RDF/XML documents … from arbitrary domains (e.g., dbpedia.org) Only found 785 domains providing RDF/XML … 1.118 billion quadruples … 947 million unique triples
  • 18.
    Setup… ranking (9 machines: 30.3 hr ) … applied PageRank over interlinked source docs. … source A links to source B if A uses a URI which “dereferences” (points) to B
  • 19.
    Challenges… … what(OWL) reasoning is feasible for Linked Data?
  • 20.
    Linked Data Reasoning: Challenges Scalable Expressive Robust Domain-Agnostic
  • 21.
    CORE 1. Reasoning… 2. Annotated Reasoning… 3. Consolidation…
  • 22.
  • 23.
    High Level Approach…… apply a subset of OWL 2 RL/RDF rules over the data
  • 24.
    ForwardChaining materialisation: Avoid runtime expense of backward-chaining Users taught impatience by Google Pre-compute answers for quick retrieval Web-scale systems should be scalable! More data = more disk-space/machines Web Reasoning: Forward Chaining! One size does not fit all! Don't materialise too much!
  • 25.
    Scalable Authoritative OWL Reasoner Our Approach
  • 26.
    OurApproach… INPUT: Flat file of triples (quads) OUTPUT: Flat file of (partial) inferred triples (quads)
  • 27.
    Scalable Reasoning: In-mem T-Box Main optimisation: Store T-Box in memory T-Box: (loosely) data describing classes and properties. Aka. schemata/vocabularies/ontologies/terminologies. E.g., foaf:topic owl:inverseOf foaf:page . sioc:UserAccount rdfs:subClassOf foaf:OnlineAccount . Most commonly accessed data for reasoning Quite small (~0.1% for our Linked Data corpus) High selectivity (if you prefer) A-Box: Lots ?s foaf:page ?o . vs. T-Box: Few foaf:page ?p ?o . + ?s ?p foaf:page .
  • 28.
    Scan 1: Scan input data separate T-Box statements, load T-Box statements into memory Do T-Box level reasoning if required (semi-naïve) Scan 2: Scan all on-disk data, join with in-memory T-Box. Scalable Reasoning: Two Scans
  • 29.
    ...ex:me foaf:homepage ex:hp . ... ... ex:hp rdf:type foaf:Document . ex:me foaf:page ex:hp . ex:hp foaf:topic ex:me . ... IN-MEM T-BOX ON-DISK A-BOX ON-DISK OUTPUT Execution of three rules: OWL 2 RL rule prp-inv1 ?p 1 owl:inverseOf ?p 2 . ?x ?p 1 ?y . ⇒ ?y ?p 2 ?x . OWL 2 RL rule prp-rng ?p rdfs:range ?c . ?x ?p ?y . ⇒ ?y a ?c . OWL 2 RL rule prp-spo1 ?p 1 rdfs:subPropertyOf ?p 2 . ?x ?p 1 ?y. ⇒ ?x ?p 2 ?y . Scalable Reasoning: No A-Box Joins
  • 30.
    However: some rulesdo require A-Box joins ?p a owl:TransitiveProperty . ?x ?p ?y . ?y ?p z . ⇒ ?x ?p ?z . Difficult to engineer a scalable solution (which reaches a fixpoint) for Linked Data(?) Can lead to quadratic inferences A lot of useful reasoning still possible without A-Box joins… Scalable Reasoning: A-Box joins?
  • 31.
    Consider source ofT-Box (schemata) data Class/property URIs dereference to their authoritative document FOAF spec authoritative for foaf:Person ✓ MY spec not authoritative for foaf:Person ✘ Allow “extension” in authoritative documents my:Person rdfs:subClassOf foaf:Person . (MY spec) ✓ BUT: Reduce obscure memberships foaf:Person rdfs:subClassOf my:Person . (MY spec) ✘ ALSO: Protect specifications foaf:knows a owl:SymmetricProperty . (MY spec) ✘ Authoritative Reasoning
  • 32.
    Survey of terminology: counts Looked at use of RDFS and OWL in our corpus rdfs:subClassOf ~307k axioms ~51k docs ✓ owl:equivalentClass ~23k axioms ~23k docs ✓ rdfs:domain ~16k axioms 623 docs ✓ rdfs:range ~14k axioms 717 docs ✓ owl:unionOf ~13k axioms 109 docs ✓ rdfs:subPropertyOf ~9k axioms 227 docs ✓ owl:inverseOf ~1k axioms 98 docs ✓ owl:disjointWith 917 axioms 60 docs ✘ owl:someValuesFrom 465 axioms 48 docs ✓ owl:intersectionOf 325 axioms 12 docs ✓ / ✘ …
  • 33.
    ...summary please? Our“cheap rules” cover 99% of RDFS/OWL axioms in our corpus 82.3% of such axioms have an authoritative version - 78.3% of all non-authoritative axioms come from one doc - (without which, ~96% of axioms have auth. version) 9.1% of documents have non-authoritative axioms Authoritative reasoning for cheap rules fully support 90.6% of the “vocabulary documents” Survey of terminology: counts
  • 34.
    Survey of terminology: ranks Looked at use of RDFS and OWL wrt. ranks of documents… rdfs:subClassOf 0.295 ✓ rdfs:range 0.294 ✓ rdfs:domain 0.292 ✓ rdfs:subPropertyOf 0.090 ✓ owl:FunctionalProperty 0.063 ✘ owl:disjointWith 0.049 ✘ owl:inverseOf 0.047 ✓ owl:unionOf 0.035 ✓ owl:SymmetricProperty 0.033 ✓ owl:equivalentClass 0.021 ✓ owl:InverseFunctionalProperty 0.030 ✘ owl:equivalentProperty 0.030 ✓ owl:someValuesFrom 0.030 ✓ / ✘
  • 35.
    ...summary please? Addingup the ranks of all vocabularies our rules fully support gives 77% of the total rank of all vocabularies Adding up the ranks of all vocabularies our authoritative rules fully support gives 70% of the total rank of all vocabularies The highest ranked document our rules do not fully support was 5 th overall: SKOS The highest ranked document with non-authoritative axioms was 7 th overall: FOAF Survey of terminology: ranks
  • 36.
    ...let’s stick tothe simple rules
  • 37.
    Scalable Distributed Reasoning ... ... ex:me ex:presented ex:ThisTalk ... SAME T-BOX SAME T-BOX SAME T-BOX SAME T-BOX SAME T-BOX DIFF. A-BOX DIFF. A-BOX DIFF. A-BOX DIFF. A-BOX DIFF. A-BOX ... ... ex:me ex:presented ex:ThisTalk ... ... ... ex:me ex:presented ex:ThisTalk ... ... ... ex:me ex:presented ex:ThisTalk ... ... ... ex:me ex:presented ex:ThisTalk ... LOCAL OUTPUT ... ... ex:me ex:presented ex:ThisTalk ... LOCAL OUTPUT LOCAL OUTPUT LOCAL OUTPUT LOCAL OUTPUT ... ... ex:me ex:presented ex:ThisTal ... ... ex:me ex:presented ex:ThisTalk ... ... ex:me ex:presented ex:ThisTalk ... ... ex:me rdf:type ex:Awesome . ... ... ... ... ... ... ... ex:me ex:presented ex:ThisTalk ... ... ... ex:me ex:presented ex:ThisTalk ... ... ... ex:me ex:presented ex:ThisTalk ... ... ... ex:me ex:presented ex:ThisTalk ... ... ... ex:me ex:presented ex:ThisTalk ... EXTRACT T-BOX EXTRACT T-BOX EXTRACT T-BOX EXTRACT T-BOX EXTRACT T-BOX COLLECT T-BOX COLLECT T-BOX COLLECT T-BOX COLLECT T-BOX COLLECT T-BOX ... ...
  • 38.
    ReasoningPerformance (1 machine)
  • 39.
    Reasoning Performance: Distrib.9 machines: Total 3.35 hours
  • 40.
    Reasoning:Results 962 million unique/novel triples 947 million unique triples
  • 41.
    2. AnnotatedReasoning
  • 42.
    Annotated Reasoning Let’stry track some meta-information during the reasoning process Annotate input triples with information Use annotated reasoning framework for transforming annotations on input triples into annotations on output triples
  • 43.
    Each input tripleis assigned the sum of the ranks of the documents in which it appears… foaf:Person rdfs:subClassOf foaf:Agent 0.3 . timbl:i rdf:type foaf:Person 0.04 . aidan:me rdf:type foaf:Person 0.0001 . Annotated Reasoning: ranks
  • 44.
    During reasoning, inferencesare assigned the least-trustworthy triple involved in their “proof” foaf:Person rdfs:subClassOf foaf:Agent 0.3 . timbl:i rdf:type foaf:Person 0.04 . ⇒ timbl:i rdf:type foaf:Agent 0.04 . Annotated Reasoning
  • 45.
    Can do top-k materialisation Only give me inferences above a certain rank threshold Only give me top- k inferences Can fix inconsistencies in the data… … aka. logical contradictions … interpreting the rank values as denoting “trustworthy” data Why?
  • 46.
    foaf:Person owl:disjointWith foaf:Document. Inconsistencies: aka. Contradictions
  • 47.
    ?c 1 owl:disjointWith ?c 2 . ?x rdf:type ?c 1 . ?x rdf:type ?c 2 . ⇒ false foaf:Person owl:disjointWith foaf:Document . ex:sleepygirl rdf:type foaf:Person . ex:sleepygirl rdf:type foaf:Document . ⇒ false Cannot compute…
  • 48.
    Considered two approaches:Find the “consistency threshold” of the input + inferred data: The largest rank such that all data above that rank are consistent Unfortunately, the 22 nd ranked document had an ill-typed literal, and so was inconsistent… So we would keep the data of ~22 documents And throw away the data of nearly four million Fixing inconsistencies
  • 49.
    Time for PlanB: 2. Perform a “granular” repair of the data Remove the weakest triple causing each contradiction foaf:Person owl:disjointWith foaf:Document 0.3 . ex:sleepygirl rdf:type foaf:Person 0.007 . ex:sleepygirl rdf:type foaf:Document 0.002 . Fixing inconsistencies
  • 50.
    ~294k ill-typed datatypes~7k members of disjoint classes Inconsistencies found
  • 51.
    Performance 9 machinesAnnotated Reasoning: 14.6 hrs (vs. 3.35hrs w/o annotations: need to do a distributed sort to remove non-optimal triples ) Detect/Extract Inconsistencies: 2.9 hrs Diagnosis/Repair 2.8 hrs Total ~20.3 hours
  • 52.
  • 53.
  • 54.
    Baseline Approach… …use the explicit owl:sameAs relations given in the data…
  • 55.
    Scan the dataand extract all owl:sameAs triples timbl:i owl:sameas identica:45563 . dbpedia:Berners-Lee owl:sameas identica:45563 . Load into memory Use a map to store equivalences: timbl:i -> identica:45563 -> dbpedia:Berners-Lee -> Consolidation: Baseline timbl:i identica:45563 dbpedia:Berners-Lee
  • 56.
    For each setof equivalent identifiers, choose a canonical term Consolidation: Baseline timbl:i identica:45563 dbpedia:Berners-Lee
  • 57.
    Scan data asecond time: Rewrite identifiers to their canonical version Skip predicates and values of rdf:type Canonicalisation timbl:i rdf:type foaf:Person . identica:48404 foaf:knows identica:45563 . dbpedia:Berners-Lee dpo:birthDate “ 1955-06-08”^^xsd:date . dbpedia:Berners-Lee rdf:type foaf:Person . identica:48404 foaf:knows dbpedia:Berners-Lee . dbpedia:Berners-Lee dpo:birthDate “ 1955-06-08”^^xsd:date . timbl:i identica:45563 dbpedia:Berners-Lee
  • 58.
    Baseline Consolidation: Performance 9 machines Extract owl:sameAs : 0.2 hr Gather owl:sameAs : 0.1 hr Canonicalise data 0.7 hr Total ~1.1 hours
  • 59.
    Applied over rawinput data ~12 million owl:sameAs triples ~2.2 million sets of equivalent identifiers ~5.8 million identifiers involved ~2.65 identifiers per set ~99.99% of terms were URIs ~6.25% of all URIs Baseline Consolidation: Results
  • 60.
    Extended Approach… …use the owl:sameAs relations inferable through reasoning…
  • 61.
    Infer owl:sameAs through reasoning (OWL 2 RL/RDF) explicit owl:sameAs (again) owl:InverseFunctionalProperty owl:FunctionalProperty owl:cardinality 1 / owl:maxCardinality 1 foaf:homepage a owl:InverseFunctionalProperty . timbl:i foaf:homepage w3c:timblhomepage . adv:timbl foaf:homepage w3c:timblhomepage . ⇒ timbl:i owl:sameas adv:timbl . … then apply consolidation as before Extended Consolidation
  • 62.
    OWL 2 RL/RDFconsolidation rules require A-Box joins! Might not be able to fit owl:sameAs index in memory ( 4 Gb ) ! ⇒ Use on-disk batch-processing Distributed sorts, scans and merge-joins Derive owl:sameAs on-disk
  • 63.
    Extended Consolidation: Performance 9 machines Inferring owl:sameAs ~7.4 hr Canonicalise data ~4.9 hr Total ~12.3 hours (11X baseline)
  • 64.
    ~12 million explicit owl:sameAs triples (as before) ~8.7 million thru. owl:InverseFunctionalProperty ~106 thousand thru. owl:FunctionalProperty none thru. owl:cardinality / owl:maxCardinality ~2.8 million sets of equivalent identifiers (1.31x baseline) ~14.86 million identifiers involved (2.58x baseline) ~5.8 million URIs (1.014x baseline) Extended Consolidation: Results
  • 65.
  • 66.
    timbl:i foaf:page ?pages . timbl:i identica:45563 dbpedia:Berners-Lee dbpedia:Berners-Lee foaf:page ?pages .
  • 67.
    Heterogeneity poses asignificant problem for consuming Linked Data Lightweight reasoning can go a long way Simple/authoritative rules have reasonable coverage Deceit/Noise ≠ End Of World Inconsistency ≠ End Of World Useful for finding noise in fact! Explicit owl:sameAs vs. extended consolidation: Extended consolidation mostly for consolidating blank-nodes from older FOAF exporters Conclusions