• Save
Reasoning with a Billion of Linked Data Facts
Upcoming SlideShare
Loading in...5
×
 

Reasoning with a Billion of Linked Data Facts

on

  • 3,974 views

 

Statistics

Views

Total Views
3,974
Views on SlideShare
3,964
Embed Views
10

Actions

Likes
7
Downloads
0
Comments
1

2 Embeds 10

http://www.slideshare.net 7
http://localhost 3

Accessibility

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • Just an example, what is needed to start using linked data over real life questions

Reasoning with a Billion of Linked Data Facts Reasoning with a Billion of Linked Data Facts Presentation Transcript

  • Reasoning with a Billion of Linked Data Facts Atanas Kiryakov ESTC 2009, Vienna Dec, 2009
  • Presentation Outline
    • Introduction
    • Reasoning with Linked Data
    • LDSR: the largest body of common-sense knowledge
    • PIKB: 20 biomedical databases in a box
    • Quo vadis?
    Reasoning with a Billion of Linked Data Facts Dec, 2009 #
    • We build upon lightweight semantics that is easy to understand, deploy, and manage
    • For instance, think of ontologies as database schemata with simple interpretation rules. Plenty of obvious (but useful) implicit facts can be inferred and match queries right away
    What do we do? Dec, 2009 # Reasoning with a Billion of Linked Data Facts
  • It is simple Dec, 2009 # Reasoning with a Billion of Linked Data Facts rdfs:subClassOf rdfs:subClassOf myData: Maria rdf:type ptop:childOf rdfs:subClassOf ptop:Agent ptop:Person ptop:Woman ptop:childOf ptop:parentOf rdfs:range owl:inverseOf inferred ptop:parentOf myData:Ivan owl:relativeOf owl:inverseOf owl:SymmetricProperty rdfs:subPropertyOf ptop:relativeOf owl:inverseOf owl:inverseOf
  • Rule-Based Inference Dec, 2009 # Reasoning with a Billion of Linked Data Facts rdfs:subClassOf rdfs:subClassOf
    • <C1,rdfs:subClassOf,C2>
    • <C2,rdfs:subClassOf,C3>
    • <C1,rdfs:subClassOf,C3>
    • <I,rdf:type,C1>
    • <C1,rdfs:subClassOf,C2>
    • <I,rdf:type,C2>
    • <I1,P1,I2>
    • <P1,rdfs:range,C2>
    • <I2,rdf:type,C2>
    • <P1,owl:inverseOf,P2>
    • <I1,P1,I2>
    • <I2,P2,I1>
    • <P1,rdf:type,owl:SymmetricProperty>
    • <P1,owl:inverseOf,P1>
    myData: Maria rdf:type ptop:childOf rdfs:subClassOf ptop:Agent ptop:Person ptop:Woman ptop:childOf ptop:parentOf rdfs:range owl:inverseOf inferred ptop:parentOf myData:Ivan owl:relativeOf owl:inverseOf owl:SymmetricProperty rdfs:subPropertyOf ptop:relativeOf owl:inverseOf owl:inverseOf
  • So, what? Dec, 2009 # Reasoning with a Billion of Linked Data Facts rdfs:subClassOf rdfs:subClassOf
    • <C1,rdfs:subClassOf,C2>
    • <C2,rdfs:subClassOf,C3>
    • <C1,rdfs:subClassOf,C3>
    • <I,rdf:type,C1>
    • <C1,rdfs:subClassOf,C2>
    • <I,rdf:type,C2>
    • <I1,P1,I2>
    • <P1,rdfs:range,C2>
    • <I2,rdf:type,C2>
    • <P1,owl:inverseOf,P2>
    • <I1,P1,I2>
    • <I2,P2,I1>
    • <P1,rdf:type,owl:SymmetricProperty>
    • <P1,owl:inverseOf,P1>
      • The database will return Ivan as result of query for
      • Maria relativeOf ?x
      • when the fact asserted was
      • Ivan childOf Maria
    myData: Maria rdf:type ptop:childOf rdfs:subClassOf ptop:Agent ptop:Person ptop:Woman ptop:childOf ptop:parentOf rdfs:range owl:inverseOf inferred ptop:parentOf myData:Ivan owl:relativeOf owl:inverseOf owl:SymmetricProperty rdfs:subPropertyOf ptop:relativeOf owl:inverseOf owl:inverseOf
  • Scalable Reasoning Map (Jun’09) Dec, 2009 # Reasoning with a Billion of Linked Data Facts
  • Interlinking Text and Data Dec, 2009 # Reasoning with a Billion of Linked Data Facts
    • We link, your data, your content, and the web!
    • In 10 weeks we can build a solution which: - integrates 10 databases with the linked data cloud - mines 10 million documents and web pages
    • and lets you search and navigate all this information   - in 10 different ways   - from a $10,000 server
    Elevator Pitch Dec, 2009 # Reasoning with a Billion of Linked Data Facts
  • Linking Open Data
    • Linking Open Data (LOD) W3C SWEO Community project http://esw.w3.org/topic/SweoIG/TaskForces/CommunityProjects/LinkingOpenData
    • Initiative for publishing “linked data” – a set of principles, which allows browsing of RDF data, spread across different servers, in the way HTML is browsed
    Dec, 2009 # Reasoning with a Billion of Linked Data Facts
  • Reason-able views to the LOD
    • Reasoning is unfeasible on the web of linked data
      • At least, it is not straightforward with the LOD taken “as is”
    • The major obstacles:
      • Most of the popular reasoners work with “ closed-world assumption ”
      • The complexity of reasoning even with the simplest DL (say OWL Lite) is prohibitatively high
      • Some of the datasets of LOD are not suitable for reasoning
        • Data publishers use OWL vocabulary with no account for its formal semantics
        • E.g. there are long cycles in category hierarchies
      • Reasoning with distributed data is practically unfeasible
        • It is possible, but it is much slower than reasoning with local data.
        • The fundamental reason is related to the so called &quot;remote join&quot; problem
        • Speed and availability is also a major issue
    # Reasoning with a Billion of Linked Data Facts Dec, 2009
  • Reason-able views to the LOD (2)
    • Reason-able views represent an approach for reasoning and management of linked data
    • Key ideas:
      • Group selected datasets and ontologies in a compound dataset
      • Load the compound dataset in a single semantic repository
      • Perform inference with respect to tractable OWL dialects
    • Objectives
      • Make reasoning and query evaluation feasible
      • Guarantee a basic level of consistency
      • Guarantee availability
      • Easier exploration and querying of unseen data
        • Lower the cost of entry
    # Reasoning with a Billion of Linked Data Facts Dec, 2009
  • Reason-able: Selection Criteria
    • Dataset selection:
      • To allow inference
        • delivers meaningful results under the semantics determined for the view
        • could be a part of it that is easy to define and isolate
        • may require some data cleanup, but it should be easy/cheap to perform it in (semi-)automated fashion
      • To be more or less static , i.e. not a wrapper for a database
    # Reasoning with a Billion of Linked Data Facts Dec, 2009
  • Two reason-able views to the web of linked data
    • LDSR: Linked Data Semantic Repository (in red)
      • Some of the central LOD datasets
      • General-purpose information (not specific to a domain)
      • 426M explicit plus 567M inferred, almost 1B triples in total
      • The largest upper level knowledge base
      • http://www.ontotext.com/ldsr/
    • Linked Life Data - PIKB (in yellow)
      • More than 20 of the most popular life-science datasets
      • Complemented by gluing ontologies
      • 2.5 billion explicit and 2.8M inferred, total 5.3 billion statemnts
      • The largest body of knowledge that was used for reasoning
      • http://www.linkedlifedata.com
    # Reasoning with a Billion of Linked Data Facts Dec, 2009
  • Linking Open Data Datasets and Views (red and yellow) Dec, 2009 # Reasoning with a Billion of Linked Data Facts
  • Linked Data Semantic Repository
    • Datasets : DBPedia, Geonames, UMBEL, Wordnet, CIA World Factbook, Lingvoj
    • Ontologies: Dublin Core, SKOS, RSS, FOAF
    • Inference: materialization with respect to owl-max
      • One of the richest tractable fragments of OWL (close to OWL2 RL)
      • Seems to completely cover the semantics of the data
      • owl:sameAs optimization in BigOWLIM, allows reduction of the indices, without loss of semantics or performance
    • Publicly available at http://ldsr.ontotext.com
      • Query and explore through Forest and Tabulator
      • RDF Search : retrieve ranked list of URIs by keywords
      • SPARQL end-point
    # Reasoning with a Billion of Linked Data Facts Dec, 2009
  • RDF Search
    • Objective:
      • Be able to search in an RDF graph by keywords
      • Get useable results
    • What and how to index:
      • Index URIs
      • Acquired text representation for each URI, by collecting the text from its RDF molecule
      • Index the text representations with standard FTS methods
    • What to return as result:
      • List of URIs, ranked by FTS + PageRank-like metric
      • Present them with with human-readable labels and text snipets
    # Reasoning with a Billion of Linked Data Facts Dec, 2009
  • LDSR Statistics # Reasoning with a Billion of Linked Data Facts Dec, 2009 Dataset Explicit Indexed Triples ('000) Inferred Indexed Triples ('000) Total # of Indexed Triples ('000) Entities ('000 of nodes in the graph) Inferred closure ratio Schemata and ontologies 10 11 20 5 1.1 DBPedia (SKOS categories) 2,233 263,208 265,441 952 117.9 DBpedia (owl:sameAs) 163 0 163 312 0.0 UMBEL 3,197 40,709 43,906 2,085 12.7 Lingvoj 20 855 874 18 43.4 CIA Factbook 161 40 201 53 0.2 Wordnet 72,748 130,806 203,554 33,388 1.8 Geonames 1,943 9,296 11,239 842 4.8 DBpedia 3.3 core 345,652 121,962 467,614 86,449 0.4 Total 426,126 566,887 993,013 124,104 1.3
  • Post-processing
    • Several kinds of post-processing were performed
      • Goal: to allow for easier navigation and browsing
      • Mechanisms: the results are available through system predicates
    • Statements which were not originally inserted or inferred, but can be retrieved:
        • &quot;compressed&quot; through sameAs-optimization : 418M
        • encoding preferred labels of the nodes: 14M
        • encoding PageRanks of URIs: 50M
    • Total number of retrievable statements: 1.47B
    Reasoning with a Billion of Linked Data Facts # Dec, 2009
  • LDSR+Freebase
    • We are still working to integrate Freebase properly
        • Already loaded together with the other datasets
        • Still, refining it to get it better integrated and to allow for more useful/interesting queries
    • Statistics including Freebase and DBPedia 3.4
        • Number of entities (RDF graph nodes): 225M
        • Number of inserted statements (NIS): 865M
        • Number of stored statements (NSS): 1.2B
        • Number of retrievable statements (NRS): 3.6B
    Reasoning with a Billion of Linked Data Facts # Dec, 2009
  • Reasoning and Querying Across Datasets
    • PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
    • PREFIX dbpedia4: <http://dbpedia.org/ontology/>
    • PREFIX dbpedia3: <http://dbpedia.org/resource/>
    • PREFIX opencyc: <http://sw.opencyc.org/2008/06/10/concept/en/>
    • PREFIX ontology: <http://www.geonames.org/ontology#?
    • SELECT *
    • WHERE { ?Person dbpedia4:birthplace ?BirthPlace .
    • ?BirthPlace ontology:parentFeature dbpedia3:Germany.
    • ?Person rdf:type opencyc:Entertainer
    • }
    • This query involves data from DBPedia, Geonames, and UMBEL (OpenCyc)
    • It involves inference over types, sub-classes, and transitive relationships
    • Without setup like LDSR, getting an answer in real time, would be impossible
    # Reasoning with a Billion of Linked Data Facts Dec, 2009
  • Time to Guess It? Reasoning with a Billion of Linked Data Facts Dec, 2009 #
  • A Pharmaceutical Industry Researcher
    • Hard to find information
    • Problems to use data due to lack of context information
    • Hard to collaborate across domains due to information silos
    • No easy way to interpret the information (most of the time is lost to prepare and transform data)
    Reasoning with a Billion of Linked Data Facts Dec, 2009 #
  • LinkedLifeData is a Platform to Help the Drug Development Process Reasoning with a Billion of Linked Data Facts Dec, 2009 #
  • Another Data Sources
    • datasource:organization/AstraZeneca
    • datasource:organization/AstraZeneca_LP
    • datasource:organization/AstraZeneca_Pharmaceuticals%2C_LP
    • datasource:organization/AstraZeneca_Pharmaceuticals_LP
    • datasource:organization/AstraZeneca_Pharmaeuticals_LP
    • datasource:organization/AstraZeneca_Pharnaceuticals_LP
    Reasoning with a Billion of Linked Data Facts Dec, 2009 #
  • ns-x: id ns-y: id db id db: id accession db: id db: accession term text to describe name name Dec, 2009 Reasoning with a Billion of Linked Data Facts # Namespace mapping Reference node Mismatched identifiers Value dereference Transitive link Literal extraction
  • Pathway and Interaction Knowledge Base Dataset
    • Linked Life Data statistics:
      • gene – proteins – pathways – targets – disease – drugs – patient
    • Number of statements: 2,187,294,998
    • Prototype to test scalability and performance of the Ontotext’s Linked Data infrastructure
    Reasoning with a Billion of Linked Data Facts Dec, 2009 #
  • Linked Life Data (LLD): 20 life science databases in a box Dec, 2009 # Reasoning with a Billion of Linked Data Facts Database Size Schema Description Uniprot 1,146,084,021 Original by the provider Protein sequences and annotations Entrez-Gene 107,193,308 Custom RDF schema Genes and annotation Gene Ontology 9,656,074 Schema by the provider Gene and gene product annotation thesaurus BioGRID 1,892,897 BioPAX 2.0 (custom generated) ‏ Protein interactions extracted from the literature NCI - Pathway Interaction Database 333,415 BioPAX 2.0 (original by the provider) ‏ Human pathway interaction database The Cancer Cell Map 173,914 BioPAX 2.0 (original by the provider) ‏ Cancer pathways database Reactome 2,538,793 BioPAX 2.0 (original by the provider) ‏ Human pathways and interactions INOH 432,456 BioPAX 2.0 (original by the provider) ‏ Pathway database KEGG 18,128,735 BioPAX 1.0 (original by the provider) ‏ Molecular Interaction PubMed * 900,861,385 Custom RDF schema Biomedical citations UMLS * 79,88,309 Public OWL semantic network + custom RDF schema Biomedical terms Total 2,187,294,998
  • LinkedLifeData Datasets
    • Linked Life Data statistics:
      • gene – proteins – pathways – targets – disease – drugs – patient
    • Number of entities: over 3 billion explicit statements
    • Data sources:
      • PIKB - Uniprot , Entrez-Gene , PubMed, UMLS (MeSH, Taxonomy, GeneOntoloigy), BioGRID , NCI , Reactome , BioCarta , KEGG , BioCyc ,
      • LODD – DailyMed, DrugBank, Diseasome, Sider, LinkedCT
      • DBPedia
    Dec, 2009 # Reasoning with a Billion of Linked Data Facts
  • The Challenge of extracting relevant information
    • Find all drugs related to asthma
    • Extract all gene in a text file
    • Compose a long query to list the genes by OR
    • Get a filtered list of genes
    • For each gene send a query in molecular interaction database
    • List all genes related to inflammatory response that participate in interactions described in the literature
    • Choose only genes that are known to be investigational or approved drug targets
    • Restrict it only to drugs used to threat asthma
    Get valuable expert knowledge Dec, 2009 # Reasoning with a Billion of Linked Data Facts
  • LinkedLifeData and LDSR in LarKC
    • LLD is the basis of couple of life-science use cases in LarKC
    • LDSR was set up as a testbed for selection and ranking components for RDF
    • PageRankRDF performance on LDSR:
      • it takes only 10 seconds to perform one iteration of PageRank
      • 3 minutes to compute the ranks of the 100 million nodes in LDSR
    • DualRDF (an RDF priming component) performance on LDSR:
      • The performance of the spreading activation tasks varies considerably depending on the parameters of the process
      • As a reference point use the following result: it takes 7 seconds to activate about 7 thousand nodes after spreading of activation from resource http://dbpedia.org/resource/Berlin with decay factor 0.25.
      • Queries on the “primed” or “selected” part of a dataset run up to 20 times faster and return only focussed results
    # Reasoning with a Billion of Linked Data Facts Dec, 2009
  • Thank you!
    • We develop core semantic technology
    • Ontotext invested 200 person-years, partnered with 100 leading groups,
    • created some of the most popular tools, and delivered multiple solutions.
    • We know what works and what doesn’t
    • Ontotext set many benchmarks and advanced the frontiers of the semantic databases.
    • We invented the “semantic annotation” – linking text with data
    • Now we are prepared to
    • interlink your data, your content, and the web
    Dec, 2009 # Reasoning with a Billion of Linked Data Facts