SlideShare a Scribd company logo
1 of 21
Download to read offline
SAMPLD 
STRUCTURAL PROPERTIES AS PROXY FOR 
SEMANTIC RELEVANCE 
LAURENS RIETVELD 
HTTP://PRESENTATIONS.LAURENSRIETVELD.NL/NOW
QUANTITY OVER QUALITY 
Datasets become too large to run on commodity hardware 
We use only a small portion 
Can't we extract the part we are interested in? 
Dataset #triples #queries coverage 
DBpedia 459M 1640 0.003% 
Linked Geo Data 289M 891 1.917% 
MetaLex 204M 4933 0.016% 
Open-BioMed 79M 931 0.011% 
BIO2RDF (KEGG) 50M 1297 2.013% 
Semantic Web Dog 
0.24M 193 62.4% 
Food
RELEVANCE BASED SAMPLING 
Find the smallest possible RDF subgraph, that covers the 
maximum number of potential queries 
How can we determine which triples are relevant, and which 
are not? 
Can we implement a scalable sampling pipeline? 
Can we evaluate the results in a scalable fashion?
HOW TO DETERMINE RELEVANCE OF TRIPLES 
INFORMED SAMPLING 
UNINFORMED SAMPLING 
We know exactly 
which queries will be 
asked 
Extract those triples 
needed to answer the 
queries 
Problem: only a limited 
number of queries 
known 
We do not know which queries will 
be asked 
Use information contained in the 
graph to determine relevance 
Rank triples by relevance, and select 
the k best triples (0 < k < size of 
graph)
APPROACH 
Use the topology of the graph to determine relevance 
(network analysis) 
Evaluate the relevance of our samples against the queries that 
we do know 
Is network structure a good predictor for query answerability?
NETWORK ANALYSIS 
Example: Explain real-world phenomenons 
Find central parts of the graph 
Betweenness Centrality 
Google PageRank 
We apply 
In Degree 
Out Degree 
PageRank
RANKED LIST OF TRIPLES 
Rewrite 
Apply network analysis 
Nodes Weights → Triples Weights 
W(triple) = max(W(sub), W(obj)) 
Weighted list of triples → Sample
EVALUATION 
Sample sizes: 1% - 99% 
Baselines: 
Random Sample (10x) 
Resource Frequency
NAIVE EVALUATION DOES NOT SCALE 
 
0!	0 
   ø)!0$+  ø)!0$+  ø 
% 
% 
 // / 0  
Over 15.000 datasets, and over 1.4 trillion triples 
Requirements 
Fast loading of samples 
Powerful hardware 
Not Scalable: load all triples, execute queries, and calculate 
recall
SCALABLE APPROACH 
Retrieve which triples are used by a query 
Use a hadoop cluster to find the weights of these triples 
Analyze whether these triples would have been included in the 
sample 
Scalable. Only execute each query once
EXAMPLE 
SELECT ?person ?country WHERE { 
?person :bornIn ?city . 
?city :capitalOf ?country . 
} 
1234 
?person ?country 
:Laurens :NL 
:Stefan :Germany 
QUERY 
DATASET 
Subject Predicate Object Weight 
:Amsterdam :capitalOf :NL 0.1 
:Berlin :capitalOf :Germany 0.5 
:Laurens :bornIn :Amsterdam 0.6 
:Rinke :bornIn :Heerenveen 0.1 
:Stefan :bornIn :Berlin 0.5
QUERY RESULTS → TRIPLES 
1234 Subject Predicate Object ?person ?city ?country 
SELECT ?person ?country WHERE { 
?person :bornIn ?city . 
?city :capitalOf ?country . 
} 
1234 
SELECT DISTINCT * WHERE { 
?person :bornIn ?city . 
?city :capitalOf ?country . 
} 
:Laurens :bornIn :Amsterdam :Laurens :Amsterdam :NL 
:Amsterdam :capitalOf :NL 
:Stefan :bornIn :Berlin :Stefan :Berlin :Germany 
:Berlin :capitalOf :Germany
TRIPLES→RECALL 
WHICH ANSWERS WOULD WE GET WITH A SAMPLE OF 60%? 
Dataset 
Subject Predicate Object Weight 
:Laurens :bornIn :Amsterdam 0.6 
:Stefan :bornIn :Berlin 0.5 
:Berlin :capitalOf :Germany 0.5 
:Amsterdam :capitalOf :NL 0.1 
:Rinke :bornIn :Heerenveen 0.1 
Triples used in query resultsets 
Subject Predicate Object 
:Laurens :bornIn :Amsterdam 
:Amsterdam :capitalOf :NL 
:Stefan :bornIn :Berlin 
:Berlin :capitalOf :Germany
EVALUATION 
Better specificity than regular recall 
Scalable: PIG instead of SPARQL 
Special cases, e.g. GROUP BY, LIMIT, DISTINCT, OPTIONAL, 
UNION
SPECIAL CASE: UNIONS 
SELECT ?name WHERE { 
{:laurens rdfs:label ?name} 
UNION 
{:laurens foaf:name ?name} 
} 
12345 
SELECT DISTINCT * WHERE { 
{:laurens rdfs:label ?name} 
UNION 
{:laurens foaf:name ?name} 
} 
12345 Subject Predicate Object 
:Laurens rdfs:label Laurens 
:Laurens foaf:name Laurens
EVALUATION DATASETS 
Dataset #triples #queries coverage 
DBpedia 459M 1640 0.003% 
Linked Geo Data 289M 891 1.917% 
MetaLex 204M 4933 0.016% 
Open-BioMed 79M 931 0.011% 
BIO2RDF (KEGG) 50M 1297 2.013% 
Semantic Web Dog 
0.24M 193 62.4% 
Food
RESULTS click for more results 
Sample Size 
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 
1.0 
0.9 
0.8 
0.7 
0.6 
0.5 
0.4 
0.3 
0.2 
0.1 
0.0 
Recall 
BIO2RDF ­UniqueLiterals 
­Outdegree 
DBpedia ­Path 
­Pagerank 
LinkedGeoData ­UniqueLiterals 
­Outdegree 
Metalex ­ResourceFrequency 
OpenBioMed ­WithoutLiterals 
­Outdegree 
SemanticWebDogFood ­Path 
­Pagerank
OBSERVATIONS 
The influence of a single triple 
DBpedia 
Good: Path + PageRank 
Bad: Path + Out Degree 
Queries: 2/3 require literals 
Other Observations 
# properties vs 'Context Literals' rewrite method 
# query triple patterns
CONCLUSION 
Scalable pipeline: network analysis algorithms + rewrite 
methods 
Able to eval over 15.000 datasets, and 1.4 trillion triples 
Number of query sets too limited to learn significant 
correlations 
Topology of the graphs can be used to determine good samples 
Mimic semantic relevance through structural properties, 
without an a-priori notion of relevance 
Special thanks to Semantic Web Science Association (SWSA)

More Related Content

What's hot

Orchestrating the Intelligent Web with Apache Mahout
Orchestrating the Intelligent Web with Apache MahoutOrchestrating the Intelligent Web with Apache Mahout
Orchestrating the Intelligent Web with Apache Mahoutaneeshabakharia
 
Linking the world with Python and Semantics
Linking the world with Python and SemanticsLinking the world with Python and Semantics
Linking the world with Python and SemanticsTatiana Al-Chueyr
 
RDF Stream Processing Tutorial: RSP implementations
RDF Stream Processing Tutorial: RSP implementationsRDF Stream Processing Tutorial: RSP implementations
RDF Stream Processing Tutorial: RSP implementationsJean-Paul Calbimonte
 
Hadoop with Python
Hadoop with PythonHadoop with Python
Hadoop with PythonDonald Miner
 
Rethinking Data-Intensive Science Using Scalable Analytics Systems
Rethinking Data-Intensive Science Using Scalable Analytics Systems Rethinking Data-Intensive Science Using Scalable Analytics Systems
Rethinking Data-Intensive Science Using Scalable Analytics Systems fnothaft
 
Linked Data, Labels, URIs
Linked Data, Labels, URIsLinked Data, Labels, URIs
Linked Data, Labels, URIsTrey Pendragon
 
Lightning fast genomics with Spark, Adam and Scala
Lightning fast genomics with Spark, Adam and ScalaLightning fast genomics with Spark, Adam and Scala
Lightning fast genomics with Spark, Adam and ScalaAndy Petrella
 
useR! 2012 Talk
useR! 2012 TalkuseR! 2012 Talk
useR! 2012 Talkrtelmore
 
Scala Programming for Semantic Web Developers ESWC Semdev2015
Scala Programming for Semantic Web Developers ESWC Semdev2015Scala Programming for Semantic Web Developers ESWC Semdev2015
Scala Programming for Semantic Web Developers ESWC Semdev2015Jean-Paul Calbimonte
 

What's hot (9)

Orchestrating the Intelligent Web with Apache Mahout
Orchestrating the Intelligent Web with Apache MahoutOrchestrating the Intelligent Web with Apache Mahout
Orchestrating the Intelligent Web with Apache Mahout
 
Linking the world with Python and Semantics
Linking the world with Python and SemanticsLinking the world with Python and Semantics
Linking the world with Python and Semantics
 
RDF Stream Processing Tutorial: RSP implementations
RDF Stream Processing Tutorial: RSP implementationsRDF Stream Processing Tutorial: RSP implementations
RDF Stream Processing Tutorial: RSP implementations
 
Hadoop with Python
Hadoop with PythonHadoop with Python
Hadoop with Python
 
Rethinking Data-Intensive Science Using Scalable Analytics Systems
Rethinking Data-Intensive Science Using Scalable Analytics Systems Rethinking Data-Intensive Science Using Scalable Analytics Systems
Rethinking Data-Intensive Science Using Scalable Analytics Systems
 
Linked Data, Labels, URIs
Linked Data, Labels, URIsLinked Data, Labels, URIs
Linked Data, Labels, URIs
 
Lightning fast genomics with Spark, Adam and Scala
Lightning fast genomics with Spark, Adam and ScalaLightning fast genomics with Spark, Adam and Scala
Lightning fast genomics with Spark, Adam and Scala
 
useR! 2012 Talk
useR! 2012 TalkuseR! 2012 Talk
useR! 2012 Talk
 
Scala Programming for Semantic Web Developers ESWC Semdev2015
Scala Programming for Semantic Web Developers ESWC Semdev2015Scala Programming for Semantic Web Developers ESWC Semdev2015
Scala Programming for Semantic Web Developers ESWC Semdev2015
 

Similar to SampLD, Structural Properties as Proxy for Semantic Relevance

ParlBench: a SPARQL-benchmark for electronic publishing applications.
ParlBench: a SPARQL-benchmark for electronic publishing applications.ParlBench: a SPARQL-benchmark for electronic publishing applications.
ParlBench: a SPARQL-benchmark for electronic publishing applications.Tatiana Tarasova
 
2009 Dils Flyweb
2009 Dils Flyweb2009 Dils Flyweb
2009 Dils FlywebJun Zhao
 
Aidan's PhD Viva
Aidan's PhD VivaAidan's PhD Viva
Aidan's PhD VivaAidan Hogan
 
RDFUnit - Test-Driven Linked Data quality Assessment (WWW2014)
RDFUnit - Test-Driven Linked Data quality Assessment (WWW2014)RDFUnit - Test-Driven Linked Data quality Assessment (WWW2014)
RDFUnit - Test-Driven Linked Data quality Assessment (WWW2014)Dimitris Kontokostas
 
A Little SPARQL in your Analytics
A Little SPARQL in your AnalyticsA Little SPARQL in your Analytics
A Little SPARQL in your AnalyticsDr. Neil Brittliff
 
2009 0807 Lod Gmod
2009 0807 Lod Gmod2009 0807 Lod Gmod
2009 0807 Lod GmodJun Zhao
 
Predicting query performance and explaining results to assist Linked Data con...
Predicting query performance and explaining results to assist Linked Data con...Predicting query performance and explaining results to assist Linked Data con...
Predicting query performance and explaining results to assist Linked Data con...Rakebul Hasan
 
Extracting Relevant Questions to an RDF Dataset Using Formal Concept Analysis
Extracting Relevant Questions to an RDF Dataset Using Formal Concept AnalysisExtracting Relevant Questions to an RDF Dataset Using Formal Concept Analysis
Extracting Relevant Questions to an RDF Dataset Using Formal Concept AnalysisMathieu d'Aquin
 
Emerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big DataEmerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big DataRahul Jain
 
Building a Knowledge Graph @ Graph Day 2018
Building a Knowledge Graph @ Graph Day 2018Building a Knowledge Graph @ Graph Day 2018
Building a Knowledge Graph @ Graph Day 2018DanBennett47
 
VALA Tech Camp 2017: Intro to Wikidata & SPARQL
VALA Tech Camp 2017: Intro to Wikidata & SPARQLVALA Tech Camp 2017: Intro to Wikidata & SPARQL
VALA Tech Camp 2017: Intro to Wikidata & SPARQLJane Frazier
 
2010 03 Lodoxf Openflydata
2010 03 Lodoxf Openflydata2010 03 Lodoxf Openflydata
2010 03 Lodoxf OpenflydataJun Zhao
 
Visualize open data with Plone - eea.daviz PLOG 2013
Visualize open data with Plone - eea.daviz PLOG 2013Visualize open data with Plone - eea.daviz PLOG 2013
Visualize open data with Plone - eea.daviz PLOG 2013Antonio De Marinis
 
SPARQL-DL - Theory & Practice
SPARQL-DL - Theory & PracticeSPARQL-DL - Theory & Practice
SPARQL-DL - Theory & PracticeAdriel Café
 
Visualizations using Visualbox
Visualizations using VisualboxVisualizations using Visualbox
Visualizations using VisualboxAlvaro Graves
 

Similar to SampLD, Structural Properties as Proxy for Semantic Relevance (20)

ParlBench: a SPARQL-benchmark for electronic publishing applications.
ParlBench: a SPARQL-benchmark for electronic publishing applications.ParlBench: a SPARQL-benchmark for electronic publishing applications.
ParlBench: a SPARQL-benchmark for electronic publishing applications.
 
AnzoGraph DB - SPARQL 101
AnzoGraph DB - SPARQL 101AnzoGraph DB - SPARQL 101
AnzoGraph DB - SPARQL 101
 
2009 Dils Flyweb
2009 Dils Flyweb2009 Dils Flyweb
2009 Dils Flyweb
 
Aidan's PhD Viva
Aidan's PhD VivaAidan's PhD Viva
Aidan's PhD Viva
 
RDFUnit - Test-Driven Linked Data quality Assessment (WWW2014)
RDFUnit - Test-Driven Linked Data quality Assessment (WWW2014)RDFUnit - Test-Driven Linked Data quality Assessment (WWW2014)
RDFUnit - Test-Driven Linked Data quality Assessment (WWW2014)
 
A Little SPARQL in your Analytics
A Little SPARQL in your AnalyticsA Little SPARQL in your Analytics
A Little SPARQL in your Analytics
 
2009 0807 Lod Gmod
2009 0807 Lod Gmod2009 0807 Lod Gmod
2009 0807 Lod Gmod
 
Predicting query performance and explaining results to assist Linked Data con...
Predicting query performance and explaining results to assist Linked Data con...Predicting query performance and explaining results to assist Linked Data con...
Predicting query performance and explaining results to assist Linked Data con...
 
BioSD Tutorial 2014 Editition
BioSD Tutorial 2014 EdititionBioSD Tutorial 2014 Editition
BioSD Tutorial 2014 Editition
 
SWAT4LS 2014 SLIDE by Yamamoto
SWAT4LS 2014 SLIDE by YamamotoSWAT4LS 2014 SLIDE by Yamamoto
SWAT4LS 2014 SLIDE by Yamamoto
 
SWT Lecture Session 2 - RDF
SWT Lecture Session 2 - RDFSWT Lecture Session 2 - RDF
SWT Lecture Session 2 - RDF
 
Extracting Relevant Questions to an RDF Dataset Using Formal Concept Analysis
Extracting Relevant Questions to an RDF Dataset Using Formal Concept AnalysisExtracting Relevant Questions to an RDF Dataset Using Formal Concept Analysis
Extracting Relevant Questions to an RDF Dataset Using Formal Concept Analysis
 
Emerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big DataEmerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big Data
 
Building a Knowledge Graph @ Graph Day 2018
Building a Knowledge Graph @ Graph Day 2018Building a Knowledge Graph @ Graph Day 2018
Building a Knowledge Graph @ Graph Day 2018
 
VALA Tech Camp 2017: Intro to Wikidata & SPARQL
VALA Tech Camp 2017: Intro to Wikidata & SPARQLVALA Tech Camp 2017: Intro to Wikidata & SPARQL
VALA Tech Camp 2017: Intro to Wikidata & SPARQL
 
2010 03 Lodoxf Openflydata
2010 03 Lodoxf Openflydata2010 03 Lodoxf Openflydata
2010 03 Lodoxf Openflydata
 
R Basics
R BasicsR Basics
R Basics
 
Visualize open data with Plone - eea.daviz PLOG 2013
Visualize open data with Plone - eea.daviz PLOG 2013Visualize open data with Plone - eea.daviz PLOG 2013
Visualize open data with Plone - eea.daviz PLOG 2013
 
SPARQL-DL - Theory & Practice
SPARQL-DL - Theory & PracticeSPARQL-DL - Theory & Practice
SPARQL-DL - Theory & Practice
 
Visualizations using Visualbox
Visualizations using VisualboxVisualizations using Visualbox
Visualizations using Visualbox
 

Recently uploaded

In-pond Race way systems for Aquaculture (IPRS).pptx
In-pond Race way systems for Aquaculture (IPRS).pptxIn-pond Race way systems for Aquaculture (IPRS).pptx
In-pond Race way systems for Aquaculture (IPRS).pptxMAGOTI ERNEST
 
Gliese 12 b, a temperate Earth-sized planet at 12 parsecs discovered with TES...
Gliese 12 b, a temperate Earth-sized planet at 12 parsecs discovered with TES...Gliese 12 b, a temperate Earth-sized planet at 12 parsecs discovered with TES...
Gliese 12 b, a temperate Earth-sized planet at 12 parsecs discovered with TES...Sérgio Sacani
 
GBSN - Microbiology (Unit 6) Human and Microbial interaction
GBSN - Microbiology (Unit 6) Human and Microbial interactionGBSN - Microbiology (Unit 6) Human and Microbial interaction
GBSN - Microbiology (Unit 6) Human and Microbial interactionAreesha Ahmad
 
Triploidy ...............................pptx
Triploidy ...............................pptxTriploidy ...............................pptx
Triploidy ...............................pptxCherry
 
Tuberculosis (TB)-Notes.pdf microbiology notes
Tuberculosis (TB)-Notes.pdf microbiology notesTuberculosis (TB)-Notes.pdf microbiology notes
Tuberculosis (TB)-Notes.pdf microbiology notesjyothisaisri
 
Exomoons & Exorings with the Habitable Worlds Observatory I: On the Detection...
Exomoons & Exorings with the Habitable Worlds Observatory I: On the Detection...Exomoons & Exorings with the Habitable Worlds Observatory I: On the Detection...
Exomoons & Exorings with the Habitable Worlds Observatory I: On the Detection...Sérgio Sacani
 
Erythropoiesis- Dr.E. Muralinath-C Kalyan
Erythropoiesis- Dr.E. Muralinath-C KalyanErythropoiesis- Dr.E. Muralinath-C Kalyan
Erythropoiesis- Dr.E. Muralinath-C Kalyanmuralinath2
 
Pests of sugarcane_Binomics_IPM_Dr.UPR.pdf
Pests of sugarcane_Binomics_IPM_Dr.UPR.pdfPests of sugarcane_Binomics_IPM_Dr.UPR.pdf
Pests of sugarcane_Binomics_IPM_Dr.UPR.pdfPirithiRaju
 
GBSN - Microbiology (Unit 7) Microbiology in Everyday Life
GBSN - Microbiology (Unit 7) Microbiology in Everyday LifeGBSN - Microbiology (Unit 7) Microbiology in Everyday Life
GBSN - Microbiology (Unit 7) Microbiology in Everyday LifeAreesha Ahmad
 
National Biodiversity protection initiatives and Convention on Biological Di...
National Biodiversity protection initiatives and  Convention on Biological Di...National Biodiversity protection initiatives and  Convention on Biological Di...
National Biodiversity protection initiatives and Convention on Biological Di...PABOLU TEJASREE
 
B lymphocytes, Receptors, Maturation and Activation
B lymphocytes, Receptors, Maturation and ActivationB lymphocytes, Receptors, Maturation and Activation
B lymphocytes, Receptors, Maturation and ActivationBhanu Krishan
 
GBSN - Microbiology Lab 1 (Microbiology Lab Safety Procedures)
GBSN -  Microbiology Lab  1 (Microbiology Lab Safety Procedures)GBSN -  Microbiology Lab  1 (Microbiology Lab Safety Procedures)
GBSN - Microbiology Lab 1 (Microbiology Lab Safety Procedures)Areesha Ahmad
 
Ostiguy & Panizza & Moffitt (eds.) - Populism in Global Perspective. A Perfor...
Ostiguy & Panizza & Moffitt (eds.) - Populism in Global Perspective. A Perfor...Ostiguy & Panizza & Moffitt (eds.) - Populism in Global Perspective. A Perfor...
Ostiguy & Panizza & Moffitt (eds.) - Populism in Global Perspective. A Perfor...frank0071
 
Hemoglobin metabolism: C Kalyan & E. Muralinath
Hemoglobin metabolism: C Kalyan & E. MuralinathHemoglobin metabolism: C Kalyan & E. Muralinath
Hemoglobin metabolism: C Kalyan & E. Muralinathmuralinath2
 
Microbial bio Synthesis of nanoparticles.pptx
Microbial bio Synthesis of nanoparticles.pptxMicrobial bio Synthesis of nanoparticles.pptx
Microbial bio Synthesis of nanoparticles.pptxCherry
 
Plasmapheresis - Dr. E. Muralinath - Kalyan . C.pptx
Plasmapheresis - Dr. E. Muralinath - Kalyan . C.pptxPlasmapheresis - Dr. E. Muralinath - Kalyan . C.pptx
Plasmapheresis - Dr. E. Muralinath - Kalyan . C.pptxmuralinath2
 
Emergent ribozyme behaviors in oxychlorine brines indicate a unique niche for...
Emergent ribozyme behaviors in oxychlorine brines indicate a unique niche for...Emergent ribozyme behaviors in oxychlorine brines indicate a unique niche for...
Emergent ribozyme behaviors in oxychlorine brines indicate a unique niche for...Sérgio Sacani
 
Film Coated Tablet and Film Coating raw materials.pdf
Film Coated Tablet and Film Coating raw materials.pdfFilm Coated Tablet and Film Coating raw materials.pdf
Film Coated Tablet and Film Coating raw materials.pdfPharmatech-rx
 
GBSN - Biochemistry (Unit 4) Chemistry of Carbohydrates
GBSN - Biochemistry (Unit 4) Chemistry of CarbohydratesGBSN - Biochemistry (Unit 4) Chemistry of Carbohydrates
GBSN - Biochemistry (Unit 4) Chemistry of CarbohydratesAreesha Ahmad
 
GBSN - Microbiology Lab 2 (Compound Microscope)
GBSN - Microbiology Lab 2 (Compound Microscope)GBSN - Microbiology Lab 2 (Compound Microscope)
GBSN - Microbiology Lab 2 (Compound Microscope)Areesha Ahmad
 

Recently uploaded (20)

In-pond Race way systems for Aquaculture (IPRS).pptx
In-pond Race way systems for Aquaculture (IPRS).pptxIn-pond Race way systems for Aquaculture (IPRS).pptx
In-pond Race way systems for Aquaculture (IPRS).pptx
 
Gliese 12 b, a temperate Earth-sized planet at 12 parsecs discovered with TES...
Gliese 12 b, a temperate Earth-sized planet at 12 parsecs discovered with TES...Gliese 12 b, a temperate Earth-sized planet at 12 parsecs discovered with TES...
Gliese 12 b, a temperate Earth-sized planet at 12 parsecs discovered with TES...
 
GBSN - Microbiology (Unit 6) Human and Microbial interaction
GBSN - Microbiology (Unit 6) Human and Microbial interactionGBSN - Microbiology (Unit 6) Human and Microbial interaction
GBSN - Microbiology (Unit 6) Human and Microbial interaction
 
Triploidy ...............................pptx
Triploidy ...............................pptxTriploidy ...............................pptx
Triploidy ...............................pptx
 
Tuberculosis (TB)-Notes.pdf microbiology notes
Tuberculosis (TB)-Notes.pdf microbiology notesTuberculosis (TB)-Notes.pdf microbiology notes
Tuberculosis (TB)-Notes.pdf microbiology notes
 
Exomoons & Exorings with the Habitable Worlds Observatory I: On the Detection...
Exomoons & Exorings with the Habitable Worlds Observatory I: On the Detection...Exomoons & Exorings with the Habitable Worlds Observatory I: On the Detection...
Exomoons & Exorings with the Habitable Worlds Observatory I: On the Detection...
 
Erythropoiesis- Dr.E. Muralinath-C Kalyan
Erythropoiesis- Dr.E. Muralinath-C KalyanErythropoiesis- Dr.E. Muralinath-C Kalyan
Erythropoiesis- Dr.E. Muralinath-C Kalyan
 
Pests of sugarcane_Binomics_IPM_Dr.UPR.pdf
Pests of sugarcane_Binomics_IPM_Dr.UPR.pdfPests of sugarcane_Binomics_IPM_Dr.UPR.pdf
Pests of sugarcane_Binomics_IPM_Dr.UPR.pdf
 
GBSN - Microbiology (Unit 7) Microbiology in Everyday Life
GBSN - Microbiology (Unit 7) Microbiology in Everyday LifeGBSN - Microbiology (Unit 7) Microbiology in Everyday Life
GBSN - Microbiology (Unit 7) Microbiology in Everyday Life
 
National Biodiversity protection initiatives and Convention on Biological Di...
National Biodiversity protection initiatives and  Convention on Biological Di...National Biodiversity protection initiatives and  Convention on Biological Di...
National Biodiversity protection initiatives and Convention on Biological Di...
 
B lymphocytes, Receptors, Maturation and Activation
B lymphocytes, Receptors, Maturation and ActivationB lymphocytes, Receptors, Maturation and Activation
B lymphocytes, Receptors, Maturation and Activation
 
GBSN - Microbiology Lab 1 (Microbiology Lab Safety Procedures)
GBSN -  Microbiology Lab  1 (Microbiology Lab Safety Procedures)GBSN -  Microbiology Lab  1 (Microbiology Lab Safety Procedures)
GBSN - Microbiology Lab 1 (Microbiology Lab Safety Procedures)
 
Ostiguy & Panizza & Moffitt (eds.) - Populism in Global Perspective. A Perfor...
Ostiguy & Panizza & Moffitt (eds.) - Populism in Global Perspective. A Perfor...Ostiguy & Panizza & Moffitt (eds.) - Populism in Global Perspective. A Perfor...
Ostiguy & Panizza & Moffitt (eds.) - Populism in Global Perspective. A Perfor...
 
Hemoglobin metabolism: C Kalyan & E. Muralinath
Hemoglobin metabolism: C Kalyan & E. MuralinathHemoglobin metabolism: C Kalyan & E. Muralinath
Hemoglobin metabolism: C Kalyan & E. Muralinath
 
Microbial bio Synthesis of nanoparticles.pptx
Microbial bio Synthesis of nanoparticles.pptxMicrobial bio Synthesis of nanoparticles.pptx
Microbial bio Synthesis of nanoparticles.pptx
 
Plasmapheresis - Dr. E. Muralinath - Kalyan . C.pptx
Plasmapheresis - Dr. E. Muralinath - Kalyan . C.pptxPlasmapheresis - Dr. E. Muralinath - Kalyan . C.pptx
Plasmapheresis - Dr. E. Muralinath - Kalyan . C.pptx
 
Emergent ribozyme behaviors in oxychlorine brines indicate a unique niche for...
Emergent ribozyme behaviors in oxychlorine brines indicate a unique niche for...Emergent ribozyme behaviors in oxychlorine brines indicate a unique niche for...
Emergent ribozyme behaviors in oxychlorine brines indicate a unique niche for...
 
Film Coated Tablet and Film Coating raw materials.pdf
Film Coated Tablet and Film Coating raw materials.pdfFilm Coated Tablet and Film Coating raw materials.pdf
Film Coated Tablet and Film Coating raw materials.pdf
 
GBSN - Biochemistry (Unit 4) Chemistry of Carbohydrates
GBSN - Biochemistry (Unit 4) Chemistry of CarbohydratesGBSN - Biochemistry (Unit 4) Chemistry of Carbohydrates
GBSN - Biochemistry (Unit 4) Chemistry of Carbohydrates
 
GBSN - Microbiology Lab 2 (Compound Microscope)
GBSN - Microbiology Lab 2 (Compound Microscope)GBSN - Microbiology Lab 2 (Compound Microscope)
GBSN - Microbiology Lab 2 (Compound Microscope)
 

SampLD, Structural Properties as Proxy for Semantic Relevance

  • 1. SAMPLD STRUCTURAL PROPERTIES AS PROXY FOR SEMANTIC RELEVANCE LAURENS RIETVELD HTTP://PRESENTATIONS.LAURENSRIETVELD.NL/NOW
  • 2. QUANTITY OVER QUALITY Datasets become too large to run on commodity hardware We use only a small portion Can't we extract the part we are interested in? Dataset #triples #queries coverage DBpedia 459M 1640 0.003% Linked Geo Data 289M 891 1.917% MetaLex 204M 4933 0.016% Open-BioMed 79M 931 0.011% BIO2RDF (KEGG) 50M 1297 2.013% Semantic Web Dog 0.24M 193 62.4% Food
  • 3. RELEVANCE BASED SAMPLING Find the smallest possible RDF subgraph, that covers the maximum number of potential queries How can we determine which triples are relevant, and which are not? Can we implement a scalable sampling pipeline? Can we evaluate the results in a scalable fashion?
  • 4. HOW TO DETERMINE RELEVANCE OF TRIPLES INFORMED SAMPLING UNINFORMED SAMPLING We know exactly which queries will be asked Extract those triples needed to answer the queries Problem: only a limited number of queries known We do not know which queries will be asked Use information contained in the graph to determine relevance Rank triples by relevance, and select the k best triples (0 < k < size of graph)
  • 5. APPROACH Use the topology of the graph to determine relevance (network analysis) Evaluate the relevance of our samples against the queries that we do know Is network structure a good predictor for query answerability?
  • 6. NETWORK ANALYSIS Example: Explain real-world phenomenons Find central parts of the graph Betweenness Centrality Google PageRank We apply In Degree Out Degree PageRank
  • 7. RANKED LIST OF TRIPLES Rewrite Apply network analysis Nodes Weights → Triples Weights W(triple) = max(W(sub), W(obj)) Weighted list of triples → Sample
  • 8. EVALUATION Sample sizes: 1% - 99% Baselines: Random Sample (10x) Resource Frequency
  • 9. NAIVE EVALUATION DOES NOT SCALE 0! 0 ø)!0$+ ø)!0$+ ø % % // / 0 Over 15.000 datasets, and over 1.4 trillion triples Requirements Fast loading of samples Powerful hardware Not Scalable: load all triples, execute queries, and calculate recall
  • 10. SCALABLE APPROACH Retrieve which triples are used by a query Use a hadoop cluster to find the weights of these triples Analyze whether these triples would have been included in the sample Scalable. Only execute each query once
  • 11. EXAMPLE SELECT ?person ?country WHERE { ?person :bornIn ?city . ?city :capitalOf ?country . } 1234 ?person ?country :Laurens :NL :Stefan :Germany QUERY DATASET Subject Predicate Object Weight :Amsterdam :capitalOf :NL 0.1 :Berlin :capitalOf :Germany 0.5 :Laurens :bornIn :Amsterdam 0.6 :Rinke :bornIn :Heerenveen 0.1 :Stefan :bornIn :Berlin 0.5
  • 12. QUERY RESULTS → TRIPLES 1234 Subject Predicate Object ?person ?city ?country SELECT ?person ?country WHERE { ?person :bornIn ?city . ?city :capitalOf ?country . } 1234 SELECT DISTINCT * WHERE { ?person :bornIn ?city . ?city :capitalOf ?country . } :Laurens :bornIn :Amsterdam :Laurens :Amsterdam :NL :Amsterdam :capitalOf :NL :Stefan :bornIn :Berlin :Stefan :Berlin :Germany :Berlin :capitalOf :Germany
  • 13. TRIPLES→RECALL WHICH ANSWERS WOULD WE GET WITH A SAMPLE OF 60%? Dataset Subject Predicate Object Weight :Laurens :bornIn :Amsterdam 0.6 :Stefan :bornIn :Berlin 0.5 :Berlin :capitalOf :Germany 0.5 :Amsterdam :capitalOf :NL 0.1 :Rinke :bornIn :Heerenveen 0.1 Triples used in query resultsets Subject Predicate Object :Laurens :bornIn :Amsterdam :Amsterdam :capitalOf :NL :Stefan :bornIn :Berlin :Berlin :capitalOf :Germany
  • 14. EVALUATION Better specificity than regular recall Scalable: PIG instead of SPARQL Special cases, e.g. GROUP BY, LIMIT, DISTINCT, OPTIONAL, UNION
  • 15. SPECIAL CASE: UNIONS SELECT ?name WHERE { {:laurens rdfs:label ?name} UNION {:laurens foaf:name ?name} } 12345 SELECT DISTINCT * WHERE { {:laurens rdfs:label ?name} UNION {:laurens foaf:name ?name} } 12345 Subject Predicate Object :Laurens rdfs:label Laurens :Laurens foaf:name Laurens
  • 16.
  • 17. EVALUATION DATASETS Dataset #triples #queries coverage DBpedia 459M 1640 0.003% Linked Geo Data 289M 891 1.917% MetaLex 204M 4933 0.016% Open-BioMed 79M 931 0.011% BIO2RDF (KEGG) 50M 1297 2.013% Semantic Web Dog 0.24M 193 62.4% Food
  • 18.
  • 19. RESULTS click for more results Sample Size 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0 Recall BIO2RDF ­UniqueLiterals ­Outdegree DBpedia ­Path ­Pagerank LinkedGeoData ­UniqueLiterals ­Outdegree Metalex ­ResourceFrequency OpenBioMed ­WithoutLiterals ­Outdegree SemanticWebDogFood ­Path ­Pagerank
  • 20. OBSERVATIONS The influence of a single triple DBpedia Good: Path + PageRank Bad: Path + Out Degree Queries: 2/3 require literals Other Observations # properties vs 'Context Literals' rewrite method # query triple patterns
  • 21. CONCLUSION Scalable pipeline: network analysis algorithms + rewrite methods Able to eval over 15.000 datasets, and 1.4 trillion triples Number of query sets too limited to learn significant correlations Topology of the graphs can be used to determine good samples Mimic semantic relevance through structural properties, without an a-priori notion of relevance Special thanks to Semantic Web Science Association (SWSA)