SlideShare a Scribd company logo
SAMPLD 
STRUCTURAL PROPERTIES AS PROXY FOR 
SEMANTIC RELEVANCE 
LAURENS RIETVELD 
HTTP://PRESENTATIONS.LAURENSRIETVELD.NL/NOW
QUANTITY OVER QUALITY 
Datasets become too large to run on commodity hardware 
We use only a small portion 
Can't we extract the part we are interested in? 
Dataset #triples #queries coverage 
DBpedia 459M 1640 0.003% 
Linked Geo Data 289M 891 1.917% 
MetaLex 204M 4933 0.016% 
Open-BioMed 79M 931 0.011% 
BIO2RDF (KEGG) 50M 1297 2.013% 
Semantic Web Dog 
0.24M 193 62.4% 
Food
RELEVANCE BASED SAMPLING 
Find the smallest possible RDF subgraph, that covers the 
maximum number of potential queries 
How can we determine which triples are relevant, and which 
are not? 
Can we implement a scalable sampling pipeline? 
Can we evaluate the results in a scalable fashion?
HOW TO DETERMINE RELEVANCE OF TRIPLES 
INFORMED SAMPLING 
UNINFORMED SAMPLING 
We know exactly 
which queries will be 
asked 
Extract those triples 
needed to answer the 
queries 
Problem: only a limited 
number of queries 
known 
We do not know which queries will 
be asked 
Use information contained in the 
graph to determine relevance 
Rank triples by relevance, and select 
the k best triples (0 < k < size of 
graph)
APPROACH 
Use the topology of the graph to determine relevance 
(network analysis) 
Evaluate the relevance of our samples against the queries that 
we do know 
Is network structure a good predictor for query answerability?
NETWORK ANALYSIS 
Example: Explain real-world phenomenons 
Find central parts of the graph 
Betweenness Centrality 
Google PageRank 
We apply 
In Degree 
Out Degree 
PageRank
RANKED LIST OF TRIPLES 
Rewrite 
Apply network analysis 
Nodes Weights → Triples Weights 
W(triple) = max(W(sub), W(obj)) 
Weighted list of triples → Sample
EVALUATION 
Sample sizes: 1% - 99% 
Baselines: 
Random Sample (10x) 
Resource Frequency
NAIVE EVALUATION DOES NOT SCALE 
 
0!	0 
   ø)!0$+  ø)!0$+  ø 
% 
% 
 // / 0  
Over 15.000 datasets, and over 1.4 trillion triples 
Requirements 
Fast loading of samples 
Powerful hardware 
Not Scalable: load all triples, execute queries, and calculate 
recall
SCALABLE APPROACH 
Retrieve which triples are used by a query 
Use a hadoop cluster to find the weights of these triples 
Analyze whether these triples would have been included in the 
sample 
Scalable. Only execute each query once
EXAMPLE 
SELECT ?person ?country WHERE { 
?person :bornIn ?city . 
?city :capitalOf ?country . 
} 
1234 
?person ?country 
:Laurens :NL 
:Stefan :Germany 
QUERY 
DATASET 
Subject Predicate Object Weight 
:Amsterdam :capitalOf :NL 0.1 
:Berlin :capitalOf :Germany 0.5 
:Laurens :bornIn :Amsterdam 0.6 
:Rinke :bornIn :Heerenveen 0.1 
:Stefan :bornIn :Berlin 0.5
QUERY RESULTS → TRIPLES 
1234 Subject Predicate Object ?person ?city ?country 
SELECT ?person ?country WHERE { 
?person :bornIn ?city . 
?city :capitalOf ?country . 
} 
1234 
SELECT DISTINCT * WHERE { 
?person :bornIn ?city . 
?city :capitalOf ?country . 
} 
:Laurens :bornIn :Amsterdam :Laurens :Amsterdam :NL 
:Amsterdam :capitalOf :NL 
:Stefan :bornIn :Berlin :Stefan :Berlin :Germany 
:Berlin :capitalOf :Germany
TRIPLES→RECALL 
WHICH ANSWERS WOULD WE GET WITH A SAMPLE OF 60%? 
Dataset 
Subject Predicate Object Weight 
:Laurens :bornIn :Amsterdam 0.6 
:Stefan :bornIn :Berlin 0.5 
:Berlin :capitalOf :Germany 0.5 
:Amsterdam :capitalOf :NL 0.1 
:Rinke :bornIn :Heerenveen 0.1 
Triples used in query resultsets 
Subject Predicate Object 
:Laurens :bornIn :Amsterdam 
:Amsterdam :capitalOf :NL 
:Stefan :bornIn :Berlin 
:Berlin :capitalOf :Germany
EVALUATION 
Better specificity than regular recall 
Scalable: PIG instead of SPARQL 
Special cases, e.g. GROUP BY, LIMIT, DISTINCT, OPTIONAL, 
UNION
SPECIAL CASE: UNIONS 
SELECT ?name WHERE { 
{:laurens rdfs:label ?name} 
UNION 
{:laurens foaf:name ?name} 
} 
12345 
SELECT DISTINCT * WHERE { 
{:laurens rdfs:label ?name} 
UNION 
{:laurens foaf:name ?name} 
} 
12345 Subject Predicate Object 
:Laurens rdfs:label Laurens 
:Laurens foaf:name Laurens
EVALUATION DATASETS 
Dataset #triples #queries coverage 
DBpedia 459M 1640 0.003% 
Linked Geo Data 289M 891 1.917% 
MetaLex 204M 4933 0.016% 
Open-BioMed 79M 931 0.011% 
BIO2RDF (KEGG) 50M 1297 2.013% 
Semantic Web Dog 
0.24M 193 62.4% 
Food
RESULTS click for more results 
Sample Size 
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 
1.0 
0.9 
0.8 
0.7 
0.6 
0.5 
0.4 
0.3 
0.2 
0.1 
0.0 
Recall 
BIO2RDF ­UniqueLiterals 
­Outdegree 
DBpedia ­Path 
­Pagerank 
LinkedGeoData ­UniqueLiterals 
­Outdegree 
Metalex ­ResourceFrequency 
OpenBioMed ­WithoutLiterals 
­Outdegree 
SemanticWebDogFood ­Path 
­Pagerank
OBSERVATIONS 
The influence of a single triple 
DBpedia 
Good: Path + PageRank 
Bad: Path + Out Degree 
Queries: 2/3 require literals 
Other Observations 
# properties vs 'Context Literals' rewrite method 
# query triple patterns
CONCLUSION 
Scalable pipeline: network analysis algorithms + rewrite 
methods 
Able to eval over 15.000 datasets, and 1.4 trillion triples 
Number of query sets too limited to learn significant 
correlations 
Topology of the graphs can be used to determine good samples 
Mimic semantic relevance through structural properties, 
without an a-priori notion of relevance 
Special thanks to Semantic Web Science Association (SWSA)

More Related Content

What's hot

Orchestrating the Intelligent Web with Apache Mahout
Orchestrating the Intelligent Web with Apache MahoutOrchestrating the Intelligent Web with Apache Mahout
Orchestrating the Intelligent Web with Apache Mahout
aneeshabakharia
 
Linking the world with Python and Semantics
Linking the world with Python and SemanticsLinking the world with Python and Semantics
Linking the world with Python and Semantics
Tatiana Al-Chueyr
 
RDF Stream Processing Tutorial: RSP implementations
RDF Stream Processing Tutorial: RSP implementationsRDF Stream Processing Tutorial: RSP implementations
RDF Stream Processing Tutorial: RSP implementations
Jean-Paul Calbimonte
 
Hadoop with Python
Hadoop with PythonHadoop with Python
Hadoop with Python
Donald Miner
 
Rethinking Data-Intensive Science Using Scalable Analytics Systems
Rethinking Data-Intensive Science Using Scalable Analytics Systems Rethinking Data-Intensive Science Using Scalable Analytics Systems
Rethinking Data-Intensive Science Using Scalable Analytics Systems
fnothaft
 
Linked Data, Labels, URIs
Linked Data, Labels, URIsLinked Data, Labels, URIs
Linked Data, Labels, URIs
Trey Pendragon
 
Lightning fast genomics with Spark, Adam and Scala
Lightning fast genomics with Spark, Adam and ScalaLightning fast genomics with Spark, Adam and Scala
Lightning fast genomics with Spark, Adam and Scala
Andy Petrella
 
useR! 2012 Talk
useR! 2012 TalkuseR! 2012 Talk
useR! 2012 Talk
rtelmore
 
Scala Programming for Semantic Web Developers ESWC Semdev2015
Scala Programming for Semantic Web Developers ESWC Semdev2015Scala Programming for Semantic Web Developers ESWC Semdev2015
Scala Programming for Semantic Web Developers ESWC Semdev2015
Jean-Paul Calbimonte
 

What's hot (9)

Orchestrating the Intelligent Web with Apache Mahout
Orchestrating the Intelligent Web with Apache MahoutOrchestrating the Intelligent Web with Apache Mahout
Orchestrating the Intelligent Web with Apache Mahout
 
Linking the world with Python and Semantics
Linking the world with Python and SemanticsLinking the world with Python and Semantics
Linking the world with Python and Semantics
 
RDF Stream Processing Tutorial: RSP implementations
RDF Stream Processing Tutorial: RSP implementationsRDF Stream Processing Tutorial: RSP implementations
RDF Stream Processing Tutorial: RSP implementations
 
Hadoop with Python
Hadoop with PythonHadoop with Python
Hadoop with Python
 
Rethinking Data-Intensive Science Using Scalable Analytics Systems
Rethinking Data-Intensive Science Using Scalable Analytics Systems Rethinking Data-Intensive Science Using Scalable Analytics Systems
Rethinking Data-Intensive Science Using Scalable Analytics Systems
 
Linked Data, Labels, URIs
Linked Data, Labels, URIsLinked Data, Labels, URIs
Linked Data, Labels, URIs
 
Lightning fast genomics with Spark, Adam and Scala
Lightning fast genomics with Spark, Adam and ScalaLightning fast genomics with Spark, Adam and Scala
Lightning fast genomics with Spark, Adam and Scala
 
useR! 2012 Talk
useR! 2012 TalkuseR! 2012 Talk
useR! 2012 Talk
 
Scala Programming for Semantic Web Developers ESWC Semdev2015
Scala Programming for Semantic Web Developers ESWC Semdev2015Scala Programming for Semantic Web Developers ESWC Semdev2015
Scala Programming for Semantic Web Developers ESWC Semdev2015
 

Similar to SampLD, Structural Properties as Proxy for Semantic Relevance

ParlBench: a SPARQL-benchmark for electronic publishing applications.
ParlBench: a SPARQL-benchmark for electronic publishing applications.ParlBench: a SPARQL-benchmark for electronic publishing applications.
ParlBench: a SPARQL-benchmark for electronic publishing applications.
Tatiana Tarasova
 
AnzoGraph DB - SPARQL 101
AnzoGraph DB - SPARQL 101AnzoGraph DB - SPARQL 101
AnzoGraph DB - SPARQL 101
Cambridge Semantics
 
2009 Dils Flyweb
2009 Dils Flyweb2009 Dils Flyweb
2009 Dils Flyweb
Jun Zhao
 
Aidan's PhD Viva
Aidan's PhD VivaAidan's PhD Viva
Aidan's PhD Viva
Aidan Hogan
 
RDFUnit - Test-Driven Linked Data quality Assessment (WWW2014)
RDFUnit - Test-Driven Linked Data quality Assessment (WWW2014)RDFUnit - Test-Driven Linked Data quality Assessment (WWW2014)
RDFUnit - Test-Driven Linked Data quality Assessment (WWW2014)
Dimitris Kontokostas
 
A Little SPARQL in your Analytics
A Little SPARQL in your AnalyticsA Little SPARQL in your Analytics
A Little SPARQL in your Analytics
Dr. Neil Brittliff
 
2009 0807 Lod Gmod
2009 0807 Lod Gmod2009 0807 Lod Gmod
2009 0807 Lod Gmod
Jun Zhao
 
Predicting query performance and explaining results to assist Linked Data con...
Predicting query performance and explaining results to assist Linked Data con...Predicting query performance and explaining results to assist Linked Data con...
Predicting query performance and explaining results to assist Linked Data con...
Rakebul Hasan
 
BioSD Tutorial 2014 Editition
BioSD Tutorial 2014 EdititionBioSD Tutorial 2014 Editition
BioSD Tutorial 2014 Editition
Rothamsted Research, UK
 
SWAT4LS 2014 SLIDE by Yamamoto
SWAT4LS 2014 SLIDE by YamamotoSWAT4LS 2014 SLIDE by Yamamoto
SWAT4LS 2014 SLIDE by Yamamoto
yayamamo @ DBCLS Kashiwanoha
 
SWT Lecture Session 2 - RDF
SWT Lecture Session 2 - RDFSWT Lecture Session 2 - RDF
SWT Lecture Session 2 - RDF
Mariano Rodriguez-Muro
 
Extracting Relevant Questions to an RDF Dataset Using Formal Concept Analysis
Extracting Relevant Questions to an RDF Dataset Using Formal Concept AnalysisExtracting Relevant Questions to an RDF Dataset Using Formal Concept Analysis
Extracting Relevant Questions to an RDF Dataset Using Formal Concept Analysis
Mathieu d'Aquin
 
Emerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big DataEmerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big Data
Rahul Jain
 
Building a Knowledge Graph @ Graph Day 2018
Building a Knowledge Graph @ Graph Day 2018Building a Knowledge Graph @ Graph Day 2018
Building a Knowledge Graph @ Graph Day 2018
DanBennett47
 
VALA Tech Camp 2017: Intro to Wikidata & SPARQL
VALA Tech Camp 2017: Intro to Wikidata & SPARQLVALA Tech Camp 2017: Intro to Wikidata & SPARQL
VALA Tech Camp 2017: Intro to Wikidata & SPARQL
Jane Frazier
 
2010 03 Lodoxf Openflydata
2010 03 Lodoxf Openflydata2010 03 Lodoxf Openflydata
2010 03 Lodoxf Openflydata
Jun Zhao
 
R Basics
R BasicsR Basics
Visualize open data with Plone - eea.daviz PLOG 2013
Visualize open data with Plone - eea.daviz PLOG 2013Visualize open data with Plone - eea.daviz PLOG 2013
Visualize open data with Plone - eea.daviz PLOG 2013
Antonio De Marinis
 
SPARQL-DL - Theory & Practice
SPARQL-DL - Theory & PracticeSPARQL-DL - Theory & Practice
SPARQL-DL - Theory & Practice
Adriel Café
 
Visualizations using Visualbox
Visualizations using VisualboxVisualizations using Visualbox
Visualizations using Visualbox
Alvaro Graves
 

Similar to SampLD, Structural Properties as Proxy for Semantic Relevance (20)

ParlBench: a SPARQL-benchmark for electronic publishing applications.
ParlBench: a SPARQL-benchmark for electronic publishing applications.ParlBench: a SPARQL-benchmark for electronic publishing applications.
ParlBench: a SPARQL-benchmark for electronic publishing applications.
 
AnzoGraph DB - SPARQL 101
AnzoGraph DB - SPARQL 101AnzoGraph DB - SPARQL 101
AnzoGraph DB - SPARQL 101
 
2009 Dils Flyweb
2009 Dils Flyweb2009 Dils Flyweb
2009 Dils Flyweb
 
Aidan's PhD Viva
Aidan's PhD VivaAidan's PhD Viva
Aidan's PhD Viva
 
RDFUnit - Test-Driven Linked Data quality Assessment (WWW2014)
RDFUnit - Test-Driven Linked Data quality Assessment (WWW2014)RDFUnit - Test-Driven Linked Data quality Assessment (WWW2014)
RDFUnit - Test-Driven Linked Data quality Assessment (WWW2014)
 
A Little SPARQL in your Analytics
A Little SPARQL in your AnalyticsA Little SPARQL in your Analytics
A Little SPARQL in your Analytics
 
2009 0807 Lod Gmod
2009 0807 Lod Gmod2009 0807 Lod Gmod
2009 0807 Lod Gmod
 
Predicting query performance and explaining results to assist Linked Data con...
Predicting query performance and explaining results to assist Linked Data con...Predicting query performance and explaining results to assist Linked Data con...
Predicting query performance and explaining results to assist Linked Data con...
 
BioSD Tutorial 2014 Editition
BioSD Tutorial 2014 EdititionBioSD Tutorial 2014 Editition
BioSD Tutorial 2014 Editition
 
SWAT4LS 2014 SLIDE by Yamamoto
SWAT4LS 2014 SLIDE by YamamotoSWAT4LS 2014 SLIDE by Yamamoto
SWAT4LS 2014 SLIDE by Yamamoto
 
SWT Lecture Session 2 - RDF
SWT Lecture Session 2 - RDFSWT Lecture Session 2 - RDF
SWT Lecture Session 2 - RDF
 
Extracting Relevant Questions to an RDF Dataset Using Formal Concept Analysis
Extracting Relevant Questions to an RDF Dataset Using Formal Concept AnalysisExtracting Relevant Questions to an RDF Dataset Using Formal Concept Analysis
Extracting Relevant Questions to an RDF Dataset Using Formal Concept Analysis
 
Emerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big DataEmerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big Data
 
Building a Knowledge Graph @ Graph Day 2018
Building a Knowledge Graph @ Graph Day 2018Building a Knowledge Graph @ Graph Day 2018
Building a Knowledge Graph @ Graph Day 2018
 
VALA Tech Camp 2017: Intro to Wikidata & SPARQL
VALA Tech Camp 2017: Intro to Wikidata & SPARQLVALA Tech Camp 2017: Intro to Wikidata & SPARQL
VALA Tech Camp 2017: Intro to Wikidata & SPARQL
 
2010 03 Lodoxf Openflydata
2010 03 Lodoxf Openflydata2010 03 Lodoxf Openflydata
2010 03 Lodoxf Openflydata
 
R Basics
R BasicsR Basics
R Basics
 
Visualize open data with Plone - eea.daviz PLOG 2013
Visualize open data with Plone - eea.daviz PLOG 2013Visualize open data with Plone - eea.daviz PLOG 2013
Visualize open data with Plone - eea.daviz PLOG 2013
 
SPARQL-DL - Theory & Practice
SPARQL-DL - Theory & PracticeSPARQL-DL - Theory & Practice
SPARQL-DL - Theory & Practice
 
Visualizations using Visualbox
Visualizations using VisualboxVisualizations using Visualbox
Visualizations using Visualbox
 

Recently uploaded

Micronuclei test.M.sc.zoology.fisheries.
Micronuclei test.M.sc.zoology.fisheries.Micronuclei test.M.sc.zoology.fisheries.
Micronuclei test.M.sc.zoology.fisheries.
Aditi Bajpai
 
THEMATIC APPERCEPTION TEST(TAT) cognitive abilities, creativity, and critic...
THEMATIC  APPERCEPTION  TEST(TAT) cognitive abilities, creativity, and critic...THEMATIC  APPERCEPTION  TEST(TAT) cognitive abilities, creativity, and critic...
THEMATIC APPERCEPTION TEST(TAT) cognitive abilities, creativity, and critic...
Abdul Wali Khan University Mardan,kP,Pakistan
 
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
University of Maribor
 
Topic: SICKLE CELL DISEASE IN CHILDREN-3.pdf
Topic: SICKLE CELL DISEASE IN CHILDREN-3.pdfTopic: SICKLE CELL DISEASE IN CHILDREN-3.pdf
Topic: SICKLE CELL DISEASE IN CHILDREN-3.pdf
TinyAnderson
 
ANAMOLOUS SECONDARY GROWTH IN DICOT ROOTS.pptx
ANAMOLOUS SECONDARY GROWTH IN DICOT ROOTS.pptxANAMOLOUS SECONDARY GROWTH IN DICOT ROOTS.pptx
ANAMOLOUS SECONDARY GROWTH IN DICOT ROOTS.pptx
RASHMI M G
 
Shallowest Oil Discovery of Turkiye.pptx
Shallowest Oil Discovery of Turkiye.pptxShallowest Oil Discovery of Turkiye.pptx
Shallowest Oil Discovery of Turkiye.pptx
Gokturk Mehmet Dilci
 
Cytokines and their role in immune regulation.pptx
Cytokines and their role in immune regulation.pptxCytokines and their role in immune regulation.pptx
Cytokines and their role in immune regulation.pptx
Hitesh Sikarwar
 
What is greenhouse gasses and how many gasses are there to affect the Earth.
What is greenhouse gasses and how many gasses are there to affect the Earth.What is greenhouse gasses and how many gasses are there to affect the Earth.
What is greenhouse gasses and how many gasses are there to affect the Earth.
moosaasad1975
 
Phenomics assisted breeding in crop improvement
Phenomics assisted breeding in crop improvementPhenomics assisted breeding in crop improvement
Phenomics assisted breeding in crop improvement
IshaGoswami9
 
bordetella pertussis.................................ppt
bordetella pertussis.................................pptbordetella pertussis.................................ppt
bordetella pertussis.................................ppt
kejapriya1
 
Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...
Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...
Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...
Travis Hills MN
 
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Ana Luísa Pinho
 
Randomised Optimisation Algorithms in DAPHNE
Randomised Optimisation Algorithms in DAPHNERandomised Optimisation Algorithms in DAPHNE
Randomised Optimisation Algorithms in DAPHNE
University of Maribor
 
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
yqqaatn0
 
Applied Science: Thermodynamics, Laws & Methodology.pdf
Applied Science: Thermodynamics, Laws & Methodology.pdfApplied Science: Thermodynamics, Laws & Methodology.pdf
Applied Science: Thermodynamics, Laws & Methodology.pdf
University of Hertfordshire
 
aziz sancar nobel prize winner: from mardin to nobel
aziz sancar nobel prize winner: from mardin to nobelaziz sancar nobel prize winner: from mardin to nobel
aziz sancar nobel prize winner: from mardin to nobel
İsa Badur
 
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
David Osipyan
 
20240520 Planning a Circuit Simulator in JavaScript.pptx
20240520 Planning a Circuit Simulator in JavaScript.pptx20240520 Planning a Circuit Simulator in JavaScript.pptx
20240520 Planning a Circuit Simulator in JavaScript.pptx
Sharon Liu
 
Oedema_types_causes_pathophysiology.pptx
Oedema_types_causes_pathophysiology.pptxOedema_types_causes_pathophysiology.pptx
Oedema_types_causes_pathophysiology.pptx
muralinath2
 
Unlocking the mysteries of reproduction: Exploring fecundity and gonadosomati...
Unlocking the mysteries of reproduction: Exploring fecundity and gonadosomati...Unlocking the mysteries of reproduction: Exploring fecundity and gonadosomati...
Unlocking the mysteries of reproduction: Exploring fecundity and gonadosomati...
AbdullaAlAsif1
 

Recently uploaded (20)

Micronuclei test.M.sc.zoology.fisheries.
Micronuclei test.M.sc.zoology.fisheries.Micronuclei test.M.sc.zoology.fisheries.
Micronuclei test.M.sc.zoology.fisheries.
 
THEMATIC APPERCEPTION TEST(TAT) cognitive abilities, creativity, and critic...
THEMATIC  APPERCEPTION  TEST(TAT) cognitive abilities, creativity, and critic...THEMATIC  APPERCEPTION  TEST(TAT) cognitive abilities, creativity, and critic...
THEMATIC APPERCEPTION TEST(TAT) cognitive abilities, creativity, and critic...
 
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
 
Topic: SICKLE CELL DISEASE IN CHILDREN-3.pdf
Topic: SICKLE CELL DISEASE IN CHILDREN-3.pdfTopic: SICKLE CELL DISEASE IN CHILDREN-3.pdf
Topic: SICKLE CELL DISEASE IN CHILDREN-3.pdf
 
ANAMOLOUS SECONDARY GROWTH IN DICOT ROOTS.pptx
ANAMOLOUS SECONDARY GROWTH IN DICOT ROOTS.pptxANAMOLOUS SECONDARY GROWTH IN DICOT ROOTS.pptx
ANAMOLOUS SECONDARY GROWTH IN DICOT ROOTS.pptx
 
Shallowest Oil Discovery of Turkiye.pptx
Shallowest Oil Discovery of Turkiye.pptxShallowest Oil Discovery of Turkiye.pptx
Shallowest Oil Discovery of Turkiye.pptx
 
Cytokines and their role in immune regulation.pptx
Cytokines and their role in immune regulation.pptxCytokines and their role in immune regulation.pptx
Cytokines and their role in immune regulation.pptx
 
What is greenhouse gasses and how many gasses are there to affect the Earth.
What is greenhouse gasses and how many gasses are there to affect the Earth.What is greenhouse gasses and how many gasses are there to affect the Earth.
What is greenhouse gasses and how many gasses are there to affect the Earth.
 
Phenomics assisted breeding in crop improvement
Phenomics assisted breeding in crop improvementPhenomics assisted breeding in crop improvement
Phenomics assisted breeding in crop improvement
 
bordetella pertussis.................................ppt
bordetella pertussis.................................pptbordetella pertussis.................................ppt
bordetella pertussis.................................ppt
 
Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...
Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...
Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...
 
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
 
Randomised Optimisation Algorithms in DAPHNE
Randomised Optimisation Algorithms in DAPHNERandomised Optimisation Algorithms in DAPHNE
Randomised Optimisation Algorithms in DAPHNE
 
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
 
Applied Science: Thermodynamics, Laws & Methodology.pdf
Applied Science: Thermodynamics, Laws & Methodology.pdfApplied Science: Thermodynamics, Laws & Methodology.pdf
Applied Science: Thermodynamics, Laws & Methodology.pdf
 
aziz sancar nobel prize winner: from mardin to nobel
aziz sancar nobel prize winner: from mardin to nobelaziz sancar nobel prize winner: from mardin to nobel
aziz sancar nobel prize winner: from mardin to nobel
 
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
 
20240520 Planning a Circuit Simulator in JavaScript.pptx
20240520 Planning a Circuit Simulator in JavaScript.pptx20240520 Planning a Circuit Simulator in JavaScript.pptx
20240520 Planning a Circuit Simulator in JavaScript.pptx
 
Oedema_types_causes_pathophysiology.pptx
Oedema_types_causes_pathophysiology.pptxOedema_types_causes_pathophysiology.pptx
Oedema_types_causes_pathophysiology.pptx
 
Unlocking the mysteries of reproduction: Exploring fecundity and gonadosomati...
Unlocking the mysteries of reproduction: Exploring fecundity and gonadosomati...Unlocking the mysteries of reproduction: Exploring fecundity and gonadosomati...
Unlocking the mysteries of reproduction: Exploring fecundity and gonadosomati...
 

SampLD, Structural Properties as Proxy for Semantic Relevance

  • 1. SAMPLD STRUCTURAL PROPERTIES AS PROXY FOR SEMANTIC RELEVANCE LAURENS RIETVELD HTTP://PRESENTATIONS.LAURENSRIETVELD.NL/NOW
  • 2. QUANTITY OVER QUALITY Datasets become too large to run on commodity hardware We use only a small portion Can't we extract the part we are interested in? Dataset #triples #queries coverage DBpedia 459M 1640 0.003% Linked Geo Data 289M 891 1.917% MetaLex 204M 4933 0.016% Open-BioMed 79M 931 0.011% BIO2RDF (KEGG) 50M 1297 2.013% Semantic Web Dog 0.24M 193 62.4% Food
  • 3. RELEVANCE BASED SAMPLING Find the smallest possible RDF subgraph, that covers the maximum number of potential queries How can we determine which triples are relevant, and which are not? Can we implement a scalable sampling pipeline? Can we evaluate the results in a scalable fashion?
  • 4. HOW TO DETERMINE RELEVANCE OF TRIPLES INFORMED SAMPLING UNINFORMED SAMPLING We know exactly which queries will be asked Extract those triples needed to answer the queries Problem: only a limited number of queries known We do not know which queries will be asked Use information contained in the graph to determine relevance Rank triples by relevance, and select the k best triples (0 < k < size of graph)
  • 5. APPROACH Use the topology of the graph to determine relevance (network analysis) Evaluate the relevance of our samples against the queries that we do know Is network structure a good predictor for query answerability?
  • 6. NETWORK ANALYSIS Example: Explain real-world phenomenons Find central parts of the graph Betweenness Centrality Google PageRank We apply In Degree Out Degree PageRank
  • 7. RANKED LIST OF TRIPLES Rewrite Apply network analysis Nodes Weights → Triples Weights W(triple) = max(W(sub), W(obj)) Weighted list of triples → Sample
  • 8. EVALUATION Sample sizes: 1% - 99% Baselines: Random Sample (10x) Resource Frequency
  • 9. NAIVE EVALUATION DOES NOT SCALE 0! 0 ø)!0$+ ø)!0$+ ø % % // / 0 Over 15.000 datasets, and over 1.4 trillion triples Requirements Fast loading of samples Powerful hardware Not Scalable: load all triples, execute queries, and calculate recall
  • 10. SCALABLE APPROACH Retrieve which triples are used by a query Use a hadoop cluster to find the weights of these triples Analyze whether these triples would have been included in the sample Scalable. Only execute each query once
  • 11. EXAMPLE SELECT ?person ?country WHERE { ?person :bornIn ?city . ?city :capitalOf ?country . } 1234 ?person ?country :Laurens :NL :Stefan :Germany QUERY DATASET Subject Predicate Object Weight :Amsterdam :capitalOf :NL 0.1 :Berlin :capitalOf :Germany 0.5 :Laurens :bornIn :Amsterdam 0.6 :Rinke :bornIn :Heerenveen 0.1 :Stefan :bornIn :Berlin 0.5
  • 12. QUERY RESULTS → TRIPLES 1234 Subject Predicate Object ?person ?city ?country SELECT ?person ?country WHERE { ?person :bornIn ?city . ?city :capitalOf ?country . } 1234 SELECT DISTINCT * WHERE { ?person :bornIn ?city . ?city :capitalOf ?country . } :Laurens :bornIn :Amsterdam :Laurens :Amsterdam :NL :Amsterdam :capitalOf :NL :Stefan :bornIn :Berlin :Stefan :Berlin :Germany :Berlin :capitalOf :Germany
  • 13. TRIPLES→RECALL WHICH ANSWERS WOULD WE GET WITH A SAMPLE OF 60%? Dataset Subject Predicate Object Weight :Laurens :bornIn :Amsterdam 0.6 :Stefan :bornIn :Berlin 0.5 :Berlin :capitalOf :Germany 0.5 :Amsterdam :capitalOf :NL 0.1 :Rinke :bornIn :Heerenveen 0.1 Triples used in query resultsets Subject Predicate Object :Laurens :bornIn :Amsterdam :Amsterdam :capitalOf :NL :Stefan :bornIn :Berlin :Berlin :capitalOf :Germany
  • 14. EVALUATION Better specificity than regular recall Scalable: PIG instead of SPARQL Special cases, e.g. GROUP BY, LIMIT, DISTINCT, OPTIONAL, UNION
  • 15. SPECIAL CASE: UNIONS SELECT ?name WHERE { {:laurens rdfs:label ?name} UNION {:laurens foaf:name ?name} } 12345 SELECT DISTINCT * WHERE { {:laurens rdfs:label ?name} UNION {:laurens foaf:name ?name} } 12345 Subject Predicate Object :Laurens rdfs:label Laurens :Laurens foaf:name Laurens
  • 16.
  • 17. EVALUATION DATASETS Dataset #triples #queries coverage DBpedia 459M 1640 0.003% Linked Geo Data 289M 891 1.917% MetaLex 204M 4933 0.016% Open-BioMed 79M 931 0.011% BIO2RDF (KEGG) 50M 1297 2.013% Semantic Web Dog 0.24M 193 62.4% Food
  • 18.
  • 19. RESULTS click for more results Sample Size 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0 Recall BIO2RDF ­UniqueLiterals ­Outdegree DBpedia ­Path ­Pagerank LinkedGeoData ­UniqueLiterals ­Outdegree Metalex ­ResourceFrequency OpenBioMed ­WithoutLiterals ­Outdegree SemanticWebDogFood ­Path ­Pagerank
  • 20. OBSERVATIONS The influence of a single triple DBpedia Good: Path + PageRank Bad: Path + Out Degree Queries: 2/3 require literals Other Observations # properties vs 'Context Literals' rewrite method # query triple patterns
  • 21. CONCLUSION Scalable pipeline: network analysis algorithms + rewrite methods Able to eval over 15.000 datasets, and 1.4 trillion triples Number of query sets too limited to learn significant correlations Topology of the graphs can be used to determine good samples Mimic semantic relevance through structural properties, without an a-priori notion of relevance Special thanks to Semantic Web Science Association (SWSA)