gStore: A Graph-based SPARQL Query Engine

  • 1,101 views
Uploaded on

 

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
1,101
On Slideshare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
30
Comments
0
Likes
4

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. gStore: A Graph-based SPARQL Query Engine ¨ M. Tamer Ozsu University of Waterloo David R. Cheriton School of Computer Science Joint work with Lei Zou, Peking University and Lei Chen, Hong Kong University of Science and TechnologyUPenn/2012-12-04 1
  • 2. RDF and Semantic Web RDF is a language for the conceptual modeling of information about web resources A building block of semantic web Facilitates exchange of information Search engines can retrieve more relevant information Facilitates data integration (mashes) Machine understandable Understand the information on the web and the interrelationships among themUPenn/2012-12-04 2
  • 3. RDF Uses Yago and DBPedia extract facts from Wikipedia & represent as RDF → structural queries Communities build RDF data E.g., biologists: Bio2RDF and Uniprot RDF Web data integration Linked Data Cloud ...UPenn/2012-12-04 3
  • 4. RDF Data Volumes . . . . . . are growing – and fast Linked data cloud currently consists of 325 datasets with >25B triples Size almost doubling every yearUPenn/2012-12-04 4
  • 5. RDF Data Volumes . . . . . . are growing – and fast Linked data cloud currently consists of 325 datasets with >25B triples Size almost doubling every year Sem- Wiki- Surge Web- company Radio LIBRIS Central RDF ohloh Doap- Music- space Semantic Resex brainz Audio- Eurécom Flickr Web.org MySpace Scrobbler QDOS SW exporter Wrapper Conference IRIT Corpus Toulouse RAE BBC BBC Crunch 2001 FOAF SIOC ACM BBC Later + John Base Revyu Jamendo Peel profiles Sites Playcount TOTP Open- Buda- Data Guides pest DBLP BME flickr RKB Project Pub Geo- Euro- wrappr Explorer Guten- Virtuoso Guide names stat berg Pisa BBC Sponger eprints Programm Open es Calais New- riese World Linked ECS castle Fact- MDB South- IEEE book ampton Magna- Gov- tune RDF Book Track Mashup DBpedia US Census Data W3C WordNet lingvoj Freebase DBLP Hannover CiteSeer UniRef LAAS- CNRS IBM March ’09: GEO Open Cyc Yago UMBEL LinkedCT Species DBLP Berlin Reactome UniParc Taxonomy 89 datasets Drug PROSITE Daily Bank Med Pub GeneID Homolo Chem Gene KEGG UniProt Pfam ProDom Disea- CAS Gene some ChEBI Ontology Symbol OMIM Inter Pro UniSTS PDB HGNC MGI PubMed As of March 2009 Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/UPenn/2012-12-04 4
  • 6. RDF Data Volumes . . . . . . are growing – and fast Linked data cloud currently consists of 325 datasets with >25B triples Size almost doubling every year Sussex St. Reading Andrews NDL Audio- Lists Resource subjects t4gm MySpace scrobbler Lists Moseley (DBTune) (DBTune) RAMEAU Folk NTU SH lobid GTAA Plymouth Resource Lists Organi- Reading Lists sations Music The Open ECS Magna- Brainz Music DB tune Library LCSH South- (Data Brainz LIBRIS ampton Tropes lobid Ulm Incubator) (zitgist) Man- EPrints Resources chester Surge Reading biz. Music RISKS Radio Lists The Open ECS data. John Brainz Discogs Library PSH Gem. UB South- gov.uk Peel (DBTune) FanHubz (Data In- (Talis) Norm- Mann- ampton (DB cubator) Jamendo datei heim RESEX Tune) Popula- Poké- DEPLOY Last.fm tion (En- pédia Artists Last.FM Linked RDF AKTing) research EUTC (DBTune) (rdfize) LCCN VIAF Book Wiki data.gov Produc- Pisa Eurécom P20 Mashup semantic NHS .uk tions classical web.org (EnAKTing) Pokedex (DB Mortality Tune) PBAC ECS (En- AKTing) BBC MARC (RKB Budapest Program Codes Explorer) Energy education OpenEI BBC List Semantic Lotico Revyu OAI (En- CO2 data.gov mes Music Crunch SW AKTing) (En- .uk Chronic- Linked Dog NSZL Base AKTing) ling Event- MDB RDF Food IRIT America Media Catalog ohloh BBC DBLP ACM IBM Good- BibBase Ord- Wildlife (RKB Openly Recht- win nance Finder Explorer) Local spraak. Family DBLP legislation Survey Tele- New VIVO UF .gov.uk nl graphis York flickr (L3S) New- VIVO castle Times URI wrappr Open Indiana RAE2001 UK Post- Burner Calais DBLP codes statistics (FU VIVO CiteSeer Roma September ’10: data.gov LOIUS Taxon iServe Berlin) IEEE .uk Cornell Concept Geo World data ESD Fact- OS dcs Names book dotAC stan- reference Project Linked Data NASA (FUB) Freebase dards data.gov Guten- .uk for Intervals (Data GESIS Course- transport DBpedia berg STW ePrints CORDIS Incu- ware data.gov bator) (FUB) Fishes ERA UN/ .uk 203 datasets of Texas Geo LOCODE Uberblic Euro- Species The stat dbpedia TCM SIDER Pub KISTI (FUB) lite Gene STITCH Chem JISC London Geo KEGG DIT LAAS Gazette TWC LOGD Linked Daily OBO Drug Eurostat Data UMBEL lingvoj Med (es) Disea- YAGO Medi some Care ChEBI KEGG NSF Linked KEGG KEGG Linked Drug Cpd GovTrack rdfabout Glycan Sensor Data CT Bank Pathway US SEC Open Reactome (Kno.e.sis) riese Uni Cyc Lexvo Path- way PDB Media Semantic totl.net Pfam HGNC XBRL WordNet KEGG KEGG Geographic (VUA) Linked Taxo- CAS Reaction rdfabout Twarql UniProt Enzyme EUNIS Open nomy US Census Publications Numbers PRO- ProDom SITE Chem2 UniRef Bio2RDF User-generated content Climbing WordNet SGD Homolo Linked (W3C) Affy- Gene GeoData Cornetto metrix Government PubMed Gene UniParc Ontology GeneID Cross-domain Airports Product DB UniSTS MGI Gen Life sciences Bank OMIM InterPro As of September 2010 Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/UPenn/2012-12-04 4
  • 7. RDF Data Volumes . . . . . . are growing – and fast Linked data cloud currently consists of 325 datasets with >25B triples Size almost doubling every year Linked LOV User Slideshare tags2con Audio Feedback 2RDF delicious Moseley Scrobbler Bricklink Sussex Folk (DBTune) Reading St. GTAA Magna- Lists Andrews Klapp- tune stuhl- Resource NTU DB club Lists Resource Tropes Lotico Semantic yovisto John Music Man- Lists Music Tweet chester Hellenic Peel Brainz NDL (DBTune) (Data Brainz Reading subjects FBD (zitgist) Lists Open EUTC Incubator) Linked Hellenic Library Open t4gm Produc- Crunch- PD Surge RDF info tions Discogs base Library Radio Ontos Source Code Crime ohloh Plymouth (Talis) (Data News LEM Ecosystem Reading RAMEAU Reports business Incubator) Crime data.gov. Portal Linked Data Lists SH UK Music Jamendo (En- uk Brainz (DBtune) LinkedL Ox AKTing) FanHubz gnoss ntnusc (DBTune) SSW CCN Points Thesau- Last.FM Poké- Thesaur Popula- artists pédia Didactal us rus W LIBRIS tion (En- (DBTune) Last.FM ia theses. LCSH Rådata reegle research patents MARC AKTing) (rdfize) my fr nå! data.gov. data.go Codes Ren. NHS uk v.uk Good- Experi- Classical List Energy (En- win flickr ment (DB Pokedex Family Norwe- Genera- AKTing) Mortality BBC wrappr Sudoc PSH Tune) gian (En- tors Program MeSH AKTing) semantic mes BBC IdRef GND CO2 educatio OpenEI web.org SW Energy Sudoc ndlna Emission n.data.g Music Dog VIAF EEA (En- Chronic- Linked (En- ov.uk Portu- Food UB AKTing) ling Event MDB AKTing) guese Mann- Europeana BBC America Media DBpedia Calames heim Ord- Recht- Wildlife Deutsche Open Revyu DDC Openly spraak. Finder Bio- lobid Election nance legislation Local nl RDF graphie Resources NSZL Swedish Data Survey Tele- data Ulm EU New Book Project data.gov.uk graphis bnf.fr Catalog Open Insti- York Open Mashup Cultural tutions Times URI Greek P20 UK Post- Burner Calais Heritage codes DBpedia ECS Wiki statistics lobid GovWILD data.gov. Taxon iServe South- Organi- LOIUS BNB Brazilian uk Concept ECS ampton sations Geo World OS BibBase STW GESIS Poli- ESD South- ECS Names Fact- (RKB ticians stan- reference ampton data.gov.uk book Freebase Explorer) Budapest dards data.gov. NASA EPrints uk intervals Project OAI Lichfield transport (Data DBpedia data Pisa September ’11: Spen- Incu- Guten- dcs data.gov. RESEX Scholaro- ISTAT ding bator) Fishes berg DBLP DBLP uk Geo meter Immi- Scotland of Texas (FU (L3S) Pupils & Uberblic DBLP gration Species Berlin) IRIT Exams Euro- dbpedia data- (RKB London TCM ACM stat lite open- Explorer) NVD Gazette (FUB) Gene IBM Traffic Geo ac-uk Scotland TWC LOGD Eurostat Daily DIT Linked UN/ Data UMBEL Med ERA Data LOCODE DEPLOY Gov.ie CORDIS YAGO New- lingvoj Disea- 295 datasets (RKB some SIDER RAE2001 castle LOCAH CORDIS Explorer) Linked Eurécom Eurostat Drug CiteSeer Roma (FUB) Sensor Data GovTrack (Ontology (Kno.e.sis) Open Bank Pfam Course- Central) riese Enipedia Cyc Lexvo LinkedCT ware Linked PDB UniProt VIVO EURES EDGAR dotAC US SEC Indiana ePrints IEEE (Ontology totl.net (rdfabout) Central) WordNet RISKS (VUA) Taxono UniProt US Census EUNIS Twarql HGNC Semantic Cornetto (Bio2RDF) (rdfabout) my VIVO FTS XBRL PRO- ProDom STITCH Cornell LAAS SITE KISTI NSF Scotland Geo- GeoWord LODE graphy Net WordNet WordNet JISC (W3C) (RKB Climbing Linked Affy- KEGG SMC Explorer) SISVU Pub VIVO UF Piedmont GeoData metrix Drug ECCO- Finnish Journals PubMed Gene SGD Chem Munici- Accomo- El AGROV Ontology TCP Media dations Alpine bible palities Viajero OC Ski ontology Tourism KEGG Ocean Austria Enzyme PBAC Geographic Metoffice GEMET ChEMBL Italian Drilling OMIM KEGG Weather Open public Codices AEMET Linked MGI Pathway schools Forecasts Data Open InterPro GeneID Publications EARTh Thesau- KEGG Turismo rus Colors Reaction de Zaragoza Product Smart KEGG User-generated content Weather DB Link Medi Glycan Janus Stations Product Care KEGG AMP UniParc UniRef UniSTS Government Types Italian Homolo Com- Yahoo! Airports Museums pound Ontology Google Gene Geo Art Planet National wrapper Chem2 Cross-domain Radio- Bio2RDF activity UniPath JP Sears Open Linked OGOLOD way Life sciences Corpo- Amster- Reactome dam medu- Open rates Numbers Museum cator As of September 2011 Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/UPenn/2012-12-04 4
  • 8. Outline 1 RDF Introduction 2 Answering Regular SPARQL Queries Answering SPARQL queries using graph pattern matching ¨ [ZouMoZhaoChenOzsu, PVLDB 2011] 3 Aggregation Answering Aggregate SPARQL Queries Over Large RDF Repositories 4 Future ResearchUPenn/2012-12-04 5
  • 9. Outline 1 RDF Introduction 2 Answering Regular SPARQL Queries Answering SPARQL queries using graph pattern matching ¨ [ZouMoZhaoChenOzsu, PVLDB 2011] 3 Aggregation Answering Aggregate SPARQL Queries Over Large RDF Repositories 4 Future ResearchUPenn/2012-12-04 6
  • 10. http://en.wikipedia.org/wiki/Abraham LincolnRDF Introduction Everything is an uniquely named resourceUPenn/2012-12-04 7
  • 11. xmlns:y=http://en.wikipedia.org/wikiRDF Introduction y:Abraham Lincoln Everything is an uniquely named resource Namespaces can be used to scope the namesUPenn/2012-12-04 7
  • 12. xmlns:y=http://en.wikipedia.org/wikiRDF Introduction y:Abraham Lincoln Everything is an uniquely named resource Namespaces can be used to scope the names Properties of resources can be defined Abraham Lincoln:hasName “Abraham Lincoln” Abraham Lincoln:BornOnDate: “1809-02-12” Abraham Lincoln:DiedOnDate: “1865-04-15”UPenn/2012-12-04 7
  • 13. xmlns:y=http://en.wikipedia.org/wikiRDF Introduction y:Abraham Lincoln Everything is an uniquely named resource Namespaces can be used to scope the names Properties of resources can be defined Abraham Lincoln:hasName “Abraham Lincoln” Abraham Lincoln:BornOnDate: “1809-02-12” Relationships with other resources Abraham Lincoln:DiedOnDate: “1865-04-15” can be defined Abraham Lincoln:DiedIn y:Washington DCUPenn/2012-12-04 7
  • 14. xmlns:y=http://en.wikipedia.org/wikiRDF Introduction y:Abraham Lincoln Everything is an uniquely named resource Namespaces can be used to scope the names Properties of resources can be defined Abraham Lincoln:hasName “Abraham Lincoln” Abraham Lincoln:BornOnDate: “1809-02-12” Relationships with other resources Abraham Lincoln:DiedOnDate: “1865-04-15” can be defined Resources can be contributed by Abraham Lincoln:DiedIn different people/groups and can be located anywhere in the web Integrated web “database” y:Washington DCUPenn/2012-12-04 7
  • 15. RDF Data Model Triple: Subject, Predicate (Property), U Object (s, p, o) Subject: the entity that is described Predicate (URI or blank node) Subject Object Predicate: a feature of the entity (URI) Object: value of the feature (URI, U B U B L blank node or literal) U: set of URIs (s, p, o) ∈ (U ∪ B) × U × (U ∪ B ∪ L) B: set of blank nodes L: set of literals Set of RDF triples is called an RDF graph Subject Predicate Object Abraham Lincoln hasName “Abraham Lincoln” Abraham Lincoln BornOnDate “1809-02-12” Abraham Lincoln DiedOnDate “1865-04-15”UPenn/2012-12-04 8
  • 16. RDF Example Instance Prefix: y=http://en.wikipedia.org/wiki Subject Predicate Object y: Abraham Lincoln hasName “Abraham Lincoln” y: Abraham Lincoln BornOnDate “1809-02-12”’ Literal y: Abraham Lincoln DiedOnDate “1865-04-15” URI y:Abraham Lincoln y: Abraham Lincoln bornIn DiedIn y:Hodgenville KY y: Washington DC y:Abraham Lincoln title “President” y:Abraham Lincoln gender “Male” y: Washington DC y:Washington DC hasName foundingYear “Washington D.C.” “1790” URI y:Hodgenville KY hasName “Hodgenville” y:United States hasName “United States” y:United States hasCapital y:Washington DC y:United States foundingYear “1776” y:Reese Witherspoon bornOnDate “1976-03-22” y:Reese Witherspoon bornIn y:New Orleans LA y:Reese Witherspoon hasName “Reese Witherspoon” y:Reese Witherspoon gender “Female” y:Reese Witherspoon title “Actress” y:New Orleans LA foundingYear “1718” y:New Orleans LA locatedIn y:United States y:Franklin Roosevelt hasName “Franklin D. Roosevelt” y:Franklin Roosevelt bornIn y:Hyde Park NY y:Franklin Roosevelt title “President” y:Franklin Roosevelt gender “Male” y:Hyde Park NY foundingYear “1810” y:Hyde Park NY locatedIn y:United States y:Marilyn Monroe gender “Female” y:Marilyn Monroe hasName “Marilyn Monroe” y:Marilyn Monroe bornOnDate “1926-07-01” y:Marilyn Monroe diedOnDate “1962-08-05”UPenn/2012-12-04 9
  • 17. RDF Graph “1926-07-01” “Female” bornOnDate gender diedOnDate hasName “1962-08-05” y:Marilyn Monroe “Marilyn Monroe” “Abraham Lincoln” “President” “Male” hasName title gender bornOnDate bornIn hasName “1809-02-12” y:Abraham Lincoln y:Hodgenville KY “Hodgenville” diedOnDate diedIn “Franklin D. Roosevelt” “Male” “1865-04-15” “1776” hasName gender y:Washington D.C. y:Franklin Roosevelt foundingYear foundYear hasName hasCapital “1976-03-22” title bornOnDate “1790” “Washington D.C.” y:United States “President” y:Reese Witherspoon bornIn gender hasName locatedIn locatedIn title hasName bornIn “Female” “United States” “Actress”“Reese Witherspoon” y:New Orleans LA y:Hyde Park NY foundingYear foundingYear “1718” “1810”UPenn/2012-12-04 10
  • 18. RDF Query Model Query Model - SPARQL Protocol and RDF Query Language Given U (set of URIs), L (set of literals), and V (set of variables), a SPARQL expression is defined recursively: an atomic triple pattern, which is an element of (U ∪ V ) × (U ∪ V ) × (U ∪ V ∪ L) ?x hasName “Abraham Lincoln” P FILTER R, where P is a graph pattern expression and R is a built-in SPARQL condition (i.e., analogous to a SQL predicate) ?x price ?p FILTER(?p < 30) P1 AND/OPT/UNION P2, where P1 and P2 are graph pattern expressions Example: SELECT ?name WHERE { ?m <b o r n I n > ? c i t y . ?m <hasName> ?name . ?m<bornOnDate> ? bd . ? c i t y <f o u n d i n g Y e a r > ‘ ‘ 1 7 1 8 ’ ’ . FILTER ( regex ( s t r ( ? bd ) , ‘ ‘ 1 9 7 6 ’ ’ ) ) }UPenn/2012-12-04 11
  • 19. SPARQL Queries SELECT ?name WHERE { ?m <b o r n I n > ? c i t y . ?m <hasName> ?name . ?m<bornOnDate> ? bd . ? c i t y <f o u n d i n g Y e a r > ‘ ‘ 1 7 1 8 ’ ’ . FILTER ( regex ( s t r ( ? bd ) , ‘ ‘ 1 9 7 6 ’ ’ ) ) } FILTER(regex(str (?bd),“1976”)) ?name ?bd “1718” hasName bornOnDate foundingYear bornIn ?m ?cityUPenn/2012-12-04 12
  • 20. Na¨ Triple Store Design ıve SELECT ?name WHERE { ?m <b o r n I n > ? c i t y . ?m <hasName> ?name . ?m<bornOnDate> ? bd . ? c i t y <f o u n d i n g Y e a r > ‘ ‘ 1 7 1 8 ’ ’ . FILTER ( regex ( s t r ( ? bd ) , ‘ ‘ 1 9 7 6 ’ ’ ) ) } Subject Property Object y:Abraham Lincoln hasName “Abraham Lincoln” y:Abraham Lincoln bornOnDate “1809-02-12” y:Abraham Lincoln diedOnDate “1865-04-15” y:Abraham Lincoln bornIn y:Hodgenville KY y:Abraham Lincoln diedIn y:Washington DC y:Abraham Lincoln title “President” y:Abraham Lincoln gender “Male” y:Washington DC hasName “Washington D.C.” y:Washington DC foundingYear “1790” y:Hodgenville KY hasName “Hodgenville” y:United States hasName “United States” y:United States hasCapital y:Washington DC y:United States foundingYear “1776” y:Reese Witherspoon bornOnDate “1976-03-22” y:Reese Witherspoon bornIn y:New Orleans LA y:Reese Witherspoon hasName “Reese Witherspoon” y:Reese Witherspoon gender “Female” y:Reese Witherspoon title “Actress” y:New Orleans LA foundingYear “1718” y:New Orleans LA locatedIn y:United States y:Franklin Roosevelt hasName “Franklin D. Roo- sevelt” y:Franklin Roosevelt bornIn y:Hyde Park NY y:Franklin Roosevelt title “President” y:Franklin Roosevelt gender “Male” y:Hyde Park NY foundingYear “1810” y:Hyde Park NY locatedIn y:United States y:Marilyn Monroe gender “Female” y:Marilyn Monroe hasName “Marilyn Monroe” y:Marilyn Monroe bornOnDate “1926-07-01” y:Marilyn Monroe diedOnDate “1962-08-05”UPenn/2012-12-04 13
  • 21. Na¨ Triple Store Design ıve SELECT ?name WHERE { ?m <b o r n I n > ? c i t y . ?m <hasName> ?name . ?m<bornOnDate> ? bd . ? c i t y <f o u n d i n g Y e a r > ‘ ‘ 1 7 1 8 ’ ’ . FILTER ( regex ( s t r ( ? bd ) , ‘ ‘ 1 9 7 6 ’ ’ ) ) } Subject Property Object y:Abraham Lincoln hasName “Abraham Lincoln” y:Abraham Lincoln bornOnDate “1809-02-12” y:Abraham Lincoln diedOnDate “1865-04-15” y:Abraham Lincoln bornIn y:Hodgenville KY y:Abraham Lincoln diedIn y:Washington DC y:Abraham Lincoln title “President” y:Abraham Lincoln gender “Male” y:Washington DC hasName “Washington D.C.” y:Washington DC foundingYear “1790” y:Hodgenville KY hasName “Hodgenville” y:United States hasName “United States” SELECT T2 . o b j e c t y:United States y:United States hasCapital foundingYear y:Washington DC “1776” FROM T a s T1 , T a s T2 , T a s T3 , y:Reese Witherspoon y:Reese Witherspoon bornOnDate bornIn “1976-03-22” y:New Orleans LA T a s T4 y:Reese Witherspoon y:Reese Witherspoon hasName gender “Reese Witherspoon” “Female” WHERE T1 . p r o p e r t y=” b o r n I n ” y:Reese Witherspoon y:New Orleans LA title foundingYear “Actress” “1718” AND T2 . p r o p e r t y=” hasName ” y:New Orleans LA y:Franklin Roosevelt locatedIn hasName y:United States “Franklin D. Roo- AND T3 . p r o p e r t y=” bornOnDate ” sevelt” AND T1 . s u b j e c t=T2 . s u b j e c t y:Franklin Roosevelt bornIn y:Hyde Park NY y:Franklin Roosevelt title “President” AND T2 . s u b j e c t=T3 . s u b j e c t y:Franklin Roosevelt gender “Male” y:Hyde Park NY foundingYear “1810” AND T4 . p r o p e t y=” f o u n d i n g Y e a r ” y:Hyde Park NY locatedIn y:United States y:Marilyn Monroe gender “Female” AND T1 . o b j e c t=T4 . s u b j e c t y:Marilyn Monroe hasName “Marilyn Monroe” y:Marilyn Monroe bornOnDate “1926-07-01” AND T4 . o b j e c t=” 1718 ” y:Marilyn Monroe diedOnDate “1962-08-05” AND T3 . o b j e c t LIKE ’ %1976% ’UPenn/2012-12-04 13
  • 22. Na¨ Triple Store Design ıve SELECT ?name WHERE { ?m <b o r n I n > ? c i t y . ?m <hasName> ?name . ?m<bornOnDate> ? bd . ? c i t y <f o u n d i n g Y e a r > ‘ ‘ 1 7 1 8 ’ ’ . FILTER ( regex ( s t r ( ? bd ) , ‘ ‘ 1 9 7 6 ’ ’ ) ) } Subject y:Abraham Lincoln y:Abraham Lincoln Property hasName bornOnDate Object “Abraham Lincoln” “1809-02-12” Too many self-joins! y:Abraham Lincoln diedOnDate “1865-04-15” y:Abraham Lincoln bornIn y:Hodgenville KY y:Abraham Lincoln diedIn y:Washington DC y:Abraham Lincoln title “President” y:Abraham Lincoln gender “Male” y:Washington DC hasName “Washington D.C.” y:Washington DC foundingYear “1790” y:Hodgenville KY hasName “Hodgenville” y:United States hasName “United States” SELECT T2 . o b j e c t y:United States y:United States hasCapital foundingYear y:Washington DC “1776” FROM T a s T1 , T a s T2 , T a s T3 , y:Reese Witherspoon y:Reese Witherspoon bornOnDate bornIn “1976-03-22” y:New Orleans LA T a s T4 y:Reese Witherspoon y:Reese Witherspoon hasName gender “Reese Witherspoon” “Female” WHERE T1 . p r o p e r t y=” b o r n I n ” y:Reese Witherspoon y:New Orleans LA title foundingYear “Actress” “1718” AND T2 . p r o p e r t y=” hasName ” y:New Orleans LA y:Franklin Roosevelt locatedIn hasName y:United States “Franklin D. Roo- AND T3 . p r o p e r t y=” bornOnDate ” sevelt” AND T1 . s u b j e c t=T2 . s u b j e c t y:Franklin Roosevelt bornIn y:Hyde Park NY y:Franklin Roosevelt title “President” AND T2 . s u b j e c t=T3 . s u b j e c t y:Franklin Roosevelt gender “Male” y:Hyde Park NY foundingYear “1810” AND T4 . p r o p e t y=” f o u n d i n g Y e a r ” y:Hyde Park NY locatedIn y:United States y:Marilyn Monroe gender “Female” AND T1 . o b j e c t=T4 . s u b j e c t y:Marilyn Monroe hasName “Marilyn Monroe” y:Marilyn Monroe bornOnDate “1926-07-01” AND T4 . o b j e c t=” 1718 ” y:Marilyn Monroe diedOnDate “1962-08-05” AND T3 . o b j e c t LIKE ’ %1976% ’UPenn/2012-12-04 13
  • 23. Existing Solutions 1. Property table Each class of objects go to a different table ⇒ similar to normalized relations Eliminates some of the joins 2. Vertically partitioned tables For each property, build a two-column table, containing both subject and object, ordered by subjects Can use merge join (faster) Good for subject-subject joins but does not help with subject-object joins 3. Exhaustive indexing Create indexes for each permutation of the three columns Query components become range queries over individual relations with merge-join to combine Excessive space usageUPenn/2012-12-04 14
  • 24. Existing Solutions 1. Property table Each class of objects go to a different table ⇒ similar to normalized relations Eliminates some of the joins 2. Vertically partitioned tables For each property, build a two-column table, containing both subject and object, ordered by subjects Can use merge join (faster) Good for subject-subject joins but does not help with subject-object joins 3. Exhaustive indexing Create indexes for each permutation of the three columns Query components become range queries over individual relations with merge-join to combine Excessive space usage None can handle wildcard queries easily Most cannot handle updatesUPenn/2012-12-04 14
  • 25. Outline 1 RDF Introduction 2 Answering Regular SPARQL Queries Answering SPARQL queries using graph pattern matching ¨ [ZouMoZhaoChenOzsu, PVLDB 2011] 3 Aggregation Answering Aggregate SPARQL Queries Over Large RDF Repositories 4 Future ResearchUPenn/2012-12-04 15
  • 26. Our Approach – General Idea We work directly on the RDF graph and the SPARQL query graph Answering SPARQL query ≡ subgraph matching Subgraph matching is computationally expensive Use a signature-based encoding of each entity and class vertex to speed up matching Filter-and-evaluate Use a false positive algorithm to prune nodes and obtain a set of candidates; then do more detailed evaluation on those We develop an index (VS∗ -tree) over the data signature graph (has light maintenance load) for efficient pruningUPenn/2012-12-04 16
  • 27. Why not ordinary graph pattern matching? 1. RDF graphs are much larger than what is considered in usual graph pattern matching (by orders of magnitude) 2. Cardinality of vertices and edges in an RDF graph is much larger than in traditional graph databases AIDS dataset: 10,000 data graphs, each with avg 20 vertices & 25 edges with a total size 5MB Yago RDF: 500M vertices with a total size >3GB 3. SPARQL queries have special structure Star subqueries that use several properties of the same entityUPenn/2012-12-04 17
  • 28. 0. Start with RDF Graph G FILTER(regex(str (?bd),“1976”)) ?name ?bd “1718” hasName bornOnDate foundingYear bornIn ?m ?city “1926-07-01” “Female” bornOnDate gender diedOnDate hasName “1962-08-05” y:Marilyn Monroe “Marilyn Monroe” “Abraham Lincoln” “President” “Male” hasName title gender bornOnDate bornIn hasName “1809-02-12” y:Abraham Lincoln y:Hodgenville KY “Hodgenville” diedOnDate diedIn “Franklin D. Roosevelt” “Male” “1865-04-15” “1776” hasName gender y:Washington D.C. y:Franklin Roosevelt foundingYear foundYear hasName hasCapital “1976-03-22” title bornOnDate “1790” “Washington D.C.” y:United States “President” y:Reese Witherspoon bornIn gender hasName locatedIn locatedIn title hasName bornIn “Female” “United States” “Actress”“Reese Witherspoon” y:New Orleans LA y:Hyde Park NY foundingYear foundingYear “1718” “1810”UPenn/2012-12-04 18
  • 29. 0. Start with RDF Graph G FILTER(regex(str (?bd),“1976”)) Finding matches over a large ?name ?bd “1718” graph is not a trivial task! hasName bornOnDate foundingYear bornIn ?m ?city “1926-07-01” “Female” bornOnDate gender diedOnDate hasName “1962-08-05” y:Marilyn Monroe “Marilyn Monroe” “Abraham Lincoln” “President” “Male” hasName title gender bornOnDate bornIn hasName “1809-02-12” y:Abraham Lincoln y:Hodgenville KY “Hodgenville” diedOnDate diedIn “Franklin D. Roosevelt” “Male” “1865-04-15” “1776” hasName gender y:Washington D.C. y:Franklin Roosevelt foundingYear foundYear hasName hasCapital “1976-03-22” title bornOnDate “1790” “Washington D.C.” y:United States “President” y:Reese Witherspoon bornIn gender hasName locatedIn locatedIn title hasName bornIn “Female” “United States” “Actress”“Reese Witherspoon” y:New Orleans LA y:Hyde Park NY foundingYear foundingYear “1718” “1810”UPenn/2012-12-04 18
  • 30. 1. Represent G as Adjacency List and Encode Each Vertex Prefix: y=http://en.wikipedia.org/wiki/ Label adjList . . . . . . y:Reese Witherspoon (BornOnDate,“1976-03-22”), (gender,“Female”), (title,“Actress”), (hasName,“Reese Witherspoon”), (BornIn,New Orleans LA) . . . . . .UPenn/2012-12-04 19
  • 31. 1. Represent G as Adjacency List and Encode Each Vertex Prefix: y=http://en.wikipedia.org/wiki/ Label adjList . . . . . . y:Reese Witherspoon (BornOnDate,“1976-03-22”), (gender,“Female”), (title,“Actress”), (hasName,“Reese Witherspoon”), (BornIn,New Orleans LA) . . . . . . Encode all neighbors of a vertex into a bit string 0010 1000 vertex signatureUPenn/2012-12-04 19
  • 32. 2. Construct Data Signature Graph G ∗ 001 003 009 10000 0000 0001 0001 1000 0010 0000 00001 002 0010 1000 0011 1000 10000 10000 005 004 008 01000 0010 1000 1000 0001 0100 0000 10000 006 01000 1000 0100UPenn/2012-12-04 20
  • 33. 3. Encode Q to Get Signature Graph Q ∗ Q FILTER(regex(str (?bd),“1976”)) ?name ?bd “1718” hasName bornOnDate foundingYear bornIn ?m ?city Q∗ 10000 0010 1000 1000 0000UPenn/2012-12-04 21
  • 34. 4. Filter-and-Evaluate Query signature graph Q ∗ 10000 0010 1000 1000 0000 Data signature graph G ∗ 10000 0000 0001 0001 1000 0010 0000 00001 0010 1000 0011 1000 10000 10000 01000 0010 1000 1000 0001 0100 0000 10000 01000 1000 0100 Find matches of Q ∗ over Verify each match in signature graph G ∗ RDF graph GUPenn/2012-12-04 22
  • 35. How to Generate Candidate ListUPenn/2012-12-04 23
  • 36. How to Generate Candidate List Two step process: 1. Run an inclusion query for each node of Q ∗ and get lists of nodes in G ∗ that include that node. Given query signature q and a set of data signatures S, find all data signatures si ∈ S where q&si = q 2. Do a multi-way join to get the candidate listUPenn/2012-12-04 23
  • 37. How to Generate Candidate List Two step process: 1. Run an inclusion query for each node of Q ∗ and get lists of nodes in G ∗ that include that node. Given query signature q and a set of data signatures S, find all data signatures si ∈ S where q&si = q 2. Do a multi-way join to get the candidate list Alternatives: Sequential scan of G ∗ Both steps are inefficientUPenn/2012-12-04 23
  • 38. How to Generate Candidate List Two step process: 1. Run an inclusion query for each node of Q ∗ and get lists of nodes in G ∗ that include that node. Given query signature q and a set of data signatures S, find all data signatures si ∈ S where q&si = q 2. Do a multi-way join to get the candidate list Alternatives: Sequential scan of G ∗ Both steps are inefficient Use S-trees Height-balanced tree over signatures Support inclusion queries well Does not support second step – expensiveUPenn/2012-12-04 23
  • 39. How to Generate Candidate List Two step process: 1. Run an inclusion query for each node of Q ∗ and get lists of nodes in G ∗ that include that node. Given query signature q and a set of data signatures S, find all data signatures si ∈ S where q&si = q 2. Do a multi-way join to get the candidate list Alternatives: Sequential scan of G ∗ Both steps are inefficient Use S-trees Height-balanced tree over signatures Support inclusion queries well Does not support second step – expensive VS-tree (and VS∗ -tree) Multi-resolution summary graph based on S-tree Supports both steps efficientlyUPenn/2012-12-04 23
  • 40. S-tree Solution 10000 0010 1000 1000 0000 1 d1 1111 1101 G1 G2 2 d1 0011 1001 1101 1101 2 d2 G3 3 3 3 3 d1 0010 1001 0011 1000 d2 d3 0101 1000 1000 0101 d4 001 002 003 004 0000 0001 0011 1000 0001 1000 1000 0001 005 009 007 008 006 0010 1000 0010 0000 0010 1000 0100 0000 1000 0100UPenn/2012-12-04 24
  • 41. S-tree Solution 10000 0010 1000 1000 0000 1 d1 1111 1101 G1 G2 2 d1 0011 1001 1101 1101 2 d2 G3 3 3 3 3 d1 0010 1001 0011 1000 d2 d3 0101 1000 1000 0101 d4 001 002 003 004 0000 0001 0011 1000 0001 1000 1000 0001 005 009 007 008 006 0010 1000 0010 0000 0010 1000 0100 0000 1000 0100UPenn/2012-12-04 24
  • 42. S-tree Solution 002 10000 0010 1000 1000 0000 005 007 1 d1 1111 1101 G1 G2 2 d1 0011 1001 1101 1101 2 d2 G3 3 3 3 3 d1 0010 1001 0011 1000 d2 d3 0101 1000 1000 0101 d4 001 002 003 004 0000 0001 0011 1000 0001 1000 1000 0001 005 009 007 008 006 0010 1000 0010 0000 0010 1000 0100 0000 1000 0100UPenn/2012-12-04 24
  • 43. S-tree Solution 002 10000 004 0010 1000 1000 0000 005 006 007 1 d1 1111 1101 G1 G2 2 d1 0011 1001 1101 1101 2 d2 G3 3 3 3 3 d1 0010 1001 0011 1000 d2 d3 0101 1000 1000 0101 d4 001 002 003 004 0000 0001 0011 1000 0001 1000 1000 0001 005 009 007 008 006 0010 1000 0010 0000 0010 1000 0100 0000 1000 0100UPenn/2012-12-04 24
  • 44. S-tree Solution 002 10000 004 0010 1000 1000 0000 005 006 007 1 d1 1111 1101 G1 G2 2 d1 0011 1001 1101 1101 2 d2 G3 3 3 3 3 d1 0010 1001 0011 1000 d2 d3 0101 1000 1000 0101 d4 001 002 003 004 0000 0001 0011 1000 0001 1000 1000 0001 005 009 007 008 006 0010 1000 0010 0000 0010 1000 0100 0000 1000 0100UPenn/2012-12-04 24
  • 45. S-tree Solution 002 10000 004 0010 1000 1000 0000 005 006 007 1 d1 1111 1101 Large join space! G1 G2 2 d1 0011 1001 1101 1101 2 d2 G3 3 3 3 3 d1 0010 1001 0011 1000 d2 d3 0101 1000 1000 0101 d4 001 002 003 004 0000 0001 0011 1000 0001 1000 1000 0001 005 009 007 008 006 0010 1000 0010 0000 0010 1000 0100 0000 1000 0100UPenn/2012-12-04 24
  • 46. VS-tree 11101 G1 1 d1 1111 1101 00001 01000 G2 10000 2 d1 2 d2 0011 1001 1101 1101 00100 00010 10000 G 3d 3 0010 1001 3 3 0011 1000 d2 d3 0101 1000 1000 0101 3 d4 1 00001 10000 01000 00100 10000 00100 001 002 003 004 00001 01000 01000 0000 0001 0011 1000 0001 1000 1000 0001 005 009 007 008 01000 006 10000 0010 1000 0010 0000 0010 1000 0001 0000 1000 0100 10000 Super edgeUPenn/2012-12-04 25
  • 47. Pruning with VS-Tree 11101 10000 0010 1000 1000 0000 G1 1 d1 1111 1101 00001 01000 G2 10000 2 d1 2 d2 0011 1001 1101 1101 00100 00010 10000 G 3d 3 0010 1001 3 3 0011 1000 d2 d3 0101 1000 1000 0101 3 d4 1 00001 10000 01000 00100 10000 00100 001 002 003 004 00001 01000 01000 0000 0001 0011 1000 0001 1000 1000 0001 005 009 007 008 01000 006 10000 0010 1000 0010 0000 0010 1000 0001 0000 1000 0100 10000UPenn/2012-12-04 26
  • 48. Pruning with VS-Tree 11101 10000 0010 1000 1000 0000 G1 1 d1 1111 1101 00001 01000 G2 10000 2 d1 2 d2 0011 1001 1101 1101 00100 00010 10000 G 3d 3 0010 1001 3 3 0011 1000 d2 d3 0101 1000 1000 0101 3 d4 1 00001 10000 01000 00100 10000 00100 001 002 003 004 00001 01000 01000 0000 0001 0011 1000 0001 1000 1000 0001 005 009 007 008 01000 006 10000 0010 1000 0010 0000 0010 1000 0001 0000 1000 0100 10000UPenn/2012-12-04 26
  • 49. Pruning with VS-Tree 11101 10000 0010 1000 1000 0000 G1 1 d1 1111 1101 00001 01000 3 10000 3 G2 10000 d1 d4 2 d1 2 d2 0011 1001 1101 1101 00100 3 d2 00010 10000 G3 G 3d 3 0010 1001 3 3 0011 1000 d2 d3 0101 1000 1000 0101 3 d4 1 00001 10000 01000 00100 10000 00100 001 002 003 004 00001 01000 01000 0000 0001 0011 1000 0001 1000 1000 0001 005 009 007 008 01000 006 10000 0010 1000 0010 0000 0010 1000 0001 0000 1000 0100 10000UPenn/2012-12-04 26
  • 50. Pruning with VS-Tree 11101 10000 0010 1000 1000 0000 G1 1 d1 1111 1101 00001 01000 3 10000 3 G2 10000 d1 d4 2 d1 2 d2 0011 1001 1101 1101 00100 3 d2 00010 10000 G3 G 3d 3 0010 1001 3 3 0011 1000 d2 d3 0101 1000 1000 0101 3 d4 1 00001 10000 01000 00100 10000 00100 001 002 003 004 00001 01000 01000 0000 0001 0011 1000 0001 1000 1000 0001 005 009 007 008 01000 006 005 004 10000 0010 1000 0010 0000 0010 1000 0001 0000 1000 0100 002 006 10000 007UPenn/2012-12-04 26
  • 51. Pruning with VS-Tree 11101 10000 0010 1000 1000 0000 G1 1 d1 1111 1101 00001 01000 3 10000 3 G2 10000 d1 d4 2 d1 2 d2 0011 1001 1101 1101 00100 3 d2 00010 10000 G3 G 3d 3 0010 1001 3 3 0011 1000 d2 d3 0101 1000 1000 0101 3 d4 1 00001 10000 01000 00100 10000 00100 001 002 003 004 00001 01000 01000 0000 0001 0011 1000 0001 1000 1000 0001 005 009 007 008 01000 006 005 004 10000 0010 1000 0010 0000 0010 1000 0001 0000 1000 0100 002 006 10000 007UPenn/2012-12-04 26
  • 52. Pruning with VS-Tree 11101 10000 0010 1000 1000 0000 G1 1 d1 1111 1101 00001 01000 3 10000 3 G2 10000 d1 d4 2 d1 2 d2 0011 1001 1101 1101 00100 3 d2 00010 10000 G3 G 3d 3 0010 1001 3 3 0011 1000 d2 d3 0101 1000 1000 0101 3 d4 1 00001 10000 01000 00100 10000 00100 001 002 003 004 00001 01000 01000 0000 0001 0011 1000 0001 1000 1000 0001 005 009 007 008 01000 006 005 004 10000 0010 1000 0010 0000 0010 1000 0001 0000 1000 0100 002 006 10000 007 Reduced join space!UPenn/2012-12-04 Performance 26
  • 53. Evaluation of VS-tree 11101 G1 1 d1 1111 1101 00001 01000 G2 10000 2 d1 2 d2 0011 1001 1101 1101 00100 00010 10000 G 3d 3 0010 1001 3 3 0011 1000 d2 d3 0101 1000 1000 0101 3 d4 1 00001 10000 01000 00100 10000 00100 001 002 003 004 00001 01000 01000 0000 0001 0011 1000 0001 1000 1000 0001 005 009 007 008 01000 006 10000 0010 1000 0010 0000 0010 1000 0001 0000 1000 0100 10000UPenn/2012-12-04 27
  • 54. Evaluation of VS-tree Very dense in higher levels 11101 of VS-tree G1 1 d1 1111 1101 00001 01000 G2 10000 2 d1 2 d2 0011 1001 1101 1101 00100 00010 10000 G 3d 3 0010 1001 3 3 0011 1000 d2 d3 0101 1000 1000 0101 3 d4 1 00001 10000 01000 00100 10000 00100 001 002 003 004 00001 01000 01000 0000 0001 0011 1000 0001 1000 1000 0001 005 009 007 008 01000 006 10000 0010 1000 0010 0000 0010 1000 0001 0000 1000 0100 10000UPenn/2012-12-04 27
  • 55. Evaluation of VS-tree Very dense in higher levels 11101 of VS-tree G1 1 d1 1111 1101 00001 01000 G2 10000 2 d1 2 d2 0011 1001 1101 1101 00100 00010 10000 G 3d 3 1 0010 1001 3 3 0011 1000 d2 d3 0101 1000 1000 0101 3 d4 Lots of summary matches 00001 10000 01000 10000 00100 00100 in higher levels 001 002 003 004 0000 0001 00001 0011 1000 0001 1000 01000 1000 0001 01000 of VS-tree 005 009 007 008 01000 006 10000 0010 1000 0010 0000 0010 1000 0001 0000 1000 0100 10000UPenn/2012-12-04 27
  • 56. Evaluation of VS-tree Very dense in higher levels 11101 of VS-tree G1 1 d1 1111 1101 00001 01000 G2 10000 2 d1 2 d2 0011 1001 1101 1101 00100 00010 10000 G 3d 3 1 0010 1001 3 3 0011 1000 d2 d3 0101 1000 1000 0101 3 d4 Lots of summary matches 00001 10000 01000 10000 00100 00100 in higher levels 001 002 003 004 0000 0001 00001 0011 1000 0001 1000 01000 1000 0001 01000 of VS-tree 005 009 007 008 01000 006 10000 0010 1000 0010 0000 0010 1000 0001 0000 1000 0100 10000 Reduced pruning powerUPenn/2012-12-04 27
  • 57. Possible Optimizations 1. Do not start the multi-way join from the root Find an oracle that determines the best place to start the join Cost-based 2. Do not materialize summary matches at higher levels of VS-tree Use a DFS-search 3. Reduce the number of super-edges when building the VS-tree Reduce the number of vertices in each listUPenn/2012-12-04 28
  • 58. VS*-tree 1 1 d1 d1 G1 1001 1000 1001 1000 G1 10000 00010 10000 00010 10000 2 01000 2 2 2 d1 1100 0000 1011 0001 d2 d1 1100 0000 1011 0001 d2 G2 G2 01000 10000 G∗ 1000 0000 0100 0000 1001 0000 0010 0001 1000 0000 0000 1000 1001 0000 0010 0001 10000 00010 G∗ 00010 10000 01000 01000 0100 0000 Enter new vertex signature u into non-leaf node d where Hamming(u, d ) is minimumUPenn/2012-12-04 29
  • 59. VS*-tree 1 1 d1 d1 1001 1000 G1 1001 1000 G1 10000 00010 10000 00010 01000 2 01000 2 2 d1 1100 0000 1011 0001 2 d2 d1 1100 0000 1011 0001 d2 G2 G2 10000 G∗ 1000 0000 0100 0000 1001 0000 0010 0001 1000 0000 0100 0000 1001 0000 0000 1000 10000 10000 00010 G∗ 01000 00010 0010 0001 01000 Enter new vertex signature u into non-leaf node d considering Hamming distance & no of supernodes introducedUPenn/2012-12-04 29
  • 60. Pruning Effect ? x1 y : hasGivenName ? x5 ? x1 y : hasFamilyName ? x6 Before After ? x1 r d f : t y p e Pruning Pruning <w o r d n e t s c i e n t i s t 1 1 0 5 6 0 6 3 7 > x1 810 810 ? x1 y : B o r n I n ? x2 ? x1 y : h a s A c a d e m i c A d v i s o r ? x4 x2 424 197 ? x2 y : l o c a t e d I n <S w i t z e r l a n d > x3 66 66 ? x3 y : l o c a t e d I n <Germany> x4 36187 6686 ? x4 y : b o r n I n ? x3UPenn/2012-12-04 30
  • 61. Exact Queries Yago network (20 million triples & size 3.1GB) 6,000 Query Response Time (in ms) 5,000 4,000 3,000 2,000 1,000 0 A1 A2 B1 B2 B3 C1 C2 Queries gStore RDF-3X SW-Store x-RDF-3x BigOWLIM GRINUPenn/2012-12-04 31
  • 62. Wildcard Queries Yago network (20 million triples & size 3.1GB) 10,000 Query Response Time (in ms) 8,000 6,000 4,000 2,000 0 A1 A2 B1 B2 B3 C1 C2 Queries gStore RDF-3X+PF SW-Store+PF x-RDF-3x+PF BigOWLIM GRIN+PFUPenn/2012-12-04 32
  • 63. Outline 1 RDF Introduction 2 Answering Regular SPARQL Queries Answering SPARQL queries using graph pattern matching ¨ [ZouMoZhaoChenOzsu, PVLDB 2011] 3 Aggregation Answering Aggregate SPARQL Queries Over Large RDF Repositories 4 Future ResearchUPenn/2012-12-04 33
  • 64. Aggregation Queries(Q1 ) Group all individuals by their Prefix: y=http://en.wikipedia.org/wiki Subject Predicate Object genders, titles, and the founding year y: Abraham Lincoln hasName “Abraham Lincoln” y: Abraham Lincoln BornOnDate “1809-02-12”’ of their birth places, and report the y: Abraham Lincoln y:Abraham Lincoln DiedOnDate bornIn “1865-04-15” y:Hodgenville KY number of individuals in each group. y: Abraham Lincoln y:Abraham Lincoln DiedIn title y: Washington DC “President” SELECT ? t ? g ? y COUNT( ?m) y:Abraham Lincoln gender “Male” y: Washington DC hasName “Washington D.C.” WHERE {?m <b o r n I n> ? c . y:Washington DC foundingYear “1790” ?m <t i t l e > ? t . y:Hodgenville KY hasName “Hodgenville” ?m <g e n d e r> ? g . y:United States hasName “United States” ? c <f o u n d i n g Y e a r> ? y . } y:United States hasCapital y:Washington DC GROUP BY ? t ? g , ? y . y:United States foundingYear “1776” y:Reese Witherspoon bornOnDate “1976-03-22” y:Reese Witherspoon bornIn y:New Orleans LA y:Reese Witherspoon hasName “Reese Witherspoon” y:Reese Witherspoon gender “Female” y:Reese Witherspoon title “Actress” y:New Orleans LA foundingYear “1718” y:New Orleans LA locatedIn y:United States y:Franklin Roosevelt hasName “Franklin D. Roosevelt” y:Franklin Roosevelt bornIn y:Hyde Park NY y:Franklin Roosevelt title “President” y:Franklin Roosevelt gender “Male” y:Hyde Park NY foundingYear “1810” y:Hyde Park NY locatedIn y:United States y:Marilyn Monroe gender “Female” y:Marilyn Monroe hasName “Marilyn Monroe” y:Marilyn Monroe bornOnDate “1926-07-01” y:Marilyn Monroe diedOnDate “1962-08-05”UPenn/2012-12-04 34
  • 65. Aggregation Queries(Q1 ) Group all individuals by their Prefix: y=http://en.wikipedia.org/wiki Subject Predicate Object genders, titles, and the founding year y: Abraham Lincoln hasName “Abraham Lincoln” y: Abraham Lincoln BornOnDate “1809-02-12”’ of their birth places, and report the y: Abraham Lincoln y:Abraham Lincoln DiedOnDate bornIn “1865-04-15” y:Hodgenville KY number of individuals in each group. y: Abraham Lincoln y:Abraham Lincoln DiedIn title y: Washington DC “President” SELECT ? t ? g ? y COUNT( ?m) y:Abraham Lincoln gender “Male” y: Washington DC hasName “Washington D.C.” WHERE {?m <b o r n I n> ? c . y:Washington DC foundingYear “1790” ?m <t i t l e > ? t . y:Hodgenville KY hasName “Hodgenville” ?m <g e n d e r> ? g . y:United States hasName “United States” ? c <f o u n d i n g Y e a r> ? y . } y:United States hasCapital y:Washington DC GROUP BY ? t ? g , ? y . y:United States foundingYear “1776” y:Reese Witherspoon bornOnDate “1976-03-22”(Q2 ) Group all individuals by their genders, y:Reese Witherspoon bornIn y:New Orleans LA y:Reese Witherspoon hasName “Reese Witherspoon” titles, and the founding year of their y:Reese Witherspoon gender “Female” y:Reese Witherspoon title “Actress” birth places, and report the number y:New Orleans LA foundingYear “1718” y:New Orleans LA locatedIn y:United States of individuals in each group for those y:Franklin Roosevelt hasName “Franklin D. Roosevelt” y:Franklin Roosevelt bornIn y:Hyde Park NY groups with more than 10 members. y:Franklin Roosevelt y:Franklin Roosevelt title gender “President” “Male” SELECT ? t ? g ? y COUNT( ?m) y:Hyde Park NY foundingYear “1810” WHERE {?m <b o r n I n> ? c . y:Hyde Park NY locatedIn y:United States y:Marilyn Monroe gender “Female” ?m <t i t l e > ? t . y:Marilyn Monroe hasName “Marilyn Monroe” ?m <g e n d e r> ? g . y:Marilyn Monroe bornOnDate “1926-07-01” ? c <f o u n d i n g Y e a r> ? y . } y:Marilyn Monroe diedOnDate “1962-08-05” GROUP BY ? t ? g , ? y . HAVING COUNT( ?m) >10.UPenn/2012-12-04 34
  • 66. Aggregate Query Results SELECT ? t ? g ? y COUNT( ?m) WHERE {?m <b o r n I n> ? c . ?m <t i t l e > ? t . ?m <g e n d e r> ? g . ? c <f o u n d i n g Y e a r> ? y . } GROUP BY ? t ? g , ? y . 030 031 “1926-07-01” “Female” bornOnDate gender 009 029 diedOnDate hasName 032 “1962-08-05” y:Marilyn Monroe “Marilyn Monroe” 010 013 014 “Abraham Lincoln” “President” “Male” hasName title gender 001 003 ?m bornIn foundingYear 011 bornOnDate bornIn hasName 017 v1 v2 “1809-02-12” y:Abraham Lincoln y:Hodgenville KY “Hodgenville” diedOnDate 025 026 012 diedIn 019 “Franklin D. Roosevelt” “Male” gender title “1865-04-15” 002 “1776” hasName 007 gender y:Washington D.C. 020 y:Franklin Roosevelt foundingYear foundYear hasName hasCapital “1976-03-22” 015 016 004 title bornOnDate 005 “1790” 027 “Washington D.C.” y:United States “President” y:Reese Witherspoon bornIn gender hasName 021 locatedIn locatedIn title hasName bornIn “Female” 018 022 023 006 “United States” 008 “Actress”“Reese Witherspoon” y:New Orleans LA y:Hyde Park NY foundingYear foundingYear 024 028 gender title foundingYear COUNT(?m) “1718” “1810” Male President 1718 1 Female Actress 1810 1 Aggregate ResultUPenn/2012-12-04 36
  • 67. Aggregate Query Results SELECT ? t ? g ? y COUNT( ?m) WHERE {?m <b o r n I n> ? c . Query ?m <t i t l e > ? t . ?m <g e n d e r> ? g . pattern ? c <f o u n d i n g Y e a r> ? y . } GROUP BY ? t ? g , ? y . 030 031 “1926-07-01” “Female” Measure bornOnDate gender 009 029 hasName 032 dimension “1962-08-05” diedOnDate y:Marilyn Monroe “Marilyn Monroe” 010 013 014 “Abraham Lincoln” “President” “Male” hasName title gender ?m bornIn foundingYear 011 bornOnDate 001 bornIn 003 hasName 017 v1 v2 “1809-02-12” y:Abraham Lincoln y:Hodgenville KY “Hodgenville” diedOnDate 025 026 012 diedIn 019 “Franklin D. Roosevelt” “Male” gender title “1865-04-15” 002 “1776” hasName 007 gender y:Washington D.C. 020 y:Franklin Roosevelt foundingYear foundYear hasName hasCapital “1976-03-22” 015 016 004 title bornOnDate 005 “1790” 027 “Washington D.C.” y:United States “President” y:Reese Witherspoon bornIn gender hasName Group-by 021 “Female” title hasName bornIn locatedIn 018 “United States” locatedIn 022 023 006 008 dimensions “Actress”“Reese Witherspoon” y:New Orleans LA y:Hyde Park NY foundingYear foundingYear 024 028 gender title foundingYear COUNT(?m) “1718” “1810” Male President 1718 1 Female Actress 1810 1 Aggregate ResultUPenn/2012-12-04 36
  • 68. Existing Solutions 1. Using Existing SPARQL Query Engines Ignore aggregation Employ existing Online compute function SPARQL engine aggregation Aggregate query Query without Result without Query result Q aggregation (Q ) aggregation (R(Q )) R(Q)UPenn/2012-12-04 37
  • 69. Existing Solutions 1. Using Existing SPARQL Query Engines Ignore aggregation Employ existing Online compute function SPARQL engine aggregation Aggregate query Query without Result without Query result Q aggregation (Q ) aggregation (R(Q )) R(Q) Problem: missed opportunities for query optimizationUPenn/2012-12-04 37
  • 70. Existing Solutions 1. Using Existing SPARQL Query Engines Ignore aggregation Employ existing Online compute function SPARQL engine aggregation Aggregate query Query without Result without Query result Q aggregation (Q ) aggregation (R(Q )) R(Q) Problem: missed opportunities for query optimization 2. Using a Relational DBMS Map to table Build relational RDF dataset (property tables) aggregate indexUPenn/2012-12-04 37
  • 71. Existing Solutions 1. Using Existing SPARQL Query Engines Ignore aggregation Employ existing Online compute function SPARQL engine aggregation Aggregate query Query without Result without Query result Q aggregation (Q ) aggregation (R(Q )) R(Q) Problem: missed opportunities for query optimization 2. Using a Relational DBMS Map to table Build relational RDF dataset (property tables) aggregate index Problems: (1) Usual problems with relational implementations; (2 )RDF data is high dimensional (DBpedia has more than 1000 properties) ⇒ prohibits using relational optimization techniquesUPenn/2012-12-04 37
  • 72. RDF Data are not Well Structured In relational systems aggregates can be computed from materialized views. Key A B C SUM Key A B SUM View V1 Aggregate table T T can be constructed by scanning V1UPenn/2012-12-04 38
  • 73. RDF Data are not Well Structured This can’t always be done in RDF data(Q1 ) Group all individuals by their (Q2 ) Group all individuals by their titles, gender and birthday, titles and gender, and report and report the number of the number of individuals in individuals in each group. each group. SELECT ? t ? g ? y COUNT( ?m) SELECT ? t ? g ? y COUNT( ?m) WHERE {?m <b o r n I n> ? c . WHERE {?m <b o r n I n> ? c . ?m <t i t l e > ? t . ?m <t i t l e > ? t . ?m <g e n d e r> ? g . ?m <g e n d e r> ? g . } ? c <bornOnDate> ? y . } GROUP BY ? t ? g . GROUP BY ? t ? g , ? y . gender title bornOnDate COUNT gender title COUNT Male President 1865-04-15 1 Male President 2 Female Actress 1976-03-22 1 Female Actress 1 R(Q1 ) R(Q2 ) R(Q2 ) cannot be constructed by scanning R(Q1 )UPenn/2012-12-04 38
  • 74. Overview of Approach Work on RDF graphs and query graphs ⇒ graph pattern matching. Q 1. Decompose aggregate ?m bornIn foundingYear query Q into star v1 v2 queries {S1 , . . . , Sn } gender title ?m foundingYear v1 v2 gender title S1 S2UPenn/2012-12-04 39
  • 75. Overview of Approach Work on RDF graphs and query graphs ⇒ graph pattern matching. Q 1. Decompose aggregate ?m bornIn foundingYear query Q into star v1 v2 queries {S1 , . . . , Sn } gender title 2. Evaluate each star ?m foundingYear v1 v2 query to get results R(S1 ), . . . , R(Sn ) gender title S1 S2 T-index: trie foundingYear structure gender title 1790 Evaluate without Male President 1718 Female Actress 1810 joins 1776 R(S1 ) R(S2 )UPenn/2012-12-04 39
  • 76. Overview of Approach Work on RDF graphs and query graphs ⇒ graph pattern matching. Q 1. Decompose aggregate ?m bornIn foundingYear query Q into star v1 v2 queries {S1 , . . . , Sn } gender title 2. Evaluate each star ?m foundingYear v1 v2 query to get results R(S1 ), . . . , R(Sn ) gender title S1 S2 T-index: trie foundingYear structure gender title 1790 Evaluate without Male President 1718 Female Actress 1810 joins 1776 3. Compute result of Q R(S1 ) R(S2 ) as R(Q) = i (Ri ) Use VS*-tree to speed up join gender title foundingYear COUNT(?m) Male President 1718 1 processing Female Actress 1810 1UPenn/2012-12-04 39
  • 77. Selective Materialization of Views: T-index Selectively materialize views through a trie structure called T-index. M(N4 ) hasName gender bornOnDate title L Abraham Male 1865-04-15 President {001} Lincoln Transaction Database D Reese With- Female 1976-03-22 Actress {005} erspoon Vertex Entity vertex Adjacent Attributes root ID N0 001 Abraham Lincoln hasName, gender, bornOnDate,title, {001,002,003,004,005,007,009} hasName foundingYear diedOnDate N1 N9 {006,008} 002 Washington DC hasName, foundingYear {001,005,007,009} 003 Hodgenville KY hasName gender foundingYear Dimension N2 N8 004 United States hasName, {002,004} List DL foundingYear {001,005,009} hasName:7 005 Reese Wither- hasName, gender, bonOnDate title foundingYear:4 N3 N7 spoon bornOnDate, title {007} gender:4 006 New Orleans foundingYear {001,005} 007 Franklin Roo- hasName, gender, title diedOnDate bornOnDate:3 sevelt title N4 N6 title:3 {009} 008 Hyde Park NY foundingYear {001} diedOnDate:2 009 Marilyn Monroe hasName, diedOnDate gender, bornOnDate, N5 diedOnDate M(N7 ) hasName gender title L Franklin D. Male President {007} RooseveltUPenn/2012-12-04 40
  • 78. T-index Construction Transaction Database D Vertex Entity vertex Adjacent Attributes ID 001 Abraham Lincoln hasName, gender, bornOnDate,title, diedOnDate 002 Washington DC hasName, foundingYear 003 Hodgenville KY hasName 004 United States hasName, foundingYear 005 Reese Wither- spoon hasName, gender, bornOnDate, title 006 New Orleans foundingYear 007 Franklin Roo- hasName, gender, sevelt title 008 Hyde Park NY foundingYear 009 Marilyn Monroe hasName, gender, bornOnDate, diedOnDateUPenn/2012-12-04 41
  • 79. T-index Construction Transaction Database D Vertex Entity vertex Adjacent Attributes ID 001 Abraham Lincoln root hasName, gender, N0 bornOnDate,title, diedOnDate {001} hasName 002 Washington DC N1 hasName, foundingYear {001} 003 Hodgenville KY gender hasName N2 004 United States hasName, {001} foundingYear bonOnDate N3 005 Reese Wither- spoon hasName, gender, bornOnDate, title {001} title 006 New Orleans foundingYear N4 007 Franklin Roo- hasName, gender, sevelt title {001} diedOnDate 008 Hyde Park NY foundingYear N5 009 Marilyn Monroe hasName, gender, bornOnDate, diedOnDateUPenn/2012-12-04 41
  • 80. T-index Construction Transaction Database D Vertex Entity vertex Adjacent Attributes ID 001 Abraham Lincoln root hasName, gender, N0 bornOnDate,title, diedOnDate {001,002} hasName 002 Washington DC N1 hasName, foundingYear {001} 003 Hodgenville KY gender foundingYear hasName N2 N8 {002} 004 United States hasName, {001} foundingYear bonOnDate N3 005 Reese Wither- spoon hasName, gender, bornOnDate, title {001} title 006 New Orleans foundingYear N4 007 Franklin Roo- hasName, gender, sevelt title {001} diedOnDate 008 Hyde Park NY foundingYear N5 009 Marilyn Monroe hasName, gender, bornOnDate, diedOnDateUPenn/2012-12-04 41
  • 81. T-index Construction Transaction Database D Vertex Entity vertex Adjacent Attributes ID 001 Abraham Lincoln root hasName, gender, N0 bornOnDate,title, diedOnDate {001,002,003} hasName 002 Washington DC N1 hasName, foundingYear {001} 003 Hodgenville KY gender foundingYear hasName N2 N8 {002} 004 United States hasName, {001} foundingYear bonOnDate N3 005 Reese Wither- spoon hasName, gender, bornOnDate, title {001} title 006 New Orleans foundingYear N4 007 Franklin Roo- hasName, gender, sevelt title {001} diedOnDate 008 Hyde Park NY foundingYear N5 009 Marilyn Monroe hasName, gender, bornOnDate, diedOnDateUPenn/2012-12-04 41
  • 82. T-index Construction Transaction Database D Vertex Entity vertex Adjacent Attributes ID 001 Abraham Lincoln root hasName, gender, N0 bornOnDate,title, diedOnDate {001,002,003,004} hasName 002 Washington DC N1 hasName, foundingYear {001} 003 Hodgenville KY gender foundingYear hasName N2 N8 {002,004} 004 United States hasName, {001} foundingYear bonOnDate N3 005 Reese Wither- spoon hasName, gender, bornOnDate, title {001} title 006 New Orleans foundingYear N4 007 Franklin Roo- hasName, gender, sevelt title {001} diedOnDate 008 Hyde Park NY foundingYear N5 009 Marilyn Monroe hasName, gender, bornOnDate, diedOnDateUPenn/2012-12-04 41
  • 83. T-index Construction Transaction Database D Vertex Entity vertex Adjacent Attributes ID 001 Abraham Lincoln root hasName, gender, N0 bornOnDate,title, diedOnDate {001,002,003,004,005} hasName 002 Washington DC N1 hasName, foundingYear {001,005} 003 Hodgenville KY gender foundingYear hasName N2 N8 {002,004} 004 United States hasName, {001,005} foundingYear bonOnDate N3 005 Reese Wither- spoon hasName, gender, bornOnDate, title {001,005} title 006 New Orleans foundingYear N4 007 Franklin Roo- hasName, gender, sevelt title {001} diedOnDate 008 Hyde Park NY foundingYear N5 009 Marilyn Monroe hasName, gender, bornOnDate, diedOnDateUPenn/2012-12-04 41
  • 84. T-index Construction Transaction Database D Vertex Entity vertex Adjacent Attributes ID 001 Abraham Lincoln root hasName, gender, N0 bornOnDate,title, diedOnDate {001,002,003,004,005} hasName foundingYear 002 Washington DC N1 N9 hasName, {006} foundingYear {001,005} 003 Hodgenville KY gender foundingYear hasName N2 N8 {002,004} 004 United States hasName, {001,005} foundingYear bonOnDate N3 005 Reese Wither- spoon hasName, gender, bornOnDate, title {001,005} title 006 New Orleans foundingYear N4 007 Franklin Roo- hasName, gender, sevelt title {001} diedOnDate 008 Hyde Park NY foundingYear N5 009 Marilyn Monroe hasName, gender, bornOnDate, diedOnDateUPenn/2012-12-04 41
  • 85. T-index Construction Transaction Database D Vertex Entity vertex Adjacent Attributes ID 001 Abraham Lincoln root hasName, gender, N0 bornOnDate,title, diedOnDate {001,002,003,004,005,007,009} hasName foundingYear 002 Washington DC N1 N9 hasName, {006,008} foundingYear {001,005,007,009} 003 Hodgenville KY gender foundingYear hasName N2 N8 {002,004} 004 United States hasName, {001,005,009} foundingYear bonOnDate title N3 N7 005 Reese Wither- {007} spoon hasName, gender, bornOnDate, title {001,005} title diedOnDate 006 New Orleans foundingYear N4 N6 {009} 007 Franklin Roo- hasName, gender, sevelt title {001} diedOnDate 008 Hyde Park NY foundingYear N5 009 Marilyn Monroe hasName, gender, bornOnDate, diedOnDateUPenn/2012-12-04 41
  • 86. T-index Construction M(N4 ) Transaction Database D hasName gender bornOnDate title L Abraham Male 1865-04-15 President {001} Lincoln Vertex Entity vertex Adjacent Attributes Reese With- Female 1976-03-22 Actress {005} ID erspoon 001 Abraham Lincoln root hasName, gender, N0 bornOnDate,title, diedOnDate {001,002,003,004,005,007,009} hasName foundingYear 002 Washington DC N1 N9 hasName, {006,008} foundingYear {001,005,007,009} 003 Hodgenville KY gender foundingYear hasName N2 N8 {002,004} 004 United States hasName, {001,005,009} foundingYear bonOnDate title N3 N7 005 Reese Wither- {007} spoon hasName, gender, bornOnDate, title {001,005} title diedOnDate 006 New Orleans foundingYear N4 N6 {009} 007 Franklin Roo- hasName, gender, sevelt title {001} diedOnDate 008 Hyde Park NY foundingYear N5 009 Marilyn Monroe hasName, gender, bornOnDate, diedOnDate M(N7 ) hasName gender title L Franklin D. Male President {007} RooseveltUPenn/2012-12-04 41
  • 87. T-index Construction M(N4 ) Transaction Database D hasName gender bornOnDate title L Abraham Male 1865-04-15 President {001} Lincoln Vertex Entity vertex Adjacent Attributes Reese With- Female 1976-03-22 Actress {005} ID erspoon 001 Abraham Lincoln root hasName, gender, N0 bornOnDate,title, diedOnDate {001,002,003,004,005,007,009} hasName foundingYear 002 Washington DC N1 N9 hasName, {006,008} foundingYear {001,005,007,009} 003 Hodgenville KY gender foundingYear hasName N2 N8 Dimension {002,004} 004 United States List DL hasName, {001,005,009} hasName:7 foundingYear bonOnDate title N3 N7 foundingYear:4 005 Reese Wither- {007} hasName, gender, gender:4 spoon {001,005} bornOnDate, title bornOnDate:3 title diedOnDate 006 New Orleans foundingYear N4 N6 title:3 {009} 007 Franklin Roo- hasName, gender, diedOnDate:2 sevelt title {001} diedOnDate 008 Hyde Park NY foundingYear N5 009 Marilyn Monroe hasName, gender, bornOnDate, diedOnDate M(N7 ) hasName gender title L Franklin D. Male President {007} RooseveltUPenn/2012-12-04 41
  • 88. Star Aggregate (SA) Query Processing ?m Given a SA Query v1 S(v , p1 , . . . , pn ): gender titleUPenn/2012-12-04 42
  • 89. Star Aggregate (SA) Query Processing ?m Given a SA Query v1 S(v , p1 , . . . , pn ): gender title 1. Find all matches of {N1 , N2 , N3 , N4 } {N1 , N2 , N7 } (p1 , . . . , pn ) in T-indexUPenn/2012-12-04 42
  • 90. Star Aggregate (SA) Query Processing ?m Given a SA Query v1 S(v , p1 , . . . , pn ): gender title 1. Find all matches of {N1 , N2 , N3 , N4 } {N1 , N2 , N7 } (p1 , . . . , pn ) in M(N4 ) T-index hasName gender bornOnDate title L 2. For each match Abraham Male 1865-04-15 President {001} Lincoln 2.1 Let Ni be the Reese Female 1976-03-22 Actress {005} Witherspoon lowest node in the match 2.2 Let M(Ni ) be Ni ’s M (N4 ) M (N7 ) gender title L aggregate set Male President {001} gender title L Male President {007} 2.3 M (Ni ) = Female Actress {005} Π(P1 ,...,Pn ) M(Ni ) M(N7 ) hasName gender title L Franklin D. Male President {007} RooseveltUPenn/2012-12-04 42
  • 91. Star Aggregate (SA) Query Processing ?m Given a SA Query v1 S(v , p1 , . . . , pn ): gender title 1. Find all matches of (p1 , . . . , pn ) in T-index M (N4 ) M (N7 ) 2. For each match gender title L gender title L Male President {001} 2.1 Let Ni be the Female Actress {005} Male President {007} lowest node in the match 2.2 Let M(Ni ) be Ni ’s aggregate set 2.3 M (Ni ) = ∪ Π(P1 ,...,Pn ) M(Ni ) gender title L COUNT Male President {001,007} 2 3. M = ∪i M (Ni ) Female Actress {005} 1UPenn/2012-12-04 42
  • 92. General Aggregate (GA) Query Processing Q 1. Decompose aggregate ?m v1 bornIn v2 foundingYear query Q into star gender title queries {S1 , . . . , Sn } ?m foundingYear v1 v2 gender title S1 S2UPenn/2012-12-04 43
  • 93. General Aggregate (GA) Query Processing Q 1. Decompose aggregate ?m v1 bornIn v2 foundingYear query Q into star gender title queries {S1 , . . . , Sn } ?m foundingYear v1 v2 gender title S1 S2 2. Evaluate them to get R(S1 ) R(S2 ) foundingYear L R(S1 ), . . . , R(Sn ) gender title L 1790 {002} Male President {001,007} 1718 {006} Female Actress {005} 1810 {008} 1776 {004}UPenn/2012-12-04 43
  • 94. General Aggregate (GA) Query Processing Q 1. Decompose aggregate ?m v1 bornIn v2 foundingYear query Q into star gender title queries {S1 , . . . , Sn } ?m foundingYear v1 v2 gender title S1 S2 2. Evaluate them to get R(S1 ) R(S2 ) foundingYear L R(S1 ), . . . , R(Sn ) gender title L 1790 {002} Male President {001,007} 1718 {006} Female Actress {005} 1810 {008} 1776 {004} 3. Generate vertex lists L1 = {001, 005, 007} L2 = {002, 004, 006, 008} (L1 , . . . , Ln ) for each R(S1 ), . . . , R(Sn )UPenn/2012-12-04 43
  • 95. GA Query Processing (cont’d) L1 = {001, 005, 007} L2 = {002, 004, 006, 008}UPenn/2012-12-04 44
  • 96. GA Query Processing (cont’d) L1 = {001, 005, 007} L2 = {002, 004, 006, 008} 4. Use VS*-tree to find bornIn u1 u2 all subgraph matches of (L1 , . . . , Ln ) – u1 u2 result is join 005 006 according to link 007 008 attributeUPenn/2012-12-04 44
  • 97. GA Query Processing (cont’d) L1 = {001, 005, 007} L2 = {002, 004, 006, 008} 4. Use VS*-tree to find bornIn u1 u2 all subgraph matches of (L1 , . . . , Ln ) – u1 u2 result is join 005 006 according to link 007 008 attribute 5. Compute the final Matches gender title foundingYear u1 u2 COUNT aggregate result (we Male President 1718 005 006 1 Female Actress 1810 007 008 1 have optimizations here)UPenn/2012-12-04 44
  • 98. Performance Comparison Yago network (20 million triples & size 3.1GB) Query Response Time (in ms) 8,000 6,000 4,000 2,000 0 SA1 SA2 SA3 GA1 GA2 GA3 Queries gStore CAA VirtuosoUPenn/2012-12-04 45
  • 99. Outline 1 RDF Introduction 2 Answering Regular SPARQL Queries Answering SPARQL queries using graph pattern matching ¨ [ZouMoZhaoChenOzsu, PVLDB 2011] 3 Aggregation Answering Aggregate SPARQL Queries Over Large RDF Repositories 4 Future ResearchUPenn/2012-12-04 47
  • 100. Systematic Study of SPARQL queryprocessing/optimization (G¨nes Alu¸) u c Representation as graphs Systematic study of dynamic RDF data management Continuously and seamlessly self-tune their configurations Systematic study of distributed RDF data management Utilization of view materialization to improve performance exploiting graph structure EndUPenn/2012-12-04 48
  • 101. Efficient SPARQL Query Execution over Dynamic RDFDatabases Problem: Existing RDF management systems are designed for static datasets and workloads, whereas the RDF data on the Web require dynamic solutions. The underlying physical database layout in existing RDF management systems is fixed a-priori. Hence, existing systems portray suboptimal performance when data are updated or the characteristics of the workloads change. Objective: To design techniques (e.g., data structures, indexes, algorithms, etc.) that enable RDF management systems to continuously and seamlessly self-tune their configurations to maintain an optimal performance.UPenn/2012-12-04 49
  • 102. Efficient SPARQL Query Execution over Distributed RDFDatabases Problem: RDF is inherently distributed across the Web, where query execution can be much more expensive than in a centralized setting. Objective: To reduce the cost of query execution in a distributed RDF database through view materialization. More specifically, to design efficient view selection algorithms (w.r.t. space/time) that potentially exploit the graph-theoretic properties of RDF.UPenn/2012-12-04 50
  • 103. Thank you!UPenn/2012-12-04 51
  • 104. Encoding Technique vLabel adjList y:Abraham Lincoln (hasName, “Abraham Lincoln”), (BornOnDate, “1809-02-12”), (DiedOnDate,“1865-04-15”), (DiedIn, y:Washington DC) (hasName, “Abraham Lincoln”) 0010 0000 0000UPenn/2012-12-04 52
  • 105. Encoding Technique vLabel adjList y:Abraham Lincoln (hasName, “Abraham Lincoln”), (BornOnDate, “1809-02-12”), (DiedOnDate,“1865-04-15”), (DiedIn, y:Washington DC) “Abr” “bra” (hasName, “Abraham Lincoln”) 0010 0000 0000 “rah” ...UPenn/2012-12-04 52
  • 106. Encoding Technique vLabel adjList y:Abraham Lincoln (hasName, “Abraham Lincoln”), (BornOnDate, “1809-02-12”), (DiedOnDate,“1865-04-15”), (DiedIn, y:Washington DC) “Abr” 0000 0010 0000 0000 “bra” 1000 0000 0000 0000 (hasName, “Abraham Lincoln”) 0010 0000 0000 “rah” 0000 0000 0100 0000 ... ...UPenn/2012-12-04 52
  • 107. Encoding Technique vLabel adjList y:Abraham Lincoln (hasName, “Abraham Lincoln”), (BornOnDate, “1809-02-12”), (DiedOnDate,“1865-04-15”), (DiedIn, y:Washington DC) “Abr” 0000 0010 0000 0000 “bra” 1000 0000 0000 0000 (hasName, “Abraham Lincoln”) 0010 0000 0000 “rah” 0000 0000 0100 0000 ... ... OR 1000 0010 0100 0000UPenn/2012-12-04 52
  • 108. Encoding Technique vLabel adjList y:Abraham Lincoln (hasName, “Abraham Lincoln”), (BornOnDate, “1809-02-12”), (DiedOnDate,“1865-04-15”), (DiedIn, y:Washington DC) (hasName, “Abraham Lincoln”) 0010 0000 0000 1000 0010 0100 0000 OR 1000 0010 0100 0000UPenn/2012-12-04 52
  • 109. Encoding Technique vLabel adjList y:Abraham Lincoln (hasName, “Abraham Lincoln”), (BornOnDate, “1809-02-12”), (DiedOnDate,“1865-04-15”), (DiedIn, y:Washington DC) (hasName, “Abraham Lincoln”) 0010 0000 0000 1000 0010 0100 0000 (BornOnDate, “1908-02-12”) 0100 0000 0000 0100 0010 0100 1000 (DiedOnDate, “1965-04-15”) 0000 1000 0000 0000 0010 0100 0000 (DiedIn, y:Washington DC) 0000 0010 0000 1000 0010 0100 0001UPenn/2012-12-04 52
  • 110. Encoding Technique vLabel adjList y:Abraham Lincoln (hasName, “Abraham Lincoln”), (BornOnDate, “1809-02-12”), (DiedOnDate,“1865-04-15”), (DiedIn, y:Washington DC) (hasName, “Abraham Lincoln”) 0010 0000 0000 1000 0010 0100 0000 (BornOnDate, “1908-02-12”) 0100 0000 0000 0100 0010 0100 1000 (DiedOnDate, “1965-04-15”) 0000 1000 0000 0000 0010 0100 0000 OR (DiedIn, y:Washington DC) 0000 0010 0000 1100 0010 0100 1001 0000 0010 0000 1000 0010 0100 0001UPenn/2012-12-04 52
  • 111. T-index Maintenance Updates modeled as triple deletion + insertion - Consider (s, p, o) DL’s order does not change – s not in the RDF data Insert new transaction into DL Insert the new transaction into one path of T-index Update materialized sets along the path M(N2 ) Transaction Database D salary gender title L $100,000 Male Professor {002} Vertex Entity ver- Adjacency Attributes $50,000 Male Assistant Professor {001} ID tex $40,000 Male Assistant Professor {003} $50,000 Female Assistant Professor {004} 001 Person1 salary, gender, title Dimension 002 Person2 salary, gender, title List DL 003 Person3 salary, gender, title salary 004 Person4 salary, gender, null title, nationality {010,011} gender 005 Person5 salary, gender, {001,002,003,004,005,006} {007,008,009} locatedIn title nationality salary year N0 N7 N10 nationality 006 Person6 salary, title, year nationality {001,002,003,004,005} {007,008,009} 007 Paper1 year, inProceedings, gender {006} inProceedings inProceedings title N1 N5 N8 N11 paperTitle paperTitle 008 Paper2 year, inProceedings, name locatedIn paperTitle {001,002,003,004} {010,011} title name 009 Paper3 year, inProceedings, N2 N4 N6 paperTitle N9 {004} nationality nationality paperTitle 010 School1 locatedIn, name {005} {006} nationality {007,008,009} 011 School2 locatedIn, name N3 M(N3 ) salary gender title nationality L $50,000 Female Assistant Professor Chinese {004}UPenn/2012-12-04 53
  • 112. T-index Maintenance Updates modeled as triple deletion + insertion - Consider (s, p, o) DL’s order does not change – s is in the RDF data; pi > p > pi+1 Find Nn with (p1 , . . . , pi , . . . , pn ) and Ni with (p1 , . . . , pi ) Remove s from path Ni+1 − Nn and update M(Nj ), i + 1 ≤ j ≤ n Insert (p, pi+1 , . . . , pn ) at Ni ; update M(Nj ), null ≤ j ≤ i M(N2 ) Transaction Database D salary gender title L $100,000 Male Professor {002} Vertex Entity ver- Adjacency Attributes $50,000 Male Assistant Professor {001} ID tex $40,000 Male Assistant Professor {003} $50,000 Female Assistant Professor {004} Dimension 001 Person1 salary, gender, title 002 Person2 salary, gender, title List DL 003 Person3 salary, gender, title salary 004 Person4 salary, gender, null title, nationality {010,011} gender 005 Person5 salary, gender, {001,002,003,004,005,006} {007,008,009} locatedIn title nationality salary year N0 N7 N10 nationality 006 Person6 salary, title, year nationality {001,002,003,004,005} {007,008,009} 007 Paper1 year, inProceedings, gender {006} inProceedings inProceedings title N1 N5 N8 N11 paperTitle paperTitle 008 Paper2 year, inProceedings, name locatedIn paperTitle {001,002,003,004} {010,011} title name 009 Paper3 year, inProceedings, N2 N4 N6 paperTitle N9 {004} nationality nationality paperTitle 010 School1 locatedIn, name {005} {006} nationality {007,008,009} 011 School2 locatedIn, name N3 M(N3 ) salary gender title nationality L $50,000 Female Assistant Professor Chinese {004}UPenn/2012-12-04 53
  • 113. T-index Maintenance Updates modeled as triple deletion + insertion - Consider (s, p, o) DL’s order does not change – s is in the RDF data; pi > p > pi+1 Find Nn with (p1 , . . . , pi , . . . , pn ) and Ni with (p1 , . . . , pi ) Remove s from path Ni+1 − Nn and update M(Nj ), i + 1 ≤ j ≤ n Insert (p, pi+1 , . . . , pn ) at Ni ; update M(Nj ), null ≤ j ≤ i M(N2 ) Insert (Person6, gender, “Male”) Transaction Database D salary gender title L $100,000 Male Professor {002} Vertex Entity ver- Adjacency Attributes $50,000 Male Assistant Professor {001} ID tex $40,000 Male Assistant Professor {003} $50,000 Female Assistant Professor {004} Dimension 001 Person1 salary, gender, title 002 Person2 salary, gender, title List DL 003 Person3 salary, gender, title salary 004 Person4 salary, gender, null title, nationality {010,011} gender 005 Person5 salary, gender, {001,002,003,004,005,006} {007,008,009} locatedIn title nationality salary year N0 N7 N10 nationality 006 Person6 salary, gender, title, year nationality {001,002,003,004,005} {007,008,009} 007 Paper1 year, inProceedings, gender {006} inProceedings inProceedings title N1 N5 N8 N11 paperTitle paperTitle 008 Paper2 year, inProceedings, name locatedIn paperTitle {001,002,003,004} {010,011} title name 009 Paper3 year, inProceedings, N2 N4 N6 paperTitle N9 {004} nationality nationality paperTitle 010 School1 locatedIn, name {005} {006} nationality {007,008,009} 011 School2 locatedIn, name N3 M(N3 ) salary gender title nationality L $50,000 Female Assistant Professor Chinese {004}UPenn/2012-12-04 53
  • 114. T-index Maintenance Updates modeled as triple deletion + insertion - Consider (s, p, o) DL’s order does not change – s is in the RDF data; pi > p > pi+1 Find Nn with (p1 , . . . , pi , . . . , pn ) and Ni with (p1 , . . . , pi ) Remove s from path Ni+1 − Nn and update M(Nj ), i + 1 ≤ j ≤ n Insert (p, pi+1 , . . . , pn ) at Ni ; update M(Nj ), null ≤ j ≤ i M(N2 ) Insert (Person6, gender, “Male”) Transaction Database D salary gender title L $100,000 Male Professor {002} Vertex Entity ver- Adjacency Attributes $50,000 Male Assistant Professor {001} ID tex $40,000 Male Assistant Professor {003} $50,000 Female Assistant Professor {004} Dimension 001 Person1 salary, gender, title 002 Person2 salary, gender, title List DL 003 Person3 salary, gender, title salary 004 Person4 salary, gender, null title, nationality {010,011} gender 005 Person5 salary, gender, {001,002,003,004,005,006} {007,008,009} locatedIn title nationality salary year N0 N7 N10 nationality 006 Person6 salary, gender, title, year nationality {001,002,003,004,005} {007,008,009} 007 Paper1 year, inProceedings, gender inProceedings inProceedings N1 N8 N11 paperTitle paperTitle 008 Paper2 year, inProceedings, name locatedIn paperTitle {001,002,003,004} {010,011} title name 009 Paper3 year, inProceedings, N2 N4 paperTitle N9 {004} nationality paperTitle 010 School1 locatedIn, name {005} nationality {007,008,009} 011 School2 locatedIn, name N3 M(N3 ) salary gender title nationality L $50,000 Female Assistant Professor Chinese {004}UPenn/2012-12-04 53
  • 115. T-index Maintenance Updates modeled as triple deletion + insertion - Consider (s, p, o) DL’s order does not change – s is in the RDF data; pi > p > pi+1 Find Nn with (p1 , . . . , pi , . . . , pn ) and Ni with (p1 , . . . , pi ) Remove s from path Ni+1 − Nn and update M(Nj ), i + 1 ≤ j ≤ n Insert (p, pi+1 , . . . , pn ) at Ni ; update M(Nj ), null ≤ j ≤ i M(N2 ) Transaction Database D salary gender title L Insert (Person6, gender, “Male”) $100,000 Male Professor {002} Vertex Entity ver- Adjacency Attributes $50,000 Male Assistant Professor {001} $40,000 Male Assistant Professor {003} ID tex $50,000 Female Assistant Professor {004} 001 Person1 salary, gender, title $50,000 Male Associate Professor {006} Dimension 002 Person2 salary, gender, title List DL 003 Person3 salary, gender, title salary 004 Person4 salary, gender, null gender title, nationality {010,011} 005 Person5 salary, gender, locatedIn title {001,002,003,004,005,006} {007,008,009} nationality salary year nationality 006 Person6 salary, gender, title, N0 N7 N10 year nationality 007 Paper1 year, inProceedings, {001,002,003,004,005,006} {007,008,009} inProceedings gender inProceedings paperTitle N1 N8 N11 paperTitle 008 Paper2 year, inProceedings, locatedIn paperTitle name {001,002,003,004,006} {010,011} name 009 Paper3 year, inProceedings, title N2 N4 paperTitle N9 010 School1 locatedIn, name {004,006} nationality paperTitle nationality {005} {007,008,009} 011 School2 locatedIn, name N3 M(N3 ) salary gender title nationality L $50,000 Female Assistant Professor Chinese {004}UPenn/2012-12-04 53
  • 116. T-index Maintenance Updates modeled as triple deletion + insertion - Consider (s, p, o) DL’s order does change This causes swapping of pi and pi+1 in DL First update as if the order does not change Then update DL, T-index and materialized sets M(N2 ) Transaction Database D salary gender title L $100,000 Male Professor {002} Vertex Entity ver- Adjacency Attributes $50,000 Male Assistant Professor {001} ID tex $40,000 Male Assistant Professor {003} $50,000 Female Assistant Professor {004} 001 Person1 salary, gender, title Dimension 002 Person2 salary, gender, title List DL 003 Person3 salary, gender, title salary 004 Person4 salary, gender, null title, nationality {010,011} gender 005 Person5 salary, gender, {001,002,003,004,005,006} {007,008,009} locatedIn title nationality salary year N0 N7 N10 nationality 006 Person6 salary, title, year nationality {001,002,003,004,005} {007,008,009} 007 Paper1 year, inProceedings, gender {006} inProceedings inProceedings title N1 N5 N8 N11 paperTitle paperTitle 008 Paper2 year, inProceedings, name locatedIn paperTitle {001,002,003,004} {010,011} title name 009 Paper3 year, inProceedings, N2 N4 N6 paperTitle N9 {004} nationality nationality paperTitle 010 School1 locatedIn, name {005} {006} nationality {007,008,009} 011 School2 locatedIn, name N3 M(N3 ) salary gender title nationality L $50,000 Female Assistant Professor Chinese {004}UPenn/2012-12-04 53
  • 117. T-index Maintenance Updates modeled as triple deletion + insertion - Consider (s, p, o) DL’s order does change This causes swapping of pi and pi+1 in DL First update as if the order does not change Then update DL, T-index and materialized sets M(N2 ) Delete (Person2, gender, “Male”) Transaction Database D salary gender title L Vertex Entity ver- Adjacency Attributes $100,000 $50,000 Male Male Professor Assistant Professor {002} {001} swap gender (pi )and title (pi+1 ) ID tex $40,000 Male Assistant Professor {003} $50,000 Female Assistant Professor {004} 001 Person1 salary, gender, title Dimension 002 Person2 salary, gender, title List DL 003 Person3 salary, gender, title salary 004 Person4 salary, gender, null title, nationality {010,011} gender 005 Person5 salary, gender, {001,002,003,004,005,006} {007,008,009} locatedIn title nationality salary year N0 N7 N10 nationality 006 Person6 salary, title, year nationality {001,002,003,004,005} {007,008,009} 007 Paper1 year, inProceedings, gender {006} inProceedings inProceedings title N1 N5 N8 N11 paperTitle paperTitle 008 Paper2 year, inProceedings, name locatedIn paperTitle {001,002,003,004} {010,011} title name 009 Paper3 year, inProceedings, N2 N4 N6 paperTitle N9 {004} nationality nationality paperTitle 010 School1 locatedIn, name {005} {006} nationality {007,008,009} 011 School2 locatedIn, name N3 M(N3 ) salary gender title nationality L $50,000 Female Assistant Professor Chinese {004}UPenn/2012-12-04 53
  • 118. T-index Maintenance Updates modeled as triple deletion + insertion - Consider (s, p, o) DL’s order does change This causes swapping of pi and pi+1 in DL First update as if the order does not change Then update DL, T-index and materialized sets M(N2 ) Delete (Person2, gender, “Male”) Transaction Database D salary gender title L Vertex Entity ver- Adjacency Attributes $100,000 $50,000 Male Male Professor Assistant Professor {002} {001} Delete 002 from path null − N3 ID tex $40,000 Male Assistant Professor {003} $50,000 Female Assistant Professor {004} 001 Person1 salary, gender, title Dimension 002 Person2 salary, gender, title List DL 003 Person3 salary, gender, title salary 004 Person4 salary, gender, null title, nationality {010,011} gender 005 Person5 salary, gender, {001,002,003,004,005,006} {007,008,009} locatedIn title nationality salary year N0 N7 N10 nationality 006 Person6 salary, title, year nationality {001,003,004,005} {007,008,009} 007 Paper1 year, inProceedings, gender {002,006} inProceedings inProceedings title N1 N5 N8 N11 paperTitle paperTitle 008 Paper2 year, inProceedings, name locatedIn paperTitle {001,003,004} {010,011} title name 009 Paper3 year, inProceedings, N2 N4 N6 paperTitle N9 {004} nationality nationality paperTitle 010 School1 locatedIn, name {005} {006} nationality {007,008,009} 011 School2 locatedIn, name N3 M(N3 ) salary gender title nationality L $50,000 Female Assistant Professor Chinese {004}UPenn/2012-12-04 53
  • 119. T-index Maintenance Updates modeled as triple deletion + insertion - Consider (s, p, o) DL’s order does change This causes swapping of pi and pi+1 in DL First update as if the order does not change Then update DL, T-index and materialized sets M(N2 ) Delete (Person2, gender, “Male”) Transaction Database D salary gender title L Vertex Entity ver- Adjacency Attributes $100,000 $50,000 Male Male Professor Assistant Professor {002} {001} Insert (Person, salary, title) into ID tex $40,000 $50,000 Male Female Assistant Professor Assistant Professor {003} {004} T-index 001 Person1 salary, gender, title Dimension 002 Person2 salary, gender, title List DL 003 Person3 salary, gender, title salary 004 Person4 salary, gender, null title, nationality {010,011} gender 005 Person5 salary, gender, {001,002,003,004,005,006} {007,008,009} locatedIn title nationality salary year N0 N7 N10 nationality 006 Person6 salary, title, year nationality {001,003,004,005} {007,008,009} 007 Paper1 year, inProceedings, gender {002,006} inProceedings inProceedings title N1 N5 N8 N11 paperTitle paperTitle 008 Paper2 year, inProceedings, name locatedIn paperTitle {001,003,004} {010,011} title name 009 Paper3 year, inProceedings, N2 N4 N6 paperTitle N9 {004} nationality nationality paperTitle 010 School1 locatedIn, name {005} {006} nationality {007,008,009} 011 School2 locatedIn, name N3 M(N3 ) salary gender title nationality L $50,000 Female Assistant Professor Chinese {004}UPenn/2012-12-04 53
  • 120. T-index Maintenance Updates modeled as triple deletion + insertion - Consider (s, p, o) DL’s order does change This causes swapping of pi and pi+1 in DL First update as if the order does not change Then update DL, T-index and materialized sets Delete (Person2, gender, “Male”) Transaction Database D Vertex Entity ver- Adjacency Attributes Swap “gender” and “title” ID tex 001 Person1 salary, gender, title Dimension 002 Person2 salary, gender, title List DL 003 Person3 salary, gender, title salary 004 Person4 salary, gender, null title, nationality {010,011} title locatedIn gender 005 Person5 salary, gender, {001,002,003,004,005,006} {007,008,009} nationality salary year N0 N7 N10 nationality 006 Person6 salary, title, year nationality {007,008,009} 007 Paper1 year, inProceedings, {001,003,004} title {002,006} title inProceedings inProceedings N1 N1 N5 N8 N11 paperTitle paperTitle gender 008 Paper2 year, inProceedings, {005} name locatedIn {001,003,004} paperTitle gender {010,011} name 009 Paper3 year, inProceedings, N2 N4 N6 paperTitle N9 nationality nationality paperTitle 010 School1 locatedIn, name {004} nationality {005} {006} {007,008,009} 011 School2 locatedIn, name N3UPenn/2012-12-04 53
  • 121. T-index Maintenance Updates modeled as triple deletion + insertion - Consider (s, p, o) DL’s order does change This causes swapping of pi and pi+1 in DL First update as if the order does not change Then update DL, T-index and materialized sets Delete (Person2, gender, “Male”) Transaction Database D Vertex Entity ver- Adjacency Attributes Swap “gender” and “title” ID tex 001 Person1 salary, gender, title Dimension 002 Person2 salary, gender, title List DL 003 Person3 salary, gender, title salary 004 Person4 salary, gender, null title, nationality {010,011} title locatedIn gender 005 Person5 salary, gender, {001,002,003,004,005,006} {007,008,009} nationality salary year N0 N7 N10 nationality 006 Person6 salary, title, year nationality {007,008,009} 007 Paper1 year, inProceedings, {001,003,004,006} title inProceedings inProceedings N1 N1 N8 N11 paperTitle paperTitle gender 008 Paper2 year, inProceedings, {005} name locatedIn {001,003,004} paperTitle gender {010,011} name 009 Paper3 year, inProceedings, N2 N6 N4 paperTitle N9 nationality nationality paperTitle 010 School1 locatedIn, name {004} nationality {006} {005} {007,008,009} 011 School2 locatedIn, name N3UPenn/2012-12-04 53
  • 122. T-index Maintenance Updates modeled as triple deletion + insertion - Consider (s, p, o) DL’s order does change This causes swapping of pi and pi+1 in DL First update as if the order does not change Then update DL, T-index and materialized sets Delete (Person2, gender, “Male”) Transaction Database D Vertex Entity ver- Adjacency Attributes Update materialized sets ID tex 001 Person1 salary, gender, title Dimension 002 Person2 salary, gender, title List DL 003 Person3 salary, gender, title salary 004 Person4 salary, gender, null title, nationality {010,011} title locatedIn gender 005 Person5 salary, gender, {001,002,003,004,005,006} {007,008,009} nationality salary year N0 N7 N10 nationality 006 Person6 salary, title, year nationality {007,008,009} 007 Paper1 year, inProceedings, {001,003,004,006} title inProceedings inProceedings N1 N1 N8 N11 paperTitle paperTitle gender 008 Paper2 year, inProceedings, {005} name locatedIn {001,003,004} paperTitle gender {010,011} name 009 Paper3 year, inProceedings, N2 N6 N4 paperTitle N9 nationality nationality paperTitle 010 School1 locatedIn, name {004} nationality {006} {005} {007,008,009} 011 School2 locatedIn, name N3UPenn/2012-12-04 53