Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Graph lecture

2,344 views

Published on

A talk that I gave at the Croucher Advanced Study Institute on Frontiers in Big Data Graph Research in Hong Kong, December 2015

Published in: Data & Analytics
  • Hello! Get Your Professional Job-Winning Resume Here - Check our website! https://vk.cc/818RFv
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Graph lecture

  1. 1. An Overview of Graph Data Management and Analysis M. Tamer ¨Ozsu University of Waterloo David R. Cheriton School of Computer Science © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 1 / 96
  2. 2. Graph Data are Very Common Internet © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 2 / 96
  3. 3. Graph Data are Very Common Social networks © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 2 / 96
  4. 4. Graph Data are Very Common Trade volumes and connections © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 2 / 96
  5. 5. Graph Data are Very Common Biological networks © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 2 / 96
  6. 6. Graph Data are Very Common As of September 2011 Music Brainz (zitgist) P20 Turismo de Zaragoza yovisto Yahoo! Geo Planet YAGO World Fact- book El Viajero Tourism WordNet (W3C) WordNet (VUA) VIVO UF VIVO Indiana VIVO Cornell VIAF URI Burner Sussex Reading Lists Plymouth Reading Lists UniRef UniProt UMBEL UK Post- codes legislation data.gov.uk Uberblic UB Mann- heim TWC LOGD Twarql transport data.gov. uk Traffic Scotland theses. fr Thesau- rus W totl.net Tele- graphis TCM Gene DIT Taxon Concept Open Library (Talis) tags2con delicious t4gm info Swedish Open Cultural Heritage Surge Radio Sudoc STW RAMEAU SH statistics data.gov. uk St. Andrews Resource Lists ECS South- ampton EPrints SSW Thesaur us Smart Link Slideshare 2RDF semantic web.org Semantic Tweet Semantic XBRL SW Dog Food Source Code Ecosystem Linked Data US SEC (rdfabout) Sears Scotland Geo- graphy Scotland Pupils & Exams Scholaro- meter WordNet (RKB Explorer) Wiki UN/ LOCODE Ulm ECS (RKB Explorer) Roma RISKS RESEX RAE2001 Pisa OS OAI NSF New- castle LAAS KISTI JISC IRIT IEEE IBM Eurécom ERA ePrints dotAC DEPLOY DBLP (RKB Explorer) Crime Reports UK Course- ware CORDIS (RKB Explorer) CiteSeer Budapest ACM riese Revyu research data.gov. ukRen. Energy Genera- tors reference data.gov. uk Recht- spraak. nl RDF ohloh Last.FM (rdfize) RDF Book Mashup Rådata nå! PSH Product Types Ontology Product DB PBAC Poké- pédia patents data.go v.uk Ox Points Ord- nance Survey Openly Local Open Library Open Cyc Open Corpo- rates Open Calais OpenEI Open Election Data Project Open Data Thesau- rus Ontos News Portal OGOLOD Janus AMP Ocean Drilling Codices New York Times NVD ntnusc NTU Resource Lists Norwe- gian MeSH NDL subjects ndlna my Experi- ment Italian Museums medu- cator MARC Codes List Man- chester Reading Lists Lotico Weather Stations London Gazette LOIUS Linked Open Colors lobid Resources lobid Organi- sations LEM Linked MDB LinkedL CCN Linked GeoData LinkedCT Linked User Feedback LOV Linked Open Numbers LODE Eurostat (Ontology Central) Linked EDGAR (Ontology Central) Linked Crunch- base lingvoj Lichfield Spen- ding LIBRIS Lexvo LCSH DBLP (L3S) Linked Sensor Data (Kno.e.sis) Klapp- stuhl- club Good- win Family National Radio- activity JP Jamendo (DBtune) Italian public schools ISTAT Immi- gration iServe IdRef Sudoc NSZL Catalog Hellenic PD Hellenic FBD Piedmont Accomo- dations GovTrack GovWILD Google Art wrapper gnoss GESIS GeoWord Net Geo Species Geo Names Geo Linked Data GEMET GTAA STITCH SIDER Project Guten- berg Medi Care Euro- stat (FUB) EURES Drug Bank Disea- some DBLP (FU Berlin) Daily Med CORDIS (FUB) Freebase flickr wrappr Fishes of Texas Finnish Munici- palities ChEMBL FanHubz Event Media EUTC Produc- tions Eurostat Europeana EUNIS EU Insti- tutions ESD stan- dards EARTh Enipedia Popula- tion (En- AKTing) NHS (En- AKTing) Mortality (En- AKTing) Energy (En- AKTing) Crime (En- AKTing) CO2 Emission (En- AKTing) EEA SISVU educatio n.data.g ov.uk ECS South- ampton ECCO- TCP GND Didactal ia DDC Deutsche Bio- graphie data dcs Music Brainz (DBTune) Magna- tune John Peel (DBTune) Classical (DB Tune) Audio Scrobbler (DBTune) Last.FM artists (DBTune) DB Tropes Portu- guese DBpedia dbpedia lite Greek DBpedia DBpedia data- open- ac-uk SMC Journals Pokedex Airports NASA (Data Incu- bator) Music Brainz (Data Incubator) Moseley Folk Metoffice Weather Forecasts Discogs (Data Incubator) Climbing data.gov.uk intervals Data Gov.ie data bnf.fr Cornetto reegle Chronic- ling America Chem2 Bio2RDF Calames business data.gov. uk Bricklink Brazilian Poli- ticians BNB UniSTS UniPath way UniParc Taxono my UniProt (Bio2RDF) SGD Reactome PubMed Pub Chem PRO- SITE ProDom Pfam PDB OMIM MGI KEGG Reaction KEGG Pathway KEGG Glycan KEGG Enzyme KEGG Drug KEGG Com- pound InterPro Homolo Gene HGNC Gene Ontology GeneID Affy- metrix bible ontology BibBase FTS BBC Wildlife Finder BBC Program mes BBC Music Alpine Ski Austria LOCAH Amster- dam Museum AGROV OC AEMET US Census (rdfabout) Media Geographic Publications Government Cross-domain Life sciences User-generated content Linked data © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 2 / 96 Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/
  7. 7. Outline 1 Introduction – Graph Types 2 Property Graph Processing Classification Online querying Offline analytics 3 RDF Graph Querying Data Warehousing Distributed SPARQL Execution Linked Object Data Querying © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 3 / 96
  8. 8. Outline 1 Introduction – Graph Types 2 Property Graph Processing Classification Online querying Offline analytics 3 RDF Graph Querying Data Warehousing Distributed SPARQL Execution Linked Object Data Querying © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 4 / 96
  9. 9. Graph Types Property graph film 2014 (initial release date, “1980-05-23”) (label, “The Shining”) books 0743424425 (rating, 4.7) offers 0743424425amazonOffer geo 2635167 (name, “United Kingdom”) (population, 62348447) actor 29704 (actor name, “Jack Nicholson”) film 3418 (label, “The Passenger”) film 1267 (label, “The Last Tycoon”) director 8476 (director name, “Stanley Kubrick”) film 2685 (label, “A Clockwork Orange”) film 424 (label, “Spartacus”) actor 30013 (relatedBook) (hasOffer) (based near) (actor) (director) (actor) (actor) (actor) (director) (director) © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 5 / 96
  10. 10. Graph Types RDF graph mdb:film/2014 “1980-05-23” movie:initial release date “The Shining” refs:label bm:books/0743424425 4.7 rev:rating bm:offers/0743424425amazonOffer geo:2635167 “United Kingdom” gn:name 62348447 gn:population mdb:actor/29704 “Jack Nicholson” movie:actor name mdb:film/3418 “The Passenger” refs:label mdb:film/1267 “The Last Tycoon” refs:label mdb:director/8476 “Stanley Kubrick” movie:director name mdb:film/2685 “A Clockwork Orange” refs:label mdb:film/424 “Spartacus” refs:label mdb:actor/30013 movie:relatedBook scam:hasOffer foaf:based near movie:actor movie:director movie:actor movie:actor movie:actor movie:director movie:director © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 5 / 96
  11. 11. Graph Types Property graph film 2014 (initial release date, “1980-05-23”) (label, “The Shining”) books 0743424425 (rating, 4.7) offers 0743424425amazonOffer geo 2635167 (name, “United Kingdom”) (population, 62348447) actor 29704 (actor name, “Jack Nicholson”) film 3418 (label, “The Passenger”) film 1267 (label, “The Last Tycoon”) director 8476 (director name, “Stanley Kubrick”) film 2685 (label, “A Clockwork Orange”) film 424 (label, “Spartacus”) actor 30013 (relatedBook) (hasOffer) (based near) (actor) (director) (actor) (actor) (actor) (director) (director) Workload: Online queries and analytic workloads Query execution: Varies RDF graph mdb:film/2014 “1980-05-23” movie:initial release date “The Shining” refs:label bm:books/0743424425 4.7 rev:rating bm:offers/0743424425amazonOffer geo:2635167 “United Kingdom” gn:name 62348447 gn:population mdb:actor/29704 “Jack Nicholson” movie:actor name mdb:film/3418 “The Passenger” refs:label mdb:film/1267 “The Last Tycoon” refs:label mdb:director/8476 “Stanley Kubrick” movie:director name mdb:film/2685 “A Clockwork Orange” refs:label mdb:film/424 “Spartacus” refs:label mdb:actor/30013 movie:relatedBook scam:hasOffer foaf:based near movie:actor movie:director movie:actor movie:actor movie:actor movie:director movie:director Workload: SPARQL queries Query execution: subgraph matching by homomorphism © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 5 / 96
  12. 12. Outline 1 Introduction – Graph Types 2 Property Graph Processing Classification Online querying Offline analytics 3 RDF Graph Querying Data Warehousing Distributed SPARQL Execution Linked Object Data Querying © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 6 / 96
  13. 13. Outline 1 Introduction – Graph Types 2 Property Graph Processing Classification Online querying Offline analytics 3 RDF Graph Querying Data Warehousing Distributed SPARQL Execution Linked Object Data Querying © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 7 / 96
  14. 14. Classification [Ammar and ¨Ozsu, 2015] Graph Dynamism Static Graphs Dynamic Graphs Streaming Graphs Evolving Graphs Algorithm Types Offline Online Streaming Incremental Dynamic Batch Dynamic Workload Types Online Queries Analytics Workloads © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 8 / 96
  15. 15. Classification [Ammar and ¨Ozsu, 2015] Graph Dynamism Static Graphs Dynamic Graphs Streaming Graphs Evolving Graphs Algorithm Types Offline Online Streaming Incremental Dynamic Batch Dynamic Workload Types Online Queries Analytics Workloads Focus here is on the dynamism of the graphs in whether or not they change and how they change. © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 8 / 96
  16. 16. Classification [Ammar and ¨Ozsu, 2015] Graph Dynamism Static Graphs Dynamic Graphs Streaming Graphs Evolving Graphs Algorithm Types Offline Online Streaming Incremental Dynamic Batch Dynamic Workload Types Online Queries Analytics Workloads Focus here is on the dynamism of the graphs in whether or not they change and how they change. Focus here is on the how algorithms behave as their input changes. © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 8 / 96
  17. 17. Classification [Ammar and ¨Ozsu, 2015] Graph Dynamism Static Graphs Dynamic Graphs Streaming Graphs Evolving Graphs Algorithm Types Offline Online Streaming Incremental Dynamic Batch Dynamic Workload Types Online Queries Analytics Workloads Focus here is on the dynamism of the graphs in whether or not they change and how they change. Focus here is on the how algorithms behave as their input changes. The types of workloads that the approaches are designed to handle. © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 8 / 96
  18. 18. Classification [Ammar and ¨Ozsu, 2015] Graph Dynamism Static Graphs Dynamic Graphs Streaming Graphs Evolving Graphs Algorithm Types Offline Online Streaming Incremental Dynamic Batch Dynamic Workload Types Online Queries Analytics Workloads © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 8 / 96
  19. 19. Classification [Ammar and ¨Ozsu, 2015] Graph Dynamism Static Graphs Dynamic Graphs Streaming Graphs Evolving Graphs Algorithm Types Offline Online Streaming Incremental Dynamic Batch Dynamic Workload Types Online Queries Analytics Workloads Graphs do not change or we are not inter- ested in their changes – only a snapshot is considered. © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 8 / 96
  20. 20. Classification [Ammar and ¨Ozsu, 2015] Graph Dynamism Static Graphs Dynamic Graphs Streaming Graphs Evolving Graphs Algorithm Types Offline Online Streaming Incremental Dynamic Batch Dynamic Workload Types Online Queries Analytics Workloads Graphs do not change or we are not inter- ested in their changes – only a snapshot is considered. Graphs change and we are interested in their changes. © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 8 / 96
  21. 21. Classification [Ammar and ¨Ozsu, 2015] Graph Dynamism Static Graphs Dynamic Graphs Streaming Graphs Evolving Graphs Algorithm Types Offline Online Streaming Incremental Dynamic Batch Dynamic Workload Types Online Queries Analytics Workloads Graphs do not change or we are not inter- ested in their changes – only a snapshot is considered. Graphs change and we are interested in their changes. Dynamic graphs with high veloc- ity changes – not possible to see the entire graph at once. © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 8 / 96
  22. 22. Classification [Ammar and ¨Ozsu, 2015] Graph Dynamism Static Graphs Dynamic Graphs Streaming Graphs Evolving Graphs Algorithm Types Offline Online Streaming Incremental Dynamic Batch Dynamic Workload Types Online Queries Analytics Workloads Graphs do not change or we are not inter- ested in their changes – only a snapshot is considered. Graphs change and we are interested in their changes. Dynamic graphs with high veloc- ity changes – not possible to see the entire graph at once. Dynamic graphs with un- known changes – requires re- discovery of the graph (e.g., LOD). © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 8 / 96
  23. 23. Classification [Ammar and ¨Ozsu, 2015] Graph Dynamism Static Graphs Dynamic Graphs Streaming Graphs Evolving Graphs Algorithm Types Offline Online Streaming Incremental Dynamic Batch Dynamic Workload Types Online Queries Analytics Workloads © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 8 / 96
  24. 24. Classification [Ammar and ¨Ozsu, 2015] Graph Dynamism Static Graphs Dynamic Graphs Streaming Graphs Evolving Graphs Algorithm Types Offline Online Streaming Incremental Dynamic Batch Dynamic Workload Types Online Queries Analytics Workloads Computation accesses a portion of the graph and the results are computed for a subset of vertices; e.g., point- to-point shortest path, subgraph matching, reachability, SPARQL. © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 8 / 96
  25. 25. Classification [Ammar and ¨Ozsu, 2015] Graph Dynamism Static Graphs Dynamic Graphs Streaming Graphs Evolving Graphs Algorithm Types Offline Online Streaming Incremental Dynamic Batch Dynamic Workload Types Online Queries Analytics Workloads Computation accesses a portion of the graph and the results are computed for a subset of vertices; e.g., point- to-point shortest path, subgraph matching, reachability, SPARQL. Computation accesses the entire graph and may require multiple iterations; e.g., PageR- ank, clustering, graph colouring, all pairs shortest path. © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 8 / 96
  26. 26. Classification [Ammar and ¨Ozsu, 2015] Graph Dynamism Static Graphs Dynamic Graphs Streaming Graphs Evolving Graphs Algorithm Types Offline Online Streaming Incremental Dynamic Batch Dynamic Workload Types Online Queries Analytics Workloads © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 8 / 96
  27. 27. Classification [Ammar and ¨Ozsu, 2015] Graph Dynamism Static Graphs Dynamic Graphs Streaming Graphs Evolving Graphs Algorithm Types Offline Online Streaming Incremental Dynamic Batch Dynamic Workload Types Online Queries Analytics Workloads Sees the en- tire input in advance. © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 8 / 96
  28. 28. Classification [Ammar and ¨Ozsu, 2015] Graph Dynamism Static Graphs Dynamic Graphs Streaming Graphs Evolving Graphs Algorithm Types Offline Online Streaming Incremental Dynamic Batch Dynamic Workload Types Online Queries Analytics Workloads Sees the en- tire input in advance. Sees the input piece-meal as it executes. © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 8 / 96
  29. 29. Classification [Ammar and ¨Ozsu, 2015] Graph Dynamism Static Graphs Dynamic Graphs Streaming Graphs Evolving Graphs Algorithm Types Offline Online Streaming Incremental Dynamic Batch Dynamic Workload Types Online Queries Analytics Workloads Sees the en- tire input in advance. Sees the input piece-meal as it executes. One-pass on- line algorithm with limited memory. © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 8 / 96
  30. 30. Classification [Ammar and ¨Ozsu, 2015] Graph Dynamism Static Graphs Dynamic Graphs Streaming Graphs Evolving Graphs Algorithm Types Offline Online Streaming Incremental Dynamic Batch Dynamic Workload Types Online Queries Analytics Workloads Sees the en- tire input in advance. Sees the input piece-meal as it executes. One-pass on- line algorithm with limited memory. Online algo- rithm with some info about forth- coming input. © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 8 / 96
  31. 31. Classification [Ammar and ¨Ozsu, 2015] Graph Dynamism Static Graphs Dynamic Graphs Streaming Graphs Evolving Graphs Algorithm Types Offline Online Streaming Incremental Dynamic Batch Dynamic Workload Types Online Queries Analytics Workloads Sees the en- tire input in advance. Sees the input piece-meal as it executes. One-pass on- line algorithm with limited memory. Online algo- rithm with some info about forth- coming input. Sees the en- tire input in advance, which may change; an- swers computed as change oc- curs. © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 8 / 96
  32. 32. Classification [Ammar and ¨Ozsu, 2015] Graph Dynamism Static Graphs Dynamic Graphs Streaming Graphs Evolving Graphs Algorithm Types Offline Online Streaming Incremental Dynamic Batch Dynamic Workload Types Online Queries Analytics Workloads Sees the en- tire input in advance. Sees the input piece-meal as it executes. One-pass on- line algorithm with limited memory. Online algo- rithm with some info about forth- coming input. Sees the en- tire input in advance, which may change; an- swers computed as change oc- curs. Similar to dy- namic, but computation happens in batches of © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 8 / 96
  33. 33. Example Design Points Graph Dynamism Static Graphs Dynamic Graphs Streaming Graphs Evolving Graphs Algorithm Types Offline Online Streaming Incremental Dynamic Batch Dynamic Workload Types Online Queries Analytics Workloads Compute the query result/perform analytic computation over the graph as it exists. © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 9 / 96
  34. 34. Example Design Points Graph Dynamism Static Graphs Dynamic Graphs Streaming Graphs Evolving Graphs Algorithm Types Offline Online Streaming Incremental Dynamic Batch Dynamic Workload Types Online Queries Analytics Workloads Compute the query result/perform analytic computation over the graph as it is revealed. © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 9 / 96
  35. 35. Example Design Points Graph Dynamism Static Graphs Dynamic Graphs Streaming Graphs Evolving Graphs Algorithm Types Offline Online Streaming Incremental Dynamic Batch Dynamic Workload Types Online Queries Analytics Workloads Compute the query result/perform analytic computation on each snap- shot from scratch. © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 9 / 96
  36. 36. Example Design Points Graph Dynamism Static Graphs Dynamic Graphs Streaming Graphs Evolving Graphs Algorithm Types Offline Online Streaming Incremental Dynamic Batch Dynamic Workload Types Online Queries Analytics Workloads Continuously compute the query result/perform analytic computation as the input changes. © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 9 / 96
  37. 37. Example Design Points Graph Dynamism Static Graphs Dynamic Graphs Streaming Graphs Evolving Graphs Algorithm Types Offline Online Streaming Incremental Dynamic Batch Dynamic Workload Types Online Queries Analytics Workloads Compute the query result/perform analytic computation after a batch of input changes. © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 9 / 96
  38. 38. Example Design Points – Not all alternatives make sense Graph Dynamism Static Graphs Dynamic Graphs Streaming Graphs Evolving Graphs Algorithm Types Offline Online Streaming Incremental Dynamic Batch Dynamic Workload Types Online Queries Analytics Workloads Dynamic (or batch-dynamic) algorithms do not make sense for static graphs. © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 10 / 96
  39. 39. Graph Processing Systems System Memory/ Disk Architecture Computing paradigm Supported Workloads Hadoop Disk Parallel/Distributed MapReduce Analytical Haloop Disk Parallel/Distributed MapReduce Analytical Pegasus Disk Parallel/Distributed MapReduce Analytical GraphX Disk Parallel/Distributed MapReduce (Spark) Analytical Pregel/Giraph Memory Parallel/Distributed Vertex-Centric Analytical GraphLab Memory Parallel/Distributed Vertex-Centric Analytical GraphChi Disk Single machine Vertex-Centric Analytical Stream Disk Single machine Edge-Centric Analytical Trinity Memory Parallel/Distributed Flexible using K-V store on DSM Online & Analytical Titan Disk Parallel/Distributed K-V store (Cassandra) Online Neo4J Disk Single machine Procedural/ Linked-list Online © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 11 / 96
  40. 40. Graph Workloads Online graph querying Reachability Single source shortest-path Subgraph matching SPARQL queries Offline graph analytics PageRank Clustering Strongly connected components Diameter finding Graph colouring All pairs shortest path Graph pattern mining Machine learning algorithms (Belief propagation, Gaussian non-negative matrix factorization) © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 12 / 96
  41. 41. Outline 1 Introduction – Graph Types 2 Property Graph Processing Classification Online querying Offline analytics 3 RDF Graph Querying Data Warehousing Distributed SPARQL Execution Linked Object Data Querying © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 13 / 96
  42. 42. Reachability Queries film 2014 (initial release date, “1980-05-23”) (label, “The Shining”) books 0743424425 (rating, 4.7) offers 0743424425amazonOffer geo 2635167 (name, “United Kingdom”) (population, 62348447) actor 29704 (actor name, “Jack Nicholson”) film 3418 (label, “The Passenger”) film 1267 (label, “The Last Tycoon”) director 8476 (director name, “Stanley Kubrick”) film 2685 (label, “A Clockwork Orange”) film 424 (label, “Spartacus”) actor 30013 (relatedBook) (hasOffer) (based near) (actor) (director) (actor) (actor) (actor) (director) (director) © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 14 / 96
  43. 43. Reachability Queries film 2014 (initial release date, “1980-05-23”) (label, “The Shining”) books 0743424425 (rating, 4.7) offers 0743424425amazonOffer geo 2635167 (name, “United Kingdom”) (population, 62348447) actor 29704 (actor name, “Jack Nicholson”) film 3418 (label, “The Passenger”) film 1267 (label, “The Last Tycoon”) director 8476 (director name, “Stanley Kubrick”) film 2685 (label, “A Clockwork Orange”) film 424 (label, “Spartacus”) actor 30013 (relatedBook) (hasOffer) (based near) (actor) (director) (actor) (actor) (actor) (director) (director) Can you reach film 1267 from film 2014? © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 14 / 96
  44. 44. Reachability Queries film 2014 (initial release date, “1980-05-23”) (label, “The Shining”) books 0743424425 (rating, 4.7) offers 0743424425amazonOffer geo 2635167 (name, “United Kingdom”) (population, 62348447) actor 29704 (actor name, “Jack Nicholson”) film 3418 (label, “The Passenger”) film 1267 (label, “The Last Tycoon”) director 8476 (director name, “Stanley Kubrick”) film 2685 (label, “A Clockwork Orange”) film 424 (label, “Spartacus”) actor 30013 (relatedBook) (hasOffer) (based near) (actor) (director) (actor) (actor) (actor) (director) (director) Is there a book whose rating is > 4.0 associated with a film that was directed by Stanley Kubrick? © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 14 / 96
  45. 45. Reachability Queries Think of Facebook graph and finding friends of friends. © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 14 / 96
  46. 46. Subgraph Matching ?m ?d movie:director ?name rdfs:label ?b movie:relatedBook “Stanley Kubrick” movie:director name ?r rev:rating FILTER(?r > 4.0) mdb:film/2014 “1980-05-23” movie:initial release date “The Shining” refs:label bm:books/0743424425 4.7 rev:rating bm:offers/0743424425amazonOffer geo:2635167 “United Kingdom” gn:name 62348447 gn:population mdb:actor/29704 “Jack Nicholson” movie:actor name mdb:film/3418 “The Passenger” refs:label mdb:film/1267 “The Last Tycoon” refs:label mdb:director/8476 “Stanley Kubrick” movie:director name mdb:film/2685 “A Clockwork Orange” refs:label mdb:film/424 “Spartacus” refs:label mdb:actor/30013 movie:relatedBook scam:hasOffer foaf:based near movie:actor movie:director movie:actor movie:actor movie:actor movie:director movie:director Su bgraphMatching © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 15 / 96
  47. 47. Outline 1 Introduction – Graph Types 2 Property Graph Processing Classification Online querying Offline analytics 3 RDF Graph Querying Data Warehousing Distributed SPARQL Execution Linked Object Data Querying © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 16 / 96
  48. 48. PageRank Computation A web page is important if it is pointed to by other important pages. P1 P2 P3 P5P6 P4 r(Pi ) = Pj ∈BPi r(Pj ) |FPj | r(P2) = r(P1) 2 + r(P3) 3 rk+1(Pi ) = Pj ∈BPi rk(Pj ) |FPj | BPi : in-neighbours of Pi FPi : out-neighbours of Pi © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 17 / 96
  49. 49. PageRank Computation A web page is important if it is pointed to by other important pages. P1 P2 P3 P5P6 P4 rk+1(Pi ) = Pj ∈BPi rk(Pj ) |FPj | Iteration 0 Iteration 1 Iteration 2 Rank at Iter. 2 r0(P1) = 1/6 r1(P1) = 1/18 r2(P1) = 1/36 5 r0(P2) = 1/6 r1(P2) = 5/36 r2(P2) = 1/18 4 r0(P3) = 1/6 r1(P3) = 1/12 r2(P3) = 1/36 5 r0(P4) = 1/6 r1(P4) = 1/4 r2(P4) = 17/72 1 r0(P5) = 1/6 r1(P5) = 5/36 r2(P5) = 11/72 3 r0(P6) = 1/6 r1(P6) = 1/6 r2(P6) = 14/72 2 Iterative processing. © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 17 / 96
  50. 50. Some Alternative Computational Models for Offline Analytics Vertex-centric (Scatter-Gather) Specify (a) computation at each vertex, and (b) communication with neighbour vertices Synchronous – Pregel [Malewicz et al., 2010], Giraph Asynchronous – GraphLab [Low et al., 2012] © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 18 / 96
  51. 51. Some Alternative Computational Models for Offline Analytics Vertex-centric (Scatter-Gather) Specify (a) computation at each vertex, and (b) communication with neighbour vertices Synchronous – Pregel [Malewicz et al., 2010], Giraph Asynchronous – GraphLab [Low et al., 2012] Block-centric Similar to vertex-centric but on blocks for communication Connected subgraph of the graph Blogel [Yan et al., 2014] © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 18 / 96
  52. 52. Some Alternative Computational Models for Offline Analytics Vertex-centric (Scatter-Gather) Specify (a) computation at each vertex, and (b) communication with neighbour vertices Synchronous – Pregel [Malewicz et al., 2010], Giraph Asynchronous – GraphLab [Low et al., 2012] Block-centric Similar to vertex-centric but on blocks for communication Connected subgraph of the graph Blogel [Yan et al., 2014] MapReduce Need to save in HDFS intermediate results of each iteration – both good and bad Hadoop, Haloop [Bu et al., 2012] © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 18 / 96
  53. 53. Some Alternative Computational Models for Offline Analytics Vertex-centric (Scatter-Gather) Specify (a) computation at each vertex, and (b) communication with neighbour vertices Synchronous – Pregel [Malewicz et al., 2010], Giraph Asynchronous – GraphLab [Low et al., 2012] Block-centric Similar to vertex-centric but on blocks for communication Connected subgraph of the graph Blogel [Yan et al., 2014] MapReduce Need to save in HDFS intermediate results of each iteration – both good and bad Hadoop, Haloop [Bu et al., 2012] Modified MapReduce Based on Spark [Zaharia et al., 2010; Zaharia, 2016] Keep intermediate states in memory Provide fault-tolerance by keeping lineage GraphX [Gonzalez et al., 2014] © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 18 / 96
  54. 54. Vertex-Centric Computation “Think like a vertex” vertex_scatter(vertex v) Push local computation to neighbours on the out-bound edges vertex_gather(vertex v) Gather local computation from neighbours on the in-bound edges Continue until all vertices are inactive ? © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 19 / 96
  55. 55. Vertex-Centric Computation “Think like a vertex” vertex_scatter(vertex v) Push local computation to neighbours on the out-bound edges vertex_gather(vertex v) Gather local computation from neighbours on the in-bound edges Continue until all vertices are inactive Vertex state machine ? Active Inactive Vote halt Message received © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 19 / 96
  56. 56. Synchronous Vertex-Centric Computation Computation © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 20 / 96
  57. 57. Synchronous Vertex-Centric Computation Superstep 1 Superstep 2 Superstep 3 © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 20 / 96
  58. 58. Synchronous Vertex-Centric Computation Machine 1 Machine 2 Machine 3 Communication Barrier Each machine performs vertex-centric computation on its graph partition Superstep 1 Superstep 2 Superstep 3 © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 20 / 96
  59. 59. Synchronous Vertex-Centric Computation Machine 1 Machine 2 Machine 3 Machine 1 Machine 2 Machine 3 Communication Barrier Each machine performs vertex-centric computation on its graph partition Communication Barrier Superstep 1 Superstep 2 Superstep 3 © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 20 / 96
  60. 60. Synchronous Vertex-Centric Computation Machine 1 Machine 2 Machine 3 Machine 1 Machine 2 Machine 3 Machine 1 Machine 2 Machine 3 Communication Barrier Each machine performs vertex-centric computation on its graph partition Communication Barrier Superstep 1 Superstep 2 Superstep 3 © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 20 / 96
  61. 61. Asynchronous Vertex-Centric Computation No communication barriers. Uses the most recent vertex values. Machine 1 Machine 2 Machine 3 Machine 1 Machine 2 Machine 3 © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 21 / 96
  62. 62. Asynchronous Vertex-Centric Computation No communication barriers. Uses the most recent vertex values. Implemented via distributed locking Machine 1 Machine 2 Machine 3 Machine 1 Machine 2 Machine 3 v0 v1 v2 v3 v4 © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 21 / 96
  63. 63. Asynchronous Vertex-Centric Computation No communication barriers. Uses the most recent vertex values. Implemented via distributed locking Machine 1 Machine 2 Machine 3 Machine 1 Machine 2 Machine 3 v0 v1 v2 v3 v4 © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 21 / 96
  64. 64. Asynchronous Vertex-Centric Computation No communication barriers. Uses the most recent vertex values. Implemented via distributed locking Machine 1 Machine 2 Machine 3 Machine 1 Machine 2 Machine 3 v0 v1 v2 v3 v4 © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 21 / 96
  65. 65. Asynchronous Vertex-Centric Computation No communication barriers. Uses the most recent vertex values. Implemented via distributed locking Machine 1 Machine 2 Machine 3 Machine 1 Machine 2 Machine 3 v0 v1 v2 v3 v4 © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 21 / 96
  66. 66. Asynchronous Vertex-Centric Computation No communication barriers. Uses the most recent vertex values. Implemented via distributed locking Machine 1 Machine 2 Machine 3 Machine 1 Machine 2 Machine 3 v0 v1 v2 v3 v4 © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 21 / 96
  67. 67. Summary of an Experiment [Han et al., 2014] A large study comparing Giraph, GraphLab, GPS, Mizan. 1 Giraph scales better across graphs; GraphLab scales better across more machines. © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 22 / 96
  68. 68. Summary of an Experiment [Han et al., 2014] A large study comparing Giraph, GraphLab, GPS, Mizan. 1 Giraph scales better across graphs; GraphLab scales better across more machines. 64 machines TW UK Giraph (byte array) 5.8GB 7.0GB GraphLab (sync) 4.5GB 14GB TW 16 machines 128 machines Giraph (byte array) 8.5GB 5.8GB GraphLab (sync) 11GB 3.3GB © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 22 / 96
  69. 69. Summary of an Experiment [Han et al., 2014] A large study comparing Giraph, GraphLab, GPS, Mizan. 1 Giraph scales better across graphs; GraphLab scales better across more machines. 2 Distributed locking for asynchronous execution is not scalable – Performance degrades as more machines are used due to lock contention, termination scheme, lack of message batching © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 22 / 96
  70. 70. Summary of an Experiment [Han et al., 2014] A large study comparing Giraph, GraphLab, GPS, Mizan. 1 Giraph scales better across graphs; GraphLab scales better across more machines. 2 Distributed locking for asynchronous execution is not scalable – Performance degrades as more machines are used due to lock contention, termination scheme, lack of message batching 3 Graph storage should be memory and mutation efficient. © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 22 / 96
  71. 71. Summary of an Experiment [Han et al., 2014] A large study comparing Giraph, GraphLab, GPS, Mizan. 1 Giraph scales better across graphs; GraphLab scales better across more machines. 2 Distributed locking for asynchronous execution is not scalable – Performance degrades as more machines are used due to lock contention, termination scheme, lack of message batching 3 Graph storage should be memory and mutation efficient. No Mutations Time Memory Byte array Hash map With Mutations (DMST) Time Memory Byte array Hash map © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 22 / 96
  72. 72. Summary of an Experiment [Han et al., 2014] A large study comparing Giraph, GraphLab, GPS, Mizan. 1 Giraph scales better across graphs; GraphLab scales better across more machines. 2 Distributed locking for asynchronous execution is not scalable – Performance degrades as more machines are used due to lock contention, termination scheme, lack of message batching 3 Graph storage should be memory and mutation efficient. 4 Message processing optimizations are very important. © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 22 / 96
  73. 73. Summary of an Experiment [Han et al., 2014] A large study comparing Giraph, GraphLab, GPS, Mizan. 1 Giraph scales better across graphs; GraphLab scales better across more machines. 2 Distributed locking for asynchronous execution is not scalable – Performance degrades as more machines are used due to lock contention, termination scheme, lack of message batching 3 Graph storage should be memory and mutation efficient. 4 Message processing optimizations are very important. 5 Workloads have different resource demands Algorithm CPU Memory Network PageRank Medium Medium High SSSP Low Low Low WCC Low Medium Medium DMST High High Medium © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 22 / 96
  74. 74. Block-Centric Computation Blogel [Yan et al., 2014]: “Think like a block”; also “think like a graph” [Tian et al., 2013] Vertex-centric assumes all vertices communicate over the network; this is not efficient Read-world graphs have skewed vertex degree distribution Common in power-law graphs Problem: imbalanced communication workloads Real-world graphs have large diameters Common in road networks, web graphs, terrain meshes Problem: one superstep per hop ⇒ too many supersteps Real-world graphs have high average vertex degree Common in social networks, mobile communication networks Problem: heavy average communication workloads © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 23 / 96
  75. 75. Blogel Principles Exploit the partitioning of the graph Message exchanges only among blocks Block: a connected subgraph of the graph Within a block, run a serial in-memory algorithm; no need to follow a BSP model © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 24 / 96
  76. 76. Benefits of Block-Centric Computation High-degree vertices inside a block send no messages Fewer number of supersteps Fewer number of blocks than vertices © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 25 / 96
  77. 77. Example: Weakly Connected Component Algorithm exchanges vertex id’s with neighbours id(vi ) ← min{vi , vj , . . . , vk} where vj , . . . , vk are neighbours of vi Vertex-centric requires every vertex sends to its neighbours until every vertex is reached Block-centric needs two iterations: 1 All vertices in partition A exchange ids; X and Y send ids to neighbours in partition B 2 All vertices in partition B exchange ids A B 0 X Y © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 26 / 96
  78. 78. Block Construction The partitioning algorithm needs to maximize number of vertices that have all their edges in the same partition Hash partitioning is not suitable because many vertices will probably have at least one cut-edge URL partitioner For web graphs: based on domain names of web page nodes 2D partitioner For spatial networks: based on coordinates of node Graph Voronoi diagram partitioner For general graphs © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 27 / 96
  79. 79. MapReduce Basics [Li et al., 2014] For data analysis of very large data sets Highly dynamic, irregular, schemaless, etc. SQL too heavy “Embarrassingly parallel problems” New, simple parallel programming model Data structured as (key, value) pairs E.g. (doc-id, content), (word, count), etc. Functional programming style with two functions to be given: Map(k1,v1) → list(k2,v2) Reduce(k2, list (v2)) → list(v3) Implemented on a distributed file system (e.g., Google File System) on very large clusters © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 28 / 96
  80. 80. MapReduce Processing ... Inputdataset Map Map Map Map (k1, v) (k2, v) (k2, v) (k2, v) (k1, v) (k1, v) (k2, v) Group by k Group by k (k1, (v, v, v)) (k1, (v, v, v, v)) Reduce Reduce Outputdataset © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 29 / 96
  81. 81. MapReduce Architecture Scheduler Master Input Module Map Module Combine Module Partition Module Map Process Worker Input Module Map Module Combine Module Partition Module Map Process Worker Input Module Map Module Combine Module Partition Module Map Process Worker Group Module Reduce Module Output Module Reduce Process Worker Group Module Reduce Module Output Module Reduce Process Worker © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 30 / 96
  82. 82. Execution Flow with Architecture [Dean and Ghemawat, 2008] MapReduce: Simplified Data Processing on Large Clusters split 0 split 1 split 2 split 3 split 4 (1) fork (3) read (4) local write (1) fork (1) fork (6) write worker worker worker Master User Program output file 0 output file 1 worker worker (2) assign map (2) assign reduce (5) remote (5) read Input files Map phasr Intermediate files (on local disks) Reduce phase Output files © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 31 / 96
  83. 83. Hadoop Most popular MapReduce implementation – developed by Yahoo! Two components Processing engine HDFS: Hadoop Distributed Storage System – others possible Can be deployed on the same machine or on different machines Processes Job tracker: hosted on the master node and implements the schedule Task tracker: hosted on the worker nodes and accepts tasks from job tracker and executes them HDFS Name node: stores how data are partitioned, monitors the status of data nodes, and data dictionary Data node: Stores and manages data chunks assigned to it Task Tracker Job Tracker Task Tracker Data Node Name Node Data Node Worker 1 Name Node Worker n MapReduce HDFS © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 32 / 96
  84. 84. HaLoop [Bu et al., 2012] Overcome MapReduce shortcomings for iterative jobs Having to save data in HDFS in between each iteration Checking the fixpoint requires a new job at each iteration Scheduler change: assign to the same machine the map reduce tasks that occur in different iterations but access the same data Cache invariant data Cache reduce output to easily check for fixpoint © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 33 / 96
  85. 85. Spark System MapReduce does not perform well in iterative computations Workflow model is acyclic Have to write to HDFS after each iteration and have to read from HDFS at the beginning of next iteration © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 34 / 96
  86. 86. Spark System MapReduce does not perform well in iterative computations Workflow model is acyclic Have to write to HDFS after each iteration and have to read from HDFS at the beginning of next iteration Spark objectives Better support for iterative programs Provide a complete ecosystem Similar abstraction (to MapReduce) for programming Maintain MapReduce fault-tolerance and scalability © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 34 / 96
  87. 87. Spark System MapReduce does not perform well in iterative computations Workflow model is acyclic Have to write to HDFS after each iteration and have to read from HDFS at the beginning of next iteration Spark objectives Better support for iterative programs Provide a complete ecosystem Similar abstraction (to MapReduce) for programming Maintain MapReduce fault-tolerance and scalability Fundamental concepts RDD: Reliable Distributed Datasets Caching of working set Maintaining lineage for fault-tolerance © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 34 / 96
  88. 88. Spark Ecosystem [Michiardi, 2015] Native Spark Apps Spark SQL Spark Streaming MLlib (machine learning) GraphX (graph processing) Apache Spark © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 35 / 96
  89. 89. Spark Programming Model [Zaharia et al., 2010, 2012] HDFS Create RDD · · · RDD Cache? Cache Yes Transform RDD? No Process No Transform Yes HDFS © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 36 / 96
  90. 90. Spark Programming Model [Zaharia et al., 2010, 2012] HDFS Create RDD · · · RDD Cache? Cache Yes Transform RDD? No Process No Transform Yes HDFS Each transform generates a new RDD that may also be cached or processed © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 36 / 96
  91. 91. Spark Programming Model [Zaharia et al., 2010, 2012] HDFS Create RDD · · · RDD Cache? Cache Yes Transform RDD? No Process No Transform Yes HDFS Each transform generates a new RDD that may also be cached or processed Created from HDFS or parallelized arrays; Partitioned across worker machines; May be made persistent lazily; © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 36 / 96
  92. 92. Spark Programming Model [Zaharia et al., 2010, 2012] HDFS Create RDD · · · RDD Cache? Cache Yes Transform RDD? No Process No Transform Yes HDFS Each transform generates a new RDD that may also be cached or processed Created from HDFS or parallelized arrays; Partitioned across worker machines; May be made persistent lazily; Processing done on one of the RDDs; Done in parallel across workers; First processing on a RDD is from disk; Subsequent processing of the same RDD from cache © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 36 / 96
  93. 93. Example – Log Mining [Zaharia et al., 2010, 2012] Load log messages from a file system, create a new file by filtering the error messages, read this file into memory, then interactively search for various patterns Driver WorkerWorkerWorker Block 1 Block 2 Block 3 © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 37 / 96
  94. 94. Example – Log Mining [Zaharia et al., 2010, 2012] Load log messages from a file system, create a new file by filtering the error messages, read this file into memory, then interactively search for various patterns lines = spark.textFile(hdfs://...) CreateRDD Driver WorkerWorkerWorker Block 1 Block 2 Block 3 © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 37 / 96
  95. 95. Example – Log Mining [Zaharia et al., 2010, 2012] Load log messages from a file system, create a new file by filtering the error messages, read this file into memory, then interactively search for various patterns lines = spark.textFile(hdfs://...) errors = lines.filter( .startsWith(ERROR)) Transform RDD Driver WorkerWorkerWorker Block 1 Block 2 Block 3 © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 37 / 96
  96. 96. Example – Log Mining [Zaharia et al., 2010, 2012] Load log messages from a file system, create a new file by filtering the error messages, read this file into memory, then interactively search for various patterns lines = spark.textFile(hdfs://...) errors = lines.filter( .startsWith(ERROR)) messages = errors.map( .split(‘t ’)(2)) Another transform Driver WorkerWorkerWorker Block 1 Block 2 Block 3 © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 37 / 96
  97. 97. Example – Log Mining [Zaharia et al., 2010, 2012] Load log messages from a file system, create a new file by filtering the error messages, read this file into memory, then interactively search for various patterns lines = spark.textFile(hdfs://...) errors = lines.filter( .startsWith(ERROR)) messages = errors.map( .split(‘t ’)(2)) cachedMsgs = messages.cache() Cache results Driver WorkerWorkerWorker Block 1 Block 2 Block 3 © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 37 / 96
  98. 98. Example – Log Mining [Zaharia et al., 2010, 2012] Load log messages from a file system, create a new file by filtering the error messages, read this file into memory, then interactively search for various patterns lines = spark.textFile(hdfs://...) errors = lines.filter( .startsWith(ERROR)) messages = errors.map( .split(‘t ’)(2)) cachedMsgs = messages.cache() cachedMsgs.filter( .contains(foo)).count Action Driver WorkerWorkerWorker Block 1 Block 2 Block 3 © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 37 / 96
  99. 99. Example – Log Mining [Zaharia et al., 2010, 2012] Load log messages from a file system, create a new file by filtering the error messages, read this file into memory, then interactively search for various patterns lines = spark.textFile(hdfs://...) errors = lines.filter( .startsWith(ERROR)) messages = errors.map( .split(‘t ’)(2)) cachedMsgs = messages.cache() cachedMsgs.filter( .contains(foo)).count Driver WorkerWorkerWorker Block 1 Block 2 Block 3 Tasks © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 37 / 96
  100. 100. Example – Log Mining [Zaharia et al., 2010, 2012] Load log messages from a file system, create a new file by filtering the error messages, read this file into memory, then interactively search for various patterns lines = spark.textFile(hdfs://...) errors = lines.filter( .startsWith(ERROR)) messages = errors.map( .split(‘t ’)(2)) cachedMsgs = messages.cache() cachedMsgs.filter( .contains(foo)).count Driver WorkerWorkerWorker Block 1 Block 2 Block 3 TasksResults © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 37 / 96
  101. 101. Example – Log Mining [Zaharia et al., 2010, 2012] Load log messages from a file system, create a new file by filtering the error messages, read this file into memory, then interactively search for various patterns lines = spark.textFile(hdfs://...) errors = lines.filter( .startsWith(ERROR)) messages = errors.map( .split(‘t ’)(2)) cachedMsgs = messages.cache() cachedMsgs.filter( .contains(foo)).count Driver WorkerWorkerWorker Block 1 Block 2 Block 3 TasksResults Cache Cache Cache © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 37 / 96
  102. 102. Example – Log Mining [Zaharia et al., 2010, 2012] Load log messages from a file system, create a new file by filtering the error messages, read this file into memory, then interactively search for various patterns lines = spark.textFile(hdfs://...) errors = lines.filter( .startsWith(ERROR)) messages = errors.map( .split(‘t ’)(2)) cachedMsgs = messages.cache() cachedMsgs.filter( .contains(foo)).count cachedMsgs.filter( .contains(bar)).count Another Action accesses cache Driver WorkerWorkerWorker Block 1 Block 2 Block 3 TasksResults Cache Cache Cache © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 37 / 96
  103. 103. RDD and Processing HDFS lines = spark.textFile(hdfs://...) lines Error, msg1 Warn, msg2 Error, msg1 Info, msg8 Warn, msg2 Info, msg8 Error, msg3 Info, msg5 Info, msg5 Error, msg4 Warn, msg9 Error, msg1 errors errors = lines.filter( .startsWith(ERROR)) Error, msg1 Error, msg1 Error, msg3 Error, msg4 Error, msg1 messages messages = errors.map .split(‘t ’)(2) msg1 msg1 msg3 msg4 msg1 Thesearenotyetgenerated © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 38 / 96
  104. 104. RDD and Processing lines errors messages msg1 msg1 msg3 msg4 msg1 lines messages.filter( .contains(foo)).count errors messages msg1 msg1 msg3 msg4 msg1 NowtheRDDsarematerialized; Commandnotyetexecuted Driver messages.filter( .contains(foo)).count © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 38 / 96
  105. 105. GraphX [Gonzalez et al., 2014] Built on top of Spark Objective is to combine data analytics with graph processing Unify computation on tables and graphs Carefully convert graph to tabular representation Native GraphX API or can accommodate vertex-centric computation Native Spark Apps Spark SQL Spark Streaming MLlib (machine learning) GraphX (graph processing) Apache Spark Vertex- centric API AppApp App App © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 39 / 96
  106. 106. GraphX: Representation of Graphs as Tables A B C D E F G H I J © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 40 / 96
  107. 107. GraphX: Representation of Graphs as Tables Partition 1 Partition 2 A B C D E F G H I J Edge-disjoint partitioning © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 40 / 96
  108. 108. GraphX: Representation of Graphs as Tables Partition 1 Partition 2 Machine1Machine2 Vertex Table (RDD) v-prop:vertex prop. A B C D E F G H I J Edge-disjoint partitioning A v-prop B v-prop ... I v-prop D v-prop E v-prop F v-prop J v-prop © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 40 / 96
  109. 109. GraphX: Representation of Graphs as Tables Partition 1 Partition 2 Machine1Machine2 Vertex Table (RDD) v-prop:vertex prop. Edge Table (RDD) e-prop:edge prop. A B C D E F G H I J Edge-disjoint partitioning A v-prop B v-prop ... I v-prop D v-prop E v-prop F v-prop J v-prop A e-prop B A e-prop C ... F e-prop G A e-prop D A e-prop E ... E e-prop F © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 40 / 96
  110. 110. GraphX: Representation of Graphs as Tables Partition 1 Partition 2 Machine1Machine2 Vertex Table (RDD) v-prop:vertex prop. Edge Table (RDD) e-prop:edge prop. A B C D E F G H I J Edge-disjoint partitioning A v-prop B v-prop ... I v-prop D v-prop E v-prop F v-prop J v-prop A e-prop B A e-prop C ... F e-prop G A e-prop D A e-prop E ... E e-prop F Joining vertices and edges Move vertices to edges © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 40 / 96
  111. 111. GraphX: Representation of Graphs as Tables Partition 1 Partition 2 Machine1Machine2 Vertex Table (RDD) v-prop:vertex prop. Edge Table (RDD) e-prop:edge prop. Routing Table (RDD) A B C D E F G H I J Edge-disjoint partitioning A v-prop B v-prop ... I v-prop D v-prop E v-prop F v-prop J v-prop A e-prop B A e-prop C ... F e-prop G A e-prop D A e-prop E ... E e-prop F A 1 2 B 1 ... I 1 F 1 2 D 2 E 2 J 2 © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 40 / 96
  112. 112. GraphX: Computation Model Machine1Machine2 Vertex Table Edge Table A v-prop B v-prop ... I v-prop D v-prop E v-prop F v-prop J v-prop A e-prop B A e-prop C ... F e-prop G A e-prop D A e-prop E ... E e-prop F © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 41 / 96
  113. 113. GraphX: Computation Model Machine1Machine2 Vertex Table Edge Table A v-prop B v-prop ... I v-prop D v-prop E v-prop F v-prop J v-prop A e-prop B A e-prop C ... F e-prop G A e-prop D A e-prop E ... E e-prop F First Phase: Join Vertex table Edge table Triples View A v-prop e-prop B v-prop A v-prop e-prop C v-prop C v-prop e-prop G v-prop ... E v-prop e-prop G v-prop J v-prop e-prop G v-prop © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 41 / 96
  114. 114. GraphX: Computation Model Machine1Machine2 Vertex Table Edge Table A v-prop B v-prop ... I v-prop D v-prop E v-prop F v-prop J v-prop A e-prop B A e-prop C ... F e-prop G A e-prop D A e-prop E ... E e-prop F Triples View A v-prop e-prop B v-prop A v-prop e-prop C v-prop C v-prop e-prop G v-prop ... E v-prop e-prop G v-prop J v-prop e-prop G v-prop Second Phase: Compute neighbourhood Group-by aggregate © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 41 / 96
  115. 115. GraphX: Operators Table transform operators – inherited from Spark map(func) Return a new RDD formed by passing each element of the source through a function func filter(func) Return a new RDD formed by selecting those elements of the source on which func returns true flatMap(func) Similar to map, but each input item can be mapped to 0 or more output items mapPartitions(func) Similar to map, but runs separately on each partition (block) of the RDD, so func must be of type Iterator sample(repl, fraction, seed) Sample a fraction fraction of the data, with or without replacement (set repl accordingly), using a given random number generator seed union(otherDataset) intersection() Return a new RDD containing the union/intersection of the elements in the source RDD and the argument groupByKey() Operates on a RDD of (K, V) pairs, returns a RDD of (K, IterableV) pairs reduceByKey(func, . . .) Operates on a RDD of (K, V) pairs, returns a RDD of (K, V) pairs where the values for each key are aggregated using the given reduce function func © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 42 / 96
  116. 116. GraphX: Operators Table transform operators – inherited from Spark Graph operators Graph(vertex coll, edge coll) Logically binds together a pair of vertex and edge property collections into a property graph; verifies that each vertex occurs only once and edges connect existing vertices triplets(vertex coll, vertex coll, edge coll) Returns the triplets view of the graph mrTriplets(map,reduce) MapReduce triplets - encodes the two-stage process of join to create triplets and group by © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 42 / 96
  117. 117. Outline 1 Introduction – Graph Types 2 Property Graph Processing Classification Online querying Offline analytics 3 RDF Graph Querying Data Warehousing Distributed SPARQL Execution Linked Object Data Querying © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 43 / 96
  118. 118. RDF Introduction Everything is an uniquely named resource http://data.linkedmdb.org/resource/actor/JN29704 © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 44 / 96
  119. 119. RDF Introduction Everything is an uniquely named resource Prefixes can be used to shorten the names xmlns:y=http://data.linkedmdb.org/resource/actor/ y:JN29704 © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 44 / 96
  120. 120. RDF Introduction Everything is an uniquely named resource Prefixes can be used to shorten the names Properties of resources can be defined xmlns:y=http://data.linkedmdb.org/resource/actor/ y:JN29704 y:JN29704:hasName “Jack Nicholson” y:JN29704:BornOnDate “1937-04-22” © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 44 / 96
  121. 121. RDF Introduction Everything is an uniquely named resource Prefixes can be used to shorten the names Properties of resources can be defined Relationships with other resources can be defined xmlns:y=http://data.linkedmdb.org/resource/actor/ y:JN29704 y:JN29704:hasName “Jack Nicholson” y:JN29704:BornOnDate “1937-04-22” y:TS2014:title “The Shining” y:TS2014:releaseDate “1980-05-23” y:TS2014 JN29704:movieActor © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 44 / 96
  122. 122. RDF Introduction Everything is an uniquely named resource Prefixes can be used to shorten the names Properties of resources can be defined Relationships with other resources can be defined Resource descriptions can be contributed by different people/groups and can be located anywhere in the web Integrated web “database” xmlns:y=http://data.linkedmdb.org/resource/actor/ y:JN29704 y:JN29704:hasName “Jack Nicholson” y:JN29704:BornOnDate “1937-04-22” y:TS2014:title “The Shining” y:TS2014:releaseDate “1980-05-23” y:TS2014 JN29704:movieActor © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 44 / 96
  123. 123. RDF Data Model Triple: Subject, Predicate (Property), Object (s, p, o) Subject: the entity that is described (URI or blank node) Predicate: a feature of the entity (URI) Object: value of the feature (URI, blank node or literal) (s, p, o) ∈ (U ∪ B) × U × (U ∪ B ∪ L) Set of RDF triples is called an RDF graph U Subject Object U B U B L U: set of URIs B: set of blank nodes L: set of literals Predicate Subject Predicate Object http://...imdb.../film/2014 rdfs:label “The Shining” http://...imdb.../film/2014 movie:releaseDate “1980-05-23” http://...imdb.../29704 movie:actor name “Jack Nicholson” . . . . . . . . . © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 45 / 96
  124. 124. RDF Example Instance Prefixes: mdb=http://data.linkedmdb.org/resource/; geo=http://sws.geonames.org/ bm=http://wifo5-03.informatik.uni-mannheim.de/bookmashup/ lexvo=http://lexvo.org/id/;wp=http://en.wikipedia.org/wiki/ Subject Predicate Object mdb: film/2014 rdfs:label “The Shining” mdb:film/2014 movie:initial release date “1980-05-23”’ mdb:film/2014 movie:director mdb:director/8476 mdb:film/2014 movie:actor mdb:actor/29704 mdb:film/2014 movie:actor mdb: actor/30013 mdb:film/2014 movie:music contributor mdb: music contributor/4110 mdb:film/2014 foaf:based near geo:2635167 mdb:film/2014 movie:relatedBook bm:0743424425 mdb:film/2014 movie:language lexvo:iso639-3/eng mdb:director/8476 movie:director name “Stanley Kubrick” mdb:film/2685 movie:director mdb:director/8476 mdb:film/2685 rdfs:label “A Clockwork Orange” mdb:film/424 movie:director mdb:director/8476 mdb:film/424 rdfs:label “Spartacus” mdb:actor/29704 movie:actor name “Jack Nicholson” mdb:film/1267 movie:actor mdb:actor/29704 mdb:film/1267 rdfs:label “The Last Tycoon” mdb:film/3418 movie:actor mdb:actor/29704 mdb:film/3418 rdfs:label “The Passenger” geo:2635167 gn:name “United Kingdom” geo:2635167 gn:population 62348447 geo:2635167 gn:wikipediaArticle wp:United Kingdom bm:books/0743424425 dc:creator bm:persons/Stephen+King bm:books/0743424425 rev:rating 4.7 bm:books/0743424425 scom:hasOffer bm:offers/0743424425amazonOffer lexvo:iso639-3/eng rdfs:label “English” lexvo:iso639-3/eng lvont:usedIn lexvo:iso3166/CA lexvo:iso639-3/eng lvont:usesScript lexvo:script/Latn URI Literal URI © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 46 / 96
  125. 125. RDF Graph mdb:film/2014 “1980-05-23” movie:initial release date “The Shining” refs:label bm:books/0743424425 4.7 rev:rating bm:offers/0743424425amazonOffer geo:2635167 “United Kingdom” gn:name 62348447 gn:population mdb:actor/29704 “Jack Nicholson” movie:actor name mdb:film/3418 “The Passenger” refs:label mdb:film/1267 “The Last Tycoon” refs:label mdb:director/8476 “Stanley Kubrick” movie:director name mdb:film/2685 “A Clockwork Orange” refs:label mdb:film/424 “Spartacus” refs:label mdb:actor/30013 movie:relatedBook scam:hasOffer foaf:based near movie:actor movie:director movie:actor movie:actor movie:actor movie:director movie:director © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 47 / 96
  126. 126. RDF Query Model – SPARQL Query Model - SPARQL Protocol and RDF Query Language Given U (set of URIs), L (set of literals), and V (set of variables), a SPARQL expression is defined recursively: an atomic triple pattern, which is an element of (U ∪ V ) × (U ∪ V ) × (U ∪ V ∪ L) ?x rdfs:label “The Shining” P FILTER R, where P is a graph pattern expression and R is a built-in SPARQL condition (i.e., analogous to a SQL predicate) ?x rev:rating ?p FILTER(?p 3.0) P1 AND/OPT/UNION P2, where P1 and P2 are graph pattern expressions Example: SELECT ?name WHERE { ?m r d f s : l a b e l ?name . ?m movie : d i r e c t o r ?d . ?d movie : director name ” Stanley Kubrick ” . ?m movie : relatedBook ?b . ?b rev : r a t i n g ? r . FILTER(? r 4.0) } © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 48 / 96
  127. 127. SPARQL Queries SELECT ?name WHERE { ?m r d f s : l a b e l ?name . ?m movie : d i r e c t o r ?d . ?d movie : director name ” Stanley Kubrick ” . ?m movie : relatedBook ?b . ?b rev : r a t i n g ? r . FILTER(? r 4.0) } ?m ?d movie:director ?name rdfs:label ?b movie:relatedBook “Stanley Kubrick” movie:director name ?r rev:rating FILTER(?r 4.0) © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 49 / 96
  128. 128. Outline 1 Introduction – Graph Types 2 Property Graph Processing Classification Online querying Offline analytics 3 RDF Graph Querying Data Warehousing Distributed SPARQL Execution Linked Object Data Querying © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 50 / 96
  129. 129. Na¨ıve Triple Store Design SELECT ?name WHERE { ?m r d f s : l a b e l ?name . ?m movie : d i r e c t o r ?d . ?d movie : director name ” Stanley Kubrick ” . ?m movie : relatedBook ?b . ?b rev : r a t i n g ? r . FILTER(? r 4.0) } Subject Property Object mdb:film/2014 rdfs:label “The Shining” mdb:film/2014 movie:initial release date “1980-05-23” mdb:film/2014 movie:director mdb:director/8476 mdb:film/2014 movie:actor mdb:actor/29704 mdb:film/2014 movie:actor mdb: actor/30013 mdb:film/2014 movie:music contributor mdb: music contributor/4110 mdb:film/2014 foaf:based near geo:2635167 mdb:film/2014 movie:relatedBook bm:0743424425 mdb:film/2014 movie:language lexvo:iso639-3/eng mdb:director/8476 movie:director name “Stanley Kubrick” mdb:film/2685 movie:director mdb:director/8476 mdb:film/2685 rdfs:label “A Clockwork Orange” mdb:film/424 movie:director mdb:director/8476 mdb:film/424 rdfs:label “Spartacus” mdb:actor/29704 movie:actor name “Jack Nicholson” mdb:film/1267 movie:actor mdb:actor/29704 mdb:film/1267 rdfs:label “The Last Tycoon” mdb:film/3418 movie:actor mdb:actor/29704 mdb:film/3418 rdfs:label “The Passenger” geo:2635167 gn:name “United Kingdom” geo:2635167 gn:population 62348447 geo:2635167 gn:wikipediaArticle wp:United Kingdom bm:books/0743424425 dc:creator bm:persons/Stephen+King bm:books/0743424425 rev:rating 4.7 bm:books/0743424425 scom:hasOffer bm:offers/0743424425amazonOffer lexvo:iso639-3/eng rdfs:label “English” lexvo:iso639-3/eng lvont:usedIn lexvo:iso3166/CA lexvo:iso639-3/eng lvont:usesScript lexvo:script/Latn © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 51 / 96
  130. 130. Na¨ıve Triple Store Design SELECT ?name WHERE { ?m r d f s : l a b e l ?name . ?m movie : d i r e c t o r ?d . ?d movie : director name ” Stanley Kubrick ” . ?m movie : relatedBook ?b . ?b rev : r a t i n g ? r . FILTER(? r 4.0) } Subject Property Object mdb:film/2014 rdfs:label “The Shining” mdb:film/2014 movie:initial release date “1980-05-23” mdb:film/2014 movie:director mdb:director/8476 mdb:film/2014 movie:actor mdb:actor/29704 mdb:film/2014 movie:actor mdb: actor/30013 mdb:film/2014 movie:music contributor mdb: music contributor/4110 mdb:film/2014 foaf:based near geo:2635167 mdb:film/2014 movie:relatedBook bm:0743424425 mdb:film/2014 movie:language lexvo:iso639-3/eng mdb:director/8476 movie:director name “Stanley Kubrick” mdb:film/2685 movie:director mdb:director/8476 mdb:film/2685 rdfs:label “A Clockwork Orange” mdb:film/424 movie:director mdb:director/8476 mdb:film/424 rdfs:label “Spartacus” mdb:actor/29704 movie:actor name “Jack Nicholson” mdb:film/1267 movie:actor mdb:actor/29704 mdb:film/1267 rdfs:label “The Last Tycoon” mdb:film/3418 movie:actor mdb:actor/29704 mdb:film/3418 rdfs:label “The Passenger” geo:2635167 gn:name “United Kingdom” geo:2635167 gn:population 62348447 geo:2635167 gn:wikipediaArticle wp:United Kingdom bm:books/0743424425 dc:creator bm:persons/Stephen+King bm:books/0743424425 rev:rating 4.7 bm:books/0743424425 scom:hasOffer bm:offers/0743424425amazonOffer lexvo:iso639-3/eng rdfs:label “English” lexvo:iso639-3/eng lvont:usedIn lexvo:iso3166/CA lexvo:iso639-3/eng lvont:usesScript lexvo:script/Latn SELECT T1 . o b j e c t FROM T as T1 , T as T2 , T as T3 , T as T4 , T as T5 WHERE T1 . p=” r d f s : l a b e l ” AND T2 . p=” movie : relatedBook ” AND T3 . p=” movie : d i r e c t o r ” AND T4 . p=” rev : r a t i n g ” AND T5 . p=” movie : d i r e c t o r n a m e ” AND T1 . s=T2 . s AND T1 . s=T3 . s AND T2 . o=T4 . s AND T3 . o=T5 . s AND T4 . o 4.0 AND T5 . o=” S t a n l e y Kubrick ” © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 51 / 96
  131. 131. Na¨ıve Triple Store Design SELECT ?name WHERE { ?m r d f s : l a b e l ?name . ?m movie : d i r e c t o r ?d . ?d movie : director name ” Stanley Kubrick ” . ?m movie : relatedBook ?b . ?b rev : r a t i n g ? r . FILTER(? r 4.0) } Subject Property Object mdb:film/2014 rdfs:label “The Shining” mdb:film/2014 movie:initial release date “1980-05-23” mdb:film/2014 movie:director mdb:director/8476 mdb:film/2014 movie:actor mdb:actor/29704 mdb:film/2014 movie:actor mdb: actor/30013 mdb:film/2014 movie:music contributor mdb: music contributor/4110 mdb:film/2014 foaf:based near geo:2635167 mdb:film/2014 movie:relatedBook bm:0743424425 mdb:film/2014 movie:language lexvo:iso639-3/eng mdb:director/8476 movie:director name “Stanley Kubrick” mdb:film/2685 movie:director mdb:director/8476 mdb:film/2685 rdfs:label “A Clockwork Orange” mdb:film/424 movie:director mdb:director/8476 mdb:film/424 rdfs:label “Spartacus” mdb:actor/29704 movie:actor name “Jack Nicholson” mdb:film/1267 movie:actor mdb:actor/29704 mdb:film/1267 rdfs:label “The Last Tycoon” mdb:film/3418 movie:actor mdb:actor/29704 mdb:film/3418 rdfs:label “The Passenger” geo:2635167 gn:name “United Kingdom” geo:2635167 gn:population 62348447 geo:2635167 gn:wikipediaArticle wp:United Kingdom bm:books/0743424425 dc:creator bm:persons/Stephen+King bm:books/0743424425 rev:rating 4.7 bm:books/0743424425 scom:hasOffer bm:offers/0743424425amazonOffer lexvo:iso639-3/eng rdfs:label “English” lexvo:iso639-3/eng lvont:usedIn lexvo:iso3166/CA lexvo:iso639-3/eng lvont:usesScript lexvo:script/Latn SELECT T1 . o b j e c t FROM T as T1 , T as T2 , T as T3 , T as T4 , T as T5 WHERE T1 . p=” r d f s : l a b e l ” AND T2 . p=” movie : relatedBook ” AND T3 . p=” movie : d i r e c t o r ” AND T4 . p=” rev : r a t i n g ” AND T5 . p=” movie : d i r e c t o r n a m e ” AND T1 . s=T2 . s AND T1 . s=T3 . s AND T2 . o=T4 . s AND T3 . o=T5 . s AND T4 . o 4.0 AND T5 . o=” S t a n l e y Kubrick ” Easy to implement but too many self-joins! © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 51 / 96
  132. 132. Exhaustive Indexing RDF-3X [Neumann and Weikum, 2008, 2009], Hexastore [Weiss et al., 2008] Strings are mapped to ids using a mapping table Original triple table Subject Property Object mdb: film/2014 rdfs:label “The Shining” mdb:film/2014 movie:initial release date “1980-05-23” mdb:director/8476 movie:director name “Stanley Kubrick” mdb:film/2685 movie:director mdb:director/8476 Encoded triple table Subject Property Object 0 1 2 0 3 4 5 6 7 8 9 5 Mapping table ID Value 0 mdb: film/2014 1 rdfs:label 2 “The Shining” 3 movie:initial release date 4 “1980-05-23” 5 mdb:director/8476 6 movie:director name 7 “Stanley Kubrick” 8 mdb:film/2685 9 movie:director © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 52 / 96
  133. 133. Exhaustive Indexing RDF-3X [Neumann and Weikum, 2008, 2009], Hexastore [Weiss et al., 2008] Strings are mapped to ids using a mapping table Triples are indexed in a clustered B+ tree in lexicographic order Subject Property Object 0 1 2 0 3 4 5 6 7 8 9 5 ... ... ... B+ tree Easy querying through mapping table © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 52 / 96
  134. 134. Exhaustive Indexing RDF-3X [Neumann and Weikum, 2008, 2009], Hexastore [Weiss et al., 2008] Strings are mapped to ids using a mapping table Triples are indexed in a clustered B+ tree in lexicographic order Create indexes for permutations of the three columns: SPO, SOP, PSO, POS, OPS, OSP Subject Property Object 0 1 2 0 3 4 5 6 7 8 9 5 ... ... ... B+ tree Easy querying through mapping table © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 52 / 96
  135. 135. Exhaustive Indexing–Query Execution Each triple pattern can be answered by a range query Joins between triple patterns computed using merge join Join order is easy due to extensive indexing Subject Property Object 0 1 2 0 3 4 5 6 7 8 9 5 ... ... ... ID Value 0 mdb: film/2014 1 rdfs:label 2 “The Shining” 3 movie:initial release date 4 “1980-05-23” 5 mdb:director/8476 6 movie:director name 7 “Stanley Kubrick” 8 mdb:film/2685 9 movie:director © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 53 / 96
  136. 136. Exhaustive Indexing–Query Execution Each triple pattern can be answered by a range query Joins between triple patterns computed using merge join Join order is easy due to extensive indexing Subject Property Object 0 1 2 0 3 4 5 6 7 8 9 5 ... ... ... ID Value 0 mdb: film/2014 1 rdfs:label 2 “The Shining” 3 movie:initial release date 4 “1980-05-23” 5 mdb:director/8476 6 movie:director name 7 “Stanley Kubrick” 8 mdb:film/2685 9 movie:director Advantages Eliminates some of the joins – they become range queries Merge join is easy and fast © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 53 / 96
  137. 137. Exhaustive Indexing–Query Execution Each triple pattern can be answered by a range query Joins between triple patterns computed using merge join Join order is easy due to extensive indexing Subject Property Object 0 1 2 0 3 4 5 6 7 8 9 5 ... ... ... ID Value 0 mdb: film/2014 1 rdfs:label 2 “The Shining” 3 movie:initial release date 4 “1980-05-23” 5 mdb:director/8476 6 movie:director name 7 “Stanley Kubrick” 8 mdb:film/2685 9 movie:director Advantages Eliminates some of the joins – they become range queries Merge join is easy and fast Disadvantages Space usage © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 53 / 96
  138. 138. Property Tables Grouping by entities; Jena [Wilkinson, 2006], DB2-RDF [Bornea et al., 2013] Clustered property table: group together the properties that tend to occur in the same (or similar) subjects Property-class table: cluster the subjects with the same type of property into one property table Subject Property Object mdb:film/2014 rdfs:label “The Shining” mdb:film/2014 movie:director mdb:director/8476 mdb:film/2685 movie:director mdb:director/8476 mdb:film/2685 rdfs:label “A Clockwork Orange” mdb:actor/29704 movie:actor name “Jack Nicholson” . . . . . . . . . Subject refs:label movie:director mob:film/2014 “The Shining” mob:director/8476 mob:film/2685 “The Clockwork Orange” mob:director/8476 Subject movie:actor name mdb:actor “Jack Nicholson” © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 54 / 96
  139. 139. Property Tables Grouping by entities; Jena [Wilkinson, 2006], DB2-RDF [Bornea et al., 2013] Clustered property table: group together the properties that tend to occur in the same (or similar) subjects Property-class table: cluster the subjects with the same type of property into one property table Subject Property Object mdb:film/2014 rdfs:label “The Shining” mdb:film/2014 movie:director mdb:director/8476 mdb:film/2685 movie:director mdb:director/8476 mdb:film/2685 rdfs:label “A Clockwork Orange” mdb:actor/29704 movie:actor name “Jack Nicholson” . . . . . . . . . Subject refs:label movie:director mob:film/2014 “The Shining” mob:director/8476 mob:film/2685 “The Clockwork Orange” mob:director/8476 Subject movie:actor name mdb:actor “Jack Nicholson” Advantages Fewer joins If the data is structured, we have a relational system – similar to normalized relations © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 54 / 96
  140. 140. Property Tables Grouping by entities; Jena [Wilkinson, 2006], DB2-RDF [Bornea et al., 2013] Clustered property table: group together the properties that tend to occur in the same (or similar) subjects Property-class table: cluster the subjects with the same type of property into one property table Subject Property Object mdb:film/2014 rdfs:label “The Shining” mdb:film/2014 movie:director mdb:director/8476 mdb:film/2685 movie:director mdb:director/8476 mdb:film/2685 rdfs:label “A Clockwork Orange” mdb:actor/29704 movie:actor name “Jack Nicholson” . . . . . . . . . Subject refs:label movie:director mob:film/2014 “The Shining” mob:director/8476 mob:film/2685 “The Clockwork Orange” mob:director/8476 Subject movie:actor name mdb:actor “Jack Nicholson” Advantages Fewer joins If the data is structured, we have a relational system – similar to normalized relations Disadvantages Potentially a lot of NULLs Clustering is not trivial Multi-valued properties are complicated © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 54 / 96
  141. 141. Binary Tables Grouping by properties: For each property, build a two-column table, containing both subject and object, ordered by subjects [Abadi et al., 2007, 2009] Also called vertical partitioned tables n two column tables (n is the number of unique properties in the data) Subject Property Object mdb:film/2014 rdfs:label “The Shining” mdb:film/2014 movie:director mdb:director/8476 mdb:film/2685 movie:director mdb:director/8476 mdb:film/2685 rdfs:label “A Clockwork Orange” mdb:actor/29704 movie:actor name “Jack Nicholson” . . . . . . . . . Subject Object mdb:film/2014 mdb:director/8476 mdb:film/2685 mdb:director/8476 movie:director Subject Object mob:film/2014 “The Shining” mob:film/2685 “The Clockwork Orange” refs:label Subject Object mdb:actor/29704 “Jack Nicholson” movie:actor name © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 55 / 96
  142. 142. Binary Tables Grouping by properties: For each property, build a two-column table, containing both subject and object, ordered by subjects [Abadi et al., 2007, 2009] Also called vertical partitioned tables n two column tables (n is the number of unique properties in the data) Subject Property Object mdb:film/2014 rdfs:label “The Shining” mdb:film/2014 movie:director mdb:director/8476 mdb:film/2685 movie:director mdb:director/8476 mdb:film/2685 rdfs:label “A Clockwork Orange” mdb:actor/29704 movie:actor name “Jack Nicholson” . . . . . . . . . Subject Object mdb:film/2014 mdb:director/8476 mdb:film/2685 mdb:director/8476 movie:director Subject Object mob:film/2014 “The Shining” mob:film/2685 “The Clockwork Orange” refs:label Subject Object mdb:actor/29704 “Jack Nicholson” movie:actor name Advantages Supports multi-valued properties No NULLs No clustering Read only needed attributes (i.e. less I/O) Good performance for subject-subject joins © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 55 / 96
  143. 143. Binary Tables Grouping by properties: For each property, build a two-column table, containing both subject and object, ordered by subjects [Abadi et al., 2007, 2009] Also called vertical partitioned tables n two column tables (n is the number of unique properties in the data) Subject Property Object mdb:film/2014 rdfs:label “The Shining” mdb:film/2014 movie:director mdb:director/8476 mdb:film/2685 movie:director mdb:director/8476 mdb:film/2685 rdfs:label “A Clockwork Orange” mdb:actor/29704 movie:actor name “Jack Nicholson” . . . . . . . . . Subject Object mdb:film/2014 mdb:director/8476 mdb:film/2685 mdb:director/8476 movie:director Subject Object mob:film/2014 “The Shining” mob:film/2685 “The Clockwork Orange” refs:label Subject Object mdb:actor/29704 “Jack Nicholson” movie:actor name Advantages Supports multi-valued properties No NULLs No clustering Read only needed attributes (i.e. less I/O) Good performance for subject-subject joins Disadvantages Not useful for subject-object joins Expensive inserts © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 55 / 96
  144. 144. Graph-based Approach Answering SPARQL query ≡ subgraph matching using homomorphism gStore [Zou et al., 2011, 2014], chameleon-db [Alu¸c et al., 2013] ?m ?d movie:director ?name rdfs:label ?b movie:relatedBook “Stanley Kubrick” movie:director name ?r rev:rating FILTER(?r 4.0) mdb:film/2014 “1980-05-23” movie:initial release date “The Shining” refs:label bm:books/0743424425 4.7 rev:rating bm:offers/0743424425amazonOffer geo:2635167 “United Kingdom” gn:name 62348447 gn:population mdb:actor/29704 “Jack Nicholson” movie:actor name mdb:film/3418 “The Passenger” refs:label mdb:film/1267 “The Last Tycoon” refs:label mdb:director/8476 “Stanley Kubrick” movie:director name mdb:film/2685 “A Clockwork Orange” refs:label mdb:film/424 “Spartacus” refs:label mdb:actor/30013 movie:relatedBook scam:hasOffer foaf:based near movie:actor movie:director movie:actor movie:actor movie:actor movie:director movie:director Subgraph M atching © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 56 / 96
  145. 145. Graph-based Approach Answering SPARQL query ≡ subgraph matching using homomorphism gStore [Zou et al., 2011, 2014], chameleon-db [Alu¸c et al., 2013] ?m ?d movie:director ?name rdfs:label ?b movie:relatedBook “Stanley Kubrick” movie:director name ?r rev:rating FILTER(?r 4.0) mdb:film/2014 “1980-05-23” movie:initial release date “The Shining” refs:label bm:books/0743424425 4.7 rev:rating bm:offers/0743424425amazonOffer geo:2635167 “United Kingdom” gn:name 62348447 gn:population mdb:actor/29704 “Jack Nicholson” movie:actor name mdb:film/3418 “The Passenger” refs:label mdb:film/1267 “The Last Tycoon” refs:label mdb:director/8476 “Stanley Kubrick” movie:director name mdb:film/2685 “A Clockwork Orange” refs:label mdb:film/424 “Spartacus” refs:label mdb:actor/30013 movie:relatedBook scam:hasOffer foaf:based near movie:actor movie:director movie:actor movie:actor movie:actor movie:director movie:director Subgraph M atching Advantages Maintains the graph structure Full set of queries can be handled © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 56 / 96
  146. 146. Graph-based Approach Answering SPARQL query ≡ subgraph matching using homomorphism gStore [Zou et al., 2011, 2014], chameleon-db [Alu¸c et al., 2013] ?m ?d movie:director ?name rdfs:label ?b movie:relatedBook “Stanley Kubrick” movie:director name ?r rev:rating FILTER(?r 4.0) mdb:film/2014 “1980-05-23” movie:initial release date “The Shining” refs:label bm:books/0743424425 4.7 rev:rating bm:offers/0743424425amazonOffer geo:2635167 “United Kingdom” gn:name 62348447 gn:population mdb:actor/29704 “Jack Nicholson” movie:actor name mdb:film/3418 “The Passenger” refs:label mdb:film/1267 “The Last Tycoon” refs:label mdb:director/8476 “Stanley Kubrick” movie:director name mdb:film/2685 “A Clockwork Orange” refs:label mdb:film/424 “Spartacus” refs:label mdb:actor/30013 movie:relatedBook scam:hasOffer foaf:based near movie:actor movie:director movie:actor movie:actor movie:actor movie:director movie:director Subgraph M atching Advantages Maintains the graph structure Full set of queries can be handled Disadvantages Graph pattern matching is expensive © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 56 / 96
  147. 147. Two Systems gStoreSystem Architecture Offline Online Storage Input Input RDF Parser RDF Graph Builder Encoding Module VS*-tree builder RDF data RDF Triples RDF Graph Signature Graph Key-Value Store VS*-tree Store SPARQL Parser SPARQL Query Encoding Module VS*-tree Query Graph Filter Module Join Module Signature Graph Node Candidate Results Fig. 4. System Architecture itstrings, denoted as vS ig(u). We encode query Q with the ame encoding method. Consequently, the match between Q nd G can be verified by simply checking the match between orresponding encoded bitstrings. chameleon-db Structural Index ... Vertex Index Spill Index ClusterIndexStorageSystem StorageAdvisor Query Engine Plan Generation Evaluation © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 57 / 96
  148. 148. gStore General Approach: Work directly on the RDF graph and the SPARQL query graph Use a signature-based encoding of each entity and class vertex to speed up matching Filter-and-evaluate Use a false positive algorithm to prune nodes and obtain a set of candidates; then do more detailed evaluation on those Use an index (VS∗-tree) over the data signature graph (has light maintenance load) for efficient pruning © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 58 / 96
  149. 149. 1. Encode Q and G to Get Signature Graphs Query signature graph Q∗ 0100 0000 1000 0000 00010 0000 0100 10000 Data signature graph G∗ 0010 1000 0100 0001 00001 1000 0001 00010 0000 0100 10000 0000 1000 10000 0000 0010 10000 0000 1001 00100 0001 0001 01000 0100 1000 01000 1001 1000 01000 0001 0100 01000 © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 59 / 96
  150. 150. 2. Filter-and-Evaluate Query signature graph Q∗ 0100 0000 1000 0000 00010 0000 0100 10000 Data signature graph G∗ 0010 1000 0100 0001 00001 1000 0001 00010 0000 0100 10000 0000 1000 10000 0000 0010 10000 0000 1001 00100 0001 0001 01000 0100 1000 01000 1001 1000 01000 0001 0100 01000 Find matches of Q∗ over signature graph G∗ Verify each match in RDF graph G © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 60 / 96
  151. 151. How to Generate Candidate List Two step process: 1 For each node of Q∗ get lists of nodes in G∗ that include that node. 2 Do a multi-way join to get the candidate list © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 61 / 96
  152. 152. How to Generate Candidate List Two step process: 1 For each node of Q∗ get lists of nodes in G∗ that include that node. 2 Do a multi-way join to get the candidate list Alternatives: © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 61 / 96
  153. 153. How to Generate Candidate List Two step process: 1 For each node of Q∗ get lists of nodes in G∗ that include that node. 2 Do a multi-way join to get the candidate list Alternatives: Sequential scan of G∗ Both steps are inefficient © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 61 / 96
  154. 154. How to Generate Candidate List Two step process: 1 For each node of Q∗ get lists of nodes in G∗ that include that node. 2 Do a multi-way join to get the candidate list Alternatives: Sequential scan of G∗ Both steps are inefficient Use S-trees Height-balanced tree over signatures Run an inclusion query for each node of Q∗ and get lists of nodes in G∗ that include that node. • Given query signature q and a set of data signatures S, find all data signatures si ∈ S where qsi = q Does not support second step – expensive © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 61 / 96
  155. 155. How to Generate Candidate List Two step process: 1 For each node of Q∗ get lists of nodes in G∗ that include that node. 2 Do a multi-way join to get the candidate list Alternatives: Sequential scan of G∗ Both steps are inefficient Use S-trees Height-balanced tree over signatures Run an inclusion query for each node of Q∗ and get lists of nodes in G∗ that include that node. • Given query signature q and a set of data signatures S, find all data signatures si ∈ S where qsi = q Does not support second step – expensive VS-tree (and VS∗ -tree) Multi-resolution summary graph based on S-tree Supports both steps efficiently Grouping by vertices © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 61 / 96
  156. 156. S-tree Solution 1111 1111 0110 1111 1101 1101 0000 1110 0110 1001 1100 1001 1001 1101 0000 1000 0000 0100 0000 0010 0010 1000 0100 0001 1000 0001 0000 1001 0100 1000 1001 1000 0001 0100 0001 0001 005 004 006 001 002 003 007 011 008 009 010 d1 1 d2 1 d2 2 d3 1 d3 2 d3 3 d3 4 G3 G2 G1 1000 00000100 0000 00010 0000 0100 10000 © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 62 / 96
  157. 157. S-tree Solution 1111 1111 0110 1111 1101 1101 0000 1110 0110 1001 1100 1001 1001 1101 0000 1000 0000 0100 0000 0010 0010 1000 0100 0001 1000 0001 0000 1001 0100 1000 1001 1000 0001 0100 0001 0001 005 004 006 001 002 003 007 011 008 009 010 d1 1 d2 1 d2 2 d3 1 d3 2 d3 3 d3 4 G3 G2 G1 1000 00000100 0000 00010 0000 0100 10000 © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 62 / 96
  158. 158. S-tree Solution 1111 1111 0110 1111 1101 1101 0000 1110 0110 1001 1100 1001 1001 1101 0000 1000 0000 0100 0000 0010 0010 1000 0100 0001 1000 0001 0000 1001 0100 1000 1001 1000 0001 0100 0001 0001 005 004 006 001 002 003 007 011 008 009 010 d1 1 d2 1 d2 2 d3 1 d3 2 d3 3 d3 4 G3 G2 G1 1000 00000100 0000 00010 0000 0100 10000 002 011 © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 62 / 96
  159. 159. S-tree Solution 1111 1111 0110 1111 1101 1101 0000 1110 0110 1001 1100 1001 1001 1101 0000 1000 0000 0100 0000 0010 0010 1000 0100 0001 1000 0001 0000 1001 0100 1000 1001 1000 0001 0100 0001 0001 005 004 006 001 002 003 007 011 008 009 010 d1 1 d2 1 d2 2 d3 1 d3 2 d3 3 d3 4 G3 G2 G1 1000 00000100 0000 00010 0000 0100 10000 002 011 003 008 © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 62 / 96
  160. 160. S-tree Solution 1111 1111 0110 1111 1101 1101 0000 1110 0110 1001 1100 1001 1001 1101 0000 1000 0000 0100 0000 0010 0010 1000 0100 0001 1000 0001 0000 1001 0100 1000 1001 1000 0001 0100 0001 0001 005 004 006 001 002 003 007 011 008 009 010 d1 1 d2 1 d2 2 d3 1 d3 2 d3 3 d3 4 G3 G2 G1 1000 00000100 0000 00010 0000 0100 10000 002 011 003 008 004 009 © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 62 / 96
  161. 161. S-tree Solution 1111 1111 0110 1111 1101 1101 0000 1110 0110 1001 1100 1001 1001 1101 0000 1000 0000 0100 0000 0010 0010 1000 0100 0001 1000 0001 0000 1001 0100 1000 1001 1000 0001 0100 0001 0001 005 004 006 001 002 003 007 011 008 009 010 d1 1 d2 1 d2 2 d3 1 d3 2 d3 3 d3 4 G3 G2 G1 1000 00000100 0000 00010 0000 0100 10000 002 011 003 008 004 009 © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 62 / 96
  162. 162. S-tree Solution 1111 1111 0110 1111 1101 1101 0000 1110 0110 1001 1100 1001 1001 1101 0000 1000 0000 0100 0000 0010 0010 1000 0100 0001 1000 0001 0000 1001 0100 1000 1001 1000 0001 0100 0001 0001 005 004 006 001 002 003 007 011 008 009 010 d1 1 d2 1 d2 2 d3 1 d3 2 d3 3 d3 4 G3 G2 G1 1000 00000100 0000 00010 0000 0100 10000 002 011 003 008 004 009 Possibly large join space! © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 62 / 96
  163. 163. VS-tree 1111 1111 0110 1111 1101 1101 0000 1110 0110 1001 1100 1001 1001 1101 0000 1000 0000 0100 0000 0010 0010 1000 0100 0001 1000 0001 0000 1001 0100 1000 1001 1000 0001 0100 0001 0001 005 004 006 001 002 003 007 011 008 009 010 d1 1 d2 1 d2 2 d3 1 d3 2 d3 3 d3 4 G3 G2 G1 11101 10010 10001 01100 10000 00001 01100 00010 10000 01000 01000 10000 10000 10000 10000 00010 00100 01000 01000 01000 01000 Super edge © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 63 / 96
  164. 164. Pruning with VS-Tree 1111 1111 0110 1111 1101 1101 0000 1110 0110 1001 1100 1001 1001 1101 0000 1000 0000 0100 0000 0010 0010 1000 0100 0001 1000 0001 0000 1001 0100 1000 1001 1000 0001 0100 0001 0001 005 004 006 001 002 003 007 011 008 009 010 d1 1 d2 1 d2 2 d3 1 d3 2 d3 3 d3 4 G3 G2 G1 11101 10010 10001 01100 10000 00001 01100 00010 10000 01000 01000 10000 10000 10000 10000 00010 00100 01000 01000 01000 01000 1000 00000100 0000 00010 0000 0100 10000 © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 64 / 96
  165. 165. Pruning with VS-Tree 1111 1111 0110 1111 1101 1101 0000 1110 0110 1001 1100 1001 1001 1101 0000 1000 0000 0100 0000 0010 0010 1000 0100 0001 1000 0001 0000 1001 0100 1000 1001 1000 0001 0100 0001 0001 005 004 006 001 002 003 007 011 008 009 010 d1 1 d2 1 d2 2 d3 1 d3 2 d3 3 d3 4 G3 G2 G1 11101 10010 10001 01100 10000 00001 01100 00010 10000 01000 01000 10000 10000 10000 10000 00010 00100 01000 01000 01000 01000 1000 00000100 0000 00010 0000 0100 10000 © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 64 / 96
  166. 166. Pruning with VS-Tree 1111 1111 0110 1111 1101 1101 0000 1110 0110 1001 1100 1001 1001 1101 0000 1000 0000 0100 0000 0010 0010 1000 0100 0001 1000 0001 0000 1001 0100 1000 1001 1000 0001 0100 0001 0001 005 004 006 001 002 003 007 011 008 009 010 d1 1 d2 1 d2 2 d3 1 d3 2 d3 3 d3 4 G3 G2 G1 11101 10010 10001 01100 10000 00001 01100 00010 10000 01000 01000 10000 10000 10000 10000 00010 00100 01000 01000 01000 01000 1000 00000100 0000 00010 0000 0100 10000 d3 2 d3 3 d3 3 d3 4 d3 1 d3 4 G3 00010 10000 01000 © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 64 / 96
  167. 167. Pruning with VS-Tree 1111 1111 0110 1111 1101 1101 0000 1110 0110 1001 1100 1001 1001 1101 0000 1000 0000 0100 0000 0010 0010 1000 0100 0001 1000 0001 0000 1001 0100 1000 1001 1000 0001 0100 0001 0001 005 004 006 001 002 003 007 011 008 009 010 d1 1 d2 1 d2 2 d3 1 d3 2 d3 3 d3 4 G3 G2 G1 11101 10010 10001 01100 10000 00001 01100 00010 10000 01000 01000 10000 10000 10000 10000 00010 00100 01000 01000 01000 01000 1000 00000100 0000 00010 0000 0100 10000 d3 2 d3 3 d3 3 d3 4 d3 1 d3 4 G3 00010 10000 01000 003 008 002 011 004 009 © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 64 / 96
  168. 168. Pruning with VS-Tree 1111 1111 0110 1111 1101 1101 0000 1110 0110 1001 1100 1001 1001 1101 0000 1000 0000 0100 0000 0010 0010 1000 0100 0001 1000 0001 0000 1001 0100 1000 1001 1000 0001 0100 0001 0001 005 004 006 001 002 003 007 011 008 009 010 d1 1 d2 1 d2 2 d3 1 d3 2 d3 3 d3 4 G3 G2 G1 11101 10010 10001 01100 10000 00001 01100 00010 10000 01000 01000 10000 10000 10000 10000 00010 00100 01000 01000 01000 01000 1000 00000100 0000 00010 0000 0100 10000 d3 2 d3 3 d3 3 d3 4 d3 1 d3 4 G3 00010 10000 01000 003 008 002 011 004 009 © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 64 / 96
  169. 169. Pruning with VS-Tree 1111 1111 0110 1111 1101 1101 0000 1110 0110 1001 1100 1001 1001 1101 0000 1000 0000 0100 0000 0010 0010 1000 0100 0001 1000 0001 0000 1001 0100 1000 1001 1000 0001 0100 0001 0001 005 004 006 001 002 003 007 011 008 009 010 d1 1 d2 1 d2 2 d3 1 d3 2 d3 3 d3 4 G3 G2 G1 11101 10010 10001 01100 10000 00001 01100 00010 10000 01000 01000 10000 10000 10000 10000 00010 00100 01000 01000 01000 01000 1000 00000100 0000 00010 0000 0100 10000 d3 2 d3 3 d3 3 d3 4 d3 1 d3 4 G3 00010 10000 01000 003 008 002 011 004 009 Reduced join space! © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 64 / 96
  170. 170. Adaptivity to Workload Applications that rely on RDF data are increasingly popular and are more varied [Verborgh et al., 2014] Data that are being handled are far more heterogeneous [Duan et al., 2011] SPARQL queries are becoming more diverse [Arias et al., 2011] and dynamic [Kirchberg et al., 2011] An experiment [Alu¸c et al., 2014a] No single system is a sole winner across all queries No single system is the sole loser across all queries, either There can be 2–5 orders of magnitude difference in the performance (i.e., query execution time) between the best and the worst system for a given query The winner in one query may timeout in another Performance difference widens as dataset size increases © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 65 / 96
  171. 171. Adaptivity to Workload Applications that rely on RDF data are increasingly popular and are more varied [Verborgh et al., 2014] Data that are being handled are far more heterogeneous [Duan et al., 2011] SPARQL queries are becoming more diverse [Arias et al., 2011] and dynamic [Kirchberg et al., 2011] An experiment [Alu¸c et al., 2014a] No single system is a sole winner across all queries No single system is the sole loser across all queries, either There can be 2–5 orders of magnitude difference in the performance (i.e., query execution time) between the best and the worst system for a given query The winner in one query may timeout in another Performance difference widens as dataset size increases Can existing systems cope with these trends – workload diversity dynamism No! [Alu¸c et al., 2014b] Fragmented data Suboptimal pruning by indexes Unnecessarily large sets of intermediate result tuples © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 65 / 96
  172. 172. Adaptivity to Workload Applications that rely on RDF data are increasingly popular and are more varied [Verborgh et al., 2014] Data that are being handled are far more heterogeneous [Duan et al., 2011] SPARQL queries are becoming more diverse [Arias et al., 2011] and dynamic [Kirchberg et al., 2011] An experiment [Alu¸c et al., 2014a] No single system is a sole winner across all queries No single system is the sole loser across all queries, either There can be 2–5 orders of magnitude difference in the performance (i.e., query execution time) between the best and the worst system for a given query The winner in one query may timeout in another Performance difference widens as dataset size increases Our proposal: Idea behind chameleon-db When designing and implementing an RDF data management system, assume nothing about the workload upfront Organize data dynamically and purely based on the workload © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 65 / 96
  173. 173. Group-by-Query Approach v1 v21 A v21 v20 B v1 v98 A v98 v30 C v1 v250 A v250 v40 D v0 v32 A v0 v52 C v0 v80 C v0 v66C v0 v47 B v6 v7 A v6 v8 B v6 v9 C C1 C2 C3 C4 C5 Characteristics: Records are not necessarily of fixed length Records are not grouped into tables Records do not necessarily share the same set of RDF predicates Each record represents a very tiny part of the RDF graph © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 66 / 96
  174. 174. Group-by-Query Approach v1 v21 A v21 v20 B v1 v98 A v98 v30 C v1 v250 A v250 v40 D v0 v32 A v0 v52 C v0 v80 C v0 v66C v0 v47 B v6 v7 A v6 v8 B v6 v9 C C1 C2 C3 C4 C5 ?x ?y A ?y ?z ?b Figure: Query 1 © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 66 / 96
  175. 175. Group-by-Query Approach v1 v21 A v21 v20 B v1 v98 A v98 v30 C v1 v250 A v250 v40 D v0 v32 A v0 v52 C v0 v80 C v0 v66C v0 v47 B v6 v7 A v6 v8 B v6 v9 C C1 C2 C3 C4 C5 ?x ?y A ?y ?z ?b Figure: Query 1 © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 66 / 96
  176. 176. Group-by-Query Approach v1 v21 A v21 v20 B v1 v98 A v98 v30 C v1 v250 A v250 v40 D v0 v32 A v0 v52 C v0 v80 C v0 v66C v0 v47 B v6 v7 A v6 v8 B v6 v9 C C1 C2 C3 C4 C5 ?x ?y A ?y ?z ?b Figure: Query 1 © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 66 / 96
  177. 177. Group-by-Query Approach v1 v21 A v21 v20 B v1 v98 A v98 v30 C v1 v250 A v250 v40 D v0 v32 A v0 v52 C v0 v80 C v0 v66C v0 v47 B v6 v7 A v6 v8 B v6 v9 C C1 C2 C3 C4 C5 ?a ?z C ?a ?x A ?a ?y B Figure: Query 2 © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 66 / 96
  178. 178. Group-by-Query Approach v1 v21 A v21 v20 B v1 v98 A v98 v30 C v1 v250 A v250 v40 D v0 v32 A v0 v52 C v0 v80 C v0 v66C v0 v47 B v6 v7 A v6 v8 B v6 v9 C C1 C2 C3 C4 C5 ?a ?z C ?a ?x A ?a ?y B Figure: Query 2 © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 66 / 96
  179. 179. Group-by-Query Approach v1 v21 A v21 v20 B v1 v98 A v98 v30 C v1 v250 A v250 v40 D v0 v32 A v0 v52 C v0 v80 C v0 v66C v0 v47 B v6 v7 A v6 v8 B v6 v9 C C1 C2 C3 C4 C5 ?a ?z C ?a ?x A ?a ?y B Figure: Query 2 © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 66 / 96
  180. 180. Group-by-Query Approach v1 v21 A v21 v20 B v1 v98 A v98 v30 C v1 v250 A v250 v40 D v0 v32 A v0 v52 C v0 v80 C v0 v66C v0 v47 B v6 v7 A v6 v8 B v6 v9 C C1 C2 C3 C4 C5 ?a ?z C ?a ?x A ?a ?y B Figure: Query 2 Advantages Data are physically clustered for the workload Better pruning by the indexes Fewer intermediate result tuples © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 66 / 96
  181. 181. Challenges Physical Data Layout: As the workloads change, the way data are grouped together may no longer be suitable Hierarchical Clustering Algorithm [Alu¸c et al., 2015] Tunable-LSH [Alu¸c et al., 2015] Indexing: Indexing upfront is not a choice Query Evaluation: Can we execute queries efficiently even when the physical layout is constantly changing? [Alu¸c et al., 2015] © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 67 / 96
  182. 182. chameleon-db Prototype system [Alu¸c et al., 2013] 35,000 lines of code in C++ under Linux (plus code for SPARQL 1.0 parser) Structural Index ... Vertex Index Spill Index ClusterIndexStorageSystem StorageAdvisor Query Engine Plan Generation Evaluation © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 68 / 96
  183. 183. Outline 1 Introduction – Graph Types 2 Property Graph Processing Classification Online querying Offline analytics 3 RDF Graph Querying Data Warehousing Distributed SPARQL Execution Linked Object Data Querying © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 69 / 96
  184. 184. Remember the Environment Distributed environment Some of the data sites can process SPARQL queries – SPARQL endpoints Not all data sites can process queries © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 70 / 96
  185. 185. Remember the Environment Distributed environment Some of the data sites can process SPARQL queries – SPARQL endpoints Not all data sites can process queries Alternatives © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 70 / 96
  186. 186. Remember the Environment Distributed environment Some of the data sites can process SPARQL queries – SPARQL endpoints Not all data sites can process queries Alternatives Data re-distribution + query decomposition © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 70 / 96

×