Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Virtuoso, The Prometheus of RDF -- Sematics 2014 Conference Keynote

12,166 views

Published on

Virtuoso, The Prometheus of RDF presented by Orri Erling (Virtuoso Program Manager)

Published in: Technology

Virtuoso, The Prometheus of RDF -- Sematics 2014 Conference Keynote

  1. 1. Virtuoso: The Prometheus of RDF-based Relational Data Management By Orri Erling Virtuoso Program Manager OpenLink Software
  2. 2. Linked Data at Dawn  The Promise and the Practice  The Science of Speed  The Structure which Is  Ongoing Research License CC-BY-SA 4.0 (International).
  3. 3. Linked Data Promises  RDF is a generic, minimalistic model for describing things  RDF has global identifiers and data is self-describing  URIs may be dereferenceable  RDF is flexible to query, does not force a single hierarchical view like XML License CC-BY-SA 4.0 (International).
  4. 4. Linked Data Scenarios  RDF is used because of  schema flexibility  global identifiers  Inference, if present, is usually trivial  Subclass  Sub-property License CC-BY-SA 4.0 (International).
  5. 5. Where Triples Come From  Relational extracts or web content is converted to and stored as triples  NLP extraction  New applications with RDF as primary data model  Doing SPARQL against data in RDBs is possible but is rare and does not deliver the flexibility License CC-BY-SA 4.0 (International).
  6. 6. Linked Data Verticals and Patterns  Publishing: tagging & annotations, evolving vocabularies  Archives: self description, long term identifiers, many versions of schema  Semantic search: structured, semi-structured, and full text, all in one  Business intelligence: many sources, ease of adding sources, no 6 month DW schema change cycle  E-science, often in life sciences: common interchange format, nano-publications, NLP extracts, different users cook their data differently, provenance License CC-BY-SA 4.0 (International).
  7. 7. The Hopes and Perceptions  The age of ad hoc  Find insight in any data, when you need it, from any source, any format  No data warehouse planning cycles; make your own from the pieces you need, when you need it  Still, data integration remains hard work; quality and coverage of sources vary  Flexibility may be there, but is performance and scalability on the level? License CC-BY-SA 4.0 (International).
  8. 8. Yes, But ...  Web and Big Data: Everybody reinvents the triple. Self-description, long term identifiers, key-value pairs in many non-RDF use cases  SPARQL and RDF would be the natural, standards-compliant choice if did beat SQL, information retrieval, custom big data, key value, map reduce solutions Is this intrinsic to linked data or is this lack of engineering?  Linked data has unique advantages in breadth of coverage and expressivity but performance must not lag behind. License CC-BY-SA 4.0 (International).
  9. 9. What is the RDF Tax?  90% of bad performance comes from non-optimal query plans  Some comes from indexing too much (e.g., SQL bulk load with no indices is 50x faster than the equivalent in RDF with all indexed)  Some comes from string ops on URIs, literals  Some comes from having a join for every attribute. Vectoring and right plans help, though License CC-BY-SA 4.0 (International).
  10. 10. The Bane of the Triple When data is stored as triples:  There is structure still but it is harder to exploit. Schema re-emerges as correlations  More joins make more possible query plans, bigger errors in plan cost estimation  More joining reduces locality  Lack of schema causes needless indexing; data takes more space  A URI for everything takes space and time For the same workload, Virtuoso SQL can also be 2–20x faster than Virtuoso SPARQL License CC-BY-SA 4.0 (International).
  11. 11. The Question is Raised  LOD2 FP7, now ending: RDF Performance parity with relational?  SQL is the senior science. Who ignores history is bound to repeat it  Integral mastery of RDB science is a prerequisite, but do not forget the subtle twists of schema-less-ness License CC-BY-SA 4.0 (International).
  12. 12. Virtuoso RDF Relational DBMS Leadership  2000–2006, v1.x–4.x: SQL row store with SQL federation and XML  2007–2008, v5.x–6.x: SPARQL, adapted for RDF quads with more compression, bitmap indices, special data types, RDF awareness in query optimization  2009, v6.x: Scale-out cluster-capable  2010–2013, v7.x: Column store, vectored execution, 3x more space efficient, 10+x more speed  2013: Star Schema benchmark with SPARQL, 100x MySQL SQL, 0.8x MonetDB SQL  2014: Top of the line SQL analytics, 500 Gtriples, Structure Awareness License CC-BY-SA 4.0 (International).
  13. 13. Triples Done Right, so?  Column-store techniques are a good fit; index-based triple storage does not get much better  RAM-only pointer-based techniques can be faster but cost 10–100x more to scale up  To take RDF to SQL parity, Virtuoso must first be on the level with the best in SQL  TPC-H is the checklist for mastery of DW and query optimization; who survives shall not fear  Parity is achieved when running with triples, just like with tables License CC-BY-SA 4.0 (International).
  14. 14. Structure is Everywhere CWI in LOD2:  90% of triples in Common Crawl fall into 20 tables  All relational extractions are 100% tables  Even DBpedia is 90% covered by 500 tables, but is unusually heterogeneous, albeit not very large License CC-BY-SA 4.0 (International).
  15. 15. The Glorious Dawn: Structure is the Servant, not the Tyrant  A set of subjects with all the same single-valued properties is in fact a table.  So, store it as a table  Allow exceptions, e.g., sometimes multiple values, different values in different graphs, extra properties, etc.  If it is big, it has repeating structure  All RDF semantics are preserved; any triple is possible, but the common ones are SQL compact and SQL fast  With tables, query optimization returns to SQL complexity and is much more reliable  So, more tricks from the SQL analytics bag become safe and applicable License CC-BY-SA 4.0 (International).
  16. 16. Gains from Structure Awareness  3+x Load Speed  2x more space efficiency  SPARQL queries against regular data within 10–20% of SQL speeds  Just declare which properties tend to occur together; no strict schema-first like with SQL  Later, self configuration License CC-BY-SA 4.0 (International).
  17. 17. The Cycle of Adventure  Rebels: SQL not cool, too rigid, drop ACID, go key-value, map-reduce, the triple is all there is, semantic web  Pioneers: Life on the frontier is hard, infrastructure missing or bad  Same everyday problems also in Utopia  Recognizing the objective values, e.g., schema freedom and identifiers, no AI. Do the job, forget dogma  Reconciliation: schema-first and schema-last converge in structure awareness License CC-BY-SA 4.0 (International).
  18. 18. Present FP7 Research  LDBC — Transparency and Relevance for Graph DB, RDF performance  GeoKnow — GeoData is everywhere, how to carry the planet in your pocket  LOD2 — Where no triple has gone before (and come back)  Open PHACTs — A Data Platform for Drug Discovery License CC-BY-SA 4.0 (International).
  19. 19. LDBC - Linked Data Benchmark Council  Rebels: SQL not cool, too rigid, drop ACID, go key-value, map-reduce, the triple is all there is, semantic web  Pioneers: Life on the frontier is hard, infrastructure missing or bad  Same everyday problems also in Utopia  Recognizing the objective values, e.g., schema freedom and identifiers, no AI. Do the job, forget dogma  Reconciliation: Some of the rebel thinking becomes mainstream, e.g., schema-first and schema-last converge in structure awareness License CC-BY-SA 4.0 (International).
  20. 20. LDBC, Independent Industry Forum for Benchmarking  The TPC for the frontiers of database  Bootstrapped in the LDBC FP7, continues as independent industry association  OpenLink, Ontotext, Neo Technologies, Sparsity as founding members  IBM, Oracle Labs, Systap, SPARQL City already joined  DB superstars Peter Boncz and Thomas Neumann as founders and scientific lead License CC-BY-SA 4.0 (International).
  21. 21. LDBC Benchmarks Social Network  Online — Lookups, updates, analysis of social environment  Business Intelligence — Spotting trends, key players, big query  Graph analytics — Community detection, Page rank, graph metrics Semantic Publishing  Modeled after the BBC linked data portal, online lookups, drill downs and updates License CC-BY-SA 4.0 (International).
  22. 22. GeoKnow - The Planet in your Pocket Ms. Globe and Mr. Cube have a thing going on:  Mr. Cube: Desiloization ... integrated metadata ... Explicit semantics .  Ms. Globe: I can feel it ... but are you man enough? ... you need to show me. License CC-BY-SA 4.0 (International).
  23. 23. Planet Scale Roadmap Jan 2014:  Virtuoso SPARQL outperforms PostGIS in map lookups with planet-wide Open Street Map  Virtuoso SQL adds 5x more power License CC-BY-SA 4.0 (International).
  24. 24. Next: Jan 2015  Parity between SPARQL and SQL via structure awareness  Geospatial data clustering  Graph analytics close to the data — Pregel, Giraph, etc., in the DB itself  Adding fine-grained geo dimension to LDBC social network benchmark License CC-BY-SA 4.0 (International).
  25. 25. The LOD2 scaling adventures Experiments at CWI’s Scilens cluster  Jan 2013: 150 Gtriples (8 x 256GB RAM)  Aug 2014: 500 Gtriples (12 x 256GB RAM)  Some trillion-triple claims exist, but do not detail any query workload BSBM explore and BI workloads  10x speed gains for BI queries between 2013 and 2014 Bulk load at 6M triples/s  All done in triples, structure awareness will go further still License CC-BY-SA 4.0 (International).
  26. 26. Open PHACTs Partners: License CC-BY-SA 4.0 (International).
  27. 27. Virtuoso Now Snapshot of RDF Linked Data customers in the Enterprise:  Data.Gov (U.S. Govt. Open Linked Data initiative)  Bank of America  Booz Allen Hamilton  Northrop Grumman  Elsevier  French National Library  Samsung  Globo  Daimler Benz  Johnson & Johnson  Bayer  St Jude's Medical  Fuijitsu  Syngenta  and many more License CC-BY-SA 4.0 (International).
  28. 28. Virtuoso Availability  Most capabilities as open source  Commercial adds  Cluster scale-out  SQL Federation  Replication (SQL & RDF)  Advanced RDF security; ABAC & RBAC (ACLs)  Wide tables  and more  Up to the minute tech previews via v7fasttrack on github, e.g., superfast TPC-H implementation License CC-BY-SA 4.0 (International).
  29. 29. Virtuoso Future  Preview of structure-aware RDF store in fall 2014 via v7fasttrack Integrated graph analytics framework  Embed complex graph algorithms, e.g., community detection, shortest path inside SPARQL/SQL  Comparison of SQL and SPARQL for big data analytics License CC-BY-SA 4.0 (International).
  30. 30. Linked Data Now  Adoption across major industries  Superior flexibility and time to solution  Dramatic performance gains in the last 5 years  Benchmarking will continue to drive progress, to the benefit of users and vendors alike  Run circles around most open source SQL in SPARQL: Virtuoso SPARQL beats MySQL in SSB by 100x  With structure awareness, SPARQL to match the best in SQL for data warehousing, OLTP  Linked Data no longer a long shot but a technology that makes sense License CC-BY-SA 4.0 (International).
  31. 31. About OpenLink Software OpenLink Software is a privately-held company founded in 1992 by its President & CEO, Kingsley Idehen. The company is an industry acclaimed technology innovator in the following areas:  ODBC, JDBC, ADO.NET, and OLE DB compliant Data Access Drivers for Oracle, Microsoft SQL Server, Informix, Ingres, Sybase, Progress, MySQL, and PostgreSQL  High-Performance & Scalable Multi- License CC-BY-SA 4.0 (International). Model (Relational & Graph) Database Technology  Data Integration Middleware (Data Virtualization Technology across a wide variety of Protocols & Formats)  Socially-enhanced Distributed Collaborative Applications Platforms (Weblogs, Wikis, Feed Aggregation and Syndication, Web File Systems, Discussion Forums, etc.)  Web Application Server Technology  Linked Data Deployment & Management  Identity Management
  32. 32. Office Locations USA OpenLink Software, Inc 10 Burlington Mall Road Suite 265 Burlington, MA 01803 Tel.: +1 781 273 0900 Fax: +1 781 229 8030 UK OpenLink Software Ltd. Airport House Purley Way Croydon, Surrey CR0 0XZ Tel.: +44 (0)20 8681 7701 Fax: +44 (0)20 8681 7702 License CC-BY-SA 4.0 (International).
  33. 33. Additional Information Web Sites OpenLink Software YouID – Digital Identity Card (Certificate) Generator OpenLink Data Spaces – Semantically enhanced Personal & Enterprise Data Spaces & Collaboration Platform OpenLink Virtuoso - Hybrid Data Management, Integration, Application, and Identity Server Universal Data Access Drivers - High-Performance ODBC, JDBC, ADO.NET, and OLE-DB Drivers LDAP and NetID-TLS – How to use LDAP scheme URIs with NetID-TLS Authentication Social Media Data spaces http://www.openlinksw.com/weblog/oerling/ (Orri Erling weblog) http://kidehen.blogspot.com (Kingsley Idehen weblog) http://www.openlinksw.com/blog/~kidehen/ (Kingsley Idehen weblog) https://twitter.com/OpenLink (Twitter) Hashtags: #LinkedData #SemanticWeb #BigData #RDF (Anywhere). License CC-BY-SA 4.0 (International).

×