Virtuoso, The Prometheus of RDF -- Sematics 2014 Conference Keynote

  1. 1. Virtuoso: The Prometheus of RDF-based Relational Data Management By Orri Erling Virtuoso Program Manager OpenLink Software
  2. 2. Linked Data at Dawn  The Promise and the Practice  The Science of Speed  The Structure which Is  Ongoing Research License CC-BY-SA 4.0 (International).
  3. 3. Linked Data Promises  RDF is a generic, minimalistic model for describing things  RDF has global identifiers and data is self-describing  URIs may be dereferenceable  RDF is flexible to query, does not force a single hierarchical view like XML License CC-BY-SA 4.0 (International).
  4. 4. Linked Data Scenarios  RDF is used because of  schema flexibility  global identifiers  Inference, if present, is usually trivial  Subclass  Sub-property License CC-BY-SA 4.0 (International).
  5. 5. Where Triples Come From  Relational extracts or web content is converted to and stored as triples  NLP extraction  New applications with RDF as primary data model  Doing SPARQL against data in RDBs is possible but is rare and does not deliver the flexibility License CC-BY-SA 4.0 (International).
  6. 6. Linked Data Verticals and Patterns  Publishing: tagging & annotations, evolving vocabularies  Archives: self description, long term identifiers, many versions of schema  Semantic search: structured, semi-structured, and full text, all in one  Business intelligence: many sources, ease of adding sources, no 6 month DW schema change cycle  E-science, often in life sciences: common interchange format, nano-publications, NLP extracts, different users cook their data differently, provenance License CC-BY-SA 4.0 (International).
  7. 7. The Hopes and Perceptions  The age of ad hoc  Find insight in any data, when you need it, from any source, any format  No data warehouse planning cycles; make your own from the pieces you need, when you need it  Still, data integration remains hard work; quality and coverage of sources vary  Flexibility may be there, but is performance and scalability on the level? License CC-BY-SA 4.0 (International).
  8. 8. Yes, But ...  Web and Big Data: Everybody reinvents the triple. Self-description, long term identifiers, key-value pairs in many non-RDF use cases  SPARQL and RDF would be the natural, standards-compliant choice if did beat SQL, information retrieval, custom big data, key value, map reduce solutions Is this intrinsic to linked data or is this lack of engineering?  Linked data has unique advantages in breadth of coverage and expressivity but performance must not lag behind. License CC-BY-SA 4.0 (International).
  9. 9. What is the RDF Tax?  90% of bad performance comes from non-optimal query plans  Some comes from indexing too much (e.g., SQL bulk load with no indices is 50x faster than the equivalent in RDF with all indexed)  Some comes from string ops on URIs, literals  Some comes from having a join for every attribute. Vectoring and right plans help, though License CC-BY-SA 4.0 (International).
  10. 10. The Bane of the Triple When data is stored as triples:  There is structure still but it is harder to exploit. Schema re-emerges as correlations  More joins make more possible query plans, bigger errors in plan cost estimation  More joining reduces locality  Lack of schema causes needless indexing; data takes more space  A URI for everything takes space and time For the same workload, Virtuoso SQL can also be 2–20x faster than Virtuoso SPARQL License CC-BY-SA 4.0 (International).
  11. 11. The Question is Raised  LOD2 FP7, now ending: RDF Performance parity with relational?  SQL is the senior science. Who ignores history is bound to repeat it  Integral mastery of RDB science is a prerequisite, but do not forget the subtle twists of schema-less-ness License CC-BY-SA 4.0 (International).
  12. 12. Virtuoso RDF Relational DBMS Leadership  2000–2006, v1.x–4.x: SQL row store with SQL federation and XML  2007–2008, v5.x–6.x: SPARQL, adapted for RDF quads with more compression, bitmap indices, special data types, RDF awareness in query optimization  2009, v6.x: Scale-out cluster-capable  2010–2013, v7.x: Column store, vectored execution, 3x more space efficient, 10+x more speed  2013: Star Schema benchmark with SPARQL, 100x MySQL SQL, 0.8x MonetDB SQL  2014: Top of the line SQL analytics, 500 Gtriples, Structure Awareness License CC-BY-SA 4.0 (International).
  13. 13. Triples Done Right, so?  Column-store techniques are a good fit; index-based triple storage does not get much better  RAM-only pointer-based techniques can be faster but cost 10–100x more to scale up  To take RDF to SQL parity, Virtuoso must first be on the level with the best in SQL  TPC-H is the checklist for mastery of DW and query optimization; who survives shall not fear  Parity is achieved when running with triples, just like with tables License CC-BY-SA 4.0 (International).
  14. 14. Structure is Everywhere CWI in LOD2:  90% of triples in Common Crawl fall into 20 tables  All relational extractions are 100% tables  Even DBpedia is 90% covered by 500 tables, but is unusually heterogeneous, albeit not very large License CC-BY-SA 4.0 (International).
  15. 15. The Glorious Dawn: Structure is the Servant, not the Tyrant  A set of subjects with all the same single-valued properties is in fact a table.  So, store it as a table  Allow exceptions, e.g., sometimes multiple values, different values in different graphs, extra properties, etc.  If it is big, it has repeating structure  All RDF semantics are preserved; any triple is possible, but the common ones are SQL compact and SQL fast  With tables, query optimization returns to SQL complexity and is much more reliable  So, more tricks from the SQL analytics bag become safe and applicable License CC-BY-SA 4.0 (International).
  16. 16. Gains from Structure Awareness  3+x Load Speed  2x more space efficiency  SPARQL queries against regular data within 10–20% of SQL speeds  Just declare which properties tend to occur together; no strict schema-first like with SQL  Later, self configuration License CC-BY-SA 4.0 (International).
  17. 17. The Cycle of Adventure  Rebels: SQL not cool, too rigid, drop ACID, go key-value, map-reduce, the triple is all there is, semantic web  Pioneers: Life on the frontier is hard, infrastructure missing or bad  Same everyday problems also in Utopia  Recognizing the objective values, e.g., schema freedom and identifiers, no AI. Do the job, forget dogma  Reconciliation: schema-first and schema-last converge in structure awareness License CC-BY-SA 4.0 (International).
  18. 18. Present FP7 Research  LDBC — Transparency and Relevance for Graph DB, RDF performance  GeoKnow — GeoData is everywhere, how to carry the planet in your pocket  LOD2 — Where no triple has gone before (and come back)  Open PHACTs — A Data Platform for Drug Discovery License CC-BY-SA 4.0 (International).
  19. 19. LDBC - Linked Data Benchmark Council  Rebels: SQL not cool, too rigid, drop ACID, go key-value, map-reduce, the triple is all there is, semantic web  Pioneers: Life on the frontier is hard, infrastructure missing or bad  Same everyday problems also in Utopia  Recognizing the objective values, e.g., schema freedom and identifiers, no AI. Do the job, forget dogma  Reconciliation: Some of the rebel thinking becomes mainstream, e.g., schema-first and schema-last converge in structure awareness License CC-BY-SA 4.0 (International).
  20. 20. LDBC, Independent Industry Forum for Benchmarking  The TPC for the frontiers of database  Bootstrapped in the LDBC FP7, continues as independent industry association  OpenLink, Ontotext, Neo Technologies, Sparsity as founding members  IBM, Oracle Labs, Systap, SPARQL City already joined  DB superstars Peter Boncz and Thomas Neumann as founders and scientific lead License CC-BY-SA 4.0 (International).
  21. 21. LDBC Benchmarks Social Network  Online — Lookups, updates, analysis of social environment  Business Intelligence — Spotting trends, key players, big query  Graph analytics — Community detection, Page rank, graph metrics Semantic Publishing  Modeled after the BBC linked data portal, online lookups, drill downs and updates License CC-BY-SA 4.0 (International).
  22. 22. GeoKnow - The Planet in your Pocket Ms. Globe and Mr. Cube have a thing going on:  Mr. Cube: Desiloization ... integrated metadata ... Explicit semantics .  Ms. Globe: I can feel it ... but are you man enough? ... you need to show me. License CC-BY-SA 4.0 (International).
  23. 23. Planet Scale Roadmap Jan 2014:  Virtuoso SPARQL outperforms PostGIS in map lookups with planet-wide Open Street Map  Virtuoso SQL adds 5x more power License CC-BY-SA 4.0 (International).
  24. 24. Next: Jan 2015  Parity between SPARQL and SQL via structure awareness  Geospatial data clustering  Graph analytics close to the data — Pregel, Giraph, etc., in the DB itself  Adding fine-grained geo dimension to LDBC social network benchmark License CC-BY-SA 4.0 (International).
  25. 25. The LOD2 scaling adventures Experiments at CWI’s Scilens cluster  Jan 2013: 150 Gtriples (8 x 256GB RAM)  Aug 2014: 500 Gtriples (12 x 256GB RAM)  Some trillion-triple claims exist, but do not detail any query workload BSBM explore and BI workloads  10x speed gains for BI queries between 2013 and 2014 Bulk load at 6M triples/s  All done in triples, structure awareness will go further still License CC-BY-SA 4.0 (International).
  26. 26. Open PHACTs Partners: License CC-BY-SA 4.0 (International).
  27. 27. Virtuoso Now Snapshot of RDF Linked Data customers in the Enterprise:  Data.Gov (U.S. Govt. Open Linked Data initiative)  Bank of America  Booz Allen Hamilton  Northrop Grumman  Elsevier  French National Library  Samsung  Globo  Daimler Benz  Johnson & Johnson  Bayer  St Jude's Medical  Fuijitsu  Syngenta  and many more License CC-BY-SA 4.0 (International).
  28. 28. Virtuoso Availability  Most capabilities as open source  Commercial adds  Cluster scale-out  SQL Federation  Replication (SQL & RDF)  Advanced RDF security; ABAC & RBAC (ACLs)  Wide tables  and more  Up to the minute tech previews via v7fasttrack on github, e.g., superfast TPC-H implementation License CC-BY-SA 4.0 (International).
  29. 29. Virtuoso Future  Preview of structure-aware RDF store in fall 2014 via v7fasttrack Integrated graph analytics framework  Embed complex graph algorithms, e.g., community detection, shortest path inside SPARQL/SQL  Comparison of SQL and SPARQL for big data analytics License CC-BY-SA 4.0 (International).
  30. 30. Linked Data Now  Adoption across major industries  Superior flexibility and time to solution  Dramatic performance gains in the last 5 years  Benchmarking will continue to drive progress, to the benefit of users and vendors alike  Run circles around most open source SQL in SPARQL: Virtuoso SPARQL beats MySQL in SSB by 100x  With structure awareness, SPARQL to match the best in SQL for data warehousing, OLTP  Linked Data no longer a long shot but a technology that makes sense License CC-BY-SA 4.0 (International).
  31. 31. About OpenLink Software OpenLink Software is a privately-held company founded in 1992 by its President & CEO, Kingsley Idehen. The company is an industry acclaimed technology innovator in the following areas:  ODBC, JDBC, ADO.NET, and OLE DB compliant Data Access Drivers for Oracle, Microsoft SQL Server, Informix, Ingres, Sybase, Progress, MySQL, and PostgreSQL  High-Performance & Scalable Multi- License CC-BY-SA 4.0 (International). Model (Relational & Graph) Database Technology  Data Integration Middleware (Data Virtualization Technology across a wide variety of Protocols & Formats)  Socially-enhanced Distributed Collaborative Applications Platforms (Weblogs, Wikis, Feed Aggregation and Syndication, Web File Systems, Discussion Forums, etc.)  Web Application Server Technology  Linked Data Deployment & Management  Identity Management
  32. 32. Office Locations USA OpenLink Software, Inc 10 Burlington Mall Road Suite 265 Burlington, MA 01803 Tel.: +1 781 273 0900 Fax: +1 781 229 8030 UK OpenLink Software Ltd. Airport House Purley Way Croydon, Surrey CR0 0XZ Tel.: +44 (0)20 8681 7701 Fax: +44 (0)20 8681 7702 License CC-BY-SA 4.0 (International).
  33. 33. Additional Information Web Sites OpenLink Software YouID – Digital Identity Card (Certificate) Generator OpenLink Data Spaces – Semantically enhanced Personal & Enterprise Data Spaces & Collaboration Platform OpenLink Virtuoso - Hybrid Data Management, Integration, Application, and Identity Server Universal Data Access Drivers - High-Performance ODBC, JDBC, ADO.NET, and OLE-DB Drivers LDAP and NetID-TLS – How to use LDAP scheme URIs with NetID-TLS Authentication Social Media Data spaces (Orri Erling weblog) (Kingsley Idehen weblog) (Kingsley Idehen weblog) (Twitter) Hashtags: #LinkedData #SemanticWeb #BigData #RDF (Anywhere). License CC-BY-SA 4.0 (International).