Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Why is data independence        (still) so important?Julian Hyde @julianhydehttp://github.com/julianhyde/optiqhttp://githu...
Data independenceThis is my opinion about data management systems in general. I dont   claim that it is the right answer f...
About meJulian HydeDatabase hacker (Oracle, Broadbase, SQLstream, LucidDB)Open source hacker (Mondrian, olap4j, LucidDB, O...
http://www.flickr.com/photos/torkildr/3462606643
http://www.flickr.com/photos/sylvar/31436961/
“Big Data”Right data, right timeDiverse data sources / Performance / Suitable formatVolume / Velocity / VarietyVolume – so...
VarietyVariety of source formats (csv, avro, json, weblogs)Variety of storage structures (indexes, projections, sort  orde...
Use case: Optiq* at Splunk    SQL interface on NoSQL system    “Smart” JDBC driver – pushes processing down to Splunk    *...
Expression tree                                 SELECT p.“product_name”, COUNT(*) AS c                                    ...
Expression tree                               SELECT p.“product_name”, COUNT(*) AS c                                      ...
Conventional DBMS architecture                JDBC client                JDBC server                SQL parser /          ...
Drill architecture                      DrQL client                     DrQL parser /                       validator     ...
Optiq architecture                         JDBC client                          JDBC server                 Optional SQL p...
Analogy: Compiler architecture  front end    C++        C          Fortran  middle end         Optimizations  back end    ...
Conclusions     Clear logical / physical separation allows a data        management system to handle a wider variety of da...
Extra material follows...
Writing an adapterDriver – if you want a vanity URL like “jdbc:drill:”Schema – describes what tables existTable – what are...
Upcoming SlideShare
Loading in …5
×

Why is data independence (still) so important? Optiq and Apache Drill.

4,498 views

Published on

Presentation to the Apache Drill Meetup in Sunnyvale, CA on 2012/9/13. Framing the debate about Drill's goals in terms of a "typical" modern DBMS architecture; and also introducing the Optiq extensible query optimizer.

Published in: Technology
  • Be the first to comment

Why is data independence (still) so important? Optiq and Apache Drill.

  1. 1. Why is data independence (still) so important?Julian Hyde @julianhydehttp://github.com/julianhyde/optiqhttp://github.com/julianhyde/optiq-splunkApache Drill Meeting2012/9/13
  2. 2. Data independenceThis is my opinion about data management systems in general. I dont claim that it is the right answer for Apache Drill.I claim that a logical/physical separation can make a data management system more widely applicable, therefore more widely adopted, therefore better.What “data independence” means in todays “big data” world.
  3. 3. About meJulian HydeDatabase hacker (Oracle, Broadbase, SQLstream, LucidDB)Open source hacker (Mondrian, olap4j, LucidDB, Optiq)@julianhydehttp://github.com/julianhyde
  4. 4. http://www.flickr.com/photos/torkildr/3462606643
  5. 5. http://www.flickr.com/photos/sylvar/31436961/
  6. 6. “Big Data”Right data, right timeDiverse data sources / Performance / Suitable formatVolume / Velocity / VarietyVolume – solved :)Velocity – not one of Drills goals (?)Variety – ?
  7. 7. VarietyVariety of source formats (csv, avro, json, weblogs)Variety of storage structures (indexes, projections, sort order, materialized views) now or in futureVariety of query languages (DrQL, SQL)Combine with other data (join, union)Embed within other systems, e.g. HiveSource for other systems, e.g. Drill | Cascading > TeradataTools generate SQL
  8. 8. Use case: Optiq* at Splunk SQL interface on NoSQL system “Smart” JDBC driver – pushes processing down to Splunk * Truth in advertising: I am the author of Optiq.
  9. 9. Expression tree SELECT p.“product_name”, COUNT(*) AS c FROM “splunk”.”splunk” AS s JOIN “mysql”.”products” AS p ON s.”product_id” = p.”product_id” WHERE s.“action” = purchase GROUP BY p.”product_name” Splunk ORDER BY c DESC Table: splunk Key: product_name Key: product_id Agg: count Condition: Key: c DESC action = purchase scan join MySQL filter group sort scan Table: products
  10. 10. Expression tree SELECT p.“product_name”, COUNT(*) AS c FROM “splunk”.”splunk” AS s(optimized) JOIN “mysql”.”products” AS p ON s.”product_id” = p.”product_id” WHERE s.“action” = purchase GROUP BY p.”product_name” Splunk ORDER BY c DESC Condition: Table: splunk action = purchase Key: product_name Agg: count Key: c DESC Key: product_id scan filter MySQL join group sort scan Table: products
  11. 11. Conventional DBMS architecture JDBC client JDBC server SQL parser / validator Metadata Query optimizer Data-flow operators Data Data
  12. 12. Drill architecture DrQL client DrQL parser / validator ? Metadata Data-flow operators Data Data
  13. 13. Optiq architecture JDBC client JDBC server Optional SQL parser / Metadata validator SPI Core Query Pluggable optimizer rules 3rd 3rd Pluggable party party ops ops 3rd party 3rd party data data
  14. 14. Analogy: Compiler architecture front end C++ C Fortran middle end Optimizations back end x86 ARM Fortran
  15. 15. Conclusions Clear logical / physical separation allows a data management system to handle a wider variety of data, query languages, and packaging. Also provides a clear interface between the sub-teams working on query language and operators. A query optimizer allows new operators, and alternative algorithms and data structures, to be easily added to the system.
  16. 16. Extra material follows...
  17. 17. Writing an adapterDriver – if you want a vanity URL like “jdbc:drill:”Schema – describes what tables existTable – what are the columns, and how to get the data.Operators (optional) – non-relational operators, if anyRules (optional, but recommended) – improve efficiency by changing the questionParser (optional) – additional source languages

×