SQL on Big Datausing Optiq@julianhydeReal-time Big Data Meetupat RichRelevanceApril 2013
What is “SQL on Big Data?”□ “Open-source Teradata”□ SQL generator for Map-Reduce□ ETL (Extract-Transform Load)□ Scalable t...
Revolution & counter-revolution“Big Data” was a revolution in data management.Lots of broken things got fixed (unlimited s...
Conventional DBMS architecture              JDBC client              JDBC server              SQL parser /                ...
Optiq architecture                   JDBC client                   JDBC server       Optional    SQL parser /            M...
SELECT p.product_name, COUNT(*) AS cExpression                            FROM splunk.splunk AS s                         ...
SELECT p.product_name, COUNT(*) AS cExpression                           FROM splunk.splunk AS s                          ...
Apache Drill“Apache Drill (incubating) is a distributed system  for interactive analysis of large-scale datasets,  based o...
Cascading Lingual“Cascading is the de facto Java API for creating  complex data processing workloads and the  engine under...
Mondrian (Pentaho Analysis)
Mondrian next-gen architecture      mondrian               mondrian              mondrian     Optiq provides SQL          ...
Summary: Data independenceLogical & physical data modelsRequires & allows query optimizationAllows you (or the system) to ...
Thank you!@julianhydeoptiq https://github.com/julianhyde/optiqdrill http://incubator.apache.org/drill/lingual http://www.c...
Upcoming SlideShare
Loading in...5
×

SQL on Big Data using Optiq

3,902

Published on

SQL on Big Data is not a "one size fits all". Optiq is a framework that allows you to build a data management system on top of any back-end system, including NoSQL and Hadoop, and rules that optimize query processing for capabilities of the data source. We show how Optiq is used in the Apache Drill and Cascading Lingual projects, and how we plan to combine Optiq materialized views, Mondrian, and a data grid to create next-generation in-memory analytics.

This presentation was given at the Real-Time Big Data meetup at RichRelevance in San Francisco, 2013-04-09.

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
3,902
On Slideshare
0
From Embeds
0
Number of Embeds
30
Actions
Shares
0
Downloads
44
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

SQL on Big Data using Optiq

  1. 1. SQL on Big Datausing Optiq@julianhydeReal-time Big Data Meetupat RichRelevanceApril 2013
  2. 2. What is “SQL on Big Data?”□ “Open-source Teradata”□ SQL generator for Map-Reduce□ ETL (Extract-Transform Load)□ Scalable transaction processing□ Querying nested data sets□ Querying documents & populating databases□ Continuous query/streaming(Check one or more.)
  3. 3. Revolution & counter-revolution“Big Data” was a revolution in data management.Lots of broken things got fixed (unlimited scale, data anywhere & any format, late schema, flexible queries).Some useful things got broken (standard interface, data independence, central control).“In 5 years everyone will be using Hadoop and they wont even know it.” – me, a few years ago
  4. 4. Conventional DBMS architecture JDBC client JDBC server SQL parser / validator Metadata Query optimizer Data-flow operators Data Data
  5. 5. Optiq architecture JDBC client JDBC server Optional SQL parser / Metadata validator SPI Core Query Pluggable optimizer rules 3rd 3rd Pluggable party party ops ops 3rd party 3rd party data data
  6. 6. SELECT p.product_name, COUNT(*) AS cExpression FROM splunk.splunk AS s JOIN mysql.products AS ptree ON s.product_id = p.product_id WHERE s.action = purchase GROUP BY p.product_name ORDER BY c DESCSplunkTable: splunk Key: product_name Key: product_id Agg: count Condition: Key: c DESC action = purchasescan joinMySQL filter group sort scan Table: products
  7. 7. SELECT p.product_name, COUNT(*) AS cExpression FROM splunk.splunk AS s JOIN mysql.products AS ptree (optimized) ON s.product_id = p.product_id WHERE s.action = purchase GROUP BY p.product_name ORDER BY c DESC Splunk Condition: Table: splunk action = purchase Key: product_name Agg: count Key: c DESC Key: product_id scan filterMySQL join group sort scan Table: products
  8. 8. Apache Drill“Apache Drill (incubating) is a distributed system for interactive analysis of large-scale datasets, based on Googles Dremel. Its goal is to efficiently process nested data. It is a design goal to scale to 10,000 servers or more and to be able to process petabyes of data and trillions of records in seconds.”Data model: JSON, late-bindingOptiq: SQL → logical plan (current) Logical → physical plan (proposed)
  9. 9. Cascading Lingual“Cascading is the de facto Java API for creating complex data processing workloads and the engine underneath Scalding, Cascalog, and others.”Lingual uses Optiq to translate SQL onto Cascading flowsSQL is “yet another DSL” for CascadingJust released!
  10. 10. Mondrian (Pentaho Analysis)
  11. 11. Mondrian next-gen architecture mondrian mondrian mondrian Optiq provides SQL view onto hybrid SQL + NoSQL + optiq optiq optiq in-memory store cache data cache grid cache In-memory tables (query results, planned & on-the-fly materializations) control control control cache cache cache Raw data + summarized / HDFS MongoDB DBMS projected / sorted / re-organized data. Partitions.
  12. 12. Summary: Data independenceLogical & physical data modelsRequires & allows query optimizationAllows you (or the system) to re-organize dataQuery federation, data movement, cachingSQL interface for humans & machinesOptiq lets you add rules to optimize better
  13. 13. Thank you!@julianhydeoptiq https://github.com/julianhyde/optiqdrill http://incubator.apache.org/drill/lingual http://www.cascading.org/lingual/mondrian http://mondrian.pentaho.comslides https://github.com/julianhyde/share/tree/master/slides
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×