SQL on Big Data is not a "one size fits all". Optiq is a framework that allows you to build a data management system on top of any back-end system, including NoSQL and Hadoop, and rules that optimize query processing for capabilities of the data source. We show how Optiq is used in the Apache Drill and Cascading Lingual projects, and how we plan to combine Optiq materialized views, Mondrian, and a data grid to create next-generation in-memory analytics.
This presentation was given at the Real-Time Big Data meetup at RichRelevance in San Francisco, 2013-04-09.
SQL on Big Datausing Optiq@julianhydeReal-time Big Data Meetupat RichRelevanceApril 2013
What is “SQL on Big Data?”□ “Open-source Teradata”□ SQL generator for Map-Reduce□ ETL (Extract-Transform Load)□ Scalable transaction processing□ Querying nested data sets□ Querying documents & populating databases□ Continuous query/streaming(Check one or more.)
Revolution & counter-revolution“Big Data” was a revolution in data management.Lots of broken things got fixed (unlimited scale, data anywhere & any format, late schema, flexible queries).Some useful things got broken (standard interface, data independence, central control).“In 5 years everyone will be using Hadoop and they wont even know it.” – me, a few years ago
Conventional DBMS architecture JDBC client JDBC server SQL parser / validator Metadata Query optimizer Data-flow operators Data Data
Optiq architecture JDBC client JDBC server Optional SQL parser / Metadata validator SPI Core Query Pluggable optimizer rules 3rd 3rd Pluggable party party ops ops 3rd party 3rd party data data
SELECT p.product_name, COUNT(*) AS cExpression FROM splunk.splunk AS s JOIN mysql.products AS ptree ON s.product_id = p.product_id WHERE s.action = purchase GROUP BY p.product_name ORDER BY c DESCSplunkTable: splunk Key: product_name Key: product_id Agg: count Condition: Key: c DESC action = purchasescan joinMySQL filter group sort scan Table: products
SELECT p.product_name, COUNT(*) AS cExpression FROM splunk.splunk AS s JOIN mysql.products AS ptree (optimized) ON s.product_id = p.product_id WHERE s.action = purchase GROUP BY p.product_name ORDER BY c DESC Splunk Condition: Table: splunk action = purchase Key: product_name Agg: count Key: c DESC Key: product_id scan filterMySQL join group sort scan Table: products
Apache Drill“Apache Drill (incubating) is a distributed system for interactive analysis of large-scale datasets, based on Googles Dremel. Its goal is to efficiently process nested data. It is a design goal to scale to 10,000 servers or more and to be able to process petabyes of data and trillions of records in seconds.”Data model: JSON, late-bindingOptiq: SQL → logical plan (current) Logical → physical plan (proposed)
Cascading Lingual“Cascading is the de facto Java API for creating complex data processing workloads and the engine underneath Scalding, Cascalog, and others.”Lingual uses Optiq to translate SQL onto Cascading flowsSQL is “yet another DSL” for CascadingJust released!
Mondrian next-gen architecture mondrian mondrian mondrian Optiq provides SQL view onto hybrid SQL + NoSQL + optiq optiq optiq in-memory store cache data cache grid cache In-memory tables (query results, planned & on-the-fly materializations) control control control cache cache cache Raw data + summarized / HDFS MongoDB DBMS projected / sorted / re-organized data. Partitions.
Summary: Data independenceLogical & physical data modelsRequires & allows query optimizationAllows you (or the system) to re-organize dataQuery federation, data movement, cachingSQL interface for humans & machinesOptiq lets you add rules to optimize better