SQL on Big Data using Optiq

  • 3,701 views
Uploaded on

SQL on Big Data is not a "one size fits all". Optiq is a framework that allows you to build a data management system on top of any back-end system, including NoSQL and Hadoop, and rules that optimize …

SQL on Big Data is not a "one size fits all". Optiq is a framework that allows you to build a data management system on top of any back-end system, including NoSQL and Hadoop, and rules that optimize query processing for capabilities of the data source. We show how Optiq is used in the Apache Drill and Cascading Lingual projects, and how we plan to combine Optiq materialized views, Mondrian, and a data grid to create next-generation in-memory analytics.

This presentation was given at the Real-Time Big Data meetup at RichRelevance in San Francisco, 2013-04-09.

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
3,701
On Slideshare
0
From Embeds
0
Number of Embeds
29

Actions

Shares
Downloads
40
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. SQL on Big Datausing Optiq@julianhydeReal-time Big Data Meetupat RichRelevanceApril 2013
  • 2. What is “SQL on Big Data?”□ “Open-source Teradata”□ SQL generator for Map-Reduce□ ETL (Extract-Transform Load)□ Scalable transaction processing□ Querying nested data sets□ Querying documents & populating databases□ Continuous query/streaming(Check one or more.)
  • 3. Revolution & counter-revolution“Big Data” was a revolution in data management.Lots of broken things got fixed (unlimited scale, data anywhere & any format, late schema, flexible queries).Some useful things got broken (standard interface, data independence, central control).“In 5 years everyone will be using Hadoop and they wont even know it.” – me, a few years ago
  • 4. Conventional DBMS architecture JDBC client JDBC server SQL parser / validator Metadata Query optimizer Data-flow operators Data Data
  • 5. Optiq architecture JDBC client JDBC server Optional SQL parser / Metadata validator SPI Core Query Pluggable optimizer rules 3rd 3rd Pluggable party party ops ops 3rd party 3rd party data data
  • 6. SELECT p.product_name, COUNT(*) AS cExpression FROM splunk.splunk AS s JOIN mysql.products AS ptree ON s.product_id = p.product_id WHERE s.action = purchase GROUP BY p.product_name ORDER BY c DESCSplunkTable: splunk Key: product_name Key: product_id Agg: count Condition: Key: c DESC action = purchasescan joinMySQL filter group sort scan Table: products
  • 7. SELECT p.product_name, COUNT(*) AS cExpression FROM splunk.splunk AS s JOIN mysql.products AS ptree (optimized) ON s.product_id = p.product_id WHERE s.action = purchase GROUP BY p.product_name ORDER BY c DESC Splunk Condition: Table: splunk action = purchase Key: product_name Agg: count Key: c DESC Key: product_id scan filterMySQL join group sort scan Table: products
  • 8. Apache Drill“Apache Drill (incubating) is a distributed system for interactive analysis of large-scale datasets, based on Googles Dremel. Its goal is to efficiently process nested data. It is a design goal to scale to 10,000 servers or more and to be able to process petabyes of data and trillions of records in seconds.”Data model: JSON, late-bindingOptiq: SQL → logical plan (current) Logical → physical plan (proposed)
  • 9. Cascading Lingual“Cascading is the de facto Java API for creating complex data processing workloads and the engine underneath Scalding, Cascalog, and others.”Lingual uses Optiq to translate SQL onto Cascading flowsSQL is “yet another DSL” for CascadingJust released!
  • 10. Mondrian (Pentaho Analysis)
  • 11. Mondrian next-gen architecture mondrian mondrian mondrian Optiq provides SQL view onto hybrid SQL + NoSQL + optiq optiq optiq in-memory store cache data cache grid cache In-memory tables (query results, planned & on-the-fly materializations) control control control cache cache cache Raw data + summarized / HDFS MongoDB DBMS projected / sorted / re-organized data. Partitions.
  • 12. Summary: Data independenceLogical & physical data modelsRequires & allows query optimizationAllows you (or the system) to re-organize dataQuery federation, data movement, cachingSQL interface for humans & machinesOptiq lets you add rules to optimize better
  • 13. Thank you!@julianhydeoptiq https://github.com/julianhyde/optiqdrill http://incubator.apache.org/drill/lingual http://www.cascading.org/lingual/mondrian http://mondrian.pentaho.comslides https://github.com/julianhyde/share/tree/master/slides