OSCON 2013: Apache Drill Workshop > SQL Queries


Published on

Discussion of Drill overall, exercises in SQL.

Published in: Technology
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • With the recent explosion of everything related to Hadoop, it is no surprise that new projects/implementations related to the Hadoop ecosystem keep appearing. There have been quite a few initiatives that provide SQL interfaces into Hadoop. The Apache Drill project is a distributed system for interactive analysis of large-scale datasets, inspired by Google's Dremel. Drill is not trying to replace existing Big Data batch processing frameworks, such as Hadoop MapReduce or stream processing frameworks, such as S4 or Storm. It rather fills the existing void – real-time interactive processing of large data sets.------------------------------Technical DetailSimilar to Dremel, the Drill implementation is based on the processing of nested, tree-like data. In Dremel this data is based on protocol buffers – nested schema-based data model. Drill is planning to extend this data model by adding additional schema-based implementations, for example, Apache Avro and schema-less data models such asJSON and BSON. In addition to a single data structure, Drill is also planning to support “baby joins” – joins to the small, loadable in memory, data structures.
  • OSCON 2013: Apache Drill Workshop > SQL Queries

    1. 1. Apache Drill Hackers Workshop Part 1: SQL layer
    2. 2. Agenda • Introductions – to us – to Apache Drill – to the Apache Software Foundation • Logistics – access, tools, methods – teams – goal setting • Hacking • Wrap-up and future plans
    3. 3. Me? • Ted Dunning – Chief Application Architect at MapR – Committer/PMC on Mahout, Zookeeper, Drill, … • @ted_dunning • ted.dunning@gmail.com • tdunning@apache.com • tdunning@maprtech.com
    4. 4. Jacques • Jacques Nadeau – Committer/PMC Apache Drill – Software Architect with MapR • @intjesus • jacques@apache.org • jnadeau@maprtech.com
    5. 5. You? • Background? • Aspirations?
    6. 6. Apache Drill • Brand new Apache Incubator project • Goal is ad hoc query – SQL, other – flexible execution, advanced optimization • Very welcoming community! – non-code contributions very welcome – @apacheDrill – http://incubator.apache.org/drill/ – drill-dev-subscribe@incubator.apache.org
    7. 7. Apache • Many projects – some well known … Hadoop, httpd, Tomcat, Solr – some very obscure/inactive • Community over Code – lightning rod projects are fine … but not at Apache – code is dead, community is living • Consensus driven – respectful debate is fine, argument is not – find points of agreement, move forward
    8. 8. Drill in a Nutshell • Apache Drill overview • Key features • Status and progress • Discuss potential use cases and cooperation
    9. 9. Big Data Workloads • ETL • Data mining • Blob store • Lightweight OLTP on large datasets • Index and model generation • Web crawling • Stream processing • Clustering, anomaly detection and classification • Interactive analysis
    10. 10. Interactive Queries and Hadoop Compile SQL to MapReduce SQL based analytics Impala Real-time interactive queries Real-time interactive queries Emerging Technologies Common Solutions Export MapReduce results to RDBMS and query the RDBMS External tables in an MPP database
    11. 11. Example Problem • Jane works as an analyst at an e- commerce company • How does she figure out good targeting segments for the next marketing campaign? • She has some ideas and lots of data User profiles Transaction information Access logs
    12. 12. Solving the Problem with Traditional Systems • Use an RDBMS – ETL the data from MongoDB and Hadoop into the RDBMS • MongoDB data must be flattened, schematized, filtered and aggregated • Hadoop data must be filtered and aggregated – Query the data using any SQL-based tool • Use MapReduce – ETL the data from Oracle and MongoDB into Hadoop – Work with the MapReduce team to generate the desired analyses • Use Hive – ETL the data from Oracle and MongoDB into Hadoop • MongoDB data must be flattened and schematized – But HiveQL is limited, queries take too long and BI tool support is limited
    13. 13. WWGD Distributed File System NoSQL Interactive analysis Batch processing GFS BigTable Dremel MapReduce HDFS HBase ??? Hadoop MapReduce Build Apache Drill to provide a true open source solution to interactive analysis of Big Data
    14. 14. Apache Drill Overview • Interactive analysis of Big Data using standard SQL • Fast – Low latency queries – Columnar execution • Inspired by Google Dremel/BigQuery – Complement native interfaces and MapReduce/Hive/Pig • Open – Community driven open source project – Under Apache Software Foundation • Modern – Standard ANSI SQL:2003 (select/into) – Nested/hierarchical data support – Schema is optional – Supports RDBMS, Hadoop and NoSQL Interactive queries Data analyst Reporting 100 ms-20 min Data mining Modeling Large ETL 20 min-20 hr MapReduce Hive Pig Apache Drill
    15. 15. How Does It Work? • Drillbits run on each node, designed to maximize data locality • Processing is done outside MapReduce paradigm (but possibly within YARN) • Queries can be fed to any Drillbit • Coordination, query planning, optimization, scheduling, and execution are distributed SELECT * FROM oracle.transactions, mongo.users, hdfs.events LIMIT 1
    16. 16. Key Features • Full SQL (ANSI SQL:2003) • Nested data • Schema is optional • Flexible and extensible architecture
    17. 17. Full SQL (ANSI SQL:2003) • Drill supports standard ANSI SQL:2003 – Correlated subqueries, analytic functions, … – SQL-like is not enough • Use any SQL-based tool with Apache Drill – Tableau, Microstrategy, Excel, SAP Crystal Reports, Toad, SQuirreL, … – Standard ODBC and JDBC drivers Drill%Worker Drill%Worker Driver Client Drillbit SQL%Query% Parser Query% Planner Drillbits Drill%ODBC% Driver Tableau MicroStrategy Excel SAP%Crystal% Reports
    18. 18. Nested Data • Nested data is becoming prevalent – JSON, BSON, XML, Protocol Buffers, Avro, etc. – The data source may or may not be aware • MongoDB supports nested data natively • A single HBase value could be a JSON document (compound nested type) – Google Dremel’s innovation was efficient columnar storage and querying of nested data • Flattening nested data is error-prone and often impossible – Think about repeated and optional fields at every level… • Apache Drill supports nested data – Extensions to ANSI SQL:2003 enum Gender { MALE, FEMALE } record User { string name; Gender gender; long followers; } { "name": "Homer", "gender": "Male", "followers": 100 children: [ {name: "Bart"}, {name: "Lisa”} ] } JSON Avro
    19. 19. Schema is Optional • Many data sources do not have rigid schemas – Schemas change rapidly – Each record may have a different schema • Sparse and wide rows in HBase and Cassandra, MongoDB • Apache Drill supports querying against unknown schemas – Query any HBase, Cassandra or MongoDB table • User can define the schema or let the system discover it automatically – System of record may already have schema information • Why manage it in a separate system? – No need to manage schema evolution Row Key CF contents CF anchor "com.cnn.www" contents:html = "<html>…" anchor:my.look.ca = "CNN.com" anchor:cnnsi.com = "CNN" "com.foxnews.www" contents:html = "<html>…" anchor:en.wikipedia.org = "Fox News" … … …
    20. 20. Flexible and Extensible Architecture • Apache Drill is designed for extensibility • Well-documented APIs and interfaces • Data sources and file formats – Implement a custom scanner to support a new data source or file format • Query languages – SQL:2003 is the primary language – Implement a custom Parser to support a Domain Specific Language – UDFs and UDTFs • Optimizers – Drill will have a cost-based optimizer – Clear surrounding APIs support easy optimizer exploration • Operators – Custom operators can be implemented • Special operators for Mahout (k-means) being designed – Operator push-down to data source (RDBMS)
    21. 21. Drill Status: In Progress • Heavy active development by multiple organizations • Available – Logical plan syntax and interpreter – Reference interpreter • In progress – SQL interpreter – Storage engine implementations for Accumulo, Cassandra, HBase and various file formats • Significant community momentum – Over 200 people on the Drill mailing list – Over 200 members of the Bay Area Drill User Group – Drill meetups across the US and Europe – OpenDremel team joined Apache Drill • Anticipated schedule: – Prototype: Q1 – Alpha: Q2 – Beta: Q3
    22. 22. Logistics • Tools – IntelliJ / eclipse – maven – JIRA + mailing lists + github
    23. 23. Let’s hack! Get code • git clone https://github.com/apache/incubator-drill.git • cd incubator-drill/sandbox/prototype • git checkout 9f69ed0 • mvn clean install Run Sqlline • ./sqlline • sqlline> !connect jdbc:optiq:model=common/target/test-classes/donuts-model.json admin admin Try out queries • select * from donuts; • Try other queries. Note that because donuts.json (our input file) is a JSON file, the SQL parser cannot figure out what fields exist in a record ahead of time. As a result, we have to use the virtual field _map which always exists for every record. Then we can access fields dynamically as in these queries • select donuts._map['name'], donuts._map['ppu'], donuts._map['batters'] from donuts;