• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Apache Drill
 

Apache Drill

on

  • 7,544 views

Summary of recent progress on Apache Drill, an open-source community-driven project to provide easy, dependable, fast and flexible ad hoc query capabilities.

Summary of recent progress on Apache Drill, an open-source community-driven project to provide easy, dependable, fast and flexible ad hoc query capabilities.

Statistics

Views

Total Views
7,544
Views on SlideShare
7,266
Embed Views
278

Actions

Likes
21
Downloads
146
Comments
2

4 Embeds 278

https://twitter.com 141
http://www.mapr.com 135
http://www.diffbot.com&_=1361181527023 HTTP 1
http://www.linkedin.com 1

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel

12 of 2 previous next

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
  • Excellent stuff ... can't wait.
    Are you sure you want to
    Your message goes here
    Processing…
  • For more updates on this open source project, follow @ApacheDrill on Twitter
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • With the recent explosion of everything related to Hadoop, it is no surprise that new projects/implementations related to the Hadoop ecosystem keep appearing. There have been quite a few initiatives that provide SQL interfaces into Hadoop. The Apache Drill project is a distributed system for interactive analysis of large-scale datasets, inspired by Google's Dremel. Drill is not trying to replace existing Big Data batch processing frameworks, such as Hadoop MapReduce or stream processing frameworks, such as S4 or Storm. It rather fills the existing void – real-time interactive processing of large data sets.------------------------------Technical DetailSimilar to Dremel, the Drill implementation is based on the processing of nested, tree-like data. In Dremel this data is based on protocol buffers – nested schema-based data model. Drill is planning to extend this data model by adding additional schema-based implementations, for example, Apache Avro and schema-less data models such asJSON and BSON. In addition to a single data structure, Drill is also planning to support “baby joins” – joins to the small, loadable in memory, data structures.

Apache Drill Apache Drill Presentation Transcript

  • Apache’s Answer to Low Latency Interactive Query for Big Data February 13, 2013
  • Agenda• Apache Drill overview• Key features• Status and progress• Discuss potential use cases and cooperation
  • Big Data Workloads• ETL• Data mining• Blob store• Lightweight OLTP on large datasets• Index and model generation• Web crawling• Stream processing• Clustering, anomaly detection and classification• Interactive analysis
  • Interactive Queries and Hadoop Common Solutions Compile SQL to Export MapReduce External tables in an MapReduce results to RDBMS and MPP database query the RDBMS Emerging Technologies Impala Real-time Real-time SQL basedinteractive queries interactive queries analytics
  • Example Problem • Jane works as an Transaction analyst at an e- information commerce company • How does she figure User out good targeting profiles segments for the next marketing campaign? • She has some ideas Access and lots of data logs
  • Solving the Problem with Traditional Systems• Use an RDBMS – ETL the data from MongoDB and Hadoop into the RDBMS • MongoDB data must be flattened, schematized, filtered and aggregated • Hadoop data must be filtered and aggregated – Query the data using any SQL-based tool• Use MapReduce – ETL the data from Oracle and MongoDB into Hadoop – Work with the MapReduce team to generate the desired analyses• Use Hive – ETL the data from Oracle and MongoDB into Hadoop • MongoDB data must be flattened and schematized – But HiveQL is limited, queries take too long and BI tool support is limited
  • WWGD Distributed Interactive Batch NoSQL File System analysis processing GFS BigTable Dremel MapReduce Hadoop HDFS HBase ??? MapReduce Build Apache Drill to provide a true open source solution to interactive analysis of Big Data
  • Apache Drill Overview• Interactive analysis of Big Data using standard SQL• Fast – Low latency queries – Columnar execution • Inspired by Google Dremel/BigQuery Interactive queries – Complement native interfaces and Apache Drill Data analyst MapReduce/Hive/Pig Reporting• Open 100 ms-20 min – Community driven open source project – Under Apache Software Foundation• Modern – Standard ANSI SQL:2003 (select/into) Data mining – Nested/hierarchical data support MapReduce Modeling Hive – Schema is optional Pig Large ETL – Supports RDBMS, Hadoop and NoSQL 20 min-20 hr
  • How Does It Work?• Drillbits run on each node, designed to maximize data locality• Processing is done outside MapReduce SELECT * FROM paradigm (but possibly within YARN) oracle.transactions,• Queries can be fed to any Drillbit mongo.users, hdfs.e vents• Coordination, query LIMIT 1 planning, optimization, scheduling, and execution are distributed
  • Key Features• Full SQL (ANSI SQL:2003)• Nested data• Schema is optional• Flexible and extensible architecture
  • Full SQL (ANSI SQL:2003)• Drill supports standard ANSI SQL:2003 – Correlated subqueries, analytic functions, … – SQL-like is not enough• Use any SQL-based tool with Apache Drill – Tableau, Microstrategy, Excel, SAP Crystal Reports, Toad, SQuirreL, … – Standard ODBC and JDBC drivers Client Tableau Drillbit MicroStrategy Drill% ODBC% SQL%Query% Query% Driver Drillbits Driver Parser Planner Drill% Worker Excel Drill%Worker SAP% Crystal% Reports
  • Nested Data JSON• Nested data is becoming prevalent { "name": "Homer", – JSON, BSON, XML, Protocol Buffers, Avro, etc. "gender": "Male", "followers": 100 – The data source may or may not be aware children: [ • MongoDB supports nested data natively {name: "Bart"}, {name: "Lisa”} • A single HBase value could be a JSON document ] (compound nested type) } – Google Dremel’s innovation was efficient columnar storage and querying of nested data• Flattening nested data is error-prone and often Avro enum Gender { impossible MALE, FEMALE – Think about repeated and optional fields at every } level… record User {• Apache Drill supports nested data string name; Gender gender; – Extensions to ANSI SQL:2003 long followers; }
  • Schema is Optional• Many data sources do not have rigid schemas – Schemas change rapidly – Each record may have a different schema • Sparse and wide rows in HBase and Cassandra, MongoDB• Apache Drill supports querying against unknown schemas – Query any HBase, Cassandra or MongoDB table• User can define the schema or let the system discover it automatically – System of record may already have schema information • Why manage it in a separate system? – No need to manage schema evolutionRow Key CF contents CF anchor"com.cnn.www" contents:html = "<html>…" anchor:my.look.ca = "CNN.com" anchor:cnnsi.com = "CNN""com.foxnews.www" contents:html = "<html>…" anchor:en.wikipedia.org = "Fox News"… … …
  • Flexible and Extensible Architecture• Apache Drill is designed for extensibility• Well-documented APIs and interfaces• Data sources and file formats – Implement a custom scanner to support a new data source or file format• Query languages – SQL:2003 is the primary language – Implement a custom Parser to support a Domain Specific Language – UDFs and UDTFs• Optimizers – Drill will have a cost-based optimizer – Clear surrounding APIs support easy optimizer exploration• Operators – Custom operators can be implemented • Special operators for Mahout (k-means) being designed – Operator push-down to data source (RDBMS)
  • How Does Impala Fit In?Impala Strengths Questions• Beta currently available • Open Source ‘Lite’• Easy install and setup on top of • Doesn’t support RDBMS or other Cloudera NoSQLs (beyond Hadoop/HBase)• Faster than Hive on some queries • Early row materialization increases• SQL-like query language footprint and reduces performance • Limited file format support • Query results must fit in memory! • Rigid schema is required • No support for nested data • Compound APIs restrict optimizer progression • SQL-like (not SQL) Many important features are “coming soon”. Architectural foundation is constrained. No community development.
  • Status: In Progress• Heavy active development by multiple organizations• Available – Logical plan syntax and interpreter – Reference interpreter• In progress – SQL interpreter – Storage engine implementations for Accumulo, Cassandra, HBase and various file formats• Significant community momentum – Over 200 people on the Drill mailing list – Over 200 members of the Bay Area Drill User Group – Drill meetups across the US and Europe – OpenDremel team joined Apache Drill• Anticipated schedule: – Prototype: Q1 – Alpha: Q2 – Beta: Q3
  • Why Apache Drill Will Be SuccessfulResources Community Architecture• Contributors have strong • Development done in the • Full SQL backgrounds from open • New data support companies like Oracle, • Active contributors from • Extensible APIs IBM Netezza, Informatica, multiple companies • Full Columnar Execution Clustrix and Pentaho • Rapidly growing • Beyond Hadoop
  • Questions?• What problems can Drill solve for you?• Where does it fit in the organization?• Which data sources and BI tools are important to you?
  • Let’s Talk!• tdunning@maprtech.com• tdunning@apache.org• @ted_dunning @ApacheDrill• Slides at http://bit.ly/YxZq8X• See also http://www.mapr.com/support/community- resources/drill
  • APPENDIX
  • Why Not Leverage MapReduce?• Scheduling Model – Coarse resource model reduces hardware utilization – Acquisition of resources typically takes 100’s of millis to seconds• Barriers – Map completion required before shuffle/reduce commencement – All maps must complete before reduce can start – In chained jobs, one job must finish entirely before the next one can start• Persistence and Recoverability – Data is persisted to disk between each barrier – Serialization and deserialization are required between execution phase