Apache Drill

T
Ted DunningSoftware Engineer at MapR Technologies
Apache’s Answer to Low Latency
 Interactive Query for Big Data
         February 13, 2013
Agenda

•   Apache Drill overview
•   Key features
•   Status and progress
•   Discuss potential use cases and cooperation
Big Data Workloads
•   ETL
•   Data mining
•   Blob store
•   Lightweight OLTP on large datasets
•   Index and model generation
•   Web crawling
•   Stream processing
•   Clustering, anomaly detection and classification
•   Interactive analysis
Interactive Queries and Hadoop
                          Common Solutions




 Compile SQL to                                  Export MapReduce
                      External tables in an
  MapReduce                                     results to RDBMS and
                         MPP database
                                                  query the RDBMS

                        Emerging Technologies

                             Impala
     Real-time                Real-time              SQL based
interactive queries      interactive queries          analytics
Example Problem
     • Jane works as an        Transaction
       analyst at an e-        information
       commerce company
     • How does she figure
                                 User
       out good targeting       profiles
       segments for the next
       marketing campaign?
     • She has some ideas        Access
       and lots of data           logs
Solving the Problem with Traditional Systems

• Use an RDBMS
   – ETL the data from MongoDB and Hadoop into the RDBMS
       • MongoDB data must be flattened, schematized, filtered and aggregated
       • Hadoop data must be filtered and aggregated
   – Query the data using any SQL-based tool
• Use MapReduce
   – ETL the data from Oracle and MongoDB into Hadoop
   – Work with the MapReduce team to generate the desired analyses
• Use Hive
   – ETL the data from Oracle and MongoDB into Hadoop
       • MongoDB data must be flattened and schematized
   – But HiveQL is limited, queries take too long and BI tool support is
     limited
WWGD

          Distributed              Interactive     Batch
                        NoSQL
          File System                analysis    processing


             GFS        BigTable    Dremel       MapReduce


                                                  Hadoop
            HDFS         HBase        ???
                                                 MapReduce




 Build Apache Drill to provide a true open source
    solution to interactive analysis of Big Data
Apache Drill Overview
• Interactive analysis of Big Data using standard SQL
• Fast
    – Low latency queries
    – Columnar execution
         • Inspired by Google Dremel/BigQuery                          Interactive queries
    – Complement native interfaces and                  Apache Drill   Data analyst
      MapReduce/Hive/Pig                                               Reporting
• Open                                                                 100 ms-20 min

    – Community driven open source project
    – Under Apache Software Foundation
• Modern
    –   Standard ANSI SQL:2003 (select/into)                           Data mining
    –   Nested/hierarchical data support                MapReduce      Modeling
                                                             Hive
    –   Schema is optional                                    Pig
                                                                       Large ETL
    –   Supports RDBMS, Hadoop and NoSQL                               20 min-20 hr
How Does It Work?
• Drillbits run on each node, designed to
  maximize data locality
• Processing is done outside MapReduce
                                            SELECT * FROM
  paradigm (but possibly within YARN)       oracle.transactions,
• Queries can be fed to any Drillbit        mongo.users, hdfs.e
                                            vents
• Coordination, query                       LIMIT 1
  planning, optimization, scheduling, and
  execution are distributed
Key Features

•   Full SQL (ANSI SQL:2003)
•   Nested data
•   Schema is optional
•   Flexible and extensible architecture
Full SQL (ANSI SQL:2003)
• Drill supports standard ANSI SQL:2003
   – Correlated subqueries, analytic functions, …
   – SQL-like is not enough
• Use any SQL-based tool with Apache Drill
   – Tableau, Microstrategy, Excel, SAP Crystal Reports, Toad, SQuirreL, …
   – Standard ODBC and JDBC drivers
                          Client

            Tableau

                                                         Drillbit
          MicroStrategy
                          Drill%
                               ODBC%            SQL%Query%           Query%
                                       Driver                                  Drillbits
                            Driver                Parser            Planner   Drill%
                                                                                   Worker
              Excel
                                                                               Drill%Worker

           SAP%
              Crystal%
            Reports
Nested Data
                                                                   JSON
• Nested data is becoming prevalent                      {
                                                             "name": "Homer",
   – JSON, BSON, XML, Protocol Buffers, Avro, etc.           "gender": "Male",
                                                             "followers": 100
   – The data source may or may not be aware                 children: [
       • MongoDB supports nested data natively                 {name: "Bart"},
                                                               {name: "Lisa”}
       • A single HBase value could be a JSON document       ]
         (compound nested type)                          }
   – Google Dremel’s innovation was efficient columnar
     storage and querying of nested data
• Flattening nested data is error-prone and often                  Avro
                                                         enum Gender {
  impossible                                               MALE, FEMALE
   – Think about repeated and optional fields at every   }

     level…                                              record User {
• Apache Drill supports nested data                        string name;
                                                           Gender gender;
   – Extensions to ANSI SQL:2003                           long followers;
                                                         }
Schema is Optional
• Many data sources do not have rigid schemas
     – Schemas change rapidly
     – Each record may have a different schema
          • Sparse and wide rows in HBase and Cassandra, MongoDB
• Apache Drill supports querying against unknown schemas
     – Query any HBase, Cassandra or MongoDB table
• User can define the schema or let the system discover it automatically
     – System of record may already have schema information
          • Why manage it in a separate system?
     – No need to manage schema evolution

Row Key              CF contents                  CF anchor
"com.cnn.www"        contents:html = "<html>…"    anchor:my.look.ca = "CNN.com"
                                                  anchor:cnnsi.com = "CNN"
"com.foxnews.www"    contents:html = "<html>…"    anchor:en.wikipedia.org = "Fox News"

…                    …                            …
Flexible and Extensible Architecture
• Apache Drill is designed for extensibility
• Well-documented APIs and interfaces
• Data sources and file formats
    – Implement a custom scanner to support a new data source or file format
• Query languages
    – SQL:2003 is the primary language
    – Implement a custom Parser to support a Domain Specific Language
    – UDFs and UDTFs
• Optimizers
    – Drill will have a cost-based optimizer
    – Clear surrounding APIs support easy optimizer exploration
• Operators
    – Custom operators can be implemented
         • Special operators for Mahout (k-means) being designed
    – Operator push-down to data source (RDBMS)
How Does Impala Fit In?

Impala Strengths                                 Questions
•   Beta currently available                     •   Open Source ‘Lite’
•   Easy install and setup on top of             •   Doesn’t support RDBMS or other
    Cloudera                                         NoSQLs (beyond Hadoop/HBase)
•   Faster than Hive on some queries             •   Early row materialization increases
•   SQL-like query language                          footprint and reduces performance
                                                 •   Limited file format support
                                                 •   Query results must fit in memory!
                                                 •   Rigid schema is required
                                                 •   No support for nested data
                                                 •   Compound APIs restrict optimizer
                                                     progression
                                                 •   SQL-like (not SQL)


    Many important features are “coming soon”. Architectural foundation is constrained. No
                                  community development.
Status: In Progress
•   Heavy active development by multiple organizations
•   Available
     – Logical plan syntax and interpreter
     – Reference interpreter
•   In progress
     – SQL interpreter
     – Storage engine implementations for Accumulo, Cassandra, HBase and various file formats
•   Significant community momentum
     –   Over 200 people on the Drill mailing list
     –   Over 200 members of the Bay Area Drill User Group
     –   Drill meetups across the US and Europe
     –   OpenDremel team joined Apache Drill
•   Anticipated schedule:
     – Prototype: Q1
     – Alpha: Q2
     – Beta: Q3
Why Apache Drill Will Be Successful
Resources                      Community                    Architecture
• Contributors have strong     • Development done in the    • Full SQL
   backgrounds from              open                       • New data support
   companies like Oracle,      • Active contributors from   • Extensible APIs
   IBM Netezza, Informatica,     multiple companies         • Full Columnar Execution
   Clustrix and Pentaho        • Rapidly growing            • Beyond Hadoop
Questions?

• What problems can Drill solve for you?
• Where does it fit in the organization?
• Which data sources and BI tools are important
  to you?
Let’s Talk!

•   tdunning@maprtech.com
•   tdunning@apache.org
•   @ted_dunning @ApacheDrill
•   Slides at http://bit.ly/YxZq8X
•   See also
     http://www.mapr.com/support/community-
    resources/drill
APPENDIX
Why Not Leverage MapReduce?
• Scheduling Model
   – Coarse resource model reduces hardware utilization
   – Acquisition of resources typically takes 100’s of millis to seconds
• Barriers
   – Map completion required before shuffle/reduce
     commencement
   – All maps must complete before reduce can start
   – In chained jobs, one job must finish entirely before the next one
     can start
• Persistence and Recoverability
   – Data is persisted to disk between each barrier
   – Serialization and deserialization are required between execution
     phase
1 of 21

Recommended

Putting Apache Drill into Production by
Putting Apache Drill into ProductionPutting Apache Drill into Production
Putting Apache Drill into ProductionMapR Technologies
3.1K views38 slides
Apache Spark Architecture by
Apache Spark ArchitectureApache Spark Architecture
Apache Spark ArchitectureAlexey Grishchenko
75.9K views114 slides
Apache Tez: Accelerating Hadoop Query Processing by
Apache Tez: Accelerating Hadoop Query Processing Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing DataWorks Summit
31.7K views44 slides
Hive tuning by
Hive tuningHive tuning
Hive tuningMichael Zhang
44.1K views91 slides
Hive+Tez: A performance deep dive by
Hive+Tez: A performance deep diveHive+Tez: A performance deep dive
Hive+Tez: A performance deep divet3rmin4t0r
9.6K views41 slides
What is in a Lucene index? by
What is in a Lucene index?What is in a Lucene index?
What is in a Lucene index?lucenerevolution
48K views38 slides

More Related Content

What's hot

Spark graphx by
Spark graphxSpark graphx
Spark graphxCarol McDonald
1.5K views63 slides
Hadoop World 2011: Hadoop Troubleshooting 101 - Kate Ting - Cloudera by
Hadoop World 2011: Hadoop Troubleshooting 101 - Kate Ting - ClouderaHadoop World 2011: Hadoop Troubleshooting 101 - Kate Ting - Cloudera
Hadoop World 2011: Hadoop Troubleshooting 101 - Kate Ting - ClouderaCloudera, Inc.
24.5K views42 slides
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014 by
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014Julien Le Dem
16.6K views40 slides
HDFS on Kubernetes—Lessons Learned with Kimoon Kim by
HDFS on Kubernetes—Lessons Learned with Kimoon KimHDFS on Kubernetes—Lessons Learned with Kimoon Kim
HDFS on Kubernetes—Lessons Learned with Kimoon KimDatabricks
9.1K views21 slides
Hadoop YARN | Hadoop YARN Architecture | Hadoop YARN Tutorial | Hadoop Tutori... by
Hadoop YARN | Hadoop YARN Architecture | Hadoop YARN Tutorial | Hadoop Tutori...Hadoop YARN | Hadoop YARN Architecture | Hadoop YARN Tutorial | Hadoop Tutori...
Hadoop YARN | Hadoop YARN Architecture | Hadoop YARN Tutorial | Hadoop Tutori...Simplilearn
5.1K views61 slides
Apache Tez - A New Chapter in Hadoop Data Processing by
Apache Tez - A New Chapter in Hadoop Data ProcessingApache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingDataWorks Summit
18.3K views39 slides

What's hot(20)

Hadoop World 2011: Hadoop Troubleshooting 101 - Kate Ting - Cloudera by Cloudera, Inc.
Hadoop World 2011: Hadoop Troubleshooting 101 - Kate Ting - ClouderaHadoop World 2011: Hadoop Troubleshooting 101 - Kate Ting - Cloudera
Hadoop World 2011: Hadoop Troubleshooting 101 - Kate Ting - Cloudera
Cloudera, Inc.24.5K views
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014 by Julien Le Dem
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014
Julien Le Dem16.6K views
HDFS on Kubernetes—Lessons Learned with Kimoon Kim by Databricks
HDFS on Kubernetes—Lessons Learned with Kimoon KimHDFS on Kubernetes—Lessons Learned with Kimoon Kim
HDFS on Kubernetes—Lessons Learned with Kimoon Kim
Databricks9.1K views
Hadoop YARN | Hadoop YARN Architecture | Hadoop YARN Tutorial | Hadoop Tutori... by Simplilearn
Hadoop YARN | Hadoop YARN Architecture | Hadoop YARN Tutorial | Hadoop Tutori...Hadoop YARN | Hadoop YARN Architecture | Hadoop YARN Tutorial | Hadoop Tutori...
Hadoop YARN | Hadoop YARN Architecture | Hadoop YARN Tutorial | Hadoop Tutori...
Simplilearn5.1K views
Apache Tez - A New Chapter in Hadoop Data Processing by DataWorks Summit
Apache Tez - A New Chapter in Hadoop Data ProcessingApache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data Processing
DataWorks Summit18.3K views
Securing Hadoop with Apache Ranger by DataWorks Summit
Securing Hadoop with Apache RangerSecuring Hadoop with Apache Ranger
Securing Hadoop with Apache Ranger
DataWorks Summit20.6K views
Ozone- Object store for Apache Hadoop by Hortonworks
Ozone- Object store for Apache HadoopOzone- Object store for Apache Hadoop
Ozone- Object store for Apache Hadoop
Hortonworks4.1K views
Hadoop Distributed File System by elliando dias
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File System
elliando dias4.6K views
Spark (v1.3) - Présentation (Français) by Alexis Seigneurin
Spark (v1.3) - Présentation (Français)Spark (v1.3) - Présentation (Français)
Spark (v1.3) - Présentation (Français)
Alexis Seigneurin5.9K views
Introduction to Bigdata and HADOOP by vinoth kumar
Introduction to Bigdata and HADOOP Introduction to Bigdata and HADOOP
Introduction to Bigdata and HADOOP
vinoth kumar1K views
Parquet Hadoop Summit 2013 by Julien Le Dem
Parquet Hadoop Summit 2013Parquet Hadoop Summit 2013
Parquet Hadoop Summit 2013
Julien Le Dem26K views
Hadoop hive presentation by Arvind Kumar
Hadoop hive presentationHadoop hive presentation
Hadoop hive presentation
Arvind Kumar5.5K views
Sqoop on Spark for Data Ingestion by DataWorks Summit
Sqoop on Spark for Data IngestionSqoop on Spark for Data Ingestion
Sqoop on Spark for Data Ingestion
DataWorks Summit16.3K views
Construisez votre première application MongoDB by MongoDB
Construisez votre première application MongoDBConstruisez votre première application MongoDB
Construisez votre première application MongoDB
MongoDB356 views

Viewers also liked

Apache Drill: Building Highly Flexible, High Performance Query Engines by M.C... by
Apache Drill: Building Highly Flexible, High Performance Query Engines by M.C...Apache Drill: Building Highly Flexible, High Performance Query Engines by M.C...
Apache Drill: Building Highly Flexible, High Performance Query Engines by M.C...The Hive
10.9K views63 slides
Spark SQL versus Apache Drill: Different Tools with Different Rules by
Spark SQL versus Apache Drill: Different Tools with Different RulesSpark SQL versus Apache Drill: Different Tools with Different Rules
Spark SQL versus Apache Drill: Different Tools with Different RulesDataWorks Summit/Hadoop Summit
5.8K views66 slides
Apache Drill and Zeppelin: Two Promising Tools You've Never Heard Of by
Apache Drill and Zeppelin: Two Promising Tools You've Never Heard OfApache Drill and Zeppelin: Two Promising Tools You've Never Heard Of
Apache Drill and Zeppelin: Two Promising Tools You've Never Heard OfCharles Givre
8.1K views56 slides
An introduction to apache drill presentation by
An introduction to apache drill presentationAn introduction to apache drill presentation
An introduction to apache drill presentationMapR Technologies
2.7K views21 slides
Drill into Drill – How Providing Flexibility and Performance is Possible by
Drill into Drill – How Providing Flexibility and Performance is PossibleDrill into Drill – How Providing Flexibility and Performance is Possible
Drill into Drill – How Providing Flexibility and Performance is PossibleMapR Technologies
1.9K views44 slides
SQL-on-Hadoop with Apache Drill by
SQL-on-Hadoop with Apache DrillSQL-on-Hadoop with Apache Drill
SQL-on-Hadoop with Apache DrillMapR Technologies
1.4K views40 slides

Viewers also liked(20)

Apache Drill: Building Highly Flexible, High Performance Query Engines by M.C... by The Hive
Apache Drill: Building Highly Flexible, High Performance Query Engines by M.C...Apache Drill: Building Highly Flexible, High Performance Query Engines by M.C...
Apache Drill: Building Highly Flexible, High Performance Query Engines by M.C...
The Hive10.9K views
Apache Drill and Zeppelin: Two Promising Tools You've Never Heard Of by Charles Givre
Apache Drill and Zeppelin: Two Promising Tools You've Never Heard OfApache Drill and Zeppelin: Two Promising Tools You've Never Heard Of
Apache Drill and Zeppelin: Two Promising Tools You've Never Heard Of
Charles Givre8.1K views
An introduction to apache drill presentation by MapR Technologies
An introduction to apache drill presentationAn introduction to apache drill presentation
An introduction to apache drill presentation
MapR Technologies2.7K views
Drill into Drill – How Providing Flexibility and Performance is Possible by MapR Technologies
Drill into Drill – How Providing Flexibility and Performance is PossibleDrill into Drill – How Providing Flexibility and Performance is Possible
Drill into Drill – How Providing Flexibility and Performance is Possible
MapR Technologies1.9K views
Apache Drill Workshop by Charles Givre
Apache Drill WorkshopApache Drill Workshop
Apache Drill Workshop
Charles Givre1.4K views
Killing ETL with Apache Drill by Charles Givre
Killing ETL with Apache DrillKilling ETL with Apache Drill
Killing ETL with Apache Drill
Charles Givre3.7K views
Data Exploration with Apache Drill: Day 2 by Charles Givre
Data Exploration with Apache Drill: Day 2Data Exploration with Apache Drill: Day 2
Data Exploration with Apache Drill: Day 2
Charles Givre1.2K views
Data Exploration with Apache Drill: Day 1 by Charles Givre
Data Exploration with Apache Drill:  Day 1Data Exploration with Apache Drill:  Day 1
Data Exploration with Apache Drill: Day 1
Charles Givre1.1K views
Drill / SQL / Optiq by Julian Hyde
Drill / SQL / OptiqDrill / SQL / Optiq
Drill / SQL / Optiq
Julian Hyde3.5K views
SQL for NoSQL and how Apache Calcite can help by Christian Tzolov
SQL for NoSQL and how  Apache Calcite can helpSQL for NoSQL and how  Apache Calcite can help
SQL for NoSQL and how Apache Calcite can help
Christian Tzolov1.2K views
Aws athenaを使ってみた by Sunggyu Rhie
Aws athenaを使ってみたAws athenaを使ってみた
Aws athenaを使ってみた
Sunggyu Rhie660 views
Cluster management and automation with cloudera manager by Chris Westin
Cluster management and automation with cloudera managerCluster management and automation with cloudera manager
Cluster management and automation with cloudera manager
Chris Westin4.9K views
Interactive SQL-on-Hadoop and JethroData by Ofir Manor
Interactive SQL-on-Hadoop and JethroDataInteractive SQL-on-Hadoop and JethroData
Interactive SQL-on-Hadoop and JethroData
Ofir Manor4K views

Similar to Apache Drill

Drill njhug -19 feb2013 by
Drill njhug -19 feb2013Drill njhug -19 feb2013
Drill njhug -19 feb2013MapR Technologies
1.3K views30 slides
Apache drill by
Apache drillApache drill
Apache drillMapR Technologies
2.2K views21 slides
No sql and sql - open analytics summit by
No sql and sql - open analytics summitNo sql and sql - open analytics summit
No sql and sql - open analytics summitOpen Analytics
4.9K views28 slides
Apache Drill at ApacheCon2014 by
Apache Drill at ApacheCon2014Apache Drill at ApacheCon2014
Apache Drill at ApacheCon2014Neeraja Rentachintala
296 views41 slides
Big Data Developers Moscow Meetup 1 - sql on hadoop by
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoopbddmoscow
931 views41 slides
Etu Solution Day 2014 Track-D: 掌握Impala和Spark by
Etu Solution Day 2014 Track-D: 掌握Impala和SparkEtu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkJames Chen
691 views60 slides

Similar to Apache Drill(20)

No sql and sql - open analytics summit by Open Analytics
No sql and sql - open analytics summitNo sql and sql - open analytics summit
No sql and sql - open analytics summit
Open Analytics4.9K views
Big Data Developers Moscow Meetup 1 - sql on hadoop by bddmoscow
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoop
bddmoscow931 views
Etu Solution Day 2014 Track-D: 掌握Impala和Spark by James Chen
Etu Solution Day 2014 Track-D: 掌握Impala和SparkEtu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
James Chen691 views
Real time hadoop + mapreduce intro by Geoff Hendrey
Real time hadoop + mapreduce introReal time hadoop + mapreduce intro
Real time hadoop + mapreduce intro
Geoff Hendrey5.5K views
SQL on Hadoop by Bigdatapump
SQL on HadoopSQL on Hadoop
SQL on Hadoop
Bigdatapump1.5K views
2013 year of real-time hadoop by Geoff Hendrey
2013 year of real-time hadoop2013 year of real-time hadoop
2013 year of real-time hadoop
Geoff Hendrey1.1K views
Cloudera Impala - San Diego Big Data Meetup August 13th 2014 by cdmaxime
Cloudera Impala - San Diego Big Data Meetup August 13th 2014Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
cdmaxime1.3K views
Processing Big Data by cwensel
Processing Big DataProcessing Big Data
Processing Big Data
cwensel817 views
Understanding the Value and Architecture of Apache Drill by DataWorks Summit
Understanding the Value and Architecture of Apache DrillUnderstanding the Value and Architecture of Apache Drill
Understanding the Value and Architecture of Apache Drill
DataWorks Summit6.4K views
Drill at the Chug 9-19-12 by Ted Dunning
Drill at the Chug 9-19-12Drill at the Chug 9-19-12
Drill at the Chug 9-19-12
Ted Dunning1.9K views
4. hadoop גיא לבנברג by Taldor Group
4. hadoop  גיא לבנברג4. hadoop  גיא לבנברג
4. hadoop גיא לבנברג
Taldor Group864 views

More from Ted Dunning

Dunning - SIGMOD - Data Economy.pptx by
Dunning - SIGMOD - Data Economy.pptxDunning - SIGMOD - Data Economy.pptx
Dunning - SIGMOD - Data Economy.pptxTed Dunning
19 views21 slides
How to Get Going with Kubernetes by
How to Get Going with KubernetesHow to Get Going with Kubernetes
How to Get Going with KubernetesTed Dunning
593 views80 slides
Progress for big data in Kubernetes by
Progress for big data in KubernetesProgress for big data in Kubernetes
Progress for big data in KubernetesTed Dunning
473 views82 slides
Anomaly Detection: How to find what you didn’t know to look for by
Anomaly Detection: How to find what you didn’t know to look forAnomaly Detection: How to find what you didn’t know to look for
Anomaly Detection: How to find what you didn’t know to look forTed Dunning
766 views104 slides
Streaming Architecture including Rendezvous for Machine Learning by
Streaming Architecture including Rendezvous for Machine LearningStreaming Architecture including Rendezvous for Machine Learning
Streaming Architecture including Rendezvous for Machine LearningTed Dunning
679 views83 slides
Machine Learning Logistics by
Machine Learning LogisticsMachine Learning Logistics
Machine Learning LogisticsTed Dunning
613 views52 slides

More from Ted Dunning(20)

Dunning - SIGMOD - Data Economy.pptx by Ted Dunning
Dunning - SIGMOD - Data Economy.pptxDunning - SIGMOD - Data Economy.pptx
Dunning - SIGMOD - Data Economy.pptx
Ted Dunning19 views
How to Get Going with Kubernetes by Ted Dunning
How to Get Going with KubernetesHow to Get Going with Kubernetes
How to Get Going with Kubernetes
Ted Dunning593 views
Progress for big data in Kubernetes by Ted Dunning
Progress for big data in KubernetesProgress for big data in Kubernetes
Progress for big data in Kubernetes
Ted Dunning473 views
Anomaly Detection: How to find what you didn’t know to look for by Ted Dunning
Anomaly Detection: How to find what you didn’t know to look forAnomaly Detection: How to find what you didn’t know to look for
Anomaly Detection: How to find what you didn’t know to look for
Ted Dunning766 views
Streaming Architecture including Rendezvous for Machine Learning by Ted Dunning
Streaming Architecture including Rendezvous for Machine LearningStreaming Architecture including Rendezvous for Machine Learning
Streaming Architecture including Rendezvous for Machine Learning
Ted Dunning679 views
Machine Learning Logistics by Ted Dunning
Machine Learning LogisticsMachine Learning Logistics
Machine Learning Logistics
Ted Dunning613 views
Tensor Abuse - how to reuse machine learning frameworks by Ted Dunning
Tensor Abuse - how to reuse machine learning frameworksTensor Abuse - how to reuse machine learning frameworks
Tensor Abuse - how to reuse machine learning frameworks
Ted Dunning883 views
Machine Learning logistics by Ted Dunning
Machine Learning logisticsMachine Learning logistics
Machine Learning logistics
Ted Dunning3.9K views
T digest-update by Ted Dunning
T digest-updateT digest-update
T digest-update
Ted Dunning1.4K views
Finding Changes in Real Data by Ted Dunning
Finding Changes in Real DataFinding Changes in Real Data
Finding Changes in Real Data
Ted Dunning803 views
Where is Data Going? - RMDC Keynote by Ted Dunning
Where is Data Going? - RMDC KeynoteWhere is Data Going? - RMDC Keynote
Where is Data Going? - RMDC Keynote
Ted Dunning545 views
Real time-hadoop by Ted Dunning
Real time-hadoopReal time-hadoop
Real time-hadoop
Ted Dunning1.7K views
Cheap learning-dunning-9-18-2015 by Ted Dunning
Cheap learning-dunning-9-18-2015Cheap learning-dunning-9-18-2015
Cheap learning-dunning-9-18-2015
Ted Dunning1.8K views
Sharing Sensitive Data Securely by Ted Dunning
Sharing Sensitive Data SecurelySharing Sensitive Data Securely
Sharing Sensitive Data Securely
Ted Dunning1.8K views
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time by Ted Dunning
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-timeReal-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time
Ted Dunning2.8K views
How the Internet of Things is Turning the Internet Upside Down by Ted Dunning
How the Internet of Things is Turning the Internet Upside DownHow the Internet of Things is Turning the Internet Upside Down
How the Internet of Things is Turning the Internet Upside Down
Ted Dunning1.7K views
Apache Kylin - OLAP Cubes for SQL on Hadoop by Ted Dunning
Apache Kylin - OLAP Cubes for SQL on HadoopApache Kylin - OLAP Cubes for SQL on Hadoop
Apache Kylin - OLAP Cubes for SQL on Hadoop
Ted Dunning8.5K views
Dunning time-series-2015 by Ted Dunning
Dunning time-series-2015Dunning time-series-2015
Dunning time-series-2015
Ted Dunning1.1K views
Doing-the-impossible by Ted Dunning
Doing-the-impossibleDoing-the-impossible
Doing-the-impossible
Ted Dunning3.3K views
Anomaly Detection - New York Machine Learning by Ted Dunning
Anomaly Detection - New York Machine LearningAnomaly Detection - New York Machine Learning
Anomaly Detection - New York Machine Learning
Ted Dunning6.3K views

Recently uploaded

iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas... by
iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas...iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas...
iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas...Bernd Ruecker
33 views69 slides
Future of Indian ConsumerTech by
Future of Indian ConsumerTechFuture of Indian ConsumerTech
Future of Indian ConsumerTechKapil Khandelwal (KK)
21 views68 slides
Kyo - Functional Scala 2023.pdf by
Kyo - Functional Scala 2023.pdfKyo - Functional Scala 2023.pdf
Kyo - Functional Scala 2023.pdfFlavio W. Brasil
298 views92 slides
Info Session November 2023.pdf by
Info Session November 2023.pdfInfo Session November 2023.pdf
Info Session November 2023.pdfAleksandraKoprivica4
11 views15 slides
Voice Logger - Telephony Integration Solution at Aegis by
Voice Logger - Telephony Integration Solution at AegisVoice Logger - Telephony Integration Solution at Aegis
Voice Logger - Telephony Integration Solution at AegisNirmal Sharma
31 views1 slide
Igniting Next Level Productivity with AI-Infused Data Integration Workflows by
Igniting Next Level Productivity with AI-Infused Data Integration Workflows Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration Workflows Safe Software
257 views86 slides

Recently uploaded(20)

iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas... by Bernd Ruecker
iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas...iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas...
iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas...
Bernd Ruecker33 views
Voice Logger - Telephony Integration Solution at Aegis by Nirmal Sharma
Voice Logger - Telephony Integration Solution at AegisVoice Logger - Telephony Integration Solution at Aegis
Voice Logger - Telephony Integration Solution at Aegis
Nirmal Sharma31 views
Igniting Next Level Productivity with AI-Infused Data Integration Workflows by Safe Software
Igniting Next Level Productivity with AI-Infused Data Integration Workflows Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Safe Software257 views
Business Analyst Series 2023 - Week 3 Session 5 by DianaGray10
Business Analyst Series 2023 -  Week 3 Session 5Business Analyst Series 2023 -  Week 3 Session 5
Business Analyst Series 2023 - Week 3 Session 5
DianaGray10237 views
AMAZON PRODUCT RESEARCH.pdf by JerikkLaureta
AMAZON PRODUCT RESEARCH.pdfAMAZON PRODUCT RESEARCH.pdf
AMAZON PRODUCT RESEARCH.pdf
JerikkLaureta19 views
GDG Cloud Southlake 28 Brad Taylor and Shawn Augenstein Old Problems in the N... by James Anderson
GDG Cloud Southlake 28 Brad Taylor and Shawn Augenstein Old Problems in the N...GDG Cloud Southlake 28 Brad Taylor and Shawn Augenstein Old Problems in the N...
GDG Cloud Southlake 28 Brad Taylor and Shawn Augenstein Old Problems in the N...
James Anderson66 views
Five Things You SHOULD Know About Postman by Postman
Five Things You SHOULD Know About PostmanFive Things You SHOULD Know About Postman
Five Things You SHOULD Know About Postman
Postman30 views
ESPC 2023 - Protect and Govern your Sensitive Data with Microsoft Purview in ... by Jasper Oosterveld
ESPC 2023 - Protect and Govern your Sensitive Data with Microsoft Purview in ...ESPC 2023 - Protect and Govern your Sensitive Data with Microsoft Purview in ...
ESPC 2023 - Protect and Govern your Sensitive Data with Microsoft Purview in ...
Case Study Copenhagen Energy and Business Central.pdf by Aitana
Case Study Copenhagen Energy and Business Central.pdfCase Study Copenhagen Energy and Business Central.pdf
Case Study Copenhagen Energy and Business Central.pdf
Aitana16 views
SAP Automation Using Bar Code and FIORI.pdf by Virendra Rai, PMP
SAP Automation Using Bar Code and FIORI.pdfSAP Automation Using Bar Code and FIORI.pdf
SAP Automation Using Bar Code and FIORI.pdf
PharoJS - Zürich Smalltalk Group Meetup November 2023 by Noury Bouraqadi
PharoJS - Zürich Smalltalk Group Meetup November 2023PharoJS - Zürich Smalltalk Group Meetup November 2023
PharoJS - Zürich Smalltalk Group Meetup November 2023
Noury Bouraqadi126 views
Automating a World-Class Technology Conference; Behind the Scenes of CiscoLive by Network Automation Forum
Automating a World-Class Technology Conference; Behind the Scenes of CiscoLiveAutomating a World-Class Technology Conference; Behind the Scenes of CiscoLive
Automating a World-Class Technology Conference; Behind the Scenes of CiscoLive

Apache Drill

  • 1. Apache’s Answer to Low Latency Interactive Query for Big Data February 13, 2013
  • 2. Agenda • Apache Drill overview • Key features • Status and progress • Discuss potential use cases and cooperation
  • 3. Big Data Workloads • ETL • Data mining • Blob store • Lightweight OLTP on large datasets • Index and model generation • Web crawling • Stream processing • Clustering, anomaly detection and classification • Interactive analysis
  • 4. Interactive Queries and Hadoop Common Solutions Compile SQL to Export MapReduce External tables in an MapReduce results to RDBMS and MPP database query the RDBMS Emerging Technologies Impala Real-time Real-time SQL based interactive queries interactive queries analytics
  • 5. Example Problem • Jane works as an Transaction analyst at an e- information commerce company • How does she figure User out good targeting profiles segments for the next marketing campaign? • She has some ideas Access and lots of data logs
  • 6. Solving the Problem with Traditional Systems • Use an RDBMS – ETL the data from MongoDB and Hadoop into the RDBMS • MongoDB data must be flattened, schematized, filtered and aggregated • Hadoop data must be filtered and aggregated – Query the data using any SQL-based tool • Use MapReduce – ETL the data from Oracle and MongoDB into Hadoop – Work with the MapReduce team to generate the desired analyses • Use Hive – ETL the data from Oracle and MongoDB into Hadoop • MongoDB data must be flattened and schematized – But HiveQL is limited, queries take too long and BI tool support is limited
  • 7. WWGD Distributed Interactive Batch NoSQL File System analysis processing GFS BigTable Dremel MapReduce Hadoop HDFS HBase ??? MapReduce Build Apache Drill to provide a true open source solution to interactive analysis of Big Data
  • 8. Apache Drill Overview • Interactive analysis of Big Data using standard SQL • Fast – Low latency queries – Columnar execution • Inspired by Google Dremel/BigQuery Interactive queries – Complement native interfaces and Apache Drill Data analyst MapReduce/Hive/Pig Reporting • Open 100 ms-20 min – Community driven open source project – Under Apache Software Foundation • Modern – Standard ANSI SQL:2003 (select/into) Data mining – Nested/hierarchical data support MapReduce Modeling Hive – Schema is optional Pig Large ETL – Supports RDBMS, Hadoop and NoSQL 20 min-20 hr
  • 9. How Does It Work? • Drillbits run on each node, designed to maximize data locality • Processing is done outside MapReduce SELECT * FROM paradigm (but possibly within YARN) oracle.transactions, • Queries can be fed to any Drillbit mongo.users, hdfs.e vents • Coordination, query LIMIT 1 planning, optimization, scheduling, and execution are distributed
  • 10. Key Features • Full SQL (ANSI SQL:2003) • Nested data • Schema is optional • Flexible and extensible architecture
  • 11. Full SQL (ANSI SQL:2003) • Drill supports standard ANSI SQL:2003 – Correlated subqueries, analytic functions, … – SQL-like is not enough • Use any SQL-based tool with Apache Drill – Tableau, Microstrategy, Excel, SAP Crystal Reports, Toad, SQuirreL, … – Standard ODBC and JDBC drivers Client Tableau Drillbit MicroStrategy Drill% ODBC% SQL%Query% Query% Driver Drillbits Driver Parser Planner Drill% Worker Excel Drill%Worker SAP% Crystal% Reports
  • 12. Nested Data JSON • Nested data is becoming prevalent { "name": "Homer", – JSON, BSON, XML, Protocol Buffers, Avro, etc. "gender": "Male", "followers": 100 – The data source may or may not be aware children: [ • MongoDB supports nested data natively {name: "Bart"}, {name: "Lisa”} • A single HBase value could be a JSON document ] (compound nested type) } – Google Dremel’s innovation was efficient columnar storage and querying of nested data • Flattening nested data is error-prone and often Avro enum Gender { impossible MALE, FEMALE – Think about repeated and optional fields at every } level… record User { • Apache Drill supports nested data string name; Gender gender; – Extensions to ANSI SQL:2003 long followers; }
  • 13. Schema is Optional • Many data sources do not have rigid schemas – Schemas change rapidly – Each record may have a different schema • Sparse and wide rows in HBase and Cassandra, MongoDB • Apache Drill supports querying against unknown schemas – Query any HBase, Cassandra or MongoDB table • User can define the schema or let the system discover it automatically – System of record may already have schema information • Why manage it in a separate system? – No need to manage schema evolution Row Key CF contents CF anchor "com.cnn.www" contents:html = "<html>…" anchor:my.look.ca = "CNN.com" anchor:cnnsi.com = "CNN" "com.foxnews.www" contents:html = "<html>…" anchor:en.wikipedia.org = "Fox News" … … …
  • 14. Flexible and Extensible Architecture • Apache Drill is designed for extensibility • Well-documented APIs and interfaces • Data sources and file formats – Implement a custom scanner to support a new data source or file format • Query languages – SQL:2003 is the primary language – Implement a custom Parser to support a Domain Specific Language – UDFs and UDTFs • Optimizers – Drill will have a cost-based optimizer – Clear surrounding APIs support easy optimizer exploration • Operators – Custom operators can be implemented • Special operators for Mahout (k-means) being designed – Operator push-down to data source (RDBMS)
  • 15. How Does Impala Fit In? Impala Strengths Questions • Beta currently available • Open Source ‘Lite’ • Easy install and setup on top of • Doesn’t support RDBMS or other Cloudera NoSQLs (beyond Hadoop/HBase) • Faster than Hive on some queries • Early row materialization increases • SQL-like query language footprint and reduces performance • Limited file format support • Query results must fit in memory! • Rigid schema is required • No support for nested data • Compound APIs restrict optimizer progression • SQL-like (not SQL) Many important features are “coming soon”. Architectural foundation is constrained. No community development.
  • 16. Status: In Progress • Heavy active development by multiple organizations • Available – Logical plan syntax and interpreter – Reference interpreter • In progress – SQL interpreter – Storage engine implementations for Accumulo, Cassandra, HBase and various file formats • Significant community momentum – Over 200 people on the Drill mailing list – Over 200 members of the Bay Area Drill User Group – Drill meetups across the US and Europe – OpenDremel team joined Apache Drill • Anticipated schedule: – Prototype: Q1 – Alpha: Q2 – Beta: Q3
  • 17. Why Apache Drill Will Be Successful Resources Community Architecture • Contributors have strong • Development done in the • Full SQL backgrounds from open • New data support companies like Oracle, • Active contributors from • Extensible APIs IBM Netezza, Informatica, multiple companies • Full Columnar Execution Clustrix and Pentaho • Rapidly growing • Beyond Hadoop
  • 18. Questions? • What problems can Drill solve for you? • Where does it fit in the organization? • Which data sources and BI tools are important to you?
  • 19. Let’s Talk! • tdunning@maprtech.com • tdunning@apache.org • @ted_dunning @ApacheDrill • Slides at http://bit.ly/YxZq8X • See also http://www.mapr.com/support/community- resources/drill
  • 21. Why Not Leverage MapReduce? • Scheduling Model – Coarse resource model reduces hardware utilization – Acquisition of resources typically takes 100’s of millis to seconds • Barriers – Map completion required before shuffle/reduce commencement – All maps must complete before reduce can start – In chained jobs, one job must finish entirely before the next one can start • Persistence and Recoverability – Data is persisted to disk between each barrier – Serialization and deserialization are required between execution phase

Editor's Notes

  1. With the recent explosion of everything related to Hadoop, it is no surprise that new projects/implementations related to the Hadoop ecosystem keep appearing. There have been quite a few initiatives that provide SQL interfaces into Hadoop. The Apache Drill project is a distributed system for interactive analysis of large-scale datasets, inspired by Google&apos;s Dremel. Drill is not trying to replace existing Big Data batch processing frameworks, such as Hadoop MapReduce or stream processing frameworks, such as S4 or Storm. It rather fills the existing void – real-time interactive processing of large data sets.------------------------------Technical DetailSimilar to Dremel, the Drill implementation is based on the processing of nested, tree-like data. In Dremel this data is based on protocol buffers – nested schema-based data model. Drill is planning to extend this data model by adding additional schema-based implementations, for example, Apache Avro and schema-less data models such asJSON and BSON. In addition to a single data structure, Drill is also planning to support “baby joins” – joins to the small, loadable in memory, data structures.