SlideShare a Scribd company logo
1 of 21
Apache’s Answer to Low Latency
 Interactive Query for Big Data
         February 13, 2013
Agenda

•   Apache Drill overview
•   Key features
•   Status and progress
•   Discuss potential use cases and cooperation
Big Data Workloads
•   ETL
•   Data mining
•   Blob store
•   Lightweight OLTP on large datasets
•   Index and model generation
•   Web crawling
•   Stream processing
•   Clustering, anomaly detection and classification
•   Interactive analysis
Interactive Queries and Hadoop
                          Common Solutions




 Compile SQL to                                  Export MapReduce
                      External tables in an
  MapReduce                                     results to RDBMS and
                         MPP database
                                                  query the RDBMS

                        Emerging Technologies

                             Impala
     Real-time                Real-time              SQL based
interactive queries      interactive queries          analytics
Example Problem
     • Jane works as an        Transaction
       analyst at an e-        information
       commerce company
     • How does she figure
                                 User
       out good targeting       profiles
       segments for the next
       marketing campaign?
     • She has some ideas        Access
       and lots of data           logs
Solving the Problem with Traditional Systems

• Use an RDBMS
   – ETL the data from MongoDB and Hadoop into the RDBMS
       • MongoDB data must be flattened, schematized, filtered and aggregated
       • Hadoop data must be filtered and aggregated
   – Query the data using any SQL-based tool
• Use MapReduce
   – ETL the data from Oracle and MongoDB into Hadoop
   – Work with the MapReduce team to generate the desired analyses
• Use Hive
   – ETL the data from Oracle and MongoDB into Hadoop
       • MongoDB data must be flattened and schematized
   – But HiveQL is limited, queries take too long and BI tool support is
     limited
WWGD

          Distributed              Interactive     Batch
                        NoSQL
          File System                analysis    processing


             GFS        BigTable    Dremel       MapReduce


                                                  Hadoop
            HDFS         HBase        ???
                                                 MapReduce




 Build Apache Drill to provide a true open source
    solution to interactive analysis of Big Data
Apache Drill Overview
• Interactive analysis of Big Data using standard SQL
• Fast
    – Low latency queries
    – Columnar execution
         • Inspired by Google Dremel/BigQuery                          Interactive queries
    – Complement native interfaces and                  Apache Drill   Data analyst
      MapReduce/Hive/Pig                                               Reporting
• Open                                                                 100 ms-20 min

    – Community driven open source project
    – Under Apache Software Foundation
• Modern
    –   Standard ANSI SQL:2003 (select/into)                           Data mining
    –   Nested/hierarchical data support                MapReduce      Modeling
                                                             Hive
    –   Schema is optional                                    Pig
                                                                       Large ETL
    –   Supports RDBMS, Hadoop and NoSQL                               20 min-20 hr
How Does It Work?
• Drillbits run on each node, designed to
  maximize data locality
• Processing is done outside MapReduce
                                            SELECT * FROM
  paradigm (but possibly within YARN)       oracle.transactions,
• Queries can be fed to any Drillbit        mongo.users, hdfs.e
                                            vents
• Coordination, query                       LIMIT 1
  planning, optimization, scheduling, and
  execution are distributed
Key Features

•   Full SQL (ANSI SQL:2003)
•   Nested data
•   Schema is optional
•   Flexible and extensible architecture
Full SQL (ANSI SQL:2003)
• Drill supports standard ANSI SQL:2003
   – Correlated subqueries, analytic functions, …
   – SQL-like is not enough
• Use any SQL-based tool with Apache Drill
   – Tableau, Microstrategy, Excel, SAP Crystal Reports, Toad, SQuirreL, …
   – Standard ODBC and JDBC drivers
                          Client

            Tableau

                                                         Drillbit
          MicroStrategy
                          Drill%
                               ODBC%            SQL%Query%           Query%
                                       Driver                                  Drillbits
                            Driver                Parser            Planner   Drill%
                                                                                   Worker
              Excel
                                                                               Drill%Worker

           SAP%
              Crystal%
            Reports
Nested Data
                                                                   JSON
• Nested data is becoming prevalent                      {
                                                             "name": "Homer",
   – JSON, BSON, XML, Protocol Buffers, Avro, etc.           "gender": "Male",
                                                             "followers": 100
   – The data source may or may not be aware                 children: [
       • MongoDB supports nested data natively                 {name: "Bart"},
                                                               {name: "Lisa”}
       • A single HBase value could be a JSON document       ]
         (compound nested type)                          }
   – Google Dremel’s innovation was efficient columnar
     storage and querying of nested data
• Flattening nested data is error-prone and often                  Avro
                                                         enum Gender {
  impossible                                               MALE, FEMALE
   – Think about repeated and optional fields at every   }

     level…                                              record User {
• Apache Drill supports nested data                        string name;
                                                           Gender gender;
   – Extensions to ANSI SQL:2003                           long followers;
                                                         }
Schema is Optional
• Many data sources do not have rigid schemas
     – Schemas change rapidly
     – Each record may have a different schema
          • Sparse and wide rows in HBase and Cassandra, MongoDB
• Apache Drill supports querying against unknown schemas
     – Query any HBase, Cassandra or MongoDB table
• User can define the schema or let the system discover it automatically
     – System of record may already have schema information
          • Why manage it in a separate system?
     – No need to manage schema evolution

Row Key              CF contents                  CF anchor
"com.cnn.www"        contents:html = "<html>…"    anchor:my.look.ca = "CNN.com"
                                                  anchor:cnnsi.com = "CNN"
"com.foxnews.www"    contents:html = "<html>…"    anchor:en.wikipedia.org = "Fox News"

…                    …                            …
Flexible and Extensible Architecture
• Apache Drill is designed for extensibility
• Well-documented APIs and interfaces
• Data sources and file formats
    – Implement a custom scanner to support a new data source or file format
• Query languages
    – SQL:2003 is the primary language
    – Implement a custom Parser to support a Domain Specific Language
    – UDFs and UDTFs
• Optimizers
    – Drill will have a cost-based optimizer
    – Clear surrounding APIs support easy optimizer exploration
• Operators
    – Custom operators can be implemented
         • Special operators for Mahout (k-means) being designed
    – Operator push-down to data source (RDBMS)
How Does Impala Fit In?

Impala Strengths                                 Questions
•   Beta currently available                     •   Open Source ‘Lite’
•   Easy install and setup on top of             •   Doesn’t support RDBMS or other
    Cloudera                                         NoSQLs (beyond Hadoop/HBase)
•   Faster than Hive on some queries             •   Early row materialization increases
•   SQL-like query language                          footprint and reduces performance
                                                 •   Limited file format support
                                                 •   Query results must fit in memory!
                                                 •   Rigid schema is required
                                                 •   No support for nested data
                                                 •   Compound APIs restrict optimizer
                                                     progression
                                                 •   SQL-like (not SQL)


    Many important features are “coming soon”. Architectural foundation is constrained. No
                                  community development.
Status: In Progress
•   Heavy active development by multiple organizations
•   Available
     – Logical plan syntax and interpreter
     – Reference interpreter
•   In progress
     – SQL interpreter
     – Storage engine implementations for Accumulo, Cassandra, HBase and various file formats
•   Significant community momentum
     –   Over 200 people on the Drill mailing list
     –   Over 200 members of the Bay Area Drill User Group
     –   Drill meetups across the US and Europe
     –   OpenDremel team joined Apache Drill
•   Anticipated schedule:
     – Prototype: Q1
     – Alpha: Q2
     – Beta: Q3
Why Apache Drill Will Be Successful
Resources                      Community                    Architecture
• Contributors have strong     • Development done in the    • Full SQL
   backgrounds from              open                       • New data support
   companies like Oracle,      • Active contributors from   • Extensible APIs
   IBM Netezza, Informatica,     multiple companies         • Full Columnar Execution
   Clustrix and Pentaho        • Rapidly growing            • Beyond Hadoop
Questions?

• What problems can Drill solve for you?
• Where does it fit in the organization?
• Which data sources and BI tools are important
  to you?
Let’s Talk!

•   tdunning@maprtech.com
•   tdunning@apache.org
•   @ted_dunning @ApacheDrill
•   Slides at http://bit.ly/YxZq8X
•   See also
     http://www.mapr.com/support/community-
    resources/drill
APPENDIX
Why Not Leverage MapReduce?
• Scheduling Model
   – Coarse resource model reduces hardware utilization
   – Acquisition of resources typically takes 100’s of millis to seconds
• Barriers
   – Map completion required before shuffle/reduce
     commencement
   – All maps must complete before reduce can start
   – In chained jobs, one job must finish entirely before the next one
     can start
• Persistence and Recoverability
   – Data is persisted to disk between each barrier
   – Serialization and deserialization are required between execution
     phase

More Related Content

What's hot

CVE-2021-44228 Log4j (and Log4Shell) Executive Explainer by cje@bugcrowd
CVE-2021-44228 Log4j (and Log4Shell) Executive Explainer by cje@bugcrowdCVE-2021-44228 Log4j (and Log4Shell) Executive Explainer by cje@bugcrowd
CVE-2021-44228 Log4j (and Log4Shell) Executive Explainer by cje@bugcrowdCasey Ellis
 
DDoS Attack PPT by Nitin Bisht
DDoS Attack  PPT by Nitin BishtDDoS Attack  PPT by Nitin Bisht
DDoS Attack PPT by Nitin BishtNitin Bisht
 
DDoS 101: Attack Types and Mitigation
DDoS 101: Attack Types and MitigationDDoS 101: Attack Types and Mitigation
DDoS 101: Attack Types and MitigationCloudflare
 
DoS Attack - Incident Handling
DoS Attack - Incident HandlingDoS Attack - Incident Handling
DoS Attack - Incident HandlingMarcelo Silva
 
Managing your Hadoop Clusters with Apache Ambari
Managing your Hadoop Clusters with Apache AmbariManaging your Hadoop Clusters with Apache Ambari
Managing your Hadoop Clusters with Apache AmbariDataWorks Summit
 
Large Scale Math with Hadoop MapReduce
Large Scale Math with Hadoop MapReduceLarge Scale Math with Hadoop MapReduce
Large Scale Math with Hadoop MapReduceHortonworks
 
Elk with Openstack
Elk with OpenstackElk with Openstack
Elk with OpenstackArun prasath
 
DB 모니터링 신규 & 개선 기능 (박명규)
DB 모니터링 신규 & 개선 기능 (박명규)DB 모니터링 신규 & 개선 기능 (박명규)
DB 모니터링 신규 & 개선 기능 (박명규)WhaTap Labs
 
Chicago Data Summit: Apache HBase: An Introduction
Chicago Data Summit: Apache HBase: An IntroductionChicago Data Summit: Apache HBase: An Introduction
Chicago Data Summit: Apache HBase: An IntroductionCloudera, Inc.
 
Honeypot 101 (slide share)
Honeypot 101 (slide share)Honeypot 101 (slide share)
Honeypot 101 (slide share)Emil Tan
 
Denial of service attack
Denial of service attackDenial of service attack
Denial of service attackKaustubh Padwad
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?sudhakara st
 
MongoDB vs. Postgres Benchmarks
MongoDB vs. Postgres Benchmarks MongoDB vs. Postgres Benchmarks
MongoDB vs. Postgres Benchmarks EDB
 

What's hot (20)

CVE-2021-44228 Log4j (and Log4Shell) Executive Explainer by cje@bugcrowd
CVE-2021-44228 Log4j (and Log4Shell) Executive Explainer by cje@bugcrowdCVE-2021-44228 Log4j (and Log4Shell) Executive Explainer by cje@bugcrowd
CVE-2021-44228 Log4j (and Log4Shell) Executive Explainer by cje@bugcrowd
 
Honeypots
HoneypotsHoneypots
Honeypots
 
Honeypot a trap to hackers
Honeypot a trap to hackersHoneypot a trap to hackers
Honeypot a trap to hackers
 
DDoS Attack PPT by Nitin Bisht
DDoS Attack  PPT by Nitin BishtDDoS Attack  PPT by Nitin Bisht
DDoS Attack PPT by Nitin Bisht
 
DDoS 101: Attack Types and Mitigation
DDoS 101: Attack Types and MitigationDDoS 101: Attack Types and Mitigation
DDoS 101: Attack Types and Mitigation
 
DoS Attack - Incident Handling
DoS Attack - Incident HandlingDoS Attack - Incident Handling
DoS Attack - Incident Handling
 
Managing your Hadoop Clusters with Apache Ambari
Managing your Hadoop Clusters with Apache AmbariManaging your Hadoop Clusters with Apache Ambari
Managing your Hadoop Clusters with Apache Ambari
 
Large Scale Math with Hadoop MapReduce
Large Scale Math with Hadoop MapReduceLarge Scale Math with Hadoop MapReduce
Large Scale Math with Hadoop MapReduce
 
Wireshark
WiresharkWireshark
Wireshark
 
Elk with Openstack
Elk with OpenstackElk with Openstack
Elk with Openstack
 
DB 모니터링 신규 & 개선 기능 (박명규)
DB 모니터링 신규 & 개선 기능 (박명규)DB 모니터링 신규 & 개선 기능 (박명규)
DB 모니터링 신규 & 개선 기능 (박명규)
 
PPT on Hadoop
PPT on HadoopPPT on Hadoop
PPT on Hadoop
 
Chicago Data Summit: Apache HBase: An Introduction
Chicago Data Summit: Apache HBase: An IntroductionChicago Data Summit: Apache HBase: An Introduction
Chicago Data Summit: Apache HBase: An Introduction
 
Honeypot Basics
Honeypot BasicsHoneypot Basics
Honeypot Basics
 
Honeypot 101 (slide share)
Honeypot 101 (slide share)Honeypot 101 (slide share)
Honeypot 101 (slide share)
 
Denial of service attack
Denial of service attackDenial of service attack
Denial of service attack
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?
 
DNS Cache Poisoning
DNS Cache PoisoningDNS Cache Poisoning
DNS Cache Poisoning
 
MongoDB vs. Postgres Benchmarks
MongoDB vs. Postgres Benchmarks MongoDB vs. Postgres Benchmarks
MongoDB vs. Postgres Benchmarks
 
MongoDB
MongoDBMongoDB
MongoDB
 

Viewers also liked

Apache Drill and Zeppelin: Two Promising Tools You've Never Heard Of
Apache Drill and Zeppelin: Two Promising Tools You've Never Heard OfApache Drill and Zeppelin: Two Promising Tools You've Never Heard Of
Apache Drill and Zeppelin: Two Promising Tools You've Never Heard OfCharles Givre
 
An introduction to apache drill presentation
An introduction to apache drill presentationAn introduction to apache drill presentation
An introduction to apache drill presentationMapR Technologies
 
Drill into Drill – How Providing Flexibility and Performance is Possible
Drill into Drill – How Providing Flexibility and Performance is PossibleDrill into Drill – How Providing Flexibility and Performance is Possible
Drill into Drill – How Providing Flexibility and Performance is PossibleMapR Technologies
 
SQL-on-Hadoop with Apache Drill
SQL-on-Hadoop with Apache DrillSQL-on-Hadoop with Apache Drill
SQL-on-Hadoop with Apache DrillMapR Technologies
 
Apache Drill Workshop
Apache Drill WorkshopApache Drill Workshop
Apache Drill WorkshopCharles Givre
 
Killing ETL with Apache Drill
Killing ETL with Apache DrillKilling ETL with Apache Drill
Killing ETL with Apache DrillCharles Givre
 
Data Exploration with Apache Drill: Day 2
Data Exploration with Apache Drill: Day 2Data Exploration with Apache Drill: Day 2
Data Exploration with Apache Drill: Day 2Charles Givre
 
Data Exploration with Apache Drill: Day 1
Data Exploration with Apache Drill:  Day 1Data Exploration with Apache Drill:  Day 1
Data Exploration with Apache Drill: Day 1Charles Givre
 
Drill / SQL / Optiq
Drill / SQL / OptiqDrill / SQL / Optiq
Drill / SQL / OptiqJulian Hyde
 
Deep Learning vs. Cheap Learning
Deep Learning vs. Cheap LearningDeep Learning vs. Cheap Learning
Deep Learning vs. Cheap LearningMapR Technologies
 
SQL for NoSQL and how Apache Calcite can help
SQL for NoSQL and how  Apache Calcite can helpSQL for NoSQL and how  Apache Calcite can help
SQL for NoSQL and how Apache Calcite can helpChristian Tzolov
 
Aws athenaを使ってみた
Aws athenaを使ってみたAws athenaを使ってみた
Aws athenaを使ってみたSunggyu Rhie
 
Cluster management and automation with cloudera manager
Cluster management and automation with cloudera managerCluster management and automation with cloudera manager
Cluster management and automation with cloudera managerChris Westin
 
Interactive SQL-on-Hadoop and JethroData
Interactive SQL-on-Hadoop and JethroDataInteractive SQL-on-Hadoop and JethroData
Interactive SQL-on-Hadoop and JethroDataOfir Manor
 

Viewers also liked (20)

Apache Drill and Zeppelin: Two Promising Tools You've Never Heard Of
Apache Drill and Zeppelin: Two Promising Tools You've Never Heard OfApache Drill and Zeppelin: Two Promising Tools You've Never Heard Of
Apache Drill and Zeppelin: Two Promising Tools You've Never Heard Of
 
An introduction to apache drill presentation
An introduction to apache drill presentationAn introduction to apache drill presentation
An introduction to apache drill presentation
 
Drill into Drill – How Providing Flexibility and Performance is Possible
Drill into Drill – How Providing Flexibility and Performance is PossibleDrill into Drill – How Providing Flexibility and Performance is Possible
Drill into Drill – How Providing Flexibility and Performance is Possible
 
SQL-on-Hadoop with Apache Drill
SQL-on-Hadoop with Apache DrillSQL-on-Hadoop with Apache Drill
SQL-on-Hadoop with Apache Drill
 
Apache Drill Workshop
Apache Drill WorkshopApache Drill Workshop
Apache Drill Workshop
 
Killing ETL with Apache Drill
Killing ETL with Apache DrillKilling ETL with Apache Drill
Killing ETL with Apache Drill
 
Data Exploration with Apache Drill: Day 2
Data Exploration with Apache Drill: Day 2Data Exploration with Apache Drill: Day 2
Data Exploration with Apache Drill: Day 2
 
Data Exploration with Apache Drill: Day 1
Data Exploration with Apache Drill:  Day 1Data Exploration with Apache Drill:  Day 1
Data Exploration with Apache Drill: Day 1
 
Meet Spark
Meet SparkMeet Spark
Meet Spark
 
Drill / SQL / Optiq
Drill / SQL / OptiqDrill / SQL / Optiq
Drill / SQL / Optiq
 
Deep Learning vs. Cheap Learning
Deep Learning vs. Cheap LearningDeep Learning vs. Cheap Learning
Deep Learning vs. Cheap Learning
 
SQL for NoSQL and how Apache Calcite can help
SQL for NoSQL and how  Apache Calcite can helpSQL for NoSQL and how  Apache Calcite can help
SQL for NoSQL and how Apache Calcite can help
 
SQL on Hadoop in Taiwan
SQL on Hadoop in TaiwanSQL on Hadoop in Taiwan
SQL on Hadoop in Taiwan
 
Aws athenaを使ってみた
Aws athenaを使ってみたAws athenaを使ってみた
Aws athenaを使ってみた
 
Avro introduction
Avro introductionAvro introduction
Avro introduction
 
Hive vs. Impala
Hive vs. ImpalaHive vs. Impala
Hive vs. Impala
 
Cluster management and automation with cloudera manager
Cluster management and automation with cloudera managerCluster management and automation with cloudera manager
Cluster management and automation with cloudera manager
 
Interactive SQL-on-Hadoop and JethroData
Interactive SQL-on-Hadoop and JethroDataInteractive SQL-on-Hadoop and JethroData
Interactive SQL-on-Hadoop and JethroData
 
Intro to Apache Mahout
Intro to Apache MahoutIntro to Apache Mahout
Intro to Apache Mahout
 
Using Apache Drill
Using Apache DrillUsing Apache Drill
Using Apache Drill
 

Similar to Apache Drill

No sql and sql - open analytics summit
No sql and sql - open analytics summitNo sql and sql - open analytics summit
No sql and sql - open analytics summitOpen Analytics
 
Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoopbddmoscow
 
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkEtu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkJames Chen
 
Real time hadoop + mapreduce intro
Real time hadoop + mapreduce introReal time hadoop + mapreduce intro
Real time hadoop + mapreduce introGeoff Hendrey
 
Microsoft Openness Mongo DB
Microsoft Openness Mongo DBMicrosoft Openness Mongo DB
Microsoft Openness Mongo DBHeriyadi Janwar
 
2013 year of real-time hadoop
2013 year of real-time hadoop2013 year of real-time hadoop
2013 year of real-time hadoopGeoff Hendrey
 
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014cdmaxime
 
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
 How to use Big Data and Data Lake concept in business using Hadoop and Spark... How to use Big Data and Data Lake concept in business using Hadoop and Spark...
How to use Big Data and Data Lake concept in business using Hadoop and Spark...Institute of Contemporary Sciences
 
Processing Big Data
Processing Big DataProcessing Big Data
Processing Big Datacwensel
 
Hadoop Summit - Hausenblas 20 March
Hadoop Summit - Hausenblas 20 MarchHadoop Summit - Hausenblas 20 March
Hadoop Summit - Hausenblas 20 MarchMapR Technologies
 
Understanding the Value and Architecture of Apache Drill
Understanding the Value and Architecture of Apache DrillUnderstanding the Value and Architecture of Apache Drill
Understanding the Value and Architecture of Apache DrillDataWorks Summit
 
Demystifying data engineering
Demystifying data engineeringDemystifying data engineering
Demystifying data engineeringThang Bui (Bob)
 
Drill at the Chug 9-19-12
Drill at the Chug 9-19-12Drill at the Chug 9-19-12
Drill at the Chug 9-19-12Ted Dunning
 
4. hadoop גיא לבנברג
4. hadoop  גיא לבנברג4. hadoop  גיא לבנברג
4. hadoop גיא לבנברגTaldor Group
 

Similar to Apache Drill (20)

Drill njhug -19 feb2013
Drill njhug -19 feb2013Drill njhug -19 feb2013
Drill njhug -19 feb2013
 
Apache drill
Apache drillApache drill
Apache drill
 
No sql and sql - open analytics summit
No sql and sql - open analytics summitNo sql and sql - open analytics summit
No sql and sql - open analytics summit
 
Apache Drill at ApacheCon2014
Apache Drill at ApacheCon2014Apache Drill at ApacheCon2014
Apache Drill at ApacheCon2014
 
Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoop
 
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkEtu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
 
Real time hadoop + mapreduce intro
Real time hadoop + mapreduce introReal time hadoop + mapreduce intro
Real time hadoop + mapreduce intro
 
Microsoft Openness Mongo DB
Microsoft Openness Mongo DBMicrosoft Openness Mongo DB
Microsoft Openness Mongo DB
 
SQL on Hadoop
SQL on HadoopSQL on Hadoop
SQL on Hadoop
 
2013 year of real-time hadoop
2013 year of real-time hadoop2013 year of real-time hadoop
2013 year of real-time hadoop
 
Impala for PhillyDB Meetup
Impala for PhillyDB MeetupImpala for PhillyDB Meetup
Impala for PhillyDB Meetup
 
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
 
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
 How to use Big Data and Data Lake concept in business using Hadoop and Spark... How to use Big Data and Data Lake concept in business using Hadoop and Spark...
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
 
Processing Big Data
Processing Big DataProcessing Big Data
Processing Big Data
 
Hadoop Summit - Hausenblas 20 March
Hadoop Summit - Hausenblas 20 MarchHadoop Summit - Hausenblas 20 March
Hadoop Summit - Hausenblas 20 March
 
Understanding the Value and Architecture of Apache Drill
Understanding the Value and Architecture of Apache DrillUnderstanding the Value and Architecture of Apache Drill
Understanding the Value and Architecture of Apache Drill
 
Apache Hadoop Hive
Apache Hadoop HiveApache Hadoop Hive
Apache Hadoop Hive
 
Demystifying data engineering
Demystifying data engineeringDemystifying data engineering
Demystifying data engineering
 
Drill at the Chug 9-19-12
Drill at the Chug 9-19-12Drill at the Chug 9-19-12
Drill at the Chug 9-19-12
 
4. hadoop גיא לבנברג
4. hadoop  גיא לבנברג4. hadoop  גיא לבנברג
4. hadoop גיא לבנברג
 

More from Ted Dunning

Dunning - SIGMOD - Data Economy.pptx
Dunning - SIGMOD - Data Economy.pptxDunning - SIGMOD - Data Economy.pptx
Dunning - SIGMOD - Data Economy.pptxTed Dunning
 
How to Get Going with Kubernetes
How to Get Going with KubernetesHow to Get Going with Kubernetes
How to Get Going with KubernetesTed Dunning
 
Progress for big data in Kubernetes
Progress for big data in KubernetesProgress for big data in Kubernetes
Progress for big data in KubernetesTed Dunning
 
Anomaly Detection: How to find what you didn’t know to look for
Anomaly Detection: How to find what you didn’t know to look forAnomaly Detection: How to find what you didn’t know to look for
Anomaly Detection: How to find what you didn’t know to look forTed Dunning
 
Streaming Architecture including Rendezvous for Machine Learning
Streaming Architecture including Rendezvous for Machine LearningStreaming Architecture including Rendezvous for Machine Learning
Streaming Architecture including Rendezvous for Machine LearningTed Dunning
 
Machine Learning Logistics
Machine Learning LogisticsMachine Learning Logistics
Machine Learning LogisticsTed Dunning
 
Tensor Abuse - how to reuse machine learning frameworks
Tensor Abuse - how to reuse machine learning frameworksTensor Abuse - how to reuse machine learning frameworks
Tensor Abuse - how to reuse machine learning frameworksTed Dunning
 
Machine Learning logistics
Machine Learning logisticsMachine Learning logistics
Machine Learning logisticsTed Dunning
 
Finding Changes in Real Data
Finding Changes in Real DataFinding Changes in Real Data
Finding Changes in Real DataTed Dunning
 
Where is Data Going? - RMDC Keynote
Where is Data Going? - RMDC KeynoteWhere is Data Going? - RMDC Keynote
Where is Data Going? - RMDC KeynoteTed Dunning
 
Real time-hadoop
Real time-hadoopReal time-hadoop
Real time-hadoopTed Dunning
 
Cheap learning-dunning-9-18-2015
Cheap learning-dunning-9-18-2015Cheap learning-dunning-9-18-2015
Cheap learning-dunning-9-18-2015Ted Dunning
 
Sharing Sensitive Data Securely
Sharing Sensitive Data SecurelySharing Sensitive Data Securely
Sharing Sensitive Data SecurelyTed Dunning
 
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-timeReal-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-timeTed Dunning
 
How the Internet of Things is Turning the Internet Upside Down
How the Internet of Things is Turning the Internet Upside DownHow the Internet of Things is Turning the Internet Upside Down
How the Internet of Things is Turning the Internet Upside DownTed Dunning
 
Apache Kylin - OLAP Cubes for SQL on Hadoop
Apache Kylin - OLAP Cubes for SQL on HadoopApache Kylin - OLAP Cubes for SQL on Hadoop
Apache Kylin - OLAP Cubes for SQL on HadoopTed Dunning
 
Dunning time-series-2015
Dunning time-series-2015Dunning time-series-2015
Dunning time-series-2015Ted Dunning
 
Doing-the-impossible
Doing-the-impossibleDoing-the-impossible
Doing-the-impossibleTed Dunning
 
Anomaly Detection - New York Machine Learning
Anomaly Detection - New York Machine LearningAnomaly Detection - New York Machine Learning
Anomaly Detection - New York Machine LearningTed Dunning
 

More from Ted Dunning (20)

Dunning - SIGMOD - Data Economy.pptx
Dunning - SIGMOD - Data Economy.pptxDunning - SIGMOD - Data Economy.pptx
Dunning - SIGMOD - Data Economy.pptx
 
How to Get Going with Kubernetes
How to Get Going with KubernetesHow to Get Going with Kubernetes
How to Get Going with Kubernetes
 
Progress for big data in Kubernetes
Progress for big data in KubernetesProgress for big data in Kubernetes
Progress for big data in Kubernetes
 
Anomaly Detection: How to find what you didn’t know to look for
Anomaly Detection: How to find what you didn’t know to look forAnomaly Detection: How to find what you didn’t know to look for
Anomaly Detection: How to find what you didn’t know to look for
 
Streaming Architecture including Rendezvous for Machine Learning
Streaming Architecture including Rendezvous for Machine LearningStreaming Architecture including Rendezvous for Machine Learning
Streaming Architecture including Rendezvous for Machine Learning
 
Machine Learning Logistics
Machine Learning LogisticsMachine Learning Logistics
Machine Learning Logistics
 
Tensor Abuse - how to reuse machine learning frameworks
Tensor Abuse - how to reuse machine learning frameworksTensor Abuse - how to reuse machine learning frameworks
Tensor Abuse - how to reuse machine learning frameworks
 
Machine Learning logistics
Machine Learning logisticsMachine Learning logistics
Machine Learning logistics
 
T digest-update
T digest-updateT digest-update
T digest-update
 
Finding Changes in Real Data
Finding Changes in Real DataFinding Changes in Real Data
Finding Changes in Real Data
 
Where is Data Going? - RMDC Keynote
Where is Data Going? - RMDC KeynoteWhere is Data Going? - RMDC Keynote
Where is Data Going? - RMDC Keynote
 
Real time-hadoop
Real time-hadoopReal time-hadoop
Real time-hadoop
 
Cheap learning-dunning-9-18-2015
Cheap learning-dunning-9-18-2015Cheap learning-dunning-9-18-2015
Cheap learning-dunning-9-18-2015
 
Sharing Sensitive Data Securely
Sharing Sensitive Data SecurelySharing Sensitive Data Securely
Sharing Sensitive Data Securely
 
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-timeReal-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time
 
How the Internet of Things is Turning the Internet Upside Down
How the Internet of Things is Turning the Internet Upside DownHow the Internet of Things is Turning the Internet Upside Down
How the Internet of Things is Turning the Internet Upside Down
 
Apache Kylin - OLAP Cubes for SQL on Hadoop
Apache Kylin - OLAP Cubes for SQL on HadoopApache Kylin - OLAP Cubes for SQL on Hadoop
Apache Kylin - OLAP Cubes for SQL on Hadoop
 
Dunning time-series-2015
Dunning time-series-2015Dunning time-series-2015
Dunning time-series-2015
 
Doing-the-impossible
Doing-the-impossibleDoing-the-impossible
Doing-the-impossible
 
Anomaly Detection - New York Machine Learning
Anomaly Detection - New York Machine LearningAnomaly Detection - New York Machine Learning
Anomaly Detection - New York Machine Learning
 

Recently uploaded

AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDGMarianaLemus7
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 

Recently uploaded (20)

AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort ServiceHot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDG
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 

Apache Drill

  • 1. Apache’s Answer to Low Latency Interactive Query for Big Data February 13, 2013
  • 2. Agenda • Apache Drill overview • Key features • Status and progress • Discuss potential use cases and cooperation
  • 3. Big Data Workloads • ETL • Data mining • Blob store • Lightweight OLTP on large datasets • Index and model generation • Web crawling • Stream processing • Clustering, anomaly detection and classification • Interactive analysis
  • 4. Interactive Queries and Hadoop Common Solutions Compile SQL to Export MapReduce External tables in an MapReduce results to RDBMS and MPP database query the RDBMS Emerging Technologies Impala Real-time Real-time SQL based interactive queries interactive queries analytics
  • 5. Example Problem • Jane works as an Transaction analyst at an e- information commerce company • How does she figure User out good targeting profiles segments for the next marketing campaign? • She has some ideas Access and lots of data logs
  • 6. Solving the Problem with Traditional Systems • Use an RDBMS – ETL the data from MongoDB and Hadoop into the RDBMS • MongoDB data must be flattened, schematized, filtered and aggregated • Hadoop data must be filtered and aggregated – Query the data using any SQL-based tool • Use MapReduce – ETL the data from Oracle and MongoDB into Hadoop – Work with the MapReduce team to generate the desired analyses • Use Hive – ETL the data from Oracle and MongoDB into Hadoop • MongoDB data must be flattened and schematized – But HiveQL is limited, queries take too long and BI tool support is limited
  • 7. WWGD Distributed Interactive Batch NoSQL File System analysis processing GFS BigTable Dremel MapReduce Hadoop HDFS HBase ??? MapReduce Build Apache Drill to provide a true open source solution to interactive analysis of Big Data
  • 8. Apache Drill Overview • Interactive analysis of Big Data using standard SQL • Fast – Low latency queries – Columnar execution • Inspired by Google Dremel/BigQuery Interactive queries – Complement native interfaces and Apache Drill Data analyst MapReduce/Hive/Pig Reporting • Open 100 ms-20 min – Community driven open source project – Under Apache Software Foundation • Modern – Standard ANSI SQL:2003 (select/into) Data mining – Nested/hierarchical data support MapReduce Modeling Hive – Schema is optional Pig Large ETL – Supports RDBMS, Hadoop and NoSQL 20 min-20 hr
  • 9. How Does It Work? • Drillbits run on each node, designed to maximize data locality • Processing is done outside MapReduce SELECT * FROM paradigm (but possibly within YARN) oracle.transactions, • Queries can be fed to any Drillbit mongo.users, hdfs.e vents • Coordination, query LIMIT 1 planning, optimization, scheduling, and execution are distributed
  • 10. Key Features • Full SQL (ANSI SQL:2003) • Nested data • Schema is optional • Flexible and extensible architecture
  • 11. Full SQL (ANSI SQL:2003) • Drill supports standard ANSI SQL:2003 – Correlated subqueries, analytic functions, … – SQL-like is not enough • Use any SQL-based tool with Apache Drill – Tableau, Microstrategy, Excel, SAP Crystal Reports, Toad, SQuirreL, … – Standard ODBC and JDBC drivers Client Tableau Drillbit MicroStrategy Drill% ODBC% SQL%Query% Query% Driver Drillbits Driver Parser Planner Drill% Worker Excel Drill%Worker SAP% Crystal% Reports
  • 12. Nested Data JSON • Nested data is becoming prevalent { "name": "Homer", – JSON, BSON, XML, Protocol Buffers, Avro, etc. "gender": "Male", "followers": 100 – The data source may or may not be aware children: [ • MongoDB supports nested data natively {name: "Bart"}, {name: "Lisa”} • A single HBase value could be a JSON document ] (compound nested type) } – Google Dremel’s innovation was efficient columnar storage and querying of nested data • Flattening nested data is error-prone and often Avro enum Gender { impossible MALE, FEMALE – Think about repeated and optional fields at every } level… record User { • Apache Drill supports nested data string name; Gender gender; – Extensions to ANSI SQL:2003 long followers; }
  • 13. Schema is Optional • Many data sources do not have rigid schemas – Schemas change rapidly – Each record may have a different schema • Sparse and wide rows in HBase and Cassandra, MongoDB • Apache Drill supports querying against unknown schemas – Query any HBase, Cassandra or MongoDB table • User can define the schema or let the system discover it automatically – System of record may already have schema information • Why manage it in a separate system? – No need to manage schema evolution Row Key CF contents CF anchor "com.cnn.www" contents:html = "<html>…" anchor:my.look.ca = "CNN.com" anchor:cnnsi.com = "CNN" "com.foxnews.www" contents:html = "<html>…" anchor:en.wikipedia.org = "Fox News" … … …
  • 14. Flexible and Extensible Architecture • Apache Drill is designed for extensibility • Well-documented APIs and interfaces • Data sources and file formats – Implement a custom scanner to support a new data source or file format • Query languages – SQL:2003 is the primary language – Implement a custom Parser to support a Domain Specific Language – UDFs and UDTFs • Optimizers – Drill will have a cost-based optimizer – Clear surrounding APIs support easy optimizer exploration • Operators – Custom operators can be implemented • Special operators for Mahout (k-means) being designed – Operator push-down to data source (RDBMS)
  • 15. How Does Impala Fit In? Impala Strengths Questions • Beta currently available • Open Source ‘Lite’ • Easy install and setup on top of • Doesn’t support RDBMS or other Cloudera NoSQLs (beyond Hadoop/HBase) • Faster than Hive on some queries • Early row materialization increases • SQL-like query language footprint and reduces performance • Limited file format support • Query results must fit in memory! • Rigid schema is required • No support for nested data • Compound APIs restrict optimizer progression • SQL-like (not SQL) Many important features are “coming soon”. Architectural foundation is constrained. No community development.
  • 16. Status: In Progress • Heavy active development by multiple organizations • Available – Logical plan syntax and interpreter – Reference interpreter • In progress – SQL interpreter – Storage engine implementations for Accumulo, Cassandra, HBase and various file formats • Significant community momentum – Over 200 people on the Drill mailing list – Over 200 members of the Bay Area Drill User Group – Drill meetups across the US and Europe – OpenDremel team joined Apache Drill • Anticipated schedule: – Prototype: Q1 – Alpha: Q2 – Beta: Q3
  • 17. Why Apache Drill Will Be Successful Resources Community Architecture • Contributors have strong • Development done in the • Full SQL backgrounds from open • New data support companies like Oracle, • Active contributors from • Extensible APIs IBM Netezza, Informatica, multiple companies • Full Columnar Execution Clustrix and Pentaho • Rapidly growing • Beyond Hadoop
  • 18. Questions? • What problems can Drill solve for you? • Where does it fit in the organization? • Which data sources and BI tools are important to you?
  • 19. Let’s Talk! • tdunning@maprtech.com • tdunning@apache.org • @ted_dunning @ApacheDrill • Slides at http://bit.ly/YxZq8X • See also http://www.mapr.com/support/community- resources/drill
  • 21. Why Not Leverage MapReduce? • Scheduling Model – Coarse resource model reduces hardware utilization – Acquisition of resources typically takes 100’s of millis to seconds • Barriers – Map completion required before shuffle/reduce commencement – All maps must complete before reduce can start – In chained jobs, one job must finish entirely before the next one can start • Persistence and Recoverability – Data is persisted to disk between each barrier – Serialization and deserialization are required between execution phase

Editor's Notes

  1. With the recent explosion of everything related to Hadoop, it is no surprise that new projects/implementations related to the Hadoop ecosystem keep appearing. There have been quite a few initiatives that provide SQL interfaces into Hadoop. The Apache Drill project is a distributed system for interactive analysis of large-scale datasets, inspired by Google&apos;s Dremel. Drill is not trying to replace existing Big Data batch processing frameworks, such as Hadoop MapReduce or stream processing frameworks, such as S4 or Storm. It rather fills the existing void – real-time interactive processing of large data sets.------------------------------Technical DetailSimilar to Dremel, the Drill implementation is based on the processing of nested, tree-like data. In Dremel this data is based on protocol buffers – nested schema-based data model. Drill is planning to extend this data model by adding additional schema-based implementations, for example, Apache Avro and schema-less data models such asJSON and BSON. In addition to a single data structure, Drill is also planning to support “baby joins” – joins to the small, loadable in memory, data structures.