Apache Drill
Interactive Analysis of Large-Scale Datasets


              Jason Frantz
             Architect, MapR
My Background
•   Caltech
•   Clustrix
•   MapR
•   Founding member of Apache Drill
MapR Technologies
• The open enterprise-grade distribution for Hadoop
   – Easy, dependable and fast
   – Open source with standards-based extensions

• MapR is deployed at 1000’s of companies
   – From small Internet startups to the world’s largest enterprises

• MapR customers analyze massive amounts of data:
   – Hundreds of billions of events daily
   – 90% of the world’s Internet population monthly
   – $1 trillion in retail purchases annually

• MapR has partnered with Google to provide Hadoop on Google
  Compute Engine
Latency Matters
• Ad-hoc analysis with interactive tools

• Real-time dashboards

• Event/trend detection and analysis
  – Network intrusions
  – Fraud
  – Failures
Big Data Processing
                  Batch processing   Interactive analysis   Stream processing
Query runtime     Minutes to hours   Milliseconds to        Never-ending
                                     minutes
Data volume       TBs to PBs         GBs to PBs             Continuous stream
Programming       MapReduce          Queries                DAG
model
Users             Developers         Analysts and           Developers
                                     developers
Google project    MapReduce          Dremel
Open source       Hadoop                                    Storm and S4
project           MapReduce


                 Introducing Apache Drill…
GOOGLE DREMEL
Google Dremel
• Interactive analysis of large-scale datasets
    –   Trillion records at interactive speeds
    –   Complementary to MapReduce
    –   Used by thousands of Google employees
    –   Paper published at VLDB 2010
         • Authors: Sergey Melnik, Andrey Gubarev, Jing Jing Long, Geoffrey Romer, Shiva
           Shivakumar, Matt Tolton, Theo Vassilakis


• Model
    – Nested data model with schema
         • Most data at Google is stored/transferred in Protocol Buffers
         • Normalization (to relational) is prohibitive
    – SQL-like query language with nested data support

• Implementation
    – Column-based storage and processing
    – In-situ data access (GFS and Bigtable)
    – Tree architecture as in Web search (and databases)
Google BigQuery
• Hosted Dremel (Dremel as a Service)
• CLI (bq) and Web UI
• Import data from Google Cloud Storage or local files
   – Files must be in CSV format
       • Nested data not supported [yet] except built-in datasets
   – Schema definition required
APACHE DRILL
Architecture



• Only the execution engine knows the physical attributes of the cluster
    – # nodes, hardware, file locations, …

• Public interfaces enable extensibility
    – Developers can build parsers for new query languages
    – Developers can provide an execution plan directly

• Each level of the plan has a human readable representation
    – Facilitates debugging and unit testing
Architecture (2)
Execution Engine Layers
• Drill execution engine has two layers
   – Operator layer is serialization-aware
       • Processes individual records
   – Execution layer is not serialization-aware
       • Processes batches of records (blobs)
       • Responsible for communication, dependencies and fault tolerance
Data Flow
Nested Query Languages
• DrQL
   – SQL-like query language for nested data
   – Compatible with Google BigQuery/Dremel
      • BigQuery applications should work with Drill
   – Designed to support efficient column-based processing
      • No record assembly during query processing


• Mongo Query Language
   – {$query: {x: 3, y: "abc"}, $orderby: {x: 1}}

• Other languages/programming models can plug in
Nested Data Model
•   The data model in Dremel is Protocol Buffers
     – Nested
     – Schema
•   Apache Drill is designed to support multiple data models
     – Schema: Protocol Buffers, Apache Avro, …
     – Schema-less: JSON, BSON, …
•   Flat records are supported as a special case of nested data
     – CSV, TSV, …

                 Avro IDL                                         JSON
     enum Gender {                                 {
       MALE, FEMALE                                    "name": "Srivas",
     }                                                 "gender": "Male",
                                                       "followers": 100
     record User {                                 }
       string name;                                {
       Gender gender;                                  "name": "Raina",
       long followers;                                 "gender": "Female",
     }                                                 "followers": 200,
                                                       "zip": "94305"
                                                   }
DrQL Example

SELECT DocId AS Id,
  COUNT(Name.Language.Code) WITHIN Name AS
Cnt,
  Name.Url + ',' + Name.Language.Code AS
Str
FROM t
WHERE REGEXP(Name.Url, '^http')
  AND DocId < 20;




                                    * Example from the Dremel paper
Query Components
• Query components:
   –   SELECT
   –   FROM
   –   WHERE
   –   GROUP BY
   –   HAVING
   –   (JOIN)

• Key logical operators:
   –   Scan
   –   Filter
   –   Aggregate
   –   (Join)
Extensibility
•   Nested query languages
     –   Pluggable model
     –   DrQL
     –   Mongo Query Language
     –   Cascading

•   Distributed execution engine
     – Extensible model (eg, Dryad)
     – Low-latency
     – Fault tolerant

•   Nested data formats
     – Pluggable model
     – Column-based (ColumnIO/Dremel, Trevni, RCFile) and row-based (RecordIO, Avro, JSON, CSV)
     – Schema (Protocol Buffers, Avro, CSV) and schema-less (JSON, BSON)

•   Scalable data sources
     – Pluggable model
     – Hadoop
     – HBase
Scan Operators
• Drill supports multiple data formats by having per-format scan operators
   • Queries involving multiple data formats/sources are supported

• Fields and predicates can be pushed down into the scan operator

• Scan operators may have adaptive side-effects (database cracking)
   • Produce ColumnIO from RecordIO
   • Google PowerDrill stores materialized expressions with the data
               Scan with schema                          Scan without schema

Operator       Protocol Buffers                          JSON-like (MessagePack)
output
Supported      ColumnIO (column-based protobuf/Dremel)   JSON
data formats   RecordIO (row-based protobuf)             HBase
               CSV
SELECT …       ColumnIO(proto URI, data URI)             Json(data URI)
FROM …         RecordIO(proto URI, data URI)             HBase(table name)
Design Principles
Flexible                          Easy
• Pluggable query languages       •   Unzip and run
• Extensible execution engine     •   Zero configuration
• Pluggable data formats          •   Reverse DNS not needed
  • Column-based and row-based    •   IP addresses can change
  • Schema and schema-less        •   Clear and concise log messages
• Pluggable data sources


Dependable                        Fast
• No SPOF                         • C/C++ core with Java support
• Instant recovery from crashes     • Google C++ style guide
                                  • Min latency and max throughput
                                    (limited only by hardware)
Hadoop Integration
• Hadoop data sources
   – Hadoop FileSystem API (HDFS/MapR-FS)
   – HBase
• Hadoop data formats
   – Apache Avro
   – RCFile
• MapReduce-based tools to create column-based formats
• Table registry in HCatalog
• Run long-running services in YARN
Get Involved!
• Download these slides
    – http://www.mapr.com/company/events/bay-area-hug/9-19-2012


• Join the mailing list
    – drill-dev-subscribe@incubator.apache.org


• Join MapR
    – jobs@mapr.com

Sep 2012 HUG: Apache Drill for Interactive Analysis

  • 1.
    Apache Drill Interactive Analysisof Large-Scale Datasets Jason Frantz Architect, MapR
  • 2.
    My Background • Caltech • Clustrix • MapR • Founding member of Apache Drill
  • 3.
    MapR Technologies • Theopen enterprise-grade distribution for Hadoop – Easy, dependable and fast – Open source with standards-based extensions • MapR is deployed at 1000’s of companies – From small Internet startups to the world’s largest enterprises • MapR customers analyze massive amounts of data: – Hundreds of billions of events daily – 90% of the world’s Internet population monthly – $1 trillion in retail purchases annually • MapR has partnered with Google to provide Hadoop on Google Compute Engine
  • 4.
    Latency Matters • Ad-hocanalysis with interactive tools • Real-time dashboards • Event/trend detection and analysis – Network intrusions – Fraud – Failures
  • 5.
    Big Data Processing Batch processing Interactive analysis Stream processing Query runtime Minutes to hours Milliseconds to Never-ending minutes Data volume TBs to PBs GBs to PBs Continuous stream Programming MapReduce Queries DAG model Users Developers Analysts and Developers developers Google project MapReduce Dremel Open source Hadoop Storm and S4 project MapReduce Introducing Apache Drill…
  • 6.
  • 7.
    Google Dremel • Interactiveanalysis of large-scale datasets – Trillion records at interactive speeds – Complementary to MapReduce – Used by thousands of Google employees – Paper published at VLDB 2010 • Authors: Sergey Melnik, Andrey Gubarev, Jing Jing Long, Geoffrey Romer, Shiva Shivakumar, Matt Tolton, Theo Vassilakis • Model – Nested data model with schema • Most data at Google is stored/transferred in Protocol Buffers • Normalization (to relational) is prohibitive – SQL-like query language with nested data support • Implementation – Column-based storage and processing – In-situ data access (GFS and Bigtable) – Tree architecture as in Web search (and databases)
  • 8.
    Google BigQuery • HostedDremel (Dremel as a Service) • CLI (bq) and Web UI • Import data from Google Cloud Storage or local files – Files must be in CSV format • Nested data not supported [yet] except built-in datasets – Schema definition required
  • 9.
  • 10.
    Architecture • Only theexecution engine knows the physical attributes of the cluster – # nodes, hardware, file locations, … • Public interfaces enable extensibility – Developers can build parsers for new query languages – Developers can provide an execution plan directly • Each level of the plan has a human readable representation – Facilitates debugging and unit testing
  • 11.
  • 12.
    Execution Engine Layers •Drill execution engine has two layers – Operator layer is serialization-aware • Processes individual records – Execution layer is not serialization-aware • Processes batches of records (blobs) • Responsible for communication, dependencies and fault tolerance
  • 13.
  • 14.
    Nested Query Languages •DrQL – SQL-like query language for nested data – Compatible with Google BigQuery/Dremel • BigQuery applications should work with Drill – Designed to support efficient column-based processing • No record assembly during query processing • Mongo Query Language – {$query: {x: 3, y: "abc"}, $orderby: {x: 1}} • Other languages/programming models can plug in
  • 15.
    Nested Data Model • The data model in Dremel is Protocol Buffers – Nested – Schema • Apache Drill is designed to support multiple data models – Schema: Protocol Buffers, Apache Avro, … – Schema-less: JSON, BSON, … • Flat records are supported as a special case of nested data – CSV, TSV, … Avro IDL JSON enum Gender { { MALE, FEMALE "name": "Srivas", } "gender": "Male", "followers": 100 record User { } string name; { Gender gender; "name": "Raina", long followers; "gender": "Female", } "followers": 200, "zip": "94305" }
  • 16.
    DrQL Example SELECT DocIdAS Id, COUNT(Name.Language.Code) WITHIN Name AS Cnt, Name.Url + ',' + Name.Language.Code AS Str FROM t WHERE REGEXP(Name.Url, '^http') AND DocId < 20; * Example from the Dremel paper
  • 17.
    Query Components • Querycomponents: – SELECT – FROM – WHERE – GROUP BY – HAVING – (JOIN) • Key logical operators: – Scan – Filter – Aggregate – (Join)
  • 18.
    Extensibility • Nested query languages – Pluggable model – DrQL – Mongo Query Language – Cascading • Distributed execution engine – Extensible model (eg, Dryad) – Low-latency – Fault tolerant • Nested data formats – Pluggable model – Column-based (ColumnIO/Dremel, Trevni, RCFile) and row-based (RecordIO, Avro, JSON, CSV) – Schema (Protocol Buffers, Avro, CSV) and schema-less (JSON, BSON) • Scalable data sources – Pluggable model – Hadoop – HBase
  • 19.
    Scan Operators • Drillsupports multiple data formats by having per-format scan operators • Queries involving multiple data formats/sources are supported • Fields and predicates can be pushed down into the scan operator • Scan operators may have adaptive side-effects (database cracking) • Produce ColumnIO from RecordIO • Google PowerDrill stores materialized expressions with the data Scan with schema Scan without schema Operator Protocol Buffers JSON-like (MessagePack) output Supported ColumnIO (column-based protobuf/Dremel) JSON data formats RecordIO (row-based protobuf) HBase CSV SELECT … ColumnIO(proto URI, data URI) Json(data URI) FROM … RecordIO(proto URI, data URI) HBase(table name)
  • 20.
    Design Principles Flexible Easy • Pluggable query languages • Unzip and run • Extensible execution engine • Zero configuration • Pluggable data formats • Reverse DNS not needed • Column-based and row-based • IP addresses can change • Schema and schema-less • Clear and concise log messages • Pluggable data sources Dependable Fast • No SPOF • C/C++ core with Java support • Instant recovery from crashes • Google C++ style guide • Min latency and max throughput (limited only by hardware)
  • 21.
    Hadoop Integration • Hadoopdata sources – Hadoop FileSystem API (HDFS/MapR-FS) – HBase • Hadoop data formats – Apache Avro – RCFile • MapReduce-based tools to create column-based formats • Table registry in HCatalog • Run long-running services in YARN
  • 22.
    Get Involved! • Downloadthese slides – http://www.mapr.com/company/events/bay-area-hug/9-19-2012 • Join the mailing list – drill-dev-subscribe@incubator.apache.org • Join MapR – jobs@mapr.com