Apache Drill
 Architecture Outlines




     Jason Frantz
whoami
• Jason


• Software Architect at MapR


• Background in Databases, File Systems,
  Distributed Systems


• jfrantz@maprtech.com, jason@apache.org
Data Flow
Big Picture




• Same basic pieces as most databases:
  –   Driver: manage interaction with client
  –   Parser: process textual query language
  –   Compiler / Optimizer: convert logical query into physical plan
  –   Execution Engine: run physical plan across cluster
  –   Storage Handlers: feed user data in/out of execution
Parser
• Converts text-based query language into internal DAG representation
  – Grammar, syntax, basic query validation
  – Generally straightforward to implement


• Initial goal is to support a SQL-like query language for nested data (DrQL)
   – Compatible with Google BigQuery/Dremel
   – Designed to support data sources that have a well-defined schema (e.g.
     protocol buffers) as well as those that don't (e.g. JSON)

• Other potential input styles:
  – MongoDB's query language
  – Hive
  – Pig
Traditional Query Optimizers
• 30+ year history into relational query optimization
   – We have to follow down the same general path


• Converts a logical query plan into a physical one
  – Example: convert logical “JOIN” operator into specific hash join operator
  – Attempts to choose the “best” overall execution plan


• Magic black box of statistics!
  – Optimizers do great with queries that can be easily modeled with
    available statistics
  – Difficulties: lack of statistics, complex schemas, complex queries
  – Database users often work around optimizer using query hints
    ● “force index”
Intermediate Representation
• Intermediate Representation (IR) is common internal API
   – Output from Parser
   – Input/Output from Optimizer
   – Input to Execution Engine


• Textual Representation:
   – Flexibility
     ● Different users can enter at different levels of the IR

     ● Advanced users can skip optimizer entirely


   – Easier to test various pieces
   – Easy to cache
     ● Query optimization can be computationally expensive, so traditional databases go to


       great lengths to reuse execution plans

• Ideally IR would be format used between optimization passes
   – Inspiration: LLVM, SQL Server showplan
Execution Engine
• Execution layer
   – Query is a DAG of operators

• Operator layer
   – Implementation of individual operators and data format
     serialization
Execution Layer
• Query structured as a Directed Acyclic Graph (DAG) representing the data flow
  – Each node is an abstract “operator”
  – Communication between nodes is “blobs” of data
  – Data model described well in Microsoft's Dryad paper (Isard '07)


• Responsible for handling:
  – Operator dependencies
  – Task scheduling
  – Inter-node communication


• Notable features:
  – Speculative execution
  – Pipelining with spill-to-disk as fallback
  – Back pressure
Operator Execution
• Implementation of individual operators
   – Example built-in operators: hash aggregate, filter, json-scan
   – Extensible so new operators are easy to plug in


• Serialization-aware:
   – Each “blob” is a batch of rows in a particular format:
     ● Row-wise, no schema: MessagePack

     ● Row-wise, schema: Protocol Buffers

     ● Columnar, schema: Dremel-style format


   – Different operator implementations for different serializations
Storage Interfaces
• Scanner operators
   – Common APIs to convert user data into formats understood by
     execution operators
   – Example conversions:
     ● JSON → MessagePack

     ● CSV → MessagePack

     ● Dremel: columnar serialization → Protocol Buffers




• Data sources:
  – HDFS
  – NFS
  – HBase / Cassandra
  – MySQL / PostgreSQL / etc
Storage Interfaces
• Scanner Flexibility:
   – Allow in-place filtering (predicate pushdown)
   – Scanners can manage their own caching policies for their
     data

• In-place processing
   – Having a separate “ETL” step is painful
     ● Easiest to process data on demand


   – Query workload gives feedback on scanner access patterns
     ● Database Cracking: adaptively convert storage layout into


       more efficient forms
Design Principles
Flexible                          Easy
•   Pluggable query languages     •   Unzip and run
•   Extensible execution engine   •   Zero configuration
•   Pluggable data formats        •   Reverse DNS not needed
•   Column-based and row-based    •   IP addresses can change
•   Schema and schema-less        •   Clear and concise log
•   Pluggable data sources            messages



Dependable                        Fast
• No SPOF                         • C/C++ core with Java support
• Instant recovery from crashes   • Google C++ style guide
                                  • Min latency and max
                                    throughput (limited only by
                                    hardware)

Drill architecture 20120913

  • 1.
    Apache Drill ArchitectureOutlines Jason Frantz
  • 2.
    whoami • Jason • SoftwareArchitect at MapR • Background in Databases, File Systems, Distributed Systems • jfrantz@maprtech.com, jason@apache.org
  • 3.
  • 4.
    Big Picture • Samebasic pieces as most databases: – Driver: manage interaction with client – Parser: process textual query language – Compiler / Optimizer: convert logical query into physical plan – Execution Engine: run physical plan across cluster – Storage Handlers: feed user data in/out of execution
  • 5.
    Parser • Converts text-basedquery language into internal DAG representation – Grammar, syntax, basic query validation – Generally straightforward to implement • Initial goal is to support a SQL-like query language for nested data (DrQL) – Compatible with Google BigQuery/Dremel – Designed to support data sources that have a well-defined schema (e.g. protocol buffers) as well as those that don't (e.g. JSON) • Other potential input styles: – MongoDB's query language – Hive – Pig
  • 6.
    Traditional Query Optimizers •30+ year history into relational query optimization – We have to follow down the same general path • Converts a logical query plan into a physical one – Example: convert logical “JOIN” operator into specific hash join operator – Attempts to choose the “best” overall execution plan • Magic black box of statistics! – Optimizers do great with queries that can be easily modeled with available statistics – Difficulties: lack of statistics, complex schemas, complex queries – Database users often work around optimizer using query hints ● “force index”
  • 7.
    Intermediate Representation • IntermediateRepresentation (IR) is common internal API – Output from Parser – Input/Output from Optimizer – Input to Execution Engine • Textual Representation: – Flexibility ● Different users can enter at different levels of the IR ● Advanced users can skip optimizer entirely – Easier to test various pieces – Easy to cache ● Query optimization can be computationally expensive, so traditional databases go to great lengths to reuse execution plans • Ideally IR would be format used between optimization passes – Inspiration: LLVM, SQL Server showplan
  • 8.
    Execution Engine • Executionlayer – Query is a DAG of operators • Operator layer – Implementation of individual operators and data format serialization
  • 9.
    Execution Layer • Querystructured as a Directed Acyclic Graph (DAG) representing the data flow – Each node is an abstract “operator” – Communication between nodes is “blobs” of data – Data model described well in Microsoft's Dryad paper (Isard '07) • Responsible for handling: – Operator dependencies – Task scheduling – Inter-node communication • Notable features: – Speculative execution – Pipelining with spill-to-disk as fallback – Back pressure
  • 10.
    Operator Execution • Implementationof individual operators – Example built-in operators: hash aggregate, filter, json-scan – Extensible so new operators are easy to plug in • Serialization-aware: – Each “blob” is a batch of rows in a particular format: ● Row-wise, no schema: MessagePack ● Row-wise, schema: Protocol Buffers ● Columnar, schema: Dremel-style format – Different operator implementations for different serializations
  • 11.
    Storage Interfaces • Scanneroperators – Common APIs to convert user data into formats understood by execution operators – Example conversions: ● JSON → MessagePack ● CSV → MessagePack ● Dremel: columnar serialization → Protocol Buffers • Data sources: – HDFS – NFS – HBase / Cassandra – MySQL / PostgreSQL / etc
  • 12.
    Storage Interfaces • ScannerFlexibility: – Allow in-place filtering (predicate pushdown) – Scanners can manage their own caching policies for their data • In-place processing – Having a separate “ETL” step is painful ● Easiest to process data on demand – Query workload gives feedback on scanner access patterns ● Database Cracking: adaptively convert storage layout into more efficient forms
  • 13.
    Design Principles Flexible Easy • Pluggable query languages • Unzip and run • Extensible execution engine • Zero configuration • Pluggable data formats • Reverse DNS not needed • Column-based and row-based • IP addresses can change • Schema and schema-less • Clear and concise log • Pluggable data sources messages Dependable Fast • No SPOF • C/C++ core with Java support • Instant recovery from crashes • Google C++ style guide • Min latency and max throughput (limited only by hardware)