Technical Overview of Apache Drill by Jacques Nadeau


Published on

A deep dive into Apache Drill given by Jacques Nadeau on May 2013.

Published in: Technology
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Technical Overview of Apache Drill by Jacques Nadeau

  1. 1. 1 > Technical Overview Jacques Nadeau, May 22, 2013
  2. 2. 2 Basic Process Zookeeper DFS/HBase DFS/HBase DFS/HBase Drillbit Distributed Cache Drillbit Distributed Cache Drillbit Distributed Cache Query 1. Query comes to any Drillbit 2. Drillbit generates execution plan based on affinity 3. Fragments are farmed to individual nodes 4. Data is returned to driving node
  3. 3. 3 Core Modules within a Drillbit SQL Parser Optimizer PhysicalPlan DFS Engine HBase Engine RPC Endpoint Distributed Cache StorageEngineInterface LogicalPlan Execution
  4. 4. 4 Query States SQL  What we want to do (analyst friendly) Logical Plan:  What we want to do (language agnostic, computer friendly) Physical Plan  How we want to do it (the best way we can tell) Execution Plan (fragments)  Where we want to do it
  5. 5. 5 SQL SELECT as name, SUM(t.cf1.sales) as total_sales FROM m7://cluster1/sales t GROUP BY name ORDER BY by total_sales desc LIMIT 10;
  6. 6. 6 Logical Plan: API/Format using JSON  Designed to be as easy as possible for language implementers to utilize – Sugared syntax such as sequence meta-operator  Don’t constrain ourselves to SQL specific paradigm – support complex data type operators such as collapse and expand as well  Allow late typing sequence: [ { op: scan, storageengine: m7, selection: {table: sales}} { op: project, projections: [ {ref: name, expr:}, {ref: sales, expr: cf1.sales}]} { op: segment, ref: by_name, exprs: [name]} { op: collapsingaggregate, target: by_name, carryovers: [name], aggregations: [{ref: total_sales, expr: sum(name)}]} { op: order, ordering: [{order: desc, expr: total_sales}]} { op: store, storageengine: screen} ]
  7. 7. 7 Physical Plan  Insert points of parallelization where optimizer thinks they are necessary – If we thought that the cardinality of name would be high, we might use an alternative of sort > range-merge-exchange > streaming aggregate > sort > range-merge-exchange instead of the simpler hash-random-exchange > sorting-hash-aggregate.  Pick the right version of each operator – For example, here we’ve picked the sorting hash aggregate. Since a hash aggregate is already a blocking operator, doing the sort simultaneously allows us to avoid materializing an intermediate state  Apply projection and other push-down rules into capable operators – Note that the projection is gone, applied directly to the m7scan operator. { @id: 1, pop: m7scan, cluster: def, table: sales, cols: [,]} { @id: 2, op: hash-random-exchange, input: 1, expr: 1} { @id: 3, op: sorting-hash-aggregate, input: 2, grouping: 1, aggr:[sum(2)], carry: [1], sort: ~agrr[0] } { @id: 4, op: screen, input: 4}
  8. 8. 8 Execution Plan  Break plan into major fragments  Determine quantity of parallelization for each task based on estimated costs as well as maximum parallelization for each fragment (file size for now)  Collect up endpoint affinity for each particular HasAffinity operator  Assign particular nodes based on affinity, load and topology  Generate minor versions of each fragment for individual execution FragmentId:  Major = portion of dataflow  Minor = a particular version of that execution (1 or more)
  9. 9. 9 Execution Plan, cont’d Each execution plan has:  One root fragment (runs on driving node)  Leaf fragments (first tasks to run)  Intermediate fragments (won’t start until they receive data from their children)  In the case where the query output is routed to storage, the root operator will often receive metadata to present rather than data Root Intermediate Leaf Intermediate Leaf
  10. 10. 10 Example Fragments Leaf Fragment 1 { pop : "hash-partition-sender", @id : 1, child : { pop : "mock-scan", @id : 2, url : "", entries : [ { id : 1, records : 4000}] }, destinations : [ "Cglsb2NhbGhvc3QY0gk=" ] Leaf Fragment 2 { pop : "hash-partition-sender", @id : 1, child : { pop : "mock-scan", @id : 2, url : "", entries : [ { id : 1, records : 4000 }, { id : 2, records : 4000 } ] }, destinations : [ "Cglsb2NhbGhvc3QY0gk=" ] } Root Fragment { pop : "screen", @id : 1, child : { pop : "random-receiver", @id : 2, providingEndpoints : [ "Cglsb2NhbGhvc3QY0gk=" ] } } Intermediate Fragment { pop : "single-sender", @id : 1, child : { pop : "mock-store", @id : 2, child : { pop : "filter", @id : 3, child : { pop : "random-receiver", @id : 4, providingEndpoints : [ "Cglsb2NhbGhvc3QYqRI=", "Cglsb2NhbGhvc3QY0gk=" ] }, expr : " ('b') > (5) " } }, destinations : [ "Cglsb2NhbGhvc3QYqRI=" ] }
  11. 11. 11 Execution Flow Drill Client UserServer Query Foreman BitCom Parser Optimizer Execution Planner
  12. 12. 12 SQL Parser  Leverage Optiq  Add support for “any” type  Add support for nested and repeated[] references  Add transformation rules to convert from SQL AST to Logical plan syntax
  13. 13. 13 Optimizer  Convert Logical to Physical  Very much TBD  Likely leverage Optiq  Hardest problem in system, especially given lack of statistics  Probably not parallel
  14. 14. 14 Execution Planner  Each scan operator provides a maximum width of parallelization based on the number of read entries (similar to splits)  Decision of parallelization width is based on simple disk costs size  Affinity orders the location of fragment assignment  Storage, Scan and Exchange operators are informed of the actual endpoint assignments to then re-decide their entries (splits)
  15. 15. 15 Grittier
  16. 16. 16 Execution Engine  Single JVM per Drillbit  Small heap space for object management  Small set of network event threads to manage socket operations  Callbacks for each message sent  Messages contain header and collection of native byte buffers  Designed to minimize copies and ser/de costs  Query setup and fragment runners are managed via processing queues & thread pools
  17. 17. 17 Data  Records are broken into batches  Batches contain a schema and a collection of fields  Each field has a particular type (e.g. smallint)  Fields (a.k.a. columns) are stored in ValueVectors  ValueVectors are façades to byte buffers.  The in-memory structure of each ValueVector is well defined and language agnostic  ValueVectors defined based on the width and nature of the underlying data – RepeatMap Fixed1 Fixed2 Fixed4 Fixed8 Fixed12 Fixed16 Bit FixedLen VarLen1 VarLen2 VarLen4  There are three sub value vector types – Optional (nullable), required or repeated
  18. 18. 18 Execution Paradigm  We will have a large amount of operators  Each operator works on a batch of records at a time  A loose goal is batches are roughly a single core’s L2 cache in size  Each batch of records carries a schema  An operator is responsible for reconfiguring itself if a new schema arrives (or rejecting the record batch if the schema is disallowed)  Most operators are the combination of a set of static operations along with the evaluation of query specific expressions  Runtime compiled operators are the combination of a pre-compiled template and a runtime compiled set of expressions  Exchange operators are converted into Senders and Receiver when execution plan is materialized  Each operator must support consumption of a SelectionVector, a partial materialization of a filter
  19. 19. 19 Storage Engine  Input and output is done through storage engines – (and the screen specialized storage operator)  A storage engine is responsible for providing metadata and statistics about the data  A storage engine exposes a set of optimizer (plan rewrite) rules to support things such as predicate pushdown  A storage engine provides one or more storage engine specific scan operators that can support affinity exposure and task splitting – These are generated based on a StorageEngine specific configuration  The primary interfaces are RecordReader and RecordWriter.  RecordReaders are responsible for – Converting stored data into Drill canonical ValueVector format a batch at a time – Providing schema for each record batch  Our initial storage engines will be for DFS and HBase
  20. 20. 20 Messages  Foreman drives query  Foreman saves intermediate fragments to distributed cache  Foreman sends leaf fragments directly to execution nodes  Executing fragments push record batches to their fragment’s destination nodes  When destination node receives first fragment for a new query, it retrieves its appropriate fragment from distributed cache, setups up required framework, then waits until the start requirement is needed: – A fragment is evaluated for the number of different sending streams that are required before the query can actually be scheduled based on each exchanges “supportsOutOfOrder” capability. – When the IncomingBatchHandler recognizes that its start criteria has been reached, it begins – In the meantime, destination mode will buffer (potentially to disk)  Fragment status messages are pushed back to foreman directly from individual nodes  A single failure status causes the foreman to cancel all other parts of query
  21. 21. 21 Scheduling  Plan is to leverage the concepts inside Sparrow  Reality is that receiver-side buffering and pre-assigned execution locations means that this is very much up in the air right now
  22. 22. 22 Operation/Configuration  Drillbit is a single JVM  Extension is done by building to an api and generating a jar file that includes a drill-module.conf file with information about where that module needs to be inserted  All configuration is done via a JSON like configuration metaphort that supports complex types  Node discovery/service registry is done through Zookeeper  Metrics are collected utilizing the Yammer metrics module
  23. 23. 23 User Interfaces  Drill provides DrillClient – Encapsulates endpoint discovery – Supports logical and physical plan submission, query cancellation, query status – Supports streaming return results  Drill will provide a JDBC driver which converts JDBC into DrillClient communication. – Currently SQL parsing is done client side • Artifact of the current state of Optiq • Need to slim up the JDBC driver and push stuff remotely  In time, will add REST proxy for DrillClient
  24. 24. 24 Technologies  Jackson for JSON SerDe for metadata  Typesafe HOCON for configuration and module management  Netty4 as core RPC engine, protobuf for communication  Vanilla Java, Larray and Netty ByteBuf for off-heap large data structure help  Hazelcast for distributed cache  Curator on top of Zookeeper for service registry  Optiq for SQL parsing and cost optimization  Parquet (probably) as ‘native’ format  Janino for expression compilation  ASM for ByteCode manipulation  Yammer Metrics for metrics  Guava extensively  Carrot HPC for primitive collections