Apache Drill Technical Overview


Published on

Published in: Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Apache Drill Technical Overview

  1. 1. 1> Technical OverviewJacques Nadeau, jacques@apache.orgMay 22, 2013
  2. 2. 2Basic ProcessZookeeperDFS/HBase DFS/HBase DFS/HBaseDrillbitDistributed CacheDrillbitDistributed CacheDrillbitDistributed CacheQuery1. Query comes to any Drillbit2. Drillbit generates execution plan based on affinity3. Fragments are farmed to individual nodes4. Data is returned to driving node
  3. 3. 3Core Modules within a DrillbitSQL ParserOptimizerPhysicalPlanDFS EngineHBase EngineRPC EndpointDistributed CacheStorageEngineInterfaceLogicalPlanExecution
  4. 4. 4Query StatesSQL What we want to do (analyst friendly)Logical Plan: What we want to do (language agnostic, computer friendly)Physical Plan How we want to do it (the best way we can tell)Execution Plan (fragments) Where we want to do it
  5. 5. 5SQLSELECTt.cf1.name as name,SUM(t.cf1.sales) as total_salesFROM m7://cluster1/sales tGROUP BY nameORDER BY by total_sales descLIMIT 10;
  6. 6. 6Logical Plan: API/Format using JSON Designed to be as easy as possible for language implementers to utilize– Sugared syntax such as sequence meta-operator Don’t constrain ourselves to SQL specific paradigm – support complex data typeoperators such as collapse and expand as well Allow late typingsequence: [{ op: scan, storageengine: m7, selection: {table: sales}}{ op: project, projections: [{ref: name, expr: cf1.name},{ref: sales, expr: cf1.sales}]}{ op: segment, ref: by_name, exprs: [name]}{ op: collapsingaggregate, target: by_name, carryovers: [name],aggregations: [{ref: total_sales, expr: sum(name)}]}{ op: order, ordering: [{order: desc, expr: total_sales}]}{ op: store, storageengine: screen}]
  7. 7. 7Physical Plan Insert points of parallelization where optimizer thinks they are necessary– If we thought that the cardinality of name would be high, we might use an alternative ofsort > range-merge-exchange > streaming aggregate > sort > range-merge-exchangeinstead of the simpler hash-random-exchange > sorting-hash-aggregate. Pick the right version of each operator– For example, here we’ve picked the sorting hash aggregate. Since a hash aggregate isalready a blocking operator, doing the sort simultaneously allows us to avoidmaterializing an intermediate state Apply projection and other push-down rules into capable operators– Note that the projection is gone, applied directly to the m7scan operator.{ @id: 1, pop: m7scan, cluster: def, table: sales, cols: [cf1.name, cf2.name]}{ @id: 2, op: hash-random-exchange, input: 1, expr: 1}{ @id: 3, op: sorting-hash-aggregate, input: 2,grouping: 1, aggr:[sum(2)], carry: [1], sort: ~agrr[0]}{ @id: 4, op: screen, input: 4}
  8. 8. 8Execution Plan Break plan into major fragments Determine quantity of parallelization for each task based onestimated costs as well as maximum parallelization for eachfragment (file size for now) Collect up endpoint affinity for each particular HasAffinity operator Assign particular nodes based on affinity, load and topology Generate minor versions of each fragment for individual executionFragmentId: Major = portion of dataflow Minor = a particular version of that execution (1 or more)
  9. 9. 9Execution Plan, cont’dEach execution plan has: One root fragment (runs on driving node) Leaf fragments (first tasks to run) Intermediate fragments (won’t start untilthey receive data from their children) In the case where the query output isrouted to storage, the root operator willoften receive metadata to present ratherthan dataRootIntermediateLeafIntermediateLeaf
  10. 10. 10Example FragmentsLeaf Fragment 1{pop : "hash-partition-sender",@id : 1,child : {pop : "mock-scan",@id : 2,url : "http://apache.org",entries : [ {id : 1,records : 4000}]},destinations : [ "Cglsb2NhbGhvc3QY0gk=" ]Leaf Fragment 2{pop : "hash-partition-sender",@id : 1,child : {pop : "mock-scan",@id : 2,url : "http://apache.org",entries : [ {id : 1,records : 4000}, {id : 2,records : 4000} ]},destinations : [ "Cglsb2NhbGhvc3QY0gk=" ]}Root Fragment{pop : "screen",@id : 1,child : {pop : "random-receiver",@id : 2,providingEndpoints : [ "Cglsb2NhbGhvc3QY0gk=" ]}}Intermediate Fragment{pop : "single-sender",@id : 1,child : {pop : "mock-store",@id : 2,child : {pop : "filter",@id : 3,child : {pop : "random-receiver",@id : 4,providingEndpoints : [ "Cglsb2NhbGhvc3QYqRI=","Cglsb2NhbGhvc3QY0gk=" ]},expr : " (b) > (5) "}},destinations : [ "Cglsb2NhbGhvc3QYqRI=" ]}
  11. 11. 11Execution FlowDrill ClientUserServerQueryForemanBitComParser OptimizerExecutionPlanner
  12. 12. 12SQL Parser Leverage Optiq Add support for “any” type Add support for nested and repeated[] references Add transformation rules to convert from SQL AST to Logical plansyntax
  13. 13. 13Optimizer Convert Logical to Physical Very much TBD Likely leverage Optiq Hardest problem in system, especially given lack of statistics Probably not parallel
  14. 14. 14Execution Planner Each scan operator provides a maximum width of parallelizationbased on the number of read entries (similar to splits) Decision of parallelization width is based on simple disk costs size Affinity orders the location of fragment assignment Storage, Scan and Exchange operators are informed of the actualendpoint assignments to then re-decide their entries (splits)
  15. 15. 15Grittier
  16. 16. 16Execution Engine Single JVM per Drillbit Small heap space for object management Small set of network event threads to manage socket operations Callbacks for each message sent Messages contain header and collection of native byte buffers Designed to minimize copies and ser/de costs Query setup and fragment runners are managed via processingqueues & thread pools
  17. 17. 17Data Records are broken into batches Batches contain a schema and a collection of fields Each field has a particular type (e.g. smallint) Fields (a.k.a. columns) are stored in ValueVectors ValueVectors are façades to byte buffers. The in-memory structure of each ValueVector is well defined andlanguage agnostic ValueVectors defined based on the width and nature of theunderlying data– RepeatMap Fixed1 Fixed2 Fixed4 Fixed8 Fixed12 Fixed16 Bit FixedLenVarLen1 VarLen2 VarLen4 There are three sub value vector types– Optional (nullable), required or repeated
  18. 18. 18Execution Paradigm We will have a large amount of operators Each operator works on a batch of records at a time A loose goal is batches are roughly a single core’s L2 cache in size Each batch of records carries a schema An operator is responsible for reconfiguring itself if a new schema arrives (or rejectingthe record batch if the schema is disallowed) Most operators are the combination of a set of static operations along with theevaluation of query specific expressions Runtime compiled operators are the combination of a pre-compiled template and aruntime compiled set of expressions Exchange operators are converted into Senders and Receiver when execution plan ismaterialized Each operator must support consumption of a SelectionVector, a partialmaterialization of a filter
  19. 19. 19Storage Engine Input and output is done through storage engines– (and the screen specialized storage operator) A storage engine is responsible for providing metadata and statistics aboutthe data A storage engine exposes a set of optimizer (plan rewrite) rules to supportthings such as predicate pushdown A storage engine provides one or more storage engine specific scanoperators that can support affinity exposure and task splitting– These are generated based on a StorageEngine specific configuration The primary interfaces are RecordReader and RecordWriter. RecordReaders are responsible for– Converting stored data into Drill canonical ValueVector format a batch at a time– Providing schema for each record batch Our initial storage engines will be for DFS and HBase
  20. 20. 20Messages Foreman drives query Foreman saves intermediate fragments to distributed cache Foreman sends leaf fragments directly to execution nodes Executing fragments push record batches to their fragment’s destinationnodes When destination node receives first fragment for a new query, it retrievesits appropriate fragment from distributed cache, setups up requiredframework, then waits until the start requirement is needed:– A fragment is evaluated for the number of different sending streams that arerequired before the query can actually be scheduled based on each exchanges“supportsOutOfOrder” capability.– When the IncomingBatchHandler recognizes that its start criteria has beenreached, it begins– In the meantime, destination mode will buffer (potentially to disk) Fragment status messages are pushed back to foreman directly fromindividual nodes A single failure status causes the foreman to cancel all other parts of query
  21. 21. 21Scheduling Plan is to leverage the concepts inside Sparrow Reality is that receiver-side buffering and pre-assigned executionlocations means that this is very much up in the air right now
  22. 22. 22Operation/Configuration Drillbit is a single JVM Extension is done by building to an api and generating a jar filethat includes a drill-module.conf file with information about wherethat module needs to be inserted All configuration is done via a JSON like configuration metaphortthat supports complex types Node discovery/service registry is done through Zookeeper Metrics are collected utilizing the Yammer metrics module
  23. 23. 23User Interfaces Drill provides DrillClient– Encapsulates endpoint discovery– Supports logical and physical plan submission, query cancellation, querystatus– Supports streaming return results Drill will provide a JDBC driver which converts JDBC into DrillClientcommunication.– Currently SQL parsing is done client side• Artifact of the current state of Optiq• Need to slim up the JDBC driver and push stuff remotely In time, will add REST proxy for DrillClient
  24. 24. 24Technologies Jackson for JSON SerDe for metadata Typesafe HOCON for configuration and module management Netty4 as core RPC engine, protobuf for communication Vanilla Java, Larray and Netty ByteBuf for off-heap large data structure help Hazelcast for distributed cache Curator on top of Zookeeper for service registry Optiq for SQL parsing and cost optimization Parquet (probably) as ‘native’ format Janino for expression compilation ASM for ByteCode manipulation Yammer Metrics for metrics Guava extensively Carrot HPC for primitive collections