Drill Meetup March 13 email@example.com MapR Technologies
Vision: A tool for interactive analysis using SQL• Fast – Low latency queries – Columnar, vectorized execution – Fully pipelined streaming engine Interactive queries – Complement native interfaces and Data analyst Apache Drill Reporting MapReduce/Hive/Pig 100 ms-20 min• Open – Community driven open source project – Under Apache Software Foundation• Modern – Standard ANSI SQL:2003 (select/into) Data mining MapReduce – Nested/hierarchical data support Hive Modeling Large ETL – Schema is optional Pig 20 min-20 hr – Supports RDBMS, Hadoop and NoSQL
Tenets• The community is core – Make it easy for the Hadoop community to work with Drill – Mostly Java, native where it matters – Clean APIs at every layer allow extensions in other languages: a DSL in Scala, an optimizer in C, UDFs in Python, etc.• Memory is scarce: keep things compressed wherever and whenever possible – Work with record batches, not individual records – Batches should be off heap: allows dropping to native for things like codec or SIMD support and minimizes gc concerns – Focus on in-memory formats before disk formats• Embrace key developments from the past decade – Cache is the new memory, leverage cache-aware algorithms and vectorized operations – Rise of nested and late-schema data isn’t a fad, Drill must support – Support Late tuple materialization and column-aware operators – Provide extended compression interfaces to apply operations on compressed data • E.g. sort an rle compressed column, filter a dictionary coded column while maintaining compression.• Operational simplicity – Single process, No SPOF, Extensible HOCON based modular configuration, No dep…
High Level Architechture• By default, Drillbits hold all roles, modules can optionally be disabled.• Any Drillbit can act as endpoint for particular query.• Zookeeper maintains ephemeral cluster membership information only• Small distributed cache utilizing embedded Hazelcast maintains information about individual queue depth, cached query plans, metadata, locality information, etc.• Originating Drillbit acts as foreman, manages all execution for their particular query, scheduling based on priority, queue depth and locality information.• Drillbit data communication is streaming and avoids any serialization/deserialization Drillbit Drillbit Drillbit Distributed Cache Distributed Cache Distributed Cache Zookeeper Storage Storage Storage Process Process Process
Life of a SQL Query Query Human or tool written ANSI compliant query Logical Plan Dataflow of what should logically be done Physical Plan How physical and exchange operators should be applied Execution Plan Assignment to particular nodes and cores Execution Actual Query Execution
Physical Plan versus Execution PlanPhysical Plan (Optimizer) Execution Plan (Scheduler)• Locations of exchanges. • Field ordering per fragment• Types and order of physical • The level of parallelization of each exchange (remotely and operators (including spools) locally).• Which projection of the raw data • The scheduling of each query to utilize fragment (including any pauses)• Query recovery points • The memory allocation for each task• Estimated memory, cpu, • The size of record batches bandwith and io required for • What disk locations to use for each operation spooling purposes • When to start various sub-pieces of the query plan.
Status Report and Plan• The last few months: – Define a logical plan – Build a reference interpreter – Basic SQL Parser• March/April – Larger SQL syntax – Physical plan – In-memory compressed data interfaces – Distributed execution focused on large cluster high performance sort, aggregation and join• Goals: Alpha Q2, Beta Q3
Exciting things to watch/leverage• Parquet and ORC file formats – Drill will probably adopt one as a primary• Tez/Stinger: Make Hive more SQL’y, add a new execution engine, faster with ORC. – Depending on status and code drop, maybe portions of execution engine can be shared• x0data: Distributed Fork-Join framework plus analytics engine – Potential for code sharing lower level to simplify/combine cluster coordination and distributed cache, ultimately support MPI-lite workloads• Impala: Hive replacement query engine. Backend entirely in C++, flat data, primarily in-memory datasets when blocking operators required – Inspiration around external integration with Hive metastore, collaboration on use and extension of Parquet• Shark+Spark: Scala query engine, record at a time, focused on intermediate resultset caching – Ideas around Adaptive caching, cleaner Scala interfaces• Tajo: Cleaner APIs, still record at a time execution, very object oriented – API Inspiration, front end test cases, expansion to reference interpreter via code sharing
CommunityShout-outs:• Julian Hyde @ Pentaho• Timothy Chen @ Microsoft• Chris Merrick @ RJMetrics• David Alves @ UT Austin• Sree Vaadi @ SSS/NGDataMore needed:• Not just code, we need use cases, query planning, code review, design help, ui, etc• Pick a JIRA, write your own JIRA, just say hi!
Join InJoin In• Mailing list: firstname.lastname@example.org• Twitter: @ApacheDrill• Source: http://github.com/apache/incubator-drill• Jira: https://issues.apache.org/jira/browse/DRILLUpcoming Events• Meetups: Late April, Hadoop Summit• Hackathon: MayWork on Apache Drill full time:• MapR is hiring fulltime open source Drill developers• Come chat with us or write to email@example.com