Drill architecture 20120913
Upcoming SlideShare
Loading in...5
×
 

Drill architecture 20120913

on

  • 9,730 views

 

Statistics

Views

Total Views
9,730
Views on SlideShare
9,675
Embed Views
55

Actions

Likes
16
Downloads
226
Comments
0

5 Embeds 55

http://baradates.tumblr.com 48
https://twitter.com 3
http://assets.txmblr.com 2
https://si0.twimg.com 1
http://www.linkedin.com 1

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Drill architecture 20120913 Drill architecture 20120913 Presentation Transcript

    • Apache Drill Architecture Outlines Jason Frantz
    • whoami• Jason• Software Architect at MapR• Background in Databases, File Systems, Distributed Systems• jfrantz@maprtech.com, jason@apache.org
    • Data Flow
    • Big Picture• Same basic pieces as most databases: – Driver: manage interaction with client – Parser: process textual query language – Compiler / Optimizer: convert logical query into physical plan – Execution Engine: run physical plan across cluster – Storage Handlers: feed user data in/out of execution
    • Parser• Converts text-based query language into internal DAG representation – Grammar, syntax, basic query validation – Generally straightforward to implement• Initial goal is to support a SQL-like query language for nested data (DrQL) – Compatible with Google BigQuery/Dremel – Designed to support data sources that have a well-defined schema (e.g. protocol buffers) as well as those that dont (e.g. JSON)• Other potential input styles: – MongoDBs query language – Hive – Pig
    • Traditional Query Optimizers• 30+ year history into relational query optimization – We have to follow down the same general path• Converts a logical query plan into a physical one – Example: convert logical “JOIN” operator into specific hash join operator – Attempts to choose the “best” overall execution plan• Magic black box of statistics! – Optimizers do great with queries that can be easily modeled with available statistics – Difficulties: lack of statistics, complex schemas, complex queries – Database users often work around optimizer using query hints ● “force index”
    • Intermediate Representation• Intermediate Representation (IR) is common internal API – Output from Parser – Input/Output from Optimizer – Input to Execution Engine• Textual Representation: – Flexibility ● Different users can enter at different levels of the IR ● Advanced users can skip optimizer entirely – Easier to test various pieces – Easy to cache ● Query optimization can be computationally expensive, so traditional databases go to great lengths to reuse execution plans• Ideally IR would be format used between optimization passes – Inspiration: LLVM, SQL Server showplan
    • Execution Engine• Execution layer – Query is a DAG of operators• Operator layer – Implementation of individual operators and data format serialization
    • Execution Layer• Query structured as a Directed Acyclic Graph (DAG) representing the data flow – Each node is an abstract “operator” – Communication between nodes is “blobs” of data – Data model described well in Microsofts Dryad paper (Isard 07)• Responsible for handling: – Operator dependencies – Task scheduling – Inter-node communication• Notable features: – Speculative execution – Pipelining with spill-to-disk as fallback – Back pressure
    • Operator Execution• Implementation of individual operators – Example built-in operators: hash aggregate, filter, json-scan – Extensible so new operators are easy to plug in• Serialization-aware: – Each “blob” is a batch of rows in a particular format: ● Row-wise, no schema: MessagePack ● Row-wise, schema: Protocol Buffers ● Columnar, schema: Dremel-style format – Different operator implementations for different serializations
    • Storage Interfaces• Scanner operators – Common APIs to convert user data into formats understood by execution operators – Example conversions: ● JSON → MessagePack ● CSV → MessagePack ● Dremel: columnar serialization → Protocol Buffers• Data sources: – HDFS – NFS – HBase / Cassandra – MySQL / PostgreSQL / etc
    • Storage Interfaces• Scanner Flexibility: – Allow in-place filtering (predicate pushdown) – Scanners can manage their own caching policies for their data• In-place processing – Having a separate “ETL” step is painful ● Easiest to process data on demand – Query workload gives feedback on scanner access patterns ● Database Cracking: adaptively convert storage layout into more efficient forms
    • Design PrinciplesFlexible Easy• Pluggable query languages • Unzip and run• Extensible execution engine • Zero configuration• Pluggable data formats • Reverse DNS not needed• Column-based and row-based • IP addresses can change• Schema and schema-less • Clear and concise log• Pluggable data sources messagesDependable Fast• No SPOF • C/C++ core with Java support• Instant recovery from crashes • Google C++ style guide • Min latency and max throughput (limited only by hardware)