Drill architecture 20120913

  • 9,459 views
Uploaded on

 

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
9,459
On Slideshare
0
From Embeds
0
Number of Embeds
3

Actions

Shares
Downloads
233
Comments
0
Likes
16

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Apache Drill Architecture Outlines Jason Frantz
  • 2. whoami• Jason• Software Architect at MapR• Background in Databases, File Systems, Distributed Systems• jfrantz@maprtech.com, jason@apache.org
  • 3. Data Flow
  • 4. Big Picture• Same basic pieces as most databases: – Driver: manage interaction with client – Parser: process textual query language – Compiler / Optimizer: convert logical query into physical plan – Execution Engine: run physical plan across cluster – Storage Handlers: feed user data in/out of execution
  • 5. Parser• Converts text-based query language into internal DAG representation – Grammar, syntax, basic query validation – Generally straightforward to implement• Initial goal is to support a SQL-like query language for nested data (DrQL) – Compatible with Google BigQuery/Dremel – Designed to support data sources that have a well-defined schema (e.g. protocol buffers) as well as those that dont (e.g. JSON)• Other potential input styles: – MongoDBs query language – Hive – Pig
  • 6. Traditional Query Optimizers• 30+ year history into relational query optimization – We have to follow down the same general path• Converts a logical query plan into a physical one – Example: convert logical “JOIN” operator into specific hash join operator – Attempts to choose the “best” overall execution plan• Magic black box of statistics! – Optimizers do great with queries that can be easily modeled with available statistics – Difficulties: lack of statistics, complex schemas, complex queries – Database users often work around optimizer using query hints ● “force index”
  • 7. Intermediate Representation• Intermediate Representation (IR) is common internal API – Output from Parser – Input/Output from Optimizer – Input to Execution Engine• Textual Representation: – Flexibility ● Different users can enter at different levels of the IR ● Advanced users can skip optimizer entirely – Easier to test various pieces – Easy to cache ● Query optimization can be computationally expensive, so traditional databases go to great lengths to reuse execution plans• Ideally IR would be format used between optimization passes – Inspiration: LLVM, SQL Server showplan
  • 8. Execution Engine• Execution layer – Query is a DAG of operators• Operator layer – Implementation of individual operators and data format serialization
  • 9. Execution Layer• Query structured as a Directed Acyclic Graph (DAG) representing the data flow – Each node is an abstract “operator” – Communication between nodes is “blobs” of data – Data model described well in Microsofts Dryad paper (Isard 07)• Responsible for handling: – Operator dependencies – Task scheduling – Inter-node communication• Notable features: – Speculative execution – Pipelining with spill-to-disk as fallback – Back pressure
  • 10. Operator Execution• Implementation of individual operators – Example built-in operators: hash aggregate, filter, json-scan – Extensible so new operators are easy to plug in• Serialization-aware: – Each “blob” is a batch of rows in a particular format: ● Row-wise, no schema: MessagePack ● Row-wise, schema: Protocol Buffers ● Columnar, schema: Dremel-style format – Different operator implementations for different serializations
  • 11. Storage Interfaces• Scanner operators – Common APIs to convert user data into formats understood by execution operators – Example conversions: ● JSON → MessagePack ● CSV → MessagePack ● Dremel: columnar serialization → Protocol Buffers• Data sources: – HDFS – NFS – HBase / Cassandra – MySQL / PostgreSQL / etc
  • 12. Storage Interfaces• Scanner Flexibility: – Allow in-place filtering (predicate pushdown) – Scanners can manage their own caching policies for their data• In-place processing – Having a separate “ETL” step is painful ● Easiest to process data on demand – Query workload gives feedback on scanner access patterns ● Database Cracking: adaptively convert storage layout into more efficient forms
  • 13. Design PrinciplesFlexible Easy• Pluggable query languages • Unzip and run• Extensible execution engine • Zero configuration• Pluggable data formats • Reverse DNS not needed• Column-based and row-based • IP addresses can change• Schema and schema-less • Clear and concise log• Pluggable data sources messagesDependable Fast• No SPOF • C/C++ core with Java support• Instant recovery from crashes • Google C++ style guide • Min latency and max throughput (limited only by hardware)