Your SlideShare is downloading. ×
0
Drill architecture 20120913
Drill architecture 20120913
Drill architecture 20120913
Drill architecture 20120913
Drill architecture 20120913
Drill architecture 20120913
Drill architecture 20120913
Drill architecture 20120913
Drill architecture 20120913
Drill architecture 20120913
Drill architecture 20120913
Drill architecture 20120913
Drill architecture 20120913
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Drill architecture 20120913

9,701

Published on

Published in: Technology
0 Comments
16 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
9,701
On Slideshare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
236
Comments
0
Likes
16
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Apache Drill Architecture Outlines Jason Frantz
  • 2. whoami• Jason• Software Architect at MapR• Background in Databases, File Systems, Distributed Systems• jfrantz@maprtech.com, jason@apache.org
  • 3. Data Flow
  • 4. Big Picture• Same basic pieces as most databases: – Driver: manage interaction with client – Parser: process textual query language – Compiler / Optimizer: convert logical query into physical plan – Execution Engine: run physical plan across cluster – Storage Handlers: feed user data in/out of execution
  • 5. Parser• Converts text-based query language into internal DAG representation – Grammar, syntax, basic query validation – Generally straightforward to implement• Initial goal is to support a SQL-like query language for nested data (DrQL) – Compatible with Google BigQuery/Dremel – Designed to support data sources that have a well-defined schema (e.g. protocol buffers) as well as those that dont (e.g. JSON)• Other potential input styles: – MongoDBs query language – Hive – Pig
  • 6. Traditional Query Optimizers• 30+ year history into relational query optimization – We have to follow down the same general path• Converts a logical query plan into a physical one – Example: convert logical “JOIN” operator into specific hash join operator – Attempts to choose the “best” overall execution plan• Magic black box of statistics! – Optimizers do great with queries that can be easily modeled with available statistics – Difficulties: lack of statistics, complex schemas, complex queries – Database users often work around optimizer using query hints ● “force index”
  • 7. Intermediate Representation• Intermediate Representation (IR) is common internal API – Output from Parser – Input/Output from Optimizer – Input to Execution Engine• Textual Representation: – Flexibility ● Different users can enter at different levels of the IR ● Advanced users can skip optimizer entirely – Easier to test various pieces – Easy to cache ● Query optimization can be computationally expensive, so traditional databases go to great lengths to reuse execution plans• Ideally IR would be format used between optimization passes – Inspiration: LLVM, SQL Server showplan
  • 8. Execution Engine• Execution layer – Query is a DAG of operators• Operator layer – Implementation of individual operators and data format serialization
  • 9. Execution Layer• Query structured as a Directed Acyclic Graph (DAG) representing the data flow – Each node is an abstract “operator” – Communication between nodes is “blobs” of data – Data model described well in Microsofts Dryad paper (Isard 07)• Responsible for handling: – Operator dependencies – Task scheduling – Inter-node communication• Notable features: – Speculative execution – Pipelining with spill-to-disk as fallback – Back pressure
  • 10. Operator Execution• Implementation of individual operators – Example built-in operators: hash aggregate, filter, json-scan – Extensible so new operators are easy to plug in• Serialization-aware: – Each “blob” is a batch of rows in a particular format: ● Row-wise, no schema: MessagePack ● Row-wise, schema: Protocol Buffers ● Columnar, schema: Dremel-style format – Different operator implementations for different serializations
  • 11. Storage Interfaces• Scanner operators – Common APIs to convert user data into formats understood by execution operators – Example conversions: ● JSON → MessagePack ● CSV → MessagePack ● Dremel: columnar serialization → Protocol Buffers• Data sources: – HDFS – NFS – HBase / Cassandra – MySQL / PostgreSQL / etc
  • 12. Storage Interfaces• Scanner Flexibility: – Allow in-place filtering (predicate pushdown) – Scanners can manage their own caching policies for their data• In-place processing – Having a separate “ETL” step is painful ● Easiest to process data on demand – Query workload gives feedback on scanner access patterns ● Database Cracking: adaptively convert storage layout into more efficient forms
  • 13. Design PrinciplesFlexible Easy• Pluggable query languages • Unzip and run• Extensible execution engine • Zero configuration• Pluggable data formats • Reverse DNS not needed• Column-based and row-based • IP addresses can change• Schema and schema-less • Clear and concise log• Pluggable data sources messagesDependable Fast• No SPOF • C/C++ core with Java support• Instant recovery from crashes • Google C++ style guide • Min latency and max throughput (limited only by hardware)

×