Big Picture• Same basic pieces as most databases: – Driver: manage interaction with client – Parser: process textual query language – Compiler / Optimizer: convert logical query into physical plan – Execution Engine: run physical plan across cluster – Storage Handlers: feed user data in/out of execution
Parser• Converts text-based query language into internal DAG representation – Grammar, syntax, basic query validation – Generally straightforward to implement• Initial goal is to support a SQL-like query language for nested data (DrQL) – Compatible with Google BigQuery/Dremel – Designed to support data sources that have a well-defined schema (e.g. protocol buffers) as well as those that dont (e.g. JSON)• Other potential input styles: – MongoDBs query language – Hive – Pig
Traditional Query Optimizers• 30+ year history into relational query optimization – We have to follow down the same general path• Converts a logical query plan into a physical one – Example: convert logical “JOIN” operator into specific hash join operator – Attempts to choose the “best” overall execution plan• Magic black box of statistics! – Optimizers do great with queries that can be easily modeled with available statistics – Difficulties: lack of statistics, complex schemas, complex queries – Database users often work around optimizer using query hints ● “force index”
Intermediate Representation• Intermediate Representation (IR) is common internal API – Output from Parser – Input/Output from Optimizer – Input to Execution Engine• Textual Representation: – Flexibility ● Different users can enter at different levels of the IR ● Advanced users can skip optimizer entirely – Easier to test various pieces – Easy to cache ● Query optimization can be computationally expensive, so traditional databases go to great lengths to reuse execution plans• Ideally IR would be format used between optimization passes – Inspiration: LLVM, SQL Server showplan
Execution Engine• Execution layer – Query is a DAG of operators• Operator layer – Implementation of individual operators and data format serialization
Execution Layer• Query structured as a Directed Acyclic Graph (DAG) representing the data flow – Each node is an abstract “operator” – Communication between nodes is “blobs” of data – Data model described well in Microsofts Dryad paper (Isard 07)• Responsible for handling: – Operator dependencies – Task scheduling – Inter-node communication• Notable features: – Speculative execution – Pipelining with spill-to-disk as fallback – Back pressure
Operator Execution• Implementation of individual operators – Example built-in operators: hash aggregate, filter, json-scan – Extensible so new operators are easy to plug in• Serialization-aware: – Each “blob” is a batch of rows in a particular format: ● Row-wise, no schema: MessagePack ● Row-wise, schema: Protocol Buffers ● Columnar, schema: Dremel-style format – Different operator implementations for different serializations
Storage Interfaces• Scanner operators – Common APIs to convert user data into formats understood by execution operators – Example conversions: ● JSON → MessagePack ● CSV → MessagePack ● Dremel: columnar serialization → Protocol Buffers• Data sources: – HDFS – NFS – HBase / Cassandra – MySQL / PostgreSQL / etc
Storage Interfaces• Scanner Flexibility: – Allow in-place filtering (predicate pushdown) – Scanners can manage their own caching policies for their data• In-place processing – Having a separate “ETL” step is painful ● Easiest to process data on demand – Query workload gives feedback on scanner access patterns ● Database Cracking: adaptively convert storage layout into more efficient forms
Design PrinciplesFlexible Easy• Pluggable query languages • Unzip and run• Extensible execution engine • Zero configuration• Pluggable data formats • Reverse DNS not needed• Column-based and row-based • IP addresses can change• Schema and schema-less • Clear and concise log• Pluggable data sources messagesDependable Fast• No SPOF • C/C++ core with Java support• Instant recovery from crashes • Google C++ style guide • Min latency and max throughput (limited only by hardware)
A particular slide catching your eye?
Clipping is a handy way to collect important slides you want to go back to later.