April 16 Hangout firstname.lastname@example.org MapR Technologies
Key Design Goals (reminder)• Keep it off-heap for data, on-heap for metadata: heap < 0.5gb• Support reasonable JNI interplay as desired• Specified, compatible wire level formats• Pipelined vectorized columnar execution• Nested data and late schema• Full SQL
Julian Hyde’s work on SQL parser• GitHub push soon• Support for basic scan, project and filter. – Includes sub-queries, scalar function pass- through, nested references and the any data type• Next Up: Group By, Union, Join
Topics• Configuration• In Memory Formats• Schema Management• RPC Framework• Specific RPC protocols• Cluster Coordination and cache
Configuration• Leverage HOCON for modular configuration – JSON++ for configurations: allows composite configuration definitions and looser syntax.• Hierarchical precedence – Common module loads drill-default.conf top-level configuration. – All other classpath drill-module.conf files loaded to integrate additional classes – drill-override.conf provides user-level properties
Schema Management andRecordBatch• RecordBatch is the relational operator unit of work• Targets ~256k in size, designed to fit in single core L2 cache• Internally manages a set of fields – Focused on fields required for completion of the query. Inference provides some type information. – Untouched or asterisk fields may be stored in secondary compound inline fields depending on RecordReader implementation• Each next() call moves forward the set of records – Each movement forward informs whether a new schema was found— if so, consumer should reconfigure based on updated schema. – Schema can be expanded: from one type to any type. Ultimately may be able to contract as well (e.g. nullable to not-nullable). – An incoming schema changing does not necessarily modify an outgoing schema.
In-memory Formats• Values are managed in one of three ValueMode’s: ValueVector, RLE or Dict• More concrete than some research work such as C- store but also allows for simpler implementation with most of the benefits.• Physical plan describes the ValueMode of the particular fields. (a field level physical property)• Depending on the particular requirements of a query and operator capabilities, data can be maintained in a compressed value-based structure. – Decision occurs at physical plan level prior to scans (requires format foreknowledge)
ValueVector• Primary common representation is ValueVector, a vectorized (array) uncompressed structure.• Off-heap native buffers manually reference counted and fronted by Netty4’s ByteBuf abstraction.• Support zero-copy transfer semantics when moving between operators.• Zero data serialization/dserialization allows direct write to and from sockets along with batch level metadata• Ultimately generate a JNI operator stub so that individual operators or groups of operators can be outside core system• Designed to leverage shared mmap between StorageEngine record readers and Drillbits to minimize overhead and reduce necessity for storage engine level pushdown.• Data Type variations include required, nullable and repeated• First few implementations made such as SInt32, Variable Length bytes, Nullables• Repeated will support cross field references as for record level and repeated-node boundaries.
RPC Framework• Zero-copy byte buffer transfers wrapped in a protobuf envelope.• Fully symmetric push+pull based protocol• Top-level envelope utilizes standard protobuf envelope encoding so that any language can interact: CompleteRpcMessage – Composed of three parts: RpcHeader, ProtobufBody, RawBody. RawBody is optional (bytes). – For Java, we manually encode/decode the top level envelope so that we can keep RawBody off-heap• Fully asynchronous using futures
RPC Protocols: Two Key TypesUser to Bit• UserClient and UserServer• Supports RunQuery > Handle, RequestResults > QueryResult, CancelQuery > Ack• Query Results mode can operate in: STREAM_FULL, STREAM_FIRST, QUERY_FOR_STATUSBit to Bit• Each Drillbit can interact with all other Drillbits• Locations are managed via a cluster cache• Either Drillbit can act as server or client (bi-directional)• Managed via BitCom: which maintains server sessions and client connections as necessary• Supports activity such as fragment announcement, send record batch, node progress, cancel fragment.
Cluster Coordination and Cache• Cluster coordination is done through ClusterCoordinator abstraction – Manages node-level service registration, currently singular across both RPC types – Leverages Netflix’s Curator framework – Manages a cache of available Drillbits and associated capabilities per node – Used by clients and Drillbits• DistributedCache implemented through embedded Hazelcast – Sets up a distributed topic for queue depth management – Will be used for query plan caching, other shared state – Expected to be used only by Drillbits, not clients
Other Discussion• Timothy: Overview of Supersonic exploration• David: Ideas around HBase and other work
Where we need help• Addition of Values operator to Reference Interpreter for SQL parser• Modify reference interpreter to avoid modification of existing records• First-level code reviews• Physical plan definition and documentation• More tests cases• TPC-H logical and physical plans• Simple identity transformer/optimizer (logical > physical)• Execution fragment format• First full-execution level storage operator, potentially using mmap shared memory• Forman implementation for query processing management• Review and evaluation of newer file formats and interaction with in- memory formats• First POP implementations• Lots of scalar function vector implementations