OpenDremel's Metaxa Architecture

1,008 views

Published on

This is a old document on OpenDremel architecture.

Published in: Technology, News & Politics
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,008
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
13
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

OpenDremel's Metaxa Architecture

  1. 1. Metaxa Architecture June 22th By Camuel, OpenDremel
  2. 2. Meet Metaxa• Implements Dremel using LAPHROAIG as execution engine and as storage backend.• No distribution, METAXA is single jar file and executed in single JVM, it produced and executes single threaded MAP job.• All input data reside inside single LAPHROAIG object.• Output is one of following: • New LAPHROAIG objet • Streamed back.• Convert type commands convert single LAPHROAIG object from popular objects serialization formats to nested columnar dremel format or vice versa.• Query type commands process LAPHROAIG objects in nested columnar dremel format and can store result in another object or convert them to popular objects serialization formats and stream back to user.• LAPHROAIG object is a container of other “serialized objects” or “columnar encoded objects”. Two types of objects not to be confused.• Just four use cases: – Convert “serialized objects” into “columnar encoded objects”. – Convert “columnar encoded objects” into “serialized objects”. – Query “columnar encoded objects” with BQL producing “serialized objects” and streaming it back to caller. – Query “columnar encoded objects” with BQL producing “serialized objects” and saving it as new LAPHROAIG “container” object – Query “columnar encoded objects” with BQL producing “columnar encoded objects” and saving it as new LAPHROAIG “container” object
  3. 3. Use case #1: Convert serialized objects into columnar-encoded objects Convert Command Hierarchical Schema Serialized objects (Protobuf, Avro, Thrift)Metaxa.jar LAPHROAIG columnar-encoded objects (Tablet)
  4. 4. Use case #2: Convert columnar-encoded objects into serialized objects Convert Command columnar-encoded objects (Tablet)Metaxa.jar LAPHROAIG Hierarchical Schema Serialized objects (Protobuf, Avro, Thrift)
  5. 5. Use case #3: Query “columnar encoded objects” with BQL producing “serialized objects” and streaming it back to caller. BQL columnar-encoded Query objects (Tablet) Metaxa.jar LAPHROAIG Hierarchical Schema Serialized objects (Protobuf, Avro, Thrift)
  6. 6. Use case #4: Query “columnar encoded objects” with BQL producing “serialized objects” and saving it BQL columnar-encoded Query objects (Tablet) Metaxa.jar LAPHROAIG Hierarchical Schema Serialized objects (Protobuf, Avro, Thrift)
  7. 7. Use case #5: Query “columnar encoded objects” with BQL producing “columnar encoded objects” and saving it BQL columnar-encoded Query objects (Tablet) Metaxa.jar LAPHROAIG columnar-encoded objects (Tablet)
  8. 8. SerObjs – Serialized Objects• A result data got by serializing objects with Protobuf, Avro and Thrift.• Hierarchical data.• Flat data like CSV• RDBMS originated data.• Data from KV-stores and document stores.• Logs.• Schema may be embedded or provided separately.
  9. 9. Tablet– Columnar-encoded objects• Immutable chunk of data.• Logically comprised from Slices and can be turned into Slice series.• Columnar and dremel-encoded.• Consists of header (called Tablet Schema) and multiple {byte, word, dword or qword}-streams.• Tablet schema describes – Tablet columns (multi-dimensional arrays) including metadata and compression and encoding metadata as well as references for associated dictionaries, rep & def levels and etc. – Original SerObjs schema and mapping to tablet columns – Future: additional SerObjs schemas and mappings• Tablet data are a set of multidimensional arrays of 8,16 ,32 or 64 bit elements denoted byte or b, word or w, double word or dw and quad word or qw. Each arrays represents a column and can be accessed independently without incurring access costs for neighbor arrays. Every element is a bit-field with various bits representing different information. For example (multiple) column values, counts (RLE)m rep and def levels.• Tablet scanner can mask some of the details of column encoding and provide higher-level interface to tablet automatically decoding RLE, dictionary and rep & def levels. However, tablet binary format is an stable interface between Metaxa modules and between different versions of OpenDremel system• Tablet are horizontal partitions of larger columnar dataset.
  10. 10. Slice– Columnar-encoded object fraction • Slice is a vector (ordered list of scalars) where each scalar corresponds to a current value of a different tablet column that is being scanned / iterated. • Tablet can be broken down into ordered list of slices and comprised back from series of slices. • Slice in Metaxa contains plain integer values (not bit fields) of b, w, dw and qw. • Slice may contain less values than columns in tablet. In this case columns represented in slice are called “projected columns”. • Slice also contains additional integer field called Level. This Level is also aliased as FetchLevel or SelectLevel depending whether Tablet is being sliced into series of slices or being reconstructed from series of slices.
  11. 11. Query Plan (QP)• QP is a descriptor of source tablet, a result tablet and a set of scalar transformations and a DAG of their dataflow interconnections.• Scalar transformations are of one of following types – Plain transformation => Also called expressions, many inputs but one output. – Predicates => boolean expression which when evaluating to false cancels the issuance of the result slice. – Aggregates => Count, Sum and Distinct functions, aggregates slices and then when the last slice in a aggregation group is detected, issues multiple result slices.• QP input and output is always slice. Because of predicates it is possible that for some input slices no output slice will be issued. Also because of aggregates it is also possible that for one input slice, multiple output slices will be issued.• Input slices contain FetchLevel and output slices contain SelectLevel. (according to appendix D in paper)
  12. 12. Conceptual View of TabletLevels(dimensions) 0 1 2 Record [5] Record [4] Record [3] Record [2] Record [1] Record [0][][ ][ ][ ][ ][ ]
  13. 13. Conceptual View of Tablet SlicingLevels(dimensions) 0 1 2 Slice Slice Slice Slice Slice Record [0] Slice [0][2][2] [0][1][1] [0][1][0] [0][0][2] [0][0][1] [0][0][0][][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ]
  14. 14. Conceptual View of QPLevels(dimensions) 0 1 2 Slice Slice Record [1] Record [0] [0][1][1] [0][0][0][] Expr (rep=0) [][ ][ ] Expr (rep=1) [ ][ ][ ][ ][ ][ ][ ] Expr (rep=2) [ ][ ][ ][ ][ ][ ]
  15. 15. CompilerTranslates BQL into Query Plan Requirements: – Must parse and compile valid BQL as defined by BigQuery. – Must not accept invalid BQL and supply user-friendly messages. – Must produce executable QP object with following features: • It is Serializable => without circular references, without references to “system” objects like file handlers, pure object model • getProcessSliceSource => returns text of in java source-code form • getSourceTablets => returns tablets to run QP on • setResultTablet => Sets result tablet • setExecutionStatusCode => to indicate status of QP execution • log => allows logging important events during QP execution • getDiagram => returns graphic image of QP diagram (for debugging) – Must provide basic command-line arguments functionality as well as simple shell functionality.
  16. 16. Vocabulary Compiler• Token - lexeme• Parse tree – token tree• AST – Abstract Syntax Tree• SM – Semantic Model• ASM – Annotated Semantic Model• QP – Query Plan• DAG – Directed Acyclic Graph• Schema – Metadata about dataset.
  17. 17. CompilerPrerequisite Materials– http://code.google.com/apis/bigquery/docs/query-reference.html– http://www.antlr.org/– http://en.wikipedia.org/wiki/Parsing– http://en.wikipedia.org/wiki/Query_plan– http://en.wikipedia.org/wiki/Compiler_construction– http://www.amazon.com/Terence-Parr/e/B001JS3O0U
  18. 18. CompilerHigh-Level Design (verbose) SerObjs Command Schema line arguments / shell Shell input BQL Antlr AST SemanticP Parser arser Result Result SM Semantic Analyzer SerObjs Schema •Validation SM Schema Generator Annotated •Resolving references •Result Schema Inference Semantic Model Semantic •Optimization (Java object model Model implemented via java QP collections) QP GeneratorQuery Plan (includes MetadataResultTablet (files locations Optimization Validation metadata) C / asm and statistics) Rules Rules Template
  19. 19. Compiler[Annotated] Semantic Model • Comprehensibly describes query to every detail • Java objects (packed into collections, without spaghetti cyclic references) • Must be serializable with SerObjLib framework to a file and restorable. • Must be printable to something comprehensible by human • Must be rendered on request into nice graphic diagram with legend.
  20. 20. QP: Scalar Transformation functions (Expr) Compiler • Set of primitive predefined scalar operations and functions applied on xfunc arguments in particular prescribed order. • Expressed in valid C or assembly with some restrictions. • Purely functional => side-effect free. Meaning no static/global variables and no memory allocations. However, for performance and brevity they are inlined into single processSlice function. • Some functions have a context object where they can store their externalized state between calls. One regular and one associative array is provided as context for this functions – Context-free transformation functions • One value in, one value out a+b – Scalar context transformation functions • Many value in, many value out sum(a) within links – Map context transformation functions • Many value in, many value out (out of sync) sum(a) group by date
  21. 21. Compiler QP in C Form• Generated ProcessSlice(..){..} function. – Input: inSlice – Output: outSlice – Context object for state-externalization• inSlice contains scalar values for every source function and also fetchLevel• outSlice must have correct scalar values for every result function and also correct selectLevel. – outSlice are guarantied to preserve its content between calls. So it can be used as cache result functions that haven’t changed and also as cache for selectLevel if it is not changed. – outSlice values can also be read (contains results of previous outSlice) – on first call all values on outSlice are guaranteed to be zeros.
  22. 22. QP template Compiler (according to appendix D)void processSlice(inSlice, outSlice, Context) { Evaluate where clause…, if evaluates to false then do: outSlice.setSkip; outSlice.selectLevel = min(outSlice.selectLevel, inSlice.fetchLevel); return; If where clause evaluates to true then… switch(inSlice.fetchLevel) { case 0: Evaluate expressions (xfuncs) with repetition level = 0 …….. …….. case n: Evaluate expressions (xfuncs) with repetition level = n If it is the last slide in aggregation group then: //the below line will cause to additional calls to ProcessSlice outSlice.setAdditionalSliceCount( Number of slices in aggregation }}
  23. 23. Columnar Abstraction• Tableton is a set of sequentially-accessed multidimensional scalar arrays.• Tablet is serialized dremel-encoded columnar dataset with fixed size. Each array in tablet can be independently serially accessed without incurring the cost of buffering neighbor arrays.• Four types of arrays: bytes, words (16b), dwords(32b), qwords(64b).• Following operations are defined: – Parsing Tablet Schema => reading and parsing tablet header/metadata also called tablet schema and providing an object model for it. – Reading => converting Tablet to SerObjs using FSM for better performance as descrbed in Dremel paper (calling calback functions to let them construct SerObjs in various formats) – Slicing => synchronized multi-array scalar iteration of Tablet – Building Tablet Schema => creating tablet header/metadata also called tablet schema with convenient builder API. Also called TabletSchema Editor. – Construction => re-creating Tablet from slices, this interface is also used for dissecting SerObjs into tablet. – Compaction => constructed Tablet is compressed and hash key generated for it and from that point on it becomes immutable.Tableton
  24. 24. What about other datatypes?• They are mapped into yet another dimension of scalar array.• It is strongly recommended not to use java strings. They are impossible to work with without incurring full cost of object lifecycle management.• It is ok not to support them at all, and then gradually add support for them.• All Java string class goodies will anyway be impossible to support in Metaxa because of performance.• Same thing about BLOB, images and any other complex data type. All are mapped to yet another dimension of scalar array.Tableton
  25. 25. Hierarchical vs. Columnar• Different abstractions / domains / contexts• Different schemas• Most confusion stems from not differentiating!• Always keep in mind the context when you r developing…• Don’t thinks about both in the same time unless you are willing to develop schizophrenia.• Columnar is not an implementation artifact of hierarchical. Columnar is whole new model in its own• We must adopt two different vocabulary for these domains. Confusion is notoriously common here.Tableton
  26. 26. Hierarchical vs. Columnar Hierarchical Columnar A SerObjs in our lingo A Tablet in our lingo Protobuf, Avro, Thirft files Dremel generated tablets Serialized Objects Multi-dimensional arrays The only user-level abstraction User never knows what it is BQL queries written against it Query plans executed against it More frontend-related More backend-related More logical / external format More physical / internal format hierarchical is queried Columns are scanned SerObjLib component Tableton componentTableton
  27. 27. Hierarchical ExampleTableton
  28. 28. Executes QP against tablets• Requirements – Must convert QP into executable bytecode and execute it (not interpret). – Must work with QP in object-model, but initially compiling and running QP in java form will suffice. – Must not mask data and task parallelism. • Data parallelism on tablet level and also on column level within tablet. • Task parallelism on separate QP transformation functions – Must be ultra-high performance • Latency overhead within few milliseconds (assuming data in RAM). • Throughput multi GB/sec Executor
  29. 29. Vocabulary• QP – Query Plan• DAG – Directed Acyclic Graph• Slot – Like thread (todo)• Expression – operator tree on scalar arguments and scalar constants• CF – Context Free (stateless scalar expression)• FC – Fixed-size Context (scalar expr with accumulator)• VC – variable-size Context (scalar expr with growing list of accumulators) Executor
  30. 30. Code generation • [todo] Janino! • [todo] Explain dynamic java code generation and compilation • [todo] Use code templates! No classes/functions & classes just code listing with labels and jumps. Generated code is every time different no one is going to study it. Put static-portions in library and pre-compile it regularly. All dynamic portion is just code snippet Executor
  31. 31. Thanks(sneak preview of future versions in next slides)
  32. 32. The overall vision for OpenDremel• Interactive data cloud platform for managing high volumes of static data in forms of serialized objects.• Compatible to Google tools such as BigQuery, prediction API, Fusion Tables and Google storage and etc...• Aggressively use existing open-source software, preferably apache licensed to quickly “implement” desired functionality.
  33. 33. Features Backlog• Processing compressed data directly without decompressing.• Macro parallelism 1) multithreading 2) multi-process 3)multi-node 4) massive clustering• Micro parallelism 1) SSE&AVX 2) OpenCL 3) Better machine code to leverage ILP 4) light-threads for parallel processing of single tablet 5) LLVM 6) special hardware GPU & tilera• Interactive joins and indexing support, zone maps and global system- recognized dimensions such as time, geography, ip• Advanced analytics, statistics and machine learning capabilities.• Richer SerObjLib, more formats• Advanced visualization and streaming.• Batch data-crunching and map-reduce support.• Multi-tenancy, resource control, metering and accounting.• CEP capabilities, fast lookups and querying also data that is not yet packed into tablets.• User-defined functions.• Scratch tables and rolling queries

×