OpenDremel Story: 2010• Camuel Gilyadov started Dremel implementation on summer 2010 named OpenDremel.• David Gruzman joined the effort a few months later followed by Constantine Peresypkin.• There wasn’t a comprehensive design or architecture. The goal was to get hierarchal-columnar transformation working smoothly and in strict accordance to the Dremel paper. Several working implementations are published by us under Apache License.• Hong San was hired as first full-timer to speedup the development. Metaxa milestone was set.
OpenDremel Story: 2011• OpenDremel early design was found too naive, mainly due to Java underperformance in inner number-crunching loops.• After fierce brainstorming, project was restarted from scratch under new name Dazo. With Dazo, query plan is an arbitrary piece of executable native code with Java frontend.• From now on we got inspiration from BigQuery as opposed to from Dremel paper.• We decided to use Google NaCl as sandboxing technology to isolate queries as well as meter resource consumption. The new sandbox was named ZeroVM.• As for storage we decided to use OpenStack Swift.
OpenDremel Story: 2012• Four people full-time, several others part time, we still don’t have fully integrated version but we are satisfied with what we have achieved and convinced that the decisions behind Dazo were correct.• We believe ZeroVM could be a disruptive technology in itself revolutionizing BigData@Cloud space.• We are excited by Apache Drill initiative and hope to be useful for it.
Design Tenet #1• Apache Drill must support multi-tenant semantics internally and not to be run in guest VMs altogether.• It should be inspired by BigQuery and not only by Dremel/PowerDrill/Tenzing papers.• It is not practical to setup a dedicated cloud (billed hourly) just to be able to run a query for a few seconds.• The codebase must be clearly divided into trusted part and untrusted part. Trusted part must be kept to absolute minimum and must be peer-reviewed, secured, audited and metered.
Design Tenet #2• Apache Drill must be extremely flexible and customizable.• Schema-on-read concept must be supported. Imperative high-performance parser code must be possible to be embedded into the query.• SQL is no longer enough. New query languages must be easily added as plug-ins or as user-defined-functions (UDF).• Additionally various data-formats must be supported like column-stores, row-stores, PAX, RCFiles and etc.
Design Tenet #2 (cont.)• We suggest that query plan format will be relaxed to arbitrary distributed executable code and data format relaxed to arbitrary opaque BLOB.• This way new query languages and new data formats could be easily supported without changing backend.• As added benefit backend becomes generic lightweight homogeneous compute-storage cloud.• Such approach exhibits good separation of control. Cloud operator controls an bills for generic infrastructure and the query engine is left completely in the control of the tenant/user.
Design Tenet #3• Apache Drill requests/queries must be hyper-elastic meaning capability to exploit compute capacity of thousands of servers for short duration of just a few seconds. No resources must be kept spinning per user between queries or when idle.• Traditional VMs are too heavyweight for that. Container approach such as OpenVZ/LXC and etc. are not secure enough in multi-tenancy context.• We suggest making sandboxing pluggable and supporting ZeroVM ( developed for OpenDremel ) and LXC (is fine for private clouds) to begin with.
Design Tenet #4• Apache Drill must be efficient.• Value-per-byte is extremely low with BigData.• Overhead in the inner loop must be kept to minimum.• Java was found inefficient for general number crunching (such as data compression). The main problem with Java is that GC overhead is unavoidable for the whole data corpus being scanned. We went so far as to keep all data in byte arrays and auto-generate transformation code and it still underperformed and code complexity went through the roof.
Suggested ArchitectureBrowser / Client Single-Tenant Multi-Tenant Frontend Backend running inside scale-out object store traditional guest VM and in-situ compute JVM Query Query Compiler Custom executable job
OpenDremel/Dazo Two separate We call it Metaxa We call it Zwift unfinished jQuery (historic reasons) (Swift + ZeroVM)apps & cmdline app BQL Parser, unfinished with no particular compiler based on Alpha Quality codenames Apache Velocity JVM Query Query Compiler Custom executable job
What is Swift?“Swift is a highly available, distributed,eventually consistent object/blob store.Organizations can use Swift to storelots of data efficiently, safely, andcheaply.”
Haven’t got it?Swift is THE open-source implementation of Amazon S3
What is ZeroVM?Highly-secure, low-overhead, low-latency container-stylevirtualization based on Google Native Client project. Thecritical security code is transferred verbatim from ChromeBrowser project and therefore is as secure as ChromeBrowser. More info: http://ZeroVM.org andhttp://news.ycombinator.com/item?id=3746222
ZeroVM highlights1. Disposable VM per request2. HyperElasticity per request3. Embeddable into everything4. High-performance (x86/ARM)5. Erlang inspired clustering6. Written in pure C, not deps
Haven’t got it?ZeroVM to Virtualization is whatSQLite is to Databases
Where is the code?• OpenDremel (1st generation design): – http://code.google.com/p/dremel/source/browse?repo=dremel – http://code.google.com/p/dremel/source/browse?repo=metaxa• Dazo (2nd generation design): – https://github.com/Dazo-org