Apache Drill             Design proposal from              OpenDremel team                  HLD Version 0.2, 9/sep/2012Cam...
Intro• This is high-level design proposal for project  ApacheDrill from OpenDeremel team.• History slides and usual “about...
Design Tenet #1• Apache Drill must support multi-tenant semantics  internally and not to be run altogether in guest VMs.• ...
Design Tenet #2• Apache Drill must be modular and customizable in  many dimensions.• Schema-on-read concept must be suppor...
Design Tenet #2 (cont.)• We suggest that query plan format will be relaxed to  arbitrary executable, and data format relax...
Design Tenet #3• Apache Drill requests/queries must be hyper-elastic  meaning capability to exploit compute capacity of  t...
Design Tenet #4•   Apache Drill must be efficient.•   Value-per-bit is extremely low with BigData.•   Overhead in the inne...
Suggested ArchitectureBrowser / Client    Single-Tenant                          Multi-Tenant                      Fronten...
Suggested Frontend                                     Design• Usual Java single-tenant web application.• In charge of:   ...
Suggested AJAX• What AJAX framework?• ExtJs?• Look&Feel – just clone Google App with the  trademarks and logos replaced?• ...
Suggested CLI                         Design• Bash+curl would suffice?• Full blown Java CLI tool?
Suggested REST-GW                                 Design• Usual vanilla Java WebApp with Spring!
Suggested Query                                             Compiler Design #1• Query Compiler consists from two component...
Suggested Query                            Compiler Design #2• DrqlSemanticModelReader is ready and published  under …..• ...
Suggested Query                                      Compiler Design #3• What is Executable Script?   – Self-contained ser...
Suggested Query                                       Compiler Design #3• How executable script is generated?   1.   Query...
Suggested Backend Design• TODO• Executors per se   – Janino based Java Executor   – LXC-GCC based C Eexecutor   – ZeroVM-G...
OpenDremel/DazoTwo separate unfinished    We call it Metaxa          We call it ZwiftjQuery apps & cmdline        (histori...
What is Swift?“Swift is a highly available, distributed,eventually consistent object/blob store.Organizations can use Swif...
Don’t get it?Swift is THE open-source   implementation of        Amazon S3
What is ZeroVM?Highly-secure, low-overhead, low-latency container-stylevirtualization based on Google Native Client projec...
ZeroVM highlights1.   Disposable VM per request2.   HyperElasticity per request3.   Embeddable into everything4.   High-pe...
Don’t get it?ZeroVM to Virtualization        is whatSQLite is to Databases
Links• https://github.com/ApacheDrill/Brainstorm/wiki/Apache-Drill-Links• OpenDremel (1st generation design):    – http://...
OpenDremel Story: 2010• Camuel Gilyadov started Dremel implementation on  summer 2010 named OpenDremel.• David Gruzman joi...
OpenDremel Story: 2011• OpenDremel early design was found too naive, mainly due to  Java underperformance in inner number-...
OpenDremel Story: 2012• Four people full-time, several others part time, we still  don’t have fully integrated version but...
ThanksCamuel Gilyadov,Email: Camuel@BigDataCraft.com
Upcoming SlideShare
Loading in …5
×

Apache Drill (ver. 0.2)

2,930 views

Published on

Updated version of OpenDremel's team suggested design for Apache Drill project.

Published in: Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
2,930
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
53
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Apache Drill (ver. 0.2)

  1. 1. Apache Drill Design proposal from OpenDremel team HLD Version 0.2, 9/sep/2012Camuel Gilyadov & Constantine Peresypkin,Email: Camuel@BigDataCraft.com
  2. 2. Intro• This is high-level design proposal for project ApacheDrill from OpenDeremel team.• History slides and usual “about us” stuff moved to the end of the deck.• Slide with all relevant links also published in the end.
  3. 3. Design Tenet #1• Apache Drill must support multi-tenant semantics internally and not to be run altogether in guest VMs.• It should be inspired by BigQuery and not only by Dremel/PowerDrill/Tenzing papers.• It is not practical to setup dedicated cloud (billed hourly) just to be able to run a query for a few seconds.• The codebase must be clearly divided into trusted part and untrusted part. Trusted part must be kept to absolute minimum and must be peer-reviewed, secured, audited and metered.
  4. 4. Design Tenet #2• Apache Drill must be modular and customizable in many dimensions.• Schema-on-read concept must be supported. Imperatively coded high-performance data parser must embeddable into the query.• SQL is not longer enough. New query languages must be easily added as well as user-defined-functions (UDF) implementing deep-analytics (such as statistics and machine learning).• Additionally various data-formats must be supported like column-stores, row-stores, PAX, RCFiles and etc.
  5. 5. Design Tenet #2 (cont.)• We suggest that query plan format will be relaxed to arbitrary executable, and data format relaxed to arbitrary opaque BLOB.• This way new query languages and new data formats could be easily supported without changing backend.• As added benefit backend becomes generic lightweight homogeneous compute-storage cloud.• Such approach exhibits good separation of control. Cloud operator controls and bills for generic infrastructure and the query engine is left completely in the control of the tenant/user.
  6. 6. Design Tenet #3• Apache Drill requests/queries must be hyper-elastic meaning capability to exploit compute capacity of thousands of servers for short duration of just a few seconds. No resources must be kept spinning per user between queries or when idle.• Traditional VMs are too heavyweight for that. Container approach such as OpenVZ/LXC and etc. are not secure enough in multi-tenancy context.• We suggest making sandboxing pluggable and supporting ZeroVM ( developed for OpenDremel ) and LXC (is fine for private clouds) to begin with.
  7. 7. Design Tenet #4• Apache Drill must be efficient.• Value-per-bit is extremely low with BigData.• Overhead in the inner loop must be kept to minimum.• Java was found inefficient for general number crunching (such as data compression). The main problem with Java is that GC overhead is unavoidable for the whole data corpus being scanned. We went so far as to keep all data in byte arrays and auto-generate transformation code and it still underperformed and code complexity went through the roof.
  8. 8. Suggested ArchitectureBrowser / Client Single-Tenant Multi-Tenant Frontend Backend running inside scale-out object store traditional guest VM and in-situ compute JVM Query Query Compiler Executable jobExecutable job
  9. 9. Suggested Frontend Design• Usual Java single-tenant web application.• In charge of: – All interaction with user. – Query/job submission – Query/job progress monitoring – Result browsing Client Tools Java Servlet CLI REST Query AJAX App Gateway Compiler
  10. 10. Suggested AJAX• What AJAX framework?• ExtJs?• Look&Feel – just clone Google App with the trademarks and logos replaced?• Why WebUI of Drill is more important than Hive? – Drill is interactive, at least basic WebUI must be provided with each release.
  11. 11. Suggested CLI Design• Bash+curl would suffice?• Full blown Java CLI tool?
  12. 12. Suggested REST-GW Design• Usual vanilla Java WebApp with Spring!
  13. 13. Suggested Query Compiler Design #1• Query Compiler consists from two component libraries with stable but language-dependent (so no reuse unfortunately ) interface between them:Query ExecutableText Parsers Semantic ModelReader Planners Script Syntax Semantic Errors Errors
  14. 14. Suggested Query Compiler Design #2• DrqlSemanticModelReader is ready and published under …..• SemanticModel that parsers produces closely follows original language. Parsers just parses query text and doesn’t attempts to “give it meaning” or annotate.• Simplified example: – List<Expression> getResultColumns() – List<DrqlQuery> getFromClause(); – List<ColumnId> getGroupByClause(); – etc….
  15. 15. Suggested Query Compiler Design #3• What is Executable Script? – Self-contained serializable, executable object. When executed with appropriate executor and yields correct query result on given input data of expected format – Self contained means no dependencies, everything is included in that executable object. – Particularly data parsing logic is included. – However, data access logic is NOT included. – The model for script is: “here is your blob of size N mapped to memory starting from address S, you have time T to generate your result up to size R in memory starting from address D. You will be terminated without advance notice for any attempted violation of any restriction”
  16. 16. Suggested Query Compiler Design #3• How executable script is generated? 1. Query object implementing SemanticModelReader interface is provided to planner by parser. 2. Planner logic examines semantic model through the SemanticModelReader interface and produces query plan object, that implements QueryPlanModelReader interface. Query analysis and optimization takes place during this stage and if needed additional interface of QueryPlanModelRewriter and/or QueryPlanModelVisitor could be created for this reason. However DrQL is a simple language without large (or any) search space so optimizer value is small. We suggest bypassing altogether query rewriting and query optimization for initial releases. 3. When query plan is generated, a most appropriate code template script is selected. Then template engine processes template coupled with QueryPlanModelReader object to produce executable
  17. 17. Suggested Backend Design• TODO• Executors per se – Janino based Java Executor – LXC-GCC based C Eexecutor – ZeroVM-GCC based C Executor• Storage platforms with collocated data processing – Local files (non distributed) – HDFS – OpenStack Swift
  18. 18. OpenDremel/DazoTwo separate unfinished We call it Metaxa We call it ZwiftjQuery apps & cmdline (historic reasons) (Swift + ZeroVM) app with no particular BQL Parser, unfinished codenames compiler based on Apache Alpha Quality Velocity JVM Query Query Compiler Executable job
  19. 19. What is Swift?“Swift is a highly available, distributed,eventually consistent object/blob store.Organizations can use Swift to storelots of data efficiently, safely, andcheaply.”
  20. 20. Don’t get it?Swift is THE open-source implementation of Amazon S3
  21. 21. What is ZeroVM?Highly-secure, low-overhead, low-latency container-stylevirtualization based on Google Native Client project. Thecritical security code is transferred verbatim from ChromeBrowser project and therefore is as secure as ChromeBrowser. More info: http://ZeroVM.org andhttp://news.ycombinator.com/item?id=3746222
  22. 22. ZeroVM highlights1. Disposable VM per request2. HyperElasticity per request3. Embeddable into everything4. High-performance (x86/ARM)5. Erlang inspired clustering6. Written in pure C, not deps
  23. 23. Don’t get it?ZeroVM to Virtualization is whatSQLite is to Databases
  24. 24. Links• https://github.com/ApacheDrill/Brainstorm/wiki/Apache-Drill-Links• OpenDremel (1st generation design): – http://code.google.com/p/dremel/source/browse?repo=dremel – http://code.google.com/p/dremel/source/browse?repo=metaxa• Dazo (2nd generation design): – https://github.com/Dazo-org
  25. 25. OpenDremel Story: 2010• Camuel Gilyadov started Dremel implementation on summer 2010 named OpenDremel.• David Gruzman joined the effort a few months later followed by Constantine Peresypkin.• There wasn’t a comprehensive design or architecture. The goal was to get hierarchal-columnar transformation working smoothly and in strict accordance to the Dremel paper. Several working implementations are published by us under Apache License.• Hong San was hired as first full-timer to speedup the development. Metaxa milestone was set.
  26. 26. OpenDremel Story: 2011• OpenDremel early design was found too naive, mainly due to Java underperformance in inner number-crunching loops.• After fierce brainstorming, project was restarted from scratch under new name Dazo. With Dazo, query plan is an arbitrary piece of executable native code with Java frontend.• From now on we got inspiration from BigQuery as opposed to from Dremel paper.• We decided to use Google NaCl as sandboxing technology to isolate queries as well as meter resource consumption. The new sandbox was named ZeroVM.• As for storage we decided to use OpenStack Swift.
  27. 27. OpenDremel Story: 2012• Four people full-time, several others part time, we still don’t have fully integrated version but we are satisfied with what we have achieved and convinced that the decisions behind Dazo were correct.• We believe ZeroVM could be a disruptive technology in itself revolutionizing BigData@Cloud space.• We are excited by Apache Drill initiative and hope to be useful for it.• Check the blog: http://BigDataCraft.com
  28. 28. ThanksCamuel Gilyadov,Email: Camuel@BigDataCraft.com

×