Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Upcoming SlideShare
Apache Spark & Hadoop
Next
Download to read offline and view in fullscreen.

Share

Apache Drill Architecture – High-Performance SQL with a JSON Data Model

Download to read offline

http://bit.ly/1zYUhKr – Apache Drill is a High-Performance SQL Engine with a JSON Data Model running with very low latcency.

Related Books

Free with a 30 day trial from Scribd

See all

Related Audiobooks

Free with a 30 day trial from Scribd

See all

Apache Drill Architecture – High-Performance SQL with a JSON Data Model

  1. 1. © 2015 MapR Technologies 1© 2015 MapR Technologies How Drill achieves Flexibility with Performance
  2. 2. © 2015 MapR Technologies 2 Drill Supports Schema Discovery On-The-Fly • Fixed schema • Leverage schema in centralized repository (Hive Metastore) • Fixed schema, evolving schema or schema-less • Leverage schema in centralized repository or self-describing data 2Schema Discovered On-The-FlySchema Declared In Advance SCHEMA ON WRITE SCHEMA BEFORE READ SCHEMA ON THE FLY
  3. 3. © 2015 MapR Technologies 3 Drill’s Data Model is Flexible JSON BSON HBase Parquet Avro CSV TSV Dynamic schema Fixed schema Complex Flat Flexibility Name Gender Age Michael M 6 Jennifer F 3 { name: { first: Michael, last: Smith }, hobbies: [ski, soccer], district: Los Altos } { name: { first: Jennifer, last: Gates }, hobbies: [sing], preschool: CCLC } RDBMS/SQL-on-Hadoop table Apache Drill table Flexibility
  4. 4. © 2015 MapR Technologies 4 - Sub-directory - HBase namespace - Hive database Drill enables ‘SQL on Everything’ SELECT * FROM dfs.yelp.`business.json` Workspace - Pathnames - Hive table - HBase table Table - DFS (Text, Parquet, JSON) - HBase/MapRDB - Hive Metastore/Hcatalog - Easy API to go beyond Hadoop Storage plugin instance
  5. 5. © 2015 MapR Technologies 5 Drill is a Distributed SQL query engine drillbit DataNode/Regi onServer drillbit DataNode/Regi onServer drillbit DataNode/Regi onServer ZooKeeper ZooKeeper ZooKeeper …  Scale out  Columnar and Vectorized execution  Optimistic and pipelined execution (no MR, Spark, Tez)  Late binding  Extensible
  6. 6. © 2015 MapR Technologies 6 Drill allows reuse of existing SQL Tools and Skills Leverage SQL-compatible tools (BI, query builders, etc.) via Drill’s standard ODBC, JDBC and ANSI SQL support Enable business analysts, technical analysts and data scientists to explore and analyze large volumes of real-time data
  7. 7. © 2015 MapR Technologies 7 Drill is Designed For A Wide Set Of Use Cases Raw Data Exploration JSON Analytics DWH Offload … Hive HBaseFiles Directories … {JSON}, Parquet Text Files …
  8. 8. © 2015 MapR Technologies 8 MapR Optimized Data Architecture Sources RELATIONAL, SAAS, MAINFRAME DOCUMENTS, EMAILS LOG FILES, CLICKSTREAMS SENSORS BLOGS, TWEETS, LINK DATA DATA WAREHOUSE Data Movement Data Access Analytics Search Schema-less data exploration BI, reporting Ad-hoc integrated analytics Data Transformation, Enrichment and Integration Operational Apps Recommendations Fraud Detection Logistics Optimized Data Architecture Machine Learning MAPR DISTRIBUTION FOR HADOOP Streaming (Spark Streaming, Storm) MapR Data Platform MapR-DB MAPR DISTRIBUTION FOR HADOOP Batch (MapReduce, Spark, Hive, Pig) MapR-FS Interactive (Drill, Impala)
  9. 9. © 2015 MapR Technologies 9© 2015 MapR Technologies Architecture – Under the hood
  10. 10. © 2015 MapR Technologies 10 High Level Architecture Cluster of commodity servers – Daemon (drillbit) on each node ZooKeeper maintains ephemeral cluster membership information – Drillbit uses ZooKeeper to find other drillbits in the cluster – Client uses ZooKeeper to find drillbits Built-in, optimistic query execution engine. Doesn’t require a particular storage or execution system (MapReduce, Spark, Tez) – Better performance and manageability Data processing unit is columnar record batches – Enables schema flexibility with negligible performance impact
  11. 11. © 2015 MapR Technologies 11 Basic Process Zookeeper DFS/HBase/H ive DFS/HBase/H ive DFS/HBase/H ive Drillbit Drillbit Drillbit Query 1. Query comes to any Drillbit (JDBC, ODBC, CLI, REST) 2. Drillbit generates execution plan based on query optimization & locality 3. Fragments are farmed to individual nodes 4. Result is returned to driving node
  12. 12. © 2015 MapR Technologies 12 Core Modules within drillbit SQL Parser Hive HBase StoragePlugins MongoDB DFS PhysicalPlan ExecutionLogicalPlan Optimizer RPC Endpoint
  13. 13. © 2015 MapR Technologies 13 A Query engine that is… • Columnar/Vectorized • Optimistic/pipelined • Runtime compilation • Late binding • Extensible
  14. 14. © 2015 MapR Technologies 14 Columnar representation A B C D E A B C D On disk E
  15. 15. © 2015 MapR Technologies 15 Columnar Encoding • Values in a col. stored next to one-another – Better compression – Range-map: save min-max, can skip if not present • Only retrieve columns participating in query • Drill optimizes for BOTH columnar storage and Execution A B C D On disk E
  16. 16. © 2015 MapR Technologies 16 Vectorization Drill operates on more than one record at a time – Word-sized manipulations – SIMD instructions (GCC, LLVM and JVM all do various optimizations automatically) – Manually code algorithms Logical Vectorization – Bitmaps allow lightning fast null-checks – Avoid branching to speed CPU pipeline
  17. 17. © 2015 MapR Technologies 17 Optimistic Execution With a short time horizon, failures infrequent – Don’t spend energy and time creating boundaries and checkpoints to minimize recovery time – Rerun entire query in face of failure No barriers No persistence unless memory overflow
  18. 18. © 2015 MapR Technologies 18 Pipelining Record batch is the unit of work for Drill – Operators work on a record batch ( ) Record batches are pipelined between nodes – ~256kB usually Operator reconfiguration happens at batch boundaries DrillBit DrillBit DrillBit
  19. 19. © 2015 MapR Technologies 19 Runtime Compilation is Faster Trivial 500 450 400 350 300 250 200 150 100 50 0 Simple Moderate Timefor1millionevaluations(ms) Source: http://bit.ly/16Xk32x Janino interpreted Trivial
  20. 20. © 2015 MapR Technologies 20 Drill compiler Loaded class Merge byte-code of the two classes Janino compiles runtime byte-code CodeModel generates code Precompiled byte- code templates
  21. 21. © 2015 MapR Technologies 21 Cost-based Optimization Pluggable rules, and cost model Rules for distributed plan generation - Insert Exchange operator into physical plan - Parallel query plans Pluggable cost model - CPU, IO, memory, network cost (data locality) - Storage engine features (HDFS vs HIVE vs HBase) Pluggable rulesQuery Optimizer Pluggable rules
  22. 22. © 2015 MapR Technologies 22 Integration and extensibility points Support UDFs – UDFs/UDAFs using high performance Java API Not Hadoop centric – Work with other NoSQL solutions including MongoDB, Cassandra, Riak, etc. – Build one distributed query engine together than per technology Built in classpath scanning and plugin concept to add additional storage engines, function and operators with zero configuration Support direct execution of strongly specified JSON based logical and physical plans – Simplifies testing – Enables integration of alternative query languages
  23. 23. © 2015 MapR Technologies 23 Additional Resources Download Apache Drill Tutorial: Apache Drill in 10 Minutes Whiteboard Video with Tomer Shiran
  • FarazAhmad90

    May. 27, 2020
  • ssuserce170b

    Apr. 5, 2020
  • cybercorlin

    Jan. 12, 2016
  • prashanthmcr1

    Dec. 20, 2015
  • konishika

    Dec. 12, 2015
  • nokkon74

    Oct. 24, 2015
  • terumba

    Oct. 22, 2015
  • hiboss1

    Sep. 8, 2015
  • MikeLavrentiev

    Aug. 27, 2015
  • rorybramwell

    Aug. 24, 2015
  • fjgirante

    Aug. 22, 2015
  • cniclsh

    Jul. 29, 2015
  • fdelagarzas

    Jul. 26, 2015
  • loriking

    Jul. 22, 2015
  • lj831015

    Jun. 30, 2015
  • raminorujov

    Jun. 19, 2015
  • majidazimi

    Jun. 19, 2015
  • obsani

    May. 26, 2015

http://bit.ly/1zYUhKr – Apache Drill is a High-Performance SQL Engine with a JSON Data Model running with very low latcency.

Views

Total views

7,084

On Slideshare

0

From embeds

0

Number of embeds

182

Actions

Downloads

187

Shares

0

Comments

0

Likes

18

×