Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
© MapR Technologies, confidential 
© 2014 MapR Technologies 
M. C. Srivas, CTO and Founder
© MapR Technologies, confidential 
MapR is Unbiased Open Source
© MapR Technologies, confidential 
Linux Is Unbiased 
• Linux provides choice 
– MySQL 
– PostgreSQL 
– SQLite 
• Linux pr...
© MapR Technologies, confidential 
MapR Is Unbiased 
• MapR provides choice 
MapR Distribution for Hadoop Distribution C D...
© MapR Technologies, confidential 
MapR Distribution for Apache Hadoop 
MapR Data Platform 
(Random Read/Write) 
Enterpris...
© MapR Technologies, confidential
© MapR Technologies, confidential 
Hadoop an augmentation for EDW—Why?
© MapR Technologies, confidential
© MapR Technologies, confidential
© MapR Technologies, confidential
© MapR Technologies, confidential
© MapR Technologies, confidential
© MapR Technologies, confidential 
But inside, it looks like this …
© MapR Technologies, confidential 
And this …
© MapR Technologies, confidential 
And this …
© MapR Technologies, confidential 
Consolidating schemas is very hard.
© MapR Technologies, confidential 
Consolidating schemas is very hard, causes SILOs
© MapR Technologies, confidential 
Silos make analysis very difficult 
• How do I identify a 
unique {customer, 
trade} ac...
© 2014 MapR Technologies 19 
Hard to know what’s of value a priori
© 2014 MapR Technologies 20 
Hard to know what’s of value a priori
© 2014 MapR Technologies 21 
Why Hadoop
© MapR Technologies, confidential 
Rethink SQL for Big Data 
Preserve 
•ANSI SQL 
• Familiar and ubiquitous 
• Performance...
© MapR Technologies, confidential 
Rethink SQL for Big Data 
Preserve 
•ANSI SQL 
• Familiar and ubiquitous 
• Performance...
© MapR Technologies, confidential 
SQL is here to stay
© MapR Technologies, confidential 
Hadoop is here to stay
© MapR Technologies, confidential 
YOU CAN’T HANDLE REAL SQL
© MapR Technologies, confidential 
SQL 
select * from A 
where exists ( 
select 1 from B where B.b < 100 ); 
• Did you kno...
© MapR Technologies, confidential 
Self-described Data 
select cf.month, cf.year 
from hbase.table1; 
• Did you know norma...
© MapR Technologies, confidential 
Self-described Data 
select cf.month, cf.year 
from hbase.table1; 
• Why? 
• Because th...
© MapR Technologies, confidential 
Self-Describing Data Ubiquitous 
Centralized schema 
- Static 
- Managed by the DBAs 
-...
© MapR Technologies, confidential 
A Quick Tour through Apache Drill
© MapR Technologies, confidential 
Data Source is in the Query 
select timestamp, message 
from dfs1.logs.`AppServerLogs/2...
© MapR Technologies, confidential 
Data Source is in the Query 
select timestamp, message 
from dfs1.logs.`AppServerLogs/2...
© MapR Technologies, confidential 
Data Source is in the Query 
select timestamp, message 
from dfs1.logs.`AppServerLogs/2...
© MapR Technologies, confidential 
Data Source is in the Query 
select timestamp, message 
from dfs1.logs.`AppServerLogs/2...
© MapR Technologies, confidential 
Combine data sources on the fly 
• JSON 
• CSV 
• ORC (ie, all Hive types) 
• Parquet 
...
© MapR Technologies, confidential 
Can be an entire directory tree 
// On a file 
select errorLevel, count(*) 
from dfs.lo...
© MapR Technologies, confidential 
Can be an entire directory tree 
// On a file 
select errorLevel, count(*) 
from dfs.lo...
© MapR Technologies, confidential 
Can be an entire directory tree 
// On a file 
select errorLevel, count(*) 
from dfs.lo...
© MapR Technologies, confidential 
Can be an entire directory tree 
// On a file 
select errorLevel, count(*) 
from dfs.lo...
© MapR Technologies, confidential 
Querying JSON 
{ name: classic 
fillings: [ 
{ name: sugar cal: 400 }]} 
{ name: choco ...
© MapR Technologies, confidential 
Cursors inside Drill 
DrillClient drill = new DrillClient().connect( …); 
ResultReader ...
© MapR Technologies, confidential 
Direct queries on nested data 
// Flattening maps in JSON, parquet and other 
nested re...
© MapR Technologies, confidential 
Complex Data Using SQL or Fluent API 
// SQL 
Result r = drill.sql( "select name, flatt...
© MapR Technologies, confidential 
Queries on embedded data 
// embedded JSON value inside column donut-json inside column...
© MapR Technologies, confidential 
Queries inside JSON records 
// Each JSON record itself can be a whole database 
// exa...
© MapR Technologies, confidential 
a 
• Schema can change over course of query 
• Operators are able to reconfigure themse...
© MapR Technologies, confidential 
De-centralized metadata 
// count the number of tweets per customer, where the customer...
© MapR Technologies, confidential 
So what does this all mean?
© MapR Technologies, confidential 
A Drill Database 
• What is a database with Drill/MapR?
© MapR Technologies, confidential 
A Drill Database 
• What is a database with Drill/MapR? 
• Just a directory, with a bun...
© MapR Technologies, confidential 
A Drill Database 
• What is a database with Drill/MapR? 
• Just a directory, with a bun...
© MapR Technologies, confidential 
A Drill Database 
/user/srivas/work/bugs 
symptom version date bugid dump-name 
impala ...
© MapR Technologies, confidential 
Queries are simple 
select b.bugid, b.symptom, b.date 
from dfs.bugs.’/Customers’ c, df...
© MapR Technologies, confidential 
Queries are simple 
select b.bugid, b.symptom, b.date 
from dfs.bugs.’/Customers’ c, df...
© MapR Technologies, confidential 
What does it mean?
© MapR Technologies, confidential 
What does it mean? 
• No ETL 
• Reach out directly to the particular table/file 
• As l...
© MapR Technologies, confidential 
Another example 
select d.name, count( d.fillings), 
from ( select convert_from( cf1.do...
© MapR Technologies, confidential 
Another example 
select d.name, count( d.fillings), 
from ( select convert_from( cf1.do...
© MapR Technologies, confidential 
Another example 
select d.name, count( d.fillings), 
from ( select convert_from( cf1.do...
© MapR Technologies, confidential 
No ETL 
• Basically, Drill is querying the raw data directly 
• Joining with processed ...
© MapR Technologies, confidential 
Seamless integration with Apache Hive 
• Low latency queries on Hive tables 
• Support ...
© MapR Technologies, confidential 
Underneath the Covers
© MapR Technologies, confidential 
Basic Process 
Zookeeper 
DFS/HBase DFS/HBase DFS/HBase 
Drillbit 
Distributed Cache 
D...
© MapR Technologies, confidential 
Stages of Query Planning 
Parser 
Logical 
Planner 
Physical 
Planner 
Query 
Foreman 
...
© MapR Technologies, confidential 
Query Execution 
SQL Parser 
Optimizer 
Scheduler 
Pig Parser 
Physical Plan 
Mongo 
Ca...
© MapR Technologies, confidential 
A Query engine that is… 
• Columnar/Vectorized 
• Optimistic/pipelined 
• Runtime compi...
© MapR Technologies, confidential 
Columnar representation 
A B C D E 
A 
B 
C 
D 
On disk 
E
© MapR Technologies, confidential 
Columnar Encoding 
• Values in a col. stored next to one-another 
– Better compression ...
© MapR Technologies, confidential 
Run-length-encoding & Sum 
• Dataset encoded as <val> <run-length>: 
– 2, 4 (4 2’s) 
– ...
© MapR Technologies, confidential 
Bit-packed Dictionary Sort 
• Dataset encoded with a dictionary and bit-positions: 
– D...
© MapR Technologies, confidential 
Drill 4-value semantics 
• SQL’s 3-valued semantics 
– True 
– False 
– Unknown 
• Dril...
© MapR Technologies, confidential 
Vectorization 
• Drill operates on more than one record at a time 
– Word-sized manipul...
© MapR Technologies, confidential 
Runtime Compilation is Faster 
• JIT is smart, but 
more gains with 
runtime 
compilati...
© MapR Technologies, confidential 
Drill compiler 
Loaded class 
Merge byte-code 
of the two 
classes 
Janino compiles 
ru...
© MapR Technologies, confidential 
Optimistic 
0 
20 
40 
60 
80 
100 
120 
140 
160 
Speed vs. check-pointing 
No need to...
© MapR Technologies, confidential 
Optimistic Execution 
• Recovery code trivial 
– Running instances discard the failed q...
© MapR Technologies, confidential 
Batches of Values 
• Value vectors 
– List of values, with same schema 
– With the 4-va...
© MapR Technologies, confidential 
Pipelining 
• Record batches are pipelined 
between nodes 
– ~256kB usually 
• Unit of ...
© MapR Technologies, confidential 
Pipelining Record Batches 
SQL Parser 
Optimizer 
Scheduler 
Pig Parser 
Physical Plan ...
© MapR Technologies, confidential 
DISK 
Pipelining 
• Random access: sort without copy or 
restructuring 
• Avoids serial...
© MapR Technologies, confidential 
Cost-based Optimization 
• Using Optiq, an extensible framework 
• Pluggable rules, and...
© MapR Technologies, confidential 
Distributed Plan Cost 
• Operators have distribution property 
• Hash, Broadcast, Singl...
© MapR Technologies, confidential 
Drill 1.0 Hive 0.13 w/ Tez Impala 1.x 
Latency Low Medium Low 
Files Yes (all Hive file...
© MapR Technologies, confidential 
Apache Drill Roadmap 
•Low-latency SQL 
•Schema-less execution 
•Files & HBase/M7 suppo...
© MapR Technologies, confidential 
MapR Distribution for Apache Hadoop 
MapR Data Platform 
(Random Read/Write) 
Enterpris...
© MapR Technologies, confidential 
Apache Drill Resources 
• Drill 0.5 released last week 
• Getting started with Drill is...
© MapR Technologies, confidential 
Active Drill Community 
• Large community, growing rapidly 
– 35-40 contributors, 16 co...
© MapR Technologies, confidential 
Drill at MapR 
• World-class SQL team, ~20 people 
• 150+ years combined experience bui...
© MapR Technologies, confidential 
Thank you! 
M. C. Srivas 
srivas@mapr.com 
Did I mention we are hiring…
Upcoming SlideShare
Loading in …5
×

Apache Drill - Why, What, How

7,088 views

Published on

Apache Drill is the next generation of SQL query engines. It builds on ANSI SQL 2003, and extends it to handle new formats like JSON, Parquet, ORC, and the usual CSV, TSV, XML and other Hadoop formats. Most importantly, it melts away the barriers that have caused databases to become silos of data. It does so by able to handle schema-changes on the fly, enabling a whole new world of self-service and data agility never seen before.

Published in: Data & Analytics
  • Hello! High Quality And Affordable Essays For You. Starting at $4.99 per page - Check our website! https://vk.cc/82gJD2
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Apache Drill - Why, What, How

  1. 1. © MapR Technologies, confidential © 2014 MapR Technologies M. C. Srivas, CTO and Founder
  2. 2. © MapR Technologies, confidential MapR is Unbiased Open Source
  3. 3. © MapR Technologies, confidential Linux Is Unbiased • Linux provides choice – MySQL – PostgreSQL – SQLite • Linux provides choice – Apache httpd – Nginx – Lighttpd
  4. 4. © MapR Technologies, confidential MapR Is Unbiased • MapR provides choice MapR Distribution for Hadoop Distribution C Distribution H Spark Spark (all of it) and SparkSQL Spark only No Interactive SQL Impala, Drill, Hive/Tez, SparkSQL One option (Impala) One option (Hive/Tez) Scheduler YARN, Mesos One option (YARN) One option (YARN) Versions Hive 0.10, 0.11, 0.12, 0.13 Pig 0.11, 012 HBase 0.94, 0.98 One version One version
  5. 5. © MapR Technologies, confidential MapR Distribution for Apache Hadoop MapR Data Platform (Random Read/Write) Enterprise Grade Data Hub Operational MapR-FS (POSIX) MapR-DB (High-Performance NoSQL) Security YARN Pig Cascading Spark Batch Spark Streaming Storm* Streaming HBase Solr NoSQL & Search Juju Provisioning & Coordination Savannah* Mahout MLLib ML, Graph GraphX MapReduc e v1 & v2 APACHE HADOOP AND OSS ECOSYSTEM EXECUTION ENGINES DATA GOVERNANCE AND OPERATIONS Workflow & Data Tez* Governance Accumulo* Hive Impala Shark Drill* SQL Sqoop Sentry* Oozie ZooKeeper Flume Knox* Falcon* Whirr Data Integration & Access HttpFS Hue NFS HDFS API HBase API JSON API MapR Control System (Management and Monitoring) * In Roadmap for inclusion/certification CLI REST API GUI
  6. 6. © MapR Technologies, confidential
  7. 7. © MapR Technologies, confidential Hadoop an augmentation for EDW—Why?
  8. 8. © MapR Technologies, confidential
  9. 9. © MapR Technologies, confidential
  10. 10. © MapR Technologies, confidential
  11. 11. © MapR Technologies, confidential
  12. 12. © MapR Technologies, confidential
  13. 13. © MapR Technologies, confidential But inside, it looks like this …
  14. 14. © MapR Technologies, confidential And this …
  15. 15. © MapR Technologies, confidential And this …
  16. 16. © MapR Technologies, confidential Consolidating schemas is very hard.
  17. 17. © MapR Technologies, confidential Consolidating schemas is very hard, causes SILOs
  18. 18. © MapR Technologies, confidential Silos make analysis very difficult • How do I identify a unique {customer, trade} across data sets? • How can I guarantee the lack of anomalous behavior if I can’t see all data?
  19. 19. © 2014 MapR Technologies 19 Hard to know what’s of value a priori
  20. 20. © 2014 MapR Technologies 20 Hard to know what’s of value a priori
  21. 21. © 2014 MapR Technologies 21 Why Hadoop
  22. 22. © MapR Technologies, confidential Rethink SQL for Big Data Preserve •ANSI SQL • Familiar and ubiquitous • Performance • Interactive nature crucial for BI/Analytics • One technology • Painful to manage different technologies • Enterprise ready • System-of-record, HA, DR, Security, Multi-tenancy, …
  23. 23. © MapR Technologies, confidential Rethink SQL for Big Data Preserve •ANSI SQL • Familiar and ubiquitous • Performance • Interactive nature crucial for BI/Analytics • One technology • Painful to manage different technologies • Enterprise ready • System-of-record, HA, DR, Security, Multi-tenancy, … Invent • Flexible data-model • Allow schemas to evolve rapidly • Support semi-structured data types • Agility • Self-service possible when developer and DBA is same • Scalability • In all dimensions: data, speed, schemas, processes, management
  24. 24. © MapR Technologies, confidential SQL is here to stay
  25. 25. © MapR Technologies, confidential Hadoop is here to stay
  26. 26. © MapR Technologies, confidential YOU CAN’T HANDLE REAL SQL
  27. 27. © MapR Technologies, confidential SQL select * from A where exists ( select 1 from B where B.b < 100 ); • Did you know Apache HIVE cannot compute it? – eg, Hive, Impala, Spark/Shark
  28. 28. © MapR Technologies, confidential Self-described Data select cf.month, cf.year from hbase.table1; • Did you know normal SQL cannot handle the above? • Nor can HIVE and its variants like Impala, Shark?
  29. 29. © MapR Technologies, confidential Self-described Data select cf.month, cf.year from hbase.table1; • Why? • Because there’s no meta-store definition available
  30. 30. © MapR Technologies, confidential Self-Describing Data Ubiquitous Centralized schema - Static - Managed by the DBAs - In a centralized repository Long, meticulous data preparation process (ETL, create/alter schema, etc.) – can take 6-18 months Self-describing, or schema-less, data - Dynamic/evolving - Managed by the applications - Embedded in the data Less schema, more suitable for data that has higher volume, variety and velocity Apache Drill
  31. 31. © MapR Technologies, confidential A Quick Tour through Apache Drill
  32. 32. © MapR Technologies, confidential Data Source is in the Query select timestamp, message from dfs1.logs.`AppServerLogs/2014/Jan/p001.parquet` where errorLevel > 2
  33. 33. © MapR Technologies, confidential Data Source is in the Query select timestamp, message from dfs1.logs.`AppServerLogs/2014/Jan/p001.parquet` where errorLevel > 2 This is a cluster in Apache Drill - DFS - HBase - Hive meta-store
  34. 34. © MapR Technologies, confidential Data Source is in the Query select timestamp, message from dfs1.logs.`AppServerLogs/2014/Jan/p001.parquet` where errorLevel > 2 This is a cluster in Apache Drill - DFS - HBase - Hive meta-store A work-space - Typically a sub-directory - HIVE database
  35. 35. © MapR Technologies, confidential Data Source is in the Query select timestamp, message from dfs1.logs.`AppServerLogs/2014/Jan/p001.parquet` where errorLevel > 2 This is a cluster in Apache Drill - DFS - HBase - Hive meta-store A work-space - Typically a sub-directory - HIVE database A table - pathnames - Hbase table - Hive table
  36. 36. © MapR Technologies, confidential Combine data sources on the fly • JSON • CSV • ORC (ie, all Hive types) • Parquet • HBase tables • … can combine them Select USERS.name, USERS.emails.work from dfs.logs.`/data/logs` LOGS, dfs.users.`/profiles.json` USERS, where LOGS.uid = USERS.uid and errorLevel > 5 order by count(*);
  37. 37. © MapR Technologies, confidential Can be an entire directory tree // On a file select errorLevel, count(*) from dfs.logs.`/AppServerLogs/2014/Jan/part0001.parquet` group by errorLevel;
  38. 38. © MapR Technologies, confidential Can be an entire directory tree // On a file select errorLevel, count(*) from dfs.logs.`/AppServerLogs/2014/Jan/part0001.parquet` group by errorLevel; // On the entire data collection: all years, all months select errorLevel, count(*) from dfs.logs.`/AppServerLogs` group by errorLevel
  39. 39. © MapR Technologies, confidential Can be an entire directory tree // On a file select errorLevel, count(*) from dfs.logs.`/AppServerLogs/2014/Jan/part0001.parquet` group by errorLevel; // On the entire data collection: all years, all months select errorLevel, count(*) from dfs.logs.`/AppServerLogs` group by errorLevel dirs[1] dirs[2]
  40. 40. © MapR Technologies, confidential Can be an entire directory tree // On a file select errorLevel, count(*) from dfs.logs.`/AppServerLogs/2014/Jan/part0001.parquet` group by errorLevel; // On the entire data collection: all years, all months select errorLevel, count(*) from dfs.logs.`/AppServerLogs` group by errorLevel where dirs[1] > 2012 , dirs[2] dirs[1] dirs[2]
  41. 41. © MapR Technologies, confidential Querying JSON { name: classic fillings: [ { name: sugar cal: 400 }]} { name: choco fillings: [ { name: sugar cal: 400 } { name: chocolate cal: 300 }]} { name: bostoncreme fillings: [ { name: sugar cal: 400 } { name: cream cal: 1000 } { name: jelly cal: 600 }]} donuts.json
  42. 42. © MapR Technologies, confidential Cursors inside Drill DrillClient drill = new DrillClient().connect( …); ResultReader r = drill.runSqlQuery( "select * from `donuts.json`"); while( r.next()) { String donutName = r.reader( “name").readString(); ListReader fillings = r.reader( "fillings"); while( fillings.next()) { int calories = fillings.reader( "cal").readInteger(); if (calories > 400) print( donutName, calories, fillings.reader( "name").readString()); } } { name: classic fillings: [ { name: sugar cal: 400 }]} { name: choco fillings: [ { name: sugar cal: 400 } { name: chocolate cal: 300 }]} { name: bostoncreme fillings: [ { name: sugar cal: 400 } { name: cream cal: 1000 } { name: jelly cal: 600 }]}
  43. 43. © MapR Technologies, confidential Direct queries on nested data // Flattening maps in JSON, parquet and other nested records select name, flatten(fillings) as f from dfs.users.`/donuts.json` where f.cal < 300; // lists the fillings < 300 calories { name: classic fillings: [ { name: sugar cal: 400 }]} { name: choco fillings: [ { name: sugar cal: 400 } { name: chocolate cal: 300 }]} { name: bostoncreme fillings: [ { name: sugar cal: 400 } { name: cream cal: 1000 } { name: jelly cal: 600 }]}
  44. 44. © MapR Technologies, confidential Complex Data Using SQL or Fluent API // SQL Result r = drill.sql( "select name, flatten(fillings) from `donuts.json` where fillings.cal < 300`); // or Fluent API Result r = drill.table(“donuts.json”) .lt(“fillings.cal”, 300).all(); while( r.next()) { String name = r.get( “name").string(); List fillings = r.get( “fillings”).list(); while(fillings.next()) { print(name, calories, fillings.get(“name”).string()); } } { name: classic fillings: [ { name: sugar cal: 400 }]} { name: choco fillings: [ { name: sugar cal: 400 } { name: plain: 280 }]} { name: bostoncreme fillings: [ { name: sugar cal: 400 } { name: cream cal: 1000 } { name: jelly cal: 600 }]}
  45. 45. © MapR Technologies, confidential Queries on embedded data // embedded JSON value inside column donut-json inside column-family cf1 of an hbase table donuts select d.name, count( d.fillings), from ( select convert_from( cf1.donut-json, json) as d from hbase.user.`donuts` );
  46. 46. © MapR Technologies, confidential Queries inside JSON records // Each JSON record itself can be a whole database // example: get all donuts with at least 1 filling with > 300 calories select d.name, count( d.fillings), max(d.fillings.cal) within record as mincal from ( select convert_from( cf1.donut-json, json) as d from hbase.user.`donuts` ) where mincal > 300;
  47. 47. © MapR Technologies, confidential a • Schema can change over course of query • Operators are able to reconfigure themselves on schema change events – Minimize flexibility overhead – Support more advanced execution optimization based on actual data characteristics
  48. 48. © MapR Technologies, confidential De-centralized metadata // count the number of tweets per customer, where the customers are in Hive, and their tweets are in HBase. Note that the hbase data has no meta-data information select c.customerName, hb.tweets.count from hive.CustomersDB.`Customers` c join hbase.user.`SocialData` hb on c.customerId = convert_from( hb.rowkey, UTF-8);
  49. 49. © MapR Technologies, confidential So what does this all mean?
  50. 50. © MapR Technologies, confidential A Drill Database • What is a database with Drill/MapR?
  51. 51. © MapR Technologies, confidential A Drill Database • What is a database with Drill/MapR? • Just a directory, with a bunch of related files
  52. 52. © MapR Technologies, confidential A Drill Database • What is a database with Drill/MapR? • Just a directory, with a bunch of related files • There’s no need for artificial boundaries – No need to bunch a set of tables together to call it a “database”
  53. 53. © MapR Technologies, confidential A Drill Database /user/srivas/work/bugs symptom version date bugid dump-name impala crash 3.1.1 14/7/14 12345 cust1.tgz cldb slow 3.1.0 12/7/14 45678 cust2.tgz BugList Customers name rep se dump-name xxxx dkim junhyuk cust1.tgz yyyy yoshi aki cust2.tgz
  54. 54. © MapR Technologies, confidential Queries are simple select b.bugid, b.symptom, b.date from dfs.bugs.’/Customers’ c, dfs.bugs.’/BugList’ b where c.dump-name = b.dump-name
  55. 55. © MapR Technologies, confidential Queries are simple select b.bugid, b.symptom, b.date from dfs.bugs.’/Customers’ c, dfs.bugs.’/BugList’ b where c.dump-name = b.dump-name Let’s say I want to cross-reference against your list: select bugid, symptom from dfs.bugs.’/Buglist’ b, dfs.yourbugs.’/YourBugFile’ b2 where b.bugid = b2.xxx
  56. 56. © MapR Technologies, confidential What does it mean?
  57. 57. © MapR Technologies, confidential What does it mean? • No ETL • Reach out directly to the particular table/file • As long as the permissions are fine, you can do it • No need to have the meta-data – None needed
  58. 58. © MapR Technologies, confidential Another example select d.name, count( d.fillings), from ( select convert_from( cf1.donut-json, json) as d from hbase.user.`donuts` ); • convert_from( xx, json) invokes the json parser inside Drill
  59. 59. © MapR Technologies, confidential Another example select d.name, count( d.fillings), from ( select convert_from( cf1.donut-json, json) as d from hbase.user.`donuts` ); • convert_from( xx, json) invokes the json parser inside Drill • What if you could plug in any parser?
  60. 60. © MapR Technologies, confidential Another example select d.name, count( d.fillings), from ( select convert_from( cf1.donut-json, json) as d from hbase.user.`donuts` ); • convert_from( xx, json) invokes the json parser inside Drill • What if you could plug in any parser – XML? – Semi-conductor yield-analysis files? Oil-exploration readings? – Telescope readings of stars? – RFIDs of various things?
  61. 61. © MapR Technologies, confidential No ETL • Basically, Drill is querying the raw data directly • Joining with processed data • NO ETL • Folks, this is very, very powerful • NO ETL
  62. 62. © MapR Technologies, confidential Seamless integration with Apache Hive • Low latency queries on Hive tables • Support for 100s of Hive file formats • Ability to reuse Hive UDFs • Support for multiple Hive metastores in a single query
  63. 63. © MapR Technologies, confidential Underneath the Covers
  64. 64. © MapR Technologies, confidential Basic Process Zookeeper DFS/HBase DFS/HBase DFS/HBase Drillbit Distributed Cache Drillbit Distributed Cache Drillbit Distributed Cache Query 1. Query comes to any Drillbit (JDBC, ODBC, CLI, protobuf) 2. Drillbit generates execution plan based on query optimization & locality 3. Fragments are farmed to individual nodes 4. Result is returned to driving node c c c
  65. 65. © MapR Technologies, confidential Stages of Query Planning Parser Logical Planner Physical Planner Query Foreman Plan fragments sent to drill bits SQL Query Heuristic and cost based Cost based
  66. 66. © MapR Technologies, confidential Query Execution SQL Parser Optimizer Scheduler Pig Parser Physical Plan Mongo Cassandra HiveQL Parser RPC Endpoint Distributed Cache Storage Engine Interface OOppereartaotorsr s Foreman Logical Plan HDFS HBase JDBC Endpoint ODBC Endpoint
  67. 67. © MapR Technologies, confidential A Query engine that is… • Columnar/Vectorized • Optimistic/pipelined • Runtime compilation • Late binding • Extensible
  68. 68. © MapR Technologies, confidential Columnar representation A B C D E A B C D On disk E
  69. 69. © MapR Technologies, confidential Columnar Encoding • Values in a col. stored next to one-another – Better compression – Range-map: save min-max, can skip if not present • Only retrieve columns participating in query • Aggregations can be performed without decoding A B C D On disk E
  70. 70. © MapR Technologies, confidential Run-length-encoding & Sum • Dataset encoded as <val> <run-length>: – 2, 4 (4 2’s) – 8, 10 (10 8’s) • Goal: sum all the records • Normally: – Decompress: 2, 2, 2, 2, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8 – Add: 2 + 2 + 2 + 2 + 8 + 8 + 8 + 8 + 8 + 8 + 8 + 8 + 8 + 8 • Optimized work: 2 * 4 + 8 * 10 – Less memory, less operations
  71. 71. © MapR Technologies, confidential Bit-packed Dictionary Sort • Dataset encoded with a dictionary and bit-positions: – Dictionary: [Rupert, Bill, Larry] {0, 1, 2} – Values: [1,0,1,2,1,2,1,0] • Normal work – Decompress & store: Bill, Rupert, Bill, Larry, Bill, Larry, Bill, Rupert – Sort: ~24 comparisons of variable width strings • Optimized work – Sort dictionary: {Bill: 1, Larry: 2, Rupert: 0} – Sort bit-packed values – Work: max 3 string comparisons, ~24 comparisons of fixed-width dictionary bits
  72. 72. © MapR Technologies, confidential Drill 4-value semantics • SQL’s 3-valued semantics – True – False – Unknown • Drill adds fourth – Repeated
  73. 73. © MapR Technologies, confidential Vectorization • Drill operates on more than one record at a time – Word-sized manipulations – SIMD-like instructions • GCC, LLVM and JVM all do various optimizations automatically – Manually code algorithms • Logical Vectorization – Bitmaps allow lightning fast null-checks – Avoid branching to speed CPU pipeline
  74. 74. © MapR Technologies, confidential Runtime Compilation is Faster • JIT is smart, but more gains with runtime compilation • Janino: Java-based Java compiler From http://bit.ly/16Xk32x
  75. 75. © MapR Technologies, confidential Drill compiler Loaded class Merge byte-code of the two classes Janino compiles runtime byte-code CodeModel generates code Precompiled byte-code templates
  76. 76. © MapR Technologies, confidential Optimistic 0 20 40 60 80 100 120 140 160 Speed vs. check-pointing No need to checkpoint Apache Drill Checkpoint frequently
  77. 77. © MapR Technologies, confidential Optimistic Execution • Recovery code trivial – Running instances discard the failed query’s intermediate state • Pipelining possible – Send results as soon as batch is large enough – Requires barrier-less decomposition of query
  78. 78. © MapR Technologies, confidential Batches of Values • Value vectors – List of values, with same schema – With the 4-value semantics for each value • Shipped around in batches – max 256k bytes in a batch – max 64K rows in a batch • RPC designed for multiple replies to a request
  79. 79. © MapR Technologies, confidential Pipelining • Record batches are pipelined between nodes – ~256kB usually • Unit of work for Drill – Operators works on a batch • Operator reconfiguration happens at batch boundaries DrillBit DrillBit DrillBit
  80. 80. © MapR Technologies, confidential Pipelining Record Batches SQL Parser Optimizer Scheduler Pig Parser Physical Plan Mongo Cassandra HiveQL Parser RPC Endpoint Distributed Cache Storage Engine Interface OOppereartaotorsr s Foreman Logical Plan HDFS HBase JDBC Endpoint ODBC Endpoint
  81. 81. © MapR Technologies, confidential DISK Pipelining • Random access: sort without copy or restructuring • Avoids serialization/deserialization • Off-heap (no GC woes when lots of memory) • Full specification + off-heap + batch – Enables C/C++ operators (fast!) • Read/write to disk – when data larger than memory Drill Bit Memory overflow uses disk
  82. 82. © MapR Technologies, confidential Cost-based Optimization • Using Optiq, an extensible framework • Pluggable rules, and cost model • Rules for distributed plan generation • Insert Exchange operator into physical plan • Optiq enhanced to explore parallel query plans • Pluggable cost model – CPU, IO, memory, network cost (data locality) – Storage engine features (HDFS vs HIVE vs HBase) Query Optimizer Pluggable rules Pluggable cost model
  83. 83. © MapR Technologies, confidential Distributed Plan Cost • Operators have distribution property • Hash, Broadcast, Singleton, … • Exchange operator to enforce distributions • Hash: HashToRandomExchange • Broadcast: BroadcastExchange • Singleton: UnionExchange, SingleMergeExchange • Enumerate all, use cost to pick best • Merge Join vs Hash Join • Partition-based join vs Broadcast-based join • Streaming Aggregation vs Hash Aggregation • Aggregation in one phase or two phases • partial local aggregation followed by final aggregation HashToRandomExchange Sort Streaming-Aggregation Data Data Data
  84. 84. © MapR Technologies, confidential Drill 1.0 Hive 0.13 w/ Tez Impala 1.x Latency Low Medium Low Files Yes (all Hive file formats, plus JSON, Text, …) Yes (all Hive file formats) Yes (Parquet, Sequence, …) HBase/MapR-DB Yes Yes, perf issues Yes, with issues Schema Hive or schema-less Hive Hive SQL support ANSI SQL HiveQL HiveQL (subset) Client support ODBC/JDBC ODBC/JDBC ODBC/JDBC Hive compat High High Low Large datasets Yes Yes Limited Nested data Yes Limited No Concurrency High Limited Medium Interactive SQL-on-Hadoop options
  85. 85. © MapR Technologies, confidential Apache Drill Roadmap •Low-latency SQL •Schema-less execution •Files & HBase/M7 support •Hive integration •BI and SQL tool support via ODBC/JDBC Data exploration/ad-hoc queries 1.0 •HBase query speedup •Nested data functions •Advanced SQL functionality Advanced analytics and operational data 1.1 •Ultra low latency queries •Single row insert/update/delete •Workload management Operational SQL 2.0
  86. 86. © MapR Technologies, confidential MapR Distribution for Apache Hadoop MapR Data Platform (Random Read/Write) Enterprise Grade Data Hub Operational MapR-FS (POSIX) MapR-DB (High-Performance NoSQL) Security YARN Pig Cascading Spark Batch Spark Streaming Storm* Streaming HBase Solr NoSQL & Search Juju Provisioning & Coordination Savannah* Mahout MLLib ML, Graph GraphX MapReduc e v1 & v2 APACHE HADOOP AND OSS ECOSYSTEM EXECUTION ENGINES DATA GOVERNANCE AND OPERATIONS Workflow & Data Tez* Governance Accumulo* Hive Impala Shark Drill* SQL Sqoop Sentry* Oozie ZooKeeper Flume Knox* Falcon* Whirr Data Integration & Access HttpFS Hue NFS HDFS API HBase API JSON API MapR Control System (Management and Monitoring) * In Roadmap for inclusion/certification CLI REST API GUI
  87. 87. © MapR Technologies, confidential Apache Drill Resources • Drill 0.5 released last week • Getting started with Drill is easy – just download tarball and start running SQL queries on local files • Mailing lists – drill-user@incubator.apache.org – drill-dev@incubator.apache.org • Docs: https://cwiki.apache.org/confluence/display/DRILL/Apache+Drill+Wiki • Fork us on GitHub: http://github.com/apache/incubator-drill/ • Create a JIRA: https://issues.apache.org/jira/browse/DRILL
  88. 88. © MapR Technologies, confidential Active Drill Community • Large community, growing rapidly – 35-40 contributors, 16 committers – Microsoft, Linked-in, Oracle, Facebook, Visa, Lucidworks, Concurrent, many universities • In 2014 – over 20 meet-ups, many more coming soon – 3 hackathons, with 40+ participants • Encourage you to join, learn, contribute and have fun …
  89. 89. © MapR Technologies, confidential Drill at MapR • World-class SQL team, ~20 people • 150+ years combined experience building commercial databases • Oracle, DB2, ParAccel, Teradata, SQLServer, Vertica • Team works on Drill, Hive, Impala • Fixed some of the toughest problems in Apache Hive
  90. 90. © MapR Technologies, confidential Thank you! M. C. Srivas srivas@mapr.com Did I mention we are hiring…

×