© MapR Technologies, confidential 
© 2014 MapR Technologies 
M. C. Srivas, CTO and Founder
© MapR Technologies, confidential 
MapR is Unbiased Open Source
© MapR Technologies, confidential 
Linux Is Unbiased 
• Linux provides choice 
– MySQL 
– PostgreSQL 
– SQLite 
• Linux provides choice 
– Apache httpd 
– Nginx 
– Lighttpd
© MapR Technologies, confidential 
MapR Is Unbiased 
• MapR provides choice 
MapR Distribution for Hadoop Distribution C Distribution H 
Spark Spark (all of it) and SparkSQL Spark only No 
Interactive SQL Impala, Drill, Hive/Tez, 
SparkSQL 
One option 
(Impala) 
One option 
(Hive/Tez) 
Scheduler YARN, Mesos One option 
(YARN) 
One option 
(YARN) 
Versions Hive 0.10, 0.11, 0.12, 0.13 
Pig 0.11, 012 
HBase 0.94, 0.98 
One version One version
© MapR Technologies, confidential 
MapR Distribution for Apache Hadoop 
MapR Data Platform 
(Random Read/Write) 
Enterprise Grade Data Hub Operational 
MapR-FS 
(POSIX) 
MapR-DB 
(High-Performance NoSQL) 
Security 
YARN 
Pig 
Cascading 
Spark 
Batch 
Spark 
Streaming 
Storm* 
Streaming 
HBase 
Solr 
NoSQL & 
Search 
Juju 
Provisioning 
& 
Coordination 
Savannah* 
Mahout 
MLLib 
ML, Graph 
GraphX 
MapReduc 
e v1 & v2 
APACHE HADOOP AND OSS ECOSYSTEM 
EXECUTION ENGINES DATA GOVERNANCE AND OPERATIONS 
Workflow 
& Data 
Tez* Governance 
Accumulo* 
Hive 
Impala 
Shark 
Drill* 
SQL 
Sqoop Sentry* Oozie ZooKeeper 
Flume Knox* Falcon* Whirr 
Data 
Integration 
& Access 
HttpFS 
Hue 
NFS HDFS API HBase API JSON API 
MapR Control System 
(Management and Monitoring) 
* In Roadmap for inclusion/certification 
CLI REST API GUI
© MapR Technologies, confidential
© MapR Technologies, confidential 
Hadoop an augmentation for EDW—Why?
© MapR Technologies, confidential
© MapR Technologies, confidential
© MapR Technologies, confidential
© MapR Technologies, confidential
© MapR Technologies, confidential
© MapR Technologies, confidential 
But inside, it looks like this …
© MapR Technologies, confidential 
And this …
© MapR Technologies, confidential 
And this …
© MapR Technologies, confidential 
Consolidating schemas is very hard.
© MapR Technologies, confidential 
Consolidating schemas is very hard, causes SILOs
© MapR Technologies, confidential 
Silos make analysis very difficult 
• How do I identify a 
unique {customer, 
trade} across data sets? 
• How can I guarantee 
the lack of anomalous 
behavior if I can’t see 
all data?
© 2014 MapR Technologies 19 
Hard to know what’s of value a priori
© 2014 MapR Technologies 20 
Hard to know what’s of value a priori
© 2014 MapR Technologies 21 
Why Hadoop
© MapR Technologies, confidential 
Rethink SQL for Big Data 
Preserve 
•ANSI SQL 
• Familiar and ubiquitous 
• Performance 
• Interactive nature crucial for BI/Analytics 
• One technology 
• Painful to manage different technologies 
• Enterprise ready 
• System-of-record, HA, DR, Security, Multi-tenancy, 
…
© MapR Technologies, confidential 
Rethink SQL for Big Data 
Preserve 
•ANSI SQL 
• Familiar and ubiquitous 
• Performance 
• Interactive nature crucial for BI/Analytics 
• One technology 
• Painful to manage different technologies 
• Enterprise ready 
• System-of-record, HA, DR, Security, Multi-tenancy, 
… 
Invent 
• Flexible data-model 
• Allow schemas to evolve rapidly 
• Support semi-structured data types 
• Agility 
• Self-service possible when developer and DBA is 
same 
• Scalability 
• In all dimensions: data, speed, schemas, processes, 
management
© MapR Technologies, confidential 
SQL is here to stay
© MapR Technologies, confidential 
Hadoop is here to stay
© MapR Technologies, confidential 
YOU CAN’T HANDLE REAL SQL
© MapR Technologies, confidential 
SQL 
select * from A 
where exists ( 
select 1 from B where B.b < 100 ); 
• Did you know Apache HIVE cannot compute it? 
– eg, Hive, Impala, Spark/Shark
© MapR Technologies, confidential 
Self-described Data 
select cf.month, cf.year 
from hbase.table1; 
• Did you know normal SQL cannot handle the above? 
• Nor can HIVE and its variants like Impala, Shark?
© MapR Technologies, confidential 
Self-described Data 
select cf.month, cf.year 
from hbase.table1; 
• Why? 
• Because there’s no meta-store definition available
© MapR Technologies, confidential 
Self-Describing Data Ubiquitous 
Centralized schema 
- Static 
- Managed by the DBAs 
- In a centralized repository 
Long, meticulous data preparation process (ETL, 
create/alter schema, etc.) 
– can take 6-18 months 
Self-describing, or schema-less, data 
- Dynamic/evolving 
- Managed by the applications 
- Embedded in the data 
Less schema, more suitable for data that has 
higher volume, variety and velocity 
Apache Drill
© MapR Technologies, confidential 
A Quick Tour through Apache Drill
© MapR Technologies, confidential 
Data Source is in the Query 
select timestamp, message 
from dfs1.logs.`AppServerLogs/2014/Jan/p001.parquet` 
where errorLevel > 2
© MapR Technologies, confidential 
Data Source is in the Query 
select timestamp, message 
from dfs1.logs.`AppServerLogs/2014/Jan/p001.parquet` 
where errorLevel > 2 
This is a cluster in Apache Drill 
- DFS 
- HBase 
- Hive meta-store
© MapR Technologies, confidential 
Data Source is in the Query 
select timestamp, message 
from dfs1.logs.`AppServerLogs/2014/Jan/p001.parquet` 
where errorLevel > 2 
This is a cluster in Apache Drill 
- DFS 
- HBase 
- Hive meta-store 
A work-space 
- Typically a sub-directory 
- HIVE database
© MapR Technologies, confidential 
Data Source is in the Query 
select timestamp, message 
from dfs1.logs.`AppServerLogs/2014/Jan/p001.parquet` 
where errorLevel > 2 
This is a cluster in Apache Drill 
- DFS 
- HBase 
- Hive meta-store 
A work-space 
- Typically a sub-directory 
- HIVE database 
A table 
- pathnames 
- Hbase table 
- Hive table
© MapR Technologies, confidential 
Combine data sources on the fly 
• JSON 
• CSV 
• ORC (ie, all Hive types) 
• Parquet 
• HBase tables 
• … can combine them 
Select USERS.name, USERS.emails.work 
from 
dfs.logs.`/data/logs` LOGS, 
dfs.users.`/profiles.json` USERS, 
where 
LOGS.uid = USERS.uid and 
errorLevel > 5 
order by count(*);
© MapR Technologies, confidential 
Can be an entire directory tree 
// On a file 
select errorLevel, count(*) 
from dfs.logs.`/AppServerLogs/2014/Jan/part0001.parquet` 
group by errorLevel;
© MapR Technologies, confidential 
Can be an entire directory tree 
// On a file 
select errorLevel, count(*) 
from dfs.logs.`/AppServerLogs/2014/Jan/part0001.parquet` 
group by errorLevel; 
// On the entire data collection: all years, all months 
select errorLevel, count(*) 
from dfs.logs.`/AppServerLogs` 
group by errorLevel
© MapR Technologies, confidential 
Can be an entire directory tree 
// On a file 
select errorLevel, count(*) 
from dfs.logs.`/AppServerLogs/2014/Jan/part0001.parquet` 
group by errorLevel; 
// On the entire data collection: all years, all months 
select errorLevel, count(*) 
from dfs.logs.`/AppServerLogs` 
group by errorLevel 
dirs[1] dirs[2]
© MapR Technologies, confidential 
Can be an entire directory tree 
// On a file 
select errorLevel, count(*) 
from dfs.logs.`/AppServerLogs/2014/Jan/part0001.parquet` 
group by errorLevel; 
// On the entire data collection: all years, all months 
select errorLevel, count(*) 
from dfs.logs.`/AppServerLogs` 
group by errorLevel 
where dirs[1] > 2012 
, dirs[2] 
dirs[1] dirs[2]
© MapR Technologies, confidential 
Querying JSON 
{ name: classic 
fillings: [ 
{ name: sugar cal: 400 }]} 
{ name: choco 
fillings: [ 
{ name: sugar cal: 400 } 
{ name: chocolate cal: 300 }]} 
{ name: bostoncreme 
fillings: [ 
{ name: sugar cal: 400 } 
{ name: cream cal: 1000 } 
{ name: jelly cal: 600 }]} 
donuts.json
© MapR Technologies, confidential 
Cursors inside Drill 
DrillClient drill = new DrillClient().connect( …); 
ResultReader r = drill.runSqlQuery( "select * from `donuts.json`"); 
while( r.next()) { 
String donutName = r.reader( “name").readString(); 
ListReader fillings = r.reader( "fillings"); 
while( fillings.next()) { 
int calories = fillings.reader( "cal").readInteger(); 
if (calories > 400) 
print( donutName, calories, fillings.reader( "name").readString()); 
} 
} 
{ name: classic 
fillings: [ 
{ name: sugar cal: 400 }]} 
{ name: choco 
fillings: [ 
{ name: sugar cal: 400 } 
{ name: chocolate cal: 300 }]} 
{ name: bostoncreme 
fillings: [ 
{ name: sugar cal: 400 } 
{ name: cream cal: 1000 } 
{ name: jelly cal: 600 }]}
© MapR Technologies, confidential 
Direct queries on nested data 
// Flattening maps in JSON, parquet and other 
nested records 
select name, flatten(fillings) as f 
from dfs.users.`/donuts.json` 
where f.cal < 300; 
// lists the fillings < 300 calories 
{ name: classic 
fillings: [ 
{ name: sugar cal: 400 }]} 
{ name: choco 
fillings: [ 
{ name: sugar cal: 400 } 
{ name: chocolate cal: 300 }]} 
{ name: bostoncreme 
fillings: [ 
{ name: sugar cal: 400 } 
{ name: cream cal: 1000 } 
{ name: jelly cal: 600 }]}
© MapR Technologies, confidential 
Complex Data Using SQL or Fluent API 
// SQL 
Result r = drill.sql( "select name, flatten(fillings) from 
`donuts.json` where fillings.cal < 300`); 
// or Fluent API 
Result r = drill.table(“donuts.json”) 
.lt(“fillings.cal”, 300).all(); 
while( r.next()) { 
String name = r.get( “name").string(); 
List fillings = r.get( “fillings”).list(); 
while(fillings.next()) { 
print(name, calories, fillings.get(“name”).string()); 
} 
} 
{ name: classic 
fillings: [ 
{ name: sugar cal: 400 }]} 
{ name: choco 
fillings: [ 
{ name: sugar cal: 400 } 
{ name: plain: 280 }]} 
{ name: bostoncreme 
fillings: [ 
{ name: sugar cal: 400 } 
{ name: cream cal: 1000 } 
{ name: jelly cal: 600 }]}
© MapR Technologies, confidential 
Queries on embedded data 
// embedded JSON value inside column donut-json inside column-family cf1 
of an hbase table donuts 
select d.name, count( d.fillings), 
from ( 
select convert_from( cf1.donut-json, json) as d 
from hbase.user.`donuts` );
© MapR Technologies, confidential 
Queries inside JSON records 
// Each JSON record itself can be a whole database 
// example: get all donuts with at least 1 filling with > 300 calories 
select d.name, count( d.fillings), 
max(d.fillings.cal) within record as mincal 
from ( select convert_from( cf1.donut-json, json) as d 
from hbase.user.`donuts` ) 
where mincal > 300;
© MapR Technologies, confidential 
a 
• Schema can change over course of query 
• Operators are able to reconfigure themselves on schema 
change events 
– Minimize flexibility overhead 
– Support more advanced execution optimization based on actual data 
characteristics
© MapR Technologies, confidential 
De-centralized metadata 
// count the number of tweets per customer, where the customers are in Hive, and 
their tweets are in HBase. Note that the hbase data has no meta-data information 
select c.customerName, hb.tweets.count 
from hive.CustomersDB.`Customers` c 
join hbase.user.`SocialData` hb 
on c.customerId = convert_from( hb.rowkey, UTF-8);
© MapR Technologies, confidential 
So what does this all mean?
© MapR Technologies, confidential 
A Drill Database 
• What is a database with Drill/MapR?
© MapR Technologies, confidential 
A Drill Database 
• What is a database with Drill/MapR? 
• Just a directory, with a bunch of related files
© MapR Technologies, confidential 
A Drill Database 
• What is a database with Drill/MapR? 
• Just a directory, with a bunch of related files 
• There’s no need for artificial boundaries 
– No need to bunch a set of tables together to call it a “database”
© MapR Technologies, confidential 
A Drill Database 
/user/srivas/work/bugs 
symptom version date bugid dump-name 
impala crash 3.1.1 14/7/14 12345 cust1.tgz 
cldb slow 3.1.0 12/7/14 45678 cust2.tgz 
BugList Customers 
name rep se dump-name 
xxxx dkim junhyuk cust1.tgz 
yyyy yoshi aki cust2.tgz
© MapR Technologies, confidential 
Queries are simple 
select b.bugid, b.symptom, b.date 
from dfs.bugs.’/Customers’ c, dfs.bugs.’/BugList’ b 
where c.dump-name = b.dump-name
© MapR Technologies, confidential 
Queries are simple 
select b.bugid, b.symptom, b.date 
from dfs.bugs.’/Customers’ c, dfs.bugs.’/BugList’ b 
where c.dump-name = b.dump-name 
Let’s say I want to cross-reference against your list: 
select bugid, symptom 
from dfs.bugs.’/Buglist’ b, dfs.yourbugs.’/YourBugFile’ b2 
where b.bugid = b2.xxx
© MapR Technologies, confidential 
What does it mean?
© MapR Technologies, confidential 
What does it mean? 
• No ETL 
• Reach out directly to the particular table/file 
• As long as the permissions are fine, you can do it 
• No need to have the meta-data 
– None needed
© MapR Technologies, confidential 
Another example 
select d.name, count( d.fillings), 
from ( select convert_from( cf1.donut-json, json) as d 
from hbase.user.`donuts` ); 
• convert_from( xx, json) invokes the json parser inside Drill
© MapR Technologies, confidential 
Another example 
select d.name, count( d.fillings), 
from ( select convert_from( cf1.donut-json, json) as d 
from hbase.user.`donuts` ); 
• convert_from( xx, json) invokes the json parser inside Drill 
• What if you could plug in any parser?
© MapR Technologies, confidential 
Another example 
select d.name, count( d.fillings), 
from ( select convert_from( cf1.donut-json, json) as d 
from hbase.user.`donuts` ); 
• convert_from( xx, json) invokes the json parser inside Drill 
• What if you could plug in any parser 
– XML? 
– Semi-conductor yield-analysis files? Oil-exploration readings? 
– Telescope readings of stars? 
– RFIDs of various things?
© MapR Technologies, confidential 
No ETL 
• Basically, Drill is querying the raw data directly 
• Joining with processed data 
• NO ETL 
• Folks, this is very, very powerful 
• NO ETL
© MapR Technologies, confidential 
Seamless integration with Apache Hive 
• Low latency queries on Hive tables 
• Support for 100s of Hive file formats 
• Ability to reuse Hive UDFs 
• Support for multiple Hive metastores in a single query
© MapR Technologies, confidential 
Underneath the Covers
© MapR Technologies, confidential 
Basic Process 
Zookeeper 
DFS/HBase DFS/HBase DFS/HBase 
Drillbit 
Distributed Cache 
Drillbit 
Distributed Cache 
Drillbit 
Distributed Cache 
Query 1. Query comes to any Drillbit (JDBC, ODBC, CLI, protobuf) 
2. Drillbit generates execution plan based on query optimization & locality 
3. Fragments are farmed to individual nodes 
4. Result is returned to driving node 
c c c
© MapR Technologies, confidential 
Stages of Query Planning 
Parser 
Logical 
Planner 
Physical 
Planner 
Query 
Foreman 
Plan 
fragments 
sent to drill 
bits 
SQL 
Query 
Heuristic and 
cost based 
Cost based
© MapR Technologies, confidential 
Query Execution 
SQL Parser 
Optimizer 
Scheduler 
Pig Parser 
Physical Plan 
Mongo 
Cassandra 
HiveQL Parser 
RPC Endpoint 
Distributed Cache 
Storage Engine Interface 
OOppereartaotorsr s 
Foreman 
Logical Plan 
HDFS 
HBase 
JDBC Endpoint ODBC Endpoint
© MapR Technologies, confidential 
A Query engine that is… 
• Columnar/Vectorized 
• Optimistic/pipelined 
• Runtime compilation 
• Late binding 
• Extensible
© MapR Technologies, confidential 
Columnar representation 
A B C D E 
A 
B 
C 
D 
On disk 
E
© MapR Technologies, confidential 
Columnar Encoding 
• Values in a col. stored next to one-another 
– Better compression 
– Range-map: save min-max, can skip if not 
present 
• Only retrieve columns participating in 
query 
• Aggregations can be performed without 
decoding 
A 
B 
C 
D 
On disk 
E
© MapR Technologies, confidential 
Run-length-encoding & Sum 
• Dataset encoded as <val> <run-length>: 
– 2, 4 (4 2’s) 
– 8, 10 (10 8’s) 
• Goal: sum all the records 
• Normally: 
– Decompress: 2, 2, 2, 2, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8 
– Add: 2 + 2 + 2 + 2 + 8 + 8 + 8 + 8 + 8 + 8 + 8 + 8 + 8 + 8 
• Optimized work: 2 * 4 + 8 * 10 
– Less memory, less operations
© MapR Technologies, confidential 
Bit-packed Dictionary Sort 
• Dataset encoded with a dictionary and bit-positions: 
– Dictionary: [Rupert, Bill, Larry] {0, 1, 2} 
– Values: [1,0,1,2,1,2,1,0] 
• Normal work 
– Decompress & store: Bill, Rupert, Bill, Larry, Bill, Larry, Bill, Rupert 
– Sort: ~24 comparisons of variable width strings 
• Optimized work 
– Sort dictionary: {Bill: 1, Larry: 2, Rupert: 0} 
– Sort bit-packed values 
– Work: max 3 string comparisons, ~24 comparisons of fixed-width dictionary bits
© MapR Technologies, confidential 
Drill 4-value semantics 
• SQL’s 3-valued semantics 
– True 
– False 
– Unknown 
• Drill adds fourth 
– Repeated
© MapR Technologies, confidential 
Vectorization 
• Drill operates on more than one record at a time 
– Word-sized manipulations 
– SIMD-like instructions 
• GCC, LLVM and JVM all do various optimizations automatically 
– Manually code algorithms 
• Logical Vectorization 
– Bitmaps allow lightning fast null-checks 
– Avoid branching to speed CPU pipeline
© MapR Technologies, confidential 
Runtime Compilation is Faster 
• JIT is smart, but 
more gains with 
runtime 
compilation 
• Janino: Java-based 
Java compiler 
From http://bit.ly/16Xk32x
© MapR Technologies, confidential 
Drill compiler 
Loaded class 
Merge byte-code 
of the two 
classes 
Janino compiles 
runtime 
byte-code 
CodeModel 
generates code 
Precompiled 
byte-code 
templates
© MapR Technologies, confidential 
Optimistic 
0 
20 
40 
60 
80 
100 
120 
140 
160 
Speed vs. check-pointing 
No need to checkpoint 
Apache Drill Checkpoint frequently
© MapR Technologies, confidential 
Optimistic Execution 
• Recovery code trivial 
– Running instances discard the failed query’s intermediate state 
• Pipelining possible 
– Send results as soon as batch is large enough 
– Requires barrier-less decomposition of query
© MapR Technologies, confidential 
Batches of Values 
• Value vectors 
– List of values, with same schema 
– With the 4-value semantics for each value 
• Shipped around in batches 
– max 256k bytes in a batch 
– max 64K rows in a batch 
• RPC designed for multiple replies to a request
© MapR Technologies, confidential 
Pipelining 
• Record batches are pipelined 
between nodes 
– ~256kB usually 
• Unit of work for Drill 
– Operators works on a batch 
• Operator reconfiguration happens 
at batch boundaries 
DrillBit 
DrillBit DrillBit
© MapR Technologies, confidential 
Pipelining Record Batches 
SQL Parser 
Optimizer 
Scheduler 
Pig Parser 
Physical Plan 
Mongo 
Cassandra 
HiveQL Parser 
RPC Endpoint 
Distributed Cache 
Storage Engine Interface 
OOppereartaotorsr s 
Foreman 
Logical Plan 
HDFS 
HBase 
JDBC Endpoint ODBC Endpoint
© MapR Technologies, confidential 
DISK 
Pipelining 
• Random access: sort without copy or 
restructuring 
• Avoids serialization/deserialization 
• Off-heap (no GC woes when lots of memory) 
• Full specification + off-heap + batch 
– Enables C/C++ operators (fast!) 
• Read/write to disk 
– when data larger than memory 
Drill Bit 
Memory 
overflow 
uses disk
© MapR Technologies, confidential 
Cost-based Optimization 
• Using Optiq, an extensible framework 
• Pluggable rules, and cost model 
• Rules for distributed plan generation 
• Insert Exchange operator into physical plan 
• Optiq enhanced to explore parallel query plans 
• Pluggable cost model 
– CPU, IO, memory, network cost (data locality) 
– Storage engine features (HDFS vs HIVE vs HBase) 
Query 
Optimizer 
Pluggable 
rules 
Pluggable 
cost model
© MapR Technologies, confidential 
Distributed Plan Cost 
• Operators have distribution property 
• Hash, Broadcast, Singleton, … 
• Exchange operator to enforce distributions 
• Hash: HashToRandomExchange 
• Broadcast: BroadcastExchange 
• Singleton: UnionExchange, SingleMergeExchange 
• Enumerate all, use cost to pick best 
• Merge Join vs Hash Join 
• Partition-based join vs Broadcast-based join 
• Streaming Aggregation vs Hash Aggregation 
• Aggregation in one phase or two phases 
• partial local aggregation followed by final aggregation 
HashToRandomExchange 
Sort 
Streaming-Aggregation 
Data Data Data
© MapR Technologies, confidential 
Drill 1.0 Hive 0.13 w/ Tez Impala 1.x 
Latency Low Medium Low 
Files Yes (all Hive file formats, 
plus JSON, Text, …) 
Yes (all Hive file formats) Yes (Parquet, Sequence, 
…) 
HBase/MapR-DB Yes Yes, perf issues Yes, with issues 
Schema Hive or schema-less Hive Hive 
SQL support ANSI SQL HiveQL HiveQL (subset) 
Client support ODBC/JDBC ODBC/JDBC ODBC/JDBC 
Hive compat High High Low 
Large datasets Yes Yes Limited 
Nested data Yes Limited No 
Concurrency High Limited Medium 
Interactive SQL-on-Hadoop options
© MapR Technologies, confidential 
Apache Drill Roadmap 
•Low-latency SQL 
•Schema-less execution 
•Files & HBase/M7 support 
•Hive integration 
•BI and SQL tool support via 
ODBC/JDBC 
Data exploration/ad-hoc queries 
1.0 
•HBase query speedup 
•Nested data functions 
•Advanced SQL 
functionality 
Advanced analytics and 
operational data 
1.1 
•Ultra low latency queries 
•Single row 
insert/update/delete 
•Workload management 
Operational SQL 
2.0
© MapR Technologies, confidential 
MapR Distribution for Apache Hadoop 
MapR Data Platform 
(Random Read/Write) 
Enterprise Grade Data Hub Operational 
MapR-FS 
(POSIX) 
MapR-DB 
(High-Performance NoSQL) 
Security 
YARN 
Pig 
Cascading 
Spark 
Batch 
Spark 
Streaming 
Storm* 
Streaming 
HBase 
Solr 
NoSQL & 
Search 
Juju 
Provisioning 
& 
Coordination 
Savannah* 
Mahout 
MLLib 
ML, Graph 
GraphX 
MapReduc 
e v1 & v2 
APACHE HADOOP AND OSS ECOSYSTEM 
EXECUTION ENGINES DATA GOVERNANCE AND OPERATIONS 
Workflow 
& Data 
Tez* Governance 
Accumulo* 
Hive 
Impala 
Shark 
Drill* 
SQL 
Sqoop Sentry* Oozie ZooKeeper 
Flume Knox* Falcon* Whirr 
Data 
Integration 
& Access 
HttpFS 
Hue 
NFS HDFS API HBase API JSON API 
MapR Control System 
(Management and Monitoring) 
* In Roadmap for inclusion/certification 
CLI REST API GUI
© MapR Technologies, confidential 
Apache Drill Resources 
• Drill 0.5 released last week 
• Getting started with Drill is easy 
– just download tarball and start running SQL queries on local files 
• Mailing lists 
– drill-user@incubator.apache.org 
– drill-dev@incubator.apache.org 
• Docs: https://cwiki.apache.org/confluence/display/DRILL/Apache+Drill+Wiki 
• Fork us on GitHub: http://github.com/apache/incubator-drill/ 
• Create a JIRA: https://issues.apache.org/jira/browse/DRILL
© MapR Technologies, confidential 
Active Drill Community 
• Large community, growing rapidly 
– 35-40 contributors, 16 committers 
– Microsoft, Linked-in, Oracle, Facebook, Visa, Lucidworks, 
Concurrent, many universities 
• In 2014 
– over 20 meet-ups, many more coming soon 
– 3 hackathons, with 40+ participants 
• Encourage you to join, learn, contribute and have fun …
© MapR Technologies, confidential 
Drill at MapR 
• World-class SQL team, ~20 people 
• 150+ years combined experience building commercial 
databases 
• Oracle, DB2, ParAccel, Teradata, SQLServer, Vertica 
• Team works on Drill, Hive, Impala 
• Fixed some of the toughest problems in Apache Hive
© MapR Technologies, confidential 
Thank you! 
M. C. Srivas 
srivas@mapr.com 
Did I mention we are hiring…

Apache Drill - Why, What, How

  • 1.
    © MapR Technologies,confidential © 2014 MapR Technologies M. C. Srivas, CTO and Founder
  • 2.
    © MapR Technologies,confidential MapR is Unbiased Open Source
  • 3.
    © MapR Technologies,confidential Linux Is Unbiased • Linux provides choice – MySQL – PostgreSQL – SQLite • Linux provides choice – Apache httpd – Nginx – Lighttpd
  • 4.
    © MapR Technologies,confidential MapR Is Unbiased • MapR provides choice MapR Distribution for Hadoop Distribution C Distribution H Spark Spark (all of it) and SparkSQL Spark only No Interactive SQL Impala, Drill, Hive/Tez, SparkSQL One option (Impala) One option (Hive/Tez) Scheduler YARN, Mesos One option (YARN) One option (YARN) Versions Hive 0.10, 0.11, 0.12, 0.13 Pig 0.11, 012 HBase 0.94, 0.98 One version One version
  • 5.
    © MapR Technologies,confidential MapR Distribution for Apache Hadoop MapR Data Platform (Random Read/Write) Enterprise Grade Data Hub Operational MapR-FS (POSIX) MapR-DB (High-Performance NoSQL) Security YARN Pig Cascading Spark Batch Spark Streaming Storm* Streaming HBase Solr NoSQL & Search Juju Provisioning & Coordination Savannah* Mahout MLLib ML, Graph GraphX MapReduc e v1 & v2 APACHE HADOOP AND OSS ECOSYSTEM EXECUTION ENGINES DATA GOVERNANCE AND OPERATIONS Workflow & Data Tez* Governance Accumulo* Hive Impala Shark Drill* SQL Sqoop Sentry* Oozie ZooKeeper Flume Knox* Falcon* Whirr Data Integration & Access HttpFS Hue NFS HDFS API HBase API JSON API MapR Control System (Management and Monitoring) * In Roadmap for inclusion/certification CLI REST API GUI
  • 6.
  • 7.
    © MapR Technologies,confidential Hadoop an augmentation for EDW—Why?
  • 8.
  • 9.
  • 10.
  • 11.
  • 12.
  • 13.
    © MapR Technologies,confidential But inside, it looks like this …
  • 14.
    © MapR Technologies,confidential And this …
  • 15.
    © MapR Technologies,confidential And this …
  • 16.
    © MapR Technologies,confidential Consolidating schemas is very hard.
  • 17.
    © MapR Technologies,confidential Consolidating schemas is very hard, causes SILOs
  • 18.
    © MapR Technologies,confidential Silos make analysis very difficult • How do I identify a unique {customer, trade} across data sets? • How can I guarantee the lack of anomalous behavior if I can’t see all data?
  • 19.
    © 2014 MapRTechnologies 19 Hard to know what’s of value a priori
  • 20.
    © 2014 MapRTechnologies 20 Hard to know what’s of value a priori
  • 21.
    © 2014 MapRTechnologies 21 Why Hadoop
  • 22.
    © MapR Technologies,confidential Rethink SQL for Big Data Preserve •ANSI SQL • Familiar and ubiquitous • Performance • Interactive nature crucial for BI/Analytics • One technology • Painful to manage different technologies • Enterprise ready • System-of-record, HA, DR, Security, Multi-tenancy, …
  • 23.
    © MapR Technologies,confidential Rethink SQL for Big Data Preserve •ANSI SQL • Familiar and ubiquitous • Performance • Interactive nature crucial for BI/Analytics • One technology • Painful to manage different technologies • Enterprise ready • System-of-record, HA, DR, Security, Multi-tenancy, … Invent • Flexible data-model • Allow schemas to evolve rapidly • Support semi-structured data types • Agility • Self-service possible when developer and DBA is same • Scalability • In all dimensions: data, speed, schemas, processes, management
  • 24.
    © MapR Technologies,confidential SQL is here to stay
  • 25.
    © MapR Technologies,confidential Hadoop is here to stay
  • 26.
    © MapR Technologies,confidential YOU CAN’T HANDLE REAL SQL
  • 27.
    © MapR Technologies,confidential SQL select * from A where exists ( select 1 from B where B.b < 100 ); • Did you know Apache HIVE cannot compute it? – eg, Hive, Impala, Spark/Shark
  • 28.
    © MapR Technologies,confidential Self-described Data select cf.month, cf.year from hbase.table1; • Did you know normal SQL cannot handle the above? • Nor can HIVE and its variants like Impala, Shark?
  • 29.
    © MapR Technologies,confidential Self-described Data select cf.month, cf.year from hbase.table1; • Why? • Because there’s no meta-store definition available
  • 30.
    © MapR Technologies,confidential Self-Describing Data Ubiquitous Centralized schema - Static - Managed by the DBAs - In a centralized repository Long, meticulous data preparation process (ETL, create/alter schema, etc.) – can take 6-18 months Self-describing, or schema-less, data - Dynamic/evolving - Managed by the applications - Embedded in the data Less schema, more suitable for data that has higher volume, variety and velocity Apache Drill
  • 31.
    © MapR Technologies,confidential A Quick Tour through Apache Drill
  • 32.
    © MapR Technologies,confidential Data Source is in the Query select timestamp, message from dfs1.logs.`AppServerLogs/2014/Jan/p001.parquet` where errorLevel > 2
  • 33.
    © MapR Technologies,confidential Data Source is in the Query select timestamp, message from dfs1.logs.`AppServerLogs/2014/Jan/p001.parquet` where errorLevel > 2 This is a cluster in Apache Drill - DFS - HBase - Hive meta-store
  • 34.
    © MapR Technologies,confidential Data Source is in the Query select timestamp, message from dfs1.logs.`AppServerLogs/2014/Jan/p001.parquet` where errorLevel > 2 This is a cluster in Apache Drill - DFS - HBase - Hive meta-store A work-space - Typically a sub-directory - HIVE database
  • 35.
    © MapR Technologies,confidential Data Source is in the Query select timestamp, message from dfs1.logs.`AppServerLogs/2014/Jan/p001.parquet` where errorLevel > 2 This is a cluster in Apache Drill - DFS - HBase - Hive meta-store A work-space - Typically a sub-directory - HIVE database A table - pathnames - Hbase table - Hive table
  • 36.
    © MapR Technologies,confidential Combine data sources on the fly • JSON • CSV • ORC (ie, all Hive types) • Parquet • HBase tables • … can combine them Select USERS.name, USERS.emails.work from dfs.logs.`/data/logs` LOGS, dfs.users.`/profiles.json` USERS, where LOGS.uid = USERS.uid and errorLevel > 5 order by count(*);
  • 37.
    © MapR Technologies,confidential Can be an entire directory tree // On a file select errorLevel, count(*) from dfs.logs.`/AppServerLogs/2014/Jan/part0001.parquet` group by errorLevel;
  • 38.
    © MapR Technologies,confidential Can be an entire directory tree // On a file select errorLevel, count(*) from dfs.logs.`/AppServerLogs/2014/Jan/part0001.parquet` group by errorLevel; // On the entire data collection: all years, all months select errorLevel, count(*) from dfs.logs.`/AppServerLogs` group by errorLevel
  • 39.
    © MapR Technologies,confidential Can be an entire directory tree // On a file select errorLevel, count(*) from dfs.logs.`/AppServerLogs/2014/Jan/part0001.parquet` group by errorLevel; // On the entire data collection: all years, all months select errorLevel, count(*) from dfs.logs.`/AppServerLogs` group by errorLevel dirs[1] dirs[2]
  • 40.
    © MapR Technologies,confidential Can be an entire directory tree // On a file select errorLevel, count(*) from dfs.logs.`/AppServerLogs/2014/Jan/part0001.parquet` group by errorLevel; // On the entire data collection: all years, all months select errorLevel, count(*) from dfs.logs.`/AppServerLogs` group by errorLevel where dirs[1] > 2012 , dirs[2] dirs[1] dirs[2]
  • 41.
    © MapR Technologies,confidential Querying JSON { name: classic fillings: [ { name: sugar cal: 400 }]} { name: choco fillings: [ { name: sugar cal: 400 } { name: chocolate cal: 300 }]} { name: bostoncreme fillings: [ { name: sugar cal: 400 } { name: cream cal: 1000 } { name: jelly cal: 600 }]} donuts.json
  • 42.
    © MapR Technologies,confidential Cursors inside Drill DrillClient drill = new DrillClient().connect( …); ResultReader r = drill.runSqlQuery( "select * from `donuts.json`"); while( r.next()) { String donutName = r.reader( “name").readString(); ListReader fillings = r.reader( "fillings"); while( fillings.next()) { int calories = fillings.reader( "cal").readInteger(); if (calories > 400) print( donutName, calories, fillings.reader( "name").readString()); } } { name: classic fillings: [ { name: sugar cal: 400 }]} { name: choco fillings: [ { name: sugar cal: 400 } { name: chocolate cal: 300 }]} { name: bostoncreme fillings: [ { name: sugar cal: 400 } { name: cream cal: 1000 } { name: jelly cal: 600 }]}
  • 43.
    © MapR Technologies,confidential Direct queries on nested data // Flattening maps in JSON, parquet and other nested records select name, flatten(fillings) as f from dfs.users.`/donuts.json` where f.cal < 300; // lists the fillings < 300 calories { name: classic fillings: [ { name: sugar cal: 400 }]} { name: choco fillings: [ { name: sugar cal: 400 } { name: chocolate cal: 300 }]} { name: bostoncreme fillings: [ { name: sugar cal: 400 } { name: cream cal: 1000 } { name: jelly cal: 600 }]}
  • 44.
    © MapR Technologies,confidential Complex Data Using SQL or Fluent API // SQL Result r = drill.sql( "select name, flatten(fillings) from `donuts.json` where fillings.cal < 300`); // or Fluent API Result r = drill.table(“donuts.json”) .lt(“fillings.cal”, 300).all(); while( r.next()) { String name = r.get( “name").string(); List fillings = r.get( “fillings”).list(); while(fillings.next()) { print(name, calories, fillings.get(“name”).string()); } } { name: classic fillings: [ { name: sugar cal: 400 }]} { name: choco fillings: [ { name: sugar cal: 400 } { name: plain: 280 }]} { name: bostoncreme fillings: [ { name: sugar cal: 400 } { name: cream cal: 1000 } { name: jelly cal: 600 }]}
  • 45.
    © MapR Technologies,confidential Queries on embedded data // embedded JSON value inside column donut-json inside column-family cf1 of an hbase table donuts select d.name, count( d.fillings), from ( select convert_from( cf1.donut-json, json) as d from hbase.user.`donuts` );
  • 46.
    © MapR Technologies,confidential Queries inside JSON records // Each JSON record itself can be a whole database // example: get all donuts with at least 1 filling with > 300 calories select d.name, count( d.fillings), max(d.fillings.cal) within record as mincal from ( select convert_from( cf1.donut-json, json) as d from hbase.user.`donuts` ) where mincal > 300;
  • 47.
    © MapR Technologies,confidential a • Schema can change over course of query • Operators are able to reconfigure themselves on schema change events – Minimize flexibility overhead – Support more advanced execution optimization based on actual data characteristics
  • 48.
    © MapR Technologies,confidential De-centralized metadata // count the number of tweets per customer, where the customers are in Hive, and their tweets are in HBase. Note that the hbase data has no meta-data information select c.customerName, hb.tweets.count from hive.CustomersDB.`Customers` c join hbase.user.`SocialData` hb on c.customerId = convert_from( hb.rowkey, UTF-8);
  • 49.
    © MapR Technologies,confidential So what does this all mean?
  • 50.
    © MapR Technologies,confidential A Drill Database • What is a database with Drill/MapR?
  • 51.
    © MapR Technologies,confidential A Drill Database • What is a database with Drill/MapR? • Just a directory, with a bunch of related files
  • 52.
    © MapR Technologies,confidential A Drill Database • What is a database with Drill/MapR? • Just a directory, with a bunch of related files • There’s no need for artificial boundaries – No need to bunch a set of tables together to call it a “database”
  • 53.
    © MapR Technologies,confidential A Drill Database /user/srivas/work/bugs symptom version date bugid dump-name impala crash 3.1.1 14/7/14 12345 cust1.tgz cldb slow 3.1.0 12/7/14 45678 cust2.tgz BugList Customers name rep se dump-name xxxx dkim junhyuk cust1.tgz yyyy yoshi aki cust2.tgz
  • 54.
    © MapR Technologies,confidential Queries are simple select b.bugid, b.symptom, b.date from dfs.bugs.’/Customers’ c, dfs.bugs.’/BugList’ b where c.dump-name = b.dump-name
  • 55.
    © MapR Technologies,confidential Queries are simple select b.bugid, b.symptom, b.date from dfs.bugs.’/Customers’ c, dfs.bugs.’/BugList’ b where c.dump-name = b.dump-name Let’s say I want to cross-reference against your list: select bugid, symptom from dfs.bugs.’/Buglist’ b, dfs.yourbugs.’/YourBugFile’ b2 where b.bugid = b2.xxx
  • 56.
    © MapR Technologies,confidential What does it mean?
  • 57.
    © MapR Technologies,confidential What does it mean? • No ETL • Reach out directly to the particular table/file • As long as the permissions are fine, you can do it • No need to have the meta-data – None needed
  • 58.
    © MapR Technologies,confidential Another example select d.name, count( d.fillings), from ( select convert_from( cf1.donut-json, json) as d from hbase.user.`donuts` ); • convert_from( xx, json) invokes the json parser inside Drill
  • 59.
    © MapR Technologies,confidential Another example select d.name, count( d.fillings), from ( select convert_from( cf1.donut-json, json) as d from hbase.user.`donuts` ); • convert_from( xx, json) invokes the json parser inside Drill • What if you could plug in any parser?
  • 60.
    © MapR Technologies,confidential Another example select d.name, count( d.fillings), from ( select convert_from( cf1.donut-json, json) as d from hbase.user.`donuts` ); • convert_from( xx, json) invokes the json parser inside Drill • What if you could plug in any parser – XML? – Semi-conductor yield-analysis files? Oil-exploration readings? – Telescope readings of stars? – RFIDs of various things?
  • 61.
    © MapR Technologies,confidential No ETL • Basically, Drill is querying the raw data directly • Joining with processed data • NO ETL • Folks, this is very, very powerful • NO ETL
  • 62.
    © MapR Technologies,confidential Seamless integration with Apache Hive • Low latency queries on Hive tables • Support for 100s of Hive file formats • Ability to reuse Hive UDFs • Support for multiple Hive metastores in a single query
  • 63.
    © MapR Technologies,confidential Underneath the Covers
  • 64.
    © MapR Technologies,confidential Basic Process Zookeeper DFS/HBase DFS/HBase DFS/HBase Drillbit Distributed Cache Drillbit Distributed Cache Drillbit Distributed Cache Query 1. Query comes to any Drillbit (JDBC, ODBC, CLI, protobuf) 2. Drillbit generates execution plan based on query optimization & locality 3. Fragments are farmed to individual nodes 4. Result is returned to driving node c c c
  • 65.
    © MapR Technologies,confidential Stages of Query Planning Parser Logical Planner Physical Planner Query Foreman Plan fragments sent to drill bits SQL Query Heuristic and cost based Cost based
  • 66.
    © MapR Technologies,confidential Query Execution SQL Parser Optimizer Scheduler Pig Parser Physical Plan Mongo Cassandra HiveQL Parser RPC Endpoint Distributed Cache Storage Engine Interface OOppereartaotorsr s Foreman Logical Plan HDFS HBase JDBC Endpoint ODBC Endpoint
  • 67.
    © MapR Technologies,confidential A Query engine that is… • Columnar/Vectorized • Optimistic/pipelined • Runtime compilation • Late binding • Extensible
  • 68.
    © MapR Technologies,confidential Columnar representation A B C D E A B C D On disk E
  • 69.
    © MapR Technologies,confidential Columnar Encoding • Values in a col. stored next to one-another – Better compression – Range-map: save min-max, can skip if not present • Only retrieve columns participating in query • Aggregations can be performed without decoding A B C D On disk E
  • 70.
    © MapR Technologies,confidential Run-length-encoding & Sum • Dataset encoded as <val> <run-length>: – 2, 4 (4 2’s) – 8, 10 (10 8’s) • Goal: sum all the records • Normally: – Decompress: 2, 2, 2, 2, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8 – Add: 2 + 2 + 2 + 2 + 8 + 8 + 8 + 8 + 8 + 8 + 8 + 8 + 8 + 8 • Optimized work: 2 * 4 + 8 * 10 – Less memory, less operations
  • 71.
    © MapR Technologies,confidential Bit-packed Dictionary Sort • Dataset encoded with a dictionary and bit-positions: – Dictionary: [Rupert, Bill, Larry] {0, 1, 2} – Values: [1,0,1,2,1,2,1,0] • Normal work – Decompress & store: Bill, Rupert, Bill, Larry, Bill, Larry, Bill, Rupert – Sort: ~24 comparisons of variable width strings • Optimized work – Sort dictionary: {Bill: 1, Larry: 2, Rupert: 0} – Sort bit-packed values – Work: max 3 string comparisons, ~24 comparisons of fixed-width dictionary bits
  • 72.
    © MapR Technologies,confidential Drill 4-value semantics • SQL’s 3-valued semantics – True – False – Unknown • Drill adds fourth – Repeated
  • 73.
    © MapR Technologies,confidential Vectorization • Drill operates on more than one record at a time – Word-sized manipulations – SIMD-like instructions • GCC, LLVM and JVM all do various optimizations automatically – Manually code algorithms • Logical Vectorization – Bitmaps allow lightning fast null-checks – Avoid branching to speed CPU pipeline
  • 74.
    © MapR Technologies,confidential Runtime Compilation is Faster • JIT is smart, but more gains with runtime compilation • Janino: Java-based Java compiler From http://bit.ly/16Xk32x
  • 75.
    © MapR Technologies,confidential Drill compiler Loaded class Merge byte-code of the two classes Janino compiles runtime byte-code CodeModel generates code Precompiled byte-code templates
  • 76.
    © MapR Technologies,confidential Optimistic 0 20 40 60 80 100 120 140 160 Speed vs. check-pointing No need to checkpoint Apache Drill Checkpoint frequently
  • 77.
    © MapR Technologies,confidential Optimistic Execution • Recovery code trivial – Running instances discard the failed query’s intermediate state • Pipelining possible – Send results as soon as batch is large enough – Requires barrier-less decomposition of query
  • 78.
    © MapR Technologies,confidential Batches of Values • Value vectors – List of values, with same schema – With the 4-value semantics for each value • Shipped around in batches – max 256k bytes in a batch – max 64K rows in a batch • RPC designed for multiple replies to a request
  • 79.
    © MapR Technologies,confidential Pipelining • Record batches are pipelined between nodes – ~256kB usually • Unit of work for Drill – Operators works on a batch • Operator reconfiguration happens at batch boundaries DrillBit DrillBit DrillBit
  • 80.
    © MapR Technologies,confidential Pipelining Record Batches SQL Parser Optimizer Scheduler Pig Parser Physical Plan Mongo Cassandra HiveQL Parser RPC Endpoint Distributed Cache Storage Engine Interface OOppereartaotorsr s Foreman Logical Plan HDFS HBase JDBC Endpoint ODBC Endpoint
  • 81.
    © MapR Technologies,confidential DISK Pipelining • Random access: sort without copy or restructuring • Avoids serialization/deserialization • Off-heap (no GC woes when lots of memory) • Full specification + off-heap + batch – Enables C/C++ operators (fast!) • Read/write to disk – when data larger than memory Drill Bit Memory overflow uses disk
  • 82.
    © MapR Technologies,confidential Cost-based Optimization • Using Optiq, an extensible framework • Pluggable rules, and cost model • Rules for distributed plan generation • Insert Exchange operator into physical plan • Optiq enhanced to explore parallel query plans • Pluggable cost model – CPU, IO, memory, network cost (data locality) – Storage engine features (HDFS vs HIVE vs HBase) Query Optimizer Pluggable rules Pluggable cost model
  • 83.
    © MapR Technologies,confidential Distributed Plan Cost • Operators have distribution property • Hash, Broadcast, Singleton, … • Exchange operator to enforce distributions • Hash: HashToRandomExchange • Broadcast: BroadcastExchange • Singleton: UnionExchange, SingleMergeExchange • Enumerate all, use cost to pick best • Merge Join vs Hash Join • Partition-based join vs Broadcast-based join • Streaming Aggregation vs Hash Aggregation • Aggregation in one phase or two phases • partial local aggregation followed by final aggregation HashToRandomExchange Sort Streaming-Aggregation Data Data Data
  • 84.
    © MapR Technologies,confidential Drill 1.0 Hive 0.13 w/ Tez Impala 1.x Latency Low Medium Low Files Yes (all Hive file formats, plus JSON, Text, …) Yes (all Hive file formats) Yes (Parquet, Sequence, …) HBase/MapR-DB Yes Yes, perf issues Yes, with issues Schema Hive or schema-less Hive Hive SQL support ANSI SQL HiveQL HiveQL (subset) Client support ODBC/JDBC ODBC/JDBC ODBC/JDBC Hive compat High High Low Large datasets Yes Yes Limited Nested data Yes Limited No Concurrency High Limited Medium Interactive SQL-on-Hadoop options
  • 85.
    © MapR Technologies,confidential Apache Drill Roadmap •Low-latency SQL •Schema-less execution •Files & HBase/M7 support •Hive integration •BI and SQL tool support via ODBC/JDBC Data exploration/ad-hoc queries 1.0 •HBase query speedup •Nested data functions •Advanced SQL functionality Advanced analytics and operational data 1.1 •Ultra low latency queries •Single row insert/update/delete •Workload management Operational SQL 2.0
  • 86.
    © MapR Technologies,confidential MapR Distribution for Apache Hadoop MapR Data Platform (Random Read/Write) Enterprise Grade Data Hub Operational MapR-FS (POSIX) MapR-DB (High-Performance NoSQL) Security YARN Pig Cascading Spark Batch Spark Streaming Storm* Streaming HBase Solr NoSQL & Search Juju Provisioning & Coordination Savannah* Mahout MLLib ML, Graph GraphX MapReduc e v1 & v2 APACHE HADOOP AND OSS ECOSYSTEM EXECUTION ENGINES DATA GOVERNANCE AND OPERATIONS Workflow & Data Tez* Governance Accumulo* Hive Impala Shark Drill* SQL Sqoop Sentry* Oozie ZooKeeper Flume Knox* Falcon* Whirr Data Integration & Access HttpFS Hue NFS HDFS API HBase API JSON API MapR Control System (Management and Monitoring) * In Roadmap for inclusion/certification CLI REST API GUI
  • 87.
    © MapR Technologies,confidential Apache Drill Resources • Drill 0.5 released last week • Getting started with Drill is easy – just download tarball and start running SQL queries on local files • Mailing lists – drill-user@incubator.apache.org – drill-dev@incubator.apache.org • Docs: https://cwiki.apache.org/confluence/display/DRILL/Apache+Drill+Wiki • Fork us on GitHub: http://github.com/apache/incubator-drill/ • Create a JIRA: https://issues.apache.org/jira/browse/DRILL
  • 88.
    © MapR Technologies,confidential Active Drill Community • Large community, growing rapidly – 35-40 contributors, 16 committers – Microsoft, Linked-in, Oracle, Facebook, Visa, Lucidworks, Concurrent, many universities • In 2014 – over 20 meet-ups, many more coming soon – 3 hackathons, with 40+ participants • Encourage you to join, learn, contribute and have fun …
  • 89.
    © MapR Technologies,confidential Drill at MapR • World-class SQL team, ~20 people • 150+ years combined experience building commercial databases • Oracle, DB2, ParAccel, Teradata, SQLServer, Vertica • Team works on Drill, Hive, Impala • Fixed some of the toughest problems in Apache Hive
  • 90.
    © MapR Technologies,confidential Thank you! M. C. Srivas srivas@mapr.com Did I mention we are hiring…