Rethinking SQL for Big Data with Apache Drill

®
© 2015 MapR Technologies 1
®
© 2015 MapR Technologies
Jim Scott – Director, Enterprise Strategy & Architecture
@kingmesal #BigDataMadison

®
Find my presentation and other related resources here:
http://events.mapr.com/BigDataMadison
(you can find this link in the event’s page at meetup.com)
Today’s Presentation
Whiteboard & demo
videos
Free On-Demand Training
Free eBooks
Free Hadoop Sandbox And more…

®
Topics
•  Motivation
•  Using Drill
•  SQL + NoSQL = ???
•  Security Controls
•  Demo
•  Resources

®
Empowering “as it happens”
businesses by speeding up the
data-to-action cycle
®

®
Top-Ranked NoSQL
Top-Ranked Hadoop
Distribution
Top-Ranked SQL-on-Hadoop
Solution
®

®
© 2015 MapR Technologies 6© 2015 MapR Technologies
®
Motivation

®
SEMI-STRUCTURED
DATA
STRUCTURED DATA
1980 2000 20101990 2020
Data is Doubling Every Two Years
Unstructured data will account
for more than 80% of the data
collected by organizations
Source: Human-Computer Interaction & Knowledge Discovery in Complex Unstructured, Big Data
TotalDataStored

®
1980 2000 20101990 2020
Fixed schema
DBA controls structure
Dynamic / Flexible schema
Application controls structure
NON-RELATIONAL DATASTORESRELATIONAL DATABASES
GBs-TBs TBs-PBsVolume
Database
Data Increasingly Stored in Non-Relational Datastores
Structure
Development
Structured Structured, semi-structured and unstructured
Planned (release cycle = months-years) Iterative (release cycle = days-weeks)

®
How To Bring SQL to Non-Relational Data Stores?
Familiarity of SQL Agility of NoSQL
•  ANSI SQL semantics
•  BI (Tableau, MicroStrategy,
etc.)
•  Low latency
•  No schema management
–  HDFS (Parquet, JSON, etc.)
–  HBase
–  …
•  No transformation
–  No silos of data
•  Ease of use

®
Industry's First
Schema-free SQL engine
for Big Data
®

®
Enabling “As-It-Happens” Business with Instant Analytics
Hadoop data Data modeling Transformation
Data
movement
(optional)
Users
Hadoop data Users
Traditional
approach
Exploratory
approach
New Business questionsSource data evolution
Total time to insight: weeks to months
Total time to insight: minutes

®
Evolution Towards Self-Service Data Exploration
Data Modeling and
Transformation
Data Visualization
IT-driven
IT-driven
IT-driven
Self-service
IT-driven
Self-service
Optional
Self-service
Traditional BI
w/ RDBMS
Self-Service BI
w/ RDBMS
SQL-on-Hadoop
Self-Service
Data Exploration
Zero-day analytics

®
Common Use Cases
Raw Data Exploration JSON Analytics DWH offload
Hive HBaseFiles Directories
…
{JSON}, Parquet
Text Files …

®
Drill Supports Schema Discovery On-The-Fly
•  Fixed schema
•  Leverage schema in centralized
repository (Hive Metastore)
•  Fixed schema, evolving schema or
schema-less
•  Leverage schema in centralized
repository or self-describing data
2Schema Discovered On-The-FlySchema Declared In Advance
SCHEMA ON
WRITE
SCHEMA
BEFORE READ
SCHEMA ON THE
FLY

®
-  Sub-directory
-  HBase namespace
-  Hive database
Drill Enables ‘SQL-on-Everything’
SELECT
*
FROM
dfs.yelp.`business.json`
!
Workspace
-  Pathnames
-  Hive table
-  HBase table
Table
-  DFS (Text, Parquet, JSON)
-  HBase/MapR-DB
-  Hive Metastore/HCatalog
- Easy API to go beyond Hadoop
Storage plugin instance

®
Drill’s Data Model is Flexible
JSON
BSON
HBase
Parquet
Avro
CSV
TSV
Dynamic
schema
Fixed schema
Complex
Flat
Flexibility
Name! Gender! Age!
Michael! M! 6!
Jennifer! F! 3!
{!
name: {!
first: Michael,!
last: Smith!
},!
hobbies: [ski, soccer],!
district: Los Altos!
}!
{!
name: {!
first: Jennifer,!
last: Gates!
},!
hobbies: [sing],!
preschool: CCLC!
}!
RDBMS/SQL-on-Hadoop table
Apache Drill table
Flexibility

®
Reuse Existing SQL Tools and Skills
Leverage SQL-compatible tools
(BI, query builders, etc.) via Drill’s
standard ODBC, JDBC and ANSI
SQL support
Enable business analysts, technical
analysts and data scientists to
explore and analyze large volumes
of real-time data

®
®
Using Drill with Yelp

®
Business dataset {

"business_id":
"4bEjOyTaDG24SY5TxsaUNQ",

"full_address":
"3655
Las
Vegas
Blvd
SnThe
StripnLas
Vegas,
NV
89109",

"hours":
{

"Monday":
{"close":
"23:00",
"open":
"07:00"},

"Tuesday":
{"close":
"23:00",
"open":
"07:00"},

"Friday":
{"close":
"00:00",
"open":
"07:00"},

"Wednesday":
{"close":
"23:00",
"open":
"07:00"},

"Thursday":
{"close":
"23:00",
"open":
"07:00"},

"Sunday":
{"close":
"23:00",
"open":
"07:00"},

"Saturday":
{"close":
"00:00",
"open":
"07:00"}

},

"open":
true,

"categories":
["Breakfast
&
Brunch",
"Steakhouses",
"French",
"Restaurants"],

"city":
"Las
Vegas",

"review_count":
4084,

"name":
"Mon
Ami
Gabi",

"neighborhoods":
["The
Strip"],

"longitude":
-‐115.172588519464,

"state":
"NV",

"stars":
4.0,

"attributes":
{

"Alcohol":
"full_bar”,

"Noise
Level":
"average",

"Has
TV":
false,

"Attire":
"casual",

"Ambience":
{

"romantic":
true,

"intimate":
false,

"touristy":
false,

"hipster":
false,

"classy":
true,

"trendy":
false,

"casual":
false

},

"Good
For":
{"dessert":
false,
"latenight":
false,
"lunch":
false,

"dinner":
true,
"breakfast":
false,
"brunch":
false},

}

}

®
Reviews dataset
{

"votes":
{"funny":
0,
"useful":
2,
"cool":
1},

"user_id":
"Xqd0DzHaiyRqVH3WRG7hzg",

"review_id":
"15SdjuK7DmYqUAj6rjGowg",

"stars":
5,

"date":
"2007-‐05-‐17",

"text":
"dr.
goldberg
offers
everything
...",

"type":
"review",

"business_id":
"vcNAWiLM4dR7D2nwwJ7nCA"

}

®
Zero to Results in 2 minutes
$
tar
-‐xvzf
apache-‐drill-‐1.0.0.tar.gz

$
bin/sqlline
-‐u
jdbc:drill:zk=local

$
bin/drill-‐embedded

>
SELECT
state,
city,
count(*)
AS
businesses

FROM

GROUP
BY
state,
city

ORDER
BY
businesses
DESC
LIMIT
10;

+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+

|

state

|

city

|

businesses
|

+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+

|
NV

|
Las
Vegas

|
12021

|

|
AZ

|
Phoenix

|
7499

|

|
AZ

|
Scottsdale
|
3605

|

|
EDH

|
Edinburgh

|
2804

|

|
AZ

|
Mesa

|
2041

|

|
AZ

|
Tempe

|
2025

|

|
NV

|
Henderson

|
1914

|

|
AZ

|
Chandler

|
1637

|

|
WI

|
Madison

|
1630

|

|
AZ

|
Glendale

|
1196

|

+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+

Install

Query
files

and

directories

Results

Launch
shell

(embedded

mode)

®
Directories are implicit partitions
SELECT dir0, SUM(amount)
FROM sales
GROUP BY dir1 IN (q1, q2)
sales
├── 2014
│ ├── q1
│ ├── q2
│ ├── q3
│ └── q4
└── 2015
└── q1

®
Intuitive SQL Access to Complex Data
//
It’s
Friday
10pm
in
Vegas
and
looking
for
Hummus

>
SELECT
name,
stars,
b.hours.Friday
friday,
categories

FROM
b

WHERE
b.hours.Friday.`open`
<
'22:00'
AND

b.hours.Friday.`close`
>
'22:00'
AND

REPEATED_CONTAINS(categories,
'Mediterranean')
AND

city
=
'Las
Vegas'

ORDER
BY
stars
DESC

LIMIT
2;

+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+

|

name

|

stars

|

friday

|
categories
|

+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+

|
Olives

|
4.0

|
{"close":"22:30","open":"11:00"}
|

["Mediterranean","Restaurants"]
|

|
Marrakech
Moroccan
Restaurant
|
4.0

|
{"close":"23:00","open":"17:30"}
|

["Mediterranean","Middle
Eastern","Moroccan","Restaurants"]
|

+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+

Query
data

with
any

levels
of

nesting

®
ANSI SQL Compatibility
//Get
top
cool
rated
businesses

Ø  SELECT
b.name
from
b

WHERE
b.business_id
IN

(SELECT
r.business_id
FROM
dfs.yelp.`review.json`
r

GROUP
BY
r.business_id
HAVING
SUM(r.votes.cool)
>
2000
ORDER
BY

SUM(r.votes.cool)
DESC);

+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+

|

name
|

+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+

|
Earl
of
Sandwich
|

|
XS
Nightclub
|

|
The
Cosmopolitan
of
Las
Vegas
|

|
Wicked
Spoon
|

+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+

Use
familiar
SQL

functionality

(Joins,

Aggregations,

Sorting,
Sub-‐
queries,
SQL
data

types)

®
Logical Views
//Create
a
view
combining
business
and
reviews
datasets

>
CREATE
OR
REPLACE
VIEW
dfs.tmp.BusinessReviews
AS

SELECT
b.name,
b.stars,
r.votes.funny,

r.votes.useful,
r.votes.cool,
r.`date`

FROM
b,
r

WHERE
r.business_id
=
b.business_id;

+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+

|

ok

|

summary

|

+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+

|
true

|
View
'BusinessReviews'
created
successfully
in
'dfs.tmp'
schema
|

+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+

>
SELECT
COUNT(*)
AS
Total
FROM
dfs.tmp.BusinessReviews;

+------------+
| Total |
+------------+
| 1125458 |
+------------+
Lightweight
file

system
based

views
for

granular
and
de-‐
centralized
data

management

®
Materialized Views AKA Tables
>
ALTER
SESSION
SET
`store.format`
=
'parquet';

>
CREATE
TABLE
dfs.yelp.BusinessReviewsTbl
AS

SELECT
b.name,
b.stars,
r.votes.funny
funny,

r.votes.useful
useful,
r.votes.cool
cool,
r.`date`

FROM
b,
r

WHERE
r.business_id
=
b.business_id;

+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+

|

Fragment

|
Number
of
records
written
|

+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+

|
1_0

|
176448

|

|
1_1

|
192439

|

|
1_2

|
198625

|

|
1_3

|
200863

|

|
1_4

|
181420

|

|
1_5

|
175663

|

+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+

Save
analysis

results
as

tables
using

familiar
CTAS

syntax

®
Extensions to ANSI SQL to work with repeated values
//
Get
most
common
business
categories

>SELECT
category,
count(*)
AS
categorycount

FROM
(SELECT
name,
FLATTEN(categories)
AS
category

FROM
dfs.yelp.`business.json`)
c

GROUP
BY
category
ORDER
BY
categorycount
DESC;

+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+

|

category

|
categorycount|

+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+

|
Restaurants
|
14303

|

…

|
Australian
|
1

|

|
Boat
Dealers
|
1

|

|
Firewood

|
1

|

+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+

®
Checkins dataset {

"checkin_info":{

"3-‐4":1,

"13-‐5":1,

"6-‐6":1,

"14-‐5":1,

"14-‐6":1,

"14-‐2":1,

"14-‐3":1,

"19-‐0":1,

"11-‐5":1,

"13-‐2":1,

"11-‐6":2,

"11-‐3":1,

"12-‐6":1,

"6-‐5":1,

"5-‐5":1,

"9-‐2":1,

"9-‐5":1,

"9-‐6":1,

"5-‐2":1,

"7-‐6":1,

"7-‐5":1,

"7-‐4":1,

"17-‐5":1,

"8-‐5":1,

"10-‐2":1,

"10-‐5":1,

"10-‐6":1

},

"type":"checkin",

"business_id":"JwUE5GmEO-‐sH1FuwJgKBlQ"

}

®
Supports Dynamic / Unknown Columns
>
SELECT
KVGEN(checkin_info)
checkins

FROM
dfs.yelp.`checkin.json`
LIMIT
1;

+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+

|

checkins

|

+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+

|
[{"key":"3-‐4","value":1},{"key":"13-‐5","value":1},{"key":"6-‐6","value":1},{"key":"14-‐5","value":1},
{"key":"14-‐6","value":1},{"key":"14-‐2","value":1},{"key":"14-‐3","value":1},{"key":"19-‐0","value":1},
{"key":"10-‐2","value":1},{"key":"10-‐5","value":1},{"key":"10-‐6","value":1}]
|

+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+

>
SELECT
FLATTEN(KVGEN(checkin_info))
checkins
FROM

dfs.yelp.`checkin.json`
limit
6;

+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+

|

checkins

|

+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+

|
{"key":"3-‐4","value":1}
|

|
{"key":"13-‐5","value":1}
|

|
{"key":"6-‐6","value":1}
|

|
{"key":"14-‐5","value":1}
|

|
{"key":"14-‐6","value":1}
|

|
{"key":"14-‐2","value":1}
|

+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+

Convert
Map
with

a
wide
set
of

dynamic
columns

into
an
array
of

key-‐value
pairs

®
Makes it easy to work with dynamic/unknown columns
//
Count
total
number
of
checkins
on
Sunday
midnight

>
SELECT
SUM(checkintbl.checkins.`value`)
as
SundayMidnightCheckins

FROM

(SELECT
FLATTEN(KVGEN(checkin_info))
checkins

FROM
dfs.yelp.checkin.json`)
checkintbl

WHERE
checkintbl.checkins.key='23-‐0';

+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+

|
SundayMidnightCheckins
|

+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+

|
8575

|

+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+

®
®
SQL + NoSQL = Accessible & Linearly Scalable

®
MusicBrainz on NoSQL
Artists, albums, tracks and labels are key objects
Reality check:
Add works (compositions), recordings, release, release group
7 tables for artist alone
12 for place, 7 for label, 17 for release/group, 8 for work
(but only 4 for recording!)
Total of 12 + 7 + 17 + 8 + 4 = 48 tables
But wait, there’s more!
10 annotation tables, 10 edit tables, 19 tag tables, 5 rating tables, 86 link tables, 5 cover
art tables and 3 tables for CD timing info (138 total)
And 50 more tables that aren’t documented yet

®
180 Tables
NOT SHOWN!

®
236 tables
to describe 7 kinds of things

®
artist
id
gid
name
sort_name
begin_date
end_date
ended
type
gender
area
being_area
end_area
comment
list<ipi>
list<isni>
list<alias>
list<release_id>
list<recording_id>
artist
id
gid
name
sort_name
begin_date
end_date
ended
type
gender
area
being_area
end_area
comment
list<ipi>
list<isni>
list<alias>

®
artist
id
gid
name
sort_name
begin_date
end_date
ended
type
gender
area
being_area
end_area
comment
list<ipi>
list<isni>
list<alias>
list<release_id>
list<recording_id>
Primitive values
One to many relations
Equivalent to indexes

®
Further Reductions
All 86 link tables become properties on artists, releases and other entities
All 44 tag, rating and annotation tables become list properties
All 5 cover art tables become lists of file references
Current score: 162 tables become 4
You get the idea

®
Is This Good?
Expressivity
–  The JSON data model is at least as expressive as the original relational
model
•  Many cases easier to describe in nested data
•  No cases are harder
Efficiency
–  Inlining can increase data size. Locality improves, however
–  Sessionizing can substantially decrease data size
–  Inlining back-references is more efficient than ordinary indexes
–  Inlined columnar data allows 1000x speedup for time series
Introspection (you decide)

®
Searching for Elvis
//
Find
discs
where
Elvis
was
credited

>
SELECT
distinct
album_id,
name
FROM

(SELECT
id
album_id,
artist_id,
name,
FLATTEN(credit)
FROM
release)
albums

join

(SELECT
distinct
artist_id
FROM

(SELECT
id
artist_id,
FLATTEN(alias)
FROM
artist

where
name
like
'Elvis%Presley’)

)
artists

USING
artist_id;

®
Benefits
Extended relational model allows massive simplification
–  On a real example, we see >20x reduction in number of tables
Simplification drives improved introspection
–  This is good
Apache Drill gives very high performance execution for extended
relational problems
You can try this out today

®
®
Security Controls

®
Access Controls that Scale
PAM Authentication +
User Impersonation
Fine-grained row and
column level access control
with Drill Views – no
centralized security
repository required
Files HBase Hive
Drill
View 1
Drill
View 2
UUU
U
U

®
Granular Security via Drill Views
Name City State Credit Card #
Dave San Jose CA 1374-7914-3865-4817
John Boulder CO 1374-9735-1794-9711
Raw File (/raw/cards.csv)
Owner
Admins
Permission
Admins
Business Analyst Data Scientist
Dave San Jose CA 1374-1111-1111-1111
John Boulder CO 1374-1111-1111-1111
Data Scientist View (/views/maskedcards.csv)
Not a physical data copy
Name City State
Dave San Jose CA
John Boulder CO
Business Analyst View
Owner
Admins
Permission
Business
Analysts
Owner
Admins
Permission
Data
Scientists

®
Ownership Chaining
Combine Self Service Exploration with Data Governance
Dave San Jose CA 1374-7914-3865-4817
John Boulder CO 1374-9735-1794-9711
Raw File (/raw/cards.csv)
Dave San Jose CA 1374-1111-1111-1111
John Boulder CO 1374-1111-1111-1111
Data Scientist (/views/V_Scientist)
Jane (Read)
John (Owner)
Name City State
Dave San Jose CA
John Boulder CO
Analyst(/views/V_Analyst)
Jack (Read)
Jane(Owner)
RAWFILEV_ScientistV_Analyst
Does Jack have access to V_Analyst? ->YES
Who is the owner of V_Analyst? ->Jane
Drill accesses V_Analyst as Jane (Impersonation hop 1)
Does Jane have access to V_Scientist ? -> YES
Who is the owner of V_Scientist? ->John
Drill accesses V_Scientist as John (Impersonation hop 2)
John(Owner)
Does John have permissions on raw file? -> YES
Who is the owner of raw file? ->John
Drill accesses source file as John (no impersonation here)
Jack queries the view V_Analyst
*Ownership chain length (# hops) is configurable
Ownership
chaining
Access
path

®
Security Summary
Logical
–  No physical data copies/silos
Granular
–  Row level and column level security controls
De-centralized
–  User impersonation respecting storage system permissions
–  No separate permission repository for granular controls
–  Integrated with Hadoop File System permissions and LDAP
Self-service w/ governance
–  If you have access to data, you control who and how widely can access it
–  Audits

®
®
National Nutrient Database

®
Complex

®
Simpler

®
Simplest

®
Sample SR27 Records
~01001~^~0100~^~Butter, salted~^~BUTTER,WITH SALT~^~~^~~^~Y~^~~^0^~~^6.38^4.27^8.79^3.87
~01004~^~0100~^~Cheese, blue~^~CHEESE,BLUE~^~~^~~^~Y~^~~^0^~~^6.38^4.27^8.79^3.87
~01005~^~0100~^~Cheese, brick~^~CHEESE,BRICK~^~~^~~^~Y~^~~^0^~~^6.38^4.27^8.79^3.87
~01006~^~0100~^~Cheese, brie~^~CHEESE,BRIE~^~~^~~^~Y~^~~^0^~~^6.38^4.27^8.79^3.87
~01007~^~0100~^~Cheese, camembert~^~CHEESE,CAMEMBERT~^~~^~~^~Y~^~~^0^~~^6.38^4.27^8.79^3.87
~01008~^~0100~^~Cheese, caraway~^~CHEESE,CARAWAY~^~~^~~^~~^~~^0^~~^6.38^4.27^8.79^3.87

®
Configuration
-- Format --
"nndb": {
"type": "text",
"extensions": [ "txt" ],
"quote": "~",
"escape": "~",
"delimiter": "^"
},
-- Workspace --
"nndb": {
"location": "/opt/drill/nndb",
"writable": true,
"storageformat": "parquet"
},

®
Sample JSON
{
"ndb_no":"08613",
"shrt_desc":"CEREALS RTE,KELLOGG'S SPL K MULTIGRAIN OATS & HONEY",
"nut_data":[{
"nutr_no": "203",
"nutr_val": "7.80",
"nutr_def": {"num_dec":2,"tagname":"PROCNT","nutrdesc":"Protein"},
"data_src":[{
"datasrc_id": "S6941",
"authors": "A Kellogg, Co.",
"title": "Kellogg Company Data",
"Year": "2011"
}]
}, {
"nutr_no": "205",
"nutr_val": "85.00",
"nutr_def": {"num_dec":2,"tagname":"CHOCDF","nutrdesc":"Carbohydrate, by difference"},
"data_src":[{
"datasrc_id": "S6941",
"authors": "C Kellogg, Co.",
"title": "Kellogg Company Data",
"Year": "2011"
}]
}],
"langual":["ANISE","FRUIT","WHOLE, NATURAL SHAPE","NOT HEAT-TREATED","COOKING METHOD NOT APPLICABLE","WATER REMOVED","HEAT
DRIED","HUMAN FOOD, NO AGE SPECIFICATION"]
}

®
Demo Queries
All queries can be found within these blogs:
https://www.mapr.com/blog/drilling-healthy-choices
https://www.mapr.com/blog/evolution-database-schemas-using-sql-
nosql

®
®
Live Demo

®
Drill is Top-Ranked SQL-on-Hadoop
Source: Gigaom Research, 2015
Key:
•  Number indicates companies relative strength across all vectors
•  Size of ball indicates company’s relative strength along individual vector
“Drill isn’t just about
SQL-on-Hadoop.
It’s about SQL-on-
pretty-much-
anything,
immediately, and
without formality.”

®
Drill Project Status
Sep’12
Jun’13
Aug’14
Nov’14
Jan’15
Apr’15
Sep’14
Dec’14
Mar’15
Project
incubation
First release
Drill 0.1
Beta
Drill 0.5
+ Apache Top
Level Project
Drill 0.7 Drill 0.8
Drill 0.9GigaOm
Top ranked SQL
On Hadoop
Drill 0.6Dev Preview
Drill 0.4
Apache
Top Level Project
Growing
user adoption
Iterative
Project cycles
Large community,
growing rapidly
50 contributors 1000’s downloads 7 releases < 9 months
H i g h l i g h t s
May’15
Drill 1.0
Just
released

®
Recommendations for Getting Started with Drill
New to Drill?
–  Get started with Free MapR On Demand training
–  Test Drive Drill on cloud with AWS
–  Learn how to use Drill with Hadoop using MapR sandbox
Ready to play with your data?
–  Try out Apache Drill in 10 mins guide on your desktop
–  Download Drill for your cluster and start exploration
–  Comprehensive tutorials and documentation available
Ask questions
–  user@drill.apache.org

®

®
Find my presentation and other related resources here:
http://events.mapr.com/BigDataMadison
(you can find this link in the event’s page at meetup.com)
Today’s Presentation
Whiteboard & demo
videos
Free On-Demand Training
Free eBooks
Free Hadoop Sandbox And more…

®
Q&A
@kingmesal maprtech
jscott@mapr.com
Engage with us!
MapR
maprtech
mapr-technologies

Rethinking SQL for Big Data with Apache Drill

Recommended

Recommended

More Related Content

More from MapR Technologies

More from MapR Technologies (20)

Recently uploaded

Recently uploaded (20)

Rethinking SQL for Big Data with Apache Drill