More Related Content
More from MapR Technologies (20)
Rethinking SQL for Big Data with Apache Drill
- 1. ®
© 2015 MapR Technologies 1
®
© 2015 MapR Technologies
Jim Scott – Director, Enterprise Strategy & Architecture
@kingmesal #BigDataMadison
- 2. ®
© 2015 MapR Technologies 2
Find my presentation and other related resources here:
http://events.mapr.com/BigDataMadison
(you can find this link in the event’s page at meetup.com)
Today’s Presentation
Whiteboard & demo
videos
Free On-Demand Training
Free eBooks
Free Hadoop Sandbox And more…
- 3. ®
© 2015 MapR Technologies 3
Topics
• Motivation
• Using Drill
• SQL + NoSQL = ???
• Security Controls
• Demo
• Resources
- 4. ®
© 2015 MapR Technologies 4
Empowering “as it happens”
businesses by speeding up the
data-to-action cycle
®
- 5. ®
© 2015 MapR Technologies 5
Top-Ranked NoSQL
Top-Ranked Hadoop
Distribution
Top-Ranked SQL-on-Hadoop
Solution
®
- 6. ®
© 2015 MapR Technologies 6© 2015 MapR Technologies
®
Motivation
- 7. ®
© 2015 MapR Technologies 7
SEMI-STRUCTURED
DATA
STRUCTURED DATA
1980 2000 20101990 2020
Data is Doubling Every Two Years
Unstructured data will account
for more than 80% of the data
collected by organizations
Source: Human-Computer Interaction & Knowledge Discovery in Complex Unstructured, Big Data
TotalDataStored
- 8. ®
© 2015 MapR Technologies 8
1980 2000 20101990 2020
Fixed schema
DBA controls structure
Dynamic / Flexible schema
Application controls structure
NON-RELATIONAL DATASTORESRELATIONAL DATABASES
GBs-TBs TBs-PBsVolume
Database
Data Increasingly Stored in Non-Relational Datastores
Structure
Development
Structured Structured, semi-structured and unstructured
Planned (release cycle = months-years) Iterative (release cycle = days-weeks)
- 9. ®
© 2015 MapR Technologies 9
How To Bring SQL to Non-Relational Data Stores?
Familiarity of SQL Agility of NoSQL
• ANSI SQL semantics
• BI (Tableau, MicroStrategy,
etc.)
• Low latency
• No schema management
– HDFS (Parquet, JSON, etc.)
– HBase
– …
• No transformation
– No silos of data
• Ease of use
- 10. ®
© 2015 MapR Technologies 10
Industry's First
Schema-free SQL engine
for Big Data
®
- 11. ®
© 2015 MapR Technologies 11
Enabling “As-It-Happens” Business with Instant Analytics
Hadoop data Data modeling Transformation
Data
movement
(optional)
Users
Hadoop data Users
Traditional
approach
Exploratory
approach
New Business questionsSource data evolution
Total time to insight: weeks to months
Total time to insight: minutes
- 12. ®
© 2015 MapR Technologies 12
Evolution Towards Self-Service Data Exploration
Data Modeling and
Transformation
Data Visualization
IT-driven
IT-driven
IT-driven
Self-service
IT-driven
Self-service
Optional
Self-service
Traditional BI
w/ RDBMS
Self-Service BI
w/ RDBMS
SQL-on-Hadoop
Self-Service
Data Exploration
Zero-day analytics
- 13. ®
© 2015 MapR Technologies 13
Common Use Cases
Raw Data Exploration JSON Analytics DWH offload
Hive HBaseFiles Directories
…
{JSON}, Parquet
Text Files …
- 14. ®
© 2015 MapR Technologies 14
Drill Supports Schema Discovery On-The-Fly
• Fixed schema
• Leverage schema in centralized
repository (Hive Metastore)
• Fixed schema, evolving schema or
schema-less
• Leverage schema in centralized
repository or self-describing data
2Schema Discovered On-The-FlySchema Declared In Advance
SCHEMA ON
WRITE
SCHEMA
BEFORE READ
SCHEMA ON THE
FLY
- 15. ®
© 2015 MapR Technologies 15
- Sub-directory
- HBase namespace
- Hive database
Drill Enables ‘SQL-on-Everything’
SELECT
*
FROM
dfs.yelp.`business.json`
!
Workspace
- Pathnames
- Hive table
- HBase table
Table
- DFS (Text, Parquet, JSON)
- HBase/MapR-DB
- Hive Metastore/HCatalog
- Easy API to go beyond Hadoop
Storage plugin instance
- 16. ®
© 2015 MapR Technologies 16
Drill’s Data Model is Flexible
JSON
BSON
HBase
Parquet
Avro
CSV
TSV
Dynamic
schema
Fixed schema
Complex
Flat
Flexibility
Name! Gender! Age!
Michael! M! 6!
Jennifer! F! 3!
{!
name: {!
first: Michael,!
last: Smith!
},!
hobbies: [ski, soccer],!
district: Los Altos!
}!
{!
name: {!
first: Jennifer,!
last: Gates!
},!
hobbies: [sing],!
preschool: CCLC!
}!
RDBMS/SQL-on-Hadoop table
Apache Drill table
Flexibility
- 17. ®
© 2015 MapR Technologies 17
Reuse Existing SQL Tools and Skills
Leverage SQL-compatible tools
(BI, query builders, etc.) via Drill’s
standard ODBC, JDBC and ANSI
SQL support
Enable business analysts, technical
analysts and data scientists to
explore and analyze large volumes
of real-time data
- 18. ®
© 2015 MapR Technologies 18© 2015 MapR Technologies
®
Using Drill with Yelp
- 19. ®
© 2015 MapR Technologies 19
Business dataset {
"business_id":
"4bEjOyTaDG24SY5TxsaUNQ",
"full_address":
"3655
Las
Vegas
Blvd
SnThe
StripnLas
Vegas,
NV
89109",
"hours":
{
"Monday":
{"close":
"23:00",
"open":
"07:00"},
"Tuesday":
{"close":
"23:00",
"open":
"07:00"},
"Friday":
{"close":
"00:00",
"open":
"07:00"},
"Wednesday":
{"close":
"23:00",
"open":
"07:00"},
"Thursday":
{"close":
"23:00",
"open":
"07:00"},
"Sunday":
{"close":
"23:00",
"open":
"07:00"},
"Saturday":
{"close":
"00:00",
"open":
"07:00"}
},
"open":
true,
"categories":
["Breakfast
&
Brunch",
"Steakhouses",
"French",
"Restaurants"],
"city":
"Las
Vegas",
"review_count":
4084,
"name":
"Mon
Ami
Gabi",
"neighborhoods":
["The
Strip"],
"longitude":
-‐115.172588519464,
"state":
"NV",
"stars":
4.0,
"attributes":
{
"Alcohol":
"full_bar”,
"Noise
Level":
"average",
"Has
TV":
false,
"Attire":
"casual",
"Ambience":
{
"romantic":
true,
"intimate":
false,
"touristy":
false,
"hipster":
false,
"classy":
true,
"trendy":
false,
"casual":
false
},
"Good
For":
{"dessert":
false,
"latenight":
false,
"lunch":
false,
"dinner":
true,
"breakfast":
false,
"brunch":
false},
}
}
- 20. ®
© 2015 MapR Technologies 20
Reviews dataset
{
"votes":
{"funny":
0,
"useful":
2,
"cool":
1},
"user_id":
"Xqd0DzHaiyRqVH3WRG7hzg",
"review_id":
"15SdjuK7DmYqUAj6rjGowg",
"stars":
5,
"date":
"2007-‐05-‐17",
"text":
"dr.
goldberg
offers
everything
...",
"type":
"review",
"business_id":
"vcNAWiLM4dR7D2nwwJ7nCA"
}
- 21. ®
© 2015 MapR Technologies 21
Zero to Results in 2 minutes
$
tar
-‐xvzf
apache-‐drill-‐1.0.0.tar.gz
$
bin/sqlline
-‐u
jdbc:drill:zk=local
$
bin/drill-‐embedded
>
SELECT
state,
city,
count(*)
AS
businesses
FROM
dfs.yelp.`business.json`
GROUP
BY
state,
city
ORDER
BY
businesses
DESC
LIMIT
10;
+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+
|
state
|
city
|
businesses
|
+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+
|
NV
|
Las
Vegas
|
12021
|
|
AZ
|
Phoenix
|
7499
|
|
AZ
|
Scottsdale
|
3605
|
|
EDH
|
Edinburgh
|
2804
|
|
AZ
|
Mesa
|
2041
|
|
AZ
|
Tempe
|
2025
|
|
NV
|
Henderson
|
1914
|
|
AZ
|
Chandler
|
1637
|
|
WI
|
Madison
|
1630
|
|
AZ
|
Glendale
|
1196
|
+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+
Install
Query
files
and
directories
Results
Launch
shell
(embedded
mode)
- 22. ®
© 2015 MapR Technologies 22
Directories are implicit partitions
SELECT dir0, SUM(amount)
FROM sales
GROUP BY dir1 IN (q1, q2)
sales
├── 2014
│ ├── q1
│ ├── q2
│ ├── q3
│ └── q4
└── 2015
└── q1
- 23. ®
© 2015 MapR Technologies 23
Intuitive SQL Access to Complex Data
//
It’s
Friday
10pm
in
Vegas
and
looking
for
Hummus
>
SELECT
name,
stars,
b.hours.Friday
friday,
categories
FROM
dfs.yelp.`business.json`
b
WHERE
b.hours.Friday.`open`
<
'22:00'
AND
b.hours.Friday.`close`
>
'22:00'
AND
REPEATED_CONTAINS(categories,
'Mediterranean')
AND
city
=
'Las
Vegas'
ORDER
BY
stars
DESC
LIMIT
2;
+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+
|
name
|
stars
|
friday
|
categories
|
+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+
|
Olives
|
4.0
|
{"close":"22:30","open":"11:00"}
|
["Mediterranean","Restaurants"]
|
|
Marrakech
Moroccan
Restaurant
|
4.0
|
{"close":"23:00","open":"17:30"}
|
["Mediterranean","Middle
Eastern","Moroccan","Restaurants"]
|
+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+
Query
data
with
any
levels
of
nesting
- 24. ®
© 2015 MapR Technologies 24
ANSI SQL Compatibility
//Get
top
cool
rated
businesses
Ø SELECT
b.name
from
dfs.yelp.`business.json`
b
WHERE
b.business_id
IN
(SELECT
r.business_id
FROM
dfs.yelp.`review.json`
r
GROUP
BY
r.business_id
HAVING
SUM(r.votes.cool)
>
2000
ORDER
BY
SUM(r.votes.cool)
DESC);
+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+
|
name
|
+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+
|
Earl
of
Sandwich
|
|
XS
Nightclub
|
|
The
Cosmopolitan
of
Las
Vegas
|
|
Wicked
Spoon
|
+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+
Use
familiar
SQL
functionality
(Joins,
Aggregations,
Sorting,
Sub-‐
queries,
SQL
data
types)
- 25. ®
© 2015 MapR Technologies 25
Logical Views
//Create
a
view
combining
business
and
reviews
datasets
>
CREATE
OR
REPLACE
VIEW
dfs.tmp.BusinessReviews
AS
SELECT
b.name,
b.stars,
r.votes.funny,
r.votes.useful,
r.votes.cool,
r.`date`
FROM
dfs.yelp.`business.json`
b,
dfs.yelp.`review.json`
r
WHERE
r.business_id
=
b.business_id;
+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+
|
ok
|
summary
|
+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+
|
true
|
View
'BusinessReviews'
created
successfully
in
'dfs.tmp'
schema
|
+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+
>
SELECT
COUNT(*)
AS
Total
FROM
dfs.tmp.BusinessReviews;
+------------+
| Total |
+------------+
| 1125458 |
+------------+
Lightweight
file
system
based
views
for
granular
and
de-‐
centralized
data
management
- 26. ®
© 2015 MapR Technologies 26
Materialized Views AKA Tables
>
ALTER
SESSION
SET
`store.format`
=
'parquet';
>
CREATE
TABLE
dfs.yelp.BusinessReviewsTbl
AS
SELECT
b.name,
b.stars,
r.votes.funny
funny,
r.votes.useful
useful,
r.votes.cool
cool,
r.`date`
FROM
dfs.yelp.`business.json`
b,
dfs.yelp.`review.json`
r
WHERE
r.business_id
=
b.business_id;
+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+
|
Fragment
|
Number
of
records
written
|
+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+
|
1_0
|
176448
|
|
1_1
|
192439
|
|
1_2
|
198625
|
|
1_3
|
200863
|
|
1_4
|
181420
|
|
1_5
|
175663
|
+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+
Save
analysis
results
as
tables
using
familiar
CTAS
syntax
- 27. ®
© 2015 MapR Technologies 27
Repeated Values Support
//
Flatten
repeated
categories
>
SELECT
name,
categories
FROM
dfs.yelp.`business.json`
LIMIT
3;
+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+
|
name
|
categories
|
+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+
|
Eric
Goldberg,
MD
|
["Doctors","Health
&
Medical"]
|
|
Pine
Cone
Restaurant
|
["Restaurants"]
|
|
Deforest
Family
Restaurant
|
["American
(Traditional)","Restaurants"]
|
+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+
>
SELECT
name,
FLATTEN(categories)
AS
categories
FROM
dfs.yelp.`business.json`
LIMIT
5;
+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+
|
name
|
categories
|
+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+
|
Eric
Goldberg,
MD
|
Doctors
|
|
Eric
Goldberg,
MD
|
Health
&
Medical
|
|
Pine
Cone
Restaurant
|
Restaurants
|
|
Deforest
Family
Restaurant
|
American
(Traditional)
|
|
Deforest
Family
Restaurant
|
Restaurants
|
+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+
Dynamically
flatten
repeated
and
nested
data
elements
as
part
of
SQL
queries.
No
ETL
necessary
- 28. ®
© 2015 MapR Technologies 28
Extensions to ANSI SQL to work with repeated values
//
Get
most
common
business
categories
>SELECT
category,
count(*)
AS
categorycount
FROM
(SELECT
name,
FLATTEN(categories)
AS
category
FROM
dfs.yelp.`business.json`)
c
GROUP
BY
category
ORDER
BY
categorycount
DESC;
+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+
|
category
|
categorycount|
+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+
|
Restaurants
|
14303
|
…
|
Australian
|
1
|
|
Boat
Dealers
|
1
|
|
Firewood
|
1
|
+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+
- 29. ®
© 2015 MapR Technologies 29
Checkins dataset {
"checkin_info":{
"3-‐4":1,
"13-‐5":1,
"6-‐6":1,
"14-‐5":1,
"14-‐6":1,
"14-‐2":1,
"14-‐3":1,
"19-‐0":1,
"11-‐5":1,
"13-‐2":1,
"11-‐6":2,
"11-‐3":1,
"12-‐6":1,
"6-‐5":1,
"5-‐5":1,
"9-‐2":1,
"9-‐5":1,
"9-‐6":1,
"5-‐2":1,
"7-‐6":1,
"7-‐5":1,
"7-‐4":1,
"17-‐5":1,
"8-‐5":1,
"10-‐2":1,
"10-‐5":1,
"10-‐6":1
},
"type":"checkin",
"business_id":"JwUE5GmEO-‐sH1FuwJgKBlQ"
}
- 30. ®
© 2015 MapR Technologies 30
Supports Dynamic / Unknown Columns
>
SELECT
KVGEN(checkin_info)
checkins
FROM
dfs.yelp.`checkin.json`
LIMIT
1;
+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+
|
checkins
|
+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+
|
[{"key":"3-‐4","value":1},{"key":"13-‐5","value":1},{"key":"6-‐6","value":1},{"key":"14-‐5","value":1},
{"key":"14-‐6","value":1},{"key":"14-‐2","value":1},{"key":"14-‐3","value":1},{"key":"19-‐0","value":1},
{"key":"11-‐5","value":1},{"key":"13-‐2","value":1},{"key":"11-‐6","value":2},{"key":"11-‐3","value":1},
{"key":"12-‐6","value":1},{"key":"6-‐5","value":1},{"key":"5-‐5","value":1},{"key":"9-‐2","value":1},
{"key":"9-‐5","value":1},{"key":"9-‐6","value":1},{"key":"5-‐2","value":1},{"key":"7-‐6","value":1},
{"key":"7-‐5","value":1},{"key":"7-‐4","value":1},{"key":"17-‐5","value":1},{"key":"8-‐5","value":1},
{"key":"10-‐2","value":1},{"key":"10-‐5","value":1},{"key":"10-‐6","value":1}]
|
+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+
>
SELECT
FLATTEN(KVGEN(checkin_info))
checkins
FROM
dfs.yelp.`checkin.json`
limit
6;
+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+
|
checkins
|
+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+
|
{"key":"3-‐4","value":1}
|
|
{"key":"13-‐5","value":1}
|
|
{"key":"6-‐6","value":1}
|
|
{"key":"14-‐5","value":1}
|
|
{"key":"14-‐6","value":1}
|
|
{"key":"14-‐2","value":1}
|
+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+
Convert
Map
with
a
wide
set
of
dynamic
columns
into
an
array
of
key-‐value
pairs
- 31. ®
© 2015 MapR Technologies 31
Makes it easy to work with dynamic/unknown columns
//
Count
total
number
of
checkins
on
Sunday
midnight
>
SELECT
SUM(checkintbl.checkins.`value`)
as
SundayMidnightCheckins
FROM
(SELECT
FLATTEN(KVGEN(checkin_info))
checkins
FROM
dfs.yelp.checkin.json`)
checkintbl
WHERE
checkintbl.checkins.key='23-‐0';
+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+
|
SundayMidnightCheckins
|
+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+
|
8575
|
+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+
- 32. ®
© 2015 MapR Technologies 32© 2015 MapR Technologies
®
SQL + NoSQL = Accessible & Linearly Scalable
- 33. ®
© 2015 MapR Technologies 33
MusicBrainz on NoSQL
Artists, albums, tracks and labels are key objects
Reality check:
Add works (compositions), recordings, release, release group
7 tables for artist alone
12 for place, 7 for label, 17 for release/group, 8 for work
(but only 4 for recording!)
Total of 12 + 7 + 17 + 8 + 4 = 48 tables
But wait, there’s more!
10 annotation tables, 10 edit tables, 19 tag tables, 5 rating tables, 86 link tables, 5 cover
art tables and 3 tables for CD timing info (138 total)
And 50 more tables that aren’t documented yet
- 35. ®
© 2015 MapR Technologies 35
236 tables
to describe 7 kinds of things
- 36. ®
© 2015 MapR Technologies 36
artist
id
gid
name
sort_name
begin_date
end_date
ended
type
gender
area
being_area
end_area
comment
list<ipi>
list<isni>
list<alias>
list<release_id>
list<recording_id>
artist
id
gid
name
sort_name
begin_date
end_date
ended
type
gender
area
being_area
end_area
comment
list<ipi>
list<isni>
list<alias>
- 37. ®
© 2015 MapR Technologies 37
artist
id
gid
name
sort_name
begin_date
end_date
ended
type
gender
area
being_area
end_area
comment
list<ipi>
list<isni>
list<alias>
list<release_id>
list<recording_id>
Primitive values
One to many relations
Equivalent to indexes
- 38. ®
© 2015 MapR Technologies 38
Further Reductions
All 86 link tables become properties on artists, releases and other entities
All 44 tag, rating and annotation tables become list properties
All 5 cover art tables become lists of file references
Current score: 162 tables become 4
You get the idea
- 39. ®
© 2015 MapR Technologies 39
Is This Good?
Expressivity
– The JSON data model is at least as expressive as the original relational
model
• Many cases easier to describe in nested data
• No cases are harder
Efficiency
– Inlining can increase data size. Locality improves, however
– Sessionizing can substantially decrease data size
– Inlining back-references is more efficient than ordinary indexes
– Inlined columnar data allows 1000x speedup for time series
Introspection (you decide)
- 40. ®
© 2015 MapR Technologies 40
Searching for Elvis
//
Find
discs
where
Elvis
was
credited
>
SELECT
distinct
album_id,
name
FROM
(SELECT
id
album_id,
artist_id,
name,
FLATTEN(credit)
FROM
release)
albums
join
(SELECT
distinct
artist_id
FROM
(SELECT
id
artist_id,
FLATTEN(alias)
FROM
artist
where
name
like
'Elvis%Presley’)
)
artists
USING
artist_id;
- 41. ®
© 2015 MapR Technologies 41
Benefits
Extended relational model allows massive simplification
– On a real example, we see >20x reduction in number of tables
Simplification drives improved introspection
– This is good
Apache Drill gives very high performance execution for extended
relational problems
You can try this out today
- 42. ®
© 2015 MapR Technologies 42© 2015 MapR Technologies
®
Security Controls
- 43. ®
© 2015 MapR Technologies 43
Access Controls that Scale
PAM Authentication +
User Impersonation
Fine-grained row and
column level access control
with Drill Views – no
centralized security
repository required
Files HBase Hive
Drill
View 1
Drill
View 2
UUU
U
U
- 44. ®
© 2015 MapR Technologies 44
Granular Security via Drill Views
Name City State Credit Card #
Dave San Jose CA 1374-7914-3865-4817
John Boulder CO 1374-9735-1794-9711
Raw File (/raw/cards.csv)
Owner
Admins
Permission
Admins
Business Analyst Data Scientist
Name City State Credit Card #
Dave San Jose CA 1374-1111-1111-1111
John Boulder CO 1374-1111-1111-1111
Data Scientist View (/views/maskedcards.csv)
Not a physical data copy
Name City State
Dave San Jose CA
John Boulder CO
Business Analyst View
Owner
Admins
Permission
Business
Analysts
Owner
Admins
Permission
Data
Scientists
- 45. ®
© 2015 MapR Technologies 45
Ownership Chaining
Combine Self Service Exploration with Data Governance
Name City State Credit Card #
Dave San Jose CA 1374-7914-3865-4817
John Boulder CO 1374-9735-1794-9711
Raw File (/raw/cards.csv)
Name City State Credit Card #
Dave San Jose CA 1374-1111-1111-1111
John Boulder CO 1374-1111-1111-1111
Data Scientist (/views/V_Scientist)
Jane (Read)
John (Owner)
Name City State
Dave San Jose CA
John Boulder CO
Analyst(/views/V_Analyst)
Jack (Read)
Jane(Owner)
RAWFILEV_ScientistV_Analyst
Does Jack have access to V_Analyst? ->YES
Who is the owner of V_Analyst? ->Jane
Drill accesses V_Analyst as Jane (Impersonation hop 1)
Does Jane have access to V_Scientist ? -> YES
Who is the owner of V_Scientist? ->John
Drill accesses V_Scientist as John (Impersonation hop 2)
John(Owner)
Does John have permissions on raw file? -> YES
Who is the owner of raw file? ->John
Drill accesses source file as John (no impersonation here)
Jack queries the view V_Analyst
*Ownership chain length (# hops) is configurable
Ownership
chaining
Access
path
- 46. ®
© 2015 MapR Technologies 46
Security Summary
Logical
– No physical data copies/silos
Granular
– Row level and column level security controls
De-centralized
– User impersonation respecting storage system permissions
– No separate permission repository for granular controls
– Integrated with Hadoop File System permissions and LDAP
Self-service w/ governance
– If you have access to data, you control who and how widely can access it
– Audits
- 47. ®
© 2015 MapR Technologies 47© 2015 MapR Technologies
®
National Nutrient Database
- 51. ®
© 2015 MapR Technologies 51
Sample SR27 Records
~01001~^~0100~^~Butter, salted~^~BUTTER,WITH SALT~^~~^~~^~Y~^~~^0^~~^6.38^4.27^8.79^3.87
~01004~^~0100~^~Cheese, blue~^~CHEESE,BLUE~^~~^~~^~Y~^~~^0^~~^6.38^4.27^8.79^3.87
~01005~^~0100~^~Cheese, brick~^~CHEESE,BRICK~^~~^~~^~Y~^~~^0^~~^6.38^4.27^8.79^3.87
~01006~^~0100~^~Cheese, brie~^~CHEESE,BRIE~^~~^~~^~Y~^~~^0^~~^6.38^4.27^8.79^3.87
~01007~^~0100~^~Cheese, camembert~^~CHEESE,CAMEMBERT~^~~^~~^~Y~^~~^0^~~^6.38^4.27^8.79^3.87
~01008~^~0100~^~Cheese, caraway~^~CHEESE,CARAWAY~^~~^~~^~~^~~^0^~~^6.38^4.27^8.79^3.87
- 52. ®
© 2015 MapR Technologies 52
Configuration
-- Format --
"nndb": {
"type": "text",
"extensions": [ "txt" ],
"quote": "~",
"escape": "~",
"delimiter": "^"
},
-- Workspace --
"nndb": {
"location": "/opt/drill/nndb",
"writable": true,
"storageformat": "parquet"
},
- 53. ®
© 2015 MapR Technologies 53
Sample JSON
{
"ndb_no":"08613",
"shrt_desc":"CEREALS RTE,KELLOGG'S SPL K MULTIGRAIN OATS & HONEY",
"nut_data":[{
"nutr_no": "203",
"nutr_val": "7.80",
"nutr_def": {"num_dec":2,"tagname":"PROCNT","nutrdesc":"Protein"},
"data_src":[{
"datasrc_id": "S6941",
"authors": "A Kellogg, Co.",
"title": "Kellogg Company Data",
"Year": "2011"
}]
}, {
"nutr_no": "205",
"nutr_val": "85.00",
"nutr_def": {"num_dec":2,"tagname":"CHOCDF","nutrdesc":"Carbohydrate, by difference"},
"data_src":[{
"datasrc_id": "S6941",
"authors": "C Kellogg, Co.",
"title": "Kellogg Company Data",
"Year": "2011"
}]
}],
"langual":["ANISE","FRUIT","WHOLE, NATURAL SHAPE","NOT HEAT-TREATED","COOKING METHOD NOT APPLICABLE","WATER REMOVED","HEAT
DRIED","HUMAN FOOD, NO AGE SPECIFICATION"]
}
- 54. ®
© 2015 MapR Technologies 54
Demo Queries
All queries can be found within these blogs:
https://www.mapr.com/blog/drilling-healthy-choices
https://www.mapr.com/blog/evolution-database-schemas-using-sql-
nosql
- 55. ®
© 2015 MapR Technologies 55© 2015 MapR Technologies
®
Live Demo
- 56. ®
© 2015 MapR Technologies 56
Drill is Top-Ranked SQL-on-Hadoop
Source: Gigaom Research, 2015
Key:
• Number indicates companies relative strength across all vectors
• Size of ball indicates company’s relative strength along individual vector
“Drill isn’t just about
SQL-on-Hadoop.
It’s about SQL-on-
pretty-much-
anything,
immediately, and
without formality.”
- 57. ®
© 2015 MapR Technologies 57
Drill Project Status
Sep’12
Jun’13
Aug’14
Nov’14
Jan’15
Apr’15
Sep’14
Dec’14
Mar’15
Project
incubation
First release
Drill 0.1
Beta
Drill 0.5
+ Apache Top
Level Project
Drill 0.7 Drill 0.8
Drill 0.9GigaOm
Top ranked SQL
On Hadoop
Drill 0.6Dev Preview
Drill 0.4
Apache
Top Level Project
Growing
user adoption
Iterative
Project cycles
Large community,
growing rapidly
50 contributors 1000’s downloads 7 releases < 9 months
H i g h l i g h t s
May’15
Drill 1.0
Just
released
- 58. ®
© 2015 MapR Technologies 58
Recommendations for Getting Started with Drill
New to Drill?
– Get started with Free MapR On Demand training
– Test Drive Drill on cloud with AWS
– Learn how to use Drill with Hadoop using MapR sandbox
Ready to play with your data?
– Try out Apache Drill in 10 mins guide on your desktop
– Download Drill for your cluster and start exploration
– Comprehensive tutorials and documentation available
Ask questions
– user@drill.apache.org
- 60. ®
© 2015 MapR Technologies 60
Find my presentation and other related resources here:
http://events.mapr.com/BigDataMadison
(you can find this link in the event’s page at meetup.com)
Today’s Presentation
Whiteboard & demo
videos
Free On-Demand Training
Free eBooks
Free Hadoop Sandbox And more…
- 61. ®
© 2015 MapR Technologies 61
Q&A
@kingmesal maprtech
jscott@mapr.com
Engage with us!
MapR
maprtech
mapr-technologies