Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
®
© 2015 MapR Technologies 1
®
© 2015 MapR Technologies
Jim Scott – Director, Enterprise Strategy & Architecture
@kingmesa...
®
© 2015 MapR Technologies 2
Find my presentation and other related resources here:
http://events.mapr.com/BigDataMadison
...
®
© 2015 MapR Technologies 3
Topics
•  Motivation
•  Using Drill
•  SQL + NoSQL = ???
•  Security Controls
•  Demo
•  Reso...
®
© 2015 MapR Technologies 4
Empowering “as it happens”
businesses by speeding up the
data-to-action cycle
®
®
© 2015 MapR Technologies 5
Top-Ranked NoSQL
Top-Ranked Hadoop
Distribution
Top-Ranked SQL-on-Hadoop
Solution
®
®
© 2015 MapR Technologies 6© 2015 MapR Technologies
®
Motivation
®
© 2015 MapR Technologies 7
SEMI-STRUCTURED
DATA
STRUCTURED DATA
1980 2000 20101990 2020
Data is Doubling Every Two Years...
®
© 2015 MapR Technologies 8
1980 2000 20101990 2020
Fixed schema
DBA controls structure
Dynamic / Flexible schema
Applica...
®
© 2015 MapR Technologies 9
How To Bring SQL to Non-Relational Data Stores?
Familiarity of SQL Agility of NoSQL
•  ANSI S...
®
© 2015 MapR Technologies 10
Industry's First
Schema-free SQL engine
for Big Data
®
®
© 2015 MapR Technologies 11
Enabling “As-It-Happens” Business with Instant Analytics
Hadoop data Data modeling Transform...
®
© 2015 MapR Technologies 12
Evolution Towards Self-Service Data Exploration
Data Modeling and
Transformation
Data Visual...
®
© 2015 MapR Technologies 13
Common Use Cases
Raw Data Exploration JSON Analytics DWH offload
Hive HBaseFiles Directories...
®
© 2015 MapR Technologies 14
Drill Supports Schema Discovery On-The-Fly
•  Fixed schema
•  Leverage schema in centralized...
®
© 2015 MapR Technologies 15
-  Sub-directory
-  HBase namespace
-  Hive database
Drill Enables ‘SQL-on-Everything’
SELEC...
®
© 2015 MapR Technologies 16
Drill’s Data Model is Flexible
JSON
BSON
HBase
Parquet
Avro
CSV
TSV
Dynamic
schema
Fixed sch...
®
© 2015 MapR Technologies 17
Reuse Existing SQL Tools and Skills
Leverage SQL-compatible tools
(BI, query builders, etc.)...
®
© 2015 MapR Technologies 18© 2015 MapR Technologies
®
Using Drill with Yelp
®
© 2015 MapR Technologies 19
Business dataset {	
  
	
  "business_id":	
  "4bEjOyTaDG24SY5TxsaUNQ",	
  
	
  "full_address...
®
© 2015 MapR Technologies 20
Reviews dataset
{	
  
	
  	
  "votes":	
  {"funny":	
  0,	
  "useful":	
  2,	
  "cool":	
  1...
®
© 2015 MapR Technologies 21
Zero to Results in 2 minutes
$	
  tar	
  -­‐xvzf	
  apache-­‐drill-­‐1.0.0.tar.gz	
  
	
  
$...
®
© 2015 MapR Technologies 22
Directories are implicit partitions
SELECT dir0, SUM(amount)
FROM sales
GROUP BY dir1 IN (q1...
®
© 2015 MapR Technologies 23
Intuitive SQL Access to Complex Data
//	
  It’s	
  Friday	
  10pm	
  in	
  Vegas	
  and	
  l...
®
© 2015 MapR Technologies 24
ANSI SQL Compatibility
//Get	
  top	
  cool	
  rated	
  businesses	
  
	
  
Ø  SELECT	
  b....
®
© 2015 MapR Technologies 25
Logical Views
//Create	
  a	
  view	
  combining	
  business	
  and	
  reviews	
  datasets	
...
®
© 2015 MapR Technologies 26
Materialized Views AKA Tables
>	
  ALTER	
  SESSION	
  SET	
  `store.format`	
  =	
  'parque...
®
© 2015 MapR Technologies 27
Repeated Values Support
//	
  Flatten	
  repeated	
  categories	
  	
  
	
  
>	
  SELECT	
  ...
®
© 2015 MapR Technologies 28
Extensions to ANSI SQL to work with repeated values
//	
  Get	
  most	
  common	
  business	...
®
© 2015 MapR Technologies 29
Checkins dataset {	
  	
  
	
  	
  	
  "checkin_info":{	
  	
  
	
  	
  	
  	
  	
  	
  "3-­...
®
© 2015 MapR Technologies 30
Supports Dynamic / Unknown Columns
>	
  SELECT	
  KVGEN(checkin_info)	
  checkins	
  	
  
	
...
®
© 2015 MapR Technologies 31
Makes it easy to work with dynamic/unknown columns
//	
  Count	
  total	
  number	
  of	
  c...
®
© 2015 MapR Technologies 32© 2015 MapR Technologies
®
SQL + NoSQL = Accessible & Linearly Scalable
®
© 2015 MapR Technologies 33
MusicBrainz on NoSQL
Artists, albums, tracks and labels are key objects
Reality check:
Add w...
®
© 2015 MapR Technologies 34
180 Tables
NOT SHOWN!
®
© 2015 MapR Technologies 35
236 tables
to describe 7 kinds of things
®
© 2015 MapR Technologies 36
artist
id
gid
name
sort_name
begin_date
end_date
ended
type
gender
area
being_area
end_area
...
®
© 2015 MapR Technologies 37
artist
id
gid
name
sort_name
begin_date
end_date
ended
type
gender
area
being_area
end_area
...
®
© 2015 MapR Technologies 38
Further Reductions
All 86 link tables become properties on artists, releases and other entit...
®
© 2015 MapR Technologies 39
Is This Good?
Expressivity
–  The JSON data model is at least as expressive as the original ...
®
© 2015 MapR Technologies 40
Searching for Elvis
//	
  Find	
  discs	
  where	
  Elvis	
  was	
  credited	
  
	
  	
  
>	...
®
© 2015 MapR Technologies 41
Benefits
Extended relational model allows massive simplification
–  On a real example, we se...
®
© 2015 MapR Technologies 42© 2015 MapR Technologies
®
Security Controls
®
© 2015 MapR Technologies 43
Access Controls that Scale
PAM Authentication +
User Impersonation
Fine-grained row and
colu...
®
© 2015 MapR Technologies 44
Granular Security via Drill Views
Name City State Credit Card #
Dave San Jose CA 1374-7914-3...
®
© 2015 MapR Technologies 45
Ownership Chaining
Combine Self Service Exploration with Data Governance
Name City State Cre...
®
© 2015 MapR Technologies 46
Security Summary
Logical
–  No physical data copies/silos
Granular
–  Row level and column l...
®
© 2015 MapR Technologies 47© 2015 MapR Technologies
®
National Nutrient Database
®
© 2015 MapR Technologies 48
Complex
®
© 2015 MapR Technologies 49
Simpler
®
© 2015 MapR Technologies 50
Simplest
®
© 2015 MapR Technologies 51
Sample SR27 Records
~01001~^~0100~^~Butter, salted~^~BUTTER,WITH SALT~^~~^~~^~Y~^~~^0^~~^6.3...
®
© 2015 MapR Technologies 52
Configuration
-- Format --
"nndb": {
"type": "text",
"extensions": [ "txt" ],
"quote": "~",
...
®
© 2015 MapR Technologies 53
Sample JSON
{
"ndb_no":"08613",
"shrt_desc":"CEREALS RTE,KELLOGG'S SPL K MULTIGRAIN OATS & H...
®
© 2015 MapR Technologies 54
Demo Queries
All queries can be found within these blogs:
https://www.mapr.com/blog/drilling...
®
© 2015 MapR Technologies 55© 2015 MapR Technologies
®
Live Demo
®
© 2015 MapR Technologies 56
Drill is Top-Ranked SQL-on-Hadoop
Source: Gigaom Research, 2015
Key:
•  Number indicates com...
®
© 2015 MapR Technologies 57
Drill Project Status
Sep’12
Jun’13
Aug’14
Nov’14
Jan’15
Apr’15
Sep’14
Dec’14
Mar’15
Project
...
®
© 2015 MapR Technologies 58
Recommendations for Getting Started with Drill
New to Drill?
–  Get started with Free MapR O...
®
© 2015 MapR Technologies 59
®
© 2015 MapR Technologies 60
Find my presentation and other related resources here:
http://events.mapr.com/BigDataMadison...
®
© 2015 MapR Technologies 61
Q&A
@kingmesal maprtech
jscott@mapr.com
Engage with us!
MapR
maprtech
mapr-technologies
Upcoming SlideShare
Loading in …5
×

Rethinking SQL for Big Data with Apache Drill

489 views

Published on

Jim Scott's Presentation at Big Data Madison

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Rethinking SQL for Big Data with Apache Drill

  1. 1. ® © 2015 MapR Technologies 1 ® © 2015 MapR Technologies Jim Scott – Director, Enterprise Strategy & Architecture @kingmesal #BigDataMadison
  2. 2. ® © 2015 MapR Technologies 2 Find my presentation and other related resources here: http://events.mapr.com/BigDataMadison (you can find this link in the event’s page at meetup.com) Today’s Presentation Whiteboard & demo videos Free On-Demand Training Free eBooks Free Hadoop Sandbox And more…
  3. 3. ® © 2015 MapR Technologies 3 Topics •  Motivation •  Using Drill •  SQL + NoSQL = ??? •  Security Controls •  Demo •  Resources
  4. 4. ® © 2015 MapR Technologies 4 Empowering “as it happens” businesses by speeding up the data-to-action cycle ®
  5. 5. ® © 2015 MapR Technologies 5 Top-Ranked NoSQL Top-Ranked Hadoop Distribution Top-Ranked SQL-on-Hadoop Solution ®
  6. 6. ® © 2015 MapR Technologies 6© 2015 MapR Technologies ® Motivation
  7. 7. ® © 2015 MapR Technologies 7 SEMI-STRUCTURED DATA STRUCTURED DATA 1980 2000 20101990 2020 Data is Doubling Every Two Years Unstructured data will account for more than 80% of the data collected by organizations Source: Human-Computer Interaction & Knowledge Discovery in Complex Unstructured, Big Data TotalDataStored
  8. 8. ® © 2015 MapR Technologies 8 1980 2000 20101990 2020 Fixed schema DBA controls structure Dynamic / Flexible schema Application controls structure NON-RELATIONAL DATASTORESRELATIONAL DATABASES GBs-TBs TBs-PBsVolume Database Data Increasingly Stored in Non-Relational Datastores Structure Development Structured Structured, semi-structured and unstructured Planned (release cycle = months-years) Iterative (release cycle = days-weeks)
  9. 9. ® © 2015 MapR Technologies 9 How To Bring SQL to Non-Relational Data Stores? Familiarity of SQL Agility of NoSQL •  ANSI SQL semantics •  BI (Tableau, MicroStrategy, etc.) •  Low latency •  No schema management –  HDFS (Parquet, JSON, etc.) –  HBase –  … •  No transformation –  No silos of data •  Ease of use
  10. 10. ® © 2015 MapR Technologies 10 Industry's First Schema-free SQL engine for Big Data ®
  11. 11. ® © 2015 MapR Technologies 11 Enabling “As-It-Happens” Business with Instant Analytics Hadoop data Data modeling Transformation Data movement (optional) Users Hadoop data Users Traditional approach Exploratory approach New Business questionsSource data evolution Total time to insight: weeks to months Total time to insight: minutes
  12. 12. ® © 2015 MapR Technologies 12 Evolution Towards Self-Service Data Exploration Data Modeling and Transformation Data Visualization IT-driven IT-driven IT-driven Self-service IT-driven Self-service Optional Self-service Traditional BI w/ RDBMS Self-Service BI w/ RDBMS SQL-on-Hadoop Self-Service Data Exploration Zero-day analytics
  13. 13. ® © 2015 MapR Technologies 13 Common Use Cases Raw Data Exploration JSON Analytics DWH offload Hive HBaseFiles Directories … {JSON}, Parquet Text Files …
  14. 14. ® © 2015 MapR Technologies 14 Drill Supports Schema Discovery On-The-Fly •  Fixed schema •  Leverage schema in centralized repository (Hive Metastore) •  Fixed schema, evolving schema or schema-less •  Leverage schema in centralized repository or self-describing data 2Schema Discovered On-The-FlySchema Declared In Advance SCHEMA ON WRITE SCHEMA BEFORE READ SCHEMA ON THE FLY
  15. 15. ® © 2015 MapR Technologies 15 -  Sub-directory -  HBase namespace -  Hive database Drill Enables ‘SQL-on-Everything’ SELECT  *  FROM  dfs.yelp.`business.json`  ! Workspace -  Pathnames -  Hive table -  HBase table Table -  DFS (Text, Parquet, JSON) -  HBase/MapR-DB -  Hive Metastore/HCatalog - Easy API to go beyond Hadoop Storage plugin instance
  16. 16. ® © 2015 MapR Technologies 16 Drill’s Data Model is Flexible JSON BSON HBase Parquet Avro CSV TSV Dynamic schema Fixed schema Complex Flat Flexibility Name! Gender! Age! Michael! M! 6! Jennifer! F! 3! {! name: {! first: Michael,! last: Smith! },! hobbies: [ski, soccer],! district: Los Altos! }! {! name: {! first: Jennifer,! last: Gates! },! hobbies: [sing],! preschool: CCLC! }! RDBMS/SQL-on-Hadoop table Apache Drill table Flexibility
  17. 17. ® © 2015 MapR Technologies 17 Reuse Existing SQL Tools and Skills Leverage SQL-compatible tools (BI, query builders, etc.) via Drill’s standard ODBC, JDBC and ANSI SQL support Enable business analysts, technical analysts and data scientists to explore and analyze large volumes of real-time data
  18. 18. ® © 2015 MapR Technologies 18© 2015 MapR Technologies ® Using Drill with Yelp
  19. 19. ® © 2015 MapR Technologies 19 Business dataset {    "business_id":  "4bEjOyTaDG24SY5TxsaUNQ",    "full_address":  "3655  Las  Vegas  Blvd  SnThe  StripnLas  Vegas,  NV  89109",    "hours":  {      "Monday":  {"close":  "23:00",  "open":  "07:00"},      "Tuesday":  {"close":  "23:00",  "open":  "07:00"},      "Friday":  {"close":  "00:00",  "open":  "07:00"},      "Wednesday":  {"close":  "23:00",  "open":  "07:00"},      "Thursday":  {"close":  "23:00",  "open":  "07:00"},      "Sunday":  {"close":  "23:00",  "open":  "07:00"},      "Saturday":  {"close":  "00:00",  "open":  "07:00"}    },    "open":  true,    "categories":  ["Breakfast  &  Brunch",  "Steakhouses",  "French",  "Restaurants"],    "city":  "Las  Vegas",    "review_count":  4084,    "name":  "Mon  Ami  Gabi",    "neighborhoods":  ["The  Strip"],    "longitude":  -­‐115.172588519464,    "state":  "NV",    "stars":  4.0,      "attributes":  {      "Alcohol":  "full_bar”,        "Noise  Level":  "average",      "Has  TV":  false,      "Attire":  "casual",      "Ambience":  {        "romantic":  true,        "intimate":  false,        "touristy":  false,        "hipster":  false,          "classy":  true,        "trendy":  false,          "casual":  false      },      "Good  For":  {"dessert":  false,  "latenight":  false,  "lunch":  false,                                                  "dinner":  true,  "breakfast":  false,  "brunch":  false},    }   }  
  20. 20. ® © 2015 MapR Technologies 20 Reviews dataset {      "votes":  {"funny":  0,  "useful":  2,  "cool":  1},      "user_id":  "Xqd0DzHaiyRqVH3WRG7hzg",      "review_id":  "15SdjuK7DmYqUAj6rjGowg",      "stars":  5,      "date":  "2007-­‐05-­‐17",      "text":  "dr.  goldberg  offers  everything  ...",      "type":  "review",      "business_id":  "vcNAWiLM4dR7D2nwwJ7nCA"   }  
  21. 21. ® © 2015 MapR Technologies 21 Zero to Results in 2 minutes $  tar  -­‐xvzf  apache-­‐drill-­‐1.0.0.tar.gz     $  bin/sqlline  -­‐u  jdbc:drill:zk=local   $  bin/drill-­‐embedded     >  SELECT  state,  city,  count(*)  AS  businesses      FROM  dfs.yelp.`business.json`      GROUP  BY  state,  city      ORDER  BY  businesses  DESC  LIMIT  10;     +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+   |      state        |        city        |    businesses  |   +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+   |  NV                  |  Las  Vegas    |  12021              |   |  AZ                  |  Phoenix        |  7499                |   |  AZ                  |  Scottsdale  |  3605                |   |  EDH                |  Edinburgh    |  2804                |   |  AZ                  |  Mesa              |  2041                |   |  AZ                  |  Tempe            |  2025                |   |  NV                  |  Henderson    |  1914                |   |  AZ                  |  Chandler      |  1637                |   |  WI                  |  Madison        |  1630                |   |  AZ                  |  Glendale      |  1196                |   +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+   Install   Query  files   and   directories   Results   Launch  shell   (embedded   mode)  
  22. 22. ® © 2015 MapR Technologies 22 Directories are implicit partitions SELECT dir0, SUM(amount) FROM sales GROUP BY dir1 IN (q1, q2) sales ├── 2014 │ ├── q1 │ ├── q2 │ ├── q3 │ └── q4 └── 2015 └── q1
  23. 23. ® © 2015 MapR Technologies 23 Intuitive SQL Access to Complex Data //  It’s  Friday  10pm  in  Vegas  and  looking  for  Hummus     >  SELECT  name,  stars,  b.hours.Friday  friday,  categories      FROM  dfs.yelp.`business.json`  b      WHERE  b.hours.Friday.`open`  <  '22:00'  AND                  b.hours.Friday.`close`  >  '22:00'  AND                  REPEATED_CONTAINS(categories,  'Mediterranean')  AND                  city  =  'Las  Vegas'      ORDER  BY  stars  DESC      LIMIT  2;     +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+   |        name        |      stars        |      friday      |  categories  |   +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+   |  Olives          |  4.0                |  {"close":"22:30","open":"11:00"}  |   ["Mediterranean","Restaurants"]  |   |  Marrakech  Moroccan  Restaurant  |  4.0                |  {"close":"23:00","open":"17:30"}  |   ["Mediterranean","Middle  Eastern","Moroccan","Restaurants"]  |   +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+   Query  data   with  any   levels  of   nesting  
  24. 24. ® © 2015 MapR Technologies 24 ANSI SQL Compatibility //Get  top  cool  rated  businesses     Ø  SELECT  b.name  from  dfs.yelp.`business.json`  b          WHERE  b.business_id  IN      (SELECT  r.business_id  FROM  dfs.yelp.`review.json`  r        GROUP  BY  r.business_id  HAVING  SUM(r.votes.cool)  >  2000  ORDER  BY          SUM(r.votes.cool)  DESC);     +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+   |        name  |   +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+   |  Earl  of  Sandwich  |   |  XS  Nightclub  |   |  The  Cosmopolitan  of  Las  Vegas  |   |  Wicked  Spoon  |   +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+   Use  familiar  SQL   functionality   (Joins,   Aggregations,   Sorting,  Sub-­‐ queries,  SQL  data   types)  
  25. 25. ® © 2015 MapR Technologies 25 Logical Views //Create  a  view  combining  business  and  reviews  datasets     >  CREATE  OR  REPLACE  VIEW  dfs.tmp.BusinessReviews  AS          SELECT  b.name,  b.stars,  r.votes.funny,                        r.votes.useful,  r.votes.cool,  r.`date`              FROM  dfs.yelp.`business.json`  b,  dfs.yelp.`review.json`  r              WHERE  r.business_id  =  b.business_id;     +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+   |          ok          |    summary      |   +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+   |  true              |  View  'BusinessReviews'  created  successfully  in  'dfs.tmp'  schema  |   +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+     >  SELECT  COUNT(*)  AS  Total  FROM  dfs.tmp.BusinessReviews;     +------------+ | Total | +------------+ | 1125458 | +------------+ Lightweight  file   system  based   views  for   granular  and  de-­‐ centralized  data   management  
  26. 26. ® © 2015 MapR Technologies 26 Materialized Views AKA Tables >  ALTER  SESSION  SET  `store.format`  =  'parquet';     >  CREATE  TABLE  dfs.yelp.BusinessReviewsTbl  AS          SELECT  b.name,  b.stars,  r.votes.funny  funny,                        r.votes.useful  useful,  r.votes.cool  cool,  r.`date`              FROM  dfs.yelp.`business.json`  b,  dfs.yelp.`review.json`  r              WHERE  r.business_id  =  b.business_id;     +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+   |    Fragment    |  Number  of  records  written  |   +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+   |  1_0                |  176448                                        |   |  1_1                |  192439                                        |   |  1_2                |  198625                                        |   |  1_3                |  200863                                        |   |  1_4                |  181420                                        |   |  1_5                |  175663                                        |   +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+   Save  analysis   results  as   tables  using   familiar  CTAS   syntax  
  27. 27. ® © 2015 MapR Technologies 27 Repeated Values Support //  Flatten  repeated  categories       >  SELECT  name,  categories      FROM  dfs.yelp.`business.json`  LIMIT  3;     +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+   |        name        |  categories  |   +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+   |  Eric  Goldberg,  MD  |  ["Doctors","Health  &  Medical"]  |   |  Pine  Cone  Restaurant  |  ["Restaurants"]  |   |  Deforest  Family  Restaurant  |  ["American  (Traditional)","Restaurants"]  |   +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+     >  SELECT  name,  FLATTEN(categories)  AS  categories      FROM  dfs.yelp.`business.json`  LIMIT  5;   +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+   |        name        |  categories  |   +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+   |  Eric  Goldberg,  MD  |  Doctors        |   |  Eric  Goldberg,  MD  |  Health  &  Medical  |   |  Pine  Cone  Restaurant  |  Restaurants  |   |  Deforest  Family  Restaurant  |  American  (Traditional)  |   |  Deforest  Family  Restaurant  |  Restaurants  |   +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+   Dynamically   flatten  repeated   and  nested  data   elements  as  part   of  SQL  queries.   No  ETL  necessary  
  28. 28. ® © 2015 MapR Technologies 28 Extensions to ANSI SQL to work with repeated values //  Get  most  common  business  categories       >SELECT  category,  count(*)  AS  categorycount      FROM  (SELECT  name,  FLATTEN(categories)  AS  category                  FROM  dfs.yelp.`business.json`)  c      GROUP  BY  category  ORDER  BY  categorycount  DESC;     +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+   |    category    |  categorycount|   +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+   |  Restaurants  |  14303            |   …   |  Australian  |  1                    |   |  Boat  Dealers  |  1                    |   |  Firewood      |  1                    |   +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+  
  29. 29. ® © 2015 MapR Technologies 29 Checkins dataset {          "checkin_info":{                "3-­‐4":1,              "13-­‐5":1,              "6-­‐6":1,              "14-­‐5":1,              "14-­‐6":1,              "14-­‐2":1,              "14-­‐3":1,              "19-­‐0":1,              "11-­‐5":1,              "13-­‐2":1,              "11-­‐6":2,              "11-­‐3":1,              "12-­‐6":1,              "6-­‐5":1,              "5-­‐5":1,              "9-­‐2":1,              "9-­‐5":1,              "9-­‐6":1,              "5-­‐2":1,              "7-­‐6":1,              "7-­‐5":1,              "7-­‐4":1,              "17-­‐5":1,              "8-­‐5":1,              "10-­‐2":1,              "10-­‐5":1,              "10-­‐6":1        },        "type":"checkin",        "business_id":"JwUE5GmEO-­‐sH1FuwJgKBlQ"   }  
  30. 30. ® © 2015 MapR Technologies 30 Supports Dynamic / Unknown Columns >  SELECT  KVGEN(checkin_info)  checkins          FROM  dfs.yelp.`checkin.json`  LIMIT  1;   +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+   |    checkins    |   +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+   |  [{"key":"3-­‐4","value":1},{"key":"13-­‐5","value":1},{"key":"6-­‐6","value":1},{"key":"14-­‐5","value":1}, {"key":"14-­‐6","value":1},{"key":"14-­‐2","value":1},{"key":"14-­‐3","value":1},{"key":"19-­‐0","value":1}, {"key":"11-­‐5","value":1},{"key":"13-­‐2","value":1},{"key":"11-­‐6","value":2},{"key":"11-­‐3","value":1}, {"key":"12-­‐6","value":1},{"key":"6-­‐5","value":1},{"key":"5-­‐5","value":1},{"key":"9-­‐2","value":1}, {"key":"9-­‐5","value":1},{"key":"9-­‐6","value":1},{"key":"5-­‐2","value":1},{"key":"7-­‐6","value":1}, {"key":"7-­‐5","value":1},{"key":"7-­‐4","value":1},{"key":"17-­‐5","value":1},{"key":"8-­‐5","value":1}, {"key":"10-­‐2","value":1},{"key":"10-­‐5","value":1},{"key":"10-­‐6","value":1}]  |   +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+     >  SELECT  FLATTEN(KVGEN(checkin_info))  checkins  FROM   dfs.yelp.`checkin.json`  limit  6;     +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+   |    checkins    |   +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+   |  {"key":"3-­‐4","value":1}  |   |  {"key":"13-­‐5","value":1}  |   |  {"key":"6-­‐6","value":1}  |   |  {"key":"14-­‐5","value":1}  |   |  {"key":"14-­‐6","value":1}  |   |  {"key":"14-­‐2","value":1}  |   +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+   Convert  Map  with   a  wide  set  of   dynamic  columns   into  an  array  of   key-­‐value  pairs  
  31. 31. ® © 2015 MapR Technologies 31 Makes it easy to work with dynamic/unknown columns //  Count  total  number  of  checkins  on  Sunday  midnight     >  SELECT  SUM(checkintbl.checkins.`value`)  as  SundayMidnightCheckins   FROM          (SELECT  FLATTEN(KVGEN(checkin_info))  checkins          FROM  dfs.yelp.checkin.json`)  checkintbl            WHERE  checkintbl.checkins.key='23-­‐0';     +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+   |  SundayMidnightCheckins  |   +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+   |  8575                                      |   +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+  
  32. 32. ® © 2015 MapR Technologies 32© 2015 MapR Technologies ® SQL + NoSQL = Accessible & Linearly Scalable
  33. 33. ® © 2015 MapR Technologies 33 MusicBrainz on NoSQL Artists, albums, tracks and labels are key objects Reality check: Add works (compositions), recordings, release, release group 7 tables for artist alone 12 for place, 7 for label, 17 for release/group, 8 for work (but only 4 for recording!) Total of 12 + 7 + 17 + 8 + 4 = 48 tables But wait, there’s more! 10 annotation tables, 10 edit tables, 19 tag tables, 5 rating tables, 86 link tables, 5 cover art tables and 3 tables for CD timing info (138 total) And 50 more tables that aren’t documented yet
  34. 34. ® © 2015 MapR Technologies 34 180 Tables NOT SHOWN!
  35. 35. ® © 2015 MapR Technologies 35 236 tables to describe 7 kinds of things
  36. 36. ® © 2015 MapR Technologies 36 artist id gid name sort_name begin_date end_date ended type gender area being_area end_area comment list<ipi> list<isni> list<alias> list<release_id> list<recording_id> artist id gid name sort_name begin_date end_date ended type gender area being_area end_area comment list<ipi> list<isni> list<alias>
  37. 37. ® © 2015 MapR Technologies 37 artist id gid name sort_name begin_date end_date ended type gender area being_area end_area comment list<ipi> list<isni> list<alias> list<release_id> list<recording_id> Primitive values One to many relations Equivalent to indexes
  38. 38. ® © 2015 MapR Technologies 38 Further Reductions All 86 link tables become properties on artists, releases and other entities All 44 tag, rating and annotation tables become list properties All 5 cover art tables become lists of file references Current score: 162 tables become 4 You get the idea
  39. 39. ® © 2015 MapR Technologies 39 Is This Good? Expressivity –  The JSON data model is at least as expressive as the original relational model •  Many cases easier to describe in nested data •  No cases are harder Efficiency –  Inlining can increase data size. Locality improves, however –  Sessionizing can substantially decrease data size –  Inlining back-references is more efficient than ordinary indexes –  Inlined columnar data allows 1000x speedup for time series Introspection (you decide)
  40. 40. ® © 2015 MapR Technologies 40 Searching for Elvis //  Find  discs  where  Elvis  was  credited       >  SELECT  distinct  album_id,  name  FROM    (SELECT  id  album_id,  artist_id,  name,  FLATTEN(credit)  FROM  release)  albums      join      (SELECT  distinct  artist_id  FROM        (SELECT  id  artist_id,  FLATTEN(alias)  FROM  artist        where  name  like  'Elvis%Presley’)    )  artists      USING  artist_id;    
  41. 41. ® © 2015 MapR Technologies 41 Benefits Extended relational model allows massive simplification –  On a real example, we see >20x reduction in number of tables Simplification drives improved introspection –  This is good Apache Drill gives very high performance execution for extended relational problems You can try this out today
  42. 42. ® © 2015 MapR Technologies 42© 2015 MapR Technologies ® Security Controls
  43. 43. ® © 2015 MapR Technologies 43 Access Controls that Scale PAM Authentication + User Impersonation Fine-grained row and column level access control with Drill Views – no centralized security repository required Files HBase Hive Drill View 1 Drill View 2 UUU U U
  44. 44. ® © 2015 MapR Technologies 44 Granular Security via Drill Views Name City State Credit Card # Dave San Jose CA 1374-7914-3865-4817 John Boulder CO 1374-9735-1794-9711 Raw File (/raw/cards.csv) Owner Admins Permission Admins Business Analyst Data Scientist Name City State Credit Card # Dave San Jose CA 1374-1111-1111-1111 John Boulder CO 1374-1111-1111-1111 Data Scientist View (/views/maskedcards.csv) Not a physical data copy Name City State Dave San Jose CA John Boulder CO Business Analyst View Owner Admins Permission Business Analysts Owner Admins Permission Data Scientists
  45. 45. ® © 2015 MapR Technologies 45 Ownership Chaining Combine Self Service Exploration with Data Governance Name City State Credit Card # Dave San Jose CA 1374-7914-3865-4817 John Boulder CO 1374-9735-1794-9711 Raw File (/raw/cards.csv) Name City State Credit Card # Dave San Jose CA 1374-1111-1111-1111 John Boulder CO 1374-1111-1111-1111 Data Scientist (/views/V_Scientist) Jane (Read) John (Owner) Name City State Dave San Jose CA John Boulder CO Analyst(/views/V_Analyst) Jack (Read) Jane(Owner) RAWFILEV_ScientistV_Analyst Does Jack have access to V_Analyst? ->YES Who is the owner of V_Analyst? ->Jane Drill accesses V_Analyst as Jane (Impersonation hop 1) Does Jane have access to V_Scientist ? -> YES Who is the owner of V_Scientist? ->John Drill accesses V_Scientist as John (Impersonation hop 2) John(Owner) Does John have permissions on raw file? -> YES Who is the owner of raw file? ->John Drill accesses source file as John (no impersonation here) Jack queries the view V_Analyst *Ownership chain length (# hops) is configurable Ownership chaining Access path
  46. 46. ® © 2015 MapR Technologies 46 Security Summary Logical –  No physical data copies/silos Granular –  Row level and column level security controls De-centralized –  User impersonation respecting storage system permissions –  No separate permission repository for granular controls –  Integrated with Hadoop File System permissions and LDAP Self-service w/ governance –  If you have access to data, you control who and how widely can access it –  Audits
  47. 47. ® © 2015 MapR Technologies 47© 2015 MapR Technologies ® National Nutrient Database
  48. 48. ® © 2015 MapR Technologies 48 Complex
  49. 49. ® © 2015 MapR Technologies 49 Simpler
  50. 50. ® © 2015 MapR Technologies 50 Simplest
  51. 51. ® © 2015 MapR Technologies 51 Sample SR27 Records ~01001~^~0100~^~Butter, salted~^~BUTTER,WITH SALT~^~~^~~^~Y~^~~^0^~~^6.38^4.27^8.79^3.87 ~01004~^~0100~^~Cheese, blue~^~CHEESE,BLUE~^~~^~~^~Y~^~~^0^~~^6.38^4.27^8.79^3.87 ~01005~^~0100~^~Cheese, brick~^~CHEESE,BRICK~^~~^~~^~Y~^~~^0^~~^6.38^4.27^8.79^3.87 ~01006~^~0100~^~Cheese, brie~^~CHEESE,BRIE~^~~^~~^~Y~^~~^0^~~^6.38^4.27^8.79^3.87 ~01007~^~0100~^~Cheese, camembert~^~CHEESE,CAMEMBERT~^~~^~~^~Y~^~~^0^~~^6.38^4.27^8.79^3.87 ~01008~^~0100~^~Cheese, caraway~^~CHEESE,CARAWAY~^~~^~~^~~^~~^0^~~^6.38^4.27^8.79^3.87
  52. 52. ® © 2015 MapR Technologies 52 Configuration -- Format -- "nndb": { "type": "text", "extensions": [ "txt" ], "quote": "~", "escape": "~", "delimiter": "^" }, -- Workspace -- "nndb": { "location": "/opt/drill/nndb", "writable": true, "storageformat": "parquet" },
  53. 53. ® © 2015 MapR Technologies 53 Sample JSON { "ndb_no":"08613", "shrt_desc":"CEREALS RTE,KELLOGG'S SPL K MULTIGRAIN OATS & HONEY", "nut_data":[{ "nutr_no": "203", "nutr_val": "7.80", "nutr_def": {"num_dec":2,"tagname":"PROCNT","nutrdesc":"Protein"}, "data_src":[{ "datasrc_id": "S6941", "authors": "A Kellogg, Co.", "title": "Kellogg Company Data", "Year": "2011" }] }, { "nutr_no": "205", "nutr_val": "85.00", "nutr_def": {"num_dec":2,"tagname":"CHOCDF","nutrdesc":"Carbohydrate, by difference"}, "data_src":[{ "datasrc_id": "S6941", "authors": "C Kellogg, Co.", "title": "Kellogg Company Data", "Year": "2011" }] }], "langual":["ANISE","FRUIT","WHOLE, NATURAL SHAPE","NOT HEAT-TREATED","COOKING METHOD NOT APPLICABLE","WATER REMOVED","HEAT DRIED","HUMAN FOOD, NO AGE SPECIFICATION"] }
  54. 54. ® © 2015 MapR Technologies 54 Demo Queries All queries can be found within these blogs: https://www.mapr.com/blog/drilling-healthy-choices https://www.mapr.com/blog/evolution-database-schemas-using-sql- nosql
  55. 55. ® © 2015 MapR Technologies 55© 2015 MapR Technologies ® Live Demo
  56. 56. ® © 2015 MapR Technologies 56 Drill is Top-Ranked SQL-on-Hadoop Source: Gigaom Research, 2015 Key: •  Number indicates companies relative strength across all vectors •  Size of ball indicates company’s relative strength along individual vector “Drill isn’t just about SQL-on-Hadoop. It’s about SQL-on- pretty-much- anything, immediately, and without formality.”
  57. 57. ® © 2015 MapR Technologies 57 Drill Project Status Sep’12 Jun’13 Aug’14 Nov’14 Jan’15 Apr’15 Sep’14 Dec’14 Mar’15 Project incubation First release Drill 0.1 Beta Drill 0.5 + Apache Top Level Project Drill 0.7 Drill 0.8 Drill 0.9GigaOm Top ranked SQL On Hadoop Drill 0.6Dev Preview Drill 0.4 Apache Top Level Project Growing user adoption Iterative Project cycles Large community, growing rapidly 50 contributors 1000’s downloads 7 releases < 9 months H i g h l i g h t s May’15 Drill 1.0 Just released
  58. 58. ® © 2015 MapR Technologies 58 Recommendations for Getting Started with Drill New to Drill? –  Get started with Free MapR On Demand training –  Test Drive Drill on cloud with AWS –  Learn how to use Drill with Hadoop using MapR sandbox Ready to play with your data? –  Try out Apache Drill in 10 mins guide on your desktop –  Download Drill for your cluster and start exploration –  Comprehensive tutorials and documentation available Ask questions –  user@drill.apache.org
  59. 59. ® © 2015 MapR Technologies 59
  60. 60. ® © 2015 MapR Technologies 60 Find my presentation and other related resources here: http://events.mapr.com/BigDataMadison (you can find this link in the event’s page at meetup.com) Today’s Presentation Whiteboard & demo videos Free On-Demand Training Free eBooks Free Hadoop Sandbox And more…
  61. 61. ® © 2015 MapR Technologies 61 Q&A @kingmesal maprtech jscott@mapr.com Engage with us! MapR maprtech mapr-technologies

×