SlideShare a Scribd company logo
1 of 61
Download to read offline
®
© 2015 MapR Technologies 1
®
© 2015 MapR Technologies
Jim Scott – Director, Enterprise Strategy & Architecture
@kingmesal #BigDataMadison
®
© 2015 MapR Technologies 2
Find my presentation and other related resources here:
http://events.mapr.com/BigDataMadison
(you can find this link in the event’s page at meetup.com)
Today’s Presentation
Whiteboard & demo
videos
Free On-Demand Training
Free eBooks
Free Hadoop Sandbox And more…
®
© 2015 MapR Technologies 3
Topics
•  Motivation
•  Using Drill
•  SQL + NoSQL = ???
•  Security Controls
•  Demo
•  Resources
®
© 2015 MapR Technologies 4
Empowering “as it happens”
businesses by speeding up the
data-to-action cycle
®
®
© 2015 MapR Technologies 5
Top-Ranked NoSQL
Top-Ranked Hadoop
Distribution
Top-Ranked SQL-on-Hadoop
Solution
®
®
© 2015 MapR Technologies 6© 2015 MapR Technologies
®
Motivation
®
© 2015 MapR Technologies 7
SEMI-STRUCTURED
DATA
STRUCTURED DATA
1980 2000 20101990 2020
Data is Doubling Every Two Years
Unstructured data will account
for more than 80% of the data
collected by organizations
Source: Human-Computer Interaction & Knowledge Discovery in Complex Unstructured, Big Data
TotalDataStored
®
© 2015 MapR Technologies 8
1980 2000 20101990 2020
Fixed schema
DBA controls structure
Dynamic / Flexible schema
Application controls structure
NON-RELATIONAL DATASTORESRELATIONAL DATABASES
GBs-TBs TBs-PBsVolume
Database
Data Increasingly Stored in Non-Relational Datastores
Structure
Development
Structured Structured, semi-structured and unstructured
Planned (release cycle = months-years) Iterative (release cycle = days-weeks)
®
© 2015 MapR Technologies 9
How To Bring SQL to Non-Relational Data Stores?
Familiarity of SQL Agility of NoSQL
•  ANSI SQL semantics
•  BI (Tableau, MicroStrategy,
etc.)
•  Low latency
•  No schema management
–  HDFS (Parquet, JSON, etc.)
–  HBase
–  …
•  No transformation
–  No silos of data
•  Ease of use
®
© 2015 MapR Technologies 10
Industry's First
Schema-free SQL engine
for Big Data
®
®
© 2015 MapR Technologies 11
Enabling “As-It-Happens” Business with Instant Analytics
Hadoop data Data modeling Transformation
Data
movement
(optional)
Users
Hadoop data Users
Traditional
approach
Exploratory
approach
New Business questionsSource data evolution
Total time to insight: weeks to months
Total time to insight: minutes
®
© 2015 MapR Technologies 12
Evolution Towards Self-Service Data Exploration
Data Modeling and
Transformation
Data Visualization
IT-driven
IT-driven
IT-driven
Self-service
IT-driven
Self-service
Optional
Self-service
Traditional BI
w/ RDBMS
Self-Service BI
w/ RDBMS
SQL-on-Hadoop
Self-Service
Data Exploration
Zero-day analytics
®
© 2015 MapR Technologies 13
Common Use Cases
Raw Data Exploration JSON Analytics DWH offload
Hive HBaseFiles Directories
…
{JSON}, Parquet
Text Files …
®
© 2015 MapR Technologies 14
Drill Supports Schema Discovery On-The-Fly
•  Fixed schema
•  Leverage schema in centralized
repository (Hive Metastore)
•  Fixed schema, evolving schema or
schema-less
•  Leverage schema in centralized
repository or self-describing data
2Schema Discovered On-The-FlySchema Declared In Advance
SCHEMA ON
WRITE
SCHEMA
BEFORE READ
SCHEMA ON THE
FLY
®
© 2015 MapR Technologies 15
-  Sub-directory
-  HBase namespace
-  Hive database
Drill Enables ‘SQL-on-Everything’
SELECT	
  *	
  FROM	
  dfs.yelp.`business.json`	
  !
Workspace
-  Pathnames
-  Hive table
-  HBase table
Table
-  DFS (Text, Parquet, JSON)
-  HBase/MapR-DB
-  Hive Metastore/HCatalog
- Easy API to go beyond Hadoop
Storage plugin instance
®
© 2015 MapR Technologies 16
Drill’s Data Model is Flexible
JSON
BSON
HBase
Parquet
Avro
CSV
TSV
Dynamic
schema
Fixed schema
Complex
Flat
Flexibility
Name! Gender! Age!
Michael! M! 6!
Jennifer! F! 3!
{!
name: {!
first: Michael,!
last: Smith!
},!
hobbies: [ski, soccer],!
district: Los Altos!
}!
{!
name: {!
first: Jennifer,!
last: Gates!
},!
hobbies: [sing],!
preschool: CCLC!
}!
RDBMS/SQL-on-Hadoop table
Apache Drill table
Flexibility
®
© 2015 MapR Technologies 17
Reuse Existing SQL Tools and Skills
Leverage SQL-compatible tools
(BI, query builders, etc.) via Drill’s
standard ODBC, JDBC and ANSI
SQL support
Enable business analysts, technical
analysts and data scientists to
explore and analyze large volumes
of real-time data
®
© 2015 MapR Technologies 18© 2015 MapR Technologies
®
Using Drill with Yelp
®
© 2015 MapR Technologies 19
Business dataset {	
  
	
  "business_id":	
  "4bEjOyTaDG24SY5TxsaUNQ",	
  
	
  "full_address":	
  "3655	
  Las	
  Vegas	
  Blvd	
  SnThe	
  StripnLas	
  Vegas,	
  NV	
  89109",	
  
	
  "hours":	
  {	
  
	
   	
  "Monday":	
  {"close":	
  "23:00",	
  "open":	
  "07:00"},	
  
	
   	
  "Tuesday":	
  {"close":	
  "23:00",	
  "open":	
  "07:00"},	
  
	
   	
  "Friday":	
  {"close":	
  "00:00",	
  "open":	
  "07:00"},	
  
	
   	
  "Wednesday":	
  {"close":	
  "23:00",	
  "open":	
  "07:00"},	
  
	
   	
  "Thursday":	
  {"close":	
  "23:00",	
  "open":	
  "07:00"},	
  
	
   	
  "Sunday":	
  {"close":	
  "23:00",	
  "open":	
  "07:00"},	
  
	
   	
  "Saturday":	
  {"close":	
  "00:00",	
  "open":	
  "07:00"}	
  
	
  },	
  
	
  "open":	
  true,	
  
	
  "categories":	
  ["Breakfast	
  &	
  Brunch",	
  "Steakhouses",	
  "French",	
  "Restaurants"],	
  
	
  "city":	
  "Las	
  Vegas",	
  
	
  "review_count":	
  4084,	
  
	
  "name":	
  "Mon	
  Ami	
  Gabi",	
  
	
  "neighborhoods":	
  ["The	
  Strip"],	
  
	
  "longitude":	
  -­‐115.172588519464,	
  
	
  "state":	
  "NV",	
  
	
  "stars":	
  4.0,	
  
	
   	
  "attributes":	
  {	
  
	
   	
  "Alcohol":	
  "full_bar”,	
  
	
   	
   	
  "Noise	
  Level":	
  "average",	
  
	
   	
  "Has	
  TV":	
  false,	
  
	
   	
  "Attire":	
  "casual",	
  
	
   	
  "Ambience":	
  {	
  
	
   	
   	
  "romantic":	
  true,	
  
	
   	
   	
  "intimate":	
  false,	
  
	
   	
   	
  "touristy":	
  false,	
  
	
   	
   	
  "hipster":	
  false,	
  
	
   	
   	
   	
  "classy":	
  true,	
  
	
   	
   	
  "trendy":	
  false,	
  
	
   	
   	
   	
  "casual":	
  false	
  
	
   	
  },	
  
	
   	
  "Good	
  For":	
  {"dessert":	
  false,	
  "latenight":	
  false,	
  "lunch":	
  false,	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  "dinner":	
  true,	
  "breakfast":	
  false,	
  "brunch":	
  false},	
  
	
  }	
  
}	
  
®
© 2015 MapR Technologies 20
Reviews dataset
{	
  
	
  	
  "votes":	
  {"funny":	
  0,	
  "useful":	
  2,	
  "cool":	
  1},	
  
	
  	
  "user_id":	
  "Xqd0DzHaiyRqVH3WRG7hzg",	
  
	
  	
  "review_id":	
  "15SdjuK7DmYqUAj6rjGowg",	
  
	
  	
  "stars":	
  5,	
  
	
  	
  "date":	
  "2007-­‐05-­‐17",	
  
	
  	
  "text":	
  "dr.	
  goldberg	
  offers	
  everything	
  ...",	
  
	
  	
  "type":	
  "review",	
  
	
  	
  "business_id":	
  "vcNAWiLM4dR7D2nwwJ7nCA"	
  
}	
  
®
© 2015 MapR Technologies 21
Zero to Results in 2 minutes
$	
  tar	
  -­‐xvzf	
  apache-­‐drill-­‐1.0.0.tar.gz	
  
	
  
$	
  bin/sqlline	
  -­‐u	
  jdbc:drill:zk=local	
  
$	
  bin/drill-­‐embedded	
  
	
  
>	
  SELECT	
  state,	
  city,	
  count(*)	
  AS	
  businesses	
  
	
  	
  FROM	
  dfs.yelp.`business.json`	
  
	
  	
  GROUP	
  BY	
  state,	
  city	
  
	
  	
  ORDER	
  BY	
  businesses	
  DESC	
  LIMIT	
  10;	
  
	
  
+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+	
  
|	
  	
  	
  state	
  	
  	
  	
  |	
  	
  	
  	
  city	
  	
  	
  	
  |	
  	
  businesses	
  |	
  
+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+	
  
|	
  NV	
  	
  	
  	
  	
  	
  	
  	
  	
  |	
  Las	
  Vegas	
  	
  |	
  12021	
  	
  	
  	
  	
  	
  	
  |	
  
|	
  AZ	
  	
  	
  	
  	
  	
  	
  	
  	
  |	
  Phoenix	
  	
  	
  	
  |	
  7499	
  	
  	
  	
  	
  	
  	
  	
  |	
  
|	
  AZ	
  	
  	
  	
  	
  	
  	
  	
  	
  |	
  Scottsdale	
  |	
  3605	
  	
  	
  	
  	
  	
  	
  	
  |	
  
|	
  EDH	
  	
  	
  	
  	
  	
  	
  	
  |	
  Edinburgh	
  	
  |	
  2804	
  	
  	
  	
  	
  	
  	
  	
  |	
  
|	
  AZ	
  	
  	
  	
  	
  	
  	
  	
  	
  |	
  Mesa	
  	
  	
  	
  	
  	
  	
  |	
  2041	
  	
  	
  	
  	
  	
  	
  	
  |	
  
|	
  AZ	
  	
  	
  	
  	
  	
  	
  	
  	
  |	
  Tempe	
  	
  	
  	
  	
  	
  |	
  2025	
  	
  	
  	
  	
  	
  	
  	
  |	
  
|	
  NV	
  	
  	
  	
  	
  	
  	
  	
  	
  |	
  Henderson	
  	
  |	
  1914	
  	
  	
  	
  	
  	
  	
  	
  |	
  
|	
  AZ	
  	
  	
  	
  	
  	
  	
  	
  	
  |	
  Chandler	
  	
  	
  |	
  1637	
  	
  	
  	
  	
  	
  	
  	
  |	
  
|	
  WI	
  	
  	
  	
  	
  	
  	
  	
  	
  |	
  Madison	
  	
  	
  	
  |	
  1630	
  	
  	
  	
  	
  	
  	
  	
  |	
  
|	
  AZ	
  	
  	
  	
  	
  	
  	
  	
  	
  |	
  Glendale	
  	
  	
  |	
  1196	
  	
  	
  	
  	
  	
  	
  	
  |	
  
+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+	
  
Install	
  
Query	
  files	
  
and	
  
directories	
  
Results	
  
Launch	
  shell	
  
(embedded	
  
mode)	
  
®
© 2015 MapR Technologies 22
Directories are implicit partitions
SELECT dir0, SUM(amount)
FROM sales
GROUP BY dir1 IN (q1, q2)
sales
├── 2014
│ ├── q1
│ ├── q2
│ ├── q3
│ └── q4
└── 2015
└── q1
®
© 2015 MapR Technologies 23
Intuitive SQL Access to Complex Data
//	
  It’s	
  Friday	
  10pm	
  in	
  Vegas	
  and	
  looking	
  for	
  Hummus	
  
	
  
>	
  SELECT	
  name,	
  stars,	
  b.hours.Friday	
  friday,	
  categories	
  
	
  	
  FROM	
  dfs.yelp.`business.json`	
  b	
  
	
  	
  WHERE	
  b.hours.Friday.`open`	
  <	
  '22:00'	
  AND	
  
	
  	
  	
  	
  	
  	
  	
  	
  b.hours.Friday.`close`	
  >	
  '22:00'	
  AND	
  
	
  	
  	
  	
  	
  	
  	
  	
  REPEATED_CONTAINS(categories,	
  'Mediterranean')	
  AND	
  
	
  	
  	
  	
  	
  	
  	
  	
  city	
  =	
  'Las	
  Vegas'	
  
	
  	
  ORDER	
  BY	
  stars	
  DESC	
  
	
  	
  LIMIT	
  2;	
  
	
  
+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+	
  
|	
  	
  	
  	
  name	
  	
  	
  	
  |	
  	
  	
  stars	
  	
  	
  	
  |	
  	
  	
  friday	
  	
  	
  |	
  categories	
  |	
  
+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+	
  
|	
  Olives	
  	
  	
  	
  	
  |	
  4.0	
  	
  	
  	
  	
  	
  	
  	
  |	
  {"close":"22:30","open":"11:00"}	
  |	
  
["Mediterranean","Restaurants"]	
  |	
  
|	
  Marrakech	
  Moroccan	
  Restaurant	
  |	
  4.0	
  	
  	
  	
  	
  	
  	
  	
  |	
  {"close":"23:00","open":"17:30"}	
  |	
  
["Mediterranean","Middle	
  Eastern","Moroccan","Restaurants"]	
  |	
  
+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+	
  
Query	
  data	
  
with	
  any	
  
levels	
  of	
  
nesting	
  
®
© 2015 MapR Technologies 24
ANSI SQL Compatibility
//Get	
  top	
  cool	
  rated	
  businesses	
  
	
  
Ø  SELECT	
  b.name	
  from	
  dfs.yelp.`business.json`	
  b	
  	
  
	
  	
  	
  WHERE	
  b.business_id	
  IN	
  
	
  	
  (SELECT	
  r.business_id	
  FROM	
  dfs.yelp.`review.json`	
  r	
  
	
  	
  	
  GROUP	
  BY	
  r.business_id	
  HAVING	
  SUM(r.votes.cool)	
  >	
  2000	
  ORDER	
  BY	
  	
  
	
  	
  	
  SUM(r.votes.cool)	
  DESC);	
  
	
  
+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+	
  
|	
  	
  	
  	
  name 	
  |	
  
+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+	
  
|	
  Earl	
  of	
  Sandwich	
  |	
  
|	
  XS	
  Nightclub	
  |	
  
|	
  The	
  Cosmopolitan	
  of	
  Las	
  Vegas	
  |	
  
|	
  Wicked	
  Spoon	
  |	
  
+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+	
  
Use	
  familiar	
  SQL	
  
functionality	
  
(Joins,	
  
Aggregations,	
  
Sorting,	
  Sub-­‐
queries,	
  SQL	
  data	
  
types)	
  
®
© 2015 MapR Technologies 25
Logical Views
//Create	
  a	
  view	
  combining	
  business	
  and	
  reviews	
  datasets	
  
	
  
>	
  CREATE	
  OR	
  REPLACE	
  VIEW	
  dfs.tmp.BusinessReviews	
  AS	
  
	
  	
  	
  	
  SELECT	
  b.name,	
  b.stars,	
  r.votes.funny,	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  r.votes.useful,	
  r.votes.cool,	
  r.`date`	
  
	
  	
  	
  	
  	
  	
  FROM	
  dfs.yelp.`business.json`	
  b,	
  dfs.yelp.`review.json`	
  r	
  
	
  	
  	
  	
  	
  	
  WHERE	
  r.business_id	
  =	
  b.business_id;	
  
	
  
+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+	
  
|	
  	
  	
  	
  	
  ok	
  	
  	
  	
  	
  |	
  	
  summary	
  	
  	
  |	
  
+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+	
  
|	
  true	
  	
  	
  	
  	
  	
  	
  |	
  View	
  'BusinessReviews'	
  created	
  successfully	
  in	
  'dfs.tmp'	
  schema	
  |	
  
+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+	
  
	
  
>	
  SELECT	
  COUNT(*)	
  AS	
  Total	
  FROM	
  dfs.tmp.BusinessReviews;	
  
	
  
+------------+
| Total |
+------------+
| 1125458 |
+------------+
Lightweight	
  file	
  
system	
  based	
  
views	
  for	
  
granular	
  and	
  de-­‐
centralized	
  data	
  
management	
  
®
© 2015 MapR Technologies 26
Materialized Views AKA Tables
>	
  ALTER	
  SESSION	
  SET	
  `store.format`	
  =	
  'parquet';	
  
	
  
>	
  CREATE	
  TABLE	
  dfs.yelp.BusinessReviewsTbl	
  AS	
  
	
  	
  	
  	
  SELECT	
  b.name,	
  b.stars,	
  r.votes.funny	
  funny,	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  r.votes.useful	
  useful,	
  r.votes.cool	
  cool,	
  r.`date`	
  
	
  	
  	
  	
  	
  	
  FROM	
  dfs.yelp.`business.json`	
  b,	
  dfs.yelp.`review.json`	
  r	
  
	
  	
  	
  	
  	
  	
  WHERE	
  r.business_id	
  =	
  b.business_id;	
  
	
  
+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+	
  
|	
  	
  Fragment	
  	
  |	
  Number	
  of	
  records	
  written	
  |	
  
+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+	
  
|	
  1_0	
  	
  	
  	
  	
  	
  	
  	
  |	
  176448	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  |	
  
|	
  1_1	
  	
  	
  	
  	
  	
  	
  	
  |	
  192439	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  |	
  
|	
  1_2	
  	
  	
  	
  	
  	
  	
  	
  |	
  198625	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  |	
  
|	
  1_3	
  	
  	
  	
  	
  	
  	
  	
  |	
  200863	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  |	
  
|	
  1_4	
  	
  	
  	
  	
  	
  	
  	
  |	
  181420	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  |	
  
|	
  1_5	
  	
  	
  	
  	
  	
  	
  	
  |	
  175663	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  |	
  
+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+	
  
Save	
  analysis	
  
results	
  as	
  
tables	
  using	
  
familiar	
  CTAS	
  
syntax	
  
®
© 2015 MapR Technologies 27
Repeated Values Support
//	
  Flatten	
  repeated	
  categories	
  	
  
	
  
>	
  SELECT	
  name,	
  categories	
  
	
  	
  FROM	
  dfs.yelp.`business.json`	
  LIMIT	
  3;	
  
	
  
+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+	
  
|	
  	
  	
  	
  name	
  	
  	
  	
  |	
  categories	
  |	
  
+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+	
  
|	
  Eric	
  Goldberg,	
  MD	
  |	
  ["Doctors","Health	
  &	
  Medical"]	
  |	
  
|	
  Pine	
  Cone	
  Restaurant	
  |	
  ["Restaurants"]	
  |	
  
|	
  Deforest	
  Family	
  Restaurant	
  |	
  ["American	
  (Traditional)","Restaurants"]	
  |	
  
+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+	
  
	
  
>	
  SELECT	
  name,	
  FLATTEN(categories)	
  AS	
  categories	
  
	
  	
  FROM	
  dfs.yelp.`business.json`	
  LIMIT	
  5;	
  
+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+	
  
|	
  	
  	
  	
  name	
  	
  	
  	
  |	
  categories	
  |	
  
+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+	
  
|	
  Eric	
  Goldberg,	
  MD	
  |	
  Doctors	
  	
  	
  	
  |	
  
|	
  Eric	
  Goldberg,	
  MD	
  |	
  Health	
  &	
  Medical	
  |	
  
|	
  Pine	
  Cone	
  Restaurant	
  |	
  Restaurants	
  |	
  
|	
  Deforest	
  Family	
  Restaurant	
  |	
  American	
  (Traditional)	
  |	
  
|	
  Deforest	
  Family	
  Restaurant	
  |	
  Restaurants	
  |	
  
+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+	
  
Dynamically	
  
flatten	
  repeated	
  
and	
  nested	
  data	
  
elements	
  as	
  part	
  
of	
  SQL	
  queries.	
  
No	
  ETL	
  necessary	
  
®
© 2015 MapR Technologies 28
Extensions to ANSI SQL to work with repeated values
//	
  Get	
  most	
  common	
  business	
  categories	
  
	
  	
  
>SELECT	
  category,	
  count(*)	
  AS	
  categorycount	
  
	
  	
  FROM	
  (SELECT	
  name,	
  FLATTEN(categories)	
  AS	
  category	
  
	
  	
  	
  	
  	
  	
  	
  	
  FROM	
  dfs.yelp.`business.json`)	
  c	
  
	
  	
  GROUP	
  BY	
  category	
  ORDER	
  BY	
  categorycount	
  DESC;	
  
	
  
+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+	
  
|	
  	
  category	
  	
  |	
  categorycount|	
  
+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+	
  
|	
  Restaurants	
  |	
  14303	
  	
  	
  	
  	
  	
  |	
  
…	
  
|	
  Australian	
  |	
  1	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  |	
  
|	
  Boat	
  Dealers	
  |	
  1	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  |	
  
|	
  Firewood	
  	
  	
  |	
  1	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  |	
  
+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+	
  
®
© 2015 MapR Technologies 29
Checkins dataset {	
  	
  
	
  	
  	
  "checkin_info":{	
  	
  
	
  	
  	
  	
  	
  	
  "3-­‐4":1,	
  
	
  	
  	
  	
  	
  	
  "13-­‐5":1,	
  
	
  	
  	
  	
  	
  	
  "6-­‐6":1,	
  
	
  	
  	
  	
  	
  	
  "14-­‐5":1,	
  
	
  	
  	
  	
  	
  	
  "14-­‐6":1,	
  
	
  	
  	
  	
  	
  	
  "14-­‐2":1,	
  
	
  	
  	
  	
  	
  	
  "14-­‐3":1,	
  
	
  	
  	
  	
  	
  	
  "19-­‐0":1,	
  
	
  	
  	
  	
  	
  	
  "11-­‐5":1,	
  
	
  	
  	
  	
  	
  	
  "13-­‐2":1,	
  
	
  	
  	
  	
  	
  	
  "11-­‐6":2,	
  
	
  	
  	
  	
  	
  	
  "11-­‐3":1,	
  
	
  	
  	
  	
  	
  	
  "12-­‐6":1,	
  
	
  	
  	
  	
  	
  	
  "6-­‐5":1,	
  
	
  	
  	
  	
  	
  	
  "5-­‐5":1,	
  
	
  	
  	
  	
  	
  	
  "9-­‐2":1,	
  
	
  	
  	
  	
  	
  	
  "9-­‐5":1,	
  
	
  	
  	
  	
  	
  	
  "9-­‐6":1,	
  
	
  	
  	
  	
  	
  	
  "5-­‐2":1,	
  
	
  	
  	
  	
  	
  	
  "7-­‐6":1,	
  
	
  	
  	
  	
  	
  	
  "7-­‐5":1,	
  
	
  	
  	
  	
  	
  	
  "7-­‐4":1,	
  
	
  	
  	
  	
  	
  	
  "17-­‐5":1,	
  
	
  	
  	
  	
  	
  	
  "8-­‐5":1,	
  
	
  	
  	
  	
  	
  	
  "10-­‐2":1,	
  
	
  	
  	
  	
  	
  	
  "10-­‐5":1,	
  
	
  	
  	
  	
  	
  	
  "10-­‐6":1	
  
	
  	
  	
  },	
  
	
  	
  	
  "type":"checkin",	
  
	
  	
  	
  "business_id":"JwUE5GmEO-­‐sH1FuwJgKBlQ"	
  
}	
  
®
© 2015 MapR Technologies 30
Supports Dynamic / Unknown Columns
>	
  SELECT	
  KVGEN(checkin_info)	
  checkins	
  	
  
	
  	
  	
  FROM	
  dfs.yelp.`checkin.json`	
  LIMIT	
  1;	
  
+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+	
  
|	
  	
  checkins	
  	
  |	
  
+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+	
  
|	
  [{"key":"3-­‐4","value":1},{"key":"13-­‐5","value":1},{"key":"6-­‐6","value":1},{"key":"14-­‐5","value":1},
{"key":"14-­‐6","value":1},{"key":"14-­‐2","value":1},{"key":"14-­‐3","value":1},{"key":"19-­‐0","value":1},
{"key":"11-­‐5","value":1},{"key":"13-­‐2","value":1},{"key":"11-­‐6","value":2},{"key":"11-­‐3","value":1},
{"key":"12-­‐6","value":1},{"key":"6-­‐5","value":1},{"key":"5-­‐5","value":1},{"key":"9-­‐2","value":1},
{"key":"9-­‐5","value":1},{"key":"9-­‐6","value":1},{"key":"5-­‐2","value":1},{"key":"7-­‐6","value":1},
{"key":"7-­‐5","value":1},{"key":"7-­‐4","value":1},{"key":"17-­‐5","value":1},{"key":"8-­‐5","value":1},
{"key":"10-­‐2","value":1},{"key":"10-­‐5","value":1},{"key":"10-­‐6","value":1}]	
  |	
  
+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+	
  
	
  
>	
  SELECT	
  FLATTEN(KVGEN(checkin_info))	
  checkins	
  FROM	
  
dfs.yelp.`checkin.json`	
  limit	
  6;	
  
	
  
+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+	
  
|	
  	
  checkins	
  	
  |	
  
+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+	
  
|	
  {"key":"3-­‐4","value":1}	
  |	
  
|	
  {"key":"13-­‐5","value":1}	
  |	
  
|	
  {"key":"6-­‐6","value":1}	
  |	
  
|	
  {"key":"14-­‐5","value":1}	
  |	
  
|	
  {"key":"14-­‐6","value":1}	
  |	
  
|	
  {"key":"14-­‐2","value":1}	
  |	
  
+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+	
  
Convert	
  Map	
  with	
  
a	
  wide	
  set	
  of	
  
dynamic	
  columns	
  
into	
  an	
  array	
  of	
  
key-­‐value	
  pairs	
  
®
© 2015 MapR Technologies 31
Makes it easy to work with dynamic/unknown columns
//	
  Count	
  total	
  number	
  of	
  checkins	
  on	
  Sunday	
  midnight	
  
	
  
>	
  SELECT	
  SUM(checkintbl.checkins.`value`)	
  as	
  SundayMidnightCheckins	
  
FROM	
  	
  
	
  	
  	
  (SELECT	
  FLATTEN(KVGEN(checkin_info))	
  checkins	
  
	
  	
  	
  	
  FROM	
  dfs.yelp.checkin.json`)	
  checkintbl	
  	
  
	
  	
  	
  	
  WHERE	
  checkintbl.checkins.key='23-­‐0';	
  
	
  
+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+	
  
|	
  SundayMidnightCheckins	
  |	
  
+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+	
  
|	
  8575	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  |	
  
+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+	
  
®
© 2015 MapR Technologies 32© 2015 MapR Technologies
®
SQL + NoSQL = Accessible & Linearly Scalable
®
© 2015 MapR Technologies 33
MusicBrainz on NoSQL
Artists, albums, tracks and labels are key objects
Reality check:
Add works (compositions), recordings, release, release group
7 tables for artist alone
12 for place, 7 for label, 17 for release/group, 8 for work
(but only 4 for recording!)
Total of 12 + 7 + 17 + 8 + 4 = 48 tables
But wait, there’s more!
10 annotation tables, 10 edit tables, 19 tag tables, 5 rating tables, 86 link tables, 5 cover
art tables and 3 tables for CD timing info (138 total)
And 50 more tables that aren’t documented yet
®
© 2015 MapR Technologies 34
180 Tables
NOT SHOWN!
®
© 2015 MapR Technologies 35
236 tables
to describe 7 kinds of things
®
© 2015 MapR Technologies 36
artist
id
gid
name
sort_name
begin_date
end_date
ended
type
gender
area
being_area
end_area
comment
list<ipi>
list<isni>
list<alias>
list<release_id>
list<recording_id>
artist
id
gid
name
sort_name
begin_date
end_date
ended
type
gender
area
being_area
end_area
comment
list<ipi>
list<isni>
list<alias>
®
© 2015 MapR Technologies 37
artist
id
gid
name
sort_name
begin_date
end_date
ended
type
gender
area
being_area
end_area
comment
list<ipi>
list<isni>
list<alias>
list<release_id>
list<recording_id>
Primitive values
One to many relations
Equivalent to indexes
®
© 2015 MapR Technologies 38
Further Reductions
All 86 link tables become properties on artists, releases and other entities
All 44 tag, rating and annotation tables become list properties
All 5 cover art tables become lists of file references
Current score: 162 tables become 4
You get the idea
®
© 2015 MapR Technologies 39
Is This Good?
Expressivity
–  The JSON data model is at least as expressive as the original relational
model
•  Many cases easier to describe in nested data
•  No cases are harder
Efficiency
–  Inlining can increase data size. Locality improves, however
–  Sessionizing can substantially decrease data size
–  Inlining back-references is more efficient than ordinary indexes
–  Inlined columnar data allows 1000x speedup for time series
Introspection (you decide)
®
© 2015 MapR Technologies 40
Searching for Elvis
//	
  Find	
  discs	
  where	
  Elvis	
  was	
  credited	
  
	
  	
  
>	
  SELECT	
  distinct	
  album_id,	
  name	
  FROM	
  
	
  (SELECT	
  id	
  album_id,	
  artist_id,	
  name,	
  FLATTEN(credit)	
  FROM	
  release)	
  albums	
  	
  
	
  join	
  	
  
	
  (SELECT	
  distinct	
  artist_id	
  FROM	
  	
  
	
   	
  (SELECT	
  id	
  artist_id,	
  FLATTEN(alias)	
  FROM	
  artist	
  
	
   	
   	
  where	
  name	
  like	
  'Elvis%Presley’)	
  
	
  )	
  artists	
  	
  
	
  USING	
  artist_id;	
  
	
  
®
© 2015 MapR Technologies 41
Benefits
Extended relational model allows massive simplification
–  On a real example, we see >20x reduction in number of tables
Simplification drives improved introspection
–  This is good
Apache Drill gives very high performance execution for extended
relational problems
You can try this out today
®
© 2015 MapR Technologies 42© 2015 MapR Technologies
®
Security Controls
®
© 2015 MapR Technologies 43
Access Controls that Scale
PAM Authentication +
User Impersonation
Fine-grained row and
column level access control
with Drill Views – no
centralized security
repository required
Files HBase Hive
Drill
View 1
Drill
View 2
UUU
U
U
®
© 2015 MapR Technologies 44
Granular Security via Drill Views
Name City State Credit Card #
Dave San Jose CA 1374-7914-3865-4817
John Boulder CO 1374-9735-1794-9711
Raw File (/raw/cards.csv)
Owner
Admins
Permission
Admins
Business Analyst Data Scientist
Name City State Credit Card #
Dave San Jose CA 1374-1111-1111-1111
John Boulder CO 1374-1111-1111-1111
Data Scientist View (/views/maskedcards.csv)
Not a physical data copy
Name City State
Dave San Jose CA
John Boulder CO
Business Analyst View
Owner
Admins
Permission
Business
Analysts
Owner
Admins
Permission
Data
Scientists
®
© 2015 MapR Technologies 45
Ownership Chaining
Combine Self Service Exploration with Data Governance
Name City State Credit Card #
Dave San Jose CA 1374-7914-3865-4817
John Boulder CO 1374-9735-1794-9711
Raw File (/raw/cards.csv)
Name City State Credit Card #
Dave San Jose CA 1374-1111-1111-1111
John Boulder CO 1374-1111-1111-1111
Data Scientist (/views/V_Scientist)
Jane (Read)
John (Owner)
Name City State
Dave San Jose CA
John Boulder CO
Analyst(/views/V_Analyst)
Jack (Read)
Jane(Owner)
RAWFILEV_ScientistV_Analyst
Does Jack have access to V_Analyst? ->YES
Who is the owner of V_Analyst? ->Jane
Drill accesses V_Analyst as Jane (Impersonation hop 1)
Does Jane have access to V_Scientist ? -> YES
Who is the owner of V_Scientist? ->John
Drill accesses V_Scientist as John (Impersonation hop 2)
John(Owner)
Does John have permissions on raw file? -> YES
Who is the owner of raw file? ->John
Drill accesses source file as John (no impersonation here)
Jack queries the view V_Analyst
*Ownership chain length (# hops) is configurable
Ownership
chaining
Access
path
®
© 2015 MapR Technologies 46
Security Summary
Logical
–  No physical data copies/silos
Granular
–  Row level and column level security controls
De-centralized
–  User impersonation respecting storage system permissions
–  No separate permission repository for granular controls
–  Integrated with Hadoop File System permissions and LDAP
Self-service w/ governance
–  If you have access to data, you control who and how widely can access it
–  Audits
®
© 2015 MapR Technologies 47© 2015 MapR Technologies
®
National Nutrient Database
®
© 2015 MapR Technologies 48
Complex
®
© 2015 MapR Technologies 49
Simpler
®
© 2015 MapR Technologies 50
Simplest
®
© 2015 MapR Technologies 51
Sample SR27 Records
~01001~^~0100~^~Butter, salted~^~BUTTER,WITH SALT~^~~^~~^~Y~^~~^0^~~^6.38^4.27^8.79^3.87
~01004~^~0100~^~Cheese, blue~^~CHEESE,BLUE~^~~^~~^~Y~^~~^0^~~^6.38^4.27^8.79^3.87
~01005~^~0100~^~Cheese, brick~^~CHEESE,BRICK~^~~^~~^~Y~^~~^0^~~^6.38^4.27^8.79^3.87
~01006~^~0100~^~Cheese, brie~^~CHEESE,BRIE~^~~^~~^~Y~^~~^0^~~^6.38^4.27^8.79^3.87
~01007~^~0100~^~Cheese, camembert~^~CHEESE,CAMEMBERT~^~~^~~^~Y~^~~^0^~~^6.38^4.27^8.79^3.87
~01008~^~0100~^~Cheese, caraway~^~CHEESE,CARAWAY~^~~^~~^~~^~~^0^~~^6.38^4.27^8.79^3.87
®
© 2015 MapR Technologies 52
Configuration
-- Format --
"nndb": {
"type": "text",
"extensions": [ "txt" ],
"quote": "~",
"escape": "~",
"delimiter": "^"
},
-- Workspace --
"nndb": {
"location": "/opt/drill/nndb",
"writable": true,
"storageformat": "parquet"
},
®
© 2015 MapR Technologies 53
Sample JSON
{
"ndb_no":"08613",
"shrt_desc":"CEREALS RTE,KELLOGG'S SPL K MULTIGRAIN OATS & HONEY",
"nut_data":[{
"nutr_no": "203",
"nutr_val": "7.80",
"nutr_def": {"num_dec":2,"tagname":"PROCNT","nutrdesc":"Protein"},
"data_src":[{
"datasrc_id": "S6941",
"authors": "A Kellogg, Co.",
"title": "Kellogg Company Data",
"Year": "2011"
}]
}, {
"nutr_no": "205",
"nutr_val": "85.00",
"nutr_def": {"num_dec":2,"tagname":"CHOCDF","nutrdesc":"Carbohydrate, by difference"},
"data_src":[{
"datasrc_id": "S6941",
"authors": "C Kellogg, Co.",
"title": "Kellogg Company Data",
"Year": "2011"
}]
}],
"langual":["ANISE","FRUIT","WHOLE, NATURAL SHAPE","NOT HEAT-TREATED","COOKING METHOD NOT APPLICABLE","WATER REMOVED","HEAT
DRIED","HUMAN FOOD, NO AGE SPECIFICATION"]
}
®
© 2015 MapR Technologies 54
Demo Queries
All queries can be found within these blogs:
https://www.mapr.com/blog/drilling-healthy-choices
https://www.mapr.com/blog/evolution-database-schemas-using-sql-
nosql
®
© 2015 MapR Technologies 55© 2015 MapR Technologies
®
Live Demo
®
© 2015 MapR Technologies 56
Drill is Top-Ranked SQL-on-Hadoop
Source: Gigaom Research, 2015
Key:
•  Number indicates companies relative strength across all vectors
•  Size of ball indicates company’s relative strength along individual vector
“Drill isn’t just about
SQL-on-Hadoop.
It’s about SQL-on-
pretty-much-
anything,
immediately, and
without formality.”
®
© 2015 MapR Technologies 57
Drill Project Status
Sep’12
Jun’13
Aug’14
Nov’14
Jan’15
Apr’15
Sep’14
Dec’14
Mar’15
Project
incubation
First release
Drill 0.1
Beta
Drill 0.5
+ Apache Top
Level Project
Drill 0.7 Drill 0.8
Drill 0.9GigaOm
Top ranked SQL
On Hadoop
Drill 0.6Dev Preview
Drill 0.4
Apache
Top Level Project
Growing
user adoption
Iterative
Project cycles
Large community,
growing rapidly
50 contributors 1000’s downloads 7 releases < 9 months
H i g h l i g h t s
May’15
Drill 1.0
Just
released
®
© 2015 MapR Technologies 58
Recommendations for Getting Started with Drill
New to Drill?
–  Get started with Free MapR On Demand training
–  Test Drive Drill on cloud with AWS
–  Learn how to use Drill with Hadoop using MapR sandbox
Ready to play with your data?
–  Try out Apache Drill in 10 mins guide on your desktop
–  Download Drill for your cluster and start exploration
–  Comprehensive tutorials and documentation available
Ask questions
–  user@drill.apache.org
®
© 2015 MapR Technologies 59
®
© 2015 MapR Technologies 60
Find my presentation and other related resources here:
http://events.mapr.com/BigDataMadison
(you can find this link in the event’s page at meetup.com)
Today’s Presentation
Whiteboard & demo
videos
Free On-Demand Training
Free eBooks
Free Hadoop Sandbox And more…
®
© 2015 MapR Technologies 61
Q&A
@kingmesal maprtech
jscott@mapr.com
Engage with us!
MapR
maprtech
mapr-technologies

More Related Content

More from MapR Technologies

More from MapR Technologies (20)

Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action
 
Live Tutorial – Streaming Real-Time Events Using Apache APIs
Live Tutorial – Streaming Real-Time Events Using Apache APIsLive Tutorial – Streaming Real-Time Events Using Apache APIs
Live Tutorial – Streaming Real-Time Events Using Apache APIs
 
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale StorageBringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
 
Live Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn PredictionLive Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn Prediction
 
An Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data PlatformAn Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data Platform
 
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
 
Best Practices for Data Convergence in Healthcare
Best Practices for Data Convergence in HealthcareBest Practices for Data Convergence in Healthcare
Best Practices for Data Convergence in Healthcare
 
Geo-Distributed Big Data and Analytics
Geo-Distributed Big Data and AnalyticsGeo-Distributed Big Data and Analytics
Geo-Distributed Big Data and Analytics
 
MapR Product Update - Spring 2017
MapR Product Update - Spring 2017MapR Product Update - Spring 2017
MapR Product Update - Spring 2017
 
3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data Analytics3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data Analytics
 
Cisco & MapR bring 3 Superpowers to SAP HANA Deployments
Cisco & MapR bring 3 Superpowers to SAP HANA DeploymentsCisco & MapR bring 3 Superpowers to SAP HANA Deployments
Cisco & MapR bring 3 Superpowers to SAP HANA Deployments
 
MapR and Cisco Make IT Better
MapR and Cisco Make IT BetterMapR and Cisco Make IT Better
MapR and Cisco Make IT Better
 
Evolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQLEvolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQL
 
Evolving Beyond the Data Lake: A Story of Wind and Rain
Evolving Beyond the Data Lake: A Story of Wind and RainEvolving Beyond the Data Lake: A Story of Wind and Rain
Evolving Beyond the Data Lake: A Story of Wind and Rain
 
Open Source Innovations in the MapR Ecosystem Pack 2.0
Open Source Innovations in the MapR Ecosystem Pack 2.0Open Source Innovations in the MapR Ecosystem Pack 2.0
Open Source Innovations in the MapR Ecosystem Pack 2.0
 
How Spark is Enabling the New Wave of Converged Cloud Applications
How Spark is Enabling the New Wave of Converged Cloud Applications How Spark is Enabling the New Wave of Converged Cloud Applications
How Spark is Enabling the New Wave of Converged Cloud Applications
 
MapR 5.2: Getting More Value from the MapR Converged Data Platform
MapR 5.2: Getting More Value from the MapR Converged Data PlatformMapR 5.2: Getting More Value from the MapR Converged Data Platform
MapR 5.2: Getting More Value from the MapR Converged Data Platform
 
MapR on Azure: Getting Value from Big Data in the Cloud -
MapR on Azure: Getting Value from Big Data in the Cloud -MapR on Azure: Getting Value from Big Data in the Cloud -
MapR on Azure: Getting Value from Big Data in the Cloud -
 
Handling the Extremes: Scaling and Streaming in Finance
Handling the Extremes: Scaling and Streaming in FinanceHandling the Extremes: Scaling and Streaming in Finance
Handling the Extremes: Scaling and Streaming in Finance
 
Baptist Health: Solving Healthcare Problems with Big Data
Baptist Health: Solving Healthcare Problems with Big DataBaptist Health: Solving Healthcare Problems with Big Data
Baptist Health: Solving Healthcare Problems with Big Data
 

Recently uploaded

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Recently uploaded (20)

Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 

Rethinking SQL for Big Data with Apache Drill

  • 1. ® © 2015 MapR Technologies 1 ® © 2015 MapR Technologies Jim Scott – Director, Enterprise Strategy & Architecture @kingmesal #BigDataMadison
  • 2. ® © 2015 MapR Technologies 2 Find my presentation and other related resources here: http://events.mapr.com/BigDataMadison (you can find this link in the event’s page at meetup.com) Today’s Presentation Whiteboard & demo videos Free On-Demand Training Free eBooks Free Hadoop Sandbox And more…
  • 3. ® © 2015 MapR Technologies 3 Topics •  Motivation •  Using Drill •  SQL + NoSQL = ??? •  Security Controls •  Demo •  Resources
  • 4. ® © 2015 MapR Technologies 4 Empowering “as it happens” businesses by speeding up the data-to-action cycle ®
  • 5. ® © 2015 MapR Technologies 5 Top-Ranked NoSQL Top-Ranked Hadoop Distribution Top-Ranked SQL-on-Hadoop Solution ®
  • 6. ® © 2015 MapR Technologies 6© 2015 MapR Technologies ® Motivation
  • 7. ® © 2015 MapR Technologies 7 SEMI-STRUCTURED DATA STRUCTURED DATA 1980 2000 20101990 2020 Data is Doubling Every Two Years Unstructured data will account for more than 80% of the data collected by organizations Source: Human-Computer Interaction & Knowledge Discovery in Complex Unstructured, Big Data TotalDataStored
  • 8. ® © 2015 MapR Technologies 8 1980 2000 20101990 2020 Fixed schema DBA controls structure Dynamic / Flexible schema Application controls structure NON-RELATIONAL DATASTORESRELATIONAL DATABASES GBs-TBs TBs-PBsVolume Database Data Increasingly Stored in Non-Relational Datastores Structure Development Structured Structured, semi-structured and unstructured Planned (release cycle = months-years) Iterative (release cycle = days-weeks)
  • 9. ® © 2015 MapR Technologies 9 How To Bring SQL to Non-Relational Data Stores? Familiarity of SQL Agility of NoSQL •  ANSI SQL semantics •  BI (Tableau, MicroStrategy, etc.) •  Low latency •  No schema management –  HDFS (Parquet, JSON, etc.) –  HBase –  … •  No transformation –  No silos of data •  Ease of use
  • 10. ® © 2015 MapR Technologies 10 Industry's First Schema-free SQL engine for Big Data ®
  • 11. ® © 2015 MapR Technologies 11 Enabling “As-It-Happens” Business with Instant Analytics Hadoop data Data modeling Transformation Data movement (optional) Users Hadoop data Users Traditional approach Exploratory approach New Business questionsSource data evolution Total time to insight: weeks to months Total time to insight: minutes
  • 12. ® © 2015 MapR Technologies 12 Evolution Towards Self-Service Data Exploration Data Modeling and Transformation Data Visualization IT-driven IT-driven IT-driven Self-service IT-driven Self-service Optional Self-service Traditional BI w/ RDBMS Self-Service BI w/ RDBMS SQL-on-Hadoop Self-Service Data Exploration Zero-day analytics
  • 13. ® © 2015 MapR Technologies 13 Common Use Cases Raw Data Exploration JSON Analytics DWH offload Hive HBaseFiles Directories … {JSON}, Parquet Text Files …
  • 14. ® © 2015 MapR Technologies 14 Drill Supports Schema Discovery On-The-Fly •  Fixed schema •  Leverage schema in centralized repository (Hive Metastore) •  Fixed schema, evolving schema or schema-less •  Leverage schema in centralized repository or self-describing data 2Schema Discovered On-The-FlySchema Declared In Advance SCHEMA ON WRITE SCHEMA BEFORE READ SCHEMA ON THE FLY
  • 15. ® © 2015 MapR Technologies 15 -  Sub-directory -  HBase namespace -  Hive database Drill Enables ‘SQL-on-Everything’ SELECT  *  FROM  dfs.yelp.`business.json`  ! Workspace -  Pathnames -  Hive table -  HBase table Table -  DFS (Text, Parquet, JSON) -  HBase/MapR-DB -  Hive Metastore/HCatalog - Easy API to go beyond Hadoop Storage plugin instance
  • 16. ® © 2015 MapR Technologies 16 Drill’s Data Model is Flexible JSON BSON HBase Parquet Avro CSV TSV Dynamic schema Fixed schema Complex Flat Flexibility Name! Gender! Age! Michael! M! 6! Jennifer! F! 3! {! name: {! first: Michael,! last: Smith! },! hobbies: [ski, soccer],! district: Los Altos! }! {! name: {! first: Jennifer,! last: Gates! },! hobbies: [sing],! preschool: CCLC! }! RDBMS/SQL-on-Hadoop table Apache Drill table Flexibility
  • 17. ® © 2015 MapR Technologies 17 Reuse Existing SQL Tools and Skills Leverage SQL-compatible tools (BI, query builders, etc.) via Drill’s standard ODBC, JDBC and ANSI SQL support Enable business analysts, technical analysts and data scientists to explore and analyze large volumes of real-time data
  • 18. ® © 2015 MapR Technologies 18© 2015 MapR Technologies ® Using Drill with Yelp
  • 19. ® © 2015 MapR Technologies 19 Business dataset {    "business_id":  "4bEjOyTaDG24SY5TxsaUNQ",    "full_address":  "3655  Las  Vegas  Blvd  SnThe  StripnLas  Vegas,  NV  89109",    "hours":  {      "Monday":  {"close":  "23:00",  "open":  "07:00"},      "Tuesday":  {"close":  "23:00",  "open":  "07:00"},      "Friday":  {"close":  "00:00",  "open":  "07:00"},      "Wednesday":  {"close":  "23:00",  "open":  "07:00"},      "Thursday":  {"close":  "23:00",  "open":  "07:00"},      "Sunday":  {"close":  "23:00",  "open":  "07:00"},      "Saturday":  {"close":  "00:00",  "open":  "07:00"}    },    "open":  true,    "categories":  ["Breakfast  &  Brunch",  "Steakhouses",  "French",  "Restaurants"],    "city":  "Las  Vegas",    "review_count":  4084,    "name":  "Mon  Ami  Gabi",    "neighborhoods":  ["The  Strip"],    "longitude":  -­‐115.172588519464,    "state":  "NV",    "stars":  4.0,      "attributes":  {      "Alcohol":  "full_bar”,        "Noise  Level":  "average",      "Has  TV":  false,      "Attire":  "casual",      "Ambience":  {        "romantic":  true,        "intimate":  false,        "touristy":  false,        "hipster":  false,          "classy":  true,        "trendy":  false,          "casual":  false      },      "Good  For":  {"dessert":  false,  "latenight":  false,  "lunch":  false,                                                  "dinner":  true,  "breakfast":  false,  "brunch":  false},    }   }  
  • 20. ® © 2015 MapR Technologies 20 Reviews dataset {      "votes":  {"funny":  0,  "useful":  2,  "cool":  1},      "user_id":  "Xqd0DzHaiyRqVH3WRG7hzg",      "review_id":  "15SdjuK7DmYqUAj6rjGowg",      "stars":  5,      "date":  "2007-­‐05-­‐17",      "text":  "dr.  goldberg  offers  everything  ...",      "type":  "review",      "business_id":  "vcNAWiLM4dR7D2nwwJ7nCA"   }  
  • 21. ® © 2015 MapR Technologies 21 Zero to Results in 2 minutes $  tar  -­‐xvzf  apache-­‐drill-­‐1.0.0.tar.gz     $  bin/sqlline  -­‐u  jdbc:drill:zk=local   $  bin/drill-­‐embedded     >  SELECT  state,  city,  count(*)  AS  businesses      FROM  dfs.yelp.`business.json`      GROUP  BY  state,  city      ORDER  BY  businesses  DESC  LIMIT  10;     +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+   |      state        |        city        |    businesses  |   +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+   |  NV                  |  Las  Vegas    |  12021              |   |  AZ                  |  Phoenix        |  7499                |   |  AZ                  |  Scottsdale  |  3605                |   |  EDH                |  Edinburgh    |  2804                |   |  AZ                  |  Mesa              |  2041                |   |  AZ                  |  Tempe            |  2025                |   |  NV                  |  Henderson    |  1914                |   |  AZ                  |  Chandler      |  1637                |   |  WI                  |  Madison        |  1630                |   |  AZ                  |  Glendale      |  1196                |   +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+   Install   Query  files   and   directories   Results   Launch  shell   (embedded   mode)  
  • 22. ® © 2015 MapR Technologies 22 Directories are implicit partitions SELECT dir0, SUM(amount) FROM sales GROUP BY dir1 IN (q1, q2) sales ├── 2014 │ ├── q1 │ ├── q2 │ ├── q3 │ └── q4 └── 2015 └── q1
  • 23. ® © 2015 MapR Technologies 23 Intuitive SQL Access to Complex Data //  It’s  Friday  10pm  in  Vegas  and  looking  for  Hummus     >  SELECT  name,  stars,  b.hours.Friday  friday,  categories      FROM  dfs.yelp.`business.json`  b      WHERE  b.hours.Friday.`open`  <  '22:00'  AND                  b.hours.Friday.`close`  >  '22:00'  AND                  REPEATED_CONTAINS(categories,  'Mediterranean')  AND                  city  =  'Las  Vegas'      ORDER  BY  stars  DESC      LIMIT  2;     +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+   |        name        |      stars        |      friday      |  categories  |   +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+   |  Olives          |  4.0                |  {"close":"22:30","open":"11:00"}  |   ["Mediterranean","Restaurants"]  |   |  Marrakech  Moroccan  Restaurant  |  4.0                |  {"close":"23:00","open":"17:30"}  |   ["Mediterranean","Middle  Eastern","Moroccan","Restaurants"]  |   +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+   Query  data   with  any   levels  of   nesting  
  • 24. ® © 2015 MapR Technologies 24 ANSI SQL Compatibility //Get  top  cool  rated  businesses     Ø  SELECT  b.name  from  dfs.yelp.`business.json`  b          WHERE  b.business_id  IN      (SELECT  r.business_id  FROM  dfs.yelp.`review.json`  r        GROUP  BY  r.business_id  HAVING  SUM(r.votes.cool)  >  2000  ORDER  BY          SUM(r.votes.cool)  DESC);     +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+   |        name  |   +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+   |  Earl  of  Sandwich  |   |  XS  Nightclub  |   |  The  Cosmopolitan  of  Las  Vegas  |   |  Wicked  Spoon  |   +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+   Use  familiar  SQL   functionality   (Joins,   Aggregations,   Sorting,  Sub-­‐ queries,  SQL  data   types)  
  • 25. ® © 2015 MapR Technologies 25 Logical Views //Create  a  view  combining  business  and  reviews  datasets     >  CREATE  OR  REPLACE  VIEW  dfs.tmp.BusinessReviews  AS          SELECT  b.name,  b.stars,  r.votes.funny,                        r.votes.useful,  r.votes.cool,  r.`date`              FROM  dfs.yelp.`business.json`  b,  dfs.yelp.`review.json`  r              WHERE  r.business_id  =  b.business_id;     +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+   |          ok          |    summary      |   +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+   |  true              |  View  'BusinessReviews'  created  successfully  in  'dfs.tmp'  schema  |   +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+     >  SELECT  COUNT(*)  AS  Total  FROM  dfs.tmp.BusinessReviews;     +------------+ | Total | +------------+ | 1125458 | +------------+ Lightweight  file   system  based   views  for   granular  and  de-­‐ centralized  data   management  
  • 26. ® © 2015 MapR Technologies 26 Materialized Views AKA Tables >  ALTER  SESSION  SET  `store.format`  =  'parquet';     >  CREATE  TABLE  dfs.yelp.BusinessReviewsTbl  AS          SELECT  b.name,  b.stars,  r.votes.funny  funny,                        r.votes.useful  useful,  r.votes.cool  cool,  r.`date`              FROM  dfs.yelp.`business.json`  b,  dfs.yelp.`review.json`  r              WHERE  r.business_id  =  b.business_id;     +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+   |    Fragment    |  Number  of  records  written  |   +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+   |  1_0                |  176448                                        |   |  1_1                |  192439                                        |   |  1_2                |  198625                                        |   |  1_3                |  200863                                        |   |  1_4                |  181420                                        |   |  1_5                |  175663                                        |   +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+   Save  analysis   results  as   tables  using   familiar  CTAS   syntax  
  • 27. ® © 2015 MapR Technologies 27 Repeated Values Support //  Flatten  repeated  categories       >  SELECT  name,  categories      FROM  dfs.yelp.`business.json`  LIMIT  3;     +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+   |        name        |  categories  |   +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+   |  Eric  Goldberg,  MD  |  ["Doctors","Health  &  Medical"]  |   |  Pine  Cone  Restaurant  |  ["Restaurants"]  |   |  Deforest  Family  Restaurant  |  ["American  (Traditional)","Restaurants"]  |   +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+     >  SELECT  name,  FLATTEN(categories)  AS  categories      FROM  dfs.yelp.`business.json`  LIMIT  5;   +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+   |        name        |  categories  |   +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+   |  Eric  Goldberg,  MD  |  Doctors        |   |  Eric  Goldberg,  MD  |  Health  &  Medical  |   |  Pine  Cone  Restaurant  |  Restaurants  |   |  Deforest  Family  Restaurant  |  American  (Traditional)  |   |  Deforest  Family  Restaurant  |  Restaurants  |   +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+   Dynamically   flatten  repeated   and  nested  data   elements  as  part   of  SQL  queries.   No  ETL  necessary  
  • 28. ® © 2015 MapR Technologies 28 Extensions to ANSI SQL to work with repeated values //  Get  most  common  business  categories       >SELECT  category,  count(*)  AS  categorycount      FROM  (SELECT  name,  FLATTEN(categories)  AS  category                  FROM  dfs.yelp.`business.json`)  c      GROUP  BY  category  ORDER  BY  categorycount  DESC;     +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+   |    category    |  categorycount|   +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+   |  Restaurants  |  14303            |   …   |  Australian  |  1                    |   |  Boat  Dealers  |  1                    |   |  Firewood      |  1                    |   +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+  
  • 29. ® © 2015 MapR Technologies 29 Checkins dataset {          "checkin_info":{                "3-­‐4":1,              "13-­‐5":1,              "6-­‐6":1,              "14-­‐5":1,              "14-­‐6":1,              "14-­‐2":1,              "14-­‐3":1,              "19-­‐0":1,              "11-­‐5":1,              "13-­‐2":1,              "11-­‐6":2,              "11-­‐3":1,              "12-­‐6":1,              "6-­‐5":1,              "5-­‐5":1,              "9-­‐2":1,              "9-­‐5":1,              "9-­‐6":1,              "5-­‐2":1,              "7-­‐6":1,              "7-­‐5":1,              "7-­‐4":1,              "17-­‐5":1,              "8-­‐5":1,              "10-­‐2":1,              "10-­‐5":1,              "10-­‐6":1        },        "type":"checkin",        "business_id":"JwUE5GmEO-­‐sH1FuwJgKBlQ"   }  
  • 30. ® © 2015 MapR Technologies 30 Supports Dynamic / Unknown Columns >  SELECT  KVGEN(checkin_info)  checkins          FROM  dfs.yelp.`checkin.json`  LIMIT  1;   +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+   |    checkins    |   +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+   |  [{"key":"3-­‐4","value":1},{"key":"13-­‐5","value":1},{"key":"6-­‐6","value":1},{"key":"14-­‐5","value":1}, {"key":"14-­‐6","value":1},{"key":"14-­‐2","value":1},{"key":"14-­‐3","value":1},{"key":"19-­‐0","value":1}, {"key":"11-­‐5","value":1},{"key":"13-­‐2","value":1},{"key":"11-­‐6","value":2},{"key":"11-­‐3","value":1}, {"key":"12-­‐6","value":1},{"key":"6-­‐5","value":1},{"key":"5-­‐5","value":1},{"key":"9-­‐2","value":1}, {"key":"9-­‐5","value":1},{"key":"9-­‐6","value":1},{"key":"5-­‐2","value":1},{"key":"7-­‐6","value":1}, {"key":"7-­‐5","value":1},{"key":"7-­‐4","value":1},{"key":"17-­‐5","value":1},{"key":"8-­‐5","value":1}, {"key":"10-­‐2","value":1},{"key":"10-­‐5","value":1},{"key":"10-­‐6","value":1}]  |   +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+     >  SELECT  FLATTEN(KVGEN(checkin_info))  checkins  FROM   dfs.yelp.`checkin.json`  limit  6;     +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+   |    checkins    |   +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+   |  {"key":"3-­‐4","value":1}  |   |  {"key":"13-­‐5","value":1}  |   |  {"key":"6-­‐6","value":1}  |   |  {"key":"14-­‐5","value":1}  |   |  {"key":"14-­‐6","value":1}  |   |  {"key":"14-­‐2","value":1}  |   +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+   Convert  Map  with   a  wide  set  of   dynamic  columns   into  an  array  of   key-­‐value  pairs  
  • 31. ® © 2015 MapR Technologies 31 Makes it easy to work with dynamic/unknown columns //  Count  total  number  of  checkins  on  Sunday  midnight     >  SELECT  SUM(checkintbl.checkins.`value`)  as  SundayMidnightCheckins   FROM          (SELECT  FLATTEN(KVGEN(checkin_info))  checkins          FROM  dfs.yelp.checkin.json`)  checkintbl            WHERE  checkintbl.checkins.key='23-­‐0';     +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+   |  SundayMidnightCheckins  |   +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+   |  8575                                      |   +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+  
  • 32. ® © 2015 MapR Technologies 32© 2015 MapR Technologies ® SQL + NoSQL = Accessible & Linearly Scalable
  • 33. ® © 2015 MapR Technologies 33 MusicBrainz on NoSQL Artists, albums, tracks and labels are key objects Reality check: Add works (compositions), recordings, release, release group 7 tables for artist alone 12 for place, 7 for label, 17 for release/group, 8 for work (but only 4 for recording!) Total of 12 + 7 + 17 + 8 + 4 = 48 tables But wait, there’s more! 10 annotation tables, 10 edit tables, 19 tag tables, 5 rating tables, 86 link tables, 5 cover art tables and 3 tables for CD timing info (138 total) And 50 more tables that aren’t documented yet
  • 34. ® © 2015 MapR Technologies 34 180 Tables NOT SHOWN!
  • 35. ® © 2015 MapR Technologies 35 236 tables to describe 7 kinds of things
  • 36. ® © 2015 MapR Technologies 36 artist id gid name sort_name begin_date end_date ended type gender area being_area end_area comment list<ipi> list<isni> list<alias> list<release_id> list<recording_id> artist id gid name sort_name begin_date end_date ended type gender area being_area end_area comment list<ipi> list<isni> list<alias>
  • 37. ® © 2015 MapR Technologies 37 artist id gid name sort_name begin_date end_date ended type gender area being_area end_area comment list<ipi> list<isni> list<alias> list<release_id> list<recording_id> Primitive values One to many relations Equivalent to indexes
  • 38. ® © 2015 MapR Technologies 38 Further Reductions All 86 link tables become properties on artists, releases and other entities All 44 tag, rating and annotation tables become list properties All 5 cover art tables become lists of file references Current score: 162 tables become 4 You get the idea
  • 39. ® © 2015 MapR Technologies 39 Is This Good? Expressivity –  The JSON data model is at least as expressive as the original relational model •  Many cases easier to describe in nested data •  No cases are harder Efficiency –  Inlining can increase data size. Locality improves, however –  Sessionizing can substantially decrease data size –  Inlining back-references is more efficient than ordinary indexes –  Inlined columnar data allows 1000x speedup for time series Introspection (you decide)
  • 40. ® © 2015 MapR Technologies 40 Searching for Elvis //  Find  discs  where  Elvis  was  credited       >  SELECT  distinct  album_id,  name  FROM    (SELECT  id  album_id,  artist_id,  name,  FLATTEN(credit)  FROM  release)  albums      join      (SELECT  distinct  artist_id  FROM        (SELECT  id  artist_id,  FLATTEN(alias)  FROM  artist        where  name  like  'Elvis%Presley’)    )  artists      USING  artist_id;    
  • 41. ® © 2015 MapR Technologies 41 Benefits Extended relational model allows massive simplification –  On a real example, we see >20x reduction in number of tables Simplification drives improved introspection –  This is good Apache Drill gives very high performance execution for extended relational problems You can try this out today
  • 42. ® © 2015 MapR Technologies 42© 2015 MapR Technologies ® Security Controls
  • 43. ® © 2015 MapR Technologies 43 Access Controls that Scale PAM Authentication + User Impersonation Fine-grained row and column level access control with Drill Views – no centralized security repository required Files HBase Hive Drill View 1 Drill View 2 UUU U U
  • 44. ® © 2015 MapR Technologies 44 Granular Security via Drill Views Name City State Credit Card # Dave San Jose CA 1374-7914-3865-4817 John Boulder CO 1374-9735-1794-9711 Raw File (/raw/cards.csv) Owner Admins Permission Admins Business Analyst Data Scientist Name City State Credit Card # Dave San Jose CA 1374-1111-1111-1111 John Boulder CO 1374-1111-1111-1111 Data Scientist View (/views/maskedcards.csv) Not a physical data copy Name City State Dave San Jose CA John Boulder CO Business Analyst View Owner Admins Permission Business Analysts Owner Admins Permission Data Scientists
  • 45. ® © 2015 MapR Technologies 45 Ownership Chaining Combine Self Service Exploration with Data Governance Name City State Credit Card # Dave San Jose CA 1374-7914-3865-4817 John Boulder CO 1374-9735-1794-9711 Raw File (/raw/cards.csv) Name City State Credit Card # Dave San Jose CA 1374-1111-1111-1111 John Boulder CO 1374-1111-1111-1111 Data Scientist (/views/V_Scientist) Jane (Read) John (Owner) Name City State Dave San Jose CA John Boulder CO Analyst(/views/V_Analyst) Jack (Read) Jane(Owner) RAWFILEV_ScientistV_Analyst Does Jack have access to V_Analyst? ->YES Who is the owner of V_Analyst? ->Jane Drill accesses V_Analyst as Jane (Impersonation hop 1) Does Jane have access to V_Scientist ? -> YES Who is the owner of V_Scientist? ->John Drill accesses V_Scientist as John (Impersonation hop 2) John(Owner) Does John have permissions on raw file? -> YES Who is the owner of raw file? ->John Drill accesses source file as John (no impersonation here) Jack queries the view V_Analyst *Ownership chain length (# hops) is configurable Ownership chaining Access path
  • 46. ® © 2015 MapR Technologies 46 Security Summary Logical –  No physical data copies/silos Granular –  Row level and column level security controls De-centralized –  User impersonation respecting storage system permissions –  No separate permission repository for granular controls –  Integrated with Hadoop File System permissions and LDAP Self-service w/ governance –  If you have access to data, you control who and how widely can access it –  Audits
  • 47. ® © 2015 MapR Technologies 47© 2015 MapR Technologies ® National Nutrient Database
  • 48. ® © 2015 MapR Technologies 48 Complex
  • 49. ® © 2015 MapR Technologies 49 Simpler
  • 50. ® © 2015 MapR Technologies 50 Simplest
  • 51. ® © 2015 MapR Technologies 51 Sample SR27 Records ~01001~^~0100~^~Butter, salted~^~BUTTER,WITH SALT~^~~^~~^~Y~^~~^0^~~^6.38^4.27^8.79^3.87 ~01004~^~0100~^~Cheese, blue~^~CHEESE,BLUE~^~~^~~^~Y~^~~^0^~~^6.38^4.27^8.79^3.87 ~01005~^~0100~^~Cheese, brick~^~CHEESE,BRICK~^~~^~~^~Y~^~~^0^~~^6.38^4.27^8.79^3.87 ~01006~^~0100~^~Cheese, brie~^~CHEESE,BRIE~^~~^~~^~Y~^~~^0^~~^6.38^4.27^8.79^3.87 ~01007~^~0100~^~Cheese, camembert~^~CHEESE,CAMEMBERT~^~~^~~^~Y~^~~^0^~~^6.38^4.27^8.79^3.87 ~01008~^~0100~^~Cheese, caraway~^~CHEESE,CARAWAY~^~~^~~^~~^~~^0^~~^6.38^4.27^8.79^3.87
  • 52. ® © 2015 MapR Technologies 52 Configuration -- Format -- "nndb": { "type": "text", "extensions": [ "txt" ], "quote": "~", "escape": "~", "delimiter": "^" }, -- Workspace -- "nndb": { "location": "/opt/drill/nndb", "writable": true, "storageformat": "parquet" },
  • 53. ® © 2015 MapR Technologies 53 Sample JSON { "ndb_no":"08613", "shrt_desc":"CEREALS RTE,KELLOGG'S SPL K MULTIGRAIN OATS & HONEY", "nut_data":[{ "nutr_no": "203", "nutr_val": "7.80", "nutr_def": {"num_dec":2,"tagname":"PROCNT","nutrdesc":"Protein"}, "data_src":[{ "datasrc_id": "S6941", "authors": "A Kellogg, Co.", "title": "Kellogg Company Data", "Year": "2011" }] }, { "nutr_no": "205", "nutr_val": "85.00", "nutr_def": {"num_dec":2,"tagname":"CHOCDF","nutrdesc":"Carbohydrate, by difference"}, "data_src":[{ "datasrc_id": "S6941", "authors": "C Kellogg, Co.", "title": "Kellogg Company Data", "Year": "2011" }] }], "langual":["ANISE","FRUIT","WHOLE, NATURAL SHAPE","NOT HEAT-TREATED","COOKING METHOD NOT APPLICABLE","WATER REMOVED","HEAT DRIED","HUMAN FOOD, NO AGE SPECIFICATION"] }
  • 54. ® © 2015 MapR Technologies 54 Demo Queries All queries can be found within these blogs: https://www.mapr.com/blog/drilling-healthy-choices https://www.mapr.com/blog/evolution-database-schemas-using-sql- nosql
  • 55. ® © 2015 MapR Technologies 55© 2015 MapR Technologies ® Live Demo
  • 56. ® © 2015 MapR Technologies 56 Drill is Top-Ranked SQL-on-Hadoop Source: Gigaom Research, 2015 Key: •  Number indicates companies relative strength across all vectors •  Size of ball indicates company’s relative strength along individual vector “Drill isn’t just about SQL-on-Hadoop. It’s about SQL-on- pretty-much- anything, immediately, and without formality.”
  • 57. ® © 2015 MapR Technologies 57 Drill Project Status Sep’12 Jun’13 Aug’14 Nov’14 Jan’15 Apr’15 Sep’14 Dec’14 Mar’15 Project incubation First release Drill 0.1 Beta Drill 0.5 + Apache Top Level Project Drill 0.7 Drill 0.8 Drill 0.9GigaOm Top ranked SQL On Hadoop Drill 0.6Dev Preview Drill 0.4 Apache Top Level Project Growing user adoption Iterative Project cycles Large community, growing rapidly 50 contributors 1000’s downloads 7 releases < 9 months H i g h l i g h t s May’15 Drill 1.0 Just released
  • 58. ® © 2015 MapR Technologies 58 Recommendations for Getting Started with Drill New to Drill? –  Get started with Free MapR On Demand training –  Test Drive Drill on cloud with AWS –  Learn how to use Drill with Hadoop using MapR sandbox Ready to play with your data? –  Try out Apache Drill in 10 mins guide on your desktop –  Download Drill for your cluster and start exploration –  Comprehensive tutorials and documentation available Ask questions –  user@drill.apache.org
  • 59. ® © 2015 MapR Technologies 59
  • 60. ® © 2015 MapR Technologies 60 Find my presentation and other related resources here: http://events.mapr.com/BigDataMadison (you can find this link in the event’s page at meetup.com) Today’s Presentation Whiteboard & demo videos Free On-Demand Training Free eBooks Free Hadoop Sandbox And more…
  • 61. ® © 2015 MapR Technologies 61 Q&A @kingmesal maprtech jscott@mapr.com Engage with us! MapR maprtech mapr-technologies