© 2016 MapR Technologies© 2016 MapR TechnologiesMapR Confidential
© 2016 MapR Technologies
1
SQL-on-Hadoop with
© 2016 MapR Technologies© 2016 MapR Technologies@tgrall 3
Ingest
Store
Process
Consume
© 2016 MapR Technologies© 2016 MapR Technologies@tgrall 4
MapR Converged Data Platform
© 2016 MapR Technologies© 2016 MapR Technologies@tgrall 6
Agenda
• Why Drill?
• Some Drilling with Open Data
• How does it work?
© 2016 MapR Technologies© 2016 MapR Technologies@tgrall 7
1980 2000 20101990 2020
Fixed schema
DBA controls structure
Dynamic / Flexible schema
Application controls structure
NON-RELATIONAL DATASTORESRELATIONAL DATABASES
GBs-TBs TBs-PBsVolume
Database
Data Increasingly Stored in Non-Relational Datastores
Structure
Development
Structured Structured, semi-structured and unstructured
Planned (release cycle = months-years) Iterative (release cycle = days-weeks)
© 2016 MapR Technologies© 2016 MapR Technologies@tgrall 8
How To Bring SQL to Non-Relational Data Stores?
Familiarity of SQL Agility of NoSQL
• ANSI SQL semantics
• Low latency
• Integrated with Tools/Applications
• No schema management
– HDFS (Parquet, JSON, etc.)
– HBase
– …
• No transformation
– No silos of data
• Ease of use
© 2016 MapR Technologies© 2016 MapR Technologies@tgrall 9
Drill Supports Schema Discovery On-The-Fly
• Fixed schema
• Leverage schema in centralized
repository (Hive Metastore)
• Fixed schema, evolving schema or
schema-less
• Leverage schema in centralized
repository or self-describing data
2Schema Discovered On-The-FlySchema Declared In Advance
SCHEMA ON
WRITE
SCHEMA BEFORE
READ
SCHEMA ON THE
FLY
© 2016 MapR Technologies© 2016 MapR Technologies@tgrall 10
• Pioneering Data Agility for Hadoop
• Apache open source project
• Scale-out execution engine for low-latency queries
• Unified SQL-based API for analytics & operational applications
50+ contributors
150+ years of experience building
databases and distributed systems
Contributing to Apache Drill
© 2016 MapR Technologies© 2016 MapR Technologies@tgrall 11
- Sub-directory
- HBase namespace
- Hive database
- Database
Drill Enables ‘SQL-on-Everything’
SELECT * FROM dfs.yelp.`business.json`
Workspace
- Pathnames
- Hive table
- HBase table
- Table
Table
- DFS (Text, Parquet, JSON, XML)
- HBase/MapR-DB
- Hive Metastore/HCatalog
- RDBMS using JDBC
- Easy API to go beyond Hadoop
Storage plugin instance
© 2016 MapR Technologies© 2016 MapR Technologies@tgrall 12
Drill’s Data Model is Flexible
JSON
BSON
HBase
Parquet
Avro
CSV
TSV
Dynamic
schema
Fixed schema
Complex
Flat
Flexibility
Name Gender Age
Michael M 6
Jennifer F 3
RDBMS/SQL-on-Hadoop table
Flexibility
Apache Drill table
{
name: {
first: Michael,
last: Smith
},
hobbies: [ski, soccer],
district: Los Altos
}
{
name: {
first: Jennifer,
last: Gates
},
hobbies: [sing],
preschool: CCLC
}
© 2016 MapR Technologies© 2016 MapR Technologies@tgrall 13
Drilling into Data
© 2016 MapR Technologies© 2016 MapR Technologies@tgrall 14
Business dataset {
"business_id": "4bEjOyTaDG24SY5TxsaUNQ",
"full_address": "3655 Las Vegas Blvd SnThe StripnLas Vegas, NV 89109",
"hours": {
"Monday": {"close": "23:00", "open": "07:00"},
"Tuesday": {"close": "23:00", "open": "07:00"},
"Friday": {"close": "00:00", "open": "07:00"},
"Wednesday": {"close": "23:00", "open": "07:00"},
"Thursday": {"close": "23:00", "open": "07:00"},
"Sunday": {"close": "23:00", "open": "07:00"},
"Saturday": {"close": "00:00", "open": "07:00"}
},
"open": true,
"categories": ["Breakfast & Brunch", "Steakhouses", "French", "Restaurants"],
"city": "Las Vegas",
"review_count": 4084,
"name": "Mon Ami Gabi",
"neighborhoods": ["The Strip"],
"longitude": -115.172588519464,
"state": "NV",
"stars": 4.0,
"attributes": {
"Alcohol": "full_bar”,
"Noise Level": "average",
"Has TV": false,
"Attire": "casual",
"Ambience": {
"romantic": true,
"intimate": false,
"touristy": false,
"hipster": false,
"classy": true,
"trendy": false,
"casual": false
},
"Good For": {"dessert": false, "latenight": false, "lunch": false,
"dinner": true, "breakfast": false, "brunch": false},
}
}
© 2016 MapR Technologies© 2016 MapR Technologies@tgrall 15
Reviews dataset
{
"votes": {"funny": 0, "useful": 2, "cool": 1},
"user_id": "Xqd0DzHaiyRqVH3WRG7hzg",
"review_id": "15SdjuK7DmYqUAj6rjGowg",
"stars": 5,
"date": "2007-05-17",
"text": "dr. goldberg offers everything ...",
"type": "review",
"business_id": "vcNAWiLM4dR7D2nwwJ7nCA"
}
© 2016 MapR Technologies© 2016 MapR Technologies@tgrall 16
Interfaces and Tools
WebUIDrill Explorer
ODBC/JDBC
Sqlline
© 2016 MapR Technologies© 2016 MapR Technologies@tgrall 17
`
$ tar -xvzf apache-drill-1.4.0.tar.gz
$ bin/sqlline -u jdbc:drill:zk=local
$ bin/drill-embedded
> SELECT state, city, count(*) AS businesses
FROM dfs.yelp.`business.json.gz`
GROUP BY state, city
ORDER BY businesses DESC LIMIT 10;
+------------+------------+-------------+
| state | city | businesses |
+------------+------------+-------------+
| NV | Las Vegas | 12021 |
| AZ | Phoenix | 7499 |
| AZ | Scottsdale | 3605 |
| EDH | Edinburgh | 2804 |
| AZ | Mesa | 2041 |
| AZ | Tempe | 2025 |
| NV | Henderson | 1914 |
| AZ | Chandler | 1637 |
| WI | Madison | 1630 |
| AZ | Glendale | 1196 |
+------------+------------+-------------+
© 2016 MapR Technologies© 2016 MapR Technologies@tgrall 18
Intuitive SQL Access to Complex Data
// It’s Friday 10pm in Vegas and looking for Hummus
> SELECT name, stars, b.hours.Friday friday, categories
FROM dfs.yelp.`business.json` b
WHERE b.hours.Friday.`open` < '22:00' AND
b.hours.Friday.`close` > '22:00' AND
REPEATED_CONTAINS(categories, 'Mediterranean') AND
city = 'Las Vegas'
ORDER BY stars DESC
LIMIT 2;
+------------------------------------+--------+-----------------------------------+-----------------------------------------------------------+
| name | stars | friday | categories |
+------------------------------------+--------+-----------------------------------+-----------------------------------------------------------+
| Khoury's Mediterranean Restaurant | 4.0 | {"close":"23:00","open":"11:00"} | ["Greek","Mediterranean","Middle Eastern","Restaurants"] |
| Olives | 4.0 | {"close":"22:30","open":"11:00"} | ["Bars","Mediterranean","Nightlife","Restaurants"] |
+------------------------------------+--------+-----------------------------------+-----------------------------------------------------------+
© 2016 MapR Technologies© 2016 MapR Technologies@tgrall 19
ANSI SQL Compatibility
//Get top cool rated businesses
 SELECT b.name from dfs.yelp.`business.json` b
WHERE b.business_id IN
(SELECT r.business_id FROM dfs.yelp.`review.json` r
GROUP BY r.business_id HAVING SUM(r.votes.cool) > 2000 ORDER BY
SUM(r.votes.cool) DESC);
+--------------------------------+
| name |
+--------------------------------+
| Earl of Sandwich |
| XS Nightclub |
| The Cosmopolitan of Las Vegas |
| Wicked Spoon |
| Bacchanal Buffet |
+--------------------------------+
© 2016 MapR Technologies© 2016 MapR Technologies@tgrall 20
Logical Views
//Create a view combining business and reviews datasets
> CREATE OR REPLACE VIEW dfs.tmp.BusinessReviews AS
SELECT b.name, b.stars, r.votes.funny,
r.votes.useful, r.votes.cool, r.`date`
FROM dfs.yelp.`business.json` b, dfs.yelp.`review.json` r
WHERE r.business_id = b.business_id;
+-------+-------------------------------------------------------------------+
| ok | summary |
+-------+-------------------------------------------------------------------+
| true | View 'BusinessReviews' replaced successfully in 'dfs.tmp' schema |
+-------+-------------------------------------------------------------------+
> SELECT COUNT(*) AS Total FROM dfs.tmp.BusinessReviews;
+------------+
| Total |
+------------+
| 1125458 |
+------------+
© 2016 MapR Technologies© 2016 MapR Technologies@tgrall 21
Materialized Views AKA Tables
> ALTER SESSION SET `store.format` = 'parquet';
> CREATE TABLE dfs.yelp.BusinessReviewsTbl AS
SELECT b.name, b.stars, r.votes.funny funny,
r.votes.useful useful, r.votes.cool cool, r.`date`
FROM dfs.yelp.`business.json` b, dfs.yelp.`review.json` r
WHERE r.business_id = b.business_id;
+------------+---------------------------+
| Fragment | Number of records written |
+------------+---------------------------+
| 1_0 | 176448 |
| 1_1 | 192439 |
| 1_2 | 198625 |
| 1_3 | 200863 |
| 1_4 | 181420 |
| 1_5 | 175663 |
+------------+---------------------------+
© 2016 MapR Technologies© 2016 MapR Technologies@tgrall 22
Repeated Values Support
// Flatten repeated categories
> SELECT name, categories
FROM dfs.yelp.`business.json` LIMIT 3;
+---------------------------+-------------------------------------+
| name | categories |
+---------------------------+-------------------------------------+
| Eric Goldberg, MD | ["Doctors","Health & Medical"] |
| Clancy's Pub | ["Nightlife"] |
| Cool Springs Golf Center | ["Active Life","Mini Golf","Golf"] |
+—————————————+-------------------------------------+
> SELECT name, FLATTEN(categories) AS categories
FROM dfs.yelp.`business.json` LIMIT 5;
+---------------------------+-------------------+
| name | categories |
+---------------------------+-------------------+
| Eric Goldberg, MD | Doctors |
| Eric Goldberg, MD | Health & Medical |
| Clancy's Pub | Nightlife |
| Cool Springs Golf Center | Active Life |
| Cool Springs Golf Center | Mini Golf |
+---------------------------+-------------------+
© 2016 MapR Technologies© 2016 MapR Technologies@tgrall 23
Extensions to ANSI SQL to work with repeated values
// Get most common business categories
>SELECT category, count(*) AS categorycount
FROM (SELECT name, FLATTEN(categories) AS category
FROM dfs.yelp.`business.json`) c
GROUP BY category ORDER BY categorycount DESC;
+-------------------------------+----------------+
| category | categorycount |
+-------------------------------+----------------+
| Restaurants | 21892 |
| Shopping | 8919 |
| Automotive | 2965 |
… … … … … … … … … … … … … … … … … … … … … … … … … … …
| Oriental | 1 |
| High Fidelity Audio Equipment | 1 |
+--------------------------------+---------------+
© 2016 MapR Technologies© 2016 MapR Technologies@tgrall 24
Checkins dataset {
"checkin_info":{
"3-4":1,
"13-5":1,
"6-6":1,
"14-5":1,
"14-6":1,
"14-2":1,
"14-3":1,
"19-0":1,
"11-5":1,
"13-2":1,
"11-6":2,
"11-3":1,
"12-6":1,
"6-5":1,
"5-5":1,
"9-2":1,
"9-5":1,
"9-6":1,
"5-2":1,
"7-6":1,
"7-5":1,
"7-4":1,
"17-5":1,
"8-5":1,
"10-2":1,
"10-5":1,
"10-6":1
},
"type":"checkin",
"business_id":"JwUE5GmEO-sH1FuwJgKBlQ"
}
© 2016 MapR Technologies© 2016 MapR Technologies@tgrall 25
Supports Dynamic / Unknown Columns
> SELECT KVGEN(checkin_info) checkins
FROM dfs.yelp.`checkin.json` LIMIT 1;
+------------+
| checkins |
+------------+
| [{"key":"3-4","value":1},{"key":"13-5","value":1},{"key":"6-6","value":1},{"key":"14-
5","value":1},{"key":"14-6","value":1},{"key":"14-2","value":1},{"key":"14-3","value":1},{"key":"19-
0","value":1},{"key":"11-5","value":1},{"key":"13-2","value":1},{"key":"11-6","value":2},{"key":"11-
3","value":1},{"key":"12-6","value":1},{"key":"6-5","value":1},{"key":"5-5","value":1},{"key":"9-
2","value":1},{"key":"9-5","value":1},{"key":"9-6","value":1},{"key":"5-2","value":1},{"key":"7-
6","value":1},{"key":"7-5","value":1},{"key":"7-4","value":1},{"key":"17-5","value":1},{"key":"8-
5","value":1},{"key":"10-2","value":1},{"key":"10-5","value":1},{"key":"10-6","value":1}] |
+------------+
> SELECT FLATTEN(KVGEN(checkin_info)) checkins FROM
dfs.yelp.`checkin.json` limit 6;
+---------------------------+
| checkins |
+---------------------------+
| {"key":"9-5","value":1} |
| {"key":"7-5","value":1} |
| {"key":"13-3","value":1} |
| {"key":"17-6","value":1} |
| {"key":"13-0","value":1} |
| {"key":"17-3","value":1} |
+---------------------------+
© 2016 MapR Technologies© 2016 MapR Technologies@tgrall 26
Makes it easy to work with dynamic/unknown columns
// Count total number of checkins on Sunday midnight
> SELECT SUM(checkintbl.checkins.`value`) as SundayMidnightCheckins FROM
(SELECT FLATTEN(KVGEN(checkin_info)) checkins
FROM dfs.yelp.checkin.json`) checkintbl
WHERE checkintbl.checkins.key='23-0';
+------------------------+
| SundayMidnightCheckins |
+------------------------+
| 8575 |
+------------------------+
© 2016 MapR Technologies© 2016 MapR Technologies@tgrall 27
Federated Queries
// Join JSON File, Parquet and MongoDB collection
> SELECT u.name, b.category, count(1) nb_review
FROM mongo.yelp.`user` u , dfs.yelp.`review.parquet` r, (select business_id, flatten(categories) category from
dfs.yelp.`business.json` ) b
WHERE u.user_id = r.user_id
AND b.business_id = r.business_id
GROUP BY u.user_id, u.name, b.category
ORDER BY nb_review DESC
LIMIT 10;
+-----------+--------------+------------+
| name | category | nb_review |
+-----------+--------------+------------+
| Rand | Restaurants | 1086 |
| J | Restaurants | 661 |
| Jennifer | Restaurants | 657 |
| Brad | Restaurants | 638 |
| Jade | Restaurants | 586 |
… … … … … … … … … .. .. ..
| Emily | Restaurants | 560 |
| Norm | Restaurants | 543 |
| Aileen | Restaurants | 499 |
| Michael | Restaurants | 496 |
+-----------+--------------+------------+
© 2016 MapR Technologies© 2016 MapR Technologies@tgrall 28
Drill Data Sources
JDBC
© 2016 MapR Technologies© 2016 MapR Technologies@tgrall 29
Directories are implicit partitions
select dir0, count(1)
from dfs.data.`/logs/`
where dir1 in (1,2,3)
group by dir0
logs
├── 2014
│ ├── 1
│ ├── 2
│ ├── 3
│ └── 4
└── 2015
└── 1
© 2016 MapR Technologies© 2016 MapR Technologies@tgrall 30
How does it work?
© 2016 MapR Technologies© 2016 MapR Technologies@tgrall 31
Everything is a “Drillbit”
• “Drillbits” run on each node
• In-memory columnar execution
• Coordination, Planning,
Execution
• Networked or not
• Exposes JDBC, ODBC, REST
• Built In Web UI and CLI
• Extensible
• Custom Functions
• Data Sources
Drillbit
© 2016 MapR Technologies© 2016 MapR Technologies@tgrall 32
Data Locality
HBase HBase
Mac
HDFS HDFS HDFS
HDFS HDFS HDFS
mongod mongod
HBase
Windows
Clusters DesktopClusters
HDFS & HBase cluster
HDFS cluster
MongoDB cluster
© 2016 MapR Technologies© 2016 MapR Technologies@tgrall 33
Run drillbits close to your data!
Drillbit
HBase HBase
Mac
HDFS HDFS HDFS
HDFS HDFS HDFS
mongod mongod
HBase
Windows
Clusters DesktopClusters
HDFS & HBase cluster
HDFS cluster
MongoDB cluster
Drillbit Drillbit
Drillbit Drillbit Drillbit
Drillbit Drillbit
Drillbit
Drillbit
© 2016 MapR Technologies© 2016 MapR Technologies@tgrall 34
Query Execution
Client
Drillbit Drillbit Drillbit
• Client connects to “a” Drillbit
• This Drillbit becomes the foreman
• Foreman generates execution plan
• cost base query optimisation
© 2016 MapR Technologies© 2016 MapR Technologies@tgrall 35
Query Execution
Client
Drillbit Drillbit Drillbit
• Execution fragment are farmed to other Drillbits
• Drillbits exchange data when necessary
© 2016 MapR Technologies© 2016 MapR Technologies@tgrall 36
Query Execution
Client
Drillbit Drillbit Drillbit
• Results are returned to the user through foreman
© 2016 MapR Technologies© 2016 MapR Technologies@tgrall 37
Granular Security via Drill Views
Name City State Credit Card #
Dave San Jose CA 1374-7914-3865-4817
John Boulder CO 1374-9735-1794-9711
Raw File (/raw/cards.csv)
Owner
Admins
Permission Admins
Business Analyst Data Scientist
Name City State Credit Card #
Dave San Jose CA 1374-1111-1111-1111
John Boulder CO 1374-1111-1111-1111
Data Scientist View (/views/maskedcards.csv)
Not a physical data copy
Name City State
Dave San Jose CA
John Boulder CO
Business Analyst View
Owner
Admins
Permission
Business
Analysts
Owner
Admins
Permission
Data
Scientists
© 2016 MapR Technologies© 2016 MapR Technologies@tgrall 38
Extensibility
© 2016 MapR Technologies© 2016 MapR Technologies@tgrall 39
Extending Apache Drill
• File Format
• ORC
• …
• Data Sources
• NoSQL Databases
• Search Engines
• REST
• ….
• Custom Functions
https://github.com/mapr-demos/simple-drill-functions
© 2016 MapR Technologies© 2016 MapR Technologies@tgrall 40
Recommendations for Getting Started with Drill
New to Drill?
– Get started with Free MapR On Demand training
– Test Drive Drill on cloud with AWS
– Learn how to use Drill with Hadoop using MapR sandbox
– Workshop : https://github.com/tgrall/drill-workshop
Ready to play with your data?
– Try out Apache Drill in 10 mins guide on your desktop
– Download Drill for your cluster and start exploration
– Comprehensive tutorials and documentation available
Ask questions
– user@drill.apache.org
– dev@drill.apache.com
© 2016 MapR Technologies© 2016 MapR Technologies@tgrall 41
© 2016 MapR Technologies© 2016 MapR Technologies@tgrall 42
Q&A
@tgrall maprtech
tug@mapr.com
Engage with us!
MapR
maprtech
mapr-technologies

SQL-on-Hadoop with Apache Drill

  • 1.
    © 2016 MapRTechnologies© 2016 MapR TechnologiesMapR Confidential © 2016 MapR Technologies 1 SQL-on-Hadoop with
  • 2.
    © 2016 MapRTechnologies© 2016 MapR Technologies@tgrall 3 Ingest Store Process Consume
  • 3.
    © 2016 MapRTechnologies© 2016 MapR Technologies@tgrall 4 MapR Converged Data Platform
  • 4.
    © 2016 MapRTechnologies© 2016 MapR Technologies@tgrall 6 Agenda • Why Drill? • Some Drilling with Open Data • How does it work?
  • 5.
    © 2016 MapRTechnologies© 2016 MapR Technologies@tgrall 7 1980 2000 20101990 2020 Fixed schema DBA controls structure Dynamic / Flexible schema Application controls structure NON-RELATIONAL DATASTORESRELATIONAL DATABASES GBs-TBs TBs-PBsVolume Database Data Increasingly Stored in Non-Relational Datastores Structure Development Structured Structured, semi-structured and unstructured Planned (release cycle = months-years) Iterative (release cycle = days-weeks)
  • 6.
    © 2016 MapRTechnologies© 2016 MapR Technologies@tgrall 8 How To Bring SQL to Non-Relational Data Stores? Familiarity of SQL Agility of NoSQL • ANSI SQL semantics • Low latency • Integrated with Tools/Applications • No schema management – HDFS (Parquet, JSON, etc.) – HBase – … • No transformation – No silos of data • Ease of use
  • 7.
    © 2016 MapRTechnologies© 2016 MapR Technologies@tgrall 9 Drill Supports Schema Discovery On-The-Fly • Fixed schema • Leverage schema in centralized repository (Hive Metastore) • Fixed schema, evolving schema or schema-less • Leverage schema in centralized repository or self-describing data 2Schema Discovered On-The-FlySchema Declared In Advance SCHEMA ON WRITE SCHEMA BEFORE READ SCHEMA ON THE FLY
  • 8.
    © 2016 MapRTechnologies© 2016 MapR Technologies@tgrall 10 • Pioneering Data Agility for Hadoop • Apache open source project • Scale-out execution engine for low-latency queries • Unified SQL-based API for analytics & operational applications 50+ contributors 150+ years of experience building databases and distributed systems Contributing to Apache Drill
  • 9.
    © 2016 MapRTechnologies© 2016 MapR Technologies@tgrall 11 - Sub-directory - HBase namespace - Hive database - Database Drill Enables ‘SQL-on-Everything’ SELECT * FROM dfs.yelp.`business.json` Workspace - Pathnames - Hive table - HBase table - Table Table - DFS (Text, Parquet, JSON, XML) - HBase/MapR-DB - Hive Metastore/HCatalog - RDBMS using JDBC - Easy API to go beyond Hadoop Storage plugin instance
  • 10.
    © 2016 MapRTechnologies© 2016 MapR Technologies@tgrall 12 Drill’s Data Model is Flexible JSON BSON HBase Parquet Avro CSV TSV Dynamic schema Fixed schema Complex Flat Flexibility Name Gender Age Michael M 6 Jennifer F 3 RDBMS/SQL-on-Hadoop table Flexibility Apache Drill table { name: { first: Michael, last: Smith }, hobbies: [ski, soccer], district: Los Altos } { name: { first: Jennifer, last: Gates }, hobbies: [sing], preschool: CCLC }
  • 11.
    © 2016 MapRTechnologies© 2016 MapR Technologies@tgrall 13 Drilling into Data
  • 12.
    © 2016 MapRTechnologies© 2016 MapR Technologies@tgrall 14 Business dataset { "business_id": "4bEjOyTaDG24SY5TxsaUNQ", "full_address": "3655 Las Vegas Blvd SnThe StripnLas Vegas, NV 89109", "hours": { "Monday": {"close": "23:00", "open": "07:00"}, "Tuesday": {"close": "23:00", "open": "07:00"}, "Friday": {"close": "00:00", "open": "07:00"}, "Wednesday": {"close": "23:00", "open": "07:00"}, "Thursday": {"close": "23:00", "open": "07:00"}, "Sunday": {"close": "23:00", "open": "07:00"}, "Saturday": {"close": "00:00", "open": "07:00"} }, "open": true, "categories": ["Breakfast & Brunch", "Steakhouses", "French", "Restaurants"], "city": "Las Vegas", "review_count": 4084, "name": "Mon Ami Gabi", "neighborhoods": ["The Strip"], "longitude": -115.172588519464, "state": "NV", "stars": 4.0, "attributes": { "Alcohol": "full_bar”, "Noise Level": "average", "Has TV": false, "Attire": "casual", "Ambience": { "romantic": true, "intimate": false, "touristy": false, "hipster": false, "classy": true, "trendy": false, "casual": false }, "Good For": {"dessert": false, "latenight": false, "lunch": false, "dinner": true, "breakfast": false, "brunch": false}, } }
  • 13.
    © 2016 MapRTechnologies© 2016 MapR Technologies@tgrall 15 Reviews dataset { "votes": {"funny": 0, "useful": 2, "cool": 1}, "user_id": "Xqd0DzHaiyRqVH3WRG7hzg", "review_id": "15SdjuK7DmYqUAj6rjGowg", "stars": 5, "date": "2007-05-17", "text": "dr. goldberg offers everything ...", "type": "review", "business_id": "vcNAWiLM4dR7D2nwwJ7nCA" }
  • 14.
    © 2016 MapRTechnologies© 2016 MapR Technologies@tgrall 16 Interfaces and Tools WebUIDrill Explorer ODBC/JDBC Sqlline
  • 15.
    © 2016 MapRTechnologies© 2016 MapR Technologies@tgrall 17 ` $ tar -xvzf apache-drill-1.4.0.tar.gz $ bin/sqlline -u jdbc:drill:zk=local $ bin/drill-embedded > SELECT state, city, count(*) AS businesses FROM dfs.yelp.`business.json.gz` GROUP BY state, city ORDER BY businesses DESC LIMIT 10; +------------+------------+-------------+ | state | city | businesses | +------------+------------+-------------+ | NV | Las Vegas | 12021 | | AZ | Phoenix | 7499 | | AZ | Scottsdale | 3605 | | EDH | Edinburgh | 2804 | | AZ | Mesa | 2041 | | AZ | Tempe | 2025 | | NV | Henderson | 1914 | | AZ | Chandler | 1637 | | WI | Madison | 1630 | | AZ | Glendale | 1196 | +------------+------------+-------------+
  • 16.
    © 2016 MapRTechnologies© 2016 MapR Technologies@tgrall 18 Intuitive SQL Access to Complex Data // It’s Friday 10pm in Vegas and looking for Hummus > SELECT name, stars, b.hours.Friday friday, categories FROM dfs.yelp.`business.json` b WHERE b.hours.Friday.`open` < '22:00' AND b.hours.Friday.`close` > '22:00' AND REPEATED_CONTAINS(categories, 'Mediterranean') AND city = 'Las Vegas' ORDER BY stars DESC LIMIT 2; +------------------------------------+--------+-----------------------------------+-----------------------------------------------------------+ | name | stars | friday | categories | +------------------------------------+--------+-----------------------------------+-----------------------------------------------------------+ | Khoury's Mediterranean Restaurant | 4.0 | {"close":"23:00","open":"11:00"} | ["Greek","Mediterranean","Middle Eastern","Restaurants"] | | Olives | 4.0 | {"close":"22:30","open":"11:00"} | ["Bars","Mediterranean","Nightlife","Restaurants"] | +------------------------------------+--------+-----------------------------------+-----------------------------------------------------------+
  • 17.
    © 2016 MapRTechnologies© 2016 MapR Technologies@tgrall 19 ANSI SQL Compatibility //Get top cool rated businesses  SELECT b.name from dfs.yelp.`business.json` b WHERE b.business_id IN (SELECT r.business_id FROM dfs.yelp.`review.json` r GROUP BY r.business_id HAVING SUM(r.votes.cool) > 2000 ORDER BY SUM(r.votes.cool) DESC); +--------------------------------+ | name | +--------------------------------+ | Earl of Sandwich | | XS Nightclub | | The Cosmopolitan of Las Vegas | | Wicked Spoon | | Bacchanal Buffet | +--------------------------------+
  • 18.
    © 2016 MapRTechnologies© 2016 MapR Technologies@tgrall 20 Logical Views //Create a view combining business and reviews datasets > CREATE OR REPLACE VIEW dfs.tmp.BusinessReviews AS SELECT b.name, b.stars, r.votes.funny, r.votes.useful, r.votes.cool, r.`date` FROM dfs.yelp.`business.json` b, dfs.yelp.`review.json` r WHERE r.business_id = b.business_id; +-------+-------------------------------------------------------------------+ | ok | summary | +-------+-------------------------------------------------------------------+ | true | View 'BusinessReviews' replaced successfully in 'dfs.tmp' schema | +-------+-------------------------------------------------------------------+ > SELECT COUNT(*) AS Total FROM dfs.tmp.BusinessReviews; +------------+ | Total | +------------+ | 1125458 | +------------+
  • 19.
    © 2016 MapRTechnologies© 2016 MapR Technologies@tgrall 21 Materialized Views AKA Tables > ALTER SESSION SET `store.format` = 'parquet'; > CREATE TABLE dfs.yelp.BusinessReviewsTbl AS SELECT b.name, b.stars, r.votes.funny funny, r.votes.useful useful, r.votes.cool cool, r.`date` FROM dfs.yelp.`business.json` b, dfs.yelp.`review.json` r WHERE r.business_id = b.business_id; +------------+---------------------------+ | Fragment | Number of records written | +------------+---------------------------+ | 1_0 | 176448 | | 1_1 | 192439 | | 1_2 | 198625 | | 1_3 | 200863 | | 1_4 | 181420 | | 1_5 | 175663 | +------------+---------------------------+
  • 20.
    © 2016 MapRTechnologies© 2016 MapR Technologies@tgrall 22 Repeated Values Support // Flatten repeated categories > SELECT name, categories FROM dfs.yelp.`business.json` LIMIT 3; +---------------------------+-------------------------------------+ | name | categories | +---------------------------+-------------------------------------+ | Eric Goldberg, MD | ["Doctors","Health & Medical"] | | Clancy's Pub | ["Nightlife"] | | Cool Springs Golf Center | ["Active Life","Mini Golf","Golf"] | +—————————————+-------------------------------------+ > SELECT name, FLATTEN(categories) AS categories FROM dfs.yelp.`business.json` LIMIT 5; +---------------------------+-------------------+ | name | categories | +---------------------------+-------------------+ | Eric Goldberg, MD | Doctors | | Eric Goldberg, MD | Health & Medical | | Clancy's Pub | Nightlife | | Cool Springs Golf Center | Active Life | | Cool Springs Golf Center | Mini Golf | +---------------------------+-------------------+
  • 21.
    © 2016 MapRTechnologies© 2016 MapR Technologies@tgrall 23 Extensions to ANSI SQL to work with repeated values // Get most common business categories >SELECT category, count(*) AS categorycount FROM (SELECT name, FLATTEN(categories) AS category FROM dfs.yelp.`business.json`) c GROUP BY category ORDER BY categorycount DESC; +-------------------------------+----------------+ | category | categorycount | +-------------------------------+----------------+ | Restaurants | 21892 | | Shopping | 8919 | | Automotive | 2965 | … … … … … … … … … … … … … … … … … … … … … … … … … … … | Oriental | 1 | | High Fidelity Audio Equipment | 1 | +--------------------------------+---------------+
  • 22.
    © 2016 MapRTechnologies© 2016 MapR Technologies@tgrall 24 Checkins dataset { "checkin_info":{ "3-4":1, "13-5":1, "6-6":1, "14-5":1, "14-6":1, "14-2":1, "14-3":1, "19-0":1, "11-5":1, "13-2":1, "11-6":2, "11-3":1, "12-6":1, "6-5":1, "5-5":1, "9-2":1, "9-5":1, "9-6":1, "5-2":1, "7-6":1, "7-5":1, "7-4":1, "17-5":1, "8-5":1, "10-2":1, "10-5":1, "10-6":1 }, "type":"checkin", "business_id":"JwUE5GmEO-sH1FuwJgKBlQ" }
  • 23.
    © 2016 MapRTechnologies© 2016 MapR Technologies@tgrall 25 Supports Dynamic / Unknown Columns > SELECT KVGEN(checkin_info) checkins FROM dfs.yelp.`checkin.json` LIMIT 1; +------------+ | checkins | +------------+ | [{"key":"3-4","value":1},{"key":"13-5","value":1},{"key":"6-6","value":1},{"key":"14- 5","value":1},{"key":"14-6","value":1},{"key":"14-2","value":1},{"key":"14-3","value":1},{"key":"19- 0","value":1},{"key":"11-5","value":1},{"key":"13-2","value":1},{"key":"11-6","value":2},{"key":"11- 3","value":1},{"key":"12-6","value":1},{"key":"6-5","value":1},{"key":"5-5","value":1},{"key":"9- 2","value":1},{"key":"9-5","value":1},{"key":"9-6","value":1},{"key":"5-2","value":1},{"key":"7- 6","value":1},{"key":"7-5","value":1},{"key":"7-4","value":1},{"key":"17-5","value":1},{"key":"8- 5","value":1},{"key":"10-2","value":1},{"key":"10-5","value":1},{"key":"10-6","value":1}] | +------------+ > SELECT FLATTEN(KVGEN(checkin_info)) checkins FROM dfs.yelp.`checkin.json` limit 6; +---------------------------+ | checkins | +---------------------------+ | {"key":"9-5","value":1} | | {"key":"7-5","value":1} | | {"key":"13-3","value":1} | | {"key":"17-6","value":1} | | {"key":"13-0","value":1} | | {"key":"17-3","value":1} | +---------------------------+
  • 24.
    © 2016 MapRTechnologies© 2016 MapR Technologies@tgrall 26 Makes it easy to work with dynamic/unknown columns // Count total number of checkins on Sunday midnight > SELECT SUM(checkintbl.checkins.`value`) as SundayMidnightCheckins FROM (SELECT FLATTEN(KVGEN(checkin_info)) checkins FROM dfs.yelp.checkin.json`) checkintbl WHERE checkintbl.checkins.key='23-0'; +------------------------+ | SundayMidnightCheckins | +------------------------+ | 8575 | +------------------------+
  • 25.
    © 2016 MapRTechnologies© 2016 MapR Technologies@tgrall 27 Federated Queries // Join JSON File, Parquet and MongoDB collection > SELECT u.name, b.category, count(1) nb_review FROM mongo.yelp.`user` u , dfs.yelp.`review.parquet` r, (select business_id, flatten(categories) category from dfs.yelp.`business.json` ) b WHERE u.user_id = r.user_id AND b.business_id = r.business_id GROUP BY u.user_id, u.name, b.category ORDER BY nb_review DESC LIMIT 10; +-----------+--------------+------------+ | name | category | nb_review | +-----------+--------------+------------+ | Rand | Restaurants | 1086 | | J | Restaurants | 661 | | Jennifer | Restaurants | 657 | | Brad | Restaurants | 638 | | Jade | Restaurants | 586 | … … … … … … … … … .. .. .. | Emily | Restaurants | 560 | | Norm | Restaurants | 543 | | Aileen | Restaurants | 499 | | Michael | Restaurants | 496 | +-----------+--------------+------------+
  • 26.
    © 2016 MapRTechnologies© 2016 MapR Technologies@tgrall 28 Drill Data Sources JDBC
  • 27.
    © 2016 MapRTechnologies© 2016 MapR Technologies@tgrall 29 Directories are implicit partitions select dir0, count(1) from dfs.data.`/logs/` where dir1 in (1,2,3) group by dir0 logs ├── 2014 │ ├── 1 │ ├── 2 │ ├── 3 │ └── 4 └── 2015 └── 1
  • 28.
    © 2016 MapRTechnologies© 2016 MapR Technologies@tgrall 30 How does it work?
  • 29.
    © 2016 MapRTechnologies© 2016 MapR Technologies@tgrall 31 Everything is a “Drillbit” • “Drillbits” run on each node • In-memory columnar execution • Coordination, Planning, Execution • Networked or not • Exposes JDBC, ODBC, REST • Built In Web UI and CLI • Extensible • Custom Functions • Data Sources Drillbit
  • 30.
    © 2016 MapRTechnologies© 2016 MapR Technologies@tgrall 32 Data Locality HBase HBase Mac HDFS HDFS HDFS HDFS HDFS HDFS mongod mongod HBase Windows Clusters DesktopClusters HDFS & HBase cluster HDFS cluster MongoDB cluster
  • 31.
    © 2016 MapRTechnologies© 2016 MapR Technologies@tgrall 33 Run drillbits close to your data! Drillbit HBase HBase Mac HDFS HDFS HDFS HDFS HDFS HDFS mongod mongod HBase Windows Clusters DesktopClusters HDFS & HBase cluster HDFS cluster MongoDB cluster Drillbit Drillbit Drillbit Drillbit Drillbit Drillbit Drillbit Drillbit Drillbit
  • 32.
    © 2016 MapRTechnologies© 2016 MapR Technologies@tgrall 34 Query Execution Client Drillbit Drillbit Drillbit • Client connects to “a” Drillbit • This Drillbit becomes the foreman • Foreman generates execution plan • cost base query optimisation
  • 33.
    © 2016 MapRTechnologies© 2016 MapR Technologies@tgrall 35 Query Execution Client Drillbit Drillbit Drillbit • Execution fragment are farmed to other Drillbits • Drillbits exchange data when necessary
  • 34.
    © 2016 MapRTechnologies© 2016 MapR Technologies@tgrall 36 Query Execution Client Drillbit Drillbit Drillbit • Results are returned to the user through foreman
  • 35.
    © 2016 MapRTechnologies© 2016 MapR Technologies@tgrall 37 Granular Security via Drill Views Name City State Credit Card # Dave San Jose CA 1374-7914-3865-4817 John Boulder CO 1374-9735-1794-9711 Raw File (/raw/cards.csv) Owner Admins Permission Admins Business Analyst Data Scientist Name City State Credit Card # Dave San Jose CA 1374-1111-1111-1111 John Boulder CO 1374-1111-1111-1111 Data Scientist View (/views/maskedcards.csv) Not a physical data copy Name City State Dave San Jose CA John Boulder CO Business Analyst View Owner Admins Permission Business Analysts Owner Admins Permission Data Scientists
  • 36.
    © 2016 MapRTechnologies© 2016 MapR Technologies@tgrall 38 Extensibility
  • 37.
    © 2016 MapRTechnologies© 2016 MapR Technologies@tgrall 39 Extending Apache Drill • File Format • ORC • … • Data Sources • NoSQL Databases • Search Engines • REST • …. • Custom Functions https://github.com/mapr-demos/simple-drill-functions
  • 38.
    © 2016 MapRTechnologies© 2016 MapR Technologies@tgrall 40 Recommendations for Getting Started with Drill New to Drill? – Get started with Free MapR On Demand training – Test Drive Drill on cloud with AWS – Learn how to use Drill with Hadoop using MapR sandbox – Workshop : https://github.com/tgrall/drill-workshop Ready to play with your data? – Try out Apache Drill in 10 mins guide on your desktop – Download Drill for your cluster and start exploration – Comprehensive tutorials and documentation available Ask questions – user@drill.apache.org – dev@drill.apache.com
  • 39.
    © 2016 MapRTechnologies© 2016 MapR Technologies@tgrall 41
  • 40.
    © 2016 MapRTechnologies© 2016 MapR Technologies@tgrall 42 Q&A @tgrall maprtech tug@mapr.com Engage with us! MapR maprtech mapr-technologies

Editor's Notes

  • #2 Change copyright to 2016
  • #6 The power of MapR begins with the power of open source innovation and community participation. In some cases MapR leads the community in projects like Apache Mahout (machine learning) or Apache Drill (SQL on Hadoop) In other areas, MapR contributes, integrates Apache and other open source software (OSS) projects into the MapR distribution, delivering a more reliable and performant system with lower overall TCO and easier system management. MapR releases a new version with the latest OSS innovations on a monthly basis. We add 2-4 new Apache projects annually as new projects become production ready and based on customer demand.
  • #8 The database/datastore landscape is evolving to meet the new requirements. 2009 was the inflection point. NoSchema systems in which applications control structure. Developers are being empowered and they are voting for the agility offered by these systems. In the early days if this revolution we sacrificed the query language, and we eliminated the ability to leverage the knowledge and tools available to millions of people. We’re changing that by a distributed SQL engine. But when we do that, we have to keep in mind that this transition to a NoSchema world happened for a reason, and we don’t want to reintroduce the centralized, DBA-managed schema.
  • #9 The database/datastore landscape is evolving to meet the new requirements. 2009 was the inflection point. NoSchema systems in which applications control structure. Developers are being empowered and they are voting for the agility offered by these systems. In the early days if this revolution we sacrificed the query language, and we eliminated the ability to leverage the knowledge and tools available to millions of people. We’re changing that by a distributed SQL engine. But when we do that, we have to keep in mind that this transition to a NoSchema world happened for a reason, and we don’t want to reintroduce the centralized, DBA-managed schema.
  • #13 All SQL engines (traditional or SQL-on-Hadoop) view tables as spreadsheet-like data structures with rows and columns. All records have the same structure, and there is no support for nested data or repeating fields. Drill views tables conceptually as collections of JSON (with additional types) documents. Each record can have a different structure (hence, schema-less). This is revolutionary and has never been done before. If you consider the four data models shown in the 2x2, all models can be represented by the complex, no schema model (JSON) because it is the most flexible. However, no other data model can be represented by the flat, fixed schema model. Therefore, when using any SQL engine except Drill, the data has to be transformed before it can be available to queries.
  • #15 CSV vs JSON formatting
  • #16 CSV vs JSON formatting
  • #19 It’s 10pm in Vegas and I Want Good Hummus!
  • #25 CSV vs JSON formatting
  • #35 User Connect to the Drill bit That drillbit becomes the FOREMAN Foreman generate execution plan cost based query optimisation and locality
  • #36 Execution fragment are farmed to other drillbits Drillbits exange fata as necessary
  • #37 Results are returned to the user through foreman
  • #43 Follow me on twitter to get updates for documentation on the Zeta Architecture