More Related Content
Similar to Analyzing Real-World Data with Apache Drill
Similar to Analyzing Real-World Data with Apache Drill (20)
Analyzing Real-World Data with Apache Drill
- 1. Analyzing Real-World Data with Apache Drill
Tomer Shiran
VP Product Management, MapR Technologies
Co-Founder, PMC Member and Committer, Apache Drill
November 20, 2014
® © 2014 MapR Technologies 1
®
© 2014 MapR Technologies
- 2. ® © 2014 MapR Technologies 2
Data is doubling in
size every two years
- 3. 44 ZETTABYTES
® © 2014 MapR Technologies 3
IDC estimates that in 2020,
there will be 44 zettabytes
of data in the world
4.4 ZETTABYTES
1.8 ZETTABYTES
2011 2013
2020
Source: IDC Digital Universe
- 4. ® © 2014 MapR Technologies 4
UNSTRUCTURED
DATA
Unstructured data will account
for more than 80% of the data
collected by organizations
STRUCTURED DATA
1980 1990 2000 2010 2020
Source: Human-Computer Interaction & Knowledge Discovery in Complex Unstructured, Big Data
Total Data Stored
- 5. NoSchema Datastores are Capturing this Data
Volume MBs-GBs TBs-PBs
Structured Structured, semi-structured and unstructured
Planned (release cycle = months-years) Iterative (release cycle = days-weeks)
RELATIONAL DATABASES “NOSCHEMA” DATASTORES
Dynamic schema (schema-free)
Application controls structure
® © 2014 MapR Technologies 5
Fixed schema
DBA controls structure
Structure
Development
Database
1980 1990 2000 2010 2020
- 6. WANT 2 DON’T WANT
® © 2014 MapR Technologies 6
SQL in the Big Data World
• SQL
• BI (Tableau, MicroStrategy, etc.)
• Low latency
• Scalability
• Create and maintain schemas on:
– HDFS (Parquet, JSON, etc.)
– HBase
– MongoDB
• Transform or copy data
We want SQL and BI support without compromising the
flexibility and agility of NoSchema datastores
- 7. • Schema-free scale-out query engine for Hadoop and NoSQL
• Point-and-query vs. schema-first
• Low latency
• Extreme ease of use
• Industry-standard APIs: ANSI SQL, ODBC/JDBC, RESTful APIs
® © 2014 MapR Technologies 7
APACHE DRILL
40+ contributors
150+ years of experience building
databases and distributed systems
- 8. Evolution Towards Self-Service Data Exploration
® © 2014 MapR Technologies 8
Data Modeling and
Transformation
Data Visualization
IT-driven
IT-driven
IT-driven
Self-service
IT-driven
Self-service
Not needed
Self-service
Traditional BI
w/ RDBMS
Self-Service BI
w/ RDBMS SQL-on-Hadoop
Self-Service
Data Exploration
Zero-day analytics
- 10. RDBMS/SQL-on-Hadoop table
Apache Drill table
® © 2014 MapR Technologies 10
Drill’s Data Model is Flexible
Fixed schema Schema-less
HBase
JSON
BSON
CSV
TSV
Parquet
Avro
Flat
Complex
Flexibility
Flexibility
Name! Gender! Age!
Michael! M! 6!
Jennifer! F! 3!
{!
name: {!
first: Michael,!
last: Smith!
},!
hobbies: [ski, soccer],!
district: Los Altos!
}!
{!
name: {!
first: Jennifer,!
last: Gates!
},!
hobbies: [sing],!
preschool: CCLC!
}!
- 11. Drill Supports Schema Discovery On-The-Fly
Schema Declared In Advance Schema2 D iscovered On-The-Fly
® © 2014 MapR Technologies 11
• Fixed schema
• Leverage schema in centralized
repository (Hive Metastore)
• Fixed schema, evolving schema or
schema-less
• Leverage schema in centralized
repository or self-describing data
SCHEMA ON
WRITE
SCHEMA
BEFORE READ
SCHEMA ON THE
FLY
- 12. SELECT
po_document.AllowPartialShipment
FROM
j_purchaseorder;
® © 2014 MapR Technologies 12
Native JSON
SELECT
json_value(po_document,
'$.AllowPartialShipment’
RETURNING
NUMBER)
FROM
j_purchaseorder;
JSON query with Drill:
JSON query with Oracle:
Relational databases cannot provide true schema-free JSON support.
- 13. ® © 2014 © 201 M4 aMpaRp RTe Tcehcnhonloogloiegsie s 13 ®
Architecture
- 14. ® © 2014 MapR Technologies 14
High Level Architecture
• Cluster of commodity servers
– Daemon (drillbit) on each node
• No dependency on other execution engines (MapReduce, Spark, Tez)
– Better performance and manageability
• ZooKeeper maintains ephemeral cluster membership information
– drillbit uses ZooKeeper to find other drillbits in the cluster
– Client uses ZooKeeper to find drillbits
• Data processing unit is columnar record batches
– Enables schema flexibility with negligible performance impact
- 15. …
ZooKeeper
ZooKeeper
ZooKeeper ® © 2014 MapR Technologies 15
Drill Maximizes Data Locality
drillbit
DataNode/
RegionServer/
mongod
drillbit
DataNode/
RegionServer/
mongod
drillbit
DataNode/
RegionServer/
mongod
Data Source Best Practice
HDFS or MapR-FS drillbit on each DataNode
HBase or MapR-DB drillbit on each RegionServer
MongoDB drillbit on each mongod node (when using replicas, run it on the replica node)
- 16. 5. Return results
to client
® © 2014 MapR Technologies 16
SELECT* Query Execution
Client
(JDBC, ODBC,
REST)
1. Find drillbits
(once per session)
2. Submit query to
drillbit
ZooKeeper drillbit
3. Create logical and physical execution plans
4. Farm out execution of fragments to cluster
(completely distributed execution)
ZooKeeper
ZooKeeper
drillbit
drillbit
* CTAS (CREATE TABLE AS SELECT) queries include steps 1-4
- 17. DFS
® © 2014 MapR Technologies 17
Core Modules within drillbit
SQL Parser
Hive
HBase
Distributed Cache
Storage Plugins
MongoDB
Physical Plan
Execution
Logical Plan
Optimizer
RPC Endpoint
- 19. ® © 2014 MapR Technologies 19
Demo Plan
1. Run Drill
2. Configure DFS and MongoDB storage plugins
3. Explore the data
– Basics
– Complex data
– Views
- 20. ® © 2014 © 201 M4 aMpaRp RTe Tcehcnhonloogloiegsie s 20 ®
Run Drill
- 21. Run Drill in Embedded Mode (sqlline)
$
tar
xf
apache-‐drill-‐0.7.0.tar.gz
$
cd
apache-‐drill-‐0.7.0
$
bin/sqlline
-‐u
jdbc:drill:zk=local
You can now access the Web UI:
http://localhost:8047
>
SELECT
*
FROM
dfs.root.`/Users/tshiran/Development/demo/data/yelp/user.json`
LIMIT
1;
+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+
|
yelping_since
|
votes
|
review_count
|
name
|
user_id
|
+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+
|
2012-‐02
|
{"funny":1,"useful":5,"cool":0}
|
6
|
Lee
|
qtrmBGNqCvupHMHL_bKFgQ
|
® © 2014 MapR Technologies 21
• drillbit (Drill daemon) starts automatically in embedded mode
• No ZooKeeper in embedded mode (hence zk=local)
• Can’t use BI clients (JDBC/ODBC) in embedded mode
- 22. • Define the Drill cluster name and ZooKeeper nodes in conf/drill-‐override.conf
• Start drillbit:
$
bin/drillbit.sh
start
® © 2014 MapR Technologies 22
Or Run Drill in Distributed Mode…
• Make sure ZooKeeper (zkServer) is running:
$
zkServer
start
• Access the Web UI: http://localhost:8047
• Connect a client to the cluster (eg, sqlline):
$
bin/sqlline
-‐u
jdbc:drill:zk=localhost:2181
• Clients (like sqlline) connect to ZooKeeper to discover the cluster nodes
• If you have multiple Drill clusters registered in one ZooKeeper ensemble, specify the desired
cluster in the JDBC connection string: jdbc:drill:zk=localhost:2181/drill/
<clustername>
• Not sure if ZooKeeper is running? Run telnet
localhost
2181 and make sure it connects
- 23. ® © 2014 © 201 M4 aMpaRp RTe Tcehcnhonloogloiegsie s 23 ®
Configure Storage Plugins
- 24. ® © 2014 MapR Technologies 24
Enable MongoDB Storage Plugin
- 26. ® © 2014 © 201 M4 aMpaRp RTe Tcehcnhonloogloiegsie s 26 ®
Explore the Data: Basics
- 27. ® © 2014 MapR Technologies 27
Inventory: DFS Files
{
"votes":
{"funny":
0,
"useful":
2,
"cool":
1},
"user_id":
"Xqd0DzHaiyRqVH3WRG7hzg",
"review_id":
"15SdjuK7DmYqUAj6rjGowg",
"stars":
5,
"date":
"2007-‐05-‐17",
"text":
"dr.
goldberg
offers
everything
...",
"type":
"review",
"business_id":
"vcNAWiLM4dR7D2nwwJ7nCA"
}
- 28. ® © 2014 MapR Technologies 28
Inventory: MongoDB Collections
$
mongo
MongoDB
shell
version:
2.6.5
>
show
databases;
admin
(empty)
local
0.078GB
yelp
0.453GB
>
use
yelp
>
db.users.findOne()
{
"_id"
:
ObjectId("54566cdf3237149de181a92a"),
"yelping_since"
:
"2012-‐02",
"votes"
:
{
"funny"
:
1,
"useful"
:
5,
"cool"
:
0
},
"review_count"
:
6,
"name"
:
"Lee",
"user_id"
:
"qtrmBGNqCvupHMHL_bKFgQ",
"friends"
:
[
]
}
- 29. Let’s Go!
>
SELECT
*
FROM
dfs.root.`/Users/tshiran/Development/
demo/data/yelp/review.json`
WHERE
stars
=
1
LIMIT
1;
+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+
|
votes
|
user_id
|
review_id
|
stars
|
date
|
text
|
type
|
business_id
|
+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+
|
{"funny":0,"useful":0,"cool":0}
|
Qrs3EICADUKNFoUq2iHStA
|
_ePLBPrkrf4bhyiKWEn4Qg
|
1
|
2013-‐04-‐19
|
I
don't
know
what
Dr.
Goldberg
was
like
before
moving
to
Arizona,
but
let
me
tell
you,
STAY
AWAY
from
this
doctor
and
this
office.
|
review
|
vcNAWiLM4dR7D2nwwJ7nCA
|
+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+
® © 2014 MapR Technologies 29
- 30. ® © 2014 MapR Technologies 30
Using Storage Plugins and Workspaces
Storage plugin
Workspace
Path relative to workspace
>
SELECT
*
FROM
dfs.root.`/Users/tshiran/Development/demo/data/
yelp/review.json`
LIMIT
1;
>
SELECT
*
FROM
dfs.demo.`yelp/review.json`
LIMIT
1;
>
SELECT
*
FROM
mongo.yelp.users
LIMIT
1;
>
USE
mongo.yelp;
>
SELECT
*
FROM
users
LIMIT
1;
Storage Plugin Workspace Table
dfs Path Path relative to workspace
mongo Database Collection
hive Database Table
hbase Namespace Table
- 31. ® © 2014 MapR Technologies 31
Most Common User Names (MongoDB)
>
SELECT
name,
count(*)
AS
users
FROM
mongo.yelp.users
GROUP
BY
name
ORDER
BY
users
DESC
LIMIT
10;
+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+
|
name
|
users
|
+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+
|
David
|
2453
|
|
John
|
2378
|
|
Michael
|
2322
|
|
Chris
|
2202
|
|
Mike
|
2037
|
|
Jennifer
|
1867
|
|
Jessica
|
1463
|
|
Jason
|
1457
|
|
Michelle
|
1439
|
|
Brian
|
1436
|
+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+
- 32. ® © 2014 MapR Technologies 32
Cities with the Most Businesses
>
SELECT
state,
city,
count(*)
AS
businesses
FROM
dfs.demo.`/yelp/business.json`
GROUP
BY
state,
city
ORDER
BY
businesses
DESC
LIMIT
10;
+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+
|
state
|
city
|
businesses
|
+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+
|
NV
|
Las
Vegas
|
12021
|
|
AZ
|
Phoenix
|
7499
|
|
AZ
|
Scottsdale
|
3605
|
|
EDH
|
Edinburgh
|
2804
|
|
AZ
|
Mesa
|
2041
|
|
AZ
|
Tempe
|
2025
|
|
NV
|
Henderson
|
1914
|
|
AZ
|
Chandler
|
1637
|
|
WI
|
Madison
|
1630
|
|
AZ
|
Glendale
|
1196
|
+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+
- 33. ® © 2014 © 201 M4 aMpaRp RTe Tcehcnhonloogloiegsie s 33 ®
Explore the Data: Complex Data
- 34. ® © 2014 MapR Technologies 34
business.json (1)
{
"business_id":
"4bEjOyTaDG24SY5TxsaUNQ",
"full_address":
"3655
Las
Vegas
Blvd
SnThe
StripnLas
Vegas,
NV
89109",
"hours":
{
"Monday":
{"close":
"23:00",
"open":
"07:00"},
"Tuesday":
{"close":
"23:00",
"open":
"07:00"},
"Friday":
{"close":
"00:00",
"open":
"07:00"},
"Wednesday":
{"close":
"23:00",
"open":
"07:00"},
"Thursday":
{"close":
"23:00",
"open":
"07:00"},
"Sunday":
{"close":
"23:00",
"open":
"07:00"},
"Saturday":
{"close":
"00:00",
"open":
"07:00"}
},
"open":
true,
"categories":
["Breakfast
&
Brunch",
"Steakhouses",
"French",
"Restaurants"],
"city":
"Las
Vegas",
"review_count":
4084,
"name":
"Mon
Ami
Gabi",
"neighborhoods":
["The
Strip"],
"longitude":
-‐115.172588519464,
- 35. ® © 2014 MapR Technologies 35
business.json (2)
"state":
"NV",
"stars":
4.0,
"attributes":
{
"Alcohol":
"full_bar”,
"Noise
Level":
"average",
"Has
TV":
false,
"Attire":
"casual",
"Ambience":
{
"romantic":
true,
"intimate":
false,
"touristy":
false,
"hipster":
false,
"classy":
true,
"trendy":
false,
"casual":
false
},
"Good
For":
{"dessert":
false,
"latenight":
false,
"lunch":
false,
"dinner":
true,
"breakfast":
false,
"brunch":
false},
}
}
- 36. Which Places Are Open Right Now (22:00)?
>
SELECT
name,
b.hours
FROM
dfs.demo.`yelp/business.json`
b
WHERE
b.hours.Saturday.`open`
<
'22:00'
AND
® © 2014 MapR Technologies 36
b.hours.Saturday.`close`
>
'22:00'
LIMIT
2;
+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+
|
name
|
hours
|
+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+
|
Chang
Jiang
Chinese
Kitchen
|
{"Tuesday":{"close":"22:00","open":"11:00"},"Friday":
{"close":"22:30","open":"11:00"},"Monday":{"close":"22:00","open":"11:00"},"Wednesday":
{"close":"22:00","open":"11:00"},"Thursday":{"close":"22:00","open":"11:00"},"Sunday":
{"close":"21:00","open":"16:00"},"Saturday":{"close":"22:30","open":"11:00"}}
|
|
Grand
China
Restaurant
|
{"Tuesday":{"close":"22:00","open":"11:00"},"Friday":
{"close":"23:00","open":"11:00"},"Monday":{"close":"22:00","open":"11:00"},"Wednesday":
{"close":"22:00","open":"11:00"},"Thursday":{"close":"22:00","open":"11:00"},"Sunday":
{"close":"22:00","open":"12:00"},"Saturday":{"close":"23:00","open":"11:00"}}
|
+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+
- 37. It’s 10pm in Vegas and I Want Good Hummus!
>
SELECT
name,
stars,
b.hours.Friday,
categories
FROM
dfs.demo.`yelp/business.json`
b
WHERE
b.hours.Friday.`open`
<
'22:00'
AND
b.hours.Friday.`close`
>
'22:00'
AND
REPEATED_CONTAINS(categories,
'Mediterranean')
AND
city
=
'Las
Vegas'
® © 2014 MapR Technologies 37
ORDER
BY
stars
DESC
LIMIT
2;
+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+
|
name
|
stars
|
EXPR$2
|
categories
|
+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+
|
Olives
|
4.0
|
{"close":"22:30","open":"11:00"}
|
["Mediterranean","Restaurants"]
|
|
Marrakech
Moroccan
Restaurant
|
4.0
|
{"close":"23:00","open":"17:30"}
|
["Mediterranean","Middle
Eastern","Moroccan","Restaurants"]
|
+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+
- 38. ® © 2014 MapR Technologies 38
Flatten Repeated Values
>
SELECT
name,
categories
FROM
dfs.demo.`yelp/business.json`
LIMIT
3;
+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+
|
name
|
categories
|
+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+
|
Eric
Goldberg,
MD
|
["Doctors","Health
&
Medical"]
|
|
Pine
Cone
Restaurant
|
["Restaurants"]
|
|
Deforest
Family
Restaurant
|
["American
(Traditional)","Restaurants"]
|
+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+
>
SELECT
name,
FLATTEN(categories)
AS
categories
FROM
dfs.demo.`yelp/business.json`
LIMIT
5;
+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+
|
name
|
categories
|
+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+
|
Eric
Goldberg,
MD
|
Doctors
|
|
Eric
Goldberg,
MD
|
Health
&
Medical
|
|
Pine
Cone
Restaurant
|
Restaurants
|
|
Deforest
Family
Restaurant
|
American
(Traditional)
|
|
Deforest
Family
Restaurant
|
Restaurants
|
+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+
- 39. Most and Least Common Business Categories
>
SELECT
category,
count(*)
AS
businesses
FROM
(SELECT
name,
FLATTEN(categories)
AS
category
® © 2014 MapR Technologies 39
FROM
dfs.demo.`yelp/business.json`)
c
GROUP
BY
category
ORDER
BY
businesses
DESC;
+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+
|
category
|
businesses
|
+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+
|
Restaurants
|
14303
|
…
|
Australian
|
1
|
|
Boat
Dealers
|
1
|
|
Firewood
|
1
|
+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+
715
rows
selected
(3.439
seconds)
>
SELECT
name,
categories
FROM
dfs.demo.`yelp/business.json`
WHERE
true
and
REPEATED_CONTAINS(categories,
'Australian');
+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+
|
name
|
categories
|
+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+
|
The
Australian
AZ
|
["Bars","Burgers","Nightlife","Australian","Sports
Bars","Restaurants"]
|
+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+
- 40. ® © 2014 © 201 M4 aMpaRp RTe Tcehcnhonloogloiegsie s 40 ®
Explore the Data: Views
- 41. columns[0]
columns[4]
® © 2014 MapR Technologies 41
Create a View for Name-Gender Mapping
names.csv:
>
CREATE
VIEW
dfs.tmp.`names`
AS
SELECT
columns[0]
AS
name,
columns[4]
AS
gender
FROM
dfs.demo.`names.csv`;
>
USE
dfs.tmp;
>
CREATE
VIEW
names1
ASSELECT
columns[0]
AS
name,
columns[4]
AS
gender
FROM
dfs.demo.`names.csv`;
>
SELECT
*
FROM
dfs.tmp.names
WHERE
name
=
'John';
+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+
|
name
|
gender
|
+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+
|
John
|
Male
|
+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+
- 42. Most Common Names (and their Genders) on Yelp
>
SELECT
u.name,
n.gender,
count(*)
AS
number
FROM
mongo.yelp.users
u,
dfs.tmp.names
n
WHERE
u.name
=
n.name
GROUP
BY
u.name,
n.gender
ORDER
BY
number
DESC
LIMIT
10;
+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+
|
name
|
gender
|
number
|
+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+
|
David
|
Male
|
2453
|
|
John
|
Male
|
2378
|
|
Michael
|
Male
|
2322
|
|
Chris
|
Unknown
|
2202
|
|
Mike
|
Male
|
2037
|
|
Jennifer
|
Female
|
1867
|
|
Jessica
|
Female
|
1463
|
|
Jason
|
Male
|
1457
|
|
Michelle
|
Female
|
1439
|
|
Brian
|
Male
|
1436
|
+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+
® © 2014 MapR Technologies 42
- 43. Who Rates Higher – Men or Women?
>
SELECT
n.gender,
count(*)
AS
users,
round(avg(average_stars),
2)
stars
FROM
mongo.yelp.users
u,
dfs.tmp.names
n
WHERE
u.name
=
n.name
GROUP
BY
n.gender;
+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+
|
gender
|
users
|
stars
|
+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+
|
Female
|
103684
|
3.77
|
|
Male
|
97430
|
3.696
|
|
Unknown
|
18409
|
3.727
|
+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+
® © 2014 MapR Technologies 43
- 44. ® © 2014 MapR Technologies 44
Who Writes More – Men or Women?
It takes a 3-way join to find out…
>
SELECT
n.gender,
round(avg(length(r.text)))
AS
review_length
FROM
dfs.demo.`yelp/review.json`
r,
mongo.yelp.users
u,
dfs.tmp.names
n
WHERE
u.name
=
n.name
AND
r.user_id
=
u.user_id
GROUP
BY
n.gender;
+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+
|
gender
|
review_length
|
+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+
|
Male
|
665
|
|
Female
|
730
|
|
Unknown
|
711
|
+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+
- 45. ® © 2014 MapR Technologies 45
Drill Tweets (@ApacheDrill)
- 46. ® © 2014 MapR Technologies 46
Thank You
• Learn: incubator.apache.org/drill/
• Download: incubator.apache.org/drill/download/
• Ask questions: drill-user@incubator.apache.org
• Contact me: tshiran@apache.org
- 47. ® © 2014 MapR Technologies 47
Thank You
Tomer Shiran, VP Product Management
@mapr maprtech
tshiran@mapr.com
MapRTechnologies
maprtech
mapr-technologies