SlideShare a Scribd company logo
®
© 2014 MapR Technologies 1
®
© 2014 MapR Technologies
Adding complex data to Spark stack
Neeraja Rentachintala, Director of Product Management
6/16/2015
®
© 2014 MapR Technologies 2
COMPLEX/
SEMI-STRUCTURED DATA
STRUCTURED DATA
1980 2000 20101990 2020
Complex data is growing
Unstructured data will account
for more than 80% of the data
collected by organizations
Source: Human-Computer Interaction & Knowledge Discovery in Complex Unstructured, Big Data
TotalDataStored
®
© 2014 MapR Technologies 3
1980 2000 20101990 2020
Fixed schema
DBA controls structure
Dynamic / Flexible schema
Application controls structure
NON-RELATIONAL DATASTORESRELATIONAL DATABASES
GBs-TBs TBs-PBsVolume
Database
Data Increasingly Stored in Non-Relational Datastores
Structure
Development
Structured Structured, semi-structured and unstructured
Planned (release cycle = months-years) Iterative (release cycle = days-weeks)
®
© 2014 MapR Technologies 4
Session topics
•  Apache Drill overview
•  Integrating Drill & Spark
•  Status & Next steps
®
© 2014 MapR Technologies 5
Drill is the Industry’s first Schema-Free SQL engine
Point-and-query vs. schema-first
Access to any data source & types
Industry standard APIs
(ANSI SQL, JDBC/ODBC, REST)
Performance at Scale
Extreme Ease of Use
®
© 2014 MapR Technologies 6
-  Sub-directory
-  HBase namespace
-  Hive database
Drill enables ‘SQL on Everything’ (Omni-SQL)
SELECT	
  *	
  FROM	
  dfs.yelp.`business.json`	
  !
Workspace
-  Pathnames
-  Hive table
-  HBase table
Table
-  File system (Text, Parquet, JSON)
-  HBase/MapR-DB
-  Hive Metastore/HCatalog
- Easy API to go beyond Hadoop
Storage plugin instance
®
© 2014 MapR Technologies 7
Drill is a Distributed SQL query engine
drillbit	
  
DataNode/
RegionServer	
  
drillbit	
  
DataNode/
RegionServer	
  
drillbit	
  
DataNode/
RegionServer	
  
ZooKeeper
ZooKeeper
ZooKeeper
…
Ø  Scale-out (single node to 1000’s of nodes)
Ø  Columnar and Vectorized execution
Ø  Optimistic execution (no MR, Spark, Tez)
Ø  Extensible
®
© 2014 MapR Technologies 8
Core Modules within drillbit	
  
SQL Parser
Hive
HBase
StoragePlugins
MongoDB
DFS
PhysicalPlan
Execution
LogicalPlan
Optimizer
RPC Endpoint
®
© 2014 MapR Technologies 9
Integrating Spark and Drill
Features at a glance:
•  Use Drill as an input to Spark
•  Query Spark RDDs via Drill and create data pipelines
Disk (DFS)
Memory
RDD
Files Files
®
© 2014 MapR Technologies 10© 2014 MapR Technologies
®
Drill examples walkthrough
®
© 2014 MapR Technologies 11
business.json {	
  
	
  "business_id":	
  "4bEjOyTaDG24SY5TxsaUNQ",	
  
	
  "full_address":	
  "3655	
  Las	
  Vegas	
  Blvd	
  SnThe	
  StripnLas	
  Vegas,	
  NV	
  89109",	
  
	
  "hours":	
  {	
  
	
   	
  "Monday":	
  {"close":	
  "23:00",	
  "open":	
  "07:00"},	
  
	
   	
  "Tuesday":	
  {"close":	
  "23:00",	
  "open":	
  "07:00"},	
  
	
   	
  "Friday":	
  {"close":	
  "00:00",	
  "open":	
  "07:00"},	
  
	
   	
  "Wednesday":	
  {"close":	
  "23:00",	
  "open":	
  "07:00"},	
  
	
   	
  "Thursday":	
  {"close":	
  "23:00",	
  "open":	
  "07:00"},	
  
	
   	
  "Sunday":	
  {"close":	
  "23:00",	
  "open":	
  "07:00"},	
  
	
   	
  "Saturday":	
  {"close":	
  "00:00",	
  "open":	
  "07:00"}	
  
	
  },	
  
	
  "open":	
  true,	
  
	
  "categories":	
  ["Breakfast	
  &	
  Brunch",	
  "Steakhouses",	
  "French",	
  "Restaurants"],	
  
	
  "city":	
  "Las	
  Vegas",	
  
	
  "review_count":	
  4084,	
  
	
  "name":	
  "Mon	
  Ami	
  Gabi",	
  
	
  "neighborhoods":	
  ["The	
  Strip"],	
  
	
  "longitude":	
  -­‐115.172588519464,	
  
	
  "state":	
  "NV",	
  
	
  "stars":	
  4.0,	
  
	
   	
  "attributes":	
  {	
  
	
   	
  "Alcohol":	
  "full_bar”,	
  
	
   	
   	
  "Noise	
  Level":	
  "average",	
  
	
   	
  "Has	
  TV":	
  false,	
  
	
   	
  "Attire":	
  "casual",	
  
	
   	
  "Ambience":	
  {	
  
	
   	
   	
  "romantic":	
  true,	
  
	
   	
   	
  "intimate":	
  false,	
  
	
   	
   	
  "touristy":	
  false,	
  
	
   	
   	
  "hipster":	
  false,	
  
	
   	
   	
   	
  "classy":	
  true,	
  
	
   	
   	
  "trendy":	
  false,	
  
	
   	
   	
   	
  "casual":	
  false	
  
	
   	
  },	
  
	
   	
  "Good	
  For":	
  {"dessert":	
  false,	
  "latenight":	
  false,	
  "lunch":	
  false,	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  "dinner":	
  true,	
  "breakfast":	
  false,	
  "brunch":	
  false},	
  
	
  }	
  
}	
  
®
© 2014 MapR Technologies 12
review.json
{	
  
	
  	
  "votes":	
  {"funny":	
  0,	
  "useful":	
  2,	
  "cool":	
  1},	
  
	
  	
  "user_id":	
  "Xqd0DzHaiyRqVH3WRG7hzg",	
  
	
  	
  "review_id":	
  "15SdjuK7DmYqUAj6rjGowg",	
  
	
  	
  "stars":	
  5,	
  
	
  	
  "date":	
  "2007-­‐05-­‐17",	
  
	
  	
  "text":	
  "dr.	
  goldberg	
  offers	
  everything	
  ...",	
  
	
  	
  "type":	
  "review",	
  
	
  	
  "business_id":	
  "vcNAWiLM4dR7D2nwwJ7nCA"	
  
}	
  
®
© 2014 MapR Technologies 13
Zero to Results in 2 minutes
$	
  tar	
  -­‐xvzf	
  apache-­‐drill-­‐1.0.0.tar.gz	
  
	
  
$	
  bin/sqlline	
  -­‐u	
  jdbc:drill:zk=local	
  
	
  
>	
  SELECT	
  state,	
  city,	
  count(*)	
  AS	
  businesses	
  
	
  	
  FROM	
  dfs.yelp.`business.json`	
  
	
  	
  GROUP	
  BY	
  state,	
  city	
  
	
  	
  ORDER	
  BY	
  businesses	
  DESC	
  LIMIT	
  10;	
  
	
  
+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+	
  
|	
  	
  	
  state	
  	
  	
  	
  |	
  	
  	
  	
  city	
  	
  	
  	
  |	
  	
  businesses	
  |	
  
+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+	
  
|	
  NV	
  	
  	
  	
  	
  	
  	
  	
  	
  |	
  Las	
  Vegas	
  	
  |	
  12021	
  	
  	
  	
  	
  	
  	
  |	
  
|	
  AZ	
  	
  	
  	
  	
  	
  	
  	
  	
  |	
  Phoenix	
  	
  	
  	
  |	
  7499	
  	
  	
  	
  	
  	
  	
  	
  |	
  
|	
  AZ	
  	
  	
  	
  	
  	
  	
  	
  	
  |	
  Scottsdale	
  |	
  3605	
  	
  	
  	
  	
  	
  	
  	
  |	
  
|	
  EDH	
  	
  	
  	
  	
  	
  	
  	
  |	
  Edinburgh	
  	
  |	
  2804	
  	
  	
  	
  	
  	
  	
  	
  |	
  
|	
  AZ	
  	
  	
  	
  	
  	
  	
  	
  	
  |	
  Mesa	
  	
  	
  	
  	
  	
  	
  |	
  2041	
  	
  	
  	
  	
  	
  	
  	
  |	
  
|	
  AZ	
  	
  	
  	
  	
  	
  	
  	
  	
  |	
  Tempe	
  	
  	
  	
  	
  	
  |	
  2025	
  	
  	
  	
  	
  	
  	
  	
  |	
  
|	
  NV	
  	
  	
  	
  	
  	
  	
  	
  	
  |	
  Henderson	
  	
  |	
  1914	
  	
  	
  	
  	
  	
  	
  	
  |	
  
|	
  AZ	
  	
  	
  	
  	
  	
  	
  	
  	
  |	
  Chandler	
  	
  	
  |	
  1637	
  	
  	
  	
  	
  	
  	
  	
  |	
  
|	
  WI	
  	
  	
  	
  	
  	
  	
  	
  	
  |	
  Madison	
  	
  	
  	
  |	
  1630	
  	
  	
  	
  	
  	
  	
  	
  |	
  
|	
  AZ	
  	
  	
  	
  	
  	
  	
  	
  	
  |	
  Glendale	
  	
  	
  |	
  1196	
  	
  	
  	
  	
  	
  	
  	
  |	
  
+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+	
  
Install	
  
Query	
  files	
  
and	
  
directories	
  
Results	
  
Launch	
  shell	
  
(embedded	
  
mode)	
  
®
© 2014 MapR Technologies 14
Directories are implicit partitions
	
  
SELECT	
  dir0,	
  SUM(amount)	
  
FROM	
  sales	
  
GROUP	
  BY	
  dir1	
  IN	
  (q1,	
  q2)	
  
	
  
	
  
	
  
sales	
  
├──	
  2014	
  
│	
  	
  	
  ├──	
  q1	
  
│	
  	
  	
  ├──	
  q2	
  
│	
  	
  	
  ├──	
  q3	
  
│	
  	
  	
  └──	
  q4	
  
└──	
  2015	
  
	
  	
  	
  	
  └──	
  q1	
  
®
© 2014 MapR Technologies 15
Intuitive SQL access to complex data
//	
  It’s	
  Friday	
  10pm	
  in	
  Vegas	
  and	
  looking	
  for	
  Hummus	
  
	
  
>	
  SELECT	
  name,	
  stars,	
  b.hours.Friday	
  friday,	
  categories	
  
	
  	
  FROM	
  dfs.yelp.`business.json`	
  b	
  
	
  	
  WHERE	
  b.hours.Friday.`open`	
  <	
  '22:00'	
  AND	
  
	
  	
  	
  	
  	
  	
  	
  	
  b.hours.Friday.`close`	
  >	
  '22:00'	
  AND	
  
	
  	
  	
  	
  	
  	
  	
  	
  REPEATED_CONTAINS(categories,	
  'Mediterranean')	
  AND	
  
	
  	
  	
  	
  	
  	
  	
  	
  city	
  =	
  'Las	
  Vegas'	
  
	
  	
  ORDER	
  BY	
  stars	
  DESC	
  
	
  	
  LIMIT	
  2;	
  
	
  
+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+	
  
|	
  	
  	
  	
  name	
  	
  	
  	
  |	
  	
  	
  stars	
  	
  	
  	
  |	
  	
  	
  friday	
  	
  	
  |	
  categories	
  |	
  
+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+	
  
|	
  Olives	
  	
  	
  	
  	
  |	
  4.0	
  	
  	
  	
  	
  	
  	
  	
  |	
  {"close":"22:30","open":"11:00"}	
  |	
  
["Mediterranean","Restaurants"]	
  |	
  
|	
  Marrakech	
  Moroccan	
  Restaurant	
  |	
  4.0	
  	
  	
  	
  	
  	
  	
  	
  |	
  {"close":"23:00","open":"17:30"}	
  |	
  
["Mediterranean","Middle	
  Eastern","Moroccan","Restaurants"]	
  |	
  
+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+	
  
Query	
  data	
  
with	
  any	
  
levels	
  of	
  
nesting	
  
®
© 2014 MapR Technologies 16
ANSI SQL compatibility
//Get	
  top	
  cool	
  rated	
  businesses	
  
	
  
Ø  SELECT	
  b.name	
  from	
  dfs.yelp.`business.json`	
  b	
  	
  
	
  	
  	
  WHERE	
  b.business_id	
  IN	
  
	
  	
  (SELECT	
  r.business_id	
  FROM	
  dfs.yelp.`review.json`	
  r	
  
	
  	
  	
  GROUP	
  BY	
  r.business_id	
  HAVING	
  SUM(r.votes.cool)	
  >	
  2000	
  ORDER	
  BY	
  	
  
	
  	
  	
  SUM(r.votes.cool)	
  DESC);	
  
	
  
+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+	
  
|	
  	
  	
  	
  name 	
  |	
  
+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+	
  
|	
  Earl	
  of	
  Sandwich	
  |	
  
|	
  XS	
  Nightclub	
  |	
  
|	
  The	
  Cosmopolitan	
  of	
  Las	
  Vegas	
  |	
  
|	
  Wicked	
  Spoon	
  |	
  
+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+	
  
Use	
  familiar	
  SQL	
  
functionality	
  
(Joins,	
  
Aggregations,	
  
Sorting,	
  Sub-­‐
queries,	
  SQL	
  data	
  
types)	
  
®
© 2014 MapR Technologies 17
Logical views
//Create	
  a	
  view	
  combining	
  business	
  and	
  reviews	
  datasets	
  
	
  
>	
  CREATE	
  OR	
  REPLACE	
  VIEW	
  dfs.tmp.BusinessReviews	
  AS	
  
	
  	
  	
  	
  SELECT	
  b.name,	
  b.stars,	
  r.votes.funny,	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  r.votes.useful,	
  r.votes.cool,	
  r.`date`	
  
	
  	
  	
  	
  	
  	
  FROM	
  dfs.yelp.`business.json`	
  b,	
  dfs.yelp.`review.json`	
  r	
  
	
  	
  	
  	
  	
  	
  WHERE	
  r.business_id	
  =	
  b.business_id;	
  
	
  
+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+	
  
|	
  	
  	
  	
  	
  ok	
  	
  	
  	
  	
  |	
  	
  summary	
  	
  	
  |	
  
+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+	
  
|	
  true	
  	
  	
  	
  	
  	
  	
  |	
  View	
  'BusinessReviews'	
  created	
  successfully	
  in	
  'dfs.tmp'	
  schema	
  |	
  
+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+	
  
	
  
>	
  SELECT	
  COUNT(*)	
  AS	
  Total	
  FROM	
  dfs.tmp.BusinessReviews;	
  
	
  
+------------+
| Total |
+------------+
| 1125458 |
+------------+
Lightweight	
  file	
  
system	
  based	
  
views	
  for	
  
granular	
  and	
  de-­‐
centralized	
  data	
  
management	
  
®
© 2014 MapR Technologies 18
Materialized Views AKA Tables
>	
  ALTER	
  SESSION	
  SET	
  `store.format`	
  =	
  'parquet';	
  
	
  
>	
  CREATE	
  TABLE	
  dfs.yelp.BusinessReviewsTbl	
  AS	
  
	
  	
  	
  	
  SELECT	
  b.name,	
  b.stars,	
  r.votes.funny	
  funny,	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  r.votes.useful	
  useful,	
  r.votes.cool	
  cool,	
  r.`date`	
  
	
  	
  	
  	
  	
  	
  FROM	
  dfs.yelp.`business.json`	
  b,	
  dfs.yelp.`review.json`	
  r	
  
	
  	
  	
  	
  	
  	
  WHERE	
  r.business_id	
  =	
  b.business_id;	
  
	
  
+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+	
  
|	
  	
  Fragment	
  	
  |	
  Number	
  of	
  records	
  written	
  |	
  
+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+	
  
|	
  1_0	
  	
  	
  	
  	
  	
  	
  	
  |	
  176448	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  |	
  
|	
  1_1	
  	
  	
  	
  	
  	
  	
  	
  |	
  192439	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  |	
  
|	
  1_2	
  	
  	
  	
  	
  	
  	
  	
  |	
  198625	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  |	
  
|	
  1_3	
  	
  	
  	
  	
  	
  	
  	
  |	
  200863	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  |	
  
|	
  1_4	
  	
  	
  	
  	
  	
  	
  	
  |	
  181420	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  |	
  
|	
  1_5	
  	
  	
  	
  	
  	
  	
  	
  |	
  175663	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  |	
  
+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+	
  
Save	
  analysis	
  
results	
  as	
  
tables	
  using	
  
familiar	
  CTAS	
  
syntax	
  
®
© 2014 MapR Technologies 19
Extensions to ANSI SQL to work with repeated values
//	
  Flatten	
  repeated	
  categories	
  	
  
	
  
>	
  SELECT	
  name,	
  categories	
  
	
  	
  FROM	
  dfs.yelp.`business.json`	
  LIMIT	
  3;	
  
	
  
+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+	
  
|	
  	
  	
  	
  name	
  	
  	
  	
  |	
  categories	
  |	
  
+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+	
  
|	
  Eric	
  Goldberg,	
  MD	
  |	
  ["Doctors","Health	
  &	
  Medical"]	
  |	
  
|	
  Pine	
  Cone	
  Restaurant	
  |	
  ["Restaurants"]	
  |	
  
|	
  Deforest	
  Family	
  Restaurant	
  |	
  ["American	
  (Traditional)","Restaurants"]	
  |	
  
+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+	
  
	
  
>	
  SELECT	
  name,	
  FLATTEN(categories)	
  AS	
  categories	
  
	
  	
  FROM	
  dfs.yelp.`business.json`	
  LIMIT	
  5;	
  
+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+	
  
|	
  	
  	
  	
  name	
  	
  	
  	
  |	
  categories	
  |	
  
+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+	
  
|	
  Eric	
  Goldberg,	
  MD	
  |	
  Doctors	
  	
  	
  	
  |	
  
|	
  Eric	
  Goldberg,	
  MD	
  |	
  Health	
  &	
  Medical	
  |	
  
|	
  Pine	
  Cone	
  Restaurant	
  |	
  Restaurants	
  |	
  
|	
  Deforest	
  Family	
  Restaurant	
  |	
  American	
  (Traditional)	
  |	
  
|	
  Deforest	
  Family	
  Restaurant	
  |	
  Restaurants	
  |	
  
+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+	
  
Dynamically	
  
flatten	
  repeated	
  
and	
  nested	
  data	
  
elements	
  as	
  part	
  
of	
  SQL	
  queries.	
  
No	
  ETL	
  necessary	
  
®
© 2014 MapR Technologies 20
Extensions to ANSI SQL to work with repeated values
//	
  Get	
  most	
  common	
  business	
  categories	
  
	
  	
  
>SELECT	
  category,	
  count(*)	
  AS	
  categorycount	
  
	
  	
  FROM	
  (SELECT	
  name,	
  FLATTEN(categories)	
  AS	
  category	
  
	
  	
  	
  	
  	
  	
  	
  	
  FROM	
  dfs.yelp.`business.json`)	
  c	
  
	
  	
  GROUP	
  BY	
  category	
  ORDER	
  BY	
  categorycount	
  DESC;	
  
	
  
+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+	
  
|	
  	
  category	
  	
  |	
  categorycount|	
  
+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+	
  
|	
  Restaurants	
  |	
  14303	
  	
  	
  	
  	
  	
  |	
  
…	
  
|	
  Australian	
  |	
  1	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  |	
  
|	
  Boat	
  Dealers	
  |	
  1	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  |	
  
|	
  Firewood	
  	
  	
  |	
  1	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  |	
  
+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+	
  
®
© 2014 MapR Technologies 21
Extensions to ANSI SQL to work with embedded JSON
-­‐-­‐	
  embedded	
  JSON	
  value	
  inside	
  column	
  donutjson	
  inside	
  column-­‐
family	
  cf1	
  	
  of	
  an	
  hbase	
  table	
  donuts	
  
SELECT	
  	
  
	
  d.name,	
  COUNT(d.fillings)	
  
FROM	
  (	
  
	
  SELECT	
  convert_from(cf1.donutjson,	
  JSON)	
  as	
  d	
  
	
  FROM	
  hbase.donuts);	
  
!
®
© 2014 MapR Technologies 22© 2014 MapR Technologies
®
Working with Dynamic Columns
®
© 2014 MapR Technologies 23
checkin.json
check-­‐in	
  
{	
  
	
  	
  	
  	
  'type':	
  'checkin',	
  
	
  	
  	
  	
  'business_id':	
  (encrypted	
  business	
  id),	
  
	
  	
  	
  	
  'checkin_info':	
  {	
  
	
  	
  	
  	
  	
  	
  	
  	
  '0-­‐0':	
  (number	
  of	
  checkins	
  from	
  00:00	
  to	
  01:00	
  on	
  all	
  Sundays),	
  
	
  	
  	
  	
  	
  	
  	
  	
  '1-­‐0':	
  (number	
  of	
  checkins	
  from	
  01:00	
  to	
  02:00	
  on	
  all	
  Sundays),	
  
	
  	
  	
  	
  	
  	
  	
  	
  ...	
  
	
  	
  	
  	
  	
  	
  	
  	
  '14-­‐4':	
  (number	
  of	
  checkins	
  from	
  14:00	
  to	
  15:00	
  on	
  all	
  Thursdays),	
  
	
  	
  	
  	
  	
  	
  	
  	
  …	
  
	
  	
  	
  	
  	
  	
  	
  	
  '23-­‐6':	
  (number	
  of	
  checkins	
  from	
  23:00	
  to	
  00:00	
  on	
  all	
  Saturdays)	
  
	
  	
  	
  	
  },	
  #	
  if	
  there	
  was	
  no	
  checkin	
  for	
  a	
  hour-­‐day	
  block	
  it	
  will	
  not	
  be	
  in	
  the	
  
dataset	
  
}	
  
®
© 2014 MapR Technologies 24
Makes it easy to work with dynamic/unknown columns
>	
  jdbc:drill:zk=local>	
  SELECT	
  KVGEN(checkin_info)	
  checkins	
  	
  
	
  	
  	
  FROM	
  dfs.yelp.`checkin.json`	
  LIMIT	
  1;	
  
+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+	
  
|	
  	
  checkins	
  	
  |	
  
+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+	
  
|	
  [{"key":"3-­‐4","value":1},{"key":"13-­‐5","value":1},{"key":"6-­‐6","value":1},{"key":"14-­‐5","value":1},
{"key":"14-­‐6","value":1},{"key":"14-­‐2","value":1},{"key":"14-­‐3","value":1},{"key":"19-­‐0","value":1},
{"key":"11-­‐5","value":1},{"key":"13-­‐2","value":1},{"key":"11-­‐6","value":2},{"key":"11-­‐3","value":1},
{"key":"12-­‐6","value":1},{"key":"6-­‐5","value":1},{"key":"5-­‐5","value":1},{"key":"9-­‐2","value":1},
{"key":"9-­‐5","value":1},{"key":"9-­‐6","value":1},{"key":"5-­‐2","value":1},{"key":"7-­‐6","value":1},
{"key":"7-­‐5","value":1},{"key":"7-­‐4","value":1},{"key":"17-­‐5","value":1},{"key":"8-­‐5","value":1},
{"key":"10-­‐2","value":1},{"key":"10-­‐5","value":1},{"key":"10-­‐6","value":1}]	
  |	
  
+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+	
  
	
  
>	
  jdbc:drill:zk=local>	
  SELECT	
  FLATTEN(KVGEN(checkin_info))	
  
checkins	
  FROM	
  dfs.yelp.`checkin.json`	
  limit	
  6;	
  
	
  
+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+	
  
|	
  	
  checkins	
  	
  |	
  
+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+	
  
|	
  {"key":"3-­‐4","value":1}	
  |	
  
|	
  {"key":"13-­‐5","value":1}	
  |	
  
|	
  {"key":"6-­‐6","value":1}	
  |	
  
|	
  {"key":"14-­‐5","value":1}	
  |	
  
|	
  {"key":"14-­‐6","value":1}	
  |	
  
|	
  {"key":"14-­‐2","value":1}	
  |	
  
+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+	
  
Convert	
  Map	
  with	
  
a	
  wide	
  set	
  of	
  
dynamic	
  columns	
  
into	
  an	
  array	
  of	
  
key-­‐value	
  pairs	
  
®
© 2014 MapR Technologies 25
Makes it easy to work with dynamic/unknown columns
//	
  Count	
  total	
  number	
  of	
  checkins	
  on	
  Sunday	
  midnight	
  
	
  
Ø  jdbc:drill:zk=local>	
  SELECT	
  SUM(checkintbl.checkins.`value`)	
  as	
  
SundayMidnightCheckins	
  FROM	
  	
  
	
  	
  	
  (SELECT	
  FLATTEN(KVGEN(checkin_info))	
  checkins	
  
	
  	
  	
  	
  FROM	
  dfs.yelp.checkin.json`)	
  checkintbl	
  	
  
	
  	
  	
  	
  WHERE	
  checkintbl.checkins.key='23-­‐0';	
  
	
  
+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+	
  
|	
  SundayMidnightCheckins	
  |	
  
+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+	
  
|	
  8575	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  |	
  
+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+	
  
®
© 2014 MapR Technologies 26© 2014 MapR Technologies
®
Combining Drill & Spark together
®
© 2014 MapR Technologies 27
Using Drill as an input to Spark Jobs
object	
  DosDemo	
  extends	
  App	
  	
  
{ 	
   	
   	
  	
  
	
   	
  	
  
val	
  conf	
  =	
  new	
  SparkConf().setAppName("DrillSql").setMaster("local")	
  	
  
	
   	
  	
  
val	
  sc	
  =	
  new	
  SparkContext(conf)	
  	
  
	
   	
  	
  
	
   	
  	
  
def	
  queryOne():	
  Unit	
  =	
  
{ 	
   	
  	
  
val	
  sql	
  =	
  "SELECT	
  name,	
  full_address,	
  hours,	
  categories	
  FROM	
  `business.json`	
  where	
  
repeated_contains(categories,’Restaurants’)	
  " 	
  	
  
	
   	
  	
  
val	
  result	
  =	
  sc.drillRDD(sql) 	
  	
  
	
   	
  	
  
val	
  formatted	
  =	
  result.map	
  {	
  r	
  => 	
  	
  
	
   	
  	
  
val	
  (name,	
  address,	
  open,	
  close)	
  =	
  (r.name,	
  r.full_address,	
  r.hours.friday.open,	
  r.hours.friday,close) 	
  	
  
	
   	
  	
  
s"$name$address$open$close” 	
  	
  	
  } 	
  	
  
	
   	
   	
   	
  	
  
formatted.foreach(println) 	
  	
  
	
  
	
  	
  	
  	
  	
  	
  } 	
  	
  
Create	
  RDDs	
  from	
  
Drill	
  queries	
  
Access	
  nested	
  
data	
  elements	
  
intuitively	
  
®
© 2014 MapR Technologies 28
Query Spark RDDs via Drill
def	
  queryTwo():	
  Unit	
  =	
  { 	
   	
   	
  	
  
	
   	
  	
  
var	
  easy	
  =	
  sc.parallelize(0	
  until	
  10)	
  	
  
	
   	
  	
  
val	
  squared	
  =	
  easy.map	
  {	
  i	
  => 	
  	
  
	
   	
  	
  
CObject(("index",	
  i),	
  ("value",	
  i*i))	
  	
  
} 	
  	
  
	
   	
  	
  
sc.registerAsTable("squared",	
  squared)	
  	
  
	
   	
  	
  
val	
  sql	
  =	
  "select	
  *	
  from	
  spark.squared	
  where	
  index>3	
  and	
  index<10	
  limit	
  5" 	
  	
  
	
   	
  	
  
val	
  result	
  =	
  sc.drillRDD(sql) 	
  	
  
	
   	
  	
  
result.foreach(println) 	
  	
  
	
  
	
  } 	
  	
  
	
   	
   	
   	
   	
  	
  
} 	
  	
  
Construct	
  
complex	
  objects	
  
as	
  part	
  of	
  the	
  
Spark	
  pipelines	
  
Operate	
  on	
  the	
  
complex	
  data	
  
directly	
  using	
  
Drill	
  queries	
  
®
© 2014 MapR Technologies 29
Combine benefits of Drill & Spark
•  Flexibility (schema-discovery,
complex data, UDFs)
•  Rich storage plugin support
•  Columnar execution
•  ANSI SQL
•  Rapid application
development
•  Advanced analytic workflows
•  Efficient batch processing
®
© 2014 MapR Technologies 30
Status and next steps
•  Functional integration complete
•  Beta release coming soon
•  Next steps
–  Improve memory efficiencies
–  Improve serialization efficiencies
–  Improve task and data locality
•  Ask questions and/or contribute
–  user@drill.apache.org
–  dev@drill.apache.org
®
© 2014 MapR Technologies 31
Recommendations On Trying and Using Drill
New to Drill?
–  Get started with Free MapR On Demand training
–  Test Drive Drill on cloud with AWS
–  Learn how to use Drill with Hadoop using MapR sandbox
Ready to play with your data?
–  Try out Apache Drill in 10 mins guide on your desktop
–  Download Drill for your cluster and start exploration
–  Comprehensive tutorials and documentation available

More Related Content

What's hot

Optimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache SparkOptimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache Spark
Databricks
 
The Hidden Life of Spark Jobs
The Hidden Life of Spark JobsThe Hidden Life of Spark Jobs
The Hidden Life of Spark Jobs
DataWorks Summit
 
Map reduce vs spark
Map reduce vs sparkMap reduce vs spark
Map reduce vs spark
Tudor Lapusan
 
Cassandra & Spark for IoT
Cassandra & Spark for IoTCassandra & Spark for IoT
Cassandra & Spark for IoT
Matthias Niehoff
 
Lambda Architectures in Practice
Lambda Architectures in PracticeLambda Architectures in Practice
Lambda Architectures in Practice
C4Media
 
Microservices, Containers, and Machine Learning
Microservices, Containers, and Machine LearningMicroservices, Containers, and Machine Learning
Microservices, Containers, and Machine Learning
Paco Nathan
 
QCon São Paulo: Real-Time Analytics with Spark Streaming
QCon São Paulo: Real-Time Analytics with Spark StreamingQCon São Paulo: Real-Time Analytics with Spark Streaming
QCon São Paulo: Real-Time Analytics with Spark Streaming
Paco Nathan
 
Introduction to Apache Spark
Introduction to Apache Spark Introduction to Apache Spark
Introduction to Apache Spark
Hubert Fan Chiang
 
Big Data (NJ SQL Server User Group)
Big Data (NJ SQL Server User Group)Big Data (NJ SQL Server User Group)
Big Data (NJ SQL Server User Group)Don Demcsak
 
Cassandra spark connector
Cassandra spark connectorCassandra spark connector
Cassandra spark connector
Duyhai Doan
 
Analyzing Time Series Data with Apache Spark and Cassandra
Analyzing Time Series Data with Apache Spark and CassandraAnalyzing Time Series Data with Apache Spark and Cassandra
Analyzing Time Series Data with Apache Spark and Cassandra
Patrick McFadin
 
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
Databricks
 
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui MengChallenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
Databricks
 
Rapid Prototyping in PySpark Streaming: The Thermodynamics of Docker Containe...
Rapid Prototyping in PySpark Streaming: The Thermodynamics of Docker Containe...Rapid Prototyping in PySpark Streaming: The Thermodynamics of Docker Containe...
Rapid Prototyping in PySpark Streaming: The Thermodynamics of Docker Containe...
Richard Seymour
 
Spark and Cassandra: An Amazing Apache Love Story by Patrick McFadin
Spark and Cassandra: An Amazing Apache Love Story by Patrick McFadinSpark and Cassandra: An Amazing Apache Love Story by Patrick McFadin
Spark and Cassandra: An Amazing Apache Love Story by Patrick McFadin
Spark Summit
 
Apache Spark: The Next Gen toolset for Big Data Processing
Apache Spark: The Next Gen toolset for Big Data ProcessingApache Spark: The Next Gen toolset for Big Data Processing
Apache Spark: The Next Gen toolset for Big Data Processing
prajods
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
Djamel Zouaoui
 
Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)
Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)
Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)
Spark Summit
 
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and MoreStrata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Paco Nathan
 
Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerar...
Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerar...Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerar...
Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerar...
Spark Summit
 

What's hot (20)

Optimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache SparkOptimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache Spark
 
The Hidden Life of Spark Jobs
The Hidden Life of Spark JobsThe Hidden Life of Spark Jobs
The Hidden Life of Spark Jobs
 
Map reduce vs spark
Map reduce vs sparkMap reduce vs spark
Map reduce vs spark
 
Cassandra & Spark for IoT
Cassandra & Spark for IoTCassandra & Spark for IoT
Cassandra & Spark for IoT
 
Lambda Architectures in Practice
Lambda Architectures in PracticeLambda Architectures in Practice
Lambda Architectures in Practice
 
Microservices, Containers, and Machine Learning
Microservices, Containers, and Machine LearningMicroservices, Containers, and Machine Learning
Microservices, Containers, and Machine Learning
 
QCon São Paulo: Real-Time Analytics with Spark Streaming
QCon São Paulo: Real-Time Analytics with Spark StreamingQCon São Paulo: Real-Time Analytics with Spark Streaming
QCon São Paulo: Real-Time Analytics with Spark Streaming
 
Introduction to Apache Spark
Introduction to Apache Spark Introduction to Apache Spark
Introduction to Apache Spark
 
Big Data (NJ SQL Server User Group)
Big Data (NJ SQL Server User Group)Big Data (NJ SQL Server User Group)
Big Data (NJ SQL Server User Group)
 
Cassandra spark connector
Cassandra spark connectorCassandra spark connector
Cassandra spark connector
 
Analyzing Time Series Data with Apache Spark and Cassandra
Analyzing Time Series Data with Apache Spark and CassandraAnalyzing Time Series Data with Apache Spark and Cassandra
Analyzing Time Series Data with Apache Spark and Cassandra
 
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
 
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui MengChallenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
 
Rapid Prototyping in PySpark Streaming: The Thermodynamics of Docker Containe...
Rapid Prototyping in PySpark Streaming: The Thermodynamics of Docker Containe...Rapid Prototyping in PySpark Streaming: The Thermodynamics of Docker Containe...
Rapid Prototyping in PySpark Streaming: The Thermodynamics of Docker Containe...
 
Spark and Cassandra: An Amazing Apache Love Story by Patrick McFadin
Spark and Cassandra: An Amazing Apache Love Story by Patrick McFadinSpark and Cassandra: An Amazing Apache Love Story by Patrick McFadin
Spark and Cassandra: An Amazing Apache Love Story by Patrick McFadin
 
Apache Spark: The Next Gen toolset for Big Data Processing
Apache Spark: The Next Gen toolset for Big Data ProcessingApache Spark: The Next Gen toolset for Big Data Processing
Apache Spark: The Next Gen toolset for Big Data Processing
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
 
Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)
Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)
Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)
 
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and MoreStrata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
 
Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerar...
Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerar...Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerar...
Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerar...
 

Similar to Adding Complex Data to Spark Stack-(Neeraja Rentachintala, MapR)

Adding Complex Data to Spark Stack by Tug Grall
Adding Complex Data to Spark Stack by Tug GrallAdding Complex Data to Spark Stack by Tug Grall
Adding Complex Data to Spark Stack by Tug Grall
Spark Summit
 
Apache drill self service data exploration (113)
Apache drill   self service data exploration (113)Apache drill   self service data exploration (113)
Apache drill self service data exploration (113)
MapR Technologies
 
Introduction to Apache Drill - NYC Apache Drill Meetup
Introduction to Apache Drill - NYC Apache Drill MeetupIntroduction to Apache Drill - NYC Apache Drill Meetup
Introduction to Apache Drill - NYC Apache Drill Meetup
Vince Gonzalez
 
SQL-on-Hadoop with Apache Drill
SQL-on-Hadoop with Apache DrillSQL-on-Hadoop with Apache Drill
SQL-on-Hadoop with Apache Drill
MapR Technologies
 
Self service data exploration with apache drill
Self service data exploration with apache drillSelf service data exploration with apache drill
Self service data exploration with apache drill
MapR Technologies
 
Free Code Friday: Drill 101 - Basics of Apache Drill
Free Code Friday: Drill 101 - Basics of Apache DrillFree Code Friday: Drill 101 - Basics of Apache Drill
Free Code Friday: Drill 101 - Basics of Apache Drill
MapR Technologies
 
What and Why and How: Apache Drill ! - Tugdual Grall
What and Why and How: Apache Drill ! - Tugdual GrallWhat and Why and How: Apache Drill ! - Tugdual Grall
What and Why and How: Apache Drill ! - Tugdual Grall
distributed matters
 
Analyzing Real-World Data with Apache Drill
Analyzing Real-World Data with Apache DrillAnalyzing Real-World Data with Apache Drill
Analyzing Real-World Data with Apache DrillTomer Shiran
 
HUG Italy meet-up with Tugdual Grall, MapR Technical Evangelist
HUG Italy meet-up with Tugdual Grall, MapR Technical EvangelistHUG Italy meet-up with Tugdual Grall, MapR Technical Evangelist
HUG Italy meet-up with Tugdual Grall, MapR Technical Evangelist
SpagoWorld
 
Rethinking SQL for Big Data with Apache Drill
Rethinking SQL for Big Data with Apache DrillRethinking SQL for Big Data with Apache Drill
Rethinking SQL for Big Data with Apache Drill
MapR Technologies
 
Analyzing Real-World Data with Apache Drill
Analyzing Real-World Data with Apache DrillAnalyzing Real-World Data with Apache Drill
Analyzing Real-World Data with Apache Drill
tshiran
 
Lambda Architecture: The Best Way to Build Scalable and Reliable Applications!
Lambda Architecture: The Best Way to Build Scalable and Reliable Applications!Lambda Architecture: The Best Way to Build Scalable and Reliable Applications!
Lambda Architecture: The Best Way to Build Scalable and Reliable Applications!
Tugdual Grall
 
Couchbase Tutorial: Big data Open Source Systems: VLDB2018
Couchbase Tutorial: Big data Open Source Systems: VLDB2018Couchbase Tutorial: Big data Open Source Systems: VLDB2018
Couchbase Tutorial: Big data Open Source Systems: VLDB2018
Keshav Murthy
 
N1QL+GSI: Language and Performance Improvements in Couchbase 5.0 and 5.5
N1QL+GSI: Language and Performance Improvements in Couchbase 5.0 and 5.5N1QL+GSI: Language and Performance Improvements in Couchbase 5.0 and 5.5
N1QL+GSI: Language and Performance Improvements in Couchbase 5.0 and 5.5
Keshav Murthy
 
Terraform Tips and Tricks - LAOUC 2022
Terraform Tips and Tricks - LAOUC 2022Terraform Tips and Tricks - LAOUC 2022
Terraform Tips and Tricks - LAOUC 2022
Nelson Calero
 
Why NoSQL Makes Sense
Why NoSQL Makes SenseWhy NoSQL Makes Sense
Why NoSQL Makes Sense
MongoDB
 
Why NoSQL Makes Sense
Why NoSQL Makes SenseWhy NoSQL Makes Sense
Why NoSQL Makes Sense
MongoDB
 
Drilling on JSON
Drilling on JSONDrilling on JSON
Drilling on JSON
Keshav Murthy
 
Drill 1.0
Drill 1.0Drill 1.0
Practical JSON in MySQL 5.7 and Beyond
Practical JSON in MySQL 5.7 and BeyondPractical JSON in MySQL 5.7 and Beyond
Practical JSON in MySQL 5.7 and Beyond
Ike Walker
 

Similar to Adding Complex Data to Spark Stack-(Neeraja Rentachintala, MapR) (20)

Adding Complex Data to Spark Stack by Tug Grall
Adding Complex Data to Spark Stack by Tug GrallAdding Complex Data to Spark Stack by Tug Grall
Adding Complex Data to Spark Stack by Tug Grall
 
Apache drill self service data exploration (113)
Apache drill   self service data exploration (113)Apache drill   self service data exploration (113)
Apache drill self service data exploration (113)
 
Introduction to Apache Drill - NYC Apache Drill Meetup
Introduction to Apache Drill - NYC Apache Drill MeetupIntroduction to Apache Drill - NYC Apache Drill Meetup
Introduction to Apache Drill - NYC Apache Drill Meetup
 
SQL-on-Hadoop with Apache Drill
SQL-on-Hadoop with Apache DrillSQL-on-Hadoop with Apache Drill
SQL-on-Hadoop with Apache Drill
 
Self service data exploration with apache drill
Self service data exploration with apache drillSelf service data exploration with apache drill
Self service data exploration with apache drill
 
Free Code Friday: Drill 101 - Basics of Apache Drill
Free Code Friday: Drill 101 - Basics of Apache DrillFree Code Friday: Drill 101 - Basics of Apache Drill
Free Code Friday: Drill 101 - Basics of Apache Drill
 
What and Why and How: Apache Drill ! - Tugdual Grall
What and Why and How: Apache Drill ! - Tugdual GrallWhat and Why and How: Apache Drill ! - Tugdual Grall
What and Why and How: Apache Drill ! - Tugdual Grall
 
Analyzing Real-World Data with Apache Drill
Analyzing Real-World Data with Apache DrillAnalyzing Real-World Data with Apache Drill
Analyzing Real-World Data with Apache Drill
 
HUG Italy meet-up with Tugdual Grall, MapR Technical Evangelist
HUG Italy meet-up with Tugdual Grall, MapR Technical EvangelistHUG Italy meet-up with Tugdual Grall, MapR Technical Evangelist
HUG Italy meet-up with Tugdual Grall, MapR Technical Evangelist
 
Rethinking SQL for Big Data with Apache Drill
Rethinking SQL for Big Data with Apache DrillRethinking SQL for Big Data with Apache Drill
Rethinking SQL for Big Data with Apache Drill
 
Analyzing Real-World Data with Apache Drill
Analyzing Real-World Data with Apache DrillAnalyzing Real-World Data with Apache Drill
Analyzing Real-World Data with Apache Drill
 
Lambda Architecture: The Best Way to Build Scalable and Reliable Applications!
Lambda Architecture: The Best Way to Build Scalable and Reliable Applications!Lambda Architecture: The Best Way to Build Scalable and Reliable Applications!
Lambda Architecture: The Best Way to Build Scalable and Reliable Applications!
 
Couchbase Tutorial: Big data Open Source Systems: VLDB2018
Couchbase Tutorial: Big data Open Source Systems: VLDB2018Couchbase Tutorial: Big data Open Source Systems: VLDB2018
Couchbase Tutorial: Big data Open Source Systems: VLDB2018
 
N1QL+GSI: Language and Performance Improvements in Couchbase 5.0 and 5.5
N1QL+GSI: Language and Performance Improvements in Couchbase 5.0 and 5.5N1QL+GSI: Language and Performance Improvements in Couchbase 5.0 and 5.5
N1QL+GSI: Language and Performance Improvements in Couchbase 5.0 and 5.5
 
Terraform Tips and Tricks - LAOUC 2022
Terraform Tips and Tricks - LAOUC 2022Terraform Tips and Tricks - LAOUC 2022
Terraform Tips and Tricks - LAOUC 2022
 
Why NoSQL Makes Sense
Why NoSQL Makes SenseWhy NoSQL Makes Sense
Why NoSQL Makes Sense
 
Why NoSQL Makes Sense
Why NoSQL Makes SenseWhy NoSQL Makes Sense
Why NoSQL Makes Sense
 
Drilling on JSON
Drilling on JSONDrilling on JSON
Drilling on JSON
 
Drill 1.0
Drill 1.0Drill 1.0
Drill 1.0
 
Practical JSON in MySQL 5.7 and Beyond
Practical JSON in MySQL 5.7 and BeyondPractical JSON in MySQL 5.7 and Beyond
Practical JSON in MySQL 5.7 and Beyond
 

More from Spark Summit

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
Spark Summit
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
Spark Summit
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Spark Summit
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Spark Summit
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
Spark Summit
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
Spark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
Spark Summit
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakNext CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub Wozniak
Spark Summit
 
Powering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimPowering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin Kim
Spark Summit
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Spark Summit
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Spark Summit
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
Spark Summit
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spark Summit
 
Goal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovGoal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim Simeonov
Spark Summit
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Spark Summit
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkGetting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Spark Summit
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Spark Summit
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
Spark Summit
 

More from Spark Summit (20)

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakNext CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub Wozniak
 
Powering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimPowering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin Kim
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
 
Goal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovGoal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim Simeonov
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkGetting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir Volk
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
 

Recently uploaded

哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
axoqas
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
Oppotus
 
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
pchutichetpong
 
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
74nqk8xf
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
AbhimanyuSinha9
 
Influence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business PlanInfluence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business Plan
jerlynmaetalle
 
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
u86oixdj
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
TravisMalana
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
ewymefz
 
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...
2023240532
 
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
dwreak4tg
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
slg6lamcq
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Subhajit Sahu
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
haila53
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
oz8q3jxlp
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
ahzuo
 
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
mbawufebxi
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Subhajit Sahu
 

Recently uploaded (20)

哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
 
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
 
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
 
Influence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business PlanInfluence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business Plan
 
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
 
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...
 
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
 
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
 

Adding Complex Data to Spark Stack-(Neeraja Rentachintala, MapR)

  • 1. ® © 2014 MapR Technologies 1 ® © 2014 MapR Technologies Adding complex data to Spark stack Neeraja Rentachintala, Director of Product Management 6/16/2015
  • 2. ® © 2014 MapR Technologies 2 COMPLEX/ SEMI-STRUCTURED DATA STRUCTURED DATA 1980 2000 20101990 2020 Complex data is growing Unstructured data will account for more than 80% of the data collected by organizations Source: Human-Computer Interaction & Knowledge Discovery in Complex Unstructured, Big Data TotalDataStored
  • 3. ® © 2014 MapR Technologies 3 1980 2000 20101990 2020 Fixed schema DBA controls structure Dynamic / Flexible schema Application controls structure NON-RELATIONAL DATASTORESRELATIONAL DATABASES GBs-TBs TBs-PBsVolume Database Data Increasingly Stored in Non-Relational Datastores Structure Development Structured Structured, semi-structured and unstructured Planned (release cycle = months-years) Iterative (release cycle = days-weeks)
  • 4. ® © 2014 MapR Technologies 4 Session topics •  Apache Drill overview •  Integrating Drill & Spark •  Status & Next steps
  • 5. ® © 2014 MapR Technologies 5 Drill is the Industry’s first Schema-Free SQL engine Point-and-query vs. schema-first Access to any data source & types Industry standard APIs (ANSI SQL, JDBC/ODBC, REST) Performance at Scale Extreme Ease of Use
  • 6. ® © 2014 MapR Technologies 6 -  Sub-directory -  HBase namespace -  Hive database Drill enables ‘SQL on Everything’ (Omni-SQL) SELECT  *  FROM  dfs.yelp.`business.json`  ! Workspace -  Pathnames -  Hive table -  HBase table Table -  File system (Text, Parquet, JSON) -  HBase/MapR-DB -  Hive Metastore/HCatalog - Easy API to go beyond Hadoop Storage plugin instance
  • 7. ® © 2014 MapR Technologies 7 Drill is a Distributed SQL query engine drillbit   DataNode/ RegionServer   drillbit   DataNode/ RegionServer   drillbit   DataNode/ RegionServer   ZooKeeper ZooKeeper ZooKeeper … Ø  Scale-out (single node to 1000’s of nodes) Ø  Columnar and Vectorized execution Ø  Optimistic execution (no MR, Spark, Tez) Ø  Extensible
  • 8. ® © 2014 MapR Technologies 8 Core Modules within drillbit   SQL Parser Hive HBase StoragePlugins MongoDB DFS PhysicalPlan Execution LogicalPlan Optimizer RPC Endpoint
  • 9. ® © 2014 MapR Technologies 9 Integrating Spark and Drill Features at a glance: •  Use Drill as an input to Spark •  Query Spark RDDs via Drill and create data pipelines Disk (DFS) Memory RDD Files Files
  • 10. ® © 2014 MapR Technologies 10© 2014 MapR Technologies ® Drill examples walkthrough
  • 11. ® © 2014 MapR Technologies 11 business.json {    "business_id":  "4bEjOyTaDG24SY5TxsaUNQ",    "full_address":  "3655  Las  Vegas  Blvd  SnThe  StripnLas  Vegas,  NV  89109",    "hours":  {      "Monday":  {"close":  "23:00",  "open":  "07:00"},      "Tuesday":  {"close":  "23:00",  "open":  "07:00"},      "Friday":  {"close":  "00:00",  "open":  "07:00"},      "Wednesday":  {"close":  "23:00",  "open":  "07:00"},      "Thursday":  {"close":  "23:00",  "open":  "07:00"},      "Sunday":  {"close":  "23:00",  "open":  "07:00"},      "Saturday":  {"close":  "00:00",  "open":  "07:00"}    },    "open":  true,    "categories":  ["Breakfast  &  Brunch",  "Steakhouses",  "French",  "Restaurants"],    "city":  "Las  Vegas",    "review_count":  4084,    "name":  "Mon  Ami  Gabi",    "neighborhoods":  ["The  Strip"],    "longitude":  -­‐115.172588519464,    "state":  "NV",    "stars":  4.0,      "attributes":  {      "Alcohol":  "full_bar”,        "Noise  Level":  "average",      "Has  TV":  false,      "Attire":  "casual",      "Ambience":  {        "romantic":  true,        "intimate":  false,        "touristy":  false,        "hipster":  false,          "classy":  true,        "trendy":  false,          "casual":  false      },      "Good  For":  {"dessert":  false,  "latenight":  false,  "lunch":  false,                                                  "dinner":  true,  "breakfast":  false,  "brunch":  false},    }   }  
  • 12. ® © 2014 MapR Technologies 12 review.json {      "votes":  {"funny":  0,  "useful":  2,  "cool":  1},      "user_id":  "Xqd0DzHaiyRqVH3WRG7hzg",      "review_id":  "15SdjuK7DmYqUAj6rjGowg",      "stars":  5,      "date":  "2007-­‐05-­‐17",      "text":  "dr.  goldberg  offers  everything  ...",      "type":  "review",      "business_id":  "vcNAWiLM4dR7D2nwwJ7nCA"   }  
  • 13. ® © 2014 MapR Technologies 13 Zero to Results in 2 minutes $  tar  -­‐xvzf  apache-­‐drill-­‐1.0.0.tar.gz     $  bin/sqlline  -­‐u  jdbc:drill:zk=local     >  SELECT  state,  city,  count(*)  AS  businesses      FROM  dfs.yelp.`business.json`      GROUP  BY  state,  city      ORDER  BY  businesses  DESC  LIMIT  10;     +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+   |      state        |        city        |    businesses  |   +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+   |  NV                  |  Las  Vegas    |  12021              |   |  AZ                  |  Phoenix        |  7499                |   |  AZ                  |  Scottsdale  |  3605                |   |  EDH                |  Edinburgh    |  2804                |   |  AZ                  |  Mesa              |  2041                |   |  AZ                  |  Tempe            |  2025                |   |  NV                  |  Henderson    |  1914                |   |  AZ                  |  Chandler      |  1637                |   |  WI                  |  Madison        |  1630                |   |  AZ                  |  Glendale      |  1196                |   +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+   Install   Query  files   and   directories   Results   Launch  shell   (embedded   mode)  
  • 14. ® © 2014 MapR Technologies 14 Directories are implicit partitions   SELECT  dir0,  SUM(amount)   FROM  sales   GROUP  BY  dir1  IN  (q1,  q2)         sales   ├──  2014   │      ├──  q1   │      ├──  q2   │      ├──  q3   │      └──  q4   └──  2015          └──  q1  
  • 15. ® © 2014 MapR Technologies 15 Intuitive SQL access to complex data //  It’s  Friday  10pm  in  Vegas  and  looking  for  Hummus     >  SELECT  name,  stars,  b.hours.Friday  friday,  categories      FROM  dfs.yelp.`business.json`  b      WHERE  b.hours.Friday.`open`  <  '22:00'  AND                  b.hours.Friday.`close`  >  '22:00'  AND                  REPEATED_CONTAINS(categories,  'Mediterranean')  AND                  city  =  'Las  Vegas'      ORDER  BY  stars  DESC      LIMIT  2;     +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+   |        name        |      stars        |      friday      |  categories  |   +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+   |  Olives          |  4.0                |  {"close":"22:30","open":"11:00"}  |   ["Mediterranean","Restaurants"]  |   |  Marrakech  Moroccan  Restaurant  |  4.0                |  {"close":"23:00","open":"17:30"}  |   ["Mediterranean","Middle  Eastern","Moroccan","Restaurants"]  |   +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+   Query  data   with  any   levels  of   nesting  
  • 16. ® © 2014 MapR Technologies 16 ANSI SQL compatibility //Get  top  cool  rated  businesses     Ø  SELECT  b.name  from  dfs.yelp.`business.json`  b          WHERE  b.business_id  IN      (SELECT  r.business_id  FROM  dfs.yelp.`review.json`  r        GROUP  BY  r.business_id  HAVING  SUM(r.votes.cool)  >  2000  ORDER  BY          SUM(r.votes.cool)  DESC);     +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+   |        name  |   +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+   |  Earl  of  Sandwich  |   |  XS  Nightclub  |   |  The  Cosmopolitan  of  Las  Vegas  |   |  Wicked  Spoon  |   +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+   Use  familiar  SQL   functionality   (Joins,   Aggregations,   Sorting,  Sub-­‐ queries,  SQL  data   types)  
  • 17. ® © 2014 MapR Technologies 17 Logical views //Create  a  view  combining  business  and  reviews  datasets     >  CREATE  OR  REPLACE  VIEW  dfs.tmp.BusinessReviews  AS          SELECT  b.name,  b.stars,  r.votes.funny,                        r.votes.useful,  r.votes.cool,  r.`date`              FROM  dfs.yelp.`business.json`  b,  dfs.yelp.`review.json`  r              WHERE  r.business_id  =  b.business_id;     +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+   |          ok          |    summary      |   +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+   |  true              |  View  'BusinessReviews'  created  successfully  in  'dfs.tmp'  schema  |   +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+     >  SELECT  COUNT(*)  AS  Total  FROM  dfs.tmp.BusinessReviews;     +------------+ | Total | +------------+ | 1125458 | +------------+ Lightweight  file   system  based   views  for   granular  and  de-­‐ centralized  data   management  
  • 18. ® © 2014 MapR Technologies 18 Materialized Views AKA Tables >  ALTER  SESSION  SET  `store.format`  =  'parquet';     >  CREATE  TABLE  dfs.yelp.BusinessReviewsTbl  AS          SELECT  b.name,  b.stars,  r.votes.funny  funny,                        r.votes.useful  useful,  r.votes.cool  cool,  r.`date`              FROM  dfs.yelp.`business.json`  b,  dfs.yelp.`review.json`  r              WHERE  r.business_id  =  b.business_id;     +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+   |    Fragment    |  Number  of  records  written  |   +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+   |  1_0                |  176448                                        |   |  1_1                |  192439                                        |   |  1_2                |  198625                                        |   |  1_3                |  200863                                        |   |  1_4                |  181420                                        |   |  1_5                |  175663                                        |   +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+   Save  analysis   results  as   tables  using   familiar  CTAS   syntax  
  • 19. ® © 2014 MapR Technologies 19 Extensions to ANSI SQL to work with repeated values //  Flatten  repeated  categories       >  SELECT  name,  categories      FROM  dfs.yelp.`business.json`  LIMIT  3;     +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+   |        name        |  categories  |   +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+   |  Eric  Goldberg,  MD  |  ["Doctors","Health  &  Medical"]  |   |  Pine  Cone  Restaurant  |  ["Restaurants"]  |   |  Deforest  Family  Restaurant  |  ["American  (Traditional)","Restaurants"]  |   +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+     >  SELECT  name,  FLATTEN(categories)  AS  categories      FROM  dfs.yelp.`business.json`  LIMIT  5;   +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+   |        name        |  categories  |   +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+   |  Eric  Goldberg,  MD  |  Doctors        |   |  Eric  Goldberg,  MD  |  Health  &  Medical  |   |  Pine  Cone  Restaurant  |  Restaurants  |   |  Deforest  Family  Restaurant  |  American  (Traditional)  |   |  Deforest  Family  Restaurant  |  Restaurants  |   +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+   Dynamically   flatten  repeated   and  nested  data   elements  as  part   of  SQL  queries.   No  ETL  necessary  
  • 20. ® © 2014 MapR Technologies 20 Extensions to ANSI SQL to work with repeated values //  Get  most  common  business  categories       >SELECT  category,  count(*)  AS  categorycount      FROM  (SELECT  name,  FLATTEN(categories)  AS  category                  FROM  dfs.yelp.`business.json`)  c      GROUP  BY  category  ORDER  BY  categorycount  DESC;     +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+   |    category    |  categorycount|   +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+   |  Restaurants  |  14303            |   …   |  Australian  |  1                    |   |  Boat  Dealers  |  1                    |   |  Firewood      |  1                    |   +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+  
  • 21. ® © 2014 MapR Technologies 21 Extensions to ANSI SQL to work with embedded JSON -­‐-­‐  embedded  JSON  value  inside  column  donutjson  inside  column-­‐ family  cf1    of  an  hbase  table  donuts   SELECT      d.name,  COUNT(d.fillings)   FROM  (    SELECT  convert_from(cf1.donutjson,  JSON)  as  d    FROM  hbase.donuts);   !
  • 22. ® © 2014 MapR Technologies 22© 2014 MapR Technologies ® Working with Dynamic Columns
  • 23. ® © 2014 MapR Technologies 23 checkin.json check-­‐in   {          'type':  'checkin',          'business_id':  (encrypted  business  id),          'checkin_info':  {                  '0-­‐0':  (number  of  checkins  from  00:00  to  01:00  on  all  Sundays),                  '1-­‐0':  (number  of  checkins  from  01:00  to  02:00  on  all  Sundays),                  ...                  '14-­‐4':  (number  of  checkins  from  14:00  to  15:00  on  all  Thursdays),                  …                  '23-­‐6':  (number  of  checkins  from  23:00  to  00:00  on  all  Saturdays)          },  #  if  there  was  no  checkin  for  a  hour-­‐day  block  it  will  not  be  in  the   dataset   }  
  • 24. ® © 2014 MapR Technologies 24 Makes it easy to work with dynamic/unknown columns >  jdbc:drill:zk=local>  SELECT  KVGEN(checkin_info)  checkins          FROM  dfs.yelp.`checkin.json`  LIMIT  1;   +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+   |    checkins    |   +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+   |  [{"key":"3-­‐4","value":1},{"key":"13-­‐5","value":1},{"key":"6-­‐6","value":1},{"key":"14-­‐5","value":1}, {"key":"14-­‐6","value":1},{"key":"14-­‐2","value":1},{"key":"14-­‐3","value":1},{"key":"19-­‐0","value":1}, {"key":"11-­‐5","value":1},{"key":"13-­‐2","value":1},{"key":"11-­‐6","value":2},{"key":"11-­‐3","value":1}, {"key":"12-­‐6","value":1},{"key":"6-­‐5","value":1},{"key":"5-­‐5","value":1},{"key":"9-­‐2","value":1}, {"key":"9-­‐5","value":1},{"key":"9-­‐6","value":1},{"key":"5-­‐2","value":1},{"key":"7-­‐6","value":1}, {"key":"7-­‐5","value":1},{"key":"7-­‐4","value":1},{"key":"17-­‐5","value":1},{"key":"8-­‐5","value":1}, {"key":"10-­‐2","value":1},{"key":"10-­‐5","value":1},{"key":"10-­‐6","value":1}]  |   +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+     >  jdbc:drill:zk=local>  SELECT  FLATTEN(KVGEN(checkin_info))   checkins  FROM  dfs.yelp.`checkin.json`  limit  6;     +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+   |    checkins    |   +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+   |  {"key":"3-­‐4","value":1}  |   |  {"key":"13-­‐5","value":1}  |   |  {"key":"6-­‐6","value":1}  |   |  {"key":"14-­‐5","value":1}  |   |  {"key":"14-­‐6","value":1}  |   |  {"key":"14-­‐2","value":1}  |   +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+   Convert  Map  with   a  wide  set  of   dynamic  columns   into  an  array  of   key-­‐value  pairs  
  • 25. ® © 2014 MapR Technologies 25 Makes it easy to work with dynamic/unknown columns //  Count  total  number  of  checkins  on  Sunday  midnight     Ø  jdbc:drill:zk=local>  SELECT  SUM(checkintbl.checkins.`value`)  as   SundayMidnightCheckins  FROM          (SELECT  FLATTEN(KVGEN(checkin_info))  checkins          FROM  dfs.yelp.checkin.json`)  checkintbl            WHERE  checkintbl.checkins.key='23-­‐0';     +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+   |  SundayMidnightCheckins  |   +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+   |  8575                                      |   +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+  
  • 26. ® © 2014 MapR Technologies 26© 2014 MapR Technologies ® Combining Drill & Spark together
  • 27. ® © 2014 MapR Technologies 27 Using Drill as an input to Spark Jobs object  DosDemo  extends  App     {               val  conf  =  new  SparkConf().setAppName("DrillSql").setMaster("local")           val  sc  =  new  SparkContext(conf)                 def  queryOne():  Unit  =   {       val  sql  =  "SELECT  name,  full_address,  hours,  categories  FROM  `business.json`  where   repeated_contains(categories,’Restaurants’)  "           val  result  =  sc.drillRDD(sql)           val  formatted  =  result.map  {  r  =>           val  (name,  address,  open,  close)  =  (r.name,  r.full_address,  r.hours.friday.open,  r.hours.friday,close)           s"$name$address$open$close”      }               formatted.foreach(println)                  }     Create  RDDs  from   Drill  queries   Access  nested   data  elements   intuitively  
  • 28. ® © 2014 MapR Technologies 28 Query Spark RDDs via Drill def  queryTwo():  Unit  =  {               var  easy  =  sc.parallelize(0  until  10)           val  squared  =  easy.map  {  i  =>           CObject(("index",  i),  ("value",  i*i))     }           sc.registerAsTable("squared",  squared)           val  sql  =  "select  *  from  spark.squared  where  index>3  and  index<10  limit  5"           val  result  =  sc.drillRDD(sql)           result.foreach(println)        }                 }     Construct   complex  objects   as  part  of  the   Spark  pipelines   Operate  on  the   complex  data   directly  using   Drill  queries  
  • 29. ® © 2014 MapR Technologies 29 Combine benefits of Drill & Spark •  Flexibility (schema-discovery, complex data, UDFs) •  Rich storage plugin support •  Columnar execution •  ANSI SQL •  Rapid application development •  Advanced analytic workflows •  Efficient batch processing
  • 30. ® © 2014 MapR Technologies 30 Status and next steps •  Functional integration complete •  Beta release coming soon •  Next steps –  Improve memory efficiencies –  Improve serialization efficiencies –  Improve task and data locality •  Ask questions and/or contribute –  user@drill.apache.org –  dev@drill.apache.org
  • 31. ® © 2014 MapR Technologies 31 Recommendations On Trying and Using Drill New to Drill? –  Get started with Free MapR On Demand training –  Test Drive Drill on cloud with AWS –  Learn how to use Drill with Hadoop using MapR sandbox Ready to play with your data? –  Try out Apache Drill in 10 mins guide on your desktop –  Download Drill for your cluster and start exploration –  Comprehensive tutorials and documentation available