The Heterogeneous Data lake

DREMIO
The Heterogeneous Data Lake
Tomer Shiran, Co-Founder & CEO at Dremio
tshiran@dremio.com | @tshiran
Hadoop Summit Europe 2016
April 13, 2016

DREMIO
Company Background
Jacques Nadeau
Founder & CTO
• Recognized SQL & NoSQL expert
• Apache Arrow & Drill PMC Chair
• Quigo (AOL); Offermatica (ADBE);
aQuantive (MSFT)
Tomer Shiran
Founder & CEO
• MapR (VP Product); Microsoft; IBM
Research
• Apache Drill Founder
• Carnegie Mellon, Technion
Julien Le Dem
Architect
• Apache Parquet Founder
• Apache Pig PMC Member
• Twitter (Lead, Analytics Data
Pipeline); Yahoo! (Architect)
Top Silicon Valley VCs• Stealth data analytics startup
• Founded in 2015
• Led by experts in Big Data and open source

DREMIO
The Rise of Heterogeneous Data Infrastructure
1980 2016

DREMIO
Can’t Simply Connect a BI Tool…
• Too slow for interactive
analysis
• Manual process to map
data to relational model
• NoSQL data often
inconsistent & unclean
(eg, mixed types)
X

DREMIO
Can’t Simply ETL the Data Into One System…
DWRDBMS RDBMS
RDBMS
RDBMS
RDBMSRDBMS
RDBMS RDBMS
• ETL between similar systems
• SQL -> SQL
• Flat -> flat
• Small & slowly evolving data
• Even then, ETL was hard!
DW
S3
HDFS
Solr S3
Oracle
Mongo
DB
SQL
Server
HBase
Elastic HDFS
• ETL between very different systems
• Search -> SQL
• Complex –> flat
• Big & rapidly evolving data
• ETL is now much harder…
The Relational World Today

DREMIO
Towards a Heterogeneous Data Lake…
• A platform that enables data analysis across disparate data sources
• Storage-agnostic
– The data can live anywhere
– Join across disparate data sources
– Leverage the strengths of each data source
• There’s a reason it was chosen to store that data…
• Client-agnostic
– Tableau, Qlik, Power BI, Excel, R, …
• Scalability & performance
– It’s the era of Big Data…
• Simple & complex analysis

DREMIO
Apache Arrow: Columnar In-Memory Execution
Arrow is backed by the lead developers of the major open source Big Data technologies
10-100x speedup
on modern CPUs
High-performance
sharing & interchange
High-speed Python
and R integration
Apache Arrow is the new standard for columnar in-memory execution technology
Data Sources:
Execution:
Data Science:
Parauet, HBase, Kudu, Phoenix, Hadoop, Cassandra
Drill, Spark, Impala, Storm
Pandas (Python), R, Ibis

DREMIO
Arrow Enables High Performance Interchange
Pre-Arrow With Arrow
• Each system has its own internal
memory format
• 70-80% CPU wasted on serialization
and deserialization
• Similar functionality implemented in
multiple projects
• All systems utilize the same memory
format
• No overhead for cross-system
communication
• Projects can share functionality (eg,
Parquet-to-Arrow reader)

DREMIO
Arrow is Designed for CPU Efficiency
Traditional
Memory Buffer
Arrow
Memory Buffer
• Cache locality
• Super-scalar & vectorized
operation
• Minimal structure overhead
• Constant value access
• Operate directly on
columnar compressed data

DREMIO
Apache Drill: A Storage-Agnostic Query Engine
Tableau, Excel, Qlik, … Custom Applications
MongoDB*
CLI
HBase Elasticsearch* MapR
HDFS NAS Local Files Amazon S3
* Currently being developed/enhanced
RDBMS*
Azure Blob Storage
Apache Drill
Query any data source as if it’s a
relational database
Join data from multiple data sources
in a single query
1 2

DREMIO
Omni-SQL (“SQL-on-Everything”)
Drill: Omni-SQL
Whereas the other engines we're discussing here create a relational
database environment on top of Hadoop, Drill instead enables a SQL
language interface to data in numerous formats, without requiring a formal
schema to be declared. This enables plug-and-play discovery over a huge
universe of data without prerequisites and preparation. So while Drill uses
SQL, and can connect to Hadoop, calling it SQL-on-Hadoop kind of misses
the point. A better name might be SQL-on-Everything, with very low setup
requirements.
“
”

DREMIO
Everything Starts With a Drillbit…
• High performance query executor
• In-memory columnar execution
• Directly interacts with data, acquiring
knowledge as it reads
• Built to leverage large amounts of memory
• Networked or not
• Exposes ODBC, JDBC, REST
• Built-in Web UI and CLI
• Extensible
Single process
(daemon or CLI)
drillbit

DREMIO
Data Lake, More Like Data Maelstrom
Clustered Services Desktops
HDFS HDFS HDFS
HBase HBase HBase
HDFS HDFS HDFS
ES ES ES
MongoDB MongoDB MongoDB
Cloud Services
DynamoDB
Amazon S3
Linux
Mac
Windows
MongoDB Cluster
Elasticsearch Cluster
Hadoop Cluster
HBase Cluster

DREMIO
Run Drill Co-Located with the Data, or Not
Clustered Services Desktops
HDFS HDFS HDFS
HBase HBase HBase
HDFS HDFS HDFS
ES ES ES
MongoDB MongoDB MongoDB
Cloud Services
DynamoDB
Amazon S3
Linux
Mac
Windows
drillbit drillbit drillbit
drillbit drillbit
drillbit drillbit
drillbit drillbit
drillbit drillbit
drillbit
drillbit
drillbit

DREMIO
Extensible Datastore Architecture
Storage Plugin API
MongoDB
Plugin
File Plugin
Execution Engine
Format Plugin APIFileSystem API
HDFS
S3
MapR-FS
Parquet
JSON
CSV
HBase
Plugin
Hive
Plugin
Chapter 2: Connecting to Datastores
Kudu
Plugin
Phoenix
Plugin

DREMIO
Referencing a Table
SELECT * FROM production.website.users;
Chapter 3: The Universal Namespace
Datastore Workspace Table

DREMIO
Namespaces & Tables
Storage Plugin Type Workspace Table
mongo Database Collection
hive Database Table
hbase Namespace Table
file (HDFS cluster, S3, …) Directory File or directory
… … …
User defines these in the
datastore configuration

DREMIO
> SELECT *
FROM dfs.root.`yelp/review.json` r,
mongo.yelp.business b
WHERE r.business_id = b.business_id;
Joining Across Datastores is Easy!
Alias to a specific file system (S3, HDFS, local, NAS)
Alias to a specific MongoDB cluster

DREMIO
Native JSON Data Model
Access Arrays
SELECT categories[0]
{
"business_id": 123,
"name": "McDonalds",
"categories": ["restaurant", "fast food"],
"attributes": {
"family friendly": true,
"fast": true,
"romantic": false
}
}
Access Maps
WHERE t.attributes.romantic IS TRUE
Flatten Arrays
SELECT name, FLATTEN(categories)
Extract Keys
SELECT name, KVGEN(attributes)
Flatten Maps
SELECT name, FLATTEN(KVGEN(attributes))
Access Embedded JSON Blobs
SELECT d.address.state
FROM (SELECT CONVERT_FROM(t.data, JSON) d FROM t)

DREMIO
FLATTEN
• FLATTEN converts single record with array field into multiple records
– One output record for each array element
• Non FLATTENed fields are repeated in each of the output records
> SELECT categories
FROM business LIMIT 2;
+-------------------------------------------+
| categories |
+-------------------------------------------+
| ["American (Traditional)","Restaurants"] |
| ["Chinese","Restaurants"] |
+-------------------------------------------+
> SELECT FLATTEN(categories)
FROM business LIMIT 4;
+-------------------------+
| EXPR$0 |
+-------------------------+
| American (Traditional) |
| Restaurants |
| Chinese |
| Restaurants |
+-------------------------+

DREMIO
ODBC and JDBC
• Drill includes standard
ODBC/JDBC drivers
– ODBC for native apps
– JDBC for Java apps
• User installs the driver
on the client
– The same machine as
the BI tool
• Driver communicates
with Drill cluster(s)
• Make sure driver and
cluster are compatible
versions
Drill Cluster
Drill JDBC Driver
TIBCO Spotfire
Client
Drill ODBC Driver
Tableau
Client (eg, Laptop)

DREMIO
Thank You
• Learn about Apache Arrow
• Jacques Nadeau’s blog post: www.dremio.com/blog/Apache-Arrow/
• Apache Arrow website: arrow.apache.org
• Download Apache Drill: drill.apache.org
• Reach out to learn more about the Dremio private beta
• Email me: tshiran@dremio.com
• Sign up on the site: www.dremio.com

DREMIO
Questions
• User trends based on yelping_since (Mongo)
• Top business categories, with coloring by state
• Which businesses are gross? (Elastic<->Mongo)
• Which of those had the most website clicks?
– distinct(business_id) on elastic, mongo.business,
hdfs.default.click

The Heterogeneous Data lake

More Related Content

What's hot

Viewers also liked

Similar to The Heterogeneous Data lake

More from DataWorks Summit/Hadoop Summit

Recently uploaded

The Heterogeneous Data lake