DREMIO
The Heterogeneous Data Lake
Tomer Shiran, Co-Founder & CEO at Dremio
tshiran@dremio.com | @tshiran
Hadoop Summit Europe 2016
April 13, 2016
DREMIO
Company Background
Jacques Nadeau
Founder & CTO
• Recognized SQL & NoSQL expert
• Apache Arrow & Drill PMC Chair
• Quigo (AOL); Offermatica (ADBE);
aQuantive (MSFT)
Tomer Shiran
Founder & CEO
• MapR (VP Product); Microsoft; IBM
Research
• Apache Drill Founder
• Carnegie Mellon, Technion
Julien Le Dem
Architect
• Apache Parquet Founder
• Apache Pig PMC Member
• Twitter (Lead, Analytics Data
Pipeline); Yahoo! (Architect)
Top Silicon Valley VCs• Stealth data analytics startup
• Founded in 2015
• Led by experts in Big Data and open source
DREMIO
The Rise of Heterogeneous Data Infrastructure
1980 2016
DREMIO
Can’t Simply Connect a BI Tool…
• Too slow for interactive
analysis
• Manual process to map
data to relational model
• NoSQL data often
inconsistent & unclean
(eg, mixed types)
X
DREMIO
Can’t Simply ETL the Data Into One System…
DWRDBMS RDBMS
RDBMS
RDBMS
RDBMSRDBMS
RDBMS RDBMS
• ETL between similar systems
• SQL -> SQL
• Flat -> flat
• Small & slowly evolving data
• Even then, ETL was hard!
DW
S3
HDFS
Solr S3
Oracle
Mongo
DB
SQL
Server
HBase
Elastic HDFS
• ETL between very different systems
• Search -> SQL
• Complex –> flat
• Big & rapidly evolving data
• ETL is now much harder…
The Relational World Today
DREMIO
DREMIO
Towards a Heterogeneous Data Lake…
• A platform that enables data analysis across disparate data sources
• Storage-agnostic
– The data can live anywhere
– Join across disparate data sources
– Leverage the strengths of each data source
• There’s a reason it was chosen to store that data…
• Client-agnostic
– Tableau, Qlik, Power BI, Excel, R, …
• Scalability & performance
– It’s the era of Big Data…
• Simple & complex analysis
DREMIO
Apache Arrow: Columnar In-Memory Execution
Arrow is backed by the lead developers of the major open source Big Data technologies
10-100x speedup
on modern CPUs
High-performance
sharing & interchange
High-speed Python
and R integration
Apache Arrow is the new standard for columnar in-memory execution technology
Data Sources:
Execution:
Data Science:
Parauet, HBase, Kudu, Phoenix, Hadoop, Cassandra
Drill, Spark, Impala, Storm
Pandas (Python), R, Ibis
DREMIO
Arrow Enables High Performance Interchange
Pre-Arrow With Arrow
• Each system has its own internal
memory format
• 70-80% CPU wasted on serialization
and deserialization
• Similar functionality implemented in
multiple projects
• All systems utilize the same memory
format
• No overhead for cross-system
communication
• Projects can share functionality (eg,
Parquet-to-Arrow reader)
DREMIO
Arrow is Designed for CPU Efficiency
Traditional
Memory Buffer
Arrow
Memory Buffer
• Cache locality
• Super-scalar & vectorized
operation
• Minimal structure overhead
• Constant value access
• Operate directly on
columnar compressed data
DREMIO
Apache Drill: A Storage-Agnostic Query Engine
Tableau, Excel, Qlik, … Custom Applications
MongoDB*
CLI
HBase Elasticsearch* MapR
HDFS NAS Local Files Amazon S3
* Currently being developed/enhanced
RDBMS*
Azure Blob Storage
Apache Drill
Query any data source as if it’s a
relational database
Join data from multiple data sources
in a single query
1 2
DREMIO
Omni-SQL (“SQL-on-Everything”)
Drill: Omni-SQL
Whereas the other engines we're discussing here create a relational
database environment on top of Hadoop, Drill instead enables a SQL
language interface to data in numerous formats, without requiring a formal
schema to be declared. This enables plug-and-play discovery over a huge
universe of data without prerequisites and preparation. So while Drill uses
SQL, and can connect to Hadoop, calling it SQL-on-Hadoop kind of misses
the point. A better name might be SQL-on-Everything, with very low setup
requirements.
“
”
DREMIO
ARCHITECTURE
DREMIO
Everything Starts With a Drillbit…
• High performance query executor
• In-memory columnar execution
• Directly interacts with data, acquiring
knowledge as it reads
• Built to leverage large amounts of memory
• Networked or not
• Exposes ODBC, JDBC, REST
• Built-in Web UI and CLI
• Extensible
Single process
(daemon or CLI)
drillbit
DREMIO
Data Lake, More Like Data Maelstrom
Clustered Services Desktops
HDFS HDFS HDFS
HBase HBase HBase
HDFS HDFS HDFS
ES ES ES
MongoDB MongoDB MongoDB
Cloud Services
DynamoDB
Amazon S3
Linux
Mac
Windows
MongoDB Cluster
Elasticsearch Cluster
Hadoop Cluster
HBase Cluster
DREMIO
Run Drill Co-Located with the Data, or Not
Clustered Services Desktops
HDFS HDFS HDFS
HBase HBase HBase
HDFS HDFS HDFS
ES ES ES
MongoDB MongoDB MongoDB
Cloud Services
DynamoDB
Amazon S3
Linux
Mac
Windows
drillbit drillbit drillbit
drillbit drillbit drillbit
drillbit drillbit drillbit
drillbit drillbit drillbit
drillbit drillbit
drillbit drillbit
drillbit drillbit
drillbit drillbit
drillbit
drillbit
drillbit
DREMIO
Extensible Datastore Architecture
Storage Plugin API
MongoDB
Plugin
File Plugin
Execution Engine
Format Plugin APIFileSystem API
HDFS
S3
MapR-FS
Parquet
JSON
CSV
HBase
Plugin
Hive
Plugin
Chapter 2: Connecting to Datastores
Kudu
Plugin
Phoenix
Plugin
DREMIO
QUERYING DATA
DREMIO
Referencing a Table
SELECT * FROM production.website.users;
Chapter 3: The Universal Namespace
Datastore Workspace Table
DREMIO
Run Your First Query
> SELECT name FROM mongo.yelp.business LIMIT 1;
+--------------------+
| name |
+--------------------+
| Eric Goldberg, MD |
+--------------------+
> SELECT name FROM dfs.root.`/opt/tutorial/yelp/business.json`
LIMIT 1;
+--------------------+
| name |
+--------------------+
| Eric Goldberg, MD |
+--------------------+
DREMIO
Namespaces & Tables
Storage Plugin Type Workspace Table
mongo Database Collection
hive Database Table
hbase Namespace Table
file (HDFS cluster, S3, …) Directory File or directory
… … …
User defines these in the
datastore configuration
DREMIO
> SELECT *
FROM dfs.root.`yelp/review.json` r,
mongo.yelp.business b
WHERE r.business_id = b.business_id;
Joining Across Datastores is Easy!
Alias to a specific file system (S3, HDFS, local, NAS)
Alias to a specific MongoDB cluster
DREMIO
> SELECT b.name AS name, COUNT(*) AS reviews
FROM dfs.yelp.`review.json` r,
mongo.yelp.business b
WHERE r.business_id = b.business_id
GROUP BY b.business_id, b.name
ORDER BY reviews DESC
LIMIT 3;
+-------------------+----------+
| name | reviews |
+-------------------+----------+
| Mon Ami Gabi | 3695 |
| Earl of Sandwich | 3263 |
| Wicked Spoon | 3011 |
+-------------------+----------+
What Business Has the Most Reviews on Yelp?
DREMIO
Native JSON Data Model
Access Arrays
SELECT categories[0]
{
"business_id": 123,
"name": "McDonalds",
"categories": ["restaurant", "fast food"],
"attributes": {
"family friendly": true,
"fast": true,
"romantic": false
}
}
Access Maps
WHERE t.attributes.romantic IS TRUE
Flatten Arrays
SELECT name, FLATTEN(categories)
Extract Keys
SELECT name, KVGEN(attributes)
Flatten Maps
SELECT name, FLATTEN(KVGEN(attributes))
Access Embedded JSON Blobs
SELECT d.address.state
FROM (SELECT CONVERT_FROM(t.data, JSON) d FROM t)
DREMIO
Accessing Array Elements
> SELECT categories FROM business LIMIT 2;
+-------------------------------------------+
| categories |
+-------------------------------------------+
| ["American (Traditional)","Restaurants"] |
| ["Chinese","Restaurants"] |
+-------------------------------------------+
> SELECT categories[0] FROM business LIMIT 2;
+-------------------------+
| EXPR$0 |
+-------------------------+
| American (Traditional) |
| Chinese |
+-------------------------+
DREMIO
FLATTEN
• FLATTEN converts single record with array field into multiple records
– One output record for each array element
• Non FLATTENed fields are repeated in each of the output records
> SELECT categories
FROM business LIMIT 2;
+-------------------------------------------+
| categories |
+-------------------------------------------+
| ["American (Traditional)","Restaurants"] |
| ["Chinese","Restaurants"] |
+-------------------------------------------+
> SELECT FLATTEN(categories)
FROM business LIMIT 4;
+-------------------------+
| EXPR$0 |
+-------------------------+
| American (Traditional) |
| Restaurants |
| Chinese |
| Restaurants |
+-------------------------+
DREMIO
Non-FLATTENed Fields are Repeated
> SELECT name, categories FROM business LIMIT 2;
+------------------------------+-------------------------------------------+
| name | categories |
+------------------------------+-------------------------------------------+
| Deforest Family Restaurant | ["American (Traditional)","Restaurants"] |
| Chang Jiang Chinese Kitchen | ["Chinese","Restaurants"] |
+------------------------------+-------------------------------------------+
> SELECT name, FLATTEN(categories) FROM business LIMIT 4;
+------------------------------+-------------------------+
| name | EXPR$1 |
+------------------------------+-------------------------+
| Deforest Family Restaurant | American (Traditional) |
| Deforest Family Restaurant | Restaurants |
| Chang Jiang Chinese Kitchen | Chinese |
| Chang Jiang Chinese Kitchen | Restaurants |
+------------------------------+-------------------------+
DREMIO
ODBC and JDBC
• Drill includes standard
ODBC/JDBC drivers
– ODBC for native apps
– JDBC for Java apps
• User installs the driver
on the client
– The same machine as
the BI tool
• Driver communicates
with Drill cluster(s)
• Make sure driver and
cluster are compatible
versions
Drill Cluster
Drill JDBC Driver
TIBCO Spotfire
Client
Drill ODBC Driver
Tableau
Client (eg, Laptop)
DREMIO
DEMO TIME!
DREMIO
Thank You
• Learn about Apache Arrow
• Jacques Nadeau’s blog post: www.dremio.com/blog/Apache-Arrow/
• Apache Arrow website: arrow.apache.org
• Download Apache Drill: drill.apache.org
• Reach out to learn more about the Dremio private beta
• Email me: tshiran@dremio.com
• Sign up on the site: www.dremio.com
DREMIO
APPENDIX
DREMIO
DREMIO
Questions
• User trends based on yelping_since (Mongo)
• Top business categories, with coloring by state
• Which businesses are gross? (Elastic<->Mongo)
• Which of those had the most website clicks?
– distinct(business_id) on elastic, mongo.business,
hdfs.default.click

The Heterogeneous Data lake

  • 1.
    DREMIO The Heterogeneous DataLake Tomer Shiran, Co-Founder & CEO at Dremio tshiran@dremio.com | @tshiran Hadoop Summit Europe 2016 April 13, 2016
  • 2.
    DREMIO Company Background Jacques Nadeau Founder& CTO • Recognized SQL & NoSQL expert • Apache Arrow & Drill PMC Chair • Quigo (AOL); Offermatica (ADBE); aQuantive (MSFT) Tomer Shiran Founder & CEO • MapR (VP Product); Microsoft; IBM Research • Apache Drill Founder • Carnegie Mellon, Technion Julien Le Dem Architect • Apache Parquet Founder • Apache Pig PMC Member • Twitter (Lead, Analytics Data Pipeline); Yahoo! (Architect) Top Silicon Valley VCs• Stealth data analytics startup • Founded in 2015 • Led by experts in Big Data and open source
  • 3.
    DREMIO The Rise ofHeterogeneous Data Infrastructure 1980 2016
  • 4.
    DREMIO Can’t Simply Connecta BI Tool… • Too slow for interactive analysis • Manual process to map data to relational model • NoSQL data often inconsistent & unclean (eg, mixed types) X
  • 5.
    DREMIO Can’t Simply ETLthe Data Into One System… DWRDBMS RDBMS RDBMS RDBMS RDBMSRDBMS RDBMS RDBMS • ETL between similar systems • SQL -> SQL • Flat -> flat • Small & slowly evolving data • Even then, ETL was hard! DW S3 HDFS Solr S3 Oracle Mongo DB SQL Server HBase Elastic HDFS • ETL between very different systems • Search -> SQL • Complex –> flat • Big & rapidly evolving data • ETL is now much harder… The Relational World Today
  • 6.
  • 7.
    DREMIO Towards a HeterogeneousData Lake… • A platform that enables data analysis across disparate data sources • Storage-agnostic – The data can live anywhere – Join across disparate data sources – Leverage the strengths of each data source • There’s a reason it was chosen to store that data… • Client-agnostic – Tableau, Qlik, Power BI, Excel, R, … • Scalability & performance – It’s the era of Big Data… • Simple & complex analysis
  • 8.
    DREMIO Apache Arrow: ColumnarIn-Memory Execution Arrow is backed by the lead developers of the major open source Big Data technologies 10-100x speedup on modern CPUs High-performance sharing & interchange High-speed Python and R integration Apache Arrow is the new standard for columnar in-memory execution technology Data Sources: Execution: Data Science: Parauet, HBase, Kudu, Phoenix, Hadoop, Cassandra Drill, Spark, Impala, Storm Pandas (Python), R, Ibis
  • 9.
    DREMIO Arrow Enables HighPerformance Interchange Pre-Arrow With Arrow • Each system has its own internal memory format • 70-80% CPU wasted on serialization and deserialization • Similar functionality implemented in multiple projects • All systems utilize the same memory format • No overhead for cross-system communication • Projects can share functionality (eg, Parquet-to-Arrow reader)
  • 10.
    DREMIO Arrow is Designedfor CPU Efficiency Traditional Memory Buffer Arrow Memory Buffer • Cache locality • Super-scalar & vectorized operation • Minimal structure overhead • Constant value access • Operate directly on columnar compressed data
  • 11.
    DREMIO Apache Drill: AStorage-Agnostic Query Engine Tableau, Excel, Qlik, … Custom Applications MongoDB* CLI HBase Elasticsearch* MapR HDFS NAS Local Files Amazon S3 * Currently being developed/enhanced RDBMS* Azure Blob Storage Apache Drill Query any data source as if it’s a relational database Join data from multiple data sources in a single query 1 2
  • 12.
    DREMIO Omni-SQL (“SQL-on-Everything”) Drill: Omni-SQL Whereasthe other engines we're discussing here create a relational database environment on top of Hadoop, Drill instead enables a SQL language interface to data in numerous formats, without requiring a formal schema to be declared. This enables plug-and-play discovery over a huge universe of data without prerequisites and preparation. So while Drill uses SQL, and can connect to Hadoop, calling it SQL-on-Hadoop kind of misses the point. A better name might be SQL-on-Everything, with very low setup requirements. “ ”
  • 13.
  • 14.
    DREMIO Everything Starts Witha Drillbit… • High performance query executor • In-memory columnar execution • Directly interacts with data, acquiring knowledge as it reads • Built to leverage large amounts of memory • Networked or not • Exposes ODBC, JDBC, REST • Built-in Web UI and CLI • Extensible Single process (daemon or CLI) drillbit
  • 15.
    DREMIO Data Lake, MoreLike Data Maelstrom Clustered Services Desktops HDFS HDFS HDFS HBase HBase HBase HDFS HDFS HDFS ES ES ES MongoDB MongoDB MongoDB Cloud Services DynamoDB Amazon S3 Linux Mac Windows MongoDB Cluster Elasticsearch Cluster Hadoop Cluster HBase Cluster
  • 16.
    DREMIO Run Drill Co-Locatedwith the Data, or Not Clustered Services Desktops HDFS HDFS HDFS HBase HBase HBase HDFS HDFS HDFS ES ES ES MongoDB MongoDB MongoDB Cloud Services DynamoDB Amazon S3 Linux Mac Windows drillbit drillbit drillbit drillbit drillbit drillbit drillbit drillbit drillbit drillbit drillbit drillbit drillbit drillbit drillbit drillbit drillbit drillbit drillbit drillbit drillbit drillbit drillbit
  • 17.
    DREMIO Extensible Datastore Architecture StoragePlugin API MongoDB Plugin File Plugin Execution Engine Format Plugin APIFileSystem API HDFS S3 MapR-FS Parquet JSON CSV HBase Plugin Hive Plugin Chapter 2: Connecting to Datastores Kudu Plugin Phoenix Plugin
  • 18.
  • 19.
    DREMIO Referencing a Table SELECT* FROM production.website.users; Chapter 3: The Universal Namespace Datastore Workspace Table
  • 20.
    DREMIO Run Your FirstQuery > SELECT name FROM mongo.yelp.business LIMIT 1; +--------------------+ | name | +--------------------+ | Eric Goldberg, MD | +--------------------+ > SELECT name FROM dfs.root.`/opt/tutorial/yelp/business.json` LIMIT 1; +--------------------+ | name | +--------------------+ | Eric Goldberg, MD | +--------------------+
  • 21.
    DREMIO Namespaces & Tables StoragePlugin Type Workspace Table mongo Database Collection hive Database Table hbase Namespace Table file (HDFS cluster, S3, …) Directory File or directory … … … User defines these in the datastore configuration
  • 22.
    DREMIO > SELECT * FROMdfs.root.`yelp/review.json` r, mongo.yelp.business b WHERE r.business_id = b.business_id; Joining Across Datastores is Easy! Alias to a specific file system (S3, HDFS, local, NAS) Alias to a specific MongoDB cluster
  • 23.
    DREMIO > SELECT b.nameAS name, COUNT(*) AS reviews FROM dfs.yelp.`review.json` r, mongo.yelp.business b WHERE r.business_id = b.business_id GROUP BY b.business_id, b.name ORDER BY reviews DESC LIMIT 3; +-------------------+----------+ | name | reviews | +-------------------+----------+ | Mon Ami Gabi | 3695 | | Earl of Sandwich | 3263 | | Wicked Spoon | 3011 | +-------------------+----------+ What Business Has the Most Reviews on Yelp?
  • 24.
    DREMIO Native JSON DataModel Access Arrays SELECT categories[0] { "business_id": 123, "name": "McDonalds", "categories": ["restaurant", "fast food"], "attributes": { "family friendly": true, "fast": true, "romantic": false } } Access Maps WHERE t.attributes.romantic IS TRUE Flatten Arrays SELECT name, FLATTEN(categories) Extract Keys SELECT name, KVGEN(attributes) Flatten Maps SELECT name, FLATTEN(KVGEN(attributes)) Access Embedded JSON Blobs SELECT d.address.state FROM (SELECT CONVERT_FROM(t.data, JSON) d FROM t)
  • 25.
    DREMIO Accessing Array Elements >SELECT categories FROM business LIMIT 2; +-------------------------------------------+ | categories | +-------------------------------------------+ | ["American (Traditional)","Restaurants"] | | ["Chinese","Restaurants"] | +-------------------------------------------+ > SELECT categories[0] FROM business LIMIT 2; +-------------------------+ | EXPR$0 | +-------------------------+ | American (Traditional) | | Chinese | +-------------------------+
  • 26.
    DREMIO FLATTEN • FLATTEN convertssingle record with array field into multiple records – One output record for each array element • Non FLATTENed fields are repeated in each of the output records > SELECT categories FROM business LIMIT 2; +-------------------------------------------+ | categories | +-------------------------------------------+ | ["American (Traditional)","Restaurants"] | | ["Chinese","Restaurants"] | +-------------------------------------------+ > SELECT FLATTEN(categories) FROM business LIMIT 4; +-------------------------+ | EXPR$0 | +-------------------------+ | American (Traditional) | | Restaurants | | Chinese | | Restaurants | +-------------------------+
  • 27.
    DREMIO Non-FLATTENed Fields areRepeated > SELECT name, categories FROM business LIMIT 2; +------------------------------+-------------------------------------------+ | name | categories | +------------------------------+-------------------------------------------+ | Deforest Family Restaurant | ["American (Traditional)","Restaurants"] | | Chang Jiang Chinese Kitchen | ["Chinese","Restaurants"] | +------------------------------+-------------------------------------------+ > SELECT name, FLATTEN(categories) FROM business LIMIT 4; +------------------------------+-------------------------+ | name | EXPR$1 | +------------------------------+-------------------------+ | Deforest Family Restaurant | American (Traditional) | | Deforest Family Restaurant | Restaurants | | Chang Jiang Chinese Kitchen | Chinese | | Chang Jiang Chinese Kitchen | Restaurants | +------------------------------+-------------------------+
  • 28.
    DREMIO ODBC and JDBC •Drill includes standard ODBC/JDBC drivers – ODBC for native apps – JDBC for Java apps • User installs the driver on the client – The same machine as the BI tool • Driver communicates with Drill cluster(s) • Make sure driver and cluster are compatible versions Drill Cluster Drill JDBC Driver TIBCO Spotfire Client Drill ODBC Driver Tableau Client (eg, Laptop)
  • 29.
  • 30.
    DREMIO Thank You • Learnabout Apache Arrow • Jacques Nadeau’s blog post: www.dremio.com/blog/Apache-Arrow/ • Apache Arrow website: arrow.apache.org • Download Apache Drill: drill.apache.org • Reach out to learn more about the Dremio private beta • Email me: tshiran@dremio.com • Sign up on the site: www.dremio.com
  • 31.
  • 32.
  • 33.
    DREMIO Questions • User trendsbased on yelping_since (Mongo) • Top business categories, with coloring by state • Which businesses are gross? (Elastic<->Mongo) • Which of those had the most website clicks? – distinct(business_id) on elastic, mongo.business, hdfs.default.click