Confidential - Do Not Share or Distribute
From Warehouse to
Lakehouse
February 15, 2023 (Paris)
2 Confidential - Do Not Share or Distribute
Dremio
The Only Data Lakehouse with Self-Service SQL Analytics
Your Data Forever, No Lock In
Sub-Second Performance, 1/10th the Cost of Data Warehouses
■
■
■
The Easy and Open Data Lakehouse
Self-service analytics with data warehouse functionality and data
lake flexibility across all of your data.
Open Source & Community Enterprise Adoption
Apache Arrow (70M+
downloads/m), Apache
Iceberg, Nessie
Creator and host of
Subsurface LIVE conference
1000s
of companies across all
industries
5
of the Fortune 10
3 Confidential - Do Not Share or Distribute
Data Analytics - The History
1980 – 2010
● Fixed compute & storage capacity
● Mostly on-prem
● Harder to use & manage
Enterprise Data Warehouse
Transactional and analytical workloads
2010 – 2015
● Fixed compute & storage capacity
● Mostly on-prem
● Harder to use & manage
Big Data + Open Source
Introduced big data processing
● Data in open file and table formats
● No need to copy & move data
● Multiple best-of-breed processing
engines
The Movement
Open Data Lakehouse
Self-service analytics on the data lake
2015 – Present
● Scale storage and compute independently
● Must load data into proprietary system
● Limited to one processing engine
● Cost prohibitive
Cloud Data Warehouse
Analytics on cloud data warehouse
4 Confidential - Do Not Share or Distribute
Competing Data Priorities
Access Governance
Self-Service
/ Agility
Security
Line of Business Centralized
5 Confidential - Do Not Share or Distribute
SQL
Data Science Dashboards Apps
Companies Want to Democratize Data… But How?
▪ Everyone wants access
▪ Data volumes are
exploding
▪ Security risks
▪ Compliance requirements
▪ Limited resources
Application Databases | IoT | Web | Logs
Continuous New Data
ADLS RDBMS
S3 GCS
Cloud Object Storage On-Prem
6 Confidential - Do Not Share or Distribute
SQL
Data Science Dashboards Apps
Data Warehouses: Expensive, Proprietary, Complex
Application Databases | IoT | Web | Logs
Continuous New Data
✗ Skyrocketing costs
✗ Vendor lock-in
✗ Exploding backlog
✗ Can’t explore data
✗ No self-service
ADLS RDBMS
S3 GCS
Cloud Object Storage On-Prem
7 Confidential - Do Not Share or Distribute
SQL
Data Science Dashboards Apps
Data Lakehouse: Easy, Open, 1/10th the Cost
Application Databases | IoT | Web | Logs
Continuous New Data
⇅ ODBC | JDBC | REST | Arrow Flight ⇅
⇅ Parallelism | Caching | Optimized Push-Downs ⇅
✓ Sub-second performance
✓ Eliminate Data Silos
✓ Improve Data Discovery and
Access
✓ No Data Movement Required
✓ No Copies
✓ Inexpensive
✓ No lock-in
ADLS RDBMS
S3 GCS
Cloud Object Storage On-Prem
8 Confidential - Do Not Share or Distribute
Created by Netflix, Apple and other big tech
INSERT/UPDATE/DELETE with any engine
Strong momentum in OSS community
■
■
■
Table Formats Enable Data Warehouse Workloads on the Lake
Record-level data mutations with SQL DML
INSERT INTO t1 ...
UPDATE t1 SET col1 = ...
DELETE FROM t1 WHERE state = "CA"
Automatic partitioning
CREATE TABLE t1 PARTITIONED BY (month(date),
bucket[1000](user))
Instant schema and partition evolution
ALTER TABLE t1 ADD/DROP/MODIFY/RENAME COLUMN c1 ...
ALTER TABLE t1 ADD/DROP PARTITION FIELD ...
Time travel
SELECT * FROM t1 AT/BEFORE <timestamp>
9 Confidential - Do Not Share or Distribute
Iceberg is a Community-Built, Vendor-Agnostic Table Format
10 Confidential - Do Not Share or Distribute
AWS
Dremio
GCP (BigQuery, etc.)
Snowflake
Tabular
…
Databricks
Iceberg is the Primary Format of Cloud and Platform Vendors
11 Confidential - Do Not Share or Distribute
Data Analytics - The History Continued
980 – 2010
e & storage capacity
m
& manage
se Data Warehouse
nd analytical workloads
2010 – 2015
● Fixed compute & storage capacity
● Mostly on-prem
● Harder to use & manage
Big Data + Open Source
Introduced big data processing
● Data in open file and table formats
● No need to copy & move data
● Multiple best-of-breed processing
engines
The Movement
Open Data Lakehouse
Self-service analytics on the data lake
2015 – Present
● Scale storage and compute independently
● Must load data into proprietary system
● Limited to one processing engine
● Cost prohibitive
Cloud Data Warehouse
Analytics on cloud data warehouse
2023 – …
● Isolated data exploration and engineering
with branches
● Version control with commits and tags
● Governance and domain management
Data Mesh with Data-as-Code
Data mesh with data managed as code
12 Confidential - Do Not Share or Distribute
Dremio Arctic
(in preview)
13 Confidential - Do Not Share or Distribute
Dremio Arctic is a Data Lakehouse Management Service
Automated data
optimization
Automated garbage
collection
Automated data ingestion
(roadmap)
Data Optimization
Fine-grained access control
Commit/audit logs
Governance & Security
Commits
Tags
Branches
Data as Code
Query Engines
Tables and views in hierarchical namespaces
Native Iceberg catalog (Delta Lake in roadmap)
Lakehouse Catalog
ADLS S3 GCS
Cloud Storage
Apache Iceberg Tables
14 Confidential - Do Not Share or Distribute
ICEBERG-NATIVE MULTIPLE DOMAINS ACCESS CONTROL
Dremio Arctic is a Modern Lakehouse Catalog
▪ Nessie (the Arctic catalog) is built into the
open source Apache Iceberg project
▪ Use a variety of Iceberg-compatible engines
including Dremio Sonar, Spark and Flink
▪ Multiple isolated domains/catalogs in an
organization, each containing a folder
hierarchy of tables and views
▪ Designed to enable data mesh (including
federated ownership and data sharing)
▪ Table, column- and row-based access
control
▪ Custom roles and integration with existing
user/group directories (AAD, Okta, etc.)
15 Confidential - Do Not Share or Distribute
ISOLATION VERSION CONTROL GOVERNANCE
Manage Data as Code with Git-like Capabilities
▪ Experiment with data without impacting other
users
▪ Ingest, transform and test data before
exposing it to other users in an atomic merge
▪ Reproduce models and dashboards from
historical data based on time or tags
▪ Recover from any mistake by instantly
undoing accidental data or metadata
changes
▪ All changes to the data and metadata are
tracked: who accessed what data and when
▪ Fine-grained privileges to control access to
the data at the table, column and row level
16 Confidential - Do Not Share or Distribute
TABLE OPTIMIZATION TABLE VACUUM
Automatic Data Optimization Enables Faster Queries
▪ Dremio Arctic automatically rewrites smaller
files into larger files and groups similar rows in
a table together
▪ Table optimization significantly accelerates
query performance
▪ Dremio Arctic automatically removes unused
manifest files, manifest lists, and data files
▪ Cleanup runs in the background and ensures
efficient use of data lake storage
17 Confidential - Do Not Share or Distribute
5 Use Cases for Data as Code
18 Confidential - Do Not Share or Distribute
1: Ensure data quality with ETL branches
CREATE BRANCH events_etl_9_28_22
USE BRANCH events_etl_9_28_22
COPY INTO web.events …
DELETE FROM web.events WHERE length(ip_address) >= 7
USE BRANCH main
MERGE BRANCH events_etl_9_28_22
Create an ETL branch and ingest the data with COPY
INTO, CTAS or Spark:
SELECT COUNT(*) FROM web.events WHERE
length(ip_address) >= 7
Run queries to test data quality:
Test the dashboard to see that it looks ok:
Fix the problems and merge into main:
main
events_etl_9_28_22
Data quality checks
Production
19 Confidential - Do Not Share or Distribute
2: Experiment with data in transient branches
CREATE BRANCH dave_9_28_22
USE BRANCH dave_9_28_22
CREATE TABLE t AS SELECT …
UPDATE t … SET …
Create a transient branch and perform data explorations
and transformations in it:
DROP BRANCH dave_9_28_22
Delete the branch or merge it when experimentation is
complete:
Create ad-hoc visualizations on the branch via a
Notebook:
main
dave_9_28_22
Experimentation
Production
Experimentation
20 Confidential - Do Not Share or Distribute
3: Reproduce models or analysis
spark.sql("USE REFERENCE modelA in arctic;")
Change context to a named tag:
val trainingData = spark.read.table("arctic.t")
val lr = new LogisticRegression()
// configure logistic regression...
val paramMap = ParamMap(...)
val model = lr.fit(trainingData, paramMap)
Create ML model based on historic data:
Select a tag, commit or branch to query in SQL Runner:
21 Confidential - Do Not Share or Distribute
ALTER BRANCH main ASSIGN COMMIT …f724
4: Recover from mistakes
main
…f724 main’
…a233
…b84c
…9bc8
…2563
…4231
Move the branch head to a historical commit:
22 Confidential - Do Not Share or Distribute
5: Troubleshooting (see who changed the data)
SHOW LOGS AT REFERENCE etl;
Get the commit history for a branch:
curl -X GET -H 'Authorization: Bearer
<PAT>' <Catalog API
Endpoint>/trees/tree/<reference
name>/log?filter="operations.exists(op,op.
key=='<table name>')"
Get the commit history for a specific table:
23 Confidential - Do Not Share or Distribute

From Data Warehouse to Lakehouse

  • 1.
    Confidential - DoNot Share or Distribute From Warehouse to Lakehouse February 15, 2023 (Paris)
  • 2.
    2 Confidential -Do Not Share or Distribute Dremio The Only Data Lakehouse with Self-Service SQL Analytics Your Data Forever, No Lock In Sub-Second Performance, 1/10th the Cost of Data Warehouses ■ ■ ■ The Easy and Open Data Lakehouse Self-service analytics with data warehouse functionality and data lake flexibility across all of your data. Open Source & Community Enterprise Adoption Apache Arrow (70M+ downloads/m), Apache Iceberg, Nessie Creator and host of Subsurface LIVE conference 1000s of companies across all industries 5 of the Fortune 10
  • 3.
    3 Confidential -Do Not Share or Distribute Data Analytics - The History 1980 – 2010 ● Fixed compute & storage capacity ● Mostly on-prem ● Harder to use & manage Enterprise Data Warehouse Transactional and analytical workloads 2010 – 2015 ● Fixed compute & storage capacity ● Mostly on-prem ● Harder to use & manage Big Data + Open Source Introduced big data processing ● Data in open file and table formats ● No need to copy & move data ● Multiple best-of-breed processing engines The Movement Open Data Lakehouse Self-service analytics on the data lake 2015 – Present ● Scale storage and compute independently ● Must load data into proprietary system ● Limited to one processing engine ● Cost prohibitive Cloud Data Warehouse Analytics on cloud data warehouse
  • 4.
    4 Confidential -Do Not Share or Distribute Competing Data Priorities Access Governance Self-Service / Agility Security Line of Business Centralized
  • 5.
    5 Confidential -Do Not Share or Distribute SQL Data Science Dashboards Apps Companies Want to Democratize Data… But How? ▪ Everyone wants access ▪ Data volumes are exploding ▪ Security risks ▪ Compliance requirements ▪ Limited resources Application Databases | IoT | Web | Logs Continuous New Data ADLS RDBMS S3 GCS Cloud Object Storage On-Prem
  • 6.
    6 Confidential -Do Not Share or Distribute SQL Data Science Dashboards Apps Data Warehouses: Expensive, Proprietary, Complex Application Databases | IoT | Web | Logs Continuous New Data ✗ Skyrocketing costs ✗ Vendor lock-in ✗ Exploding backlog ✗ Can’t explore data ✗ No self-service ADLS RDBMS S3 GCS Cloud Object Storage On-Prem
  • 7.
    7 Confidential -Do Not Share or Distribute SQL Data Science Dashboards Apps Data Lakehouse: Easy, Open, 1/10th the Cost Application Databases | IoT | Web | Logs Continuous New Data ⇅ ODBC | JDBC | REST | Arrow Flight ⇅ ⇅ Parallelism | Caching | Optimized Push-Downs ⇅ ✓ Sub-second performance ✓ Eliminate Data Silos ✓ Improve Data Discovery and Access ✓ No Data Movement Required ✓ No Copies ✓ Inexpensive ✓ No lock-in ADLS RDBMS S3 GCS Cloud Object Storage On-Prem
  • 8.
    8 Confidential -Do Not Share or Distribute Created by Netflix, Apple and other big tech INSERT/UPDATE/DELETE with any engine Strong momentum in OSS community ■ ■ ■ Table Formats Enable Data Warehouse Workloads on the Lake Record-level data mutations with SQL DML INSERT INTO t1 ... UPDATE t1 SET col1 = ... DELETE FROM t1 WHERE state = "CA" Automatic partitioning CREATE TABLE t1 PARTITIONED BY (month(date), bucket[1000](user)) Instant schema and partition evolution ALTER TABLE t1 ADD/DROP/MODIFY/RENAME COLUMN c1 ... ALTER TABLE t1 ADD/DROP PARTITION FIELD ... Time travel SELECT * FROM t1 AT/BEFORE <timestamp>
  • 9.
    9 Confidential -Do Not Share or Distribute Iceberg is a Community-Built, Vendor-Agnostic Table Format
  • 10.
    10 Confidential -Do Not Share or Distribute AWS Dremio GCP (BigQuery, etc.) Snowflake Tabular … Databricks Iceberg is the Primary Format of Cloud and Platform Vendors
  • 11.
    11 Confidential -Do Not Share or Distribute Data Analytics - The History Continued 980 – 2010 e & storage capacity m & manage se Data Warehouse nd analytical workloads 2010 – 2015 ● Fixed compute & storage capacity ● Mostly on-prem ● Harder to use & manage Big Data + Open Source Introduced big data processing ● Data in open file and table formats ● No need to copy & move data ● Multiple best-of-breed processing engines The Movement Open Data Lakehouse Self-service analytics on the data lake 2015 – Present ● Scale storage and compute independently ● Must load data into proprietary system ● Limited to one processing engine ● Cost prohibitive Cloud Data Warehouse Analytics on cloud data warehouse 2023 – … ● Isolated data exploration and engineering with branches ● Version control with commits and tags ● Governance and domain management Data Mesh with Data-as-Code Data mesh with data managed as code
  • 12.
    12 Confidential -Do Not Share or Distribute Dremio Arctic (in preview)
  • 13.
    13 Confidential -Do Not Share or Distribute Dremio Arctic is a Data Lakehouse Management Service Automated data optimization Automated garbage collection Automated data ingestion (roadmap) Data Optimization Fine-grained access control Commit/audit logs Governance & Security Commits Tags Branches Data as Code Query Engines Tables and views in hierarchical namespaces Native Iceberg catalog (Delta Lake in roadmap) Lakehouse Catalog ADLS S3 GCS Cloud Storage Apache Iceberg Tables
  • 14.
    14 Confidential -Do Not Share or Distribute ICEBERG-NATIVE MULTIPLE DOMAINS ACCESS CONTROL Dremio Arctic is a Modern Lakehouse Catalog ▪ Nessie (the Arctic catalog) is built into the open source Apache Iceberg project ▪ Use a variety of Iceberg-compatible engines including Dremio Sonar, Spark and Flink ▪ Multiple isolated domains/catalogs in an organization, each containing a folder hierarchy of tables and views ▪ Designed to enable data mesh (including federated ownership and data sharing) ▪ Table, column- and row-based access control ▪ Custom roles and integration with existing user/group directories (AAD, Okta, etc.)
  • 15.
    15 Confidential -Do Not Share or Distribute ISOLATION VERSION CONTROL GOVERNANCE Manage Data as Code with Git-like Capabilities ▪ Experiment with data without impacting other users ▪ Ingest, transform and test data before exposing it to other users in an atomic merge ▪ Reproduce models and dashboards from historical data based on time or tags ▪ Recover from any mistake by instantly undoing accidental data or metadata changes ▪ All changes to the data and metadata are tracked: who accessed what data and when ▪ Fine-grained privileges to control access to the data at the table, column and row level
  • 16.
    16 Confidential -Do Not Share or Distribute TABLE OPTIMIZATION TABLE VACUUM Automatic Data Optimization Enables Faster Queries ▪ Dremio Arctic automatically rewrites smaller files into larger files and groups similar rows in a table together ▪ Table optimization significantly accelerates query performance ▪ Dremio Arctic automatically removes unused manifest files, manifest lists, and data files ▪ Cleanup runs in the background and ensures efficient use of data lake storage
  • 17.
    17 Confidential -Do Not Share or Distribute 5 Use Cases for Data as Code
  • 18.
    18 Confidential -Do Not Share or Distribute 1: Ensure data quality with ETL branches CREATE BRANCH events_etl_9_28_22 USE BRANCH events_etl_9_28_22 COPY INTO web.events … DELETE FROM web.events WHERE length(ip_address) >= 7 USE BRANCH main MERGE BRANCH events_etl_9_28_22 Create an ETL branch and ingest the data with COPY INTO, CTAS or Spark: SELECT COUNT(*) FROM web.events WHERE length(ip_address) >= 7 Run queries to test data quality: Test the dashboard to see that it looks ok: Fix the problems and merge into main: main events_etl_9_28_22 Data quality checks Production
  • 19.
    19 Confidential -Do Not Share or Distribute 2: Experiment with data in transient branches CREATE BRANCH dave_9_28_22 USE BRANCH dave_9_28_22 CREATE TABLE t AS SELECT … UPDATE t … SET … Create a transient branch and perform data explorations and transformations in it: DROP BRANCH dave_9_28_22 Delete the branch or merge it when experimentation is complete: Create ad-hoc visualizations on the branch via a Notebook: main dave_9_28_22 Experimentation Production Experimentation
  • 20.
    20 Confidential -Do Not Share or Distribute 3: Reproduce models or analysis spark.sql("USE REFERENCE modelA in arctic;") Change context to a named tag: val trainingData = spark.read.table("arctic.t") val lr = new LogisticRegression() // configure logistic regression... val paramMap = ParamMap(...) val model = lr.fit(trainingData, paramMap) Create ML model based on historic data: Select a tag, commit or branch to query in SQL Runner:
  • 21.
    21 Confidential -Do Not Share or Distribute ALTER BRANCH main ASSIGN COMMIT …f724 4: Recover from mistakes main …f724 main’ …a233 …b84c …9bc8 …2563 …4231 Move the branch head to a historical commit:
  • 22.
    22 Confidential -Do Not Share or Distribute 5: Troubleshooting (see who changed the data) SHOW LOGS AT REFERENCE etl; Get the commit history for a branch: curl -X GET -H 'Authorization: Bearer <PAT>' <Catalog API Endpoint>/trees/tree/<reference name>/log?filter="operations.exists(op,op. key=='<table name>')" Get the commit history for a specific table:
  • 23.
    23 Confidential -Do Not Share or Distribute