Confidential - Do Not Share or Distribute
Dremio
The Easy and Open Lakehouse Platform
1
Confidential - Do Not Share or Distribute
The Easy and Open Data Lakehouse Platform
– Data warehouse performance directly on the lake
– Query acceleration to eliminate copies and BI extracts
– Semantic layer to enable governed self-service
– Database connectors to enable queries on other sources
Enterprise Adoption
– 1000s of companies across all industries
– 5 of the Fortune 10
Open Source & Community
– Apache Arrow (60M+ downloads/m), Apache Iceberg, Nessie
– Creator and host of Subsurface LIVE conference
About Dremio
3 Confidential - Do Not Share or Distribute
SQL
Data Science Dashboards Apps
Companies Want to Democratize Data… But How?
▪ Everyone wants access
▪ Data volumes are
exploding
▪ Security risks
▪ Compliance requirements
▪ Limited resources
Application Databases | IoT | Web | Logs
Continuous New Data
ADLS RDBMS
S3 GCS
Cloud Object Storage On-Prem
4 Confidential - Do Not Share or Distribute
SQL
Data Science Dashboards Apps
Data Warehouses: Expensive, Proprietary, Complex
Application Databases | IoT | Web | Logs
Continuous New Data
✗ Skyrocketing costs
✗ Vendor lock-in
✗ Exploding backlog
✗ Can’t explore data
✗ No self-service
ADLS RDBMS
S3 GCS
Cloud Object Storage On-Prem
5 Confidential - Do Not Share or Distribute
SQL
Data Science Dashboards Apps
Dremio Data Lakehouse: Easy, Open, 1/10th the Cost
Application Databases | IoT | Web | Logs
Continuous New Data
⇅ ODBC | JDBC | REST | Arrow Flight ⇅
⇅ Parallelism | Caching | Optimized Push-Downs ⇅
✓ Sub-second performance
✓ Eliminate Data Silos
✓ Improve Data Discovery and
Access
✓ No Data Movement Required
✓ No Copies
✓ Inexpensive
✓ No lock-in
ADLS RDBMS
S3 GCS
Cloud Object Storage On-Prem
6 Confidential - Do Not Share or Distribute
Raw
zone
Physical
datasets
Semantic
zone
Virtual
datasets
Data
Engineers
BI Users
SQL
Data Scientists
⇅ ODBC | JDBC | REST | Arrow Flight ⇅
ADLS S3
or or
Acceleration
(Data Reflections)
Data
Analysts
and
Engineers
IT-Governed Self-Service Semantic Layer
Standardized, User-Defined Abstraction Layer Enabling Virtual Data Sets, with an Easy-to-Use UI
Data Analysts
✓ Consistent business logic & KPIs
✓ No more waiting for IT
✓ Use visualization tool(s) of choice
Data Engineers & Architects
✓ Centralize data security & governance
✓ No more reactive, tedious work
✓ Easy collaboration with data analysts
7 Confidential - Do Not Share or Distribute
SQL
Data Science Dashboards Apps
A Realistic Example: DW Offload
Application Databases | IoT | Web | Logs
Continuous New Data
✗ Maxed capacity
✗ End-of-life support
✗ Complex ETL processes
✗ Legacy query engines
performance
RDBMS
8 Confidential - Do Not Share or Distribute
SQL
Data Science Dashboards Apps
A Realistic Example: DW Offload
Application Databases | IoT | Web | Logs
Continuous New Data
RDBMS
✓ Unified layer
✓ Combine DW and DL data
✓ Address DW capacity issues
✓ Smooth transition
Third-Party Data
Bloomberg, S&P, AWS Data Exchange…
Semantic Model
Fast Performance
Apache Arrow-based columnar
execution increases throughput and
reduces cost
Transparent Acceleration
Reflections enable sub-second
queries and eliminate copies and BI
extracts
Semantic Layer
Data teams define and expose a
logical data model for governed
self-service
Ingest & Transform Data
DML and dbt integration help ingest
data into the lakehouse and transform
it as needed.
Open Data Formats
Apache Iceberg ensures no vendor
lock-in and the flexibility to use any
engine.
Enterprise-Grade Security
Role-based access control, native
row/column-level policies and
advanced integrations.
Dremio at a glance
Confidential - Do Not Share or Distribute
Powering Analytics for Thousands of Companies
16
Confidential - Do Not Share or Distribute
Merci !
10
Confidential - Do Not Share or Distribute
Open Source Roots: Apache Arrow Inside
– Dremio seeded the market with its internal memory format
– Arrow now downloaded over 60M times per month
– Dremio is the only Arrow-based engine in the market
11
Apache Arrow was created by Dremio
– Data is immediately read into Arrow
– All operators use Arrow as input and output
– Gandiva: LLVM-based vectorized execution on Apache Arrow
Arrow-based vectorized execution
Confidential - Do Not Share or Distribute
Data Sources
Data Lake Engine
BI Users
SQL
Data Scientists
Data
Consumer
Tools
⇅ ODBC | JDBC | REST | Arrow Flight ⇅
⇅ Optimized Push-Downs ⇅
Coordinator
Node
Executor
Nodes
Orchestrated via Cloud, Kubernetes or YARN
External Data Reflection Stores Data Reflection Stores
Executor
Nodes
Executor
Nodes
Coordinator
Node
Coordinator
Node
DREMIO
Dremio deployment architecture
Confidential - Do Not Share or Distribute
Query Acceleration: BI on Data Lakes
Columnar Cloud Cache (C3) Data Reflections
13
– NVMe-level I/O performance on S3/ADLS/GCS
– Eliminate S3/ADLS I/O costs (10-15% of cost per query)
– Use existing NVMe/SSD on EC2 instances & Azure VMs
– Transparent to analysts and engineers
– Enable low-latency (including sub-second) BI queries
– Eliminate cubes and BI extracts
– Reduce infrastructure costs by up to 100x
– Persisted on Data Lake as Parquet/Iceberg tables
– Transparent to analysts (advanced query plan rewrites)
NVMe NVMe NVMe NVMe
Data Lake
Columnar Cloud Cache (C3)
Executor Executor Executor Executor
ENGINE
User-specific cubes, extracts, aggregations
Domain-specific data marts
User picks the best optimization
DL/DW
Dremio picks the best optimization
DL
TRADITIONAL DREMIO
Reflections
Confidential - Do Not Share or Distribute
Multi-Engine Architecture
XL
M
L
Engine Routing Rules
● User
● Roles
● Query type
● Query cost
● Connection parameters
● Date & time
● ...
Queues Engines
Query
14
LOWER EC2 COSTS
Auto-stop/start and right-sized
engines eliminate the need to
over-provision infrastructure.
60% NOISY NEIGHBOR CONCERNS
Workloads are physically separated
so one workload can’t impact the
performance of another workload.
0 CONTROL OF RESOURCES
Control resource allocation with policies
such as query priority, max query cost,
max queue time, max runtime, etc.
100%
15 Confidential - Do Not Share or Distribute
The Dremio Advantage
Open Data, No Lock-In
● Modern and Intuitive User Interface
● Unified View of Data (on-prem,
hybrid and Cloud)
● Federated Queries
Based on community-driven standards:
● Apache Parquet
● Apache Iceberg
● Apache Arrow
Sub-Second Performance
at 1/10th the Cost
Self–Service Analytics
● Lightning-fast queries
● High concurrency
● No expensive data copies to manage
● No semantic layer
● No federated queries
● Cloud only
● Proprietary platform
● Must ingest data in order to query it
● Limited Apache Iceberg support
● Very expensive
● Data duplication
● No query acceleration
● Poor performance with open standards
● Designed for batch processing
(ETL/data science)
● No semantic layer
● Experimental federated queries
● Cloud only
● Focused on Delta Lake, not Apache
Iceberg
● No query acceleration, BI
extracts/imports required for low latency
● Limited and expensive for data serving
● Proven cost reduction after replacement
by Dremio

Vue d'ensemble Dremio

  • 1.
    Confidential - DoNot Share or Distribute Dremio The Easy and Open Lakehouse Platform 1
  • 2.
    Confidential - DoNot Share or Distribute The Easy and Open Data Lakehouse Platform – Data warehouse performance directly on the lake – Query acceleration to eliminate copies and BI extracts – Semantic layer to enable governed self-service – Database connectors to enable queries on other sources Enterprise Adoption – 1000s of companies across all industries – 5 of the Fortune 10 Open Source & Community – Apache Arrow (60M+ downloads/m), Apache Iceberg, Nessie – Creator and host of Subsurface LIVE conference About Dremio
  • 3.
    3 Confidential -Do Not Share or Distribute SQL Data Science Dashboards Apps Companies Want to Democratize Data… But How? ▪ Everyone wants access ▪ Data volumes are exploding ▪ Security risks ▪ Compliance requirements ▪ Limited resources Application Databases | IoT | Web | Logs Continuous New Data ADLS RDBMS S3 GCS Cloud Object Storage On-Prem
  • 4.
    4 Confidential -Do Not Share or Distribute SQL Data Science Dashboards Apps Data Warehouses: Expensive, Proprietary, Complex Application Databases | IoT | Web | Logs Continuous New Data ✗ Skyrocketing costs ✗ Vendor lock-in ✗ Exploding backlog ✗ Can’t explore data ✗ No self-service ADLS RDBMS S3 GCS Cloud Object Storage On-Prem
  • 5.
    5 Confidential -Do Not Share or Distribute SQL Data Science Dashboards Apps Dremio Data Lakehouse: Easy, Open, 1/10th the Cost Application Databases | IoT | Web | Logs Continuous New Data ⇅ ODBC | JDBC | REST | Arrow Flight ⇅ ⇅ Parallelism | Caching | Optimized Push-Downs ⇅ ✓ Sub-second performance ✓ Eliminate Data Silos ✓ Improve Data Discovery and Access ✓ No Data Movement Required ✓ No Copies ✓ Inexpensive ✓ No lock-in ADLS RDBMS S3 GCS Cloud Object Storage On-Prem
  • 6.
    6 Confidential -Do Not Share or Distribute Raw zone Physical datasets Semantic zone Virtual datasets Data Engineers BI Users SQL Data Scientists ⇅ ODBC | JDBC | REST | Arrow Flight ⇅ ADLS S3 or or Acceleration (Data Reflections) Data Analysts and Engineers IT-Governed Self-Service Semantic Layer Standardized, User-Defined Abstraction Layer Enabling Virtual Data Sets, with an Easy-to-Use UI Data Analysts ✓ Consistent business logic & KPIs ✓ No more waiting for IT ✓ Use visualization tool(s) of choice Data Engineers & Architects ✓ Centralize data security & governance ✓ No more reactive, tedious work ✓ Easy collaboration with data analysts
  • 7.
    7 Confidential -Do Not Share or Distribute SQL Data Science Dashboards Apps A Realistic Example: DW Offload Application Databases | IoT | Web | Logs Continuous New Data ✗ Maxed capacity ✗ End-of-life support ✗ Complex ETL processes ✗ Legacy query engines performance RDBMS
  • 8.
    8 Confidential -Do Not Share or Distribute SQL Data Science Dashboards Apps A Realistic Example: DW Offload Application Databases | IoT | Web | Logs Continuous New Data RDBMS ✓ Unified layer ✓ Combine DW and DL data ✓ Address DW capacity issues ✓ Smooth transition Third-Party Data Bloomberg, S&P, AWS Data Exchange… Semantic Model
  • 9.
    Fast Performance Apache Arrow-basedcolumnar execution increases throughput and reduces cost Transparent Acceleration Reflections enable sub-second queries and eliminate copies and BI extracts Semantic Layer Data teams define and expose a logical data model for governed self-service Ingest & Transform Data DML and dbt integration help ingest data into the lakehouse and transform it as needed. Open Data Formats Apache Iceberg ensures no vendor lock-in and the flexibility to use any engine. Enterprise-Grade Security Role-based access control, native row/column-level policies and advanced integrations. Dremio at a glance
  • 10.
    Confidential - DoNot Share or Distribute Powering Analytics for Thousands of Companies 16
  • 11.
    Confidential - DoNot Share or Distribute Merci ! 10
  • 12.
    Confidential - DoNot Share or Distribute Open Source Roots: Apache Arrow Inside – Dremio seeded the market with its internal memory format – Arrow now downloaded over 60M times per month – Dremio is the only Arrow-based engine in the market 11 Apache Arrow was created by Dremio – Data is immediately read into Arrow – All operators use Arrow as input and output – Gandiva: LLVM-based vectorized execution on Apache Arrow Arrow-based vectorized execution
  • 13.
    Confidential - DoNot Share or Distribute Data Sources Data Lake Engine BI Users SQL Data Scientists Data Consumer Tools ⇅ ODBC | JDBC | REST | Arrow Flight ⇅ ⇅ Optimized Push-Downs ⇅ Coordinator Node Executor Nodes Orchestrated via Cloud, Kubernetes or YARN External Data Reflection Stores Data Reflection Stores Executor Nodes Executor Nodes Coordinator Node Coordinator Node DREMIO Dremio deployment architecture
  • 14.
    Confidential - DoNot Share or Distribute Query Acceleration: BI on Data Lakes Columnar Cloud Cache (C3) Data Reflections 13 – NVMe-level I/O performance on S3/ADLS/GCS – Eliminate S3/ADLS I/O costs (10-15% of cost per query) – Use existing NVMe/SSD on EC2 instances & Azure VMs – Transparent to analysts and engineers – Enable low-latency (including sub-second) BI queries – Eliminate cubes and BI extracts – Reduce infrastructure costs by up to 100x – Persisted on Data Lake as Parquet/Iceberg tables – Transparent to analysts (advanced query plan rewrites) NVMe NVMe NVMe NVMe Data Lake Columnar Cloud Cache (C3) Executor Executor Executor Executor ENGINE User-specific cubes, extracts, aggregations Domain-specific data marts User picks the best optimization DL/DW Dremio picks the best optimization DL TRADITIONAL DREMIO Reflections
  • 15.
    Confidential - DoNot Share or Distribute Multi-Engine Architecture XL M L Engine Routing Rules ● User ● Roles ● Query type ● Query cost ● Connection parameters ● Date & time ● ... Queues Engines Query 14 LOWER EC2 COSTS Auto-stop/start and right-sized engines eliminate the need to over-provision infrastructure. 60% NOISY NEIGHBOR CONCERNS Workloads are physically separated so one workload can’t impact the performance of another workload. 0 CONTROL OF RESOURCES Control resource allocation with policies such as query priority, max query cost, max queue time, max runtime, etc. 100%
  • 16.
    15 Confidential -Do Not Share or Distribute The Dremio Advantage Open Data, No Lock-In ● Modern and Intuitive User Interface ● Unified View of Data (on-prem, hybrid and Cloud) ● Federated Queries Based on community-driven standards: ● Apache Parquet ● Apache Iceberg ● Apache Arrow Sub-Second Performance at 1/10th the Cost Self–Service Analytics ● Lightning-fast queries ● High concurrency ● No expensive data copies to manage ● No semantic layer ● No federated queries ● Cloud only ● Proprietary platform ● Must ingest data in order to query it ● Limited Apache Iceberg support ● Very expensive ● Data duplication ● No query acceleration ● Poor performance with open standards ● Designed for batch processing (ETL/data science) ● No semantic layer ● Experimental federated queries ● Cloud only ● Focused on Delta Lake, not Apache Iceberg ● No query acceleration, BI extracts/imports required for low latency ● Limited and expensive for data serving ● Proven cost reduction after replacement by Dremio