SlideShare a Scribd company logo
Making Data Timelier and More Reliable
with Lakehouse Technology
Matei Zaharia
Databricks and Stanford University
Talk Outline
§ Many problems with data analytics today stem from the complex data
architectures we use
§ New “Lakehouse” technologies can remove this complexity by enabling
fast data warehousing, streaming & ML directly on data lake storage
The biggest challenges with data today:
data quality and staleness
Data Analyst Survey
60% reported data quality as top challenge
86% of analysts had to use stale data, with
41% using data that is >2 months old
90% regularly had unreliable data sources
Data Scientist Survey
75%
51%
42%
Getting high-quality, timely data is hard…
but it’s partly a problem of our own making!
The Evolution of
Data Management
1980s: Data Warehouses
§ ETL data directly from operational
database systems
§ Purpose-built for SQL analytics & BI:
schemas, indexes, caching, etc
§ Powerful management features such as
ACID transactions and time travel
ETL
Operational Data
Data Warehouses
BI Reports
2010s: New Problems for Data Warehouses
§ Could not support rapidly growing
unstructured and semi-structured data:
time series, logs, images, documents, etc
§ High cost to store large datasets
§ No support for data science & ML
ETL
Operational Data
Data Warehouses
BI Reports
2010s: Data Lakes
§ Low-cost storage to hold all raw data
(e.g. Amazon S3, HDFS)
▪ $12/TB/month for S3 infrequent tier!
§ ETL jobs then load specific data into
warehouses, possibly for further ELT
§ Directly readable in ML libraries (e.g.
TensorFlow) due to open file format
BI Data
Science
Machine
Learning
Structured, Semi-Structured & Unstructured Data
Data Lake
Real-Time
Database
Reports
Data Warehouses Data
PreparationETL
Problems with Today’s Data Lakes
Cheap to store all the data, but system architecture is much more complex!
Data reliability suffers:
§ Multiple storage systems with different
semantics, SQL dialects, etc
§ Extra ETL steps that can go wrong
Timeliness suffers:
§ Extra ETL steps before data is available
in data warehouses
BI Data
Science
Machine
Learning
Structured, Semi-Structured & Unstructured Data
Data Lake
Real-Time
Database
Reports
Data Warehouses Data
PreparationETL
Problems with Today’s Data Lakes
Cheap to store all the data, but system architecture is much more complex!
Data reliability suffers:
§ Multiple storage systems with different
semantics, SQL dialects, etc
§ Extra ETL steps that can go wrong
Timeliness suffers:
§ Extra ETL steps before data is available
in data warehouses
Summary
At least some of the problems in modern data architectures are due to
unnecessary system complexity
§ We wanted low-cost storage for large historical data, but we
designed separate storage systems (data lakes) for that
§ Now we need to sync data across systems all the time!
What if we didn’t need to have all these different data systems?
Lakehouse Technology
New techniques to provide data warehousing features directly on
data lake storage
§ Retain existing open file formats (e.g. Apache Parquet, ORC)
§ Add management and performance features on top
(transactions, data versioning, indexes, etc)
§ Can also help eliminate other data systems, e.g. message queues
Key parts: metadata layers such as Delta Lake (from Databricks)
and Apache Iceberg (from Netflix) + new engine designs
Streaming
Analytics
BI Data
Science
Machine
Learning
Structured, Semi-Structured & Unstructured Data
Lakehouse Vision
Data lake storage for all data
Single platform for every use case
Management features
(transactions, versioning, etc)
Key Technologies Enabling Lakehouse
1. Metadata layers for data lakes: add transactions, versioning & more
2. New query engine designs: great SQL performance on data lake
storage systems and file formats
3. Optimized access for data science & ML
Key Technologies Enabling Lakehouse
1. Metadata layers for data lakes: add transactions, versioning & more
2. New query engine designs: great SQL performance on data lake
storage systems and file formats
3. Optimized access for data science & ML
Metadata Layers for Data Lakes
§ A data lake is normally just a collection of files
§ Metadata layers keep track of which files are part of a table to enable
richer management features such as transactions
▪ Clients can then still access the underlying files at high speed
§ Implemented in multiple systems:
ACID
Keep metadata in the object store itself Keep metadata in a database
Problem: What if a query reads the table while the delete is running?
Example: Basic Data Lake
file1.parquet
file2.parquet
file3.parquet
“events” table Query: delete all events data about customer #17
file1b.parquet
file3b.parquet
rewrite
rewrite
+ delete file1.parquet
+ delete file3.parquet
Problem: What if the query doing the delete fails partway through?
Example with
file1.parquet
file2.parquet
file3.parquet
“events” table
_delta_log / v1.parquet
/ v2.parquet
Query: delete all events data about customer #17
file1b.parquet
file3b.parquet
rewrite
rewrite
track which files are part of
each version of the table
(e.g. v2 = file1, file2, file3)
_delta_log / v3.parquet
atomically add new log file
v3 = file1b, file2, file3b
Clients now always read a
consistent table version!
• If a client reads v2 of log, it sees
file1, file2, file3 (no delete)
• If a client reads v3 of log, it sees
file1b, file2, file3b (all deleted)
See our VLDB 2020 paper for details
Other Management Features with
§ Time travel to an old table version
§ Zero-copy CLONE by forking the log
§ DESCRIBE HISTORY
§ INSERT, UPSERT, DELETE & MERGE
SELECT * FROM my_table
TIMESTAMP AS OF “2020-05-01”
CREATE TABLE my_table_dev
SHALLOW CLONE my_table
Other Management Features with
§ Streaming I/O: treat a table as a
stream of changes to remove need
for message buses like Kafka
§ Schema enforcement & evolution
§ Expectations for data quality
CREATE TABLE orders (
product_id INTEGER NOT NULL,
quantity INTEGER CHECK(quantity > 0),
list_price DECIMAL CHECK(list_price > 0),
discount_price DECIMAL
CHECK(discount_price > 0 AND
discount_price <= list_price)
);
spark.readStream
.format("delta")
.table("events")
Adoption
§ Used by thousands of companies to process exabytes of data/day
§ Grew from zero to ~50% of the Databricks workload in 3 years
§ Largest deployments: exabyte tables and 1000s of users
Available Connectors
Store
data in
Ingest
from
Query
from
Key Technologies Enabling Lakehouse
1. Metadata layers for data lakes: add transactions, versioning & more
2. New query engine designs: great SQL performance on data lake
storage systems and file formats
3. Optimized access for data science & ML
The Challenge
§ Most data warehouses have full control over the data storage system
and query engine, so they design them together
§ The key idea in a Lakehouse is to store data in open storage formats
(e.g. Parquet) for direct access from many systems
§ How can we get great performance with these standard, open formats?
Enabling Lakehouse Performance
Even with a fixed, directly-accessible storage format, four optimizations
can enable great SQL performance:
§ Caching hot data, possibly in a different format
§ Auxiliary data structures like statistics and indexes
§ Data layout optimizations to minimize I/O
§ Vectorized execution engines for modern CPUs
New query engines such as Databricks Delta Engine use these ideas
Optimization 1: Caching
§ Most query workloads have concentrated accesses on “hot” data
▪ Data warehouses use SSD and memory caches to improve performance
§ The same techniques work in a Lakehouse if we have a metadata layer
such as Delta Lake to correctly maintain the cache
▪ Caches can even hold data in a faster format (e.g. decompressed)
§ Example: SSD cache in
Databricks Delta Engine
0 20 40 60 80
Parquet onS3
Parquet onSSD
Delta Engine cache
Values read per second per core (millions)
Optimization 2: Auxiliary Data Structures
§ Even if the base data is in Parquet, we can build many other data
structures to speed up queries and maintain them transactionally
▪ Inspired by the literature on databases for “raw” data formats
§ Example: min/max statistics on Parquet files for data skipping
file1.parquet
file2.parquet
file3.parquet
year: min 2018, max 2019
uid: min 12000, max 23000
year: min 2018, max 2020
uid: min 12000, max 14000
year: min 2020, max 2020
uid: min 23000, max 25000
Query: SELECT * FROM events
WHERE year=2020 AND uid=24000
updated transactionally
with Delta table log
Optimization 2: Auxiliary Data Structures
§ Even if the base data is in Parquet, we can build many other data
structures to speed up queries and maintain them transactionally
▪ Inspired by the literature on databases for “raw” data formats
§ Example: min/max statistics on Parquet files for data skipping
file1.parquet
file2.parquet
file3.parquet
year: min 2018, max 2019
uid: min 12000, max 23000
year: min 2018, max 2020
uid: min 12000, max 14000
year: min 2020, max 2020
uid: min 23000, max 25000
Query: SELECT * FROM events
WHERE year=2020 AND uid=24000
updated transactionally
with Delta table log
Optimization 2: Auxiliary Data Structures
§ Even if the base data is in Parquet, we can build many other data
structures to speed up queries and maintain them transactionally
▪ Inspired by the literature on databases for “raw” data formats
§ Example: indexes over Parquet files
file1.parquet
file2.parquet
file3.parquet
Query: SELECT * FROM events
WHERE type = “DELETE_ACCOUNT”
tree
index
Optimization 3: Data Layout
§ Query execution time primarily depends on amount of data accessed
§ Even with a fixed storage format such as Parquet, we can optimize the
data layout within tables to reduce execution time
§ Example: sorting a table for fast access
file1.parquet
file2.parquet
file3.parquet
file4.parquet
uid = 0…1000
uid = 1001…2000
uid = 2001…3000
uid = 3001…4000
Optimization 3: Data Layout
§ Query execution time primarily depends on amount of data accessed
§ Even with a fixed storage format such as Parquet, we can optimize the
data layout within tables to reduce execution time
§ Example: Z-ordering for multi-dimensional access
dimension 1
dimension2
99%
67%
0%
60%
0%
47%
0%
44%
0%
20%
40%
60%
80%
100%
Sort by col1 Z-Order by col1-4
DataFilesSkipped
Filter on col1 Filter on col2
Filter on col3 Filter on col4
Optimization 4: Vectorized Execution
§ Modern data warehouses optimize CPU time by using vector (SIMD)
instructions on modern CPUs, e.g., AVX512
§ Many of these optimizations can also be applied over Parquet
§ Databricks Delta Engine: ~10x faster
than Java-based engines
3220
25544
35861
0
10000
20000
30000
40000
Delta
Engine
Apache
Spark 3.0
Presto 230
TPC-DS Benchmark Time (s)
Putting These Optimizations Together
§ Given that (1) most reads are from a cache, (2) I/O cost is the key factor
for non-cached data, and (3) CPU time can be optimized via SIMD…
§ Lakehouse engines can offer similar performance to DWs!
0
10000
20000
30000
40000
DW1 DW2 DW3 Delta
Engine
TPC-DS 30TB Benchmark Time (s)
Key Technologies Enabling Lakehouse
1. Metadata layers for data lakes: add transactions, versioning & more
2. New query engine designs: great SQL performance on data lake
storage systems and file formats
3. Optimized access for data science & ML
ML over a Data Warehouse is Painful
§ Unlike SQL workloads, ML workloads need to process large amounts of
data with non-SQL code (e.g. TensorFlow, XGBoost, etc)
§ SQL over JDBC/ODBC interface is too slow for this at scale
§ Export data to a data lake? → adds a third ETL step and more staleness!
§ Maintain production datasets in both DW & lake? → even more complex
ML over a Lakehouse
§ Direct access to data files without overloading the SQL frontend
(e.g., just run a GPU cluster to do deep learning on S3 data)
▪ ML frameworks already support reading Parquet!
§ New declarative APIs for ML data prep enable further optimization
Example: Spark’s Declarative DataFrame API
Users write DataFrame code in Python, R or Java
users = spark.table(“users”)
buyers = users[users.kind == “buyer”]
train_set = buyers[“start_date”, “zip”, “quantity”]
.fillna(0)
Example: Spark’s Declarative DataFrame API
Users write DataFrame code in Python, R or Java
...
model.fit(train_set)
Lazily evaluated query plan
Optimized execution using
cache, statistics, index, etc
users
SELECT(kind = “buyer”)
PROJECT(start_date, zip, …)
PROJECT(NULL → 0)users = spark.table(“users”)
buyers = users[users.kind == “buyer”]
train_set = buyers[“start_date”, “zip”, “quantity”]
.fillna(0)
ML over Lakehouse: Management Features
Lakehouse systems’ management features also make ML easier!
§ Use time travel for data versioning and reproducible experiments
§ Use transactions to reliably update tables
§ Always access the latest data from streaming I/O
Example: organizations using Delta Lake as an ML “feature store”
Summary
Lakehouse systems combine the benefits of data warehouses & lakes
§ Management features via metadata
layers (transactions, CLONE, etc)
§ Performance via new query engines
§ Direct access via open file formats
§ Low cost equal to cloud storage
Streaming
Analytics
BI Data
Science
Machine
Learning
Structured, Semi-Structured & Unstructured Data
Result: simplify data architectures to
improve both reliability & timeliness
Before and After Lakehouse
Typical Architecture with Many Data Systems
ETL Job ETL Job
ETL Job
Delta Lake
Table 1
Delta Lake
Table 2
Delta Lake
Table 3
Streaming
Analytics
Data
Scientists
BI Users
Cloud Object Store
input
Cloud Object Store
ETL Job ETL Job
ETL Job
Message
Queue
Parquet
Table 1
Parquet
Table 2
Parquet
Table 3
Data
Warehouse
Data
Warehouse
Streaming
Analytics
Data
Scientists
BI Users
input
Lakehouse Architecture: All Data in Object Store
Fewer copies of the data, fewer ETL
steps, no divergence & faster results!
Learn More
Download and learn Delta Lake at delta.io
View free content from our conferences at spark-summit.org:

More Related Content

What's hot

Modern Data architecture Design
Modern Data architecture DesignModern Data architecture Design
Modern Data architecture Design
Kujambu Murugesan
 
Webinar Data Mesh - Part 3
Webinar Data Mesh - Part 3Webinar Data Mesh - Part 3
Webinar Data Mesh - Part 3
Jeffrey T. Pollock
 
Free Training: How to Build a Lakehouse
Free Training: How to Build a LakehouseFree Training: How to Build a Lakehouse
Free Training: How to Build a Lakehouse
Databricks
 
Time to Talk about Data Mesh
Time to Talk about Data MeshTime to Talk about Data Mesh
Time to Talk about Data Mesh
LibbySchulze
 
Databricks Fundamentals
Databricks FundamentalsDatabricks Fundamentals
Databricks Fundamentals
Dalibor Wijas
 
DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
Databricks
 
From Data Warehouse to Lakehouse
From Data Warehouse to LakehouseFrom Data Warehouse to Lakehouse
From Data Warehouse to Lakehouse
Modern Data Stack France
 
The delta architecture
The delta architectureThe delta architecture
The delta architecture
Prakash Chockalingam
 
Architect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh ArchitectureArchitect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh Architecture
Databricks
 
Data Architecture, Solution Architecture, Platform Architecture — What’s the ...
Data Architecture, Solution Architecture, Platform Architecture — What’s the ...Data Architecture, Solution Architecture, Platform Architecture — What’s the ...
Data Architecture, Solution Architecture, Platform Architecture — What’s the ...
DATAVERSITY
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
Databricks
 
Intro to Delta Lake
Intro to Delta LakeIntro to Delta Lake
Intro to Delta Lake
Databricks
 
Building an Effective Data Warehouse Architecture
Building an Effective Data Warehouse ArchitectureBuilding an Effective Data Warehouse Architecture
Building an Effective Data Warehouse Architecture
James Serra
 
Data Mesh Part 4 Monolith to Mesh
Data Mesh Part 4 Monolith to MeshData Mesh Part 4 Monolith to Mesh
Data Mesh Part 4 Monolith to Mesh
Jeffrey T. Pollock
 
Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)
James Serra
 
Snowflake Architecture.pptx
Snowflake Architecture.pptxSnowflake Architecture.pptx
Snowflake Architecture.pptx
chennakesava44
 
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin AmbardDelta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Paris Data Engineers !
 
A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0 A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0
DataWorks Summit
 
Enabling a Data Mesh Architecture with Data Virtualization
Enabling a Data Mesh Architecture with Data VirtualizationEnabling a Data Mesh Architecture with Data Virtualization
Enabling a Data Mesh Architecture with Data Virtualization
Denodo
 
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
Cathrine Wilhelmsen
 

What's hot (20)

Modern Data architecture Design
Modern Data architecture DesignModern Data architecture Design
Modern Data architecture Design
 
Webinar Data Mesh - Part 3
Webinar Data Mesh - Part 3Webinar Data Mesh - Part 3
Webinar Data Mesh - Part 3
 
Free Training: How to Build a Lakehouse
Free Training: How to Build a LakehouseFree Training: How to Build a Lakehouse
Free Training: How to Build a Lakehouse
 
Time to Talk about Data Mesh
Time to Talk about Data MeshTime to Talk about Data Mesh
Time to Talk about Data Mesh
 
Databricks Fundamentals
Databricks FundamentalsDatabricks Fundamentals
Databricks Fundamentals
 
DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
From Data Warehouse to Lakehouse
From Data Warehouse to LakehouseFrom Data Warehouse to Lakehouse
From Data Warehouse to Lakehouse
 
The delta architecture
The delta architectureThe delta architecture
The delta architecture
 
Architect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh ArchitectureArchitect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh Architecture
 
Data Architecture, Solution Architecture, Platform Architecture — What’s the ...
Data Architecture, Solution Architecture, Platform Architecture — What’s the ...Data Architecture, Solution Architecture, Platform Architecture — What’s the ...
Data Architecture, Solution Architecture, Platform Architecture — What’s the ...
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
 
Intro to Delta Lake
Intro to Delta LakeIntro to Delta Lake
Intro to Delta Lake
 
Building an Effective Data Warehouse Architecture
Building an Effective Data Warehouse ArchitectureBuilding an Effective Data Warehouse Architecture
Building an Effective Data Warehouse Architecture
 
Data Mesh Part 4 Monolith to Mesh
Data Mesh Part 4 Monolith to MeshData Mesh Part 4 Monolith to Mesh
Data Mesh Part 4 Monolith to Mesh
 
Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)
 
Snowflake Architecture.pptx
Snowflake Architecture.pptxSnowflake Architecture.pptx
Snowflake Architecture.pptx
 
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin AmbardDelta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
 
A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0 A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0
 
Enabling a Data Mesh Architecture with Data Virtualization
Enabling a Data Mesh Architecture with Data VirtualizationEnabling a Data Mesh Architecture with Data Virtualization
Enabling a Data Mesh Architecture with Data Virtualization
 
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
 

Similar to Making Data Timelier and More Reliable with Lakehouse Technology

The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
Databricks
 
So You Want to Build a Data Lake?
So You Want to Build a Data Lake?So You Want to Build a Data Lake?
So You Want to Build a Data Lake?
David P. Moore
 
Databricks Platform.pptx
Databricks Platform.pptxDatabricks Platform.pptx
Databricks Platform.pptx
Alex Ivy
 
The Marriage of the Data Lake and the Data Warehouse and Why You Need Both
The Marriage of the Data Lake and the Data Warehouse and Why You Need BothThe Marriage of the Data Lake and the Data Warehouse and Why You Need Both
The Marriage of the Data Lake and the Data Warehouse and Why You Need Both
Adaryl "Bob" Wakefield, MBA
 
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
DATAVERSITY
 
Planing and optimizing data lake architecture
Planing and optimizing data lake architecturePlaning and optimizing data lake architecture
Planing and optimizing data lake architecture
Milos Milovanovic
 
Planning and Optimizing Data Lake Architecture - Milos Milovanovic
 Planning and Optimizing Data Lake Architecture - Milos Milovanovic Planning and Optimizing Data Lake Architecture - Milos Milovanovic
Planning and Optimizing Data Lake Architecture - Milos Milovanovic
Institute of Contemporary Sciences
 
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Databricks
 
Design Principles for a Modern Data Warehouse
Design Principles for a Modern Data WarehouseDesign Principles for a Modern Data Warehouse
Design Principles for a Modern Data Warehouse
Rob Winters
 
Best Practices for Building Robust Data Platform with Apache Spark and Delta
Best Practices for Building Robust Data Platform with Apache Spark and DeltaBest Practices for Building Robust Data Platform with Apache Spark and Delta
Best Practices for Building Robust Data Platform with Apache Spark and Delta
Databricks
 
Myth Busters: I’m Building a Data Lake, So I Don’t Need Data Virtualization (...
Myth Busters: I’m Building a Data Lake, So I Don’t Need Data Virtualization (...Myth Busters: I’m Building a Data Lake, So I Don’t Need Data Virtualization (...
Myth Busters: I’m Building a Data Lake, So I Don’t Need Data Virtualization (...
Denodo
 
Is the traditional data warehouse dead?
Is the traditional data warehouse dead?Is the traditional data warehouse dead?
Is the traditional data warehouse dead?
James Serra
 
http://www.hfadeel.com/Blog/?p=151
http://www.hfadeel.com/Blog/?p=151http://www.hfadeel.com/Blog/?p=151
http://www.hfadeel.com/Blog/?p=151xlight
 
Data Virtualization: An Essential Component of a Cloud Data Lake
Data Virtualization: An Essential Component of a Cloud Data LakeData Virtualization: An Essential Component of a Cloud Data Lake
Data Virtualization: An Essential Component of a Cloud Data Lake
Denodo
 
Maximizing Data Lake ROI with Data Virtualization: A Technical Demonstration
Maximizing Data Lake ROI with Data Virtualization: A Technical DemonstrationMaximizing Data Lake ROI with Data Virtualization: A Technical Demonstration
Maximizing Data Lake ROI with Data Virtualization: A Technical Demonstration
Denodo
 
Data Warehouse or Data Lake, Which Do I Choose?
Data Warehouse or Data Lake, Which Do I Choose?Data Warehouse or Data Lake, Which Do I Choose?
Data Warehouse or Data Lake, Which Do I Choose?
DATAVERSITY
 
Data center design standards for cabinet and floor loading
Data center design standards for cabinet and floor loadingData center design standards for cabinet and floor loading
Data center design standards for cabinet and floor loadingkotatsu
 
Hadoop File system (HDFS)
Hadoop File system (HDFS)Hadoop File system (HDFS)
Hadoop File system (HDFS)
Prashant Gupta
 
Suburface 2021 IBM Cloud Data Lake
Suburface 2021 IBM Cloud Data LakeSuburface 2021 IBM Cloud Data Lake
Suburface 2021 IBM Cloud Data Lake
Torsten Steinbach
 
External & Managed Tables In Fabric Lakehouse.pptx
External & Managed Tables In Fabric Lakehouse.pptxExternal & Managed Tables In Fabric Lakehouse.pptx
External & Managed Tables In Fabric Lakehouse.pptx
Puneet Vijwani
 

Similar to Making Data Timelier and More Reliable with Lakehouse Technology (20)

The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
 
So You Want to Build a Data Lake?
So You Want to Build a Data Lake?So You Want to Build a Data Lake?
So You Want to Build a Data Lake?
 
Databricks Platform.pptx
Databricks Platform.pptxDatabricks Platform.pptx
Databricks Platform.pptx
 
The Marriage of the Data Lake and the Data Warehouse and Why You Need Both
The Marriage of the Data Lake and the Data Warehouse and Why You Need BothThe Marriage of the Data Lake and the Data Warehouse and Why You Need Both
The Marriage of the Data Lake and the Data Warehouse and Why You Need Both
 
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
 
Planing and optimizing data lake architecture
Planing and optimizing data lake architecturePlaning and optimizing data lake architecture
Planing and optimizing data lake architecture
 
Planning and Optimizing Data Lake Architecture - Milos Milovanovic
 Planning and Optimizing Data Lake Architecture - Milos Milovanovic Planning and Optimizing Data Lake Architecture - Milos Milovanovic
Planning and Optimizing Data Lake Architecture - Milos Milovanovic
 
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
 
Design Principles for a Modern Data Warehouse
Design Principles for a Modern Data WarehouseDesign Principles for a Modern Data Warehouse
Design Principles for a Modern Data Warehouse
 
Best Practices for Building Robust Data Platform with Apache Spark and Delta
Best Practices for Building Robust Data Platform with Apache Spark and DeltaBest Practices for Building Robust Data Platform with Apache Spark and Delta
Best Practices for Building Robust Data Platform with Apache Spark and Delta
 
Myth Busters: I’m Building a Data Lake, So I Don’t Need Data Virtualization (...
Myth Busters: I’m Building a Data Lake, So I Don’t Need Data Virtualization (...Myth Busters: I’m Building a Data Lake, So I Don’t Need Data Virtualization (...
Myth Busters: I’m Building a Data Lake, So I Don’t Need Data Virtualization (...
 
Is the traditional data warehouse dead?
Is the traditional data warehouse dead?Is the traditional data warehouse dead?
Is the traditional data warehouse dead?
 
http://www.hfadeel.com/Blog/?p=151
http://www.hfadeel.com/Blog/?p=151http://www.hfadeel.com/Blog/?p=151
http://www.hfadeel.com/Blog/?p=151
 
Data Virtualization: An Essential Component of a Cloud Data Lake
Data Virtualization: An Essential Component of a Cloud Data LakeData Virtualization: An Essential Component of a Cloud Data Lake
Data Virtualization: An Essential Component of a Cloud Data Lake
 
Maximizing Data Lake ROI with Data Virtualization: A Technical Demonstration
Maximizing Data Lake ROI with Data Virtualization: A Technical DemonstrationMaximizing Data Lake ROI with Data Virtualization: A Technical Demonstration
Maximizing Data Lake ROI with Data Virtualization: A Technical Demonstration
 
Data Warehouse or Data Lake, Which Do I Choose?
Data Warehouse or Data Lake, Which Do I Choose?Data Warehouse or Data Lake, Which Do I Choose?
Data Warehouse or Data Lake, Which Do I Choose?
 
Data center design standards for cabinet and floor loading
Data center design standards for cabinet and floor loadingData center design standards for cabinet and floor loading
Data center design standards for cabinet and floor loading
 
Hadoop File system (HDFS)
Hadoop File system (HDFS)Hadoop File system (HDFS)
Hadoop File system (HDFS)
 
Suburface 2021 IBM Cloud Data Lake
Suburface 2021 IBM Cloud Data LakeSuburface 2021 IBM Cloud Data Lake
Suburface 2021 IBM Cloud Data Lake
 
External & Managed Tables In Fabric Lakehouse.pptx
External & Managed Tables In Fabric Lakehouse.pptxExternal & Managed Tables In Fabric Lakehouse.pptx
External & Managed Tables In Fabric Lakehouse.pptx
 

Recently uploaded

Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
TravisMalana
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
NABLAS株式会社
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
Tiktokethiodaily
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
ewymefz
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar
 
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
benishzehra469
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
Opendatabay
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
AbhimanyuSinha9
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
Subhajit Sahu
 
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
nscud
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
jerlynmaetalle
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
axoqas
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
vcaxypu
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 
Jpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization SampleJpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization Sample
James Polillo
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
ukgaet
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Linda486226
 
Business update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMIBusiness update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMI
AlejandraGmez176757
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
ewymefz
 

Recently uploaded (20)

Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
 
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
 
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
Jpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization SampleJpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization Sample
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
 
Business update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMIBusiness update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMI
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
 

Making Data Timelier and More Reliable with Lakehouse Technology

  • 1. Making Data Timelier and More Reliable with Lakehouse Technology Matei Zaharia Databricks and Stanford University
  • 2. Talk Outline § Many problems with data analytics today stem from the complex data architectures we use § New “Lakehouse” technologies can remove this complexity by enabling fast data warehousing, streaming & ML directly on data lake storage
  • 3. The biggest challenges with data today: data quality and staleness
  • 4. Data Analyst Survey 60% reported data quality as top challenge 86% of analysts had to use stale data, with 41% using data that is >2 months old 90% regularly had unreliable data sources
  • 6. Getting high-quality, timely data is hard… but it’s partly a problem of our own making!
  • 8. 1980s: Data Warehouses § ETL data directly from operational database systems § Purpose-built for SQL analytics & BI: schemas, indexes, caching, etc § Powerful management features such as ACID transactions and time travel ETL Operational Data Data Warehouses BI Reports
  • 9. 2010s: New Problems for Data Warehouses § Could not support rapidly growing unstructured and semi-structured data: time series, logs, images, documents, etc § High cost to store large datasets § No support for data science & ML ETL Operational Data Data Warehouses BI Reports
  • 10. 2010s: Data Lakes § Low-cost storage to hold all raw data (e.g. Amazon S3, HDFS) ▪ $12/TB/month for S3 infrequent tier! § ETL jobs then load specific data into warehouses, possibly for further ELT § Directly readable in ML libraries (e.g. TensorFlow) due to open file format BI Data Science Machine Learning Structured, Semi-Structured & Unstructured Data Data Lake Real-Time Database Reports Data Warehouses Data PreparationETL
  • 11. Problems with Today’s Data Lakes Cheap to store all the data, but system architecture is much more complex! Data reliability suffers: § Multiple storage systems with different semantics, SQL dialects, etc § Extra ETL steps that can go wrong Timeliness suffers: § Extra ETL steps before data is available in data warehouses BI Data Science Machine Learning Structured, Semi-Structured & Unstructured Data Data Lake Real-Time Database Reports Data Warehouses Data PreparationETL
  • 12. Problems with Today’s Data Lakes Cheap to store all the data, but system architecture is much more complex! Data reliability suffers: § Multiple storage systems with different semantics, SQL dialects, etc § Extra ETL steps that can go wrong Timeliness suffers: § Extra ETL steps before data is available in data warehouses
  • 13. Summary At least some of the problems in modern data architectures are due to unnecessary system complexity § We wanted low-cost storage for large historical data, but we designed separate storage systems (data lakes) for that § Now we need to sync data across systems all the time! What if we didn’t need to have all these different data systems?
  • 14. Lakehouse Technology New techniques to provide data warehousing features directly on data lake storage § Retain existing open file formats (e.g. Apache Parquet, ORC) § Add management and performance features on top (transactions, data versioning, indexes, etc) § Can also help eliminate other data systems, e.g. message queues Key parts: metadata layers such as Delta Lake (from Databricks) and Apache Iceberg (from Netflix) + new engine designs
  • 15. Streaming Analytics BI Data Science Machine Learning Structured, Semi-Structured & Unstructured Data Lakehouse Vision Data lake storage for all data Single platform for every use case Management features (transactions, versioning, etc)
  • 16. Key Technologies Enabling Lakehouse 1. Metadata layers for data lakes: add transactions, versioning & more 2. New query engine designs: great SQL performance on data lake storage systems and file formats 3. Optimized access for data science & ML
  • 17. Key Technologies Enabling Lakehouse 1. Metadata layers for data lakes: add transactions, versioning & more 2. New query engine designs: great SQL performance on data lake storage systems and file formats 3. Optimized access for data science & ML
  • 18. Metadata Layers for Data Lakes § A data lake is normally just a collection of files § Metadata layers keep track of which files are part of a table to enable richer management features such as transactions ▪ Clients can then still access the underlying files at high speed § Implemented in multiple systems: ACID Keep metadata in the object store itself Keep metadata in a database
  • 19. Problem: What if a query reads the table while the delete is running? Example: Basic Data Lake file1.parquet file2.parquet file3.parquet “events” table Query: delete all events data about customer #17 file1b.parquet file3b.parquet rewrite rewrite + delete file1.parquet + delete file3.parquet Problem: What if the query doing the delete fails partway through?
  • 20. Example with file1.parquet file2.parquet file3.parquet “events” table _delta_log / v1.parquet / v2.parquet Query: delete all events data about customer #17 file1b.parquet file3b.parquet rewrite rewrite track which files are part of each version of the table (e.g. v2 = file1, file2, file3) _delta_log / v3.parquet atomically add new log file v3 = file1b, file2, file3b Clients now always read a consistent table version! • If a client reads v2 of log, it sees file1, file2, file3 (no delete) • If a client reads v3 of log, it sees file1b, file2, file3b (all deleted) See our VLDB 2020 paper for details
  • 21. Other Management Features with § Time travel to an old table version § Zero-copy CLONE by forking the log § DESCRIBE HISTORY § INSERT, UPSERT, DELETE & MERGE SELECT * FROM my_table TIMESTAMP AS OF “2020-05-01” CREATE TABLE my_table_dev SHALLOW CLONE my_table
  • 22. Other Management Features with § Streaming I/O: treat a table as a stream of changes to remove need for message buses like Kafka § Schema enforcement & evolution § Expectations for data quality CREATE TABLE orders ( product_id INTEGER NOT NULL, quantity INTEGER CHECK(quantity > 0), list_price DECIMAL CHECK(list_price > 0), discount_price DECIMAL CHECK(discount_price > 0 AND discount_price <= list_price) ); spark.readStream .format("delta") .table("events")
  • 23. Adoption § Used by thousands of companies to process exabytes of data/day § Grew from zero to ~50% of the Databricks workload in 3 years § Largest deployments: exabyte tables and 1000s of users
  • 25. Key Technologies Enabling Lakehouse 1. Metadata layers for data lakes: add transactions, versioning & more 2. New query engine designs: great SQL performance on data lake storage systems and file formats 3. Optimized access for data science & ML
  • 26. The Challenge § Most data warehouses have full control over the data storage system and query engine, so they design them together § The key idea in a Lakehouse is to store data in open storage formats (e.g. Parquet) for direct access from many systems § How can we get great performance with these standard, open formats?
  • 27. Enabling Lakehouse Performance Even with a fixed, directly-accessible storage format, four optimizations can enable great SQL performance: § Caching hot data, possibly in a different format § Auxiliary data structures like statistics and indexes § Data layout optimizations to minimize I/O § Vectorized execution engines for modern CPUs New query engines such as Databricks Delta Engine use these ideas
  • 28. Optimization 1: Caching § Most query workloads have concentrated accesses on “hot” data ▪ Data warehouses use SSD and memory caches to improve performance § The same techniques work in a Lakehouse if we have a metadata layer such as Delta Lake to correctly maintain the cache ▪ Caches can even hold data in a faster format (e.g. decompressed) § Example: SSD cache in Databricks Delta Engine 0 20 40 60 80 Parquet onS3 Parquet onSSD Delta Engine cache Values read per second per core (millions)
  • 29. Optimization 2: Auxiliary Data Structures § Even if the base data is in Parquet, we can build many other data structures to speed up queries and maintain them transactionally ▪ Inspired by the literature on databases for “raw” data formats § Example: min/max statistics on Parquet files for data skipping file1.parquet file2.parquet file3.parquet year: min 2018, max 2019 uid: min 12000, max 23000 year: min 2018, max 2020 uid: min 12000, max 14000 year: min 2020, max 2020 uid: min 23000, max 25000 Query: SELECT * FROM events WHERE year=2020 AND uid=24000 updated transactionally with Delta table log
  • 30. Optimization 2: Auxiliary Data Structures § Even if the base data is in Parquet, we can build many other data structures to speed up queries and maintain them transactionally ▪ Inspired by the literature on databases for “raw” data formats § Example: min/max statistics on Parquet files for data skipping file1.parquet file2.parquet file3.parquet year: min 2018, max 2019 uid: min 12000, max 23000 year: min 2018, max 2020 uid: min 12000, max 14000 year: min 2020, max 2020 uid: min 23000, max 25000 Query: SELECT * FROM events WHERE year=2020 AND uid=24000 updated transactionally with Delta table log
  • 31. Optimization 2: Auxiliary Data Structures § Even if the base data is in Parquet, we can build many other data structures to speed up queries and maintain them transactionally ▪ Inspired by the literature on databases for “raw” data formats § Example: indexes over Parquet files file1.parquet file2.parquet file3.parquet Query: SELECT * FROM events WHERE type = “DELETE_ACCOUNT” tree index
  • 32. Optimization 3: Data Layout § Query execution time primarily depends on amount of data accessed § Even with a fixed storage format such as Parquet, we can optimize the data layout within tables to reduce execution time § Example: sorting a table for fast access file1.parquet file2.parquet file3.parquet file4.parquet uid = 0…1000 uid = 1001…2000 uid = 2001…3000 uid = 3001…4000
  • 33. Optimization 3: Data Layout § Query execution time primarily depends on amount of data accessed § Even with a fixed storage format such as Parquet, we can optimize the data layout within tables to reduce execution time § Example: Z-ordering for multi-dimensional access dimension 1 dimension2 99% 67% 0% 60% 0% 47% 0% 44% 0% 20% 40% 60% 80% 100% Sort by col1 Z-Order by col1-4 DataFilesSkipped Filter on col1 Filter on col2 Filter on col3 Filter on col4
  • 34. Optimization 4: Vectorized Execution § Modern data warehouses optimize CPU time by using vector (SIMD) instructions on modern CPUs, e.g., AVX512 § Many of these optimizations can also be applied over Parquet § Databricks Delta Engine: ~10x faster than Java-based engines 3220 25544 35861 0 10000 20000 30000 40000 Delta Engine Apache Spark 3.0 Presto 230 TPC-DS Benchmark Time (s)
  • 35. Putting These Optimizations Together § Given that (1) most reads are from a cache, (2) I/O cost is the key factor for non-cached data, and (3) CPU time can be optimized via SIMD… § Lakehouse engines can offer similar performance to DWs! 0 10000 20000 30000 40000 DW1 DW2 DW3 Delta Engine TPC-DS 30TB Benchmark Time (s)
  • 36. Key Technologies Enabling Lakehouse 1. Metadata layers for data lakes: add transactions, versioning & more 2. New query engine designs: great SQL performance on data lake storage systems and file formats 3. Optimized access for data science & ML
  • 37. ML over a Data Warehouse is Painful § Unlike SQL workloads, ML workloads need to process large amounts of data with non-SQL code (e.g. TensorFlow, XGBoost, etc) § SQL over JDBC/ODBC interface is too slow for this at scale § Export data to a data lake? → adds a third ETL step and more staleness! § Maintain production datasets in both DW & lake? → even more complex
  • 38. ML over a Lakehouse § Direct access to data files without overloading the SQL frontend (e.g., just run a GPU cluster to do deep learning on S3 data) ▪ ML frameworks already support reading Parquet! § New declarative APIs for ML data prep enable further optimization
  • 39. Example: Spark’s Declarative DataFrame API Users write DataFrame code in Python, R or Java users = spark.table(“users”) buyers = users[users.kind == “buyer”] train_set = buyers[“start_date”, “zip”, “quantity”] .fillna(0)
  • 40. Example: Spark’s Declarative DataFrame API Users write DataFrame code in Python, R or Java ... model.fit(train_set) Lazily evaluated query plan Optimized execution using cache, statistics, index, etc users SELECT(kind = “buyer”) PROJECT(start_date, zip, …) PROJECT(NULL → 0)users = spark.table(“users”) buyers = users[users.kind == “buyer”] train_set = buyers[“start_date”, “zip”, “quantity”] .fillna(0)
  • 41. ML over Lakehouse: Management Features Lakehouse systems’ management features also make ML easier! § Use time travel for data versioning and reproducible experiments § Use transactions to reliably update tables § Always access the latest data from streaming I/O Example: organizations using Delta Lake as an ML “feature store”
  • 42. Summary Lakehouse systems combine the benefits of data warehouses & lakes § Management features via metadata layers (transactions, CLONE, etc) § Performance via new query engines § Direct access via open file formats § Low cost equal to cloud storage Streaming Analytics BI Data Science Machine Learning Structured, Semi-Structured & Unstructured Data Result: simplify data architectures to improve both reliability & timeliness
  • 43. Before and After Lakehouse Typical Architecture with Many Data Systems ETL Job ETL Job ETL Job Delta Lake Table 1 Delta Lake Table 2 Delta Lake Table 3 Streaming Analytics Data Scientists BI Users Cloud Object Store input Cloud Object Store ETL Job ETL Job ETL Job Message Queue Parquet Table 1 Parquet Table 2 Parquet Table 3 Data Warehouse Data Warehouse Streaming Analytics Data Scientists BI Users input Lakehouse Architecture: All Data in Object Store Fewer copies of the data, fewer ETL steps, no divergence & faster results!
  • 44. Learn More Download and learn Delta Lake at delta.io View free content from our conferences at spark-summit.org: