SlideShare a Scribd company logo
Delta Lake and the Delta
Architecture
07/07/2021
1
Personal introduction
Yue Fang
a big data enthusiast, and has worked on big data tech skills for almost 10 years.
builds data pipelines and platforms on Cloudera's platform and Azure's Cloud.
is a certified AWS solution architect.
deep experience using spark structured streaming, Kafka, Cassandra, Hive,
HBase, Solr, EventHub and Cosmosdb.
worked on the Azure Databricks platform and Delta Lake as well.
2
Outline
● Apache Spark problems
● Data Lake problems
● What is DataBricks?
● Delta Lake key features
● Delta Lake architecture
● Lakehouse architecture
3
Apache Spark Problems
● Not ACID compliant
● Missing schema enforcement
● Small files - big problems
- File listing
- File opening/closing
- Reduced compression effectiveness
- Excessive metadata(external HIVE tables)
4
Two docs for details.
Generic Load/Save Functions - Spark 3.1.2 Documentation
Transactional writes to cloud storage with DBIO | Databricks on AWS
Data Lakes Problems
A data lake is a centralized repository that allows you to store all your
structured and unstructured data at any scale.
5
Reliability issues
● Failed production jobs leave data in
corrupt state
● Lack of schema enforcement
● creates inconsistent and low quality
data(schema-on-read)
● Lack of consistency makes it almost
impossible to mix appends and reads,
batch and streaming
Performance issues
● File size inconsistency with either too small or
too big files
● Slow read/write performance of cloud storage
compared to file system storage
Garbage In Garbage Out
Why is Databricks?
6
source: Comparing Databricks to Apache Spark
Databricks builds on top of Spark and adds:
- Highly reliable and performant data
pipelines
- Productive data science at scale.
Delta Lake introduction
7
What is Delta Lake?
● an open source project that enables building a Lakehouse architecture on top
of data lakes.
● a storage layer that brings scalable, ACID transactions to Apache Spark and
other big-data engines.
● Delta Lake provides ACID transactions, scalable metadata handling, and
unifies streaming and batch data processing on top of existing data lakes,
such as S3, ADLS, GCS, and HDFS.
8
Delta Lake key features
● ACID Transactions
● Scalable Metadata Handling
● Time Travel (data versioning)
● Open Format
● Delta Lake change data feed
● Unified Batch and Streaming Source and Sink
● Schema Enforcement
● Schema Evolution
● Audit History
● Updates and Delete
● 100% Compatible with Apache Spark API
● Data Clean-up
9
Delta Lake key feature - ACID transaction
● What the transaction log is.
● How the transaction log serves as a single source of truth to support ACID.
● How Delta Lake computes the state of each table.
● Using optimistic concurrency control.
● How Delta Lake uses mutual exclusion to ensure that commits
are serialized properly.
10
DEMO
Delta Lake key feature - Schema Enforcement
Schema enforcement, also known as schema validation, is a safeguard in Delta
Lake that ensures data quality by rejecting writes to a table that do not match the
table’s schema.
● Schema validation on write.
● Cannot contain any additional columns that are not present in the target
table’s schema
● Cannot have column data types that differ from the column data types in the
target table.
● Can not contain column names that differ only by case.
● Table’s schema is saved in JSON format inside the transaction log.
11
Delta Lake key feature - Schema Evolution
Schema evolution is a feature that allows users to easily change a table’s current
schema to accommodate data that is changing over time.
“Read-Compatible” Schema Change .option('mergeSchema', 'true')
● Adding new columns (this is the most common scenario)
● Changing of data types from NullType -> any other type, or upcasts from
ByteType -> ShortType -> IntegerType
“Non-Read-Compabtile” Schema Change .option("overwriteSchema", "true")
● Dropping a column
● Changing an existing column’s data type (in place)
● Renaming column names that differ only by case (e.g. “Foo” and “foo”)
12
DEMO
Delta Lake key feature - Time Travel
Delta Lake time travel allows you to query an older snapshot of a Delta table.
● Timestamp based
● Version number based
● Data retention
Transaction log file retention period
delta.logRetentionDuration = 30 days at default
Data file retention period
delta.deletedFileRetentionDuration = 7
● Use cases:
○ Audit data changes
○ Reproduce experiments & report
○ Rollbacks
https://docs.databricks.com/spark/latest/spark-sql/language-manual/sql-ref-syntax-ddl-alter-table.html#delta-table-schema-options
13
DEMO
Delta Lake key feature - Table Utility Commands
● Remove files no longer referenced by Delta table
● Audit History
● Retrieve Table Details
● Generate a manifest file
● Convert parquet table to Delta table
● Convert Delta table to parquet table
14
DML Internals (Delete, Update, Merge) - Delta Lake Tech Talks Table deletes, updates, and merges | Databricks on AWS
DEMO
Delta Lake key feature - Insert|Delete|UpSert
● SQL
INSERT
DELETE
UPDATE
MERGE
● Delta Table API
delete
update
Merge
A merge operation can fail if multiple rows of the source dataset match and
attempt to update the same rows of the target Delta table
15
DML Internals (Delete, Update, Merge) - Delta Lake Tech Talks Table deletes, updates, and merges | Databricks on AWS
Delta Lake key feature - Clean up
● Transaction Log clean up
_delta_log
Checkpoint log file
delta.logRetentionDuration=30 days at default
● Data file clean up
SQL
API
Vacuum command
Retention 7 days at default
spark.databricks.delta.retentionDurationCheck.enabled = true|false
16
DML Internals (Delete, Update, Merge) - Delta Lake Tech Talks Table deletes, updates, and merges | Databricks on AWS
Delta Lake key feature - Streaming as source and sink
● Delta Lake is deeply integrated with Spark Structured Streaming through
readStream and writeStream.
● Delta Lake overcomes many of the limitations typically associated with
streaming systems and files, including:
Maintaining “exactly-once” processing with more than one stream (or
concurrent batch jobs)
Efficiently discovering which files are new when using files as the source
for a stream
17
DML Internals (Delete, Update, Merge) - Delta Lake Tech Talks Table deletes, updates, and merges | Databricks on AWS
Delta Lake key feature - Streaming as source and sink
As a source
does not handle input that is not an append and
throws an exception if any modifications occur on
the table being used as a source.
● delete the output and checkpoint and restart
the stream from the beginning.
● set either of these two options:
ignoreDeletes
ignoreChanges
Specify initial position
● startingVersion
● startingTimestamp
18
As a sink
● Append mode
● Complete mode
Delta Lake key feature - Delta Lake change data feed
● Support DataBricks Runtime 8.4 and above
● The Delta change data feed represents row-level changes between
versions of a Delta table.
● set spark.databricks.delta.properties.defaults.enableChangeDataFeed =
true;
● Change data event schema
In addition to the data columns, change data contains metadata
columns that identify the type of change event:
_change_type
>>insert, update_preimage , update_postimage, delete
_commit_version
_commit_timestamp
19
DML Internals (Delete, Update, Merge) - Delta Lake Tech Talks Table deletes, updates, and merges | Databricks on AWS
Delta Lake Architecture
20
DEMO
LakeHouse Architecture
21
A paradigm or conception of modern architecture.
Rely on Delta Lake under the hood
Replace additional data warehouse and data lake
Need fast SQL analysis engine
Future trend
Thank you
22
Any questions are welcome.
Learning together

More Related Content

What's hot

[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
DataScienceConferenc1
 
Free Training: How to Build a Lakehouse
Free Training: How to Build a LakehouseFree Training: How to Build a Lakehouse
Free Training: How to Build a Lakehouse
Databricks
 
Azure DataBricks for Data Engineering by Eugene Polonichko
Azure DataBricks for Data Engineering by Eugene PolonichkoAzure DataBricks for Data Engineering by Eugene Polonichko
Azure DataBricks for Data Engineering by Eugene Polonichko
Dimko Zhluktenko
 
Spark with Delta Lake
Spark with Delta LakeSpark with Delta Lake
Spark with Delta Lake
Knoldus Inc.
 
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
DataScienceConferenc1
 
Intro to databricks delta lake
 Intro to databricks delta lake Intro to databricks delta lake
Intro to databricks delta lake
Mykola Zerniuk
 
3D: DBT using Databricks and Delta
3D: DBT using Databricks and Delta3D: DBT using Databricks and Delta
3D: DBT using Databricks and Delta
Databricks
 
Delta Lake with Azure Databricks
Delta Lake with Azure DatabricksDelta Lake with Azure Databricks
Delta Lake with Azure Databricks
Dustin Vannoy
 
Moving to Databricks & Delta
Moving to Databricks & DeltaMoving to Databricks & Delta
Moving to Databricks & Delta
Databricks
 
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
Databricks
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
Databricks
 
Data Migration to Azure
Data Migration to AzureData Migration to Azure
Data Migration to Azure
Sanjay B. Bhakta
 
Apache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the CoversApache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the Covers
ScyllaDB
 
TechEvent Databricks on Azure
TechEvent Databricks on AzureTechEvent Databricks on Azure
TechEvent Databricks on Azure
Trivadis
 
The delta architecture
The delta architectureThe delta architecture
The delta architecture
Prakash Chockalingam
 
Modernizing to a Cloud Data Architecture
Modernizing to a Cloud Data ArchitectureModernizing to a Cloud Data Architecture
Modernizing to a Cloud Data Architecture
Databricks
 
The Hidden Value of Hadoop Migration
The Hidden Value of Hadoop MigrationThe Hidden Value of Hadoop Migration
The Hidden Value of Hadoop Migration
Databricks
 
Change Data Feed in Delta
Change Data Feed in DeltaChange Data Feed in Delta
Change Data Feed in Delta
Databricks
 
Apache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEAApache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEA
Adam Doyle
 
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeSimplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Databricks
 

What's hot (20)

[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
 
Free Training: How to Build a Lakehouse
Free Training: How to Build a LakehouseFree Training: How to Build a Lakehouse
Free Training: How to Build a Lakehouse
 
Azure DataBricks for Data Engineering by Eugene Polonichko
Azure DataBricks for Data Engineering by Eugene PolonichkoAzure DataBricks for Data Engineering by Eugene Polonichko
Azure DataBricks for Data Engineering by Eugene Polonichko
 
Spark with Delta Lake
Spark with Delta LakeSpark with Delta Lake
Spark with Delta Lake
 
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
 
Intro to databricks delta lake
 Intro to databricks delta lake Intro to databricks delta lake
Intro to databricks delta lake
 
3D: DBT using Databricks and Delta
3D: DBT using Databricks and Delta3D: DBT using Databricks and Delta
3D: DBT using Databricks and Delta
 
Delta Lake with Azure Databricks
Delta Lake with Azure DatabricksDelta Lake with Azure Databricks
Delta Lake with Azure Databricks
 
Moving to Databricks & Delta
Moving to Databricks & DeltaMoving to Databricks & Delta
Moving to Databricks & Delta
 
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
 
Data Migration to Azure
Data Migration to AzureData Migration to Azure
Data Migration to Azure
 
Apache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the CoversApache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the Covers
 
TechEvent Databricks on Azure
TechEvent Databricks on AzureTechEvent Databricks on Azure
TechEvent Databricks on Azure
 
The delta architecture
The delta architectureThe delta architecture
The delta architecture
 
Modernizing to a Cloud Data Architecture
Modernizing to a Cloud Data ArchitectureModernizing to a Cloud Data Architecture
Modernizing to a Cloud Data Architecture
 
The Hidden Value of Hadoop Migration
The Hidden Value of Hadoop MigrationThe Hidden Value of Hadoop Migration
The Hidden Value of Hadoop Migration
 
Change Data Feed in Delta
Change Data Feed in DeltaChange Data Feed in Delta
Change Data Feed in Delta
 
Apache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEAApache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEA
 
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeSimplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
 

Similar to Delta lake and the delta architecture

Pulsar in the Lakehouse: Overview of Apache Pulsar and Delta Lake Connector -...
Pulsar in the Lakehouse: Overview of Apache Pulsar and Delta Lake Connector -...Pulsar in the Lakehouse: Overview of Apache Pulsar and Delta Lake Connector -...
Pulsar in the Lakehouse: Overview of Apache Pulsar and Delta Lake Connector -...
StreamNative
 
Databricks Platform.pptx
Databricks Platform.pptxDatabricks Platform.pptx
Databricks Platform.pptx
Alex Ivy
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
Growing the Delta Ecosystem to Rust and Python with Delta-RS
Growing the Delta Ecosystem to Rust and Python with Delta-RSGrowing the Delta Ecosystem to Rust and Python with Delta-RS
Growing the Delta Ecosystem to Rust and Python with Delta-RS
Databricks
 
Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)
Eric Sun
 
Healthcare Claim Reimbursement using Apache Spark
Healthcare Claim Reimbursement using Apache SparkHealthcare Claim Reimbursement using Apache Spark
Healthcare Claim Reimbursement using Apache Spark
Databricks
 
MS-SQL SERVER ARCHITECTURE
MS-SQL SERVER ARCHITECTUREMS-SQL SERVER ARCHITECTURE
MS-SQL SERVER ARCHITECTURE
Douglas Bernardini
 
Aioug vizag oracle12c_new_features
Aioug vizag oracle12c_new_featuresAioug vizag oracle12c_new_features
Aioug vizag oracle12c_new_features
AiougVizagChapter
 
From Data Warehouse to Lakehouse
From Data Warehouse to LakehouseFrom Data Warehouse to Lakehouse
From Data Warehouse to Lakehouse
Modern Data Stack France
 
Getting Started with Delta Lake on Databricks
Getting Started with Delta Lake on DatabricksGetting Started with Delta Lake on Databricks
Getting Started with Delta Lake on Databricks
Knoldus Inc.
 
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft AzureOtimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Luan Moreno Medeiros Maciel
 
Building Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta LakeBuilding Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta Lake
Flink Forward
 
Transactional writes to cloud storage with Eric Liang
Transactional writes to cloud storage with Eric LiangTransactional writes to cloud storage with Eric Liang
Transactional writes to cloud storage with Eric Liang
Databricks
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Databricks
 
Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...
Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...
Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...
Cloudera, Inc.
 
Be A Hero: Transforming GoPro Analytics Data Pipeline
Be A Hero: Transforming GoPro Analytics Data PipelineBe A Hero: Transforming GoPro Analytics Data Pipeline
Be A Hero: Transforming GoPro Analytics Data Pipeline
Chester Chen
 
Building the Petcare Data Platform using Delta Lake and 'Kyte': Our Spark ETL...
Building the Petcare Data Platform using Delta Lake and 'Kyte': Our Spark ETL...Building the Petcare Data Platform using Delta Lake and 'Kyte': Our Spark ETL...
Building the Petcare Data Platform using Delta Lake and 'Kyte': Our Spark ETL...
Databricks
 
In-memory ColumnStore Index
In-memory ColumnStore IndexIn-memory ColumnStore Index
In-memory ColumnStore Index
SolidQ
 
SQL on Hadoop
SQL on HadoopSQL on Hadoop
SQL on Hadoop
nvvrajesh
 
Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streaming
Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to StreamingBravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streaming
Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streaming
Yaroslav Tkachenko
 

Similar to Delta lake and the delta architecture (20)

Pulsar in the Lakehouse: Overview of Apache Pulsar and Delta Lake Connector -...
Pulsar in the Lakehouse: Overview of Apache Pulsar and Delta Lake Connector -...Pulsar in the Lakehouse: Overview of Apache Pulsar and Delta Lake Connector -...
Pulsar in the Lakehouse: Overview of Apache Pulsar and Delta Lake Connector -...
 
Databricks Platform.pptx
Databricks Platform.pptxDatabricks Platform.pptx
Databricks Platform.pptx
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 
Growing the Delta Ecosystem to Rust and Python with Delta-RS
Growing the Delta Ecosystem to Rust and Python with Delta-RSGrowing the Delta Ecosystem to Rust and Python with Delta-RS
Growing the Delta Ecosystem to Rust and Python with Delta-RS
 
Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)
 
Healthcare Claim Reimbursement using Apache Spark
Healthcare Claim Reimbursement using Apache SparkHealthcare Claim Reimbursement using Apache Spark
Healthcare Claim Reimbursement using Apache Spark
 
MS-SQL SERVER ARCHITECTURE
MS-SQL SERVER ARCHITECTUREMS-SQL SERVER ARCHITECTURE
MS-SQL SERVER ARCHITECTURE
 
Aioug vizag oracle12c_new_features
Aioug vizag oracle12c_new_featuresAioug vizag oracle12c_new_features
Aioug vizag oracle12c_new_features
 
From Data Warehouse to Lakehouse
From Data Warehouse to LakehouseFrom Data Warehouse to Lakehouse
From Data Warehouse to Lakehouse
 
Getting Started with Delta Lake on Databricks
Getting Started with Delta Lake on DatabricksGetting Started with Delta Lake on Databricks
Getting Started with Delta Lake on Databricks
 
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft AzureOtimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
 
Building Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta LakeBuilding Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta Lake
 
Transactional writes to cloud storage with Eric Liang
Transactional writes to cloud storage with Eric LiangTransactional writes to cloud storage with Eric Liang
Transactional writes to cloud storage with Eric Liang
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
 
Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...
Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...
Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...
 
Be A Hero: Transforming GoPro Analytics Data Pipeline
Be A Hero: Transforming GoPro Analytics Data PipelineBe A Hero: Transforming GoPro Analytics Data Pipeline
Be A Hero: Transforming GoPro Analytics Data Pipeline
 
Building the Petcare Data Platform using Delta Lake and 'Kyte': Our Spark ETL...
Building the Petcare Data Platform using Delta Lake and 'Kyte': Our Spark ETL...Building the Petcare Data Platform using Delta Lake and 'Kyte': Our Spark ETL...
Building the Petcare Data Platform using Delta Lake and 'Kyte': Our Spark ETL...
 
In-memory ColumnStore Index
In-memory ColumnStore IndexIn-memory ColumnStore Index
In-memory ColumnStore Index
 
SQL on Hadoop
SQL on HadoopSQL on Hadoop
SQL on Hadoop
 
Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streaming
Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to StreamingBravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streaming
Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streaming
 

More from Adam Doyle

ML Ops.pptx
ML Ops.pptxML Ops.pptx
ML Ops.pptx
Adam Doyle
 
Data Engineering Roles
Data Engineering RolesData Engineering Roles
Data Engineering Roles
Adam Doyle
 
Managed Cluster Services
Managed Cluster ServicesManaged Cluster Services
Managed Cluster Services
Adam Doyle
 
Great Expectations Presentation
Great Expectations PresentationGreat Expectations Presentation
Great Expectations Presentation
Adam Doyle
 
May 2021 Spark Testing ... or how to farm reputation on StackOverflow
May 2021 Spark Testing ... or how to farm reputation on StackOverflowMay 2021 Spark Testing ... or how to farm reputation on StackOverflow
May 2021 Spark Testing ... or how to farm reputation on StackOverflow
Adam Doyle
 
Automate your data flows with Apache NIFI
Automate your data flows with Apache NIFIAutomate your data flows with Apache NIFI
Automate your data flows with Apache NIFI
Adam Doyle
 
Localized Hadoop Development
Localized Hadoop DevelopmentLocalized Hadoop Development
Localized Hadoop Development
Adam Doyle
 
The new big data
The new big dataThe new big data
The new big data
Adam Doyle
 
Feature store Overview St. Louis Big Data IDEA Meetup aug 2020
Feature store Overview   St. Louis Big Data IDEA Meetup aug 2020Feature store Overview   St. Louis Big Data IDEA Meetup aug 2020
Feature store Overview St. Louis Big Data IDEA Meetup aug 2020
Adam Doyle
 
Snowflake Data Science and AI/ML at Scale
Snowflake Data Science and AI/ML at ScaleSnowflake Data Science and AI/ML at Scale
Snowflake Data Science and AI/ML at Scale
Adam Doyle
 
Operationalizing Data Science St. Louis Big Data IDEA
Operationalizing Data Science St. Louis Big Data IDEAOperationalizing Data Science St. Louis Big Data IDEA
Operationalizing Data Science St. Louis Big Data IDEA
Adam Doyle
 
Retooling on the Modern Data and Analytics Tech Stack
Retooling on the Modern Data and Analytics Tech StackRetooling on the Modern Data and Analytics Tech Stack
Retooling on the Modern Data and Analytics Tech Stack
Adam Doyle
 
Stl meetup cloudera platform - january 2020
Stl meetup   cloudera platform  - january 2020Stl meetup   cloudera platform  - january 2020
Stl meetup cloudera platform - january 2020
Adam Doyle
 
How stlrda does data
How stlrda does dataHow stlrda does data
How stlrda does data
Adam Doyle
 
Tailoring machine learning practices to support prescriptive analytics
Tailoring machine learning practices to support prescriptive analyticsTailoring machine learning practices to support prescriptive analytics
Tailoring machine learning practices to support prescriptive analytics
Adam Doyle
 
Synthesis of analytical methods data driven decision-making
Synthesis of analytical methods data driven decision-makingSynthesis of analytical methods data driven decision-making
Synthesis of analytical methods data driven decision-making
Adam Doyle
 
Big Data IDEA 101 2019
Big Data IDEA 101 2019Big Data IDEA 101 2019
Big Data IDEA 101 2019
Adam Doyle
 
Data Engineering and the Data Science Lifecycle
Data Engineering and the Data Science LifecycleData Engineering and the Data Science Lifecycle
Data Engineering and the Data Science Lifecycle
Adam Doyle
 
Data engineering Stl Big Data IDEA user group
Data engineering   Stl Big Data IDEA user groupData engineering   Stl Big Data IDEA user group
Data engineering Stl Big Data IDEA user group
Adam Doyle
 
Cloudera - Docker on hadoop
Cloudera - Docker on hadoopCloudera - Docker on hadoop
Cloudera - Docker on hadoop
Adam Doyle
 

More from Adam Doyle (20)

ML Ops.pptx
ML Ops.pptxML Ops.pptx
ML Ops.pptx
 
Data Engineering Roles
Data Engineering RolesData Engineering Roles
Data Engineering Roles
 
Managed Cluster Services
Managed Cluster ServicesManaged Cluster Services
Managed Cluster Services
 
Great Expectations Presentation
Great Expectations PresentationGreat Expectations Presentation
Great Expectations Presentation
 
May 2021 Spark Testing ... or how to farm reputation on StackOverflow
May 2021 Spark Testing ... or how to farm reputation on StackOverflowMay 2021 Spark Testing ... or how to farm reputation on StackOverflow
May 2021 Spark Testing ... or how to farm reputation on StackOverflow
 
Automate your data flows with Apache NIFI
Automate your data flows with Apache NIFIAutomate your data flows with Apache NIFI
Automate your data flows with Apache NIFI
 
Localized Hadoop Development
Localized Hadoop DevelopmentLocalized Hadoop Development
Localized Hadoop Development
 
The new big data
The new big dataThe new big data
The new big data
 
Feature store Overview St. Louis Big Data IDEA Meetup aug 2020
Feature store Overview   St. Louis Big Data IDEA Meetup aug 2020Feature store Overview   St. Louis Big Data IDEA Meetup aug 2020
Feature store Overview St. Louis Big Data IDEA Meetup aug 2020
 
Snowflake Data Science and AI/ML at Scale
Snowflake Data Science and AI/ML at ScaleSnowflake Data Science and AI/ML at Scale
Snowflake Data Science and AI/ML at Scale
 
Operationalizing Data Science St. Louis Big Data IDEA
Operationalizing Data Science St. Louis Big Data IDEAOperationalizing Data Science St. Louis Big Data IDEA
Operationalizing Data Science St. Louis Big Data IDEA
 
Retooling on the Modern Data and Analytics Tech Stack
Retooling on the Modern Data and Analytics Tech StackRetooling on the Modern Data and Analytics Tech Stack
Retooling on the Modern Data and Analytics Tech Stack
 
Stl meetup cloudera platform - january 2020
Stl meetup   cloudera platform  - january 2020Stl meetup   cloudera platform  - january 2020
Stl meetup cloudera platform - january 2020
 
How stlrda does data
How stlrda does dataHow stlrda does data
How stlrda does data
 
Tailoring machine learning practices to support prescriptive analytics
Tailoring machine learning practices to support prescriptive analyticsTailoring machine learning practices to support prescriptive analytics
Tailoring machine learning practices to support prescriptive analytics
 
Synthesis of analytical methods data driven decision-making
Synthesis of analytical methods data driven decision-makingSynthesis of analytical methods data driven decision-making
Synthesis of analytical methods data driven decision-making
 
Big Data IDEA 101 2019
Big Data IDEA 101 2019Big Data IDEA 101 2019
Big Data IDEA 101 2019
 
Data Engineering and the Data Science Lifecycle
Data Engineering and the Data Science LifecycleData Engineering and the Data Science Lifecycle
Data Engineering and the Data Science Lifecycle
 
Data engineering Stl Big Data IDEA user group
Data engineering   Stl Big Data IDEA user groupData engineering   Stl Big Data IDEA user group
Data engineering Stl Big Data IDEA user group
 
Cloudera - Docker on hadoop
Cloudera - Docker on hadoopCloudera - Docker on hadoop
Cloudera - Docker on hadoop
 

Recently uploaded

Overview IFM June 2024 Consumer Confidence INDEX Report.pdf
Overview IFM June 2024 Consumer Confidence INDEX Report.pdfOverview IFM June 2024 Consumer Confidence INDEX Report.pdf
Overview IFM June 2024 Consumer Confidence INDEX Report.pdf
nhutnguyen355078
 
一比一原版澳洲西澳大学毕业证(uwa毕业证书)如何办理
一比一原版澳洲西澳大学毕业证(uwa毕业证书)如何办理一比一原版澳洲西澳大学毕业证(uwa毕业证书)如何办理
一比一原版澳洲西澳大学毕业证(uwa毕业证书)如何办理
aguty
 
Econ3060_Screen Time and Success_ final_GroupProject.pdf
Econ3060_Screen Time and Success_ final_GroupProject.pdfEcon3060_Screen Time and Success_ final_GroupProject.pdf
Econ3060_Screen Time and Success_ final_GroupProject.pdf
blueshagoo1
 
Hyderabad Call Girls 7339748667 With Free Home Delivery At Your Door
Hyderabad Call Girls 7339748667 With Free Home Delivery At Your DoorHyderabad Call Girls 7339748667 With Free Home Delivery At Your Door
Hyderabad Call Girls 7339748667 With Free Home Delivery At Your Door
Russian Escorts in Delhi 9711199171 with low rate Book online
 
一比一原版马来西亚博特拉大学毕业证(upm毕业证)如何办理
一比一原版马来西亚博特拉大学毕业证(upm毕业证)如何办理一比一原版马来西亚博特拉大学毕业证(upm毕业证)如何办理
一比一原版马来西亚博特拉大学毕业证(upm毕业证)如何办理
eudsoh
 
[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024
Vietnam Cotton & Spinning Association
 
Discovering Digital Process Twins for What-if Analysis: a Process Mining Appr...
Discovering Digital Process Twins for What-if Analysis: a Process Mining Appr...Discovering Digital Process Twins for What-if Analysis: a Process Mining Appr...
Discovering Digital Process Twins for What-if Analysis: a Process Mining Appr...
Marlon Dumas
 
Essential Skills for Family Assessment - Marital and Family Therapy and Couns...
Essential Skills for Family Assessment - Marital and Family Therapy and Couns...Essential Skills for Family Assessment - Marital and Family Therapy and Couns...
Essential Skills for Family Assessment - Marital and Family Therapy and Couns...
PsychoTech Services
 
一比一原版南昆士兰大学毕业证如何办理
一比一原版南昆士兰大学毕业证如何办理一比一原版南昆士兰大学毕业证如何办理
一比一原版南昆士兰大学毕业证如何办理
ugydym
 
[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024
Vietnam Cotton & Spinning Association
 
Call Girls Goa👉9024918724👉Low Rate Escorts in Goa 💃 Available 24/7
Call Girls Goa👉9024918724👉Low Rate Escorts in Goa 💃 Available 24/7Call Girls Goa👉9024918724👉Low Rate Escorts in Goa 💃 Available 24/7
Call Girls Goa👉9024918724👉Low Rate Escorts in Goa 💃 Available 24/7
nitachopra
 
一比一原版(曼大毕业证书)曼尼托巴大学毕业证如何办理
一比一原版(曼大毕业证书)曼尼托巴大学毕业证如何办理一比一原版(曼大毕业证书)曼尼托巴大学毕业证如何办理
一比一原版(曼大毕业证书)曼尼托巴大学毕业证如何办理
ytypuem
 
一比一原版(uob毕业证书)伯明翰大学毕业证如何办理
一比一原版(uob毕业证书)伯明翰大学毕业证如何办理一比一原版(uob毕业证书)伯明翰大学毕业证如何办理
一比一原版(uob毕业证书)伯明翰大学毕业证如何办理
9gr6pty
 
PyData London 2024: Mistakes were made (Dr. Rebecca Bilbro)
PyData London 2024: Mistakes were made (Dr. Rebecca Bilbro)PyData London 2024: Mistakes were made (Dr. Rebecca Bilbro)
PyData London 2024: Mistakes were made (Dr. Rebecca Bilbro)
Rebecca Bilbro
 
Do People Really Know Their Fertility Intentions? Correspondence between Sel...
Do People Really Know Their Fertility Intentions?  Correspondence between Sel...Do People Really Know Their Fertility Intentions?  Correspondence between Sel...
Do People Really Know Their Fertility Intentions? Correspondence between Sel...
Xiao Xu
 
CAP Excel Formulas & Functions July - Copy (4).pdf
CAP Excel Formulas & Functions July - Copy (4).pdfCAP Excel Formulas & Functions July - Copy (4).pdf
CAP Excel Formulas & Functions July - Copy (4).pdf
frp60658
 
Call Girls Lucknow 0000000000 Independent Call Girl Service Lucknow
Call Girls Lucknow 0000000000 Independent Call Girl Service LucknowCall Girls Lucknow 0000000000 Independent Call Girl Service Lucknow
Call Girls Lucknow 0000000000 Independent Call Girl Service Lucknow
hiju9823
 
Senior Engineering Sample EM DOE - Sheet1.pdf
Senior Engineering Sample EM DOE  - Sheet1.pdfSenior Engineering Sample EM DOE  - Sheet1.pdf
Senior Engineering Sample EM DOE - Sheet1.pdf
Vineet
 
一比一原版悉尼大学毕业证如何办理
一比一原版悉尼大学毕业证如何办理一比一原版悉尼大学毕业证如何办理
一比一原版悉尼大学毕业证如何办理
keesa2
 
Template xxxxxxxx ssssssssssss Sertifikat.pptx
Template xxxxxxxx ssssssssssss Sertifikat.pptxTemplate xxxxxxxx ssssssssssss Sertifikat.pptx
Template xxxxxxxx ssssssssssss Sertifikat.pptx
TeukuEriSyahputra
 

Recently uploaded (20)

Overview IFM June 2024 Consumer Confidence INDEX Report.pdf
Overview IFM June 2024 Consumer Confidence INDEX Report.pdfOverview IFM June 2024 Consumer Confidence INDEX Report.pdf
Overview IFM June 2024 Consumer Confidence INDEX Report.pdf
 
一比一原版澳洲西澳大学毕业证(uwa毕业证书)如何办理
一比一原版澳洲西澳大学毕业证(uwa毕业证书)如何办理一比一原版澳洲西澳大学毕业证(uwa毕业证书)如何办理
一比一原版澳洲西澳大学毕业证(uwa毕业证书)如何办理
 
Econ3060_Screen Time and Success_ final_GroupProject.pdf
Econ3060_Screen Time and Success_ final_GroupProject.pdfEcon3060_Screen Time and Success_ final_GroupProject.pdf
Econ3060_Screen Time and Success_ final_GroupProject.pdf
 
Hyderabad Call Girls 7339748667 With Free Home Delivery At Your Door
Hyderabad Call Girls 7339748667 With Free Home Delivery At Your DoorHyderabad Call Girls 7339748667 With Free Home Delivery At Your Door
Hyderabad Call Girls 7339748667 With Free Home Delivery At Your Door
 
一比一原版马来西亚博特拉大学毕业证(upm毕业证)如何办理
一比一原版马来西亚博特拉大学毕业证(upm毕业证)如何办理一比一原版马来西亚博特拉大学毕业证(upm毕业证)如何办理
一比一原版马来西亚博特拉大学毕业证(upm毕业证)如何办理
 
[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024
 
Discovering Digital Process Twins for What-if Analysis: a Process Mining Appr...
Discovering Digital Process Twins for What-if Analysis: a Process Mining Appr...Discovering Digital Process Twins for What-if Analysis: a Process Mining Appr...
Discovering Digital Process Twins for What-if Analysis: a Process Mining Appr...
 
Essential Skills for Family Assessment - Marital and Family Therapy and Couns...
Essential Skills for Family Assessment - Marital and Family Therapy and Couns...Essential Skills for Family Assessment - Marital and Family Therapy and Couns...
Essential Skills for Family Assessment - Marital and Family Therapy and Couns...
 
一比一原版南昆士兰大学毕业证如何办理
一比一原版南昆士兰大学毕业证如何办理一比一原版南昆士兰大学毕业证如何办理
一比一原版南昆士兰大学毕业证如何办理
 
[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024
 
Call Girls Goa👉9024918724👉Low Rate Escorts in Goa 💃 Available 24/7
Call Girls Goa👉9024918724👉Low Rate Escorts in Goa 💃 Available 24/7Call Girls Goa👉9024918724👉Low Rate Escorts in Goa 💃 Available 24/7
Call Girls Goa👉9024918724👉Low Rate Escorts in Goa 💃 Available 24/7
 
一比一原版(曼大毕业证书)曼尼托巴大学毕业证如何办理
一比一原版(曼大毕业证书)曼尼托巴大学毕业证如何办理一比一原版(曼大毕业证书)曼尼托巴大学毕业证如何办理
一比一原版(曼大毕业证书)曼尼托巴大学毕业证如何办理
 
一比一原版(uob毕业证书)伯明翰大学毕业证如何办理
一比一原版(uob毕业证书)伯明翰大学毕业证如何办理一比一原版(uob毕业证书)伯明翰大学毕业证如何办理
一比一原版(uob毕业证书)伯明翰大学毕业证如何办理
 
PyData London 2024: Mistakes were made (Dr. Rebecca Bilbro)
PyData London 2024: Mistakes were made (Dr. Rebecca Bilbro)PyData London 2024: Mistakes were made (Dr. Rebecca Bilbro)
PyData London 2024: Mistakes were made (Dr. Rebecca Bilbro)
 
Do People Really Know Their Fertility Intentions? Correspondence between Sel...
Do People Really Know Their Fertility Intentions?  Correspondence between Sel...Do People Really Know Their Fertility Intentions?  Correspondence between Sel...
Do People Really Know Their Fertility Intentions? Correspondence between Sel...
 
CAP Excel Formulas & Functions July - Copy (4).pdf
CAP Excel Formulas & Functions July - Copy (4).pdfCAP Excel Formulas & Functions July - Copy (4).pdf
CAP Excel Formulas & Functions July - Copy (4).pdf
 
Call Girls Lucknow 0000000000 Independent Call Girl Service Lucknow
Call Girls Lucknow 0000000000 Independent Call Girl Service LucknowCall Girls Lucknow 0000000000 Independent Call Girl Service Lucknow
Call Girls Lucknow 0000000000 Independent Call Girl Service Lucknow
 
Senior Engineering Sample EM DOE - Sheet1.pdf
Senior Engineering Sample EM DOE  - Sheet1.pdfSenior Engineering Sample EM DOE  - Sheet1.pdf
Senior Engineering Sample EM DOE - Sheet1.pdf
 
一比一原版悉尼大学毕业证如何办理
一比一原版悉尼大学毕业证如何办理一比一原版悉尼大学毕业证如何办理
一比一原版悉尼大学毕业证如何办理
 
Template xxxxxxxx ssssssssssss Sertifikat.pptx
Template xxxxxxxx ssssssssssss Sertifikat.pptxTemplate xxxxxxxx ssssssssssss Sertifikat.pptx
Template xxxxxxxx ssssssssssss Sertifikat.pptx
 

Delta lake and the delta architecture

  • 1. Delta Lake and the Delta Architecture 07/07/2021 1
  • 2. Personal introduction Yue Fang a big data enthusiast, and has worked on big data tech skills for almost 10 years. builds data pipelines and platforms on Cloudera's platform and Azure's Cloud. is a certified AWS solution architect. deep experience using spark structured streaming, Kafka, Cassandra, Hive, HBase, Solr, EventHub and Cosmosdb. worked on the Azure Databricks platform and Delta Lake as well. 2
  • 3. Outline ● Apache Spark problems ● Data Lake problems ● What is DataBricks? ● Delta Lake key features ● Delta Lake architecture ● Lakehouse architecture 3
  • 4. Apache Spark Problems ● Not ACID compliant ● Missing schema enforcement ● Small files - big problems - File listing - File opening/closing - Reduced compression effectiveness - Excessive metadata(external HIVE tables) 4 Two docs for details. Generic Load/Save Functions - Spark 3.1.2 Documentation Transactional writes to cloud storage with DBIO | Databricks on AWS
  • 5. Data Lakes Problems A data lake is a centralized repository that allows you to store all your structured and unstructured data at any scale. 5 Reliability issues ● Failed production jobs leave data in corrupt state ● Lack of schema enforcement ● creates inconsistent and low quality data(schema-on-read) ● Lack of consistency makes it almost impossible to mix appends and reads, batch and streaming Performance issues ● File size inconsistency with either too small or too big files ● Slow read/write performance of cloud storage compared to file system storage Garbage In Garbage Out
  • 6. Why is Databricks? 6 source: Comparing Databricks to Apache Spark Databricks builds on top of Spark and adds: - Highly reliable and performant data pipelines - Productive data science at scale.
  • 8. What is Delta Lake? ● an open source project that enables building a Lakehouse architecture on top of data lakes. ● a storage layer that brings scalable, ACID transactions to Apache Spark and other big-data engines. ● Delta Lake provides ACID transactions, scalable metadata handling, and unifies streaming and batch data processing on top of existing data lakes, such as S3, ADLS, GCS, and HDFS. 8
  • 9. Delta Lake key features ● ACID Transactions ● Scalable Metadata Handling ● Time Travel (data versioning) ● Open Format ● Delta Lake change data feed ● Unified Batch and Streaming Source and Sink ● Schema Enforcement ● Schema Evolution ● Audit History ● Updates and Delete ● 100% Compatible with Apache Spark API ● Data Clean-up 9
  • 10. Delta Lake key feature - ACID transaction ● What the transaction log is. ● How the transaction log serves as a single source of truth to support ACID. ● How Delta Lake computes the state of each table. ● Using optimistic concurrency control. ● How Delta Lake uses mutual exclusion to ensure that commits are serialized properly. 10 DEMO
  • 11. Delta Lake key feature - Schema Enforcement Schema enforcement, also known as schema validation, is a safeguard in Delta Lake that ensures data quality by rejecting writes to a table that do not match the table’s schema. ● Schema validation on write. ● Cannot contain any additional columns that are not present in the target table’s schema ● Cannot have column data types that differ from the column data types in the target table. ● Can not contain column names that differ only by case. ● Table’s schema is saved in JSON format inside the transaction log. 11
  • 12. Delta Lake key feature - Schema Evolution Schema evolution is a feature that allows users to easily change a table’s current schema to accommodate data that is changing over time. “Read-Compatible” Schema Change .option('mergeSchema', 'true') ● Adding new columns (this is the most common scenario) ● Changing of data types from NullType -> any other type, or upcasts from ByteType -> ShortType -> IntegerType “Non-Read-Compabtile” Schema Change .option("overwriteSchema", "true") ● Dropping a column ● Changing an existing column’s data type (in place) ● Renaming column names that differ only by case (e.g. “Foo” and “foo”) 12 DEMO
  • 13. Delta Lake key feature - Time Travel Delta Lake time travel allows you to query an older snapshot of a Delta table. ● Timestamp based ● Version number based ● Data retention Transaction log file retention period delta.logRetentionDuration = 30 days at default Data file retention period delta.deletedFileRetentionDuration = 7 ● Use cases: ○ Audit data changes ○ Reproduce experiments & report ○ Rollbacks https://docs.databricks.com/spark/latest/spark-sql/language-manual/sql-ref-syntax-ddl-alter-table.html#delta-table-schema-options 13 DEMO
  • 14. Delta Lake key feature - Table Utility Commands ● Remove files no longer referenced by Delta table ● Audit History ● Retrieve Table Details ● Generate a manifest file ● Convert parquet table to Delta table ● Convert Delta table to parquet table 14 DML Internals (Delete, Update, Merge) - Delta Lake Tech Talks Table deletes, updates, and merges | Databricks on AWS DEMO
  • 15. Delta Lake key feature - Insert|Delete|UpSert ● SQL INSERT DELETE UPDATE MERGE ● Delta Table API delete update Merge A merge operation can fail if multiple rows of the source dataset match and attempt to update the same rows of the target Delta table 15 DML Internals (Delete, Update, Merge) - Delta Lake Tech Talks Table deletes, updates, and merges | Databricks on AWS
  • 16. Delta Lake key feature - Clean up ● Transaction Log clean up _delta_log Checkpoint log file delta.logRetentionDuration=30 days at default ● Data file clean up SQL API Vacuum command Retention 7 days at default spark.databricks.delta.retentionDurationCheck.enabled = true|false 16 DML Internals (Delete, Update, Merge) - Delta Lake Tech Talks Table deletes, updates, and merges | Databricks on AWS
  • 17. Delta Lake key feature - Streaming as source and sink ● Delta Lake is deeply integrated with Spark Structured Streaming through readStream and writeStream. ● Delta Lake overcomes many of the limitations typically associated with streaming systems and files, including: Maintaining “exactly-once” processing with more than one stream (or concurrent batch jobs) Efficiently discovering which files are new when using files as the source for a stream 17 DML Internals (Delete, Update, Merge) - Delta Lake Tech Talks Table deletes, updates, and merges | Databricks on AWS
  • 18. Delta Lake key feature - Streaming as source and sink As a source does not handle input that is not an append and throws an exception if any modifications occur on the table being used as a source. ● delete the output and checkpoint and restart the stream from the beginning. ● set either of these two options: ignoreDeletes ignoreChanges Specify initial position ● startingVersion ● startingTimestamp 18 As a sink ● Append mode ● Complete mode
  • 19. Delta Lake key feature - Delta Lake change data feed ● Support DataBricks Runtime 8.4 and above ● The Delta change data feed represents row-level changes between versions of a Delta table. ● set spark.databricks.delta.properties.defaults.enableChangeDataFeed = true; ● Change data event schema In addition to the data columns, change data contains metadata columns that identify the type of change event: _change_type >>insert, update_preimage , update_postimage, delete _commit_version _commit_timestamp 19 DML Internals (Delete, Update, Merge) - Delta Lake Tech Talks Table deletes, updates, and merges | Databricks on AWS
  • 21. LakeHouse Architecture 21 A paradigm or conception of modern architecture. Rely on Delta Lake under the hood Replace additional data warehouse and data lake Need fast SQL analysis engine Future trend
  • 22. Thank you 22 Any questions are welcome. Learning together