SlideShare a Scribd company logo
Delta Lake and the Delta
Architecture
07/07/2021
1
Personal introduction
Yue Fang
a big data enthusiast, and has worked on big data tech skills for almost 10 years.
builds data pipelines and platforms on Cloudera's platform and Azure's Cloud.
is a certified AWS solution architect.
deep experience using spark structured streaming, Kafka, Cassandra, Hive,
HBase, Solr, EventHub and Cosmosdb.
worked on the Azure Databricks platform and Delta Lake as well.
2
Outline
● Apache Spark problems
● Data Lake problems
● What is DataBricks?
● Delta Lake key features
● Delta Lake architecture
● Lakehouse architecture
3
Apache Spark Problems
● Not ACID compliant
● Missing schema enforcement
● Small files - big problems
- File listing
- File opening/closing
- Reduced compression effectiveness
- Excessive metadata(external HIVE tables)
4
Two docs for details.
Generic Load/Save Functions - Spark 3.1.2 Documentation
Transactional writes to cloud storage with DBIO | Databricks on AWS
Data Lakes Problems
A data lake is a centralized repository that allows you to store all your
structured and unstructured data at any scale.
5
Reliability issues
● Failed production jobs leave data in
corrupt state
● Lack of schema enforcement
● creates inconsistent and low quality
data(schema-on-read)
● Lack of consistency makes it almost
impossible to mix appends and reads,
batch and streaming
Performance issues
● File size inconsistency with either too small or
too big files
● Slow read/write performance of cloud storage
compared to file system storage
Garbage In Garbage Out
Why is Databricks?
6
source: Comparing Databricks to Apache Spark
Databricks builds on top of Spark and adds:
- Highly reliable and performant data
pipelines
- Productive data science at scale.
Delta Lake introduction
7
What is Delta Lake?
● an open source project that enables building a Lakehouse architecture on top
of data lakes.
● a storage layer that brings scalable, ACID transactions to Apache Spark and
other big-data engines.
● Delta Lake provides ACID transactions, scalable metadata handling, and
unifies streaming and batch data processing on top of existing data lakes,
such as S3, ADLS, GCS, and HDFS.
8
Delta Lake key features
● ACID Transactions
● Scalable Metadata Handling
● Time Travel (data versioning)
● Open Format
● Delta Lake change data feed
● Unified Batch and Streaming Source and Sink
● Schema Enforcement
● Schema Evolution
● Audit History
● Updates and Delete
● 100% Compatible with Apache Spark API
● Data Clean-up
9
Delta Lake key feature - ACID transaction
● What the transaction log is.
● How the transaction log serves as a single source of truth to support ACID.
● How Delta Lake computes the state of each table.
● Using optimistic concurrency control.
● How Delta Lake uses mutual exclusion to ensure that commits
are serialized properly.
10
DEMO
Delta Lake key feature - Schema Enforcement
Schema enforcement, also known as schema validation, is a safeguard in Delta
Lake that ensures data quality by rejecting writes to a table that do not match the
table’s schema.
● Schema validation on write.
● Cannot contain any additional columns that are not present in the target
table’s schema
● Cannot have column data types that differ from the column data types in the
target table.
● Can not contain column names that differ only by case.
● Table’s schema is saved in JSON format inside the transaction log.
11
Delta Lake key feature - Schema Evolution
Schema evolution is a feature that allows users to easily change a table’s current
schema to accommodate data that is changing over time.
“Read-Compatible” Schema Change .option('mergeSchema', 'true')
● Adding new columns (this is the most common scenario)
● Changing of data types from NullType -> any other type, or upcasts from
ByteType -> ShortType -> IntegerType
“Non-Read-Compabtile” Schema Change .option("overwriteSchema", "true")
● Dropping a column
● Changing an existing column’s data type (in place)
● Renaming column names that differ only by case (e.g. “Foo” and “foo”)
12
DEMO
Delta Lake key feature - Time Travel
Delta Lake time travel allows you to query an older snapshot of a Delta table.
● Timestamp based
● Version number based
● Data retention
Transaction log file retention period
delta.logRetentionDuration = 30 days at default
Data file retention period
delta.deletedFileRetentionDuration = 7
● Use cases:
○ Audit data changes
○ Reproduce experiments & report
○ Rollbacks
https://docs.databricks.com/spark/latest/spark-sql/language-manual/sql-ref-syntax-ddl-alter-table.html#delta-table-schema-options
13
DEMO
Delta Lake key feature - Table Utility Commands
● Remove files no longer referenced by Delta table
● Audit History
● Retrieve Table Details
● Generate a manifest file
● Convert parquet table to Delta table
● Convert Delta table to parquet table
14
DML Internals (Delete, Update, Merge) - Delta Lake Tech Talks Table deletes, updates, and merges | Databricks on AWS
DEMO
Delta Lake key feature - Insert|Delete|UpSert
● SQL
INSERT
DELETE
UPDATE
MERGE
● Delta Table API
delete
update
Merge
A merge operation can fail if multiple rows of the source dataset match and
attempt to update the same rows of the target Delta table
15
DML Internals (Delete, Update, Merge) - Delta Lake Tech Talks Table deletes, updates, and merges | Databricks on AWS
Delta Lake key feature - Clean up
● Transaction Log clean up
_delta_log
Checkpoint log file
delta.logRetentionDuration=30 days at default
● Data file clean up
SQL
API
Vacuum command
Retention 7 days at default
spark.databricks.delta.retentionDurationCheck.enabled = true|false
16
DML Internals (Delete, Update, Merge) - Delta Lake Tech Talks Table deletes, updates, and merges | Databricks on AWS
Delta Lake key feature - Streaming as source and sink
● Delta Lake is deeply integrated with Spark Structured Streaming through
readStream and writeStream.
● Delta Lake overcomes many of the limitations typically associated with
streaming systems and files, including:
Maintaining “exactly-once” processing with more than one stream (or
concurrent batch jobs)
Efficiently discovering which files are new when using files as the source
for a stream
17
DML Internals (Delete, Update, Merge) - Delta Lake Tech Talks Table deletes, updates, and merges | Databricks on AWS
Delta Lake key feature - Streaming as source and sink
As a source
does not handle input that is not an append and
throws an exception if any modifications occur on
the table being used as a source.
● delete the output and checkpoint and restart
the stream from the beginning.
● set either of these two options:
ignoreDeletes
ignoreChanges
Specify initial position
● startingVersion
● startingTimestamp
18
As a sink
● Append mode
● Complete mode
Delta Lake key feature - Delta Lake change data feed
● Support DataBricks Runtime 8.4 and above
● The Delta change data feed represents row-level changes between
versions of a Delta table.
● set spark.databricks.delta.properties.defaults.enableChangeDataFeed =
true;
● Change data event schema
In addition to the data columns, change data contains metadata
columns that identify the type of change event:
_change_type
>>insert, update_preimage , update_postimage, delete
_commit_version
_commit_timestamp
19
DML Internals (Delete, Update, Merge) - Delta Lake Tech Talks Table deletes, updates, and merges | Databricks on AWS
Delta Lake Architecture
20
DEMO
LakeHouse Architecture
21
A paradigm or conception of modern architecture.
Rely on Delta Lake under the hood
Replace additional data warehouse and data lake
Need fast SQL analysis engine
Future trend
Thank you
22
Any questions are welcome.
Learning together

More Related Content

What's hot

Big data architectures and the data lake
Big data architectures and the data lakeBig data architectures and the data lake
Big data architectures and the data lake
James Serra
 
DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
Databricks
 
Introduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse ArchitectureIntroduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse Architecture
Databricks
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)
James Serra
 
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
DataScienceConferenc1
 
Introduction to Dremio
Introduction to DremioIntroduction to Dremio
Introduction to Dremio
Dremio Corporation
 
Making Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeMaking Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta Lake
Databricks
 
From Data Warehouse to Lakehouse
From Data Warehouse to LakehouseFrom Data Warehouse to Lakehouse
From Data Warehouse to Lakehouse
Modern Data Stack France
 
Lakehouse in Azure
Lakehouse in AzureLakehouse in Azure
Lakehouse in Azure
Sergio Zenatti Filho
 
The delta architecture
The delta architectureThe delta architecture
The delta architecture
Prakash Chockalingam
 
Introducing Databricks Delta
Introducing Databricks DeltaIntroducing Databricks Delta
Introducing Databricks Delta
Databricks
 
Databricks Fundamentals
Databricks FundamentalsDatabricks Fundamentals
Databricks Fundamentals
Dalibor Wijas
 
Snowflake: The most cost-effective agile and scalable data warehouse ever!
Snowflake: The most cost-effective agile and scalable data warehouse ever!Snowflake: The most cost-effective agile and scalable data warehouse ever!
Snowflake: The most cost-effective agile and scalable data warehouse ever!
Visual_BI
 
Free Training: How to Build a Lakehouse
Free Training: How to Build a LakehouseFree Training: How to Build a Lakehouse
Free Training: How to Build a Lakehouse
Databricks
 
Technical Deck Delta Live Tables.pdf
Technical Deck Delta Live Tables.pdfTechnical Deck Delta Live Tables.pdf
Technical Deck Delta Live Tables.pdf
Ilham31574
 
Making Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse TechnologyMaking Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse Technology
Matei Zaharia
 
What’s New with Databricks Machine Learning
What’s New with Databricks Machine LearningWhat’s New with Databricks Machine Learning
What’s New with Databricks Machine Learning
Databricks
 
Owning Your Own (Data) Lake House
Owning Your Own (Data) Lake HouseOwning Your Own (Data) Lake House
Owning Your Own (Data) Lake House
Data Con LA
 
Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)
James Serra
 

What's hot (20)

Big data architectures and the data lake
Big data architectures and the data lakeBig data architectures and the data lake
Big data architectures and the data lake
 
DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
Introduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse ArchitectureIntroduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse Architecture
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)
 
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
 
Introduction to Dremio
Introduction to DremioIntroduction to Dremio
Introduction to Dremio
 
Making Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeMaking Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta Lake
 
From Data Warehouse to Lakehouse
From Data Warehouse to LakehouseFrom Data Warehouse to Lakehouse
From Data Warehouse to Lakehouse
 
Lakehouse in Azure
Lakehouse in AzureLakehouse in Azure
Lakehouse in Azure
 
The delta architecture
The delta architectureThe delta architecture
The delta architecture
 
Introducing Databricks Delta
Introducing Databricks DeltaIntroducing Databricks Delta
Introducing Databricks Delta
 
Databricks Fundamentals
Databricks FundamentalsDatabricks Fundamentals
Databricks Fundamentals
 
Snowflake: The most cost-effective agile and scalable data warehouse ever!
Snowflake: The most cost-effective agile and scalable data warehouse ever!Snowflake: The most cost-effective agile and scalable data warehouse ever!
Snowflake: The most cost-effective agile and scalable data warehouse ever!
 
Free Training: How to Build a Lakehouse
Free Training: How to Build a LakehouseFree Training: How to Build a Lakehouse
Free Training: How to Build a Lakehouse
 
Technical Deck Delta Live Tables.pdf
Technical Deck Delta Live Tables.pdfTechnical Deck Delta Live Tables.pdf
Technical Deck Delta Live Tables.pdf
 
Making Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse TechnologyMaking Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse Technology
 
What’s New with Databricks Machine Learning
What’s New with Databricks Machine LearningWhat’s New with Databricks Machine Learning
What’s New with Databricks Machine Learning
 
Owning Your Own (Data) Lake House
Owning Your Own (Data) Lake HouseOwning Your Own (Data) Lake House
Owning Your Own (Data) Lake House
 
Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)
 

Similar to Delta lake and the delta architecture

Pulsar in the Lakehouse: Overview of Apache Pulsar and Delta Lake Connector -...
Pulsar in the Lakehouse: Overview of Apache Pulsar and Delta Lake Connector -...Pulsar in the Lakehouse: Overview of Apache Pulsar and Delta Lake Connector -...
Pulsar in the Lakehouse: Overview of Apache Pulsar and Delta Lake Connector -...
StreamNative
 
Intro to databricks delta lake
 Intro to databricks delta lake Intro to databricks delta lake
Intro to databricks delta lake
Mykola Zerniuk
 
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
Growing the Delta Ecosystem to Rust and Python with Delta-RS
Growing the Delta Ecosystem to Rust and Python with Delta-RSGrowing the Delta Ecosystem to Rust and Python with Delta-RS
Growing the Delta Ecosystem to Rust and Python with Delta-RS
Databricks
 
Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)
Eric Sun
 
Healthcare Claim Reimbursement using Apache Spark
Healthcare Claim Reimbursement using Apache SparkHealthcare Claim Reimbursement using Apache Spark
Healthcare Claim Reimbursement using Apache Spark
Databricks
 
MS-SQL SERVER ARCHITECTURE
MS-SQL SERVER ARCHITECTUREMS-SQL SERVER ARCHITECTURE
MS-SQL SERVER ARCHITECTURE
Douglas Bernardini
 
Aioug vizag oracle12c_new_features
Aioug vizag oracle12c_new_featuresAioug vizag oracle12c_new_features
Aioug vizag oracle12c_new_features
AiougVizagChapter
 
Getting Started with Delta Lake on Databricks
Getting Started with Delta Lake on DatabricksGetting Started with Delta Lake on Databricks
Getting Started with Delta Lake on Databricks
Knoldus Inc.
 
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft AzureOtimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Luan Moreno Medeiros Maciel
 
Building Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta LakeBuilding Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta Lake
Flink Forward
 
Transactional writes to cloud storage with Eric Liang
Transactional writes to cloud storage with Eric LiangTransactional writes to cloud storage with Eric Liang
Transactional writes to cloud storage with Eric Liang
Databricks
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Databricks
 
Spark with Delta Lake
Spark with Delta LakeSpark with Delta Lake
Spark with Delta Lake
Knoldus Inc.
 
Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...
Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...
Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...
Cloudera, Inc.
 
Be A Hero: Transforming GoPro Analytics Data Pipeline
Be A Hero: Transforming GoPro Analytics Data PipelineBe A Hero: Transforming GoPro Analytics Data Pipeline
Be A Hero: Transforming GoPro Analytics Data Pipeline
Chester Chen
 
Building the Petcare Data Platform using Delta Lake and 'Kyte': Our Spark ETL...
Building the Petcare Data Platform using Delta Lake and 'Kyte': Our Spark ETL...Building the Petcare Data Platform using Delta Lake and 'Kyte': Our Spark ETL...
Building the Petcare Data Platform using Delta Lake and 'Kyte': Our Spark ETL...
Databricks
 
In-memory ColumnStore Index
In-memory ColumnStore IndexIn-memory ColumnStore Index
In-memory ColumnStore Index
SolidQ
 
SQL on Hadoop
SQL on HadoopSQL on Hadoop
SQL on Hadoop
nvvrajesh
 

Similar to Delta lake and the delta architecture (20)

Pulsar in the Lakehouse: Overview of Apache Pulsar and Delta Lake Connector -...
Pulsar in the Lakehouse: Overview of Apache Pulsar and Delta Lake Connector -...Pulsar in the Lakehouse: Overview of Apache Pulsar and Delta Lake Connector -...
Pulsar in the Lakehouse: Overview of Apache Pulsar and Delta Lake Connector -...
 
Intro to databricks delta lake
 Intro to databricks delta lake Intro to databricks delta lake
Intro to databricks delta lake
 
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 
Growing the Delta Ecosystem to Rust and Python with Delta-RS
Growing the Delta Ecosystem to Rust and Python with Delta-RSGrowing the Delta Ecosystem to Rust and Python with Delta-RS
Growing the Delta Ecosystem to Rust and Python with Delta-RS
 
Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)
 
Healthcare Claim Reimbursement using Apache Spark
Healthcare Claim Reimbursement using Apache SparkHealthcare Claim Reimbursement using Apache Spark
Healthcare Claim Reimbursement using Apache Spark
 
MS-SQL SERVER ARCHITECTURE
MS-SQL SERVER ARCHITECTUREMS-SQL SERVER ARCHITECTURE
MS-SQL SERVER ARCHITECTURE
 
Aioug vizag oracle12c_new_features
Aioug vizag oracle12c_new_featuresAioug vizag oracle12c_new_features
Aioug vizag oracle12c_new_features
 
Getting Started with Delta Lake on Databricks
Getting Started with Delta Lake on DatabricksGetting Started with Delta Lake on Databricks
Getting Started with Delta Lake on Databricks
 
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft AzureOtimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
 
Building Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta LakeBuilding Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta Lake
 
Transactional writes to cloud storage with Eric Liang
Transactional writes to cloud storage with Eric LiangTransactional writes to cloud storage with Eric Liang
Transactional writes to cloud storage with Eric Liang
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
 
Spark with Delta Lake
Spark with Delta LakeSpark with Delta Lake
Spark with Delta Lake
 
Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...
Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...
Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...
 
Be A Hero: Transforming GoPro Analytics Data Pipeline
Be A Hero: Transforming GoPro Analytics Data PipelineBe A Hero: Transforming GoPro Analytics Data Pipeline
Be A Hero: Transforming GoPro Analytics Data Pipeline
 
Building the Petcare Data Platform using Delta Lake and 'Kyte': Our Spark ETL...
Building the Petcare Data Platform using Delta Lake and 'Kyte': Our Spark ETL...Building the Petcare Data Platform using Delta Lake and 'Kyte': Our Spark ETL...
Building the Petcare Data Platform using Delta Lake and 'Kyte': Our Spark ETL...
 
In-memory ColumnStore Index
In-memory ColumnStore IndexIn-memory ColumnStore Index
In-memory ColumnStore Index
 
SQL on Hadoop
SQL on HadoopSQL on Hadoop
SQL on Hadoop
 

More from Adam Doyle

ML Ops.pptx
ML Ops.pptxML Ops.pptx
ML Ops.pptx
Adam Doyle
 
Data Engineering Roles
Data Engineering RolesData Engineering Roles
Data Engineering Roles
Adam Doyle
 
Managed Cluster Services
Managed Cluster ServicesManaged Cluster Services
Managed Cluster Services
Adam Doyle
 
Great Expectations Presentation
Great Expectations PresentationGreat Expectations Presentation
Great Expectations Presentation
Adam Doyle
 
May 2021 Spark Testing ... or how to farm reputation on StackOverflow
May 2021 Spark Testing ... or how to farm reputation on StackOverflowMay 2021 Spark Testing ... or how to farm reputation on StackOverflow
May 2021 Spark Testing ... or how to farm reputation on StackOverflow
Adam Doyle
 
Automate your data flows with Apache NIFI
Automate your data flows with Apache NIFIAutomate your data flows with Apache NIFI
Automate your data flows with Apache NIFI
Adam Doyle
 
Apache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEAApache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEA
Adam Doyle
 
Localized Hadoop Development
Localized Hadoop DevelopmentLocalized Hadoop Development
Localized Hadoop Development
Adam Doyle
 
The new big data
The new big dataThe new big data
The new big data
Adam Doyle
 
Feature store Overview St. Louis Big Data IDEA Meetup aug 2020
Feature store Overview   St. Louis Big Data IDEA Meetup aug 2020Feature store Overview   St. Louis Big Data IDEA Meetup aug 2020
Feature store Overview St. Louis Big Data IDEA Meetup aug 2020
Adam Doyle
 
Snowflake Data Science and AI/ML at Scale
Snowflake Data Science and AI/ML at ScaleSnowflake Data Science and AI/ML at Scale
Snowflake Data Science and AI/ML at Scale
Adam Doyle
 
Operationalizing Data Science St. Louis Big Data IDEA
Operationalizing Data Science St. Louis Big Data IDEAOperationalizing Data Science St. Louis Big Data IDEA
Operationalizing Data Science St. Louis Big Data IDEA
Adam Doyle
 
Retooling on the Modern Data and Analytics Tech Stack
Retooling on the Modern Data and Analytics Tech StackRetooling on the Modern Data and Analytics Tech Stack
Retooling on the Modern Data and Analytics Tech Stack
Adam Doyle
 
Stl meetup cloudera platform - january 2020
Stl meetup   cloudera platform  - january 2020Stl meetup   cloudera platform  - january 2020
Stl meetup cloudera platform - january 2020
Adam Doyle
 
How stlrda does data
How stlrda does dataHow stlrda does data
How stlrda does data
Adam Doyle
 
Tailoring machine learning practices to support prescriptive analytics
Tailoring machine learning practices to support prescriptive analyticsTailoring machine learning practices to support prescriptive analytics
Tailoring machine learning practices to support prescriptive analytics
Adam Doyle
 
Synthesis of analytical methods data driven decision-making
Synthesis of analytical methods data driven decision-makingSynthesis of analytical methods data driven decision-making
Synthesis of analytical methods data driven decision-making
Adam Doyle
 
Big Data IDEA 101 2019
Big Data IDEA 101 2019Big Data IDEA 101 2019
Big Data IDEA 101 2019
Adam Doyle
 
Data Engineering and the Data Science Lifecycle
Data Engineering and the Data Science LifecycleData Engineering and the Data Science Lifecycle
Data Engineering and the Data Science Lifecycle
Adam Doyle
 
Data engineering Stl Big Data IDEA user group
Data engineering   Stl Big Data IDEA user groupData engineering   Stl Big Data IDEA user group
Data engineering Stl Big Data IDEA user group
Adam Doyle
 

More from Adam Doyle (20)

ML Ops.pptx
ML Ops.pptxML Ops.pptx
ML Ops.pptx
 
Data Engineering Roles
Data Engineering RolesData Engineering Roles
Data Engineering Roles
 
Managed Cluster Services
Managed Cluster ServicesManaged Cluster Services
Managed Cluster Services
 
Great Expectations Presentation
Great Expectations PresentationGreat Expectations Presentation
Great Expectations Presentation
 
May 2021 Spark Testing ... or how to farm reputation on StackOverflow
May 2021 Spark Testing ... or how to farm reputation on StackOverflowMay 2021 Spark Testing ... or how to farm reputation on StackOverflow
May 2021 Spark Testing ... or how to farm reputation on StackOverflow
 
Automate your data flows with Apache NIFI
Automate your data flows with Apache NIFIAutomate your data flows with Apache NIFI
Automate your data flows with Apache NIFI
 
Apache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEAApache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEA
 
Localized Hadoop Development
Localized Hadoop DevelopmentLocalized Hadoop Development
Localized Hadoop Development
 
The new big data
The new big dataThe new big data
The new big data
 
Feature store Overview St. Louis Big Data IDEA Meetup aug 2020
Feature store Overview   St. Louis Big Data IDEA Meetup aug 2020Feature store Overview   St. Louis Big Data IDEA Meetup aug 2020
Feature store Overview St. Louis Big Data IDEA Meetup aug 2020
 
Snowflake Data Science and AI/ML at Scale
Snowflake Data Science and AI/ML at ScaleSnowflake Data Science and AI/ML at Scale
Snowflake Data Science and AI/ML at Scale
 
Operationalizing Data Science St. Louis Big Data IDEA
Operationalizing Data Science St. Louis Big Data IDEAOperationalizing Data Science St. Louis Big Data IDEA
Operationalizing Data Science St. Louis Big Data IDEA
 
Retooling on the Modern Data and Analytics Tech Stack
Retooling on the Modern Data and Analytics Tech StackRetooling on the Modern Data and Analytics Tech Stack
Retooling on the Modern Data and Analytics Tech Stack
 
Stl meetup cloudera platform - january 2020
Stl meetup   cloudera platform  - january 2020Stl meetup   cloudera platform  - january 2020
Stl meetup cloudera platform - january 2020
 
How stlrda does data
How stlrda does dataHow stlrda does data
How stlrda does data
 
Tailoring machine learning practices to support prescriptive analytics
Tailoring machine learning practices to support prescriptive analyticsTailoring machine learning practices to support prescriptive analytics
Tailoring machine learning practices to support prescriptive analytics
 
Synthesis of analytical methods data driven decision-making
Synthesis of analytical methods data driven decision-makingSynthesis of analytical methods data driven decision-making
Synthesis of analytical methods data driven decision-making
 
Big Data IDEA 101 2019
Big Data IDEA 101 2019Big Data IDEA 101 2019
Big Data IDEA 101 2019
 
Data Engineering and the Data Science Lifecycle
Data Engineering and the Data Science LifecycleData Engineering and the Data Science Lifecycle
Data Engineering and the Data Science Lifecycle
 
Data engineering Stl Big Data IDEA user group
Data engineering   Stl Big Data IDEA user groupData engineering   Stl Big Data IDEA user group
Data engineering Stl Big Data IDEA user group
 

Recently uploaded

Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
haila53
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
ewymefz
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
slg6lamcq
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
oz8q3jxlp
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
NABLAS株式会社
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
AbhimanyuSinha9
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
ewymefz
 
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
v3tuleee
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
Opendatabay
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Subhajit Sahu
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
ewymefz
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
John Andrews
 
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
u86oixdj
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
ukgaet
 
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationPredicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Boston Institute of Analytics
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Linda486226
 
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
mbawufebxi
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
vcaxypu
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
ArpitMalhotra16
 

Recently uploaded (20)

Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
 
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
 
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
 
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationPredicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
 
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
 

Delta lake and the delta architecture

  • 1. Delta Lake and the Delta Architecture 07/07/2021 1
  • 2. Personal introduction Yue Fang a big data enthusiast, and has worked on big data tech skills for almost 10 years. builds data pipelines and platforms on Cloudera's platform and Azure's Cloud. is a certified AWS solution architect. deep experience using spark structured streaming, Kafka, Cassandra, Hive, HBase, Solr, EventHub and Cosmosdb. worked on the Azure Databricks platform and Delta Lake as well. 2
  • 3. Outline ● Apache Spark problems ● Data Lake problems ● What is DataBricks? ● Delta Lake key features ● Delta Lake architecture ● Lakehouse architecture 3
  • 4. Apache Spark Problems ● Not ACID compliant ● Missing schema enforcement ● Small files - big problems - File listing - File opening/closing - Reduced compression effectiveness - Excessive metadata(external HIVE tables) 4 Two docs for details. Generic Load/Save Functions - Spark 3.1.2 Documentation Transactional writes to cloud storage with DBIO | Databricks on AWS
  • 5. Data Lakes Problems A data lake is a centralized repository that allows you to store all your structured and unstructured data at any scale. 5 Reliability issues ● Failed production jobs leave data in corrupt state ● Lack of schema enforcement ● creates inconsistent and low quality data(schema-on-read) ● Lack of consistency makes it almost impossible to mix appends and reads, batch and streaming Performance issues ● File size inconsistency with either too small or too big files ● Slow read/write performance of cloud storage compared to file system storage Garbage In Garbage Out
  • 6. Why is Databricks? 6 source: Comparing Databricks to Apache Spark Databricks builds on top of Spark and adds: - Highly reliable and performant data pipelines - Productive data science at scale.
  • 8. What is Delta Lake? ● an open source project that enables building a Lakehouse architecture on top of data lakes. ● a storage layer that brings scalable, ACID transactions to Apache Spark and other big-data engines. ● Delta Lake provides ACID transactions, scalable metadata handling, and unifies streaming and batch data processing on top of existing data lakes, such as S3, ADLS, GCS, and HDFS. 8
  • 9. Delta Lake key features ● ACID Transactions ● Scalable Metadata Handling ● Time Travel (data versioning) ● Open Format ● Delta Lake change data feed ● Unified Batch and Streaming Source and Sink ● Schema Enforcement ● Schema Evolution ● Audit History ● Updates and Delete ● 100% Compatible with Apache Spark API ● Data Clean-up 9
  • 10. Delta Lake key feature - ACID transaction ● What the transaction log is. ● How the transaction log serves as a single source of truth to support ACID. ● How Delta Lake computes the state of each table. ● Using optimistic concurrency control. ● How Delta Lake uses mutual exclusion to ensure that commits are serialized properly. 10 DEMO
  • 11. Delta Lake key feature - Schema Enforcement Schema enforcement, also known as schema validation, is a safeguard in Delta Lake that ensures data quality by rejecting writes to a table that do not match the table’s schema. ● Schema validation on write. ● Cannot contain any additional columns that are not present in the target table’s schema ● Cannot have column data types that differ from the column data types in the target table. ● Can not contain column names that differ only by case. ● Table’s schema is saved in JSON format inside the transaction log. 11
  • 12. Delta Lake key feature - Schema Evolution Schema evolution is a feature that allows users to easily change a table’s current schema to accommodate data that is changing over time. “Read-Compatible” Schema Change .option('mergeSchema', 'true') ● Adding new columns (this is the most common scenario) ● Changing of data types from NullType -> any other type, or upcasts from ByteType -> ShortType -> IntegerType “Non-Read-Compabtile” Schema Change .option("overwriteSchema", "true") ● Dropping a column ● Changing an existing column’s data type (in place) ● Renaming column names that differ only by case (e.g. “Foo” and “foo”) 12 DEMO
  • 13. Delta Lake key feature - Time Travel Delta Lake time travel allows you to query an older snapshot of a Delta table. ● Timestamp based ● Version number based ● Data retention Transaction log file retention period delta.logRetentionDuration = 30 days at default Data file retention period delta.deletedFileRetentionDuration = 7 ● Use cases: ○ Audit data changes ○ Reproduce experiments & report ○ Rollbacks https://docs.databricks.com/spark/latest/spark-sql/language-manual/sql-ref-syntax-ddl-alter-table.html#delta-table-schema-options 13 DEMO
  • 14. Delta Lake key feature - Table Utility Commands ● Remove files no longer referenced by Delta table ● Audit History ● Retrieve Table Details ● Generate a manifest file ● Convert parquet table to Delta table ● Convert Delta table to parquet table 14 DML Internals (Delete, Update, Merge) - Delta Lake Tech Talks Table deletes, updates, and merges | Databricks on AWS DEMO
  • 15. Delta Lake key feature - Insert|Delete|UpSert ● SQL INSERT DELETE UPDATE MERGE ● Delta Table API delete update Merge A merge operation can fail if multiple rows of the source dataset match and attempt to update the same rows of the target Delta table 15 DML Internals (Delete, Update, Merge) - Delta Lake Tech Talks Table deletes, updates, and merges | Databricks on AWS
  • 16. Delta Lake key feature - Clean up ● Transaction Log clean up _delta_log Checkpoint log file delta.logRetentionDuration=30 days at default ● Data file clean up SQL API Vacuum command Retention 7 days at default spark.databricks.delta.retentionDurationCheck.enabled = true|false 16 DML Internals (Delete, Update, Merge) - Delta Lake Tech Talks Table deletes, updates, and merges | Databricks on AWS
  • 17. Delta Lake key feature - Streaming as source and sink ● Delta Lake is deeply integrated with Spark Structured Streaming through readStream and writeStream. ● Delta Lake overcomes many of the limitations typically associated with streaming systems and files, including: Maintaining “exactly-once” processing with more than one stream (or concurrent batch jobs) Efficiently discovering which files are new when using files as the source for a stream 17 DML Internals (Delete, Update, Merge) - Delta Lake Tech Talks Table deletes, updates, and merges | Databricks on AWS
  • 18. Delta Lake key feature - Streaming as source and sink As a source does not handle input that is not an append and throws an exception if any modifications occur on the table being used as a source. ● delete the output and checkpoint and restart the stream from the beginning. ● set either of these two options: ignoreDeletes ignoreChanges Specify initial position ● startingVersion ● startingTimestamp 18 As a sink ● Append mode ● Complete mode
  • 19. Delta Lake key feature - Delta Lake change data feed ● Support DataBricks Runtime 8.4 and above ● The Delta change data feed represents row-level changes between versions of a Delta table. ● set spark.databricks.delta.properties.defaults.enableChangeDataFeed = true; ● Change data event schema In addition to the data columns, change data contains metadata columns that identify the type of change event: _change_type >>insert, update_preimage , update_postimage, delete _commit_version _commit_timestamp 19 DML Internals (Delete, Update, Merge) - Delta Lake Tech Talks Table deletes, updates, and merges | Databricks on AWS
  • 21. LakeHouse Architecture 21 A paradigm or conception of modern architecture. Rely on Delta Lake under the hood Replace additional data warehouse and data lake Need fast SQL analysis engine Future trend
  • 22. Thank you 22 Any questions are welcome. Learning together