SlideShare a Scribd company logo
February 21, 2020
Intro to Databricks Delta
Lake
Personal introduction
Mykola Zerniuk - Java / Scala Developer / Data Engineer
Andrii Lenets - Java / Scala Developer / Data Engineer
Outline
❏ Data lake problems
❏ Spark and ACID
❏ Delta Lake key features overview
❏ Comparison with similar data lake storage layers
Data challenges with data lakes
❏ Reliability issues
❏ Failed production jobs leave data in corrupt state requiring tedious recovery
❏ Lack of schema enforcement creates inconsistent and low quality data
❏ Lack of consistency makes it almost impossible to mix appends and reads, batch and streaming
❏ Performance issues
❏ File size inconsistency with either too small or too big files
❏ Partitioning, while useful, can be a performance bottleneck when a query selects too many fields
❏ Slow read/write performance of cloud storage compared to file system storage
Apache Spark and ACID
❏ Why ACID is critical?
❏ Atomicity - all or nothing
❏ Consistency - data is always in a valid state
❏ Isolation - an operation must be isolated from other concurrent operation
❏ Durability - once committed data is never lost
Delta Lake key features
❏ 100% Compatible with Apache Spark API
❏ ACID Transactions
❏ Updates and Deletes
❏ Time Travel (data versioning)
❏ Schema Enforcement / Schema Evolution
ACID Transactions
❏ Delta Lake Transaction Log
❏ Single Source of Truth
❏ The Implementation of Atomicity on Delta Lake
❏ Consistency support
❏ Isolation and Durability out of the box
Updates and Deletes
Time travel
❏ Common Challenges with Changing Data
❏ Audit data changes
❏ Reproduce experiments & reports
❏ Rollbacks
❏ Introducing Time Travel
❏ By version number
❏ By timestamp
Schema Enforcement
❏ How Is Schema Enforcement Useful?
❏ What Is Schema Evolution?
❏ Delta Lake Schema Evolution Options:
❏ Merge schema
❏ Overwrite schema
Delta Lake Transaction Log
❏ How Does the Transaction Log Work
❏ Breaking Down Transactions Into Atomic Commits
❏ The Delta Lake Transaction Log at the File Level
❏ Quickly Recomputing State With Checkpoint Files
❏ Dealing With Multiple Concurrent Reads and Writes
Similar storage layers
Users of Delta Lake
References
https://delta.io/
https://docs.delta.io/latest/index.html
https://databricks.com/blog/2019/08/21/diving-into-delta-lake-unpacking-the-transaction-log.html
https://databricks.com/blog/2019/09/24/diving-into-delta-lake-schema-enforcement-evolution.html
https://databricks.com/blog/2019/02/04/introducing-delta-time-travel-for-large-scale-data-lakes.html
Building Robust Production Data Pipelines with Databricks Delta - https://youtu.be/_-6jqiLnJUM

More Related Content

What's hot

The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
Databricks
 

What's hot (20)

Building Modern Data Platform with Microsoft Azure
Building Modern Data Platform with Microsoft AzureBuilding Modern Data Platform with Microsoft Azure
Building Modern Data Platform with Microsoft Azure
 
Spark with Delta Lake
Spark with Delta LakeSpark with Delta Lake
Spark with Delta Lake
 
Getting Started with Delta Lake on Databricks
Getting Started with Delta Lake on DatabricksGetting Started with Delta Lake on Databricks
Getting Started with Delta Lake on Databricks
 
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
 
Intro to Delta Lake
Intro to Delta LakeIntro to Delta Lake
Intro to Delta Lake
 
Databricks Platform.pptx
Databricks Platform.pptxDatabricks Platform.pptx
Databricks Platform.pptx
 
Data Lake Overview
Data Lake OverviewData Lake Overview
Data Lake Overview
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 
Designing modern dw and data lake
Designing modern dw and data lakeDesigning modern dw and data lake
Designing modern dw and data lake
 
Making Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeMaking Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta Lake
 
TechEvent Databricks on Azure
TechEvent Databricks on AzureTechEvent Databricks on Azure
TechEvent Databricks on Azure
 
Modern Data architecture Design
Modern Data architecture DesignModern Data architecture Design
Modern Data architecture Design
 
Building Lakehouses on Delta Lake with SQL Analytics Primer
Building Lakehouses on Delta Lake with SQL Analytics PrimerBuilding Lakehouses on Delta Lake with SQL Analytics Primer
Building Lakehouses on Delta Lake with SQL Analytics Primer
 
The delta architecture
The delta architectureThe delta architecture
The delta architecture
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
Data Platform Architecture Principles and Evaluation Criteria
Data Platform Architecture Principles and Evaluation CriteriaData Platform Architecture Principles and Evaluation Criteria
Data Platform Architecture Principles and Evaluation Criteria
 
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
 
Modern Data Architecture
Modern Data ArchitectureModern Data Architecture
Modern Data Architecture
 
Introducing Databricks Delta
Introducing Databricks DeltaIntroducing Databricks Delta
Introducing Databricks Delta
 
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeSimplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
 

Similar to Intro to databricks delta lake

Jdbc Best Practices - DB2/ IDUG - Orlando, May 10, 2004
Jdbc Best Practices - DB2/ IDUG - Orlando, May 10, 2004Jdbc Best Practices - DB2/ IDUG - Orlando, May 10, 2004
Jdbc Best Practices - DB2/ IDUG - Orlando, May 10, 2004
derek_clark_ashmore
 
In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines ...
In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines ...In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines ...
In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines ...
Gianmario Spacagna
 
datastage training | datastage online training | datastage training videos | ...
datastage training | datastage online training | datastage training videos | ...datastage training | datastage online training | datastage training videos | ...
datastage training | datastage online training | datastage training videos | ...
Nancy Thomas
 

Similar to Intro to databricks delta lake (20)

Healthcare Claim Reimbursement using Apache Spark
Healthcare Claim Reimbursement using Apache SparkHealthcare Claim Reimbursement using Apache Spark
Healthcare Claim Reimbursement using Apache Spark
 
Things learned from OpenWorld 2013
Things learned from OpenWorld 2013Things learned from OpenWorld 2013
Things learned from OpenWorld 2013
 
Edge performance with in memory nosql
Edge performance with in memory nosqlEdge performance with in memory nosql
Edge performance with in memory nosql
 
CI/CD with an Idempotent Kafka Producer & Consumer | Kafka Summit London 2022
CI/CD with an Idempotent Kafka Producer & Consumer | Kafka Summit London 2022CI/CD with an Idempotent Kafka Producer & Consumer | Kafka Summit London 2022
CI/CD with an Idempotent Kafka Producer & Consumer | Kafka Summit London 2022
 
VMworld 2013: Virtualizing Databases: Doing IT Right
VMworld 2013: Virtualizing Databases: Doing IT Right VMworld 2013: Virtualizing Databases: Doing IT Right
VMworld 2013: Virtualizing Databases: Doing IT Right
 
Data stage Online Training
Data stage Online TrainingData stage Online Training
Data stage Online Training
 
So You Want to Build a Data Lake?
So You Want to Build a Data Lake?So You Want to Build a Data Lake?
So You Want to Build a Data Lake?
 
Jdbc Best Practices - DB2/ IDUG - Orlando, May 10, 2004
Jdbc Best Practices - DB2/ IDUG - Orlando, May 10, 2004Jdbc Best Practices - DB2/ IDUG - Orlando, May 10, 2004
Jdbc Best Practices - DB2/ IDUG - Orlando, May 10, 2004
 
In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines ...
In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines ...In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines ...
In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines ...
 
DataStax | DataStax Tools for Developers (Alex Popescu) | Cassandra Summit 2016
DataStax | DataStax Tools for Developers (Alex Popescu) | Cassandra Summit 2016DataStax | DataStax Tools for Developers (Alex Popescu) | Cassandra Summit 2016
DataStax | DataStax Tools for Developers (Alex Popescu) | Cassandra Summit 2016
 
Clustering van IT-componenten
Clustering van IT-componentenClustering van IT-componenten
Clustering van IT-componenten
 
datastage training | datastage online training | datastage training videos | ...
datastage training | datastage online training | datastage training videos | ...datastage training | datastage online training | datastage training videos | ...
datastage training | datastage online training | datastage training videos | ...
 
Database Architecture & Scaling Strategies, in the Cloud & on the Rack
Database Architecture & Scaling Strategies, in the Cloud & on the Rack Database Architecture & Scaling Strategies, in the Cloud & on the Rack
Database Architecture & Scaling Strategies, in the Cloud & on the Rack
 
NDC Minnesota - Analyzing StackExchange data with Azure Data Lake
NDC Minnesota - Analyzing StackExchange data with Azure Data LakeNDC Minnesota - Analyzing StackExchange data with Azure Data Lake
NDC Minnesota - Analyzing StackExchange data with Azure Data Lake
 
20160524 ibm fast data meetup
20160524 ibm fast data meetup20160524 ibm fast data meetup
20160524 ibm fast data meetup
 
Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)
 
Whats new in Oracle Database 12c release 12.1.0.2
Whats new in Oracle Database 12c release 12.1.0.2Whats new in Oracle Database 12c release 12.1.0.2
Whats new in Oracle Database 12c release 12.1.0.2
 
Oracle golden gate 12c New Features
Oracle golden gate 12c New FeaturesOracle golden gate 12c New Features
Oracle golden gate 12c New Features
 
Accelerating Machine Learning Pipelines with Alluxio at Alluxio Meetup 2016
Accelerating Machine Learning Pipelines with Alluxio at Alluxio Meetup 2016Accelerating Machine Learning Pipelines with Alluxio at Alluxio Meetup 2016
Accelerating Machine Learning Pipelines with Alluxio at Alluxio Meetup 2016
 
Sa introduction to big data pipelining with cassandra & spark west mins...
Sa introduction to big data pipelining with cassandra & spark   west mins...Sa introduction to big data pipelining with cassandra & spark   west mins...
Sa introduction to big data pipelining with cassandra & spark west mins...
 

Recently uploaded

一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
ewymefz
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
vcaxypu
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
enxupq
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
ocavb
 
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
nscud
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
ukgaet
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
vcaxypu
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
nscud
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
ArpitMalhotra16
 
Professional Data Engineer Certification Exam Guide  _  Learn  _  Google Clou...
Professional Data Engineer Certification Exam Guide  _  Learn  _  Google Clou...Professional Data Engineer Certification Exam Guide  _  Learn  _  Google Clou...
Professional Data Engineer Certification Exam Guide  _  Learn  _  Google Clou...
Domenico Conte
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
Opendatabay
 

Recently uploaded (20)

Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
Uber Ride Supply Demand Gap Analysis Report
Uber Ride Supply Demand Gap Analysis ReportUber Ride Supply Demand Gap Analysis Report
Uber Ride Supply Demand Gap Analysis Report
 
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
 
Using PDB Relocation to Move a Single PDB to Another Existing CDB
Using PDB Relocation to Move a Single PDB to Another Existing CDBUsing PDB Relocation to Move a Single PDB to Another Existing CDB
Using PDB Relocation to Move a Single PDB to Another Existing CDB
 
Business update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMIBusiness update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMI
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
 
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPs
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPsWebinar One View, Multiple Systems No-Code Integration of Salesforce and ERPs
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPs
 
How can I successfully sell my pi coins in Philippines?
How can I successfully sell my pi coins in Philippines?How can I successfully sell my pi coins in Philippines?
How can I successfully sell my pi coins in Philippines?
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
 
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
 
Jpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization SampleJpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization Sample
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
 
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationPredicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
 
Professional Data Engineer Certification Exam Guide  _  Learn  _  Google Clou...
Professional Data Engineer Certification Exam Guide  _  Learn  _  Google Clou...Professional Data Engineer Certification Exam Guide  _  Learn  _  Google Clou...
Professional Data Engineer Certification Exam Guide  _  Learn  _  Google Clou...
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
 

Intro to databricks delta lake