SlideShare a Scribd company logo

Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard

Delta Lake is an open source framework living on top of parquet in your data lake to provide Reliability and performances. It has been open-sourced by Databricks this year and is gaining traction to become the defacto delta lake format. We’ll see all the goods Delta Lake can do to your data with ACID transactions, DDL operations, Schema enforcement, batch and stream support etc !

1 of 42
Download to read offline
The Delta
Architecture
Quentin Ambard
quentin.ambard@databricks.com
Databricks Workspace
Collaborative Notebooks, production jobs & business insights
Managed platform
Cloud Native
Databricks: Unified Data Analytics Platform
ML Runtime
For your Big data and Machine Learning Lifecycle
...
● A typical Data Lake Architecture
● The Delta Architecture
● Inside Delta Lake
● Demo
The Delta Agenda
Enterprises have been spending millions
of dollars getting data into data lakes
Data Lake
The aspiration is to do data science and
ML on all that data using Apache Spark!
Data Lake
Data Science & ML
• Recommendation Engines
• Risk, Fraud Detection
• IoT & Predictive Maintenance
• Genomics & DNA Sequencing
Data Lake
But the data is not ready for data science & ML
The majority of these projects are failing due to
Complex pipeline and unreliable data!
Data Science & ML
• Recommendation Engines
• Risk, Fraud Detection
• IoT & Predictive Maintenance
• Genomics & DNA Sequencing

Recommended

Intro to Delta Lake
Intro to Delta LakeIntro to Delta Lake
Intro to Delta LakeDatabricks
 
Delta lake and the delta architecture
Delta lake and the delta architectureDelta lake and the delta architecture
Delta lake and the delta architectureAdam Doyle
 
Modernizing to a Cloud Data Architecture
Modernizing to a Cloud Data ArchitectureModernizing to a Cloud Data Architecture
Modernizing to a Cloud Data ArchitectureDatabricks
 
Introduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse ArchitectureIntroduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse ArchitectureDatabricks
 
Achieving Lakehouse Models with Spark 3.0
Achieving Lakehouse Models with Spark 3.0Achieving Lakehouse Models with Spark 3.0
Achieving Lakehouse Models with Spark 3.0Databricks
 
Building Lakehouses on Delta Lake with SQL Analytics Primer
Building Lakehouses on Delta Lake with SQL Analytics PrimerBuilding Lakehouses on Delta Lake with SQL Analytics Primer
Building Lakehouses on Delta Lake with SQL Analytics PrimerDatabricks
 
Scaling and Modernizing Data Platform with Databricks
Scaling and Modernizing Data Platform with DatabricksScaling and Modernizing Data Platform with Databricks
Scaling and Modernizing Data Platform with DatabricksDatabricks
 
Making Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeMaking Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeDatabricks
 

More Related Content

What's hot

[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
[DSC Europe 22] Overview of the Databricks Platform - Petar ZecevicDataScienceConferenc1
 
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...DataScienceConferenc1
 
Change Data Feed in Delta
Change Data Feed in DeltaChange Data Feed in Delta
Change Data Feed in DeltaDatabricks
 
Building a modern data warehouse
Building a modern data warehouseBuilding a modern data warehouse
Building a modern data warehouseJames Serra
 
Moving to Databricks & Delta
Moving to Databricks & DeltaMoving to Databricks & Delta
Moving to Databricks & DeltaDatabricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Databricks
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks
 
Getting Started with Databricks SQL Analytics
Getting Started with Databricks SQL AnalyticsGetting Started with Databricks SQL Analytics
Getting Started with Databricks SQL AnalyticsDatabricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceDatabricks
 
Introducing Databricks Delta
Introducing Databricks DeltaIntroducing Databricks Delta
Introducing Databricks DeltaDatabricks
 
Databricks Delta Lake and Its Benefits
Databricks Delta Lake and Its BenefitsDatabricks Delta Lake and Its Benefits
Databricks Delta Lake and Its BenefitsDatabricks
 
Databricks Platform.pptx
Databricks Platform.pptxDatabricks Platform.pptx
Databricks Platform.pptxAlex Ivy
 
Data platform architecture
Data platform architectureData platform architecture
Data platform architectureSudheer Kondla
 
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...Cathrine Wilhelmsen
 
Databricks on AWS.pptx
Databricks on AWS.pptxDatabricks on AWS.pptx
Databricks on AWS.pptxWasm1953
 
DBT ELT approach for Advanced Analytics.pptx
DBT ELT approach for Advanced Analytics.pptxDBT ELT approach for Advanced Analytics.pptx
DBT ELT approach for Advanced Analytics.pptxHong Ong
 
Large Scale Lakehouse Implementation Using Structured Streaming
Large Scale Lakehouse Implementation Using Structured StreamingLarge Scale Lakehouse Implementation Using Structured Streaming
Large Scale Lakehouse Implementation Using Structured StreamingDatabricks
 
Databricks for Dummies
Databricks for DummiesDatabricks for Dummies
Databricks for DummiesRodney Joyce
 
3D: DBT using Databricks and Delta
3D: DBT using Databricks and Delta3D: DBT using Databricks and Delta
3D: DBT using Databricks and DeltaDatabricks
 
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...Databricks
 

What's hot (20)

[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
 
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
 
Change Data Feed in Delta
Change Data Feed in DeltaChange Data Feed in Delta
Change Data Feed in Delta
 
Building a modern data warehouse
Building a modern data warehouseBuilding a modern data warehouse
Building a modern data warehouse
 
Moving to Databricks & Delta
Moving to Databricks & DeltaMoving to Databricks & Delta
Moving to Databricks & Delta
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
 
Getting Started with Databricks SQL Analytics
Getting Started with Databricks SQL AnalyticsGetting Started with Databricks SQL Analytics
Getting Started with Databricks SQL Analytics
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Introducing Databricks Delta
Introducing Databricks DeltaIntroducing Databricks Delta
Introducing Databricks Delta
 
Databricks Delta Lake and Its Benefits
Databricks Delta Lake and Its BenefitsDatabricks Delta Lake and Its Benefits
Databricks Delta Lake and Its Benefits
 
Databricks Platform.pptx
Databricks Platform.pptxDatabricks Platform.pptx
Databricks Platform.pptx
 
Data platform architecture
Data platform architectureData platform architecture
Data platform architecture
 
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
 
Databricks on AWS.pptx
Databricks on AWS.pptxDatabricks on AWS.pptx
Databricks on AWS.pptx
 
DBT ELT approach for Advanced Analytics.pptx
DBT ELT approach for Advanced Analytics.pptxDBT ELT approach for Advanced Analytics.pptx
DBT ELT approach for Advanced Analytics.pptx
 
Large Scale Lakehouse Implementation Using Structured Streaming
Large Scale Lakehouse Implementation Using Structured StreamingLarge Scale Lakehouse Implementation Using Structured Streaming
Large Scale Lakehouse Implementation Using Structured Streaming
 
Databricks for Dummies
Databricks for DummiesDatabricks for Dummies
Databricks for Dummies
 
3D: DBT using Databricks and Delta
3D: DBT using Databricks and Delta3D: DBT using Databricks and Delta
3D: DBT using Databricks and Delta
 
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
 

Similar to Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard

Open Source Reliability for Data Lake with Apache Spark by Michael Armbrust
Open Source Reliability for Data Lake with Apache Spark by Michael ArmbrustOpen Source Reliability for Data Lake with Apache Spark by Michael Armbrust
Open Source Reliability for Data Lake with Apache Spark by Michael ArmbrustData Con LA
 
Delta Lake: Open Source Reliability w/ Apache Spark
Delta Lake: Open Source Reliability w/ Apache SparkDelta Lake: Open Source Reliability w/ Apache Spark
Delta Lake: Open Source Reliability w/ Apache SparkGeorge Chow
 
Building Reliable Data Lakes at Scale with Delta Lake
Building Reliable Data Lakes at Scale with Delta LakeBuilding Reliable Data Lakes at Scale with Delta Lake
Building Reliable Data Lakes at Scale with Delta LakeDatabricks
 
Apache CarbonData+Spark to realize data convergence and Unified high performa...
Apache CarbonData+Spark to realize data convergence and Unified high performa...Apache CarbonData+Spark to realize data convergence and Unified high performa...
Apache CarbonData+Spark to realize data convergence and Unified high performa...Tech Triveni
 
FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the Cloud
FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the CloudFSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the Cloud
FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the CloudAmazon Web Services
 
Technical Deck Delta Live Tables.pdf
Technical Deck Delta Live Tables.pdfTechnical Deck Delta Live Tables.pdf
Technical Deck Delta Live Tables.pdfIlham31574
 
Simplify and Scale Data Engineering Pipelines with Delta Lake
Simplify and Scale Data Engineering Pipelines with Delta LakeSimplify and Scale Data Engineering Pipelines with Delta Lake
Simplify and Scale Data Engineering Pipelines with Delta LakeDatabricks
 
New Developments in the Open Source Ecosystem: Apache Spark 3.0, Delta Lake, ...
New Developments in the Open Source Ecosystem: Apache Spark 3.0, Delta Lake, ...New Developments in the Open Source Ecosystem: Apache Spark 3.0, Delta Lake, ...
New Developments in the Open Source Ecosystem: Apache Spark 3.0, Delta Lake, ...Databricks
 
Cloud Experience: Data-driven Applications Made Simple and Fast
Cloud Experience: Data-driven Applications Made Simple and FastCloud Experience: Data-driven Applications Made Simple and Fast
Cloud Experience: Data-driven Applications Made Simple and FastDatabricks
 
Data & analytics challenges in a microservice architecture
Data & analytics challenges in a microservice architectureData & analytics challenges in a microservice architecture
Data & analytics challenges in a microservice architectureNiels Naglé
 
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...Databricks
 
Continuous Intelligence - Intersecting Event-Based Business Logic and ML
Continuous Intelligence - Intersecting Event-Based Business Logic and MLContinuous Intelligence - Intersecting Event-Based Business Logic and ML
Continuous Intelligence - Intersecting Event-Based Business Logic and MLParis Carbone
 
Lessons from Building Large-Scale, Multi-Cloud, SaaS Software at Databricks
Lessons from Building Large-Scale, Multi-Cloud, SaaS Software at DatabricksLessons from Building Large-Scale, Multi-Cloud, SaaS Software at Databricks
Lessons from Building Large-Scale, Multi-Cloud, SaaS Software at DatabricksDatabricks
 
What to Expect for Big Data and Apache Spark in 2017
What to Expect for Big Data and Apache Spark in 2017 What to Expect for Big Data and Apache Spark in 2017
What to Expect for Big Data and Apache Spark in 2017 Databricks
 
Cloud-based Data Lake for Analytics and AI
Cloud-based Data Lake for Analytics and AICloud-based Data Lake for Analytics and AI
Cloud-based Data Lake for Analytics and AITorsten Steinbach
 
First in Class: Optimizing the Data Lake for Tighter Integration
First in Class: Optimizing the Data Lake for Tighter IntegrationFirst in Class: Optimizing the Data Lake for Tighter Integration
First in Class: Optimizing the Data Lake for Tighter IntegrationInside Analysis
 
Big Data Analytics Platforms by KTH and RISE SICS
Big Data Analytics Platforms by KTH and RISE SICSBig Data Analytics Platforms by KTH and RISE SICS
Big Data Analytics Platforms by KTH and RISE SICSBig Data Value Association
 
Azure Stream Analytics : Analyse Data in Motion
Azure Stream Analytics  : Analyse Data in MotionAzure Stream Analytics  : Analyse Data in Motion
Azure Stream Analytics : Analyse Data in MotionRuhani Arora
 
Spark + AI Summit 2020 イベント概要
Spark + AI Summit 2020 イベント概要Spark + AI Summit 2020 イベント概要
Spark + AI Summit 2020 イベント概要Paulo Gutierrez
 
Amazon Elastic Map Reduce - Ian Meyers
Amazon Elastic Map Reduce - Ian MeyersAmazon Elastic Map Reduce - Ian Meyers
Amazon Elastic Map Reduce - Ian Meyershuguk
 

Similar to Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard (20)

Open Source Reliability for Data Lake with Apache Spark by Michael Armbrust
Open Source Reliability for Data Lake with Apache Spark by Michael ArmbrustOpen Source Reliability for Data Lake with Apache Spark by Michael Armbrust
Open Source Reliability for Data Lake with Apache Spark by Michael Armbrust
 
Delta Lake: Open Source Reliability w/ Apache Spark
Delta Lake: Open Source Reliability w/ Apache SparkDelta Lake: Open Source Reliability w/ Apache Spark
Delta Lake: Open Source Reliability w/ Apache Spark
 
Building Reliable Data Lakes at Scale with Delta Lake
Building Reliable Data Lakes at Scale with Delta LakeBuilding Reliable Data Lakes at Scale with Delta Lake
Building Reliable Data Lakes at Scale with Delta Lake
 
Apache CarbonData+Spark to realize data convergence and Unified high performa...
Apache CarbonData+Spark to realize data convergence and Unified high performa...Apache CarbonData+Spark to realize data convergence and Unified high performa...
Apache CarbonData+Spark to realize data convergence and Unified high performa...
 
FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the Cloud
FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the CloudFSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the Cloud
FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the Cloud
 
Technical Deck Delta Live Tables.pdf
Technical Deck Delta Live Tables.pdfTechnical Deck Delta Live Tables.pdf
Technical Deck Delta Live Tables.pdf
 
Simplify and Scale Data Engineering Pipelines with Delta Lake
Simplify and Scale Data Engineering Pipelines with Delta LakeSimplify and Scale Data Engineering Pipelines with Delta Lake
Simplify and Scale Data Engineering Pipelines with Delta Lake
 
New Developments in the Open Source Ecosystem: Apache Spark 3.0, Delta Lake, ...
New Developments in the Open Source Ecosystem: Apache Spark 3.0, Delta Lake, ...New Developments in the Open Source Ecosystem: Apache Spark 3.0, Delta Lake, ...
New Developments in the Open Source Ecosystem: Apache Spark 3.0, Delta Lake, ...
 
Cloud Experience: Data-driven Applications Made Simple and Fast
Cloud Experience: Data-driven Applications Made Simple and FastCloud Experience: Data-driven Applications Made Simple and Fast
Cloud Experience: Data-driven Applications Made Simple and Fast
 
Data & analytics challenges in a microservice architecture
Data & analytics challenges in a microservice architectureData & analytics challenges in a microservice architecture
Data & analytics challenges in a microservice architecture
 
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
 
Continuous Intelligence - Intersecting Event-Based Business Logic and ML
Continuous Intelligence - Intersecting Event-Based Business Logic and MLContinuous Intelligence - Intersecting Event-Based Business Logic and ML
Continuous Intelligence - Intersecting Event-Based Business Logic and ML
 
Lessons from Building Large-Scale, Multi-Cloud, SaaS Software at Databricks
Lessons from Building Large-Scale, Multi-Cloud, SaaS Software at DatabricksLessons from Building Large-Scale, Multi-Cloud, SaaS Software at Databricks
Lessons from Building Large-Scale, Multi-Cloud, SaaS Software at Databricks
 
What to Expect for Big Data and Apache Spark in 2017
What to Expect for Big Data and Apache Spark in 2017 What to Expect for Big Data and Apache Spark in 2017
What to Expect for Big Data and Apache Spark in 2017
 
Cloud-based Data Lake for Analytics and AI
Cloud-based Data Lake for Analytics and AICloud-based Data Lake for Analytics and AI
Cloud-based Data Lake for Analytics and AI
 
First in Class: Optimizing the Data Lake for Tighter Integration
First in Class: Optimizing the Data Lake for Tighter IntegrationFirst in Class: Optimizing the Data Lake for Tighter Integration
First in Class: Optimizing the Data Lake for Tighter Integration
 
Big Data Analytics Platforms by KTH and RISE SICS
Big Data Analytics Platforms by KTH and RISE SICSBig Data Analytics Platforms by KTH and RISE SICS
Big Data Analytics Platforms by KTH and RISE SICS
 
Azure Stream Analytics : Analyse Data in Motion
Azure Stream Analytics  : Analyse Data in MotionAzure Stream Analytics  : Analyse Data in Motion
Azure Stream Analytics : Analyse Data in Motion
 
Spark + AI Summit 2020 イベント概要
Spark + AI Summit 2020 イベント概要Spark + AI Summit 2020 イベント概要
Spark + AI Summit 2020 イベント概要
 
Amazon Elastic Map Reduce - Ian Meyers
Amazon Elastic Map Reduce - Ian MeyersAmazon Elastic Map Reduce - Ian Meyers
Amazon Elastic Map Reduce - Ian Meyers
 

More from Paris Data Engineers !

REX : pourquoi et comment développer son propre scheduler
REX : pourquoi et comment développer son propre schedulerREX : pourquoi et comment développer son propre scheduler
REX : pourquoi et comment développer son propre schedulerParis Data Engineers !
 
Utilisation de MLflow pour le cycle de vie des projet Machine learning
Utilisation de MLflow pour le cycle de vie des projet Machine learningUtilisation de MLflow pour le cycle de vie des projet Machine learning
Utilisation de MLflow pour le cycle de vie des projet Machine learningParis Data Engineers !
 
10 things i wish i'd known before using spark in production
10 things i wish i'd known before using spark in production10 things i wish i'd known before using spark in production
10 things i wish i'd known before using spark in productionParis Data Engineers !
 
Change Data Capture with Data Collector @OVH
Change Data Capture with Data Collector @OVHChange Data Capture with Data Collector @OVH
Change Data Capture with Data Collector @OVHParis Data Engineers !
 
Building highly reliable data pipeline @datadog par Quentin François
Building highly reliable data pipeline @datadog par Quentin FrançoisBuilding highly reliable data pipeline @datadog par Quentin François
Building highly reliable data pipeline @datadog par Quentin FrançoisParis Data Engineers !
 
Scala pour le Data Engineering par Jonathan Winandy
Scala pour le Data Engineering par Jonathan WinandyScala pour le Data Engineering par Jonathan Winandy
Scala pour le Data Engineering par Jonathan WinandyParis Data Engineers !
 

More from Paris Data Engineers ! (11)

Spark tools by Jonathan Winandy
Spark tools by Jonathan WinandySpark tools by Jonathan Winandy
Spark tools by Jonathan Winandy
 
SCIO : Apache Beam API
SCIO : Apache Beam APISCIO : Apache Beam API
SCIO : Apache Beam API
 
Apache Beam de A à Z
 Apache Beam de A à Z Apache Beam de A à Z
Apache Beam de A à Z
 
REX : pourquoi et comment développer son propre scheduler
REX : pourquoi et comment développer son propre schedulerREX : pourquoi et comment développer son propre scheduler
REX : pourquoi et comment développer son propre scheduler
 
Deeplearning in production
Deeplearning in productionDeeplearning in production
Deeplearning in production
 
Utilisation de MLflow pour le cycle de vie des projet Machine learning
Utilisation de MLflow pour le cycle de vie des projet Machine learningUtilisation de MLflow pour le cycle de vie des projet Machine learning
Utilisation de MLflow pour le cycle de vie des projet Machine learning
 
Introduction à Apache Pulsar
 Introduction à Apache Pulsar Introduction à Apache Pulsar
Introduction à Apache Pulsar
 
10 things i wish i'd known before using spark in production
10 things i wish i'd known before using spark in production10 things i wish i'd known before using spark in production
10 things i wish i'd known before using spark in production
 
Change Data Capture with Data Collector @OVH
Change Data Capture with Data Collector @OVHChange Data Capture with Data Collector @OVH
Change Data Capture with Data Collector @OVH
 
Building highly reliable data pipeline @datadog par Quentin François
Building highly reliable data pipeline @datadog par Quentin FrançoisBuilding highly reliable data pipeline @datadog par Quentin François
Building highly reliable data pipeline @datadog par Quentin François
 
Scala pour le Data Engineering par Jonathan Winandy
Scala pour le Data Engineering par Jonathan WinandyScala pour le Data Engineering par Jonathan Winandy
Scala pour le Data Engineering par Jonathan Winandy
 

Recently uploaded

Pragmatic UI testing with Compose Semantics.pdf
Pragmatic UI testing with Compose Semantics.pdfPragmatic UI testing with Compose Semantics.pdf
Pragmatic UI testing with Compose Semantics.pdfinfogdgmi
 
GraphSummit London Feb 2024 - ABK - Neo4j Product Vision and Roadmap.pptx
GraphSummit London Feb 2024 - ABK - Neo4j Product Vision and Roadmap.pptxGraphSummit London Feb 2024 - ABK - Neo4j Product Vision and Roadmap.pptx
GraphSummit London Feb 2024 - ABK - Neo4j Product Vision and Roadmap.pptxNeo4j
 
Artificial Intelligence, Design, and More-than-Human Justice
Artificial Intelligence, Design, and More-than-Human JusticeArtificial Intelligence, Design, and More-than-Human Justice
Artificial Intelligence, Design, and More-than-Human JusticeJosh Gellers
 
What’s New in CloudStack 4.19, Abhishek Kumar, Release Manager Apache CloudSt...
What’s New in CloudStack 4.19, Abhishek Kumar, Release Manager Apache CloudSt...What’s New in CloudStack 4.19, Abhishek Kumar, Release Manager Apache CloudSt...
What’s New in CloudStack 4.19, Abhishek Kumar, Release Manager Apache CloudSt...ShapeBlue
 
Leonis Insights: The State of AI (7 trends for 2023 and 7 predictions for 2024)
Leonis Insights: The State of AI (7 trends for 2023 and 7 predictions for 2024)Leonis Insights: The State of AI (7 trends for 2023 and 7 predictions for 2024)
Leonis Insights: The State of AI (7 trends for 2023 and 7 predictions for 2024)Jay Zhao
 
SKY Paradigms, change and cake: the steep curve of introducing new technologies
SKY Paradigms, change and cake: the steep curve of introducing new technologiesSKY Paradigms, change and cake: the steep curve of introducing new technologies
SKY Paradigms, change and cake: the steep curve of introducing new technologiesNeo4j
 
VM Migration from VMware to CloudStack and KVM – Suresh Anaparti, ShapeBlue
VM Migration from VMware to CloudStack and KVM – Suresh Anaparti, ShapeBlueVM Migration from VMware to CloudStack and KVM – Suresh Anaparti, ShapeBlue
VM Migration from VMware to CloudStack and KVM – Suresh Anaparti, ShapeBlueShapeBlue
 
Low Latency at Extreme Scale: Proven Practices & Pitfalls
Low Latency at Extreme Scale: Proven Practices & PitfallsLow Latency at Extreme Scale: Proven Practices & Pitfalls
Low Latency at Extreme Scale: Proven Practices & PitfallsScyllaDB
 
Roundtable_-_API_Research__Testing_Tools.pdf
Roundtable_-_API_Research__Testing_Tools.pdfRoundtable_-_API_Research__Testing_Tools.pdf
Roundtable_-_API_Research__Testing_Tools.pdfMostafa Higazy
 
KUBRICK Graphs: A journey from in vogue to success-ion
KUBRICK Graphs: A journey from in vogue to success-ionKUBRICK Graphs: A journey from in vogue to success-ion
KUBRICK Graphs: A journey from in vogue to success-ionNeo4j
 
Microsoft x 2toLead Webinar Session 1 - How Employee Communication and Connec...
Microsoft x 2toLead Webinar Session 1 - How Employee Communication and Connec...Microsoft x 2toLead Webinar Session 1 - How Employee Communication and Connec...
Microsoft x 2toLead Webinar Session 1 - How Employee Communication and Connec...2toLead Limited
 
Python For Kids - Sách Lập trình cho trẻ em
Python For Kids - Sách Lập trình cho trẻ emPython For Kids - Sách Lập trình cho trẻ em
Python For Kids - Sách Lập trình cho trẻ emNho Vĩnh
 
AMER Introduction to ThousandEyes Webinar
AMER Introduction to ThousandEyes WebinarAMER Introduction to ThousandEyes Webinar
AMER Introduction to ThousandEyes WebinarThousandEyes
 
Q4 2023 Quarterly Investor Presentation - FINAL.pdf
Q4 2023 Quarterly Investor Presentation - FINAL.pdfQ4 2023 Quarterly Investor Presentation - FINAL.pdf
Q4 2023 Quarterly Investor Presentation - FINAL.pdfTejal81
 
PrismCRM-RealEstate-SalesCRM_byCode5Company
PrismCRM-RealEstate-SalesCRM_byCode5CompanyPrismCRM-RealEstate-SalesCRM_byCode5Company
PrismCRM-RealEstate-SalesCRM_byCode5CompanyMustafa Kuğu
 
My Journey towards Artificial Intelligence
My Journey towards Artificial IntelligenceMy Journey towards Artificial Intelligence
My Journey towards Artificial IntelligenceVijayananda Mohire
 
iOncologi_Pitch Deck_2024 slide show for hostinger
iOncologi_Pitch Deck_2024 slide show for hostingeriOncologi_Pitch Deck_2024 slide show for hostinger
iOncologi_Pitch Deck_2024 slide show for hostingerssuser9354ce
 
TrustArc Webinar - TrustArc's Latest AI Innovations
TrustArc Webinar - TrustArc's Latest AI InnovationsTrustArc Webinar - TrustArc's Latest AI Innovations
TrustArc Webinar - TrustArc's Latest AI InnovationsTrustArc
 
The Art of the Possible with Graph by Dr Jim Webber Neo4j.pptx
The Art of the Possible with Graph by Dr Jim Webber Neo4j.pptxThe Art of the Possible with Graph by Dr Jim Webber Neo4j.pptx
The Art of the Possible with Graph by Dr Jim Webber Neo4j.pptxNeo4j
 
ASTRAZENECA. Knowledge Graphs Powering a Fast-moving Global Life Sciences Org...
ASTRAZENECA. Knowledge Graphs Powering a Fast-moving Global Life Sciences Org...ASTRAZENECA. Knowledge Graphs Powering a Fast-moving Global Life Sciences Org...
ASTRAZENECA. Knowledge Graphs Powering a Fast-moving Global Life Sciences Org...Neo4j
 

Recently uploaded (20)

Pragmatic UI testing with Compose Semantics.pdf
Pragmatic UI testing with Compose Semantics.pdfPragmatic UI testing with Compose Semantics.pdf
Pragmatic UI testing with Compose Semantics.pdf
 
GraphSummit London Feb 2024 - ABK - Neo4j Product Vision and Roadmap.pptx
GraphSummit London Feb 2024 - ABK - Neo4j Product Vision and Roadmap.pptxGraphSummit London Feb 2024 - ABK - Neo4j Product Vision and Roadmap.pptx
GraphSummit London Feb 2024 - ABK - Neo4j Product Vision and Roadmap.pptx
 
Artificial Intelligence, Design, and More-than-Human Justice
Artificial Intelligence, Design, and More-than-Human JusticeArtificial Intelligence, Design, and More-than-Human Justice
Artificial Intelligence, Design, and More-than-Human Justice
 
What’s New in CloudStack 4.19, Abhishek Kumar, Release Manager Apache CloudSt...
What’s New in CloudStack 4.19, Abhishek Kumar, Release Manager Apache CloudSt...What’s New in CloudStack 4.19, Abhishek Kumar, Release Manager Apache CloudSt...
What’s New in CloudStack 4.19, Abhishek Kumar, Release Manager Apache CloudSt...
 
Leonis Insights: The State of AI (7 trends for 2023 and 7 predictions for 2024)
Leonis Insights: The State of AI (7 trends for 2023 and 7 predictions for 2024)Leonis Insights: The State of AI (7 trends for 2023 and 7 predictions for 2024)
Leonis Insights: The State of AI (7 trends for 2023 and 7 predictions for 2024)
 
SKY Paradigms, change and cake: the steep curve of introducing new technologies
SKY Paradigms, change and cake: the steep curve of introducing new technologiesSKY Paradigms, change and cake: the steep curve of introducing new technologies
SKY Paradigms, change and cake: the steep curve of introducing new technologies
 
VM Migration from VMware to CloudStack and KVM – Suresh Anaparti, ShapeBlue
VM Migration from VMware to CloudStack and KVM – Suresh Anaparti, ShapeBlueVM Migration from VMware to CloudStack and KVM – Suresh Anaparti, ShapeBlue
VM Migration from VMware to CloudStack and KVM – Suresh Anaparti, ShapeBlue
 
Low Latency at Extreme Scale: Proven Practices & Pitfalls
Low Latency at Extreme Scale: Proven Practices & PitfallsLow Latency at Extreme Scale: Proven Practices & Pitfalls
Low Latency at Extreme Scale: Proven Practices & Pitfalls
 
Roundtable_-_API_Research__Testing_Tools.pdf
Roundtable_-_API_Research__Testing_Tools.pdfRoundtable_-_API_Research__Testing_Tools.pdf
Roundtable_-_API_Research__Testing_Tools.pdf
 
KUBRICK Graphs: A journey from in vogue to success-ion
KUBRICK Graphs: A journey from in vogue to success-ionKUBRICK Graphs: A journey from in vogue to success-ion
KUBRICK Graphs: A journey from in vogue to success-ion
 
Microsoft x 2toLead Webinar Session 1 - How Employee Communication and Connec...
Microsoft x 2toLead Webinar Session 1 - How Employee Communication and Connec...Microsoft x 2toLead Webinar Session 1 - How Employee Communication and Connec...
Microsoft x 2toLead Webinar Session 1 - How Employee Communication and Connec...
 
Python For Kids - Sách Lập trình cho trẻ em
Python For Kids - Sách Lập trình cho trẻ emPython For Kids - Sách Lập trình cho trẻ em
Python For Kids - Sách Lập trình cho trẻ em
 
AMER Introduction to ThousandEyes Webinar
AMER Introduction to ThousandEyes WebinarAMER Introduction to ThousandEyes Webinar
AMER Introduction to ThousandEyes Webinar
 
Q4 2023 Quarterly Investor Presentation - FINAL.pdf
Q4 2023 Quarterly Investor Presentation - FINAL.pdfQ4 2023 Quarterly Investor Presentation - FINAL.pdf
Q4 2023 Quarterly Investor Presentation - FINAL.pdf
 
PrismCRM-RealEstate-SalesCRM_byCode5Company
PrismCRM-RealEstate-SalesCRM_byCode5CompanyPrismCRM-RealEstate-SalesCRM_byCode5Company
PrismCRM-RealEstate-SalesCRM_byCode5Company
 
My Journey towards Artificial Intelligence
My Journey towards Artificial IntelligenceMy Journey towards Artificial Intelligence
My Journey towards Artificial Intelligence
 
iOncologi_Pitch Deck_2024 slide show for hostinger
iOncologi_Pitch Deck_2024 slide show for hostingeriOncologi_Pitch Deck_2024 slide show for hostinger
iOncologi_Pitch Deck_2024 slide show for hostinger
 
TrustArc Webinar - TrustArc's Latest AI Innovations
TrustArc Webinar - TrustArc's Latest AI InnovationsTrustArc Webinar - TrustArc's Latest AI Innovations
TrustArc Webinar - TrustArc's Latest AI Innovations
 
The Art of the Possible with Graph by Dr Jim Webber Neo4j.pptx
The Art of the Possible with Graph by Dr Jim Webber Neo4j.pptxThe Art of the Possible with Graph by Dr Jim Webber Neo4j.pptx
The Art of the Possible with Graph by Dr Jim Webber Neo4j.pptx
 
ASTRAZENECA. Knowledge Graphs Powering a Fast-moving Global Life Sciences Org...
ASTRAZENECA. Knowledge Graphs Powering a Fast-moving Global Life Sciences Org...ASTRAZENECA. Knowledge Graphs Powering a Fast-moving Global Life Sciences Org...
ASTRAZENECA. Knowledge Graphs Powering a Fast-moving Global Life Sciences Org...
 

Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard

  • 2. Databricks Workspace Collaborative Notebooks, production jobs & business insights Managed platform Cloud Native Databricks: Unified Data Analytics Platform ML Runtime For your Big data and Machine Learning Lifecycle ...
  • 3. ● A typical Data Lake Architecture ● The Delta Architecture ● Inside Delta Lake ● Demo The Delta Agenda
  • 4. Enterprises have been spending millions of dollars getting data into data lakes Data Lake
  • 5. The aspiration is to do data science and ML on all that data using Apache Spark! Data Lake Data Science & ML • Recommendation Engines • Risk, Fraud Detection • IoT & Predictive Maintenance • Genomics & DNA Sequencing
  • 6. Data Lake But the data is not ready for data science & ML The majority of these projects are failing due to Complex pipeline and unreliable data! Data Science & ML • Recommendation Engines • Risk, Fraud Detection • IoT & Predictive Maintenance • Genomics & DNA Sequencing
  • 7. What does a typical data lake project look like?
  • 8. Evolution of a Cutting-Edge Data Lake Events ? AI & Reporting Streaming Analytics Data Lake
  • 9. Evolution of a Cutting-Edge Data Lake Events AI & Reporting Streaming Analytics Data Lake
  • 10. Challenge #1: Historical Queries? Data Lake λ-arch λ-arch Streaming Analytics AI & Reporting Events λ-arch1 1 1
  • 11. Challenge #2: Messy Data? Data Lake λ-arch λ-arch Streaming Analytics AI & Reporting Events Validation λ-arch Validation 1 21 1 2
  • 12. Reprocessing Challenge #3: Mistakes and Failures? Data Lake λ-arch λ-arch Streaming Analytics AI & Reporting Events Validation λ-arch Validation Reprocessing Partitioned 1 2 3 1 1 3 2
  • 13. Challenge #4: Updates? Data Lake λ-arch λ-arch Streaming Analytics AI & Reporting Events Validation λ-arch Validation Reprocessing Updates GDPR... Partitioned UPDATE & MERGE Scheduled to Avoid Modifications 1 2 3 1 1 3 4 4 4 2 Reprocessing
  • 14. Challenge #5: Stability at scale? Data Lake λ-arch λ-arch Streaming Analytics AI & Reporting Events Validation λ-arch Validation Reprocessing Updates GDPR... Small filesPartitioned UPDATE & MERGE Scheduled to Avoid Modifications 1 2 3 1 1 3 4 4 4 2 5 5 Reprocessing
  • 15. Data reliability challenges with data lakes No atomicity: failed jobs leaves data in corrupt state requiring tedious recovery✗ No quality enforcement: creates inconsistent and low quality data Lack of consistency / isolation: makes it almost impossible to mix delete, appends and reads, batch and streaming
  • 16. Let’s try it instead with
  • 17. ● Open Format Based on Parquet ● By the creator of Apache Spark ● With Transactions ● Using Spark API’s A New Standard for Building Data Lakes
  • 18. Is there a better architecture? Data Lake λ-arch λ-arch Streaming Analytics AI & Reporting Events Validation λ-arch Validation Reprocessing Updates GDPR... Small filesPartitioned UPDATE & MERGE Scheduled to Avoid Modifications 1 2 3 1 1 3 4 4 4 2 5 5 Reprocessing
  • 19. Data Lake AI & Reporting Streaming Analytics Business-level Aggregates Filtered, Cleaned Augmented Raw Ingestion The Bronze Silver Gold CSV, JSON, TXT… Kinesis
  • 20. Data Lake AI & Reporting Streaming Analytics Business-level Aggregates Filtered, Cleaned Augmented Raw Ingestion The Bronze Silver Gold CSV, JSON, TXT… Kinesis Quality Delta Lake allows you to improve the quality of your data until it is ready for consumption.
  • 21. Data Lake AI & Reporting Streaming Analytics Business-level Aggregates Filtered, Cleaned Augmented Raw Ingestion The Bronze Silver Gold CSV, JSON, TXT… Kinesis Raw data with minimal parsing Supports long retention (years)
  • 22. Data Lake AI & Reporting Streaming Analytics Business-level Aggregates Filtered, Cleaned Augmented Raw Ingestion The Bronze Silver Gold CSV, JSON, TXT… Kinesis Intermediate data with some cleanup applied. Schema enforcement/evolution, data expectation Queryable for easy debugging!
  • 23. Data Lake AI & Reporting Streaming Analytics Business-level Aggregates Filtered, Cleaned Augmented Raw Ingestion The Bronze Silver Gold CSV, JSON, TXT… Kinesis Clean data, ready for consumption. Read with Spark, Presto, Glue* *Coming Soon
  • 24. Data Lake AI & Reporting Streaming Analytics Business-level Aggregates Filtered, Cleaned Augmented Raw Ingestion The Bronze Silver Gold CSV, JSON, TXT… Kinesis • Full ACID Transactions • Open Source (Apache License) • Powered by
  • 25. Data Lake AI & Reporting Streaming Analytics Business-level Aggregates Filtered, Cleaned Augmented Raw Ingestion The Bronze Silver CSV, JSON, TXT… Kinesis Streams move data through the Delta Lake •Low-latency or manually triggered •Eliminates management of schedules and jobs Gold
  • 26. Data Lake AI & Reporting Streaming Analytics Business-level Aggregates Filtered, Cleaned Augmented Raw Ingestion The Bronze Silver CSV, JSON, TXT… Kinesis Delta Lake also supports batch jobs and standard DML while streams run UPDATE DELETE MERGE OVERWRITE • Retention • Corrections • GDPR INSERT Gold
  • 27. Data Lake AI & Reporting Streaming Analytics Business-level Aggregates Filtered, Cleaned Augmented Raw Ingestion The Bronze Silver CSV, JSON, TXT… Kinesis Easy to recompute when business logic changes: • Clear tables • Restart streams DELETE DELETE Gold
  • 28. How do I use ?
  • 29. dataframe .write .format("delta") .save("/data") Get Started with Delta using Spark APIs dataframe .write .format("parquet") .save("/data") Instead of parquet... … simply say delta Add Spark Package pyspark --packages io.delta:delta-core_2.12:0.1.0 bin/spark-shell --packages io.delta:delta-core_2.12:0.1.0 <dependency> <groupId>io.delta</groupId> <artifactId>delta-core_2.12</artifactId> <version>0.1.0</version> </dependency> Maven
  • 32. Log Structured Storage Changes to the table are stored as ordered, atomic units called commits Add 1.parquet Add 2.parquet Remove 1.parquet Remove 2.parquet Add 3.parquet 000000.json 000001.json …
  • 33. Handling Massive Metadata Large tables can have millions of files in them! How do we scale the metadata? Use Spark for scaling! Add 1.parquet Add 2.parquet Remove 1.parquet Remove 2.parquet Add 3.parquet Checkpoint … 0009.json 0010.json checkpoint-1.parquet 0011.json … Transaction log
  • 34. Transactional Log Parquet Files Delta Lake ensures data reliability Streaming ● ACID Transactions / full DML ● Data quality ● Unified Batch & Streaming ● Time Travel/Data Snapshots Key Features High Quality & Reliable Data always ready for analytics Batch Updates/Deletes
  • 35. Support concurrent operation Notebook/User 1: SELECT * FROM customers WHERE firstname='xxx' Notebook/User 2: INSERT INTO customers (firstname, …) VALUES ('marc', …) Notebook/User 3: DELETE FROM customers WHERE firstname='quentin'
  • 36. Support concurrent operation Isolation level: WriteSerializable Delta solves conflict optimistically Concurrent modifications on a table triggers a rollback
  • 37. Upsert/Merge: Fine-grained Updates MERGE INTO customers -- Delta table USING updates ON customers.customerId = source.customerId WHEN MATCHED THEN UPDATE SET address = updates.address WHEN NOT MATCHED THEN INSERT (customerId, address) VALUES (updates.customerId, updates.address)
  • 38. Ensure Data Quality* Enforce metadata, schema, and quality declaratively. Inserts will fail if data doesn’t respect schema or quality table("warehouse") .location(…) // Location on DBFS .schema(my_schema) // Optional strict schema checking .metastoreName(…) // Registration in Hive Metastore .description(…) // Human readable description for users *Coming Soon .expect("validTimestamp", // Expectations on data quality* "timestamp > 2012-01-01 AND …", "fail / alert / quarantine")
  • 39. Unified batch and streaming Concurrent stream/batch with exactly-once processing guarantee Data Lake AI & Reporting Streaming Analytics Join stream with table/stream Bronze Silver CSV, JSON, TXT… Kinesis DELETE DELETE Gold
  • 40. SELECT count(*) FROM events TIMESTAMP AS OF timestamp SELECT count(*) FROM events VERSION AS OF version Time Travel spark.read.format(" delta").option("timestampAsOf", timestamp_string).load("/events/") INSERT INTO my_table SELECT * FROM my_table TIMESTAMP AS OF date_sub( current_date(), 1) Reproduce experiments & reports Rollback accidental bad writes
  • 42. Workshop Delta & MLFlow Jeudi 7 Novembre 9h-12h30 https://dbricks.co/workshop-databricks