SlideShare a Scribd company logo
WIFI SSID:Spark+AISummit | Password: UnifiedDataAnalytics
Nagaraj Sengodan, HCL Technologies
Nitin Raj Soundararajan, Cognizant Worldwide Limited
Databricks Delta Lake
And Its Benefits
#UnifiedDataAnalytics #SparkAISummit
3
1
2
3
4
5
What is Delta lake offering from Databricks and overview
The Brief
Good news is Delta Lake is open source now!
Open Source
Why I care about Delta lake?
Benefits
We do the necessary steps to deliver the result.
Modern Warehouse
Are we ready for building unified data platform?
Conclusion
Agenda
4
Who
are
we ? NITIN RAJ
SOUNDARARAJAN
Senior Consultant –
Data, AI and Analytics
Cognizant Worldwide Limited
NAGARAJ
SENGODAN
Senior Manager – Data
and Analytics
HCL Technologies
Common Challenges with Data Lakes
5
5
Data Lake getting Polluted
Unsafe Writes Orphan Data
No Schema Evolution
No Schema
Hot path for Streaming
ACID Transaction Metadata Handling
Unified Batch and Stream
Schema Enforcement
Time Travel Audit History
Full DML Support
Typical Databricks Architecture
6
SecurityIntegration
DATABRICKS COLLABORATIVE
WORKSPACE
Apis
Jobs Models
Notebooks
Dashboards
DATA ENGINEERS DATA
SCIENTISTS
DATABRICKS RUNTIME
for Big Data for Machine Learning
Batch & Streaming
Data Lakes & Data
Warehouses
DATABRICKS CLOUD SERVICE
Databricks - Delta Lake Architecture
7
SecurityIntegration
DATABRICKS COLLABORATIVE
WORKSPACE
Apis
Jobs Models
Notebooks
Dashboards
DATA ENGINEERS DATA
SCIENTISTS
DATABRICKS RUNTIME
for Big Data for Machine Learning
Batch & Streaming
Data Lakes & Data
Warehouses
DATABRICKS CLOUD SERVICE
DATABRICKS DELTA
The Brief
8
Delta.io – OPEN SOURCE.
APR. 2019
Announcing Delta Lake Open Source Project | Ali Ghodsi
Delta 0.2 – Cloud storage
JUN. 2019
Support for cloud storage (Amazon S3, Azure Blob Storage) and Improved Concurrency (Append-only
writes ensuring serializability)
Delta 0.3 – Scala/Java API
AUG. 2019
Scala Java APIs and DML Commands, Query Commit History and vacuuming old files
Delta 0.4 – Python APIs and Convert to Delta
OCT. 2019
Python APIs for DML and utility operations, Convert-to-Delta, SQL for Utility Operations
Demo
9
Benefits
10
ACID transactions on Spark
Scalable metadata handling
Unified Batch and Streaming Source
Schema enforcement
Time travel
Audit History
Full DML Support
ACID Transaction on Spark
11
The Brief
Every write is a transaction
Serial order for writes recorded in a transaction
log
01 Multiple Writes
Multiple writes trying to modify the same
files don’t happen that often
02
Optimistic
concurrency
Continuously keep writing to a directory or
table and consumers to keep reading from
the same directory or table
03
Serializable
isolation level
Scalable Metadata Handling
12
Metadata information of a table or directory in the
transaction log instead of the metastore
01
Metadata in
Transaction
Log
Delta Lake can list files in large directories in
constant time
02
Efficient Data
Read
Unified Batch and Streaming Sink
13
Efficient streaming sink with Apache Spark’s
structured streaming01 Streaming Link
with ACID transactions and scalable metadata
handling, the efficient streaming sink now
enables lot of near real-time analytics use
cases without having to maintain a
complicated streaming and batch pipeline
02
Near Real
Time
Analytics
Schema enforcement
14
Automatically validates the DataFrame’s schema
with schema of the table01
Automatic
Schema
Validation
Columns that are present in the table but not
in the DataFrame are set to null
Exception is thrown when when extra column
present in the DataFrame but not in Table
02
Column
Validation
03
Serializable
isolation level
Delta Lake has DDL to explicitly add new
columns
Ability to update the schema automatically
Time Travel and Data Versioning
15
Allows users to read a previous snapshot of the
table or directory
01 Snapshots
Newer version of the files are created when
the files are modified during writes and older
versions are preserved
02 Versioning
Provide a timestamp or a version number to
Apache Spark’s read APIs to read the older
version of the table or directory
Delta Lake constructs the full snapshot as of
that timestamp or version based on the
information in the transaction log
User can reproduce experiments and reports
and also can revert a table to its older
versions
03
Timestamp and
Transaction
Log
Record Update and Deletion (Coming Soon)
16
Will support Merge, Update and Delete
Easily upsert and delete records in data lakes
simplify their change data capture and GDPR
use cases
01
Merge
Update
Delete
More efficient than reading and overwriting
entire partitions or tables02
File-level
granularity
Data Expectations (Coming Soon)
17
Will support an API to set expectations on tables
or directories01
API to set data
expectations
Engineers will be able to specify a boolean
condition and tune the severity to handle
data expectations
02
Severity to
handle
expectations
Modern Data warehouse
18
Ingestion Tables Refined Tables
Feature/Agg Data Store
Existing Data Lake
Azure Data Lake Storage
Analytics
Machine Learning
Conclusion
19
Delta Lake Data Lake
Unification High
Batch and Stream data set can be processed in
same pipeline
Medium
Stream process require hot
pipeline
Reliable High
As it enforce schema and ACID operations
helps data lake more reliable
Less
Accept all data and late binding
leads lot of orphan data
Ease of Use Medium
Delta require DBA operations like Vacuum and
Optimize
High
No write on schema and accept
any data
Performance High
Z-Order skipping files for efficient read
Medium
Sequence read
Like to know more?
20
https://github.com/KRSNagaraj/SparkSummit2019
https://www.linkedin.com/in/NagarajSengodan
https://www.linkedin.com/in/NitinRajS/
DON’T FORGET TO RATE
AND REVIEW THE SESSIONS
SEARCH SPARK + AI SUMMIT

More Related Content

What's hot

Databricks Platform.pptx
Databricks Platform.pptxDatabricks Platform.pptx
Databricks Platform.pptx
Alex Ivy
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
Databricks
 
Delta Lake with Azure Databricks
Delta Lake with Azure DatabricksDelta Lake with Azure Databricks
Delta Lake with Azure Databricks
Dustin Vannoy
 
Delta from a Data Engineer's Perspective
Delta from a Data Engineer's PerspectiveDelta from a Data Engineer's Perspective
Delta from a Data Engineer's Perspective
Databricks
 
Making Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse TechnologyMaking Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse Technology
Matei Zaharia
 
Introducing Databricks Delta
Introducing Databricks DeltaIntroducing Databricks Delta
Introducing Databricks Delta
Databricks
 
Architect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh ArchitectureArchitect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh Architecture
Databricks
 
Lakehouse in Azure
Lakehouse in AzureLakehouse in Azure
Lakehouse in Azure
Sergio Zenatti Filho
 
Diving into Delta Lake: Unpacking the Transaction Log
Diving into Delta Lake: Unpacking the Transaction LogDiving into Delta Lake: Unpacking the Transaction Log
Diving into Delta Lake: Unpacking the Transaction Log
Databricks
 
Build Real-Time Applications with Databricks Streaming
Build Real-Time Applications with Databricks StreamingBuild Real-Time Applications with Databricks Streaming
Build Real-Time Applications with Databricks Streaming
Databricks
 
Microsoft Azure Databricks
Microsoft Azure DatabricksMicrosoft Azure Databricks
Microsoft Azure Databricks
Sascha Dittmann
 
Apache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the CoversApache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the Covers
ScyllaDB
 
Databricks Fundamentals
Databricks FundamentalsDatabricks Fundamentals
Databricks Fundamentals
Dalibor Wijas
 
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
DataScienceConferenc1
 
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
Cathrine Wilhelmsen
 
DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
Databricks
 
Building Lakehouses on Delta Lake with SQL Analytics Primer
Building Lakehouses on Delta Lake with SQL Analytics PrimerBuilding Lakehouses on Delta Lake with SQL Analytics Primer
Building Lakehouses on Delta Lake with SQL Analytics Primer
Databricks
 
Free Training: How to Build a Lakehouse
Free Training: How to Build a LakehouseFree Training: How to Build a Lakehouse
Free Training: How to Build a Lakehouse
Databricks
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
Databricks
 
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
DataScienceConferenc1
 

What's hot (20)

Databricks Platform.pptx
Databricks Platform.pptxDatabricks Platform.pptx
Databricks Platform.pptx
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
Delta Lake with Azure Databricks
Delta Lake with Azure DatabricksDelta Lake with Azure Databricks
Delta Lake with Azure Databricks
 
Delta from a Data Engineer's Perspective
Delta from a Data Engineer's PerspectiveDelta from a Data Engineer's Perspective
Delta from a Data Engineer's Perspective
 
Making Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse TechnologyMaking Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse Technology
 
Introducing Databricks Delta
Introducing Databricks DeltaIntroducing Databricks Delta
Introducing Databricks Delta
 
Architect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh ArchitectureArchitect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh Architecture
 
Lakehouse in Azure
Lakehouse in AzureLakehouse in Azure
Lakehouse in Azure
 
Diving into Delta Lake: Unpacking the Transaction Log
Diving into Delta Lake: Unpacking the Transaction LogDiving into Delta Lake: Unpacking the Transaction Log
Diving into Delta Lake: Unpacking the Transaction Log
 
Build Real-Time Applications with Databricks Streaming
Build Real-Time Applications with Databricks StreamingBuild Real-Time Applications with Databricks Streaming
Build Real-Time Applications with Databricks Streaming
 
Microsoft Azure Databricks
Microsoft Azure DatabricksMicrosoft Azure Databricks
Microsoft Azure Databricks
 
Apache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the CoversApache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the Covers
 
Databricks Fundamentals
Databricks FundamentalsDatabricks Fundamentals
Databricks Fundamentals
 
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
 
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
 
DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Building Lakehouses on Delta Lake with SQL Analytics Primer
Building Lakehouses on Delta Lake with SQL Analytics PrimerBuilding Lakehouses on Delta Lake with SQL Analytics Primer
Building Lakehouses on Delta Lake with SQL Analytics Primer
 
Free Training: How to Build a Lakehouse
Free Training: How to Build a LakehouseFree Training: How to Build a Lakehouse
Free Training: How to Build a Lakehouse
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
 
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
 

Similar to Databricks Delta Lake and Its Benefits

Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft AzureOtimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Luan Moreno Medeiros Maciel
 
Lessons Learned from Modernizing USCIS Data Analytics Platform
Lessons Learned from Modernizing USCIS Data Analytics PlatformLessons Learned from Modernizing USCIS Data Analytics Platform
Lessons Learned from Modernizing USCIS Data Analytics Platform
Databricks
 
Getting Started with Delta Lake on Databricks
Getting Started with Delta Lake on DatabricksGetting Started with Delta Lake on Databricks
Getting Started with Delta Lake on Databricks
Knoldus Inc.
 
Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)
Eric Sun
 
What Is Delta Lake ???
What Is Delta Lake ???What Is Delta Lake ???
What Is Delta Lake ???
✪Computants✪IBM_BP
 
2018 02-08-what's-new-in-apache-spark-2.3
2018 02-08-what's-new-in-apache-spark-2.3 2018 02-08-what's-new-in-apache-spark-2.3
2018 02-08-what's-new-in-apache-spark-2.3
Chester Chen
 
What's New in Upcoming Apache Spark 2.3
What's New in Upcoming Apache Spark 2.3What's New in Upcoming Apache Spark 2.3
What's New in Upcoming Apache Spark 2.3
Databricks
 
Streaming Data and Stream Processing with Apache Kafka
Streaming Data and Stream Processing with Apache KafkaStreaming Data and Stream Processing with Apache Kafka
Streaming Data and Stream Processing with Apache Kafka
confluent
 
Cosmos DB Real-time Advanced Analytics Workshop
Cosmos DB Real-time Advanced Analytics WorkshopCosmos DB Real-time Advanced Analytics Workshop
Cosmos DB Real-time Advanced Analytics Workshop
Databricks
 
The Future of Hadoop: A deeper look at Apache Spark
The Future of Hadoop: A deeper look at Apache SparkThe Future of Hadoop: A deeper look at Apache Spark
The Future of Hadoop: A deeper look at Apache Spark
Cloudera, Inc.
 
Leveraging Mainframe Data for Modern Analytics
Leveraging Mainframe Data for Modern AnalyticsLeveraging Mainframe Data for Modern Analytics
Leveraging Mainframe Data for Modern Analytics
confluent
 
Webinar future dataintegration-datamesh-and-goldengatekafka
Webinar future dataintegration-datamesh-and-goldengatekafkaWebinar future dataintegration-datamesh-and-goldengatekafka
Webinar future dataintegration-datamesh-and-goldengatekafka
Jeffrey T. Pollock
 
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
Databricks
 
Data Science Across Data Sources with Apache Arrow
Data Science Across Data Sources with Apache ArrowData Science Across Data Sources with Apache Arrow
Data Science Across Data Sources with Apache Arrow
Databricks
 
Discovery Day 2019 Sofia - What is new in SQL Server 2019
Discovery Day 2019 Sofia - What is new in SQL Server 2019Discovery Day 2019 Sofia - What is new in SQL Server 2019
Discovery Day 2019 Sofia - What is new in SQL Server 2019
Ivan Donev
 
Designing & Optimizing Micro Batching Systems Using 100+ Nodes (Ananth Ram, R...
Designing & Optimizing Micro Batching Systems Using 100+ Nodes (Ananth Ram, R...Designing & Optimizing Micro Batching Systems Using 100+ Nodes (Ananth Ram, R...
Designing & Optimizing Micro Batching Systems Using 100+ Nodes (Ananth Ram, R...
DataStax
 
Pulsar in the Lakehouse: Overview of Apache Pulsar and Delta Lake Connector -...
Pulsar in the Lakehouse: Overview of Apache Pulsar and Delta Lake Connector -...Pulsar in the Lakehouse: Overview of Apache Pulsar and Delta Lake Connector -...
Pulsar in the Lakehouse: Overview of Apache Pulsar and Delta Lake Connector -...
StreamNative
 
Building a modern data warehouse
Building a modern data warehouseBuilding a modern data warehouse
Building a modern data warehouse
James Serra
 
Unlocking the Value of Your Data Lake
Unlocking the Value of Your Data LakeUnlocking the Value of Your Data Lake
Unlocking the Value of Your Data Lake
DATAVERSITY
 
Google take on heterogeneous data base replication
Google take on heterogeneous data base replication Google take on heterogeneous data base replication
Google take on heterogeneous data base replication
Svetlin Stanchev
 

Similar to Databricks Delta Lake and Its Benefits (20)

Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft AzureOtimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
 
Lessons Learned from Modernizing USCIS Data Analytics Platform
Lessons Learned from Modernizing USCIS Data Analytics PlatformLessons Learned from Modernizing USCIS Data Analytics Platform
Lessons Learned from Modernizing USCIS Data Analytics Platform
 
Getting Started with Delta Lake on Databricks
Getting Started with Delta Lake on DatabricksGetting Started with Delta Lake on Databricks
Getting Started with Delta Lake on Databricks
 
Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)
 
What Is Delta Lake ???
What Is Delta Lake ???What Is Delta Lake ???
What Is Delta Lake ???
 
2018 02-08-what's-new-in-apache-spark-2.3
2018 02-08-what's-new-in-apache-spark-2.3 2018 02-08-what's-new-in-apache-spark-2.3
2018 02-08-what's-new-in-apache-spark-2.3
 
What's New in Upcoming Apache Spark 2.3
What's New in Upcoming Apache Spark 2.3What's New in Upcoming Apache Spark 2.3
What's New in Upcoming Apache Spark 2.3
 
Streaming Data and Stream Processing with Apache Kafka
Streaming Data and Stream Processing with Apache KafkaStreaming Data and Stream Processing with Apache Kafka
Streaming Data and Stream Processing with Apache Kafka
 
Cosmos DB Real-time Advanced Analytics Workshop
Cosmos DB Real-time Advanced Analytics WorkshopCosmos DB Real-time Advanced Analytics Workshop
Cosmos DB Real-time Advanced Analytics Workshop
 
The Future of Hadoop: A deeper look at Apache Spark
The Future of Hadoop: A deeper look at Apache SparkThe Future of Hadoop: A deeper look at Apache Spark
The Future of Hadoop: A deeper look at Apache Spark
 
Leveraging Mainframe Data for Modern Analytics
Leveraging Mainframe Data for Modern AnalyticsLeveraging Mainframe Data for Modern Analytics
Leveraging Mainframe Data for Modern Analytics
 
Webinar future dataintegration-datamesh-and-goldengatekafka
Webinar future dataintegration-datamesh-and-goldengatekafkaWebinar future dataintegration-datamesh-and-goldengatekafka
Webinar future dataintegration-datamesh-and-goldengatekafka
 
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
 
Data Science Across Data Sources with Apache Arrow
Data Science Across Data Sources with Apache ArrowData Science Across Data Sources with Apache Arrow
Data Science Across Data Sources with Apache Arrow
 
Discovery Day 2019 Sofia - What is new in SQL Server 2019
Discovery Day 2019 Sofia - What is new in SQL Server 2019Discovery Day 2019 Sofia - What is new in SQL Server 2019
Discovery Day 2019 Sofia - What is new in SQL Server 2019
 
Designing & Optimizing Micro Batching Systems Using 100+ Nodes (Ananth Ram, R...
Designing & Optimizing Micro Batching Systems Using 100+ Nodes (Ananth Ram, R...Designing & Optimizing Micro Batching Systems Using 100+ Nodes (Ananth Ram, R...
Designing & Optimizing Micro Batching Systems Using 100+ Nodes (Ananth Ram, R...
 
Pulsar in the Lakehouse: Overview of Apache Pulsar and Delta Lake Connector -...
Pulsar in the Lakehouse: Overview of Apache Pulsar and Delta Lake Connector -...Pulsar in the Lakehouse: Overview of Apache Pulsar and Delta Lake Connector -...
Pulsar in the Lakehouse: Overview of Apache Pulsar and Delta Lake Connector -...
 
Building a modern data warehouse
Building a modern data warehouseBuilding a modern data warehouse
Building a modern data warehouse
 
Unlocking the Value of Your Data Lake
Unlocking the Value of Your Data LakeUnlocking the Value of Your Data Lake
Unlocking the Value of Your Data Lake
 
Google take on heterogeneous data base replication
Google take on heterogeneous data base replication Google take on heterogeneous data base replication
Google take on heterogeneous data base replication
 

More from Databricks

Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
Machine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionMachine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack Detection
Databricks
 
Jeeves Grows Up: An AI Chatbot for Performance and Quality
Jeeves Grows Up: An AI Chatbot for Performance and QualityJeeves Grows Up: An AI Chatbot for Performance and Quality
Jeeves Grows Up: An AI Chatbot for Performance and Quality
Databricks
 

More from Databricks (20)

Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 
Machine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionMachine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack Detection
 
Jeeves Grows Up: An AI Chatbot for Performance and Quality
Jeeves Grows Up: An AI Chatbot for Performance and QualityJeeves Grows Up: An AI Chatbot for Performance and Quality
Jeeves Grows Up: An AI Chatbot for Performance and Quality
 

Recently uploaded

transgenders community data in india by govt
transgenders community data in india by govttransgenders community data in india by govt
transgenders community data in india by govt
palanisamyiiiier
 
Celebrity Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service...
Celebrity Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service...Celebrity Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service...
Celebrity Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service...
tanupasswan6
 
ISBP 821 - UCP 600 - ed).pdf banking standards
ISBP 821 - UCP 600 - ed).pdf banking standardsISBP 821 - UCP 600 - ed).pdf banking standards
ISBP 821 - UCP 600 - ed).pdf banking standards
DevanshuAnada1
 
Mumbai Girls Call Mumbai 🛵🚡9910780858 💃 Choose Best And Top Girl Service And ...
Mumbai Girls Call Mumbai 🛵🚡9910780858 💃 Choose Best And Top Girl Service And ...Mumbai Girls Call Mumbai 🛵🚡9910780858 💃 Choose Best And Top Girl Service And ...
Mumbai Girls Call Mumbai 🛵🚡9910780858 💃 Choose Best And Top Girl Service And ...
norina2645
 
Maruti Wagon R on road price in Faridabad - CarDekho
Maruti Wagon R on road price in Faridabad - CarDekhoMaruti Wagon R on road price in Faridabad - CarDekho
Maruti Wagon R on road price in Faridabad - CarDekho
kamli sharma#S10
 
CHAPTER-1-Introduction-to-Marketing.pptx
CHAPTER-1-Introduction-to-Marketing.pptxCHAPTER-1-Introduction-to-Marketing.pptx
CHAPTER-1-Introduction-to-Marketing.pptx
girewiy968
 
Harendra Singh, AI Strategy and Consulting Portfolio
Harendra Singh, AI Strategy and Consulting PortfolioHarendra Singh, AI Strategy and Consulting Portfolio
Harendra Singh, AI Strategy and Consulting Portfolio
harendmgr
 
OpenMetadata Spotlight - OpenMetadata @ Aspire by Vinol Joy Dsouza
OpenMetadata Spotlight - OpenMetadata @ Aspire by Vinol Joy DsouzaOpenMetadata Spotlight - OpenMetadata @ Aspire by Vinol Joy Dsouza
OpenMetadata Spotlight - OpenMetadata @ Aspire by Vinol Joy Dsouza
OpenMetadata
 
ch8_multiplexing cs553 st07 slide share ss
ch8_multiplexing cs553 st07 slide share ssch8_multiplexing cs553 st07 slide share ss
ch8_multiplexing cs553 st07 slide share ss
MinThetLwin1
 
DU degree offer diploma Transcript
DU degree offer diploma TranscriptDU degree offer diploma Transcript
DU degree offer diploma Transcript
uapta
 
Data analytics and Access Program Recommendations
Data analytics and Access Program RecommendationsData analytics and Access Program Recommendations
Data analytics and Access Program Recommendations
hemantsharmaus
 
Research proposal seminar ,Research Methodology
Research proposal seminar ,Research MethodologyResearch proposal seminar ,Research Methodology
Research proposal seminar ,Research Methodology
doctorzlife786
 
Fine-Tuning of Small/Medium LLMs for Business QA on Structured Data
Fine-Tuning of Small/Medium LLMs for Business QA on Structured DataFine-Tuning of Small/Medium LLMs for Business QA on Structured Data
Fine-Tuning of Small/Medium LLMs for Business QA on Structured Data
kevig
 
Noida Girls Call Noida 9873940964 Unlimited Short Providing Girls Service Ava...
Noida Girls Call Noida 9873940964 Unlimited Short Providing Girls Service Ava...Noida Girls Call Noida 9873940964 Unlimited Short Providing Girls Service Ava...
Noida Girls Call Noida 9873940964 Unlimited Short Providing Girls Service Ava...
kinni singh$A17
 
Introduction to the Red Hat Portfolio.pdf
Introduction to the Red Hat Portfolio.pdfIntroduction to the Red Hat Portfolio.pdf
Introduction to the Red Hat Portfolio.pdf
kihus38
 
Machine learning _new.pptx for a presentation
Machine learning _new.pptx for a presentationMachine learning _new.pptx for a presentation
Machine learning _new.pptx for a presentation
RahulS66654
 
Girls Call Vadodara 000XX00000 Provide Best And Top Girl Service And No1 in City
Girls Call Vadodara 000XX00000 Provide Best And Top Girl Service And No1 in CityGirls Call Vadodara 000XX00000 Provide Best And Top Girl Service And No1 in City
Girls Call Vadodara 000XX00000 Provide Best And Top Girl Service And No1 in City
gargnatasha985
 
Supervised Learning (Data Science).pptx
Supervised Learning  (Data Science).pptxSupervised Learning  (Data Science).pptx
Supervised Learning (Data Science).pptx
TARIKU ENDALE
 
all about the data science process, covering the steps present in almost ever...
all about the data science process, covering the steps present in almost ever...all about the data science process, covering the steps present in almost ever...
all about the data science process, covering the steps present in almost ever...
palaniappancse
 
Artificial Intelligence (AI) Technology Project Proposal _ by Slidesgo.pptx
Artificial Intelligence (AI) Technology Project Proposal _ by Slidesgo.pptxArtificial Intelligence (AI) Technology Project Proposal _ by Slidesgo.pptx
Artificial Intelligence (AI) Technology Project Proposal _ by Slidesgo.pptx
vaishnavisharma877623
 

Recently uploaded (20)

transgenders community data in india by govt
transgenders community data in india by govttransgenders community data in india by govt
transgenders community data in india by govt
 
Celebrity Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service...
Celebrity Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service...Celebrity Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service...
Celebrity Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service...
 
ISBP 821 - UCP 600 - ed).pdf banking standards
ISBP 821 - UCP 600 - ed).pdf banking standardsISBP 821 - UCP 600 - ed).pdf banking standards
ISBP 821 - UCP 600 - ed).pdf banking standards
 
Mumbai Girls Call Mumbai 🛵🚡9910780858 💃 Choose Best And Top Girl Service And ...
Mumbai Girls Call Mumbai 🛵🚡9910780858 💃 Choose Best And Top Girl Service And ...Mumbai Girls Call Mumbai 🛵🚡9910780858 💃 Choose Best And Top Girl Service And ...
Mumbai Girls Call Mumbai 🛵🚡9910780858 💃 Choose Best And Top Girl Service And ...
 
Maruti Wagon R on road price in Faridabad - CarDekho
Maruti Wagon R on road price in Faridabad - CarDekhoMaruti Wagon R on road price in Faridabad - CarDekho
Maruti Wagon R on road price in Faridabad - CarDekho
 
CHAPTER-1-Introduction-to-Marketing.pptx
CHAPTER-1-Introduction-to-Marketing.pptxCHAPTER-1-Introduction-to-Marketing.pptx
CHAPTER-1-Introduction-to-Marketing.pptx
 
Harendra Singh, AI Strategy and Consulting Portfolio
Harendra Singh, AI Strategy and Consulting PortfolioHarendra Singh, AI Strategy and Consulting Portfolio
Harendra Singh, AI Strategy and Consulting Portfolio
 
OpenMetadata Spotlight - OpenMetadata @ Aspire by Vinol Joy Dsouza
OpenMetadata Spotlight - OpenMetadata @ Aspire by Vinol Joy DsouzaOpenMetadata Spotlight - OpenMetadata @ Aspire by Vinol Joy Dsouza
OpenMetadata Spotlight - OpenMetadata @ Aspire by Vinol Joy Dsouza
 
ch8_multiplexing cs553 st07 slide share ss
ch8_multiplexing cs553 st07 slide share ssch8_multiplexing cs553 st07 slide share ss
ch8_multiplexing cs553 st07 slide share ss
 
DU degree offer diploma Transcript
DU degree offer diploma TranscriptDU degree offer diploma Transcript
DU degree offer diploma Transcript
 
Data analytics and Access Program Recommendations
Data analytics and Access Program RecommendationsData analytics and Access Program Recommendations
Data analytics and Access Program Recommendations
 
Research proposal seminar ,Research Methodology
Research proposal seminar ,Research MethodologyResearch proposal seminar ,Research Methodology
Research proposal seminar ,Research Methodology
 
Fine-Tuning of Small/Medium LLMs for Business QA on Structured Data
Fine-Tuning of Small/Medium LLMs for Business QA on Structured DataFine-Tuning of Small/Medium LLMs for Business QA on Structured Data
Fine-Tuning of Small/Medium LLMs for Business QA on Structured Data
 
Noida Girls Call Noida 9873940964 Unlimited Short Providing Girls Service Ava...
Noida Girls Call Noida 9873940964 Unlimited Short Providing Girls Service Ava...Noida Girls Call Noida 9873940964 Unlimited Short Providing Girls Service Ava...
Noida Girls Call Noida 9873940964 Unlimited Short Providing Girls Service Ava...
 
Introduction to the Red Hat Portfolio.pdf
Introduction to the Red Hat Portfolio.pdfIntroduction to the Red Hat Portfolio.pdf
Introduction to the Red Hat Portfolio.pdf
 
Machine learning _new.pptx for a presentation
Machine learning _new.pptx for a presentationMachine learning _new.pptx for a presentation
Machine learning _new.pptx for a presentation
 
Girls Call Vadodara 000XX00000 Provide Best And Top Girl Service And No1 in City
Girls Call Vadodara 000XX00000 Provide Best And Top Girl Service And No1 in CityGirls Call Vadodara 000XX00000 Provide Best And Top Girl Service And No1 in City
Girls Call Vadodara 000XX00000 Provide Best And Top Girl Service And No1 in City
 
Supervised Learning (Data Science).pptx
Supervised Learning  (Data Science).pptxSupervised Learning  (Data Science).pptx
Supervised Learning (Data Science).pptx
 
all about the data science process, covering the steps present in almost ever...
all about the data science process, covering the steps present in almost ever...all about the data science process, covering the steps present in almost ever...
all about the data science process, covering the steps present in almost ever...
 
Artificial Intelligence (AI) Technology Project Proposal _ by Slidesgo.pptx
Artificial Intelligence (AI) Technology Project Proposal _ by Slidesgo.pptxArtificial Intelligence (AI) Technology Project Proposal _ by Slidesgo.pptx
Artificial Intelligence (AI) Technology Project Proposal _ by Slidesgo.pptx
 

Databricks Delta Lake and Its Benefits

  • 1. WIFI SSID:Spark+AISummit | Password: UnifiedDataAnalytics
  • 2. Nagaraj Sengodan, HCL Technologies Nitin Raj Soundararajan, Cognizant Worldwide Limited Databricks Delta Lake And Its Benefits #UnifiedDataAnalytics #SparkAISummit
  • 3. 3 1 2 3 4 5 What is Delta lake offering from Databricks and overview The Brief Good news is Delta Lake is open source now! Open Source Why I care about Delta lake? Benefits We do the necessary steps to deliver the result. Modern Warehouse Are we ready for building unified data platform? Conclusion Agenda
  • 4. 4 Who are we ? NITIN RAJ SOUNDARARAJAN Senior Consultant – Data, AI and Analytics Cognizant Worldwide Limited NAGARAJ SENGODAN Senior Manager – Data and Analytics HCL Technologies
  • 5. Common Challenges with Data Lakes 5 5 Data Lake getting Polluted Unsafe Writes Orphan Data No Schema Evolution No Schema Hot path for Streaming ACID Transaction Metadata Handling Unified Batch and Stream Schema Enforcement Time Travel Audit History Full DML Support
  • 6. Typical Databricks Architecture 6 SecurityIntegration DATABRICKS COLLABORATIVE WORKSPACE Apis Jobs Models Notebooks Dashboards DATA ENGINEERS DATA SCIENTISTS DATABRICKS RUNTIME for Big Data for Machine Learning Batch & Streaming Data Lakes & Data Warehouses DATABRICKS CLOUD SERVICE
  • 7. Databricks - Delta Lake Architecture 7 SecurityIntegration DATABRICKS COLLABORATIVE WORKSPACE Apis Jobs Models Notebooks Dashboards DATA ENGINEERS DATA SCIENTISTS DATABRICKS RUNTIME for Big Data for Machine Learning Batch & Streaming Data Lakes & Data Warehouses DATABRICKS CLOUD SERVICE DATABRICKS DELTA
  • 8. The Brief 8 Delta.io – OPEN SOURCE. APR. 2019 Announcing Delta Lake Open Source Project | Ali Ghodsi Delta 0.2 – Cloud storage JUN. 2019 Support for cloud storage (Amazon S3, Azure Blob Storage) and Improved Concurrency (Append-only writes ensuring serializability) Delta 0.3 – Scala/Java API AUG. 2019 Scala Java APIs and DML Commands, Query Commit History and vacuuming old files Delta 0.4 – Python APIs and Convert to Delta OCT. 2019 Python APIs for DML and utility operations, Convert-to-Delta, SQL for Utility Operations
  • 10. Benefits 10 ACID transactions on Spark Scalable metadata handling Unified Batch and Streaming Source Schema enforcement Time travel Audit History Full DML Support
  • 11. ACID Transaction on Spark 11 The Brief Every write is a transaction Serial order for writes recorded in a transaction log 01 Multiple Writes Multiple writes trying to modify the same files don’t happen that often 02 Optimistic concurrency Continuously keep writing to a directory or table and consumers to keep reading from the same directory or table 03 Serializable isolation level
  • 12. Scalable Metadata Handling 12 Metadata information of a table or directory in the transaction log instead of the metastore 01 Metadata in Transaction Log Delta Lake can list files in large directories in constant time 02 Efficient Data Read
  • 13. Unified Batch and Streaming Sink 13 Efficient streaming sink with Apache Spark’s structured streaming01 Streaming Link with ACID transactions and scalable metadata handling, the efficient streaming sink now enables lot of near real-time analytics use cases without having to maintain a complicated streaming and batch pipeline 02 Near Real Time Analytics
  • 14. Schema enforcement 14 Automatically validates the DataFrame’s schema with schema of the table01 Automatic Schema Validation Columns that are present in the table but not in the DataFrame are set to null Exception is thrown when when extra column present in the DataFrame but not in Table 02 Column Validation 03 Serializable isolation level Delta Lake has DDL to explicitly add new columns Ability to update the schema automatically
  • 15. Time Travel and Data Versioning 15 Allows users to read a previous snapshot of the table or directory 01 Snapshots Newer version of the files are created when the files are modified during writes and older versions are preserved 02 Versioning Provide a timestamp or a version number to Apache Spark’s read APIs to read the older version of the table or directory Delta Lake constructs the full snapshot as of that timestamp or version based on the information in the transaction log User can reproduce experiments and reports and also can revert a table to its older versions 03 Timestamp and Transaction Log
  • 16. Record Update and Deletion (Coming Soon) 16 Will support Merge, Update and Delete Easily upsert and delete records in data lakes simplify their change data capture and GDPR use cases 01 Merge Update Delete More efficient than reading and overwriting entire partitions or tables02 File-level granularity
  • 17. Data Expectations (Coming Soon) 17 Will support an API to set expectations on tables or directories01 API to set data expectations Engineers will be able to specify a boolean condition and tune the severity to handle data expectations 02 Severity to handle expectations
  • 18. Modern Data warehouse 18 Ingestion Tables Refined Tables Feature/Agg Data Store Existing Data Lake Azure Data Lake Storage Analytics Machine Learning
  • 19. Conclusion 19 Delta Lake Data Lake Unification High Batch and Stream data set can be processed in same pipeline Medium Stream process require hot pipeline Reliable High As it enforce schema and ACID operations helps data lake more reliable Less Accept all data and late binding leads lot of orphan data Ease of Use Medium Delta require DBA operations like Vacuum and Optimize High No write on schema and accept any data Performance High Z-Order skipping files for efficient read Medium Sequence read
  • 20. Like to know more? 20 https://github.com/KRSNagaraj/SparkSummit2019 https://www.linkedin.com/in/NagarajSengodan https://www.linkedin.com/in/NitinRajS/
  • 21. DON’T FORGET TO RATE AND REVIEW THE SESSIONS SEARCH SPARK + AI SUMMIT