SlideShare a Scribd company logo
1 of 52
Download to read offline
Using Delta Lake to transform a
legacy SparkSQL to support
complex CRUD operations
Lantao Jin
Staff Software Engineer @ eBay
About Me
Lantao Jin is a software engineer at eBay's
Infrastructure Data Platform.
8+ years big data infra development exp.
Focusing on Spark internal optimization and
efficient platform building.
https://www.linkedin.com/in/lantaojin
https://github.com/LantaoJin
https://databricks.com/speaker/lantao-jin
Previous presentation
https://databricks.com/session/managing-
apache-spark-workload-and-automatic-
optimizing
Agenda
1. Background
Our requirements and technical selection
2. Implementations
Cross tables update/delete and insertion
3. Optimizations
10X faster and memory reduce
4. Managements
Auto vacuum and UI
Background
▪ A Blocker that offload commercial data warehouse to open source
▪ SQL syntax compatible with the commercial product
▪ For example, migrate the ad-hoc workload from the MPP engine to Spark SQL
▪ CRUD is a fundamental requirement in data processing
▪ Developed a FileFormat (Hive ACID)
▪ Left join to perform incremental data to do update
▪ Databaselization is a trend in analytic datasets
▪ Google BigQuery
▪ Provide an option with new approach in many scenarios
Requirements
▪ Fully support the commercial data warehouse SQL syntax
▪ Complex update/delete SQL syntax
▪ Match it in performance
▪ Based on Spark 2.3.0
▪ Legacy Spark
▪ Deliver in a short time
Project Timeline
▪ Started from Nov. 2019 based on Spark 2.3.0 + Delta Lake 0.4.0
▪ Delivered to customers at Mar. 2020
▪ Migration signed off at May. 2020
▪ Forward-ported to Spark 3.0.0 + Delta Lake 0.7.0 at Sep. 2020
Usage data end of Sep. 2020
▪ Totally, 5x ~ 10x faster than open source version in our scenarios
▪ 10+ business units are using Delta tables
▪ 2000+ production tables converted to Delta tables
▪ 3000+ update/delete statements per day
https://delta.io/
Why we choose Delta Lake?
https://delta.io/
Why we choose Delta Lake?
Why we choose Delta Lake?
Evaluated at Nov. 2019
What did we do?
▪ Stage 1
▪ From Nov. 2019 To Mar. 2020
▪ Refactor Delta Lake to support Spark2.3
▪ Cross table update/delete syntax
▪ Auto vacuuming
▪ Rollback and At by SQL
▪ Stage 2
▪ From Apr. 2020 To Jun 2020
▪ Bugfix
▪ Support bucketing join
▪ Performance improvements
▪ Delta table UI
▪ Stage 3
▪ From Jul. 2020 To Oct. 2020
▪ Migrate to Spark3.0 + Delta Lake0.7
▪ Reduce memory consumption
▪ Support subquery in WHERE
▪ Support index for Delta
▪ Stage 4
▪ From Nov. 2020
▪ Support Kudu
▪ Runtime Filter
▪ Z-ordering
▪ Native engine
Challenges – Stage 1
▪ SQL hadn’t been supported in Delta Lake 0.4
▪ Delta Lake requires Apache Spark 2.4 and above
▪ Spark 3.0 only supports single table update/delete syntax
▪ Integration with our internal features
Stage 1
Implementations
- Support SQL (Delta Lake 0.4)
- Cross tables update/delete
- Insertion
- SQL based Time Travel
Management
- Auto Vacuuming
Delta Lake 0.4 + Spark 2.3
▪ Added some patches in Spark 2.3
▪ Backported update/delete code to Spark 2.3
▪ Downgraded partial codegen interfaces in Delta Lake 0.4
▪ Rewritten the resolution code with Data Source V1
Cross tables update/delete
Support SQL
▪ Implements it in Catalyst
▪ SparkSessionExtensions
▪ Store essential metadata in HiveMetastore
▪ Rewrite in DataSource V1
Based on Delta Lake 0.4 + Spark 2.3.0
Delta Lake
Delta Lake
❌
SqlBase.g4
SparkSqlParser.scala
visitUpdateTable()
Parse From clause
Build a cross table join context
Package Assignments and Condition
Generate UpdateTable node
Inject resolution rule via
SparkSessionExtension
Support SQL
Update internals
Resolve Assignments
and Conditions
Return single table
update
Using join conditions
and the attributes in
assignments to add
ProjectionPushdown to
source side
Assignments foldable && Condition empty
yes
no
Source side contains join
Infer out all conditions
which only appear in
source side, push
down them.
Generate node
UpdateWithJoinTable
yes
no
Filter out the files which
not correlated with target
table
yes Multiple rows
matched?
Match out tahoeFileIndex
and build
UpdateWithJoinCommand
Get touched files by inner
join
exception
Mark as RemoveFiles
Get filesToRewrite
no
FileFormatWrite.write and
mark as AddedFiles
no
Build left join plan
mapPartition on the plan:
if a row matched,write
output of source side,
otherwise write the output
of target side
Repartition the planIs bucket table?
Commit to transaction log
no
yes
INSERT
INTO/OVERWRITE ...
DataSourceStrategy:
Case InsertIntoTable
Add static partition
projection
Build
InsertIntoDataSource
with the plan as its child
Contains static
partition?
no
yes
Get the actualQuery
and package it to
InsertIntoDataSourceC
ommand
Support SQL
Insert internals
SparkStrategy:
BasicOperators:
Case InsertIntoDataSource
Add
HashClusteredDistribution
InsertIntoDataSourceComm
and.run()
Is bucketed table
no
yes
Generate
InsertIntoDataSourceExec
EnsureRequirements:
ensureDistributionAndOrde
ring()
EnsurePartitionForWriting:
Add
ShuffleExchangeExec
InsertableRelation.insert()
insert static partition
&& overwrite
Fill replace_where
OptimisticTransaction.write
no
yes
FileFormatWrite.write and
mark as AddedFiles
Assemble predicates, use
snapshot.fileForScan to
get deleteFiles
Mark as RemoveFiles
Commit to transaction log
AT
ROLLBACK
Time Travel via SQL
Rollback & At
JDBC/
ODBC
Carmel
SSD
Hive Metastore
Gateway
Tenant
A
Tenant
B
Tenant
C
Alation
Tableau
Zeta
Shuffle Cache
Apollo
L
B
Hercules
HDD
Prod DBs
Hermes
Gateway
Gateway
VDM
……
……
▪ 1 BU, 1 queue (YARN)
▪ 1 queue, 1 or N STS(s)
▪ 1 queue (1 STS) is reserved
Architecture background
Auto Vacuuming
Auto Vacuuming
▪ Every STS uses listener to store
delta metadata to third-part
storage async.
o Convert to delta
o Create table using delta
o Rename table
o Alter table
o Drop table
▪ The STS in reserved queue
double checks if events lost
▪ The STS in reserved queue
triggers auto vacuuming and
attaches Delta UI
Implementation
Main contributions
▪ Support cross tables update/delete
▪ Support update/delete with multiple tables join
▪ Support join conditions inferring
Based on Delta Lake 0.4 + Spark 2.3.0
Main contributions
▪ Insertion with static/dynamic partitions
▪ Auto vacuuming
▪ Time travel via SQL
Based on Delta Lake 0.4 + Spark 2.3.0
Challenges – Stage 2
▪ The performance of update/delete in Delta Lake has a certain gap
with the commercial product.
▪ In a long running Spark thrift-server, a big query on Delta table easily
causes Spark Driver OOM.
▪ Manage the capacity of Delta and small files problem.
Stage 2
Optimizations
- Support bucketing join
- Resolving small files problem automitically
- Rewrite outer join to reduce shuffle
- More FilterPushDown
Management
- Delta UI
Support bucketing join
▪ Store bucketSpec in delta table metadata
▪ requireChildDistribution
▪ HashClusteredDistribution
▪ Example
▪ UPDATE a big table(4.7TB) join with a table(200GB) (without AQE)
▪ Before OOM
▪ After 180s
Auto resolving small files problem
▪ community (0.7.0)
Auto resolving small files problem
Auto resolving small files problem
▪ our solution
Auto resolving small files problem
Auto resolving small files problem
Auto resolving small files problem
Rewrite heavy outer join to reduce shuffle data
▪ In right outer join, even there are some predicates in right side, it still needs all
rows in filesToRewrite to preform join.
▪ We move right side only predicates from join conditions to its filters, then union
the join and the right side which applied anti-predicates filters.
▪ By our testing and practice, after applying this patch, the SMJ could be 5~10
times faster, which depends on how much data is skipping from shuffle.
Rewrite heavy outer join to reduce shuffle data
Delta UI
Our contributions
Based on Delta Lake 0.4 + Spark 2.3.0
▪ Support bucket Delta table
▪ Support bucketing join
▪ Auto resolving small files problem
Our contributions
Based on Delta Lake 0.4 + Spark 2.3.0
▪ Rewrite heavy outer join to reduce shuffle data
▪ https://github.com/delta-io/delta/pull/435
▪ Apply filter pushdown to source rows for the right outer join of
matched only case
▪ https://github.com/delta-io/delta/pull/438
▪ Apply filter pushdown to the inner join to avoid to scan all rows in
parquet files
▪ https://github.com/delta-io/delta/pull/432
Challenges – Stage 3
▪ Planed to upgrade to Spark3.0 in this year
▪ Subquery statements not supported
▪ File index, materialized view, range partition not supported
▪ Availability & Robustness
Stage 3
Implementations
- Migrate our changes to Spark3.0 + Delta 0.7
- Support Subquery in WHERE
Optimization
- Reduce memory consumption
Support Subquery in WHERE
Supported:
§ IN
§ EXISTS
§ NOT IN with IS NOT NULL
§ NOT EXISTS
§ Correlated Subquery
§ Nested Subquery
§ Multiple subqueries with conjunctive
Unsupported:
§ NOT IN without IS NOT NULL
§ Scalar Subquery
§ Multiple subqueries with disjunctive
Support Subquery in WHERE
Support Subquery in WHERE
UPDATE target t
SET t.b = 0
WHERE t.a IN
(SELECT s.a FROM source s WHERE s.a % 2 = 0)
Support Subquery in WHERE
UPDATE target t
SET t.b = 0
WHERE NOT EXISTS
(SELECT * FROM source s WHERE t.a = s.a AND
s.a % 2 = 0)
Our contributions
Based on Delta Lake 0.7 + Spark 3.0.0
▪ Migrate all changes and improvements to latest version
▪ Support Subquery in Where
▪ Reduce memory consumption in Driver
▪ [SPARK-32994][CORE] Update external heavy accumulators before they entering into listener event loop
▪ Skip schema infer and merge when table schema can be read from catalog
▪ Fallback to simple update if all SET statements are foldable and no join
Commerical Data Warehouse Stage2 Stage3
PS: Pulse is our regular release
Future work
Future work – Stage 4
▪ Range partition for delta (WIP)
▪ File index for delta (WIP)
▪ Runtime Filter Join optimization (WIP)
▪ Support Kudu (WIP)
▪ Z-ordering
▪ Native engine
Feedback
Your feedback is important to us.
Don’t forget to rate
and review the sessions.

More Related Content

What's hot

Autoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive ModeAutoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive ModeFlink Forward
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudNoritaka Sekiyama
 
The Apache Spark File Format Ecosystem
The Apache Spark File Format EcosystemThe Apache Spark File Format Ecosystem
The Apache Spark File Format EcosystemDatabricks
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDatabricks
 
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in SparkSpark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in SparkBo Yang
 
Continuous Application with FAIR Scheduler with Robert Xue
Continuous Application with FAIR Scheduler with Robert XueContinuous Application with FAIR Scheduler with Robert Xue
Continuous Application with FAIR Scheduler with Robert XueDatabricks
 
Apache Hudi: The Path Forward
Apache Hudi: The Path ForwardApache Hudi: The Path Forward
Apache Hudi: The Path ForwardAlluxio, Inc.
 
Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi
 Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi
Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng ShiDatabricks
 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsAlluxio, Inc.
 
Top 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsTop 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsSpark Summit
 
Spark SQL Beyond Official Documentation
Spark SQL Beyond Official DocumentationSpark SQL Beyond Official Documentation
Spark SQL Beyond Official DocumentationDatabricks
 
Building an open data platform with apache iceberg
Building an open data platform with apache icebergBuilding an open data platform with apache iceberg
Building an open data platform with apache icebergAlluxio, Inc.
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesDatabricks
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsDatabricks
 
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroThe Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroDatabricks
 
2022-06-23 Apache Arrow and DataFusion_ Changing the Game for implementing Da...
2022-06-23 Apache Arrow and DataFusion_ Changing the Game for implementing Da...2022-06-23 Apache Arrow and DataFusion_ Changing the Game for implementing Da...
2022-06-23 Apache Arrow and DataFusion_ Changing the Game for implementing Da...Andrew Lamb
 
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Databricks
 
Apache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationApache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationDatabricks
 
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark JobsFine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark JobsDatabricks
 

What's hot (20)

Autoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive ModeAutoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive Mode
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
 
The Apache Spark File Format Ecosystem
The Apache Spark File Format EcosystemThe Apache Spark File Format Ecosystem
The Apache Spark File Format Ecosystem
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache Spark
 
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in SparkSpark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
 
Continuous Application with FAIR Scheduler with Robert Xue
Continuous Application with FAIR Scheduler with Robert XueContinuous Application with FAIR Scheduler with Robert Xue
Continuous Application with FAIR Scheduler with Robert Xue
 
Apache Hudi: The Path Forward
Apache Hudi: The Path ForwardApache Hudi: The Path Forward
Apache Hudi: The Path Forward
 
Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi
 Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi
Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi
 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic Datasets
 
Top 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsTop 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark Applications
 
Spark SQL Beyond Official Documentation
Spark SQL Beyond Official DocumentationSpark SQL Beyond Official Documentation
Spark SQL Beyond Official Documentation
 
Building an open data platform with apache iceberg
Building an open data platform with apache icebergBuilding an open data platform with apache iceberg
Building an open data platform with apache iceberg
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIs
 
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroThe Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
 
2022-06-23 Apache Arrow and DataFusion_ Changing the Game for implementing Da...
2022-06-23 Apache Arrow and DataFusion_ Changing the Game for implementing Da...2022-06-23 Apache Arrow and DataFusion_ Changing the Game for implementing Da...
2022-06-23 Apache Arrow and DataFusion_ Changing the Game for implementing Da...
 
Hive: Loading Data
Hive: Loading DataHive: Loading Data
Hive: Loading Data
 
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
 
Apache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationApache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper Optimization
 
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark JobsFine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark Jobs
 

Similar to Using Delta Lake to Transform a Legacy Apache Spark to Support Complex Update/Delete SQL Operation

Using Databricks as an Analysis Platform
Using Databricks as an Analysis PlatformUsing Databricks as an Analysis Platform
Using Databricks as an Analysis PlatformDatabricks
 
Staged Patching Approach in Oracle E-Business Suite
Staged Patching Approach in Oracle E-Business SuiteStaged Patching Approach in Oracle E-Business Suite
Staged Patching Approach in Oracle E-Business Suitevasuballa
 
Building Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta LakeBuilding Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta LakeFlink Forward
 
From HDFS to S3: Migrate Pinterest Apache Spark Clusters
From HDFS to S3: Migrate Pinterest Apache Spark ClustersFrom HDFS to S3: Migrate Pinterest Apache Spark Clusters
From HDFS to S3: Migrate Pinterest Apache Spark ClustersDatabricks
 
Most Wanted: Future PostgreSQL Features
Most Wanted: Future PostgreSQL FeaturesMost Wanted: Future PostgreSQL Features
Most Wanted: Future PostgreSQL FeaturesPeter Eisentraut
 
Healthcare Claim Reimbursement using Apache Spark
Healthcare Claim Reimbursement using Apache SparkHealthcare Claim Reimbursement using Apache Spark
Healthcare Claim Reimbursement using Apache SparkDatabricks
 
Zesty journey to adopt apache iceberg-AWS-Floor28_Sep-23.pdf
Zesty journey to adopt apache iceberg-AWS-Floor28_Sep-23.pdfZesty journey to adopt apache iceberg-AWS-Floor28_Sep-23.pdf
Zesty journey to adopt apache iceberg-AWS-Floor28_Sep-23.pdfEran Levy
 
Analytics Metrics Delivery & ML Feature Visualization
Analytics Metrics Delivery & ML Feature VisualizationAnalytics Metrics Delivery & ML Feature Visualization
Analytics Metrics Delivery & ML Feature VisualizationBill Liu
 
Snowflake for Data Engineering
Snowflake for Data EngineeringSnowflake for Data Engineering
Snowflake for Data EngineeringHarald Erb
 
Skew Mitigation For Facebook PetabyteScale Joins
Skew Mitigation For Facebook PetabyteScale JoinsSkew Mitigation For Facebook PetabyteScale Joins
Skew Mitigation For Facebook PetabyteScale JoinsDatabricks
 
Apache Spark 3.0: Overview of What’s New and Why Care
Apache Spark 3.0: Overview of What’s New and Why CareApache Spark 3.0: Overview of What’s New and Why Care
Apache Spark 3.0: Overview of What’s New and Why CareDatabricks
 
Datastage free tutorial
Datastage free tutorialDatastage free tutorial
Datastage free tutorialtekslate1
 
CISOA Conference 2020 Banner 9 Development
CISOA Conference 2020 Banner 9 DevelopmentCISOA Conference 2020 Banner 9 Development
CISOA Conference 2020 Banner 9 DevelopmentBrad Rippe
 
Quick Guide to Refresh Spark skills
Quick Guide to Refresh Spark skillsQuick Guide to Refresh Spark skills
Quick Guide to Refresh Spark skillsRavindra kumar
 
Inno db 5_7_features
Inno db 5_7_featuresInno db 5_7_features
Inno db 5_7_featuresTinku Ajit
 
Handling Database Deployments
Handling Database DeploymentsHandling Database Deployments
Handling Database DeploymentsMike Willbanks
 
What’s New in the Upcoming Apache Spark 3.0
What’s New in the Upcoming Apache Spark 3.0What’s New in the Upcoming Apache Spark 3.0
What’s New in the Upcoming Apache Spark 3.0Databricks
 
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeSimplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeDatabricks
 
Spring data jpa are used to develop spring applications
Spring data jpa are used to develop spring applicationsSpring data jpa are used to develop spring applications
Spring data jpa are used to develop spring applicationsmichaelaaron25322
 
700 Updatable Queries Per Second: Spark as a Real-Time Web Service
700 Updatable Queries Per Second: Spark as a Real-Time Web Service700 Updatable Queries Per Second: Spark as a Real-Time Web Service
700 Updatable Queries Per Second: Spark as a Real-Time Web ServiceEvan Chan
 

Similar to Using Delta Lake to Transform a Legacy Apache Spark to Support Complex Update/Delete SQL Operation (20)

Using Databricks as an Analysis Platform
Using Databricks as an Analysis PlatformUsing Databricks as an Analysis Platform
Using Databricks as an Analysis Platform
 
Staged Patching Approach in Oracle E-Business Suite
Staged Patching Approach in Oracle E-Business SuiteStaged Patching Approach in Oracle E-Business Suite
Staged Patching Approach in Oracle E-Business Suite
 
Building Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta LakeBuilding Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta Lake
 
From HDFS to S3: Migrate Pinterest Apache Spark Clusters
From HDFS to S3: Migrate Pinterest Apache Spark ClustersFrom HDFS to S3: Migrate Pinterest Apache Spark Clusters
From HDFS to S3: Migrate Pinterest Apache Spark Clusters
 
Most Wanted: Future PostgreSQL Features
Most Wanted: Future PostgreSQL FeaturesMost Wanted: Future PostgreSQL Features
Most Wanted: Future PostgreSQL Features
 
Healthcare Claim Reimbursement using Apache Spark
Healthcare Claim Reimbursement using Apache SparkHealthcare Claim Reimbursement using Apache Spark
Healthcare Claim Reimbursement using Apache Spark
 
Zesty journey to adopt apache iceberg-AWS-Floor28_Sep-23.pdf
Zesty journey to adopt apache iceberg-AWS-Floor28_Sep-23.pdfZesty journey to adopt apache iceberg-AWS-Floor28_Sep-23.pdf
Zesty journey to adopt apache iceberg-AWS-Floor28_Sep-23.pdf
 
Analytics Metrics Delivery & ML Feature Visualization
Analytics Metrics Delivery & ML Feature VisualizationAnalytics Metrics Delivery & ML Feature Visualization
Analytics Metrics Delivery & ML Feature Visualization
 
Snowflake for Data Engineering
Snowflake for Data EngineeringSnowflake for Data Engineering
Snowflake for Data Engineering
 
Skew Mitigation For Facebook PetabyteScale Joins
Skew Mitigation For Facebook PetabyteScale JoinsSkew Mitigation For Facebook PetabyteScale Joins
Skew Mitigation For Facebook PetabyteScale Joins
 
Apache Spark 3.0: Overview of What’s New and Why Care
Apache Spark 3.0: Overview of What’s New and Why CareApache Spark 3.0: Overview of What’s New and Why Care
Apache Spark 3.0: Overview of What’s New and Why Care
 
Datastage free tutorial
Datastage free tutorialDatastage free tutorial
Datastage free tutorial
 
CISOA Conference 2020 Banner 9 Development
CISOA Conference 2020 Banner 9 DevelopmentCISOA Conference 2020 Banner 9 Development
CISOA Conference 2020 Banner 9 Development
 
Quick Guide to Refresh Spark skills
Quick Guide to Refresh Spark skillsQuick Guide to Refresh Spark skills
Quick Guide to Refresh Spark skills
 
Inno db 5_7_features
Inno db 5_7_featuresInno db 5_7_features
Inno db 5_7_features
 
Handling Database Deployments
Handling Database DeploymentsHandling Database Deployments
Handling Database Deployments
 
What’s New in the Upcoming Apache Spark 3.0
What’s New in the Upcoming Apache Spark 3.0What’s New in the Upcoming Apache Spark 3.0
What’s New in the Upcoming Apache Spark 3.0
 
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeSimplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
 
Spring data jpa are used to develop spring applications
Spring data jpa are used to develop spring applicationsSpring data jpa are used to develop spring applications
Spring data jpa are used to develop spring applications
 
700 Updatable Queries Per Second: Spark as a Real-Time Web Service
700 Updatable Queries Per Second: Spark as a Real-Time Web Service700 Updatable Queries Per Second: Spark as a Real-Time Web Service
700 Updatable Queries Per Second: Spark as a Real-Time Web Service
 

More from Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDatabricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDatabricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceDatabricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringDatabricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsDatabricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkDatabricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeDatabricks
 
Machine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionMachine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionDatabricks
 

More from Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 
Machine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionMachine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack Detection
 

Recently uploaded

专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改yuu sss
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一fhwihughh
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Cathrine Wilhelmsen
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Seán Kennedy
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...ssuserf63bd7
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queensdataanalyticsqueen03
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 
While-For-loop in python used in college
While-For-loop in python used in collegeWhile-For-loop in python used in college
While-For-loop in python used in collegessuser7a7cd61
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Colleen Farrelly
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档208367051
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our WorldEduminds Learning
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfchwongval
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...Amil Baba Dawood bangali
 

Recently uploaded (20)

Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queens
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 
While-For-loop in python used in college
While-For-loop in python used in collegeWhile-For-loop in python used in college
While-For-loop in python used in college
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our World
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdf
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
 

Using Delta Lake to Transform a Legacy Apache Spark to Support Complex Update/Delete SQL Operation

  • 1. Using Delta Lake to transform a legacy SparkSQL to support complex CRUD operations Lantao Jin Staff Software Engineer @ eBay
  • 2. About Me Lantao Jin is a software engineer at eBay's Infrastructure Data Platform. 8+ years big data infra development exp. Focusing on Spark internal optimization and efficient platform building. https://www.linkedin.com/in/lantaojin https://github.com/LantaoJin https://databricks.com/speaker/lantao-jin Previous presentation https://databricks.com/session/managing- apache-spark-workload-and-automatic- optimizing
  • 3. Agenda 1. Background Our requirements and technical selection 2. Implementations Cross tables update/delete and insertion 3. Optimizations 10X faster and memory reduce 4. Managements Auto vacuum and UI
  • 4. Background ▪ A Blocker that offload commercial data warehouse to open source ▪ SQL syntax compatible with the commercial product ▪ For example, migrate the ad-hoc workload from the MPP engine to Spark SQL ▪ CRUD is a fundamental requirement in data processing ▪ Developed a FileFormat (Hive ACID) ▪ Left join to perform incremental data to do update ▪ Databaselization is a trend in analytic datasets ▪ Google BigQuery ▪ Provide an option with new approach in many scenarios
  • 5. Requirements ▪ Fully support the commercial data warehouse SQL syntax ▪ Complex update/delete SQL syntax ▪ Match it in performance ▪ Based on Spark 2.3.0 ▪ Legacy Spark ▪ Deliver in a short time
  • 6. Project Timeline ▪ Started from Nov. 2019 based on Spark 2.3.0 + Delta Lake 0.4.0 ▪ Delivered to customers at Mar. 2020 ▪ Migration signed off at May. 2020 ▪ Forward-ported to Spark 3.0.0 + Delta Lake 0.7.0 at Sep. 2020
  • 7. Usage data end of Sep. 2020 ▪ Totally, 5x ~ 10x faster than open source version in our scenarios ▪ 10+ business units are using Delta tables ▪ 2000+ production tables converted to Delta tables ▪ 3000+ update/delete statements per day
  • 10. Why we choose Delta Lake? Evaluated at Nov. 2019
  • 11. What did we do? ▪ Stage 1 ▪ From Nov. 2019 To Mar. 2020 ▪ Refactor Delta Lake to support Spark2.3 ▪ Cross table update/delete syntax ▪ Auto vacuuming ▪ Rollback and At by SQL ▪ Stage 2 ▪ From Apr. 2020 To Jun 2020 ▪ Bugfix ▪ Support bucketing join ▪ Performance improvements ▪ Delta table UI ▪ Stage 3 ▪ From Jul. 2020 To Oct. 2020 ▪ Migrate to Spark3.0 + Delta Lake0.7 ▪ Reduce memory consumption ▪ Support subquery in WHERE ▪ Support index for Delta ▪ Stage 4 ▪ From Nov. 2020 ▪ Support Kudu ▪ Runtime Filter ▪ Z-ordering ▪ Native engine
  • 12. Challenges – Stage 1 ▪ SQL hadn’t been supported in Delta Lake 0.4 ▪ Delta Lake requires Apache Spark 2.4 and above ▪ Spark 3.0 only supports single table update/delete syntax ▪ Integration with our internal features
  • 13. Stage 1 Implementations - Support SQL (Delta Lake 0.4) - Cross tables update/delete - Insertion - SQL based Time Travel Management - Auto Vacuuming
  • 14. Delta Lake 0.4 + Spark 2.3 ▪ Added some patches in Spark 2.3 ▪ Backported update/delete code to Spark 2.3 ▪ Downgraded partial codegen interfaces in Delta Lake 0.4 ▪ Rewritten the resolution code with Data Source V1
  • 16. Support SQL ▪ Implements it in Catalyst ▪ SparkSessionExtensions ▪ Store essential metadata in HiveMetastore ▪ Rewrite in DataSource V1 Based on Delta Lake 0.4 + Spark 2.3.0 Delta Lake Delta Lake ❌
  • 17. SqlBase.g4 SparkSqlParser.scala visitUpdateTable() Parse From clause Build a cross table join context Package Assignments and Condition Generate UpdateTable node Inject resolution rule via SparkSessionExtension Support SQL Update internals
  • 18. Resolve Assignments and Conditions Return single table update Using join conditions and the attributes in assignments to add ProjectionPushdown to source side Assignments foldable && Condition empty yes no Source side contains join Infer out all conditions which only appear in source side, push down them. Generate node UpdateWithJoinTable yes no
  • 19. Filter out the files which not correlated with target table yes Multiple rows matched? Match out tahoeFileIndex and build UpdateWithJoinCommand Get touched files by inner join exception Mark as RemoveFiles Get filesToRewrite no FileFormatWrite.write and mark as AddedFiles no Build left join plan mapPartition on the plan: if a row matched,write output of source side, otherwise write the output of target side Repartition the planIs bucket table? Commit to transaction log no yes
  • 20. INSERT INTO/OVERWRITE ... DataSourceStrategy: Case InsertIntoTable Add static partition projection Build InsertIntoDataSource with the plan as its child Contains static partition? no yes Get the actualQuery and package it to InsertIntoDataSourceC ommand Support SQL Insert internals
  • 21. SparkStrategy: BasicOperators: Case InsertIntoDataSource Add HashClusteredDistribution InsertIntoDataSourceComm and.run() Is bucketed table no yes Generate InsertIntoDataSourceExec EnsureRequirements: ensureDistributionAndOrde ring() EnsurePartitionForWriting: Add ShuffleExchangeExec
  • 22. InsertableRelation.insert() insert static partition && overwrite Fill replace_where OptimisticTransaction.write no yes FileFormatWrite.write and mark as AddedFiles Assemble predicates, use snapshot.fileForScan to get deleteFiles Mark as RemoveFiles Commit to transaction log
  • 23. AT ROLLBACK Time Travel via SQL Rollback & At
  • 24. JDBC/ ODBC Carmel SSD Hive Metastore Gateway Tenant A Tenant B Tenant C Alation Tableau Zeta Shuffle Cache Apollo L B Hercules HDD Prod DBs Hermes Gateway Gateway VDM …… …… ▪ 1 BU, 1 queue (YARN) ▪ 1 queue, 1 or N STS(s) ▪ 1 queue (1 STS) is reserved Architecture background Auto Vacuuming
  • 25. Auto Vacuuming ▪ Every STS uses listener to store delta metadata to third-part storage async. o Convert to delta o Create table using delta o Rename table o Alter table o Drop table ▪ The STS in reserved queue double checks if events lost ▪ The STS in reserved queue triggers auto vacuuming and attaches Delta UI Implementation
  • 26. Main contributions ▪ Support cross tables update/delete ▪ Support update/delete with multiple tables join ▪ Support join conditions inferring Based on Delta Lake 0.4 + Spark 2.3.0
  • 27. Main contributions ▪ Insertion with static/dynamic partitions ▪ Auto vacuuming ▪ Time travel via SQL Based on Delta Lake 0.4 + Spark 2.3.0
  • 28. Challenges – Stage 2 ▪ The performance of update/delete in Delta Lake has a certain gap with the commercial product. ▪ In a long running Spark thrift-server, a big query on Delta table easily causes Spark Driver OOM. ▪ Manage the capacity of Delta and small files problem.
  • 29. Stage 2 Optimizations - Support bucketing join - Resolving small files problem automitically - Rewrite outer join to reduce shuffle - More FilterPushDown Management - Delta UI
  • 30. Support bucketing join ▪ Store bucketSpec in delta table metadata ▪ requireChildDistribution ▪ HashClusteredDistribution ▪ Example ▪ UPDATE a big table(4.7TB) join with a table(200GB) (without AQE) ▪ Before OOM ▪ After 180s
  • 31. Auto resolving small files problem ▪ community (0.7.0)
  • 32. Auto resolving small files problem
  • 33. Auto resolving small files problem ▪ our solution
  • 34. Auto resolving small files problem
  • 35. Auto resolving small files problem
  • 36. Auto resolving small files problem
  • 37. Rewrite heavy outer join to reduce shuffle data ▪ In right outer join, even there are some predicates in right side, it still needs all rows in filesToRewrite to preform join. ▪ We move right side only predicates from join conditions to its filters, then union the join and the right side which applied anti-predicates filters. ▪ By our testing and practice, after applying this patch, the SMJ could be 5~10 times faster, which depends on how much data is skipping from shuffle.
  • 38. Rewrite heavy outer join to reduce shuffle data
  • 40. Our contributions Based on Delta Lake 0.4 + Spark 2.3.0 ▪ Support bucket Delta table ▪ Support bucketing join ▪ Auto resolving small files problem
  • 41. Our contributions Based on Delta Lake 0.4 + Spark 2.3.0 ▪ Rewrite heavy outer join to reduce shuffle data ▪ https://github.com/delta-io/delta/pull/435 ▪ Apply filter pushdown to source rows for the right outer join of matched only case ▪ https://github.com/delta-io/delta/pull/438 ▪ Apply filter pushdown to the inner join to avoid to scan all rows in parquet files ▪ https://github.com/delta-io/delta/pull/432
  • 42. Challenges – Stage 3 ▪ Planed to upgrade to Spark3.0 in this year ▪ Subquery statements not supported ▪ File index, materialized view, range partition not supported ▪ Availability & Robustness
  • 43. Stage 3 Implementations - Migrate our changes to Spark3.0 + Delta 0.7 - Support Subquery in WHERE Optimization - Reduce memory consumption
  • 44. Support Subquery in WHERE Supported: § IN § EXISTS § NOT IN with IS NOT NULL § NOT EXISTS § Correlated Subquery § Nested Subquery § Multiple subqueries with conjunctive Unsupported: § NOT IN without IS NOT NULL § Scalar Subquery § Multiple subqueries with disjunctive
  • 46. Support Subquery in WHERE UPDATE target t SET t.b = 0 WHERE t.a IN (SELECT s.a FROM source s WHERE s.a % 2 = 0)
  • 47. Support Subquery in WHERE UPDATE target t SET t.b = 0 WHERE NOT EXISTS (SELECT * FROM source s WHERE t.a = s.a AND s.a % 2 = 0)
  • 48. Our contributions Based on Delta Lake 0.7 + Spark 3.0.0 ▪ Migrate all changes and improvements to latest version ▪ Support Subquery in Where ▪ Reduce memory consumption in Driver ▪ [SPARK-32994][CORE] Update external heavy accumulators before they entering into listener event loop ▪ Skip schema infer and merge when table schema can be read from catalog ▪ Fallback to simple update if all SET statements are foldable and no join
  • 49. Commerical Data Warehouse Stage2 Stage3 PS: Pulse is our regular release
  • 51. Future work – Stage 4 ▪ Range partition for delta (WIP) ▪ File index for delta (WIP) ▪ Runtime Filter Join optimization (WIP) ▪ Support Kudu (WIP) ▪ Z-ordering ▪ Native engine
  • 52. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.