SlideShare a Scribd company logo
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC pipeline with Spark
Streaming SQL and Delta Lake
Jun Song@windpiger
songjun.sj@alibaba-inc.com
June 24, 2020
About Me
• Staff Engineer from Alibaba Cloud E-MapReduce Product Team
• Spark contributor focused on SparkSQL
• HiveOnDelta contributor(https://github.com/delta-io/connectors)
Agenda
• What is CDC
• CDC solution using Spark
Streaming SQL & Delta Lake
• Future Works
What is CDC
Change Data Capture
Target
Storage
AdHoc/
ETL
Collect Change Sets & Merge
Analytics
CDC
OLTP database
Data Warehouse
Data Lake
…
▪ load pressure on source database
▪ high latency batch job(hourly/daily/…)
▪ can not handle delete rows
▪ can not handle schema change
DrawbacksUsing Sqoop(Batch Mode)
Change Data Capture
sqoop
--incremental lastmodified
--last-value '2028/01/01 13:00:00’
...
sqoop merge
--new-data newer
--onto older
--merge-key id
…
▪ heavy servers & operational
support(Kudu&HBase)
▪ HBase can not support high-throughput
analytics
▪ complex merge implement by java/scala code
▪ can not handle schema change
DrawbacksUsing binlog(Streaming Mode)
Change Data Capture
binlog
scala/java
CDC solution using Spark Streaming SQL & Delta
Spark Streaming SQL
SparkCore
SparkSQL
Structured Streaming
Spark Streaming SQL
https://www.alibabacloud.com/help/doc-detail/124684.htm
SQL is a Standard Declarative Language, which can simplify real-time analytics
▪ DDL
CREATE TABLE、CREATE TABLE AS SELECT、CREATE SCAN、CREATE STREAM
▪ DML
INSERT INTO、MERGE INTO
▪ SELECT
SELECT FROM、WHERE、GROUP BY 、JOIN、UNION ALL
▪ UDF
TUMBLING、HOPPING、DELAY、SparkSQL UDF
▪ Data Source
Delta、Kafka、HBase、JDBC、Druid、Redis、Kudu、Alibaba Cloud(Loghub、Tablestore、DataHub)
Design Doc: https://docs.google.com/document/d/19degwnIIcuMSELv6BQ_1VQI5AIVcvGeqOm5xE2-aRA0/edit
Spark Streaming SQL
CREATE TABLE kafka_test
USING kafka
OPTIONS(
kafka.bootstrap.servers='',
subscribe='test')
batch
streaming
CREATE SCAN kafka_test_batch_scan
ON kafka_test
USING batch
CREATE SCAN kafka_test_batch_scan
ON kafka_test
USING stream
OPTIONS(
maxOffsetsPerTrigger='100000'
)
SELECT count(*)
FROM kafka_test_batch_scan
?
CREATE SCAN
CREATE SCAN tbName_alias
ON tbName
USING queryType
OPTIONS (propertyName=propertyValue[,propertyName=propertyValue]*)
Spark Streaming SQL
CREATE SCAN kafka_test_batch_scan
ON kafka_test
USING stream
OPTIONS(
maxOffsetsPerTrigger='100000'
)
CREATE STREAM
CREATE STREAM kafka_test_stream_job
OPTIONS(
checkpointLocation='/tmp/spark',
outputMode='Append',
triggerType='ProcessingTime'
triggerIntervalMs='3000')
INSERT INTO target_tbl
SELECT * FROM kafka_test_batch_scan
WHERE units > 1000;
CREATE STREAM queryName
OPTIONS (propertyName=propertyValue[,propertyName=propertyValue]*)
INSERT INTO tbName
queryStatement;
Spark Streaming SQL
MERGE INTO
mergeInto
: MERGE INTO target=tableIdentifier tableAlias
USING (source=tableIdentifier (timeTravel)? | '(' subquery = query ')') tableAlias
mergeCondition?
matchedClauses*
notMatchedClause?
MERGE INTO target_table t
USING source_table s
ON s.id = t.id
WHEN MATCHED AND s.opType = 'delete' THEN DELETE
WHEN MATCHED AND s.opTye = 'update' THEN UPDATE SET id = s.id, name = s.name
WHEN NOT MATCHED AND s.opType = 'insert' THEN INSERT (key, value) VALUES (key, value)
Spark Streaming SQL
DELAY / TUMBLING / HOPPING
WHERE delay(colName) < 'duration'withWatermark("colName", "duration")
SELECT avg(inv_quantity_on_hand) qoh
FROM kafka_inventory
WHERE delay(inv_data_time) < '2 minutes'
GROUP BY TUMBLING (inv_data_time, interval 1 minute)
Delta Lake
Streaming/Batch
Streaming/Batch
ACID Transactions
Metadata Management
Unified Batch&Streaming Schema Enforcement&Evolution
Update&Delete&Merge
Time Travel
Parquet
Key Feature
Delta Lake
Improvement
Delta Lake
SparkSQL Spark Streaming SQL
Update/Delete/
Optimize/Vacuum
DDL/DML
Hive/
Presto
HDFS Alibaba OSS S3 …
Delta Lake
Improvement
CREATE EXTERNAL TABLE delta_tbl(a string, b int)
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT 'io.delta.hive.DeltaInputFormat'
OUTPUTFORMAT 'io.delta.hive.DeltaOutputFormat'
LOCATION 'oss://testbucket/delta/events'
Spark Streaming SQL
binlog
▪ no extra operational support for delta
▪ no load pressure on source database
▪ merge implement easily by SQL
▪ realtime low-latency(minute)
CDC solution using Spark Streaming SQL & Delta Lake
Spark Streaming SQL
binlog
CREATE SCAN cdctest_incremental_scan ON kafka_cdctest
USING STREAM
OPTIONS(
startingOffsets='earliest',
maxOffsetsPerTrigger='100000',
failOnDataLoss=false
);
CREATE STREAM cdctest_job
OPTIONS(checkpointLocation='/delta/cdctest_checkpoint_oss')
MERGE INTO delta_cdctest_oss as target
USING (
SELECT
// binlog parser
…
FROM cdctest_incremental_scan
)
ON target.id = source.before_id
WHEN MATCHED AND source.recordType='UPDATE' THEN
UPDATE SET …
WHEN MATCHED AND source.recordType='DELETE' THEN
DELETE
WHEN NOT MATCHED AND source.recordType='INSERT' THEN
INSERT …
CREATE TABLE kafka_cdctest USING KAFKA …
CREATE TABLE delta_cdctest_oss USING DELTA …
streaming-sql --master yarn --use-emr-datasource -f cdc_oss.sql
1
2
3
4
Delta Table
batch-batch- batch- …batch-
DeltaTable
.merge
DeltaTable
.merge
DeltaTable
.merge
DeltaTable
.merge
Spark Streaming SQL MERGE INTO
CDC solution using Spark Streaming SQL & Delta Lake
Long Running Stability Improvement
How to handle small files?
▪ increase batch interval(minutes)
▪ compcation(change data layout, not change data)
▪ adaptive execution mode
Long Running Stability Improvement
How to handle small files? - Scheduled Compaction
Delta Table
batch-batch- batch- …
batch-
DeltaTable
.merge
DeltaTable
.merge
DeltaTable
.merge
DeltaTable
.merge
CompactionCompaction
scheduled job hourly/daily/…
OPTIMIZE <tbl> [WHERE where_clause]
Long Running Stability Improvement
Problem: Streaming Job Failed when doing a compaction
batch-
Delta Table
Compaction do transaction commit
merge
Transaction conflict check
read
How to handle small files? - Scheduled Compaction
Long Running Stability Improvement
Problem: Streaming Job Failed when doing a compaction
binlog in batch
streaming job
status
Improvement
only insert Succeed Fix one bug: https://github.com/delta-io/delta/issues/326
including
delete/update Failed streaming job retry this batch
How to handle small files? - Scheduled Compaction
Long Running Stability Improvement
How to handle small files? - Auto Compaction
Delta Table
batch-batch- …batch-
merge
compaction compaction
merge merge
sequential execution, no conflict
select files which file_size
> COMPACT_FILE_SIZE
flies
number >
TRIGGER_FILE_COU
NT
do compaction
Yes
No
continue streaming
Strategy
Long Running Stability Improvement
How to handle small files? - Adaptive Execution
batch
binlog
target
delta
target
changed files
Join
rewrite
all changed files
Join
spark.sql.adaptive.enabled -> true
Adaptive Excetion can auto merge small partitions,
to decrease the number of reducers, then decrease
the number of output files.
Long Running Stability Improvement
Performance issue
batch-batch- batch- …batch-
size of target delta table
merge performance
Runtime Filter https://issues.apache.org/jira/browse/SPARK-27227
batch
binlog
target
delta
target
changed files
Join
filter
Future Works
Future
▪ auto schema change detected
▪ long running stable performance(read on merge)
▪ simplify user experience by SYNC grammar
SYNC kafka_binlog_tbl
TO delta_tbl
OPTIONS(
type='debezium.mysql'
)
Feedback
Your feedback is important to us.
Don’t forget to rate and
review the sessions.

More Related Content

What's hot

Delta lake and the delta architecture
Delta lake and the delta architectureDelta lake and the delta architecture
Delta lake and the delta architecture
Adam Doyle
 
Apache Hudi: The Path Forward
Apache Hudi: The Path ForwardApache Hudi: The Path Forward
Apache Hudi: The Path Forward
Alluxio, Inc.
 
Apache Ranger
Apache RangerApache Ranger
Apache Ranger
Rommel Garcia
 
Building an open data platform with apache iceberg
Building an open data platform with apache icebergBuilding an open data platform with apache iceberg
Building an open data platform with apache iceberg
Alluxio, Inc.
 
3D: DBT using Databricks and Delta
3D: DBT using Databricks and Delta3D: DBT using Databricks and Delta
3D: DBT using Databricks and Delta
Databricks
 
Getting Started with Confluent Schema Registry
Getting Started with Confluent Schema RegistryGetting Started with Confluent Schema Registry
Getting Started with Confluent Schema Registry
confluent
 
Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)
Ryan Blue
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Noritaka Sekiyama
 
Iceberg: a fast table format for S3
Iceberg: a fast table format for S3Iceberg: a fast table format for S3
Iceberg: a fast table format for S3
DataWorks Summit
 
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
Chester Chen
 
Apache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data TransportApache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data Transport
Wes McKinney
 
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Using Apache Arrow, Calcite, and Parquet to Build a Relational CacheUsing Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Dremio Corporation
 
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin AmbardDelta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Paris Data Engineers !
 
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
HostedbyConfluent
 
Achieving Lakehouse Models with Spark 3.0
Achieving Lakehouse Models with Spark 3.0Achieving Lakehouse Models with Spark 3.0
Achieving Lakehouse Models with Spark 3.0
Databricks
 
Databricks Delta Lake and Its Benefits
Databricks Delta Lake and Its BenefitsDatabricks Delta Lake and Its Benefits
Databricks Delta Lake and Its Benefits
Databricks
 
Stream processing using Kafka
Stream processing using KafkaStream processing using Kafka
Stream processing using Kafka
Knoldus Inc.
 
Large Scale Lakehouse Implementation Using Structured Streaming
Large Scale Lakehouse Implementation Using Structured StreamingLarge Scale Lakehouse Implementation Using Structured Streaming
Large Scale Lakehouse Implementation Using Structured Streaming
Databricks
 
Introduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse ArchitectureIntroduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse Architecture
Databricks
 
Intro to Delta Lake
Intro to Delta LakeIntro to Delta Lake
Intro to Delta Lake
Databricks
 

What's hot (20)

Delta lake and the delta architecture
Delta lake and the delta architectureDelta lake and the delta architecture
Delta lake and the delta architecture
 
Apache Hudi: The Path Forward
Apache Hudi: The Path ForwardApache Hudi: The Path Forward
Apache Hudi: The Path Forward
 
Apache Ranger
Apache RangerApache Ranger
Apache Ranger
 
Building an open data platform with apache iceberg
Building an open data platform with apache icebergBuilding an open data platform with apache iceberg
Building an open data platform with apache iceberg
 
3D: DBT using Databricks and Delta
3D: DBT using Databricks and Delta3D: DBT using Databricks and Delta
3D: DBT using Databricks and Delta
 
Getting Started with Confluent Schema Registry
Getting Started with Confluent Schema RegistryGetting Started with Confluent Schema Registry
Getting Started with Confluent Schema Registry
 
Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
 
Iceberg: a fast table format for S3
Iceberg: a fast table format for S3Iceberg: a fast table format for S3
Iceberg: a fast table format for S3
 
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
 
Apache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data TransportApache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data Transport
 
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Using Apache Arrow, Calcite, and Parquet to Build a Relational CacheUsing Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
 
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin AmbardDelta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
 
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
 
Achieving Lakehouse Models with Spark 3.0
Achieving Lakehouse Models with Spark 3.0Achieving Lakehouse Models with Spark 3.0
Achieving Lakehouse Models with Spark 3.0
 
Databricks Delta Lake and Its Benefits
Databricks Delta Lake and Its BenefitsDatabricks Delta Lake and Its Benefits
Databricks Delta Lake and Its Benefits
 
Stream processing using Kafka
Stream processing using KafkaStream processing using Kafka
Stream processing using Kafka
 
Large Scale Lakehouse Implementation Using Structured Streaming
Large Scale Lakehouse Implementation Using Structured StreamingLarge Scale Lakehouse Implementation Using Structured Streaming
Large Scale Lakehouse Implementation Using Structured Streaming
 
Introduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse ArchitectureIntroduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse Architecture
 
Intro to Delta Lake
Intro to Delta LakeIntro to Delta Lake
Intro to Delta Lake
 

Similar to Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake

Self-serve analytics journey at Celtra: Snowflake, Spark, and Databricks
Self-serve analytics journey at Celtra: Snowflake, Spark, and DatabricksSelf-serve analytics journey at Celtra: Snowflake, Spark, and Databricks
Self-serve analytics journey at Celtra: Snowflake, Spark, and Databricks
Grega Kespret
 
In-memory ColumnStore Index
In-memory ColumnStore IndexIn-memory ColumnStore Index
In-memory ColumnStore Index
SolidQ
 
Copy Data Management for the DBA
Copy Data Management for the DBACopy Data Management for the DBA
Copy Data Management for the DBA
Kellyn Pot'Vin-Gorman
 
iceberg introduction.pptx
iceberg introduction.pptxiceberg introduction.pptx
iceberg introduction.pptx
Dori Waldman
 
Pulsar in the Lakehouse: Overview of Apache Pulsar and Delta Lake Connector -...
Pulsar in the Lakehouse: Overview of Apache Pulsar and Delta Lake Connector -...Pulsar in the Lakehouse: Overview of Apache Pulsar and Delta Lake Connector -...
Pulsar in the Lakehouse: Overview of Apache Pulsar and Delta Lake Connector -...
StreamNative
 
High Throughput Analytics with Cassandra & Azure
High Throughput Analytics with Cassandra & AzureHigh Throughput Analytics with Cassandra & Azure
High Throughput Analytics with Cassandra & Azure
DataStax Academy
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
Denver SQL Saturday The Next Frontier
Denver SQL Saturday The Next FrontierDenver SQL Saturday The Next Frontier
Denver SQL Saturday The Next Frontier
Kellyn Pot'Vin-Gorman
 
Collaborate 2009 - Migrating a Data Warehouse from Microsoft SQL Server to Or...
Collaborate 2009 - Migrating a Data Warehouse from Microsoft SQL Server to Or...Collaborate 2009 - Migrating a Data Warehouse from Microsoft SQL Server to Or...
Collaborate 2009 - Migrating a Data Warehouse from Microsoft SQL Server to Or...
djkucera
 
PayPal merchant ecosystem using Apache Spark, Hive, Druid, and HBase
PayPal merchant ecosystem using Apache Spark, Hive, Druid, and HBase PayPal merchant ecosystem using Apache Spark, Hive, Druid, and HBase
PayPal merchant ecosystem using Apache Spark, Hive, Druid, and HBase
DataWorks Summit
 
OOW13 Exadata and ODI with Parallel
OOW13 Exadata and ODI with ParallelOOW13 Exadata and ODI with Parallel
OOW13 Exadata and ODI with Parallel
Kellyn Pot'Vin-Gorman
 
ETL and pivoting in spark
ETL and pivoting in sparkETL and pivoting in spark
ETL and pivoting in spark
Subhasish Guha
 
ETL and pivoting in spark
ETL and pivoting in sparkETL and pivoting in spark
ETL and pivoting in spark
Subhasish Guha
 
Manchester Hadoop Meetup: Spark Cassandra Integration
Manchester Hadoop Meetup: Spark Cassandra IntegrationManchester Hadoop Meetup: Spark Cassandra Integration
Manchester Hadoop Meetup: Spark Cassandra Integration
Christopher Batey
 
It Depends
It DependsIt Depends
It Depends
Maggie Pint
 
Operating and Supporting Delta Lake in Production
Operating and Supporting Delta Lake in ProductionOperating and Supporting Delta Lake in Production
Operating and Supporting Delta Lake in Production
Databricks
 
Building a High-Performance Database with Scala, Akka, and Spark
Building a High-Performance Database with Scala, Akka, and SparkBuilding a High-Performance Database with Scala, Akka, and Spark
Building a High-Performance Database with Scala, Akka, and Spark
Evan Chan
 
Autonomous Transaction Processing (ATP): In Heavy Traffic, Why Drive Stick?
Autonomous Transaction Processing (ATP): In Heavy Traffic, Why Drive Stick?Autonomous Transaction Processing (ATP): In Heavy Traffic, Why Drive Stick?
Autonomous Transaction Processing (ATP): In Heavy Traffic, Why Drive Stick?
Jim Czuprynski
 
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Databricks
 
User Defined Partitioning on PlazmaDB
User Defined Partitioning on PlazmaDBUser Defined Partitioning on PlazmaDB
User Defined Partitioning on PlazmaDB
Kai Sasaki
 

Similar to Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake (20)

Self-serve analytics journey at Celtra: Snowflake, Spark, and Databricks
Self-serve analytics journey at Celtra: Snowflake, Spark, and DatabricksSelf-serve analytics journey at Celtra: Snowflake, Spark, and Databricks
Self-serve analytics journey at Celtra: Snowflake, Spark, and Databricks
 
In-memory ColumnStore Index
In-memory ColumnStore IndexIn-memory ColumnStore Index
In-memory ColumnStore Index
 
Copy Data Management for the DBA
Copy Data Management for the DBACopy Data Management for the DBA
Copy Data Management for the DBA
 
iceberg introduction.pptx
iceberg introduction.pptxiceberg introduction.pptx
iceberg introduction.pptx
 
Pulsar in the Lakehouse: Overview of Apache Pulsar and Delta Lake Connector -...
Pulsar in the Lakehouse: Overview of Apache Pulsar and Delta Lake Connector -...Pulsar in the Lakehouse: Overview of Apache Pulsar and Delta Lake Connector -...
Pulsar in the Lakehouse: Overview of Apache Pulsar and Delta Lake Connector -...
 
High Throughput Analytics with Cassandra & Azure
High Throughput Analytics with Cassandra & AzureHigh Throughput Analytics with Cassandra & Azure
High Throughput Analytics with Cassandra & Azure
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 
Denver SQL Saturday The Next Frontier
Denver SQL Saturday The Next FrontierDenver SQL Saturday The Next Frontier
Denver SQL Saturday The Next Frontier
 
Collaborate 2009 - Migrating a Data Warehouse from Microsoft SQL Server to Or...
Collaborate 2009 - Migrating a Data Warehouse from Microsoft SQL Server to Or...Collaborate 2009 - Migrating a Data Warehouse from Microsoft SQL Server to Or...
Collaborate 2009 - Migrating a Data Warehouse from Microsoft SQL Server to Or...
 
PayPal merchant ecosystem using Apache Spark, Hive, Druid, and HBase
PayPal merchant ecosystem using Apache Spark, Hive, Druid, and HBase PayPal merchant ecosystem using Apache Spark, Hive, Druid, and HBase
PayPal merchant ecosystem using Apache Spark, Hive, Druid, and HBase
 
OOW13 Exadata and ODI with Parallel
OOW13 Exadata and ODI with ParallelOOW13 Exadata and ODI with Parallel
OOW13 Exadata and ODI with Parallel
 
ETL and pivoting in spark
ETL and pivoting in sparkETL and pivoting in spark
ETL and pivoting in spark
 
ETL and pivoting in spark
ETL and pivoting in sparkETL and pivoting in spark
ETL and pivoting in spark
 
Manchester Hadoop Meetup: Spark Cassandra Integration
Manchester Hadoop Meetup: Spark Cassandra IntegrationManchester Hadoop Meetup: Spark Cassandra Integration
Manchester Hadoop Meetup: Spark Cassandra Integration
 
It Depends
It DependsIt Depends
It Depends
 
Operating and Supporting Delta Lake in Production
Operating and Supporting Delta Lake in ProductionOperating and Supporting Delta Lake in Production
Operating and Supporting Delta Lake in Production
 
Building a High-Performance Database with Scala, Akka, and Spark
Building a High-Performance Database with Scala, Akka, and SparkBuilding a High-Performance Database with Scala, Akka, and Spark
Building a High-Performance Database with Scala, Akka, and Spark
 
Autonomous Transaction Processing (ATP): In Heavy Traffic, Why Drive Stick?
Autonomous Transaction Processing (ATP): In Heavy Traffic, Why Drive Stick?Autonomous Transaction Processing (ATP): In Heavy Traffic, Why Drive Stick?
Autonomous Transaction Processing (ATP): In Heavy Traffic, Why Drive Stick?
 
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
 
User Defined Partitioning on PlazmaDB
User Defined Partitioning on PlazmaDBUser Defined Partitioning on PlazmaDB
User Defined Partitioning on PlazmaDB
 

More from Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Machine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionMachine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack Detection
Databricks
 

More from Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Machine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionMachine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack Detection
 

Recently uploaded

Female Girls Call Noida 🎈🔥9873940964 🔥💋🎈 Provide Best And Top Girl Service An...
Female Girls Call Noida 🎈🔥9873940964 🔥💋🎈 Provide Best And Top Girl Service An...Female Girls Call Noida 🎈🔥9873940964 🔥💋🎈 Provide Best And Top Girl Service An...
Female Girls Call Noida 🎈🔥9873940964 🔥💋🎈 Provide Best And Top Girl Service An...
sheetal singh$A17
 
Celebrity Girls Call Noida 9873940964 Unlimited Short Providing Girls Service...
Celebrity Girls Call Noida 9873940964 Unlimited Short Providing Girls Service...Celebrity Girls Call Noida 9873940964 Unlimited Short Providing Girls Service...
Celebrity Girls Call Noida 9873940964 Unlimited Short Providing Girls Service...
ginni singh$A17
 
Noida Girls Call Noida 9873940964 Unlimited Short Providing Girls Service Ava...
Noida Girls Call Noida 9873940964 Unlimited Short Providing Girls Service Ava...Noida Girls Call Noida 9873940964 Unlimited Short Providing Girls Service Ava...
Noida Girls Call Noida 9873940964 Unlimited Short Providing Girls Service Ava...
kinni singh$A17
 
Exclusive Girls Call Noida 🎈🔥9873940964 🔥💋🎈 Provide Best And Top Girl Service...
Exclusive Girls Call Noida 🎈🔥9873940964 🔥💋🎈 Provide Best And Top Girl Service...Exclusive Girls Call Noida 🎈🔥9873940964 🔥💋🎈 Provide Best And Top Girl Service...
Exclusive Girls Call Noida 🎈🔥9873940964 🔥💋🎈 Provide Best And Top Girl Service...
sheetal singh$A17
 
VIP Kanpur Girls Call Kanpur 0X0000000X Doorstep High-Profile Girl Service Ca...
VIP Kanpur Girls Call Kanpur 0X0000000X Doorstep High-Profile Girl Service Ca...VIP Kanpur Girls Call Kanpur 0X0000000X Doorstep High-Profile Girl Service Ca...
VIP Kanpur Girls Call Kanpur 0X0000000X Doorstep High-Profile Girl Service Ca...
satpalsheravatmumbai
 
社内勉強会資料_TransNeXt: Robust Foveal Visual Perception for Vision Transformers
社内勉強会資料_TransNeXt: Robust Foveal Visual Perception for Vision Transformers社内勉強会資料_TransNeXt: Robust Foveal Visual Perception for Vision Transformers
社内勉強会資料_TransNeXt: Robust Foveal Visual Perception for Vision Transformers
NABLAS株式会社
 
Busty Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service And...
Busty Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service And...Busty Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service And...
Busty Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service And...
tanupasswan6
 
Potential Uses of the Floyd-Warshall Algorithm as appropriate
Potential Uses of the Floyd-Warshall Algorithm as appropriatePotential Uses of the Floyd-Warshall Algorithm as appropriate
Potential Uses of the Floyd-Warshall Algorithm as appropriate
huseindihon
 
DataScienceConcept_Kanchana_Weerasinghe.pptx
DataScienceConcept_Kanchana_Weerasinghe.pptxDataScienceConcept_Kanchana_Weerasinghe.pptx
DataScienceConcept_Kanchana_Weerasinghe.pptx
Kanchana Weerasinghe
 
FINAL PROJECT WORK PORTFOLIO MANAGEMENT (2) hhh (1) (2) (5) (1) (1).pdf
FINAL PROJECT WORK PORTFOLIO MANAGEMENT (2)  hhh (1) (2) (5) (1) (1).pdfFINAL PROJECT WORK PORTFOLIO MANAGEMENT (2)  hhh (1) (2) (5) (1) (1).pdf
FINAL PROJECT WORK PORTFOLIO MANAGEMENT (2) hhh (1) (2) (5) (1) (1).pdf
bala krishna
 
Celebrity Girls Call Andheri 9930245274 Unlimited Short Providing Girls Servi...
Celebrity Girls Call Andheri 9930245274 Unlimited Short Providing Girls Servi...Celebrity Girls Call Andheri 9930245274 Unlimited Short Providing Girls Servi...
Celebrity Girls Call Andheri 9930245274 Unlimited Short Providing Girls Servi...
revolutionary575
 
Harnessing Wild and Untamed (Publicly Available) Data for the Cost efficient ...
Harnessing Wild and Untamed (Publicly Available) Data for the Cost efficient ...Harnessing Wild and Untamed (Publicly Available) Data for the Cost efficient ...
Harnessing Wild and Untamed (Publicly Available) Data for the Cost efficient ...
weiwchu
 
AWS re:Invent 2023 - Deep dive into Amazon Aurora and its innovations DAT408
AWS re:Invent 2023 - Deep dive into Amazon Aurora and its innovations DAT408AWS re:Invent 2023 - Deep dive into Amazon Aurora and its innovations DAT408
AWS re:Invent 2023 - Deep dive into Amazon Aurora and its innovations DAT408
Grant McAlister
 
PyData Sofia May 2024 - Intro to Apache Arrow
PyData Sofia May 2024 - Intro to Apache ArrowPyData Sofia May 2024 - Intro to Apache Arrow
PyData Sofia May 2024 - Intro to Apache Arrow
Uwe Korn
 
DU degree offer diploma Transcript
DU degree offer diploma TranscriptDU degree offer diploma Transcript
DU degree offer diploma Transcript
uapta
 
Why_are_we_hypnotizing_ourselves-_ATeggin-1.pdf
Why_are_we_hypnotizing_ourselves-_ATeggin-1.pdfWhy_are_we_hypnotizing_ourselves-_ATeggin-1.pdf
Why_are_we_hypnotizing_ourselves-_ATeggin-1.pdf
Alexander Teggin
 
New Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service And N...
New Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service And N...New Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service And N...
New Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service And N...
tanupasswan6
 
Mumbai Girls Call Mumbai 🛵🚡9910780858 💃 Choose Best And Top Girl Service And ...
Mumbai Girls Call Mumbai 🛵🚡9910780858 💃 Choose Best And Top Girl Service And ...Mumbai Girls Call Mumbai 🛵🚡9910780858 💃 Choose Best And Top Girl Service And ...
Mumbai Girls Call Mumbai 🛵🚡9910780858 💃 Choose Best And Top Girl Service And ...
norina2645
 
Oracle Database Desupported Features on 23ai (Part B)
Oracle Database Desupported Features on 23ai (Part B)Oracle Database Desupported Features on 23ai (Part B)
Oracle Database Desupported Features on 23ai (Part B)
Alireza Kamrani
 
Celebrity Girls Call Noida 9873940964 Unlimited Short Providing Girls Service...
Celebrity Girls Call Noida 9873940964 Unlimited Short Providing Girls Service...Celebrity Girls Call Noida 9873940964 Unlimited Short Providing Girls Service...
Celebrity Girls Call Noida 9873940964 Unlimited Short Providing Girls Service...
ginni singh$A17
 

Recently uploaded (20)

Female Girls Call Noida 🎈🔥9873940964 🔥💋🎈 Provide Best And Top Girl Service An...
Female Girls Call Noida 🎈🔥9873940964 🔥💋🎈 Provide Best And Top Girl Service An...Female Girls Call Noida 🎈🔥9873940964 🔥💋🎈 Provide Best And Top Girl Service An...
Female Girls Call Noida 🎈🔥9873940964 🔥💋🎈 Provide Best And Top Girl Service An...
 
Celebrity Girls Call Noida 9873940964 Unlimited Short Providing Girls Service...
Celebrity Girls Call Noida 9873940964 Unlimited Short Providing Girls Service...Celebrity Girls Call Noida 9873940964 Unlimited Short Providing Girls Service...
Celebrity Girls Call Noida 9873940964 Unlimited Short Providing Girls Service...
 
Noida Girls Call Noida 9873940964 Unlimited Short Providing Girls Service Ava...
Noida Girls Call Noida 9873940964 Unlimited Short Providing Girls Service Ava...Noida Girls Call Noida 9873940964 Unlimited Short Providing Girls Service Ava...
Noida Girls Call Noida 9873940964 Unlimited Short Providing Girls Service Ava...
 
Exclusive Girls Call Noida 🎈🔥9873940964 🔥💋🎈 Provide Best And Top Girl Service...
Exclusive Girls Call Noida 🎈🔥9873940964 🔥💋🎈 Provide Best And Top Girl Service...Exclusive Girls Call Noida 🎈🔥9873940964 🔥💋🎈 Provide Best And Top Girl Service...
Exclusive Girls Call Noida 🎈🔥9873940964 🔥💋🎈 Provide Best And Top Girl Service...
 
VIP Kanpur Girls Call Kanpur 0X0000000X Doorstep High-Profile Girl Service Ca...
VIP Kanpur Girls Call Kanpur 0X0000000X Doorstep High-Profile Girl Service Ca...VIP Kanpur Girls Call Kanpur 0X0000000X Doorstep High-Profile Girl Service Ca...
VIP Kanpur Girls Call Kanpur 0X0000000X Doorstep High-Profile Girl Service Ca...
 
社内勉強会資料_TransNeXt: Robust Foveal Visual Perception for Vision Transformers
社内勉強会資料_TransNeXt: Robust Foveal Visual Perception for Vision Transformers社内勉強会資料_TransNeXt: Robust Foveal Visual Perception for Vision Transformers
社内勉強会資料_TransNeXt: Robust Foveal Visual Perception for Vision Transformers
 
Busty Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service And...
Busty Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service And...Busty Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service And...
Busty Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service And...
 
Potential Uses of the Floyd-Warshall Algorithm as appropriate
Potential Uses of the Floyd-Warshall Algorithm as appropriatePotential Uses of the Floyd-Warshall Algorithm as appropriate
Potential Uses of the Floyd-Warshall Algorithm as appropriate
 
DataScienceConcept_Kanchana_Weerasinghe.pptx
DataScienceConcept_Kanchana_Weerasinghe.pptxDataScienceConcept_Kanchana_Weerasinghe.pptx
DataScienceConcept_Kanchana_Weerasinghe.pptx
 
FINAL PROJECT WORK PORTFOLIO MANAGEMENT (2) hhh (1) (2) (5) (1) (1).pdf
FINAL PROJECT WORK PORTFOLIO MANAGEMENT (2)  hhh (1) (2) (5) (1) (1).pdfFINAL PROJECT WORK PORTFOLIO MANAGEMENT (2)  hhh (1) (2) (5) (1) (1).pdf
FINAL PROJECT WORK PORTFOLIO MANAGEMENT (2) hhh (1) (2) (5) (1) (1).pdf
 
Celebrity Girls Call Andheri 9930245274 Unlimited Short Providing Girls Servi...
Celebrity Girls Call Andheri 9930245274 Unlimited Short Providing Girls Servi...Celebrity Girls Call Andheri 9930245274 Unlimited Short Providing Girls Servi...
Celebrity Girls Call Andheri 9930245274 Unlimited Short Providing Girls Servi...
 
Harnessing Wild and Untamed (Publicly Available) Data for the Cost efficient ...
Harnessing Wild and Untamed (Publicly Available) Data for the Cost efficient ...Harnessing Wild and Untamed (Publicly Available) Data for the Cost efficient ...
Harnessing Wild and Untamed (Publicly Available) Data for the Cost efficient ...
 
AWS re:Invent 2023 - Deep dive into Amazon Aurora and its innovations DAT408
AWS re:Invent 2023 - Deep dive into Amazon Aurora and its innovations DAT408AWS re:Invent 2023 - Deep dive into Amazon Aurora and its innovations DAT408
AWS re:Invent 2023 - Deep dive into Amazon Aurora and its innovations DAT408
 
PyData Sofia May 2024 - Intro to Apache Arrow
PyData Sofia May 2024 - Intro to Apache ArrowPyData Sofia May 2024 - Intro to Apache Arrow
PyData Sofia May 2024 - Intro to Apache Arrow
 
DU degree offer diploma Transcript
DU degree offer diploma TranscriptDU degree offer diploma Transcript
DU degree offer diploma Transcript
 
Why_are_we_hypnotizing_ourselves-_ATeggin-1.pdf
Why_are_we_hypnotizing_ourselves-_ATeggin-1.pdfWhy_are_we_hypnotizing_ourselves-_ATeggin-1.pdf
Why_are_we_hypnotizing_ourselves-_ATeggin-1.pdf
 
New Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service And N...
New Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service And N...New Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service And N...
New Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service And N...
 
Mumbai Girls Call Mumbai 🛵🚡9910780858 💃 Choose Best And Top Girl Service And ...
Mumbai Girls Call Mumbai 🛵🚡9910780858 💃 Choose Best And Top Girl Service And ...Mumbai Girls Call Mumbai 🛵🚡9910780858 💃 Choose Best And Top Girl Service And ...
Mumbai Girls Call Mumbai 🛵🚡9910780858 💃 Choose Best And Top Girl Service And ...
 
Oracle Database Desupported Features on 23ai (Part B)
Oracle Database Desupported Features on 23ai (Part B)Oracle Database Desupported Features on 23ai (Part B)
Oracle Database Desupported Features on 23ai (Part B)
 
Celebrity Girls Call Noida 9873940964 Unlimited Short Providing Girls Service...
Celebrity Girls Call Noida 9873940964 Unlimited Short Providing Girls Service...Celebrity Girls Call Noida 9873940964 Unlimited Short Providing Girls Service...
Celebrity Girls Call Noida 9873940964 Unlimited Short Providing Girls Service...
 

Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake

  • 2. Simplify CDC pipeline with Spark Streaming SQL and Delta Lake Jun Song@windpiger songjun.sj@alibaba-inc.com June 24, 2020
  • 3. About Me • Staff Engineer from Alibaba Cloud E-MapReduce Product Team • Spark contributor focused on SparkSQL • HiveOnDelta contributor(https://github.com/delta-io/connectors)
  • 4. Agenda • What is CDC • CDC solution using Spark Streaming SQL & Delta Lake • Future Works
  • 6. Change Data Capture Target Storage AdHoc/ ETL Collect Change Sets & Merge Analytics CDC OLTP database Data Warehouse Data Lake …
  • 7. ▪ load pressure on source database ▪ high latency batch job(hourly/daily/…) ▪ can not handle delete rows ▪ can not handle schema change DrawbacksUsing Sqoop(Batch Mode) Change Data Capture sqoop --incremental lastmodified --last-value '2028/01/01 13:00:00’ ... sqoop merge --new-data newer --onto older --merge-key id …
  • 8. ▪ heavy servers & operational support(Kudu&HBase) ▪ HBase can not support high-throughput analytics ▪ complex merge implement by java/scala code ▪ can not handle schema change DrawbacksUsing binlog(Streaming Mode) Change Data Capture binlog scala/java
  • 9. CDC solution using Spark Streaming SQL & Delta
  • 10. Spark Streaming SQL SparkCore SparkSQL Structured Streaming Spark Streaming SQL https://www.alibabacloud.com/help/doc-detail/124684.htm SQL is a Standard Declarative Language, which can simplify real-time analytics ▪ DDL CREATE TABLE、CREATE TABLE AS SELECT、CREATE SCAN、CREATE STREAM ▪ DML INSERT INTO、MERGE INTO ▪ SELECT SELECT FROM、WHERE、GROUP BY 、JOIN、UNION ALL ▪ UDF TUMBLING、HOPPING、DELAY、SparkSQL UDF ▪ Data Source Delta、Kafka、HBase、JDBC、Druid、Redis、Kudu、Alibaba Cloud(Loghub、Tablestore、DataHub) Design Doc: https://docs.google.com/document/d/19degwnIIcuMSELv6BQ_1VQI5AIVcvGeqOm5xE2-aRA0/edit
  • 11. Spark Streaming SQL CREATE TABLE kafka_test USING kafka OPTIONS( kafka.bootstrap.servers='', subscribe='test') batch streaming CREATE SCAN kafka_test_batch_scan ON kafka_test USING batch CREATE SCAN kafka_test_batch_scan ON kafka_test USING stream OPTIONS( maxOffsetsPerTrigger='100000' ) SELECT count(*) FROM kafka_test_batch_scan ? CREATE SCAN CREATE SCAN tbName_alias ON tbName USING queryType OPTIONS (propertyName=propertyValue[,propertyName=propertyValue]*)
  • 12. Spark Streaming SQL CREATE SCAN kafka_test_batch_scan ON kafka_test USING stream OPTIONS( maxOffsetsPerTrigger='100000' ) CREATE STREAM CREATE STREAM kafka_test_stream_job OPTIONS( checkpointLocation='/tmp/spark', outputMode='Append', triggerType='ProcessingTime' triggerIntervalMs='3000') INSERT INTO target_tbl SELECT * FROM kafka_test_batch_scan WHERE units > 1000; CREATE STREAM queryName OPTIONS (propertyName=propertyValue[,propertyName=propertyValue]*) INSERT INTO tbName queryStatement;
  • 13. Spark Streaming SQL MERGE INTO mergeInto : MERGE INTO target=tableIdentifier tableAlias USING (source=tableIdentifier (timeTravel)? | '(' subquery = query ')') tableAlias mergeCondition? matchedClauses* notMatchedClause? MERGE INTO target_table t USING source_table s ON s.id = t.id WHEN MATCHED AND s.opType = 'delete' THEN DELETE WHEN MATCHED AND s.opTye = 'update' THEN UPDATE SET id = s.id, name = s.name WHEN NOT MATCHED AND s.opType = 'insert' THEN INSERT (key, value) VALUES (key, value)
  • 14. Spark Streaming SQL DELAY / TUMBLING / HOPPING WHERE delay(colName) < 'duration'withWatermark("colName", "duration") SELECT avg(inv_quantity_on_hand) qoh FROM kafka_inventory WHERE delay(inv_data_time) < '2 minutes' GROUP BY TUMBLING (inv_data_time, interval 1 minute)
  • 15. Delta Lake Streaming/Batch Streaming/Batch ACID Transactions Metadata Management Unified Batch&Streaming Schema Enforcement&Evolution Update&Delete&Merge Time Travel Parquet Key Feature
  • 16. Delta Lake Improvement Delta Lake SparkSQL Spark Streaming SQL Update/Delete/ Optimize/Vacuum DDL/DML Hive/ Presto HDFS Alibaba OSS S3 …
  • 17. Delta Lake Improvement CREATE EXTERNAL TABLE delta_tbl(a string, b int) ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' STORED AS INPUTFORMAT 'io.delta.hive.DeltaInputFormat' OUTPUTFORMAT 'io.delta.hive.DeltaOutputFormat' LOCATION 'oss://testbucket/delta/events'
  • 18. Spark Streaming SQL binlog ▪ no extra operational support for delta ▪ no load pressure on source database ▪ merge implement easily by SQL ▪ realtime low-latency(minute) CDC solution using Spark Streaming SQL & Delta Lake
  • 19. Spark Streaming SQL binlog CREATE SCAN cdctest_incremental_scan ON kafka_cdctest USING STREAM OPTIONS( startingOffsets='earliest', maxOffsetsPerTrigger='100000', failOnDataLoss=false ); CREATE STREAM cdctest_job OPTIONS(checkpointLocation='/delta/cdctest_checkpoint_oss') MERGE INTO delta_cdctest_oss as target USING ( SELECT // binlog parser … FROM cdctest_incremental_scan ) ON target.id = source.before_id WHEN MATCHED AND source.recordType='UPDATE' THEN UPDATE SET … WHEN MATCHED AND source.recordType='DELETE' THEN DELETE WHEN NOT MATCHED AND source.recordType='INSERT' THEN INSERT … CREATE TABLE kafka_cdctest USING KAFKA … CREATE TABLE delta_cdctest_oss USING DELTA … streaming-sql --master yarn --use-emr-datasource -f cdc_oss.sql 1 2 3 4
  • 20. Delta Table batch-batch- batch- …batch- DeltaTable .merge DeltaTable .merge DeltaTable .merge DeltaTable .merge Spark Streaming SQL MERGE INTO CDC solution using Spark Streaming SQL & Delta Lake
  • 21. Long Running Stability Improvement How to handle small files? ▪ increase batch interval(minutes) ▪ compcation(change data layout, not change data) ▪ adaptive execution mode
  • 22. Long Running Stability Improvement How to handle small files? - Scheduled Compaction Delta Table batch-batch- batch- … batch- DeltaTable .merge DeltaTable .merge DeltaTable .merge DeltaTable .merge CompactionCompaction scheduled job hourly/daily/… OPTIMIZE <tbl> [WHERE where_clause]
  • 23. Long Running Stability Improvement Problem: Streaming Job Failed when doing a compaction batch- Delta Table Compaction do transaction commit merge Transaction conflict check read How to handle small files? - Scheduled Compaction
  • 24. Long Running Stability Improvement Problem: Streaming Job Failed when doing a compaction binlog in batch streaming job status Improvement only insert Succeed Fix one bug: https://github.com/delta-io/delta/issues/326 including delete/update Failed streaming job retry this batch How to handle small files? - Scheduled Compaction
  • 25. Long Running Stability Improvement How to handle small files? - Auto Compaction Delta Table batch-batch- …batch- merge compaction compaction merge merge sequential execution, no conflict select files which file_size > COMPACT_FILE_SIZE flies number > TRIGGER_FILE_COU NT do compaction Yes No continue streaming Strategy
  • 26. Long Running Stability Improvement How to handle small files? - Adaptive Execution batch binlog target delta target changed files Join rewrite all changed files Join spark.sql.adaptive.enabled -> true Adaptive Excetion can auto merge small partitions, to decrease the number of reducers, then decrease the number of output files.
  • 27. Long Running Stability Improvement Performance issue batch-batch- batch- …batch- size of target delta table merge performance Runtime Filter https://issues.apache.org/jira/browse/SPARK-27227 batch binlog target delta target changed files Join filter
  • 29. Future ▪ auto schema change detected ▪ long running stable performance(read on merge) ▪ simplify user experience by SYNC grammar SYNC kafka_binlog_tbl TO delta_tbl OPTIONS( type='debezium.mysql' )
  • 30. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.