Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake

98 views

Published on

Change Data Capture CDC is a typical use case in Real-Time Data Warehousing. It tracks the data change log -binlog- of a relational database [OLTP], and replay these change log timely to an external storage to do Real-Time OLAP, such as delta/kudu. To implement a robust CDC streaming pipeline, lots of factors should be concerned, such as how to ensure data accuracy , how to process OLTP source schema changed, whether it is easy to build for variety databases with less code.

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake

  1. 1. Simplify CDC pipeline with Spark Streaming SQL and Delta Lake Jun Song@windpiger songjun.sj@alibaba-inc.com June 24, 2020
  2. 2. About Me • Staff Engineer from Alibaba Cloud E-MapReduce Product Team • Spark contributor focused on SparkSQL • HiveOnDelta contributor(https://github.com/delta-io/connectors)
  3. 3. Agenda • What is CDC • CDC solution using Spark Streaming SQL & Delta Lake • Future Works
  4. 4. What is CDC
  5. 5. Change Data Capture Target Storage AdHoc/ ETL Collect Change Sets & Merge Analytics CDC OLTP database Data Warehouse Data Lake …
  6. 6. ▪ load pressure on source database ▪ high latency batch job(hourly/daily/…) ▪ can not handle delete rows ▪ can not handle schema change DrawbacksUsing Sqoop(Batch Mode) Change Data Capture sqoop --incremental lastmodified --last-value '2028/01/01 13:00:00’ ... sqoop merge --new-data newer --onto older --merge-key id …
  7. 7. ▪ heavy servers & operational support(Kudu&HBase) ▪ HBase can not support high-throughput analytics ▪ complex merge implement by java/scala code ▪ can not handle schema change DrawbacksUsing binlog(Streaming Mode) Change Data Capture binlog scala/java
  8. 8. CDC solution using Spark Streaming SQL & Delta
  9. 9. Spark Streaming SQL SparkCore SparkSQL Structured Streaming Spark Streaming SQL https://www.alibabacloud.com/help/doc-detail/124684.htm SQL is a Standard Declarative Language, which can simplify real-time analytics ▪ DDL CREATE TABLE、CREATE TABLE AS SELECT、CREATE SCAN、CREATE STREAM ▪ DML INSERT INTO、MERGE INTO ▪ SELECT SELECT FROM、WHERE、GROUP BY 、JOIN、UNION ALL ▪ UDF TUMBLING、HOPPING、DELAY、SparkSQL UDF ▪ Data Source Delta、Kafka、HBase、JDBC、Druid、Redis、Kudu、Alibaba Cloud(Loghub、Tablestore、DataHub) Design Doc: https://docs.google.com/document/d/19degwnIIcuMSELv6BQ_1VQI5AIVcvGeqOm5xE2-aRA0/edit
  10. 10. Spark Streaming SQL CREATE TABLE kafka_test USING kafka OPTIONS( kafka.bootstrap.servers='', subscribe='test') batch streaming CREATE SCAN kafka_test_batch_scan ON kafka_test USING batch CREATE SCAN kafka_test_batch_scan ON kafka_test USING stream OPTIONS( maxOffsetsPerTrigger='100000' ) SELECT count(*) FROM kafka_test_batch_scan ? CREATE SCAN CREATE SCAN tbName_alias ON tbName USING queryType OPTIONS (propertyName=propertyValue[,propertyName=propertyValue]*)
  11. 11. Spark Streaming SQL CREATE SCAN kafka_test_batch_scan ON kafka_test USING stream OPTIONS( maxOffsetsPerTrigger='100000' ) CREATE STREAM CREATE STREAM kafka_test_stream_job OPTIONS( checkpointLocation='/tmp/spark', outputMode='Append', triggerType='ProcessingTime' triggerIntervalMs='3000') INSERT INTO target_tbl SELECT * FROM kafka_test_batch_scan WHERE units > 1000; CREATE STREAM queryName OPTIONS (propertyName=propertyValue[,propertyName=propertyValue]*) INSERT INTO tbName queryStatement;
  12. 12. Spark Streaming SQL MERGE INTO mergeInto : MERGE INTO target=tableIdentifier tableAlias USING (source=tableIdentifier (timeTravel)? | '(' subquery = query ')') tableAlias mergeCondition? matchedClauses* notMatchedClause? MERGE INTO target_table t USING source_table s ON s.id = t.id WHEN MATCHED AND s.opType = 'delete' THEN DELETE WHEN MATCHED AND s.opTye = 'update' THEN UPDATE SET id = s.id, name = s.name WHEN NOT MATCHED AND s.opType = 'insert' THEN INSERT (key, value) VALUES (key, value)
  13. 13. Spark Streaming SQL DELAY / TUMBLING / HOPPING WHERE delay(colName) < 'duration'withWatermark("colName", "duration") SELECT avg(inv_quantity_on_hand) qoh FROM kafka_inventory WHERE delay(inv_data_time) < '2 minutes' GROUP BY TUMBLING (inv_data_time, interval 1 minute)
  14. 14. Delta Lake Streaming/Batch Streaming/Batch ACID Transactions Metadata Management Unified Batch&Streaming Schema Enforcement&Evolution Update&Delete&Merge Time Travel Parquet Key Feature
  15. 15. Delta Lake Improvement Delta Lake SparkSQL Spark Streaming SQL Update/Delete/ Optimize/Vacuum DDL/DML Hive/ Presto HDFS Alibaba OSS S3 …
  16. 16. Delta Lake Improvement CREATE EXTERNAL TABLE delta_tbl(a string, b int) ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' STORED AS INPUTFORMAT 'io.delta.hive.DeltaInputFormat' OUTPUTFORMAT 'io.delta.hive.DeltaOutputFormat' LOCATION 'oss://testbucket/delta/events'
  17. 17. Spark Streaming SQL binlog ▪ no extra operational support for delta ▪ no load pressure on source database ▪ merge implement easily by SQL ▪ realtime low-latency(minute) CDC solution using Spark Streaming SQL & Delta Lake
  18. 18. Spark Streaming SQL binlog CREATE SCAN cdctest_incremental_scan ON kafka_cdctest USING STREAM OPTIONS( startingOffsets='earliest', maxOffsetsPerTrigger='100000', failOnDataLoss=false ); CREATE STREAM cdctest_job OPTIONS(checkpointLocation='/delta/cdctest_checkpoint_oss') MERGE INTO delta_cdctest_oss as target USING ( SELECT // binlog parser … FROM cdctest_incremental_scan ) ON target.id = source.before_id WHEN MATCHED AND source.recordType='UPDATE' THEN UPDATE SET … WHEN MATCHED AND source.recordType='DELETE' THEN DELETE WHEN NOT MATCHED AND source.recordType='INSERT' THEN INSERT … CREATE TABLE kafka_cdctest USING KAFKA … CREATE TABLE delta_cdctest_oss USING DELTA … streaming-sql --master yarn --use-emr-datasource -f cdc_oss.sql 1 2 3 4
  19. 19. Delta Table batch-batch- batch- …batch- DeltaTable .merge DeltaTable .merge DeltaTable .merge DeltaTable .merge Spark Streaming SQL MERGE INTO CDC solution using Spark Streaming SQL & Delta Lake
  20. 20. Long Running Stability Improvement How to handle small files? ▪ increase batch interval(minutes) ▪ compcation(change data layout, not change data) ▪ adaptive execution mode
  21. 21. Long Running Stability Improvement How to handle small files? - Scheduled Compaction Delta Table batch-batch- batch- … batch- DeltaTable .merge DeltaTable .merge DeltaTable .merge DeltaTable .merge CompactionCompaction scheduled job hourly/daily/… OPTIMIZE <tbl> [WHERE where_clause]
  22. 22. Long Running Stability Improvement Problem: Streaming Job Failed when doing a compaction batch- Delta Table Compaction do transaction commit merge Transaction conflict check read How to handle small files? - Scheduled Compaction
  23. 23. Long Running Stability Improvement Problem: Streaming Job Failed when doing a compaction binlog in batch streaming job status Improvement only insert Succeed Fix one bug: https://github.com/delta-io/delta/issues/326 including delete/update Failed streaming job retry this batch How to handle small files? - Scheduled Compaction
  24. 24. Long Running Stability Improvement How to handle small files? - Auto Compaction Delta Table batch-batch- …batch- merge compaction compaction merge merge sequential execution, no conflict select files which file_size > COMPACT_FILE_SIZE flies number > TRIGGER_FILE_COU NT do compaction Yes No continue streaming Strategy
  25. 25. Long Running Stability Improvement How to handle small files? - Adaptive Execution batch binlog target delta target changed files Join rewrite all changed files Join spark.sql.adaptive.enabled -> true Adaptive Excetion can auto merge small partitions, to decrease the number of reducers, then decrease the number of output files.
  26. 26. Long Running Stability Improvement Performance issue batch-batch- batch- …batch- size of target delta table merge performance Runtime Filter https://issues.apache.org/jira/browse/SPARK-27227 batch binlog target delta target changed files Join filter
  27. 27. Future Works
  28. 28. Future ▪ auto schema change detected ▪ long running stable performance(read on merge) ▪ simplify user experience by SYNC grammar SYNC kafka_binlog_tbl TO delta_tbl OPTIONS( type='debezium.mysql' )
  29. 29. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.

×