Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Architect Things Right

3,241 views

Published on

Structured Streaming has proven to be the best platform for building distributed stream processing applications. Its unified SQL/Dataset/DataFrame APIs and Spark’s built-in functions make it easy for developers to express complex computations. Delta Lake, on the other hand, is the best way to store structured data because it is a open-source storage layer that brings ACID transactions to Apache Spark and big data workloads Together, these can make it very easy to build pipelines in many common scenarios. However, expressing the business logic is only part of the larger problem of building end-to-end streaming pipelines that interact with a complex ecosystem of storage systems and workloads. It is important for the developer to truly understand the business problem that needs to be solved. Apache Spark, being a unified analytics engine doing both batch and stream processing, often provides multiples ways to solve the same problem. So understanding the requirements carefully helps you to architect your pipeline that solves your business needs in the most resource efficient manner.

In this talk, I am going examine a number common streaming design patterns in the context of the following questions.

WHAT are you trying to consume? What are you trying to produce? What is the final output that the business wants? What are your throughput and latency requirements?
WHY do you really have those requirements? Would solving the requirements of the individual pipeline actually solve your end-to-end business requirements?
HOW are going to architect the solution? And how much are you willing to pay for it?
Clarity in understanding the ‘what and why’ of any problem can automatically much clarity on the ‘how’ to architect it using Structured Streaming and, in many cases, Delta Lake.

Published in: Data & Analytics

Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Architect Things Right

  1. 1. DESIGNING ETL PIPELINES WITH How to architect things right Spark Summit Europe 16 October 2019 Tathagata “TD” Das @tathadas STRUCTURED STREAMING
  2. 2. About Me Started Streaming project in AMPLab, UC Berkeley Currently focused on Structured Streaming and Delta Lake Staff Engineer on the StreamTeam @ Team Motto: "We make all your streams come true"
  3. 3. Structured Streaming Distributed stream processing built on SQL engine High throughput, second-scale latencies Fault-tolerant, exactly-once Great set of connectors Philosophy: Treat data streams like unbounded tables Users write batch-like queries on tables Spark will continuously execute the queries incrementally on streams 3
  4. 4. Structured Streaming 4 Example Read JSON data from Kafka Parse nested JSON Store in structured Parquet table Get end-to-end failure guarantees ETL spark.readStream.format("kafka") .option("kafka.boostrap.servers",...) .option("subscribe", "topic") .load() .selectExpr("cast (value as string) as json") .select(from_json("json", schema).as("data")) .writeStream .format("parquet") .option("path", "/parquetTable/" .trigger("1 minute") .option("checkpointLocation", "…") .start()
  5. 5. Anatomy of a Streaming Query 5 spark.readStream.format("kafka") .option("kafka.boostrap.servers",...) .option("subscribe", "topic") .load() .selectExpr("cast (value as string) as json") .select(from_json("json", schema).as("data")) .writeStream .format("parquet") .option("path", "/parquetTable/" .trigger("1 minute") .option("checkpointLocation", "…") .start() Specify where to read data from Specify data transformations Specify where to write data to Specify how to process data
  6. 6. DataFrames, Datasets, SQL Logical Plan Read from Kafka Project cast(value) as json Project from_json(json) Write to Parquet Spark automatically streamifies! Spark SQL converts batch-like query to a series of incremental execution plans operating on new batches of data Kafka Source Optimized Operator codegen, off- heap, etc. Parquet Sink Optimized Plan Series of Incremental Execution Plans process newdata t = 1 t = 2 t = 3 process newdata process newdata spark.readStream.format("kafka") .option("kafka.boostrap.servers",...) .option("subscribe", "topic") .load() .selectExpr("cast (value as string) as json") .select(from_json("json", schema).as("data")) .writeStream .format("parquet") .option("path", "/parquetTable/" .trigger("1 minute") .option("checkpointLocation", "…") .start()
  7. 7. • ACID transactions • Schema Enforcement and Evolution • Data versioning and Audit History • Time travel to old versions • Open formats • Scalable Metadata Handling • Great with batch + streaming • Upserts and Deletes Open-source storage layer that brings ACID transactions to Apache Spark™ and big data workloads. https://delta.io/
  8. 8. THE GOOD OF DATA LAKES • Massive scale out • Open Formats • Mixed workloads THE GOOD OF DATA WAREHOUSES • Pristine Data • Transactional Reliability • Fast SQL Queries Open-source storage layer that brings ACID transactions to Apache Spark™ and big data workloads. https://delta.io/
  9. 9. STRUCTURED STREAMING How to build streaming data pipelines with them?
  10. 10. STRUCTURED STREAMING What are the design patterns to correctly architect streaming data pipelines?
  11. 11. Another streaming design pattern talk???? Most talks Focus on a pure streaming engine Explain one way of achieving the end goal 11 This talk Spark is more than a streaming engine Spark has multiple ways of achieving the end goal with tunable perf, cost and quality
  12. 12. This talk How to think about design Common design patterns How we are making this easier 12
  13. 13. Streaming Pipeline Design 13 Data streams Insights ????
  14. 14. 14 ???? What? Why? How?
  15. 15. What? 15 What is your input? What is your data? What format and system is your data in? What is your output? What results do you need? What throughput and latency do you need?
  16. 16. Why? 16 Why do you want this output in this way? Who is going to take actions based on it? When and how are they going to consume it? humans? computers?
  17. 17. Why? Common mistakes! #1 "I want my dashboard with counts to be updated every second" #2 "I want to generate automatic alerts with up-to-the-last second counts" (but my input data is often delayed) 17 No point of updating every second if humans are going to take actions in minutes or hours No point taking fast actions on low quality data and results
  18. 18. Why? Common mistakes! #3 "I want to train machine learning models on the results" (but my results are in a key-value store) 18 Key-value stores are not great for large, repeated data scans which machine learning workloads perform
  19. 19. How? 19 ???? How to process the data? ???? How to store the results?
  20. 20. Streaming Design Patterns 20 What? Why? How?Complex ETL
  21. 21. Pattern 1: ETL 21 Input: unstructured input stream from files, Kafka, etc. What? Why? Query latest structured data interactively or with periodic jobs 01:06:45 WARN id = 1 , update failed 01:06:45 INFO id=23, update success 01:06:57 INFO id=87: update postpo … Output: structured tabular data
  22. 22. P1: ETL 22 Convert unstructured input to structured tabular data Latency: few minutes What? How? Why? Query latest structured data interactively or with periodic jobs Process: Use Structured Streaming query to transform unstructured, dirty data Run 24/7 on a cluster with default trigger Store: Save to structured scalable storage that supports data skipping, etc. E.g.: Parquet, ORC, or even better, Delta Lake STRUCTURED STREAMING ETL QUERY 01:06:45 WARN id = 1 , update failed 01:06:45 INFO id=23, update success 01:06:57 INFO id=87: update postpo …
  23. 23. P1: ETL with Delta Lake 23 How? Store: Save to STRUCTURED STREAMING Read with snapshot guarantees while writes are in progress Concurrently reprocess data with full ACID guarantees Coalesce small files into larger files Update table to fix mistakes in data Delete data for GDPR REPROCESS ETL QUERY 01:06:45 WARN id = 1 , update failed 01:06:45 INFO id=23, update success 01:06:57 INFO id=87: update postpo …
  24. 24. P1.1: Cheaper ETL 24 Convert unstructured input to structured tabular data Latency: few minutes hours Not have clusters up 24/7 What? How? Why? Query latest data interactively or with periodic jobs Cheaper solution Process: Still use Structured Streaming query! Run streaming query with "trigger.once" for processing all available data since last batch Set up external schedule (every few hours?) to periodically start a cluster and run one batch STRUCTURED STREAMING RESTART ON SCHEDULE 01:06:45 WARN id = 1 , update failed 01:06:45 INFO id=23, update success 01:06:57 INFO id=87: update postpo …
  25. 25. P1.2: Query faster than ETL! 25 Latency: hours seconds What? How? Why? Query latest up-to-the last second data interactively Query data in Kafka directly using Spark SQL Can process up to the last records received by Kafka when the query was started SQL
  26. 26. Pattern 2: Key-value output 26 Input: new data for each key Lookup latest value for key (dashboards, websites, etc.) OR Summary tables for querying interactively or with periodic jobs KEY LATEST VALUE key1 value2 key2 value3 { "key1": "value1" } { "key1": "value2" } { "key2": "value3" } Output: updated values for each key Aggregations (sum, count, …) Sessionizations What? Why?
  27. 27. P2.1: Key-value output for lookup 27 Generate updated values for keys Latency: seconds/minutes What? How? Why? Lookup latest value for key STRUCTURED STREAMING Process: Use Structured Streaming with stateful operations for aggregation Store: Save in key-values stores optimized for single key lookups STATEFUL AGGREGATION LOOKUP { "key1": "value1" } { "key1": "value2" } { "key2": "value3" }
  28. 28. P2.2: Key-value output for analytics 28 Generate updated values for keys Latency: seconds/minutes What? How? Why? Lookup latest value for key Summary tables for analytics Process: Use Structured Streaming with Store: Save in Delta Lake! STRUCTURED STREAMING STATEFUL AGGREGATION SUMMARY stateful operations for aggregation Delta Lake supports upserts using MERGE { "key1": "value1" } { "key1": "value2" } { "key2": "value3" }
  29. 29. P2.2: Key-value output for analytics 29 How? STRUCTURED STREAMING STATEFUL AGGREGATION streamingDataFrame.foreachBatch { batchOutputDF => DeltaTable.forPath(spark, "/aggs/").as("t") .merge( batchOutputDF.as("s"), "t.key = s.key") .whenMatched().update(...) .whenNotMatched().insert(...) .execute() }.start() SUMMARY Stateful operations for aggregation { "key1": "value1" } { "key1": "value2" } { "key2": "value3" } Delta Lake supports upserts using Merge SQL operation Scala/Java/Python APIs with same semantics as SQL Merge SQL Merge supported in Databricks Delta, not in OSS yet
  30. 30. P2.2: Key-value output for analytics 30 How? Stateful aggregation requires setting watermark to drop very late data Dropping some data leads some inaccuracies Stateful operations for aggregation Delta Lake supports upserts using Merge STRUCTURED STREAMING STATEFUL AGGREGATION SUMMARY { "key1": "value1" } { "key1": "value2" } { "key2": "value3" }
  31. 31. P2.3: Key-value aggregations for analytics 31 Generate aggregated values for keys Latency: hours/days Do not drop any late data What? How? Why? Summary tables for analytics Correct Process: ETL to structured table (no stateful aggregation) Store: Save in Delta Lake Post-process: Aggregate after all delayed data received STRUCTURED STREAMING STATEFUL AGGREGATION AGGREGATIONETL SUMMARY { "key1": "value1" } { "key1": "value2" } { "key2": "value3" }
  32. 32. Pattern 3: Joining multiple inputs 32 #UnifiedAnalytics #SparkAISummit { "id": 14, "name": "td", "v": 100.. { "id": 23, "name": "by", "v": -10.. { "id": 57, "name": "sz", "v": 34.. … 01:06:45 WARN id = 1 , update failed 01:06:45 INFO id=23, update success 01:06:57 INFO id=87: update postpo … id update value Input: Multiple data streams based on common key Output: Combined information What?
  33. 33. P3.1: Joining fast and slow streams 33 Input: One fast stream of facts and one slow stream of dimension changes Output: Fast stream enriched by data from slow stream Example: product sales info ("facts") enriched by more product info ("dimensions") What + Why? { "id": 14, "name": "td", "v": 100.. { "id": 23, "name": "by", "v": -10.. { "id": 57, "name": "sz", "v": 34.. … 01:06:45 WARN id = 1 , update failed 01:06:45 INFO id=23, update success 01:06:57 INFO id=87: update postpo … id update value How? ETL slow stream to a dimension table Join fast stream with snapshots of the dimension table FAST ETL DIMENSION TABLE COMBINED TABLE JOIN SLOW ETL
  34. 34. P3.1: Joining fast and slow streams 34 { "id": 14, "name": "td", "v": 100.. { "id": 23, "name": "by", "v": -10.. { "id": 57, "name": "sz", "v": 34.. … 01:06:45 WARN id = 1 , update failed 01:06:45 INFO id=23, update success 01:06:57 INFO id=87: update postpo … id update value SLOW ETL How? - Caveats FAST ETL JOIN COMBINED TABLE DIMENSION TABLE Store dimension table in Delta Lake Delta Lake's versioning allows changes to be detected and the snapshot automatically reloaded without restart** Better Solution ** available only in Databricks Delta Lake Structured Streaming does not reload dimension table snapshot Changes by slow ETL wont be seen until restart
  35. 35. P3.1: Joining fast and slow stream 35 How? - Caveats Treat it as a "joining fast and fast data" Better Solution Delays in updates to dimension table can cause joining with stale dimension data E.g. sale of a product received even before product table has any info on the product
  36. 36. P3.2: Joining fast and fast data 36 Input: Two fast streams where either stream maybe delayed Output: Combined info even if one is delayed over another Example: ad impressions and ad clicks Use stream-stream joins in Structured Streaming Data will be buffered as state Watermarks define how long to buffer before giving up on matching What + Why? id update valueSTREAM STREAM JOIN { "id": 14, "name": "td", "v": 100.. { "id": 23, "name": "by", "v": -10.. { "id": 57, "name": "sz", "v": 34.. … 01:06:45 WARN id = 1 , update failed 01:06:45 INFO id=23, update success 01:06:57 INFO id=87: update postpo … See my past deep dive talk for more info How?
  37. 37. Pattern 4: Change data capture 37 #UnifiedAnalytics #SparkAISummit INSERT a, 1 INSERT b, 2 UPDATE a, 3 DELETE b INSERT b, 4 key value a 3 b 4 Input: Change data based on a primary key Output: Final table after changes End-to-end replication of transactional tables into analytical tables What? Why?
  38. 38. P4: Change data capture 38 How? Use foreachBatch and Merge In each batch, apply changes to the Delta table using Merge Delta Lake's Merge support extended syntax with includes multiple match clauses, clause conditions and deletes INSERT a, 1 INSERT b, 2 UPDATE a, 3 DELETE b INSERT b, 4 STRUCTURED STREAMING streamingDataFrame.foreachBatch { batchOutputDF => DeltaTable.forPath(spark, "/deltaTable/").as("t") .merge(batchOutputDF.as("updates"), "t.key = s.key") .whenMatched("delete = false").update(...) .whenMatched("delete = true").delete() .whenNotMatched().insert(...) .execute() }.start() See merge examples in docs for more info
  39. 39. Pattern 5: Writing to multiple outputs 39 id val1 val2 id sum count RAW LOGS + SUMMARIES TABLE 2 TABLE 1 { "id": 14, "name": "td", "v": 100.. { "id": 23, "name": "by", "v": -10.. { "id": 57, "name": "sz", "v": 34.. … SUMMARY FOR ANALYTICS + SUMMARY FOR LOOKUP TABLE CHANGE LOG + UPDATED TABLE What?
  40. 40. P5: Serial or Parallel? 40 id val1 val2 id sum count TABLE 2 TABLE 1 { "id": 14, "name": "td", "v": 100.. { "id": 23, "name": "by", "v": -10.. { "id": 57, "name": "sz", "v": 34.. … Parallel How? id val1 val2 TABLE 1 { "id": 14, "name": "td", "v": 100.. { "id": 23, "name": "by", "v": -10.. { "id": 57, "name": "sz", "v": 34.. … id val1 val2 TABLE 2 Writes table 1 and reads it again Cheap or expensive depends on the size and format of table 1 Higher latency Reads input twice, may have to parse the data twice Cheap or expensive depends on size of raw input + parsing cost Serial
  41. 41. P5: Combination! 41 Combo 1: Multiple streaming queries id val1 val2 TABLE 1 { "id": 14, "name": "td", "v": 100.. { "id": 23, "name": "by", "v": -10.. { "id": 57, "name": "sz", "v": 34.. … id val1 val2 TABLE 2 Do expensive parsing once, write to table 1 Do cheaper follow up processing from table 1 Good for ETL + multiple levels of summaries Change log + updated table Still writing + reading table1 Compute once, write multiple times Cheaper, but looses exactly-once guarantee id val1 val2 TABLE 3 Combo 2: Single query + foreachBatch EXPENSIVE CHEAP { "id": 14, "name": "td", "v": 100.. { "id": 23, "name": "by", "v": -10.. { "id": 57, "name": "sz", "v": 34.. … id val1 val2 LOCATION 1 id val1 val2 LOCATION 2 streamingDataFrame.foreachBatch { batchSummaryData => // write summary to Delta Lake // write summary to key value store }.start() How?
  42. 42. Production Pipelines @ Bronze Tables Silver Tables Gold Tables REPORTING STREAMING ANALYTICS STRUCTURED STREAMING STRUCTURED STREAMING STRUCTURED STREAMING STRUCTURED STREAMING STRUCTUREDSTREAMING STRUCTURED STREAMING STRUCTURED STREAMING STRUCTURED STREAMING STRUCTUREDSTREAMING
  43. 43. In Progress: Declarative Pipelines 43 Allow users to declaratively express the entire DAG of pipelines as a dataflow graph REPORTING STREAMING ANALYTICS STRUCTURED STREAMING STRUCTURED STREAMING STRUCTURED STREAMING STRUCTURED STREAMING STRUCTUREDSTREAMING STRUCTURED STREAMING STRUCTURED STREAMING STRUCTURED STREAMING STRUCTUREDSTREAMING
  44. 44. In Progress: Declarative Pipelines Enforce metadata, storage, and quality declaratively dataset("warehouse") .query(input("kafka").select(…).join(…)) // Query to materialize .location(…) // Storage Location .schema(…) // Optional strict schema checking .metastoreName(…) // Hive Metastore .description(…) // Human readable description .expect("validTimestamp", // Expectations on data quality "timestamp > 2012-01-01 AND …", "fail / alert / quarantine") *Coming Soon
  45. 45. In Progress: Declarative Pipelines Enforce metadata, storage, and quality declaratively dataset("warehouse") .query(input("kafka").withColumn(…).join(…)) // Query to materialize .location(…) // Storage Location .schema(…) // Optional strict schema checking .metastoreName(…) // Hive Metastore .description(…) // Human readable description .expect("validTimestamp", // Expectations on data quality "timestamp > 2012-01-01 AND …", "fail / alert / quarantine") *Coming Soon
  46. 46. In Progress: Declarative Pipelines Enforce metadata, storage, and quality declaratively dataset("warehouse") .query(input("kafka").withColumn(…).join(…)) // Query to materialize .location(…) // Storage Location .schema(…) // Optional strict schema checking .metastoreName(…) // Hive Metastore .description(…) // Human readable description .expect("validTimestamp", // Expectations on data quality "timestamp > 2012-01-01 AND …", "fail / alert / quarantine") *Coming Soon
  47. 47. In Progress: Declarative Pipelines 47 Unit test part or whole DAG with sample input data Integration test DAG with production data Deploy/upgrade new DAG code Monitor whole DAG in one place Rollback code / data when needed REPORTING STREAMING ANALYTICS STRUCTURED STREAMING STRUCTURED STREAMING STRUCTURED STREAMING STRUCTURED STREAMING STRUCTUREDSTREAMING STRUCTURED STREAMING STRUCTURED STREAMING STRUCTURED STREAMING STRUCTUREDSTREAMING *Coming Soon
  48. 48. Build your own Delta Lake at https://delta.io

×