Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Composable Data Processing with Apache Spark

As the usage of Apache Spark continues to ramp up within the industry, a major challenge has been scaling our development. Too often we find that developers are re-implementing a similar set of cross-cutting concerns, sprinkled with some variance of use-case specific business logic as a concrete Spark App.

  • Be the first to comment

Composable Data Processing with Apache Spark

  1. 1. Composable Data Processing Shone Sadler & Dilip Biswal
  2. 2. Agenda The Why Background on the problem(s) that drove our need for Composable Data processing (CDP). The What High level walk-through of our CDP design. The How Discuss challenges to achieve CDP with Spark. The Results What has been the impact of CDP and where are we headed.
  3. 3. Adobe Experience Platform (AEP) Zen Statement 4 INSIGHTS – ML & QUERY ACTION POS CRM Product Usage Mktg Automate IoT Geo- Location Commerce DATA Centralize and standardize customer data and content across the enterprise – powering 360° customer profiles, enabling data science, and data governance to drive real-time personalized experiences SEMANTICS & CONTROL
  4. 4. Adobe Experience Cloud Evolution
  5. 5. Data Landing (aka Siphon) ▪ 1M Batches per Day ▪ 13 Terabytes Per Day ▪ 32 Billion Events Per Day Customers Siphon Data Lake Solutions 3rd Parties Producers ▪ Transformation ▪ Validation ▪ Partitioning ▪ Compaction ▪ Writing with Exactly Once ▪ Lineage Tracking
  6. 6. Siphon’s Cross Cutting Features Producers Data Lake Queue Siphon Siphon Siphon Bulkhead1 Siphon Bulkhead2 Supervisor Catalog Streaming Ingest Batch Ingest
  7. 7. Siphon’s Data Processing (aka Ingest Pipeline) Ingest Pipeline Producers Data Lake Siphon Parse Convert Validate Report Write Data Write Errors
  8. 8. Engineering Bottleneck Cross-Cutting Data Processing Time Features Cross-Cutting DataProcessing
  9. 9. Option A: Path of Least Resistance Siphon + Feature X + Feature Y + ….. Input Output ▪ Deprioritize Hardening ▪ Overhead due to Context Switching ▪ Tendency towards Spaghetti code ▪ Increasingly difficult to test over time ▪ Increasingly difficult to maintain over time
  10. 10. Option B: “Delegate” the Problem Siphon Input Output Feature X By Service X Output X Feature Y By Service Y Output Y Feature … … Output … ▪ Lack of Reuse ▪ Lack of Consistency ▪ Complex to Test E2E ▪ Complex to Monitor E2E ▪ Complex to maintain over time ▪ Increased Latency ▪ COGS not tenable
  11. 11. Option C: Composable Data Processing Siphon Input Output ▪ Scalable Engineering ▪ Modularized Design & Code ▪ Clear Separation of Responsibilities ▪ Easier to Test ▪ Easier to Maintain ▪ Maximizes re-use ▪ Minimizes Complexity ▪ Minimize Latency ▪ Minimizes COGs Feature X By Team X Feature Y By Team Y Feature … By Team …
  12. 12. The What
  13. 13. Goal ▪ Implement a framework that enables different teams to extend Siphon’s data ingestion pipeline. ▪ Framework must be: ▪ Efficient ▪ Modular ▪ Pluggable ▪ Composable ▪ Supportable
  14. 14. Modularizing the Pipeline {"id": "1", "name": "Jared Dunn", "bday": "1988-11-13" , "level": "bronze"} {"id": "2", "name": "Russ Hannerman", "bday": "1972-05-20" {"id": "3", "name": "Monica Hall", "bday": "1985-02", "level": "silver"} {"id": "4", "Name": "Dinesh", "bday": "1985-01-05", "level": "blah"} Field Type Constraint Id String firstName String lastName String birthDate Date rewardsLevel String Enum [bronze,silver,go ld] JSON Schema
  15. 15. 1. Parsing {"id": "1", "name": "Jared Dunn", "bday": "1988-11-13" , "level": "bronze"} {"id": "2", "name": "Russ Hannerman", "bday": "1972-05-20" {"id": "3", "name": "Monica Hall", "bday": "1985-02", "level": "silver"} {"id": "4", "Name": "Dinesh", "bday": "1985-01-05", "level": "blah"} Id Name bday level 1 Jared Dunn 1988-11-13 bronze 3 Monica Hall 1985-02 silver 4 Dinesh 1985-01-05 blah Input Pass Fail Id _errors 2 [{"code":”101","message":”Missing closing bracket."}]
  16. 16. 2. Conversion Id First Name Last Name birthDate rewardsLe vel 1 Jared Dunn 1988-11-13 bronze 4 Dinesh 1985-01-05 blah Id _errors 3 [{"code":"355", "message":"Invalid Date", "column":"bday"}] Id Name bday level 1 Jared Dunn 1988-11-13 bronze 3 Monica Hall 1985-02 silver 4 Dinesh 1985-01-05 blah Input Pass Fail
  17. 17. 3.Validation Id First Name Last Name birthDate rewardsLe vel 1 Jared Dunn 1988-11-13 bronze 4 Dinesh 1985-01-05 blah Input Id First Name Last Name birthDate rewardsLe vel 1 Jared Dunn 1988-11-13 bronze Id _errors 4 [{"code":401","message":"Requied value","column":"lastNa me"}, {"code":"411","message":"Invalid enum value: blah, must be one of bronze|silver|gold.", "column": "rewardsLevel"}] Pass Fail
  18. 18. 4. Persisting the Good Id First Name Last Name birthDate rewardsLe vel 1 Jared Dunn 1988-11-13 bronze Data Lake
  19. 19. 5. Quarantining the Bad Id _errors 2 [{"code":”101","message":”Missing closing bracket."}] 3 [{"code":"355", "message":"Invalid Date", "column":"bday"}] 4 [{"code":401","message":"Requied value","colum n":"lastName"}, {"code":"411","message":"Invalid enum value: blah, must be one of bronze|silver|gold.", "column": "rewardsLevel"}] Id Name bday level 1 Jared Dunn 1988-11-13 bronze 3 Monica Hall 1985-02 silver 4 Dinesh 1985-01-05 blah Quarantine Join Failed Parser Output
  20. 20. Weaving It All Together Plugin Runtime Parser Converter Validator Data Sink Error Sink Errors + Data Errors + Data Errors + Data Data Errors Siphon
  21. 21. The How
  22. 22. Challenges ▪ DSL ▪ APIs ▪ Parsing errors ▪ Conversion/Validation errors. ▪ Error consolidation ▪ Error trapping using Custom Expression ▪ Externalization of errors
  23. 23. Domain Specific Language { "parser": "csv", "converters": [ "mapper" ], "validators": [ "isRequiredCheck", "enumCheck", "isIdentityCheck" ], "dataSink": "dataLake", "errorSink": "quarantine" } SIP Runtime Parser Converter Validator Data Sink Error Sink Converters Validators Errors + Data Errors + Data Errors + Data Data Errors
  24. 24. Converter Interface
  25. 25. ConvertResult Interface
  26. 26. Validator Interface
  27. 27. ValidateResult Interface
  28. 28. Parsing Errors Ø Processed only once by SIP at the beginning. Ø Only applicable for file sources like CSV and JSON Ø Relies on Spark to capture the parsing errors. Ø Pass on appropriate read options Ø CSV Ø Mode = PERMISSIVE Ø columnNameOfCorruptRecord = “_corrupt_record” Ø Parsing error records are captured Ø By applying predicate on _corrupt_record column. Ø Good records are passed to plugins for further processing
  29. 29. Parsing Errors { “name”: ”John”, “age”: 30 } { ”name”: ”Mike”, ”age”: 20 JSON: p.json spark.read.json("p.json").show(false) CSV: p.csv name,age John,30 Mike,20,20 spark.read.schema(csvSchema).options(csvOptions).csv("p.csv").show(false) No record terminator Record does not confirm to schema
  30. 30. Conversion/Validation errors Ø SIP invokes the plugins in sequence Ø Converter Plugin Ø Both good and error records are collected. Ø The good records are passed to the next plugin in sequence. Ø Validate Plugin Ø Returns the error records. Ø Process an error record multiple times. Ø To capture all possible errors for a given record. Ø Example, both plugin-1 and plugin-2 may find different errors for one or more columns of same record.
  31. 31. Error consolidation (contd ..) Mapping rule Target_column first_name || last_name full_name MAPPINGS full_name age Row_id John Vanau 40 1 Michael Shankar -32 3 row_id _errors 2 [[last_name, ERR-100, “Field `last_name` can not be null]] first_name last_name age Row_id John Vanau 40 1 Jack NULL 24 2 Michael Shankar -32 3 INPUT_DATA DATA_WITH_ROW_ID monotonically_increasing_id() applying mapping rule success error successful mapping column_name data_type constraint full_name String None age Short age > 0 TARGET _SCHEMA applying target schema row_id _errors 3 [[age, ERR-200, “Field `age` cannot be < 0”]] row_id _errors 3 [[age, ERR-200, “Field `age` cannot be < 0”]] row_id _errors 2 [[last_name, ERR-100, “Field `last_name` can not be null]] row_id _errors 2 [[last_name, ERR-100, “Field `last_name` can not be null]] 3 [[age, ERR-200, “Field `age` cannot be < 0”]] Union first_name last_nam e age _errors Jack NULL 24 [[last_name, ERR-100, “Field `last_name` can not be null”]] Michael Shankar -32 [[age, ERR-200, “Field `age` cannot be < 0”]] full_name age John Vanau 40 first_name last_name age Row_id John Vanau 40 1 Jack NULL 24 2 Michael Shankar -32 3 DATA_WITH_ROW_ID Join Anti-Join final successfinal error full_name age Row_id John Vanau 40 1 Michael Shankar -32 3
  32. 32. Error Trapping ▪ Most of existing conversion and validations use UDFs. ▪ Nested type conversions use nested UDFs. ▪ Currently not possible to capture errors from nested UDFs. ▪ Custom expression used to trap errors. ▪ Captures input column value, error code and error text in case of error. ▪ Captures output column value upon successful conversion/validation. Custom Expression
  33. 33. Error Trapping (Contd..) Custom Expression
  34. 34. Error Trapping – Example
  35. 35. Error Trapping – Example ContinuedError Trapping – Example Continued
  36. 36. Error Trapping – Example Continued
  37. 37. The Results
  38. 38. Benefits Cross-Cutting Data Processing ▪ Scalable Engineering ▪ Separation of Responsibilities ▪ More Readable Code ▪ More Testable Code ▪ Easier to Maintain ▪ More ETL Features ▪ More Validation Features ▪ More Error Reporting Features ▪ Minimize Latency (from 10 min to 10 sec) ▪ Re-Use ▪ 50% or More Storage Savings ▪ 50% or More Compute Savings Cross-Cutting Data Processing
  39. 39. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.

    Be the first to comment

  • OmprakashKanikePE

    Jul. 23, 2020

As the usage of Apache Spark continues to ramp up within the industry, a major challenge has been scaling our development. Too often we find that developers are re-implementing a similar set of cross-cutting concerns, sprinkled with some variance of use-case specific business logic as a concrete Spark App.

Views

Total views

366

On Slideshare

0

From embeds

0

Number of embeds

0

Actions

Downloads

14

Shares

0

Comments

0

Likes

1

×