Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

of

Modern ETL Pipelines with Change Data Capture Slide 1 Modern ETL Pipelines with Change Data Capture Slide 2 Modern ETL Pipelines with Change Data Capture Slide 3 Modern ETL Pipelines with Change Data Capture Slide 4 Modern ETL Pipelines with Change Data Capture Slide 5 Modern ETL Pipelines with Change Data Capture Slide 6 Modern ETL Pipelines with Change Data Capture Slide 7 Modern ETL Pipelines with Change Data Capture Slide 8 Modern ETL Pipelines with Change Data Capture Slide 9 Modern ETL Pipelines with Change Data Capture Slide 10 Modern ETL Pipelines with Change Data Capture Slide 11 Modern ETL Pipelines with Change Data Capture Slide 12 Modern ETL Pipelines with Change Data Capture Slide 13 Modern ETL Pipelines with Change Data Capture Slide 14 Modern ETL Pipelines with Change Data Capture Slide 15 Modern ETL Pipelines with Change Data Capture Slide 16 Modern ETL Pipelines with Change Data Capture Slide 17 Modern ETL Pipelines with Change Data Capture Slide 18 Modern ETL Pipelines with Change Data Capture Slide 19 Modern ETL Pipelines with Change Data Capture Slide 20 Modern ETL Pipelines with Change Data Capture Slide 21 Modern ETL Pipelines with Change Data Capture Slide 22 Modern ETL Pipelines with Change Data Capture Slide 23 Modern ETL Pipelines with Change Data Capture Slide 24 Modern ETL Pipelines with Change Data Capture Slide 25 Modern ETL Pipelines with Change Data Capture Slide 26 Modern ETL Pipelines with Change Data Capture Slide 27 Modern ETL Pipelines with Change Data Capture Slide 28 Modern ETL Pipelines with Change Data Capture Slide 29 Modern ETL Pipelines with Change Data Capture Slide 30 Modern ETL Pipelines with Change Data Capture Slide 31 Modern ETL Pipelines with Change Data Capture Slide 32 Modern ETL Pipelines with Change Data Capture Slide 33
Upcoming SlideShare
What to Upload to SlideShare
Next
Download to read offline and view in fullscreen.

2 Likes

Share

Download to read offline

Modern ETL Pipelines with Change Data Capture

Download to read offline

In this talk we’ll present how at GetYourGuide we’ve built from scratch a completely new ETL pipeline using Debezium, Kafka, Spark and Airflow, which can automatically handle schema changes. Our starting point was an error prone legacy system that ran daily, and was vulnerable to breaking schema changes, which caused many sleepless on-call nights. As most companies, we also have traditional SQL databases that we need to connect to in order to extract relevant data.

This is done usually through either full or partial copies of the data with tools such as sqoop. However another approach that has become quite popular lately is to use Debezium as the Change Data Capture layer which reads databases binlogs, and stream these changes directly to Kafka. As having data once a day is not enough anymore for our bussiness, and we wanted our pipelines to be resilent to upstream schema changes, we’ve decided to rebuild our ETL using Debezium.

We’ll walk the audience through the steps we followed to architect and develop such solution using Databricks to reduce operation time. By building this new pipeline we are now able to refresh our data lake multiple times a day, giving our users fresh data, and protecting our nights of sleep.

Modern ETL Pipelines with Change Data Capture

  1. 1. WIFI SSID:Spark+AISummit | Password: UnifiedDataAnalytics
  2. 2. Thiago Rigo and David Mariassy, GetYourGuide Modern ETL Pipelines with Change Data Capture #UnifiedDataAnalytics #SparkAISummit
  3. 3. Who are we? 5 years of experience in Business Intelligence and Data Engineering roles from the Berlin e-commerce scene. Data Engineer, Data Platform. Software engineer for the past 7 years, last 3 focused on data engineering. Senior Data Engineer, Data Platform.
  4. 4. Agenda 1 Intro to GetYourGuide 2 GYG’s Legacy ETL Pipeline 3 Rivulus ETL Pipeline 4 Conclusion 5 Questions
  5. 5. Intro to GetYourGuide
  6. 6. We make it simple to book and enjoy incredible experiences
  7. 7. Europe’s largest marketplace for travel experiences 50k+ Products in 150+ countries 25M+ Tickets sold $650M+ In VC funding 600+ Strong global team 150+ Traveler nationalities
  8. 8. GYG’s Legacy ETL Pipeline
  9. 9. Breaking schema changes upstream Requires special knowledge Long recovery times Difficult to test Bad SLAs Where we started
  10. 10. Requires special knowledge Breaking schema changes upstream Long recovery times Difficult to test Bad SLAs Automatic handling of schema changes Familiar tooling (Scala, SQL) Maximum parallelism Built for testability Better SLAs What we wanted
  11. 11. Rivulus ETL Pipeline
  12. 12. Overview
  13. 13. Extraction Layer
  14. 14. The pipeline
  15. 15. Debezium ● Open source distributed platform for change data capture ● Can read several databases ○ MySQL, Postgres, Cassandra, Oracle, SQL Server, and Mongo DB ● It works as a connector part of Kafka Connect ● It streams the database's event log into Kafka ● Streams those changes to Kafka
  16. 16. ● Scala library ● Keeps track of all schema changes applied to the tables ● Holds PK, timestamp and partition columns ● Prevents breaking changes from being introduced ○ Type changes ● Upcast types ● Schema Service works on column level Schema Service Automatic handling of schema changes
  17. 17. Avro Converter ● Regular Scala application ● Runs as part of Airflow DAG ● Reads raw Avro files from S3 ● Communicates with Schema Service to handle schema changes automatically ● Writes out Parquet files Automatic handling of schema changes
  18. 18. Upsert ● Spark application ● Runs as part of Airflow DAG ● Reads in new Parquet files ● Communicates with Schema Service to get PK, timestamp and partition columns ● Compacts the data based on table’s PK ● Creates Hive table which contains a replica of source DB
  19. 19. Transformation Layer
  20. 20. The performance penalty of managing transformation dependencies inefficiently
  21. 21. The gradual forsaking of performance on the altar of dependency management Humble beginnings ● Small set of transformations. ● Small team / single engineer. ● Simple one-to-one type dependencies. ● Defining an optimal dependency model by hand is possible.
  22. 22. The gradual forsaking of performance on the altar of dependency management Humble beginnings Complexity on the horizon ● Growing set of transformations. ● Growing team. ● One-to-many / many-to-many type dependencies. ● Defining a dependency model by hand becomes cumbersome and error-prone
  23. 23. The gradual forsaking of performance on the altar of dependency management Humble beginnings Complexity on the horizon The hard choice between performance and correctness ● As optimal dependency models become ever more difficult to maintain and expand manually without making errors, teams decide to optimise for correctness over performance. ● This results in crude dependency models with a lot of sequential execution in places where parallelization would be possible.
  24. 24. The gradual forsaking of performance on the altar of dependency management Humble beginnings Complexity on the horizon The performance bottleneck strikes back The hard choice between performance and correctness ● Sequential execution results in long execution and long recovery times. In other words Poor SLAs. ● 💣🔥
  25. 25. Rivulus SQL for automated dependency inference Maximum parallelism
  26. 26. ● SQL transformations ○ A collection of Rivulus SQL files that make use of a set of custom template variables. ● Executor app ○ Spark app that executes a single transformation at a time. ● DGB (Dependency Graph Builder) ○ Parses all files in the SQL library and builds a dependency graph of the transformations by interpolating Rivulus SQL template vars. ● Airflow ○ Executes the transformations on Databricks in the order specified by the DGB. Main components
  27. 27. Rivulus SQL syntax ● {% reference:target ‘dim_tour’ %} ○ Declares a dependency between this transformation and the dim_tour transformation that must be defined in the same SQL library ● {% reference:source ‘gyg__customer’ %} ○ Declares a dependency between this transformation and a raw data source (gyg.customer) that is loaded to Hive by an extraction job ● {% load ‘file.sql’ %} ○ Loads a reusable subquery defined in file.sql into this transformation. Familiar tooling (Scala, SQL)
  28. 28. { "fact_nps_feedback": { "source_dependencies": [ "gyg__nps_feedback" ], "transformation_dependencies": [ "dim_nps_feedback_stage" ] } } DGB Airflow Example Executor app invocations on DB SELECT nps_feedback_id , nps_feedback_stage_id , booking_id , score , feedback , update_timestamp , source FROM {% reference:source 'gyg__nps_feedback' %} AS nf LEFT JOIN {% reference:target 'dim_nps_feedback_stage' %} nfs ON nfs.nps_feedback_stage_name = nf.stage Build time Rivulus SQL Build time Runtime
  29. 29. A word on testing ● Maximum parallelism enhances testability ● Separation of config from code ○ Configurable input and output paths Built for testability
  30. 30. Conclusion
  31. 31. Results Eliminated vulnerability to upstream schema changes Democratized our ETL by migrating all business logic to SQL Minimized recovery time by maximizing parallelism Designed for E2E testability Cut processing time by 70% (further reductions are possible)
  32. 32. Next Steps Intra-day micro-batches Database Replication as a Service Rivulus SQL is GYG’s standard tool for writing transformations Delta for Upsert
  33. 33. Questions? We’re hiring! https://careers.getyourguide.com
  • tntphuoc

    Jul. 5, 2020
  • SimonSpti

    Nov. 11, 2019

In this talk we’ll present how at GetYourGuide we’ve built from scratch a completely new ETL pipeline using Debezium, Kafka, Spark and Airflow, which can automatically handle schema changes. Our starting point was an error prone legacy system that ran daily, and was vulnerable to breaking schema changes, which caused many sleepless on-call nights. As most companies, we also have traditional SQL databases that we need to connect to in order to extract relevant data. This is done usually through either full or partial copies of the data with tools such as sqoop. However another approach that has become quite popular lately is to use Debezium as the Change Data Capture layer which reads databases binlogs, and stream these changes directly to Kafka. As having data once a day is not enough anymore for our bussiness, and we wanted our pipelines to be resilent to upstream schema changes, we’ve decided to rebuild our ETL using Debezium. We’ll walk the audience through the steps we followed to architect and develop such solution using Databricks to reduce operation time. By building this new pipeline we are now able to refresh our data lake multiple times a day, giving our users fresh data, and protecting our nights of sleep.

Views

Total views

1,317

On Slideshare

0

From embeds

0

Number of embeds

0

Actions

Downloads

72

Shares

0

Comments

0

Likes

2

×