Real Time ETL processing
By Veeramani Moorthy
Agenda
Real time ETL Architecture
Why Reconciler?
Reconciler Data model
Q & A?
Requirements for Reconciler
[1.2.1]JDBCFetchTableSchema
Trail
Files
Adapter
Read
GoldenGate
Schema
Registry
[1.1] Data
Pump
• Schema Registry is a repository of ALL schemas which are versioned.
• GoldenGate captures the table change events
• Kafka – Distributed Messaging system
• CDC – Change Data Capture
[2.1] CDC
Events to
broker
Spark Reconciler Spark Joiner
Get Table Schema Get Table Schema
Streaming
Reconciler
job
Write
output
Reconciled
Companies Topic
Source
DB
Golden
Gate
[1.0] Data
Extract
[1.2]Get/Create/UpdateSchema
Real-Time ETL Architecture
Companies
Topic
Addresses
Topic
Streaming
Joiner/Transfo
rmer Job
Streaming
Reconciler
job
Reconciled
Addresses Topic
Read/Write for Reconcile Addresses
Read/Write for Reconcile Companies
[3.1] CDC
Events to
broker
Streaming
Joiner/Transfo
rmer Job
fn
Mapping service
Get Mapping
Requirements for Reconciler
Support for Idempotency
Support for immutability
Support for Schema evolution
Support to handle out of order CDC events
Challenges in Spark streaming
Out of sequence
UPDATE comes first INSERT comes later
Challenges in Spark streaming …
Data model
Tuple Id Source DB
Timestamp
Attribute Name Attribute value isDelete?
10201 12345677 company_id 10201 false
10201 12345677 company_name ABC Inc false
10201 12345677 company_addr EGL, BLR false
10201 22345677 company_addr Ecospace, BLR false
….
Company_id Company_name Company_addr
10201 ABC Inc EGL, BLR
….
Instead of
Go with
How does it solve?
Immutability?
Idempotency?
Out of sequence events?
Schema Evolution
Tuple Id Source DB
Timestamp
Attribute Name Attribute value isDelete?
10201 12345677 company_id 10201 false
10201 12345677 company_name ABC Inc false
10201 12345677 company_addr EGL, BLR false
10201 22345677 company_addr Ecospace, BLR false
10201 22345900 Registered_
name
ABC India
Pvt Ltd
false
….
Do I have to change the destination schema?
Schema Evolution
Addition of new column
Deletion of an existing column
Data Type change
Real time ETL processing using Spark streaming
Real time ETL processing using Spark streaming

Real time ETL processing using Spark streaming

  • 1.
    Real Time ETLprocessing By Veeramani Moorthy
  • 2.
    Agenda Real time ETLArchitecture Why Reconciler? Reconciler Data model Q & A? Requirements for Reconciler
  • 3.
    [1.2.1]JDBCFetchTableSchema Trail Files Adapter Read GoldenGate Schema Registry [1.1] Data Pump • SchemaRegistry is a repository of ALL schemas which are versioned. • GoldenGate captures the table change events • Kafka – Distributed Messaging system • CDC – Change Data Capture [2.1] CDC Events to broker Spark Reconciler Spark Joiner Get Table Schema Get Table Schema Streaming Reconciler job Write output Reconciled Companies Topic Source DB Golden Gate [1.0] Data Extract [1.2]Get/Create/UpdateSchema Real-Time ETL Architecture Companies Topic Addresses Topic Streaming Joiner/Transfo rmer Job Streaming Reconciler job Reconciled Addresses Topic Read/Write for Reconcile Addresses Read/Write for Reconcile Companies [3.1] CDC Events to broker Streaming Joiner/Transfo rmer Job fn Mapping service Get Mapping
  • 4.
    Requirements for Reconciler Supportfor Idempotency Support for immutability Support for Schema evolution Support to handle out of order CDC events
  • 5.
  • 6.
    Out of sequence UPDATEcomes first INSERT comes later
  • 7.
    Challenges in Sparkstreaming …
  • 8.
    Data model Tuple IdSource DB Timestamp Attribute Name Attribute value isDelete? 10201 12345677 company_id 10201 false 10201 12345677 company_name ABC Inc false 10201 12345677 company_addr EGL, BLR false 10201 22345677 company_addr Ecospace, BLR false …. Company_id Company_name Company_addr 10201 ABC Inc EGL, BLR …. Instead of Go with
  • 9.
    How does itsolve? Immutability? Idempotency? Out of sequence events?
  • 10.
    Schema Evolution Tuple IdSource DB Timestamp Attribute Name Attribute value isDelete? 10201 12345677 company_id 10201 false 10201 12345677 company_name ABC Inc false 10201 12345677 company_addr EGL, BLR false 10201 22345677 company_addr Ecospace, BLR false 10201 22345900 Registered_ name ABC India Pvt Ltd false …. Do I have to change the destination schema?
  • 11.
    Schema Evolution Addition ofnew column Deletion of an existing column Data Type change