EDW Optimization with Hadoop
and CDAP
Change Data Capture
Sagar Kapare
Software Engineer, Cask
● Data Warehouses are expensive
● $20,000 to $40,000 per terabyte
● Forced to decide between costly expansion or discarding data
● Often proprietary technology on an appliance
● Data Warehouses are limited
● Unable to handle data of petabytes of volumes
● Inability to handle unstructured and semi-structured data
● Limited access to the data due to low-concurrency capabilities
● Expensive computation with lower cost
● Handle large volume of data
● Horizontally scalable
● Lot of projects and tooling around it!
● Dumping the whole Database through export/import process
● Incremental ingest
○ Timestamp on rows
○ Version number on rows
○ Custom applications
○ Change Data Capture(CDC) through transactional logs
Oracle
● Minimal impact on the database
● No need to make programmatic changes to the application
● Low latency in acquiring changes
● No need to change the database schema
Oracle
Transac-
tional Logs
CDC
Insert
Update
Delete
DML
Oracle
Transa
ctional
Logs
Oracle
Golden
Gate
Connector
Kafka Ingest Normalize Store Kudu
Impala
Query
Persist all DDL
and DML
operations
Publish DDL and
DML to Kafka
- Consume Golden Gate Events
- Normalize the data
- Perform Upserts and Deletes in Kudu
- Enterprise grade production ready solution
● Self Service - drag and drop data pipelines
○ Create operational pipelines within minutes
● Operational Metrics
● Operational Logging
● Audit Trails - READ, WRITE, METADATA change etc
● Lineage - Who moved my dataset?
● Security - Authentication, Authorization, and Impersonation
● Ease of Operations
Unified Integration Platform for Big Data
Oracle
Transa
ctional
Logs
Oracle
Golden
Gate
Connector
Kafka
Golden
Gate
Source
CDC
Normalizer
CDC Kudu
Sink Kudu
Impala
Query
Persist all DDL
and DML
operations
Publish DDL and
DML to Kafka
- Consume Golden Gate Events
- Normalize the data
- Perform Upserts and Deletes in Kudu
CDAP Data pipeline
● DML Operations (Insert, Update, and Delete)
● DDL Operations (Add column, Remove column)
● Different Oracle CDC sinks such as
HBase, Hive, Database etc.
● CDC for SQL Server
Sagar Kapare
sagar@cask.co

#BDAM: EDW Optimization with Hadoop and CDAP, by Sagar Kapare from Cask

  • 1.
    EDW Optimization withHadoop and CDAP Change Data Capture Sagar Kapare Software Engineer, Cask
  • 2.
    ● Data Warehousesare expensive ● $20,000 to $40,000 per terabyte ● Forced to decide between costly expansion or discarding data ● Often proprietary technology on an appliance ● Data Warehouses are limited ● Unable to handle data of petabytes of volumes ● Inability to handle unstructured and semi-structured data ● Limited access to the data due to low-concurrency capabilities
  • 3.
    ● Expensive computationwith lower cost ● Handle large volume of data ● Horizontally scalable ● Lot of projects and tooling around it!
  • 4.
    ● Dumping thewhole Database through export/import process ● Incremental ingest ○ Timestamp on rows ○ Version number on rows ○ Custom applications ○ Change Data Capture(CDC) through transactional logs Oracle
  • 5.
    ● Minimal impacton the database ● No need to make programmatic changes to the application ● Low latency in acquiring changes ● No need to change the database schema Oracle Transac- tional Logs CDC Insert Update Delete DML
  • 6.
    Oracle Transa ctional Logs Oracle Golden Gate Connector Kafka Ingest NormalizeStore Kudu Impala Query Persist all DDL and DML operations Publish DDL and DML to Kafka - Consume Golden Gate Events - Normalize the data - Perform Upserts and Deletes in Kudu
  • 7.
    - Enterprise gradeproduction ready solution ● Self Service - drag and drop data pipelines ○ Create operational pipelines within minutes ● Operational Metrics ● Operational Logging ● Audit Trails - READ, WRITE, METADATA change etc ● Lineage - Who moved my dataset? ● Security - Authentication, Authorization, and Impersonation ● Ease of Operations
  • 8.
  • 9.
    Oracle Transa ctional Logs Oracle Golden Gate Connector Kafka Golden Gate Source CDC Normalizer CDC Kudu Sink Kudu Impala Query Persistall DDL and DML operations Publish DDL and DML to Kafka - Consume Golden Gate Events - Normalize the data - Perform Upserts and Deletes in Kudu CDAP Data pipeline
  • 10.
    ● DML Operations(Insert, Update, and Delete) ● DDL Operations (Add column, Remove column)
  • 11.
    ● Different OracleCDC sinks such as HBase, Hive, Database etc. ● CDC for SQL Server
  • 12.