SlideShare a Scribd company logo
Containerized Stream Engine to build
Modern Delta Lake
Sandeep Reddy Bheemi Reddy, Senior Data Engineer
Karthikeyan Siva Baskaran, Senior Data Engineer
Who We Are
Sandeep Reddy Bheemi Reddy
Senior Data Engineer
Karthikeyan Siva Baskaran
Senior Data Engineer
TIGER DATA FOUNDATION
Containerized Stream Engine to
build Modern Delta Lake
Contact us
+1-408-508-4430
info@tigeranalytics.com
https://www.tigeranalytics.com/
Agenda
Objective
Design Considerations
Infrastructure Provisioning
Solution Deep Dive
Application Monitoring
Points to be Noted
Questions
Objective
To Build Single Source of Truth data for Enterprise via CDC
Most compelling operational analytics demand
real-time data rather than historical data.
Data Agility
The Speed of business is rapidly accelerating,
driving the need for delivering intelligent, fast
solutions.
Facilitate larger amounts of data from multiple sources by
tracking changes made to the source data, combine them
together to build a Single Source of Truth to make
decisions based on data.
Build SSOT from Siloed Data
Demand for real-time Data
Design Considerations
Few Common Ways to Capture Data to get Insights
Change Data Capture
App
DB
LOG
Analytics Data Lake
Dual Writes
App
DB
Pub Sub
System
Analytics Data Lake
Direct JDBC
App
DB
Analytics Data Lake
Inconsistent Data
During Job failure, in overwrite mode
it leads to inconsistent data
Schema Enforcement & Evolution
DDLs are not supported, this leads to
break in the flow if upstream
applications changed the schema
Roll Back not possible
In case of failure, it is not possible to
roll back to the previous state of data
No Metadata layer
As there is no metadata layer, there is no clear
isolation b/w reads and writes – thus it is not
consistent, durable or atomic
VersioningSchema
E2
Data
Corruption
Not
ACID
Complaint
Problem with Today’s Data Lake
Provides clear isolation
between different writes by
maintaining log file for each
transaction
Even Job failure with
Overwrite mode, will not
corrupt the data
Provides serializable isolation
levels to ensure the data
consistent across multiple users
Changes to the table are
maintained as ordered, atomic
commits
ACID
Compliant
Atomicity Consistency
Isolation Durability
mergeSchema - Any column that is present in
the Data Frame but not in the target table is
automatically added on to the end of the
schema.
overwriteSchema – Datatype change,
drop/rename column
Time Travel to older version
All metadata and lineage of your data are
stored. To travel back to previous versions of
your delta table, provide a timestamp or
specific version
Expectations for data quality, which prohibits
the invalid data to enter your enterprise data
lake
Data Check Constraints
Schema Enforcement & Evolution
Delta Lake to Rescue
Infrastructure Provisioning
10
On Premise
Code Repo
To maintain
versions of
Terraform files
Open Source
Agent
Security &
Compliance Checks
Terraform
To deploy TF files
and maintain state
of TF files.
Deploy TF Files
TF State Files
DevOps
Professional
CD Pipeline
TF Files
▪ Cloud Agnostic – Create & manage infrastructure
across various platforms.
▪ Minimize human errors and configuration
differences in various environments.
▪ Maintain the state of infrastructure.
▪ Perform policy checks on Infrastructure
IaC – Workflow
Infra Provisioning in Selected Environment
Kubernetes Cluster (With
Scalable worker nodes)
Pods (Deployment,
Replica Sets)
Launch the Deployment
Services (Node Port &
Load Balancer)
Volumes (PV & PVC)
Solution Deep Dive
Kafka
Schema Registry
Kubernetes
Source Database
Structured StreamingKafka Connect
Change Data Streaming Queue Processing Layer Storage Layer
DB Logs
ADLS S3
Kafka Connect uses
Debezium connector
to parse the database
logs
Schema
id data
Schema
id data
Avro
Schema-1 Schema-2 Schema-n
id - 1001 id - 1002 id - n
Register Schema
Provides flexibility by creating
a VIEW from different schema
for different teams based on
their need. This helps
downstream apps to run
without any interruption when
schema changes
Persistent
Volume Claim
PVC
PVC
{
"name": "mssql-${DBName}-connector-${conn_string}",
"config":
{
"connector.class": "io.debezium.connector.sqlserver.SqlServerConnector",
"tasks.max": "1",
"database.hostname": "${Hostname}",
"database.port": "${Port}",
"database.user": "${UserName}",
"database.password": "${Password}",
"database.server.id": "${conn_string}",
"database.server.name": "${Source}.${DBName}.${conn_string}",
"database.whitelist": "${DBName}",
"database.dbname": "${DBName}",
"database.history.kafka.bootstrap.servers": "${KAFKA}:9092",
"database.history.kafka.topic": "${Source}.${DBName}.dbhistory",
"key.converter":"io.confluent.connect.avro.AvroConverter",
"key.converter.schema.registry.url":"http://${SCHEMA_REGISTRY}:8081",
"value.converter":"io.confluent.connect.avro.AvroConverter",
"value.converter.schema.registry.url": "http://${SCHEMA_REGISTRY}:8081",
}
}
Kafka Connector Properties
{
"payload": {
"before": {
"emp_no": 1,
"birth_date": 18044,
"first_name": “Marion",
"last_name": “Colbrun"
},
"after": {
"emp_no": 1,
"birth_date": 18044,
"first_name": “Marion",
"last_name": “Brickell"
}
}
}
{
"payload": {
"before": {
"emp_no": 1,
"birth_date": 18044,
"first_name": "Marion",
"last_name": "Colbrun"
},
"after": null
}
}
{
"payload": {
"before": null,
"after": {
"emp_no": 1,
"birth_date": 18044,
"first_name": "Marion",
"last_name": "Colbrun"
}
}
}
insert into sample_emp values
(1,current_date,'Marion’,
'Colbrun');
update sample_emp set
last_name='Brickell’
where emp_no=1;
delete from sample_emp
where emp_no=1;
INSERT UPDATE DELETE
CDC Code Logic Flow
Read data from Kafka and create Delta
Table and insert the recent data based on
Primary Key and exclude if there are any
Deletes.
Read the data from Kafka and split
delete data from Inserts/Updates.
Get the latest data by using Rank
window.
Enable autoMerge schema
property to detect any schema
changes and merge the schema
to Delta Table
MERGE command to handle
Inserts/Updates/Deletes based on
Operation(op) column which is created by
Debezium by parsing the logs
Initial
Load
DDL
DM L
Data
Preprocess
Data Preprocess
Initial Load
DML Scenario
DDL Scenario
Incremental Load:
Data Preprocess, DDL & DML
Flag ID Value CDCTimeStamp
I 1 1 2018-01-01 16:02:00
U 1 11 2018-01-01 16:02:01
I 2 2 2018-01-01 16:02:03
I 3 33 2018-01-01 16:02:04
I 4 40 2018-01-01 16:02:04
U 4 44 2018-01-01 16:02:05
Get Latest Record in order to maintain
SCD Type I
Inserts/Updates
Flag ID Value CDCTimeStamp
D 2 2 2018-01-01 16:02:04
Deletes
Deletes will have different schema when it is inserted in
Kafka from Debezium. For Deletes, take Before Image data
to know which primary key records got deleted, where as
for Inserts and Updates, pull data from After Image.
Flag ID Value CDCTimeStamp
U 1 11 2018-01-01 16:02:01
D 2 2 2018-01-01 16:02:04
I 3 33 2018-01-01 16:02:04
U 4 44 2018-01-01 16:02:05
val orderBy_lst = List(“CDCTimeStamp")
val byPrimaryKey = Window
.partitionBy(partitionBy_lst.map(col): _*)
.orderBy(orderBy_lst.map(x => col(x) desc):_*)
rankDf = dmlDf
.withColumn("rank", rank over byPrimaryKey)
.filter("rank = 1")
.drop("rank")
Data Pre-processing
Get Latest Record and exclude Deletes
for Initial Load
Consolidated data for Initial Load
As the requirement is to maintain SCD I, there is
no need to load the Deletes data into Delta Lake
during Initial Load.
Flag ID Value CDCTimeStamp
U 1 11 2018-01-01 16:02:01
I 3 33 2018-01-01 16:02:04
U 4 44 2018-01-01 16:02:05
df.where("op != 'd'")
.write
.mode("overwrite")
.option("path", delta_tbl_loc)
.format("delta")
.saveAsTable(db_tbl)
Flag ID Value CDCTimeStamp
U 1 11 2018-01-01 16:02:01
D 2 2 2018-01-01 16:02:04
I 3 33 2018-01-01 16:02:04
U 4 44 2018-01-01 16:02:05
Initial Load
Flag ID Value City CDCTimeStamp
I 11 100 MDU 2018-01-01 16:02:20
U 11 1000 CHN 2018-01-01 16:02:21
U 3 300 MDU 2018-01-01 16:02:22
I 14 400 MDU 2018-01-01 16:02:21
D 4 44 2018-01-01 16:02:22
Incremental Staged Data Incremental Load: Data pre-process &
Get Latest Record
val orderBy_lst = List("CDCTimeStamp")
val byPrimaryKey = Window
.partitionBy(partitionBy_lst.map(col): _*)
.orderBy(orderBy_lst.map(x => col(x) desc):_*)
rankDf = dmlDf
.withColumn("rank", rank over byPrimaryKey)
.filter("rank = 1")
.drop("rank")
Flag ID Value City CDCTimeStamp
U 11 1000 CHN 2018-01-01 16:02:21
U 3 300 MDU 2018-01-01 16:02:22
I 14 400 MDU 2018-01-01 16:02:21
D 4 44 2018-01-01 16:02:22
Latest Record from incremental load
Data pre-processing by splitting deletes and
Inserts/Updates and get Latest Records per
primary key.
Finally union both dataframe before performing
MERGE
Flag ID Value City CDCTimeStamp
U 11 1000 CHN 2018-01-01 16:02:21
U 3 300 MDU 2018-01-01 16:02:22
I 14 400 MDU 2018-01-01 16:02:21
D 4 44 2018-01-01 16:02:22
Latest Incremental Staged Data Incremental Load – DDL & DML
MERGE INTO ${db_tbl} AS target
USING staging_tbl AS source
ON ${pri_key_const}
WHEN MATCHED and source.op = 'u’
THEN
UPDATE SET *
WHEN MATCHED and source.op = 'd’
THEN
delete
WHEN NOT MATCHED and source.op = 'c’
THEN
INSERT *
Flag ID Value City CDCTimeStamp
U 1 11 Null 2018-01-01 16:02:01
U 3 300 MDU 2018-01-01 16:02:22
U 11 1000 CHN 2018-01-01 16:02:21
I 14 400 MDU 2018-01-01 16:02:21
Enable this property to add new columns on the fly
when MERGE happens
spark.databricks.delta.schema.autoMerge.enabled
Only available from delta lake 0.6.0 and higher
versions
Flag ID Value CDCTimeStamp
U 1 11 2018-01-01 16:02:01
I 3 33 2018-01-01 16:02:04
U 4 44 2018-01-01 16:02:05
Before
After
20
Spark Streaming with Kubernetes
API
Server
Scheduler
Kubernetes Master
Spark Driver Pod
Spark Executor
Pod
Spark Executor
Pod
Spark Executor
Pod
1)Spark Submit
2)Start Driver Pod
3)Request Executor Pods
5)Notify of New Executor
4)Schedule Executor Pod
6)Schedule tasks on executors
Kubernetes Cluster
Key Benefits
▪ Containerization – Applications are more portable and easy to package dependencies.
▪ Cloud Agnostic – Able to launch the Spark job in any platform without any code changes.
▪ Efficient Resource Sharing – Resources can be utilized by other applications when Spark jobs are idle.
File Share
Checkpointing
Application Monitoring
Fluentd is a popular data collector that runs
as a DaemonSet inside Kubernetes Worker
Nodes to collect and ingest container logs
from local filesystem into Elasticsearch
engine
Metrics Beat is a lightweight shipper that
collects and ships various system and
service metrics like CPU, Memory, Disk
usage etc. to Elasticsearch engine
Elasticsearch is a real time distributed
scalable search engine that is used to index
and search through larger volume of log
data
Kibana is a powerful data visualization tool
that allows to explore log data stored in
Elasticsearch and gain quick insights to
Kubernetes applications
Node
Pod2
Podn
Pod1
.
.
Node
Storage
Pod2
Podn
Pod1
.
.
Node
Storage
Daemon
Node
Daemon
Daemon Daemon
CPU,Memory,
Network
CPU,Memory,
Network
Logs
Logs
Centralized Log Monitoring
Monitoring Dashboard
Points to be noted!
DEBEZIUM
Points to be Noted
Primary Key is mandatory, without primary
key it is not possible to track the changes
and apply the changes to Target
Primary Key
By default, Kafka connect will create topic
with only one partition. Due to this, Spark
Job will not get parallelized. To achieve
parallelism, we need to create the topic
with more no of partitions
Partitions
For each Table under database, one topic
will be created and one common topic for
one DB to maintain DDLs
Topic/Table
SPARK
If small files are handled, when merge
happens during incremental load, there is no
need to rewrite most of the files, in return
performance between micro batch will be
increased.
To control the compaction, either
Run OPTIMIZE or set false to dataChange delta
property or enable Adaptive Execution Mode
Small Files
Time travel will not read the delta log
checkpoint directory, because we need the
specific version, so this will read the specific
json commit file, because the checkpoint
parquet file is consolidated of all the json files
which is committed previously.
Time Travel
Any Questions?
Feedback
Your feedback is important to us.
Don’t forget to rate
and review the session.
THANKS!

More Related Content

What's hot

Free Training: How to Build a Lakehouse
Free Training: How to Build a LakehouseFree Training: How to Build a Lakehouse
Free Training: How to Build a Lakehouse
Databricks
 
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin AmbardDelta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Paris Data Engineers !
 
Designing and Building Next Generation Data Pipelines at Scale with Structure...
Designing and Building Next Generation Data Pipelines at Scale with Structure...Designing and Building Next Generation Data Pipelines at Scale with Structure...
Designing and Building Next Generation Data Pipelines at Scale with Structure...
Databricks
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
Databricks
 
Diving into Delta Lake: Unpacking the Transaction Log
Diving into Delta Lake: Unpacking the Transaction LogDiving into Delta Lake: Unpacking the Transaction Log
Diving into Delta Lake: Unpacking the Transaction Log
Databricks
 
Making Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse TechnologyMaking Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse Technology
Matei Zaharia
 
Considerations for Data Access in the Lakehouse
Considerations for Data Access in the LakehouseConsiderations for Data Access in the Lakehouse
Considerations for Data Access in the Lakehouse
Databricks
 
Introduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse ArchitectureIntroduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse Architecture
Databricks
 
Spark with Delta Lake
Spark with Delta LakeSpark with Delta Lake
Spark with Delta Lake
Knoldus Inc.
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Databricks
 
Spark and S3 with Ryan Blue
Spark and S3 with Ryan BlueSpark and S3 with Ryan Blue
Spark and S3 with Ryan Blue
Databricks
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache Spark
Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
Databricks
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark Summit
 
Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark Overview
Vadim Y. Bichutskiy
 
DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)
James Serra
 
3D: DBT using Databricks and Delta
3D: DBT using Databricks and Delta3D: DBT using Databricks and Delta
3D: DBT using Databricks and Delta
Databricks
 
Apache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationApache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper Optimization
Databricks
 
Making Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeMaking Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta Lake
Databricks
 

What's hot (20)

Free Training: How to Build a Lakehouse
Free Training: How to Build a LakehouseFree Training: How to Build a Lakehouse
Free Training: How to Build a Lakehouse
 
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin AmbardDelta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
 
Designing and Building Next Generation Data Pipelines at Scale with Structure...
Designing and Building Next Generation Data Pipelines at Scale with Structure...Designing and Building Next Generation Data Pipelines at Scale with Structure...
Designing and Building Next Generation Data Pipelines at Scale with Structure...
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
 
Diving into Delta Lake: Unpacking the Transaction Log
Diving into Delta Lake: Unpacking the Transaction LogDiving into Delta Lake: Unpacking the Transaction Log
Diving into Delta Lake: Unpacking the Transaction Log
 
Making Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse TechnologyMaking Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse Technology
 
Considerations for Data Access in the Lakehouse
Considerations for Data Access in the LakehouseConsiderations for Data Access in the Lakehouse
Considerations for Data Access in the Lakehouse
 
Introduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse ArchitectureIntroduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse Architecture
 
Spark with Delta Lake
Spark with Delta LakeSpark with Delta Lake
Spark with Delta Lake
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
 
Spark and S3 with Ryan Blue
Spark and S3 with Ryan BlueSpark and S3 with Ryan Blue
Spark and S3 with Ryan Blue
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache Spark
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
 
Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark Overview
 
DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)
 
3D: DBT using Databricks and Delta
3D: DBT using Databricks and Delta3D: DBT using Databricks and Delta
3D: DBT using Databricks and Delta
 
Apache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationApache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper Optimization
 
Making Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeMaking Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta Lake
 

Similar to Containerized Stream Engine to Build Modern Delta Lake

Synapse 2018 Guarding against failure in a hundred step pipeline
Synapse 2018 Guarding against failure in a hundred step pipelineSynapse 2018 Guarding against failure in a hundred step pipeline
Synapse 2018 Guarding against failure in a hundred step pipeline
Calvin French-Owen
 
Evolutionary db development
Evolutionary db development Evolutionary db development
Evolutionary db development
Open Party
 
Hw09 Hadoop Based Data Mining Platform For The Telecom Industry
Hw09   Hadoop Based Data Mining Platform For The Telecom IndustryHw09   Hadoop Based Data Mining Platform For The Telecom Industry
Hw09 Hadoop Based Data Mining Platform For The Telecom Industry
Cloudera, Inc.
 
Building the DW - ETL
Building the DW - ETLBuilding the DW - ETL
Building the DW - ETL
ganblues
 
Vertica And Spark: Connecting Computation And Data
Vertica And Spark: Connecting Computation And DataVertica And Spark: Connecting Computation And Data
Vertica And Spark: Connecting Computation And Data
Spark Summit
 
Vertica And Spark: Connecting Computation And Data
Vertica And Spark: Connecting Computation And DataVertica And Spark: Connecting Computation And Data
Vertica And Spark: Connecting Computation And Data
Rui Liu
 
Building Operational Data Lake using Spark and SequoiaDB with Yang Peng
Building Operational Data Lake using Spark and SequoiaDB with Yang PengBuilding Operational Data Lake using Spark and SequoiaDB with Yang Peng
Building Operational Data Lake using Spark and SequoiaDB with Yang Peng
Databricks
 
Product overview 6.0 v.1.0
Product overview 6.0 v.1.0Product overview 6.0 v.1.0
Product overview 6.0 v.1.0
Gianluigi Riccio
 
Witsml data processing with kafka and spark streaming
Witsml data processing with kafka and spark streamingWitsml data processing with kafka and spark streaming
Witsml data processing with kafka and spark streaming
Mark Kerzner
 
LendingClub RealTime BigData Platform with Oracle GoldenGate
LendingClub RealTime BigData Platform with Oracle GoldenGateLendingClub RealTime BigData Platform with Oracle GoldenGate
LendingClub RealTime BigData Platform with Oracle GoldenGate
Rajit Saha
 
“Lights Out”Configuration using Tivoli Netcool AutoDiscovery Tools
“Lights Out”Configuration using Tivoli Netcool AutoDiscovery Tools“Lights Out”Configuration using Tivoli Netcool AutoDiscovery Tools
“Lights Out”Configuration using Tivoli Netcool AutoDiscovery Tools
Antonio Rolle
 
Planning your Next-Gen Change Data Capture (CDC) Architecture in 2019 - Strea...
Planning your Next-Gen Change Data Capture (CDC) Architecture in 2019 - Strea...Planning your Next-Gen Change Data Capture (CDC) Architecture in 2019 - Strea...
Planning your Next-Gen Change Data Capture (CDC) Architecture in 2019 - Strea...
Impetus Technologies
 
2017 10-oow-fma-application-containers-v01-final
2017 10-oow-fma-application-containers-v01-final2017 10-oow-fma-application-containers-v01-final
2017 10-oow-fma-application-containers-v01-final
Markus Flechtner
 
Deep dive into spark streaming
Deep dive into spark streamingDeep dive into spark streaming
Deep dive into spark streaming
Tao Li
 
Change Data Feed in Delta
Change Data Feed in DeltaChange Data Feed in Delta
Change Data Feed in Delta
Databricks
 
Flink SQL: The Challenges to Build a Streaming SQL Engine
Flink SQL: The Challenges to Build a Streaming SQL EngineFlink SQL: The Challenges to Build a Streaming SQL Engine
Flink SQL: The Challenges to Build a Streaming SQL Engine
HostedbyConfluent
 
ETL and pivoting in spark
ETL and pivoting in sparkETL and pivoting in spark
ETL and pivoting in spark
Subhasish Guha
 
ETL and pivoting in spark
ETL and pivoting in sparkETL and pivoting in spark
ETL and pivoting in spark
Subhasish Guha
 
Why and how to leverage the simplicity and power of SQL on Flink
Why and how to leverage the simplicity and power of SQL on FlinkWhy and how to leverage the simplicity and power of SQL on Flink
Why and how to leverage the simplicity and power of SQL on Flink
DataWorks Summit
 
Deep learning and streaming in Apache Spark 2.2 by Matei Zaharia
Deep learning and streaming in Apache Spark 2.2 by Matei ZahariaDeep learning and streaming in Apache Spark 2.2 by Matei Zaharia
Deep learning and streaming in Apache Spark 2.2 by Matei Zaharia
GoDataDriven
 

Similar to Containerized Stream Engine to Build Modern Delta Lake (20)

Synapse 2018 Guarding against failure in a hundred step pipeline
Synapse 2018 Guarding against failure in a hundred step pipelineSynapse 2018 Guarding against failure in a hundred step pipeline
Synapse 2018 Guarding against failure in a hundred step pipeline
 
Evolutionary db development
Evolutionary db development Evolutionary db development
Evolutionary db development
 
Hw09 Hadoop Based Data Mining Platform For The Telecom Industry
Hw09   Hadoop Based Data Mining Platform For The Telecom IndustryHw09   Hadoop Based Data Mining Platform For The Telecom Industry
Hw09 Hadoop Based Data Mining Platform For The Telecom Industry
 
Building the DW - ETL
Building the DW - ETLBuilding the DW - ETL
Building the DW - ETL
 
Vertica And Spark: Connecting Computation And Data
Vertica And Spark: Connecting Computation And DataVertica And Spark: Connecting Computation And Data
Vertica And Spark: Connecting Computation And Data
 
Vertica And Spark: Connecting Computation And Data
Vertica And Spark: Connecting Computation And DataVertica And Spark: Connecting Computation And Data
Vertica And Spark: Connecting Computation And Data
 
Building Operational Data Lake using Spark and SequoiaDB with Yang Peng
Building Operational Data Lake using Spark and SequoiaDB with Yang PengBuilding Operational Data Lake using Spark and SequoiaDB with Yang Peng
Building Operational Data Lake using Spark and SequoiaDB with Yang Peng
 
Product overview 6.0 v.1.0
Product overview 6.0 v.1.0Product overview 6.0 v.1.0
Product overview 6.0 v.1.0
 
Witsml data processing with kafka and spark streaming
Witsml data processing with kafka and spark streamingWitsml data processing with kafka and spark streaming
Witsml data processing with kafka and spark streaming
 
LendingClub RealTime BigData Platform with Oracle GoldenGate
LendingClub RealTime BigData Platform with Oracle GoldenGateLendingClub RealTime BigData Platform with Oracle GoldenGate
LendingClub RealTime BigData Platform with Oracle GoldenGate
 
“Lights Out”Configuration using Tivoli Netcool AutoDiscovery Tools
“Lights Out”Configuration using Tivoli Netcool AutoDiscovery Tools“Lights Out”Configuration using Tivoli Netcool AutoDiscovery Tools
“Lights Out”Configuration using Tivoli Netcool AutoDiscovery Tools
 
Planning your Next-Gen Change Data Capture (CDC) Architecture in 2019 - Strea...
Planning your Next-Gen Change Data Capture (CDC) Architecture in 2019 - Strea...Planning your Next-Gen Change Data Capture (CDC) Architecture in 2019 - Strea...
Planning your Next-Gen Change Data Capture (CDC) Architecture in 2019 - Strea...
 
2017 10-oow-fma-application-containers-v01-final
2017 10-oow-fma-application-containers-v01-final2017 10-oow-fma-application-containers-v01-final
2017 10-oow-fma-application-containers-v01-final
 
Deep dive into spark streaming
Deep dive into spark streamingDeep dive into spark streaming
Deep dive into spark streaming
 
Change Data Feed in Delta
Change Data Feed in DeltaChange Data Feed in Delta
Change Data Feed in Delta
 
Flink SQL: The Challenges to Build a Streaming SQL Engine
Flink SQL: The Challenges to Build a Streaming SQL EngineFlink SQL: The Challenges to Build a Streaming SQL Engine
Flink SQL: The Challenges to Build a Streaming SQL Engine
 
ETL and pivoting in spark
ETL and pivoting in sparkETL and pivoting in spark
ETL and pivoting in spark
 
ETL and pivoting in spark
ETL and pivoting in sparkETL and pivoting in spark
ETL and pivoting in spark
 
Why and how to leverage the simplicity and power of SQL on Flink
Why and how to leverage the simplicity and power of SQL on FlinkWhy and how to leverage the simplicity and power of SQL on Flink
Why and how to leverage the simplicity and power of SQL on Flink
 
Deep learning and streaming in Apache Spark 2.2 by Matei Zaharia
Deep learning and streaming in Apache Spark 2.2 by Matei ZahariaDeep learning and streaming in Apache Spark 2.2 by Matei Zaharia
Deep learning and streaming in Apache Spark 2.2 by Matei Zaharia
 

More from Databricks

Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
Machine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionMachine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack Detection
Databricks
 
Jeeves Grows Up: An AI Chatbot for Performance and Quality
Jeeves Grows Up: An AI Chatbot for Performance and QualityJeeves Grows Up: An AI Chatbot for Performance and Quality
Jeeves Grows Up: An AI Chatbot for Performance and Quality
Databricks
 

More from Databricks (20)

Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 
Machine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionMachine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack Detection
 
Jeeves Grows Up: An AI Chatbot for Performance and Quality
Jeeves Grows Up: An AI Chatbot for Performance and QualityJeeves Grows Up: An AI Chatbot for Performance and Quality
Jeeves Grows Up: An AI Chatbot for Performance and Quality
 

Recently uploaded

一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
bopyb
 
End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024
Lars Albertsson
 
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
mkkikqvo
 
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
v7oacc3l
 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
Walaa Eldin Moustafa
 
The Ipsos - AI - Monitor 2024 Report.pdf
The  Ipsos - AI - Monitor 2024 Report.pdfThe  Ipsos - AI - Monitor 2024 Report.pdf
The Ipsos - AI - Monitor 2024 Report.pdf
Social Samosa
 
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Aggregage
 
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
nuttdpt
 
一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理
一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理
一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理
y3i0qsdzb
 
DSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelinesDSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelines
Timothy Spann
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
aqzctr7x
 
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
Timothy Spann
 
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docxDATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
SaffaIbrahim1
 
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
nuttdpt
 
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
nyfuhyz
 
Analysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performanceAnalysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performance
roli9797
 
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
Social Samosa
 
Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
manishkhaire30
 
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headedGlobal Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
vikram sood
 
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
ihavuls
 

Recently uploaded (20)

一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
 
End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024
 
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
 
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
 
The Ipsos - AI - Monitor 2024 Report.pdf
The  Ipsos - AI - Monitor 2024 Report.pdfThe  Ipsos - AI - Monitor 2024 Report.pdf
The Ipsos - AI - Monitor 2024 Report.pdf
 
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
 
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
 
一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理
一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理
一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理
 
DSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelinesDSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelines
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
 
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
 
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docxDATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
 
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
 
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
 
Analysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performanceAnalysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performance
 
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
 
Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
 
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headedGlobal Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
 
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
 

Containerized Stream Engine to Build Modern Delta Lake

  • 1. Containerized Stream Engine to build Modern Delta Lake Sandeep Reddy Bheemi Reddy, Senior Data Engineer Karthikeyan Siva Baskaran, Senior Data Engineer
  • 2. Who We Are Sandeep Reddy Bheemi Reddy Senior Data Engineer Karthikeyan Siva Baskaran Senior Data Engineer TIGER DATA FOUNDATION Containerized Stream Engine to build Modern Delta Lake Contact us +1-408-508-4430 info@tigeranalytics.com https://www.tigeranalytics.com/
  • 3. Agenda Objective Design Considerations Infrastructure Provisioning Solution Deep Dive Application Monitoring Points to be Noted Questions
  • 4. Objective To Build Single Source of Truth data for Enterprise via CDC Most compelling operational analytics demand real-time data rather than historical data. Data Agility The Speed of business is rapidly accelerating, driving the need for delivering intelligent, fast solutions. Facilitate larger amounts of data from multiple sources by tracking changes made to the source data, combine them together to build a Single Source of Truth to make decisions based on data. Build SSOT from Siloed Data Demand for real-time Data
  • 6. Few Common Ways to Capture Data to get Insights Change Data Capture App DB LOG Analytics Data Lake Dual Writes App DB Pub Sub System Analytics Data Lake Direct JDBC App DB Analytics Data Lake
  • 7. Inconsistent Data During Job failure, in overwrite mode it leads to inconsistent data Schema Enforcement & Evolution DDLs are not supported, this leads to break in the flow if upstream applications changed the schema Roll Back not possible In case of failure, it is not possible to roll back to the previous state of data No Metadata layer As there is no metadata layer, there is no clear isolation b/w reads and writes – thus it is not consistent, durable or atomic VersioningSchema E2 Data Corruption Not ACID Complaint Problem with Today’s Data Lake
  • 8. Provides clear isolation between different writes by maintaining log file for each transaction Even Job failure with Overwrite mode, will not corrupt the data Provides serializable isolation levels to ensure the data consistent across multiple users Changes to the table are maintained as ordered, atomic commits ACID Compliant Atomicity Consistency Isolation Durability mergeSchema - Any column that is present in the Data Frame but not in the target table is automatically added on to the end of the schema. overwriteSchema – Datatype change, drop/rename column Time Travel to older version All metadata and lineage of your data are stored. To travel back to previous versions of your delta table, provide a timestamp or specific version Expectations for data quality, which prohibits the invalid data to enter your enterprise data lake Data Check Constraints Schema Enforcement & Evolution Delta Lake to Rescue
  • 10. 10 On Premise Code Repo To maintain versions of Terraform files Open Source Agent Security & Compliance Checks Terraform To deploy TF files and maintain state of TF files. Deploy TF Files TF State Files DevOps Professional CD Pipeline TF Files ▪ Cloud Agnostic – Create & manage infrastructure across various platforms. ▪ Minimize human errors and configuration differences in various environments. ▪ Maintain the state of infrastructure. ▪ Perform policy checks on Infrastructure IaC – Workflow Infra Provisioning in Selected Environment Kubernetes Cluster (With Scalable worker nodes) Pods (Deployment, Replica Sets) Launch the Deployment Services (Node Port & Load Balancer) Volumes (PV & PVC)
  • 12. Kafka Schema Registry Kubernetes Source Database Structured StreamingKafka Connect Change Data Streaming Queue Processing Layer Storage Layer DB Logs ADLS S3 Kafka Connect uses Debezium connector to parse the database logs Schema id data Schema id data Avro Schema-1 Schema-2 Schema-n id - 1001 id - 1002 id - n Register Schema Provides flexibility by creating a VIEW from different schema for different teams based on their need. This helps downstream apps to run without any interruption when schema changes Persistent Volume Claim PVC PVC
  • 13. { "name": "mssql-${DBName}-connector-${conn_string}", "config": { "connector.class": "io.debezium.connector.sqlserver.SqlServerConnector", "tasks.max": "1", "database.hostname": "${Hostname}", "database.port": "${Port}", "database.user": "${UserName}", "database.password": "${Password}", "database.server.id": "${conn_string}", "database.server.name": "${Source}.${DBName}.${conn_string}", "database.whitelist": "${DBName}", "database.dbname": "${DBName}", "database.history.kafka.bootstrap.servers": "${KAFKA}:9092", "database.history.kafka.topic": "${Source}.${DBName}.dbhistory", "key.converter":"io.confluent.connect.avro.AvroConverter", "key.converter.schema.registry.url":"http://${SCHEMA_REGISTRY}:8081", "value.converter":"io.confluent.connect.avro.AvroConverter", "value.converter.schema.registry.url": "http://${SCHEMA_REGISTRY}:8081", } } Kafka Connector Properties
  • 14. { "payload": { "before": { "emp_no": 1, "birth_date": 18044, "first_name": “Marion", "last_name": “Colbrun" }, "after": { "emp_no": 1, "birth_date": 18044, "first_name": “Marion", "last_name": “Brickell" } } } { "payload": { "before": { "emp_no": 1, "birth_date": 18044, "first_name": "Marion", "last_name": "Colbrun" }, "after": null } } { "payload": { "before": null, "after": { "emp_no": 1, "birth_date": 18044, "first_name": "Marion", "last_name": "Colbrun" } } } insert into sample_emp values (1,current_date,'Marion’, 'Colbrun'); update sample_emp set last_name='Brickell’ where emp_no=1; delete from sample_emp where emp_no=1; INSERT UPDATE DELETE
  • 15. CDC Code Logic Flow Read data from Kafka and create Delta Table and insert the recent data based on Primary Key and exclude if there are any Deletes. Read the data from Kafka and split delete data from Inserts/Updates. Get the latest data by using Rank window. Enable autoMerge schema property to detect any schema changes and merge the schema to Delta Table MERGE command to handle Inserts/Updates/Deletes based on Operation(op) column which is created by Debezium by parsing the logs Initial Load DDL DM L Data Preprocess Data Preprocess Initial Load DML Scenario DDL Scenario Incremental Load: Data Preprocess, DDL & DML
  • 16. Flag ID Value CDCTimeStamp I 1 1 2018-01-01 16:02:00 U 1 11 2018-01-01 16:02:01 I 2 2 2018-01-01 16:02:03 I 3 33 2018-01-01 16:02:04 I 4 40 2018-01-01 16:02:04 U 4 44 2018-01-01 16:02:05 Get Latest Record in order to maintain SCD Type I Inserts/Updates Flag ID Value CDCTimeStamp D 2 2 2018-01-01 16:02:04 Deletes Deletes will have different schema when it is inserted in Kafka from Debezium. For Deletes, take Before Image data to know which primary key records got deleted, where as for Inserts and Updates, pull data from After Image. Flag ID Value CDCTimeStamp U 1 11 2018-01-01 16:02:01 D 2 2 2018-01-01 16:02:04 I 3 33 2018-01-01 16:02:04 U 4 44 2018-01-01 16:02:05 val orderBy_lst = List(“CDCTimeStamp") val byPrimaryKey = Window .partitionBy(partitionBy_lst.map(col): _*) .orderBy(orderBy_lst.map(x => col(x) desc):_*) rankDf = dmlDf .withColumn("rank", rank over byPrimaryKey) .filter("rank = 1") .drop("rank") Data Pre-processing
  • 17. Get Latest Record and exclude Deletes for Initial Load Consolidated data for Initial Load As the requirement is to maintain SCD I, there is no need to load the Deletes data into Delta Lake during Initial Load. Flag ID Value CDCTimeStamp U 1 11 2018-01-01 16:02:01 I 3 33 2018-01-01 16:02:04 U 4 44 2018-01-01 16:02:05 df.where("op != 'd'") .write .mode("overwrite") .option("path", delta_tbl_loc) .format("delta") .saveAsTable(db_tbl) Flag ID Value CDCTimeStamp U 1 11 2018-01-01 16:02:01 D 2 2 2018-01-01 16:02:04 I 3 33 2018-01-01 16:02:04 U 4 44 2018-01-01 16:02:05 Initial Load
  • 18. Flag ID Value City CDCTimeStamp I 11 100 MDU 2018-01-01 16:02:20 U 11 1000 CHN 2018-01-01 16:02:21 U 3 300 MDU 2018-01-01 16:02:22 I 14 400 MDU 2018-01-01 16:02:21 D 4 44 2018-01-01 16:02:22 Incremental Staged Data Incremental Load: Data pre-process & Get Latest Record val orderBy_lst = List("CDCTimeStamp") val byPrimaryKey = Window .partitionBy(partitionBy_lst.map(col): _*) .orderBy(orderBy_lst.map(x => col(x) desc):_*) rankDf = dmlDf .withColumn("rank", rank over byPrimaryKey) .filter("rank = 1") .drop("rank") Flag ID Value City CDCTimeStamp U 11 1000 CHN 2018-01-01 16:02:21 U 3 300 MDU 2018-01-01 16:02:22 I 14 400 MDU 2018-01-01 16:02:21 D 4 44 2018-01-01 16:02:22 Latest Record from incremental load Data pre-processing by splitting deletes and Inserts/Updates and get Latest Records per primary key. Finally union both dataframe before performing MERGE
  • 19. Flag ID Value City CDCTimeStamp U 11 1000 CHN 2018-01-01 16:02:21 U 3 300 MDU 2018-01-01 16:02:22 I 14 400 MDU 2018-01-01 16:02:21 D 4 44 2018-01-01 16:02:22 Latest Incremental Staged Data Incremental Load – DDL & DML MERGE INTO ${db_tbl} AS target USING staging_tbl AS source ON ${pri_key_const} WHEN MATCHED and source.op = 'u’ THEN UPDATE SET * WHEN MATCHED and source.op = 'd’ THEN delete WHEN NOT MATCHED and source.op = 'c’ THEN INSERT * Flag ID Value City CDCTimeStamp U 1 11 Null 2018-01-01 16:02:01 U 3 300 MDU 2018-01-01 16:02:22 U 11 1000 CHN 2018-01-01 16:02:21 I 14 400 MDU 2018-01-01 16:02:21 Enable this property to add new columns on the fly when MERGE happens spark.databricks.delta.schema.autoMerge.enabled Only available from delta lake 0.6.0 and higher versions Flag ID Value CDCTimeStamp U 1 11 2018-01-01 16:02:01 I 3 33 2018-01-01 16:02:04 U 4 44 2018-01-01 16:02:05 Before After
  • 20. 20 Spark Streaming with Kubernetes API Server Scheduler Kubernetes Master Spark Driver Pod Spark Executor Pod Spark Executor Pod Spark Executor Pod 1)Spark Submit 2)Start Driver Pod 3)Request Executor Pods 5)Notify of New Executor 4)Schedule Executor Pod 6)Schedule tasks on executors Kubernetes Cluster Key Benefits ▪ Containerization – Applications are more portable and easy to package dependencies. ▪ Cloud Agnostic – Able to launch the Spark job in any platform without any code changes. ▪ Efficient Resource Sharing – Resources can be utilized by other applications when Spark jobs are idle. File Share Checkpointing
  • 22. Fluentd is a popular data collector that runs as a DaemonSet inside Kubernetes Worker Nodes to collect and ingest container logs from local filesystem into Elasticsearch engine Metrics Beat is a lightweight shipper that collects and ships various system and service metrics like CPU, Memory, Disk usage etc. to Elasticsearch engine Elasticsearch is a real time distributed scalable search engine that is used to index and search through larger volume of log data Kibana is a powerful data visualization tool that allows to explore log data stored in Elasticsearch and gain quick insights to Kubernetes applications Node Pod2 Podn Pod1 . . Node Storage Pod2 Podn Pod1 . . Node Storage Daemon Node Daemon Daemon Daemon CPU,Memory, Network CPU,Memory, Network Logs Logs Centralized Log Monitoring
  • 24. Points to be noted!
  • 25. DEBEZIUM Points to be Noted Primary Key is mandatory, without primary key it is not possible to track the changes and apply the changes to Target Primary Key By default, Kafka connect will create topic with only one partition. Due to this, Spark Job will not get parallelized. To achieve parallelism, we need to create the topic with more no of partitions Partitions For each Table under database, one topic will be created and one common topic for one DB to maintain DDLs Topic/Table
  • 26. SPARK If small files are handled, when merge happens during incremental load, there is no need to rewrite most of the files, in return performance between micro batch will be increased. To control the compaction, either Run OPTIMIZE or set false to dataChange delta property or enable Adaptive Execution Mode Small Files Time travel will not read the delta log checkpoint directory, because we need the specific version, so this will read the specific json commit file, because the checkpoint parquet file is consolidated of all the json files which is committed previously. Time Travel
  • 28. Feedback Your feedback is important to us. Don’t forget to rate and review the session. THANKS!