SlideShare a Scribd company logo
1 of 62
Download to read offline
Stream Data Deduplication Powered by Kafka Streams | Philipp Schirmer | Kafka Summit Europe 2021
Stream Data Deduplication
Powered by Kafka Streams
Philipp Schirmer
Data Engineer Kafka Summit Europe 2021
Stream Data Deduplication Powered by Kafka Streams | Philipp Schirmer | Kafka Summit Europe 2021
Scenario
2
Deduplication
Alignment
▬ Headline
▬ Content
▬ Location
▬ Company
▬ Tags
▬ Author
▬ Date
▬ …
Icons
made
by
Freepik
from
www.flaticon.com
https://www.kaggle.com/c/kdd-cup-2013-author-paper-identification-challenge/data
Stream Data Deduplication Powered by Kafka Streams | Philipp Schirmer | Kafka Summit Europe 2021
Scenario
3
Deduplication
Alignment
Icons
made
by
Freepik
from
www.flaticon.com
Deduplication is hard
Especially if you want to distribute
and streamify it for large-scale
workloads
https://www.kaggle.com/c/kdd-cup-2013-author-paper-identification-challenge/data
Stream Data Deduplication Powered by Kafka Streams | Philipp Schirmer | Kafka Summit Europe 2021
4
Requirements
Document streams
Updates & deletions
Frequent reprocessing
Reduce costs
Fewer but much larger messages than event streams
Only most recent version of records should be used
> 100 million existing records
Scale to zero
Stream Data Deduplication Powered by Kafka Streams | Philipp Schirmer | Kafka Summit Europe 2021
5
Given: a duplicate record classifier (black box)
https://github.com/bakdata/dedupe
Naïve approach: Compare every record to every other record
▬ Does not scale (quadratic complexity)
→ Pre-select plausible candidates
Duplicate Detection
Classification
Stream Data Deduplication Powered by Kafka Streams | Philipp Schirmer | Kafka Summit Europe 2021
Goal: Only classify similar records.
Assumption: When sorted, similar records will be close together.
Solution: Only classify records within a fixed window when sorted
▬ Linear complexity
What is the best sorting criteria? Use multiple!
Candidate Generation using Sorted Neighborhood
6
Candidate
Generation
Classification
Alibaba
Alpabet
Alphabet
Alphabet Inc.
Amazon
Amazon.com
Alphabet Inc
Stream Data Deduplication Powered by Kafka Streams | Philipp Schirmer | Kafka Summit Europe 2021
If A and B are duplicates, and B and C are duplicates, A and C
have to be duplicates
→ Ensure transitivity by computing transitive closure
Transitivity
7
Candidate
Generation
Classification
Transitive
Closure
Stream Data Deduplication Powered by Kafka Streams | Philipp Schirmer | Kafka Summit Europe 2021
All connected components are considered a duplicate
▬ Also improves recall
Transitive Closure
8
Entity A
0.9
0.8
A1
A3
A2
Stream Data Deduplication Powered by Kafka Streams | Philipp Schirmer | Kafka Summit Europe 2021
All connected components are considered a duplicate
▬ Also improves recall
Transitive Closure
9
Entity A
0.9
0.8
A1
A3
A2
Stream Data Deduplication Powered by Kafka Streams | Philipp Schirmer | Kafka Summit Europe 2021
All connected components are considered a duplicate
But: error prone
Transitive Closure
10
Entity A
Entity B
0.9
0.8 0.8
0.9
0.9
A1
A3
A2 A1 A2
B2 B1
Stream Data Deduplication Powered by Kafka Streams | Philipp Schirmer | Kafka Summit Europe 2021
All connected components are considered a duplicate
But: error prone
Transitive Closure
11
Entity A
Entity B
0.9
0.8 0.8
0.9
0.9
A1
A3
A2 A1 A2
B2 B1
Stream Data Deduplication Powered by Kafka Streams | Philipp Schirmer | Kafka Summit Europe 2021
All connected components are considered a duplicate
But: error prone
Transitive Closure
12
Entity A
Entity B
0.9
0.8 0.8
0.9
0.9
A1
A3
A2 A1 A2
B2 B1
Stream Data Deduplication Powered by Kafka Streams | Philipp Schirmer | Kafka Summit Europe 2021
Goal: Find strongly connected components
▬ Classify all records within a cluster
▬ Weight negative scores higher (precision over recall)
Clustering
13
Entity A
Entity B
0.9
0.8 0.8
0.9
0.9
A1
A3
A2 A1 A2
B2 B1
Stream Data Deduplication Powered by Kafka Streams | Philipp Schirmer | Kafka Summit Europe 2021
Goal: Find strongly connected components
▬ Classify all records within a cluster
▬ Weight negative scores higher (precision over recall)
Clustering
14
Entity A
Entity B
A1
A3
A2 A1 A2
B2 B1
Stream Data Deduplication Powered by Kafka Streams | Philipp Schirmer | Kafka Summit Europe 2021
Overview
15
Candidate
Generation
Classification
Transitive
Closure
Clustering
Stream Data Deduplication Powered by Kafka Streams | Philipp Schirmer | Kafka Summit Europe 2021
Overview
16
Candidate
Generation
Classification
Transitive
Closure
Clustering
Stream Data Deduplication Powered by Kafka Streams | Philipp Schirmer | Kafka Summit Europe 2021
Overview
17
Candidate
Generation
Classification
Transitive
Closure
Clustering
Kafka Streams Apps
Stream Data Deduplication Powered by Kafka Streams | Philipp Schirmer | Kafka Summit Europe 2021
Candidate Generation
18
Alibaba ali1
Alpabet alpha1
Alphabet alpha2
Alphabet Inc. alpha3
Amazon amzn1
Amazon.com amzn2
alpha4;
Alphabet Inc
alpha4;
alpha1, alpha2,
alpha3, amzn1
Stream Data Deduplication Powered by Kafka Streams | Philipp Schirmer | Kafka Summit Europe 2021
Candidate Generation
19
Alibaba ali1
Alpabet alpha1
Alphabet alpha2
Alphabet Inc. alpha3
Amazon amzn1
Amazon.com amzn2
alpha4;
Alphabet Inc
alpha4;
alpha1, alpha2,
alpha3, amzn1
How to implement a sorted doubly linked list in Kafka
Streams?
Stream Data Deduplication Powered by Kafka Streams | Philipp Schirmer | Kafka Summit Europe 2021
Candidate Generation
20
Alibaba ali1
Alpabet alpha1
Alphabet alpha2
Alphabet Inc. alpha3
Amazon amzn1
Amazon.com amzn2
alpha4;
Alphabet Inc
alpha4;
alpha1, alpha2,
alpha3, amzn1
How to implement a sorted doubly linked list in Kafka
Streams?
▬ Use Kafka Streams state store backed by RocksDB
▬ Supports forwards and backwards iteration
▬ Partitioned
→ Relevant neighborhood may not be on the local
partition
Stream Data Deduplication Powered by Kafka Streams | Philipp Schirmer | Kafka Summit Europe 2021
Candidate Generation
21
Alibaba ali1
Alpabet alpha1
Alphabet alpha2
Alphabet Inc. alpha3
Amazon amzn1
Amazon.com amzn2
alpha4;
Alphabet Inc
alpha4;
alpha1, alpha2,
alpha3, amzn1
How to implement a sorted doubly linked list in Kafka
Streams?
▬ Use Kafka Streams state store backed by RocksDB ✘
▬ Use Kafka Streams global store backed by RocksDB
▬ Supports forwards and backwards iteration
▬ Unpartitioned and replicated to every node
(> 100 GB)
▬ Replication latency
Stream Data Deduplication Powered by Kafka Streams | Philipp Schirmer | Kafka Summit Europe 2021
Candidate Generation
22
Alibaba ali1
Alpabet alpha1
Alphabet alpha2
Alphabet Inc. alpha3
Amazon amzn1
Amazon.com amzn2
alpha4;
Alphabet Inc
alpha4;
alpha1, alpha2,
alpha3, amzn1
How to implement a sorted doubly linked list in Kafka
Streams?
▬ Use Kafka Streams state store backed by RocksDB ✘
▬ Use Kafka Streams global store backed by RocksDB ✘
▬ Use SQL database, e.g., PostgreSQL
▬ Indices allow quick forwards and backwards
neighborhood queries
▬ Transactional guarantees for inserts
▬ Network and disk overhead
Stream Data Deduplication Powered by Kafka Streams | Philipp Schirmer | Kafka Summit Europe 2021
Candidate Generation
23
Alibaba ali1
Alpabet alpha1
Alphabet alpha2
Alphabet Inc. alpha3
Amazon amzn1
Amazon.com amzn2
alpha4;
Alphabet Inc
alpha4;
alpha1, alpha2,
alpha3, amzn1
How to implement a sorted doubly linked list in Kafka
Streams?
▬ Use Kafka Streams state store backed by RocksDB ✘
▬ Use Kafka Streams global store backed by RocksDB ✘
▬ Use SQL database, e.g., PostgreSQL ✔
Stream Data Deduplication Powered by Kafka Streams | Philipp Schirmer | Kafka Summit Europe 2021
Candidate Generation
24
Output
Records
SN Index
Sorted
Neighborhood
A2; A1, B1, B2
A2
Stream Data Deduplication Powered by Kafka Streams | Philipp Schirmer | Kafka Summit Europe 2021
Overview
25
SN Index
Fill Index
Candidate
Generation
Classification
Transitive
Closure
Clustering
Stream Data Deduplication Powered by Kafka Streams | Philipp Schirmer | Kafka Summit Europe 2021
Classification
26
A2; A1, B1, B2
Output
Candidates Record Index
Classification A2; A1, B1
Stream Data Deduplication Powered by Kafka Streams | Philipp Schirmer | Kafka Summit Europe 2021
Overview
27
SN Index
Fill Indices
Candidate
Generation
Classification
Transitive
Closure
Clustering
Record Index
Stream Data Deduplication Powered by Kafka Streams | Philipp Schirmer | Kafka Summit Europe 2021
Two Kafka Streams state stores to maintain transitive closure:
▬ Record ID → Cluster ID
▬ Cluster ID → Record IDs
Single partitioned because we don’t know ahead of time which
records will end up being duplicates
But: not a problem as operation is very cheap and state is small
Transitive Closure
28
Stream Data Deduplication Powered by Kafka Streams | Philipp Schirmer | Kafka Summit Europe 2021
Transitive Closure
29
State Store State Store
Cluster ID → Record IDs
Record ID → Cluster ID
A2; A1, B1
Classifications
Stream Data Deduplication Powered by Kafka Streams | Philipp Schirmer | Kafka Summit Europe 2021
Transitive Closure
30
State Store State Store
Cluster ID → Record IDs
Record ID → Cluster ID
A2; A1, B1 A1: A
A2: A
B1: A
A: A1, A2, B1
Classifications
Stream Data Deduplication Powered by Kafka Streams | Philipp Schirmer | Kafka Summit Europe 2021
Transitive Closure
31
State Store State Store
Cluster ID → Record IDs
Record ID → Cluster ID
A2; A1, B1
B2; B1
A1: A
A2: A
B1: A
A: A1, A2, B1
Classifications
Stream Data Deduplication Powered by Kafka Streams | Philipp Schirmer | Kafka Summit Europe 2021
Transitive Closure
32
State Store State Store
Cluster ID → Record IDs
Record ID → Cluster ID
A2; A1, B1
B2; B1
A1: A
A2: A
B1: A
A: A1, A2, B1
Classifications
Stream Data Deduplication Powered by Kafka Streams | Philipp Schirmer | Kafka Summit Europe 2021
Transitive Closure
33
State Store State Store
Cluster ID → Record IDs
Record ID → Cluster ID
A2; A1, B1
B2; B1
A1: A
A2: A
B1: A
A: A1, A2, B1
Classifications
Stream Data Deduplication Powered by Kafka Streams | Philipp Schirmer | Kafka Summit Europe 2021
Transitive Closure
34
State Store State Store
Cluster ID → Record IDs
Record ID → Cluster ID
A2; A1, B1
B2; B1
A1: A
A2: A
B1: A
A: A1, A2, B1, B2
B2: A
Classifications
Stream Data Deduplication Powered by Kafka Streams | Philipp Schirmer | Kafka Summit Europe 2021
Overview
35
SN Index
Fill Indices
Candidate
Generation
Classification
Transitive
Closure
Clustering
Record Index
Stream Data Deduplication Powered by Kafka Streams | Philipp Schirmer | Kafka Summit Europe 2021
Clustering
36
A1, A2, B1, B2
B1, B2
A1, A2
Output
Transitive Closure Record Index
Clustering
Stream Data Deduplication Powered by Kafka Streams | Philipp Schirmer | Kafka Summit Europe 2021
Overview
37
SN Index
Fill Indices
Candidate
Generation
Classification
Transitive
Closure
Clustering
Record Index
Stream Data Deduplication Powered by Kafka Streams | Philipp Schirmer | Kafka Summit Europe 2021
Cluster Ordering
38
Partition 1
Partition 2
Clustering
A1, A2, B1, B2
A1, A2, B1
A1, A2, B1
A1: A
A2: A
B1: A
Output
Transitive Closure
Timestamp 0
Timestamp 1
Stream Data Deduplication Powered by Kafka Streams | Philipp Schirmer | Kafka Summit Europe 2021
Cluster Ordering
39
Partition 1
Partition 2
Clustering
A1, A2, B1, B2
A1, A2, B1
A1, A2, B1
A1: A
A2: A
B1: A
A1: A
A2: A
B1: B
B2: B
Output
Transitive Closure
B1, B2
A1, A2
Timestamp 0
Timestamp 1
Stream Data Deduplication Powered by Kafka Streams | Philipp Schirmer | Kafka Summit Europe 2021
Cluster Ordering
40
Partition 1
Partition 2
Clustering
A1, A2, B1, B2
A1, A2, B1
B1, B2
A1, A2
A1: A
A2: A
B1: B
B2: B
Output
Transitive Closure
Timestamp 0
Timestamp 1
Stream Data Deduplication Powered by Kafka Streams | Philipp Schirmer | Kafka Summit Europe 2021
Cluster Ordering
41
Partition 1
Partition 2
Clustering
A1, A2, B1, B2
A1, A2, B1
A1, A2, B1
B1, B2
A1, A2
A1: A
A2: A
B1: B
B2: B
A1: A
A2: A
B1: A
Output
Transitive Closure
Timestamp 0
Timestamp 1
Stream Data Deduplication Powered by Kafka Streams | Philipp Schirmer | Kafka Summit Europe 2021
Cluster Ordering
42
Partition 1
Partition 2
Clustering
A1, A2, B1, B2
A1, A2, B1
A1, A2, B1
B1, B2
A1, A2
A1: A
A2: A
B1: B
B2: B
A1: A
A2: A
B1: A
Output
Transitive Closure
Timestamp 0
Timestamp 1
Stream Data Deduplication Powered by Kafka Streams | Philipp Schirmer | Kafka Summit Europe 2021
Challenge: Clustering does not preserve order if done in parallel
Solution:
▬ Add sequential id, e.g., timestamp, to results of transitive
closure
▬ Parallelize clustering
▬ Filter results with state store (single partitioned)
▬ For each record in group of clusters, store latest
timestamp that has been seen
▬ For each group of clusters, check if timestamp is younger
than all timestamps that have been seen for its records
Cluster Ordering
43
Stream Data Deduplication Powered by Kafka Streams | Philipp Schirmer | Kafka Summit Europe 2021
Cluster Ordering
44
A1, A2, B1, B2
Timestamp: 1
State Store
Record ID → Timestamp
Transitive Closure Clustering Output
Stream Data Deduplication Powered by Kafka Streams | Philipp Schirmer | Kafka Summit Europe 2021
Cluster Ordering
45
A1, A2, B1, B2
Timestamp: 1
A1, A2
B1, B2
Timestamp: 1
State Store
Record ID → Timestamp
Transitive Closure Clustering Output
Stream Data Deduplication Powered by Kafka Streams | Philipp Schirmer | Kafka Summit Europe 2021
Cluster Ordering
46
A1, A2, B1, B2
Timestamp: 1
B1: 1
A1, A2
B1, B2
Timestamp: 1
B2: 1
State Store
Record ID → Timestamp
A1: 1
A2: 1
Transitive Closure Clustering Output
A1: A
A2: A
B1: B
B2: B
Stream Data Deduplication Powered by Kafka Streams | Philipp Schirmer | Kafka Summit Europe 2021
Cluster Ordering
47
A1, A2, B1
Timestamp: 0
B1: 1
B2: 1
State Store
Record ID → Timestamp
A1: 1
A2: 1
Transitive Closure Clustering Output
Stream Data Deduplication Powered by Kafka Streams | Philipp Schirmer | Kafka Summit Europe 2021
Cluster Ordering
48
A1, A2, B1
Timestamp: 0
B1: 1
A1, A2, B1
Timestamp: 0
B2: 1
State Store
Record ID → Timestamp
A1: 1
A2: 1
Transitive Closure Clustering Output
Stream Data Deduplication Powered by Kafka Streams | Philipp Schirmer | Kafka Summit Europe 2021
Cluster Ordering
49
A1, A2, B1
Timestamp: 0
B1: 1
A1, A2, B1
Timestamp: 0
B2: 1
State Store
Record ID → Timestamp
A1: 1
A2: 1
Transitive Closure Clustering Output
Stream Data Deduplication Powered by Kafka Streams | Philipp Schirmer | Kafka Summit Europe 2021
Cluster Ordering
50
A1, A2, B1, B2, A3
Timestamp: 2
B1: 1
B2: 1
State Store
Record ID → Timestamp
A1: 1
A2: 1
Transitive Closure Clustering Output
Stream Data Deduplication Powered by Kafka Streams | Philipp Schirmer | Kafka Summit Europe 2021
Cluster Ordering
51
A1, A2, B1, B2, A3
Timestamp: 2
B1: 1
A1, A2, A3
B1, B2
Timestamp: 2
B2: 1
State Store
Record ID → Timestamp
A1: 1
A2: 1
Transitive Closure Clustering Output
Stream Data Deduplication Powered by Kafka Streams | Philipp Schirmer | Kafka Summit Europe 2021
Cluster Ordering
52
A1, A2, B1, B2, A3
Timestamp: 2
B1: 1
A1, A2, A3
B1, B2
Timestamp: 2
B2: 1
State Store
Record ID → Timestamp
A1: 1
A2: 1
Transitive Closure Clustering Output
Stream Data Deduplication Powered by Kafka Streams | Philipp Schirmer | Kafka Summit Europe 2021
Cluster Ordering
53
A1, A2, B1, B2, A3
Timestamp: 2
A3: 2
B1: 2
A1, A2, A3
B1, B2
Timestamp: 2
B2: 2
State Store
Record ID → Timestamp
A1: 2
A2: 2
Transitive Closure Clustering Output
A1: A
A2: A
B1: B
B2: B
A3: A
Stream Data Deduplication Powered by Kafka Streams | Philipp Schirmer | Kafka Summit Europe 2021
Overview
54
SN Index
Fill Indices
Candidate
Generation
Classification
Transitive
Closure
Clustering
Cluster
Ordering
Record Index
Stream Data Deduplication Powered by Kafka Streams | Philipp Schirmer | Kafka Summit Europe 2021
Overview
55
SN Index
Fill Indices
Candidate
Generation
Classification
Transitive
Closure
Clustering
Cluster
Ordering
Record Index
Stream Data Deduplication Powered by Kafka Streams | Philipp Schirmer | Kafka Summit Europe 2021
Deletions
56
SN Index
Fill Indices
Candidate
Generation
Classification
Transitive
Closure
Clustering
Cluster
Ordering
Record Index
Icon
made
by
Freepik
from
www.flaticon.com
Stream Data Deduplication Powered by Kafka Streams | Philipp Schirmer | Kafka Summit Europe 2021
Deletions
57
SN Index
Fill Indices
Candidate
Generation
Classification
Transitive
Closure
Clustering
Cluster
Ordering
Record Index
Icon
made
by
Freepik
from
www.flaticon.com
Record ID is message key
Stream Data Deduplication Powered by Kafka Streams | Philipp Schirmer | Kafka Summit Europe 2021
Deletions
58
SN Index
Fill Indices
Candidate
Generation
Classification
Transitive
Closure
Clustering
Cluster
Ordering
Record Index
Icon
made
by
Freepik
from
www.flaticon.com
Cluster ID is message key
Stream Data Deduplication Powered by Kafka Streams | Philipp Schirmer | Kafka Summit Europe 2021
Deletions
59
SN Index
Fill Indices
Candidate
Generation
Classification
Transitive
Closure
Clustering
Cluster
Ordering
Record Index
Stream Data Deduplication Powered by Kafka Streams | Philipp Schirmer | Kafka Summit Europe 2021
▬ Up to 1,200 deduplications per second with 100
parallel classification pods
▬ Deduplication of 150,000,000 records in < 4 days
▬ Throughput limited by memory availability of
database and classifier throughput
Performance
60
Icon
made
by
Freepik
from
www.flaticon.com
Stream Data Deduplication Powered by Kafka Streams | Philipp Schirmer | Kafka Summit Europe 2021
▬ All deployed and developed with
https://github.com/bakdata/streams-bootstrap
▬ Lag-based auto-scaling from 0 to number of
partitions powered by KEDA
https://keda.sh/docs/2.1/scalers/apache-kafka/
▬ Database auto-scaling with Aurora Serverless
https://aws.amazon.com/rds/aurora/serverless/
Deployment & Operations
61
autoscaling:
enabled: true
maxReplicas: 100
consumergroup: news-deduplication-class
lagThreshold: "10000"
Logo
by
Kubernetes
/
CC
BY
Stream Data Deduplication Powered by Kafka Streams | Philipp Schirmer | Kafka Summit Europe 2021
Each module scales independently
Conclusion
62
Modular architecture
High throughput
Large-scale processing
Scale to zero
With regard to complexity of deduplication
Hundreds of million of records can be processed
If there is no load, we only pay for storage

More Related Content

What's hot

Experience A Live BI 4.3 Upgrade
Experience A Live BI 4.3 UpgradeExperience A Live BI 4.3 Upgrade
Experience A Live BI 4.3 UpgradeWiiisdom
 
Hyperion essbase overview
Hyperion essbase overviewHyperion essbase overview
Hyperion essbase overviewVishal Mahajan
 
Kafka Intro With Simple Java Producer Consumers
Kafka Intro With Simple Java Producer ConsumersKafka Intro With Simple Java Producer Consumers
Kafka Intro With Simple Java Producer ConsumersJean-Paul Azar
 
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...Flink Forward
 
Changelog Stream Processing with Apache Flink
Changelog Stream Processing with Apache FlinkChangelog Stream Processing with Apache Flink
Changelog Stream Processing with Apache FlinkFlink Forward
 
Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...Flink Forward
 
FDMEE Can Do That?
FDMEE Can Do That?FDMEE Can Do That?
FDMEE Can Do That?Alithya
 
ERP/SAP Project Charter
ERP/SAP Project CharterERP/SAP Project Charter
ERP/SAP Project CharterBogdan Gorka
 
Data migration methodology for sap v2
Data migration methodology for sap v2Data migration methodology for sap v2
Data migration methodology for sap v2cvcby
 
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...Databricks
 
Db connect with sap bw
Db connect with sap bwDb connect with sap bw
Db connect with sap bwObaid shaikh
 
Reports and dashboards
Reports and dashboardsReports and dashboards
Reports and dashboardsd1360x
 
SAP BW to BW4HANA Migration
SAP BW to BW4HANA MigrationSAP BW to BW4HANA Migration
SAP BW to BW4HANA Migrationssuserff70ea1
 
SAP Real Estate Beneficios
SAP Real Estate Beneficios SAP Real Estate Beneficios
SAP Real Estate Beneficios marcelochaggas
 
Creating Reports with Financial Reporting Web Studio.pptx
Creating Reports with Financial Reporting Web Studio.pptxCreating Reports with Financial Reporting Web Studio.pptx
Creating Reports with Financial Reporting Web Studio.pptxMurtuzaS1
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
 
Utilizing HFM to Handle the Requirements of IFRS
Utilizing HFM to Handle the Requirements of IFRSUtilizing HFM to Handle the Requirements of IFRS
Utilizing HFM to Handle the Requirements of IFRSAlithya
 

What's hot (20)

Experience A Live BI 4.3 Upgrade
Experience A Live BI 4.3 UpgradeExperience A Live BI 4.3 Upgrade
Experience A Live BI 4.3 Upgrade
 
Hyperion essbase overview
Hyperion essbase overviewHyperion essbase overview
Hyperion essbase overview
 
Kafka Intro With Simple Java Producer Consumers
Kafka Intro With Simple Java Producer ConsumersKafka Intro With Simple Java Producer Consumers
Kafka Intro With Simple Java Producer Consumers
 
Power bi software
Power bi softwarePower bi software
Power bi software
 
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
 
Changelog Stream Processing with Apache Flink
Changelog Stream Processing with Apache FlinkChangelog Stream Processing with Apache Flink
Changelog Stream Processing with Apache Flink
 
Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...
 
Cutover Plan V2
Cutover Plan V2Cutover Plan V2
Cutover Plan V2
 
Ab initio training Ab-initio Architecture
Ab initio training Ab-initio ArchitectureAb initio training Ab-initio Architecture
Ab initio training Ab-initio Architecture
 
FDMEE Can Do That?
FDMEE Can Do That?FDMEE Can Do That?
FDMEE Can Do That?
 
ERP/SAP Project Charter
ERP/SAP Project CharterERP/SAP Project Charter
ERP/SAP Project Charter
 
Data migration methodology for sap v2
Data migration methodology for sap v2Data migration methodology for sap v2
Data migration methodology for sap v2
 
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
 
Db connect with sap bw
Db connect with sap bwDb connect with sap bw
Db connect with sap bw
 
Reports and dashboards
Reports and dashboardsReports and dashboards
Reports and dashboards
 
SAP BW to BW4HANA Migration
SAP BW to BW4HANA MigrationSAP BW to BW4HANA Migration
SAP BW to BW4HANA Migration
 
SAP Real Estate Beneficios
SAP Real Estate Beneficios SAP Real Estate Beneficios
SAP Real Estate Beneficios
 
Creating Reports with Financial Reporting Web Studio.pptx
Creating Reports with Financial Reporting Web Studio.pptxCreating Reports with Financial Reporting Web Studio.pptx
Creating Reports with Financial Reporting Web Studio.pptx
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Utilizing HFM to Handle the Requirements of IFRS
Utilizing HFM to Handle the Requirements of IFRSUtilizing HFM to Handle the Requirements of IFRS
Utilizing HFM to Handle the Requirements of IFRS
 

Similar to Stream Data Deduplication Powered by Kafka Streams | Philipp Schirmer, Bakdata

Learnings From Shipping 1000+ Streaming Data Pipelines To Production with Hak...
Learnings From Shipping 1000+ Streaming Data Pipelines To Production with Hak...Learnings From Shipping 1000+ Streaming Data Pipelines To Production with Hak...
Learnings From Shipping 1000+ Streaming Data Pipelines To Production with Hak...HostedbyConfluent
 
End to-end large messages processing with Kafka Streams & Kafka Connect
End to-end large messages processing with Kafka Streams & Kafka ConnectEnd to-end large messages processing with Kafka Streams & Kafka Connect
End to-end large messages processing with Kafka Streams & Kafka Connectconfluent
 
More Data, More Problems: Scaling Kafka Mirroring Pipelines at LinkedIn
More Data, More Problems: Scaling Kafka Mirroring Pipelines at LinkedInMore Data, More Problems: Scaling Kafka Mirroring Pipelines at LinkedIn
More Data, More Problems: Scaling Kafka Mirroring Pipelines at LinkedInCelia Kung
 
Apache Kafka and KSQL in Action: Let's Build a Streaming Data Pipeline!
Apache Kafka and KSQL in Action: Let's Build a Streaming Data Pipeline!Apache Kafka and KSQL in Action: Let's Build a Streaming Data Pipeline!
Apache Kafka and KSQL in Action: Let's Build a Streaming Data Pipeline!confluent
 
Kafka summit apac session
Kafka summit apac sessionKafka summit apac session
Kafka summit apac sessionChristina Lin
 
Applying ML on your Data in Motion with AWS and Confluent | Joseph Morais, Co...
Applying ML on your Data in Motion with AWS and Confluent | Joseph Morais, Co...Applying ML on your Data in Motion with AWS and Confluent | Joseph Morais, Co...
Applying ML on your Data in Motion with AWS and Confluent | Joseph Morais, Co...HostedbyConfluent
 
What is Apache Kafka and What is an Event Streaming Platform?
What is Apache Kafka and What is an Event Streaming Platform?What is Apache Kafka and What is an Event Streaming Platform?
What is Apache Kafka and What is an Event Streaming Platform?confluent
 
Build a Bridge to Cloud with Apache Kafka® for Data Analytics Cloud Services
Build a Bridge to Cloud with Apache Kafka® for Data Analytics Cloud ServicesBuild a Bridge to Cloud with Apache Kafka® for Data Analytics Cloud Services
Build a Bridge to Cloud with Apache Kafka® for Data Analytics Cloud Servicesconfluent
 
Kubernetes connectivity to Cloud Native Kafka | Evan Shortiss and Hugo Guerre...
Kubernetes connectivity to Cloud Native Kafka | Evan Shortiss and Hugo Guerre...Kubernetes connectivity to Cloud Native Kafka | Evan Shortiss and Hugo Guerre...
Kubernetes connectivity to Cloud Native Kafka | Evan Shortiss and Hugo Guerre...HostedbyConfluent
 
Apache Kafka - A modern Stream Processing Platform
Apache Kafka - A modern Stream Processing PlatformApache Kafka - A modern Stream Processing Platform
Apache Kafka - A modern Stream Processing PlatformGuido Schmutz
 
How kafka is transforming hadoop, spark & storm
How kafka is transforming hadoop, spark & stormHow kafka is transforming hadoop, spark & storm
How kafka is transforming hadoop, spark & stormEdureka!
 
How to build 1000 microservices with Kafka and thrive
How to build 1000 microservices with Kafka and thriveHow to build 1000 microservices with Kafka and thrive
How to build 1000 microservices with Kafka and thriveNatan Silnitsky
 
Kafka Streams: The Stream Processing Engine of Apache Kafka
Kafka Streams: The Stream Processing Engine of Apache KafkaKafka Streams: The Stream Processing Engine of Apache Kafka
Kafka Streams: The Stream Processing Engine of Apache KafkaEno Thereska
 
Go for Real Time Streaming Architectures - DotGo 2017
Go for Real Time Streaming Architectures - DotGo 2017Go for Real Time Streaming Architectures - DotGo 2017
Go for Real Time Streaming Architectures - DotGo 2017Mickaël Rémond
 
Apache Kafka - Scalable Message Processing and more!
Apache Kafka - Scalable Message Processing and more!Apache Kafka - Scalable Message Processing and more!
Apache Kafka - Scalable Message Processing and more!Guido Schmutz
 
Building Stream Processing as a Service
Building Stream Processing as a ServiceBuilding Stream Processing as a Service
Building Stream Processing as a ServiceSteven Wu
 
Consistency and Completeness: Rethinking Distributed Stream Processing in Apa...
Consistency and Completeness: Rethinking Distributed Stream Processing in Apa...Consistency and Completeness: Rethinking Distributed Stream Processing in Apa...
Consistency and Completeness: Rethinking Distributed Stream Processing in Apa...Guozhang Wang
 
Data pipeline with kafka
Data pipeline with kafkaData pipeline with kafka
Data pipeline with kafkaMole Wong
 
KFServing Payload Logging for Trusted AI
KFServing Payload Logging for Trusted AIKFServing Payload Logging for Trusted AI
KFServing Payload Logging for Trusted AIAnimesh Singh
 
14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...
14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...
14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...Athens Big Data
 

Similar to Stream Data Deduplication Powered by Kafka Streams | Philipp Schirmer, Bakdata (20)

Learnings From Shipping 1000+ Streaming Data Pipelines To Production with Hak...
Learnings From Shipping 1000+ Streaming Data Pipelines To Production with Hak...Learnings From Shipping 1000+ Streaming Data Pipelines To Production with Hak...
Learnings From Shipping 1000+ Streaming Data Pipelines To Production with Hak...
 
End to-end large messages processing with Kafka Streams & Kafka Connect
End to-end large messages processing with Kafka Streams & Kafka ConnectEnd to-end large messages processing with Kafka Streams & Kafka Connect
End to-end large messages processing with Kafka Streams & Kafka Connect
 
More Data, More Problems: Scaling Kafka Mirroring Pipelines at LinkedIn
More Data, More Problems: Scaling Kafka Mirroring Pipelines at LinkedInMore Data, More Problems: Scaling Kafka Mirroring Pipelines at LinkedIn
More Data, More Problems: Scaling Kafka Mirroring Pipelines at LinkedIn
 
Apache Kafka and KSQL in Action: Let's Build a Streaming Data Pipeline!
Apache Kafka and KSQL in Action: Let's Build a Streaming Data Pipeline!Apache Kafka and KSQL in Action: Let's Build a Streaming Data Pipeline!
Apache Kafka and KSQL in Action: Let's Build a Streaming Data Pipeline!
 
Kafka summit apac session
Kafka summit apac sessionKafka summit apac session
Kafka summit apac session
 
Applying ML on your Data in Motion with AWS and Confluent | Joseph Morais, Co...
Applying ML on your Data in Motion with AWS and Confluent | Joseph Morais, Co...Applying ML on your Data in Motion with AWS and Confluent | Joseph Morais, Co...
Applying ML on your Data in Motion with AWS and Confluent | Joseph Morais, Co...
 
What is Apache Kafka and What is an Event Streaming Platform?
What is Apache Kafka and What is an Event Streaming Platform?What is Apache Kafka and What is an Event Streaming Platform?
What is Apache Kafka and What is an Event Streaming Platform?
 
Build a Bridge to Cloud with Apache Kafka® for Data Analytics Cloud Services
Build a Bridge to Cloud with Apache Kafka® for Data Analytics Cloud ServicesBuild a Bridge to Cloud with Apache Kafka® for Data Analytics Cloud Services
Build a Bridge to Cloud with Apache Kafka® for Data Analytics Cloud Services
 
Kubernetes connectivity to Cloud Native Kafka | Evan Shortiss and Hugo Guerre...
Kubernetes connectivity to Cloud Native Kafka | Evan Shortiss and Hugo Guerre...Kubernetes connectivity to Cloud Native Kafka | Evan Shortiss and Hugo Guerre...
Kubernetes connectivity to Cloud Native Kafka | Evan Shortiss and Hugo Guerre...
 
Apache Kafka - A modern Stream Processing Platform
Apache Kafka - A modern Stream Processing PlatformApache Kafka - A modern Stream Processing Platform
Apache Kafka - A modern Stream Processing Platform
 
How kafka is transforming hadoop, spark & storm
How kafka is transforming hadoop, spark & stormHow kafka is transforming hadoop, spark & storm
How kafka is transforming hadoop, spark & storm
 
How to build 1000 microservices with Kafka and thrive
How to build 1000 microservices with Kafka and thriveHow to build 1000 microservices with Kafka and thrive
How to build 1000 microservices with Kafka and thrive
 
Kafka Streams: The Stream Processing Engine of Apache Kafka
Kafka Streams: The Stream Processing Engine of Apache KafkaKafka Streams: The Stream Processing Engine of Apache Kafka
Kafka Streams: The Stream Processing Engine of Apache Kafka
 
Go for Real Time Streaming Architectures - DotGo 2017
Go for Real Time Streaming Architectures - DotGo 2017Go for Real Time Streaming Architectures - DotGo 2017
Go for Real Time Streaming Architectures - DotGo 2017
 
Apache Kafka - Scalable Message Processing and more!
Apache Kafka - Scalable Message Processing and more!Apache Kafka - Scalable Message Processing and more!
Apache Kafka - Scalable Message Processing and more!
 
Building Stream Processing as a Service
Building Stream Processing as a ServiceBuilding Stream Processing as a Service
Building Stream Processing as a Service
 
Consistency and Completeness: Rethinking Distributed Stream Processing in Apa...
Consistency and Completeness: Rethinking Distributed Stream Processing in Apa...Consistency and Completeness: Rethinking Distributed Stream Processing in Apa...
Consistency and Completeness: Rethinking Distributed Stream Processing in Apa...
 
Data pipeline with kafka
Data pipeline with kafkaData pipeline with kafka
Data pipeline with kafka
 
KFServing Payload Logging for Trusted AI
KFServing Payload Logging for Trusted AIKFServing Payload Logging for Trusted AI
KFServing Payload Logging for Trusted AI
 
14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...
14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...
14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...
 

More from HostedbyConfluent

Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
Renaming a Kafka Topic | Kafka Summit London
Renaming a Kafka Topic | Kafka Summit LondonRenaming a Kafka Topic | Kafka Summit London
Renaming a Kafka Topic | Kafka Summit LondonHostedbyConfluent
 
Evolution of NRT Data Ingestion Pipeline at Trendyol
Evolution of NRT Data Ingestion Pipeline at TrendyolEvolution of NRT Data Ingestion Pipeline at Trendyol
Evolution of NRT Data Ingestion Pipeline at TrendyolHostedbyConfluent
 
Ensuring Kafka Service Resilience: A Dive into Health-Checking Techniques
Ensuring Kafka Service Resilience: A Dive into Health-Checking TechniquesEnsuring Kafka Service Resilience: A Dive into Health-Checking Techniques
Ensuring Kafka Service Resilience: A Dive into Health-Checking TechniquesHostedbyConfluent
 
Exactly-once Stream Processing with Arroyo and Kafka
Exactly-once Stream Processing with Arroyo and KafkaExactly-once Stream Processing with Arroyo and Kafka
Exactly-once Stream Processing with Arroyo and KafkaHostedbyConfluent
 
Fish Plays Pokemon | Kafka Summit London
Fish Plays Pokemon | Kafka Summit LondonFish Plays Pokemon | Kafka Summit London
Fish Plays Pokemon | Kafka Summit LondonHostedbyConfluent
 
Tiered Storage 101 | Kafla Summit London
Tiered Storage 101 | Kafla Summit LondonTiered Storage 101 | Kafla Summit London
Tiered Storage 101 | Kafla Summit LondonHostedbyConfluent
 
Building a Self-Service Stream Processing Portal: How And Why
Building a Self-Service Stream Processing Portal: How And WhyBuilding a Self-Service Stream Processing Portal: How And Why
Building a Self-Service Stream Processing Portal: How And WhyHostedbyConfluent
 
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...HostedbyConfluent
 
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...HostedbyConfluent
 
Navigating Private Network Connectivity Options for Kafka Clusters
Navigating Private Network Connectivity Options for Kafka ClustersNavigating Private Network Connectivity Options for Kafka Clusters
Navigating Private Network Connectivity Options for Kafka ClustersHostedbyConfluent
 
Apache Flink: Building a Company-wide Self-service Streaming Data Platform
Apache Flink: Building a Company-wide Self-service Streaming Data PlatformApache Flink: Building a Company-wide Self-service Streaming Data Platform
Apache Flink: Building a Company-wide Self-service Streaming Data PlatformHostedbyConfluent
 
Explaining How Real-Time GenAI Works in a Noisy Pub
Explaining How Real-Time GenAI Works in a Noisy PubExplaining How Real-Time GenAI Works in a Noisy Pub
Explaining How Real-Time GenAI Works in a Noisy PubHostedbyConfluent
 
TL;DR Kafka Metrics | Kafka Summit London
TL;DR Kafka Metrics | Kafka Summit LondonTL;DR Kafka Metrics | Kafka Summit London
TL;DR Kafka Metrics | Kafka Summit LondonHostedbyConfluent
 
A Window Into Your Kafka Streams Tasks | KSL
A Window Into Your Kafka Streams Tasks | KSLA Window Into Your Kafka Streams Tasks | KSL
A Window Into Your Kafka Streams Tasks | KSLHostedbyConfluent
 
Mastering Kafka Producer Configs: A Guide to Optimizing Performance
Mastering Kafka Producer Configs: A Guide to Optimizing PerformanceMastering Kafka Producer Configs: A Guide to Optimizing Performance
Mastering Kafka Producer Configs: A Guide to Optimizing PerformanceHostedbyConfluent
 
Data Contracts Management: Schema Registry and Beyond
Data Contracts Management: Schema Registry and BeyondData Contracts Management: Schema Registry and Beyond
Data Contracts Management: Schema Registry and BeyondHostedbyConfluent
 
Code-First Approach: Crafting Efficient Flink Apps
Code-First Approach: Crafting Efficient Flink AppsCode-First Approach: Crafting Efficient Flink Apps
Code-First Approach: Crafting Efficient Flink AppsHostedbyConfluent
 
Debezium vs. the World: An Overview of the CDC Ecosystem
Debezium vs. the World: An Overview of the CDC EcosystemDebezium vs. the World: An Overview of the CDC Ecosystem
Debezium vs. the World: An Overview of the CDC EcosystemHostedbyConfluent
 
Beyond Tiered Storage: Serverless Kafka with No Local Disks
Beyond Tiered Storage: Serverless Kafka with No Local DisksBeyond Tiered Storage: Serverless Kafka with No Local Disks
Beyond Tiered Storage: Serverless Kafka with No Local DisksHostedbyConfluent
 

More from HostedbyConfluent (20)

Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Renaming a Kafka Topic | Kafka Summit London
Renaming a Kafka Topic | Kafka Summit LondonRenaming a Kafka Topic | Kafka Summit London
Renaming a Kafka Topic | Kafka Summit London
 
Evolution of NRT Data Ingestion Pipeline at Trendyol
Evolution of NRT Data Ingestion Pipeline at TrendyolEvolution of NRT Data Ingestion Pipeline at Trendyol
Evolution of NRT Data Ingestion Pipeline at Trendyol
 
Ensuring Kafka Service Resilience: A Dive into Health-Checking Techniques
Ensuring Kafka Service Resilience: A Dive into Health-Checking TechniquesEnsuring Kafka Service Resilience: A Dive into Health-Checking Techniques
Ensuring Kafka Service Resilience: A Dive into Health-Checking Techniques
 
Exactly-once Stream Processing with Arroyo and Kafka
Exactly-once Stream Processing with Arroyo and KafkaExactly-once Stream Processing with Arroyo and Kafka
Exactly-once Stream Processing with Arroyo and Kafka
 
Fish Plays Pokemon | Kafka Summit London
Fish Plays Pokemon | Kafka Summit LondonFish Plays Pokemon | Kafka Summit London
Fish Plays Pokemon | Kafka Summit London
 
Tiered Storage 101 | Kafla Summit London
Tiered Storage 101 | Kafla Summit LondonTiered Storage 101 | Kafla Summit London
Tiered Storage 101 | Kafla Summit London
 
Building a Self-Service Stream Processing Portal: How And Why
Building a Self-Service Stream Processing Portal: How And WhyBuilding a Self-Service Stream Processing Portal: How And Why
Building a Self-Service Stream Processing Portal: How And Why
 
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
 
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
 
Navigating Private Network Connectivity Options for Kafka Clusters
Navigating Private Network Connectivity Options for Kafka ClustersNavigating Private Network Connectivity Options for Kafka Clusters
Navigating Private Network Connectivity Options for Kafka Clusters
 
Apache Flink: Building a Company-wide Self-service Streaming Data Platform
Apache Flink: Building a Company-wide Self-service Streaming Data PlatformApache Flink: Building a Company-wide Self-service Streaming Data Platform
Apache Flink: Building a Company-wide Self-service Streaming Data Platform
 
Explaining How Real-Time GenAI Works in a Noisy Pub
Explaining How Real-Time GenAI Works in a Noisy PubExplaining How Real-Time GenAI Works in a Noisy Pub
Explaining How Real-Time GenAI Works in a Noisy Pub
 
TL;DR Kafka Metrics | Kafka Summit London
TL;DR Kafka Metrics | Kafka Summit LondonTL;DR Kafka Metrics | Kafka Summit London
TL;DR Kafka Metrics | Kafka Summit London
 
A Window Into Your Kafka Streams Tasks | KSL
A Window Into Your Kafka Streams Tasks | KSLA Window Into Your Kafka Streams Tasks | KSL
A Window Into Your Kafka Streams Tasks | KSL
 
Mastering Kafka Producer Configs: A Guide to Optimizing Performance
Mastering Kafka Producer Configs: A Guide to Optimizing PerformanceMastering Kafka Producer Configs: A Guide to Optimizing Performance
Mastering Kafka Producer Configs: A Guide to Optimizing Performance
 
Data Contracts Management: Schema Registry and Beyond
Data Contracts Management: Schema Registry and BeyondData Contracts Management: Schema Registry and Beyond
Data Contracts Management: Schema Registry and Beyond
 
Code-First Approach: Crafting Efficient Flink Apps
Code-First Approach: Crafting Efficient Flink AppsCode-First Approach: Crafting Efficient Flink Apps
Code-First Approach: Crafting Efficient Flink Apps
 
Debezium vs. the World: An Overview of the CDC Ecosystem
Debezium vs. the World: An Overview of the CDC EcosystemDebezium vs. the World: An Overview of the CDC Ecosystem
Debezium vs. the World: An Overview of the CDC Ecosystem
 
Beyond Tiered Storage: Serverless Kafka with No Local Disks
Beyond Tiered Storage: Serverless Kafka with No Local DisksBeyond Tiered Storage: Serverless Kafka with No Local Disks
Beyond Tiered Storage: Serverless Kafka with No Local Disks
 

Recently uploaded

"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDGMarianaLemus7
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentationphoebematthew05
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 

Recently uploaded (20)

"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDG
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentation
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort ServiceHot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
 

Stream Data Deduplication Powered by Kafka Streams | Philipp Schirmer, Bakdata

  • 1. Stream Data Deduplication Powered by Kafka Streams | Philipp Schirmer | Kafka Summit Europe 2021 Stream Data Deduplication Powered by Kafka Streams Philipp Schirmer Data Engineer Kafka Summit Europe 2021
  • 2. Stream Data Deduplication Powered by Kafka Streams | Philipp Schirmer | Kafka Summit Europe 2021 Scenario 2 Deduplication Alignment ▬ Headline ▬ Content ▬ Location ▬ Company ▬ Tags ▬ Author ▬ Date ▬ … Icons made by Freepik from www.flaticon.com https://www.kaggle.com/c/kdd-cup-2013-author-paper-identification-challenge/data
  • 3. Stream Data Deduplication Powered by Kafka Streams | Philipp Schirmer | Kafka Summit Europe 2021 Scenario 3 Deduplication Alignment Icons made by Freepik from www.flaticon.com Deduplication is hard Especially if you want to distribute and streamify it for large-scale workloads https://www.kaggle.com/c/kdd-cup-2013-author-paper-identification-challenge/data
  • 4. Stream Data Deduplication Powered by Kafka Streams | Philipp Schirmer | Kafka Summit Europe 2021 4 Requirements Document streams Updates & deletions Frequent reprocessing Reduce costs Fewer but much larger messages than event streams Only most recent version of records should be used > 100 million existing records Scale to zero
  • 5. Stream Data Deduplication Powered by Kafka Streams | Philipp Schirmer | Kafka Summit Europe 2021 5 Given: a duplicate record classifier (black box) https://github.com/bakdata/dedupe Naïve approach: Compare every record to every other record ▬ Does not scale (quadratic complexity) → Pre-select plausible candidates Duplicate Detection Classification
  • 6. Stream Data Deduplication Powered by Kafka Streams | Philipp Schirmer | Kafka Summit Europe 2021 Goal: Only classify similar records. Assumption: When sorted, similar records will be close together. Solution: Only classify records within a fixed window when sorted ▬ Linear complexity What is the best sorting criteria? Use multiple! Candidate Generation using Sorted Neighborhood 6 Candidate Generation Classification Alibaba Alpabet Alphabet Alphabet Inc. Amazon Amazon.com Alphabet Inc
  • 7. Stream Data Deduplication Powered by Kafka Streams | Philipp Schirmer | Kafka Summit Europe 2021 If A and B are duplicates, and B and C are duplicates, A and C have to be duplicates → Ensure transitivity by computing transitive closure Transitivity 7 Candidate Generation Classification Transitive Closure
  • 8. Stream Data Deduplication Powered by Kafka Streams | Philipp Schirmer | Kafka Summit Europe 2021 All connected components are considered a duplicate ▬ Also improves recall Transitive Closure 8 Entity A 0.9 0.8 A1 A3 A2
  • 9. Stream Data Deduplication Powered by Kafka Streams | Philipp Schirmer | Kafka Summit Europe 2021 All connected components are considered a duplicate ▬ Also improves recall Transitive Closure 9 Entity A 0.9 0.8 A1 A3 A2
  • 10. Stream Data Deduplication Powered by Kafka Streams | Philipp Schirmer | Kafka Summit Europe 2021 All connected components are considered a duplicate But: error prone Transitive Closure 10 Entity A Entity B 0.9 0.8 0.8 0.9 0.9 A1 A3 A2 A1 A2 B2 B1
  • 11. Stream Data Deduplication Powered by Kafka Streams | Philipp Schirmer | Kafka Summit Europe 2021 All connected components are considered a duplicate But: error prone Transitive Closure 11 Entity A Entity B 0.9 0.8 0.8 0.9 0.9 A1 A3 A2 A1 A2 B2 B1
  • 12. Stream Data Deduplication Powered by Kafka Streams | Philipp Schirmer | Kafka Summit Europe 2021 All connected components are considered a duplicate But: error prone Transitive Closure 12 Entity A Entity B 0.9 0.8 0.8 0.9 0.9 A1 A3 A2 A1 A2 B2 B1
  • 13. Stream Data Deduplication Powered by Kafka Streams | Philipp Schirmer | Kafka Summit Europe 2021 Goal: Find strongly connected components ▬ Classify all records within a cluster ▬ Weight negative scores higher (precision over recall) Clustering 13 Entity A Entity B 0.9 0.8 0.8 0.9 0.9 A1 A3 A2 A1 A2 B2 B1
  • 14. Stream Data Deduplication Powered by Kafka Streams | Philipp Schirmer | Kafka Summit Europe 2021 Goal: Find strongly connected components ▬ Classify all records within a cluster ▬ Weight negative scores higher (precision over recall) Clustering 14 Entity A Entity B A1 A3 A2 A1 A2 B2 B1
  • 15. Stream Data Deduplication Powered by Kafka Streams | Philipp Schirmer | Kafka Summit Europe 2021 Overview 15 Candidate Generation Classification Transitive Closure Clustering
  • 16. Stream Data Deduplication Powered by Kafka Streams | Philipp Schirmer | Kafka Summit Europe 2021 Overview 16 Candidate Generation Classification Transitive Closure Clustering
  • 17. Stream Data Deduplication Powered by Kafka Streams | Philipp Schirmer | Kafka Summit Europe 2021 Overview 17 Candidate Generation Classification Transitive Closure Clustering Kafka Streams Apps
  • 18. Stream Data Deduplication Powered by Kafka Streams | Philipp Schirmer | Kafka Summit Europe 2021 Candidate Generation 18 Alibaba ali1 Alpabet alpha1 Alphabet alpha2 Alphabet Inc. alpha3 Amazon amzn1 Amazon.com amzn2 alpha4; Alphabet Inc alpha4; alpha1, alpha2, alpha3, amzn1
  • 19. Stream Data Deduplication Powered by Kafka Streams | Philipp Schirmer | Kafka Summit Europe 2021 Candidate Generation 19 Alibaba ali1 Alpabet alpha1 Alphabet alpha2 Alphabet Inc. alpha3 Amazon amzn1 Amazon.com amzn2 alpha4; Alphabet Inc alpha4; alpha1, alpha2, alpha3, amzn1 How to implement a sorted doubly linked list in Kafka Streams?
  • 20. Stream Data Deduplication Powered by Kafka Streams | Philipp Schirmer | Kafka Summit Europe 2021 Candidate Generation 20 Alibaba ali1 Alpabet alpha1 Alphabet alpha2 Alphabet Inc. alpha3 Amazon amzn1 Amazon.com amzn2 alpha4; Alphabet Inc alpha4; alpha1, alpha2, alpha3, amzn1 How to implement a sorted doubly linked list in Kafka Streams? ▬ Use Kafka Streams state store backed by RocksDB ▬ Supports forwards and backwards iteration ▬ Partitioned → Relevant neighborhood may not be on the local partition
  • 21. Stream Data Deduplication Powered by Kafka Streams | Philipp Schirmer | Kafka Summit Europe 2021 Candidate Generation 21 Alibaba ali1 Alpabet alpha1 Alphabet alpha2 Alphabet Inc. alpha3 Amazon amzn1 Amazon.com amzn2 alpha4; Alphabet Inc alpha4; alpha1, alpha2, alpha3, amzn1 How to implement a sorted doubly linked list in Kafka Streams? ▬ Use Kafka Streams state store backed by RocksDB ✘ ▬ Use Kafka Streams global store backed by RocksDB ▬ Supports forwards and backwards iteration ▬ Unpartitioned and replicated to every node (> 100 GB) ▬ Replication latency
  • 22. Stream Data Deduplication Powered by Kafka Streams | Philipp Schirmer | Kafka Summit Europe 2021 Candidate Generation 22 Alibaba ali1 Alpabet alpha1 Alphabet alpha2 Alphabet Inc. alpha3 Amazon amzn1 Amazon.com amzn2 alpha4; Alphabet Inc alpha4; alpha1, alpha2, alpha3, amzn1 How to implement a sorted doubly linked list in Kafka Streams? ▬ Use Kafka Streams state store backed by RocksDB ✘ ▬ Use Kafka Streams global store backed by RocksDB ✘ ▬ Use SQL database, e.g., PostgreSQL ▬ Indices allow quick forwards and backwards neighborhood queries ▬ Transactional guarantees for inserts ▬ Network and disk overhead
  • 23. Stream Data Deduplication Powered by Kafka Streams | Philipp Schirmer | Kafka Summit Europe 2021 Candidate Generation 23 Alibaba ali1 Alpabet alpha1 Alphabet alpha2 Alphabet Inc. alpha3 Amazon amzn1 Amazon.com amzn2 alpha4; Alphabet Inc alpha4; alpha1, alpha2, alpha3, amzn1 How to implement a sorted doubly linked list in Kafka Streams? ▬ Use Kafka Streams state store backed by RocksDB ✘ ▬ Use Kafka Streams global store backed by RocksDB ✘ ▬ Use SQL database, e.g., PostgreSQL ✔
  • 24. Stream Data Deduplication Powered by Kafka Streams | Philipp Schirmer | Kafka Summit Europe 2021 Candidate Generation 24 Output Records SN Index Sorted Neighborhood A2; A1, B1, B2 A2
  • 25. Stream Data Deduplication Powered by Kafka Streams | Philipp Schirmer | Kafka Summit Europe 2021 Overview 25 SN Index Fill Index Candidate Generation Classification Transitive Closure Clustering
  • 26. Stream Data Deduplication Powered by Kafka Streams | Philipp Schirmer | Kafka Summit Europe 2021 Classification 26 A2; A1, B1, B2 Output Candidates Record Index Classification A2; A1, B1
  • 27. Stream Data Deduplication Powered by Kafka Streams | Philipp Schirmer | Kafka Summit Europe 2021 Overview 27 SN Index Fill Indices Candidate Generation Classification Transitive Closure Clustering Record Index
  • 28. Stream Data Deduplication Powered by Kafka Streams | Philipp Schirmer | Kafka Summit Europe 2021 Two Kafka Streams state stores to maintain transitive closure: ▬ Record ID → Cluster ID ▬ Cluster ID → Record IDs Single partitioned because we don’t know ahead of time which records will end up being duplicates But: not a problem as operation is very cheap and state is small Transitive Closure 28
  • 29. Stream Data Deduplication Powered by Kafka Streams | Philipp Schirmer | Kafka Summit Europe 2021 Transitive Closure 29 State Store State Store Cluster ID → Record IDs Record ID → Cluster ID A2; A1, B1 Classifications
  • 30. Stream Data Deduplication Powered by Kafka Streams | Philipp Schirmer | Kafka Summit Europe 2021 Transitive Closure 30 State Store State Store Cluster ID → Record IDs Record ID → Cluster ID A2; A1, B1 A1: A A2: A B1: A A: A1, A2, B1 Classifications
  • 31. Stream Data Deduplication Powered by Kafka Streams | Philipp Schirmer | Kafka Summit Europe 2021 Transitive Closure 31 State Store State Store Cluster ID → Record IDs Record ID → Cluster ID A2; A1, B1 B2; B1 A1: A A2: A B1: A A: A1, A2, B1 Classifications
  • 32. Stream Data Deduplication Powered by Kafka Streams | Philipp Schirmer | Kafka Summit Europe 2021 Transitive Closure 32 State Store State Store Cluster ID → Record IDs Record ID → Cluster ID A2; A1, B1 B2; B1 A1: A A2: A B1: A A: A1, A2, B1 Classifications
  • 33. Stream Data Deduplication Powered by Kafka Streams | Philipp Schirmer | Kafka Summit Europe 2021 Transitive Closure 33 State Store State Store Cluster ID → Record IDs Record ID → Cluster ID A2; A1, B1 B2; B1 A1: A A2: A B1: A A: A1, A2, B1 Classifications
  • 34. Stream Data Deduplication Powered by Kafka Streams | Philipp Schirmer | Kafka Summit Europe 2021 Transitive Closure 34 State Store State Store Cluster ID → Record IDs Record ID → Cluster ID A2; A1, B1 B2; B1 A1: A A2: A B1: A A: A1, A2, B1, B2 B2: A Classifications
  • 35. Stream Data Deduplication Powered by Kafka Streams | Philipp Schirmer | Kafka Summit Europe 2021 Overview 35 SN Index Fill Indices Candidate Generation Classification Transitive Closure Clustering Record Index
  • 36. Stream Data Deduplication Powered by Kafka Streams | Philipp Schirmer | Kafka Summit Europe 2021 Clustering 36 A1, A2, B1, B2 B1, B2 A1, A2 Output Transitive Closure Record Index Clustering
  • 37. Stream Data Deduplication Powered by Kafka Streams | Philipp Schirmer | Kafka Summit Europe 2021 Overview 37 SN Index Fill Indices Candidate Generation Classification Transitive Closure Clustering Record Index
  • 38. Stream Data Deduplication Powered by Kafka Streams | Philipp Schirmer | Kafka Summit Europe 2021 Cluster Ordering 38 Partition 1 Partition 2 Clustering A1, A2, B1, B2 A1, A2, B1 A1, A2, B1 A1: A A2: A B1: A Output Transitive Closure Timestamp 0 Timestamp 1
  • 39. Stream Data Deduplication Powered by Kafka Streams | Philipp Schirmer | Kafka Summit Europe 2021 Cluster Ordering 39 Partition 1 Partition 2 Clustering A1, A2, B1, B2 A1, A2, B1 A1, A2, B1 A1: A A2: A B1: A A1: A A2: A B1: B B2: B Output Transitive Closure B1, B2 A1, A2 Timestamp 0 Timestamp 1
  • 40. Stream Data Deduplication Powered by Kafka Streams | Philipp Schirmer | Kafka Summit Europe 2021 Cluster Ordering 40 Partition 1 Partition 2 Clustering A1, A2, B1, B2 A1, A2, B1 B1, B2 A1, A2 A1: A A2: A B1: B B2: B Output Transitive Closure Timestamp 0 Timestamp 1
  • 41. Stream Data Deduplication Powered by Kafka Streams | Philipp Schirmer | Kafka Summit Europe 2021 Cluster Ordering 41 Partition 1 Partition 2 Clustering A1, A2, B1, B2 A1, A2, B1 A1, A2, B1 B1, B2 A1, A2 A1: A A2: A B1: B B2: B A1: A A2: A B1: A Output Transitive Closure Timestamp 0 Timestamp 1
  • 42. Stream Data Deduplication Powered by Kafka Streams | Philipp Schirmer | Kafka Summit Europe 2021 Cluster Ordering 42 Partition 1 Partition 2 Clustering A1, A2, B1, B2 A1, A2, B1 A1, A2, B1 B1, B2 A1, A2 A1: A A2: A B1: B B2: B A1: A A2: A B1: A Output Transitive Closure Timestamp 0 Timestamp 1
  • 43. Stream Data Deduplication Powered by Kafka Streams | Philipp Schirmer | Kafka Summit Europe 2021 Challenge: Clustering does not preserve order if done in parallel Solution: ▬ Add sequential id, e.g., timestamp, to results of transitive closure ▬ Parallelize clustering ▬ Filter results with state store (single partitioned) ▬ For each record in group of clusters, store latest timestamp that has been seen ▬ For each group of clusters, check if timestamp is younger than all timestamps that have been seen for its records Cluster Ordering 43
  • 44. Stream Data Deduplication Powered by Kafka Streams | Philipp Schirmer | Kafka Summit Europe 2021 Cluster Ordering 44 A1, A2, B1, B2 Timestamp: 1 State Store Record ID → Timestamp Transitive Closure Clustering Output
  • 45. Stream Data Deduplication Powered by Kafka Streams | Philipp Schirmer | Kafka Summit Europe 2021 Cluster Ordering 45 A1, A2, B1, B2 Timestamp: 1 A1, A2 B1, B2 Timestamp: 1 State Store Record ID → Timestamp Transitive Closure Clustering Output
  • 46. Stream Data Deduplication Powered by Kafka Streams | Philipp Schirmer | Kafka Summit Europe 2021 Cluster Ordering 46 A1, A2, B1, B2 Timestamp: 1 B1: 1 A1, A2 B1, B2 Timestamp: 1 B2: 1 State Store Record ID → Timestamp A1: 1 A2: 1 Transitive Closure Clustering Output A1: A A2: A B1: B B2: B
  • 47. Stream Data Deduplication Powered by Kafka Streams | Philipp Schirmer | Kafka Summit Europe 2021 Cluster Ordering 47 A1, A2, B1 Timestamp: 0 B1: 1 B2: 1 State Store Record ID → Timestamp A1: 1 A2: 1 Transitive Closure Clustering Output
  • 48. Stream Data Deduplication Powered by Kafka Streams | Philipp Schirmer | Kafka Summit Europe 2021 Cluster Ordering 48 A1, A2, B1 Timestamp: 0 B1: 1 A1, A2, B1 Timestamp: 0 B2: 1 State Store Record ID → Timestamp A1: 1 A2: 1 Transitive Closure Clustering Output
  • 49. Stream Data Deduplication Powered by Kafka Streams | Philipp Schirmer | Kafka Summit Europe 2021 Cluster Ordering 49 A1, A2, B1 Timestamp: 0 B1: 1 A1, A2, B1 Timestamp: 0 B2: 1 State Store Record ID → Timestamp A1: 1 A2: 1 Transitive Closure Clustering Output
  • 50. Stream Data Deduplication Powered by Kafka Streams | Philipp Schirmer | Kafka Summit Europe 2021 Cluster Ordering 50 A1, A2, B1, B2, A3 Timestamp: 2 B1: 1 B2: 1 State Store Record ID → Timestamp A1: 1 A2: 1 Transitive Closure Clustering Output
  • 51. Stream Data Deduplication Powered by Kafka Streams | Philipp Schirmer | Kafka Summit Europe 2021 Cluster Ordering 51 A1, A2, B1, B2, A3 Timestamp: 2 B1: 1 A1, A2, A3 B1, B2 Timestamp: 2 B2: 1 State Store Record ID → Timestamp A1: 1 A2: 1 Transitive Closure Clustering Output
  • 52. Stream Data Deduplication Powered by Kafka Streams | Philipp Schirmer | Kafka Summit Europe 2021 Cluster Ordering 52 A1, A2, B1, B2, A3 Timestamp: 2 B1: 1 A1, A2, A3 B1, B2 Timestamp: 2 B2: 1 State Store Record ID → Timestamp A1: 1 A2: 1 Transitive Closure Clustering Output
  • 53. Stream Data Deduplication Powered by Kafka Streams | Philipp Schirmer | Kafka Summit Europe 2021 Cluster Ordering 53 A1, A2, B1, B2, A3 Timestamp: 2 A3: 2 B1: 2 A1, A2, A3 B1, B2 Timestamp: 2 B2: 2 State Store Record ID → Timestamp A1: 2 A2: 2 Transitive Closure Clustering Output A1: A A2: A B1: B B2: B A3: A
  • 54. Stream Data Deduplication Powered by Kafka Streams | Philipp Schirmer | Kafka Summit Europe 2021 Overview 54 SN Index Fill Indices Candidate Generation Classification Transitive Closure Clustering Cluster Ordering Record Index
  • 55. Stream Data Deduplication Powered by Kafka Streams | Philipp Schirmer | Kafka Summit Europe 2021 Overview 55 SN Index Fill Indices Candidate Generation Classification Transitive Closure Clustering Cluster Ordering Record Index
  • 56. Stream Data Deduplication Powered by Kafka Streams | Philipp Schirmer | Kafka Summit Europe 2021 Deletions 56 SN Index Fill Indices Candidate Generation Classification Transitive Closure Clustering Cluster Ordering Record Index Icon made by Freepik from www.flaticon.com
  • 57. Stream Data Deduplication Powered by Kafka Streams | Philipp Schirmer | Kafka Summit Europe 2021 Deletions 57 SN Index Fill Indices Candidate Generation Classification Transitive Closure Clustering Cluster Ordering Record Index Icon made by Freepik from www.flaticon.com Record ID is message key
  • 58. Stream Data Deduplication Powered by Kafka Streams | Philipp Schirmer | Kafka Summit Europe 2021 Deletions 58 SN Index Fill Indices Candidate Generation Classification Transitive Closure Clustering Cluster Ordering Record Index Icon made by Freepik from www.flaticon.com Cluster ID is message key
  • 59. Stream Data Deduplication Powered by Kafka Streams | Philipp Schirmer | Kafka Summit Europe 2021 Deletions 59 SN Index Fill Indices Candidate Generation Classification Transitive Closure Clustering Cluster Ordering Record Index
  • 60. Stream Data Deduplication Powered by Kafka Streams | Philipp Schirmer | Kafka Summit Europe 2021 ▬ Up to 1,200 deduplications per second with 100 parallel classification pods ▬ Deduplication of 150,000,000 records in < 4 days ▬ Throughput limited by memory availability of database and classifier throughput Performance 60 Icon made by Freepik from www.flaticon.com
  • 61. Stream Data Deduplication Powered by Kafka Streams | Philipp Schirmer | Kafka Summit Europe 2021 ▬ All deployed and developed with https://github.com/bakdata/streams-bootstrap ▬ Lag-based auto-scaling from 0 to number of partitions powered by KEDA https://keda.sh/docs/2.1/scalers/apache-kafka/ ▬ Database auto-scaling with Aurora Serverless https://aws.amazon.com/rds/aurora/serverless/ Deployment & Operations 61 autoscaling: enabled: true maxReplicas: 100 consumergroup: news-deduplication-class lagThreshold: "10000" Logo by Kubernetes / CC BY
  • 62. Stream Data Deduplication Powered by Kafka Streams | Philipp Schirmer | Kafka Summit Europe 2021 Each module scales independently Conclusion 62 Modular architecture High throughput Large-scale processing Scale to zero With regard to complexity of deduplication Hundreds of million of records can be processed If there is no load, we only pay for storage