Stream Data Deduplication Powered by Kafka Streams | Philipp Schirmer, Bakdata

Stream Data Deduplication Powered by Kafka Streams | Philipp Schirmer | Kafka Summit Europe 2021
Stream Data Deduplication
Powered by Kafka Streams
Philipp Schirmer
Data Engineer Kafka Summit Europe 2021

Scenario
2
Deduplication
Alignment
▬ Headline
▬ Content
▬ Location
▬ Company
▬ Tags
▬ Author
▬ Date
▬ …
Icons
made
by
Freepik
from
www.flaticon.com
https://www.kaggle.com/c/kdd-cup-2013-author-paper-identification-challenge/data

Scenario
3
Deduplication
Alignment
Icons
made
by
Freepik
from
www.flaticon.com
Deduplication is hard
Especially if you want to distribute
and streamify it for large-scale
workloads
https://www.kaggle.com/c/kdd-cup-2013-author-paper-identification-challenge/data

4
Requirements
Document streams
Updates & deletions
Frequent reprocessing
Reduce costs
Fewer but much larger messages than event streams
Only most recent version of records should be used
> 100 million existing records
Scale to zero

5
Given: a duplicate record classifier (black box)
https://github.com/bakdata/dedupe
Naïve approach: Compare every record to every other record
▬ Does not scale (quadratic complexity)
→ Pre-select plausible candidates
Duplicate Detection
Classification

Goal: Only classify similar records.
Assumption: When sorted, similar records will be close together.
Solution: Only classify records within a fixed window when sorted
▬ Linear complexity
What is the best sorting criteria? Use multiple!
Candidate Generation using Sorted Neighborhood
6
Candidate
Generation
Classification
Alibaba
Alpabet
Alphabet
Alphabet Inc.
Amazon
Amazon.com
Alphabet Inc

If A and B are duplicates, and B and C are duplicates, A and C
have to be duplicates
→ Ensure transitivity by computing transitive closure
Transitivity
7
Candidate
Generation
Classification
Transitive
Closure

All connected components are considered a duplicate
▬ Also improves recall
Transitive Closure
8
Entity A
0.9
0.8
A1
A3
A2

▬ Also improves recall
Transitive Closure
9
Entity A
0.9
0.8
A1
A3
A2

But: error prone
Transitive Closure
10
Entity A
Entity B
0.9
0.8 0.8
0.9
0.9
A1
A3
A2 A1 A2
B2 B1

But: error prone
Transitive Closure
11
Entity A
Entity B
0.9
0.8 0.8
0.9
0.9
A1
A3
A2 A1 A2
B2 B1

But: error prone
Transitive Closure
12
Entity A
Entity B
0.9
0.8 0.8
0.9
0.9
A1
A3
A2 A1 A2
B2 B1

Goal: Find strongly connected components
▬ Classify all records within a cluster
▬ Weight negative scores higher (precision over recall)
Clustering
13
Entity A
Entity B
0.9
0.8 0.8
0.9
0.9
A1
A3
A2 A1 A2
B2 B1

Goal: Find strongly connected components
▬ Classify all records within a cluster
▬ Weight negative scores higher (precision over recall)
Clustering
14
Entity A
Entity B
A1
A3
A2 A1 A2
B2 B1

Overview
15
Candidate
Generation
Classification
Transitive
Closure
Clustering

Overview
16
Candidate
Generation
Classification
Transitive
Closure
Clustering

Overview
17
Candidate
Generation
Classification
Transitive
Closure
Clustering
Kafka Streams Apps

Candidate Generation
18
Alibaba ali1
Alpabet alpha1
Alphabet alpha2
Alphabet Inc. alpha3
Amazon amzn1
Amazon.com amzn2
alpha4;
Alphabet Inc
alpha4;
alpha1, alpha2,
alpha3, amzn1

19
Alibaba ali1
Alpabet alpha1
Alphabet alpha2
Amazon amzn1
Amazon.com amzn2
alpha4;
Alphabet Inc
alpha4;
alpha1, alpha2,
alpha3, amzn1
How to implement a sorted doubly linked list in Kafka
Streams?

20
Alibaba ali1
Alpabet alpha1
Alphabet alpha2
Amazon amzn1
Amazon.com amzn2
alpha4;
Alphabet Inc
alpha4;
alpha1, alpha2,
alpha3, amzn1
Streams?
▬ Use Kafka Streams state store backed by RocksDB
▬ Supports forwards and backwards iteration
▬ Partitioned
→ Relevant neighborhood may not be on the local
partition

21
Alibaba ali1
Alpabet alpha1
Alphabet alpha2
Amazon amzn1
Amazon.com amzn2
alpha4;
Alphabet Inc
alpha4;
alpha1, alpha2,
alpha3, amzn1
Streams?
▬ Use Kafka Streams state store backed by RocksDB ✘
▬ Use Kafka Streams global store backed by RocksDB
▬ Supports forwards and backwards iteration
▬ Unpartitioned and replicated to every node
(> 100 GB)
▬ Replication latency

22
Alibaba ali1
Alpabet alpha1
Alphabet alpha2
Amazon amzn1
Amazon.com amzn2
alpha4;
Alphabet Inc
alpha4;
alpha1, alpha2,
alpha3, amzn1
Streams?
▬ Use Kafka Streams global store backed by RocksDB ✘
▬ Use SQL database, e.g., PostgreSQL
▬ Indices allow quick forwards and backwards
neighborhood queries
▬ Transactional guarantees for inserts
▬ Network and disk overhead

23
Alibaba ali1
Alpabet alpha1
Alphabet alpha2
Amazon amzn1
Amazon.com amzn2
alpha4;
Alphabet Inc
alpha4;
alpha1, alpha2,
alpha3, amzn1
Streams?
▬ Use Kafka Streams global store backed by RocksDB ✘
▬ Use SQL database, e.g., PostgreSQL ✔

24
Output
Records
SN Index
Sorted
Neighborhood
A2; A1, B1, B2
A2

Overview
25
SN Index
Fill Index
Candidate
Generation
Classification
Transitive
Closure
Clustering

Classification
26
A2; A1, B1, B2
Output
Candidates Record Index
Classification A2; A1, B1

Overview
27
SN Index
Fill Indices
Candidate
Generation
Classification
Transitive
Closure
Clustering
Record Index

Two Kafka Streams state stores to maintain transitive closure:
▬ Record ID → Cluster ID
▬ Cluster ID → Record IDs
Single partitioned because we don’t know ahead of time which
records will end up being duplicates
But: not a problem as operation is very cheap and state is small
Transitive Closure
28

Transitive Closure
29
State Store State Store
Cluster ID → Record IDs
Record ID → Cluster ID
A2; A1, B1
Classifications

Transitive Closure
30
A2; A1, B1 A1: A
A2: A
B1: A
A: A1, A2, B1
Classifications

Transitive Closure
31
A2; A1, B1
B2; B1
A1: A
A2: A
B1: A
A: A1, A2, B1
Classifications

Transitive Closure
32
A2; A1, B1
B2; B1
A1: A
A2: A
B1: A
A: A1, A2, B1
Classifications

Transitive Closure
33
A2; A1, B1
B2; B1
A1: A
A2: A
B1: A
A: A1, A2, B1
Classifications

Transitive Closure
34
A2; A1, B1
B2; B1
A1: A
A2: A
B1: A
A: A1, A2, B1, B2
B2: A
Classifications

Overview
35
SN Index
Fill Indices
Candidate
Generation
Classification
Transitive
Closure
Clustering
Record Index

Clustering
36
A1, A2, B1, B2
B1, B2
A1, A2
Output
Transitive Closure Record Index
Clustering

Overview
37
SN Index
Fill Indices
Candidate
Generation
Classification
Transitive
Closure
Clustering
Record Index

Cluster Ordering
38
Partition 1
Partition 2
Clustering
A1, A2, B1, B2
A1, A2, B1
A1, A2, B1
A1: A
A2: A
B1: A
Output
Transitive Closure
Timestamp 0
Timestamp 1

Cluster Ordering
39
Partition 1
Partition 2
Clustering
A1, A2, B1, B2
A1, A2, B1
A1, A2, B1
A1: A
A2: A
B1: A
A1: A
A2: A
B1: B
B2: B
Output
Transitive Closure
B1, B2
A1, A2
Timestamp 0
Timestamp 1

Cluster Ordering
40
Partition 1
Partition 2
Clustering
A1, A2, B1, B2
A1, A2, B1
B1, B2
A1, A2
A1: A
A2: A
B1: B
B2: B
Output
Transitive Closure
Timestamp 0
Timestamp 1

Cluster Ordering
41
Partition 1
Partition 2
Clustering
A1, A2, B1, B2
A1, A2, B1
A1, A2, B1
B1, B2
A1, A2
A1: A
A2: A
B1: B
B2: B
A1: A
A2: A
B1: A
Output
Transitive Closure
Timestamp 0
Timestamp 1

Cluster Ordering
42
Partition 1
Partition 2
Clustering
A1, A2, B1, B2
A1, A2, B1
A1, A2, B1
B1, B2
A1, A2
A1: A
A2: A
B1: B
B2: B
A1: A
A2: A
B1: A
Output
Transitive Closure
Timestamp 0
Timestamp 1

Challenge: Clustering does not preserve order if done in parallel
Solution:
▬ Add sequential id, e.g., timestamp, to results of transitive
closure
▬ Parallelize clustering
▬ Filter results with state store (single partitioned)
▬ For each record in group of clusters, store latest
timestamp that has been seen
▬ For each group of clusters, check if timestamp is younger
than all timestamps that have been seen for its records
Cluster Ordering
43

Cluster Ordering
44
A1, A2, B1, B2
Timestamp: 1
State Store
Record ID → Timestamp
Transitive Closure Clustering Output

Cluster Ordering
45
A1, A2, B1, B2
Timestamp: 1
A1, A2
B1, B2
Timestamp: 1
State Store

Cluster Ordering
46
A1, A2, B1, B2
Timestamp: 1
B1: 1
A1, A2
B1, B2
Timestamp: 1
B2: 1
State Store
A1: 1
A2: 1
A1: A
A2: A
B1: B
B2: B

Cluster Ordering
47
A1, A2, B1
Timestamp: 0
B1: 1
B2: 1
State Store
A1: 1
A2: 1

Cluster Ordering
48
A1, A2, B1
Timestamp: 0
B1: 1
A1, A2, B1
Timestamp: 0
B2: 1
State Store
A1: 1
A2: 1

Cluster Ordering
49
A1, A2, B1
Timestamp: 0
B1: 1
A1, A2, B1
Timestamp: 0
B2: 1
State Store
A1: 1
A2: 1

Cluster Ordering
50
A1, A2, B1, B2, A3
Timestamp: 2
B1: 1
B2: 1
State Store
A1: 1
A2: 1

Cluster Ordering
51
A1, A2, B1, B2, A3
Timestamp: 2
B1: 1
A1, A2, A3
B1, B2
Timestamp: 2
B2: 1
State Store
A1: 1
A2: 1

Cluster Ordering
52
A1, A2, B1, B2, A3
Timestamp: 2
B1: 1
A1, A2, A3
B1, B2
Timestamp: 2
B2: 1
State Store
A1: 1
A2: 1

Cluster Ordering
53
A1, A2, B1, B2, A3
Timestamp: 2
A3: 2
B1: 2
A1, A2, A3
B1, B2
Timestamp: 2
B2: 2
State Store
A1: 2
A2: 2
A1: A
A2: A
B1: B
B2: B
A3: A

Overview
54
SN Index
Fill Indices
Candidate
Generation
Classification
Transitive
Closure
Clustering
Cluster
Ordering
Record Index

Overview
55
SN Index
Fill Indices
Candidate
Generation
Classification
Transitive
Closure
Clustering
Cluster
Ordering
Record Index

Deletions
56
SN Index
Fill Indices
Candidate
Generation
Classification
Transitive
Closure
Clustering
Cluster
Ordering
Record Index
Icon
made
by
Freepik
from
www.flaticon.com

Deletions
57
SN Index
Fill Indices
Candidate
Generation
Classification
Transitive
Closure
Clustering
Cluster
Ordering
Record Index
Icon
made
by
Freepik
from
www.flaticon.com
Record ID is message key

Deletions
58
SN Index
Fill Indices
Candidate
Generation
Classification
Transitive
Closure
Clustering
Cluster
Ordering
Record Index
Icon
made
by
Freepik
from
www.flaticon.com
Cluster ID is message key

Deletions
59
SN Index
Fill Indices
Candidate
Generation
Classification
Transitive
Closure
Clustering
Cluster
Ordering
Record Index

▬ Up to 1,200 deduplications per second with 100
parallel classification pods
▬ Deduplication of 150,000,000 records in < 4 days
▬ Throughput limited by memory availability of
database and classifier throughput
Performance
60
Icon
made
by
Freepik
from
www.flaticon.com

▬ All deployed and developed with
https://github.com/bakdata/streams-bootstrap
▬ Lag-based auto-scaling from 0 to number of
partitions powered by KEDA
https://keda.sh/docs/2.1/scalers/apache-kafka/
▬ Database auto-scaling with Aurora Serverless
https://aws.amazon.com/rds/aurora/serverless/
Deployment & Operations
61
autoscaling:
enabled: true
maxReplicas: 100
consumergroup: news-deduplication-class
lagThreshold: "10000"
Logo
by
Kubernetes
/
CC
BY

Each module scales independently
Conclusion
62
Modular architecture
High throughput
Large-scale processing
Scale to zero
With regard to complexity of deduplication
Hundreds of million of records can be processed
If there is no load, we only pay for storage

Stream Data Deduplication Powered by Kafka Streams | Philipp Schirmer, Bakdata

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Stream Data Deduplication Powered by Kafka Streams | Philipp Schirmer, Bakdata

Similar to Stream Data Deduplication Powered by Kafka Streams | Philipp Schirmer, Bakdata (20)

More from HostedbyConfluent

More from HostedbyConfluent (20)

Recently uploaded

Recently uploaded (20)

Stream Data Deduplication Powered by Kafka Streams | Philipp Schirmer, Bakdata