Representations of data, e.g., describing news, persons or places, differ. Therefore, we need to identify duplicates, for example, if we want to stream deduplicated news from different sources into a sentiment classifier.
We built a system that collects data from different sources in a streaming fashion, aligns them to a global schema and then detects duplicates within the data stream without time window constraints. The challenge is not only to process newly published data without significant delay, but also to reprocess hundreds of millions existing messages, for example, after improving the similarity measure.
In this talk, we present our implementation for deduplication of data streams built on top of Kafka Streams. For this, we leverage Kafka APIs, namely state stores, and also use Kubernetes to auto-scale our application from 0 to a defined maximum. This allows us to process live data immediately and also reprocess all data from scratch within a reasonable amount of time.
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Stream Data Deduplication Powered by Kafka Streams | Philipp Schirmer, Bakdata
1. Stream Data Deduplication Powered by Kafka Streams | Philipp Schirmer | Kafka Summit Europe 2021
Stream Data Deduplication
Powered by Kafka Streams
Philipp Schirmer
Data Engineer Kafka Summit Europe 2021
2. Stream Data Deduplication Powered by Kafka Streams | Philipp Schirmer | Kafka Summit Europe 2021
Scenario
2
Deduplication
Alignment
▬ Headline
▬ Content
▬ Location
▬ Company
▬ Tags
▬ Author
▬ Date
▬ …
Icons
made
by
Freepik
from
www.flaticon.com
https://www.kaggle.com/c/kdd-cup-2013-author-paper-identification-challenge/data
3. Stream Data Deduplication Powered by Kafka Streams | Philipp Schirmer | Kafka Summit Europe 2021
Scenario
3
Deduplication
Alignment
Icons
made
by
Freepik
from
www.flaticon.com
Deduplication is hard
Especially if you want to distribute
and streamify it for large-scale
workloads
https://www.kaggle.com/c/kdd-cup-2013-author-paper-identification-challenge/data
4. Stream Data Deduplication Powered by Kafka Streams | Philipp Schirmer | Kafka Summit Europe 2021
4
Requirements
Document streams
Updates & deletions
Frequent reprocessing
Reduce costs
Fewer but much larger messages than event streams
Only most recent version of records should be used
> 100 million existing records
Scale to zero
5. Stream Data Deduplication Powered by Kafka Streams | Philipp Schirmer | Kafka Summit Europe 2021
5
Given: a duplicate record classifier (black box)
https://github.com/bakdata/dedupe
Naïve approach: Compare every record to every other record
▬ Does not scale (quadratic complexity)
→ Pre-select plausible candidates
Duplicate Detection
Classification
6. Stream Data Deduplication Powered by Kafka Streams | Philipp Schirmer | Kafka Summit Europe 2021
Goal: Only classify similar records.
Assumption: When sorted, similar records will be close together.
Solution: Only classify records within a fixed window when sorted
▬ Linear complexity
What is the best sorting criteria? Use multiple!
Candidate Generation using Sorted Neighborhood
6
Candidate
Generation
Classification
Alibaba
Alpabet
Alphabet
Alphabet Inc.
Amazon
Amazon.com
Alphabet Inc
7. Stream Data Deduplication Powered by Kafka Streams | Philipp Schirmer | Kafka Summit Europe 2021
If A and B are duplicates, and B and C are duplicates, A and C
have to be duplicates
→ Ensure transitivity by computing transitive closure
Transitivity
7
Candidate
Generation
Classification
Transitive
Closure
8. Stream Data Deduplication Powered by Kafka Streams | Philipp Schirmer | Kafka Summit Europe 2021
All connected components are considered a duplicate
▬ Also improves recall
Transitive Closure
8
Entity A
0.9
0.8
A1
A3
A2
9. Stream Data Deduplication Powered by Kafka Streams | Philipp Schirmer | Kafka Summit Europe 2021
All connected components are considered a duplicate
▬ Also improves recall
Transitive Closure
9
Entity A
0.9
0.8
A1
A3
A2
10. Stream Data Deduplication Powered by Kafka Streams | Philipp Schirmer | Kafka Summit Europe 2021
All connected components are considered a duplicate
But: error prone
Transitive Closure
10
Entity A
Entity B
0.9
0.8 0.8
0.9
0.9
A1
A3
A2 A1 A2
B2 B1
11. Stream Data Deduplication Powered by Kafka Streams | Philipp Schirmer | Kafka Summit Europe 2021
All connected components are considered a duplicate
But: error prone
Transitive Closure
11
Entity A
Entity B
0.9
0.8 0.8
0.9
0.9
A1
A3
A2 A1 A2
B2 B1
12. Stream Data Deduplication Powered by Kafka Streams | Philipp Schirmer | Kafka Summit Europe 2021
All connected components are considered a duplicate
But: error prone
Transitive Closure
12
Entity A
Entity B
0.9
0.8 0.8
0.9
0.9
A1
A3
A2 A1 A2
B2 B1
13. Stream Data Deduplication Powered by Kafka Streams | Philipp Schirmer | Kafka Summit Europe 2021
Goal: Find strongly connected components
▬ Classify all records within a cluster
▬ Weight negative scores higher (precision over recall)
Clustering
13
Entity A
Entity B
0.9
0.8 0.8
0.9
0.9
A1
A3
A2 A1 A2
B2 B1
14. Stream Data Deduplication Powered by Kafka Streams | Philipp Schirmer | Kafka Summit Europe 2021
Goal: Find strongly connected components
▬ Classify all records within a cluster
▬ Weight negative scores higher (precision over recall)
Clustering
14
Entity A
Entity B
A1
A3
A2 A1 A2
B2 B1
15. Stream Data Deduplication Powered by Kafka Streams | Philipp Schirmer | Kafka Summit Europe 2021
Overview
15
Candidate
Generation
Classification
Transitive
Closure
Clustering
16. Stream Data Deduplication Powered by Kafka Streams | Philipp Schirmer | Kafka Summit Europe 2021
Overview
16
Candidate
Generation
Classification
Transitive
Closure
Clustering
17. Stream Data Deduplication Powered by Kafka Streams | Philipp Schirmer | Kafka Summit Europe 2021
Overview
17
Candidate
Generation
Classification
Transitive
Closure
Clustering
Kafka Streams Apps
18. Stream Data Deduplication Powered by Kafka Streams | Philipp Schirmer | Kafka Summit Europe 2021
Candidate Generation
18
Alibaba ali1
Alpabet alpha1
Alphabet alpha2
Alphabet Inc. alpha3
Amazon amzn1
Amazon.com amzn2
alpha4;
Alphabet Inc
alpha4;
alpha1, alpha2,
alpha3, amzn1
19. Stream Data Deduplication Powered by Kafka Streams | Philipp Schirmer | Kafka Summit Europe 2021
Candidate Generation
19
Alibaba ali1
Alpabet alpha1
Alphabet alpha2
Alphabet Inc. alpha3
Amazon amzn1
Amazon.com amzn2
alpha4;
Alphabet Inc
alpha4;
alpha1, alpha2,
alpha3, amzn1
How to implement a sorted doubly linked list in Kafka
Streams?
20. Stream Data Deduplication Powered by Kafka Streams | Philipp Schirmer | Kafka Summit Europe 2021
Candidate Generation
20
Alibaba ali1
Alpabet alpha1
Alphabet alpha2
Alphabet Inc. alpha3
Amazon amzn1
Amazon.com amzn2
alpha4;
Alphabet Inc
alpha4;
alpha1, alpha2,
alpha3, amzn1
How to implement a sorted doubly linked list in Kafka
Streams?
▬ Use Kafka Streams state store backed by RocksDB
▬ Supports forwards and backwards iteration
▬ Partitioned
→ Relevant neighborhood may not be on the local
partition
21. Stream Data Deduplication Powered by Kafka Streams | Philipp Schirmer | Kafka Summit Europe 2021
Candidate Generation
21
Alibaba ali1
Alpabet alpha1
Alphabet alpha2
Alphabet Inc. alpha3
Amazon amzn1
Amazon.com amzn2
alpha4;
Alphabet Inc
alpha4;
alpha1, alpha2,
alpha3, amzn1
How to implement a sorted doubly linked list in Kafka
Streams?
▬ Use Kafka Streams state store backed by RocksDB ✘
▬ Use Kafka Streams global store backed by RocksDB
▬ Supports forwards and backwards iteration
▬ Unpartitioned and replicated to every node
(> 100 GB)
▬ Replication latency
22. Stream Data Deduplication Powered by Kafka Streams | Philipp Schirmer | Kafka Summit Europe 2021
Candidate Generation
22
Alibaba ali1
Alpabet alpha1
Alphabet alpha2
Alphabet Inc. alpha3
Amazon amzn1
Amazon.com amzn2
alpha4;
Alphabet Inc
alpha4;
alpha1, alpha2,
alpha3, amzn1
How to implement a sorted doubly linked list in Kafka
Streams?
▬ Use Kafka Streams state store backed by RocksDB ✘
▬ Use Kafka Streams global store backed by RocksDB ✘
▬ Use SQL database, e.g., PostgreSQL
▬ Indices allow quick forwards and backwards
neighborhood queries
▬ Transactional guarantees for inserts
▬ Network and disk overhead
23. Stream Data Deduplication Powered by Kafka Streams | Philipp Schirmer | Kafka Summit Europe 2021
Candidate Generation
23
Alibaba ali1
Alpabet alpha1
Alphabet alpha2
Alphabet Inc. alpha3
Amazon amzn1
Amazon.com amzn2
alpha4;
Alphabet Inc
alpha4;
alpha1, alpha2,
alpha3, amzn1
How to implement a sorted doubly linked list in Kafka
Streams?
▬ Use Kafka Streams state store backed by RocksDB ✘
▬ Use Kafka Streams global store backed by RocksDB ✘
▬ Use SQL database, e.g., PostgreSQL ✔
24. Stream Data Deduplication Powered by Kafka Streams | Philipp Schirmer | Kafka Summit Europe 2021
Candidate Generation
24
Output
Records
SN Index
Sorted
Neighborhood
A2; A1, B1, B2
A2
25. Stream Data Deduplication Powered by Kafka Streams | Philipp Schirmer | Kafka Summit Europe 2021
Overview
25
SN Index
Fill Index
Candidate
Generation
Classification
Transitive
Closure
Clustering
26. Stream Data Deduplication Powered by Kafka Streams | Philipp Schirmer | Kafka Summit Europe 2021
Classification
26
A2; A1, B1, B2
Output
Candidates Record Index
Classification A2; A1, B1
27. Stream Data Deduplication Powered by Kafka Streams | Philipp Schirmer | Kafka Summit Europe 2021
Overview
27
SN Index
Fill Indices
Candidate
Generation
Classification
Transitive
Closure
Clustering
Record Index
28. Stream Data Deduplication Powered by Kafka Streams | Philipp Schirmer | Kafka Summit Europe 2021
Two Kafka Streams state stores to maintain transitive closure:
▬ Record ID → Cluster ID
▬ Cluster ID → Record IDs
Single partitioned because we don’t know ahead of time which
records will end up being duplicates
But: not a problem as operation is very cheap and state is small
Transitive Closure
28
29. Stream Data Deduplication Powered by Kafka Streams | Philipp Schirmer | Kafka Summit Europe 2021
Transitive Closure
29
State Store State Store
Cluster ID → Record IDs
Record ID → Cluster ID
A2; A1, B1
Classifications
30. Stream Data Deduplication Powered by Kafka Streams | Philipp Schirmer | Kafka Summit Europe 2021
Transitive Closure
30
State Store State Store
Cluster ID → Record IDs
Record ID → Cluster ID
A2; A1, B1 A1: A
A2: A
B1: A
A: A1, A2, B1
Classifications
31. Stream Data Deduplication Powered by Kafka Streams | Philipp Schirmer | Kafka Summit Europe 2021
Transitive Closure
31
State Store State Store
Cluster ID → Record IDs
Record ID → Cluster ID
A2; A1, B1
B2; B1
A1: A
A2: A
B1: A
A: A1, A2, B1
Classifications
32. Stream Data Deduplication Powered by Kafka Streams | Philipp Schirmer | Kafka Summit Europe 2021
Transitive Closure
32
State Store State Store
Cluster ID → Record IDs
Record ID → Cluster ID
A2; A1, B1
B2; B1
A1: A
A2: A
B1: A
A: A1, A2, B1
Classifications
33. Stream Data Deduplication Powered by Kafka Streams | Philipp Schirmer | Kafka Summit Europe 2021
Transitive Closure
33
State Store State Store
Cluster ID → Record IDs
Record ID → Cluster ID
A2; A1, B1
B2; B1
A1: A
A2: A
B1: A
A: A1, A2, B1
Classifications
34. Stream Data Deduplication Powered by Kafka Streams | Philipp Schirmer | Kafka Summit Europe 2021
Transitive Closure
34
State Store State Store
Cluster ID → Record IDs
Record ID → Cluster ID
A2; A1, B1
B2; B1
A1: A
A2: A
B1: A
A: A1, A2, B1, B2
B2: A
Classifications
35. Stream Data Deduplication Powered by Kafka Streams | Philipp Schirmer | Kafka Summit Europe 2021
Overview
35
SN Index
Fill Indices
Candidate
Generation
Classification
Transitive
Closure
Clustering
Record Index
36. Stream Data Deduplication Powered by Kafka Streams | Philipp Schirmer | Kafka Summit Europe 2021
Clustering
36
A1, A2, B1, B2
B1, B2
A1, A2
Output
Transitive Closure Record Index
Clustering
37. Stream Data Deduplication Powered by Kafka Streams | Philipp Schirmer | Kafka Summit Europe 2021
Overview
37
SN Index
Fill Indices
Candidate
Generation
Classification
Transitive
Closure
Clustering
Record Index
38. Stream Data Deduplication Powered by Kafka Streams | Philipp Schirmer | Kafka Summit Europe 2021
Cluster Ordering
38
Partition 1
Partition 2
Clustering
A1, A2, B1, B2
A1, A2, B1
A1, A2, B1
A1: A
A2: A
B1: A
Output
Transitive Closure
Timestamp 0
Timestamp 1
39. Stream Data Deduplication Powered by Kafka Streams | Philipp Schirmer | Kafka Summit Europe 2021
Cluster Ordering
39
Partition 1
Partition 2
Clustering
A1, A2, B1, B2
A1, A2, B1
A1, A2, B1
A1: A
A2: A
B1: A
A1: A
A2: A
B1: B
B2: B
Output
Transitive Closure
B1, B2
A1, A2
Timestamp 0
Timestamp 1
40. Stream Data Deduplication Powered by Kafka Streams | Philipp Schirmer | Kafka Summit Europe 2021
Cluster Ordering
40
Partition 1
Partition 2
Clustering
A1, A2, B1, B2
A1, A2, B1
B1, B2
A1, A2
A1: A
A2: A
B1: B
B2: B
Output
Transitive Closure
Timestamp 0
Timestamp 1
41. Stream Data Deduplication Powered by Kafka Streams | Philipp Schirmer | Kafka Summit Europe 2021
Cluster Ordering
41
Partition 1
Partition 2
Clustering
A1, A2, B1, B2
A1, A2, B1
A1, A2, B1
B1, B2
A1, A2
A1: A
A2: A
B1: B
B2: B
A1: A
A2: A
B1: A
Output
Transitive Closure
Timestamp 0
Timestamp 1
42. Stream Data Deduplication Powered by Kafka Streams | Philipp Schirmer | Kafka Summit Europe 2021
Cluster Ordering
42
Partition 1
Partition 2
Clustering
A1, A2, B1, B2
A1, A2, B1
A1, A2, B1
B1, B2
A1, A2
A1: A
A2: A
B1: B
B2: B
A1: A
A2: A
B1: A
Output
Transitive Closure
Timestamp 0
Timestamp 1
43. Stream Data Deduplication Powered by Kafka Streams | Philipp Schirmer | Kafka Summit Europe 2021
Challenge: Clustering does not preserve order if done in parallel
Solution:
▬ Add sequential id, e.g., timestamp, to results of transitive
closure
▬ Parallelize clustering
▬ Filter results with state store (single partitioned)
▬ For each record in group of clusters, store latest
timestamp that has been seen
▬ For each group of clusters, check if timestamp is younger
than all timestamps that have been seen for its records
Cluster Ordering
43
44. Stream Data Deduplication Powered by Kafka Streams | Philipp Schirmer | Kafka Summit Europe 2021
Cluster Ordering
44
A1, A2, B1, B2
Timestamp: 1
State Store
Record ID → Timestamp
Transitive Closure Clustering Output
45. Stream Data Deduplication Powered by Kafka Streams | Philipp Schirmer | Kafka Summit Europe 2021
Cluster Ordering
45
A1, A2, B1, B2
Timestamp: 1
A1, A2
B1, B2
Timestamp: 1
State Store
Record ID → Timestamp
Transitive Closure Clustering Output
46. Stream Data Deduplication Powered by Kafka Streams | Philipp Schirmer | Kafka Summit Europe 2021
Cluster Ordering
46
A1, A2, B1, B2
Timestamp: 1
B1: 1
A1, A2
B1, B2
Timestamp: 1
B2: 1
State Store
Record ID → Timestamp
A1: 1
A2: 1
Transitive Closure Clustering Output
A1: A
A2: A
B1: B
B2: B
47. Stream Data Deduplication Powered by Kafka Streams | Philipp Schirmer | Kafka Summit Europe 2021
Cluster Ordering
47
A1, A2, B1
Timestamp: 0
B1: 1
B2: 1
State Store
Record ID → Timestamp
A1: 1
A2: 1
Transitive Closure Clustering Output
48. Stream Data Deduplication Powered by Kafka Streams | Philipp Schirmer | Kafka Summit Europe 2021
Cluster Ordering
48
A1, A2, B1
Timestamp: 0
B1: 1
A1, A2, B1
Timestamp: 0
B2: 1
State Store
Record ID → Timestamp
A1: 1
A2: 1
Transitive Closure Clustering Output
49. Stream Data Deduplication Powered by Kafka Streams | Philipp Schirmer | Kafka Summit Europe 2021
Cluster Ordering
49
A1, A2, B1
Timestamp: 0
B1: 1
A1, A2, B1
Timestamp: 0
B2: 1
State Store
Record ID → Timestamp
A1: 1
A2: 1
Transitive Closure Clustering Output
50. Stream Data Deduplication Powered by Kafka Streams | Philipp Schirmer | Kafka Summit Europe 2021
Cluster Ordering
50
A1, A2, B1, B2, A3
Timestamp: 2
B1: 1
B2: 1
State Store
Record ID → Timestamp
A1: 1
A2: 1
Transitive Closure Clustering Output
51. Stream Data Deduplication Powered by Kafka Streams | Philipp Schirmer | Kafka Summit Europe 2021
Cluster Ordering
51
A1, A2, B1, B2, A3
Timestamp: 2
B1: 1
A1, A2, A3
B1, B2
Timestamp: 2
B2: 1
State Store
Record ID → Timestamp
A1: 1
A2: 1
Transitive Closure Clustering Output
52. Stream Data Deduplication Powered by Kafka Streams | Philipp Schirmer | Kafka Summit Europe 2021
Cluster Ordering
52
A1, A2, B1, B2, A3
Timestamp: 2
B1: 1
A1, A2, A3
B1, B2
Timestamp: 2
B2: 1
State Store
Record ID → Timestamp
A1: 1
A2: 1
Transitive Closure Clustering Output
53. Stream Data Deduplication Powered by Kafka Streams | Philipp Schirmer | Kafka Summit Europe 2021
Cluster Ordering
53
A1, A2, B1, B2, A3
Timestamp: 2
A3: 2
B1: 2
A1, A2, A3
B1, B2
Timestamp: 2
B2: 2
State Store
Record ID → Timestamp
A1: 2
A2: 2
Transitive Closure Clustering Output
A1: A
A2: A
B1: B
B2: B
A3: A
54. Stream Data Deduplication Powered by Kafka Streams | Philipp Schirmer | Kafka Summit Europe 2021
Overview
54
SN Index
Fill Indices
Candidate
Generation
Classification
Transitive
Closure
Clustering
Cluster
Ordering
Record Index
55. Stream Data Deduplication Powered by Kafka Streams | Philipp Schirmer | Kafka Summit Europe 2021
Overview
55
SN Index
Fill Indices
Candidate
Generation
Classification
Transitive
Closure
Clustering
Cluster
Ordering
Record Index
56. Stream Data Deduplication Powered by Kafka Streams | Philipp Schirmer | Kafka Summit Europe 2021
Deletions
56
SN Index
Fill Indices
Candidate
Generation
Classification
Transitive
Closure
Clustering
Cluster
Ordering
Record Index
Icon
made
by
Freepik
from
www.flaticon.com
57. Stream Data Deduplication Powered by Kafka Streams | Philipp Schirmer | Kafka Summit Europe 2021
Deletions
57
SN Index
Fill Indices
Candidate
Generation
Classification
Transitive
Closure
Clustering
Cluster
Ordering
Record Index
Icon
made
by
Freepik
from
www.flaticon.com
Record ID is message key
58. Stream Data Deduplication Powered by Kafka Streams | Philipp Schirmer | Kafka Summit Europe 2021
Deletions
58
SN Index
Fill Indices
Candidate
Generation
Classification
Transitive
Closure
Clustering
Cluster
Ordering
Record Index
Icon
made
by
Freepik
from
www.flaticon.com
Cluster ID is message key
59. Stream Data Deduplication Powered by Kafka Streams | Philipp Schirmer | Kafka Summit Europe 2021
Deletions
59
SN Index
Fill Indices
Candidate
Generation
Classification
Transitive
Closure
Clustering
Cluster
Ordering
Record Index
60. Stream Data Deduplication Powered by Kafka Streams | Philipp Schirmer | Kafka Summit Europe 2021
▬ Up to 1,200 deduplications per second with 100
parallel classification pods
▬ Deduplication of 150,000,000 records in < 4 days
▬ Throughput limited by memory availability of
database and classifier throughput
Performance
60
Icon
made
by
Freepik
from
www.flaticon.com
61. Stream Data Deduplication Powered by Kafka Streams | Philipp Schirmer | Kafka Summit Europe 2021
▬ All deployed and developed with
https://github.com/bakdata/streams-bootstrap
▬ Lag-based auto-scaling from 0 to number of
partitions powered by KEDA
https://keda.sh/docs/2.1/scalers/apache-kafka/
▬ Database auto-scaling with Aurora Serverless
https://aws.amazon.com/rds/aurora/serverless/
Deployment & Operations
61
autoscaling:
enabled: true
maxReplicas: 100
consumergroup: news-deduplication-class
lagThreshold: "10000"
Logo
by
Kubernetes
/
CC
BY
62. Stream Data Deduplication Powered by Kafka Streams | Philipp Schirmer | Kafka Summit Europe 2021
Each module scales independently
Conclusion
62
Modular architecture
High throughput
Large-scale processing
Scale to zero
With regard to complexity of deduplication
Hundreds of million of records can be processed
If there is no load, we only pay for storage