SlideShare a Scribd company logo
Crossing the Streams:
Foreign-Key Joins with Kafka Streams
John Roesler
Software Engineer @ Confluent
Agenda
01. The missing join: Foreign-Key Join
02. The current join: Equi- Join
03. The problem with FK Join
04. The solution for FK Join
05. Testing
06. Case Study: Bazaarvoice
albums
AlbumId
Title
ArtistId
tracks
TrackId
Name
AlbumId
Composer
Bytes
UnitPrice
Foreign-Key Join
3
SELECT * from Tracks
JOIN Albums ON Tracks.AlbumID = Albums.AlbumID
albums
AlbumId
Title
ArtistId
tracks
TrackId
Name
AlbumId
Composer
Bytes
UnitPrice
Foreign-Key Join
4
Primary
Foreign
SELECT * from Tracks
JOIN Albums ON Tracks.AlbumID = Albums.AlbumID
albums
AlbumId
Title
ArtistId
tracks
TrackId
Name
AlbumId
Composer
Bytes
UnitPrice
Foreign-Key Join
5
Primary
Foreign
JOIN
SELECT * from Tracks
JOIN Albums ON Tracks.AlbumID = Albums.AlbumID
Foreign-Key Join
6
KTable<TrackId, Track> tracks = …
KTable<AlbumId, Album> albums = …
KTable<TrackId, TrackWithAlbum> =
tracks.join(albums,
Track::getAlbumId,
TrackWithAlbum::joiner);
Agenda
7
01. The missing join: Foreign-Key Join
02. The current join: Equi Join
03. The problem with FK Join
04. The solution for FK Join
05. Testing
06. Case Study: Bazaarvoice
track-meta
TrackId
Name
AlbumId
Composer
Bytes
Equi Join
8
track-pricing
TrackId
UnitPrice
tracks
TrackId
Name
AlbumId
Composer
Bytes
UnitPrice
JOIN
Equi Join
KTable<TrackId, TrackMeta> tracksMetadata = …
KTable<TrackId, TrackStore> tracksPricing = …
KTable<TrackId, Track> =
tracksMetadata.join(tracksPricing,
Track::joiner);
9
A: 9
B: 2
C: 4
A: 6
D: 8
A: 9
C: 4
A: 6
B: 2
D: 8
Partition 0 Partition 1
Big Data Processing == Partitioning
10
A: 9
B: 2
C: 4
A: 6
D: 8
Partition 0 Partition 1
A: α
B: β
C: γ
A: ξ
D: σ
Left Right
A: 9
C: 4
A: 6
A: α
C: γ
A: ξ
Left Right
B: 2
D: 8
B: β
D: σ
Left Right
A: (9,α)
C: (c,γ)
A: (6,ξ)
Join
B: (2,β)
D: (8,σ)
Join
Partitioned Equi Join
11
Agenda
12
01. The missing join: Foreign-Key Join
02. The current join: Equi- Join
03. The problem with FK Join
04. The solution for FK Join
05. Testing
06. Case Study: Bazaarvoice
A: 9
B: 2
C: 4
A: 6
D: 9
Partition 0 Partition 1
Left Right
A: 9
C: 4
A: 6
Left Right
B: 2
D: 9
Left RightJoin Join
9: α
4: β
3: γ
6: ξ
9: σ
? ?? ?
Partitioned Foreign-Key Join?
13
A: 9
B: 2
C: 4
A: 6
D: 8
Partition 0 Partition 1
Left Right
A: 9
C: 4
A: 6
Left
B: 2
D: 8
Left
9: α
4: β
3: γ
6: ξ
9: σ
Partitioned Foreign-Key Join
Partition 0 Partition 1
9: α
9: σ
Right
4: β
3: γ
6: ξ
Right
14
Agenda
15
01. The missing join: Foreign-Key Join
02. The current join: Equi- Join
03. The problem with FK Join
04. The solution for FK Join
05. Testing
06. Case Study: Bazaarvoice
Partitioned Foreign-Key Join
A: 9
B: 9
C: 4
A: 6
D: 8
9: α
4: β
3: γ
6: ξ
9: σ
Left Right
9: A
9: B
4: C
6: A
8: D
Subscriptions
A: α
B: α
C: β
A: ξ
D: null
updates
A: (9,α)
B: (9,α)
C: (4,β)
A: (6,ξ)
D: (8,null)
Join
subscribe
update
16
Partitioned Foreign-Key Join
A: 9 9: α
Left Right
9: A
Subscriptions
updates
A: (9,α)
Join
subscribe
update
17
Partitioned Foreign-Key Join
A: 9
B: 9
9: α
Left Right
9: A
Subscriptions
updates
A: (9,α)
Join
subscribe
update
18
Partitioned Foreign-Key Join
A: 9
B: 9
9: α
Left Right
9: A
Subscriptions
updates
A: (9,α)
Join
subscribe
update
9:B
19
Partitioned Foreign-Key Join
A: 9
B: 9
9: α
Left Right
9: A
9: B
Subscriptions
updates
A: (9,α)
Join
subscribe
update
20
Partitioned Foreign-Key Join
A: 9
B: 9
9: α
Left Right
9: A
9: B
Subscriptions
updates
A: (9,α)
Join
subscribe
update
B: α
21
Partitioned Foreign-Key Join
A: 9
B: 9
9: α
Left Right
9: A
9: B
Subscriptions
updates
A: (9,α)
Join
subscribe
update
B: α
22
Partitioned Foreign-Key Join
A: 9
B: 9
9: α
Left Right
9: A
9: B
Subscriptions
B: α
updates
A: (9,α)
Join
subscribe
update
23
Partitioned Foreign-Key Join
A: 9
B: 9
9: α
Left Right
9: A
9: B
Subscriptions
B: α
updates
A: (9,α)
B: (9,α)
Join
subscribe
update
24
Partitioned Foreign-Key Join
A: 9
B: 9
9: α
Left Right
9: A
9: B
Subscriptions
updates
A: (9,α)
B: (9,α)
Join
subscribe
update
25
Partitioned Foreign-Key Join
A: 9
B: 9
9: α
9: β
Left Right
9: A
9: B
Subscriptions
updates
A: (9,α)
B: (9,α)
Join
subscribe
update
26
Partitioned Foreign-Key Join
A: 9
B: 9
9: α
9: β
Left Right
9: A
9: B
Subscriptions
updates
A: (9,α)
B: (9,α)
Join
subscribe
update
A: β
B: β
27
Partitioned Foreign-Key Join
A: 9
B: 9
9: α
9: β
Left Right
9: A
9: B
Subscriptions
updates
A: (9,α)
B: (9,α)
Join
subscribe
update
A: β
B: β
28
Partitioned Foreign-Key Join
A: 9
B: 9
9: α
9: β
Left Right
9: A
9: B
Subscriptions
A: β
B: β
updates
A: (9,α)
B: (9,α)
Join
subscribe
update
29
Partitioned Foreign-Key Join
A: 9
B: 9
9: α
9: β
Left Right
9: A
9: B
Subscriptions
A: β
B: β
updates
A: (9,α)
B: (9,α)
A: (9,β)
B: (9,β)
Join
subscribe
update
30
Partitioned Foreign-Key Join
A: 9
B: 9
9: β
Left Right
9: A
9: B
Subscriptions
updates
A: (9,β)
B: (9,β)
Join
subscribe
update
31
Agenda
32
01. The missing join: Foreign-Key Join
02. The current join: Equi- Join
03. The problem with FK Join
04. The solution for FK Join
05. Testing
06. Case Study: Bazaarvoice
Testing
KTable<TrackId, Track> tracks = …
KTable<AlbumId, Album> albums = …
KTable<TrackId, TrackWithAlbum> =
tracks.join(albums,
Track::getAlbumId,
TrackWithAlbum::joiner);
33
Testing
try(driver = new TopologyTestDriver(...)) {
trackInput = driver.createInputTopic(...)
albumInput = driver.createInputTopic(...)
result = driver.createOutputTopic(...)
}
34
Testing
try(driver = new TopologyTestDriver(...)) {
trackInput = driver.createInputTopic(...)
albumInput = driver.createInputTopic(...)
result = driver.createOutputTopic(...)
trackInput.pipeInput(“t1”, new Track(“a1”))
trackInput.pipeInput(“t2”, new Track(“a1”))
albumInput.pipeInput(“a1”, new Album(...))
}
35
Testing
try(driver = new TopologyTestDriver(...)) {
trackInput = driver.createInputTopic(...)
albumInput = driver.createInputTopic(...)
result = driver.createOutputTopic(...)
trackInput.pipeInput(“t1”, new Track(“a1”))
trackInput.pipeInput(“t2”, new Track(“a1”))
albumInput.pipeInput(“a1”, new Album(...))
assertThat(
result.readValuesToMap(),
is(map(
“t1”: pair(track1, album1),
“t2”: pair(track2, album1)
))
);
}
36
Agenda
37
01. The missing join: Foreign-Key Join
02. The current join: Equi- Join
03. The problem with FK Join
04. The solution for FK Join
05. Testing
06. Case Study: Bazaarvoice
Case Study: Bazaarvoice
● Early Relational Streaming adopter
○ In-house streaming platform
○ Periodic bulk DB query jobs
○ Spark, Hadoop, etc.
● Large dataset, healthy update rate
○ 100s of Millions of Products
○ 100s of Billions of Reviews
○ Updates: 10s of Millions a day, at least
○ Views: ludicrous
● Join-heavy workload (high cardinality)
○ Product -> Review fan-out can be 100 of Millions
38
Case Study: Bazaarvoice
● Product
○ Name
○ Description
○ URL
○ Average Rating
● Review
○ ProductId
○ Text
○ Rating
○ Product Name
39
Average Rating (aggregation)
KTable<ReviewId, Review> reviews;
KTable<ProductId, Product> products;
KTable<ProductId, Double> avgRatings =
reviews
.groupBy(Review::getProductId)
.reduce(averageRatings)
KTable<ProductId, ViewProduct> result =
avgRatings.join(products)
40
reviews
productsavgRatings
groupBy(productId)
reduce(avg)
result
Case Study: Bazaarvoice
● Product
○ Name
○ Description
○ URL
○ Average Rating
● Review
○ ProductId
○ Text
○ Rating
○ Product Name
41
Product Name (join)
KTable<ProductId, Set<ReviewId>> productReviews =
reviews
.groupBy(Review::getProductId)
.reduce(collectReviewIdsSet)
42
groupBy(productId)
reduce(collect set)
all reviews for
each product
reviews
Product Name (join)
KTable<ProductId, Set<ReviewId>> productReviews =
reviews
.groupBy(Review::getProductId)
.reduce(collectReviewIdsSet)
KTable<ProductId,
Pair<String, Set<ReviewId>>> toExplode =
products
.mapValues(Product::getName)
.join(productReviews)
43
groupBy(productId)
reduce(collect set)
products
all reviews for
each product
all reviews and
product name for
each product
reviews
Product Name (join)
KTable<ProductId, Set<ReviewId>> productReviews =
reviews
.groupBy(Review::getProductId)
.reduce(collectReviewIdsSet)
KTable<ProductId,
Pair<String, Set<ReviewId>>> toExplode =
products
.mapValues(Product::getName)
.join(productReviews)
KTable<ReviewId, String> reviewsToProductNames =
toExplode.flatMap( name, reviewSet ->
for (reviewId : reviewSet)
forward(reviewId, name);
)
44
groupBy(productId)
reduce(collect set)
products
all reviews for
each product
all reviews and
product name for
each product
product name for
each review
reviews
Product Name (join)
KTable<ProductId, Set<ReviewId>> productReviews =
reviews
.groupBy(Review::getProductId)
.reduce(collectReviewIdsSet)
KTable<ProductId,
Pair<String, Set<ReviewId>>> toExplode =
products
.mapValues(Product::getName)
.join(productReviews)
KTable<ReviewId, String> reviewsToProductNames =
toExplode.flatMap( name, reviewSet ->
for (reviewId : reviewSet)
forward(reviewId, name);
)
KTable<ReviewId, ViewReview> result =
reviews.join(reviewsToProductNames) 45
groupBy(productId)
reduce(collect set)
products
all reviews for
each product
all reviews and
product name for
each product
product name for
each review
result
reviews
Foreign-Key Join
A: 9
B: 9
9: β
Left Right
9: A
9: B
Subscriptions
updates
A: (9,β)
B: (9,β)
Join
subscribe
update
46
Product Name (join)
KTable<ProductId, Set<ReviewId>> productReviews =
reviews
.groupBy(Review::getProductId)
.reduce(collectReviewIdsSet)
KTable<ProductId,
Pair<String, Set<ReviewId>>> toExplode =
products
.mapValues(Product::getName)
.join(productReviews)
KTable<ReviewId, String> reviewsToProductNames =
toExplode.flatMap( name, reviewSet ->
for (reviewId : reviewSet)
forward(reviewId, name);
)
KTable<ReviewId, ViewReview> result =
reviews.join(reviewsToProductNames)
repartition
repartition
47
Foreign-Key Join
A: 9
B: 9
9: β
Left Right
9: A
9: B
Subscriptions
updates
A: (9,β)
B: (9,β)
Join
subscribe
update
48
Product Name (join)
KTable<ProductId, Set<ReviewId>> productReviews =
reviews
.groupBy(Review::getProductId)
.reduce(collectReviewIdsSet)
KTable<ProductId,
Pair<String, Set<ReviewId>>> toExplode =
products
.mapValues(Product::getName)
.join(productReviews)
KTable<ReviewId, String> reviewsToProductNames =
toExplode.flatMap( name, reviewSet ->
for (reviewId : reviewSet)
forward(reviewId, name);
)
KTable<ReviewId, ViewReview> result =
reviews.join(reviewsToProductNames)
repartition
repartition
store and
transmit
entire set
49
Product Name (join)
KTable<ProductId, Set<ReviewId>> productReviews =
reviews
.groupBy(Review::getProductId)
.reduce(collectReviewIdsSet)
KTable<ProductId,
Pair<String, Set<ReviewId>>> toExplode =
products
.mapValues(Product::getName)
.join(productReviews)
KTable<ReviewId, String> reviewsToProductNames =
toExplode.flatMap( name, reviewSet ->
for (reviewId : reviewSet)
forward(reviewId, name);
)
KTable<ReviewId, ViewReview> result =
reviews.join(reviewsToProductNames) 50
Product Name (join)
KTable<ProductId, String> productNames =
products.mapValues(Product::getName)
KTable<ReviewId, ViewReview> result =
reviews.join(productNames,
Review::getProductId)
51
Coming soon to ksqlDB !
SELECT * FROM
Reviews JOIN Products
ON Review.ProductID = Product.ID
52
Thanks to the authors of KIP-213!
● Jan Filipiak (Oct 2017)
● Adam Bellemare (July 2018)
● Accepted Oct 2019
● Released in 2.4.0 Dec 2019
53
Thank you!
john@confluent.io
vvcephei@apache.org
cnfl.io/meetups cnfl.io/slackcnfl.io/blog

More Related Content

What's hot

Bringing Kafka Without Zookeeper Into Production with Colin McCabe | Kafka Su...
Bringing Kafka Without Zookeeper Into Production with Colin McCabe | Kafka Su...Bringing Kafka Without Zookeeper Into Production with Colin McCabe | Kafka Su...
Bringing Kafka Without Zookeeper Into Production with Colin McCabe | Kafka Su...
HostedbyConfluent
 
A Hitchhiker's Guide to Apache Kafka Geo-Replication with Sanjana Kaundinya ...
 A Hitchhiker's Guide to Apache Kafka Geo-Replication with Sanjana Kaundinya ... A Hitchhiker's Guide to Apache Kafka Geo-Replication with Sanjana Kaundinya ...
A Hitchhiker's Guide to Apache Kafka Geo-Replication with Sanjana Kaundinya ...
HostedbyConfluent
 
Online index rebuild automation
Online index rebuild automationOnline index rebuild automation
Online index rebuild automation
Carlos Sierra
 
Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...
Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...
Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...
confluent
 
What is new in Apache Hive 3.0?
What is new in Apache Hive 3.0?What is new in Apache Hive 3.0?
What is new in Apache Hive 3.0?
DataWorks Summit
 
Using Apache Hive with High Performance
Using Apache Hive with High PerformanceUsing Apache Hive with High Performance
Using Apache Hive with High Performance
Inderaj (Raj) Bains
 
HiveServer2
HiveServer2HiveServer2
HiveServer2
Schubert Zhang
 
Practical learnings from running thousands of Flink jobs
Practical learnings from running thousands of Flink jobsPractical learnings from running thousands of Flink jobs
Practical learnings from running thousands of Flink jobs
Flink Forward
 
Presto on Apache Spark: A Tale of Two Computation Engines
Presto on Apache Spark: A Tale of Two Computation EnginesPresto on Apache Spark: A Tale of Two Computation Engines
Presto on Apache Spark: A Tale of Two Computation Engines
Databricks
 
A Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLA Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQL
Databricks
 
Iceberg: a fast table format for S3
Iceberg: a fast table format for S3Iceberg: a fast table format for S3
Iceberg: a fast table format for S3
DataWorks Summit
 
The Current State of Table API in 2022
The Current State of Table API in 2022The Current State of Table API in 2022
The Current State of Table API in 2022
Flink Forward
 
Building a fully managed stream processing platform on Flink at scale for Lin...
Building a fully managed stream processing platform on Flink at scale for Lin...Building a fully managed stream processing platform on Flink at scale for Lin...
Building a fully managed stream processing platform on Flink at scale for Lin...
Flink Forward
 
Understanding and Improving Code Generation
Understanding and Improving Code GenerationUnderstanding and Improving Code Generation
Understanding and Improving Code Generation
Databricks
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introduction
colorant
 
Stream processing using Kafka
Stream processing using KafkaStream processing using Kafka
Stream processing using Kafka
Knoldus Inc.
 
Hive+Tez: A performance deep dive
Hive+Tez: A performance deep diveHive+Tez: A performance deep dive
Hive+Tez: A performance deep dive
t3rmin4t0r
 
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Flink Forward
 
Optimizing Hive Queries
Optimizing Hive QueriesOptimizing Hive Queries
Optimizing Hive Queries
DataWorks Summit
 
HDFSネームノードのHAについて #hcj13w
HDFSネームノードのHAについて #hcj13wHDFSネームノードのHAについて #hcj13w
HDFSネームノードのHAについて #hcj13w
Cloudera Japan
 

What's hot (20)

Bringing Kafka Without Zookeeper Into Production with Colin McCabe | Kafka Su...
Bringing Kafka Without Zookeeper Into Production with Colin McCabe | Kafka Su...Bringing Kafka Without Zookeeper Into Production with Colin McCabe | Kafka Su...
Bringing Kafka Without Zookeeper Into Production with Colin McCabe | Kafka Su...
 
A Hitchhiker's Guide to Apache Kafka Geo-Replication with Sanjana Kaundinya ...
 A Hitchhiker's Guide to Apache Kafka Geo-Replication with Sanjana Kaundinya ... A Hitchhiker's Guide to Apache Kafka Geo-Replication with Sanjana Kaundinya ...
A Hitchhiker's Guide to Apache Kafka Geo-Replication with Sanjana Kaundinya ...
 
Online index rebuild automation
Online index rebuild automationOnline index rebuild automation
Online index rebuild automation
 
Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...
Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...
Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...
 
What is new in Apache Hive 3.0?
What is new in Apache Hive 3.0?What is new in Apache Hive 3.0?
What is new in Apache Hive 3.0?
 
Using Apache Hive with High Performance
Using Apache Hive with High PerformanceUsing Apache Hive with High Performance
Using Apache Hive with High Performance
 
HiveServer2
HiveServer2HiveServer2
HiveServer2
 
Practical learnings from running thousands of Flink jobs
Practical learnings from running thousands of Flink jobsPractical learnings from running thousands of Flink jobs
Practical learnings from running thousands of Flink jobs
 
Presto on Apache Spark: A Tale of Two Computation Engines
Presto on Apache Spark: A Tale of Two Computation EnginesPresto on Apache Spark: A Tale of Two Computation Engines
Presto on Apache Spark: A Tale of Two Computation Engines
 
A Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLA Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQL
 
Iceberg: a fast table format for S3
Iceberg: a fast table format for S3Iceberg: a fast table format for S3
Iceberg: a fast table format for S3
 
The Current State of Table API in 2022
The Current State of Table API in 2022The Current State of Table API in 2022
The Current State of Table API in 2022
 
Building a fully managed stream processing platform on Flink at scale for Lin...
Building a fully managed stream processing platform on Flink at scale for Lin...Building a fully managed stream processing platform on Flink at scale for Lin...
Building a fully managed stream processing platform on Flink at scale for Lin...
 
Understanding and Improving Code Generation
Understanding and Improving Code GenerationUnderstanding and Improving Code Generation
Understanding and Improving Code Generation
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introduction
 
Stream processing using Kafka
Stream processing using KafkaStream processing using Kafka
Stream processing using Kafka
 
Hive+Tez: A performance deep dive
Hive+Tez: A performance deep diveHive+Tez: A performance deep dive
Hive+Tez: A performance deep dive
 
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
 
Optimizing Hive Queries
Optimizing Hive QueriesOptimizing Hive Queries
Optimizing Hive Queries
 
HDFSネームノードのHAについて #hcj13w
HDFSネームノードのHAについて #hcj13wHDFSネームノードのHAについて #hcj13w
HDFSネームノードのHAについて #hcj13w
 

More from confluent

Building API data products on top of your real-time data infrastructure
Building API data products on top of your real-time data infrastructureBuilding API data products on top of your real-time data infrastructure
Building API data products on top of your real-time data infrastructure
confluent
 
Speed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in MinutesSpeed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in Minutes
confluent
 
Evolving Data Governance for the Real-time Streaming and AI Era
Evolving Data Governance for the Real-time Streaming and AI EraEvolving Data Governance for the Real-time Streaming and AI Era
Evolving Data Governance for the Real-time Streaming and AI Era
confluent
 
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
confluent
 
Santander Stream Processing with Apache Flink
Santander Stream Processing with Apache FlinkSantander Stream Processing with Apache Flink
Santander Stream Processing with Apache Flink
confluent
 
Unlocking the Power of IoT: A comprehensive approach to real-time insights
Unlocking the Power of IoT: A comprehensive approach to real-time insightsUnlocking the Power of IoT: A comprehensive approach to real-time insights
Unlocking the Power of IoT: A comprehensive approach to real-time insights
confluent
 
Workshop híbrido: Stream Processing con Flink
Workshop híbrido: Stream Processing con FlinkWorkshop híbrido: Stream Processing con Flink
Workshop híbrido: Stream Processing con Flink
confluent
 
Industry 4.0: Building the Unified Namespace with Confluent, HiveMQ and Spark...
Industry 4.0: Building the Unified Namespace with Confluent, HiveMQ and Spark...Industry 4.0: Building the Unified Namespace with Confluent, HiveMQ and Spark...
Industry 4.0: Building the Unified Namespace with Confluent, HiveMQ and Spark...
confluent
 
AWS Immersion Day Mapfre - Confluent
AWS Immersion Day Mapfre   -   ConfluentAWS Immersion Day Mapfre   -   Confluent
AWS Immersion Day Mapfre - Confluent
confluent
 
Eventos y Microservicios - Santander TechTalk
Eventos y Microservicios - Santander TechTalkEventos y Microservicios - Santander TechTalk
Eventos y Microservicios - Santander TechTalk
confluent
 
Q&A with Confluent Experts: Navigating Networking in Confluent Cloud
Q&A with Confluent Experts: Navigating Networking in Confluent CloudQ&A with Confluent Experts: Navigating Networking in Confluent Cloud
Q&A with Confluent Experts: Navigating Networking in Confluent Cloud
confluent
 
Citi TechTalk Session 2: Kafka Deep Dive
Citi TechTalk Session 2: Kafka Deep DiveCiti TechTalk Session 2: Kafka Deep Dive
Citi TechTalk Session 2: Kafka Deep Dive
confluent
 
Build real-time streaming data pipelines to AWS with Confluent
Build real-time streaming data pipelines to AWS with ConfluentBuild real-time streaming data pipelines to AWS with Confluent
Build real-time streaming data pipelines to AWS with Confluent
confluent
 
Q&A with Confluent Professional Services: Confluent Service Mesh
Q&A with Confluent Professional Services: Confluent Service MeshQ&A with Confluent Professional Services: Confluent Service Mesh
Q&A with Confluent Professional Services: Confluent Service Mesh
confluent
 
Citi Tech Talk: Event Driven Kafka Microservices
Citi Tech Talk: Event Driven Kafka MicroservicesCiti Tech Talk: Event Driven Kafka Microservices
Citi Tech Talk: Event Driven Kafka Microservices
confluent
 
Confluent & GSI Webinars series - Session 3
Confluent & GSI Webinars series - Session 3Confluent & GSI Webinars series - Session 3
Confluent & GSI Webinars series - Session 3
confluent
 
Citi Tech Talk: Messaging Modernization
Citi Tech Talk: Messaging ModernizationCiti Tech Talk: Messaging Modernization
Citi Tech Talk: Messaging Modernization
confluent
 
Citi Tech Talk: Data Governance for streaming and real time data
Citi Tech Talk: Data Governance for streaming and real time dataCiti Tech Talk: Data Governance for streaming and real time data
Citi Tech Talk: Data Governance for streaming and real time data
confluent
 
Confluent & GSI Webinars series: Session 2
Confluent & GSI Webinars series: Session 2Confluent & GSI Webinars series: Session 2
Confluent & GSI Webinars series: Session 2
confluent
 
Data In Motion Paris 2023
Data In Motion Paris 2023Data In Motion Paris 2023
Data In Motion Paris 2023
confluent
 

More from confluent (20)

Building API data products on top of your real-time data infrastructure
Building API data products on top of your real-time data infrastructureBuilding API data products on top of your real-time data infrastructure
Building API data products on top of your real-time data infrastructure
 
Speed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in MinutesSpeed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in Minutes
 
Evolving Data Governance for the Real-time Streaming and AI Era
Evolving Data Governance for the Real-time Streaming and AI EraEvolving Data Governance for the Real-time Streaming and AI Era
Evolving Data Governance for the Real-time Streaming and AI Era
 
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
 
Santander Stream Processing with Apache Flink
Santander Stream Processing with Apache FlinkSantander Stream Processing with Apache Flink
Santander Stream Processing with Apache Flink
 
Unlocking the Power of IoT: A comprehensive approach to real-time insights
Unlocking the Power of IoT: A comprehensive approach to real-time insightsUnlocking the Power of IoT: A comprehensive approach to real-time insights
Unlocking the Power of IoT: A comprehensive approach to real-time insights
 
Workshop híbrido: Stream Processing con Flink
Workshop híbrido: Stream Processing con FlinkWorkshop híbrido: Stream Processing con Flink
Workshop híbrido: Stream Processing con Flink
 
Industry 4.0: Building the Unified Namespace with Confluent, HiveMQ and Spark...
Industry 4.0: Building the Unified Namespace with Confluent, HiveMQ and Spark...Industry 4.0: Building the Unified Namespace with Confluent, HiveMQ and Spark...
Industry 4.0: Building the Unified Namespace with Confluent, HiveMQ and Spark...
 
AWS Immersion Day Mapfre - Confluent
AWS Immersion Day Mapfre   -   ConfluentAWS Immersion Day Mapfre   -   Confluent
AWS Immersion Day Mapfre - Confluent
 
Eventos y Microservicios - Santander TechTalk
Eventos y Microservicios - Santander TechTalkEventos y Microservicios - Santander TechTalk
Eventos y Microservicios - Santander TechTalk
 
Q&A with Confluent Experts: Navigating Networking in Confluent Cloud
Q&A with Confluent Experts: Navigating Networking in Confluent CloudQ&A with Confluent Experts: Navigating Networking in Confluent Cloud
Q&A with Confluent Experts: Navigating Networking in Confluent Cloud
 
Citi TechTalk Session 2: Kafka Deep Dive
Citi TechTalk Session 2: Kafka Deep DiveCiti TechTalk Session 2: Kafka Deep Dive
Citi TechTalk Session 2: Kafka Deep Dive
 
Build real-time streaming data pipelines to AWS with Confluent
Build real-time streaming data pipelines to AWS with ConfluentBuild real-time streaming data pipelines to AWS with Confluent
Build real-time streaming data pipelines to AWS with Confluent
 
Q&A with Confluent Professional Services: Confluent Service Mesh
Q&A with Confluent Professional Services: Confluent Service MeshQ&A with Confluent Professional Services: Confluent Service Mesh
Q&A with Confluent Professional Services: Confluent Service Mesh
 
Citi Tech Talk: Event Driven Kafka Microservices
Citi Tech Talk: Event Driven Kafka MicroservicesCiti Tech Talk: Event Driven Kafka Microservices
Citi Tech Talk: Event Driven Kafka Microservices
 
Confluent & GSI Webinars series - Session 3
Confluent & GSI Webinars series - Session 3Confluent & GSI Webinars series - Session 3
Confluent & GSI Webinars series - Session 3
 
Citi Tech Talk: Messaging Modernization
Citi Tech Talk: Messaging ModernizationCiti Tech Talk: Messaging Modernization
Citi Tech Talk: Messaging Modernization
 
Citi Tech Talk: Data Governance for streaming and real time data
Citi Tech Talk: Data Governance for streaming and real time dataCiti Tech Talk: Data Governance for streaming and real time data
Citi Tech Talk: Data Governance for streaming and real time data
 
Confluent & GSI Webinars series: Session 2
Confluent & GSI Webinars series: Session 2Confluent & GSI Webinars series: Session 2
Confluent & GSI Webinars series: Session 2
 
Data In Motion Paris 2023
Data In Motion Paris 2023Data In Motion Paris 2023
Data In Motion Paris 2023
 

Recently uploaded

“How Axelera AI Uses Digital Compute-in-memory to Deliver Fast and Energy-eff...
“How Axelera AI Uses Digital Compute-in-memory to Deliver Fast and Energy-eff...“How Axelera AI Uses Digital Compute-in-memory to Deliver Fast and Energy-eff...
“How Axelera AI Uses Digital Compute-in-memory to Deliver Fast and Energy-eff...
Edge AI and Vision Alliance
 
Astute Business Solutions | Oracle Cloud Partner |
Astute Business Solutions | Oracle Cloud Partner |Astute Business Solutions | Oracle Cloud Partner |
Astute Business Solutions | Oracle Cloud Partner |
AstuteBusiness
 
Harnessing the Power of NLP and Knowledge Graphs for Opioid Research
Harnessing the Power of NLP and Knowledge Graphs for Opioid ResearchHarnessing the Power of NLP and Knowledge Graphs for Opioid Research
Harnessing the Power of NLP and Knowledge Graphs for Opioid Research
Neo4j
 
Taking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdfTaking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdf
ssuserfac0301
 
JavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green MasterplanJavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green Masterplan
Miro Wengner
 
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing InstancesEnergy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Alpen-Adria-Universität
 
Generating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and MilvusGenerating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and Milvus
Zilliz
 
"Choosing proper type of scaling", Olena Syrota
"Choosing proper type of scaling", Olena Syrota"Choosing proper type of scaling", Olena Syrota
"Choosing proper type of scaling", Olena Syrota
Fwdays
 
Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...
Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...
Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...
Pitangent Analytics & Technology Solutions Pvt. Ltd
 
Nordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptxNordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptx
MichaelKnudsen27
 
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdfMonitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Tosin Akinosho
 
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-EfficiencyFreshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
ScyllaDB
 
Northern Engraving | Nameplate Manufacturing Process - 2024
Northern Engraving | Nameplate Manufacturing Process - 2024Northern Engraving | Nameplate Manufacturing Process - 2024
Northern Engraving | Nameplate Manufacturing Process - 2024
Northern Engraving
 
Fueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte WebinarFueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte Webinar
Zilliz
 
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Safe Software
 
Principle of conventional tomography-Bibash Shahi ppt..pptx
Principle of conventional tomography-Bibash Shahi ppt..pptxPrinciple of conventional tomography-Bibash Shahi ppt..pptx
Principle of conventional tomography-Bibash Shahi ppt..pptx
BibashShahi
 
AppSec PNW: Android and iOS Application Security with MobSF
AppSec PNW: Android and iOS Application Security with MobSFAppSec PNW: Android and iOS Application Security with MobSF
AppSec PNW: Android and iOS Application Security with MobSF
Ajin Abraham
 
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectorsConnector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
DianaGray10
 
Choosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptxChoosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptx
Brandon Minnick, MBA
 

Recently uploaded (20)

“How Axelera AI Uses Digital Compute-in-memory to Deliver Fast and Energy-eff...
“How Axelera AI Uses Digital Compute-in-memory to Deliver Fast and Energy-eff...“How Axelera AI Uses Digital Compute-in-memory to Deliver Fast and Energy-eff...
“How Axelera AI Uses Digital Compute-in-memory to Deliver Fast and Energy-eff...
 
Astute Business Solutions | Oracle Cloud Partner |
Astute Business Solutions | Oracle Cloud Partner |Astute Business Solutions | Oracle Cloud Partner |
Astute Business Solutions | Oracle Cloud Partner |
 
Harnessing the Power of NLP and Knowledge Graphs for Opioid Research
Harnessing the Power of NLP and Knowledge Graphs for Opioid ResearchHarnessing the Power of NLP and Knowledge Graphs for Opioid Research
Harnessing the Power of NLP and Knowledge Graphs for Opioid Research
 
Taking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdfTaking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdf
 
Artificial Intelligence and Electronic Warfare
Artificial Intelligence and Electronic WarfareArtificial Intelligence and Electronic Warfare
Artificial Intelligence and Electronic Warfare
 
JavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green MasterplanJavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green Masterplan
 
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing InstancesEnergy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
 
Generating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and MilvusGenerating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and Milvus
 
"Choosing proper type of scaling", Olena Syrota
"Choosing proper type of scaling", Olena Syrota"Choosing proper type of scaling", Olena Syrota
"Choosing proper type of scaling", Olena Syrota
 
Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...
Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...
Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...
 
Nordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptxNordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptx
 
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdfMonitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdf
 
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-EfficiencyFreshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
 
Northern Engraving | Nameplate Manufacturing Process - 2024
Northern Engraving | Nameplate Manufacturing Process - 2024Northern Engraving | Nameplate Manufacturing Process - 2024
Northern Engraving | Nameplate Manufacturing Process - 2024
 
Fueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte WebinarFueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte Webinar
 
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
 
Principle of conventional tomography-Bibash Shahi ppt..pptx
Principle of conventional tomography-Bibash Shahi ppt..pptxPrinciple of conventional tomography-Bibash Shahi ppt..pptx
Principle of conventional tomography-Bibash Shahi ppt..pptx
 
AppSec PNW: Android and iOS Application Security with MobSF
AppSec PNW: Android and iOS Application Security with MobSFAppSec PNW: Android and iOS Application Security with MobSF
AppSec PNW: Android and iOS Application Security with MobSF
 
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectorsConnector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
 
Choosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptxChoosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptx
 

Crossing the Streams: the New Streaming Foreign-Key Join Feature in Kafka Streams (John Roesler, Confluent) Kafka Summit 2020

  • 1. Crossing the Streams: Foreign-Key Joins with Kafka Streams John Roesler Software Engineer @ Confluent
  • 2. Agenda 01. The missing join: Foreign-Key Join 02. The current join: Equi- Join 03. The problem with FK Join 04. The solution for FK Join 05. Testing 06. Case Study: Bazaarvoice
  • 6. Foreign-Key Join 6 KTable<TrackId, Track> tracks = … KTable<AlbumId, Album> albums = … KTable<TrackId, TrackWithAlbum> = tracks.join(albums, Track::getAlbumId, TrackWithAlbum::joiner);
  • 7. Agenda 7 01. The missing join: Foreign-Key Join 02. The current join: Equi Join 03. The problem with FK Join 04. The solution for FK Join 05. Testing 06. Case Study: Bazaarvoice
  • 9. Equi Join KTable<TrackId, TrackMeta> tracksMetadata = … KTable<TrackId, TrackStore> tracksPricing = … KTable<TrackId, Track> = tracksMetadata.join(tracksPricing, Track::joiner); 9
  • 10. A: 9 B: 2 C: 4 A: 6 D: 8 A: 9 C: 4 A: 6 B: 2 D: 8 Partition 0 Partition 1 Big Data Processing == Partitioning 10
  • 11. A: 9 B: 2 C: 4 A: 6 D: 8 Partition 0 Partition 1 A: α B: β C: γ A: ξ D: σ Left Right A: 9 C: 4 A: 6 A: α C: γ A: ξ Left Right B: 2 D: 8 B: β D: σ Left Right A: (9,α) C: (c,γ) A: (6,ξ) Join B: (2,β) D: (8,σ) Join Partitioned Equi Join 11
  • 12. Agenda 12 01. The missing join: Foreign-Key Join 02. The current join: Equi- Join 03. The problem with FK Join 04. The solution for FK Join 05. Testing 06. Case Study: Bazaarvoice
  • 13. A: 9 B: 2 C: 4 A: 6 D: 9 Partition 0 Partition 1 Left Right A: 9 C: 4 A: 6 Left Right B: 2 D: 9 Left RightJoin Join 9: α 4: β 3: γ 6: ξ 9: σ ? ?? ? Partitioned Foreign-Key Join? 13
  • 14. A: 9 B: 2 C: 4 A: 6 D: 8 Partition 0 Partition 1 Left Right A: 9 C: 4 A: 6 Left B: 2 D: 8 Left 9: α 4: β 3: γ 6: ξ 9: σ Partitioned Foreign-Key Join Partition 0 Partition 1 9: α 9: σ Right 4: β 3: γ 6: ξ Right 14
  • 15. Agenda 15 01. The missing join: Foreign-Key Join 02. The current join: Equi- Join 03. The problem with FK Join 04. The solution for FK Join 05. Testing 06. Case Study: Bazaarvoice
  • 16. Partitioned Foreign-Key Join A: 9 B: 9 C: 4 A: 6 D: 8 9: α 4: β 3: γ 6: ξ 9: σ Left Right 9: A 9: B 4: C 6: A 8: D Subscriptions A: α B: α C: β A: ξ D: null updates A: (9,α) B: (9,α) C: (4,β) A: (6,ξ) D: (8,null) Join subscribe update 16
  • 17. Partitioned Foreign-Key Join A: 9 9: α Left Right 9: A Subscriptions updates A: (9,α) Join subscribe update 17
  • 18. Partitioned Foreign-Key Join A: 9 B: 9 9: α Left Right 9: A Subscriptions updates A: (9,α) Join subscribe update 18
  • 19. Partitioned Foreign-Key Join A: 9 B: 9 9: α Left Right 9: A Subscriptions updates A: (9,α) Join subscribe update 9:B 19
  • 20. Partitioned Foreign-Key Join A: 9 B: 9 9: α Left Right 9: A 9: B Subscriptions updates A: (9,α) Join subscribe update 20
  • 21. Partitioned Foreign-Key Join A: 9 B: 9 9: α Left Right 9: A 9: B Subscriptions updates A: (9,α) Join subscribe update B: α 21
  • 22. Partitioned Foreign-Key Join A: 9 B: 9 9: α Left Right 9: A 9: B Subscriptions updates A: (9,α) Join subscribe update B: α 22
  • 23. Partitioned Foreign-Key Join A: 9 B: 9 9: α Left Right 9: A 9: B Subscriptions B: α updates A: (9,α) Join subscribe update 23
  • 24. Partitioned Foreign-Key Join A: 9 B: 9 9: α Left Right 9: A 9: B Subscriptions B: α updates A: (9,α) B: (9,α) Join subscribe update 24
  • 25. Partitioned Foreign-Key Join A: 9 B: 9 9: α Left Right 9: A 9: B Subscriptions updates A: (9,α) B: (9,α) Join subscribe update 25
  • 26. Partitioned Foreign-Key Join A: 9 B: 9 9: α 9: β Left Right 9: A 9: B Subscriptions updates A: (9,α) B: (9,α) Join subscribe update 26
  • 27. Partitioned Foreign-Key Join A: 9 B: 9 9: α 9: β Left Right 9: A 9: B Subscriptions updates A: (9,α) B: (9,α) Join subscribe update A: β B: β 27
  • 28. Partitioned Foreign-Key Join A: 9 B: 9 9: α 9: β Left Right 9: A 9: B Subscriptions updates A: (9,α) B: (9,α) Join subscribe update A: β B: β 28
  • 29. Partitioned Foreign-Key Join A: 9 B: 9 9: α 9: β Left Right 9: A 9: B Subscriptions A: β B: β updates A: (9,α) B: (9,α) Join subscribe update 29
  • 30. Partitioned Foreign-Key Join A: 9 B: 9 9: α 9: β Left Right 9: A 9: B Subscriptions A: β B: β updates A: (9,α) B: (9,α) A: (9,β) B: (9,β) Join subscribe update 30
  • 31. Partitioned Foreign-Key Join A: 9 B: 9 9: β Left Right 9: A 9: B Subscriptions updates A: (9,β) B: (9,β) Join subscribe update 31
  • 32. Agenda 32 01. The missing join: Foreign-Key Join 02. The current join: Equi- Join 03. The problem with FK Join 04. The solution for FK Join 05. Testing 06. Case Study: Bazaarvoice
  • 33. Testing KTable<TrackId, Track> tracks = … KTable<AlbumId, Album> albums = … KTable<TrackId, TrackWithAlbum> = tracks.join(albums, Track::getAlbumId, TrackWithAlbum::joiner); 33
  • 34. Testing try(driver = new TopologyTestDriver(...)) { trackInput = driver.createInputTopic(...) albumInput = driver.createInputTopic(...) result = driver.createOutputTopic(...) } 34
  • 35. Testing try(driver = new TopologyTestDriver(...)) { trackInput = driver.createInputTopic(...) albumInput = driver.createInputTopic(...) result = driver.createOutputTopic(...) trackInput.pipeInput(“t1”, new Track(“a1”)) trackInput.pipeInput(“t2”, new Track(“a1”)) albumInput.pipeInput(“a1”, new Album(...)) } 35
  • 36. Testing try(driver = new TopologyTestDriver(...)) { trackInput = driver.createInputTopic(...) albumInput = driver.createInputTopic(...) result = driver.createOutputTopic(...) trackInput.pipeInput(“t1”, new Track(“a1”)) trackInput.pipeInput(“t2”, new Track(“a1”)) albumInput.pipeInput(“a1”, new Album(...)) assertThat( result.readValuesToMap(), is(map( “t1”: pair(track1, album1), “t2”: pair(track2, album1) )) ); } 36
  • 37. Agenda 37 01. The missing join: Foreign-Key Join 02. The current join: Equi- Join 03. The problem with FK Join 04. The solution for FK Join 05. Testing 06. Case Study: Bazaarvoice
  • 38. Case Study: Bazaarvoice ● Early Relational Streaming adopter ○ In-house streaming platform ○ Periodic bulk DB query jobs ○ Spark, Hadoop, etc. ● Large dataset, healthy update rate ○ 100s of Millions of Products ○ 100s of Billions of Reviews ○ Updates: 10s of Millions a day, at least ○ Views: ludicrous ● Join-heavy workload (high cardinality) ○ Product -> Review fan-out can be 100 of Millions 38
  • 39. Case Study: Bazaarvoice ● Product ○ Name ○ Description ○ URL ○ Average Rating ● Review ○ ProductId ○ Text ○ Rating ○ Product Name 39
  • 40. Average Rating (aggregation) KTable<ReviewId, Review> reviews; KTable<ProductId, Product> products; KTable<ProductId, Double> avgRatings = reviews .groupBy(Review::getProductId) .reduce(averageRatings) KTable<ProductId, ViewProduct> result = avgRatings.join(products) 40 reviews productsavgRatings groupBy(productId) reduce(avg) result
  • 41. Case Study: Bazaarvoice ● Product ○ Name ○ Description ○ URL ○ Average Rating ● Review ○ ProductId ○ Text ○ Rating ○ Product Name 41
  • 42. Product Name (join) KTable<ProductId, Set<ReviewId>> productReviews = reviews .groupBy(Review::getProductId) .reduce(collectReviewIdsSet) 42 groupBy(productId) reduce(collect set) all reviews for each product reviews
  • 43. Product Name (join) KTable<ProductId, Set<ReviewId>> productReviews = reviews .groupBy(Review::getProductId) .reduce(collectReviewIdsSet) KTable<ProductId, Pair<String, Set<ReviewId>>> toExplode = products .mapValues(Product::getName) .join(productReviews) 43 groupBy(productId) reduce(collect set) products all reviews for each product all reviews and product name for each product reviews
  • 44. Product Name (join) KTable<ProductId, Set<ReviewId>> productReviews = reviews .groupBy(Review::getProductId) .reduce(collectReviewIdsSet) KTable<ProductId, Pair<String, Set<ReviewId>>> toExplode = products .mapValues(Product::getName) .join(productReviews) KTable<ReviewId, String> reviewsToProductNames = toExplode.flatMap( name, reviewSet -> for (reviewId : reviewSet) forward(reviewId, name); ) 44 groupBy(productId) reduce(collect set) products all reviews for each product all reviews and product name for each product product name for each review reviews
  • 45. Product Name (join) KTable<ProductId, Set<ReviewId>> productReviews = reviews .groupBy(Review::getProductId) .reduce(collectReviewIdsSet) KTable<ProductId, Pair<String, Set<ReviewId>>> toExplode = products .mapValues(Product::getName) .join(productReviews) KTable<ReviewId, String> reviewsToProductNames = toExplode.flatMap( name, reviewSet -> for (reviewId : reviewSet) forward(reviewId, name); ) KTable<ReviewId, ViewReview> result = reviews.join(reviewsToProductNames) 45 groupBy(productId) reduce(collect set) products all reviews for each product all reviews and product name for each product product name for each review result reviews
  • 46. Foreign-Key Join A: 9 B: 9 9: β Left Right 9: A 9: B Subscriptions updates A: (9,β) B: (9,β) Join subscribe update 46
  • 47. Product Name (join) KTable<ProductId, Set<ReviewId>> productReviews = reviews .groupBy(Review::getProductId) .reduce(collectReviewIdsSet) KTable<ProductId, Pair<String, Set<ReviewId>>> toExplode = products .mapValues(Product::getName) .join(productReviews) KTable<ReviewId, String> reviewsToProductNames = toExplode.flatMap( name, reviewSet -> for (reviewId : reviewSet) forward(reviewId, name); ) KTable<ReviewId, ViewReview> result = reviews.join(reviewsToProductNames) repartition repartition 47
  • 48. Foreign-Key Join A: 9 B: 9 9: β Left Right 9: A 9: B Subscriptions updates A: (9,β) B: (9,β) Join subscribe update 48
  • 49. Product Name (join) KTable<ProductId, Set<ReviewId>> productReviews = reviews .groupBy(Review::getProductId) .reduce(collectReviewIdsSet) KTable<ProductId, Pair<String, Set<ReviewId>>> toExplode = products .mapValues(Product::getName) .join(productReviews) KTable<ReviewId, String> reviewsToProductNames = toExplode.flatMap( name, reviewSet -> for (reviewId : reviewSet) forward(reviewId, name); ) KTable<ReviewId, ViewReview> result = reviews.join(reviewsToProductNames) repartition repartition store and transmit entire set 49
  • 50. Product Name (join) KTable<ProductId, Set<ReviewId>> productReviews = reviews .groupBy(Review::getProductId) .reduce(collectReviewIdsSet) KTable<ProductId, Pair<String, Set<ReviewId>>> toExplode = products .mapValues(Product::getName) .join(productReviews) KTable<ReviewId, String> reviewsToProductNames = toExplode.flatMap( name, reviewSet -> for (reviewId : reviewSet) forward(reviewId, name); ) KTable<ReviewId, ViewReview> result = reviews.join(reviewsToProductNames) 50
  • 51. Product Name (join) KTable<ProductId, String> productNames = products.mapValues(Product::getName) KTable<ReviewId, ViewReview> result = reviews.join(productNames, Review::getProductId) 51
  • 52. Coming soon to ksqlDB ! SELECT * FROM Reviews JOIN Products ON Review.ProductID = Product.ID 52
  • 53. Thanks to the authors of KIP-213! ● Jan Filipiak (Oct 2017) ● Adam Bellemare (July 2018) ● Accepted Oct 2019 ● Released in 2.4.0 Dec 2019 53