Crossing the Streams: the New Streaming Foreign-Key Join Feature in Kafka Streams (John Roesler, Confluent) Kafka Summit 2020

Crossing the Streams:
Foreign-Key Joins with Kafka Streams
John Roesler
Software Engineer @ Conﬂuent

Agenda
01. The missing join: Foreign-Key Join
02. The current join: Equi- Join
03. The problem with FK Join
04. The solution for FK Join
05. Testing
06. Case Study: Bazaarvoice

albums
AlbumId
Title
ArtistId
tracks
TrackId
Name
AlbumId
Composer
Bytes
UnitPrice
Foreign-Key Join
3
SELECT * from Tracks
JOIN Albums ON Tracks.AlbumID = Albums.AlbumID

albums
AlbumId
Title
ArtistId
tracks
TrackId
Name
AlbumId
Composer
Bytes
UnitPrice
Foreign-Key Join
4
Primary
Foreign

albums
AlbumId
Title
ArtistId
tracks
TrackId
Name
AlbumId
Composer
Bytes
UnitPrice
Foreign-Key Join
5
Primary
Foreign
JOIN

Foreign-Key Join
6
KTable<TrackId, Track> tracks = …
KTable<AlbumId, Album> albums = …
KTable<TrackId, TrackWithAlbum> =
tracks.join(albums,
Track::getAlbumId,
TrackWithAlbum::joiner);

Agenda
7
02. The current join: Equi Join
05. Testing

track-meta
TrackId
Name
AlbumId
Composer
Bytes
Equi Join
8
track-pricing
TrackId
UnitPrice
tracks
TrackId
Name
AlbumId
Composer
Bytes
UnitPrice
JOIN

Equi Join
KTable<TrackId, TrackMeta> tracksMetadata = …
KTable<TrackId, TrackStore> tracksPricing = …
KTable<TrackId, Track> =
tracksMetadata.join(tracksPricing,
Track::joiner);
9

A: 9
B: 2
C: 4
A: 6
D: 8
A: 9
C: 4
A: 6
B: 2
D: 8
Partition 0 Partition 1
Big Data Processing == Partitioning
10

A: 9
B: 2
C: 4
A: 6
D: 8
A: α
B: β
C: γ
A: ξ
D: σ
Left Right
A: 9
C: 4
A: 6
A: α
C: γ
A: ξ
Left Right
B: 2
D: 8
B: β
D: σ
Left Right
A: (9,α)
C: (c,γ)
A: (6,ξ)
Join
B: (2,β)
D: (8,σ)
Join
Partitioned Equi Join
11

Agenda
12
05. Testing

A: 9
B: 2
C: 4
A: 6
D: 9
Left Right
A: 9
C: 4
A: 6
Left Right
B: 2
D: 9
Left RightJoin Join
9: α
4: β
3: γ
6: ξ
9: σ
? ?? ?
Partitioned Foreign-Key Join?
13

A: 9
B: 2
C: 4
A: 6
D: 8
Left Right
A: 9
C: 4
A: 6
Left
B: 2
D: 8
Left
9: α
4: β
3: γ
6: ξ
9: σ
Partitioned Foreign-Key Join
9: α
9: σ
Right
4: β
3: γ
6: ξ
Right
14

Agenda
15
05. Testing

A: 9
B: 9
C: 4
A: 6
D: 8
9: α
4: β
3: γ
6: ξ
9: σ
Left Right
9: A
9: B
4: C
6: A
8: D
Subscriptions
A: α
B: α
C: β
A: ξ
D: null
updates
A: (9,α)
B: (9,α)
C: (4,β)
A: (6,ξ)
D: (8,null)
Join
subscribe
update
16

A: 9 9: α
Left Right
9: A
Subscriptions
updates
A: (9,α)
Join
subscribe
update
17

A: 9
B: 9
9: α
Left Right
9: A
Subscriptions
updates
A: (9,α)
Join
subscribe
update
18

A: 9
B: 9
9: α
Left Right
9: A
Subscriptions
updates
A: (9,α)
Join
subscribe
update
9:B
19

A: 9
B: 9
9: α
Left Right
9: A
9: B
Subscriptions
updates
A: (9,α)
Join
subscribe
update
20

A: 9
B: 9
9: α
Left Right
9: A
9: B
Subscriptions
updates
A: (9,α)
Join
subscribe
update
B: α
21

A: 9
B: 9
9: α
Left Right
9: A
9: B
Subscriptions
updates
A: (9,α)
Join
subscribe
update
B: α
22

A: 9
B: 9
9: α
Left Right
9: A
9: B
Subscriptions
B: α
updates
A: (9,α)
Join
subscribe
update
23

A: 9
B: 9
9: α
Left Right
9: A
9: B
Subscriptions
B: α
updates
A: (9,α)
B: (9,α)
Join
subscribe
update
24

A: 9
B: 9
9: α
Left Right
9: A
9: B
Subscriptions
updates
A: (9,α)
B: (9,α)
Join
subscribe
update
25

A: 9
B: 9
9: α
9: β
Left Right
9: A
9: B
Subscriptions
updates
A: (9,α)
B: (9,α)
Join
subscribe
update
26

A: 9
B: 9
9: α
9: β
Left Right
9: A
9: B
Subscriptions
updates
A: (9,α)
B: (9,α)
Join
subscribe
update
A: β
B: β
27

A: 9
B: 9
9: α
9: β
Left Right
9: A
9: B
Subscriptions
updates
A: (9,α)
B: (9,α)
Join
subscribe
update
A: β
B: β
28

A: 9
B: 9
9: α
9: β
Left Right
9: A
9: B
Subscriptions
A: β
B: β
updates
A: (9,α)
B: (9,α)
Join
subscribe
update
29

A: 9
B: 9
9: α
9: β
Left Right
9: A
9: B
Subscriptions
A: β
B: β
updates
A: (9,α)
B: (9,α)
A: (9,β)
B: (9,β)
Join
subscribe
update
30

A: 9
B: 9
9: β
Left Right
9: A
9: B
Subscriptions
updates
A: (9,β)
B: (9,β)
Join
subscribe
update
31

Agenda
32
05. Testing

Testing
KTable<TrackId, Track> tracks = …
KTable<AlbumId, Album> albums = …
KTable<TrackId, TrackWithAlbum> =
tracks.join(albums,
Track::getAlbumId,
TrackWithAlbum::joiner);
33

Testing
try(driver = new TopologyTestDriver(...)) {
trackInput = driver.createInputTopic(...)
albumInput = driver.createInputTopic(...)
result = driver.createOutputTopic(...)
}
34

Testing
trackInput.pipeInput(“t1”, new Track(“a1”))
albumInput.pipeInput(“a1”, new Album(...))
}
35

Testing
albumInput.pipeInput(“a1”, new Album(...))
assertThat(
result.readValuesToMap(),
is(map(
“t1”: pair(track1, album1),
“t2”: pair(track2, album1)
))
);
}
36

Agenda
37
05. Testing

Case Study: Bazaarvoice
● Early Relational Streaming adopter
○ In-house streaming platform
○ Periodic bulk DB query jobs
○ Spark, Hadoop, etc.
● Large dataset, healthy update rate
○ 100s of Millions of Products
○ 100s of Billions of Reviews
○ Updates: 10s of Millions a day, at least
○ Views: ludicrous
● Join-heavy workload (high cardinality)
○ Product -> Review fan-out can be 100 of Millions
38

● Product
○ Name
○ Description
○ URL
○ Average Rating
● Review
○ ProductId
○ Text
○ Rating
○ Product Name
39

Average Rating (aggregation)
KTable<ReviewId, Review> reviews;
KTable<ProductId, Product> products;
KTable<ProductId, Double> avgRatings =
reviews
.groupBy(Review::getProductId)
.reduce(averageRatings)
KTable<ProductId, ViewProduct> result =
avgRatings.join(products)
40
reviews
productsavgRatings
groupBy(productId)
reduce(avg)
result

● Product
○ Name
○ Description
○ URL
○ Average Rating
● Review
○ ProductId
○ Text
○ Rating
○ Product Name
41

Product Name (join)
KTable<ProductId, Set<ReviewId>> productReviews =
reviews
.reduce(collectReviewIdsSet)
42
groupBy(productId)
reduce(collect set)
all reviews for
each product
reviews

Product Name (join)
reviews
KTable<ProductId,
Pair<String, Set<ReviewId>>> toExplode =
products
.mapValues(Product::getName)
.join(productReviews)
43
groupBy(productId)
reduce(collect set)
products
all reviews for
each product
all reviews and
product name for
each product
reviews

Product Name (join)
reviews
KTable<ProductId,
products
KTable<ReviewId, String> reviewsToProductNames =
toExplode.flatMap( name, reviewSet ->
for (reviewId : reviewSet)
forward(reviewId, name);
)
44
groupBy(productId)
reduce(collect set)
products
all reviews for
each product
all reviews and
product name for
each product
product name for
each review
reviews

Product Name (join)
reviews
KTable<ProductId,
products
)
KTable<ReviewId, ViewReview> result =
reviews.join(reviewsToProductNames) 45
groupBy(productId)
reduce(collect set)
products
all reviews for
each product
all reviews and
product name for
each product
product name for
each review
result
reviews

Foreign-Key Join
A: 9
B: 9
9: β
Left Right
9: A
9: B
Subscriptions
updates
A: (9,β)
B: (9,β)
Join
subscribe
update
46

Product Name (join)
reviews
KTable<ProductId,
products
)
reviews.join(reviewsToProductNames)
repartition
repartition
47

Foreign-Key Join
A: 9
B: 9
9: β
Left Right
9: A
9: B
Subscriptions
updates
A: (9,β)
B: (9,β)
Join
subscribe
update
48

Product Name (join)
reviews
KTable<ProductId,
products
)
reviews.join(reviewsToProductNames)
repartition
repartition
store and
transmit
entire set
49

Product Name (join)
reviews
KTable<ProductId,
products
)
reviews.join(reviewsToProductNames) 50

Product Name (join)
KTable<ProductId, String> productNames =
products.mapValues(Product::getName)
reviews.join(productNames,
Review::getProductId)
51

Coming soon to ksqlDB !
SELECT * FROM
Reviews JOIN Products
ON Review.ProductID = Product.ID
52

Thanks to the authors of KIP-213!
● Jan Filipiak (Oct 2017)
● Adam Bellemare (July 2018)
● Accepted Oct 2019
● Released in 2.4.0 Dec 2019
53

Thank you!
john@confluent.io
vvcephei@apache.org
cnfl.io/meetups cnfl.io/slackcnfl.io/blog

Crossing the Streams: the New Streaming Foreign-Key Join Feature in Kafka Streams (John Roesler, Confluent) Kafka Summit 2020

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

More from confluent

More from confluent (20)

Recently uploaded

Recently uploaded (20)

Crossing the Streams: the New Streaming Foreign-Key Join Feature in Kafka Streams (John Roesler, Confluent) Kafka Summit 2020