The Streaming Data Lake - What Do KIP-405 and KIP-833 Mean for Your Larger Data Infrastructure

The Streaming Data Lake
What do KIP-405 and KIP-833 mean for your larger data infrastructure?

Data is the new Bacon
Everything's better
with Bacon

If data makes everything BETTER then
The one with the most data is the BEST!

Kafka:
Hadoop:
1 Exabyte
1.2
Petabytes

If Kafka is the entry point for all data where does the extra data
come from?

Kafka:
Hadoop:
1 Exabyte
1.2
Petabytes
5 years
7 days

“A data lake is a centralized repository that ingests,
stores, and allows for processing of large volumes of
data in its original form.”
Azure

Pre KIP-405 = A data swimming pool ;-)
Fast
Comfortable
Convenient
Prohibitively Expensive
Limited in scale.

A KIP-405 Primer…
Cold set
Topic_X
Hot set
Topic_X

A KIP-405 Primer…
Expensive
Low Latency
Limited size
Cheap
High Latency
Unbounded
Cold set
Topic_X
Hot set
Topic_X

A KIP-405 Primer…
Cold set
Topic_X
Hot set
Topic_X
Replica
Restoration
State Store
Hydration
Producer Consumer

What’s missing?
Most analytical cases will
scan the entire dataset.

For analytics everything must be read
Key:
Value:
{
"country": "UK",
}
{
"name": "TOM SCOTT",
"houseNumber": 5,
"streetName": "ZARA BOULEVARD",
"town": "BENJAMINVILLE",
"isActive": true,
"postCode": "QT55 3RN"
}
Kafka cannot know the
value of this without
reading the message

Transform
SET 'auto.offset.reset'='earliest';

Cold set
Topic_X
Hot set
Topic_X
Client

Kafka is a datalake, just not a very good one.
How can we make it better?
What you order What you receive

Why challenge the status quo?
ELT
Limitations:
Cost - Duplicating/transforming data costs money.
Complexity - Pipelines must be developed, monitored, maintained and evolved.
Consistency - Multiple copies of data cannot be consistent
Scope - Only a small subset of data is transferred
External
Datalake

Consistency Service
Data Transfer
Pros:
Minimal Kafka Impact
An extension of existing tech
Cons:
Edge cases!
Must be kept inline with new
datasets/features/schema evolution etc.
Still Duplicating data
Option 1:
A better ELT

Spot the ETL/ELT
“Transfer and/or transform of data to an external, persistent data
container accessible by tooling outside the source system.”
Which of these are ETL/ELT?
Stream processing state store
Kafka Replication Any Kafka Connect Sink
?

Spot the ETL/ELT
What about now?
Stream processing state store Dashboard

A Note on Replicas
Both sides of the ELT are resilient
Broker 1 Broker 2 Broker 3 DataNode 1 DataNode 2 DataNode 3
6 replicas to tolerate a single failure in either system!

Option 1(a):
Heterogenous replicas
Broker 1 Broker 2 Broker 3
A subset of replicas are in
formats suitable for
analytical workloads
Streaming Lakehouse: Part 1 Introducing Pulsar’s
Lakehouse Tiered Storage
Hang Chen
Software Engineer, StreamNative & Apache Pulsar
PMC Member
Sijie Guo
CEO and Co-Founder, StreamNative, Apache Pulsar
PMC Member
Blog Oct 25, 2023 8 min read
KIP-1008: Parka - the Marriage of Parquet and Kafka
Created by Xinli Shang, last modiﬁed on Dec 02, 2023
• Status
• Motivation
• Public Interfaces

Option 1(a): Heterogenous Replicas
• Server side solution
• Solves the replicas problem (sort
of)
• Preserves the single source of
truth
PROS CONS
• Degraded performance is
unpredictable
Inﬂexible

Option 2: Enrich the source data
INDEXING (WHERE)
PRE-AGGREGATION
(GROUP BY)
STATISTICS (JOIN)

Option 2: Enrich the source data
PROS CONS
• Will never be the outright
speed winner
• No row columnar
• Still a single source of truth
• Wider query surface
• Graceful degradation
• Consistent data format

Wait, what about KIP-833?
More Data = More Variety
If you never delete a topic then
metadata grows continuously
If we divide topics then
metadata grows
What about warehousing
cases?
1 2 3

Operational
Realm
Let’s talk governance
ELT
MICROSERVICE
Analytical
Realm
ANALYST

Operational
Realm
ANALYST
MICROSERVICE
Let’s talk governance

Summary
Today:
Kafka is already a datalake (and always was)
KIP-405 gives history, and with it volume
But… base Kafka is terrible at adhoc analytics
The future’s bright, look out for
A smarter, more realtime aware ELT
A wide choice of data formats for both batch and streaming workloads
A richer set of metadata for enhanced query performance
Some pointers
Get close to the raw events, avoid aggregating early
Simplicity is king!
Streams are relevant to humans too

The Streaming Data Lake - What Do KIP-405 and KIP-833 Mean for Your Larger Data Infrastructure

Recommended

Recommended

More Related Content

Similar to The Streaming Data Lake - What Do KIP-405 and KIP-833 Mean for Your Larger Data Infrastructure

Similar to The Streaming Data Lake - What Do KIP-405 and KIP-833 Mean for Your Larger Data Infrastructure (20)

More from HostedbyConfluent

More from HostedbyConfluent (20)

Recently uploaded

Recently uploaded (20)

The Streaming Data Lake - What Do KIP-405 and KIP-833 Mean for Your Larger Data Infrastructure