"KIP-405 and KIP-833 stand to make a big impact, not just in the world of event streaming but on the larger data infrastructure of your business too. Kafka no longer just powers the operational side of business but is making significant inroads into the analytical side too. Many are realising that a tiered storage enabled Kafka (containing the huge variety of data enabled by Kraft), shows many of the same characteristics as a traditional datalake and can be used to serve requests for raw historical data as well as linking services and applications.
In this talk we will explore the efficiencies that can be gained by using Kafka for this role. Kafka as a datalake can greatly simplify data infrastructure, improve consistency between operational and analytical data, dramatically reduce costs and provide advantages in terms of the freshness and variety of data available to analysts. We will also cover the critical features that are missing to make this possible and ways in which these may be achieved in today and future.
Join us for an exciting look into the future where we expand the role of Kafka in the data infrastructure and challenge the ETL/ELT status quo."
18. For analytics everything must be read
Key:
Value:
{
"country": "UK",
}
{
"name": "TOM SCOTT",
"houseNumber": 5,
"streetName": "ZARA BOULEVARD",
"town": "BENJAMINVILLE",
"isActive": true,
"postCode": "QT55 3RN"
}
Kafka cannot know the
value of this without
reading the message
21. Kafka is a datalake, just not a very good one.
How can we make it better?
What you order What you receive
22. Why challenge the status quo?
ELT
Limitations:
Cost - Duplicating/transforming data costs money.
Complexity - Pipelines must be developed, monitored, maintained and evolved.
Consistency - Multiple copies of data cannot be consistent
Scope - Only a small subset of data is transferred
External
Datalake
23. Consistency Service
Data Transfer
Pros:
Minimal Kafka Impact
An extension of existing tech
Cons:
Edge cases!
Must be kept inline with new
datasets/features/schema evolution etc.
Still Duplicating data
Option 1:
A better ELT
24. Spot the ETL/ELT
“Transfer and/or transform of data to an external, persistent data
container accessible by tooling outside the source system.”
Which of these are ETL/ELT?
Stream processing state store
Kafka Replication Any Kafka Connect Sink
?
26. A Note on Replicas
Both sides of the ELT are resilient
Broker 1 Broker 2 Broker 3 DataNode 1 DataNode 2 DataNode 3
6 replicas to tolerate a single failure in either system!
27. Option 1(a):
Heterogenous replicas
Broker 1 Broker 2 Broker 3
A subset of replicas are in
formats suitable for
analytical workloads
Streaming Lakehouse: Part 1 Introducing Pulsar’s
Lakehouse Tiered Storage
Hang Chen
Software Engineer, StreamNative & Apache Pulsar
PMC Member
Sijie Guo
CEO and Co-Founder, StreamNative, Apache Pulsar
PMC Member
Blog Oct 25, 2023 8 min read
KIP-1008: Parka - the Marriage of Parquet and Kafka
Created by Xinli Shang, last modified on Dec 02, 2023
• Status
• Motivation
• Public Interfaces
28. Option 1(a): Heterogenous Replicas
• Server side solution
• Solves the replicas problem (sort
of)
• Preserves the single source of
truth
PROS CONS
• Degraded performance is
unpredictable
Inflexible
29. Option 2: Enrich the source data
INDEXING (WHERE)
PRE-AGGREGATION
(GROUP BY)
STATISTICS (JOIN)
30. Option 2: Enrich the source data
PROS CONS
• Will never be the outright
speed winner
• No row columnar
• Still a single source of truth
• Wider query surface
• Graceful degradation
• Consistent data format
31.
32. Wait, what about KIP-833?
More Data = More Variety
If you never delete a topic then
metadata grows continuously
If we divide topics then
metadata grows
What about warehousing
cases?
1 2 3
35. Summary
Today:
Kafka is already a datalake (and always was)
KIP-405 gives history, and with it volume
But… base Kafka is terrible at adhoc analytics
The future’s bright, look out for
A smarter, more realtime aware ELT
A wide choice of data formats for both batch and streaming workloads
A richer set of metadata for enhanced query performance
Some pointers
Get close to the raw events, avoid aggregating early
Simplicity is king!
Streams are relevant to humans too