SlideShare a Scribd company logo
BASEL | BERN | BRUGG | BUKAREST | DÜSSELDORF | FRANKFURT A.M. | FREIBURG I.BR. | GENF
HAMBURG | KOPENHAGEN | LAUSANNE | MANNHEIM | MÜNCHEN | STUTTGART | WIEN | ZÜRICH
http://guidoschmutz@wordpress.com@gschmutz
Kafka as your Data Lake - is it Feasible?
Guido Schmutz
Kafka Summit 2020
BASEL | BERN | BRUGG | BUKAREST | DÜSSELDORF | FRANKFURT A.M. | FREIBURG I.BR. | GENF
HAMBURG | KOPENHAGEN | LAUSANNE | MANNHEIM | MÜNCHEN | STUTTGART | WIEN | ZÜRICH
Guido
Working at Trivadis for more than 23 years
Consultant, Trainer, Platform Architect for Java,
Oracle, SOA and Big Data / Fast Data
Oracle Groundbreaker Ambassador & Oracle ACE
Director
@gschmutz guidoschmutz.wordpress.com
195th
edition
Agenda
1. What is a Data Lake?
2. Four Architecture Blueprints for “treating Kafka as a Data Lake”
3. Summary
Demo environment and code samples available here: https://github.com/gschmutz/kafka-as-your-datalake-demo3
What is a Data Lake?
4
Bulk Source
Data Consumer
DB
Extract
File
DB
What is a Data Lake? Traditional Data Lake
Architecture
File Import / SQL Import
“Native” Raw
Hadoop ClusterdHadoop ClusterBig Data Platform
Storage
Storage
Raw
Refined/
UsageOpt
Initial Idea of Data Lake
• Single store of all data (incl. raw data) in the enterprise
• Put an end to data silos
• Reporting, Visualization, Analytics and Machine
Learning
• Focus on Schema-on-Read
Tech for 1st Gen Data Lake
• HDFS, MapReduce, Pig, Hive, Impala, Flume,
Sqoop
Tech for 2nd Gen Data Lake (Cloud native)
• Object Store (S3, Azure Blob Storage, …), Spark,
Flink, Presto, StreamSets, …
SQL / Search
Parallel
Processing
Query
Engine BI Apps
Data Science
Workbench
7
high latency
Traditional Data Lake Zones
8
”Streaming Data Lake” – aka. Kappa Architecture
Event
Stream
Stream Processing Platform
Stream Processor V1.0 State V1.0
Event Hub
Reply Bulk Data Flow
Hadoop ClusterdHadoop Cluster(Big) Data Platform
Storage
Storage
Raw
Refined/
UsageOpt
Bulk
Data Flow
Data
Consumer
BI Apps
Dashboard
Serving
Stream Processor V2.0 State V2.0
Result V1.0
Result V2.0
API
(Switcher)
{ }
Parallel
Processing
Query
Engine
SQL / Search
“Native” Raw
Data Science
Workbench
Result
Stream Source
of
Truth
11
Bulk Source
Event Source
Location
DB
Extract
File
Weather
DB
IoT
Data
Mobile
Apps
Social
Change
Data
Capture
Event
Stream
[8] – Questioning the Lambda Architecture – by Jay Kreps
“Streaming Data Lake” Zones
12
Bulk
Data Flow
Result
Stream
SQL / Search
“Native” Raw
Event
Stream
Stream Processing Platform
Stream Processor V1.0 State V1.0
Event Hub
Hadoop ClusterdHadoop Cluster(Big) Data Platform
Storage
Storage
Raw
Refined/
UsageOpt
Data
Consumer
BI Apps
Dashboard
Serving
Stream Processor V2.0 State V2.0
Result V1.0
Result V2.0
API
(Switcher)
{ }
Parallel
Processing
Query
Engine
Data Science
Workbench
Reply Bulk Data Flow
Source
of
Truth
[1] Turning the database inside out with Apache Samza – by Martin Kleppmann13
Bulk Source
Event Source
Location
DB
Extract
File
Weather
DB
IoT
Data
Mobile
Apps
Social
Change
Data
Capture
Event
Stream
Moving the Source of Truth to Event Hub
Turning the
database inside-out!
Bulk
Data Flow
Result
Stream
SQL / Search
“Native” Raw
Event
Stream
Stream Processing Platform
Stream Processor V1.0 State V1.0
Event Hub
Hadoop ClusterdHadoop Cluster(Big) Data Platform
Storage
Storage
Raw
Refined/
UsageOpt
Data
Consumer
BI Apps
Dashboard
Serving
Stream Processor V2.0 State V2.0
Result V1.0
Result V2.0
API
(Switcher)
{ }
Data Science
Workbench
Source
of
Truth
Moving the Source of Truth to Event Hub
[2] – It’s Okay To Store Data In Apache Kafka – by Jay Kreps14
Parallel
Processing
Query
Engine
Bulk Source
Event Source
Location
DB
Extract
File
Weather
DB
IoT
Data
Mobile
Apps
Social
Change
Data
Capture
Event
Stream
is it feasible?
Confluent Enterprise Tiered Storage
Data Retention
• Never
• Time (TTL) or Size-based
• Log-Compacted based
Tiered Storage uses two tiers of storage
• Local (same local disks on brokers)
• Remote (Object storage, currently AWS S3 only)
Enables Kafka to be a long-term storage
solution
• Transparent (no ETL pipelines needed)
• Cheaper storage for cold data
• Better scalability and less complex operations
Broker 1
Broker 2
Broker 3
Object
Storage
hot cold
[3] Infinite Storage in Confluent Platform – by Lucas Bradstreet, Dhruvil Shah, Manveer Chawla
[4] KIP-405: Kafka Tiered Storage – Kafka Improvement Proposal
15
Four Architecture Blueprints
for “treating Kafka as a Data
Lake”
20
How can you access a Kafka topic?
Streaming Queries
• Latest - start at end and continuously consume new data
• Earliest – start at beginning and consume history and then continuously consume new data
• Seek to offset – start at a given offset, consume history, and continuously consume new data
• Seek to timestamp – start at given timestamp, consume history and continuously consume new data
Batch Queries
• From start offset to end offset – start at a given offset and consume until another offset
• From start timestamp to end timestamp – start at a given offset and consume until another offset
• Full scan – Scan the complete topic from start to end
All above access options can be applied on topic or on a set of partitions
21
BP-1: ”Streaming” Data Lake
• Using Stream Processing tools to
perform processing on ”data in
motion” instead of in batch
• Can consume from multiple sources
• Works well if no or limited history is
needed
• Queryable State Stores, aka.
Interactive Queries or Pull Queries
[5] Streaming Machine Learning with Tiered Storage and Without a Data Lake – by Kai Waehner22
BP-1_1: ”Streaming” Data Lake with ksqlDB /
Kafka Streams
• Kafka Streams or ksqlDB fit perfectly
• Using ksqlDB pull queries to retrieve
current state of materialized views
• Store results in another Kafka topic to
persist state store information
• Can be combined with BP-4 to store
results/state in a database
Yes No
Yes No
Yes No
Supports Protobuf
Timestamp Filter Pushdown
Offset Filter Pushdown
Yes NoSupports Avro
Yes NoSupports JSON
Yes NoSchema Registry Integration
Yes noPartition Filter Pushdown
Yes NoSupports Produce Operation
Yes noSupports Exactly Once23
Demo Use Case – Vehicle Tracking
Truck-2
truck_
position
Truck-n
Refinement
truck_
position_avro
detect_proble
matic_driving
problematic_
driving
Truck
Driver
jdbc-source
truck_
driver
join_problematic
_driving_driver
problematic_
driving_driver
27, Walter, Ward, Y, 24-JUL-85, 2017-10-02 15:19:00
console
consumer
{"id":19,"firstName":"Walter",
"lastName":"Ward","available
":"Y","birthdate":"24-JUL-
85","last_update":150692305
2012}
2020-06-02 14:39:56.605,98,27,803014426,
Wichita to Little Rock Route2,
Normal,38.65,90.21,5187297736652502631
24
Truck-1
2020-06-02 14:39:56.605,21,19,803014427,
Wichita to Little Rock Route3,
Overspeed,32.35,91.21,5187297736652502632
2020-06-02 14:39:56.605,21,19,803014427,
Wichita to Little Rock Route3,
Overspeed,32.35,91.21,5187297736652502632
aggregate by eventType
over time window
problematic_
driving_agg
Pull query
Overspeed,10,10:00:00,10:00:059
Pull query
Raw Refined Usage Opt
Demo
25
BP-2: Batch Processing with Event Hub as Source
• Using a Batch Processing framework
to process Event Hub data
retrospectively (full history available)
• Write back results to Event Hub
• Read and join multiple sources
• Can be combined with Advanced
Analytics capabilities (i.e. machine
learning / AI)
26
BP-2_1: Apache Spark with Kafka as Source
• Apache Spark is a unified analytics
engine for large-scale data processing
• Provides complex analytics through
MLlib and GraphX
• Can consume from/produce to Kafka
both in Streaming as well as Batch
Mode
• Use Data Frame / Dataset abstraction
as you would with other data sources
Yes No
Yes No
Yes No
Supports Protobuf
Timestamp Filter Pushdown
Offset Filter Pushdown
Yes NoSupports Avro
Yes NoSupports JSON
Yes NoSchema Registry Integration
Yes NoPartition Filter Pushdown
Yes NoSupports Produce Operation
Yes NoSupports Exactly Once27
28
BP-2_1: Apache Spark with Kafka as Source
truckPositionSchema = StructType().add("timestamp", TimestampType())
.add("truckId",LongType())
.add("driverId", LongType())
.add("routeId", LongType())
.add("eventType", StringType())
.add("latitude", DoubleType())
.add("longitude", DoubleType())
.add("correlationId", StringType())
rawDf = spark.read.format("kafka")
.option("kafka.bootstrap.servers", "kafka-1:19092,kafka-2:19093")
.option("subscribe", "truck_position")
.load()
jsonDf = rawDf.selectExpr("CAST(value AS string)")
jsonDf = jsonDf.select(from_json(jsonDf.value, truckPositionSchema)
.alias("json"))
.selectExpr("json.*",
"cast(cast (json.timestamp as double) / 1000 as timestamp) as eventTime")29
BP-2_1: Apache Spark with Kafka as Source
30
BP-3: Batch Query with Event Hub as Source
• Using a Query Virtualization
framework to consume (query) Event
Hub data retrospectively (full history
available)
• Optionally produce (insert) data into
Event Hub
• Read and join multiple sources
• Based on SQL and with the full power
of SQL at hand (functions and
optionally UDF/UDFA/UDTF)
• Batch SQL not Streaming SQL
31
BP-3_1: Presto with Kafka as Source
• Presto is a distributed SQL query
engine for big data
• Supports accessing data from multiple
systems within a single query
• Supports Kafka as a source (query) but
not as a target
• Does not yet support pushdown of
timestamp queries
• Starburst Enterprise Presto provides
fined grained access control
Yes No
Yes No
Yes No
Supports Protobuf
Timestamp Filter Pushdown
Offset Filter Pushdown
Yes NoSupports Avro
Yes NoSupports JSON
Yes NoSchema Registry Integration
Yes NoPartition Filter Pushdown
Yes NoSupports Produce Operation
Yes NoSupports Exactly Once32
33
BP-3_1: Presto with Kafka as Source
kafka.nodes=kafka-1:9092
kafka.table-names=truck_position, truck_driver
kafka.default-schema=logistics
kafka.hide-internal-columns=false
kafka.table-description-dir=etc/kafka
kafka.properties
select * from truck_position;
34
BP-3_1: Presto with Kafka as Source
{
"tableName": "truck_position",
"schemaName": "logistics",
"topicName": "truck_position",
"key": {
"dataFormat": "raw",
"fields": [
{
"name": "kafka_key",
"dataFormat": "BYTE",
"type": "VARCHAR",
"hidden": "false"
}
]
},
"message": {
"dataFormat": "json",
"fields": [
{
"name": "timestamp",
"mapping": "timestamp",
"type": "BIGINT"
},
{
"name": "truck_id",
"mapping": "truckId",
"type": "BIGINT"
},
...
etc/kafka/truck_position.json etc/kafka/truck_driver.json
{
"tableName": "truck_position",
"schemaName": "logistics",
"topicName": "truck_position",
"key": {
"dataFormat": "raw",
"fields": [
{
"name": "kafka_key",
"dataFormat": "BYTE",
"type": "VARCHAR",
"hidden": "false"
}
]
},
"message": {
"dataFormat": "json",
"fields": [
{
"name": "timestamp",
"mapping": "timestamp",
"type": "BIGINT"
},
{
"name": "truck_id",
"mapping": "truckId",
"type": "BIGINT"
},
...
35
BP-3_1: Presto with Kafka as Source
select * from truck_position
select * from truck_driver
36
BP-3_1: Presto with Kafka as Source
Join truck_position with truck_driver (removing non-compacted entries using Presto
WINDOW Function)
SELECT d.id, d.first_name, d.last_name, t.*
FROM truck_position t
LEFT JOIN (
SELECT *
FROM truck_driver
WHERE (last_update) IN
(SELECT LAST_VALUE(last_update)
OVER (PARTITION BY id
ORDER BY last_update
RANGE BETWEEN UNBOUNDED PRECEDING AND
UNBOUNDED FOLLOWING) AS last_update
FROM truck_driver) ) d
ON t.driver_id = d.id
WHERE t.event_type != 'Normal';
37
BP-3_2: Apache Drill with Kafka as Source
• Apache Drill is a schema-free SQL
Query Engine for Hadoop, NoSQL and
Cloud Storage
• Supports accessing data from multiple
systems within a single query
• Can push down filters on partitions,
timestamp and offset
Yes No
Yes No
Yes No
Supports Protobuf
Timestamp Filter Pushdown
Offset Filter Pushdown
Yes NoSupports Avro
Yes NoSupports JSON
Yes NoSchema Registry Integration
Yes NoPartition Filter Pushdown
Yes NoSupports Produce Operation
Yes NoSupports Exactly Once38
BP-3_3: Hive/Spark SQL with Kafka as Source
• Apache Hive facilitates reading,
writing, and managing large datasets
residing in distributed storage using
SQL
• Part of any Hadoop distribution
• A special storage handler allows
access to Kafka topic via Hive external
tables
• Spark SQL on data frame as shown in
BP-2_1 or by integrating Hive
Metastore
Yes No
Yes No
Yes No
Supports Protobuf
Timestamp Filter Pushdown
Offset Filter Pushdown
Yes NoSupports Avro
Yes NoSupports JSON
Yes NoSchema Registry Integration
Yes NoPartition Filter Pushdown
Yes NoSupports Produce Operation
Yes NoSupports Exactly Once39
BP-3_4: Oracle Access to Kafka with Kafka as
Source
• Oracle SQL Access to Kafka is a PL/SQL
package that enables Oracle SQL to
query Kafka topics via DB views and
underlying external tables [6]
• Runs in the Oracle database
• Supports Kafka as a source (query) but
not (yet) as a target
• Use Oracle SQL to access the Kafka
topics and optionally join to RDBMS
tables
Yes No
Yes No
Yes No
Supports Protobuf
Timestamp Filter Pushdown
Offset Filter Pushdown
Yes NoSupports Avro
Yes NoSupports JSON
Yes NoSchema Registry Integration
Yes NoPartition Filter Pushdown
Yes NoSupports Produce Operation
Yes NoSupports Exactly Once40
BP-4: Use any storage as “materialized view”
• Use any persistence
technology to provide a
“Materialized View” to the
Data Consumers
• Can be provided in
retrospective, run-once,
batch or streaming update
(in sync) mode
• “On Demand” use cases
• Provide a sandbox
environment for data
scientists
• Provide part of Kafka topics
materialized in object
storage
41
Architecture Blueprints Overview
Blueprint
Capability
Streaming
BP1_1
Batch Processing
BP2_1
Query
BP3_1
Query
BP3_2
Query
BP3_3
Query
BP3_4
Supports JSON
Supports Avro 🔴 🔴
Supports Protobuf 🔴 🔴 🔴 🔴 🔴
Schema Registry Integration 🔴 🔴 🔴 🔴
Timestamp Filter Pushdown ⚪ 🔴
Offset Filter Pushdown ⚪ 🔴
Partition Filter Pushdown ⚪ 🔴 🔴
Supports Produce Operation 🔴 🔴 🔴
Supports Exactly Once 🔴 🔴 🔴 🔴
• BP-1_1: Streaming Data Lake using Kafka Streams / ksqlDB
• BP-2_1: Apache Spark with Kafka as Source
• BP-3_1: Presto with Kafka as Source
• BP-3_2: Apache Drill with Kafka as Source
• BP-3_3: Hive/Spark SQL with Kafka as Source
• BP-3_4: Oracle Access to Kafka with Kafka as Source
42
Summary
43
Summary
• Move processing / analytics from batch to stream processing pipelines
• Event Hub (Kafka) as the single source of truth => turning the database inside out!
• everything else is just a “Materialized Views” of the Event Hub topics data
• Can still be HDFS, Object Store (S3, …) but also Kudu on Parquet
• NoSQL Databases & Relational Databases
• In-Memory Databases
• Confluent Platform Tiered Storage makes long-term storage feasible
• Does not apply for large, unstructured data (images, videos, …) => separate path around Event Hub
necessary, but sending metadata through Event Hub
• This is the result of a Proof-of-Concept: only functional test done so far, performance tests will
follow
44
References
1. Turning the database inside out with Apache Samza – by Martin Kleppmann
2. It’s Okay To Store Data In Apache Kafka – by Jay Kreps
3. Infinite Storage in Confluent Platform – by Lucas Bradstreet, Dhruvil Shah, Manveer Chawla
4. KIP-405: Kafka Tiered Storage – Kafka Improvement Proposal
5. Streaming Machine Learning with Tiered Storage and Without a Data Lake – by Kai Waehner
6. Read data from Kafka topic using Oracle SQL Access to Kafka (OSAK) - by Mohammad H.
AbdelQader
7. Demo environment and code samples - by Guido Schmutz (on GitHub)
8. Questioning the Lambda Architecture - by Jay Kreps
45
Kafka as your Data Lake - is it Feasible? (Guido Schmutz, Trivadis) Kafka Summit 2020

More Related Content

What's hot

Streaming all over the world Real life use cases with Kafka Streams
Streaming all over the world  Real life use cases with Kafka StreamsStreaming all over the world  Real life use cases with Kafka Streams
Streaming all over the world Real life use cases with Kafka Streams
confluent
 
Apache Kafka 0.11 の Exactly Once Semantics
Apache Kafka 0.11 の Exactly Once SemanticsApache Kafka 0.11 の Exactly Once Semantics
Apache Kafka 0.11 の Exactly Once Semantics
Yoshiyasu SAEKI
 
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Flink Forward
 
Apache Kafka and KSQL in Action: Let's Build a Streaming Data Pipeline!
Apache Kafka and KSQL in Action: Let's Build a Streaming Data Pipeline!Apache Kafka and KSQL in Action: Let's Build a Streaming Data Pipeline!
Apache Kafka and KSQL in Action: Let's Build a Streaming Data Pipeline!
confluent
 
Apache Kafka 0.8 basic training - Verisign
Apache Kafka 0.8 basic training - VerisignApache Kafka 0.8 basic training - Verisign
Apache Kafka 0.8 basic training - Verisign
Michael Noll
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIs
Databricks
 
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Flink Forward
 
How Uber scaled its Real Time Infrastructure to Trillion events per day
How Uber scaled its Real Time Infrastructure to Trillion events per dayHow Uber scaled its Real Time Infrastructure to Trillion events per day
How Uber scaled its Real Time Infrastructure to Trillion events per day
DataWorks Summit
 
Introduction to Kafka Streams
Introduction to Kafka StreamsIntroduction to Kafka Streams
Introduction to Kafka Streams
Guozhang Wang
 
Apache Kafka Architecture & Fundamentals Explained
Apache Kafka Architecture & Fundamentals ExplainedApache Kafka Architecture & Fundamentals Explained
Apache Kafka Architecture & Fundamentals Explained
confluent
 
아름답고 유연한 데이터 파이프라인 구축을 위한 Amazon Managed Workflow for Apache Airflow - 유다니엘 A...
아름답고 유연한 데이터 파이프라인 구축을 위한 Amazon Managed Workflow for Apache Airflow - 유다니엘 A...아름답고 유연한 데이터 파이프라인 구축을 위한 Amazon Managed Workflow for Apache Airflow - 유다니엘 A...
아름답고 유연한 데이터 파이프라인 구축을 위한 Amazon Managed Workflow for Apache Airflow - 유다니엘 A...
Amazon Web Services Korea
 
From Spring Framework 5.3 to 6.0
From Spring Framework 5.3 to 6.0From Spring Framework 5.3 to 6.0
From Spring Framework 5.3 to 6.0
VMware Tanzu
 
Kafka Retry and DLQ
Kafka Retry and DLQKafka Retry and DLQ
Kafka Retry and DLQ
George Teo
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
ELK Stack
ELK StackELK Stack
ELK Stack
Eberhard Wolff
 
Confluent Operator as Cloud-Native Kafka Operator for Kubernetes
Confluent Operator as Cloud-Native Kafka Operator for KubernetesConfluent Operator as Cloud-Native Kafka Operator for Kubernetes
Confluent Operator as Cloud-Native Kafka Operator for Kubernetes
Kai Wähner
 
CDC patterns in Apache Kafka®
CDC patterns in Apache Kafka®CDC patterns in Apache Kafka®
CDC patterns in Apache Kafka®
confluent
 
Using Kafka Streams to Analyze Live Trading Activity for Crypto Exchanges (Lu...
Using Kafka Streams to Analyze Live Trading Activity for Crypto Exchanges (Lu...Using Kafka Streams to Analyze Live Trading Activity for Crypto Exchanges (Lu...
Using Kafka Streams to Analyze Live Trading Activity for Crypto Exchanges (Lu...
confluent
 
Stl meetup cloudera platform - january 2020
Stl meetup   cloudera platform  - january 2020Stl meetup   cloudera platform  - january 2020
Stl meetup cloudera platform - january 2020
Adam Doyle
 
Evening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkEvening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in Flink
Flink Forward
 

What's hot (20)

Streaming all over the world Real life use cases with Kafka Streams
Streaming all over the world  Real life use cases with Kafka StreamsStreaming all over the world  Real life use cases with Kafka Streams
Streaming all over the world Real life use cases with Kafka Streams
 
Apache Kafka 0.11 の Exactly Once Semantics
Apache Kafka 0.11 の Exactly Once SemanticsApache Kafka 0.11 の Exactly Once Semantics
Apache Kafka 0.11 の Exactly Once Semantics
 
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
 
Apache Kafka and KSQL in Action: Let's Build a Streaming Data Pipeline!
Apache Kafka and KSQL in Action: Let's Build a Streaming Data Pipeline!Apache Kafka and KSQL in Action: Let's Build a Streaming Data Pipeline!
Apache Kafka and KSQL in Action: Let's Build a Streaming Data Pipeline!
 
Apache Kafka 0.8 basic training - Verisign
Apache Kafka 0.8 basic training - VerisignApache Kafka 0.8 basic training - Verisign
Apache Kafka 0.8 basic training - Verisign
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIs
 
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
 
How Uber scaled its Real Time Infrastructure to Trillion events per day
How Uber scaled its Real Time Infrastructure to Trillion events per dayHow Uber scaled its Real Time Infrastructure to Trillion events per day
How Uber scaled its Real Time Infrastructure to Trillion events per day
 
Introduction to Kafka Streams
Introduction to Kafka StreamsIntroduction to Kafka Streams
Introduction to Kafka Streams
 
Apache Kafka Architecture & Fundamentals Explained
Apache Kafka Architecture & Fundamentals ExplainedApache Kafka Architecture & Fundamentals Explained
Apache Kafka Architecture & Fundamentals Explained
 
아름답고 유연한 데이터 파이프라인 구축을 위한 Amazon Managed Workflow for Apache Airflow - 유다니엘 A...
아름답고 유연한 데이터 파이프라인 구축을 위한 Amazon Managed Workflow for Apache Airflow - 유다니엘 A...아름답고 유연한 데이터 파이프라인 구축을 위한 Amazon Managed Workflow for Apache Airflow - 유다니엘 A...
아름답고 유연한 데이터 파이프라인 구축을 위한 Amazon Managed Workflow for Apache Airflow - 유다니엘 A...
 
From Spring Framework 5.3 to 6.0
From Spring Framework 5.3 to 6.0From Spring Framework 5.3 to 6.0
From Spring Framework 5.3 to 6.0
 
Kafka Retry and DLQ
Kafka Retry and DLQKafka Retry and DLQ
Kafka Retry and DLQ
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
ELK Stack
ELK StackELK Stack
ELK Stack
 
Confluent Operator as Cloud-Native Kafka Operator for Kubernetes
Confluent Operator as Cloud-Native Kafka Operator for KubernetesConfluent Operator as Cloud-Native Kafka Operator for Kubernetes
Confluent Operator as Cloud-Native Kafka Operator for Kubernetes
 
CDC patterns in Apache Kafka®
CDC patterns in Apache Kafka®CDC patterns in Apache Kafka®
CDC patterns in Apache Kafka®
 
Using Kafka Streams to Analyze Live Trading Activity for Crypto Exchanges (Lu...
Using Kafka Streams to Analyze Live Trading Activity for Crypto Exchanges (Lu...Using Kafka Streams to Analyze Live Trading Activity for Crypto Exchanges (Lu...
Using Kafka Streams to Analyze Live Trading Activity for Crypto Exchanges (Lu...
 
Stl meetup cloudera platform - january 2020
Stl meetup   cloudera platform  - january 2020Stl meetup   cloudera platform  - january 2020
Stl meetup cloudera platform - january 2020
 
Evening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkEvening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in Flink
 

Similar to Kafka as your Data Lake - is it Feasible? (Guido Schmutz, Trivadis) Kafka Summit 2020

Kafka as your Data Lake - is it Feasible?
Kafka as your Data Lake - is it Feasible?Kafka as your Data Lake - is it Feasible?
Kafka as your Data Lake - is it Feasible?
Guido Schmutz
 
Introduction to Stream Processing
Introduction to Stream ProcessingIntroduction to Stream Processing
Introduction to Stream Processing
Guido Schmutz
 
Streaming Visualization
Streaming VisualizationStreaming Visualization
Streaming Visualization
Guido Schmutz
 
JConWorld_ Continuous SQL with Kafka and Flink
JConWorld_ Continuous SQL with Kafka and FlinkJConWorld_ Continuous SQL with Kafka and Flink
JConWorld_ Continuous SQL with Kafka and Flink
Timothy Spann
 
Event Hub (i.e. Kafka) in Modern Data Architecture
Event Hub (i.e. Kafka) in Modern Data ArchitectureEvent Hub (i.e. Kafka) in Modern Data Architecture
Event Hub (i.e. Kafka) in Modern Data Architecture
Guido Schmutz
 
Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop Ecosystem Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop Ecosystem
DataWorks Summit/Hadoop Summit
 
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016
Gyula Fóra
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Guido Schmutz
 
Introduction to GCP Data Flow Presentation
Introduction to GCP Data Flow PresentationIntroduction to GCP Data Flow Presentation
Introduction to GCP Data Flow Presentation
Knoldus Inc.
 
Introduction to GCP DataFlow Presentation
Introduction to GCP DataFlow PresentationIntroduction to GCP DataFlow Presentation
Introduction to GCP DataFlow Presentation
Knoldus Inc.
 
Self-Service Data Ingestion Using NiFi, StreamSets & Kafka
Self-Service Data Ingestion Using NiFi, StreamSets & KafkaSelf-Service Data Ingestion Using NiFi, StreamSets & Kafka
Self-Service Data Ingestion Using NiFi, StreamSets & Kafka
Guido Schmutz
 
Headaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous ApplicationsHeadaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous Applications
Databricks
 
Data Warehouse vs. Data Lake vs. Data Streaming – Friends, Enemies, Frenemies?
Data Warehouse vs. Data Lake vs. Data Streaming – Friends, Enemies, Frenemies?Data Warehouse vs. Data Lake vs. Data Streaming – Friends, Enemies, Frenemies?
Data Warehouse vs. Data Lake vs. Data Streaming – Friends, Enemies, Frenemies?
Kai Wähner
 
DBA Fundamentals Group: Continuous SQL with Kafka and Flink
DBA Fundamentals Group: Continuous SQL with Kafka and FlinkDBA Fundamentals Group: Continuous SQL with Kafka and Flink
DBA Fundamentals Group: Continuous SQL with Kafka and Flink
Timothy Spann
 
Keeping Analytics Data Fresh in a Streaming Architecture | John Neal, Qlik
Keeping Analytics Data Fresh in a Streaming Architecture | John Neal, QlikKeeping Analytics Data Fresh in a Streaming Architecture | John Neal, Qlik
Keeping Analytics Data Fresh in a Streaming Architecture | John Neal, Qlik
HostedbyConfluent
 
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Landon Robinson
 
Event Hub (i.e. Kafka) in Modern Data (Analytics) Architecture
Event Hub (i.e. Kafka) in Modern Data (Analytics) ArchitectureEvent Hub (i.e. Kafka) in Modern Data (Analytics) Architecture
Event Hub (i.e. Kafka) in Modern Data (Analytics) Architecture
Guido Schmutz
 
XStream: stream processing platform at facebook
XStream:  stream processing platform at facebookXStream:  stream processing platform at facebook
XStream: stream processing platform at facebook
Aniket Mokashi
 
The Never Landing Stream with HTAP and Streaming
The Never Landing Stream with HTAP and StreamingThe Never Landing Stream with HTAP and Streaming
The Never Landing Stream with HTAP and Streaming
Timothy Spann
 
February 2014 HUG : Tez Details and Insides
February 2014 HUG : Tez Details and InsidesFebruary 2014 HUG : Tez Details and Insides
February 2014 HUG : Tez Details and Insides
Yahoo Developer Network
 

Similar to Kafka as your Data Lake - is it Feasible? (Guido Schmutz, Trivadis) Kafka Summit 2020 (20)

Kafka as your Data Lake - is it Feasible?
Kafka as your Data Lake - is it Feasible?Kafka as your Data Lake - is it Feasible?
Kafka as your Data Lake - is it Feasible?
 
Introduction to Stream Processing
Introduction to Stream ProcessingIntroduction to Stream Processing
Introduction to Stream Processing
 
Streaming Visualization
Streaming VisualizationStreaming Visualization
Streaming Visualization
 
JConWorld_ Continuous SQL with Kafka and Flink
JConWorld_ Continuous SQL with Kafka and FlinkJConWorld_ Continuous SQL with Kafka and Flink
JConWorld_ Continuous SQL with Kafka and Flink
 
Event Hub (i.e. Kafka) in Modern Data Architecture
Event Hub (i.e. Kafka) in Modern Data ArchitectureEvent Hub (i.e. Kafka) in Modern Data Architecture
Event Hub (i.e. Kafka) in Modern Data Architecture
 
Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop Ecosystem Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop Ecosystem
 
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
 
Introduction to GCP Data Flow Presentation
Introduction to GCP Data Flow PresentationIntroduction to GCP Data Flow Presentation
Introduction to GCP Data Flow Presentation
 
Introduction to GCP DataFlow Presentation
Introduction to GCP DataFlow PresentationIntroduction to GCP DataFlow Presentation
Introduction to GCP DataFlow Presentation
 
Self-Service Data Ingestion Using NiFi, StreamSets & Kafka
Self-Service Data Ingestion Using NiFi, StreamSets & KafkaSelf-Service Data Ingestion Using NiFi, StreamSets & Kafka
Self-Service Data Ingestion Using NiFi, StreamSets & Kafka
 
Headaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous ApplicationsHeadaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous Applications
 
Data Warehouse vs. Data Lake vs. Data Streaming – Friends, Enemies, Frenemies?
Data Warehouse vs. Data Lake vs. Data Streaming – Friends, Enemies, Frenemies?Data Warehouse vs. Data Lake vs. Data Streaming – Friends, Enemies, Frenemies?
Data Warehouse vs. Data Lake vs. Data Streaming – Friends, Enemies, Frenemies?
 
DBA Fundamentals Group: Continuous SQL with Kafka and Flink
DBA Fundamentals Group: Continuous SQL with Kafka and FlinkDBA Fundamentals Group: Continuous SQL with Kafka and Flink
DBA Fundamentals Group: Continuous SQL with Kafka and Flink
 
Keeping Analytics Data Fresh in a Streaming Architecture | John Neal, Qlik
Keeping Analytics Data Fresh in a Streaming Architecture | John Neal, QlikKeeping Analytics Data Fresh in a Streaming Architecture | John Neal, Qlik
Keeping Analytics Data Fresh in a Streaming Architecture | John Neal, Qlik
 
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
 
Event Hub (i.e. Kafka) in Modern Data (Analytics) Architecture
Event Hub (i.e. Kafka) in Modern Data (Analytics) ArchitectureEvent Hub (i.e. Kafka) in Modern Data (Analytics) Architecture
Event Hub (i.e. Kafka) in Modern Data (Analytics) Architecture
 
XStream: stream processing platform at facebook
XStream:  stream processing platform at facebookXStream:  stream processing platform at facebook
XStream: stream processing platform at facebook
 
The Never Landing Stream with HTAP and Streaming
The Never Landing Stream with HTAP and StreamingThe Never Landing Stream with HTAP and Streaming
The Never Landing Stream with HTAP and Streaming
 
February 2014 HUG : Tez Details and Insides
February 2014 HUG : Tez Details and InsidesFebruary 2014 HUG : Tez Details and Insides
February 2014 HUG : Tez Details and Insides
 

More from HostedbyConfluent

Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
HostedbyConfluent
 
Renaming a Kafka Topic | Kafka Summit London
Renaming a Kafka Topic | Kafka Summit LondonRenaming a Kafka Topic | Kafka Summit London
Renaming a Kafka Topic | Kafka Summit London
HostedbyConfluent
 
Evolution of NRT Data Ingestion Pipeline at Trendyol
Evolution of NRT Data Ingestion Pipeline at TrendyolEvolution of NRT Data Ingestion Pipeline at Trendyol
Evolution of NRT Data Ingestion Pipeline at Trendyol
HostedbyConfluent
 
Ensuring Kafka Service Resilience: A Dive into Health-Checking Techniques
Ensuring Kafka Service Resilience: A Dive into Health-Checking TechniquesEnsuring Kafka Service Resilience: A Dive into Health-Checking Techniques
Ensuring Kafka Service Resilience: A Dive into Health-Checking Techniques
HostedbyConfluent
 
Exactly-once Stream Processing with Arroyo and Kafka
Exactly-once Stream Processing with Arroyo and KafkaExactly-once Stream Processing with Arroyo and Kafka
Exactly-once Stream Processing with Arroyo and Kafka
HostedbyConfluent
 
Fish Plays Pokemon | Kafka Summit London
Fish Plays Pokemon | Kafka Summit LondonFish Plays Pokemon | Kafka Summit London
Fish Plays Pokemon | Kafka Summit London
HostedbyConfluent
 
Tiered Storage 101 | Kafla Summit London
Tiered Storage 101 | Kafla Summit LondonTiered Storage 101 | Kafla Summit London
Tiered Storage 101 | Kafla Summit London
HostedbyConfluent
 
Building a Self-Service Stream Processing Portal: How And Why
Building a Self-Service Stream Processing Portal: How And WhyBuilding a Self-Service Stream Processing Portal: How And Why
Building a Self-Service Stream Processing Portal: How And Why
HostedbyConfluent
 
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
HostedbyConfluent
 
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
HostedbyConfluent
 
Navigating Private Network Connectivity Options for Kafka Clusters
Navigating Private Network Connectivity Options for Kafka ClustersNavigating Private Network Connectivity Options for Kafka Clusters
Navigating Private Network Connectivity Options for Kafka Clusters
HostedbyConfluent
 
Apache Flink: Building a Company-wide Self-service Streaming Data Platform
Apache Flink: Building a Company-wide Self-service Streaming Data PlatformApache Flink: Building a Company-wide Self-service Streaming Data Platform
Apache Flink: Building a Company-wide Self-service Streaming Data Platform
HostedbyConfluent
 
Explaining How Real-Time GenAI Works in a Noisy Pub
Explaining How Real-Time GenAI Works in a Noisy PubExplaining How Real-Time GenAI Works in a Noisy Pub
Explaining How Real-Time GenAI Works in a Noisy Pub
HostedbyConfluent
 
TL;DR Kafka Metrics | Kafka Summit London
TL;DR Kafka Metrics | Kafka Summit LondonTL;DR Kafka Metrics | Kafka Summit London
TL;DR Kafka Metrics | Kafka Summit London
HostedbyConfluent
 
A Window Into Your Kafka Streams Tasks | KSL
A Window Into Your Kafka Streams Tasks | KSLA Window Into Your Kafka Streams Tasks | KSL
A Window Into Your Kafka Streams Tasks | KSL
HostedbyConfluent
 
Mastering Kafka Producer Configs: A Guide to Optimizing Performance
Mastering Kafka Producer Configs: A Guide to Optimizing PerformanceMastering Kafka Producer Configs: A Guide to Optimizing Performance
Mastering Kafka Producer Configs: A Guide to Optimizing Performance
HostedbyConfluent
 
Data Contracts Management: Schema Registry and Beyond
Data Contracts Management: Schema Registry and BeyondData Contracts Management: Schema Registry and Beyond
Data Contracts Management: Schema Registry and Beyond
HostedbyConfluent
 
Code-First Approach: Crafting Efficient Flink Apps
Code-First Approach: Crafting Efficient Flink AppsCode-First Approach: Crafting Efficient Flink Apps
Code-First Approach: Crafting Efficient Flink Apps
HostedbyConfluent
 
Debezium vs. the World: An Overview of the CDC Ecosystem
Debezium vs. the World: An Overview of the CDC EcosystemDebezium vs. the World: An Overview of the CDC Ecosystem
Debezium vs. the World: An Overview of the CDC Ecosystem
HostedbyConfluent
 
Beyond Tiered Storage: Serverless Kafka with No Local Disks
Beyond Tiered Storage: Serverless Kafka with No Local DisksBeyond Tiered Storage: Serverless Kafka with No Local Disks
Beyond Tiered Storage: Serverless Kafka with No Local Disks
HostedbyConfluent
 

More from HostedbyConfluent (20)

Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Renaming a Kafka Topic | Kafka Summit London
Renaming a Kafka Topic | Kafka Summit LondonRenaming a Kafka Topic | Kafka Summit London
Renaming a Kafka Topic | Kafka Summit London
 
Evolution of NRT Data Ingestion Pipeline at Trendyol
Evolution of NRT Data Ingestion Pipeline at TrendyolEvolution of NRT Data Ingestion Pipeline at Trendyol
Evolution of NRT Data Ingestion Pipeline at Trendyol
 
Ensuring Kafka Service Resilience: A Dive into Health-Checking Techniques
Ensuring Kafka Service Resilience: A Dive into Health-Checking TechniquesEnsuring Kafka Service Resilience: A Dive into Health-Checking Techniques
Ensuring Kafka Service Resilience: A Dive into Health-Checking Techniques
 
Exactly-once Stream Processing with Arroyo and Kafka
Exactly-once Stream Processing with Arroyo and KafkaExactly-once Stream Processing with Arroyo and Kafka
Exactly-once Stream Processing with Arroyo and Kafka
 
Fish Plays Pokemon | Kafka Summit London
Fish Plays Pokemon | Kafka Summit LondonFish Plays Pokemon | Kafka Summit London
Fish Plays Pokemon | Kafka Summit London
 
Tiered Storage 101 | Kafla Summit London
Tiered Storage 101 | Kafla Summit LondonTiered Storage 101 | Kafla Summit London
Tiered Storage 101 | Kafla Summit London
 
Building a Self-Service Stream Processing Portal: How And Why
Building a Self-Service Stream Processing Portal: How And WhyBuilding a Self-Service Stream Processing Portal: How And Why
Building a Self-Service Stream Processing Portal: How And Why
 
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
 
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
 
Navigating Private Network Connectivity Options for Kafka Clusters
Navigating Private Network Connectivity Options for Kafka ClustersNavigating Private Network Connectivity Options for Kafka Clusters
Navigating Private Network Connectivity Options for Kafka Clusters
 
Apache Flink: Building a Company-wide Self-service Streaming Data Platform
Apache Flink: Building a Company-wide Self-service Streaming Data PlatformApache Flink: Building a Company-wide Self-service Streaming Data Platform
Apache Flink: Building a Company-wide Self-service Streaming Data Platform
 
Explaining How Real-Time GenAI Works in a Noisy Pub
Explaining How Real-Time GenAI Works in a Noisy PubExplaining How Real-Time GenAI Works in a Noisy Pub
Explaining How Real-Time GenAI Works in a Noisy Pub
 
TL;DR Kafka Metrics | Kafka Summit London
TL;DR Kafka Metrics | Kafka Summit LondonTL;DR Kafka Metrics | Kafka Summit London
TL;DR Kafka Metrics | Kafka Summit London
 
A Window Into Your Kafka Streams Tasks | KSL
A Window Into Your Kafka Streams Tasks | KSLA Window Into Your Kafka Streams Tasks | KSL
A Window Into Your Kafka Streams Tasks | KSL
 
Mastering Kafka Producer Configs: A Guide to Optimizing Performance
Mastering Kafka Producer Configs: A Guide to Optimizing PerformanceMastering Kafka Producer Configs: A Guide to Optimizing Performance
Mastering Kafka Producer Configs: A Guide to Optimizing Performance
 
Data Contracts Management: Schema Registry and Beyond
Data Contracts Management: Schema Registry and BeyondData Contracts Management: Schema Registry and Beyond
Data Contracts Management: Schema Registry and Beyond
 
Code-First Approach: Crafting Efficient Flink Apps
Code-First Approach: Crafting Efficient Flink AppsCode-First Approach: Crafting Efficient Flink Apps
Code-First Approach: Crafting Efficient Flink Apps
 
Debezium vs. the World: An Overview of the CDC Ecosystem
Debezium vs. the World: An Overview of the CDC EcosystemDebezium vs. the World: An Overview of the CDC Ecosystem
Debezium vs. the World: An Overview of the CDC Ecosystem
 
Beyond Tiered Storage: Serverless Kafka with No Local Disks
Beyond Tiered Storage: Serverless Kafka with No Local DisksBeyond Tiered Storage: Serverless Kafka with No Local Disks
Beyond Tiered Storage: Serverless Kafka with No Local Disks
 

Recently uploaded

Assure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyesAssure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Ramesh Iyer
 
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptxSecstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
nkrafacyberclub
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
Product School
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
Pierluigi Pugliese
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
UiPathCommunity
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Aggregage
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
mikeeftimakis1
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
UiPathCommunity
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfSAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
Peter Spielvogel
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
Product School
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Paige Cruz
 

Recently uploaded (20)

Assure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyesAssure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyes
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
 
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptxSecstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfSAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
 

Kafka as your Data Lake - is it Feasible? (Guido Schmutz, Trivadis) Kafka Summit 2020

  • 1. BASEL | BERN | BRUGG | BUKAREST | DÜSSELDORF | FRANKFURT A.M. | FREIBURG I.BR. | GENF HAMBURG | KOPENHAGEN | LAUSANNE | MANNHEIM | MÜNCHEN | STUTTGART | WIEN | ZÜRICH http://guidoschmutz@wordpress.com@gschmutz Kafka as your Data Lake - is it Feasible? Guido Schmutz Kafka Summit 2020
  • 2. BASEL | BERN | BRUGG | BUKAREST | DÜSSELDORF | FRANKFURT A.M. | FREIBURG I.BR. | GENF HAMBURG | KOPENHAGEN | LAUSANNE | MANNHEIM | MÜNCHEN | STUTTGART | WIEN | ZÜRICH Guido Working at Trivadis for more than 23 years Consultant, Trainer, Platform Architect for Java, Oracle, SOA and Big Data / Fast Data Oracle Groundbreaker Ambassador & Oracle ACE Director @gschmutz guidoschmutz.wordpress.com 195th edition
  • 3. Agenda 1. What is a Data Lake? 2. Four Architecture Blueprints for “treating Kafka as a Data Lake” 3. Summary Demo environment and code samples available here: https://github.com/gschmutz/kafka-as-your-datalake-demo3
  • 4. What is a Data Lake? 4
  • 5. Bulk Source Data Consumer DB Extract File DB What is a Data Lake? Traditional Data Lake Architecture File Import / SQL Import “Native” Raw Hadoop ClusterdHadoop ClusterBig Data Platform Storage Storage Raw Refined/ UsageOpt Initial Idea of Data Lake • Single store of all data (incl. raw data) in the enterprise • Put an end to data silos • Reporting, Visualization, Analytics and Machine Learning • Focus on Schema-on-Read Tech for 1st Gen Data Lake • HDFS, MapReduce, Pig, Hive, Impala, Flume, Sqoop Tech for 2nd Gen Data Lake (Cloud native) • Object Store (S3, Azure Blob Storage, …), Spark, Flink, Presto, StreamSets, … SQL / Search Parallel Processing Query Engine BI Apps Data Science Workbench 7 high latency
  • 7. ”Streaming Data Lake” – aka. Kappa Architecture Event Stream Stream Processing Platform Stream Processor V1.0 State V1.0 Event Hub Reply Bulk Data Flow Hadoop ClusterdHadoop Cluster(Big) Data Platform Storage Storage Raw Refined/ UsageOpt Bulk Data Flow Data Consumer BI Apps Dashboard Serving Stream Processor V2.0 State V2.0 Result V1.0 Result V2.0 API (Switcher) { } Parallel Processing Query Engine SQL / Search “Native” Raw Data Science Workbench Result Stream Source of Truth 11 Bulk Source Event Source Location DB Extract File Weather DB IoT Data Mobile Apps Social Change Data Capture Event Stream [8] – Questioning the Lambda Architecture – by Jay Kreps
  • 9. Bulk Data Flow Result Stream SQL / Search “Native” Raw Event Stream Stream Processing Platform Stream Processor V1.0 State V1.0 Event Hub Hadoop ClusterdHadoop Cluster(Big) Data Platform Storage Storage Raw Refined/ UsageOpt Data Consumer BI Apps Dashboard Serving Stream Processor V2.0 State V2.0 Result V1.0 Result V2.0 API (Switcher) { } Parallel Processing Query Engine Data Science Workbench Reply Bulk Data Flow Source of Truth [1] Turning the database inside out with Apache Samza – by Martin Kleppmann13 Bulk Source Event Source Location DB Extract File Weather DB IoT Data Mobile Apps Social Change Data Capture Event Stream Moving the Source of Truth to Event Hub Turning the database inside-out!
  • 10. Bulk Data Flow Result Stream SQL / Search “Native” Raw Event Stream Stream Processing Platform Stream Processor V1.0 State V1.0 Event Hub Hadoop ClusterdHadoop Cluster(Big) Data Platform Storage Storage Raw Refined/ UsageOpt Data Consumer BI Apps Dashboard Serving Stream Processor V2.0 State V2.0 Result V1.0 Result V2.0 API (Switcher) { } Data Science Workbench Source of Truth Moving the Source of Truth to Event Hub [2] – It’s Okay To Store Data In Apache Kafka – by Jay Kreps14 Parallel Processing Query Engine Bulk Source Event Source Location DB Extract File Weather DB IoT Data Mobile Apps Social Change Data Capture Event Stream is it feasible?
  • 11. Confluent Enterprise Tiered Storage Data Retention • Never • Time (TTL) or Size-based • Log-Compacted based Tiered Storage uses two tiers of storage • Local (same local disks on brokers) • Remote (Object storage, currently AWS S3 only) Enables Kafka to be a long-term storage solution • Transparent (no ETL pipelines needed) • Cheaper storage for cold data • Better scalability and less complex operations Broker 1 Broker 2 Broker 3 Object Storage hot cold [3] Infinite Storage in Confluent Platform – by Lucas Bradstreet, Dhruvil Shah, Manveer Chawla [4] KIP-405: Kafka Tiered Storage – Kafka Improvement Proposal 15
  • 12. Four Architecture Blueprints for “treating Kafka as a Data Lake” 20
  • 13. How can you access a Kafka topic? Streaming Queries • Latest - start at end and continuously consume new data • Earliest – start at beginning and consume history and then continuously consume new data • Seek to offset – start at a given offset, consume history, and continuously consume new data • Seek to timestamp – start at given timestamp, consume history and continuously consume new data Batch Queries • From start offset to end offset – start at a given offset and consume until another offset • From start timestamp to end timestamp – start at a given offset and consume until another offset • Full scan – Scan the complete topic from start to end All above access options can be applied on topic or on a set of partitions 21
  • 14. BP-1: ”Streaming” Data Lake • Using Stream Processing tools to perform processing on ”data in motion” instead of in batch • Can consume from multiple sources • Works well if no or limited history is needed • Queryable State Stores, aka. Interactive Queries or Pull Queries [5] Streaming Machine Learning with Tiered Storage and Without a Data Lake – by Kai Waehner22
  • 15. BP-1_1: ”Streaming” Data Lake with ksqlDB / Kafka Streams • Kafka Streams or ksqlDB fit perfectly • Using ksqlDB pull queries to retrieve current state of materialized views • Store results in another Kafka topic to persist state store information • Can be combined with BP-4 to store results/state in a database Yes No Yes No Yes No Supports Protobuf Timestamp Filter Pushdown Offset Filter Pushdown Yes NoSupports Avro Yes NoSupports JSON Yes NoSchema Registry Integration Yes noPartition Filter Pushdown Yes NoSupports Produce Operation Yes noSupports Exactly Once23
  • 16. Demo Use Case – Vehicle Tracking Truck-2 truck_ position Truck-n Refinement truck_ position_avro detect_proble matic_driving problematic_ driving Truck Driver jdbc-source truck_ driver join_problematic _driving_driver problematic_ driving_driver 27, Walter, Ward, Y, 24-JUL-85, 2017-10-02 15:19:00 console consumer {"id":19,"firstName":"Walter", "lastName":"Ward","available ":"Y","birthdate":"24-JUL- 85","last_update":150692305 2012} 2020-06-02 14:39:56.605,98,27,803014426, Wichita to Little Rock Route2, Normal,38.65,90.21,5187297736652502631 24 Truck-1 2020-06-02 14:39:56.605,21,19,803014427, Wichita to Little Rock Route3, Overspeed,32.35,91.21,5187297736652502632 2020-06-02 14:39:56.605,21,19,803014427, Wichita to Little Rock Route3, Overspeed,32.35,91.21,5187297736652502632 aggregate by eventType over time window problematic_ driving_agg Pull query Overspeed,10,10:00:00,10:00:059 Pull query Raw Refined Usage Opt
  • 18. BP-2: Batch Processing with Event Hub as Source • Using a Batch Processing framework to process Event Hub data retrospectively (full history available) • Write back results to Event Hub • Read and join multiple sources • Can be combined with Advanced Analytics capabilities (i.e. machine learning / AI) 26
  • 19. BP-2_1: Apache Spark with Kafka as Source • Apache Spark is a unified analytics engine for large-scale data processing • Provides complex analytics through MLlib and GraphX • Can consume from/produce to Kafka both in Streaming as well as Batch Mode • Use Data Frame / Dataset abstraction as you would with other data sources Yes No Yes No Yes No Supports Protobuf Timestamp Filter Pushdown Offset Filter Pushdown Yes NoSupports Avro Yes NoSupports JSON Yes NoSchema Registry Integration Yes NoPartition Filter Pushdown Yes NoSupports Produce Operation Yes NoSupports Exactly Once27
  • 20. 28
  • 21. BP-2_1: Apache Spark with Kafka as Source truckPositionSchema = StructType().add("timestamp", TimestampType()) .add("truckId",LongType()) .add("driverId", LongType()) .add("routeId", LongType()) .add("eventType", StringType()) .add("latitude", DoubleType()) .add("longitude", DoubleType()) .add("correlationId", StringType()) rawDf = spark.read.format("kafka") .option("kafka.bootstrap.servers", "kafka-1:19092,kafka-2:19093") .option("subscribe", "truck_position") .load() jsonDf = rawDf.selectExpr("CAST(value AS string)") jsonDf = jsonDf.select(from_json(jsonDf.value, truckPositionSchema) .alias("json")) .selectExpr("json.*", "cast(cast (json.timestamp as double) / 1000 as timestamp) as eventTime")29
  • 22. BP-2_1: Apache Spark with Kafka as Source 30
  • 23. BP-3: Batch Query with Event Hub as Source • Using a Query Virtualization framework to consume (query) Event Hub data retrospectively (full history available) • Optionally produce (insert) data into Event Hub • Read and join multiple sources • Based on SQL and with the full power of SQL at hand (functions and optionally UDF/UDFA/UDTF) • Batch SQL not Streaming SQL 31
  • 24. BP-3_1: Presto with Kafka as Source • Presto is a distributed SQL query engine for big data • Supports accessing data from multiple systems within a single query • Supports Kafka as a source (query) but not as a target • Does not yet support pushdown of timestamp queries • Starburst Enterprise Presto provides fined grained access control Yes No Yes No Yes No Supports Protobuf Timestamp Filter Pushdown Offset Filter Pushdown Yes NoSupports Avro Yes NoSupports JSON Yes NoSchema Registry Integration Yes NoPartition Filter Pushdown Yes NoSupports Produce Operation Yes NoSupports Exactly Once32
  • 25. 33
  • 26. BP-3_1: Presto with Kafka as Source kafka.nodes=kafka-1:9092 kafka.table-names=truck_position, truck_driver kafka.default-schema=logistics kafka.hide-internal-columns=false kafka.table-description-dir=etc/kafka kafka.properties select * from truck_position; 34
  • 27. BP-3_1: Presto with Kafka as Source { "tableName": "truck_position", "schemaName": "logistics", "topicName": "truck_position", "key": { "dataFormat": "raw", "fields": [ { "name": "kafka_key", "dataFormat": "BYTE", "type": "VARCHAR", "hidden": "false" } ] }, "message": { "dataFormat": "json", "fields": [ { "name": "timestamp", "mapping": "timestamp", "type": "BIGINT" }, { "name": "truck_id", "mapping": "truckId", "type": "BIGINT" }, ... etc/kafka/truck_position.json etc/kafka/truck_driver.json { "tableName": "truck_position", "schemaName": "logistics", "topicName": "truck_position", "key": { "dataFormat": "raw", "fields": [ { "name": "kafka_key", "dataFormat": "BYTE", "type": "VARCHAR", "hidden": "false" } ] }, "message": { "dataFormat": "json", "fields": [ { "name": "timestamp", "mapping": "timestamp", "type": "BIGINT" }, { "name": "truck_id", "mapping": "truckId", "type": "BIGINT" }, ... 35
  • 28. BP-3_1: Presto with Kafka as Source select * from truck_position select * from truck_driver 36
  • 29. BP-3_1: Presto with Kafka as Source Join truck_position with truck_driver (removing non-compacted entries using Presto WINDOW Function) SELECT d.id, d.first_name, d.last_name, t.* FROM truck_position t LEFT JOIN ( SELECT * FROM truck_driver WHERE (last_update) IN (SELECT LAST_VALUE(last_update) OVER (PARTITION BY id ORDER BY last_update RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS last_update FROM truck_driver) ) d ON t.driver_id = d.id WHERE t.event_type != 'Normal'; 37
  • 30. BP-3_2: Apache Drill with Kafka as Source • Apache Drill is a schema-free SQL Query Engine for Hadoop, NoSQL and Cloud Storage • Supports accessing data from multiple systems within a single query • Can push down filters on partitions, timestamp and offset Yes No Yes No Yes No Supports Protobuf Timestamp Filter Pushdown Offset Filter Pushdown Yes NoSupports Avro Yes NoSupports JSON Yes NoSchema Registry Integration Yes NoPartition Filter Pushdown Yes NoSupports Produce Operation Yes NoSupports Exactly Once38
  • 31. BP-3_3: Hive/Spark SQL with Kafka as Source • Apache Hive facilitates reading, writing, and managing large datasets residing in distributed storage using SQL • Part of any Hadoop distribution • A special storage handler allows access to Kafka topic via Hive external tables • Spark SQL on data frame as shown in BP-2_1 or by integrating Hive Metastore Yes No Yes No Yes No Supports Protobuf Timestamp Filter Pushdown Offset Filter Pushdown Yes NoSupports Avro Yes NoSupports JSON Yes NoSchema Registry Integration Yes NoPartition Filter Pushdown Yes NoSupports Produce Operation Yes NoSupports Exactly Once39
  • 32. BP-3_4: Oracle Access to Kafka with Kafka as Source • Oracle SQL Access to Kafka is a PL/SQL package that enables Oracle SQL to query Kafka topics via DB views and underlying external tables [6] • Runs in the Oracle database • Supports Kafka as a source (query) but not (yet) as a target • Use Oracle SQL to access the Kafka topics and optionally join to RDBMS tables Yes No Yes No Yes No Supports Protobuf Timestamp Filter Pushdown Offset Filter Pushdown Yes NoSupports Avro Yes NoSupports JSON Yes NoSchema Registry Integration Yes NoPartition Filter Pushdown Yes NoSupports Produce Operation Yes NoSupports Exactly Once40
  • 33. BP-4: Use any storage as “materialized view” • Use any persistence technology to provide a “Materialized View” to the Data Consumers • Can be provided in retrospective, run-once, batch or streaming update (in sync) mode • “On Demand” use cases • Provide a sandbox environment for data scientists • Provide part of Kafka topics materialized in object storage 41
  • 34. Architecture Blueprints Overview Blueprint Capability Streaming BP1_1 Batch Processing BP2_1 Query BP3_1 Query BP3_2 Query BP3_3 Query BP3_4 Supports JSON Supports Avro 🔴 🔴 Supports Protobuf 🔴 🔴 🔴 🔴 🔴 Schema Registry Integration 🔴 🔴 🔴 🔴 Timestamp Filter Pushdown ⚪ 🔴 Offset Filter Pushdown ⚪ 🔴 Partition Filter Pushdown ⚪ 🔴 🔴 Supports Produce Operation 🔴 🔴 🔴 Supports Exactly Once 🔴 🔴 🔴 🔴 • BP-1_1: Streaming Data Lake using Kafka Streams / ksqlDB • BP-2_1: Apache Spark with Kafka as Source • BP-3_1: Presto with Kafka as Source • BP-3_2: Apache Drill with Kafka as Source • BP-3_3: Hive/Spark SQL with Kafka as Source • BP-3_4: Oracle Access to Kafka with Kafka as Source 42
  • 36. Summary • Move processing / analytics from batch to stream processing pipelines • Event Hub (Kafka) as the single source of truth => turning the database inside out! • everything else is just a “Materialized Views” of the Event Hub topics data • Can still be HDFS, Object Store (S3, …) but also Kudu on Parquet • NoSQL Databases & Relational Databases • In-Memory Databases • Confluent Platform Tiered Storage makes long-term storage feasible • Does not apply for large, unstructured data (images, videos, …) => separate path around Event Hub necessary, but sending metadata through Event Hub • This is the result of a Proof-of-Concept: only functional test done so far, performance tests will follow 44
  • 37. References 1. Turning the database inside out with Apache Samza – by Martin Kleppmann 2. It’s Okay To Store Data In Apache Kafka – by Jay Kreps 3. Infinite Storage in Confluent Platform – by Lucas Bradstreet, Dhruvil Shah, Manveer Chawla 4. KIP-405: Kafka Tiered Storage – Kafka Improvement Proposal 5. Streaming Machine Learning with Tiered Storage and Without a Data Lake – by Kai Waehner 6. Read data from Kafka topic using Oracle SQL Access to Kafka (OSAK) - by Mohammad H. AbdelQader 7. Demo environment and code samples - by Guido Schmutz (on GitHub) 8. Questioning the Lambda Architecture - by Jay Kreps 45