SlideShare a Scribd company logo

Kafka 102: Streams and Tables All the Way Down | Kafka Summit San Francisco 2019

Talk URL: https://kafka-summit.org/sessions/kafka-102-streams-tables-way/ Video recording: https://www.confluent.io/kafka-summit-san-francisco-2019/kafka-102-streams-and-tables-all-the-way-down Abstract: Streams and Tables are the foundation of event streaming with Kafka, and they power nearly every conceivable use case, from payment processing to change data capture, from streaming ETL to real-time alerting for connected cars, and even the lowly WordCount example. Tables are something that most of us are familiar with from the world of databases, whereas Streams are a rather new concept. Trying to leverage Kafka without understanding tables and streams is like building a rocket ship without understanding the laws of physics-a mission bound to fail. In this session for developers, operators, and architects alike we take a deep dive into these two fundamental primitives of Kafka’s data model. We discuss how streams and tables incl. global tables relate to each other and to topics, partitioning, compaction, serialization (Kafka’s storage layer), and how they interplay to process data, react to data changes, and manage state in an elastic, scalable, fault-tolerant manner (Kafka’s compute layer). Developers will understand better how to use streams and tables to build event-driven applications with Kafka Streams and KSQL, and we answer questions such as “How can I query my tables?” and “What is data co-partitioning, and how does it affect my join?”. Operators will better understand how these applications will run in production, with questions such as “How do I scale my application?” and “When my application crashes, how will it recover its state?”. At a higher level, we will explore how Kafka uses streams and tables to turn the Database inside-out and put it back together.

1 of 55
Download to read offline
1
Kafka 102:
Streams and Tables All the Way Down
Kafka Summit San Francisco, Sep 2019
Michael G. Noll
Technologist, Office of the CTO, Confluent
@miguno
22
Streams and Tables
A First Look
33
@miguno
Streams Tables
Event Streaming
44
@miguno
Streams Tables
Event Streaming
5
An Event Streaming Platform
gives you three key functionalities
5
Publish & Subscribe
to Events
Store
Events
Process & Analyze
Events
@miguno
6
An Event
records the fact that something happened
6
A good
was sold
An invoice
was issued
A payment
was made
A new customer
registered
@miguno

Recommended

Consistency and Completeness: Rethinking Distributed Stream Processing in Apa...
Consistency and Completeness: Rethinking Distributed Stream Processing in Apa...Consistency and Completeness: Rethinking Distributed Stream Processing in Apa...
Consistency and Completeness: Rethinking Distributed Stream Processing in Apa...Guozhang Wang
 
Performance Analysis and Optimizations for Kafka Streams Applications
Performance Analysis and Optimizations for Kafka Streams ApplicationsPerformance Analysis and Optimizations for Kafka Streams Applications
Performance Analysis and Optimizations for Kafka Streams ApplicationsGuozhang Wang
 
Achieving a 50% Reduction in Cross-AZ Network Costs from Kafka (Uday Sagar Si...
Achieving a 50% Reduction in Cross-AZ Network Costs from Kafka (Uday Sagar Si...Achieving a 50% Reduction in Cross-AZ Network Costs from Kafka (Uday Sagar Si...
Achieving a 50% Reduction in Cross-AZ Network Costs from Kafka (Uday Sagar Si...confluent
 
Exactly-Once Made Easy: Transactional Messaging Improvement for Usability and...
Exactly-Once Made Easy: Transactional Messaging Improvement for Usability and...Exactly-Once Made Easy: Transactional Messaging Improvement for Usability and...
Exactly-Once Made Easy: Transactional Messaging Improvement for Usability and...Guozhang Wang
 
Kick your database_to_the_curb_reston_08_27_19
Kick your database_to_the_curb_reston_08_27_19Kick your database_to_the_curb_reston_08_27_19
Kick your database_to_the_curb_reston_08_27_19confluent
 
Event sourcing - what could possibly go wrong ? Devoxx PL 2021
Event sourcing  - what could possibly go wrong ? Devoxx PL 2021Event sourcing  - what could possibly go wrong ? Devoxx PL 2021
Event sourcing - what could possibly go wrong ? Devoxx PL 2021Andrzej Ludwikowski
 
Robust Operations of Kafka Streams
Robust Operations of Kafka StreamsRobust Operations of Kafka Streams
Robust Operations of Kafka Streamsconfluent
 

More Related Content

What's hot

What's the time? ...and why? (Mattias Sax, Confluent) Kafka Summit SF 2019
What's the time? ...and why? (Mattias Sax, Confluent) Kafka Summit SF 2019What's the time? ...and why? (Mattias Sax, Confluent) Kafka Summit SF 2019
What's the time? ...and why? (Mattias Sax, Confluent) Kafka Summit SF 2019confluent
 
Apache Kafka, and the Rise of Stream Processing
Apache Kafka, and the Rise of Stream ProcessingApache Kafka, and the Rise of Stream Processing
Apache Kafka, and the Rise of Stream ProcessingGuozhang Wang
 
Kafka Summit NYC 2017 - The Best Thing Since Partitioned Bread
Kafka Summit NYC 2017 - The Best Thing Since Partitioned Bread Kafka Summit NYC 2017 - The Best Thing Since Partitioned Bread
Kafka Summit NYC 2017 - The Best Thing Since Partitioned Bread confluent
 
Riddles of Streaming - Code Puzzlers for Fun & Profit (Nick Dearden, Confluen...
Riddles of Streaming - Code Puzzlers for Fun & Profit (Nick Dearden, Confluen...Riddles of Streaming - Code Puzzlers for Fun & Profit (Nick Dearden, Confluen...
Riddles of Streaming - Code Puzzlers for Fun & Profit (Nick Dearden, Confluen...confluent
 
Kafka Streams: the easiest way to start with stream processing
Kafka Streams: the easiest way to start with stream processingKafka Streams: the easiest way to start with stream processing
Kafka Streams: the easiest way to start with stream processingYaroslav Tkachenko
 
HPBigData2015 PSTL kafka spark vertica
HPBigData2015 PSTL kafka spark verticaHPBigData2015 PSTL kafka spark vertica
HPBigData2015 PSTL kafka spark verticaJack Gudenkauf
 
KSQL: Streaming SQL for Kafka
KSQL: Streaming SQL for KafkaKSQL: Streaming SQL for Kafka
KSQL: Streaming SQL for Kafkaconfluent
 
Kafka Summit SF 2017 - Exactly-once Stream Processing with Kafka Streams
Kafka Summit SF 2017 - Exactly-once Stream Processing with Kafka StreamsKafka Summit SF 2017 - Exactly-once Stream Processing with Kafka Streams
Kafka Summit SF 2017 - Exactly-once Stream Processing with Kafka Streamsconfluent
 
Arbitrary Stateful Aggregations using Structured Streaming in Apache Spark
Arbitrary Stateful Aggregations using Structured Streaming in Apache SparkArbitrary Stateful Aggregations using Structured Streaming in Apache Spark
Arbitrary Stateful Aggregations using Structured Streaming in Apache SparkDatabricks
 
Delta Lake Streaming: Under the Hood
Delta Lake Streaming: Under the HoodDelta Lake Streaming: Under the Hood
Delta Lake Streaming: Under the HoodDatabricks
 
Kafka Connect: Operational Lessons Learned from the Trenches (Elizabeth Benne...
Kafka Connect: Operational Lessons Learned from the Trenches (Elizabeth Benne...Kafka Connect: Operational Lessons Learned from the Trenches (Elizabeth Benne...
Kafka Connect: Operational Lessons Learned from the Trenches (Elizabeth Benne...confluent
 
Apache Flink Internals: Stream & Batch Processing in One System – Apache Flin...
Apache Flink Internals: Stream & Batch Processing in One System – Apache Flin...Apache Flink Internals: Stream & Batch Processing in One System – Apache Flin...
Apache Flink Internals: Stream & Batch Processing in One System – Apache Flin...ucelebi
 
Querying the Internet of Things: Streaming SQL on Kafka/Samza and Storm/Trident
Querying the Internet of Things: Streaming SQL on Kafka/Samza and Storm/TridentQuerying the Internet of Things: Streaming SQL on Kafka/Samza and Storm/Trident
Querying the Internet of Things: Streaming SQL on Kafka/Samza and Storm/TridentDataWorks Summit/Hadoop Summit
 
Building a Replicated Logging System with Apache Kafka
Building a Replicated Logging System with Apache KafkaBuilding a Replicated Logging System with Apache Kafka
Building a Replicated Logging System with Apache KafkaGuozhang Wang
 
PSUG #52 Dataflow and simplified reactive programming with Akka-streams
PSUG #52 Dataflow and simplified reactive programming with Akka-streamsPSUG #52 Dataflow and simplified reactive programming with Akka-streams
PSUG #52 Dataflow and simplified reactive programming with Akka-streamsStephane Manciot
 
Spark stream - Kafka
Spark stream - Kafka Spark stream - Kafka
Spark stream - Kafka Dori Waldman
 
Flink 0.10 @ Bay Area Meetup (October 2015)
Flink 0.10 @ Bay Area Meetup (October 2015)Flink 0.10 @ Bay Area Meetup (October 2015)
Flink 0.10 @ Bay Area Meetup (October 2015)Stephan Ewen
 

What's hot (20)

What's the time? ...and why? (Mattias Sax, Confluent) Kafka Summit SF 2019
What's the time? ...and why? (Mattias Sax, Confluent) Kafka Summit SF 2019What's the time? ...and why? (Mattias Sax, Confluent) Kafka Summit SF 2019
What's the time? ...and why? (Mattias Sax, Confluent) Kafka Summit SF 2019
 
Apache Kafka, and the Rise of Stream Processing
Apache Kafka, and the Rise of Stream ProcessingApache Kafka, and the Rise of Stream Processing
Apache Kafka, and the Rise of Stream Processing
 
Kafka Summit NYC 2017 - The Best Thing Since Partitioned Bread
Kafka Summit NYC 2017 - The Best Thing Since Partitioned Bread Kafka Summit NYC 2017 - The Best Thing Since Partitioned Bread
Kafka Summit NYC 2017 - The Best Thing Since Partitioned Bread
 
Riddles of Streaming - Code Puzzlers for Fun & Profit (Nick Dearden, Confluen...
Riddles of Streaming - Code Puzzlers for Fun & Profit (Nick Dearden, Confluen...Riddles of Streaming - Code Puzzlers for Fun & Profit (Nick Dearden, Confluen...
Riddles of Streaming - Code Puzzlers for Fun & Profit (Nick Dearden, Confluen...
 
Kafka Streams: the easiest way to start with stream processing
Kafka Streams: the easiest way to start with stream processingKafka Streams: the easiest way to start with stream processing
Kafka Streams: the easiest way to start with stream processing
 
HPBigData2015 PSTL kafka spark vertica
HPBigData2015 PSTL kafka spark verticaHPBigData2015 PSTL kafka spark vertica
HPBigData2015 PSTL kafka spark vertica
 
KSQL: Streaming SQL for Kafka
KSQL: Streaming SQL for KafkaKSQL: Streaming SQL for Kafka
KSQL: Streaming SQL for Kafka
 
Kafka Summit SF 2017 - Exactly-once Stream Processing with Kafka Streams
Kafka Summit SF 2017 - Exactly-once Stream Processing with Kafka StreamsKafka Summit SF 2017 - Exactly-once Stream Processing with Kafka Streams
Kafka Summit SF 2017 - Exactly-once Stream Processing with Kafka Streams
 
Unified Stream and Batch Processing with Apache Flink
Unified Stream and Batch Processing with Apache FlinkUnified Stream and Batch Processing with Apache Flink
Unified Stream and Batch Processing with Apache Flink
 
Arbitrary Stateful Aggregations using Structured Streaming in Apache Spark
Arbitrary Stateful Aggregations using Structured Streaming in Apache SparkArbitrary Stateful Aggregations using Structured Streaming in Apache Spark
Arbitrary Stateful Aggregations using Structured Streaming in Apache Spark
 
Delta Lake Streaming: Under the Hood
Delta Lake Streaming: Under the HoodDelta Lake Streaming: Under the Hood
Delta Lake Streaming: Under the Hood
 
The Stream Processor as a Database Apache Flink
The Stream Processor as a Database Apache FlinkThe Stream Processor as a Database Apache Flink
The Stream Processor as a Database Apache Flink
 
Kafka Connect: Operational Lessons Learned from the Trenches (Elizabeth Benne...
Kafka Connect: Operational Lessons Learned from the Trenches (Elizabeth Benne...Kafka Connect: Operational Lessons Learned from the Trenches (Elizabeth Benne...
Kafka Connect: Operational Lessons Learned from the Trenches (Elizabeth Benne...
 
Apache Flink Internals: Stream & Batch Processing in One System – Apache Flin...
Apache Flink Internals: Stream & Batch Processing in One System – Apache Flin...Apache Flink Internals: Stream & Batch Processing in One System – Apache Flin...
Apache Flink Internals: Stream & Batch Processing in One System – Apache Flin...
 
Querying the Internet of Things: Streaming SQL on Kafka/Samza and Storm/Trident
Querying the Internet of Things: Streaming SQL on Kafka/Samza and Storm/TridentQuerying the Internet of Things: Streaming SQL on Kafka/Samza and Storm/Trident
Querying the Internet of Things: Streaming SQL on Kafka/Samza and Storm/Trident
 
Building a Replicated Logging System with Apache Kafka
Building a Replicated Logging System with Apache KafkaBuilding a Replicated Logging System with Apache Kafka
Building a Replicated Logging System with Apache Kafka
 
PSUG #52 Dataflow and simplified reactive programming with Akka-streams
PSUG #52 Dataflow and simplified reactive programming with Akka-streamsPSUG #52 Dataflow and simplified reactive programming with Akka-streams
PSUG #52 Dataflow and simplified reactive programming with Akka-streams
 
Spark streaming: Best Practices
Spark streaming: Best PracticesSpark streaming: Best Practices
Spark streaming: Best Practices
 
Spark stream - Kafka
Spark stream - Kafka Spark stream - Kafka
Spark stream - Kafka
 
Flink 0.10 @ Bay Area Meetup (October 2015)
Flink 0.10 @ Bay Area Meetup (October 2015)Flink 0.10 @ Bay Area Meetup (October 2015)
Flink 0.10 @ Bay Area Meetup (October 2015)
 

Similar to Kafka 102: Streams and Tables All the Way Down | Kafka Summit San Francisco 2019

What every software engineer should know about streams and tables in kafka ...
What every software engineer should know about streams and tables in kafka   ...What every software engineer should know about streams and tables in kafka   ...
What every software engineer should know about streams and tables in kafka ...confluent
 
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and PinotExactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and PinotFlink Forward
 
spark stream - kafka - the right way
spark stream - kafka - the right way spark stream - kafka - the right way
spark stream - kafka - the right way Dori Waldman
 
Managing Data and Operation Distribution In MongoDB
Managing Data and Operation Distribution In MongoDBManaging Data and Operation Distribution In MongoDB
Managing Data and Operation Distribution In MongoDBJason Terpko
 
Unified stateful big data processing in Apache Beam (incubating)
Unified stateful big data processing in Apache Beam (incubating)Unified stateful big data processing in Apache Beam (incubating)
Unified stateful big data processing in Apache Beam (incubating)Aljoscha Krettek
 
Aljoscha Krettek - Portable stateful big data processing in Apache Beam
Aljoscha Krettek - Portable stateful big data processing in Apache BeamAljoscha Krettek - Portable stateful big data processing in Apache Beam
Aljoscha Krettek - Portable stateful big data processing in Apache BeamVerverica
 
Simplifying Disaster Recovery with Delta Lake
Simplifying Disaster Recovery with Delta LakeSimplifying Disaster Recovery with Delta Lake
Simplifying Disaster Recovery with Delta LakeDatabricks
 
Managing data and operation distribution in MongoDB
Managing data and operation distribution in MongoDBManaging data and operation distribution in MongoDB
Managing data and operation distribution in MongoDBAntonios Giannopoulos
 
Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...Flink Forward
 
Buckle Up! With Valerie Burchby and Xinran Waibe | Current 2022
Buckle Up! With Valerie Burchby and Xinran Waibe | Current 2022Buckle Up! With Valerie Burchby and Xinran Waibe | Current 2022
Buckle Up! With Valerie Burchby and Xinran Waibe | Current 2022HostedbyConfluent
 
Apache Flink: Better, Faster & Uncut - Piotr Nowojski, data Artisans
Apache Flink: Better, Faster & Uncut - Piotr Nowojski, data ArtisansApache Flink: Better, Faster & Uncut - Piotr Nowojski, data Artisans
Apache Flink: Better, Faster & Uncut - Piotr Nowojski, data ArtisansEvention
 
Presto Summit 2018 - 09 - Netflix Iceberg
Presto Summit 2018  - 09 - Netflix IcebergPresto Summit 2018  - 09 - Netflix Iceberg
Presto Summit 2018 - 09 - Netflix Icebergkbajda
 
Evening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkEvening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkFlink Forward
 
Apache Flink internals
Apache Flink internalsApache Flink internals
Apache Flink internalsKostas Tzoumas
 
Parallel computing in Python: Current state and recent advances
Parallel computing in Python: Current state and recent advancesParallel computing in Python: Current state and recent advances
Parallel computing in Python: Current state and recent advancesPierre Glaser
 
Iceberg: a fast table format for S3
Iceberg: a fast table format for S3Iceberg: a fast table format for S3
Iceberg: a fast table format for S3DataWorks Summit
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guideRyan Blue
 
Crosstalk
CrosstalkCrosstalk
Crosstalkcdhowe
 
Streaming Distributed Data Processing with Silk #deim2014
Streaming Distributed Data Processing with Silk #deim2014Streaming Distributed Data Processing with Silk #deim2014
Streaming Distributed Data Processing with Silk #deim2014Taro L. Saito
 

Similar to Kafka 102: Streams and Tables All the Way Down | Kafka Summit San Francisco 2019 (20)

What every software engineer should know about streams and tables in kafka ...
What every software engineer should know about streams and tables in kafka   ...What every software engineer should know about streams and tables in kafka   ...
What every software engineer should know about streams and tables in kafka ...
 
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and PinotExactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
 
spark stream - kafka - the right way
spark stream - kafka - the right way spark stream - kafka - the right way
spark stream - kafka - the right way
 
Managing Data and Operation Distribution In MongoDB
Managing Data and Operation Distribution In MongoDBManaging Data and Operation Distribution In MongoDB
Managing Data and Operation Distribution In MongoDB
 
Unified stateful big data processing in Apache Beam (incubating)
Unified stateful big data processing in Apache Beam (incubating)Unified stateful big data processing in Apache Beam (incubating)
Unified stateful big data processing in Apache Beam (incubating)
 
Aljoscha Krettek - Portable stateful big data processing in Apache Beam
Aljoscha Krettek - Portable stateful big data processing in Apache BeamAljoscha Krettek - Portable stateful big data processing in Apache Beam
Aljoscha Krettek - Portable stateful big data processing in Apache Beam
 
Simplifying Disaster Recovery with Delta Lake
Simplifying Disaster Recovery with Delta LakeSimplifying Disaster Recovery with Delta Lake
Simplifying Disaster Recovery with Delta Lake
 
Managing data and operation distribution in MongoDB
Managing data and operation distribution in MongoDBManaging data and operation distribution in MongoDB
Managing data and operation distribution in MongoDB
 
Handout3o
Handout3oHandout3o
Handout3o
 
Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...
 
Buckle Up! With Valerie Burchby and Xinran Waibe | Current 2022
Buckle Up! With Valerie Burchby and Xinran Waibe | Current 2022Buckle Up! With Valerie Burchby and Xinran Waibe | Current 2022
Buckle Up! With Valerie Burchby and Xinran Waibe | Current 2022
 
Apache Flink: Better, Faster & Uncut - Piotr Nowojski, data Artisans
Apache Flink: Better, Faster & Uncut - Piotr Nowojski, data ArtisansApache Flink: Better, Faster & Uncut - Piotr Nowojski, data Artisans
Apache Flink: Better, Faster & Uncut - Piotr Nowojski, data Artisans
 
Presto Summit 2018 - 09 - Netflix Iceberg
Presto Summit 2018  - 09 - Netflix IcebergPresto Summit 2018  - 09 - Netflix Iceberg
Presto Summit 2018 - 09 - Netflix Iceberg
 
Evening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkEvening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in Flink
 
Apache Flink internals
Apache Flink internalsApache Flink internals
Apache Flink internals
 
Parallel computing in Python: Current state and recent advances
Parallel computing in Python: Current state and recent advancesParallel computing in Python: Current state and recent advances
Parallel computing in Python: Current state and recent advances
 
Iceberg: a fast table format for S3
Iceberg: a fast table format for S3Iceberg: a fast table format for S3
Iceberg: a fast table format for S3
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
 
Crosstalk
CrosstalkCrosstalk
Crosstalk
 
Streaming Distributed Data Processing with Silk #deim2014
Streaming Distributed Data Processing with Silk #deim2014Streaming Distributed Data Processing with Silk #deim2014
Streaming Distributed Data Processing with Silk #deim2014
 

More from Michael Noll

Now You See Me, Now You Compute: Building Event-Driven Architectures with Apa...
Now You See Me, Now You Compute: Building Event-Driven Architectures with Apa...Now You See Me, Now You Compute: Building Event-Driven Architectures with Apa...
Now You See Me, Now You Compute: Building Event-Driven Architectures with Apa...Michael Noll
 
Big, Fast, Easy Data: Distributed Stream Processing for Everyone with KSQL, t...
Big, Fast, Easy Data: Distributed Stream Processing for Everyone with KSQL, t...Big, Fast, Easy Data: Distributed Stream Processing for Everyone with KSQL, t...
Big, Fast, Easy Data: Distributed Stream Processing for Everyone with KSQL, t...Michael Noll
 
Unlocking the world of stream processing with KSQL, the streaming SQL engine ...
Unlocking the world of stream processing with KSQL, the streaming SQL engine ...Unlocking the world of stream processing with KSQL, the streaming SQL engine ...
Unlocking the world of stream processing with KSQL, the streaming SQL engine ...Michael Noll
 
Rethinking Stream Processing with Apache Kafka: Applications vs. Clusters, St...
Rethinking Stream Processing with Apache Kafka: Applications vs. Clusters, St...Rethinking Stream Processing with Apache Kafka: Applications vs. Clusters, St...
Rethinking Stream Processing with Apache Kafka: Applications vs. Clusters, St...Michael Noll
 
Rethinking Stream Processing with Apache Kafka: Applications vs. Clusters, St...
Rethinking Stream Processing with Apache Kafka: Applications vs. Clusters, St...Rethinking Stream Processing with Apache Kafka: Applications vs. Clusters, St...
Rethinking Stream Processing with Apache Kafka: Applications vs. Clusters, St...Michael Noll
 
Introducing Apache Kafka's Streams API - Kafka meetup Munich, Jan 25 2017
Introducing Apache Kafka's Streams API - Kafka meetup Munich, Jan 25 2017Introducing Apache Kafka's Streams API - Kafka meetup Munich, Jan 25 2017
Introducing Apache Kafka's Streams API - Kafka meetup Munich, Jan 25 2017Michael Noll
 
Being Ready for Apache Kafka - Apache: Big Data Europe 2015
Being Ready for Apache Kafka - Apache: Big Data Europe 2015Being Ready for Apache Kafka - Apache: Big Data Europe 2015
Being Ready for Apache Kafka - Apache: Big Data Europe 2015Michael Noll
 
Apache Storm 0.9 basic training - Verisign
Apache Storm 0.9 basic training - VerisignApache Storm 0.9 basic training - Verisign
Apache Storm 0.9 basic training - VerisignMichael Noll
 
Apache Kafka 0.8 basic training - Verisign
Apache Kafka 0.8 basic training - VerisignApache Kafka 0.8 basic training - Verisign
Apache Kafka 0.8 basic training - VerisignMichael Noll
 

More from Michael Noll (9)

Now You See Me, Now You Compute: Building Event-Driven Architectures with Apa...
Now You See Me, Now You Compute: Building Event-Driven Architectures with Apa...Now You See Me, Now You Compute: Building Event-Driven Architectures with Apa...
Now You See Me, Now You Compute: Building Event-Driven Architectures with Apa...
 
Big, Fast, Easy Data: Distributed Stream Processing for Everyone with KSQL, t...
Big, Fast, Easy Data: Distributed Stream Processing for Everyone with KSQL, t...Big, Fast, Easy Data: Distributed Stream Processing for Everyone with KSQL, t...
Big, Fast, Easy Data: Distributed Stream Processing for Everyone with KSQL, t...
 
Unlocking the world of stream processing with KSQL, the streaming SQL engine ...
Unlocking the world of stream processing with KSQL, the streaming SQL engine ...Unlocking the world of stream processing with KSQL, the streaming SQL engine ...
Unlocking the world of stream processing with KSQL, the streaming SQL engine ...
 
Rethinking Stream Processing with Apache Kafka: Applications vs. Clusters, St...
Rethinking Stream Processing with Apache Kafka: Applications vs. Clusters, St...Rethinking Stream Processing with Apache Kafka: Applications vs. Clusters, St...
Rethinking Stream Processing with Apache Kafka: Applications vs. Clusters, St...
 
Rethinking Stream Processing with Apache Kafka: Applications vs. Clusters, St...
Rethinking Stream Processing with Apache Kafka: Applications vs. Clusters, St...Rethinking Stream Processing with Apache Kafka: Applications vs. Clusters, St...
Rethinking Stream Processing with Apache Kafka: Applications vs. Clusters, St...
 
Introducing Apache Kafka's Streams API - Kafka meetup Munich, Jan 25 2017
Introducing Apache Kafka's Streams API - Kafka meetup Munich, Jan 25 2017Introducing Apache Kafka's Streams API - Kafka meetup Munich, Jan 25 2017
Introducing Apache Kafka's Streams API - Kafka meetup Munich, Jan 25 2017
 
Being Ready for Apache Kafka - Apache: Big Data Europe 2015
Being Ready for Apache Kafka - Apache: Big Data Europe 2015Being Ready for Apache Kafka - Apache: Big Data Europe 2015
Being Ready for Apache Kafka - Apache: Big Data Europe 2015
 
Apache Storm 0.9 basic training - Verisign
Apache Storm 0.9 basic training - VerisignApache Storm 0.9 basic training - Verisign
Apache Storm 0.9 basic training - Verisign
 
Apache Kafka 0.8 basic training - Verisign
Apache Kafka 0.8 basic training - VerisignApache Kafka 0.8 basic training - Verisign
Apache Kafka 0.8 basic training - Verisign
 

Recently uploaded

Operations Data On Mobile - inSis Mobile App - Sample Screens
Operations Data On Mobile - inSis Mobile App - Sample ScreensOperations Data On Mobile - inSis Mobile App - Sample Screens
Operations Data On Mobile - inSis Mobile App - Sample ScreensKondapi V Siva Rama Brahmam
 
Tips to Align with Your Salesforce Data Goals
Tips to Align with Your Salesforce Data GoalsTips to Align with Your Salesforce Data Goals
Tips to Align with Your Salesforce Data GoalsDataArchiva
 
Generative AI Rennes Meetup with OVHcloud - WAICF highlights & how to deploy ...
Generative AI Rennes Meetup with OVHcloud - WAICF highlights & how to deploy ...Generative AI Rennes Meetup with OVHcloud - WAICF highlights & how to deploy ...
Generative AI Rennes Meetup with OVHcloud - WAICF highlights & how to deploy ...Thibaud Le Douarin
 
Artificial Intelligence for Vision: A walkthrough of recent breakthroughs
Artificial Intelligence for Vision:  A walkthrough of recent breakthroughsArtificial Intelligence for Vision:  A walkthrough of recent breakthroughs
Artificial Intelligence for Vision: A walkthrough of recent breakthroughsNikolas Markou
 
A Gentle Introduction to Text Analysis :)
A Gentle Introduction to Text Analysis :)A Gentle Introduction to Text Analysis :)
A Gentle Introduction to Text Analysis :)UNCResearchHub
 
itc limited word file.pdf...............
itc limited word file.pdf...............itc limited word file.pdf...............
itc limited word file.pdf...............mahetamanav24
 
Introduction to data science.pdf-Definition,types and application of Data Sci...
Introduction to data science.pdf-Definition,types and application of Data Sci...Introduction to data science.pdf-Definition,types and application of Data Sci...
Introduction to data science.pdf-Definition,types and application of Data Sci...DrSumathyV
 
ppt penjualan berbasis online omset.pptx
ppt penjualan berbasis online omset.pptxppt penjualan berbasis online omset.pptx
ppt penjualan berbasis online omset.pptxHizkiaJastis
 
Basics of Creating Graphs / Charts using Microsoft Excel
Basics of Creating Graphs / Charts using Microsoft ExcelBasics of Creating Graphs / Charts using Microsoft Excel
Basics of Creating Graphs / Charts using Microsoft ExcelTope Osanyintuyi
 
What you need to know about Generative AI and Data Management?
What you need to know about Generative AI and Data Management?What you need to know about Generative AI and Data Management?
What you need to know about Generative AI and Data Management?Denodo
 
fundamentals of digital imaging - POONAM.pptx
fundamentals of digital imaging - POONAM.pptxfundamentals of digital imaging - POONAM.pptx
fundamentals of digital imaging - POONAM.pptxPoonamRijal
 
IIBA Adl - Being Effective on Day 1 - Slide Deck.pdf
IIBA Adl - Being Effective on Day 1 - Slide Deck.pdfIIBA Adl - Being Effective on Day 1 - Slide Deck.pdf
IIBA Adl - Being Effective on Day 1 - Slide Deck.pdfAustraliaChapterIIBA
 
Choose your perfect jacket.pdf
Choose your perfect jacket.pdfChoose your perfect jacket.pdf
Choose your perfect jacket.pdfAlexia Trejo
 
Unlocking New Insights Into the World of European Soccer Through the European...
Unlocking New Insights Into the World of European Soccer Through the European...Unlocking New Insights Into the World of European Soccer Through the European...
Unlocking New Insights Into the World of European Soccer Through the European...ThinkInnovation
 

Recently uploaded (15)

Operations Data On Mobile - inSis Mobile App - Sample Screens
Operations Data On Mobile - inSis Mobile App - Sample ScreensOperations Data On Mobile - inSis Mobile App - Sample Screens
Operations Data On Mobile - inSis Mobile App - Sample Screens
 
Electricity Year 2023_updated_22022024.pptx
Electricity Year 2023_updated_22022024.pptxElectricity Year 2023_updated_22022024.pptx
Electricity Year 2023_updated_22022024.pptx
 
Tips to Align with Your Salesforce Data Goals
Tips to Align with Your Salesforce Data GoalsTips to Align with Your Salesforce Data Goals
Tips to Align with Your Salesforce Data Goals
 
Generative AI Rennes Meetup with OVHcloud - WAICF highlights & how to deploy ...
Generative AI Rennes Meetup with OVHcloud - WAICF highlights & how to deploy ...Generative AI Rennes Meetup with OVHcloud - WAICF highlights & how to deploy ...
Generative AI Rennes Meetup with OVHcloud - WAICF highlights & how to deploy ...
 
Artificial Intelligence for Vision: A walkthrough of recent breakthroughs
Artificial Intelligence for Vision:  A walkthrough of recent breakthroughsArtificial Intelligence for Vision:  A walkthrough of recent breakthroughs
Artificial Intelligence for Vision: A walkthrough of recent breakthroughs
 
A Gentle Introduction to Text Analysis :)
A Gentle Introduction to Text Analysis :)A Gentle Introduction to Text Analysis :)
A Gentle Introduction to Text Analysis :)
 
itc limited word file.pdf...............
itc limited word file.pdf...............itc limited word file.pdf...............
itc limited word file.pdf...............
 
Introduction to data science.pdf-Definition,types and application of Data Sci...
Introduction to data science.pdf-Definition,types and application of Data Sci...Introduction to data science.pdf-Definition,types and application of Data Sci...
Introduction to data science.pdf-Definition,types and application of Data Sci...
 
ppt penjualan berbasis online omset.pptx
ppt penjualan berbasis online omset.pptxppt penjualan berbasis online omset.pptx
ppt penjualan berbasis online omset.pptx
 
Basics of Creating Graphs / Charts using Microsoft Excel
Basics of Creating Graphs / Charts using Microsoft ExcelBasics of Creating Graphs / Charts using Microsoft Excel
Basics of Creating Graphs / Charts using Microsoft Excel
 
What you need to know about Generative AI and Data Management?
What you need to know about Generative AI and Data Management?What you need to know about Generative AI and Data Management?
What you need to know about Generative AI and Data Management?
 
fundamentals of digital imaging - POONAM.pptx
fundamentals of digital imaging - POONAM.pptxfundamentals of digital imaging - POONAM.pptx
fundamentals of digital imaging - POONAM.pptx
 
IIBA Adl - Being Effective on Day 1 - Slide Deck.pdf
IIBA Adl - Being Effective on Day 1 - Slide Deck.pdfIIBA Adl - Being Effective on Day 1 - Slide Deck.pdf
IIBA Adl - Being Effective on Day 1 - Slide Deck.pdf
 
Choose your perfect jacket.pdf
Choose your perfect jacket.pdfChoose your perfect jacket.pdf
Choose your perfect jacket.pdf
 
Unlocking New Insights Into the World of European Soccer Through the European...
Unlocking New Insights Into the World of European Soccer Through the European...Unlocking New Insights Into the World of European Soccer Through the European...
Unlocking New Insights Into the World of European Soccer Through the European...
 

Kafka 102: Streams and Tables All the Way Down | Kafka Summit San Francisco 2019

  • 1. 1 Kafka 102: Streams and Tables All the Way Down Kafka Summit San Francisco, Sep 2019 Michael G. Noll Technologist, Office of the CTO, Confluent @miguno
  • 5. 5 An Event Streaming Platform gives you three key functionalities 5 Publish & Subscribe to Events Store Events Process & Analyze Events @miguno
  • 6. 6 An Event records the fact that something happened 6 A good was sold An invoice was issued A payment was made A new customer registered @miguno
  • 7. 7 A Stream records history as a sequence of Events 7 @miguno
  • 8. 88 @miguno Event Streaming Paradigm Highly Scalable Durable Persistent Maintains Order Fast (Low Latency) Kafka = Source of Truth, stores every article since 1851 Denormalized into “Content View” Normalized assets (images, articles, bylines, etc.) https://www.confluent.io/blog/publishing-apache-kafka-new-york-times/ Streams record history, even hundreds of years
  • 9. 99 “The ledger of sales.” “The sales totals.” Streams record history Tables represent state @miguno
  • 10. 1010 1. e4 e5 2. Nf3 Nc6 3. Bc4 Bc5 4. d3 Nf6 5. Nbd2 “The sequence of moves.” “The state of the board.” Streams record history Tables represent state @miguno
  • 11. 11 Streams = INSERT only Immutable, append-only Tables = INSERT, UPDATE, DELETE Mutable, row key (event.key) identifies which row 11 @miguno
  • 12. 12 The key to mutability is … the event.key! 12 @miguno Stream Table Has unique key constraint? No Yes First event with key ‘alice’ arrives INSERT INSERT Another event with key ‘alice’ arrives INSERT UPDATE Event with key ‘alice’ and value == null arrives INSERT DELETE Event with key == null arrives INSERT <ignored> RDBMS analogy: A Stream is ~ a Table that has no unique key and is append-only.
  • 13. 13 Creating a table from a stream or topic streams
  • 14. 1414 @miguno Stream (facts) Table (dims) alice Berlin bob Lima alice Berlin alice Rome bob Lima alice Paris bob Sydney alice Berlin alice Rome bob Lima alice Paris bob Sydney 90°
  • 15. 1515 aggregation (like SUM, COUNT) table changes *See Streams and Tables: Two Sides of the Same Coin, M. Sax et al., BIRTE ’18 Streams record history Tables represent state Duality @miguno
  • 16. 16 Aggregating a stream (COUNT example) streams @miguno
  • 17. 17 Aggregating a stream (COUNT example) streams @miguno
  • 18. 1818 Kafka Topics The storage foundation of Streams and Tables
  • 19. 19 Data storage of a Kafka topic is partitioned which impacts data processing as we see later 19 ... ... ... ... P1 P2 P3 P4 storage Topic @miguno
  • 20. 20 Producers determine target partition of an event through a partitioning function ƒ(event.key) 20 ... ... ... ... P1 P2 P3 P4 storage Topic Producer client 1 Producer client 2 event sent and appended to partition 1 @miguno
  • 21. 21 Events with same key should be in same partition to ensure proper ordering of related events to ensure processing by Consumers returns expected results 21 ... ... ... ... P1 P2 P3 P4 Producer client 1 Producer client 2 Yellow events should always be stored in partition 3 @miguno
  • 22. 22 Top causes for same key in different partitions 1. You increased/decreased number of partitions 2. A producer uses a custom partitioner → Be careful in this situation! 22 @miguno
  • 23. 23 Processing Layer (KStreams, KSQL, etc.) Storage Layer (Brokers) Partitions play a central role in Kafka 23 @miguno Topics are partitioned. Partitions enable scalability, elasticity, fault-tolerance. joined based on stored in replicated based on log-compacted based on read from and written to processed based on partitionsData is
  • 24. 2424 From Topics to Streams and Tables
  • 25. 25 Topics live in the Kafka storage layer, ‘filesystem’ (Brokers) Streams and Tables live in the Kafka processing layer (KStreams, KSQL) 25 @miguno
  • 26. 26 Processing Layer (KSQL, KStreams) 26 00100 11101 11000 00011 00100 00110Topic alice Paris bob Sydney alice RomeStream plus schema (serdes) alice Rome bob Sydney Table plus aggregation Storage Layer (Brokers) Topics vs. Streams and Tables @miguno
  • 27. 27 Kafka Processing Data is processed per-partition 27 ... ... ... ... P1 P2 P3 P4 storage processing state Stream Task 1 Stream Task 2 Stream Task 3 Stream Task 4 read via network Application Instance 1Topic Application Instance 1 Application Instance 2 @miguno
  • 28. 28 Streams and Tables are partitioned, too And so is their processing! 28 ... ... ... ... P1 P2 P3 P4 Stream Task 1 Stream Task 2 Stream Task 3 Stream Task 4 KTable / TABLE 2 GB 3 GB 5 GB 2 GB @miguno
  • 29. 29 Global Tables give complete data to every task Great for joins without re-keying the input or to broadcast info 29 ... ... ... ... P1 P2 P3 P4 Stream Task 1 Stream Task 2 Stream Task 3 Stream Task 4 GlobalKTable 2 + 3 + 5 + 2 = 12 GB 12 GB 12 GB 12 GB @miguno
  • 30. 30 Tables are cached in ‘state stores’ on local disk Tables and other state don’t need to fit into RAM. Enables large-scale state. Saves $$$ on cloud instances. 30 But: The ‘source of truth’ of a table is its underlying topic! @miguno Stream Task 1 Cached on local disk under /var (default storage engine: RocksDB) Global Table Table
  • 31. 31 streaming restore via network Stream Task 1 On machine B Tables and other state are always fault-tolerant because backed by Kafka topics (cf. event sourcing) 31 streaming backup via network table’s changelog topic (log-compacted) Kafka Storage KSQL, Kafka Streams On machine A Stream Task 1 @miguno
  • 32. 32 Standby Replicas speed up application recovery App instances can optionally maintain copies of another instance’s local state stores to minimize failover times 32 @miguno num.standby.replicas = 1 Stream Task 1 Stream Task 2 App Instance 1 App Instance 2 num.standby.replicas = 0 (default) Stream Task 1 Stream Task 2 Stream Task 2 failover
  • 33. 33 Elastic scaling Stream tasks are migrated, including their state (via Kafka) 33 @miguno App Instance 1 App Instance 2 Stream Task 1 Stream Task 3 Stream Task 2 Stream Task 4 App Instance 3 App Instance 4 Stream Task 2 Stream Task 4
  • 34. 34 bob Zurich alice Rome bob Zurich bob Sydney alice Romealice Bern alice Paris Tables <-> topic log-compaction A table’s underlying topic is compacted by default to save Kafka storage space to speed up failover and recovery for processing 34 Note: Compaction intentionally removes part of a table’s history. If you need the full history and don’t have the historic data elsewhere, consider disabling compaction. @miguno
  • 35. 3535 @miguno TL;DR for Log Compaction Have a Stream? → Disable log compaction for its topic (= default) Have a Table? → Enable log compaction for its topic (= default) Disable only when needed, see previous slide
  • 36. 3636 Concept Schema Partitioned Unbounded Ordering Mutable Unique key constraint Fault-tolerant Storage Layer Topic No (raw bytes) Yes Yes Yes No No Yes Processing Layer Stream Yes Yes Yes Yes No No Yes Table Yes Yes No* No Yes Yes Yes Global Table Yes No No* No Yes Yes Yes *Generally speaking the answer is Yes but, in practice, tables are almost always bounded due to finite key space. Topics vs. Streams and Tables @miguno
  • 38. 38 Max processing parallelism = #input partitions 38 ... ... ... ... P1 P2 P3 P4 Topic Application Instance 1 Application Instance 2 Application Instance 3 Application Instance 4 Application Instance 5 *** idle *** Application Instance 6 *** idle *** → Need higher parallelism? Increase the original topic’s partition count. → Higher parallelism for just one use case? Derive a new topic from the original with higher partition count. Lower its retention to save storage. @miguno
  • 39. 39 How to increase number of partitions when needed KSQL example: statement below creates a new stream with the desired number of partitions. 39 CREATE STREAM products_repartitioned WITH (PARTITIONS=30) AS SELECT * FROM products @miguno
  • 40. 40 ‘Hot’ partitions can be problematic, often caused by 1. Events not evenly distributed across partitions 2. Events evenly distributed but certain events take longer to process 40 Strategies to address hot partitions include 1a. Ingress: Find better partitioning function ƒ(event.key) for producers 1b. Storage: Re-partition data into new topic if you can’t change the original 2. Scale processing vertically, e.g. more powerful CPU instances ... ... ... ... P1 P2 P3 P4 @miguno
  • 41. 41 Joining Streams and Tables Data must be ‘co-partitioned’ 41 TableStream Join Output (Stream) @miguno
  • 42. 42 Joining Streams and Tables Data must be ‘co-partitioned’ 42 bob male alice female alex male alice Paris Table P1 P2 P3 zoie female andrew male mina female natalie female blake male alice Paris Stream P2 (alice, Paris) from stream’s P2 has a matching entry for alice in the table’s P2. female @miguno
  • 43. 43 Joining Streams and Tables Data is looked up in same partition number 43 alice Paris alice male alice female alice Paris Stream Table P2 P1 P2 P3 Here, key ‘alice’ exists in multiple partitions. But entry in P2 (female) is used because the stream-side event is from stream’s partition P2. female Scenario 2 @miguno
  • 44. 44 Joining Streams and Tables Data is looked up in same partition number 44 alice Paris alice male alice Paris Stream Table P2 P1 P2 P3 Here, key ‘alice’ exists only in the table’s P1 != P2. null no match! Scenario 3 @miguno
  • 45. 45 Data co-partitioning requirements in detail 1. Same keying scheme for both input sides 2. Same number of partitions 3. Same partitioning function ƒ(event.key) 45 Further Reading on Joining Streams and Tables: https://www.confluent.io/kafka-summit-sf18/zen-and-the-art-of-streaming-joins https://docs.confluent.io/current/ksql/docs/developer-guide/partition-data.html @miguno
  • 46. 46 Why is that so? Because of how input data is mapped to stream tasks 46 ... ... ... P1 P2 P3 storage processing state Stream Task 2 read via network Stream Topic ... ... ... P1 P2 P3 Table Topic from stream’s P2 from table’s P2 @miguno
  • 47. 47 How to re-partition your data when needed KSQL example: statement below creates a new stream with changed number of partitions and a new field as event.key (so that its data is now correctly co-partitioned for joining) 47 CREATE STREAM products_repartitioned WITH (PARTITIONS=42) AS SELECT * FROM products PARTITION BY product_id; @miguno
  • 48. 48 Joining Streams and Global Tables No need to worry about co-partitioning! 48 Global TableStream Join Output (Stream) @miguno Stream Task 1 2 + 3 + 5 + 2 = 12 GB That’s because each stream task has the full data from all the table’s partitions:
  • 49. 49 Capacity Planning and Sizing Sorry, not covered here! I recommend: KSQL Performance Tuning for Fun and Profit, by Nick Dearden October 1, 2019 @ 2:50-3:30 pm, Stream Processing track https://kafka-summit.org/sessions/ksql-performance-tuning-fun-profit/ 49 @miguno
  • 50. 50 Streams and Tables in KSQL Stream 01 Stream 02 Stream 03 Table Process event streams to create new, continuously updated streams or tables QueryQuery Push Query CREATE TABLE OrderTotals AS SELECT * FROM ... EMIT CHANGES @miguno
  • 51. 51 Streams and Tables in KSQL Query tables similar to a relational database Table QueryQuery Pull Query SELECT * FROM OrderTotals WHERE region = ‘Europe’ Result New feature @miguno
  • 52. 52 Query tables from other apps with push or pull queries Other Applications (Java, Go, Python, etc.) can directly query tables Result via network (KSQL REST API) Table SELECT * FROM OrderTotals WHERE region = ‘Europe’ Streams and Tables in KSQL @miguno
  • 53. 53 Streaming import/export of Tables KSQL integrates with Kafka Connect CREATE SOURCE CONNECTOR my-postgres-jdbc WITH ( connector.class = "io.confluent.connect.jdbc.jdbcSourceConnector", connection.url = "jdbc:postgresql://dbserver:5432/my-db", ...); New feature controls controls @miguno
  • 54. 54 KSQL example use case Creating an event-driven dashboard from a customer database @miguno customers table Table Kafka Connect is streaming change events Stream Aggregations are computed in real-time Table Results are continuously updating Elasticsearch Table (Index)Stream