The document introduces Apache Kafka's Streams API for stream processing. Some key points covered include:
- The Streams API allows building stream processing applications without needing a separate cluster, providing an elastic, scalable, and fault-tolerant processing engine.
- It integrates with existing Kafka deployments and supports both stateful and stateless computations on data in Kafka topics.
- Applications built with the Streams API are standard Java applications that run on client machines and leverage Kafka for distributed, parallel processing and fault tolerance via state stores in Kafka.
Optimizing AI for immediate response in Smart CCTV
Ā
Introducing Kafka's Streams API
1. 1Confidential
Introducing Kafkaās Streams API
Stream processing made simple
Target audience: technical staff, developers, architects
Expected duration for full deck: 45 minutes
2. 2Confidential
0.10 Data processing (Streams API)
0.9 Data integration (Connect API)
Intra-cluster
replication
0.8
Apache Kafka: birthed as a messaging system, now a streaming platform
2012 2014 2015 2016 2017
Cluster mirroring,
data compression
0.7
2013
3. 3Confidential
Kafkaās Streams API: the easiest way to process data in Apache Kafka
Key Benefits of Apache Kafkaās Streams API
ā¢ Build Apps, Not Clusters: no additional cluster required
ā¢ Cluster to go: elastic, scalable, distributed, fault-tolerant, secure
ā¢ Database to go: tables, local state, interactive queries
ā¢ Equally viable for S / M / L / XL / XXL use cases
ā¢ āRuns Everywhereā: integrates with your existing deployment
strategies such as containers, automation, cloud
Part of open source Apache Kafka, introduced in 0.10+
ā¢ Powerful client library to build stream processing apps
ā¢ Apps are standard Java applications that run on client
machines
ā¢ https://github.com/apache/kafka/tree/trunk/streams
Streams
API
Your App
Kafka
Cluster
5. 5Confidential
Streams API in the context of Kafka
Streams
API
Your App
Kafka
Cluster
ConnectAPI
ConnectAPI
OtherSystems
OtherSystems
6. 6Confidential
When to use Kafkaās Streams API
ā¢ Mainstream Application Development
ā¢ To build core business applications
ā¢ Microservices
ā¢ Fast Data apps for small and big data
ā¢ Reactive applications
ā¢ Continuous queries and transformations
ā¢ Event-triggered processes
ā¢ The āTā in ETL
ā¢ <and more>
Use case examples
ā¢ Real-time monitoring and intelligence
ā¢ Customer 360-degree view
ā¢ Fraud detection
ā¢ Location-based marketing
ā¢ Fleet management
ā¢ <and more>
7. 7Confidential
Some public use cases in the wild & external articles
ā¢ Applying Kafkaās Streams API for internal message delivery pipeline at LINE Corp.
ā¢ http://developers.linecorp.com/blog/?p=3960
ā¢ Kafka Streams in production at LINE, a social platform based in Japan with 220+ million users
ā¢ Microservices and reactive applications at Capital One
ā¢ https://speakerdeck.com/bobbycalderwood/commander-decoupled-immutable-rest-apis-with-kafka-streams
ā¢ User behavior analysis
ā¢ https://timothyrenner.github.io/engineering/2016/08/11/kafka-streams-not-looking-at-facebook.html
ā¢ Containerized Kafka Streams applications in Scala
ā¢ https://www.madewithtea.com/processing-tweets-with-kafka-streams.html
ā¢ Geo-spatial data analysis
ā¢ http://www.infolace.com/blog/2016/07/14/simple-spatial-windowing-with-kafka-streams/
ā¢ Language classification with machine learning
ā¢ https://dzone.com/articles/machine-learning-with-kafka-streams
10. 10Confidential
Architecture comparison: use case example
Other
App
Dashboard
Frontend
App
Other
App
1 Capture business
events in Kafka
2 Must process events with
separate cluster (e.g. Spark)
4
Other apps access latest results
by querying these DBs
3 Must share latest results through
separate systems (e.g. MySQL)
Before: Undue complexity, heavy footprint, many technologies, split ownership with conflicting
priorities
Your
āJobā
Other
App
Dashboard
Frontend
App
Other
App
1 Capture business
events in Kafka
2 Process events with standard
Java apps that use Kafka Streams
3 Now other apps can directly
query the latest results
With Kafka Streams: simplified, app-centric architecture, puts app owners in control
Kafka
Streams
Your App
13. 13Confidential
How do I install the Streams API?
ā¢ There is and there should be no āinstallationā ā Build Apps, Not Clusters!
ā¢ Itās a library. Add it to your app like any other library.
<dependency>
<groupId>org.apache.kafka</groupId>
<artifactId>kafka-Āāstreams</artifactId>
<version>0.10.1.1</version>
</dependency>
14. 14Confidential
āBut wait a minute ā whereās THE CLUSTER to process the data?ā
ā¢ No cluster needed ā Build Apps, Not Clusters!
ā¢ Unlearn bad habits: ādo cool stuff with data ā must have clusterā
Ok. Ok. Ok.
16. 16Confidential
Organizational benefits: decouple teams and roadmaps, scale people
Infrastructure Team
(Kafka as a shared, multi-tenant service)
Fraud
detection
app
Payments team
Recommenda
tions app
Mobile team
Security
alerts
app
Operations team
...more apps...
...
17. 17Confidential
How do I package, deploy, monitor my apps? How do I ā¦?
ā¢ Whatever works for you. Stick to what you/your company think is the best way.
ā¢ No magic needed.
ā¢ Why? Because an app that uses the Streams API isā¦a normal Java app.
19. 19Confidential
The API is but the tip of the iceberg
API, Ā coding
Org. Ā processes
Realityā¢
Deployment
Operations
Security
ā¦
Architecture
Debugging
20. 20Confidential
ā¢ API option 1: DSL (declarative)
KStream<Integer, Ā Integer> Ā input Ā =
builder.stream("numbers-Āātopic");
// Ā Stateless Ā computation
KStream<Integer, Ā Integer> Ā doubled Ā =
input.mapValues(v Ā -Āā> Ā v Ā * Ā 2);
// Ā Stateful Ā computation
KTable<Integer, Ā Integer> Ā sumOfOdds = Ā input
.filter((k,v) Ā -Āā> Ā v Ā % Ā 2 Ā != Ā 0)
.selectKey((k, Ā v) Ā -Āā> Ā 1)
.groupByKey()
.reduce((v1, Ā v2) Ā -Āā> Ā v1 Ā + Ā v2, Ā "sum-Āāof-Āāodds");
The preferred API for most use cases.
Particularly appeals to:
ā¢ Fans of Scala, functional programming
ā¢ Users familiar with e.g. Spark
21. 21Confidential
ā¢ API option 2: Processor API (imperative)
class Ā PrintToConsoleProcessor
implements Ā Processor<K, Ā V> Ā {
@Override
public Ā void Ā init(ProcessorContext context) Ā {}
@Override
void Ā process(K Ā key, Ā V Ā value) Ā { Ā
System.out.println("Got Ā value Ā " Ā + Ā value); Ā
}
@Override
void Ā punctuate(long Ā timestamp) Ā {}
@Override
void Ā close() Ā {}
}
Full flexibility but more manual work
Appeals to:
ā¢ Users who require functionality that is
not yet available in the DSL
ā¢ Users familiar with e.g. Storm, Samza
ā¢ Still, check out the DSL!
22. 22Confidential
When to use Kafka Streams vs. Kafkaās ānormalā consumer clients
Kafka Streams
ā¢ Basically all the time
ā¢ Basically all the time
ā¢ Basically all the time
ā¢ Basically all the time
ā¢ Basically all the time
ā¢ Basically all the time
ā¢ Basically all the time
ā¢ Basically all the time
ā¢ Basically all the time
ā¢ Basically all the time
ā¢ Basically all the time
Kafka consumer clients (Java, C/C++, Python, Go, ā¦)
ā¢ When you must interact with Kafka at a very low
level and/or in a very special way
ā¢ Example: When integrating your own stream
processing tool (Spark, Storm) with Kafka.
24. 24Confidential
āMy WordCount is better than your WordCountā (?)
Kafka
Spark
These isolated code snippets are nice (and actually quite similar) but they are not very meaningful. In practice, we
also need to read data from somewhere, write data back to somewhere, etc.ā but we can see none of this here.
35. 35Confidential
Key observation: close relationship between Streams and Tables
http://www.confluent.io/blog/introducing-kafka-streams-stream-processing-made-simple
http://docs.confluent.io/current/streams/concepts.html#duality-of-streams-and-tables
48. 48Confidential
Streams meet Tables
ā¢ Most use cases for stream processing require both Streams and Tables
ā¢ Essential for any stateful computations
ā¢ Kafka ships with first-class support for Streams and Tables
ā¢ Scalability, fault tolerance, efficient joins and aggregations, ā¦
ā¢ Benefits include: simplified architectures, less moving pieces, less Do-It-Yourself work
52. 52Confidential
Key features in 0.10
ā¢ Native, 100%-compatible Kafka integration
ā¢ Secure stream processing using Kafkaās security features
53. 53Confidential
Secure stream processing with the Streams API
ā¢ Your applications can leverage all client-side security features in Apache Kafka
ā¢ Security features include:
ā¢ Encrypting data-in-transit between applications and Kafka clusters
ā¢ Authenticating applications against Kafka clusters (āonly some apps may talk to the production
clusterā)
ā¢ Authorizing application against Kafka clusters (āonly some apps may read data from sensitive topicsā)
60. 60Confidential
Key features in 0.10
ā¢ Native, 100%-compatible Kafka integration
ā¢ Secure stream processing using Kafkaās security features
ā¢ Elastic and highly scalable
ā¢ Fault-tolerant
ā¢ Stateful and stateless computations
61. 61Confidential
Stateful computations
ā¢ Stateful computations like aggregations (e.g. counting), joins, or windowing require state
ā¢ State stores are the backbone of state management
ā¢ ā¦ are local for best performance
ā¢ ā¦ are backed up to Kafka for elasticity and for fault-tolerance
ā¢ ... are per stream task for isolation ā think: share-nothing
ā¢ Pluggable storage engines
ā¢ Default: RocksDB (a key-value store) to allow for local state that is larger than available RAM
ā¢ You can also use your own, custom storage engine
ā¢ From the user perspective:
ā¢ DSL: no need to worry about anything, state management is automatically being done for you
ā¢ Processor API: direct access to state stores ā very flexible but more manual work
72. 72Confidential
Interactive Queries: architecture comparison
Kafka
Streams
App
App
App
App
1 Capture business
events in Kafka
2 Process the events
with Kafka Streams
4
Other apps query external
systems for latest results
! Must use external systems
to share latest results
App
App
App
1 Capture business
events in Kafka
2 Process the events
with Kafka Streams
3 Now other apps can directly
query the latest results
Before (0.10.0)
After (0.10.1): simplified, more app-centric architecture
Kafka
Streams
App
73. 73Confidential
Key features in 0.10
ā¢ Native, 100%-compatible Kafka integration
ā¢ Secure stream processing using Kafkaās security features
ā¢ Elastic and highly scalable
ā¢ Fault-tolerant
ā¢ Stateful and stateless computations
ā¢ Interactive queries
ā¢ Time model
76. 76Confidential
Time
ā¢ You configure the desired time semantics through timestamp extractors
ā¢ Default extractor yields event-time semantics
ā¢ Extracts embedded timestamps of Kafka messages (introduced in v0.10)
77. 77Confidential
Key features in 0.10
ā¢ Native, 100%-compatible Kafka integration
ā¢ Secure stream processing using Kafkaās security features
ā¢ Elastic and highly scalable
ā¢ Fault-tolerant
ā¢ Stateful and stateless computations
ā¢ Interactive queries
ā¢ Time model
ā¢ Windowing
78. 78Confidential
Windowing
ā¢ Group events in a stream using time-based windows
ā¢ Use case examples:
ā¢ Time-based analysis of ad impressions (ānumber of ads clicked in the past hourā)
ā¢ Monitoring statistics of telemetry data (ā1min/5min/15min averagesā)
Input data, where
colors represent
different users events
Rectangles denote
different event-time
windows
processing-time
event-time
windowing
alice
bob
dave
80. 80Confidential
Key features in 0.10
ā¢ Native, 100%-compatible Kafka integration
ā¢ Secure stream processing using Kafkaās security features
ā¢ Elastic and highly scalable
ā¢ Fault-tolerant
ā¢ Stateful and stateless computations
ā¢ Interactive queries
ā¢ Time model
ā¢ Windowing
ā¢ Supports late-arriving and out-of-order data
82. 82Confidential
Out-of-order and late-arriving data: example when this will happen
Users with mobile phones enter
airplane, lose Internet connectivity
Emails are being written
during the 10h flight
Internet connectivity is restored,
phones will send queued emails now
83. 83Confidential
Out-of-order and late-arriving data
ā¢ Is very common in practice, not a rare corner case
ā¢ Related to time model discussion
ā¢ We want control over how out-of-order data is handled, and handling must be efficient
ā¢ Example: We process data in 5-minute windows, e.g. compute statistics
ā¢ Option A: When event arrives 1 minute late: update the original result!
ā¢ Option B: When event arrives 2 hours late: discard it!
84. 84Confidential
Key features in 0.10
ā¢ Native, 100%-compatible Kafka integration
ā¢ Secure stream processing using Kafkaās security features
ā¢ Elastic and highly scalable
ā¢ Fault-tolerant
ā¢ Stateful and stateless computations
ā¢ Interactive queries
ā¢ Time model
ā¢ Windowing
ā¢ Supports late-arriving and out-of-order data
ā¢ Millisecond processing latency, no micro-batching
ā¢ At-least-once processing guarantees (exactly-once is in the works as we speak)
86. 86Confidential
Roadmap outlook for Kafka Streams
ā¢ Exactly-Once processing semantics
ā¢ Unified API for real-time processing and ābatchā processing
ā¢ Global KTables
ā¢ Session windows
ā¢ ā¦ and more ā¦
88. 88Confidential
Where to go from here
ā¢ Kafka Streams is available in Confluent Platform 3.1 and in Apache Kafka 0.10.1
ā¢ http://www.confluent.io/download
ā¢ Kafka Streams demos: https://github.com/confluentinc/examples
ā¢ Java 7, Java 8+ with lambdas, and Scala
ā¢ WordCount, Interactive Queries, Joins, Security, Windowing, Avro integration, ā¦
ā¢ Confluent documentation: http://docs.confluent.io/current/streams/
ā¢ Quickstart, Concepts, Architecture, Developer Guide, FAQ
ā¢ Recorded talks
ā¢ Introduction to Kafka Streams:
http://www.youtube.com/watch?v=o7zSLNiTZbA
ā¢ Application Development and Data in the Emerging World of Stream Processing (higher level talk):
https://www.youtube.com/watch?v=JQnNHO5506w
91. 91Confidential
Motivating example: continuously compute current users per geo-region
4
7
5
3
2
8
Real-time dashboard
āHow many users younger than 30y, per region?ā
alice Asia, 25y, ā¦
bob Europe, 46y, ā¦
ā¦ ā¦
user-locations
(mobile team)
user-prefs
(web team)
92. 92Confidential
Motivating example: continuously compute current users per geo-region
4
7
5
3
2
8
Real-time dashboard
āHow many users younger than 30y, per region?ā
alice Europe
user-locations
alice Asia, 25y, ā¦
bob Europe, 46y, ā¦
ā¦ ā¦
user-locations
(mobile team)
user-prefs
(web team)
93. 93Confidential
Motivating example: continuously compute current users per geo-region
4
7
5
3
2
8
Real-time dashboard
āHow many users younger than 30y, per region?ā
alice Europe
user-locations
user-locations
(mobile team)
user-prefs
(web team)
alice Asia, 25y, ā¦
bob Europe, 46y, ā¦
ā¦ ā¦
alice Europe, 25y, ā¦
bob Europe, 46y, ā¦
ā¦ ā¦
94. 94Confidential
Motivating example: continuously compute current users per geo-region
4
7
5
3
2
8 4
7
6
3
2
7
Alice
Real-time dashboard
āHow many users younger than 30y, per region?ā
alice Europe
user-locations
alice Asia, 25y, ā¦
bob Europe, 46y, ā¦
ā¦ ā¦
alice Europe, 25y, ā¦
bob Europe, 46y, ā¦
ā¦ ā¦
-1
+1
user-locations
(mobile team)
user-prefs
(web team)
95. 95Confidential
Same data, but different use cases require different interpretations
alice San Francisco
alice New York City
alice Rio de Janeiro
alice Sydney
alice Beijing
alice Paris
alice Berlin
96. 96Confidential
Same data, but different use cases require different interpretations
alice San Francisco
alice New York City
alice Rio de Janeiro
alice Sydney
alice Beijing
alice Paris
alice Berlin
Use Ā case Ā 1: Ā Frequent Ā traveler Ā status?
Use Ā case Ā 2: Ā Current Ā location?
97. 97Confidential
Same data, but different use cases require different interpretations
āAlice has been to SFO, NYC, Rio, Sydney,
Beijing, Paris, and finally Berlin.ā
āAlice is in SFO, NYC, Rio, Sydney,
Beijing, Paris, Berlin right now.ā
ā ā
āā
ā
ā
ā ā ā
āā
ā
ā
ā
Use Ā case Ā 1: Ā Frequent Ā traveler Ā status? Use Ā case Ā 2: Ā Current Ā location?
98. 98Confidential
Same data, but different use cases require different interpretations
alice San Francisco
alice New York City
alice Rio de Janeiro
alice Sydney
alice Beijing
alice Paris
alice Berlin
Use Ā case Ā 1: Ā Frequent Ā traveler Ā status?
Use Ā case Ā 2: Ā Current Ā location?
ā ā āā
ā
āā
ā
99. 99Confidential
Same data, but different use cases require different interpretations
alice San Francisco
alice New York City
alice Rio de Janeiro
alice Sydney
alice Beijing
alice Paris
alice Berlin
Use Ā case Ā 1: Ā Frequent Ā traveler Ā status?
Use Ā case Ā 2: Ā Current Ā location?
ā ā āā
ā
āā
ā
100. 100Confidential
Same data, but different use cases require different interpretations
alice San Francisco
alice New York City
alice Rio de Janeiro
alice Sydney
alice Beijing
alice Paris
alice Berlin
Use Ā case Ā 1: Ā Frequent Ā traveler Ā status?
Use Ā case Ā 2: Ā Current Ā location?
ā ā āā
ā
āā
ā
101. 101Confidential
Streams meet Tables
record stream
When you needā¦ so that the topic is
interpreted as a
All the values of a key KStream
then youād read the
Kafka topic into a
Example
All the places Alice
has ever been to
with messages
interpreted as
INSERT
(append)
102. 102Confidential
Streams meet Tables
record stream
changelog stream
When you needā¦ so that the topic is
interpreted as a
All the values of a key
Latest value of a key
KStream
KTable
then youād read the
Kafka topic into a
Example
All the places Alice
has ever been to
Where Alice
is right now
with messages
interpreted as
INSERT
(append)
UPSERT
(overwrite
existing)
103. 103Confidential
Same data, but different use cases require different interpretations
āAlice has been to SFO, NYC, Rio, Sydney,
Beijing, Paris, and finally Berlin.ā
āAlice is in SFO, NYC, Rio, Sydney,
Beijing, Paris, Berlin right now.ā
ā ā
āā
ā
ā
ā ā ā
āā
ā
ā
ā
Use Ā case Ā 1: Ā Frequent Ā traveler Ā status? Use Ā case Ā 2: Ā Current Ā location?
KStream KTable
104. 104Confidential
Motivating example: continuously compute current users per geo-region
4
7
5
3
2
8 4
7
6
3
2
7
Alice
Real-time dashboard
āHow many users younger than 30y, per region?ā
alice Europe
user-locations
alice Asia, 25y, ā¦
bob Europe, 46y, ā¦
ā¦ ā¦
alice Europe, 25y, ā¦
bob Europe, 46y, ā¦
ā¦ ā¦
-1
+1
user-locations
(mobile team)
user-prefs
(web team)
106. 106Confidential
Motivating example: continuously compute current users per geo-region
alice Europe
user-locations
alice Asia, 25y, ā¦
bob Europe, 46y, ā¦
ā¦ ā¦
alice Europe, 25y, ā¦
bob Europe, 46y, ā¦
ā¦ ā¦
KTable<UserId, Ā Location> Ā userLocations = Ā builder.table(āuser-Āālocations-Āātopicā);
KTable<UserId, Ā Prefs> Ā Ā Ā Ā userPrefs = Ā builder.table(āuser-Āāpreferences-Āātopicā);
// Ā Merge Ā into Ā detailed Ā user Ā profiles Ā (continuously Ā updated)
KTable<UserId, Ā UserProfile> Ā userProfiles =
userLocations.join(userPrefs, Ā (loc, Ā prefs) Ā -Āā> Ā new Ā UserProfile(loc, Ā prefs));
KTable userProfilesKTable userProfiles
107. 107Confidential
Motivating example: continuously compute current users per geo-region
KTable<UserId, Ā Location> Ā userLocations = Ā builder.table(āuser-Āālocations-Āātopicā);
KTable<UserId, Ā Prefs> Ā Ā Ā Ā userPrefs = Ā builder.table(āuser-Āāpreferences-Āātopicā);
// Ā Merge Ā into Ā detailed Ā user Ā profiles Ā (continuously Ā updated)
KTable<UserId, Ā UserProfile> Ā userProfiles =
userLocations.join(userPrefs, Ā (loc, Ā prefs) Ā -Āā> Ā new Ā UserProfile(loc, Ā prefs));
// Ā Compute Ā per-Āāregion Ā statistics Ā (continuously Ā updated)
KTable<UserId, Ā Long> Ā usersPerRegion = Ā userProfiles
.filter((userId, Ā profile) Ā Ā -Āā> Ā profile.age < Ā 30)
.groupBy((userId, Ā profile) Ā -Āā> Ā profile.location)
.count();
alice Europe
user-locations
Africa 3
ā¦ ā¦
Asia 8
Europe 5
Africa 3
ā¦ ā¦
Asia 7
Europe 6
KTable usersPerRegion KTable usersPerRegion
108. 108Confidential
Motivating example: continuously compute current users per geo-region
4
7
5
3
2
8 4
7
6
3
2
7
Alice
Real-time dashboard
āHow many users younger than 30y, per region?ā
alice Europe
user-locations
alice Asia, 25y, ā¦
bob Europe, 46y, ā¦
ā¦ ā¦
alice Europe, 25y, ā¦
bob Europe, 46y, ā¦
ā¦ ā¦
-1
+1
user-locations
(mobile team)
user-prefs
(web team)
109. 109Confidential
Another common use case: continuous transformations
ā¢ Example: to enrich an input stream (user clicks) with side data (current user profile)
KStream alice /rental/p8454vb, 06:59 PM PDT
user-clicks-topics (at 1M msgs/s)
āfactsā
110. 110Confidential
Another common use case: continuous transformations
ā¢ Example: to enrich an input stream (user clicks) with side data (current user profile)
KStream alice /rental/p8454vb, 06:59 PM PDT
alice Asia, 25y
bob Europe, 46y
ā¦ ā¦
KTable
user-profiles-topic
user-clicks-topics (at 1M msgs/s)
āfactsā
ādimensionsā
111. 111Confidential
Another common use case: continuous transformations
ā¢ Example: to enrich an input stream (user clicks) with side data (current user profile)
KStream
alice /rental/p8454vb, 06:59 PDT, Asia, 25y
stream.JOIN(table)
alice /rental/p8454vb, 06:59 PM PDT
alice Asia, 25y
bob Europe, 46y
ā¦ ā¦
KTable
user-profiles-topic
user-clicks-topics (at 1M msgs/s)
āfactsā
ādimensionsā
112. 112Confidential
Another common use case: continuous transformations
ā¢ Example: to enrich an input stream (user clicks) with side data (current user profile)
KStream
alice /rental/p8454vb, 06:59 PDT, Asia, 25y
stream.JOIN(table)
alice /rental/p8454vb, 06:59 PM PDT
alice Asia, 25y
bob Europe, 46y
ā¦ ā¦
KTable
alice Europe, 25y
bob Europe, 46y
ā¦ ā¦alice Europe
new update for alice from user-locations topic
user-profiles-topic
user-clicks-topics (at 1M msgs/s)
āfactsā
ādimensionsā