2. 2
Kafka Summit 2020 —
By the numbers
https://kafkasummit.io/
• 10 Keynote Speakers
• 87 Sessions
• 46 Birds of a Feather & Ask the
Experts
• 102 Session Speakers
Attendees from 143 Countries
3. Sessions
3
• Opening Keynote — Gwen Shapira
• Feed your SIEM Smart with Kafka Connect
• Building a Modern, Scalable Cyber Intelligence Platform with Confluent Kafka
• Learnings From the Field. Lessons From Working with Dozens of Small & Large
Deployments
• Maximize the Business Value of Machine Learning and Data Science with
Kafka
• Measuring your Digital Transformation: Why Real Time Analytics are the
Critical Next Step
• MQTT and Apache Kafka: The Solution to Poor Internet Connectivity in Africa
• The Flux Capacitor of Kafka Streams and ksqlDB
• Keynote: Kafka ♥ Cloud — Jay Kreps
15. Feed your SIEM Smart with
Kafka Connect
Vitalii Rudenskyi
Information Security Architect
McKesson Corporation
16. Background and Motivation
• How to use Kafka Connect to ingest, consume and deliver data to SIEM
• Migration from old generation SIEM to new SIEM solution
• Not easy, not fun at all!
• Not to make the mistake again
• Decided to have own data ingestion and consumption,
• Not to be dependent on any kind of Vendors.
29. Key Takeaways
• Kafka has become an integral part in enterprise SIEM Modernization
• With Kafka and Connect, customer can take a “vendor agnostic” approach at their SIEM
strategy
• Kafka Connect is flexible solution to deal with various data sources/sinks and different
data formats. In this particular case, 530+ Connectors have been deployed
• Kafka Connect is extremely extensible and customers could customize or develop their
own Connector based on requirements.
• Speaker has developed customized Transformation Library to transform different part of
the messages. Looking to implement Stream Processing as Next Step.
• Speaker has shared smart tips on making use of Headers and has created a solution for
connector High Availability.
30. Jack Noel - Security Solutions Architect - Intel
Building a Modern, Scalable Cyber
Intelligence Platform with Confluent Kafka
31. Intel Information Security’s Mission
To Keep Intel’s Intel legal and secure!
The mission is never done. e.g. find the balance between infosec requirements while being cost effective
and agile.
32.
33. ● Data Filtering
● Data comes
from Partners
● IT non-security
data
● Prioritise high-
value data
● Enrich Data
35. ● Acquire data
once, consume
many times
● Using data is
expensive
● Filtering, joining
and enrich data
instream to
provide rich
data upstream
● ML Instream -
Advanced
● Become
Predictive!
36. ● Reduce
technical debt,
e.g. point to
point
integrations
● Always on
● Thriving
community
● Slide speaks for
itself!
38. Key Takeaways
• There are a lot of security vendors, each with their own way of producing and parsing
Data. Kafka Makes that more Seamless
• No Vendor Lock ins, e.g. Use OS Native as much as possible.
• Share data with other teams
• Collect data from other teams. e.g. vendors or IT
• Operational Maturity is important to ensure success. e.g. people, process and tools.
39. Mitch Henderson - Customer Success Technical Architect
Learnings From the Field. Lessons From
Working with Dozens of Small & Large
Deployments
40. Key do's and don'ts for managing Kafka installations
• Upgrades - do them well and often, don't fall into trap of
sticking with old version
• How to execute upgrades well
• Monitoring: JMX
• Configuration - varying from defaults, recommended
tunables
• Logging
• Quotas
• Clusters - single or multiple
41. Key takeaways
• If you don't have the option of running fully-managed
Apache Kafka as Confluent Cloud - Kafka is a distributed
system and takes careful and deliberate management.
• There are many recommended settings changes from
default for certain types of production-operations -
OOTB settings are really development settings.
• Make upgrades part of Kafka muscle-memory.
• Don't wait until you have problems to set recommended
settings and guardrails.
• If in-doubt, hire a professional - Confluent PS.
42. Tom Szumowski - Senior Data Scientist, Nuuly
Chirag Dadia - Directory of Engineering, Nuuly
Maximize the Business Value of Machine
Learning and Data Science with Kafka
43. Background and Motivation
Nuuly is a clothing rental subscription service driven by a Kafka-based architecture.
As an online platform they are continually looking to improve their service to better meet
the needs of their customers, and drive revenue.
Data analytics and machine learning form a large part of their optimisation strategy, but
how to implement this mostly offline batch style processing with a real time platform?
Challenge - typical warehouses track SKU and stock level - Nuuly track individual items. Real
time inventory becomes critical - a rented item should not be rented twice at the same time.
46. Data Science
Everything is asynchronous, ETL pipelines transform, sent to warehouse
ML that integrates with user interactions
Stream processors used to materialise state. Kafka is our data store.
Adding new steps is as easy as building a new microservice.
64. Key Takeaways
• Shared vision across the organization - common view of the world (data)
• Event driven adds value to customer interactions
• “Gen Z” expectations - contextual, personalized, real-time
• Real-time is freshness of data + fast analytics
• Velocity of data and velocity of understanding are distinct measures
66. Background and Motivation
• Internet connectivity is a problem in the remote villages of Africa
• There were challenges implementing agency banking in villages in Tanzania
• How MQTT and Apache Kafka was used to overcome these problems
*** Agency banking model[1] is a function of certain Commercial banks in kenya and as regulated by Central Bank of Kenya legislation that
allows them to contract third party retail networks as Banking agent. Upon successful application, vetting and approval,[2] these Agents are
authorized to offer selected products and services on behalf of the Bank. This relationship creates an Agency Banking business model.
74. Key Takeaways
• How MQTT and Apache Kafka has been leveraged to provide digital banking to a region
which has poor internet connectivity
• MQTT maintains the session from hand held devices but MQTT does not provide for long
term message storage. Also, MQTT connectivity with downstream enterprise solutions is
also not great.
• Apache Kafka is used to store data for a longer period of time and with Connectors, the
data can be pushed to all the downstream systems where further analytics could be done
• MQTT connector is used as the bridge between Kafka and MQTT
• Since, management of Kafka is a specialized job which requires a lot of effort and $, they
are moving to Confluent Cloud gradually
75. Matthias J. Sax | Software Engineer, Confluent
The Flux Capacitor of
Kafka Streams and ksqlDB
77. Recap: Time 101
77@MatthiasJSax
Event Time
• When an event happened (embedded in the message/record)
• Ensures deterministic processing
• Used to express processing semantics, i.e., impacts the result
Processing Time (aka Wall-clock Time)
• When an event/message/record is processed
• Used for non-functional properties
• Timeouts
• Data rate control
• Periodic actions
• Should not impact the result: otherwise, non-deterministic
78. Yeah, well, history is gonna change
Input records with descending event timestamp are considered out-of-order
• Out-of-order if event-time < stream-time
78@MatthiasJSax
14:01… 14:03… 14:08…14:01… 14:02… 14:11…
stream-time
14:03 14:1114:0814:01
advances
out-of-order out-of-order
14:03 14:08
79. You are not thinking fourth-dimensionally
79@MatthiasJSax
14:11…Topic-A, Partition 0
Topic-B, Partition 0 empty
Pause processing and poll() for new data.
Unblock when timeout max.task.idle.ms hits.
… 14:01
14:02… 14:04… 14:03…
14:05…
14:08…
81. Tumbling Windows
• fixed size / non-overlapping / grouped (i.e, GROUP BY)
Time Windows
81@MatthiasJSax
14:00 14:05 14:1514:10
No variable size window support yet:
• Weeks, Month, Years
• No out-of-the-box time zone support
• https://github.com/confluentinc/kafka-streams-examples/blob/5.5.0-post/src/test/java/io/confluent/examples/streams/window/DailyTimeWindows.java
82. Time Windows
82@MatthiasJSax
Hopping Windows
• fixed size / overlapping / grouped (i.e., GROUP BY)
• Different to a sliding window!
14:00 14:05 14:1514:10
14:01 14:06 14:1614:11
14:02 14:07 14:1714:12
14:03 14:08 14:1814:13
14:04 14:09 14:1914:14
83. Different use-case: aggregate the data of the last (e.g.) 10 minutes
• Window boundaries are data dependent and unknown upfront (cf. KIP-450)
Sliding Windows
83@MatthiasJSax
14:03… 14:07… 14:12… 14:19… 14:26…
13:53 | 14:03
13:57
14:07
14:02 14:12
14:04
14:1414:08 14:18
14:09 14:19
14:13 14:23
14:16 14:26
14:20 14:30
84. When we are processing, we don’t need watermarks
Grace period: defines a cut-off for out-of-order records that are (too) late
• Grace period is defined per operator
• Late if stream-time - event-time > grace period
• Late data is ignored and not processed by the operator
84@MatthiasJSax
14:01… 14:03… 14:08…14:01… 14:02… 14:11…
stream-time
14:03 14:1114:0814:01
advances
grace := 5min
-> late (delay: 6min)
14:03 14:08
85. Retention Time
How long to store data in a (windowed) table.
TimeWindows.of(Duration.ofMinutes(5L)).grace(Duration.ofMinutes(1L))
Materialized.as(…).withRetention(Duration.ofHours(1L))
WINDOW TUMBLING(SIZE 5 MINUTES, GRACE PERIOD 1 MINUTE, RETENTION TIME 1 HOUR)
85@MatthiasJSax
stream-time
SIZE
5 MINUTES
GRACE PERIOD
1 MINUTE
windowStart
@14:00
windowEnd
@14:05
window close
@14:06
14:05 15:05
retention
(1 hour)
86. Stream-Stream Join
86@MatthiasJSax
Streams are conceptually unbounded
• Limited join scope via a sliding time window
leftStream.join(rightStream, JoinWindows.of(Duration.ofMinutes(5L)));
SELECT * FROM leftStream AS l JOIN rightStream AS r WITHIN 5 MINUTES ON l.id = r.id;
14:041 14:162 14:083
14:01A 14:11B 14:23C
14:041⨝A 14:162⨝B 14:113⨝B
max(l.ts; r.ts)