APAC Kafka Summit - Best Of

Kafka Summit 2020
Highlights
Sep 2020

2
Kafka Summit 2020 —
By the numbers
https://kafkasummit.io/
• 10 Keynote Speakers
• 87 Sessions
• 46 Birds of a Feather & Ask the
Experts
• 102 Session Speakers
Attendees from 143 Countries

Sessions
3
• Opening Keynote — Gwen Shapira
• Feed your SIEM Smart with Kafka Connect
• Building a Modern, Scalable Cyber Intelligence Platform with Confluent Kafka
• Learnings From the Field. Lessons From Working with Dozens of Small & Large
Deployments
• Maximize the Business Value of Machine Learning and Data Science with
Kafka
• Measuring your Digital Transformation: Why Real Time Analytics are the
Critical Next Step
• MQTT and Apache Kafka: The Solution to Poor Internet Connectivity in Africa
• The Flux Capacitor of Kafka Streams and ksqlDB
• Keynote: Kafka ♥ Cloud — Jay Kreps

Feed your SIEM Smart with
Kafka Connect
Vitalii Rudenskyi
Information Security Architect
McKesson Corporation

Background and Motivation
• How to use Kafka Connect to ingest, consume and deliver data to SIEM
• Migration from old generation SIEM to new SIEM solution
• Not easy, not fun at all!
• Not to make the mistake again
• Decided to have own data ingestion and consumption,
• Not to be dependent on any kind of Vendors.

Pull - PollableAPIClient Connector

Sharing
• NettySource Connector
• https://github.com/vrudenskyi/kafka-connect-netty-source
• PollableAPIClient Connector
• https://github.com/vrudenskyi/kafka-connect-pollable-source
• Transformations Library
• https://github.com/vrudenskyi/kafka-connect-transform

Key Takeaways
• Kafka has become an integral part in enterprise SIEM Modernization
• With Kafka and Connect, customer can take a “vendor agnostic” approach at their SIEM
strategy
• Kafka Connect is flexible solution to deal with various data sources/sinks and different
data formats. In this particular case, 530+ Connectors have been deployed
• Kafka Connect is extremely extensible and customers could customize or develop their
own Connector based on requirements.
• Speaker has developed customized Transformation Library to transform different part of
the messages. Looking to implement Stream Processing as Next Step.
• Speaker has shared smart tips on making use of Headers and has created a solution for
connector High Availability.

Jack Noel - Security Solutions Architect - Intel
Building a Modern, Scalable Cyber
Intelligence Platform with Confluent Kafka

Intel Information Security’s Mission
To Keep Intel’s Intel legal and secure!
The mission is never done. e.g. find the balance between infosec requirements while being cost effective
and agile.

● Data Filtering
● Data comes
from Partners
● IT non-security
data
● Prioritise high-
value data
● Enrich Data

● Asset
information and
locations
● IP CIDR
locations
● Process with
KSTREAMS into
clean topics

● Acquire data
once, consume
many times
● Using data is
expensive
● Filtering, joining
and enrich data
instream to
provide rich
data upstream
● ML Instream -
Advanced
● Become
Predictive!

● Reduce
technical debt,
e.g. point to
point
integrations
● Always on
● Thriving
community
● Slide speaks for
itself!

● Jack Noel
authored some
of these

Key Takeaways
• There are a lot of security vendors, each with their own way of producing and parsing
Data. Kafka Makes that more Seamless
• No Vendor Lock ins, e.g. Use OS Native as much as possible.
• Share data with other teams
• Collect data from other teams. e.g. vendors or IT
• Operational Maturity is important to ensure success. e.g. people, process and tools.

Mitch Henderson - Customer Success Technical Architect
Learnings From the Field. Lessons From
Working with Dozens of Small & Large
Deployments

Key do's and don'ts for managing Kafka installations
• Upgrades - do them well and often, don't fall into trap of
sticking with old version
• How to execute upgrades well
• Monitoring: JMX
• Configuration - varying from defaults, recommended
tunables
• Logging
• Quotas
• Clusters - single or multiple

Key takeaways
• If you don't have the option of running fully-managed
Apache Kafka as Confluent Cloud - Kafka is a distributed
system and takes careful and deliberate management.
• There are many recommended settings changes from
default for certain types of production-operations -
OOTB settings are really development settings.
• Make upgrades part of Kafka muscle-memory.
• Don't wait until you have problems to set recommended
settings and guardrails.
• If in-doubt, hire a professional - Confluent PS.

Tom Szumowski - Senior Data Scientist, Nuuly
Chirag Dadia - Directory of Engineering, Nuuly
Maximize the Business Value of Machine
Learning and Data Science with Kafka

Nuuly is a clothing rental subscription service driven by a Kafka-based architecture.
As an online platform they are continually looking to improve their service to better meet
the needs of their customers, and drive revenue.
Data analytics and machine learning form a large part of their optimisation strategy, but
how to implement this mostly offline batch style processing with a real time platform?
Challenge - typical warehouses track SKU and stock level - Nuuly track individual items. Real
time inventory becomes critical - a rented item should not be rented twice at the same time.

Data Science
Everything is asynchronous, ETL pipelines transform, sent to warehouse
ML that integrates with user interactions
Stream processors used to materialise state. Kafka is our data store.
Adding new steps is as easy as building a new microservice.

Measuring your Digital Transformation: Why
Real Time Analytics are the Critical Next Step
Rachel Padreschi, Vice President of Community
Imply

Characteristics of “Real-Time”

Velocity of Data vs Velocity of Understanding

Key Takeaways
• Shared vision across the organization - common view of the world (data)
• Event driven adds value to customer interactions
• “Gen Z” expectations - contextual, personalized, real-time
• Real-time is freshness of data + fast analytics
• Velocity of data and velocity of understanding are distinct measures

Fadhili Juma
Remitly Inc
MQTT and Apache Kafka: The Solution
to Poor Internet Connectivity in Africa

• Internet connectivity is a problem in the remote villages of Africa
• There were challenges implementing agency banking in villages in Tanzania
• How MQTT and Apache Kafka was used to overcome these problems
*** Agency banking model[1] is a function of certain Commercial banks in kenya and as regulated by Central Bank of Kenya legislation that
allows them to contract third party retail networks as Banking agent. Upon successful application, vetting and approval,[2] these Agents are
authorized to offer selected products and services on behalf of the Bank. This relationship creates an Agency Banking business model.

Phase 1
• Phase 1 tried with ReST APIs
• Lots of resources consumed
• Problems with connections
• Lost transactions

Key Takeaways
• How MQTT and Apache Kafka has been leveraged to provide digital banking to a region
which has poor internet connectivity
• MQTT maintains the session from hand held devices but MQTT does not provide for long
term message storage. Also, MQTT connectivity with downstream enterprise solutions is
also not great.
• Apache Kafka is used to store data for a longer period of time and with Connectors, the
data can be pushed to all the downstream systems where further analytics could be done
• MQTT connector is used as the bridge between Kafka and MQTT
• Since, management of Kafka is a specialized job which requires a lot of effort and $, they
are moving to Confluent Cloud gradually

Matthias J. Sax | Software Engineer, Confluent
The Flux Capacitor of
Kafka Streams and ksqlDB

Stream Processing is our Density.

Recap: Time 101
77@MatthiasJSax
Event Time
• When an event happened (embedded in the message/record)
• Ensures deterministic processing
• Used to express processing semantics, i.e., impacts the result
Processing Time (aka Wall-clock Time)
• When an event/message/record is processed
• Used for non-functional properties
• Timeouts
• Data rate control
• Periodic actions
• Should not impact the result: otherwise, non-deterministic

Yeah, well, history is gonna change
Input records with descending event timestamp are considered out-of-order
• Out-of-order if event-time < stream-time
78@MatthiasJSax
14:01… 14:03… 14:08…14:01… 14:02… 14:11…
stream-time
14:03 14:1114:0814:01
advances
out-of-order out-of-order
14:03 14:08

You are not thinking fourth-dimensionally
79@MatthiasJSax
14:11…Topic-A, Partition 0
Topic-B, Partition 0 empty
Pause processing and poll() for new data.
Unblock when timeout max.task.idle.ms hits.
… 14:01
14:02… 14:04… 14:03…
14:05…
14:08…

Tumbling Windows
• fixed size / non-overlapping / grouped (i.e, GROUP BY)
Time Windows
81@MatthiasJSax
14:00 14:05 14:1514:10
No variable size window support yet:
• Weeks, Month, Years
• No out-of-the-box time zone support
• https://github.com/confluentinc/kafka-streams-examples/blob/5.5.0-post/src/test/java/io/confluent/examples/streams/window/DailyTimeWindows.java

Time Windows
82@MatthiasJSax
Hopping Windows
• fixed size / overlapping / grouped (i.e., GROUP BY)
• Different to a sliding window!
14:00 14:05 14:1514:10
14:01 14:06 14:1614:11
14:02 14:07 14:1714:12
14:03 14:08 14:1814:13
14:04 14:09 14:1914:14

Different use-case: aggregate the data of the last (e.g.) 10 minutes
• Window boundaries are data dependent and unknown upfront (cf. KIP-450)
Sliding Windows
83@MatthiasJSax
14:03… 14:07… 14:12… 14:19… 14:26…
13:53 | 14:03
13:57
14:07
14:02 14:12
14:04
14:1414:08 14:18
14:09 14:19
14:13 14:23
14:16 14:26
14:20 14:30

When we are processing, we don’t need watermarks
Grace period: defines a cut-off for out-of-order records that are (too) late
• Grace period is defined per operator
• Late if stream-time - event-time > grace period
• Late data is ignored and not processed by the operator
84@MatthiasJSax
14:01… 14:03… 14:08…14:01… 14:02… 14:11…
stream-time
14:03 14:1114:0814:01
advances
grace := 5min 
-> late (delay: 6min)
14:03 14:08

Retention Time
How long to store data in a (windowed) table. 
TimeWindows.of(Duration.ofMinutes(5L)).grace(Duration.ofMinutes(1L))
Materialized.as(…).withRetention(Duration.ofHours(1L))
WINDOW TUMBLING(SIZE 5 MINUTES, GRACE PERIOD 1 MINUTE, RETENTION TIME 1 HOUR)
85@MatthiasJSax
stream-time
SIZE 
5 MINUTES
GRACE PERIOD 
1 MINUTE
windowStart
@14:00
windowEnd
@14:05
window close
@14:06
14:05 15:05
retention
(1 hour)

Stream-Stream Join
86@MatthiasJSax
Streams are conceptually unbounded
• Limited join scope via a sliding time window
leftStream.join(rightStream, JoinWindows.of(Duration.ofMinutes(5L)));
SELECT * FROM leftStream AS l JOIN rightStream AS r WITHIN 5 MINUTES ON l.id = r.id;
14:041 14:162 14:083
14:01A 14:11B 14:23C
14:041⨝A 14:162⨝B 14:113⨝B
max(l.ts; r.ts)

Stream-Table Join
87@MatthiasJSax
Stream-Table join is a temporal join
14:01a 14:03b 14:05c 14:08b 14:11a
14:02… 14:04… 14:07…14:06… 14:10…
14:01a
14:03b
14:05c
14:05
14:01a
14:08b
14:05c
14:08
14:11a
14:08b
14:05c
14:11
14:01a
14:03b
14:03
14:01a
14:01
14:06 14:07 14:1014:0414:02

You Need to Know your History
88@MatthiasJSax
Table Changelog
Stream
truncation
retention time
Lost History
fully compacted append
new data 
(tail)

Jay Kreps
Keynote: Jay Kreps, Confluent |
Kafka ♥ Cloud

Learn Kafka.
Start building with
Apache Kafka at
Confluent Developer.
developer.confluent.io

APAC Kafka Summit - Best Of

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to APAC Kafka Summit - Best Of

Similar to APAC Kafka Summit - Best Of (20)

More from confluent

More from confluent (20)

Recently uploaded

Recently uploaded (20)

APAC Kafka Summit - Best Of