Data Con LA 2022 - Data Streaming with Kafka

1
© 2020 KPMG LLP, a Delaware limited liability partnership and the U.S. member firm of the KPMG network of independent member firms affiliated with KPMG International
Cooperative (“KPMG International”), a Swiss entity. All rights reserved.
August 13, 2022 at DataConLA
Real Time Data Streaming with Kafka
Speaker:
Jie Chen
Manager Advisory
Engineering Architect
LinkedIn

2
Agenda
Kafka at a Glance
Kafka Use Cases
Key Takeaways
Q&A
Intelligent Forecast System
Kafka in Banking
Distributed Data with CQRS
5
min
20
min
5
min
10
min

3
Kafka at a Glance

4
Kafka in the Market
CORE CAPABILITIES
Scalable
Scale production clusters up to a
thousand brokers, trillion of
messages per day, petabytes of
data, hundreds of thousands of
partitions. Elastically expand and
contract storage and processing.
High Throughput
Deliver messages at network limited
throughput using a cluster of machines
with latencies as low as 2ms
Permanent Storage
Store streams of data safely in a
distributed, durable, fault tolerant cluster
High Availability
Stretch clusters efficiently over availability
zones or connect separate clusters across
geographic regions
Source: kafka.apache.org

5
Kafka Platform Overview
Event Streaming Platform
Distributed streaming platform
that enables real-time, event-
driven applications using a topic-
based pub-sub model
Performance at Scale
Kafka operates as a highly-
available and fault-tolerant
cluster that spans servers and
even data centers with a
partitioning system that supports
data volumes of practically any
size
https://docs.confluent.io/

6
What is Event Driven Streaming with Kafka
ETL
Raw Message Queue
Change Data Capture
Mainframe
Customed
Topic
Partition
Partition
Partition
Brokers (Servers) Web
Mobile
Data Warehouse
Monitor Tool
Partners
Subscribing
Publishing
Data Draining
Producers Consumers
Kafka Cluster
An event is a type of data that describes the entity’s observable state updates over time (Definition by IBM)
For example, first time user registration, payment, social media post etc.

7
Distributed Data with
CQRS and Kafka

8
Distributed Data with CQRS and Kafka - CQRS at a Glance
Overview
Command Query
Responsibility Segregation
Read and write workloads are
separated, decoupled, and
scaled independently.
Event Sourcing
CQRS is often linked with event
sourcing – Effectively viewing
data state as a series of discrete
events.
Event Sourcing is an approach to handling operations on data that's driven by a sequence of events, each of which is recorded in an append-only store
(Defined by Microsoft). For example, placing an online order, returning the order under the same user account.

9
Distributed Data with CQRS and Kafka - Traditional Design
Difficult to Scale
SOR must be able to support the load
of all clients and systems. Read
replicas can improve scalability.
Single Point of Failure
If SOR or API layer is unavailable, all
consumers may be affected
Rigid
All access to SOR data flows through
centralized APIs. Consumers receive
data in the schemas set up by access
layer.
Difficult to Manipulate Data
Data access to SOR directly is
restricted. Transforms, joins, and
analytical operations may be difficult
and rely on lagging ETL operations
Client: external facing UI, third party apis
System: internal facing ETL, mainframe
SOR: System of Record (the authoritative data source)

10
Distributed Data with CQRS and Kafka - CQRS Design
Data Changes as Events
Current state of SOR is captured
through an event format
Consumer Subscribe to
Changes
Consumers listen to data event
changes and consume the information
according to their own use case
Other Systems Act on Data
Systems act on data updates as
defined by use case. Systems may
replicate the data, enrich the data, or
simply process events in real-time
Read / Write Separation
Data read is segregated from data
write. Read only consumers introduce
no additional load to SOR.

11
Distributed Data with CQRS and Kafka Advantages and Challenges
Independent Scaling
Read and write workloads may
be scaled independently based
on load and access patterns
Separation of Concerns
Segregated models allow for
tightly controlled write logic while
permitting flexibility in read
models and stream processing
System Isolation
Access to the SOR database is
restricted to a controlled write
API. Consumers may safely read
from a replica
Flexible Consumption
Kafka’s scalable architecture
allows for consumers to process
events differently across systems
at different velocities
Eventual Consistency
Reads will be eventually
consistent and may have some
delay until writes have
propagated through the system
Complexity
Implementation of the pattern
increases complexity of the
overall solution
Different Data Velocity
Consumers may process events
at different velocities, resulting in
inconsistencies across systems
Advantages Challenges

12
Distributed Data with CQRS and Kafka - Common Scenarios
Complex Data Operations Across
Systems
Different systems need to
process and transform data in
complex and evolving use cases.
Real-time Data Processing
Across Systems
Traditional ETL and batch
operations are too slow and rigid
to meet evolving business
requirements. Organization
seeks to process data in real-
time as it becomes available
across different systems.
Resource Bottlenecks with
Growing Demand
Traditional data system
resources are strained and
unable to support growing
demands of business.
Scenarios to Consider
Data Security Concerns Across
Systems
Data must be shared securely
across systems without
introducing new security risks.
Increased Demand for Data
Sharing Across Enterprise
Enterprise seeks to break down
data silos and share data
effectively across the
organization increase synergy
between systems.

13
Intelligent Forecast
with Kafka

14
Intelligent Forecast with Native Kafka Solution
ELK Stack
Elasticsearch
Storage
Kafka Connector API
Indexing
ETL
Raw Message Queue
Change Data Capture
Mainframe
Customed
Producers Consumers
Kafka Cluster
Publishing Subscribing
Data
Draining
Kafka's role in this solution is to publish the data from the different channels as the categorized topics; Through Kafka connector APIs
(connector replicators), the ELK Stack subscribe to the specified topics. This Pub/Sub is also called event streaming. The
customized data can then be rendered through Kibana dashboard.

15
Challenges
Kafka Connector
Similar open source solutions like MirrorMaker, uReplicator by Uber, Mirus by Salesforce can be alternatives to
tackle the scalability bottleneck while reducing the licensing cost.
PII encryption
While considering Kafka security library and in house solution, it is important to establish the early PII governance
among producers, Kafka cluster and consumers. In other words, who is responsible for masking the sensitive data
throughout the real data streaming pipeline.
Intelligent Forecast with Native Kafka Solution
Key Design
Pub/Sub, decoupled
and asynchronous
messaging service
for scalability
Equivalent Solutions
Azure Event hub
Google Pub/Sub
AWS Kinesis
Proactive Analytics
in use cases such as the capability of detecting
and forecasting the abnormal trend outside of
the threshold: transaction fraud at ATMs and
restaurant mobile orders.

16
Kafka in Banking

What a Banking Institute’s need to modernize its Legacy System
A Banking Institute has been looking to migrate its legacy system to modern technologies that accelerate fast growing demand
in big data through building a modern data streaming platform as part of Business Operation Brain (BOB).
Reuse the existing data centers, storage, infrastructure and security procedures
Scalable and reliable (million transactions/events per second) with the existing infrastructure
Data must be logged for transaction tracing and auditing (For example, Change Data Capture)

What options we have: Kafka and Its Comparables in the marketplace
Not an inclusive options

AWS Kinesis
Open Source, On Prem Managed Cloud Computing
Proprietary
Open Source, On Prem or Managed
Cloud Computing
©
2
0
2
2
K
P
M
G
L
L
P
,
a
D
e
l
a
w
a
r
e
l
i
m
i
t
e
d
l
i
a
b
i
l
i
t
y
p
a
r
t
n
e
r
s
h
i
p
a
n
d
a
m
e
m
b
e
r
f
i
r
m
o
f
t
h
e
K
P
M
G
g
l
o
1
9
Apache Kafka Rabbit MQ
Operation
Cost
Messaging
Immutable, ordered,
replay; User defined
retention policy
Queue/Message index attached with
TTL; Messages are removed once
consumed
Storage
Persistent storage offers
durability and reliability;
Append log
Scalability
Horizonal Scale, Scale Out,
adding more machines to
increase disk I/O
Vertical Scale, Scale up,
adding more CPU, RAM to the
existing machine/hardware
Up to 365 days
Identify KPIs When Evaluating the Options
Autoscaling
Security Customized,
Manual Configuration
Native Cloud Solution Customized,
Manual Configuration
Pay as you go,
Elastic and durable
Messages are removed once
consumed; In memory is preferred

20
Key Takeaways

21
Key Takeaways
CQRS Pattern with Kafka
Use the scale, speed, and reliability of
Kafka as the backbone for an
eventually-consistent distributed data
solutions that allows flexible
consumption models and independent
scaling.
Kafka in Banking
Objectively select the metrics for the
business use case. Design the data
streaming solution that is ready to
scale.
Intelligent Forecast with Kafka
To reap the scalability benefit, design
the Kafka connector solution for future
business growth. PII must be encrypted
throughout Kafka pipeline and
automated.

22
Q&A

Data Con LA 2022 - Data Streaming with Kafka

Recommended

Recommended

More Related Content

Similar to Data Con LA 2022 - Data Streaming with Kafka

Similar to Data Con LA 2022 - Data Streaming with Kafka (20)

More from Data Con LA

More from Data Con LA (20)

Recently uploaded

Recently uploaded (20)

Data Con LA 2022 - Data Streaming with Kafka