In this presentation, we show how Data Reply helped an Austrian fintech customer to overcome previous performance limitations in their data analytics landscape, leverage real-time pipelines, break down monoliths, and foster a self-service data culture to enable new event-driven and business-critical use cases.
WSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity
Confluent Partner Tech Talk with Reply
1. @yourtwitterhandle | developer.confluent.io
What are the best practices to debug client applications
(producers/consumers in general but also Kafka Streams
applications)?
Starting soon…
STARTING SOOOOON..
STARTING SOON…
S
Starting Soon…
2. @yourtwitterhandle | developer.confluent.io
What are the best practices to debug client applications
(producers/consumers in general but also Kafka Streams
applications)?
Starting soon…
STARTING SOOOOON..
STARTING SOON…
3. Tech Talk Q3 - Data Reply
Overcoming Performance Limitations with
"Data in Motion"
Confluent Cloud Free Trial
New signups receive $400
to spend during their first 30 days.
4. Our Partner Technical Sales Enablement offering
Scheduled sessions On-demand
Join us for these live sessions
where our experts will guide you
through sessions of different level
and will be available to answer your
questions. Some examples of
sessions are below:
● Confluent 101: for new starters
● Workshops
● Path to production series
Learn the basics with a guided
experience, at your own pace with our
learning paths on-demand. You will
also find an always growing repository
of more advanced presentations to
dig-deeper. Some examples are below:
● Confluent 10
● Confluent Use Cases
● Positioning Confluent Value
● Confluent Cloud Networking
● … and many more
AskTheExpert /
Workshops
For selected partners, we’ll offer
additional support to:
● Technical Sales workshop
● JIT coaching on spotlight
opportunity
● Build CoE inside partners by
getting people with similar
interest together
● Solution discovery
● Tech Talk
● Q&A
6. Goal
Partners Tech Talks are webinars where subject matter experts from a Partner talk about a
specific use case or project. The goal of Tech Talks is to provide best practices and
applications insights, along with inspiration, and help you stay up to date about innovations
in confluent ecosystem.
8. Confluent Platform 7.5 provides key enhancements
to reinforce our core product pillars
8
SSO for Control Center (C3)
for Confluent Platform
Streamline Confluent usage and
integration with SSO for Control
Center, supporting third party IDPs
and REST API use with preferred
languages and third-party solutions.
Cloud-Native Complete
Confluent REST Proxy
Produce v3 API
Everywhere
Streamline interfacing with
Confluent platform with REST Proxy
Produce API V3 by supporting
custom headers, streaming mode.
Bidirectional
Cluster Linking
Optimize disaster recovery and
increase reliability with bi-directional
Cluster Linking, allowing for
active/active architectures
APACHE KAFKA 3.5
9. Seamless and Secure
Login with SSO for
Control Center(C3)
9
Experience seamless and secure login
for any OIDC-supported identity
provider inclusive of added security
controls with Single Sign-On for
Control Center
Implement Single Sign-On using
custom language of your choice even
without a Confluent-supported client
Implement Single Sign-On using any
identity provider that supports industry
standard OpenID Connect (OIDC)
Leverage additional security controls
such as MFA through the identity
provider
Cloud-Native
10. Seamless and secure login with SSO for Control Center
10
Confluent Platform
(CP)
Identity Provider (IdP)
User
1. User logs into IdP
2. User Selects Application
3. Claims are sent to User
4. Claims are sent to
Application endpoint
5. Response received
and user validated
6. Access Granted
11. Confluent REST Proxy
Produce v3 API
11
Streamlined interface for CP with
REST APIs with, now featuring the
capability to add custom headers and
streaming mode with v3 of the REST
Proxy API.
Enable seamless integrations with
serverless solutions, SaaS applications,
and legacy integration connectors by
incorporating custom headers and
streaming mode
Utilize Confluent effortlessly,
bypassing the need to delve into the
intricacies of the Kafka protocol
Use the REST API with a language of
your choice even if a Confluent
supported client doesn’t exist
Complete
12. Simplifying interface with Confluent Platform with
REST Produce V3 API
12
AWS Lambda
(Serverless)
SaaS Application
HTTP Client App
Mainframe /
Data Center
REST Produce V3
http://
REST
Proxy
Kafka Protocol Kafka Protocol
Kafka Client
App
13. Bidirectional Cluster
Linking
13
Optimize disaster recovery
capabilities, with increased reliability
all setting up a disaster recovery
architecture with Confluent's
bi-directional cluster linking.
Facilitate seamless consumer
migration with retained offsets for
consistent data processing with
Bi-directional cluster links
Increases efficiency and reduces time
for data recovery by eliminating the
need of custom code
Streamline security configuration with
Bi-directional links that provides
outbound and inbound connections
Everywhere
**Note - bi-directional cluster linking is available for new cluster links only,
existing cluster link need to be deleted and re-activated to obtain this
functionality.
14. Enhanced Disaster Recovery Capabilities with
Bi-directional Cluster Linking
14
Cluster Link
bidirectional
Connection and Authentication
Connection and Authentication
Cluster A Cluster B
Applications
in region B
Cluster A Cluster B
Cluster Link
bidirectional
Topics on
Cluster A
Mirror
Topics on
Cluster B
Mirror Topics
on Cluster A
Topics on
Cluster B
ACLs / RBAC for Cluster
B
API Key or OAuth for Cluster A
API Key or OAuth for Cluster B
ACLs / RBAC for Cluster A
Applications
in region A
Data &
Metadata
Data &
Metadata
17. Cloud data warehouses continue to power
business-critical analytics
17
● Rigid architecture that makes it
hard to integrate with other
systems
● Expensive in both upfront and
ongoing maintenance costs
● Slow to uncover business insights
due to limited analytical capabilities
● Lower TCO by decoupling storage
from compute and leveraging
consumption-based pricing
● Increased overall flexibility and
business agility
● Scalable analytics with advanced
capabilities like machine learning,
to power real-time decision-making
Legacy Data Warehouses Cloud Data Warehouses
18. Challenges with existing approaches to data
pipelines
18
1. Batch ETL/ELT
Batch-based pipelines use batch ingestion, batch
processing, and batch delivery, which result in
low-fidelity, inconsistent, and stale data.
2. Centralized data teams
Bottlenecks from a centralized, domain-agnostic data
team hinder self-service data access and innovation.
3. Immature governance & observability
Patchwork of point-to-point pipelines has high
overhead and lacks observability, data lineage, data and
schema error management.
4. Infra-heavy data processing
Traditional pipelines require intensive, unpredictable
computing and storage with high data volumes and
increasing variety of workloads.
5. Monolithic design
Rigid “black box” pipelines are difficult to change or port
across environments, increasing pipeline sprawl and
technical debt.
18
Batch Jobs & APIs
On-Prem
Legacy Data
Warehouse
ETL/ELT
DB /
Data Lake
Cloud
SaaS App
DB /
Data Lake
ETL/ELT
CRM
Cloud Data
Warehouse
19. Unleash real-time, analytics-ready data in CDWH
with Confluent streaming data pipelines
1. Connect
Break down data silos and stream hybrid,
multicloud data from any source to your Amazon
Redshift using 120+ pre-built connectors.
2. Process
Stream process data in flight with ksqlDB and
use our fully managed service to lower your
cloud data warehouse costs and overall data
pipeline total cost of ownership.
3. Govern
Stream Governance ensures compliance and
data quality for your Cloud Data Warehouse,
allowing teams to focus on building real-time
analytics.
Data Lake
SaaS App
Real-time connections & streams
On-Prem Cloud
Heterogeneous
Databases
Cloud Data
Warehouse
Govern
to reduce risk and
ensure data
quality
Connect
with 120+ pre-built
connectors
Process
with ksqlDB to
join, enrich,
aggregate
On-Prem Data
Warehouse
19
21. Who we are
7 years of experience building
data driven applications & E2E
architectures
Various sectors such as
Finance, Telco, Manufacturing
and Consumer Entertainment.
ALEX PIERMATTEO MAJID AZIMI
Associate Partner Data Architect
7+ years experience in event-
driven as also big data
architectures
Helped several enterprise clients
in Manufacturing, Telco,
Automotive and Media with their
digital transformation
22. 3
“Leverage your data assets and gain a data-driven
competitive advantage with cutting-edge technology”
Data & ML Engineering
3
Data Platforms & Cloud
Solutions
1
Event-driven & Streaming
Applications
2
§ Enterprise Data
Analytics Platforms
§ Data Lake(-house)
§ Data Mesh
§ (Post-)Modern Data
Stack
§ Data Governance
DATA PLATFORMS
§ Real-Time data
applications
§ Event-Driven
Microservices &
Workflows
§ Event-Driven Edge
applications
EVENT-DRIVEN
APPLICATIONS
§ ML Engineering
§ Computer Vision
§ NLP & Conversational AI
§ Anomaly Detection
§ Predictive Maintenance
§ Quality Assurance
MLOPS
§ Data processing
§ DataOps
§ Data Security
§ Data Management
§ Data Products
DATA ENGINEERING
§ Telemetry analytics
§ Continuous Intelligence
§ Real-Time Edge2Cloud
Analytics
DATA STREAMING
§ Data visualization
§ IoT Analytics
§ Fast OLAP
§ Semantic Search
ANALYTICS
23. 4
Data Reply & Confluent
§ Working with Confluent since 2015
§ Using Apache Kafka since the beginning (Kafka 0.8.2/Confluent 1.0-
Snapshot) for different use cases and architectures
§ Track record of successful platforms and Kafka projects
§ First productive Streaming Datalake in DACH based on Confluent Kafka at
PayTV provider in 2015
§ Presented at several conferences as well as Confluent Webinars (incl. 1st
Combined Webinar Confluent & Imply in EMEA: FAST OLAP: How and why
your data insights should become real-time )
§ Actively supporting Confluent Professional Services at customer
engagements
§ Certified Confluent Trainers within the team
A Strong Partnership
Confluent ProServ
Partner
3
Confluent Premier
Consulting & SI
Partner
1
Confluent Training
Partner
2
25. 7
Case study of our fintech customer
§ Founding: Established in 2014.
§ Mission: Make investing accessible to everyone.
§ Vision: Reimagine investment through user-friendly financial products.
§ Achievements: Over 8 years of success.
§ Team: 700+ members.
§ User Base: 4 million users.
Data-Driven Decision Making
§ Data Collection: Vast data collection daily.
§ Business Intelligence: BI & Analytics team harnesses data.
§ Informed Choices: Daily business decisions driven by insights.
The company in a nutshell
26. 8
The original analytical status quo landscape (early 2022)
Some numbers
120 tables, 700GB source database
1
DWH is multi-TB in size as it contains historical and
transactional data
2
90 stored procedures & 40 views
3
80+ AWS Glue pipelines from sources to Redshift
4
Redshift & S3 were the main data sharing approach
5
Many different teams working with these data in silos
6
[Optional space for an image]
27. 9
All this works, but..
Some new challenges were arising.
80+ AWS Glue pipelines running point to point data integrations
1
A lot of maintenance work during the day for the Data Engineering
team to handle all the different pipelines
2
No concepts for Data Sharing, only small trials with DataHub as a
Catalog
3
No data ownership
4
No data lineage to understand how all these separate pipelines
were transforming the data
5
No real-time capabilities in the DWH, new data coming only every
couple of hours
6
There was no notion of incremental processing. Most data pipelines
were traditional erase-copy-rebuild style.
7
28. 10
All this works, but..
Possible solutions?
80+ AWS Glue pipelines running point to point data integrations
1
A lot of maintenance work during the day for the Data Engineering
team to handle all the different pipelines
2
No concepts for Data Sharing, only small trials with DataHub as a
Catalog
3
No data ownership
4
No data lineage to understand how all these separate pipelines
were transforming the data
5
No real-time capabilities in the DWH, new data coming only every
couple of hours
6
There was no notion of incremental processing. All data pipelines
were traditional erase-copy-rebuild style.
7
What would be your first
logical choice here to
overcome these pain
points?
29. 11
The original analytical status quo landscape
Decision for the DWH migration
1
2
3
Complex data transformations
§ Such as aggregations, joins and data cleansing. Snowflake for
scalable and high-performance SQL queries
§ Airflow for orchestrating these data transformations
Diverse data sources
§ Diverse sources such as databases, data lakes, APIs and more.
Data Sharing and Collaboration
§ Business vision to securely share data with external parties or
collaborate with partners
§ Snowflake's data sharing capabilities (zero copy sharing) could
simplify this process while maintaining data integrity
§ Airflow to automate data sharing workflows.
Real-time data processing
§ The idea was to gradually introduce real-time or near-real-time
data processing, and Snowflake's ability to ingest and query
streaming data was promising.
4
Of course, a DWH
migration from Redshift to
Snowflake
+
Deprecation of Glue
👏👏👏
31. 13
Potential for more improvements
The new ingestion layer
Introduction of Kafka (via Confluent Cloud) as central data entry
point for all sources
1
Usage of CC for the Kafka broker maintenance
(almost zero operations)
2
We started with only the CDC data
from the main DB but eventually we
integrated all sources afterwards
3
4
Reduced maintenance overhead
5
Data integrated via Debezium Connector, each table as a different Kafka topic.
Connectors running on K8S for major flexibility via Confluent Operator
6
Unlocked real-time capabilities for
the entire architecture
32. 14
The challenges behind
Ingestion with Debezium – reason for the usage
Scales horizontally to handle incoming
data changes, even if the source
database is experiencing performance
issues.
Scalability
Reduces the impact of performance
issues on the database itself. Helps
avoiding multi-write inconsistency
problem. Application can write in one
data store.
Database decoupling
Allows other downstream systems to
consume and process these changes
instantly.
Real-time data integration
Captures real-time changes made to
the source database and transforms
them into an event stream.
Change Data Capture (CDC)
Debezium
Connector
33. 15
The challenges behind
Ingestion with Debezium - challenges
Debezium does its best to provide an
ordered stream of events, but we had
to implement additional logic to handle
certain scenarios.
Data consistency and
ordering
Debezium can capture schema
changes, but downstream consumers
need to be able to handle schema
evolution gracefully.
Schema evolution
This included configuring the cluster
partitions, replication and network
bandwidth accordingly.
Data volume and
throughput
the source RDS was already
experiencing performance issues, and
adding Debezium to capture change
data exacerbated these issues. It was
critical to monitor the resource usage
of the database and we ended up
adding a new read replica for the
RDS to mitigate this issue.
Performance impact on the
source database
Challenges
34. 16
The challenges behind
DWH Architecture – migration of Redshift to Snowflake
§ Exporting, transforming
and importing data
§ Differences in data
types, distribution
keys and sort keys
between the two
platforms can
complicate this process
1Data
migration
§ Differences in syntax
and capabilities
§ To create clean
pipelines, we had to
template SQL scripts to
make them reusable.
DBT helped a lot here.
2Stored procedure
transformation
§ Ensuring data
consistency during
migration is critical
§ We needed to create
several data quality
pipelines that check
both DWHs
simultaneously for
consistency during
migration.
3Data consistency &
schema evolution
§ Rigorous testing of
migrated data and
procedures is essential
to identify and resolve
issues before going live
4Testing and
validation
35. 17
The challenges behind
DWH Architecture – Ingestion from Kafka to Snowflake via S3
1
2
3
Data Staging
§ Storing data in S3 acts as a staging area.
§ Decoupling ingestion and transformation steps, making them
more flexible and easier to manage.
Data Lake architecture
§ Provides benefits such as cost efficiency, data versioning
and support for different data formats
§ Access with different tools from different departments.
Real time Pipeline (in case required)
§ For some use cases, a parallel pipeline ingest directly into
Snowflake using Snowflake Sink Connector.
Fault tolerance
§ If any part of the pipeline fails, we can easily reprocess data
from S3 without losing any information.
4
Cost optimisation
§ Snowflake's pricing model can be impacted by native data
storage, so optimising data storage in S3 helped manage
costs.
5
36. 18
From single ingestion layer to "blueprint"
§ Debezium connectors extract all relevant
transactions from source systems.
§ Since Debezium exposes a lot of information about
database transactions detail, Confluent cloud
KSQLDB will be used to extract relevant parts and
store it on new topics.
§ Raw Debezium topics will be stored, as single
source of truth.
§ Confluent Replicator mirrors these new topics into
an internal MSK which is deployed in DMZ area
[ultra restricted zone for internal access]
§ All the new micros services will start consuming
from MSK and store it in their own data store.
§ External systems will be fed by Apache Camel
using various integration plugins.
38. What is data transparency?
“Practice of making data, its sources, processing methods, and usage easily accessible and
understandable to relevant stakeholders, ensuring openness and clarity in data management.”
Benefits
§ Informed Decision-Making: Accessible data empowers
users to make informed decisions
§ Trust and Accountability: Transparency builds trust
among stakeholders
§ Data Quality Improvement: Errors and inconsistencies
can be identified and corrected more effectively.
§ Compliance and Governance: Transparency supports
regulatory compliance and data governance by providing
visibility
§ Collaboration: It fosters collaboration by facilitating
shared access to data.
Challenges
§ Technical Infrastructure: investments in data
infrastructure, metadata management, and transparency
tools.
§ Data Governance: Establishing data governance policies
and enforcing them across the organization.
§ Cultural Shift: It requires a cultural shift within an
organization, as it often involves sharing more data than
previously done.
§ Data Security: Strong security measures are necessary.
§ Data Ownership: Determining data ownership and
responsibilities for data transparency is complex,
especially in large, decentralized organizations.
39. 21
Our solution for data transparency
For the real-time data and the help of Confluent Cloud
1
2
3
Schema Registry
§ Standardised structure for data (format and semantics are
clear and consistent)
§ Compatibility: Schema evolution is managed, enabling
backward and forward compatibility
Access control and security
§ Fine-grained access control: Policies that specify who has
permissions to read, write, and manage Kafka topics.
§ Audit logs: Giving a transparent record of who accessed
what data and when.
Ecosystem compatibility
§ Integration with cloud services ensures that data visibility
extends across the data ecosystem
Data governance and auditing
§ Data lineage tracking: Allowing you to trace the movement of data within Kafka topics.
§ Auditing: Audit logs provide a record of data changes, access activity, and system
events, providing visibility into data operations.
4
40. 22
Our solution for data transparency
DWH Architecture
Data Catalog DataHub Integration
§ Unified metadata repository: that catalogues data assets
across the organisation (Kafka + Snowflake + S3)
§ Understand the available Kafka topics and their schemas
through Data Catalog DataHub's easy-to-use interface.
§ DataHub's data lineage capabilities track the flow of data
from Kafka topics to downstream systems such as
Snowflake.
§ Add context and business glossary terms to make data
more transparent in terms of its meaning and use.
Snowflake Data Warehouse Integration
§ Implement semi-automated processes to synchronise
schemas between Schema Registry and Snowflake to
ensure consistency.
§ Leveraged Snowflake's built-in audit trails and compliance
features to track data changes and access activities
41. What is data sharing?
“Providing access to data assets, with authorized users, organizations, or systems, to support
collaborative efforts, analytics, or business processes.”
Benefits
§ Collaboration: Facilitates collaboration by enabling teams
to access and work with shared data resources.
§ Efficiency: Reduces duplication of data and efforts by
centralizing data resources, optimizing resource utilization,
and reducing data silos.
§ Innovation: By making data available for analysis,
experimentation, and the development of new products or
services.
§ Competitive Advantage: By enabling organizations to
respond more effectively to market changes and customer
demands.
§ Cost Savings: Reduces the costs associated with data
duplication, storage, and management, especially in cloud-
based environments.
Challenges
§ Data Privacy and Security: Unauthorized access and
data breaches must be prevented.
§ Data Governance: Establishing clear data ownership,
access policies, and governance practices is.
§ Data Quality: Data quality control mechanisms must be in
place.
§ Technical Compatibility: Data sharing may involve
integrating with systems and tools that have varying
technical requirements and compatibility challenges.
§ Data Volume and Scalability: As data sharing grows,
scalability becomes a concern.
§ Data Lifecycle Management: Managing the entire data
lifecycle, including data sharing, archiving, and eventual
retirement, is a complex task.
44. 26
Transaction Checksum Verification
§ As part of holistic fraud detection process, there was a need to build a
real time transaction checksum verification component.
§ Transaction checksum verification is used to make sure database
changes are applied strictly using business applications.
§ Any changes outside of business applications, should raise alarms to
trigger further auditing.
§ Transaction checksum depends on previous transactions of the
user, so it’s a stateful checksum.
Problem
§ The solution was achieved by building a stateful Kafka Stream application.
§ This application had two inputs. Debezium messages + Ledger information.
§ Merged information and relevant alarms are sent to an output Kafka which
triggers alarms on audit and ledger departments.
Solution
Debezium Ledger
Kafka
KStreams
Kafka
Audit
45. 27
Realtime Analytics of Leveraged Transactions
§ Leveraged transactions must be monitored by finance department to
make plausible hedging position.
§ In traditional banking, this is a multi-day process as the entire system
moves slower compared to cryptocurrency world.
§ Due to fast-paced environment, the hedging process must be made real
time.
Problem
§ Leveraged transactions are enriched with ledger information by a Kafka
Streams application.
§ Enriched information is pushed to Snowflake by utilizing Snowflake
Sink Connector.
§ Materialized view built on top of the base Snowflake table with
automatic refresh.
§ We were able to achieve real time dashboard within a latency of a
single minute.
Solution
Debezium Ledger
Kafka
KStreams
Kafka
Snowflake Sink Connector
Snowflake
Dashboard
47. 29
Our take aways
After one year and half of project
Confluent Cloud infinite retention is very
useful to keep the cost for source topics.
Infinite retention also accommodates
future use cases.
Confluent auto rebalancing makes
cluster upgrades incredibly graceful
compared to OSS Kafka distribution
Confluent Cloud stream governance is a
necessary building block of data
discovery and catalog
Network cost of re-ingesting big topics by
each service can add up
For very high load topics, it would make
sense to have a small Confluent platform
setup in the VPC and use Confluent
Replicator
Running hosted connectors is problematic
since Confluent Cloud initiates the connection.
In such cases hosting connectors within the
VPC makes sense
Infinite retention is a must, since dismantling
of monolith into smaller services is a gradual
process. So, source topics must be stored
long enough until for entire migration process
which can take up to years.
Take
aways