20/03/2024
Apache Flink: Building
a company-wide
self-service Streaming
Data Platform
Gleb Shipilov (Data Integration Team Leader, Exness)
20/03/2024
Data as Bedrock
20/03/2024
Kafka Summit 2024
● Exness is the largest CFD broker by
trading volume and active clients
● Every millisecond counts
● As Exness delved deeper into event-driven
architecture, the need for processing
streaming data became paramount
● Each team had to deal with processing
streaming data on their own, solving all
the problems with:
○ Scalability
○ Fault tolerance
○ State management
○ Security
2
20/03/2024
Streaming Data Platform components
20/03/2024
Kafka Summit 2024 3
20/03/2024
Why Apache Flink?
● Support for several Kafka instances
● Performance
● Fault tolerance
● Support of very large state
● Java based framework
20/03/2024
Kafka Summit 2024 4
20/03/2024
What challenges have we faced
How to provide
Python and Go
developers with a
self-service platform
based on a Java
framework?
01
How to provide
developers with a
unified deployment
process for all the
components and
make it simple?
02
How to ensure
security?
03
How to flexibly
manage and isolate
resources between
teams?
04
20/03/2024
Kafka Summit 2024 5
20/03/2024
Flink SQL usage
20/03/2024
Kafka Summit 2024 6
20/03/2024
Flink SQL challenges
● Perfect for simple cases:
○ Aggregate;
○ Union data;
○ Flat data.
● Doesn’t work so perfect
with complex cases:
○ A lot of enrichments;
○ Complex business logic;
○ No tracing support.
20/03/2024
Kafka Summit 2024 7
Go developer
20/03/2024
PyFlink usage
20/03/2024
Kafka Summit 2024 8
20/03/2024
PyFlink challenges
● Lack of Apple silicon support (M1 / M2)
● No tracing support out of the box
● Necessity of both Table API and Data
Stream API usage in one PyFlink job
20/03/2024
Kafka Summit 2024 9
● At least the 1.17 version of Flink
● Tracing using:
○ OpenTelemetry;
○ Jaeger.
● Stream table environment to work with
both Table and Data Stream APIs
20/03/2024
Unified deployment process
● Main components of the deployment
process:
○ Terraform;
○ GitLab pipeline;
○ K8S operators.
20/03/2024
Kafka Summit 2024 10
20/03/2024
Unified deployment process
20/03/2024
Kafka Summit 2024 11
20/03/2024
Streaming Data replication using
Terraform over Flink
● Templated Terraform modules
instead of multiple and similar
SQL artefacts
● One module defines
configuration of the whole
pipeline from Kafka topic to S3
20/03/2024
Kafka Summit 2024 12
20/03/2024
One team–one Flink Cluster
● Security
● Resource management
● Own development environment
● Observability
20/03/2024
Kafka Summit 2024 13
20/03/2024
Monitoring and alerting
● Separate monitoring
for each team
● Slack channels with alerts
for each Flink cluster
● One Slack channel for technical
support of all the users
20/03/2024
Kafka Summit 2024 14
20/03/2024
The most important projects delivered
on our self-service platform
● Trading data processing lag
decrease from 2 hours to 2
minutes during peak times
● 1 MLN+ bots’ activity events
are prevented in real-time
● Fraud and abuse prevention
based on real-time data
● Marketing campaigns based
on real-time data
20/03/2024
Kafka Summit 2024 15
20/03/2024
Special thanks to:
Kafka Summit 2024 16
https://medium.com/exness-blog
● Alexey Perminov
● Ilya Soin
● Yury Smirnov
● Igor Matcko

Apache Flink: Building a Company-wide Self-service Streaming Data Platform

  • 1.
    20/03/2024 Apache Flink: Building acompany-wide self-service Streaming Data Platform Gleb Shipilov (Data Integration Team Leader, Exness)
  • 2.
    20/03/2024 Data as Bedrock 20/03/2024 KafkaSummit 2024 ● Exness is the largest CFD broker by trading volume and active clients ● Every millisecond counts ● As Exness delved deeper into event-driven architecture, the need for processing streaming data became paramount ● Each team had to deal with processing streaming data on their own, solving all the problems with: ○ Scalability ○ Fault tolerance ○ State management ○ Security 2
  • 3.
    20/03/2024 Streaming Data Platformcomponents 20/03/2024 Kafka Summit 2024 3
  • 4.
    20/03/2024 Why Apache Flink? ●Support for several Kafka instances ● Performance ● Fault tolerance ● Support of very large state ● Java based framework 20/03/2024 Kafka Summit 2024 4
  • 5.
    20/03/2024 What challenges havewe faced How to provide Python and Go developers with a self-service platform based on a Java framework? 01 How to provide developers with a unified deployment process for all the components and make it simple? 02 How to ensure security? 03 How to flexibly manage and isolate resources between teams? 04 20/03/2024 Kafka Summit 2024 5
  • 6.
  • 7.
    20/03/2024 Flink SQL challenges ●Perfect for simple cases: ○ Aggregate; ○ Union data; ○ Flat data. ● Doesn’t work so perfect with complex cases: ○ A lot of enrichments; ○ Complex business logic; ○ No tracing support. 20/03/2024 Kafka Summit 2024 7 Go developer
  • 8.
  • 9.
    20/03/2024 PyFlink challenges ● Lackof Apple silicon support (M1 / M2) ● No tracing support out of the box ● Necessity of both Table API and Data Stream API usage in one PyFlink job 20/03/2024 Kafka Summit 2024 9 ● At least the 1.17 version of Flink ● Tracing using: ○ OpenTelemetry; ○ Jaeger. ● Stream table environment to work with both Table and Data Stream APIs
  • 10.
    20/03/2024 Unified deployment process ●Main components of the deployment process: ○ Terraform; ○ GitLab pipeline; ○ K8S operators. 20/03/2024 Kafka Summit 2024 10
  • 11.
  • 12.
    20/03/2024 Streaming Data replicationusing Terraform over Flink ● Templated Terraform modules instead of multiple and similar SQL artefacts ● One module defines configuration of the whole pipeline from Kafka topic to S3 20/03/2024 Kafka Summit 2024 12
  • 13.
    20/03/2024 One team–one FlinkCluster ● Security ● Resource management ● Own development environment ● Observability 20/03/2024 Kafka Summit 2024 13
  • 14.
    20/03/2024 Monitoring and alerting ●Separate monitoring for each team ● Slack channels with alerts for each Flink cluster ● One Slack channel for technical support of all the users 20/03/2024 Kafka Summit 2024 14
  • 15.
    20/03/2024 The most importantprojects delivered on our self-service platform ● Trading data processing lag decrease from 2 hours to 2 minutes during peak times ● 1 MLN+ bots’ activity events are prevented in real-time ● Fraud and abuse prevention based on real-time data ● Marketing campaigns based on real-time data 20/03/2024 Kafka Summit 2024 15
  • 16.
    20/03/2024 Special thanks to: KafkaSummit 2024 16 https://medium.com/exness-blog ● Alexey Perminov ● Ilya Soin ● Yury Smirnov ● Igor Matcko