SlideShare a Scribd company logo
© 2019 Ververica
Konstantin Knauf
Head of Product - Ververica Platform
@snntrable
© 2019 Ververica2 = Worst Practice
Don’t use an iterative development process!
AnalysisProject Setup Development (Testing) Go Live Maintenance
© 2019 Ververica3 = Worst Practice
...choose the right setup!
start with one of you most challenging uses cases
no training
don’t use the community
no prior stream processing knowledge
© 2019 Ververica4 = Worst Practice
● Community
○ user@flink.apache.org
■ ~ 600 threads per month
■ ∅30h until first response
○ www.stackoverflow.com
■ How not to ask questions?
● https://data.stackexchange.com/stackoverflow/query/1115371/most-down-voted-apache-flink-questions
○ (ASF Slack #flink)
● Training
○ @FlinkForward
○ https://www.ververica.com/training
Getting Help!
© 2019 Ververica5 = Worst Practice
Don’t use an iterative development process!
AnalysisProject Setup Development (Testing) Go Live Maintenance
© 2019 Ververica6 = Worst Practice
...don’t think too much about your requirements.
don’t think about consistency & delivery guarantees
don’t think about the scale of your problem
don’t think too much about your actual business requirements
don’t think about upgrades and application evolution
© 2019 Ververica7 = Worst Practice
Do I care about correct results?
Three Questions
Do I care about duplicate (yet correct)
records downstream?
Do I care about losing records?
No → CheckpointingMode.AT_LEAST_ONCE & replayable
sources
Yes → Next question
No → No checkpointing, any source
Yes → Next question
No → CheckpointingMode.EXACTLY_ONCE & replayable
sources
Yes → CheckpointingMode.EXACTLY_ONCE, replayable
sources & transactional sinks
L
A
T
E
N
C
Y
© 2019 Ververica8 = Worst Practice
No checkpointing,
any source
NoYes
CheckpointingMode.EXACTLY_ONCE
& replayable sources
CheckpointingMode.AT_LEAST_ONCE
& replayable sources
CheckpointingMode.EXACTLY_ONCE,
replayable sources & transactional sinks
Yes No
NoYes
Do I care about correct results?
Do I care about duplicate (yet correct)
records downstream?
Do I care about losing records?
L
A
T
E
N
C
Y
© 2019 Ververica9 = Worst Practice
Flink job
user code
Local state
backends
Persisted
savepoint
local reads / writes that
manipulate state
persist to
DFS on
savepoint
Basics
© 2019 Ververica10 = Worst Practice
Flink job
user code
Local state
backends
Persisted
savepoint
Upgrade
Application
1. Upgrade Flink cluster
2. Fix bugs
3. Pipeline topology changes
4. Job reconfigurations
5. Adapt state schema
6. ...
Basics
© 2019 Ververica11 = Worst Practice
Flink job
user code
Local state
backends
Persisted
savepoint
continue to
access state
reload state
into local
state backends
Basics
© 2019 Ververica12 = Worst Practice
flink run --allowNonRestoredState <jar-file
<arguments>
Sink:
LateDataSink
Topology Changes
© 2019 Ververica13 = Worst Practice
new operator starts ➙ empty state
Still the same?
streamOperator
.uid("AggregatePerSensor")
Topology Changes
© 2019 Ververica14 = Worst Practice
● Avro Types ✓
○ https://avro.apache.org/docs/1.7.7/spec.html#Schema+Resolution
● Flink POJOs ✓
○ https://ci.apache.org/projects/flink/flink-docs-release-1.9/dev/stream/state/schema_evolution.html#pojo-types
● Kryo✗
● Key data types can not be changed
State Schema Evolution
State Unlocked
October 9: 2:30 pm - 3:10 pm
B 07 - B 08
Seth Wiesman, Tzu-Li (Gordon) Tai
● State Processor API for the rescue
© 2019 Ververica15 = Worst Practice
State Size & Network
Stateful Operator
TaskManager (1 Slot)
Source
keyBy
Sink
Ingress MB/s Shuffle MB/s
Shuffle MB/s
Egress MB/s
FilesystemStatebackend Checkpoint MB/s
#key * state_size
#records/s * size
#records/s * size * #tm-1/#tm
overall_state_size * 1/checkpoint_interval
#records/s * size * #tm-1/#tm
#messages/s * size
© 2019 Ververica16 = Worst Practice
Event Time & Out-Of-Orderness
I want to send an alarm when the number of transactions per customer
exceeds three in ten seconds.
4 8 13 11 15 21
4 8 13 11 15 21
4 8 13 11 15 21
4 8 13 11 15 21
Tumbling Window (10 secs):
Sliding Window (10/5 secs):
“Look Back” (10 secs) :
Fire or wait for watermark?
Fire or wait till watermark reaches end of window?
© 2019 Ververica17 = Worst Practice
Batch Processing Requirements
I receive two files every ten minutes: transactions.txt & quotes.txt
● They need to be transformed & joined.
● The output must be exactly one file.
● The operation needs to be atomic.
→ Don’t try to re-implement your batch jobs as a stream processing jobs.
© 2019 Ververica18 = Worst Practice
Don’t use an iterative development process!
AnalysisProject Setup Development (Testing) Production Maintenance
© 2019 Ververica19 = Worst Practice
Applications
(physical)
Analytics
(declarative)
DataStream API Table API/SQL
Types are Java / Scala classes Logical Schema for Tables
Transformation Functions Declarative Language (SQL, Table DSL)
State implicit in operationsExplicit control over State
Explicit control over Time SLAs define when to trigger
Executes as described Automatic Optimization
Analytics or Application
© 2019 Ververica20 = Worst Practice
Table API / SQL Red Flags
● When upgrading Apache Flink or my application I want to migrate state.
● I can not loose late data.
● I want to change the behaviour of my application during runtime.
© 2019 Ververica21 = Worst Practice
Worst Practices
make use of deeply-nested, complex data types
(don’t think about potential schema evolution)
KeySelectors#getKey can return any serializable type; use this freedom
© 2019 Ververica22 = Worst Practice
● serialization is not to be underestimated
● the simpler your data types the better
● use Flink POJOs or Avro SpecificRecords
● key types matter most
○ part of every KeyedState
○ part of every Timer
● tune locally
Serializer Ops/s
PojoSerializer 305
Kryo 102
Avro (Reflect API) 127
Avro (SpecificRecord API) 297
© 2019 Ververica23 = Worst Practice
sourceStream.flatMap(new Deserializer())
.keyBy(“cities”)
.timeWindow()
.count()
.filter(new GeographyFilter(“America”))
.addSink(...)
don’t process data you don’t need
● project early
● filter early
● don’t deserialize unused fields, e.g.
public class Record {
private City city;
private byte[] enclosedRecord;
}
© 2019 Ververica24 = Worst Practice
static variables to share state between Tasks
spawning threads in user functions
● bugs
● deadlocks & lock contention
(interaction with framework code)
● synchronization overhead
● complicated error prone
(checkpointing)
● use AsyncStreams to reduce wait times
on external IO
● use Timers to schedule tasks
● increase operator parallelism to
increase parallelism
© 2019 Ververica25 = Worst Practice
stream.keyBy(“key”)
.timeWindow(Time.of(30, DAYS), Time.of(5, SECONDS)
.apply(new MyWindowFunction())
stream.keyBy(“key”)
.window(GlobalWindows.create())
.trigger(new CustomTrigger())
.evictor(new CustomEvictor())
.reduce/aggregate/fold/apply()
● Avoid custom windowing
● KeyedProcessFunction usually
○ less error-prone
○ simpler
● each record is added to > 500k
windows
● without pre-aggregation
© 2019 Ververica26 = Worst Practice
Queryable State for Inter-Task Communication
● Non-Thread Safe State Access
○ RocksDBStatebackend ✓
○ FilesystemStatebackend ✗
● Performance?
● Consistency Guarantees?
● Use Queryable State for debugging and
monitoring only
© 2019 Ververica27 = Worst Practice
stream.keyBy(“key”)
.flatmap(..)
.keyBy(“key”)
.process(..)
.keyBy(“key”)
.timeWindow(..)
● DataStreamUtils#reinterpretAsKeyedStream
● Note: Stream needs to be partitioned exactly as Flink
would partition it.
public void flatMap(Bar bar, Collector<Foo> out) throws Exception {
MyParserFactory factory = MyParserFactory.newInstance();
MyParser parser = factory.newParser();
out.collect(parser.parse(bar));
}
● use RichFunction#open
for initialization logic
© 2019 Ververica28 = Worst Practice
Don’t use an iterative development process!
AnalysisProject Setup Development (Testing) Go Live Maintenance
© 2019 Ververica29 = Worst Practice
That’s a pyramid!
UDF Unit Tests
System Tests
© 2019 Ververica30 = Worst Practice
That’s a pyramid!
UDF Unit Tests
AbstractTestHarness
MiniClusterResource
System Tests
© 2019 Ververica31 = Worst Practice
Examples: https://github.com/knaufk/flink-testing-pyramid
Testing Stateful and Timely Operators and Functions
@Test
public void testingStatefulFlatMapFunction() throws Exception {
//push (timestamped) elements into the operator (and hence user defined function)
testHarness.processElement(2L, 100L);
//trigger event time timers by advancing the event time of the operator with a watermark
testHarness.processWatermark(100L);
//trigger processing time timers by advancing the processing time of the operator directly
testHarness.setProcessingTime(100L);
//retrieve list of emitted records for assertions
assertThat(testHarness.getOutput(), containsInExactlyThisOrder(3L))
//retrieve list of records emitted to a specific side output for assertions (ProcessFunction only)
assertThat(testHarness.getSideOutput(new OutputTag<>("invalidRecords")), hasSize(0))
}
© 2019 Ververica32 = Worst Practice
Don’t use an iterative development process!
AnalysisProject Setup Development (Testing) Go Live Maintenance
© 2019 Ververica33 = Worst Practice
Ignore Spiky Loads!
● seasonal fluctuations
● checkpoint alignment
● watermark interval
● GC pauses
● ...
● timers
processing time
records/s
© 2019 Ververica34 = Worst Practice
Ignore Spiky Loads!
● seasonal fluctuations
● checkpoint alignment
● watermark interval
● GC pauses
● ...
● timers
processing time
records/s
catch-up scenario
© 2019 Ververica35 = Worst Practice
use the Flink Web Interface as monitoring system
Monitoring & Metrics
if at all start monitoring now
using latency markers in production
[1] https://flink.apache.org/news/2019/02/25/monitoring-best-practices.html
● don’t miss the chance to learn about
Flink’s runtime during development
● not sure how to start → read [1]
● not the right tool for the job →
MetricsReporters (e.g Prometheus,
InfluxDB, Datadog)
● too many metrics can bring
JobManagers down
● high overhead (in particular with
metrics.latency.granularity:
subtask)
● measure event time lag instead [2]
[2] https://flink.apache.org/2019/06/05/flink-network-stack.html
© 2019 Ververica36 = Worst Practice
Configuration
choosing RocksDBStatebackend by default
playing around with Slots and SlotSharingGroups (too
early)
NFS/EBS/etc. as state.backend.rocksdb.localdir
● FilesystemStatebackend is faster &
easier to configure
● State Processor API can be used to
switch later
● disk IO ultimately determines how fast
you can read/write state
● usually not worth it
● leads to less chaining and additional
serialization, more network shuffles
and lower resource utilization
© 2019 Ververica37 = Worst Practice
Don’t use an iterative development process!
AnalysisProject Setup Development (Testing) Go Live Maintenance
© 2019 Ververica38 = Worst Practice
with a fast-pace project like Apache Flink, don’t upgrade
© 2019 Ververica
Questions?
© 2019 Ververica
www.ververica.com @snntrablekonstantin@ververica.com

More Related Content

What's hot

Introduction to Kafka Cruise Control
Introduction to Kafka Cruise ControlIntroduction to Kafka Cruise Control
Introduction to Kafka Cruise Control
Jiangjie Qin
 
Changelog Stream Processing with Apache Flink
Changelog Stream Processing with Apache FlinkChangelog Stream Processing with Apache Flink
Changelog Stream Processing with Apache Flink
Flink Forward
 
Introducing the Apache Flink Kubernetes Operator
Introducing the Apache Flink Kubernetes OperatorIntroducing the Apache Flink Kubernetes Operator
Introducing the Apache Flink Kubernetes Operator
Flink Forward
 
A Deep Dive into Kafka Controller
A Deep Dive into Kafka ControllerA Deep Dive into Kafka Controller
A Deep Dive into Kafka Controller
confluent
 
Common issues with Apache Kafka® Producer
Common issues with Apache Kafka® ProducerCommon issues with Apache Kafka® Producer
Common issues with Apache Kafka® Producer
confluent
 
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Flink Forward
 
How Netflix Tunes EC2 Instances for Performance
How Netflix Tunes EC2 Instances for PerformanceHow Netflix Tunes EC2 Instances for Performance
How Netflix Tunes EC2 Instances for Performance
Brendan Gregg
 
Iceberg: a fast table format for S3
Iceberg: a fast table format for S3Iceberg: a fast table format for S3
Iceberg: a fast table format for S3
DataWorks Summit
 
Maxim Fateev - Beyond the Watermark- On-Demand Backfilling in Flink
Maxim Fateev - Beyond the Watermark- On-Demand Backfilling in FlinkMaxim Fateev - Beyond the Watermark- On-Demand Backfilling in Flink
Maxim Fateev - Beyond the Watermark- On-Demand Backfilling in Flink
Flink Forward
 
Batch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & IcebergBatch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & Iceberg
Flink Forward
 
Processing Semantically-Ordered Streams in Financial Services
Processing Semantically-Ordered Streams in Financial ServicesProcessing Semantically-Ordered Streams in Financial Services
Processing Semantically-Ordered Streams in Financial Services
Flink Forward
 
Building Better Data Pipelines using Apache Airflow
Building Better Data Pipelines using Apache AirflowBuilding Better Data Pipelines using Apache Airflow
Building Better Data Pipelines using Apache Airflow
Sid Anand
 
VictoriaLogs: Open Source Log Management System - Preview
VictoriaLogs: Open Source Log Management System - PreviewVictoriaLogs: Open Source Log Management System - Preview
VictoriaLogs: Open Source Log Management System - Preview
VictoriaMetrics
 
Autoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive ModeAutoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive Mode
Flink Forward
 
Webinar: 99 Ways to Enrich Streaming Data with Apache Flink - Konstantin Knauf
Webinar: 99 Ways to Enrich Streaming Data with Apache Flink - Konstantin KnaufWebinar: 99 Ways to Enrich Streaming Data with Apache Flink - Konstantin Knauf
Webinar: 99 Ways to Enrich Streaming Data with Apache Flink - Konstantin Knauf
Ververica
 
Stream processing with Apache Flink (Timo Walther - Ververica)
Stream processing with Apache Flink (Timo Walther - Ververica)Stream processing with Apache Flink (Timo Walther - Ververica)
Stream processing with Apache Flink (Timo Walther - Ververica)
KafkaZone
 
Handle Large Messages In Apache Kafka
Handle Large Messages In Apache KafkaHandle Large Messages In Apache Kafka
Handle Large Messages In Apache Kafka
Jiangjie Qin
 
Top 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsTop 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark Applications
Spark Summit
 
Producer Performance Tuning for Apache Kafka
Producer Performance Tuning for Apache KafkaProducer Performance Tuning for Apache Kafka
Producer Performance Tuning for Apache Kafka
Jiangjie Qin
 
Cruise Control: Effortless management of Kafka clusters
Cruise Control: Effortless management of Kafka clustersCruise Control: Effortless management of Kafka clusters
Cruise Control: Effortless management of Kafka clusters
Prateek Maheshwari
 

What's hot (20)

Introduction to Kafka Cruise Control
Introduction to Kafka Cruise ControlIntroduction to Kafka Cruise Control
Introduction to Kafka Cruise Control
 
Changelog Stream Processing with Apache Flink
Changelog Stream Processing with Apache FlinkChangelog Stream Processing with Apache Flink
Changelog Stream Processing with Apache Flink
 
Introducing the Apache Flink Kubernetes Operator
Introducing the Apache Flink Kubernetes OperatorIntroducing the Apache Flink Kubernetes Operator
Introducing the Apache Flink Kubernetes Operator
 
A Deep Dive into Kafka Controller
A Deep Dive into Kafka ControllerA Deep Dive into Kafka Controller
A Deep Dive into Kafka Controller
 
Common issues with Apache Kafka® Producer
Common issues with Apache Kafka® ProducerCommon issues with Apache Kafka® Producer
Common issues with Apache Kafka® Producer
 
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
 
How Netflix Tunes EC2 Instances for Performance
How Netflix Tunes EC2 Instances for PerformanceHow Netflix Tunes EC2 Instances for Performance
How Netflix Tunes EC2 Instances for Performance
 
Iceberg: a fast table format for S3
Iceberg: a fast table format for S3Iceberg: a fast table format for S3
Iceberg: a fast table format for S3
 
Maxim Fateev - Beyond the Watermark- On-Demand Backfilling in Flink
Maxim Fateev - Beyond the Watermark- On-Demand Backfilling in FlinkMaxim Fateev - Beyond the Watermark- On-Demand Backfilling in Flink
Maxim Fateev - Beyond the Watermark- On-Demand Backfilling in Flink
 
Batch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & IcebergBatch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & Iceberg
 
Processing Semantically-Ordered Streams in Financial Services
Processing Semantically-Ordered Streams in Financial ServicesProcessing Semantically-Ordered Streams in Financial Services
Processing Semantically-Ordered Streams in Financial Services
 
Building Better Data Pipelines using Apache Airflow
Building Better Data Pipelines using Apache AirflowBuilding Better Data Pipelines using Apache Airflow
Building Better Data Pipelines using Apache Airflow
 
VictoriaLogs: Open Source Log Management System - Preview
VictoriaLogs: Open Source Log Management System - PreviewVictoriaLogs: Open Source Log Management System - Preview
VictoriaLogs: Open Source Log Management System - Preview
 
Autoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive ModeAutoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive Mode
 
Webinar: 99 Ways to Enrich Streaming Data with Apache Flink - Konstantin Knauf
Webinar: 99 Ways to Enrich Streaming Data with Apache Flink - Konstantin KnaufWebinar: 99 Ways to Enrich Streaming Data with Apache Flink - Konstantin Knauf
Webinar: 99 Ways to Enrich Streaming Data with Apache Flink - Konstantin Knauf
 
Stream processing with Apache Flink (Timo Walther - Ververica)
Stream processing with Apache Flink (Timo Walther - Ververica)Stream processing with Apache Flink (Timo Walther - Ververica)
Stream processing with Apache Flink (Timo Walther - Ververica)
 
Handle Large Messages In Apache Kafka
Handle Large Messages In Apache KafkaHandle Large Messages In Apache Kafka
Handle Large Messages In Apache Kafka
 
Top 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsTop 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark Applications
 
Producer Performance Tuning for Apache Kafka
Producer Performance Tuning for Apache KafkaProducer Performance Tuning for Apache Kafka
Producer Performance Tuning for Apache Kafka
 
Cruise Control: Effortless management of Kafka clusters
Cruise Control: Effortless management of Kafka clustersCruise Control: Effortless management of Kafka clusters
Cruise Control: Effortless management of Kafka clusters
 

Similar to Apache Flink Worst Practices

Towards Flink 2.0: Unified Batch & Stream Processing - Aljoscha Krettek, Ver...
Towards Flink 2.0:  Unified Batch & Stream Processing - Aljoscha Krettek, Ver...Towards Flink 2.0:  Unified Batch & Stream Processing - Aljoscha Krettek, Ver...
Towards Flink 2.0: Unified Batch & Stream Processing - Aljoscha Krettek, Ver...
Flink Forward
 
Towards Flink 2.0: Unified Batch & Stream Processing - Aljoscha Krettek, Verv...
Towards Flink 2.0: Unified Batch & Stream Processing - Aljoscha Krettek, Verv...Towards Flink 2.0: Unified Batch & Stream Processing - Aljoscha Krettek, Verv...
Towards Flink 2.0: Unified Batch & Stream Processing - Aljoscha Krettek, Verv...
Flink Forward
 
Introduction to Stream Processing with Apache Flink (2019-11-02 Bengaluru Mee...
Introduction to Stream Processing with Apache Flink (2019-11-02 Bengaluru Mee...Introduction to Stream Processing with Apache Flink (2019-11-02 Bengaluru Mee...
Introduction to Stream Processing with Apache Flink (2019-11-02 Bengaluru Mee...
Timo Walther
 
What's new for Apache Flink's Table & SQL APIs?
What's new for Apache Flink's Table & SQL APIs?What's new for Apache Flink's Table & SQL APIs?
What's new for Apache Flink's Table & SQL APIs?
Timo Walther
 
Flink Forward San Francisco 2019: Towards Flink 2.0: Rethinking the stack and...
Flink Forward San Francisco 2019: Towards Flink 2.0: Rethinking the stack and...Flink Forward San Francisco 2019: Towards Flink 2.0: Rethinking the stack and...
Flink Forward San Francisco 2019: Towards Flink 2.0: Rethinking the stack and...
Flink Forward
 
ScalaUA - distage: Staged Dependency Injection
ScalaUA - distage: Staged Dependency InjectionScalaUA - distage: Staged Dependency Injection
ScalaUA - distage: Staged Dependency Injection
7mind
 
Processes
ProcessesProcesses
Running Flink in Production: The good, The bad and The in Between - Lakshmi ...
Running Flink in Production:  The good, The bad and The in Between - Lakshmi ...Running Flink in Production:  The good, The bad and The in Between - Lakshmi ...
Running Flink in Production: The good, The bad and The in Between - Lakshmi ...
Flink Forward
 
How to Improve Performance Testing Using InfluxDB and Apache JMeter
How to Improve Performance Testing Using InfluxDB and Apache JMeterHow to Improve Performance Testing Using InfluxDB and Apache JMeter
How to Improve Performance Testing Using InfluxDB and Apache JMeter
InfluxData
 
Scaling Prometheus Metrics in Kubernetes with Telegraf | Chris Goller | Influ...
Scaling Prometheus Metrics in Kubernetes with Telegraf | Chris Goller | Influ...Scaling Prometheus Metrics in Kubernetes with Telegraf | Chris Goller | Influ...
Scaling Prometheus Metrics in Kubernetes with Telegraf | Chris Goller | Influ...
InfluxData
 
FaSilET² full end-to-end testing solution presented at OW2con'19, June 12-13,...
FaSilET² full end-to-end testing solution presented at OW2con'19, June 12-13,...FaSilET² full end-to-end testing solution presented at OW2con'19, June 12-13,...
FaSilET² full end-to-end testing solution presented at OW2con'19, June 12-13,...
OW2
 
Von neumann workers
Von neumann workersVon neumann workers
Von neumann workers
riccardobecker
 
Bootiful Reactive Testing with Mario Gray
Bootiful Reactive Testing with Mario GrayBootiful Reactive Testing with Mario Gray
Bootiful Reactive Testing with Mario Gray
VMware Tanzu
 
Devfest 2023 - Service Weaver Introduction - Taipei.pdf
Devfest 2023 - Service Weaver Introduction - Taipei.pdfDevfest 2023 - Service Weaver Introduction - Taipei.pdf
Devfest 2023 - Service Weaver Introduction - Taipei.pdf
KAI CHU CHUNG
 
Deploying Flink on Kubernetes - David Anderson
 Deploying Flink on Kubernetes - David Anderson Deploying Flink on Kubernetes - David Anderson
Deploying Flink on Kubernetes - David Anderson
Ververica
 
Creating Real-Time Data Streaming powered by SQL on Kubernetes - Albert Lewan...
Creating Real-Time Data Streaming powered by SQL on Kubernetes - Albert Lewan...Creating Real-Time Data Streaming powered by SQL on Kubernetes - Albert Lewan...
Creating Real-Time Data Streaming powered by SQL on Kubernetes - Albert Lewan...
GetInData
 
Setup & Operate Tungsten Replicator
Setup & Operate Tungsten ReplicatorSetup & Operate Tungsten Replicator
Setup & Operate Tungsten Replicator
Continuent
 
Why meteor
Why meteorWhy meteor
Why meteor
Jonathan Perl
 
Improving the Accumulo User Experience
 Improving the Accumulo User Experience Improving the Accumulo User Experience
Improving the Accumulo User Experience
Accumulo Summit
 
OpenShift/Kubernetes to Splunk log integration
OpenShift/Kubernetes to Splunk log integrationOpenShift/Kubernetes to Splunk log integration
OpenShift/Kubernetes to Splunk log integration
Michiel Kalkman
 

Similar to Apache Flink Worst Practices (20)

Towards Flink 2.0: Unified Batch & Stream Processing - Aljoscha Krettek, Ver...
Towards Flink 2.0:  Unified Batch & Stream Processing - Aljoscha Krettek, Ver...Towards Flink 2.0:  Unified Batch & Stream Processing - Aljoscha Krettek, Ver...
Towards Flink 2.0: Unified Batch & Stream Processing - Aljoscha Krettek, Ver...
 
Towards Flink 2.0: Unified Batch & Stream Processing - Aljoscha Krettek, Verv...
Towards Flink 2.0: Unified Batch & Stream Processing - Aljoscha Krettek, Verv...Towards Flink 2.0: Unified Batch & Stream Processing - Aljoscha Krettek, Verv...
Towards Flink 2.0: Unified Batch & Stream Processing - Aljoscha Krettek, Verv...
 
Introduction to Stream Processing with Apache Flink (2019-11-02 Bengaluru Mee...
Introduction to Stream Processing with Apache Flink (2019-11-02 Bengaluru Mee...Introduction to Stream Processing with Apache Flink (2019-11-02 Bengaluru Mee...
Introduction to Stream Processing with Apache Flink (2019-11-02 Bengaluru Mee...
 
What's new for Apache Flink's Table & SQL APIs?
What's new for Apache Flink's Table & SQL APIs?What's new for Apache Flink's Table & SQL APIs?
What's new for Apache Flink's Table & SQL APIs?
 
Flink Forward San Francisco 2019: Towards Flink 2.0: Rethinking the stack and...
Flink Forward San Francisco 2019: Towards Flink 2.0: Rethinking the stack and...Flink Forward San Francisco 2019: Towards Flink 2.0: Rethinking the stack and...
Flink Forward San Francisco 2019: Towards Flink 2.0: Rethinking the stack and...
 
ScalaUA - distage: Staged Dependency Injection
ScalaUA - distage: Staged Dependency InjectionScalaUA - distage: Staged Dependency Injection
ScalaUA - distage: Staged Dependency Injection
 
Processes
ProcessesProcesses
Processes
 
Running Flink in Production: The good, The bad and The in Between - Lakshmi ...
Running Flink in Production:  The good, The bad and The in Between - Lakshmi ...Running Flink in Production:  The good, The bad and The in Between - Lakshmi ...
Running Flink in Production: The good, The bad and The in Between - Lakshmi ...
 
How to Improve Performance Testing Using InfluxDB and Apache JMeter
How to Improve Performance Testing Using InfluxDB and Apache JMeterHow to Improve Performance Testing Using InfluxDB and Apache JMeter
How to Improve Performance Testing Using InfluxDB and Apache JMeter
 
Scaling Prometheus Metrics in Kubernetes with Telegraf | Chris Goller | Influ...
Scaling Prometheus Metrics in Kubernetes with Telegraf | Chris Goller | Influ...Scaling Prometheus Metrics in Kubernetes with Telegraf | Chris Goller | Influ...
Scaling Prometheus Metrics in Kubernetes with Telegraf | Chris Goller | Influ...
 
FaSilET² full end-to-end testing solution presented at OW2con'19, June 12-13,...
FaSilET² full end-to-end testing solution presented at OW2con'19, June 12-13,...FaSilET² full end-to-end testing solution presented at OW2con'19, June 12-13,...
FaSilET² full end-to-end testing solution presented at OW2con'19, June 12-13,...
 
Von neumann workers
Von neumann workersVon neumann workers
Von neumann workers
 
Bootiful Reactive Testing with Mario Gray
Bootiful Reactive Testing with Mario GrayBootiful Reactive Testing with Mario Gray
Bootiful Reactive Testing with Mario Gray
 
Devfest 2023 - Service Weaver Introduction - Taipei.pdf
Devfest 2023 - Service Weaver Introduction - Taipei.pdfDevfest 2023 - Service Weaver Introduction - Taipei.pdf
Devfest 2023 - Service Weaver Introduction - Taipei.pdf
 
Deploying Flink on Kubernetes - David Anderson
 Deploying Flink on Kubernetes - David Anderson Deploying Flink on Kubernetes - David Anderson
Deploying Flink on Kubernetes - David Anderson
 
Creating Real-Time Data Streaming powered by SQL on Kubernetes - Albert Lewan...
Creating Real-Time Data Streaming powered by SQL on Kubernetes - Albert Lewan...Creating Real-Time Data Streaming powered by SQL on Kubernetes - Albert Lewan...
Creating Real-Time Data Streaming powered by SQL on Kubernetes - Albert Lewan...
 
Setup & Operate Tungsten Replicator
Setup & Operate Tungsten ReplicatorSetup & Operate Tungsten Replicator
Setup & Operate Tungsten Replicator
 
Why meteor
Why meteorWhy meteor
Why meteor
 
Improving the Accumulo User Experience
 Improving the Accumulo User Experience Improving the Accumulo User Experience
Improving the Accumulo User Experience
 
OpenShift/Kubernetes to Splunk log integration
OpenShift/Kubernetes to Splunk log integrationOpenShift/Kubernetes to Splunk log integration
OpenShift/Kubernetes to Splunk log integration
 

Recently uploaded

Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
UiPathCommunity
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
Cheryl Hung
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Thijs Feryn
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
Ralf Eggert
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
Dorra BARTAGUIZ
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 
Assure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyesAssure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
mikeeftimakis1
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfSAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
Peter Spielvogel
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
Sri Ambati
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Paige Cruz
 
Welocme to ViralQR, your best QR code generator.
Welocme to ViralQR, your best QR code generator.Welocme to ViralQR, your best QR code generator.
Welocme to ViralQR, your best QR code generator.
ViralQR
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
Alison B. Lowndes
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
sonjaschweigert1
 

Recently uploaded (20)

Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
Assure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyesAssure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyes
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfSAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
 
Welocme to ViralQR, your best QR code generator.
Welocme to ViralQR, your best QR code generator.Welocme to ViralQR, your best QR code generator.
Welocme to ViralQR, your best QR code generator.
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
 

Apache Flink Worst Practices

  • 1. © 2019 Ververica Konstantin Knauf Head of Product - Ververica Platform @snntrable
  • 2. © 2019 Ververica2 = Worst Practice Don’t use an iterative development process! AnalysisProject Setup Development (Testing) Go Live Maintenance
  • 3. © 2019 Ververica3 = Worst Practice ...choose the right setup! start with one of you most challenging uses cases no training don’t use the community no prior stream processing knowledge
  • 4. © 2019 Ververica4 = Worst Practice ● Community ○ user@flink.apache.org ■ ~ 600 threads per month ■ ∅30h until first response ○ www.stackoverflow.com ■ How not to ask questions? ● https://data.stackexchange.com/stackoverflow/query/1115371/most-down-voted-apache-flink-questions ○ (ASF Slack #flink) ● Training ○ @FlinkForward ○ https://www.ververica.com/training Getting Help!
  • 5. © 2019 Ververica5 = Worst Practice Don’t use an iterative development process! AnalysisProject Setup Development (Testing) Go Live Maintenance
  • 6. © 2019 Ververica6 = Worst Practice ...don’t think too much about your requirements. don’t think about consistency & delivery guarantees don’t think about the scale of your problem don’t think too much about your actual business requirements don’t think about upgrades and application evolution
  • 7. © 2019 Ververica7 = Worst Practice Do I care about correct results? Three Questions Do I care about duplicate (yet correct) records downstream? Do I care about losing records? No → CheckpointingMode.AT_LEAST_ONCE & replayable sources Yes → Next question No → No checkpointing, any source Yes → Next question No → CheckpointingMode.EXACTLY_ONCE & replayable sources Yes → CheckpointingMode.EXACTLY_ONCE, replayable sources & transactional sinks L A T E N C Y
  • 8. © 2019 Ververica8 = Worst Practice No checkpointing, any source NoYes CheckpointingMode.EXACTLY_ONCE & replayable sources CheckpointingMode.AT_LEAST_ONCE & replayable sources CheckpointingMode.EXACTLY_ONCE, replayable sources & transactional sinks Yes No NoYes Do I care about correct results? Do I care about duplicate (yet correct) records downstream? Do I care about losing records? L A T E N C Y
  • 9. © 2019 Ververica9 = Worst Practice Flink job user code Local state backends Persisted savepoint local reads / writes that manipulate state persist to DFS on savepoint Basics
  • 10. © 2019 Ververica10 = Worst Practice Flink job user code Local state backends Persisted savepoint Upgrade Application 1. Upgrade Flink cluster 2. Fix bugs 3. Pipeline topology changes 4. Job reconfigurations 5. Adapt state schema 6. ... Basics
  • 11. © 2019 Ververica11 = Worst Practice Flink job user code Local state backends Persisted savepoint continue to access state reload state into local state backends Basics
  • 12. © 2019 Ververica12 = Worst Practice flink run --allowNonRestoredState <jar-file <arguments> Sink: LateDataSink Topology Changes
  • 13. © 2019 Ververica13 = Worst Practice new operator starts ➙ empty state Still the same? streamOperator .uid("AggregatePerSensor") Topology Changes
  • 14. © 2019 Ververica14 = Worst Practice ● Avro Types ✓ ○ https://avro.apache.org/docs/1.7.7/spec.html#Schema+Resolution ● Flink POJOs ✓ ○ https://ci.apache.org/projects/flink/flink-docs-release-1.9/dev/stream/state/schema_evolution.html#pojo-types ● Kryo✗ ● Key data types can not be changed State Schema Evolution State Unlocked October 9: 2:30 pm - 3:10 pm B 07 - B 08 Seth Wiesman, Tzu-Li (Gordon) Tai ● State Processor API for the rescue
  • 15. © 2019 Ververica15 = Worst Practice State Size & Network Stateful Operator TaskManager (1 Slot) Source keyBy Sink Ingress MB/s Shuffle MB/s Shuffle MB/s Egress MB/s FilesystemStatebackend Checkpoint MB/s #key * state_size #records/s * size #records/s * size * #tm-1/#tm overall_state_size * 1/checkpoint_interval #records/s * size * #tm-1/#tm #messages/s * size
  • 16. © 2019 Ververica16 = Worst Practice Event Time & Out-Of-Orderness I want to send an alarm when the number of transactions per customer exceeds three in ten seconds. 4 8 13 11 15 21 4 8 13 11 15 21 4 8 13 11 15 21 4 8 13 11 15 21 Tumbling Window (10 secs): Sliding Window (10/5 secs): “Look Back” (10 secs) : Fire or wait for watermark? Fire or wait till watermark reaches end of window?
  • 17. © 2019 Ververica17 = Worst Practice Batch Processing Requirements I receive two files every ten minutes: transactions.txt & quotes.txt ● They need to be transformed & joined. ● The output must be exactly one file. ● The operation needs to be atomic. → Don’t try to re-implement your batch jobs as a stream processing jobs.
  • 18. © 2019 Ververica18 = Worst Practice Don’t use an iterative development process! AnalysisProject Setup Development (Testing) Production Maintenance
  • 19. © 2019 Ververica19 = Worst Practice Applications (physical) Analytics (declarative) DataStream API Table API/SQL Types are Java / Scala classes Logical Schema for Tables Transformation Functions Declarative Language (SQL, Table DSL) State implicit in operationsExplicit control over State Explicit control over Time SLAs define when to trigger Executes as described Automatic Optimization Analytics or Application
  • 20. © 2019 Ververica20 = Worst Practice Table API / SQL Red Flags ● When upgrading Apache Flink or my application I want to migrate state. ● I can not loose late data. ● I want to change the behaviour of my application during runtime.
  • 21. © 2019 Ververica21 = Worst Practice Worst Practices make use of deeply-nested, complex data types (don’t think about potential schema evolution) KeySelectors#getKey can return any serializable type; use this freedom
  • 22. © 2019 Ververica22 = Worst Practice ● serialization is not to be underestimated ● the simpler your data types the better ● use Flink POJOs or Avro SpecificRecords ● key types matter most ○ part of every KeyedState ○ part of every Timer ● tune locally Serializer Ops/s PojoSerializer 305 Kryo 102 Avro (Reflect API) 127 Avro (SpecificRecord API) 297
  • 23. © 2019 Ververica23 = Worst Practice sourceStream.flatMap(new Deserializer()) .keyBy(“cities”) .timeWindow() .count() .filter(new GeographyFilter(“America”)) .addSink(...) don’t process data you don’t need ● project early ● filter early ● don’t deserialize unused fields, e.g. public class Record { private City city; private byte[] enclosedRecord; }
  • 24. © 2019 Ververica24 = Worst Practice static variables to share state between Tasks spawning threads in user functions ● bugs ● deadlocks & lock contention (interaction with framework code) ● synchronization overhead ● complicated error prone (checkpointing) ● use AsyncStreams to reduce wait times on external IO ● use Timers to schedule tasks ● increase operator parallelism to increase parallelism
  • 25. © 2019 Ververica25 = Worst Practice stream.keyBy(“key”) .timeWindow(Time.of(30, DAYS), Time.of(5, SECONDS) .apply(new MyWindowFunction()) stream.keyBy(“key”) .window(GlobalWindows.create()) .trigger(new CustomTrigger()) .evictor(new CustomEvictor()) .reduce/aggregate/fold/apply() ● Avoid custom windowing ● KeyedProcessFunction usually ○ less error-prone ○ simpler ● each record is added to > 500k windows ● without pre-aggregation
  • 26. © 2019 Ververica26 = Worst Practice Queryable State for Inter-Task Communication ● Non-Thread Safe State Access ○ RocksDBStatebackend ✓ ○ FilesystemStatebackend ✗ ● Performance? ● Consistency Guarantees? ● Use Queryable State for debugging and monitoring only
  • 27. © 2019 Ververica27 = Worst Practice stream.keyBy(“key”) .flatmap(..) .keyBy(“key”) .process(..) .keyBy(“key”) .timeWindow(..) ● DataStreamUtils#reinterpretAsKeyedStream ● Note: Stream needs to be partitioned exactly as Flink would partition it. public void flatMap(Bar bar, Collector<Foo> out) throws Exception { MyParserFactory factory = MyParserFactory.newInstance(); MyParser parser = factory.newParser(); out.collect(parser.parse(bar)); } ● use RichFunction#open for initialization logic
  • 28. © 2019 Ververica28 = Worst Practice Don’t use an iterative development process! AnalysisProject Setup Development (Testing) Go Live Maintenance
  • 29. © 2019 Ververica29 = Worst Practice That’s a pyramid! UDF Unit Tests System Tests
  • 30. © 2019 Ververica30 = Worst Practice That’s a pyramid! UDF Unit Tests AbstractTestHarness MiniClusterResource System Tests
  • 31. © 2019 Ververica31 = Worst Practice Examples: https://github.com/knaufk/flink-testing-pyramid Testing Stateful and Timely Operators and Functions @Test public void testingStatefulFlatMapFunction() throws Exception { //push (timestamped) elements into the operator (and hence user defined function) testHarness.processElement(2L, 100L); //trigger event time timers by advancing the event time of the operator with a watermark testHarness.processWatermark(100L); //trigger processing time timers by advancing the processing time of the operator directly testHarness.setProcessingTime(100L); //retrieve list of emitted records for assertions assertThat(testHarness.getOutput(), containsInExactlyThisOrder(3L)) //retrieve list of records emitted to a specific side output for assertions (ProcessFunction only) assertThat(testHarness.getSideOutput(new OutputTag<>("invalidRecords")), hasSize(0)) }
  • 32. © 2019 Ververica32 = Worst Practice Don’t use an iterative development process! AnalysisProject Setup Development (Testing) Go Live Maintenance
  • 33. © 2019 Ververica33 = Worst Practice Ignore Spiky Loads! ● seasonal fluctuations ● checkpoint alignment ● watermark interval ● GC pauses ● ... ● timers processing time records/s
  • 34. © 2019 Ververica34 = Worst Practice Ignore Spiky Loads! ● seasonal fluctuations ● checkpoint alignment ● watermark interval ● GC pauses ● ... ● timers processing time records/s catch-up scenario
  • 35. © 2019 Ververica35 = Worst Practice use the Flink Web Interface as monitoring system Monitoring & Metrics if at all start monitoring now using latency markers in production [1] https://flink.apache.org/news/2019/02/25/monitoring-best-practices.html ● don’t miss the chance to learn about Flink’s runtime during development ● not sure how to start → read [1] ● not the right tool for the job → MetricsReporters (e.g Prometheus, InfluxDB, Datadog) ● too many metrics can bring JobManagers down ● high overhead (in particular with metrics.latency.granularity: subtask) ● measure event time lag instead [2] [2] https://flink.apache.org/2019/06/05/flink-network-stack.html
  • 36. © 2019 Ververica36 = Worst Practice Configuration choosing RocksDBStatebackend by default playing around with Slots and SlotSharingGroups (too early) NFS/EBS/etc. as state.backend.rocksdb.localdir ● FilesystemStatebackend is faster & easier to configure ● State Processor API can be used to switch later ● disk IO ultimately determines how fast you can read/write state ● usually not worth it ● leads to less chaining and additional serialization, more network shuffles and lower resource utilization
  • 37. © 2019 Ververica37 = Worst Practice Don’t use an iterative development process! AnalysisProject Setup Development (Testing) Go Live Maintenance
  • 38. © 2019 Ververica38 = Worst Practice with a fast-pace project like Apache Flink, don’t upgrade
  • 40. © 2019 Ververica www.ververica.com @snntrablekonstantin@ververica.com