SlideShare a Scribd company logo
Autoscaling
Timothy Farkas
Senior Software Engineer @ Netflix
Problem
Definition
Our Pain
● Thousands of stateless single source and single sink Flink routers.
● All operators are chained.
● When lag for a router exceeds a threshold we are paged.
Kafka
Consumer
Project
Filter
Sink
Definitions
● Workload: Events being produced to a kafka topic. Two main knobs to turn:
○ Message Rate
○ Message Size
● Lag: The time it would take for a router to process all the remaining
unprocessed events that are buffered in its kafka topic.
● Healthy Router: A router is healthy if it’s lag is always under ~5 minutes.
● Autoscaling Solution: Adjust the number of nodes in the router dynamically
based on the workload to keep the router healthy. Attempt to use the
smallest number of nodes that are required to keep the pipeline healthy.
Solution Space
● Claim: There is no perfect solution. Any autoscaling algorithm can be defeated
by one or more workloads.
● Proof: Take any autoscaling algorithm A. Provide A with a workload W that
does the exact opposite of what A expects whenever A decides to resize the
cluster. =>
A will always make the wrong decision for W by definition. =>
Q.E.D.
○ Understand our limitations.
○ Make assumptions about our workloads.
○ Make a solution that works well when taking both into account.
Limitations
● Rescaling introduces processing pauses.
● Scaling down a Flink job suspends processing for 1 - 3 minutes and possibly
more.
○ CHEAP: Graceful shutdown with savepoint.
○ EXPENSIVE: Remove TMs.
○ CHEAP: Restart from savepoint with reduced parallelism.
● Scaling up a Flink job suspends processing for period < 1 minute.
○ EXPENSIVE: Add TMs.
○ CHEAP: Graceful shutdown with savepoint.
○ CHEAP: Restart from savepoint with increased parallelism.
● There is a two minute delay for propagating metrics through Netflix’s metrics
infrastructure.
Assumptions
● Better to accidentally over allocate than to under allocate.
● Average message size changes infrequently.
● Large spikes in the workload happen, but not frequently.
● Workloads tend to smoothly increase or decrease, when they don’t have a
large spike.
Solution
Desirable Characteristics
● Minimal amount of state
● Deterministic behavior
● Easy to unit test
● Easy to control
Approaches
● Historical Prediction
● Rule Based
● PID Controller
● Statistical Short Term Prediction + Policies
Autoscaling - High Level Steps
● Collect: Receive a batch of metrics for the current 1 minute timebucket.
● Pre-Decision Policy: Apply policies which decide whether the cluster can be
rescaled or whether performance information can be collected about the cluster.
● Decide: Based on latest batch of events, decide whether to:
○ Scale up
○ Scale down
○ Stay the same
○ Also collects cluster performance information
● Calculate Size: If scaling up or down, decide how many nodes need to be
added or removed.
● Post-Decision Policy: Apply policies which can modify scale up and scale
down decisions.
Metrics Collection
Each minute collect the following
● Kafka consumer lag
● Records processed per second
● Cpu utilization
● Max Message Latency
Kafka
Kafka
Consumer
Buffered Events
Router
Sink
Processing
Filter /
Projection
● Kafka messages in per second
● Net in / out utilization
● Sink health metrics
Store the metrics for the past N minutes to inform scaling decisions and to do
regression to predict the workload.
Pre-Decision Policy
Abort autoscaling process if:
● The router has recent task failures
● The router is currently redeploying
Decide - Scale Up
Scale up if:
● There is significant lag AND sink is healthy
● Utilization exceeds the safe threshold AND sink is healthy
Key Insight - Collect cluster performance information:
● If the cluster needs to be scaled up that means the cluster is saturated.
● This is effectively a benchmark for the performance of the cluster at the current
size.
● Save this information in a Performance Table for future scaling decisions.
● More on this in the Performance Table section later.
Decide - Scale Down
Scale down if:
● There is no lag AND we do not anticipate an increase in incoming messages
● More on how we anticipate incoming message rate in the Predict Workload
section later.
Calculate Size
● Predict Workload: Predict the future workload (messages in per second),
while taking spikes into account.
● Target Events Per Second: Compute target events / sec that the pipeline will
need to handle X minutes from now.
● Cluster Size Lookup: Use the target events / sec to estimate the desired
cluster size, which can handle the workload up to X minutes from now.
● Quadratic regression for troughs: ax^2 + bx + c
● Linear regression for everything else: ax + b
Predict Workload
● Assume error for regression is normally distributed and centered at 0.
● Find standard deviation of error.
● Any error greater than 3 * sigma is an outlier.
● After enough consecutive outliers are observed, the baseline is reset.
Spike Detection
Spike Detection - Baseline
Var = (1 + 1 + 1 + 1) / 6 std = sqrt(4 / 6) 3 *std = 2.45
Spike Detection - First Outlier
Spike Detection - Outliers
Spike Detection - Baseline Reset
Calculate Size
● Predict Workload: Predict the future workload (messages in per second),
while taking spikes into account.
● Target Events Per Second: Compute target events / sec that the pipeline will
need to handle X minutes from now.
● Cluster Size Lookup: Use the target events / sec to estimate the desired
cluster size, which can handle the workload up to X minutes from now.
restartTime catchUpTime
recoveryTime = 15 min
2 minutes 13 minutes
t1 t2 = t1 + 15 min
Compute Target Processing Rate
t3 = t1 + 20 min
validTime
1.
2.
3.
Calculate Size
● Predict Workload: Predict the future workload (messages in per second),
while taking spikes into account.
● Target Events Per Second: Compute target events / sec that the pipeline will
need to handle X minutes from now.
● Cluster Size Lookup: Use the target events / sec to estimate the desired
cluster size, which can handle the workload up to X minutes from now.
● Lag and resource usage is high =>
● Pipeline is saturated =>
● We decide to scale up =>
● We know the maximum throughput of the current cluster at the current size =>
● Record the performance in a lookup table
Cluster Size Lookup - The Performance Table
● Given a target rate find the performance records above and below it.
● Do linear interpolation to find the suitable cluster size.
Num Nodes Max Rate
4 10,000
10 20,000
18 35,000
targetRate = 15,000
ratio = (15000 - 10000)
(20000 - 10000)
ratio = .5
clusterSize = .5 (4) + .5 (10)
clusterSize = 7
Performance Table
Cluster Size Lookup - The Performance Table
Num Nodes Max Rate
4 10,000
10 20,000
18 35,000
targetRate = 40,000
clusterSize = (40,000) = 20.57
(35,000 / 18)
ceil(clusterSize) = 21
Performance Table
Cluster Size Lookup - Corner Case
Cluster Size Lookup - Complexities
● Few more corner cases
● Utilization also needs to be taken into account
● Want new cluster size to have reasonable resource utilization 60% or less
Calculate Size
Scale Up vs Scale Down
● Flow and logic is the same
● Minor differences in implementation details
Post-Decision Policy
● Minimum cluster size based on partition count of Kafka topic
● Maximum cluster size based on partition count of Kafka topic
● Cooldown period for scale ups
● Cooldown period for scale downs
● Disable scale downs during region failover (see Region Failover section)
● Safety limit for max scale up. Ex. cannot add more than 50 nodes during a
scale up
● Safety limit for max cluster size
Running In
Production
Architecture Options
● Embed autoscaling in Flink
○ Pros:
■ Lower latency for retrieving metrics.
○ Cons:
■ Complex resource manager interactions get pushed down into Flink.
■ Rescale operation not easily integrated into operations history for the job.
■ Autoscaling changes requires redeploy of the job.
● Run autoscaling as a Mantis pipeline
○ Pros:
■ Flink service control plane handles all resource manager interactions already and it can be
re-used for rescaling the job.
■ Flink service control plane keeps history of all rescale actions.
■ Autoscaling can be changed without redeploying jobs.
○ Cons:
■ 2 minute latency for getting metrics.
Titus
Autoscaling Architecture
Autoscaler
(Mantis Job)
User
Router
User
Router
User
Router
Spinnaker
Flink as a Service
Atlas
Results
Sine Wave
Linear Spikey
Production Router
Fleet Resource Usage
Additional
Considerations
Memory Requirements
● Direct memory has to be preserved for Kafka Consumer and Kafka
Producer
● Direct memory cannot be changed for TMs that are already running
● Smaller clusters require more Direct memory (each node handles more
partitions)
● Larger clusters require less Direct memory (each node handles fewer
partitions)
● Deploy cluster with Direct memory that works for the minimum cluster
size
Partition Balancing
2 Partitions
2 Partitions
1 Partition
1 Partition
1 Partition
1 Partition
● 3 TMs
● 2 Task Slots per TM
● Topic with 8 partitions
TMs with N Task Slots in the worst case
can be assigned N more partitions than
other TMs
● Aggregate consumer lag and
average message latency may
be low.
● A few partitions may have high
latency due to unbalanced
distribution of partitions.
● Round up the cluster so that the
maximum possible partitions per
subtask is reduced by 1.
● Note this is a much looser
requirement on cluster size than
requiring equal distribution of
partitions to subtasks. This
allows finer grained cluster size
control.
Outlier Containers
[Wed Sep 4 17:50:03 2019] EDAC skx MC2: HANDLING MCE MEMORY ERROR
[Wed Sep 4 17:50:03 2019] EDAC skx MC2: CPU 24: Machine Check Event: 0 Bank 13:
cc063f80000800c0
[Wed Sep 4 17:50:03 2019] EDAC skx MC2: TSC 0
[Wed Sep 4 17:50:03 2019] EDAC skx MC2: ADDR 57c24fb4c0
[Wed Sep 4 17:50:03 2019] EDAC skx MC2: MISC 908511010100086
[Wed Sep 4 17:50:03 2019] EDAC skx MC2: PROCESSOR 0:50654 TIME 1567619410 SOCKET 1 APIC 40
[Wed Sep 4 17:50:03 2019] EDAC MC2: 6398 CE memory scrubbing error on
CPU_SrcID#1_MC#0_Chan#0_DIMM#0 (channel:0 slot:0 page:0x57c24fb offset:0x4c0 grain:32
syndrome:0x0 - OVERFLOW err_code:0008:00c0 socket:1 imc:0 rank:1 bg:3 ba:2 row:1aacf col:338)
Region Failover
Netflix
us-west
Netflix
us-east
Netflix
eu-west
User
us-west
User
us-east
User
eu-west
Flink Routers
us-west
Flink Routers
us-west
Flink Routers
us-west
Disable scaledowns in the evacuated
region until traffic comes back.
Future Work
Eager Scale Up
● Current Scale Up Decision:
○ There is significant consumer lag AND sink is healthy
○ Utilization exceeds the safe threshold AND sink is healthy
This is not ideal since in most cases latency builds up in the job before a scale up is
triggered.
● Eager Scale Up Decision:
○ Use the performance table to determine the maximum processing rate of the
current cluster.
○ Use regression to determine if the workload will exceed the processing rate
of the cluster in the near future.
○ If this is the case do a scale up before any lag builds up.
Downscale Optimization
● Current downscale operation
○ CHEAP: Graceful shutdown with savepoint.
○ EXPENSIVE: Remove TMs.
○ CHEAP: Restart from savepoint with reduced parallelism.
● Optimized downscale operation
○ CHEAP: Graceful shutdown with savepoint.
○ CHEAP: Blacklist TMs that will be removed.
○ CHEAP: Restart from savepoint with reduced parallelism.
○ EXPENSIVE: Remove TMs.
Complex DAGs
● Extend the algorithm to support multiple sources and sinks.
● Handle jobs where all operators are not chained.
Acknowledgements
● Steven Wu
● Andrew Nguonly
● Neeraj Joshi
● Mark Cho
● Netflix Flink Team
● Netflix Mantis Team
● Netflix Data Pipeline Team
● Netflix RTDI Team
● https://www.spinnaker.io/
● https://github.com/Netflix/mantis

More Related Content

What's hot

The top 3 challenges running multi-tenant Flink at scale
The top 3 challenges running multi-tenant Flink at scaleThe top 3 challenges running multi-tenant Flink at scale
The top 3 challenges running multi-tenant Flink at scale
Flink Forward
 
Tzu-Li (Gordon) Tai - Stateful Stream Processing with Apache Flink
Tzu-Li (Gordon) Tai - Stateful Stream Processing with Apache FlinkTzu-Li (Gordon) Tai - Stateful Stream Processing with Apache Flink
Tzu-Li (Gordon) Tai - Stateful Stream Processing with Apache Flink
Ververica
 
Building a fully managed stream processing platform on Flink at scale for Lin...
Building a fully managed stream processing platform on Flink at scale for Lin...Building a fully managed stream processing platform on Flink at scale for Lin...
Building a fully managed stream processing platform on Flink at scale for Lin...
Flink Forward
 
A Deep Dive into Kafka Controller
A Deep Dive into Kafka ControllerA Deep Dive into Kafka Controller
A Deep Dive into Kafka Controller
confluent
 
Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안
Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안
Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안
SANG WON PARK
 
Data Pipelines with Kafka Connect
Data Pipelines with Kafka ConnectData Pipelines with Kafka Connect
Data Pipelines with Kafka Connect
Kaufman Ng
 
Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...
Flink Forward
 
Producer Performance Tuning for Apache Kafka
Producer Performance Tuning for Apache KafkaProducer Performance Tuning for Apache Kafka
Producer Performance Tuning for Apache Kafka
Jiangjie Qin
 
Tuning Apache Kafka Connectors for Flink.pptx
Tuning Apache Kafka Connectors for Flink.pptxTuning Apache Kafka Connectors for Flink.pptx
Tuning Apache Kafka Connectors for Flink.pptx
Flink Forward
 
How Uber scaled its Real Time Infrastructure to Trillion events per day
How Uber scaled its Real Time Infrastructure to Trillion events per dayHow Uber scaled its Real Time Infrastructure to Trillion events per day
How Uber scaled its Real Time Infrastructure to Trillion events per day
DataWorks Summit
 
Apache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native EraApache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native Era
Flink Forward
 
Stephan Ewen - Experiences running Flink at Very Large Scale
Stephan Ewen -  Experiences running Flink at Very Large ScaleStephan Ewen -  Experiences running Flink at Very Large Scale
Stephan Ewen - Experiences running Flink at Very Large Scale
Ververica
 
Kafka 101
Kafka 101Kafka 101
Kafka 101
Clement Demonchy
 
Citi Tech Talk Disaster Recovery Solutions Deep Dive
Citi Tech Talk  Disaster Recovery Solutions Deep DiveCiti Tech Talk  Disaster Recovery Solutions Deep Dive
Citi Tech Talk Disaster Recovery Solutions Deep Dive
confluent
 
Kafka At Scale in the Cloud
Kafka At Scale in the CloudKafka At Scale in the Cloud
Kafka At Scale in the Cloud
confluent
 
Apache kafka performance(throughput) - without data loss and guaranteeing dat...
Apache kafka performance(throughput) - without data loss and guaranteeing dat...Apache kafka performance(throughput) - without data loss and guaranteeing dat...
Apache kafka performance(throughput) - without data loss and guaranteeing dat...
SANG WON PARK
 
Scaling Flink in Cloud
Scaling Flink in CloudScaling Flink in Cloud
Scaling Flink in Cloud
Steven Wu
 
Apache Flink internals
Apache Flink internalsApache Flink internals
Apache Flink internals
Kostas Tzoumas
 
Changelog Stream Processing with Apache Flink
Changelog Stream Processing with Apache FlinkChangelog Stream Processing with Apache Flink
Changelog Stream Processing with Apache Flink
Flink Forward
 
Practical learnings from running thousands of Flink jobs
Practical learnings from running thousands of Flink jobsPractical learnings from running thousands of Flink jobs
Practical learnings from running thousands of Flink jobs
Flink Forward
 

What's hot (20)

The top 3 challenges running multi-tenant Flink at scale
The top 3 challenges running multi-tenant Flink at scaleThe top 3 challenges running multi-tenant Flink at scale
The top 3 challenges running multi-tenant Flink at scale
 
Tzu-Li (Gordon) Tai - Stateful Stream Processing with Apache Flink
Tzu-Li (Gordon) Tai - Stateful Stream Processing with Apache FlinkTzu-Li (Gordon) Tai - Stateful Stream Processing with Apache Flink
Tzu-Li (Gordon) Tai - Stateful Stream Processing with Apache Flink
 
Building a fully managed stream processing platform on Flink at scale for Lin...
Building a fully managed stream processing platform on Flink at scale for Lin...Building a fully managed stream processing platform on Flink at scale for Lin...
Building a fully managed stream processing platform on Flink at scale for Lin...
 
A Deep Dive into Kafka Controller
A Deep Dive into Kafka ControllerA Deep Dive into Kafka Controller
A Deep Dive into Kafka Controller
 
Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안
Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안
Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안
 
Data Pipelines with Kafka Connect
Data Pipelines with Kafka ConnectData Pipelines with Kafka Connect
Data Pipelines with Kafka Connect
 
Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...
 
Producer Performance Tuning for Apache Kafka
Producer Performance Tuning for Apache KafkaProducer Performance Tuning for Apache Kafka
Producer Performance Tuning for Apache Kafka
 
Tuning Apache Kafka Connectors for Flink.pptx
Tuning Apache Kafka Connectors for Flink.pptxTuning Apache Kafka Connectors for Flink.pptx
Tuning Apache Kafka Connectors for Flink.pptx
 
How Uber scaled its Real Time Infrastructure to Trillion events per day
How Uber scaled its Real Time Infrastructure to Trillion events per dayHow Uber scaled its Real Time Infrastructure to Trillion events per day
How Uber scaled its Real Time Infrastructure to Trillion events per day
 
Apache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native EraApache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native Era
 
Stephan Ewen - Experiences running Flink at Very Large Scale
Stephan Ewen -  Experiences running Flink at Very Large ScaleStephan Ewen -  Experiences running Flink at Very Large Scale
Stephan Ewen - Experiences running Flink at Very Large Scale
 
Kafka 101
Kafka 101Kafka 101
Kafka 101
 
Citi Tech Talk Disaster Recovery Solutions Deep Dive
Citi Tech Talk  Disaster Recovery Solutions Deep DiveCiti Tech Talk  Disaster Recovery Solutions Deep Dive
Citi Tech Talk Disaster Recovery Solutions Deep Dive
 
Kafka At Scale in the Cloud
Kafka At Scale in the CloudKafka At Scale in the Cloud
Kafka At Scale in the Cloud
 
Apache kafka performance(throughput) - without data loss and guaranteeing dat...
Apache kafka performance(throughput) - without data loss and guaranteeing dat...Apache kafka performance(throughput) - without data loss and guaranteeing dat...
Apache kafka performance(throughput) - without data loss and guaranteeing dat...
 
Scaling Flink in Cloud
Scaling Flink in CloudScaling Flink in Cloud
Scaling Flink in Cloud
 
Apache Flink internals
Apache Flink internalsApache Flink internals
Apache Flink internals
 
Changelog Stream Processing with Apache Flink
Changelog Stream Processing with Apache FlinkChangelog Stream Processing with Apache Flink
Changelog Stream Processing with Apache Flink
 
Practical learnings from running thousands of Flink jobs
Practical learnings from running thousands of Flink jobsPractical learnings from running thousands of Flink jobs
Practical learnings from running thousands of Flink jobs
 

Similar to Virtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy Farkas

Strata+Hadoop 2017 San Jose: Lessons from a year of supporting Apache Kafka
Strata+Hadoop 2017 San Jose: Lessons from a year of supporting Apache KafkaStrata+Hadoop 2017 San Jose: Lessons from a year of supporting Apache Kafka
Strata+Hadoop 2017 San Jose: Lessons from a year of supporting Apache Kafka
confluent
 
Netflix SRE perf meetup_slides
Netflix SRE perf meetup_slidesNetflix SRE perf meetup_slides
Netflix SRE perf meetup_slides
Ed Hunter
 
Webinar slides: An Introduction to Performance Monitoring for PostgreSQL
Webinar slides: An Introduction to Performance Monitoring for PostgreSQLWebinar slides: An Introduction to Performance Monitoring for PostgreSQL
Webinar slides: An Introduction to Performance Monitoring for PostgreSQL
Severalnines
 
Scaling Monitoring At Databricks From Prometheus to M3
Scaling Monitoring At Databricks From Prometheus to M3Scaling Monitoring At Databricks From Prometheus to M3
Scaling Monitoring At Databricks From Prometheus to M3
LibbySchulze
 
Enabling Presto to handle massive scale at lightning speed
Enabling Presto to handle massive scale at lightning speedEnabling Presto to handle massive scale at lightning speed
Enabling Presto to handle massive scale at lightning speed
Shubham Tagra
 
HBaseCon2017 Improving HBase availability in a multi tenant environment
HBaseCon2017 Improving HBase availability in a multi tenant environmentHBaseCon2017 Improving HBase availability in a multi tenant environment
HBaseCon2017 Improving HBase availability in a multi tenant environment
HBaseCon
 
Netflix Data Pipeline With Kafka
Netflix Data Pipeline With KafkaNetflix Data Pipeline With Kafka
Netflix Data Pipeline With Kafka
Steven Wu
 
Netflix Data Pipeline With Kafka
Netflix Data Pipeline With KafkaNetflix Data Pipeline With Kafka
Netflix Data Pipeline With Kafka
Allen (Xiaozhong) Wang
 
Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...
Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...
Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...
HostedbyConfluent
 
Salesforce enabling real time scenarios at scale using kafka
Salesforce enabling real time scenarios at scale using kafkaSalesforce enabling real time scenarios at scale using kafka
Salesforce enabling real time scenarios at scale using kafka
Thomas Alex
 
Enabling presto to handle massive scale at lightning speed
Enabling presto to handle massive scale at lightning speedEnabling presto to handle massive scale at lightning speed
Enabling presto to handle massive scale at lightning speed
Shubham Tagra
 
From Three Nines to Five Nines - A Kafka Journey
From Three Nines to Five Nines - A Kafka JourneyFrom Three Nines to Five Nines - A Kafka Journey
From Three Nines to Five Nines - A Kafka Journey
Allen (Xiaozhong) Wang
 
Kafka to the Maxka - (Kafka Performance Tuning)
Kafka to the Maxka - (Kafka Performance Tuning)Kafka to the Maxka - (Kafka Performance Tuning)
Kafka to the Maxka - (Kafka Performance Tuning)
DataWorks Summit
 
Improve Presto Architectural Decisions with Shadow Cache
 Improve Presto Architectural Decisions with Shadow Cache Improve Presto Architectural Decisions with Shadow Cache
Improve Presto Architectural Decisions with Shadow Cache
Alluxio, Inc.
 
In datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unitIn datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unit
Jinwon Lee
 
Netflix Keystone Pipeline at Samza Meetup 10-13-2015
Netflix Keystone Pipeline at Samza Meetup 10-13-2015Netflix Keystone Pipeline at Samza Meetup 10-13-2015
Netflix Keystone Pipeline at Samza Meetup 10-13-2015
Monal Daxini
 
AWS Techniques and lessons writing low cost autoscaling GitLab runners
AWS Techniques and lessons writing low cost autoscaling GitLab runnersAWS Techniques and lessons writing low cost autoscaling GitLab runners
AWS Techniques and lessons writing low cost autoscaling GitLab runners
Anthony Scata
 
Distributed Task Scheduling with Akka, Kafka and Cassandra
Distributed Task Scheduling with Akka, Kafka and CassandraDistributed Task Scheduling with Akka, Kafka and Cassandra
Distributed Task Scheduling with Akka, Kafka and Cassandra
David van Geest
 
Cost Dimensions of Kafka - Opti Owl Cloud
Cost Dimensions of Kafka - Opti Owl CloudCost Dimensions of Kafka - Opti Owl Cloud
Cost Dimensions of Kafka - Opti Owl Cloud
SrinivasDevaki
 

Similar to Virtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy Farkas (20)

Strata+Hadoop 2017 San Jose: Lessons from a year of supporting Apache Kafka
Strata+Hadoop 2017 San Jose: Lessons from a year of supporting Apache KafkaStrata+Hadoop 2017 San Jose: Lessons from a year of supporting Apache Kafka
Strata+Hadoop 2017 San Jose: Lessons from a year of supporting Apache Kafka
 
Netflix SRE perf meetup_slides
Netflix SRE perf meetup_slidesNetflix SRE perf meetup_slides
Netflix SRE perf meetup_slides
 
Webinar slides: An Introduction to Performance Monitoring for PostgreSQL
Webinar slides: An Introduction to Performance Monitoring for PostgreSQLWebinar slides: An Introduction to Performance Monitoring for PostgreSQL
Webinar slides: An Introduction to Performance Monitoring for PostgreSQL
 
Scaling Monitoring At Databricks From Prometheus to M3
Scaling Monitoring At Databricks From Prometheus to M3Scaling Monitoring At Databricks From Prometheus to M3
Scaling Monitoring At Databricks From Prometheus to M3
 
Enabling Presto to handle massive scale at lightning speed
Enabling Presto to handle massive scale at lightning speedEnabling Presto to handle massive scale at lightning speed
Enabling Presto to handle massive scale at lightning speed
 
HBaseCon2017 Improving HBase availability in a multi tenant environment
HBaseCon2017 Improving HBase availability in a multi tenant environmentHBaseCon2017 Improving HBase availability in a multi tenant environment
HBaseCon2017 Improving HBase availability in a multi tenant environment
 
Netflix Data Pipeline With Kafka
Netflix Data Pipeline With KafkaNetflix Data Pipeline With Kafka
Netflix Data Pipeline With Kafka
 
Netflix Data Pipeline With Kafka
Netflix Data Pipeline With KafkaNetflix Data Pipeline With Kafka
Netflix Data Pipeline With Kafka
 
Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...
Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...
Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...
 
Salesforce enabling real time scenarios at scale using kafka
Salesforce enabling real time scenarios at scale using kafkaSalesforce enabling real time scenarios at scale using kafka
Salesforce enabling real time scenarios at scale using kafka
 
Enabling presto to handle massive scale at lightning speed
Enabling presto to handle massive scale at lightning speedEnabling presto to handle massive scale at lightning speed
Enabling presto to handle massive scale at lightning speed
 
From Three Nines to Five Nines - A Kafka Journey
From Three Nines to Five Nines - A Kafka JourneyFrom Three Nines to Five Nines - A Kafka Journey
From Three Nines to Five Nines - A Kafka Journey
 
Kafka to the Maxka - (Kafka Performance Tuning)
Kafka to the Maxka - (Kafka Performance Tuning)Kafka to the Maxka - (Kafka Performance Tuning)
Kafka to the Maxka - (Kafka Performance Tuning)
 
Improve Presto Architectural Decisions with Shadow Cache
 Improve Presto Architectural Decisions with Shadow Cache Improve Presto Architectural Decisions with Shadow Cache
Improve Presto Architectural Decisions with Shadow Cache
 
In datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unitIn datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unit
 
Netflix Keystone Pipeline at Samza Meetup 10-13-2015
Netflix Keystone Pipeline at Samza Meetup 10-13-2015Netflix Keystone Pipeline at Samza Meetup 10-13-2015
Netflix Keystone Pipeline at Samza Meetup 10-13-2015
 
AWS Techniques and lessons writing low cost autoscaling GitLab runners
AWS Techniques and lessons writing low cost autoscaling GitLab runnersAWS Techniques and lessons writing low cost autoscaling GitLab runners
AWS Techniques and lessons writing low cost autoscaling GitLab runners
 
kafka
kafkakafka
kafka
 
Distributed Task Scheduling with Akka, Kafka and Cassandra
Distributed Task Scheduling with Akka, Kafka and CassandraDistributed Task Scheduling with Akka, Kafka and Cassandra
Distributed Task Scheduling with Akka, Kafka and Cassandra
 
Cost Dimensions of Kafka - Opti Owl Cloud
Cost Dimensions of Kafka - Opti Owl CloudCost Dimensions of Kafka - Opti Owl Cloud
Cost Dimensions of Kafka - Opti Owl Cloud
 

More from Flink Forward

Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Flink Forward
 
Introducing the Apache Flink Kubernetes Operator
Introducing the Apache Flink Kubernetes OperatorIntroducing the Apache Flink Kubernetes Operator
Introducing the Apache Flink Kubernetes Operator
Flink Forward
 
One sink to rule them all: Introducing the new Async Sink
One sink to rule them all: Introducing the new Async SinkOne sink to rule them all: Introducing the new Async Sink
One sink to rule them all: Introducing the new Async Sink
Flink Forward
 
Flink powered stream processing platform at Pinterest
Flink powered stream processing platform at PinterestFlink powered stream processing platform at Pinterest
Flink powered stream processing platform at Pinterest
Flink Forward
 
Using the New Apache Flink Kubernetes Operator in a Production Deployment
Using the New Apache Flink Kubernetes Operator in a Production DeploymentUsing the New Apache Flink Kubernetes Operator in a Production Deployment
Using the New Apache Flink Kubernetes Operator in a Production Deployment
Flink Forward
 
The Current State of Table API in 2022
The Current State of Table API in 2022The Current State of Table API in 2022
The Current State of Table API in 2022
Flink Forward
 
Flink SQL on Pulsar made easy
Flink SQL on Pulsar made easyFlink SQL on Pulsar made easy
Flink SQL on Pulsar made easy
Flink Forward
 
Dynamic Rule-based Real-time Market Data Alerts
Dynamic Rule-based Real-time Market Data AlertsDynamic Rule-based Real-time Market Data Alerts
Dynamic Rule-based Real-time Market Data Alerts
Flink Forward
 
Processing Semantically-Ordered Streams in Financial Services
Processing Semantically-Ordered Streams in Financial ServicesProcessing Semantically-Ordered Streams in Financial Services
Processing Semantically-Ordered Streams in Financial Services
Flink Forward
 
Batch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & IcebergBatch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & Iceberg
Flink Forward
 
Welcome to the Flink Community!
Welcome to the Flink Community!Welcome to the Flink Community!
Welcome to the Flink Community!
Flink Forward
 
Extending Flink SQL for stream processing use cases
Extending Flink SQL for stream processing use casesExtending Flink SQL for stream processing use cases
Extending Flink SQL for stream processing use cases
Flink Forward
 
Using Queryable State for Fun and Profit
Using Queryable State for Fun and ProfitUsing Queryable State for Fun and Profit
Using Queryable State for Fun and Profit
Flink Forward
 
Large Scale Real Time Fraudulent Web Behavior Detection
Large Scale Real Time Fraudulent Web Behavior DetectionLarge Scale Real Time Fraudulent Web Behavior Detection
Large Scale Real Time Fraudulent Web Behavior Detection
Flink Forward
 
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Flink Forward
 
Building Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta LakeBuilding Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta Lake
Flink Forward
 
Near real-time statistical modeling and anomaly detection using Flink!
Near real-time statistical modeling and anomaly detection using Flink!Near real-time statistical modeling and anomaly detection using Flink!
Near real-time statistical modeling and anomaly detection using Flink!
Flink Forward
 

More from Flink Forward (17)

Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
 
Introducing the Apache Flink Kubernetes Operator
Introducing the Apache Flink Kubernetes OperatorIntroducing the Apache Flink Kubernetes Operator
Introducing the Apache Flink Kubernetes Operator
 
One sink to rule them all: Introducing the new Async Sink
One sink to rule them all: Introducing the new Async SinkOne sink to rule them all: Introducing the new Async Sink
One sink to rule them all: Introducing the new Async Sink
 
Flink powered stream processing platform at Pinterest
Flink powered stream processing platform at PinterestFlink powered stream processing platform at Pinterest
Flink powered stream processing platform at Pinterest
 
Using the New Apache Flink Kubernetes Operator in a Production Deployment
Using the New Apache Flink Kubernetes Operator in a Production DeploymentUsing the New Apache Flink Kubernetes Operator in a Production Deployment
Using the New Apache Flink Kubernetes Operator in a Production Deployment
 
The Current State of Table API in 2022
The Current State of Table API in 2022The Current State of Table API in 2022
The Current State of Table API in 2022
 
Flink SQL on Pulsar made easy
Flink SQL on Pulsar made easyFlink SQL on Pulsar made easy
Flink SQL on Pulsar made easy
 
Dynamic Rule-based Real-time Market Data Alerts
Dynamic Rule-based Real-time Market Data AlertsDynamic Rule-based Real-time Market Data Alerts
Dynamic Rule-based Real-time Market Data Alerts
 
Processing Semantically-Ordered Streams in Financial Services
Processing Semantically-Ordered Streams in Financial ServicesProcessing Semantically-Ordered Streams in Financial Services
Processing Semantically-Ordered Streams in Financial Services
 
Batch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & IcebergBatch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & Iceberg
 
Welcome to the Flink Community!
Welcome to the Flink Community!Welcome to the Flink Community!
Welcome to the Flink Community!
 
Extending Flink SQL for stream processing use cases
Extending Flink SQL for stream processing use casesExtending Flink SQL for stream processing use cases
Extending Flink SQL for stream processing use cases
 
Using Queryable State for Fun and Profit
Using Queryable State for Fun and ProfitUsing Queryable State for Fun and Profit
Using Queryable State for Fun and Profit
 
Large Scale Real Time Fraudulent Web Behavior Detection
Large Scale Real Time Fraudulent Web Behavior DetectionLarge Scale Real Time Fraudulent Web Behavior Detection
Large Scale Real Time Fraudulent Web Behavior Detection
 
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
 
Building Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta LakeBuilding Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta Lake
 
Near real-time statistical modeling and anomaly detection using Flink!
Near real-time statistical modeling and anomaly detection using Flink!Near real-time statistical modeling and anomaly detection using Flink!
Near real-time statistical modeling and anomaly detection using Flink!
 

Recently uploaded

Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Aggregage
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
 
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfSAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
Peter Spielvogel
 
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
Pierluigi Pugliese
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
Matthew Sinclair
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
Adtran
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
mikeeftimakis1
 
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptxSecstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
nkrafacyberclub
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
James Anderson
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
Neo4j
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Paige Cruz
 
Free Complete Python - A step towards Data Science
Free Complete Python - A step towards Data ScienceFree Complete Python - A step towards Data Science
Free Complete Python - A step towards Data Science
RinaMondal9
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
Uni Systems S.M.S.A.
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
ControlCase
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
91mobiles
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
Kari Kakkonen
 

Recently uploaded (20)

Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
 
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfSAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
 
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
 
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptxSecstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
 
Free Complete Python - A step towards Data Science
Free Complete Python - A step towards Data ScienceFree Complete Python - A step towards Data Science
Free Complete Python - A step towards Data Science
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
 

Virtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy Farkas

  • 3. Our Pain ● Thousands of stateless single source and single sink Flink routers. ● All operators are chained. ● When lag for a router exceeds a threshold we are paged. Kafka Consumer Project Filter Sink
  • 4. Definitions ● Workload: Events being produced to a kafka topic. Two main knobs to turn: ○ Message Rate ○ Message Size ● Lag: The time it would take for a router to process all the remaining unprocessed events that are buffered in its kafka topic. ● Healthy Router: A router is healthy if it’s lag is always under ~5 minutes. ● Autoscaling Solution: Adjust the number of nodes in the router dynamically based on the workload to keep the router healthy. Attempt to use the smallest number of nodes that are required to keep the pipeline healthy.
  • 5. Solution Space ● Claim: There is no perfect solution. Any autoscaling algorithm can be defeated by one or more workloads. ● Proof: Take any autoscaling algorithm A. Provide A with a workload W that does the exact opposite of what A expects whenever A decides to resize the cluster. => A will always make the wrong decision for W by definition. => Q.E.D. ○ Understand our limitations. ○ Make assumptions about our workloads. ○ Make a solution that works well when taking both into account.
  • 6. Limitations ● Rescaling introduces processing pauses. ● Scaling down a Flink job suspends processing for 1 - 3 minutes and possibly more. ○ CHEAP: Graceful shutdown with savepoint. ○ EXPENSIVE: Remove TMs. ○ CHEAP: Restart from savepoint with reduced parallelism. ● Scaling up a Flink job suspends processing for period < 1 minute. ○ EXPENSIVE: Add TMs. ○ CHEAP: Graceful shutdown with savepoint. ○ CHEAP: Restart from savepoint with increased parallelism. ● There is a two minute delay for propagating metrics through Netflix’s metrics infrastructure.
  • 7. Assumptions ● Better to accidentally over allocate than to under allocate. ● Average message size changes infrequently. ● Large spikes in the workload happen, but not frequently. ● Workloads tend to smoothly increase or decrease, when they don’t have a large spike.
  • 9. Desirable Characteristics ● Minimal amount of state ● Deterministic behavior ● Easy to unit test ● Easy to control
  • 10. Approaches ● Historical Prediction ● Rule Based ● PID Controller ● Statistical Short Term Prediction + Policies
  • 11. Autoscaling - High Level Steps ● Collect: Receive a batch of metrics for the current 1 minute timebucket. ● Pre-Decision Policy: Apply policies which decide whether the cluster can be rescaled or whether performance information can be collected about the cluster. ● Decide: Based on latest batch of events, decide whether to: ○ Scale up ○ Scale down ○ Stay the same ○ Also collects cluster performance information ● Calculate Size: If scaling up or down, decide how many nodes need to be added or removed. ● Post-Decision Policy: Apply policies which can modify scale up and scale down decisions.
  • 12. Metrics Collection Each minute collect the following ● Kafka consumer lag ● Records processed per second ● Cpu utilization ● Max Message Latency Kafka Kafka Consumer Buffered Events Router Sink Processing Filter / Projection ● Kafka messages in per second ● Net in / out utilization ● Sink health metrics Store the metrics for the past N minutes to inform scaling decisions and to do regression to predict the workload.
  • 13. Pre-Decision Policy Abort autoscaling process if: ● The router has recent task failures ● The router is currently redeploying
  • 14. Decide - Scale Up Scale up if: ● There is significant lag AND sink is healthy ● Utilization exceeds the safe threshold AND sink is healthy Key Insight - Collect cluster performance information: ● If the cluster needs to be scaled up that means the cluster is saturated. ● This is effectively a benchmark for the performance of the cluster at the current size. ● Save this information in a Performance Table for future scaling decisions. ● More on this in the Performance Table section later.
  • 15. Decide - Scale Down Scale down if: ● There is no lag AND we do not anticipate an increase in incoming messages ● More on how we anticipate incoming message rate in the Predict Workload section later.
  • 16. Calculate Size ● Predict Workload: Predict the future workload (messages in per second), while taking spikes into account. ● Target Events Per Second: Compute target events / sec that the pipeline will need to handle X minutes from now. ● Cluster Size Lookup: Use the target events / sec to estimate the desired cluster size, which can handle the workload up to X minutes from now.
  • 17. ● Quadratic regression for troughs: ax^2 + bx + c ● Linear regression for everything else: ax + b Predict Workload
  • 18. ● Assume error for regression is normally distributed and centered at 0. ● Find standard deviation of error. ● Any error greater than 3 * sigma is an outlier. ● After enough consecutive outliers are observed, the baseline is reset. Spike Detection
  • 19. Spike Detection - Baseline
  • 20. Var = (1 + 1 + 1 + 1) / 6 std = sqrt(4 / 6) 3 *std = 2.45 Spike Detection - First Outlier
  • 21. Spike Detection - Outliers
  • 22. Spike Detection - Baseline Reset
  • 23. Calculate Size ● Predict Workload: Predict the future workload (messages in per second), while taking spikes into account. ● Target Events Per Second: Compute target events / sec that the pipeline will need to handle X minutes from now. ● Cluster Size Lookup: Use the target events / sec to estimate the desired cluster size, which can handle the workload up to X minutes from now.
  • 24. restartTime catchUpTime recoveryTime = 15 min 2 minutes 13 minutes t1 t2 = t1 + 15 min Compute Target Processing Rate t3 = t1 + 20 min validTime 1. 2. 3.
  • 25. Calculate Size ● Predict Workload: Predict the future workload (messages in per second), while taking spikes into account. ● Target Events Per Second: Compute target events / sec that the pipeline will need to handle X minutes from now. ● Cluster Size Lookup: Use the target events / sec to estimate the desired cluster size, which can handle the workload up to X minutes from now.
  • 26. ● Lag and resource usage is high => ● Pipeline is saturated => ● We decide to scale up => ● We know the maximum throughput of the current cluster at the current size => ● Record the performance in a lookup table Cluster Size Lookup - The Performance Table
  • 27. ● Given a target rate find the performance records above and below it. ● Do linear interpolation to find the suitable cluster size. Num Nodes Max Rate 4 10,000 10 20,000 18 35,000 targetRate = 15,000 ratio = (15000 - 10000) (20000 - 10000) ratio = .5 clusterSize = .5 (4) + .5 (10) clusterSize = 7 Performance Table Cluster Size Lookup - The Performance Table
  • 28. Num Nodes Max Rate 4 10,000 10 20,000 18 35,000 targetRate = 40,000 clusterSize = (40,000) = 20.57 (35,000 / 18) ceil(clusterSize) = 21 Performance Table Cluster Size Lookup - Corner Case
  • 29. Cluster Size Lookup - Complexities ● Few more corner cases ● Utilization also needs to be taken into account ● Want new cluster size to have reasonable resource utilization 60% or less
  • 30. Calculate Size Scale Up vs Scale Down ● Flow and logic is the same ● Minor differences in implementation details
  • 31. Post-Decision Policy ● Minimum cluster size based on partition count of Kafka topic ● Maximum cluster size based on partition count of Kafka topic ● Cooldown period for scale ups ● Cooldown period for scale downs ● Disable scale downs during region failover (see Region Failover section) ● Safety limit for max scale up. Ex. cannot add more than 50 nodes during a scale up ● Safety limit for max cluster size
  • 33. Architecture Options ● Embed autoscaling in Flink ○ Pros: ■ Lower latency for retrieving metrics. ○ Cons: ■ Complex resource manager interactions get pushed down into Flink. ■ Rescale operation not easily integrated into operations history for the job. ■ Autoscaling changes requires redeploy of the job. ● Run autoscaling as a Mantis pipeline ○ Pros: ■ Flink service control plane handles all resource manager interactions already and it can be re-used for rescaling the job. ■ Flink service control plane keeps history of all rescale actions. ■ Autoscaling can be changed without redeploying jobs. ○ Cons: ■ 2 minute latency for getting metrics.
  • 41. Memory Requirements ● Direct memory has to be preserved for Kafka Consumer and Kafka Producer ● Direct memory cannot be changed for TMs that are already running ● Smaller clusters require more Direct memory (each node handles more partitions) ● Larger clusters require less Direct memory (each node handles fewer partitions) ● Deploy cluster with Direct memory that works for the minimum cluster size
  • 42. Partition Balancing 2 Partitions 2 Partitions 1 Partition 1 Partition 1 Partition 1 Partition ● 3 TMs ● 2 Task Slots per TM ● Topic with 8 partitions TMs with N Task Slots in the worst case can be assigned N more partitions than other TMs ● Aggregate consumer lag and average message latency may be low. ● A few partitions may have high latency due to unbalanced distribution of partitions. ● Round up the cluster so that the maximum possible partitions per subtask is reduced by 1. ● Note this is a much looser requirement on cluster size than requiring equal distribution of partitions to subtasks. This allows finer grained cluster size control.
  • 43. Outlier Containers [Wed Sep 4 17:50:03 2019] EDAC skx MC2: HANDLING MCE MEMORY ERROR [Wed Sep 4 17:50:03 2019] EDAC skx MC2: CPU 24: Machine Check Event: 0 Bank 13: cc063f80000800c0 [Wed Sep 4 17:50:03 2019] EDAC skx MC2: TSC 0 [Wed Sep 4 17:50:03 2019] EDAC skx MC2: ADDR 57c24fb4c0 [Wed Sep 4 17:50:03 2019] EDAC skx MC2: MISC 908511010100086 [Wed Sep 4 17:50:03 2019] EDAC skx MC2: PROCESSOR 0:50654 TIME 1567619410 SOCKET 1 APIC 40 [Wed Sep 4 17:50:03 2019] EDAC MC2: 6398 CE memory scrubbing error on CPU_SrcID#1_MC#0_Chan#0_DIMM#0 (channel:0 slot:0 page:0x57c24fb offset:0x4c0 grain:32 syndrome:0x0 - OVERFLOW err_code:0008:00c0 socket:1 imc:0 rank:1 bg:3 ba:2 row:1aacf col:338)
  • 44. Region Failover Netflix us-west Netflix us-east Netflix eu-west User us-west User us-east User eu-west Flink Routers us-west Flink Routers us-west Flink Routers us-west Disable scaledowns in the evacuated region until traffic comes back.
  • 46. Eager Scale Up ● Current Scale Up Decision: ○ There is significant consumer lag AND sink is healthy ○ Utilization exceeds the safe threshold AND sink is healthy This is not ideal since in most cases latency builds up in the job before a scale up is triggered. ● Eager Scale Up Decision: ○ Use the performance table to determine the maximum processing rate of the current cluster. ○ Use regression to determine if the workload will exceed the processing rate of the cluster in the near future. ○ If this is the case do a scale up before any lag builds up.
  • 47. Downscale Optimization ● Current downscale operation ○ CHEAP: Graceful shutdown with savepoint. ○ EXPENSIVE: Remove TMs. ○ CHEAP: Restart from savepoint with reduced parallelism. ● Optimized downscale operation ○ CHEAP: Graceful shutdown with savepoint. ○ CHEAP: Blacklist TMs that will be removed. ○ CHEAP: Restart from savepoint with reduced parallelism. ○ EXPENSIVE: Remove TMs.
  • 48. Complex DAGs ● Extend the algorithm to support multiple sources and sinks. ● Handle jobs where all operators are not chained.
  • 49. Acknowledgements ● Steven Wu ● Andrew Nguonly ● Neeraj Joshi ● Mark Cho ● Netflix Flink Team ● Netflix Mantis Team ● Netflix Data Pipeline Team ● Netflix RTDI Team ● https://www.spinnaker.io/ ● https://github.com/Netflix/mantis