Operating Flink on Mesos at Scale
@joerg_schad biswajit@branch.io
© 2018 Mesosphere, Inc. All Rights Reserved. 2
Jörg Schad
Tech Lead Community @Mesosphere
@joerg_schad
@joerg.mesosphere
Biswajit Das
Chief Architect @Branch
biswajit@branch.io
© 2018 Mesosphere, Inc. All Rights Reserved.
● Resource Manager
○ Dynamic resource allocation
○ Running multiple applications
○ 2-level scheduling
● Fault-tolerant, battle-tested
● Scalable to 10,000+ nodes
● Created by Mesosphere founder @ UC Berkeley; used in production by 100+
web-scale companies [1]
●
[1] http://mesos.apache.org/documentation/latest/powered-by-mesos/
Apache Mesos in a Nutshell
© 2018 Mesosphere, Inc. All Rights Reserved.
● Mesos offers full functionality to implement fault tolerant and elastic
distributed applications
● 30% of survey respondents were running Flink on Mesos (prior to proper
Mesos support*, September 2016)
● Other Deployment Models
● Standalone
● Yarn
● Kubernetes
*Kudos to Eron Wright for this work
Why Flink & Mesos
© 2018 Mesosphere, Inc. All Rights Reserved. 5
Why Mesos?
Typical Datacenter
siloed, over-provisioned servers,
low utilization
Kafka
Kubernetes
HDFS
Flink
Flink Test
© 2018 Mesosphere, Inc. All Rights Reserved. 6
© 2018 Mesosphere, Inc. All Rights Reserved. 7
Datacenter
Typical Datacenter
siloed, over-provisioned servers,
low utilization
Mesos/ DC/OS
automated schedulers, workload multiplexing onto the
same machines
HDFS
Kubernetes
Kafka
Flink
Flink 2
Deploy
Scale
Configure
Recover
3 AM
...
Typical Datacenter
siloed, over-provisioned servers,
low utilization
HDFS
Kafka
Kubernetes
Flink
Flink 2
© 2018 Mesosphere, Inc. All Rights Reserved.
Two-level Scheduling
1. Agents advertise resources to Master
2. Master offers resources to Framework
3. Framework rejects / uses resources
4. Agent reports task status to Master
9
MESOS ARCHITECTURE
Mesos
Master
Mesos
Master
Mesos
Master
Mesos AgentMesos Agent Service
Cassandra
Executor
Cassandra
Task
Flink
Scheduler
Spark
Executor
Spark
Task
Mesos AgentMesos Agent Service
Docker
Executor
Docker
Task
CDB
Executor
Spark
Task
Spark
Scheduler
Kafka
Scheduler
© 2018 Mesosphere, Inc. All Rights Reserved. 10
© 2018 Mesosphere, Inc. All Rights Reserved.
Powered by Apache Mesos
© 2018 Mesosphere, Inc. All Rights Reserved.
Flink Mesos Integration (old/simplefied)
Apache Flink Framework Mesos Master
Mesos App Master
Flink Mesos
ResourceManager
JobManager
Mesos Task
TaskManager
Mesos Task
TaskManager
Allocate
Resources
Launch Mesos tasksRegister
Execute Job
© 2018 Mesosphere, Inc. All Rights Reserved.
Flink Mesos Integration
Mesos Master
Mesos Cluster
Client
(2) HTTP POST
JobGraph/Jars
Flink Master Process
Flink Mesos
ResourceManager
JobManager
(4) Start Process
(and supervise)
(8) Deploy
Tasks
(7) Register
(5) Request slots
Flink Mesos
Dispatcher
(3) Allocate
container
for Flink master
(6) Allocate
containers
for TaskManagers
Marathon
(1) Start and
monitor
dispatcher
Mesos Task
TaskManager
Mesos Task
TaskManager
●
●
●
●
●
SECOR
Tranquility
SECOR
HDFS/S3
Master
Data/Warehouse
Re-Publish
Streaming Path
Chronos/Schedule
DownStream Batch
private docker hub
Job Template
➢ Custom scheduler to submit job once it satisfy resource criteria
➢ 50 Streaming Jobs
➢ Stream RPS 120k/sec
➢ 10B + events /day
➢ 2.5 TB /day
➢ 200+ Mesos Node cluster
➢ Marathon on Marathon
➢ Auto Scale with custom tool x-scale & ASG
➢ Custom Monitoring Platform with prometheus and Elk
© 2018 Mesosphere, Inc. All Rights Reserved.
Operating Flink
on Mesos
© 2018 Mesosphere, Inc. All Rights Reserved.
● Versioned app definition/job
● Immutable Docker tags
● Private Docker registry
● CI/CD
● No manual deployments to Prod
Deployments
© 2018 Mesosphere, Inc. All Rights Reserved.
● Use HDFS for HA setup
● dcos package install HDFS
● dcos hdfs endpoints
HA Setup
© 2018 Mesosphere, Inc. All Rights Reserved.
● Which Container Runtime
● UCR vs Docker
● No need to build docker images
Containerization
{
"id": "/flink-app",
"cmd": "$JAVA_HOME/bin/java -jar MyApp.jar",
"instances": 1,
"fetch": [
{
"uri": "http://…/MyApp.jar",
},
{
"uri": "https://.../jre-8u121-linux-x64.tar.gz",
}
],
© 2018 Mesosphere, Inc. All Rights Reserved.
● JVM and Container
● Not aware of cgroups
● Much better with JDK 9 & 10
● Overwrite JVM default values
Containerization
https://cloakable.irdeto.com/2017/08/24/java-is-a-first-class-citizen-in-a-docker-ecosystem-now/
© 2018 Mesosphere, Inc. All Rights Reserved.
● Depends on Job you are :)
○ Monitoring usage/allocation
● Memory
○ Consider Overhead to Heap
● Flexibility thanks to Flip-6
Resource Allocation
© 2018 Mesosphere, Inc. All Rights Reserved.
● Share resources between multiple
frameworks/job
● Without static partitioning
● One role per job/entity
● Use quota per role
● Min and Max resource
allocation
Multi-User: Quota
© 2018 Mesosphere, Inc. All Rights Reserved.
Monitoring
© 2018 Mesosphere, Inc. All Rights Reserved.
Currently manual changes and
redeploy
● Checkpoints
● Parallel Deployments
Configuration Changes and Updates
© 2018 Mesosphere, Inc. All Rights Reserved. 29
Demo
Generator Display
1. Financial data created
by generator
2. Written to
Kafka topics
3. Kafka Topics
consumed by Flink 4. Results written back into Kafka
stream (another topic)
7. Results displayed
© 2018 Mesosphere, Inc. All Rights Reserved.
Special Thanks to All Collaborators
30
Till Rohrmann
Eron Wright
Robin Oh
Mischa Krüger
...
● Contribute!
○ Flink
○ Flink/Mesos
○ DC/OS package
○ Documentation
○ ...

Flink Forward San Francisco 2018: Jörg Schad and Biswajit Das - "Operating Flink on Mesos at Scale"

  • 1.
    Operating Flink onMesos at Scale @joerg_schad biswajit@branch.io
  • 2.
    © 2018 Mesosphere,Inc. All Rights Reserved. 2 Jörg Schad Tech Lead Community @Mesosphere @joerg_schad @joerg.mesosphere Biswajit Das Chief Architect @Branch biswajit@branch.io
  • 3.
    © 2018 Mesosphere,Inc. All Rights Reserved. ● Resource Manager ○ Dynamic resource allocation ○ Running multiple applications ○ 2-level scheduling ● Fault-tolerant, battle-tested ● Scalable to 10,000+ nodes ● Created by Mesosphere founder @ UC Berkeley; used in production by 100+ web-scale companies [1] ● [1] http://mesos.apache.org/documentation/latest/powered-by-mesos/ Apache Mesos in a Nutshell
  • 4.
    © 2018 Mesosphere,Inc. All Rights Reserved. ● Mesos offers full functionality to implement fault tolerant and elastic distributed applications ● 30% of survey respondents were running Flink on Mesos (prior to proper Mesos support*, September 2016) ● Other Deployment Models ● Standalone ● Yarn ● Kubernetes *Kudos to Eron Wright for this work Why Flink & Mesos
  • 5.
    © 2018 Mesosphere,Inc. All Rights Reserved. 5 Why Mesos? Typical Datacenter siloed, over-provisioned servers, low utilization Kafka Kubernetes HDFS Flink Flink Test
  • 6.
    © 2018 Mesosphere,Inc. All Rights Reserved. 6
  • 7.
    © 2018 Mesosphere,Inc. All Rights Reserved. 7 Datacenter Typical Datacenter siloed, over-provisioned servers, low utilization Mesos/ DC/OS automated schedulers, workload multiplexing onto the same machines HDFS Kubernetes Kafka Flink Flink 2
  • 8.
    Deploy Scale Configure Recover 3 AM ... Typical Datacenter siloed,over-provisioned servers, low utilization HDFS Kafka Kubernetes Flink Flink 2
  • 9.
    © 2018 Mesosphere,Inc. All Rights Reserved. Two-level Scheduling 1. Agents advertise resources to Master 2. Master offers resources to Framework 3. Framework rejects / uses resources 4. Agent reports task status to Master 9 MESOS ARCHITECTURE Mesos Master Mesos Master Mesos Master Mesos AgentMesos Agent Service Cassandra Executor Cassandra Task Flink Scheduler Spark Executor Spark Task Mesos AgentMesos Agent Service Docker Executor Docker Task CDB Executor Spark Task Spark Scheduler Kafka Scheduler
  • 10.
    © 2018 Mesosphere,Inc. All Rights Reserved. 10
  • 11.
    © 2018 Mesosphere,Inc. All Rights Reserved. Powered by Apache Mesos
  • 12.
    © 2018 Mesosphere,Inc. All Rights Reserved. Flink Mesos Integration (old/simplefied) Apache Flink Framework Mesos Master Mesos App Master Flink Mesos ResourceManager JobManager Mesos Task TaskManager Mesos Task TaskManager Allocate Resources Launch Mesos tasksRegister Execute Job
  • 13.
    © 2018 Mesosphere,Inc. All Rights Reserved. Flink Mesos Integration Mesos Master Mesos Cluster Client (2) HTTP POST JobGraph/Jars Flink Master Process Flink Mesos ResourceManager JobManager (4) Start Process (and supervise) (8) Deploy Tasks (7) Register (5) Request slots Flink Mesos Dispatcher (3) Allocate container for Flink master (6) Allocate containers for TaskManagers Marathon (1) Start and monitor dispatcher Mesos Task TaskManager Mesos Task TaskManager
  • 14.
  • 15.
  • 16.
  • 17.
  • 18.
    ➢ Custom schedulerto submit job once it satisfy resource criteria
  • 19.
    ➢ 50 StreamingJobs ➢ Stream RPS 120k/sec ➢ 10B + events /day ➢ 2.5 TB /day ➢ 200+ Mesos Node cluster ➢ Marathon on Marathon ➢ Auto Scale with custom tool x-scale & ASG ➢ Custom Monitoring Platform with prometheus and Elk
  • 20.
    © 2018 Mesosphere,Inc. All Rights Reserved. Operating Flink on Mesos
  • 21.
    © 2018 Mesosphere,Inc. All Rights Reserved. ● Versioned app definition/job ● Immutable Docker tags ● Private Docker registry ● CI/CD ● No manual deployments to Prod Deployments
  • 22.
    © 2018 Mesosphere,Inc. All Rights Reserved. ● Use HDFS for HA setup ● dcos package install HDFS ● dcos hdfs endpoints HA Setup
  • 23.
    © 2018 Mesosphere,Inc. All Rights Reserved. ● Which Container Runtime ● UCR vs Docker ● No need to build docker images Containerization { "id": "/flink-app", "cmd": "$JAVA_HOME/bin/java -jar MyApp.jar", "instances": 1, "fetch": [ { "uri": "http://…/MyApp.jar", }, { "uri": "https://.../jre-8u121-linux-x64.tar.gz", } ],
  • 24.
    © 2018 Mesosphere,Inc. All Rights Reserved. ● JVM and Container ● Not aware of cgroups ● Much better with JDK 9 & 10 ● Overwrite JVM default values Containerization https://cloakable.irdeto.com/2017/08/24/java-is-a-first-class-citizen-in-a-docker-ecosystem-now/
  • 25.
    © 2018 Mesosphere,Inc. All Rights Reserved. ● Depends on Job you are :) ○ Monitoring usage/allocation ● Memory ○ Consider Overhead to Heap ● Flexibility thanks to Flip-6 Resource Allocation
  • 26.
    © 2018 Mesosphere,Inc. All Rights Reserved. ● Share resources between multiple frameworks/job ● Without static partitioning ● One role per job/entity ● Use quota per role ● Min and Max resource allocation Multi-User: Quota
  • 27.
    © 2018 Mesosphere,Inc. All Rights Reserved. Monitoring
  • 28.
    © 2018 Mesosphere,Inc. All Rights Reserved. Currently manual changes and redeploy ● Checkpoints ● Parallel Deployments Configuration Changes and Updates
  • 29.
    © 2018 Mesosphere,Inc. All Rights Reserved. 29 Demo Generator Display 1. Financial data created by generator 2. Written to Kafka topics 3. Kafka Topics consumed by Flink 4. Results written back into Kafka stream (another topic) 7. Results displayed
  • 30.
    © 2018 Mesosphere,Inc. All Rights Reserved. Special Thanks to All Collaborators 30 Till Rohrmann Eron Wright Robin Oh Mischa Krüger ... ● Contribute! ○ Flink ○ Flink/Mesos ○ DC/OS package ○ Documentation ○ ...