Moving Toward
Network Isolated Containers
Clément Michaud
Frédéric Boismenu
MesosCon
November 5th, 2018
1
About us
Clément Michaud
Currently
SRE @ Criteo for 1 year
Previously
C++ Software Engineer in
finance for 3 years
Frédéric Boismenu
Currently
SRE @ Criteo for 1 ½ year
Previously
Dev, Ops, stuff in a hedge fund
(even Delphi!)
2
A Global technology company
French company created in 2005
3.000 employees
30 offices worldwide
R&D in Paris, Ann Arbor and Palo Alto
Providing targeted online advertising
1.2B exposed consumers per month
3B ads displayed daily
Managing its own infrastructure
140 Site Reliability Engineers
8 datacenters in 3 continents
25.000 servers
3
- 10 clusters
- 1400 agents
- 300 applications
- Bare Metal
- .. no network overlay: 1 IP per server
- Heterogeneous (2G, 10G)
Mesos @ Criteo
4
Agenda
● Why network isolation?
● How we isolate network usage of our containers
● How we declared network bandwidth resource
5
Why Network Isolation?
6
What is QoS / Traffic Control?
Network
Card
Shaper Scheduler Classifier
Policer
7
Classification
Network
Card
Shaper Scheduler Classifier
Policer
CONNMARK
8
Scheduling
Network
Card
Shaper Scheduler Classifier
Policer
9
FQ_CoDel
Shaper
Network
Card
Shaper Scheduler Classifier
Policer
10
HTB
Policer
Network
Card
Shaper Scheduler Classifier
Policer
11
Shaping for Incoming traffic
Network
Card
IFB0 Classifier Scheduler Shaper
stolen!
12
History
13
2017-Q4
Ruby “Isolator”
in prod
2018-Q3
Mesos Network
Resource in prod
2017-Q1
Port_mapping test
2017-Q3
TC based
Proof-of-Concept
Why a custom solution?
● port_mapping
● Istio, {buzzword}
14
Step 1. Proof of Concept
● Basheries
● Some recompilation
● %Bandwidth == %CPU
15
Step 2. Mesos Agent Watcher v0
Mesos Agent
cgroups/net_cls
Mesos Agent Watcher
HTTP
16
eth0 ifb0
TC / IPTABLES
Mesos Agent Watcher - Metrology
Mesos Agent Watcher
tc iptables
Metrics
17
Any Issues?
● Conntrack Table
● UDP
● Queue Lengths
● Retransmission Timeouts
● Loopback traffic uncontrolled
● No TCP stack tuning per container
18
Troubleshooting?
19
TC for your team
● Mob Debugging
● Presentations
● Q&A Sessions
● Documentation
● Blog Posts
● Tools
20
Toolbox for an Operator
Shut off isolation by API
One task / One server
Debug the container
MesosTerm
21
22
TC for users
Metrology
23
Mesos
Enlightenment 24
Network Bandwidth relative to CPU share
2 Gbps24 CPUs
10 CPUs 830 Mbps
Full server
CPU greedy
task
19.2 CPUs 1.6 Gbps
How many servers able could allocate this amount?
How much resources wasted?
Network bandwidth
greedy task
25
Handle heterogeneous clusters
AllocatedNet = AllocatedCPUs / TotalAgentCPUs * TotalAgentNet
Implicitly allocated amount
2.5Gbps on a
24CPU/10Gbps server
500Mbps on a
24CPU/2Gbps server
26
Handle heterogeneous clusters
becomesminimum = 2Gbps at Criteo
AllocatedNet = AllocatedCPUs / TotalAgentCPUs * 2Gbps
Implicitly allocated amount
27
10 Gbps 10 Gbps
2 Gbps2 Gbps
How did we solve it?
28
Custom resources!
Quick overview of the Mesos ecosystem
2 frameworks
Aurora
29
Marathon
Migration plan made simple - STEP 1
30
Patching frameworksDeclaring resources Accounting resources
Pitfalls when declaring custom resources
Aurora crashed!
Old version. :)
Backport fixed the issueMake sure all your components are
up-to-date!
31
Limited support for custom resources in Marathon
Custom resources not
supported MARATHON-4572
32
Limited support for custom resources in Aurora
Custom resources not
supported
33
Limited support for custom resources
Network bandwidth support in other
(custom) frameworks?
Not confident...
34
Migration plan made simple - STEP 2
35
Patching frameworksDeclaring resources Accounting resources
Do you think this is difficult?
implicit allocation
of
mandatory resources
36
What is a mandatory resource?
RAM
A resource that is required by any task to run, i.e., a
resource that should be accounted for all tasks.
We already know... But we might also consider...
37
And we should distinguish them from...
CPU
Network
bandwidth
Disk IO
Number of
file
descriptors
GPU USB ports FPGAsDisk space
How to make a
resource accounted
for all tasks? 38
Offer matching mechanism
39
CPU = 4
RAM = 512MB
Please run java -jar helloworld.jar
with 1 CPU and 64MB of RAM
Network bandwidth in offers
40
CPU = 4
RAM = 512MB
NET = 200Mbps
Network? don’t care!
I don’t support it yet.
CPU = 1
RAM = 64MB
41
Implicit allocation of network bandwidth
CPU = 4
RAM = 512MB
NET = 200Mbps CPU = 1
RAM = 64MB
NET = 500Mbps
CPU = 1
RAM = 64MB
Injected and
accounted by Mesos
Master
Downside of implicit allocation
42
CPU = 1
NET = 500Mbps
CPU = 4
NET = 200Mbps
CPU = 0.25
NET = 125Mbps
Rejected
Accepted
Network bandwidth for everyone
43
Resources: {
cpu: 1,
network: 100
}
Resources: {
cpu: 1
}
Labels: {
FOO: “bar”
NETWORK: “100”
}
SUPPORT NO SUPPORT NO SUPPORT
How do you handle
the migration? 44
Restart applications!
45
● Implicit allocation for ALL tasks.
● Mesos becomes aware of ALL
allocated network resources.
No patch in
frameworks so
far!
46
Mesos UI
Remaining issues...
47
Task rejected
2Gbps
available only
The truth about implicit allocation
Schedulers SHOULD ENFORCE the
declaration of
mandatory resources
Implicit allocation is a fallback
system to cope with the time it takes
to patch frameworks
48
Migration plan made simple - STEP 3
49
Patching frameworksDeclaring resources Accounting resources
We eventually patched frameworks
By replicating what has been done for GPUs
50
And left the power to users...
51
And left the power to users...
52
Project done!
53
Conclusion 54
● Be Agile
● Train your team
● Shed light for your users
● No turn-key solution
55
Network Resource Future
● Master Hook MESOS-9315
● Framework patches
56
Isolator Improvements
● Open Source Isolator
● Real Isolator (C++)
● System resources allocation
● No network resource?
● Configurable Policies?
57
Thank you!
Questions?
We are hiring! 59

Mesos Network Isolation at Criteo