SlideShare a Scribd company logo
Lessons learned running large
real-world Docker environments
Oct 27th 2015
Alois Mayr
@mayralois
alois.mayr@ruxit.com
Dec 3rd 2015
Source: http://www.schoonoart.de/
What is a “large” environment?
Campfire stories
#1 – The Death Star of Service Dependencies
#1 – Death Star of Service Dependencies
Load-balanced service
System-wide service
dependencies
Reverse proxies are essential
#1 – The Death Star of Service Dependencies
App #1
App #2
App #1 depends on App #2
Where is this specified?
Unwanted dependencies break architecture
#1 – The Death Star of Service Dependencies
Use proper versioning for
services, APIs, and images
#1 – The Death Star of Service Dependencies
Campfire stories
#1 – The Death Star of Service Dependencies
#2 – The Network Retransmission Episode
#2 – The Network Retransmission Episode
Retransmissions
Retransmissions Retransmissions
Retransmissions Retransmissions
Retransmissions
Retransmissions
• Hardware defect in a single network interface card
• NIC worked well under low load
• Retransmissions only under heavy load
• Affected communications to other machines
in datacenter
• Still not sure about exact defect on NIC
What was the problem?
#2 – The Network Retransmission Episode
#2 – The Network Retransmission Episode
Co-locate related containers.
Check network infrastructure.
#2 – The Network Retransmission Episode
Campfire stories
#1 – The Death Star of Service Dependencies
#2 – The Network Retransmission Episode
#3 – The Hungry Container Breakdown
#3 – The Hungry Container Breakdown
Low disk space
Low disk space
• Shared /logs partition on host
• No log rotation, no archiving for app logs
• No proper log management used for Docker environment
• Shared /logs partition on a single host ran out of space
What was the problem?
#3 – The Hungry Container Breakdown
• Container health checks failed
• Marathon terminated task and rescheduled new one
• Still no free space on /logs
• Termination and rescheduling
• /var/lib/docker ran out of space
• Mesos slave unable to run Docker tasks
How the problem evolved over time
#3 – The Hungry Container Breakdown
• Log management tools for app logs, e.g. Fluentd and Logstash
--log-driver=none|syslog
• Remove container
--rm=true
• Run Mesos slave with
--docker_remove_delay=VALUE
How the problem could have been avoided
#3 – The Hungry Container Breakdown
Use log management tools
Empty /var/lib/docker
#3 – The Hungry Container Breakdown
Campfire stories
#1 – The Death Star of Service Dependencies
#2 – The Network Retransmission Episode
#3 – The Hungry Container Breakdown
#4 – The Day Orchestration Stood Still
#4 – The Day Orchestration Stood Still
Queue and deployment
methods are slow
• Marathon 0.8.x keeps all versions of applications for recovery (by default)
• High frequency of microservices deployments
• Slowdown through zk overload
What was the problem?
#4 – The Day Orchestration Stood Still
• Respective parameter (zk_max_versions) was not set to proper limit
--zk_max_versions=20
How the problem could have been avoided
#4 – The Day Orchestration Stood Still
Track orchestration layer performance
Separate Mesos clusters
#4 – The Day Orchestration Stood Still
Campfire stories
#1 – The Death Star of Service Dependencies
#2 – The Network Retransmission Episode
#3 – The Hungry Container Breakdown
#4 – The Day Orchestration Stood Still
#5 – The Mushroom Cloud Effect
#5 – The Mushroom Cloud Effect
Way too many
components involved
820 BILLION dependencies!
• Massive load testing in preparation for Black Friday
• Tests ran for 3 days
• No impact to real users, only backend services affected
• Many components to take into account
What was the problem?
174 / 3.4k
22 / 13.3k
Service
Container
Host
1
1..*
*
1
#5 – The Mushroom Cloud Effect
Automation needed for problem
analysis in large environments
#5 – The Mushroom Cloud Effect
Campfire stories
#1 – The Death Star of Service Dependencies
#2 – The Network Retransmission Episode
#3 – The Hungry Container Breakdown
#4 – The Day Orchestration Stood Still
#5 – The Mushroom Cloud Effect
Free trial - https://ruxit.com/docker-monitoring/
Blog - https://blog.ruxit.com/
@ruxit
What lessons have you learned?

More Related Content

What's hot

Kafka Summit NYC 2017 - Deep Dive Into Apache Kafka
Kafka Summit NYC 2017 - Deep Dive Into Apache KafkaKafka Summit NYC 2017 - Deep Dive Into Apache Kafka
Kafka Summit NYC 2017 - Deep Dive Into Apache Kafka
confluent
 
Apache Kafka Reliability Guarantees StrataHadoop NYC 2015
Apache Kafka Reliability Guarantees StrataHadoop NYC 2015 Apache Kafka Reliability Guarantees StrataHadoop NYC 2015
Apache Kafka Reliability Guarantees StrataHadoop NYC 2015
Jeff Holoman
 
Automated Deployment Using Jenkins Across Clusters
Automated Deployment Using Jenkins Across ClustersAutomated Deployment Using Jenkins Across Clusters
Automated Deployment Using Jenkins Across Clusters
Naveen S.R
 
Container Orchestration with Docker Swarm and Kubernetes
Container Orchestration with Docker Swarm and KubernetesContainer Orchestration with Docker Swarm and Kubernetes
Container Orchestration with Docker Swarm and Kubernetes
Will Hall
 
Windows container security
Windows container securityWindows container security
Windows container security
Docker, Inc.
 
BlueHat Seattle 2019 || Kubernetes Practical Attack and Defense
BlueHat Seattle 2019 || Kubernetes Practical Attack and DefenseBlueHat Seattle 2019 || Kubernetes Practical Attack and Defense
BlueHat Seattle 2019 || Kubernetes Practical Attack and Defense
BlueHat Security Conference
 
How to install and use Kubernetes
How to install and use KubernetesHow to install and use Kubernetes
How to install and use Kubernetes
Luke Marsden
 
Docker {at,with} SignalFx
Docker {at,with} SignalFxDocker {at,with} SignalFx
Docker {at,with} SignalFx
Maxime Petazzoni
 
Securing & Enforcing Network Policy and Encryption with Weave Net
Securing & Enforcing Network Policy and Encryption with Weave NetSecuring & Enforcing Network Policy and Encryption with Weave Net
Securing & Enforcing Network Policy and Encryption with Weave Net
Luke Marsden
 
Accessible hpc for everyone with docker and containers
Accessible hpc for everyone with docker and containersAccessible hpc for everyone with docker and containers
Accessible hpc for everyone with docker and containers
Docker, Inc.
 
Netflix Container Runtime - Titus - for Container Camp 2016
Netflix Container Runtime - Titus - for Container Camp 2016Netflix Container Runtime - Titus - for Container Camp 2016
Netflix Container Runtime - Titus - for Container Camp 2016
aspyker
 
Lightning Fast Monitoring against Lightning Fast Outages
Lightning Fast Monitoring against Lightning Fast OutagesLightning Fast Monitoring against Lightning Fast Outages
Lightning Fast Monitoring against Lightning Fast Outages
Maxime Petazzoni
 
How and why we got Prometheus working with Docker Swarm
How and why we got Prometheus working with Docker SwarmHow and why we got Prometheus working with Docker Swarm
How and why we got Prometheus working with Docker Swarm
Luke Marsden
 
WebLogic Stability; Detect and Analyse Stuck Threads
WebLogic Stability; Detect and Analyse Stuck ThreadsWebLogic Stability; Detect and Analyse Stuck Threads
WebLogic Stability; Detect and Analyse Stuck Threads
Maarten Smeets
 
Build your own Service Bus V2
Build your own Service Bus V2Build your own Service Bus V2
Build your own Service Bus V2
Kévin LOVATO
 
An empirical comparison of dependency issues in open source software packagin...
An empirical comparison of dependency issues in open source software packagin...An empirical comparison of dependency issues in open source software packagin...
An empirical comparison of dependency issues in open source software packagin...
Tom Mens
 
Locking down your Kubernetes cluster with Linkerd
Locking down your Kubernetes cluster with LinkerdLocking down your Kubernetes cluster with Linkerd
Locking down your Kubernetes cluster with Linkerd
Buoyant
 
KubeCon London 2016 Ronana Cloud Native SDN
KubeCon London 2016 Ronana Cloud Native SDNKubeCon London 2016 Ronana Cloud Native SDN
KubeCon London 2016 Ronana Cloud Native SDN
Romana Project
 
How to build a Neutron Plugin (stadium edition)
How to build a Neutron Plugin (stadium edition)How to build a Neutron Plugin (stadium edition)
How to build a Neutron Plugin (stadium edition)
Salvatore Orlando
 
Docker casual alpine with nim nimlang 박승환_2016_03
Docker casual alpine with nim nimlang 박승환_2016_03Docker casual alpine with nim nimlang 박승환_2016_03
Docker casual alpine with nim nimlang 박승환_2016_03
Seunghwan Park
 

What's hot (20)

Kafka Summit NYC 2017 - Deep Dive Into Apache Kafka
Kafka Summit NYC 2017 - Deep Dive Into Apache KafkaKafka Summit NYC 2017 - Deep Dive Into Apache Kafka
Kafka Summit NYC 2017 - Deep Dive Into Apache Kafka
 
Apache Kafka Reliability Guarantees StrataHadoop NYC 2015
Apache Kafka Reliability Guarantees StrataHadoop NYC 2015 Apache Kafka Reliability Guarantees StrataHadoop NYC 2015
Apache Kafka Reliability Guarantees StrataHadoop NYC 2015
 
Automated Deployment Using Jenkins Across Clusters
Automated Deployment Using Jenkins Across ClustersAutomated Deployment Using Jenkins Across Clusters
Automated Deployment Using Jenkins Across Clusters
 
Container Orchestration with Docker Swarm and Kubernetes
Container Orchestration with Docker Swarm and KubernetesContainer Orchestration with Docker Swarm and Kubernetes
Container Orchestration with Docker Swarm and Kubernetes
 
Windows container security
Windows container securityWindows container security
Windows container security
 
BlueHat Seattle 2019 || Kubernetes Practical Attack and Defense
BlueHat Seattle 2019 || Kubernetes Practical Attack and DefenseBlueHat Seattle 2019 || Kubernetes Practical Attack and Defense
BlueHat Seattle 2019 || Kubernetes Practical Attack and Defense
 
How to install and use Kubernetes
How to install and use KubernetesHow to install and use Kubernetes
How to install and use Kubernetes
 
Docker {at,with} SignalFx
Docker {at,with} SignalFxDocker {at,with} SignalFx
Docker {at,with} SignalFx
 
Securing & Enforcing Network Policy and Encryption with Weave Net
Securing & Enforcing Network Policy and Encryption with Weave NetSecuring & Enforcing Network Policy and Encryption with Weave Net
Securing & Enforcing Network Policy and Encryption with Weave Net
 
Accessible hpc for everyone with docker and containers
Accessible hpc for everyone with docker and containersAccessible hpc for everyone with docker and containers
Accessible hpc for everyone with docker and containers
 
Netflix Container Runtime - Titus - for Container Camp 2016
Netflix Container Runtime - Titus - for Container Camp 2016Netflix Container Runtime - Titus - for Container Camp 2016
Netflix Container Runtime - Titus - for Container Camp 2016
 
Lightning Fast Monitoring against Lightning Fast Outages
Lightning Fast Monitoring against Lightning Fast OutagesLightning Fast Monitoring against Lightning Fast Outages
Lightning Fast Monitoring against Lightning Fast Outages
 
How and why we got Prometheus working with Docker Swarm
How and why we got Prometheus working with Docker SwarmHow and why we got Prometheus working with Docker Swarm
How and why we got Prometheus working with Docker Swarm
 
WebLogic Stability; Detect and Analyse Stuck Threads
WebLogic Stability; Detect and Analyse Stuck ThreadsWebLogic Stability; Detect and Analyse Stuck Threads
WebLogic Stability; Detect and Analyse Stuck Threads
 
Build your own Service Bus V2
Build your own Service Bus V2Build your own Service Bus V2
Build your own Service Bus V2
 
An empirical comparison of dependency issues in open source software packagin...
An empirical comparison of dependency issues in open source software packagin...An empirical comparison of dependency issues in open source software packagin...
An empirical comparison of dependency issues in open source software packagin...
 
Locking down your Kubernetes cluster with Linkerd
Locking down your Kubernetes cluster with LinkerdLocking down your Kubernetes cluster with Linkerd
Locking down your Kubernetes cluster with Linkerd
 
KubeCon London 2016 Ronana Cloud Native SDN
KubeCon London 2016 Ronana Cloud Native SDNKubeCon London 2016 Ronana Cloud Native SDN
KubeCon London 2016 Ronana Cloud Native SDN
 
How to build a Neutron Plugin (stadium edition)
How to build a Neutron Plugin (stadium edition)How to build a Neutron Plugin (stadium edition)
How to build a Neutron Plugin (stadium edition)
 
Docker casual alpine with nim nimlang 박승환_2016_03
Docker casual alpine with nim nimlang 박승환_2016_03Docker casual alpine with nim nimlang 박승환_2016_03
Docker casual alpine with nim nimlang 박승환_2016_03
 

Viewers also liked

Blue Whale in an Enterprise Pond
Blue Whale in an Enterprise PondBlue Whale in an Enterprise Pond
Blue Whale in an Enterprise Pond
Digia Plc
 
Using Docker in the Real World
Using Docker in the Real WorldUsing Docker in the Real World
Using Docker in the Real World
Tim Haak
 
Solving Real World Production Problems with Docker
Solving Real World Production Problems with DockerSolving Real World Production Problems with Docker
Solving Real World Production Problems with Docker
Marc Campbell
 
A Fabric/Puppet Build/Deploy System
A Fabric/Puppet Build/Deploy SystemA Fabric/Puppet Build/Deploy System
A Fabric/Puppet Build/Deploy System
adrian_nye
 
Real World Experience of Running Docker in Development and Production
Real World Experience of Running Docker in Development and ProductionReal World Experience of Running Docker in Development and Production
Real World Experience of Running Docker in Development and Production
Ben Hall
 
Real-World Docker: 10 Things We've Learned
Real-World Docker: 10 Things We've Learned  Real-World Docker: 10 Things We've Learned
Real-World Docker: 10 Things We've Learned
RightScale
 
Programming the world with Docker
Programming the world with DockerProgramming the world with Docker
Programming the world with Docker
Patrick Chanezon
 

Viewers also liked (7)

Blue Whale in an Enterprise Pond
Blue Whale in an Enterprise PondBlue Whale in an Enterprise Pond
Blue Whale in an Enterprise Pond
 
Using Docker in the Real World
Using Docker in the Real WorldUsing Docker in the Real World
Using Docker in the Real World
 
Solving Real World Production Problems with Docker
Solving Real World Production Problems with DockerSolving Real World Production Problems with Docker
Solving Real World Production Problems with Docker
 
A Fabric/Puppet Build/Deploy System
A Fabric/Puppet Build/Deploy SystemA Fabric/Puppet Build/Deploy System
A Fabric/Puppet Build/Deploy System
 
Real World Experience of Running Docker in Development and Production
Real World Experience of Running Docker in Development and ProductionReal World Experience of Running Docker in Development and Production
Real World Experience of Running Docker in Development and Production
 
Real-World Docker: 10 Things We've Learned
Real-World Docker: 10 Things We've Learned  Real-World Docker: 10 Things We've Learned
Real-World Docker: 10 Things We've Learned
 
Programming the world with Docker
Programming the world with DockerProgramming the world with Docker
Programming the world with Docker
 

Similar to Lessons learned running large real-world Docker environments

KubeCon EU 2016: Kubernetes meets Finagle for Resilient Microservices
KubeCon EU 2016: Kubernetes meets Finagle for Resilient MicroservicesKubeCon EU 2016: Kubernetes meets Finagle for Resilient Microservices
KubeCon EU 2016: Kubernetes meets Finagle for Resilient Microservices
KubeAcademy
 
Tokyo azure meetup #12 service fabric internals
Tokyo azure meetup #12   service fabric internalsTokyo azure meetup #12   service fabric internals
Tokyo azure meetup #12 service fabric internals
Tokyo Azure Meetup
 
ApacheCon BigData - What it takes to process a trillion events a day?
ApacheCon BigData - What it takes to process a trillion events a day?ApacheCon BigData - What it takes to process a trillion events a day?
ApacheCon BigData - What it takes to process a trillion events a day?
Jagadish Venkatraman
 
OpenStack Summit Tokyo - Know-how of Challlenging Deploy/Operation NTT DOCOMO...
OpenStack Summit Tokyo - Know-how of Challlenging Deploy/Operation NTT DOCOMO...OpenStack Summit Tokyo - Know-how of Challlenging Deploy/Operation NTT DOCOMO...
OpenStack Summit Tokyo - Know-how of Challlenging Deploy/Operation NTT DOCOMO...
Masaaki Nakagawa
 
Fasten Industry Meeting with GitHub about Dependancy Management
Fasten Industry Meeting with GitHub about Dependancy ManagementFasten Industry Meeting with GitHub about Dependancy Management
Fasten Industry Meeting with GitHub about Dependancy Management
Fasten Project
 
Patterns and Pains of Migrating Legacy Applications to Kubernetes
Patterns and Pains of Migrating Legacy Applications to KubernetesPatterns and Pains of Migrating Legacy Applications to Kubernetes
Patterns and Pains of Migrating Legacy Applications to Kubernetes
QAware GmbH
 
Patterns and Pains of Migrating Legacy Applications to Kubernetes
Patterns and Pains of Migrating Legacy Applications to KubernetesPatterns and Pains of Migrating Legacy Applications to Kubernetes
Patterns and Pains of Migrating Legacy Applications to Kubernetes
Josef Adersberger
 
Orchestrating Linux Containers while tolerating failures
Orchestrating Linux Containers while tolerating failuresOrchestrating Linux Containers while tolerating failures
Orchestrating Linux Containers while tolerating failures
Docker, Inc.
 
Stephan Ewen - Experiences running Flink at Very Large Scale
Stephan Ewen -  Experiences running Flink at Very Large ScaleStephan Ewen -  Experiences running Flink at Very Large Scale
Stephan Ewen - Experiences running Flink at Very Large Scale
Ververica
 
Remote core locking-Andrea Lombardo
Remote core locking-Andrea LombardoRemote core locking-Andrea Lombardo
Remote core locking-Andrea Lombardo
Andrea Lombardo
 
Hands on kubernetes_container_orchestration
Hands on kubernetes_container_orchestrationHands on kubernetes_container_orchestration
Hands on kubernetes_container_orchestration
Amir Hossein Sorouri
 
Sample Solution Blueprint
Sample Solution BlueprintSample Solution Blueprint
Sample Solution BlueprintMike Alvarado
 
Tupperware: Containerized Deployment at FB
Tupperware: Containerized Deployment at FBTupperware: Containerized Deployment at FB
Tupperware: Containerized Deployment at FBDocker, Inc.
 
The Mushroom Cloud Effect - What happens when containers fail?
The Mushroom Cloud Effect - What happens when containers fail?The Mushroom Cloud Effect - What happens when containers fail?
The Mushroom Cloud Effect - What happens when containers fail?
Alois Mayr
 
The Mushroom Cloud Effect or What Happens When Containers Fail? by Alois Mayr...
The Mushroom Cloud Effect or What Happens When Containers Fail? by Alois Mayr...The Mushroom Cloud Effect or What Happens When Containers Fail? by Alois Mayr...
The Mushroom Cloud Effect or What Happens When Containers Fail? by Alois Mayr...
Docker, Inc.
 
Breaking the Monolith Road to Containers
Breaking the Monolith Road to ContainersBreaking the Monolith Road to Containers
Breaking the Monolith Road to Containers
Amazon Web Services
 
John adams talk cloudy
John adams   talk cloudyJohn adams   talk cloudy
John adams talk cloudy
John Adams
 
Docker Networking in Production at Visa - Sasi Kannappan, Visa and Mark Churc...
Docker Networking in Production at Visa - Sasi Kannappan, Visa and Mark Churc...Docker Networking in Production at Visa - Sasi Kannappan, Visa and Mark Churc...
Docker Networking in Production at Visa - Sasi Kannappan, Visa and Mark Churc...
Docker, Inc.
 
Cloud orchestration risks
Cloud orchestration risksCloud orchestration risks
Cloud orchestration risks
Glib Pakharenko
 

Similar to Lessons learned running large real-world Docker environments (20)

KubeCon EU 2016: Kubernetes meets Finagle for Resilient Microservices
KubeCon EU 2016: Kubernetes meets Finagle for Resilient MicroservicesKubeCon EU 2016: Kubernetes meets Finagle for Resilient Microservices
KubeCon EU 2016: Kubernetes meets Finagle for Resilient Microservices
 
Tokyo azure meetup #12 service fabric internals
Tokyo azure meetup #12   service fabric internalsTokyo azure meetup #12   service fabric internals
Tokyo azure meetup #12 service fabric internals
 
ApacheCon BigData - What it takes to process a trillion events a day?
ApacheCon BigData - What it takes to process a trillion events a day?ApacheCon BigData - What it takes to process a trillion events a day?
ApacheCon BigData - What it takes to process a trillion events a day?
 
OpenStack Summit Tokyo - Know-how of Challlenging Deploy/Operation NTT DOCOMO...
OpenStack Summit Tokyo - Know-how of Challlenging Deploy/Operation NTT DOCOMO...OpenStack Summit Tokyo - Know-how of Challlenging Deploy/Operation NTT DOCOMO...
OpenStack Summit Tokyo - Know-how of Challlenging Deploy/Operation NTT DOCOMO...
 
Fasten Industry Meeting with GitHub about Dependancy Management
Fasten Industry Meeting with GitHub about Dependancy ManagementFasten Industry Meeting with GitHub about Dependancy Management
Fasten Industry Meeting with GitHub about Dependancy Management
 
Patterns and Pains of Migrating Legacy Applications to Kubernetes
Patterns and Pains of Migrating Legacy Applications to KubernetesPatterns and Pains of Migrating Legacy Applications to Kubernetes
Patterns and Pains of Migrating Legacy Applications to Kubernetes
 
Patterns and Pains of Migrating Legacy Applications to Kubernetes
Patterns and Pains of Migrating Legacy Applications to KubernetesPatterns and Pains of Migrating Legacy Applications to Kubernetes
Patterns and Pains of Migrating Legacy Applications to Kubernetes
 
Orchestrating Linux Containers while tolerating failures
Orchestrating Linux Containers while tolerating failuresOrchestrating Linux Containers while tolerating failures
Orchestrating Linux Containers while tolerating failures
 
Stephan Ewen - Experiences running Flink at Very Large Scale
Stephan Ewen -  Experiences running Flink at Very Large ScaleStephan Ewen -  Experiences running Flink at Very Large Scale
Stephan Ewen - Experiences running Flink at Very Large Scale
 
Remote core locking-Andrea Lombardo
Remote core locking-Andrea LombardoRemote core locking-Andrea Lombardo
Remote core locking-Andrea Lombardo
 
Hands on kubernetes_container_orchestration
Hands on kubernetes_container_orchestrationHands on kubernetes_container_orchestration
Hands on kubernetes_container_orchestration
 
Sample Solution Blueprint
Sample Solution BlueprintSample Solution Blueprint
Sample Solution Blueprint
 
4. system models
4. system models4. system models
4. system models
 
Tupperware: Containerized Deployment at FB
Tupperware: Containerized Deployment at FBTupperware: Containerized Deployment at FB
Tupperware: Containerized Deployment at FB
 
The Mushroom Cloud Effect - What happens when containers fail?
The Mushroom Cloud Effect - What happens when containers fail?The Mushroom Cloud Effect - What happens when containers fail?
The Mushroom Cloud Effect - What happens when containers fail?
 
The Mushroom Cloud Effect or What Happens When Containers Fail? by Alois Mayr...
The Mushroom Cloud Effect or What Happens When Containers Fail? by Alois Mayr...The Mushroom Cloud Effect or What Happens When Containers Fail? by Alois Mayr...
The Mushroom Cloud Effect or What Happens When Containers Fail? by Alois Mayr...
 
Breaking the Monolith Road to Containers
Breaking the Monolith Road to ContainersBreaking the Monolith Road to Containers
Breaking the Monolith Road to Containers
 
John adams talk cloudy
John adams   talk cloudyJohn adams   talk cloudy
John adams talk cloudy
 
Docker Networking in Production at Visa - Sasi Kannappan, Visa and Mark Churc...
Docker Networking in Production at Visa - Sasi Kannappan, Visa and Mark Churc...Docker Networking in Production at Visa - Sasi Kannappan, Visa and Mark Churc...
Docker Networking in Production at Visa - Sasi Kannappan, Visa and Mark Churc...
 
Cloud orchestration risks
Cloud orchestration risksCloud orchestration risks
Cloud orchestration risks
 

More from Alois Mayr

Automated distributed tracing - a first class citizen of monitoring
Automated distributed tracing - a first class citizen of monitoringAutomated distributed tracing - a first class citizen of monitoring
Automated distributed tracing - a first class citizen of monitoring
Alois Mayr
 
Monitoring a cloud native platform feature
Monitoring a cloud native platform featureMonitoring a cloud native platform feature
Monitoring a cloud native platform feature
Alois Mayr
 
When containers fail
When containers failWhen containers fail
When containers fail
Alois Mayr
 
Running microservice environments is no free lunch
Running microservice environments is no free lunchRunning microservice environments is no free lunch
Running microservice environments is no free lunch
Alois Mayr
 
Managing and Scaling Microservices with Docker in the Wild
Managing and Scaling Microservices with Docker in the WildManaging and Scaling Microservices with Docker in the Wild
Managing and Scaling Microservices with Docker in the Wild
Alois Mayr
 
Scaling and Monitoring Docker environments
Scaling and Monitoring Docker environmentsScaling and Monitoring Docker environments
Scaling and Monitoring Docker environments
Alois Mayr
 

More from Alois Mayr (6)

Automated distributed tracing - a first class citizen of monitoring
Automated distributed tracing - a first class citizen of monitoringAutomated distributed tracing - a first class citizen of monitoring
Automated distributed tracing - a first class citizen of monitoring
 
Monitoring a cloud native platform feature
Monitoring a cloud native platform featureMonitoring a cloud native platform feature
Monitoring a cloud native platform feature
 
When containers fail
When containers failWhen containers fail
When containers fail
 
Running microservice environments is no free lunch
Running microservice environments is no free lunchRunning microservice environments is no free lunch
Running microservice environments is no free lunch
 
Managing and Scaling Microservices with Docker in the Wild
Managing and Scaling Microservices with Docker in the WildManaging and Scaling Microservices with Docker in the Wild
Managing and Scaling Microservices with Docker in the Wild
 
Scaling and Monitoring Docker environments
Scaling and Monitoring Docker environmentsScaling and Monitoring Docker environments
Scaling and Monitoring Docker environments
 

Recently uploaded

Understanding Globus Data Transfers with NetSage
Understanding Globus Data Transfers with NetSageUnderstanding Globus Data Transfers with NetSage
Understanding Globus Data Transfers with NetSage
Globus
 
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
informapgpstrackings
 
Graphic Design Crash Course for beginners
Graphic Design Crash Course for beginnersGraphic Design Crash Course for beginners
Graphic Design Crash Course for beginners
e20449
 
Quarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden ExtensionsQuarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden Extensions
Max Andersen
 
Launch Your Streaming Platforms in Minutes
Launch Your Streaming Platforms in MinutesLaunch Your Streaming Platforms in Minutes
Launch Your Streaming Platforms in Minutes
Roshan Dwivedi
 
Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...
Globus
 
Enhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdfEnhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdf
Globus
 
How Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptxHow Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptx
wottaspaceseo
 
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing SuiteAI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
Google
 
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Mind IT Systems
 
Lecture 1 Introduction to games development
Lecture 1 Introduction to games developmentLecture 1 Introduction to games development
Lecture 1 Introduction to games development
abdulrafaychaudhry
 
Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604
Fermin Galan
 
Top 7 Unique WhatsApp API Benefits | Saudi Arabia
Top 7 Unique WhatsApp API Benefits | Saudi ArabiaTop 7 Unique WhatsApp API Benefits | Saudi Arabia
Top 7 Unique WhatsApp API Benefits | Saudi Arabia
Yara Milbes
 
APIs for Browser Automation (MoT Meetup 2024)
APIs for Browser Automation (MoT Meetup 2024)APIs for Browser Automation (MoT Meetup 2024)
APIs for Browser Automation (MoT Meetup 2024)
Boni García
 
Text-Summarization-of-Breaking-News-Using-Fine-tuning-BART-Model.pptx
Text-Summarization-of-Breaking-News-Using-Fine-tuning-BART-Model.pptxText-Summarization-of-Breaking-News-Using-Fine-tuning-BART-Model.pptx
Text-Summarization-of-Breaking-News-Using-Fine-tuning-BART-Model.pptx
ShamsuddeenMuhammadA
 
Utilocate provides Smarter, Better, Faster, Safer Locate Ticket Management
Utilocate provides Smarter, Better, Faster, Safer Locate Ticket ManagementUtilocate provides Smarter, Better, Faster, Safer Locate Ticket Management
Utilocate provides Smarter, Better, Faster, Safer Locate Ticket Management
Utilocate
 
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Globus
 
Prosigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology SolutionsProsigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology Solutions
Prosigns
 
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data AnalysisProviding Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
Globus
 
Vitthal Shirke Java Microservices Resume.pdf
Vitthal Shirke Java Microservices Resume.pdfVitthal Shirke Java Microservices Resume.pdf
Vitthal Shirke Java Microservices Resume.pdf
Vitthal Shirke
 

Recently uploaded (20)

Understanding Globus Data Transfers with NetSage
Understanding Globus Data Transfers with NetSageUnderstanding Globus Data Transfers with NetSage
Understanding Globus Data Transfers with NetSage
 
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
 
Graphic Design Crash Course for beginners
Graphic Design Crash Course for beginnersGraphic Design Crash Course for beginners
Graphic Design Crash Course for beginners
 
Quarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden ExtensionsQuarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden Extensions
 
Launch Your Streaming Platforms in Minutes
Launch Your Streaming Platforms in MinutesLaunch Your Streaming Platforms in Minutes
Launch Your Streaming Platforms in Minutes
 
Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...
 
Enhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdfEnhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdf
 
How Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptxHow Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptx
 
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing SuiteAI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
 
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
 
Lecture 1 Introduction to games development
Lecture 1 Introduction to games developmentLecture 1 Introduction to games development
Lecture 1 Introduction to games development
 
Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604
 
Top 7 Unique WhatsApp API Benefits | Saudi Arabia
Top 7 Unique WhatsApp API Benefits | Saudi ArabiaTop 7 Unique WhatsApp API Benefits | Saudi Arabia
Top 7 Unique WhatsApp API Benefits | Saudi Arabia
 
APIs for Browser Automation (MoT Meetup 2024)
APIs for Browser Automation (MoT Meetup 2024)APIs for Browser Automation (MoT Meetup 2024)
APIs for Browser Automation (MoT Meetup 2024)
 
Text-Summarization-of-Breaking-News-Using-Fine-tuning-BART-Model.pptx
Text-Summarization-of-Breaking-News-Using-Fine-tuning-BART-Model.pptxText-Summarization-of-Breaking-News-Using-Fine-tuning-BART-Model.pptx
Text-Summarization-of-Breaking-News-Using-Fine-tuning-BART-Model.pptx
 
Utilocate provides Smarter, Better, Faster, Safer Locate Ticket Management
Utilocate provides Smarter, Better, Faster, Safer Locate Ticket ManagementUtilocate provides Smarter, Better, Faster, Safer Locate Ticket Management
Utilocate provides Smarter, Better, Faster, Safer Locate Ticket Management
 
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
 
Prosigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology SolutionsProsigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology Solutions
 
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data AnalysisProviding Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
 
Vitthal Shirke Java Microservices Resume.pdf
Vitthal Shirke Java Microservices Resume.pdfVitthal Shirke Java Microservices Resume.pdf
Vitthal Shirke Java Microservices Resume.pdf
 

Lessons learned running large real-world Docker environments

  • 1. Lessons learned running large real-world Docker environments Oct 27th 2015 Alois Mayr @mayralois alois.mayr@ruxit.com Dec 3rd 2015
  • 3. What is a “large” environment?
  • 4.
  • 5. Campfire stories #1 – The Death Star of Service Dependencies
  • 6. #1 – Death Star of Service Dependencies Load-balanced service System-wide service dependencies
  • 7. Reverse proxies are essential #1 – The Death Star of Service Dependencies
  • 8. App #1 App #2 App #1 depends on App #2 Where is this specified? Unwanted dependencies break architecture #1 – The Death Star of Service Dependencies
  • 9. Use proper versioning for services, APIs, and images #1 – The Death Star of Service Dependencies
  • 10. Campfire stories #1 – The Death Star of Service Dependencies #2 – The Network Retransmission Episode
  • 11. #2 – The Network Retransmission Episode Retransmissions Retransmissions Retransmissions Retransmissions Retransmissions Retransmissions Retransmissions
  • 12. • Hardware defect in a single network interface card • NIC worked well under low load • Retransmissions only under heavy load • Affected communications to other machines in datacenter • Still not sure about exact defect on NIC What was the problem? #2 – The Network Retransmission Episode
  • 13. #2 – The Network Retransmission Episode
  • 14. Co-locate related containers. Check network infrastructure. #2 – The Network Retransmission Episode
  • 15. Campfire stories #1 – The Death Star of Service Dependencies #2 – The Network Retransmission Episode #3 – The Hungry Container Breakdown
  • 16. #3 – The Hungry Container Breakdown Low disk space Low disk space
  • 17. • Shared /logs partition on host • No log rotation, no archiving for app logs • No proper log management used for Docker environment • Shared /logs partition on a single host ran out of space What was the problem? #3 – The Hungry Container Breakdown
  • 18. • Container health checks failed • Marathon terminated task and rescheduled new one • Still no free space on /logs • Termination and rescheduling • /var/lib/docker ran out of space • Mesos slave unable to run Docker tasks How the problem evolved over time #3 – The Hungry Container Breakdown
  • 19. • Log management tools for app logs, e.g. Fluentd and Logstash --log-driver=none|syslog • Remove container --rm=true • Run Mesos slave with --docker_remove_delay=VALUE How the problem could have been avoided #3 – The Hungry Container Breakdown
  • 20. Use log management tools Empty /var/lib/docker #3 – The Hungry Container Breakdown
  • 21. Campfire stories #1 – The Death Star of Service Dependencies #2 – The Network Retransmission Episode #3 – The Hungry Container Breakdown #4 – The Day Orchestration Stood Still
  • 22. #4 – The Day Orchestration Stood Still Queue and deployment methods are slow
  • 23. • Marathon 0.8.x keeps all versions of applications for recovery (by default) • High frequency of microservices deployments • Slowdown through zk overload What was the problem? #4 – The Day Orchestration Stood Still
  • 24. • Respective parameter (zk_max_versions) was not set to proper limit --zk_max_versions=20 How the problem could have been avoided #4 – The Day Orchestration Stood Still
  • 25. Track orchestration layer performance Separate Mesos clusters #4 – The Day Orchestration Stood Still
  • 26. Campfire stories #1 – The Death Star of Service Dependencies #2 – The Network Retransmission Episode #3 – The Hungry Container Breakdown #4 – The Day Orchestration Stood Still #5 – The Mushroom Cloud Effect
  • 27. #5 – The Mushroom Cloud Effect Way too many components involved 820 BILLION dependencies!
  • 28. • Massive load testing in preparation for Black Friday • Tests ran for 3 days • No impact to real users, only backend services affected • Many components to take into account What was the problem? 174 / 3.4k 22 / 13.3k Service Container Host 1 1..* * 1 #5 – The Mushroom Cloud Effect
  • 29.
  • 30. Automation needed for problem analysis in large environments #5 – The Mushroom Cloud Effect
  • 31. Campfire stories #1 – The Death Star of Service Dependencies #2 – The Network Retransmission Episode #3 – The Hungry Container Breakdown #4 – The Day Orchestration Stood Still #5 – The Mushroom Cloud Effect
  • 32. Free trial - https://ruxit.com/docker-monitoring/ Blog - https://blog.ruxit.com/ @ruxit What lessons have you learned?