When containers fail

@mayraloisAugust, 2016
The Mushroom Cloud Effect
or
What Happens When Containers Fail?
Alois Mayr
Technology Lead Cloud & Containers
Microservices and Containers Meetup Austin

@mayralois
about:me
• Austrian
• Never seen
Sound of Music
• Often seen much more modern
technology stuff
• Seen even more technology stuff
now with Dynatrace
• Technology Lead for Cloud & Containers
CloudFoundry, Docker, AWS, etc.

@mayralois
about:dynatrace
• APM market leader who helps companies in
Digital transformation
• Founded in Austria back in 2005
• ~ 1600 employees worldwide
• > 8000 customers across all industries
• Seen many performance and stability
problems and patterns out there

@mayralois
about:you
• Who of you run/manage containers in
production?
• Whose life has become easier since then?
• What’s needed to make it easy?
• Thanks!

@mayralois
Source: http://www.schoonoart.de/
…there’s been the
mushroom cloud effect
oh yeah, everything
screwed up

@mayralois
The Mushroom Cloud Effect
or
What Happens When Containers Fail?

@mayralois
Biggest LatAm E-Commerce Company
• ~ U$ 2.5 billion revenue
• 4 sites: Americanas, Shoptime,
Submarino, Soubarato
• ~ 150 hosts across 4 regions
• 5k-15k containers
• 1k-3k services

@mayralois
About Cloud-Scale Systems

@mayralois
Important Aspects…
• Lots of (micro-)services
• Lots of communication between services
• Service dependencies
• Versioning and API compatibilities
• Zero downtime

@mayralois
Platform-relatedAspects
• Most often container-based
• Clustered for scalability
• Ephemeral containers
• Resilient architecture
• Cross AZ fail-overs
• SDN for communication

@mayralois
Deployments are no Longer Static
7:00 a.m.
Low load, service running
with minimum redundancy
12:00 p.m.
Scaled up service during peak load
with failover of problematic node
7:00 p.m.
Scaled back down to lower load,
move to different geolocation

@mayralois
Anatomy of dynamic environments
https://www.dynatrace.com/en/ruxit/

@mayralois
All About (Service) Dependencies

@mayralois
Failing containers…
…may or may not have an (immediate)
impact to service performance

@mayralois
Cascading Failures Lead to a
Mushroom Cloud Effect

@mayralois
The Hungry Container Breakdown
• Shared /logs partition on host
• No log rotation, no archiving for app logs
• No proper log management used for Docker environment
• Shared /logs partition ran out of space
What was the problem?

@mayralois
• Container health checks failed
• Orchestration killed container and rescheduled new one
• Still no free space on /logs
• Termination and rescheduling
• /var/lib/docker ran out of space
• Cluster nodes were no longer able to run any containers
How the problem has evolved over time?

@mayralois
• Services at the top of the graph
• Increased failure rates
• Lots of depending Tomcat and DB services affected
How the problem affected services?

@mayralois
Log management tools for app logs
--log-driver=none|syslog
Remove container / clean-up jobs
--rm=true
/var/lib/docker deserves its own partition
How the problem could have been avoided?

@mayralois
Buggy Containers May Kill Your Nodes

@mayralois
Try to Break Your Clusters Early
(And be Prepared for Black Friday)

@mayralois
Break Your Clusters Early
Massive load testing!
Survive three days of pain
Include everything
Services, Containers,
Orchestration, EC2 instances

@mayralois
Testing everything
13.3k containers (+nodes)
3,451 services

@mayralois
Automation Needed to Pinpoint the
Root Cause of Cascading Failures!

@mayralois
Questions? Or Beer? Or Both?
How do you know if a failing container
breaks your apps?

When containers fail

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (16)

Similar to When containers fail

Similar to When containers fail (20)

Recently uploaded

Recently uploaded (20)

When containers fail