The Mushroom Cloud Effect - What happens when containers fail?

•Download as PPTX, PDF•

0 likes•175 views

Micro service architectures result in up to 20 times larger environments than their monolithic counterparts. In such big and interconnected environments container metrics will tell you about infrastructure health but not service health. Even if you have implemented service health checks to quickly react on service failures, in a resilient system you will see intermediary mushroom cloud effects of a large number of services being affected temporarily. How do you find out what really caused the problem and how to distinguish effect vs. cause?

Software

@mayralois
Mushroom Cloud Effect
Alois Mayr
Technology Lead – Dynatrace

@mayralois
Monitoring for Highly Dynamic
Container Environments

@mayralois
Source: http://www.schoonoart.de/
…there’s been the
mushroom cloud effect
oh yeah, everything
screwed up

@mayralois
The Mushroom Cloud Effect
or
What Happens When Containers Fail?

@mayralois
Biggest LatAm e-commerce Company
• ~ U$ 2.5 billion revenue
• 4 sites: Americanas, Shoptime,
Submarino, Soubarato
• ~ 150 hosts across 4 regions
• 5k-15k containers
• 1k-3k services

@mayralois
Important Aspects…
• Lots of (micro-)services
• Lots of communication between services
• Service dependencies
• Versioning and API compatibilities
• Zero downtime

@mayralois
Platform-related Aspects
• Most often container-based
• Clustered for scalability
• Ephemeral containers
• Resilient architecture
• Cross AZ fail-overs
• SDN for communication

@mayralois
Deployments are no Longer Static
7:00 a.m.
Low load, service running
with minimum redundancy
12:00 p.m.
Scaled up service during peak load
with failover of problematic node
7:00 p.m.
Scaled back down to lower load,
move to different geolocation

@mayralois
Anatomy of dynamic environments
https://www.dynatrace.com/en/ruxit/

@mayralois
All About (Service) Dependencies

@mayralois
Failing containers…
…may or may not have an (immediate)
impact on service performance

@mayralois
Cascading Failures Lead to a
Mushroom Cloud Effect

@mayralois
The Hungry Container Breakdown
• Shared /logs partition on host
• No log rotation, no archiving for app logs
• No proper log management used for Docker environment
• Shared /logs partition ran out of space
What was the problem?

@mayralois
The Hungry Container Breakdown
• Container health checks failed
• Orchestration killed container and rescheduled new one
• Still no free space on /logs
• Termination and rescheduling
• /var/lib/docker ran out of space
• Cluster nodes were no longer able to run any containers
How the problem has evolved over time?

@mayralois
The Hungry Container Breakdown
• Services at the top of the graph
• Increased failure rates
• Lots of depending Tomcat and DB services affected
How the problem affected services?

@mayralois
The Hungry Container Breakdown
Log management tools for app logs
--log-driver=none|syslog
Remove container
--rm=true
/var/lib/docker deserves its own partition
How the problem could have been avoided?

@mayralois
The Hungry Container Breakdown
Buggy Containers May Kill Your Nodes

@mayralois
Try to Break Your Clusters Early
(And be Prepared for Black Friday)

@mayralois
Break Your Clusters Early
Massive load testing!
Survive three days of pain
Include everything
Services, Containers,
Orchestration, EC2 instances

@mayralois
Testing everything
13.3k containers (+nodes)
3,451 services

@mayralois
Automation Needed to Pinpoint the
Root Cause of Cascading Failures!

@mayralois
Want to learn more?
Stop by our booth!
G15

What's hot

Azure Microservices in Practice - Radu Vunvulea ITCamp Community Timisoara 07...Radu Vunvulea

Getting Started with Serverless ArchitecturesAmazon Web Services

Full Stack Application Monitoring for AWS Powered by AIDynatrace

Dead-Simple Deployment: Headache-Free Java Web Applications in the CloudCraig Dickson

KafkaMajid Hajibaba

An Introduction to MicroservicesAd van der Veer

Building Automated Control Systems for Your AWS InfrastructureAmazon Web Services

Building a reliable, scalable service with Clojure and Core.asyncKapil Reddy

.NET Security (Radu Vunvulea)Radu Vunvulea

Artificial Intelligence & Machine learning foundation topic in AWS Varun Manik

AWS LambdaDanilo Poccia

Keeping Rails on the TracksRubyX

Kapil Thangavelu - Cloud CustodianServerlessConf

How LEGO.com Accelerates With ServerlessSheenBrisals

Ben Kehoe - Serverless Architecture for the Internet of ThingsServerlessConf

Serverless Architecture on AWSRajind Ruparathna

Introduction to RightScaleAkelios

Serverless computingOm Vikram Thapa

Cloud Services Powered by IBM SoftLayer and NetflixOSSaspyker

Using Machine Learning on K8s Logs to Find Root Cause FasterLibbySchulze

What's hot (20)

Azure Microservices in Practice - Radu Vunvulea ITCamp Community Timisoara 07...

Getting Started with Serverless Architectures

Full Stack Application Monitoring for AWS Powered by AI

Dead-Simple Deployment: Headache-Free Java Web Applications in the Cloud

Kafka

An Introduction to Microservices

Building Automated Control Systems for Your AWS Infrastructure

Building a reliable, scalable service with Clojure and Core.async

.NET Security (Radu Vunvulea)

Artificial Intelligence & Machine learning foundation topic in AWS

AWS Lambda

Keeping Rails on the Tracks

Kapil Thangavelu - Cloud Custodian

How LEGO.com Accelerates With Serverless

Ben Kehoe - Serverless Architecture for the Internet of Things

Serverless Architecture on AWS

Introduction to RightScale

Serverless computing

Cloud Services Powered by IBM SoftLayer and NetflixOSS

Using Machine Learning on K8s Logs to Find Root Cause Faster

Viewers also liked

Cost of a Speeding Ticket in WisconsinNicholson Gansner

Excipients and GMPDRA Consulting Oy

Ej03 añadir, insertar y borrar textosegundociclocm

Tatjana Konakov Radic-uverenje ISO 14001.PDFTatjana Radić

Scan_20160815Ahmad Nur Faizzy Bin Ahmad Sabri

Yoga Sutras VibhutiJuan Carlos

Open InnovationMiguel Jimbo

Curriculum vitaeBlanca Marisol Gómez Martínez

Actividad tics (1)danielacorso

Potential Book CoversBeyond Philosophy

Propgaganda Dr. Shriniwas Kashalikarsarojs

Shanti MantraJuan Carlos

a825902cdfef20898579d87c868479c4Jonas K Ö Högberg

Sheep HappensAndy Clark

Performance monitoring and call tracing in microservice environmentsMartin Gutenbrunner

Introduction to RADAR by NITimour Chaikhraziev

Geomorphology fieldbookDavid Rogers

Data Integrity webinar - Essentials & Solutionspi

Viewers also liked (18)

Cost of a Speeding Ticket in Wisconsin

Excipients and GMP

Ej03 añadir, insertar y borrar texto

Tatjana Konakov Radic-uverenje ISO 14001.PDF

Scan_20160815

Yoga Sutras Vibhuti

Open Innovation

Curriculum vitae

Actividad tics (1)

Potential Book Covers

Propgaganda Dr. Shriniwas Kashalikar

Shanti Mantra

a825902cdfef20898579d87c868479c4

Sheep Happens

Performance monitoring and call tracing in microservice environments

Introduction to RADAR by NI

Geomorphology fieldbook

Data Integrity webinar - Essentials & Solutions

Similar to The Mushroom Cloud Effect - What happens when containers fail?

When containers failAlois Mayr

Spotinst 'AWS Cost Optimization' Webinar - Jan 20th, 2016Spotinst

Lessons learned running large real-world Docker environmentsAlois Mayr

Site reliability in the Serverless age - Serverless Boston 2019Erik Peterson

Webinar: Queues with RabbitMQ - Lorna MitchellCodemotion

Microservices with Apache Camel, Docker and Fabric8 v2Christian Posta

Building data intensive applicationsAmit Kejriwal

Webinar Slides: AWS Aurora MySQL Replacement: Break Away From Geo-Limitations...Continuent

Architecting for failure - Why are distributed systems hard?Markus Eisele

Concurrency at Scale: Evolution to Micro-ServicesRandy Shoup

Release the Monkeys ! Testing in the Wild at NetflixGareth Bowles

Stay productive while slicing up the monolithMarkus Eisele

How to fail with serverlessJeremy Daly

.Net Microservices with Event Sourcing, CQRS, Docker and... Windows Server 20...Javier García Magna

Stay productive while slicing up the monolithMarkus Eisele

Design for Scale / Surge 2010Christopher Brown

Webinar Slides: No Data Loss MySQL: Guaranteed Credit Card Transaction Availa...Continuent

John adams talk cloudyJohn Adams

MassTLC Cloud Summit KeynoteAriel Tseitlin

Integrating Microservices with Apache CamelChristian Posta

Similar to The Mushroom Cloud Effect - What happens when containers fail? (20)

When containers fail

Spotinst 'AWS Cost Optimization' Webinar - Jan 20th, 2016

Lessons learned running large real-world Docker environments

Site reliability in the Serverless age - Serverless Boston 2019

Webinar: Queues with RabbitMQ - Lorna Mitchell

Microservices with Apache Camel, Docker and Fabric8 v2

Building data intensive applications

Webinar Slides: AWS Aurora MySQL Replacement: Break Away From Geo-Limitations...

Architecting for failure - Why are distributed systems hard?

Concurrency at Scale: Evolution to Micro-Services

Release the Monkeys ! Testing in the Wild at Netflix

Stay productive while slicing up the monolith

How to fail with serverless

.Net Microservices with Event Sourcing, CQRS, Docker and... Windows Server 20...

Stay productive while slicing up the monolith

Design for Scale / Surge 2010

Webinar Slides: No Data Loss MySQL: Guaranteed Credit Card Transaction Availa...

John adams talk cloudy

MassTLC Cloud Summit Keynote

Integrating Microservices with Apache Camel

Recently uploaded

ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...Christina Lin

The Evolution of Karaoke From Analog to App.pdfPower Karaoke

Hand gesture recognition PROJECT PPT.pptxbodapatigopi8531

Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideChristina Lin

XpertSolvers: Your Partner in Building Innovative Software SolutionsMehedi Hasan Shohan

Unit 1.1 Excite Part 1, class 9, cbse...aditisharan08

Optimizing AI for immediate response in Smart CCTVshikhaohhpro

Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...MyIntelliSource, Inc.

Cloud Management Software Platforms: OpenStackVICTOR MAESTRE RAMIREZ

What is Binary Language? Computer Number SystemsJheuzeDellosa

Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfkalichargn70th171

The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...ICS

Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝soniya singh

The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfkalichargn70th171

cybersecurity notes for mca students for learningVitsRangannavar

Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.

Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...kellynguyen01

Salesforce Certified Field Service ConsultantAxelRicardoTrocheRiq

KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxTier1 app

BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEOrtus Solutions, Corp

Recently uploaded (20)

ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...

The Evolution of Karaoke From Analog to App.pdf

Hand gesture recognition PROJECT PPT.pptx

Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide

XpertSolvers: Your Partner in Building Innovative Software Solutions

Unit 1.1 Excite Part 1, class 9, cbse...

Optimizing AI for immediate response in Smart CCTV

Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...

Cloud Management Software Platforms: OpenStack

What is Binary Language? Computer Number Systems

Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf

The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...

Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝

The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf

cybersecurity notes for mca students for learning

Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...

Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...

Salesforce Certified Field Service Consultant

KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx

BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE

The Mushroom Cloud Effect - What happens when containers fail?

1. @mayralois Mushroom Cloud Effect Alois Mayr Technology Lead – Dynatrace

2. @mayralois Monitoring for Highly Dynamic Container Environments

3. @mayralois Source: http://www.schoonoart.de/ …there’s been the mushroom cloud effect oh yeah, everything screwed up

4. @mayralois The Mushroom Cloud Effect or What Happens When Containers Fail?

5. @mayralois Biggest LatAm e-commerce Company • ~ U$ 2.5 billion revenue • 4 sites: Americanas, Shoptime, Submarino, Soubarato • ~ 150 hosts across 4 regions • 5k-15k containers • 1k-3k services

6. @mayralois TL;DR

7. @mayralois About Cloud-Scale Systems

8. @mayralois Important Aspects… • Lots of (micro-)services • Lots of communication between services • Service dependencies • Versioning and API compatibilities • Zero downtime

9. @mayralois Platform-related Aspects • Most often container-based • Clustered for scalability • Ephemeral containers • Resilient architecture • Cross AZ fail-overs • SDN for communication

10. @mayralois Deployments are no Longer Static 7:00 a.m. Low load, service running with minimum redundancy 12:00 p.m. Scaled up service during peak load with failover of problematic node 7:00 p.m. Scaled back down to lower load, move to different geolocation

11. @mayralois Anatomy of dynamic environments https://www.dynatrace.com/en/ruxit/

12. @mayralois All About (Service) Dependencies

13. @mayralois Failing containers… …may or may not have an (immediate) impact on service performance

14. @mayralois Cascading Failures Lead to a Mushroom Cloud Effect

15. @mayralois

16. @mayralois The Hungry Container Breakdown • Shared /logs partition on host • No log rotation, no archiving for app logs • No proper log management used for Docker environment • Shared /logs partition ran out of space What was the problem?

17. @mayralois The Hungry Container Breakdown • Container health checks failed • Orchestration killed container and rescheduled new one • Still no free space on /logs • Termination and rescheduling • /var/lib/docker ran out of space • Cluster nodes were no longer able to run any containers How the problem has evolved over time?

18. @mayralois The Hungry Container Breakdown • Services at the top of the graph • Increased failure rates • Lots of depending Tomcat and DB services affected How the problem affected services?

19. @mayralois

20. @mayralois The Hungry Container Breakdown Log management tools for app logs --log-driver=none|syslog Remove container --rm=true /var/lib/docker deserves its own partition How the problem could have been avoided?

21. @mayralois The Hungry Container Breakdown Buggy Containers May Kill Your Nodes

22. @mayralois Try to Break Your Clusters Early (And be Prepared for Black Friday)

23. @mayralois Break Your Clusters Early Massive load testing! Survive three days of pain Include everything Services, Containers, Orchestration, EC2 instances

24. @mayralois Testing everything 13.3k containers (+nodes) 3,451 services

25. @mayralois

26. @mayralois Automation Needed to Pinpoint the Root Cause of Cascading Failures!

27. @mayralois Want to learn more? Stop by our booth! G15

28. @mayralois Thank you!

Editor's Notes

Who am I and what we do Dynatrace the monitoring company with a solution for highly dynamic cloud and container environments Many customers, lots of experience, production monitoring,
There are many campfire stories we could go into … but today I’m gonna talk about the Mushroom Cloud Effect Go into context here. Mushroom cloud effect we see very often in dynamic container environements with lots of microservices etc.
All the services should work as designed You need to be careful of api and service compatibilities Additional complexity through feature sets cross geo-location availability Usually designed for zero downtime Most of them apply to netflix website services
However, running such services often done with paas or paas like systems Container based One container, one service One service, multiple containers – somewhere across different hosts Containers are ephemeral, regularly killed and started Orchestration not only for scheduling but also for resilience at container level
This is an example of one of our customers – B2W the largest e-commerce provider in LatAm Left hand side: a single service in v5 provided by 5 containers – one is offline. The conainers run on separate hosts in 2 datacenters. This is the technical part Right hand side is a mess – all services communicate somehow to each other. Service dependencies and communication needs to be real-time
This is a way simpler environment that shows the flow of transactional apps.
Example of a small mushroom cloud effect Bottom: Hosts Middle part: processes / containers Top: services and applications

The Mushroom Cloud Effect - What happens when containers fail?

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (18)

Similar to The Mushroom Cloud Effect - What happens when containers fail?

Similar to The Mushroom Cloud Effect - What happens when containers fail? (20)

Recently uploaded

Recently uploaded (20)

The Mushroom Cloud Effect - What happens when containers fail?

Editor's Notes