“Sh*^%# on Fire, Yo!”: A True Story Inspired by Real Events

“Sh!@$ on Fire, Yo!”
True Stories Inspired by Real Events
Brendan Aye
Technical Director, Platform Architecture
James Webb
Member of Technical Staff

2
Platform and Infrastructure Engineering
§ 55 Team Members, including redundant and
geo-distributed Joes
§ Virtual Infrastructure
§ 5,000 Virtual Hosts
§ 50,000 Virtual Machines
§ CloudFoundry
§ 30 Foundations
§ 75,000 Application Instances
§ Kubernetes
§ 90 Clusters
§ 22,000 Pods
Who We Are
T-Mobile Confidential

3
Platform KPIs
Synthetic Transactions
BlackBox Monitoring
Server Infrastructure
Network Infrastructure
Slack
Application Requests
Container Metrics
What Do We Watch?

4
Architecting a Highly Available CloudFoundry App
Foundation A Foundation B Foundation C
Myapp.geo.cf.mydomain.com Myapp.geo.cf.mydomain.com Myapp.geo.cf.mydomain.com
Load Balancer Load Balancer Load Balancer
GSLB
Clients
Myapp.geo.cf.mydomain.com

55
§ Platform team built shiny GSLB-as-a-service
§ Customer team consumed shiny GSLB-as-a-
service
§ Clients queried GLSB to determine endpoint
and established persistent HTTP
connections
§ App teams took one region out of load which
correctly de-registered it with GSLB
§ Persistent connections don’t need to query
GSLB anymore and the LoadBalancer kept
the connections alive… L
What Went Wrong?

6
§ Improved Documentation! GLSB is only one
method to load balance application traffic, so
explaining its benefits and drawbacks is
crucial to a successful partnership.
§ Sharing incident post-mortem with GSLB
customers so they understand what went
wrong, and how they can plan for expected
failure.
§ Suggesting disabling HTTP keep-alive when
using GLSB
§ Investing in alternative platform-supported
load balancing methodologies.
How Did We Get Better?

77
§ Homebrew Java Application running on
WebLogic
§ Running in a single Kubernetes cluster, but
with many instances spread across
multiple share-nothing AZs
§ Application upgrades and restarts
working fine and not causing any impacts
to service
§ Multi-tenant cluster managed by Platform
Team with daytime upgrades planned
during CloudFoundry Summit 2019
Anatomy of a Failing Kubernetes App

88
§ Cluster upgrades kicked off with max-in-flight
of one
§ As nodes quickly cycled through upgrades,
application had fewer and fewer ‘ready’
pods
§ By the time remaining nodes were upgraded,
all customer pods were in a crashed state
and failing to come back up
§ Management was displeased with our
daytime upgrades with no Change Request
leading to a P1 Incident
What Went Wrong?

9
§ Switching application to depend on
/dev/urandom instead of /dev/random
§ Customers implemented Pod Disruption
Budgets (PDB) to maintain a minimum
of 66% of ready pods before upgrades
can proceed
§ File a Change Request for anything that
touches a customer-facing cluster (yes,
even non-production)
How Did We Get Better?How Did We Get Better?

11
§ Adopt a policy of radical transparency
with your customers
§ Assume your customers are right until
you can demonstrate otherwise
§ Avoid seeing Mean-Time-To-Blame as a
useful KPI
§ When your platform is at fault, accept
responsibility, fix the issue, and explain
how you’ll improve
§ When a customer is doing something
that will lead to failure, ensure your
concern is heard and partner for
success
You
Can’t

“Sh*^%# on Fire, Yo!”: A True Story Inspired by Real Events

“Sh*^%# on Fire, Yo!”: A True Story Inspired by Real Events

More Related Content

What's hot

Similar to “Sh*^%# on Fire, Yo!”: A True Story Inspired by Real Events

More from VMware Tanzu

Recently uploaded

“Sh*^%# on Fire, Yo!”: A True Story Inspired by Real Events