“Sh!@$ on Fire, Yo!”
True Stories Inspired by Real Events
Brendan Aye
Technical Director, Platform Architecture
James Webb
Member of Technical Staff
2
Platform and Infrastructure Engineering
§ 55 Team Members, including redundant and
geo-distributed Joes
§ Virtual Infrastructure
§ 5,000 Virtual Hosts
§ 50,000 Virtual Machines
§ CloudFoundry
§ 30 Foundations
§ 75,000 Application Instances
§ Kubernetes
§ 90 Clusters
§ 22,000 Pods
Who We Are
T-Mobile Confidential
3
Platform KPIs
Synthetic Transactions
BlackBox Monitoring
Server Infrastructure
Network Infrastructure
Slack
Application Requests
Container Metrics
What Do We Watch?
4
Architecting a Highly Available CloudFoundry App
Foundation A Foundation B Foundation C
Myapp.geo.cf.mydomain.com Myapp.geo.cf.mydomain.com Myapp.geo.cf.mydomain.com
Load Balancer Load Balancer Load Balancer
GSLB
Clients
Myapp.geo.cf.mydomain.com
Myapp.geo.cf.mydomain.com
Myapp.geo.cf.mydomain.com
Myapp.geo.cf.mydomain.com
Myapp.geo.cf.mydomain.com
Myapp.geo.cf.mydomain.com
55
§ Platform team built shiny GSLB-as-a-service
§ Customer team consumed shiny GSLB-as-a-
service
§ Clients queried GLSB to determine endpoint
and established persistent HTTP
connections
§ App teams took one region out of load which
correctly de-registered it with GSLB
§ Persistent connections don’t need to query
GSLB anymore and the LoadBalancer kept
the connections alive… L
What Went Wrong?
T-Mobile Confidential
6
§ Improved Documentation! GLSB is only one
method to load balance application traffic, so
explaining its benefits and drawbacks is
crucial to a successful partnership.
§ Sharing incident post-mortem with GSLB
customers so they understand what went
wrong, and how they can plan for expected
failure.
§ Suggesting disabling HTTP keep-alive when
using GLSB
§ Investing in alternative platform-supported
load balancing methodologies.
How Did We Get Better?
T-Mobile Confidential
77
§ Homebrew Java Application running on
WebLogic
§ Running in a single Kubernetes cluster, but
with many instances spread across
multiple share-nothing AZs
§ Application upgrades and restarts
working fine and not causing any impacts
to service
§ Multi-tenant cluster managed by Platform
Team with daytime upgrades planned
during CloudFoundry Summit 2019
Anatomy of a Failing Kubernetes App
88
§ Cluster upgrades kicked off with max-in-flight
of one
§ As nodes quickly cycled through upgrades,
application had fewer and fewer ‘ready’
pods
§ By the time remaining nodes were upgraded,
all customer pods were in a crashed state
and failing to come back up
§ Management was displeased with our
daytime upgrades with no Change Request
leading to a P1 Incident
What Went Wrong?
T-Mobile Confidential
9
§ Switching application to depend on
/dev/urandom instead of /dev/random
§ Customers implemented Pod Disruption
Budgets (PDB) to maintain a minimum
of 66% of ready pods before upgrades
can proceed
§ File a Change Request for anything that
touches a customer-facing cluster (yes,
even non-production)
How Did We Get Better?How Did We Get Better?
How Do You
Prevent
Incidents?
11
§ Adopt a policy of radical transparency
with your customers
§ Assume your customers are right until
you can demonstrate otherwise
§ Avoid seeing Mean-Time-To-Blame as a
useful KPI
§ When your platform is at fault, accept
responsibility, fix the issue, and explain
how you’ll improve
§ When a customer is doing something
that will lead to failure, ensure your
concern is heard and partner for
success
You
Can’t
T-Mobile Confidential
Let’s talk
“Sh*^%# on Fire, Yo!”: A True Story Inspired by Real Events

“Sh*^%# on Fire, Yo!”: A True Story Inspired by Real Events

  • 1.
    “Sh!@$ on Fire,Yo!” True Stories Inspired by Real Events Brendan Aye Technical Director, Platform Architecture James Webb Member of Technical Staff
  • 2.
    2 Platform and InfrastructureEngineering § 55 Team Members, including redundant and geo-distributed Joes § Virtual Infrastructure § 5,000 Virtual Hosts § 50,000 Virtual Machines § CloudFoundry § 30 Foundations § 75,000 Application Instances § Kubernetes § 90 Clusters § 22,000 Pods Who We Are T-Mobile Confidential
  • 3.
    3 Platform KPIs Synthetic Transactions BlackBoxMonitoring Server Infrastructure Network Infrastructure Slack Application Requests Container Metrics What Do We Watch?
  • 4.
    4 Architecting a HighlyAvailable CloudFoundry App Foundation A Foundation B Foundation C Myapp.geo.cf.mydomain.com Myapp.geo.cf.mydomain.com Myapp.geo.cf.mydomain.com Load Balancer Load Balancer Load Balancer GSLB Clients Myapp.geo.cf.mydomain.com Myapp.geo.cf.mydomain.com Myapp.geo.cf.mydomain.com Myapp.geo.cf.mydomain.com Myapp.geo.cf.mydomain.com Myapp.geo.cf.mydomain.com
  • 5.
    55 § Platform teambuilt shiny GSLB-as-a-service § Customer team consumed shiny GSLB-as-a- service § Clients queried GLSB to determine endpoint and established persistent HTTP connections § App teams took one region out of load which correctly de-registered it with GSLB § Persistent connections don’t need to query GSLB anymore and the LoadBalancer kept the connections alive… L What Went Wrong? T-Mobile Confidential
  • 6.
    6 § Improved Documentation!GLSB is only one method to load balance application traffic, so explaining its benefits and drawbacks is crucial to a successful partnership. § Sharing incident post-mortem with GSLB customers so they understand what went wrong, and how they can plan for expected failure. § Suggesting disabling HTTP keep-alive when using GLSB § Investing in alternative platform-supported load balancing methodologies. How Did We Get Better? T-Mobile Confidential
  • 7.
    77 § Homebrew JavaApplication running on WebLogic § Running in a single Kubernetes cluster, but with many instances spread across multiple share-nothing AZs § Application upgrades and restarts working fine and not causing any impacts to service § Multi-tenant cluster managed by Platform Team with daytime upgrades planned during CloudFoundry Summit 2019 Anatomy of a Failing Kubernetes App
  • 8.
    88 § Cluster upgradeskicked off with max-in-flight of one § As nodes quickly cycled through upgrades, application had fewer and fewer ‘ready’ pods § By the time remaining nodes were upgraded, all customer pods were in a crashed state and failing to come back up § Management was displeased with our daytime upgrades with no Change Request leading to a P1 Incident What Went Wrong? T-Mobile Confidential
  • 9.
    9 § Switching applicationto depend on /dev/urandom instead of /dev/random § Customers implemented Pod Disruption Budgets (PDB) to maintain a minimum of 66% of ready pods before upgrades can proceed § File a Change Request for anything that touches a customer-facing cluster (yes, even non-production) How Did We Get Better?How Did We Get Better?
  • 10.
  • 11.
    11 § Adopt apolicy of radical transparency with your customers § Assume your customers are right until you can demonstrate otherwise § Avoid seeing Mean-Time-To-Blame as a useful KPI § When your platform is at fault, accept responsibility, fix the issue, and explain how you’ll improve § When a customer is doing something that will lead to failure, ensure your concern is heard and partner for success You Can’t T-Mobile Confidential
  • 12.