4. Major airline incident - introduction
• Started with a planned failover
on the database cluster that
served Core Facilities (CF)
• CF handled flight searches –
critical, so designed for high
availability
• CF was going to be used by
self-service check-in kiosks,
IVR, and “channel partner”
applications
5. Major airline incident – outage facts
• Thursday evening, 11 pm: a team of engineers executed a manual database
failover from CF db1 to CF db2, then updated db1, then migrated the
database back to db1 and applied the same change to db2
• 12:30 am: the crew marked the change as “Completed, Success” and signed
off (no downtime)
• 2:30 am: all the check-in kiosks in USA went red (stopped servicing requests)
• minutes later: the IVR servers went red too
• A Severity 1 case was opened immediately
• Priority – restore service: restart CF and kiosks application servers
• Total elapsed time: approx. 3 hours
6. Major airline incident – consequences
• Cost the company hundreds of thousands of dollars
• When the kiosks go down, off-shift agents are called in
• It took until 3 pm to deal with the backlog
• Delayed flights, reallocated gates
• Bad publicity for the airline in the media
• Affected FAA’s annual report card – measures customer complaints,
and on-time arrivals/departures (less money for CEO)
7. Major airline incident – post-mortem
• Data to collect:
application servers: log files, thread dumps, and configuration files
database servers: configuration files for the db and the cluster server
compare current db configuration files to those from the nightly backup
• Thread dumps:
all threads blocked inside SocketInputStream.socketRead(), trying vainly to
read a response that would never come
all threads had called: FlightSearch.lookupByCity()
8. Major airline incident – the culprit
public class FlightSearch implements SessionBean {
private MonitoredDataSource connectionPool;
public List lookupByCity(. . .) throws SQLException, RemoteException {
Connection conn = null;
Statement stmt = null;
try {
conn = connectionPool.getConnection();
stmt = conn.createStatement(); //…
} finally {
if (stmt != null) { stmt.close(); }
if (conn != null) { conn.close(); }
}
}
}
9. What is stability
• Transaction = an abstract unit of work processed by the system
• System = the complete, interdependent set of hardware, applications,
and services required to process transactions for users
• Stability = system keeps processing transactions, even when there are
transient impulses, persistent stresses, or component failures disrupting
normal processing (users can still get work done)
• A component of the system which starts to fail before everything else
does = crack in the system
• Cracks propagate!
• Tight coupling accelerates cracks
10. Major airline incident – avoid propagation
• The pool could have been configured to create more connections if it
was exhausted or to block callers for a limited time, not forever
• The client could have set a timeout on the RMI sockets
• CF servers could have been partitioned into more than one service group
• Use a Circuit breaker
12. What is capacity
• Performance measures how fast the system processes a single
transaction
• Throughput describes the number of transactions the system can process
in a given time span
• Capacity is the maximum throughput a system can sustain, for a given
workload, while maintaining an acceptable response time for each
individual transaction
13. Retailer incident
• 300 people have worked for about 3 years to build a complete
replacement for the online store, content management, customer
service, and order-processing systems
• 9 am: the program manager hit the big red button and system went live
• 9:05 am: 10,000 sessions active on the servers
• 9:10 am: 50,000 sessions active on the servers
• 9:30 am: 250,000 sessions active on the servers CRASH!!!!
14. Retailer incident – reasons for failure
• The number of sessions killed the site
• Each session got serialized and transmitted to a session backup server
after each page request (session replication enabled)
• Sessions were consuming RAM, CPU, and network bandwidth
• All load test scripts used cookies to track sessions
• In production:
Search engines drove customers to old-style URLs
Search engine spiders expect the site to support session tracking via URL
rewriting
Scrapers and shopbots did not handle cookies properly
15. Retailer incident – fixes
• Use server scripting to protect the site
• Added a gateway page that served three critical capabilities:
if the requester did not handle cookies properly, the page redirected the
browser to a separate page that explained how to enable cookies
a throttle was set to determine what percentage of new sessions would be
allowed to the real home page
block specific IP addresses from hitting the site (shopbots, request floods)
17. 0 downtime deployments - Expansion
• Deploy new static files (images, stylesheets, JS)
• Create new service pools, if needed
• Add new tables
• Add new columns
• Run data migration scripts
• Add bridging triggers
• Apply recursive ZDD to prepare secondary clusters
18. 0 downtime deployments - Rollout
• For each server:
• Unpack code on the server
• Stop accepting new requests
• Shutdown the server
• Point to the new code
• Start up the server
• Verify clean startup
19. 0 downtime deployments - Cleanup
• Remove bridging triggers
• Remove obsolete referential integrity relations
• Remove obsolete columns
• Remove obsolete tables
• Add new referential integrity relations
• Add NOT NULL constraints
• Remove obsolete static files
• Remove the old code
• Remove old service pools