Talk given by Mike Christian, VP, Site Reliability Engineering at Salesforce, at SRECon in April 2016
We pay an incredible amount of money to put our infrastructure into Tier 3/4 data centers, and still things fail. Worse yet, many of the failures are due to the complexity we've added to the system in a futile attempt to make it more "reliable." It's a false economy, as there is no such thing as a reliable data center. Switching to a resilience model not only solves the problem of keeping the service running when things fail; it also lets us build much lower tier data centers, for a fraction of the cost. No HVAC. No Generators. No UPS. Just an ephemeral building full of ephemeral servers.
This talk will consist of real life failure examples, coupled with resiliency strategies that would have saved the day.
4. Classic DR Planning:
“So we don’t need to put that much effort into it.”
“We will most likely never actually need to use this…”
“We’re okay, we bought some insurance.”
“Box… Checked!”
9. A partner once told me:
“We have never lost a datacenter.”
10. Then, one week later they did.
“No, it was just the edge routers.”
11.
12. Even the best datacenters DO go down!
For all sorts of unexpected reasons:
• Network instability
• HVAC failures
• Power distribution and switching
• UPS failures
• Generator failures
• Leaky roofs and equipment
• Inadvertent fire suppression
• Partial nuclear meltdown
• Squirrels vs Internet!
Maintenance & Migrations:
• Core router upgrades
• Infrastructure maintenance
• Massive migrations
17. “When the servers crashed, effectively
putting eBay out of business, Meg
gathered her team, along with the best
technology experts from around Silicon
Valley, and they stayed there until it was
fixed. Literally, sleeping at the office.”
Quote from Meg Whitman for
Governor
18.
19. “Remember that Meg Whitman story? I
was that guy! No kidding! I was cleaning
up some space on the database, so I
dropped an unused table, and it ended up
being the main customer index. Doh! Meg
came in and brought a ton of Oracle
experts to sit with me all night, and rebuild
the tablespace. DUDE, I LITERALLY SAVED
EBAY!”
-Ops Superhero
-(Paraphrased)
- (and Interview Candidate)
36. The Collapse of Complex Societies
Joseph Tainter (Anthropologist/Historian)
37. So What?
• As our platforms grow, scale, and evolve,
they will develop innate complexity in an
almost organic way
• Redundancy adds additional complexity on
top of that.
• Too many layers of redundancy on top of an
already complex system will ultimately lead
to a decrease in reliability.
39. So What Now?
• Dr. Richard Cooke had it right.
• At massive scale (and especially as we grow),
the law of diminishing returns means our
industry cannot build “reliable” systems.
• But by embracing failure as a fundamental
truth, we can design around it.
• We need to stop focusing on reliable
systems, and focus on resilient systems.
• And we need to do that globally.
http://commons.wikimedia.org/wiki/File:Bali-IMG_4647.JPG
I, the copyright holder of this work, release this work into the public domain. This applies worldwide.In some countries this may not be legally possible; if so:I grant anyone the right to use this work for any purpose, without any conditions, unless such conditions are required by law
http://commons.wikimedia.org/wiki/File:Kitten-stare.jpg
I, the copyright holder of this work, release this work into the public domain. This applies worldwide.In some countries this may not be legally possible; if so:I grant anyone the right to use this work for any purpose, without any conditions, unless such conditions are required by law
This file is in the public domain because it was solely created by NASA. NASA copyright policy states that "NASA material is not protected by copyright unless noted".
http://commons.wikimedia.org/wiki/File:Hurricane_Wilma_200510212015.jpg
This image is in the public domain because it contains materials that originally came from the U.S. National Oceanic and Atmospheric Administration, taken or made as part of an employee's official duties.
http://commons.wikimedia.org/wiki/File:Photocopy_of_photograph._STEEL_PLANT,_1000-HORSEPOWER_CORLISS_STEAM_ENGINE_AND_FLYWHEEL_FOR_14-INCH_MILL,_1910._(From_the_Jefferson_County_Historical_society_Collection,_HAER_WASH,16-PORTO.V,1-36.tif
This image or media file contains material based on a work of a National Park Service employee, created as part of that person's official duties. As a work of the U.S. federal government, such work is in the public domain. See the NPS website and NPS copyright policy for more information.
http://commons.wikimedia.org/wiki/File:DAMAGE_CAUSED_BY_THE_EXPLOSION_OF_THE_CORLISS_ENGINE_FLYWHEEL._NO._2_ENGINE_HOUSE_(NO._7_MILL)._PHOTOCOPY_OF_1871_VIEW_LOOKING_WEST._From_the_collection_of_the_Manchester_Public_HABS_NH,6-MANCH,2-120.tif
This image or media file contains material based on a work of a National Park Service employee, created as part of that person's official duties. As a work of the U.S. federal government, such work is in the public domain. See the NPS website and NPS copyright policy for more information.
http://commons.wikimedia.org/wiki/File:SH-60B_helicopter_flies_over_Sendai.jpg
This file is a work of a sailor or employee of the U.S. Navy, taken or made as part of that person's official duties. As a work of the U.S. federal government, the image is in the public domain.
http://commons.wikimedia.org/wiki/File:Hokusai_Monster_Rat.jpg
According to Japanese Copyright Law the copyright on this work has expired and is as such public domain. According to articles 51 and 57 of the copyright laws of Japan, under the jurisdiction of the Government of Japan all non-photographic works enter the public domain 50 years after the death of the creator (there being multiple creators, the creator who dies last) or 50 years after publication for anonymous or pseudonymous authors or for works whose copyright holder is an organization.
http://commons.wikimedia.org/wiki/File:Cole_Thomas_The_Course_of_Empire_Destruction_1836.jpg
This work is in the public domain in the United States, and those countries with a copyright term of life of the author plus 100 years or less.