Slideshare.net (beta)

 
Post: 
Myspace Hi5 Friendster Xanga LiveJournal Facebook Blogger Tagged Typepad Freewebs BlackPlanet gigya icons



All comments

Add a comment on Slide 1

If you have a SlideShare account, login to comment; else you can comment as a guest


Showing 1-50 of 8 (more)

Failure Happens - Reliability and how to run large websites.

From crucially, 2 months ago

Talk from Web 2.0 Expo San Francisco 2008 on how to run large webs more

6134 views  |  1 comment  |  8 favorites  |  349 downloads  |  5 embeds (Stats)
 

Tags

velocity 2.0 web unix web2.0 reliability operations ops webexpo web20exposf

more

 
 

Groups/Events

Not added to any group/event

 
 

Privacy InfoNew!

This slideshow is Public

 
Embed in your blog
Embed (wordpress.com)
custom

Slideshow Statistics
Total Views: 6134
on Slideshare: 6125
from embeds: 9* * Views from embeds since 21 Aug, 07

Slideshow transcript

Slide 1: Failure Happens F***, the f*****g thing is f****d What broke and what we learned

Slide 2: Redundancy Redundancy, in general terms, refers to the quality or state of being redundant, that is: exceeding what is necessary or normal; or duplication. This can have a negative connotation, especially in rhetoric: superfluous or repetitive; or a positive implication, especially in engineering: serving as a duplicate for preventing failure of an entire system.

Slide 3: Jesse Robbins Artur Bergman

Slide 4: Artur Bergman Jesse Robbins

Slide 5: • Jesse – Runs ops for Etelos – Firefighter/EMT – Emergency Manager • Katrina – Experiences running large websites – Had the best title ever “Master of Disaster” • Artur – Runs ops & engineering for Wikia – Experiences of running large websites, enterprise (boring) and stock exchanges – Core Perl developer, long development background • Both of us – Write for O’Reilly Radar – Speak at conferences – Annoy our peers and coworkers – Agree on nearly everything

Slide 6: Redundant

Slide 7: Jesse is sick • Thankfully, we have high availability – Hence this talk • Jesse has a 98% availability • I am more honest, probably more like 90% excluding the time I sleep • Our combined availability is 99.84% • His war stories will be missing

Slide 8: June 23-24, 2008 Jesse & Steve

Slide 9: 364.96 Main • San Francisco data center • Hosts a lot of Web 2.0 companies • Power outage • 24 July 2008 – A day I am sure a lot of people remember fondly

Slide 12: Mistakes • Generator 3 took down 1 and 4 – 200% more outage than needed • But really? – Not 365 Mains fault

Slide 13: Failure happens • A single datacenter is the problem – Since they all fail at some point • Recovery procedures after failure – Power was gone ~45 minutes – Most services took hours to come back – Some unnamed ones more than 12 hours • Communication – All DNS servers in the same datacenter!

Slide 15: Radar article • Disaster recovery plans exist on a different continuum, affecting not just operations but also your entire organisation's response to disasters. • An earthquake is a question of when, not if. Are the startups ready for this? How long will we expect them to be gone? Several of the world's largest websites went down. None of them were ready for a datacenter outage. None of them had backup datacenters or fail over that worked. • None even had a coherent strategy for communicating the situation to the rest of the world.

Slide 16: Futility of MTBF • Mean time between failures – Vendor quote you this all time • Irrelevant! • Failure is inevitable • 365 Main probably had a excellent aggregated MTBF – But when something fails, the mean time to the next failure is hardly going to make you feel better

Slide 17: MTTR • Mean time to recovery • Drastically reduced severity of the power outage even without hot standby • Noone cares if you fail once a minute – If you recover in 50 ms • If you are down 1 minute a week, you are still going to hit 4 nines (99.99%)

Slide 18: Nines (roughly) • 99% 5000 Minutes / Year 3.5 Days

Slide 19: Nines (roughly) • 99% 5000 Min / Year (3.5 days) • 99.9% 500 Min / Year ( 8 hours )

Slide 20: Nines (roughly) • 99% 5000 Min / Year (3.5 days) • 99.9% 500 Min / Year ( 8 hours ) • 99.99% 50 Min / Year

Slide 21: Nines (roughly) • 99% 5000 Min / Year (3.5 days) • 99.9% 500 Min / Year ( 8 hours ) • 99.99% 50 Min / Year • 99.999% 5 Min / Year

Slide 22: Nines (roughly) • 99% 5000 Min / Year (3.5 days) • 99.9% 500 Min / Year ( 8 hours ) • 99.99% 50 Min / Year • 99.999% 5 Min / Year • 99.9999% 30 Seconds / Year

Slide 23: Nines (roughly) • 99% 5000 Min / Year (3.5 days) • 99.9% 500 Min / Year ( 8 hours ) • 99.99% 50 Min / Year • 99.999% 5 Min / Year • 99.9999% 30 Seconds / Year • 99.99999% 3 Seconds / Year

Slide 24: Irrelevance of the nines • Blizzard – $520 million in profit last year • World of Warcraft – 10 million players • 98-99% – By design

Slide 25: Train your users • Scheduled Downtime each week • Very little redundancy • Server failure – Up to 10 minutes of data loss • Been like this from the beginning

Slide 26: “We pay them money, so we have to accept the downtime.”

Slide 27: Reliability • Don’t aim to high unless – Banks – Space shuttles – Lung/heart machines • The higher you aim – Increases complexity (exponentially) – The harder you fail

Slide 29: Complexity killed the cat

Slide 30: Bebo www.bebo.com 12h 28m Windows Live Spaces spaces.live.com 7h 25m Friendster www.friendster.com 6h 0m hi5 www.hi5.com 5h 5m Reunion.com www.reunion.com 2h 55m LinkedIn www.linkedin.com 4h 0m Classmates.com www.classmates.com 2h 5m Facebook www.facebook.com 1h 35m Orkut www.orkut.com 1h 10m Last.fm www.last.fm 1h 10m Xanga www.xanga.com 45m MySpace www.myspace.com 25m LiveJournal www.livejournal.com 10m Yahoo! 360 360.yahoo.com 5m Jan-Feb 2008 - Source pingdom.com

Slide 31: Bebo www.bebo.com 12h 28m $800 MM Windows Live Spaces spaces.live.com 7h 25m Friendster www.friendster.com 6h 0m hi5 www.hi5.com 5h 5m Reunion.com www.reunion.com 2h 55m LinkedIn www.linkedin.com 4h 0m Classmates.com www.classmates.com 2h 5m Facebook www.facebook.com 1h 35m Orkut www.orkut.com 1h 10m Last.fm www.last.fm 1h 10m Xanga www.xanga.com 45m MySpace www.myspace.com 25m LiveJournal www.livejournal.com 10m Yahoo! 360 360.yahoo.com 5m Jan-Feb 2008 - Source pingdom.com

Slide 32: Measurement • How do you measure uptime? • Ping doesn’t work • Connect • Your view is limited from your monitoring stations • Network problems outside your control – Hello Cogent

Slide 33: Measurement • Look at the traffic – The data is there – HTML delivery time – Image delivery time – TCP packet loss – Use an image call to collect end user performance metrics • Calculate expected traffic rates – Benchmark against that (bandwidth curves should be smooth!) – I always watch the bandwidth • Wikipieda method – How many people complain on IRC?

Slide 34: Outage?

Slide 35: Outage!

Slide 36: Youtube vs BGP vs Pakistan • BGP runs your internet – Protocol for routers to share routing data – How to get from me to somewhere else • Each organization has an AS number • Each router keeps track of the number of AS numbers to the destination over different routes • Chooses the shortest one

Slide 37: Anycast / Multihoming • BGP allows you to tell multiple ISPs that you are capable of handling a network • Traffic will flow the “shortest” path • If a link goes down, that router-router BGP session goes away and the route is then withdrawn through the system • “BGP Convergence” – Don’t ask what it really means

Slide 38: Networks and prefixes • Each netblock is subclassed and has a prefix. • People mostly know /24 which is 255 addresses • /23 is twice as that • /8 is a vast quantity

Slide 39: IP Conservation vs Routing table conservation • We are running out of Ips • Our routing table is growing fast • To limit the growth of the routing table, routers will usually block any routes more specific than /24 • Youtube was being a good citizen and broadcasting one 22 instead of four /24

Slide 40: Pakistan Telekom • Government orders ban of Youtube • PT achives this by broadcasting a BGP route for the one of Youtubes IP ranges using a /24 prefix – Sadly, they did this to the entire world • Routers choose the most specific route first, so /24 wins over /22 • All of youtube traffic went to Pakistan

Slide 41: Try reaching for 4 nines • A BGP error anywhere, can quickly bring you down • Thank the souls running the large ISPs core networking. – They are the reason it works • Only way to solve this, is to be a bad citizen and spam the table with more routes. But even that doesn’t fully protect you from local outages

Slide 42: June 23-24, 2008 Jesse & Steve

Slide 43: Value of reliability (operations and performance) • Bad reliability is a waste or R&D • Why develop if you can’t deliver? • Operations is always treated as the stepchild of Engineering • But with no reliability, no company • Fixed amount of time + faster site = more page views

Slide 44: Speed / Reliability • Important • Direct correlation between speed and user interaction • Brand name relies on reliability

Slide 46: Requests /sec Response time

Slide 47: Requests /sec Response time

Slide 48: Nothing matters • This entire conference! • Any cool features! • Unless it works

Slide 49: Cost benefit • Cost of deliver • Revenue earned • Increase cost for more complexity

Slide 50: Metrics you need • Cost per page view • Cost per specific feature/page • This is key, what you should prioritize, what you should do is, dependent on these numbers • How else can you value it? • Don’t always go for cheap, sometimes it is better to buy time using money, sometimes not.

Slide 51: Operational Engineers • Ops stepchild of development? – Ops is staffed with failed developers • Fire them • Hire good ones • Who are passionate to learn and explore the entire stack

Slide 52: My story • Software developer • Interested in ops • I always get transferred to ops – Fixing the same problems every time • (Save me, go to Velocity and learn!) • I bring engineering to ops, and a way to look at the entire system

Slide 54: Pyromaniac Paranoid

Slide 56: Backups / High Availability • Don’t confuse them • Backups protect your data • High Availability keeps your site running • Mysql replication is a valid HA solution • But it won’t help you with – DROP TABLE;

Slide 57: Debugging • 9 Rules of debugging • http://www.debuggingrules.com/Poster_ download.html – Yes the font is horrible

Slide 58: Rule 1: Understand the system • Complexity Kills • No excuse • If you write it, you must know it • If you run it, you must know it • If you buy it, you must know it

Slide 59: Rule 3: Quit thinking and look • "It is a capital mistake to theorize before one has data. Insensibly one begins to twist facts to suit theories, instead of theories to suit facts.”

Slide 60: Rule 3: Quit thinking and look • What do you look at? • The importance of monitoring • Monitoring • Monitoring • Monitoring

Slide 61: My my, confusing term • Monitoring • Alerting • Trending

Slide 62: Alerting • Acts on monitoring data • Severe alerts – Active – Needs action • Passive alerts – Things that need to be done but not right now • DO NOT OVER ALERT • DO NOT CRY WOLF

Slide 63: Wikia alerting strategy • When the site is slow • Or down • We send emails and do phone calls • Europe and US West coast • Looking to hire in East Asia • No night time

Slide 64: Trending • Long term • Capacity planning

Slide 65: Ganglia • We love ganglia • Automatically graphs everything you want - just works • Large scale clusters • Multicast • Zero config • RRD

Slide 66: http://ganglia.wikimedia.org/ • 270 hosts • 880 CPU • 2 clusters • 1.2 TB of Memory

Slide 67: http://ganglia.wikimedia.org

Slide 68: Custom Ganglia Gmetrics • Write your own gmetric --name='Oldest query' --type=int32 --units='sec' --dmax=65 --value=`echo ' show processlist' | mysql -uroot -ppass | grep -v Sleep | grep -v 'system user' | head -2 | tail -1 | cut -f 6`

Slide 69: Something is wrong • Don’t worry, data warehouse

Slide 70: Problem found • If it is critical, start a phone conversation • Use IRC to communicate technical data • One person liasons with non technical staff • One person specifically in command • Sleep scheduling ( audit log important )

Slide 71: Post crisis • Root cause analysis – Just find out what went wrong – And how to avoid it – Or fix it faster next time if you can’t • Keep track of your uptime

Slide 72: Automation • All machines are created equal • Seriously • If you manually make changes • You are wrong – Unless you know what you are doing

Slide 73: Best practices • Version control • Gold images • Centralised authentication • Time Sync ( NTP ) • Central logging • ( All of this applies for virtual machines too!)

Slide 74: Puppet • New hip kid on the block • Written in ruby • Better support? • Much nicer syntax • Easier to extend

Slide 75: tcpdump / wireshark • If you suspect the network • Don’t just suspect • LOOK AT IT • Tcpdump / waveshark will tell you – If your packets are lost, delayed or corrupted – Your windowing is wrong

Slide 76: Puppet • Automated machine configuration • Automation is key • Our Motd states “If change change anything locally, I will hunt down and kill you”

Slide 77: Rule 4: Divde and Conquer • Look at the problems in turn • Split between people • Go in the order you suspect is the most likely

Slide 78: Rule 5: Change one thing at a time • I cannot stress this enough • IF YOU DO NOT THEN YOU HAVE FAILED TO IDENTIFY THE PROBLEM

Slide 79: Rule 6: Keep an audit trail • You might be making things worse • Good for the root cause analysis • Have your shell log all commands – Good practice anyway • Version control

Slide 80: Rule 9: If you didn’t fix it, it ain’t fixed • You must do something to fix a problem • Or it will bite you again • And again • And again • They don’t just appear and disappear • Except BGP route convergence :)

Slide 82: Good Book!

Slide 83: “multiple and unexpected interactions of failures are inevitable” -Charles Perrow

Slide 84: shit happens. Sky@crucially.net Jesse@oreilly.com