managing high availability with low cost


Published on

Omar del Rio of Sieena Consulting gave this presentation at the Pre-MIX11 event ROCK!

  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Mention how the most scalable systems are not always the most available (banks). Mention how the most reliable systems are not the most scalable (phones, specific purpose stuff).Analogy with orange chicken.
  • Explain the process a bit more. Tell the story from the perspective of those who use it and those who receive a benefit. Also discuss the AR example and Call routing and call tracking.
  • Explain the process a bit more. Tell the story from the perspective of those who use it and those who receive a benefit. Also discuss the AR example and Call routing and call tracking.
  • For reliability, store your session data in the database. For reliability, everything must be redundant.For availability the system must be designed from the ground up.
  • Explain how scalability is a moving target, maybe add the graphics from IAR.Explain how a system that works now is obsolete faster than you can think of. Mention the debate that is going on with “MySpace failed because of Microsoft”.Lessons from the MS bugs in the session handling code, enhancements in Windows performance over the different versions (from IIS 6 to IIS 7.5)
  • Explain how availability is defined in terms of users and not in terms of system
  • managing high availability with low cost

    1. 1. Managing high availability with low cost Experiences from the kitchen
    2. 2. Some definitions • Scalability is how big you can get. • Reliability is how consistent you are (in the short term). • Availability is being reliable and scalable (in the long run). • Scalability and reliability are not related (one does not cause the other or impacts the other). • Can’t have availability without scalability or reliability.
    3. 3. Without further ado • The requirement: – Emergency responder system requires notifications from emergency workers for availability. • Results in: – System that is available 24x7 to respond to notifications. • Constrained by budget. • Currently at ~1,200 users/sec
    4. 4. Without further ado • The requirement: – Call routing system that must respond to every single request as fast as possible. • Results in: – System that is available 24x7 to respond to calls (marketing and others). • Constrained by time.
    5. 5. Results • Switch to Word doc
    6. 6. What we tried first • What the blogs say – Be redundant in every part of the system (this gets very expensive!) • What the teachers say (by the book) – Formal engineering (this is very expensive too!) • What our gut told us – Test test test!
    7. 7. What we learned • Scalability is a process not a destination. • Reliability is not a matter of QA. • The tools matter – but not in the traditional sense. – SQL Server (from 2005 to 2008 R2) – Windows (from 2003 to 2008 R2)
    8. 8. Some statistics System availability 2010 14 minutes of failures from at least 2 of 3 monitoring locations 1% IIS 7 8% Network 32% Framework Bugs 52% App Bugs 37% 16% SQL Server - 100% CPU SQL Server - Mirroring 4% SQL Server - No reason 2% * Only outages at the core router are displayed here as network problems.
    9. 9. Specific Lessons • Design code for failure (not for 100% reliability or 0 bugs). – Redundancy in code is critical. • Fail fast and fail often. – Don’t wait until the system fails completely. • Monitor and validate. – Monitor as frequently as it is affordable.
    10. 10. Design for failure • Why once if you can • Use all available tools twice? – Bidirectional replication for regional duplication of traffic. – Cheap load balancers. – Cheap RAID 10 SATA. – Don’t trust your database.
    11. 11. Fail fast and fail often • Specific configuration settings for IIS. – Yes! App pool recycling increases availability. • Specific configuration for queuing. – Use a messaging system that always responds and stores safely in case the database is not available. • Make a lot of noise.
    12. 12. Monitor and validate • Monitis – Cheap, but support is not there. • Other tools – – expensive, but if you can afford it, great. • Inside tools – Open source and MS tools.
    13. 13. The bottom line Design for failure Traditional Route Database approach - Expect the system to Database Clustering operate without a DB for brief periods of time. - Do mirroring locally. - Do replication remotely. Hardware approach - Configure for redundancy at Redundancy everywhere the telecom level. - Configure regional redundancy (invest in another server with another host, make sure network is different enough). Code approach - Design multiple systems that Design for 0 bugs (the formal do the same thing in simple method). Increase QA. ways. - There is nothing wrong with multiple code paths, even processes. - Reliability is not having the same bug in the same place.