Practical Approaches


Published on

Published in: Technology, Business
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Practical Approaches

    1. 1. HA & DR Strategy Giles Gamon of High-Availability.Com Practical Approaches July 2007
    2. 2. Business Continuity <ul><li>A system of planning for, recovering and maintaining both the IT and business environments within an organisation regardless of the type of interruption. In addition to the IT infrastructure, it covers people, facilities, workplaces, equipment, business processes, and more </li></ul>
    3. 3. Defining High-Availability <ul><li>Provision of end-to-end access to a service and data without interruption </li></ul><ul><ul><li>The elimination of all Single Points Of Failure (SPOF) </li></ul></ul><ul><ul><li>Objective - Zero/Near Zero downtime </li></ul></ul><ul><ul><ul><li>Includes handling scheduled downtime </li></ul></ul></ul>
    4. 4. Defining Disaster Recovery <ul><li>The process of restoring and maintaining the data, equipment, applications and other technical resources on which a business depends </li></ul><ul><li>Response to complete loss of a facility </li></ul><ul><ul><li>May include dealing with loss of key staff </li></ul></ul><ul><ul><li>Disaster may also affect alternate facilities that were assumed to be available </li></ul></ul>
    5. 5. Achieving Business Continuity <ul><li>Identification of threats to service </li></ul><ul><ul><li>Systems failures, human errors, sabotage, software bugs, acts of God etc </li></ul></ul><ul><li>Management of risk </li></ul><ul><ul><li>Building in redundancy, taking backups, training staff, testing systems, active management solutions </li></ul></ul>
    6. 6. Causes of Down Time Source - IEEE
    7. 7. Causes - Disaster <ul><li>Planning to cope with disasters is an important component of a High-Availability strategy </li></ul><ul><ul><li>Flood, fire, power grid failure, terrorism etc </li></ul></ul><ul><li>Most ‘disasters’ are classified as environmental causes of downtime </li></ul><ul><ul><li>Collectively environmental causes approximately 5% of downtime </li></ul></ul>
    8. 8. Causes - Environmental <ul><li>Power cuts and brown outs </li></ul><ul><ul><li>UPS & Generator </li></ul></ul><ul><ul><ul><li>What do they power? </li></ul></ul></ul><ul><li>Cooling systems error </li></ul><ul><ul><li>Humidification regulation errors can cause hardware failures </li></ul></ul>
    9. 9. Southampton University 2005
    10. 10. UK – Jan 2005 & June 2007
    11. 11. Causes – Hardware Failure <ul><li>Probably the most recognised cause of downtime </li></ul><ul><ul><li>Server failures </li></ul></ul><ul><ul><ul><li>Disk, CPU, internal cooling fans, memory faults, … </li></ul></ul></ul><ul><ul><li>Network failures </li></ul></ul><ul><ul><ul><li>DNS, DHCP, router, ISP, switches, cables cut, … </li></ul></ul></ul><ul><ul><li>Other </li></ul></ul><ul><ul><ul><li>Tape backup corruption, client hardware, … </li></ul></ul></ul>
    12. 12. Causes - Planned <ul><li>Hardware upgrades </li></ul><ul><li>OS version upgrades </li></ul><ul><li>Software version upgrades </li></ul><ul><li>Data migration / transformation </li></ul><ul><li>Backups </li></ul><ul><li>Batch processing </li></ul><ul><li>Preventative maintenance </li></ul><ul><li>Testing </li></ul>
    13. 13. Causes – Human Factor <ul><li>Failure to maintain </li></ul><ul><ul><li>File systems full </li></ul></ul><ul><ul><li>Database tables full </li></ul></ul><ul><ul><li>Patches for known bugs not applied </li></ul></ul><ul><li>Accidents </li></ul><ul><ul><li>root # rm –rf / tmp/tempstuff </li></ul></ul><ul><ul><li>Network mis-configuration </li></ul></ul><ul><ul><li>Incorrect cable removed </li></ul></ul><ul><li>Inexperience </li></ul><ul><ul><li>root# reboot </li></ul></ul><ul><ul><li>Cleaner knocks cables out </li></ul></ul><ul><li>Malice </li></ul><ul><ul><li>root# uadmin 1 5 or halt </li></ul></ul><ul><ul><li>Physical sabotage </li></ul></ul>
    14. 14. Causes – Software Error <ul><li>Code crashes </li></ul><ul><ul><li>Application suddenly stops with a core dump </li></ul></ul><ul><li>Memory leaks </li></ul><ul><ul><li>Slowly consumes all memory until system crash </li></ul></ul><ul><li>Run away code </li></ul><ul><ul><li>Taking all CPU time in a loop </li></ul></ul><ul><li>Hanging code </li></ul><ul><ul><li>Code pauses waiting for reply that never comes </li></ul></ul><ul><li>Resource shortfalls </li></ul><ul><ul><li>Overflowing logs, failure to allocate memory or process </li></ul></ul><ul><li>Buffer overflows </li></ul><ul><ul><li>Possibly exploited or just bad code </li></ul></ul>
    15. 15. Managing Risks <ul><li>Identify critical services </li></ul><ul><li>Describe service level targets </li></ul><ul><li>Map risks to services </li></ul><ul><li>Quantify the level of threat </li></ul><ul><li>Design and cost solutions </li></ul><ul><li>Compromise in a rational way </li></ul>
    16. 16. Identify Critical Services <ul><li>How long can the web server be down? </li></ul><ul><ul><li>Think – internal & public </li></ul></ul><ul><li>How about Email? </li></ul><ul><ul><li>Can some Emails be lost? </li></ul></ul><ul><li>How about finance, HR, ? </li></ul><ul><ul><li>How much downtime is acceptable? </li></ul></ul><ul><li>Who will be affected? </li></ul><ul><ul><li>Admin, public, suppliers … </li></ul></ul><ul><li>What is the impact on the ‘business’ </li></ul><ul><ul><li>Reputation, income, disruption, political … </li></ul></ul>
    17. 17. Describe Service Level Targets <ul><li>Email, Web (external) </li></ul><ul><ul><li>Downtime < 2 hours per month 8a.m. – 2a.m. </li></ul></ul><ul><li>Housing Server </li></ul><ul><ul><li>Downtime < 30 mins per month – 24x7 </li></ul></ul><ul><li>Revenue & Benefits </li></ul><ul><ul><li>Downtime < 5 mins per year – 24x7 </li></ul></ul><ul><li>Statistical Server </li></ul><ul><ul><li>Fix when you can – not really required </li></ul></ul>
    18. 18. Balancing Risk and Reward <ul><li>Unless you have an infinite budget you will have to make ‘trade-offs’ </li></ul><ul><li>Identify and remove SPoFs for critical services </li></ul><ul><ul><li>SPoF = Single Points of Failure </li></ul></ul><ul><li>Identify the least reliable – MTBFs </li></ul><ul><ul><li>Moving parts typically have the lowest MTBF </li></ul></ul><ul><li>Identify the most difficult components to repair/rebuild </li></ul><ul><ul><li>e.g.:- Security server, database </li></ul></ul><ul><li>Identify what will have biggest impact on failure </li></ul><ul><ul><li>Usually a core server </li></ul></ul><ul><ul><ul><li>Database, Email, Web, authentication server etc </li></ul></ul></ul>
    19. 19. Technical Approaches <ul><li>Clustering </li></ul><ul><li>Replication </li></ul><ul><ul><li>Transaction / block level </li></ul></ul><ul><li>Emerging technologies </li></ul><ul><ul><li>iSCSI </li></ul></ul><ul><li>Multi-domain clusters </li></ul><ul><li>Oracle RAC </li></ul>
    20. 20. Typical Multi-Tier Architecture <ul><li>View the service in a holistic fashion </li></ul><ul><li>List all SPoFs </li></ul><ul><ul><li>Network </li></ul></ul><ul><ul><li>Load balancers </li></ul></ul><ul><ul><li>Switches </li></ul></ul><ul><ul><li>Application server </li></ul></ul><ul><ul><li>Database server </li></ul></ul><ul><ul><li>Data disks </li></ul></ul><ul><ul><li>Etc </li></ul></ul><ul><li>Design in redundancy where possible </li></ul>
    21. 21. Resilient Architecture <ul><li>Multi-site solution </li></ul><ul><ul><li>Replication to remote site </li></ul></ul><ul><ul><li>Load balancers shown actually provide each other with redundant functionality </li></ul></ul><ul><ul><li>Multiple switches used but not shown </li></ul></ul><ul><li>SPoFs reduced near to zero </li></ul><ul><ul><li>Multiple active blades centres </li></ul></ul><ul><ul><li>Multiple active application servers </li></ul></ul><ul><ul><li>Clustered database servers </li></ul></ul><ul><li>This architecture is resilient to almost every conceivable fault </li></ul>
    22. 22. Resilient Architecture
    23. 23. Resilient Architecture
    24. 24. High-Availability Clustering <ul><li>Intelligent management solution </li></ul><ul><li>Software only </li></ul><ul><li>Deployed on critical servers </li></ul><ul><li>Can be active-active or active-passive </li></ul><ul><li>Constant monitoring </li></ul><ul><ul><li>Application availability </li></ul></ul><ul><ul><li>Server health </li></ul></ul><ul><ul><li>Network availability </li></ul></ul><ul><ul><li>Other defined components </li></ul></ul><ul><li>Automated restart / move in the event of a fault </li></ul><ul><li>Notifications to administrative staff </li></ul><ul><ul><li>GUI, Email, SMS </li></ul></ul>
    25. 25. High-Availability Clustering <ul><li>Active-Passive </li></ul><ul><ul><li>Simple setup </li></ul></ul><ul><li>Externalise ‘shared’ data </li></ul><ul><li>Use RAID &/ Mirroring </li></ul><ul><li>Low cost, fast and simple </li></ul><ul><li>Very reliable </li></ul>
    26. 26. High-Availability Replication <ul><li>Traditional cluster locally </li></ul><ul><li>Replicate to remote node </li></ul><ul><li>Replication at transaction level </li></ul><ul><li>Remote node probably included in cluster </li></ul><ul><ul><li>Automatic locally </li></ul></ul><ul><ul><li>Manual remotely </li></ul></ul>
    27. 27. High-Availability Replication <ul><li>Typically replication does a ‘log scrape’ </li></ul><ul><ul><li>Although some newer versions have closer integration </li></ul></ul><ul><li>Takes committed transactions and copies them across to the other node(s) </li></ul><ul><li>Other nodes ‘apply’ the transactions to a read-only copy of the database </li></ul>
    28. 28. High-Availability Replication <ul><li>Block level replication </li></ul><ul><ul><li>Suitable for user files </li></ul></ul><ul><ul><li>Not ideal for databases </li></ul></ul><ul><ul><ul><li>Many better approaches that understand dB data </li></ul></ul></ul><ul><ul><li>Available in different guises - like </li></ul></ul><ul><ul><ul><li>Sun’s SNDR (remote mirror) – in kernel </li></ul></ul></ul><ul><ul><ul><ul><li>Sync / async </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Streams type module </li></ul></ul></ul></ul><ul><ul><ul><li>Rsync – user space </li></ul></ul></ul><ul><ul><ul><ul><li>Periodic checking and copy </li></ul></ul></ul></ul>
    29. 29. High-Availability Replication <ul><li>Use dB replication for dB when possible </li></ul><ul><li>Use block level for other file types and legacy applications that have no replication option available </li></ul>
    30. 30. Practical Examples <ul><li>Carlisle </li></ul><ul><ul><li>Some lessons learned </li></ul></ul><ul><li>Surrey Ambulance </li></ul><ul><ul><li>999 call handling centre </li></ul></ul><ul><li>North Yorkshire Police </li></ul><ul><ul><li>Tasking & operational management </li></ul></ul>
    31. 31. Carlisle – Jan 2005 <ul><li>Extensive flooding Jan 2005 </li></ul><ul><ul><li>Civic centre hub of all operations hit </li></ul></ul><ul><li>Backup generators in basement (flooded 1 st) </li></ul><ul><li>Guardian IT ‘insurance’ not used </li></ul><ul><li>All major systems down for a week </li></ul><ul><li>Flooded in Jan 2005 and still dealing with substantial issues today </li></ul>
    32. 32. Carlisle - Lessons <ul><li>Don’t assume just because you have ‘a plan’ it will actually work </li></ul><ul><ul><li>Guardian IT / Sun Guard provide a warm feeling but not useful – Carlisle terminating </li></ul></ul><ul><ul><li>Test it </li></ul></ul><ul><ul><li>Keep testing and updating </li></ul></ul><ul><ul><li>Recovery takes longer than you imagine </li></ul></ul><ul><ul><li>Administration relating to recovery and the process of recovery itself are a huge drains on resources </li></ul></ul>
    33. 33. Surrey Ambulance Service <ul><li>999 call centre </li></ul><ul><li>24x7 live operations environment </li></ul><ul><li>Handling calls from the public </li></ul><ul><li>Live feeds from ambulance GPS devices </li></ul><ul><li>Automatic escalation and logging </li></ul>
    34. 34. North Yorkshire Police <ul><li>24x7 live CAD system </li></ul><ul><ul><li>Command and control </li></ul></ul><ul><ul><li>Custody management </li></ul></ul><ul><ul><li>Crime management </li></ul></ul><ul><ul><li>Duty rostering </li></ul></ul><ul><ul><li>Imaging and biometrics </li></ul></ul><ul><li>Oracle backend to ‘STORM’ application </li></ul><ul><li>Highly integrated systems </li></ul><ul><ul><li>Mapping systems </li></ul></ul><ul><ul><li>PNC links </li></ul></ul><ul><ul><li>DVLA links </li></ul></ul><ul><ul><li>Firearms database </li></ul></ul><ul><ul><li>Neighbouring force systems </li></ul></ul>
    35. 35. North Yorkshire Police
    36. 36. Contacts <ul><li>Giles Gamon </li></ul><ul><li>High-Availability.Com </li></ul><ul><li>sales @High-Availability.Com </li></ul><ul><li>support @High-Availability.Com </li></ul><ul><li>giles @High-Availability.Com </li></ul><ul><li>01565 754 459 </li></ul>