Designing and Implementing Mission-Critical Systems

Loading...

Flash Player 9 (or above) is needed to view presentations.
We have detected that you do not have it on your computer. To install it, go here.

0 comments

Post a comment

    Post a comment
    Embed Video
    Edit your comment Cancel

    Favorites, Groups & Events

    Designing and Implementing Mission-Critical Systems - Presentation Transcript

    1. www.xdelta.co.uk Slide 1 of 46 Connect - San Francisco Bay Chapter Designing and implementing mission-critical systems Colin Butcher Copyright © Colin Butcher, XDelta Limited, Sep. 2009
    2. www.xdelta.co.uk BC, DT, HA and DR (1) Slide 2 of 46 • Business continuity: – It’s not just the systems – it’s everything! • Disaster tolerance: – Continue operations while surviving major site outages without loss of data • High availability: – Continue operations while surviving equipment and software failures without loss of data Copyright © Colin Butcher, XDelta Limited, Sep. 2009
    3. www.xdelta.co.uk BC, DT, HA and DR (2) Slide 3 of 46 • Disaster recovery: – The process of restarting sufficient operations to run the business after an serious outage, typically from another location • Budget and Schedule: – They have to be appropriate for the problems we’re trying to deal with. Don’t set them first! Copyright © Colin Butcher, XDelta Limited, Sep. 2009
    4. www.xdelta.co.uk Problem solving concepts - TRIZ Slide 4 of 46 Before Now After Environment System Component Copyright © Colin Butcher, XDelta Limited, Sep. 2009
    5. www.xdelta.co.uk Design decisions Slide 5 of 46 • Big decisions which have long-term implications and constraints • Small decisions which seem big at the time • There will be requirements and constraints you don’t yet understand or know about • Need to design in the ability to make changes • Must establish a meaningful naming convention Copyright © Colin Butcher, XDelta Limited, Sep. 2009
    6. www.xdelta.co.uk Risk Slide 6 of 46 Risk is a fact of life. We have to deal with it as best we can – or at least well enough for the circumstances we find ourselves in. Think about both project management and technical design as part of systems engineering, then apply techniques from other engineering disciplines to help us analyse the situation and guide our thinking. Disaster-tolerant systems aim to minimise the risk of loss of service and loss of data as much as possible. Copyright © Colin Butcher, XDelta Limited, Sep. 2009
    7. www.xdelta.co.uk The risk continuum Slide 7 of 46 • What is the probability of a situation occurring? • What is the impact if that situation occurs? • What are the long-term consequences? • Most projects handle medium risk well enough • Many projects over-specify to cater for what are in fact low risk issues • Some projects under-specify and fail to cater for what are in fact high risk issues • How can we start to identify what the risks might be? Copyright © Colin Butcher, XDelta Limited, Sep. 2009
    8. www.xdelta.co.uk Risk assessment Slide 8 of 46 • How can we look for points of failure? • How can we assess the impact of failure? • Which parts of the system are mission-critical? • Which parts of the system are safety-critical? • What kind of failure do we prefer? • What happens to our information? Copyright © Colin Butcher, XDelta Limited, Sep. 2009
    9. www.xdelta.co.uk Mission Critical systems Slide 9 of 46 Mission critical systems need to be able to: • Survive failures (resilience and failover) • Survive changes (adapt and evolve) • Survive people (simplify and automate) • Never corrupt or lose critical data (data integrity) • Requirements never remain static over an extended period of time, so we need to be able to make changes during the operational lifetime of the system • Circumstances change, so we often need to be able to extend the operational lifetime and scope of a system Copyright © Colin Butcher, XDelta Limited, Sep. 2009
    10. www.xdelta.co.uk Do we really need 100% uptime? Slide 10 of 46 • Safety-critical systems (especially real-time monitoring and control systems such as air traffic control) require exceedingly high levels of availability. They also have to be fail-safe in order not to endanger lives. • True 24x365 mission-critical systems are fairly rare. With these there is no “downtime window” to take backups, fix faults or to make changes. So, whatever you do has to be done “live” – and very carefully! • The closer you get to 100% uptime the more expensive a satisfactory solution will become. Copyright © Colin Butcher, XDelta Limited, Sep. 2009
    11. www.xdelta.co.uk “Survivability test” (1) Slide 11 of 46 Cause of Outage: Planned (Maintenance) Unplanned (Failure) Hardware ? ? Operating System ? ? Network Layer ? ? Layered Products ? ? Application Software ? ? Application Data ? ? Environment ? ? People ? ? Copyright © Colin Butcher, XDelta Limited, Sep. 2009
    12. www.xdelta.co.uk “Survivability test” (2) Slide 12 of 46 • How long have we got? • How much data can we afford to lose? • How long have we got before we need to be ready for the next failure? • Nothing happens instantaneously - there is always a “state transition” Copyright © Colin Butcher, XDelta Limited, Sep. 2009
    13. www.xdelta.co.uk RPO and RTO Slide 13 of 46 RPO = Recovery Point Objective • How much data can we tolerate losing? • How quickly do we need to react to a failure? RTO = Recovery Time Objective • What level of service outage can we tolerate? • How quickly do we need to recover? • How quickly do we need to be ready to deal with a subsequent failure? Copyright © Colin Butcher, XDelta Limited, Sep. 2009
    14. www.xdelta.co.uk The systems “flight envelope” Slide 14 of 46 • What does the business require the systems to do? • What are the consequences if the systems fail? • What happens if you push beyond the limits? • How far from the edge are you? • How do you know? • What can we measure? • What comparisons can we make? • What evidence can we look at? Copyright © Colin Butcher, XDelta Limited, Sep. 2009
    15. www.xdelta.co.uk Availability and performance (1) Slide 15 of 46 Availability: – Probability of system being available for use at a given instant in time within the ‘operational window’ – Function of both MTBF (reliability) and MTTR (repair time) Performance: – Performance issues are often the cause of transient system failures and disruption – The systems have to have sufficient capacity and performance to deal with the workload in an acceptable period of time under normal, failure and recovery conditions Availability is more important than performance Copyright © Colin Butcher, XDelta Limited, Sep. 2009
    16. www.xdelta.co.uk Availability and Performance (2) Slide 16 of 46 • Why is understanding performance so important in mission-critical systems? • How does performance affect availability? • What can we do to test a system before it goes into production? • What can we do to test changes to a system before we implement them? • What can we do to monitor a system once it’s in production? Copyright © Colin Butcher, XDelta Limited, Sep. 2009
    17. www.xdelta.co.uk Throughput and response times Slide 17 of 46 • Bandwidth – determines throughput – It’s not just “speed”, it’s throughput in terms of “units of stuff per second” • Latency – determines response time – Determines how much “stuff” is in transit through the system at any given instant – “Stuff in transit” is the data at risk if there is a failure • Jitter (“div latency” or variation of latency with time) – determines predictability of response – Understanding jitter is important for establishing timeout values – Latency fluctuations can cause system failures under peak load Copyright © Colin Butcher, XDelta Limited, Sep. 2009
    18. www.xdelta.co.uk Designing for performance Slide 18 of 46 • Size systems to cope with peaks in workload • Minimise “wait states” (caches, parallelism) • Minimise contention for resources and data structures • Understand the need for synchronisation and serialisation of access to data structures • Maximise “User Mode”, minimise the other modes: – The fastest IO is the IO you don’t do – The fastest code is the code you don’t execute Copyright © Colin Butcher, XDelta Limited, Sep. 2009
    19. www.xdelta.co.uk Parallelism and scalability Slide 19 of 46 • Understand how the applications could break down into parallel streams of execution: – Some will be capable of being split into many small elements with little interaction between the parallel streams of execution – Some will require very high interconnectivity between the parallel streams of execution – Some will require high-throughput single-stream processing • Understand scalability – do as much as possible once only, do little as possible every time Copyright © Colin Butcher, XDelta Limited, Sep. 2009
    20. www.xdelta.co.uk Designing for availability Slide 20 of 46 • Which parts of the system are mission-critical? • Which parts of the system are safety-critical? • What kind of failure do we prefer? • What state transitions occur during failure and recovery? • How can we recover from a failure without data loss or data corruption? • How will you test your failure and recovery scenarios? • How can you make changes while it’s working? Copyright © Colin Butcher, XDelta Limited, Sep. 2009
    21. www.xdelta.co.uk Multi-site issues Slide 21 of 46 • Naming conventions • Quorum and voting schemes • Data replication schemes • Effects of distance on network and storage protocols • Symmetric or asymmetric operation – how good is your “crystal ball”? • Centralised (and replicated) monitoring and alerting • Remote access for management and operation • Avoid automation of decision making Copyright © Colin Butcher, XDelta Limited, Sep. 2009
    22. www.xdelta.co.uk Testing Slide 22 of 46 • We need to prove that service will continue with minimal disruption during failure and recovery • We need to test scalability as well as functionality • We need to test every aspect of the system and surrounding infrastructure under normal, failure and recovery conditions • How will we generate a realistic load for testing? • How will we instrument the system and infrastructure? • We need to regularly rehearse and test our procedures and plans to ensure that we stay current Copyright © Colin Butcher, XDelta Limited, Sep. 2009
    23. www.xdelta.co.uk Testing for availability Slide 23 of 46 • Mission-critical systems hardly ever fail, so we need the people responsible for its operation to have a good understanding and ‘feel’ for the way it works • We need to find out how the system behaves when it starts to fail • We need to know the ‘warning signs’ of incipient failure • We need to know how to return the system to its normal operational state without data loss or data corruption • We must have a representative offline test environment Copyright © Colin Butcher, XDelta Limited, Sep. 2009
    24. www.xdelta.co.uk Controlling change Slide 24 of 46 • Physical environment (power, cooling, etc.) • Hardware (revision levels, labelling, etc.) • Firmware versions • Operating system updates • Network infrastructure and configurations • Storage infrastructure and configurations • Database versions and configurations • Application software • Interactions • Interoperability with mixed versions • Upgrade and backout actions • Data conversions Copyright © Colin Butcher, XDelta Limited, Sep. 2009
    25. www.xdelta.co.uk Measuring system behaviour Slide 25 of 46 • How do we instrument the system? – Application level – Operating system level – Data storage level – Network infrastructure • Time synchronisation • How do we generate a typical workload? • How do we generate representative data sets? Copyright © Colin Butcher, XDelta Limited, Sep. 2009
    26. www.xdelta.co.uk Finding and fixing problems Slide 26 of 46 • Where can we do the testing? – Production environment – Pre-production Test environment – Pre-delivery Test environment • How might a problem show up – and when? • How can we find a problem, eg: data corruption? • Can we recreate a problem in a test environment? • Continual monitoring and event logging is essential • Knowledge of the whole system is essential Copyright © Colin Butcher, XDelta Limited, Sep. 2009
    27. www.xdelta.co.uk Reliability Analysis Techniques Slide 27 of 46 There are many techniques which have evolved over the years and there are tools to help you apply them. • Reliability Block Diagrams (RBD) • Fault Tree Analysis (FTA) • Failure Modes, Effects and Criticality Analysis (FMECA) These (and many other) techniques can be applied with software tools available from a number of vendors. Copyright © Colin Butcher, XDelta Limited, Sep. 2009
    28. www.xdelta.co.uk Reliability block diagram (1) Slide 28 of 46 For this to be operational, Node A Node B we need to have: • Either LAN, and • Either node, which in turn needs either connection to either LAN Copyright © Colin Butcher, XDelta Limited, Sep. 2009
    29. www.xdelta.co.uk Reliability block diagram (2) Slide 29 of 46 To LAN A Node A LAN A To LAN B To LAN A LAN B Node B To LAN B 1of 2 1of 2 Copyright © Colin Butcher, XDelta Limited, Sep. 2009
    30. www.xdelta.co.uk Reliability block diagram (3) Slide 30 of 46 • Used to identify single points of failure (SPOF) • Can be used to derive an overall theoretical probability of failure for the system by assigning probabilities of failure to individual items • Be aware that theoretical probability of failure calculations are based on statistics and assumptions – however the process is invaluable in understanding the issues Copyright © Colin Butcher, XDelta Limited, Sep. 2009
    31. www.xdelta.co.uk FMECA and FMEA Slide 31 of 46 • Failure Modes, Effects and Criticality Analysis (FMECA) • Failure Mode and Effects Analysis (FMEA) These are techniques used to: – identify potential failure modes – assess the risk associated with the failure modes – sort the issues in terms of importance – identify possible recovery actions Copyright © Colin Butcher, XDelta Limited, Sep. 2009
    32. www.xdelta.co.uk Fault Tree Analysis Slide 32 of 46 • Fault Tree Analysis (FTA) • Identify the way that failures can ripple through a system • The “inverse” of a fault tree can help to identify “common mode” failure events and guide fault-finding, eg: loss of one phase can cause loss of power to many devices Copyright © Colin Butcher, XDelta Limited, Sep. 2009
    33. www.xdelta.co.uk State transitions (1) Slide 33 of 46 • Take the example used in the Reliability Block Diagram where we have two machines in hot-standby operation • What states can a pair of machines be in? • We need to identify all possible states and ensure that “invalid states” do not occur (or are handled appropriately) and that the state information is propagated to all other participating machines in the overall system Copyright © Colin Butcher, XDelta Limited, Sep. 2009
    34. www.xdelta.co.uk State transitions (2) Slide 34 of 46 A B Off to Master Off Master Off to Standby Master to Off Standby to Master Master to Off Off to Master Master to Standby Standby to Master Master to Hung Standby to Confused And so on… Copyright © Colin Butcher, XDelta Limited, Sep. 2009
    35. www.xdelta.co.uk State transitions (3) Slide 35 of 46 • Nothing happens instantaneously • How long do state transitions last for? • What can we do while a state transition is in progress? • Can we ensure that there are no timing windows / flaws? • How can we test it? Copyright © Colin Butcher, XDelta Limited, Sep. 2009
    36. www.xdelta.co.uk Example - NHSBT “Pulse” Slide 36 of 46 An overview of how the new NHSBT Pulse systems fit into the NHSBT infrastructure. We have Production, Archive and Test environments with a shared common infrastructure, all of which is separated from the rest of the existing infrastructure. Copyright © Colin Butcher, XDelta Limited, Sep. 2009
    37. www.xdelta.co.uk Where these systems fit in the new infrastructure Slide 37 of 46 Copyright © Colin Butcher, XDelta Limited, Sep. 2009
    38. www.xdelta.co.uk Functional layout Slide 38 of 46 The systems are split up into: • Common infrastructure (SAN fabrics, private network interconnects etc.) • Production environment (a split-site cluster with host- based volume shadowed storage) • Test environment (a split-site cluster on a smaller scale) • Archive environment (a single node at Site A) • Duplicated monitoring and reporting facilities • External connectivity for users Copyright © Colin Butcher, XDelta Limited, Sep. 2009
    39. www.xdelta.co.uk Common SAN fabric infrastructure Slide 39 of 46 Copyright © Colin Butcher, XDelta Limited, Sep. 2009
    40. www.xdelta.co.uk Common data network infrastructure Slide 40 of 46 Copyright © Colin Butcher, XDelta Limited, Sep. 2009
    41. www.xdelta.co.uk Multi-site clustering Slide 41 of 46 • Split-site OpenVMS clusters give us “shared everything” access to data with protection from loss or corruption, even in the event of site failure • Host-based volume shadowing (HBVS) ensures that data is consistent across all members of the shadow sets. • The quorum scheme lets Site A continue if Site B fails and protects us from data corruption due to a partitioned cluster • The DTCS software monitors the systems for us and (most important of all) controls the formation of storage shadow sets when the systems boot and when nodes rejoin the cluster Copyright © Colin Butcher, XDelta Limited, Sep. 2009
    42. www.xdelta.co.uk DTCS – monitoring and control Slide 42 of 46 • DTCS is a set of HP and 3rd party products with installation, configuration and support services • Remote console access, management and console output logging • Integrated monitoring and quorum adjustment • Rule based monitoring of individual systems / nodes • Rule based SNMP polling of equipment • Rule based TCP/IP “ping reachability” polling • GUI and e-mail based alerting • Scripting of failover and recovery actions across all systems / nodes and storage subsystems Copyright © Colin Butcher, XDelta Limited, Sep. 2009
    43. www.xdelta.co.uk Success factors Slide 43 of 46 • Small team of committed people • Clear objectives • Built ‘proof of concept’ data migration system first • Built system ‘on paper’, discussed it extensively and resolved potential technical problems prior to purchasing equipment and building system platform • Project management and planning • Leadership and collaborative working • Trust between team members • Sufficient flexibility to cope with issues as they arose Copyright © Colin Butcher, XDelta Limited, Sep. 2009
    44. www.xdelta.co.uk Summary Slide 44 of 46 • The art is to select elements that work well together and which provide the bulk of what you need with minimal additional work • Establish the minimum requirements that have to be met – and do it as well as possible • Availability and performance have to be designed in to the application • Monitoring and automation are key components • Understand the typical behaviour of your systems and be aware of changes Copyright © Colin Butcher, XDelta Limited, Sep. 2009
    45. www.xdelta.co.uk Key issues Slide 45 of 46 - You can’t buy high-availability off the shelf - Do the minimum you have to do and get it right - Design for change on-the-fly with no loss of service - Documentation and configuration control - Protect the data and ensure that it’s consistent - Monitoring, information and automation - Testing and continual training - Procurement - Project management - Leadership and collaborative working Copyright © Colin Butcher, XDelta Limited, Sep. 2009
    46. www.xdelta.co.uk Slide 46 of 46 Thank you for your participation Discussion! Copyright © Colin Butcher, XDelta Limited, Sep. 2009

    + ItaniumallianceItaniumalliance, 1 month ago

    custom

    304 views, 0 favs, 0 embeds more stats

    Business continuity - it's not just the systems, it more

    More info about this document

    © All Rights Reserved

    Go to text version

    • Total Views 304
      • 304 on SlideShare
      • 0 from embeds
    • Comments 0
    • Favorites 0
    • Downloads 3
    Most viewed embeds

    more

    All embeds

    less

    Flagged as inappropriate Flag as inappropriate
    Flag as inappropriate

    Select your reason for flagging this presentation as inappropriate. If needed, use the feedback form to let us know more details.

    Cancel
    File a copyright complaint
    Having problems? Go to our helpdesk?

    Categories