Your SlideShare is downloading. ×
2012_Architects_Guide_Designing_Integrated_Multi-Product_HA_DR_BC_Solutions_v2
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

2012_Architects_Guide_Designing_Integrated_Multi-Product_HA_DR_BC_Solutions_v2

181
views

Published on

In today's sophisticated IT Cloud world, how do I fuse multiple technologies, products, and clouds together to create a 2012 integrated High Availability, Disaster Recovery, Business Continuity IT …

In today's sophisticated IT Cloud world, how do I fuse multiple technologies, products, and clouds together to create a 2012 integrated High Availability, Disaster Recovery, Business Continuity IT solution? This session
complements product-specific and Overview HA/DR/BC sessions by providing proven, product-agnostic methodology to architect such a solution, including petabyte-level considerations. We provide pragmatic industry-proven, step-by-step methodology / toolset for you to use to work directly with clients to a) crisply elicit, distill HA/DR/BC requirements b) efficiently organize, map requirements to c) design a integrated multi-product, phased-approach, IT HA/DR/BC solution which properly combines backup/restore software, tape, tape libraries, dedup, point-in-time and continuous disk replication, and storage virtualization products c) provide template to clearly communicate solution, gain consensus
across multiple levels of operations and management. John Sing is author of 3 IBM Redbooks, including
SG24-6547-03 IBM System Storage Planning for Business Continuity. My only request when referencing this material in your work, is that you give full credit to me, John Sing, and IBM, as the authors of this material, research, and methodology. That having been said, please spread the good word.

Published in: Technology, Business

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
181
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
16
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • http:// en.wikipedia.org/wiki/Disruptive_innovation
  • First, let’s review important IBM 2009 messaging.
  • There are three primary aspects of providing business continuity for key applications and business processes: High Availability, Continuous Operations, and Disaster Recovery. Generally the higher in the organization, the simpler the term to use.  Senior execs are responsible for setting vision and strategy.  Mid level more for implementation. So you can get in the door with just BC at the senior level; but you need BC + HA & CO & DR to get in at the Manager, Director, level. “ Business Continuity” was preferred by senior IT executives and line of business titles . Lower IT titles preferred more detailed naming that spelled out the solution components-- they wanted to make it relevant to their more limited responsibilities. High Availability: is the ability to provide access to applications. High availability is often provided by clustering solutions that work with operating systems coupled with hardware infrastructure that has no single points of failure. If a server that is running an application suffers a failure, the application is picked up by another server in the cluster, and users see minimal or no interruption. Today’s servers and storage systems are also built with fault-tolerant architectures to minimize application outages due to hardware failures. In addition, there are many aspects of security imbedded in the hardware from servers to storage to network components to help protect unauthorized access. You can think of high availability as resilient IT infrastructure that masks failures, and thus continues to provide access to applications. Continuous Operations: Sometimes you must take important applications down for purposes of updating files, or taking backups. Fortunately, great progress has been made in recent years in technology for online backups, but even with these advances, sometimes applications must be taken down as planned outages for maintenance or upgrading of servers or storage. You can think of continuous availability is the ability to keep things running when everything is working right... where you do not have to take applications down merely to do scheduled backups or planned maintenance. Disaster Recovery: the ability to recover a datacenter at a different site if a disaster destroys the primary site or otherwise renders it inoperable. The characteristics of a disaster recovery solutions are that processing resumes at a different site, and on different hardware. (A non-disaster problem, such as a corruption of a key customer database, may indeed be a catastrophe for a business, but it is not a disaster , in this sense of the term, unless processing must be resumed at a different location and on different hardware. You can think of disaster recovery as the ability to recover from unplanned outages at a different site, something you do after something has gone wrong. Fortunately, some of the solutions that you can implement as preparedness for disaster recovery, can also help with High Availability and with Continuous Operations. In this way, your investment in disaster recovery can help your operations even if you never suffer a disaster. The goal of business continuity is to protect critical business data, to make key applications available, and to enable operations to continue after a disaster. This must be done in such a way that recovery time is both predictable and reliable, and such that costs are predictable and manageable.
  • http://www-935.ibm.com/services/us/igs/smarterdatacenter.html
  • This animated chart is used to organize “who does what“ in a recovery, and to define Recovery Time Objective (RTO) and Recovery Point Objective (RPO). Hardware (servers, storage) can only handle the blue portion of the recovery. All the other necessary processes are important, they are just outside ability of the hardware/servers/storage to control. Hence they should be acknowledged as important, but also, should be supplemental discussions that should be discussed with Services team, and thus outside the scope of a storage-only or Tivoli-only discussion. It‘s good to use this chart to help audience visually organize who does what, in what order, in a recovery.
  • This animation shows that the previous timeline still applies today. Automation simply makes consistent, the multiple steps of the Timeline of an IT Recovery. Also, Automation provides affordable way to handle testing and compliance of Data Protection solution:
  • In summary, the animation shows the storage pool concept – mapped to the different technologies: (click) Backup/Restore (click) Rapid Data Recovery (click) Continuous Availability
  • This slide shows that that Technology only addresses RPO (i.e. how current is the data?). As we improve the technology, we improve RPO. Notice that RTO (Recovery Time Objective) is not driven by technology. (Next chart)
  • Here we see that automation drives the RTO ( recovery time objective ). Automation is what affects the RTO – because it addresses all the non-technology factors that take time
  • First, let’s review important IBM 2009 messaging.
  • Rework title – All Information has a lifespan based on business value
  • Client Issue: How will technologies evolve to meet the needs of business continuity planning? Strategic Planning Assumption: Data replication for disaster recovery will increase in large enterprises from 25 percent in 2004 to 75 percent by 2006 (0.7 probability).
  • Example of Application / Database replication: DB2 Queue Replication URL: http://www-128.ibm.com/developerworks/db2/library/techarticle/dm-0503aschoff/
  • *The Data Center as a Computer: Introduction to Warehouse Scale Computing, p.81 Barroso, Holzle http://www.morganclaypool.com/doi/pdf/10.2200/S00193ED1V01Y200905CAC006
  • Speed of Decision Making Data volumes have a major effect on "time to analysis" (i.e., the elapsed time between data reception, analysis, presentation, and decision-maker activities). There are four architectural options (i.e., CEP, OLTP/ODS, EDW, and big data), and big data is most appropriate when addressing slow decision cycles that are based on large data volumes. The CEPs requirement for processing hundreds or thousands of transactions per second requires that the decision making be automated using models or business rules. OLTP and ODS support the operational reporting function in which decisions are made at human speed and based on recent data. The EDW — with the time to integrate data from disparate operational systems, process transformations, and compute aggregations — supports historic trend analysis and forecasting. Big data analysis enables the analysis of large volumes of data — larger than can be processed within the EDW — and so supports long-term/strategic and one-off transactional and behavioral analysis. Processing Complexity Processing complexity is the inverse of the speed of decision making. In general, CEP has a relatively simple processing model — although CEP often includes the application of behavioral models and business rules that require complex processing on historic data occurring in the EDW or big data analytics phases of the data-processing pipeline. The requirement to process unstructured data at real-time speeds — for example, in surveillance and intelligence applications — is changing this model. Processing complexity increases through OLTP, ODS, and EDW. Two trends are emerging: OLTP is beginning to include an analytics component within the business process and to utilize in-database analytics. The EDW is exploiting the increasing computational power of the database engine. Processing complexities, and the associated data volumes, are so high within the big data analytics phase that parallel processing is the preferred architectural and algorithmic pattern. Transactional Data Volumes Transactional data volume is the amount of data (either the number of records/events or event size) processed within a single transaction or analysis operation. Modern IT internet architectures process a huge number of discrete base events to compute a sophisticated, pockets of value output. OLTP is similarly concerned with transactional or atomic events. Analysis, with its requirement to process many record simultaneously, starts with ODS, and its complexity grows within the EDW. Big data analytics — with the requirement to model long-term trends and customer behavior on Web clickstream data — processes even larger transactional data volumes. Data Structure The prevalence of non-structured data (semi-, quasi-, and unstructured) increases as the data-processing pipeline is traversed from CEP to big data. The EDW layer is increasingly becoming more heterogeneous as other, often non-structured, data sources are required by the analysis being undertaken. This is having a corresponding effect on processing complexity. The mining of structured data is advanced, and systems and products are optimized for this form of analysis. The mining of non-structured data (e.g., text analytics and image processing) is less well understood, computationally expensive, and often not integrated into the many commercially available analysis tools and packages. One of the primary uses of big data analysis is processing Web clickstream data, which is quasi-structured. In addition, the data is not stored within databases; rather, it is collected and stored within files. Some examples of non-structured data that fit with the big data definition include: log files, clickstream data, shopping card data, social media data, call or support center logs, and telephone call data records (CDRs). There is an increasing requirement to process unstructured data at real-time speeds — for example in surveillance and intelligence applications — so this class of data is becoming more important in CEP processing. Flexibility of Processing/Analysis Data management stakeholders understand the processing and scheduling requirements of transactional processing and operational reporting. The stakeholder's ability to build analysis models is well proven. Peaks and troughs commonly occur across various time intervals (e.g., overnight batch processing window or peak holiday period), but these variations have been studied though trending and forecasting. Big data analysis and a growing percentage of EDW processing are ad hoc or one-off in nature. Data relationships may be poorly understood and require experimentation to refine the analysis. Big data analysis models "analytic heroes" that are continually being challenged "challengers" by new or refined models to see which has better performance or yields better accuracy. The flexibility of such processing is high, and conversely, the governance that can be applied to such processing is low. Throughput Throughput, a measure of the degree of simultaneous execution of transactions, is high in transactional and reporting processing. The high data volumes and complex processing that characterize big data analysis are often hardware constrained and have a low concurrency. The scheduling of big data analysis processing is not time-critical. Big data analysis is therefore not suitable for real-time or near-real-time requirements. = = = = = = = Source for graphic: “InfoSphere Streams Architecture”, Mike Spicer, Chief Architect, InfoSphere Streams, June 2, 2011 Source for quote: Dr. Steve Pratt, CenterPoint Energy, May 25, 2011, IBM Smarter Computing Summit, “Managing the Information Explosion” with Brian Truskowski, between 8:20 and 20:40, http://centerlinebeta.net/smarter-computing-palm-springs/index.html
  • http://wikibon.org/wiki/v/Big_Data:_Hadoop%2C_Business_Analytics_and_Beyond A Hadoop “stack” is made up of a number of components. They include: Hadoop Distributed File System (HDFS): The default storage layer in any given Hadoop cluster; Name Node: The node in a Hadoop cluster that provides the client information on where in the cluster particular data is stored and if any nodes fail; Secondary Node: A backup to the Name Node, it periodically replicates and stores data from the Name Node should it fail; Job Tracker: The node in a Hadoop cluster that initiates and coordinates MapReduce jobs, or the processing of the data. Slave Nodes: The grunts of any Hadoop cluster, slave nodes store data and take direction to process it from the Job Tracker. In addition to the above, the Hadoop ecosystem is made up of a number of complimentary sub-projects. NoSQL data stores like Cassandra and HBase are also used to store the results of MapReduce jobs in Hadoop. In addition to Java, some MapReduce jobs and other Hadoop functions are written in Pig, an open source language designed specifically for Hadoop. Hive is an open source data warehouse originally developed by Facebook that allows for analytic modeling within Hadoop. Following is a guide to Hadoop's components: Hadoop Distributed File System:  HDFS, the storage layer of Hadoop, is a distributed, scalable, Java-based file system adept at storing large volumes of unstructured data. MapReduce:  MapReduce is a software framework that serves as the compute layer of Hadoop. MapReduce jobs are divided into two (obviously named) parts. The “Map” function divides a query into multiple parts and processes data at the node level. The “Reduce” function aggregates the results of the “Map” function to determine the “answer” to the query. Hive:  Hive is a Hadoop-based data warehouse developed by Facebook. It allows users to write queries in SQL, which are then converted to MapReduce. This allows SQL programmers with no MapReduce experience to use the warehouse and makes it easier to integrate with business intelligence and visualization tools such as Microstrategy, Tableau, Revolutions Analytics, etc. Pig:  Pig Latin is a Hadoop-based language developed by Yahoo. It is relatively easy to learn and is adept at very deep, very long data pipelines (a limitation of SQL.) HBase:  HBase is a non-relational database that allows for low-latency, quick lookups in Hadoop. It adds transactional capabilities to Hadoop, allowing users to conduct updates, inserts and deletes. EBay and Facebook use HBase heavily. Flume:  Flume is a framework for populating Hadoop with data. Agents are populated throughout ones IT infrastructure – inside web servers, application servers and mobile devices, for example – to collect data and integrate it into Hadoop. Oozie:  Oozie is a workflow processing system that lets users define a series of jobs written in multiple languages – such as Map Reduce, Pig and Hive -- then intelligently link them to one another. Oozie allows users to specify, for example, that a particular query is only to be initiated after specified previous jobs on which it relies for data are completed. Whirr:  Whirr is a set of libraries that allows users to easily spin-up Hadoop clusters on top of Amazon EC2, Rackspace or any virtual infrastructure. It supports all major virtualized infrastructure vendors on the market. Avro:  Avro is a data serialization system that allows for encoding the schema of Hadoop files. It is adept at parsing data and performing removed procedure calls. Mahout:  Mahout is a data mining library. It takes the most popular data mining algorithms for performing clustering, regression testing and statistical modeling and implements them using the Map Reduce model. Sqoop:  Sqoop is a connectivity tool for moving data from non-Hadoop data stores – such as relational databases and data warehouses – into Hadoop. It allows users to specify the target location inside of Hadoop and instruct Sqoop to move data from Oracle, Teradata or other relational databases to the target. BigTop:  BigTop is an effort to create a more formal process or framework for packaging and interoperability testing of Hadoop's sub-projects and related components with the goal improving the Hadoop platform as a whole.
  • Understanding your data, categorizing it by recovery time, is essential in order to build a cost-justifiable, affordable solution. Finally, not every client can justify near continuous availability or rapid data recovery solutions. A balance between the priorities of uptime and cost, in concert with the needs of the business, is always necessary. For example, many clients may find that the appropriate cost/recovery time equation is that it is not necessary for the data at the remote site to be within seconds; the requirement is only for the data at the remote site to be no more than 12 hours old. These types of recoveries do not required on-going real-time consistent update of data at a remote site. Rather, only a periodic point-in-time copy needs to be made (on disk or on tape for the lower tiers), and the simply replicate the copies to a remote site Server and workload restart is semi-automated or manual.
  • Data center complexity has reached crisis levels and is continuing to increase thereby limiting improvement and growth Businesses spend a large fraction of their IT budgets on data center resource management rather than on valuable applications and business processes IT management costs are the dominant IT cost component today and have increased over the past ten years in rough proportion to increasing scale-out sprawl Basic forces will drive continuing increases in IT complexity The numbers of systems deployed will continue to grow rapidly, driven largely by: New applications (for Web 2.0, surveillance, operational asset mgmt., ...) Improving hardware price/performance and utilization (more systems per server) The diversity of IT products will increase as competing suppliers continue to introduce new applications, systems, and management software products The coupling of IT components is extensive and increasing, driven by application tiering, growing SOA usage, advances in high-performance standard networks, … The resulting increase in IT complexity will further exacerbate the current IT management cost crisis. Managing the increasing IT complexity and scale-out sprawl with traditional IT management software will be increasingly difficult and costly New approaches to Data Center Architectures are needed to simplify IT management and enable growth
  • In summary, the animation shows the storage pool concept – mapped to the different technologies: (click) Backup/Restore (click) Rapid Data Recovery (click) Continuous Availability
  • This is an optional chart, showing the typical Information Availability System Storage technologies that we would apply to the various pools of storage (click) including the fact that large unstructured data probably needs to be recovered using file system or application involvement.
  • Here is another way of showing the same step by step, incremental-improve concept. It’s a ‘big picture’ positioning the various kinds of technologies that can be deployed, step by step, to provide IT BC solutions – starting from low and moving to the high end of the cost curve. Click to show each one of the steps to come up. Note how the icons show where the data flows, through different types of technologies that we will discuss further today.
  • Building upon the previous chart, we continue clicking to show enhancements to: Rapid Data Recovery capabilities, followed by Continuous Availability capabilities. (This chart starts from where the previous chart left off
  • This is an optional chart, showing the typical Information Availability System Storage technologies that we would apply to the various pools of storage (click) including the fact that large unstructured data probably needs to be recovered using file system or application involvement.
  • This is an optional chart, showing the typical Information Availability System Storage technologies that we would apply to the various pools of storage (click) including the fact that large unstructured data probably needs to be recovered using file system or application involvement.
  • This is an optional chart, showing the typical Information Availability System Storage technologies that we would apply to the various pools of storage (click) including the fact that large unstructured data probably needs to be recovered using file system or application involvement.
  • This is an optional chart, showing the typical Information Availability System Storage technologies that we would apply to the various pools of storage (click) including the fact that large unstructured data probably needs to be recovered using file system or application involvement.
  • In summary, the animation shows the storage pool concept – mapped to the different technologies: (click) Backup/Restore (click) Rapid Data Recovery (click) Continuous Availability
  • In summary, the animation shows the storage pool concept – mapped to the general categories: (click) Backup/Restore (click) Rapid Data Recovery (click) Continuous Availability
  • Here’s yet another way to look at this process. Each step of the process that we’ve reviewed, are shown here in a build-up, step by step project visualiization. In this case, we show how the Timeline of an IT Recovery is improved at each step.
  • Thank you!
  • Transcript

    • 1. Architect’s Guide to Designing IntegratedMulti-Product HA-DR-BC SolutionsJohn Sing, Executive Strategy, IBM Session E10 1
    • 2. John Sing • 31 years of experience with IBM in high end servers, storage, and software – 2009 - Present: IBM Executive Strategy Consultant: IT Strategy and Planning, Enterprise Large Scale Storage, Internet Scale Workloads and Data Center Design, Big Data Analytics, HA/DR/BC – 2002-2008: IBM IT Data Center Strategy, Large Scale Systems, Business Continuity, HA/DR/BC, IBM Storage – 1998-2001: IBM Storage Subsystems Group - Enterprise Storage Server Marketing Manager, Planner for ESS Copy Services (FlashCopy, PPRC, XRC, Metro Mirror, Global Mirror) – 1994-1998: IBM Hong Kong, IBM China Marketing Specialist for High-End Storage – 1989-1994: IBM USA Systems Center Specialist for High-End S/390 processors – 1982-1989: IBM USA Marketing Specialist for S/370, S/390 customers (including VSE and VSE/ESA) • singj@us.ibm.com • IBM colleagues may access my webpage: – http://snjgsa.ibm.com/~singj/ • You may follow my daily IT research blog – http://www.delicious.com/atsf_arizona 2
    • 3. Agenda • Understand today’s challenges and best practices – for IT High Availability and IT Business Continuity • What has changed? What is the same? • Strategies for: – Requirements, design, implementation • Step by step approach – Essential role of automation – Accommodating petabyte scale – Exploiting Cloud 2012 Cloud deployment options3 3
    • 4. Agenda1. Solving Today’s HA-DR-BC Challenges2. Guiding HA-DR-BC Principles to mitigate chaos3. Traditional Workloads vs. Internet Scale Workloads4. Master Vision and Best Practices Methodology 4
    • 5. Recovering today’s real-time massive streaming workflows is challenging n d Chart in public domain: IEEE Massive File Storage presentation, author: Bill Kramer, NCSA: http://storageconference.org/2010/Presentations/MSST/1.Kramer.pdf: 5
    • 6. Today’s Data and Data Recovery Conundrum: 6
    • 7. Inter-Many options, including many non-traditional alternatives for Disciplinaryuser deployments, workload hosting, and recovery models Traditional alternatives: • Non-traditional alternatives: – The Cloud, the Developing World • Other platforms • Other vendors Illustrative Cloud examples only No endorsement is implied or expressed 7
    • 8. Finally, we have this ‘little’ problem regarding Mobile proliferation Clayton Christensen Harvard Business School• From IT standpoint, we are clearly seeing “consumerization of IT”• Key is to recognize and exploit hyper-pace reality of BYOD’s associated data• Not just the technology• Also the recovery model (“cloud), the business model, and the required ecosystem http://en.wikipedia.org/wiki/Disruptive_innovation 8
    • 9. So how do we affordably architect HA / BC / DR in 2012? 9
    • 10. What has remained the same?(Continued good Guiding Principles that mitigate HA/DR/BC chaos) Storage Efficiency Service Management Data Protection 10
    • 11. The Business Process is still the Recoverable UnitBusiness Business Business Business Business Business Business Business process A process B process C process D process E process F process G 3. The loss of both db2 applications affects twoApplication Application 2 http://xyz.xml distinctly different Web Sphere business processes MQseries 2. The error impacts management Application 3 Analytics Application 1 the ability of two or report reports decision more applications to SQL point share critical dataInfrastructure IT Business Continuity 1. An error occurs on a storage device that must recover at the correspondingly corrupts a database business process level 11
    • 12. Cloud does not change business process; still the recovery unitBusiness Business Business Business Business Business Business Business process A process B process C process D process E process F process G 3. The loss of Cloud db2 output affects twoApplication Application 2 http://xyz.xml distinctly different Web Sphere processes business STOP management Application 3 Analytics Application 1 2. Cloud provider reports decision report outage SQL pointInfrastructure Cloud is simply another deployment option 1. Data input to the cloud But doesn’t change HA/BC fundamental approach 12
    • 13. When can Cloud recovery can provide extremely fast time to project completion?• Where entire business process recoverable units can be out-sourced to Cloud provider – Production example: Out-sourcing production, or backup/restore, or integrated, standalon, application to a provider – Cloud application-as-a-service (AaaS) example: Salesforce.com, etc. Business Business Business Business Business Business Business Business process A process B process C process D process E process F process G db2 Application http://xyz.xml Application 2 Web Sphere MQseries Analytics management Application 3 Application 1 decision report SQL reports point Technical 13
    • 14. The trick to leveraging Cloud is:Understanding that Cloud is simply another (albeit powerful) deployment choice Good news:Fundamental principles for HA/DR/BC haven’t changed It’s only the deployment options that have changed 14
    • 15. Still true: synergistic overlap of valid data protection techniques IT Data Protection 1. High Availability 2. Continuous Operations 3. Disaster Recovery Fault-tolerant, failure- Non-disruptive backups and Protection against unplanned resistant streamlined system maintenance coupled with outages such as disasters infrastructure with continuous availability of through reliable, predictable affordable cost applications recovery foundation Protection of critical Business data Operations continue after a disaster Recovery is predictable and reliable Costs are predictable and manageable 15
    • 16. Four Stages of Data Center Efficiency: (pre-req’s for HA/BC/DR) April 2012 http://www-935.ibm.com/services/us/igs/smarterdatacenter.html http://public.dhe.ibm.com/common/ssi/ecm/en/rlw03007usen/RLW03007USEN.PDF 16
    • 17. Telecom bandwidth still the major delimiter Still true: Timeline of an IT Recovery ==> for any fast recovery Execute hardware, operating system, RPO ? Assess and data integrity recovery Telecom Network Management Control Data Physical Facilities Operating System Outage! Production ☺ Operations Staff Network Staff Applications StaffRecovery Point Objective Recovery Time Objective (RTO) of hardware data integrity RPO Done? transaction Application integrity recovery (RPO)How much data Applications must be recreated? Recovery Time Objective (RTO) of transaction integrity Now were done! 17
    • 18. Still true: value of Automation for real-time failover ===> RPO ? Assess HW Telecom Network Management Control Data Physical Facilities Operating System Value of automation Outage! Production ☺ Operations Staff Network Staff Applications Staff RTO Trans. •Reliability RPO Recov. H/W •RepeatabilityRecovery Point Applications •Scalability Objective (RPO) RTO trans. integrity •Frequent TestingHow much data must be Now were done! recreated? 18
    • 19. Still true: Organize High Availability, Business Continuity Technologies Balancing recovery time objective with cost / value Recovery from a disk image Recovery from tape copy BC Tier 7 – Add Server or Storage replication with end-to-end automated server recovery BC Tier 6 – Add real-time continuous data replication, server or storage BC Tier 5 – Add Application/database integration to Backup/Restoree u a V/ t s o C BC Tier 4 – Add Point in Time replication to Backup/Restore BC Tier 3 – VTL, Data De-Dup, Remote vault BC Tier 2 – Tape libraries + Automation BC Tier 1 – Restore l 15 Min. 1-4 Hr.. 4 -8 Hr.. 8-12 Hr.. 12-16 Hr.. 24 Hr.. Days from Tape Recovery Time Objective (guidelines only) 19
    • 20. Still true: Replication Technology Drives RPO For example: Wks Days Hrs Mins Secs Secs Mins Hrs Days Wks Recovery Point Recovery Time Tape Backup Periodic Replication Asynchronous replication Synchronous replication / HA 20
    • 21. Still true: Recovery Automation Drives Recovery Time For example: Wks Days Hrs Mins Secs Secs Mins Hrs Days Wks Recovery Point Recovery Time End to end automated Storage Recovery Time includes: clustering automation Manual Tape Restore – Fault detection – Recovering data – Bringing applications back online – Network access 21
    • 22. Still true: “ideal world” construct for IT High Availability and Business Continuity Business processes drive strategies and they are integral to the Continuity of Business Operations. A company cannot be resilient without having strategies for alternate workspace, staff members, call centers and communications channels.Business Prioritization Integration into IT Manage Awareness, Regular Validation, Change Management, Quarterly Management Briefings Resilience Program Management e ery ted Tim Cap rrent lity ss RTO/RPO ine ct m Re stima s ra abi bu pa is Cu ra m og ation im lys og ign pr id Pr es cov risk a program Strategy l an E assessment assessment D Design Implement va of s rea itie ts • Maturity ac e crisis team High Availability 1. People mp utag Th abil Model ts design I 2. Processes O and ulner • Measure 3. Plans business ROI High Availability 4. Strategies resumption s, V Servers • Roadmap 5. Networks disaster k for 6. Platforms Ris Storage, Data Program recovery Replication 7. Facilities high Database and availability Software design Source: IBM STG, IBM Global Services 22
    • 23. The 2012 Bottom line: (IT Business Continuity Planning Steps) For today’s real world environment……….Need faster way than even this simplified 2007 version: 2012 key #1: 1. Collect information for this “ideal” process? i.e. how to streamline prioritization need a basic 2. Vulnerability, risk assessment, scope Business Prioritization Awareness, Regular Validation, Change Management, Quarterly Management Briefings Integration into IT Manage Data Strategy Resilience Program Management 3. Define BC targets based on scope e ery ted Cap rrent Tim lity s RTO/RPO es sin t Re stima abi m bu pac is m ra og ion Cu im lys ra og n pr idat cov Pr esig 4. Solution option design and evaluation a E risk an program Strategy l D Implement va assessment assessment Design 2012 key #2: f so • Maturity 1. People ct high availability s 5. Recommend solutions and products pa age crisis team rea itie Model 2. Processes Im ut Workload type design O • Measure Th abil 3. Plans ts ROI and ulner business 4. Strategies • Roadmap High Availability resumption 5. Networks Servers V for Program 6. Platforms 6. Recommend strategy and roadmap ks, 7. Facilities Ris disaster Data recovery Replication high availability Database and Software design 23
    • 24. Streamlined BC Actions 2005 version Input Output Scope, Resource Business Business processes, Key 1. Collect info for Impact Perf. Indicators, IT prioritization Component effect on business processes inventory Defined vulnerabilities List of vulnerabilities 2. Vulnerability / Risk Assessment Defined BC baseline Existing BC capability, KPIs, targets, 3. Define desired HA/BC architecture, targets, and success rate targets based on scope decision and success criteria Technologies and solution 4. Solution design and options evaluation Business process segments and solutions Generic solutions that meet 5. Recommend Recommended IBM criteria solutions and products Solutions and benefitsBudget, major project milestones, resource 6. Recommend strategy and Baseline Bus. Cont. strategy, availability, business roadmap, benefits, challenges, roadmap financial implications and process priority justification 24
    • 25. Streamlined BC Actions 2012 version Input Output Scope, Resource Business Business processes, Key 1. Collect info for Impact Perf. Indicators, IT prioritization Component effect on business processes inventory Do basic HA/DR List of vulnerabilities Data Strategy 2. Vulnerability / Risk Defined vulnerabilities Assessment Defined BC baseline Existing BC capability, KPIs, targets, 3. Define desired HA/BC architecture, targets, and success rate targets based on scope decision and success criteria Technologies and solution 4. Solution design and options evaluation Business process segments and solutions Exploit Generic solutions that meet Workload Type 5. Recommend Recommended IBM criteria solutions and products Solutions and benefitsBudget, major project milestones, resource 6. Recommend strategy and Baseline Bus. Cont. strategy, availability, business roadmap, benefits, challenges, roadmap financial implications and process priority justification 25
    • 26. How do we get there in 2012?Bottom line #1: have a basic Data Strategy Bottom line #2: Exploit Workload type Storage Efficiency Service Management Data Protection 26
    • 27. i.e. #1: It’s all about the Data Now, what do I mean by that? 27
    • 28. What is a basic Data Strategy? Specify data usage over it’s lifespan Applications Information Information create data and data Archive / Retain / Delete Management Frequency of Access and Use Time28 28
    • 29. Data strategy = collecting information, prioritizing, vulnerability/risk, scope Business processes drive strategies and they are integral to the Continuity of Business Operations. A company cannot be resilient without having strategies for alternate workspace, staff members, call centers and communications channels.Business Prioritization Integration into IT Manage Awareness, Regular Validation, Change Management, Quarterly Management Briefings Resilience Program Management e ery ted Tim Cap rrent lity ss RTO/RPO ine ct m Re stima s Data ra abi bu pa is Cu ra m og ation im lys og ign pr id Pr es cov risk a program Strategy l an E assessment Strategy assessment D Design Implement va of s rea itie ts • Maturity ac e crisis team High Availability 1. People mp utag Th abil Model ts design I 2. Processes O and ulner • Measure 3. Plans business ROI High Availability 4. Strategies resumption V Servers • Roadmap 5. Networks k s, for disaster 6. Platforms Ri s Storage, Data Program recovery Replication 7. Facilities high Database and availability Software design Source: IBM STG, IBM Global Services 29
    • 30. Data Strategy DefinedData Strategy: relationship to Business, IT Strategies Business Strategy IT Strategy Business Strategies Business Technology Scope Scope IT Strategy Distinct Business System IT Competencies Governance Competencies Governance Data Strategy Enterprise IT Architecture Organization, Infrastructure, IT Infrastructure Process And processes IT IT Infrastructure Process Infrastructure People Data Skills Tools Processes Skills Process Technology Structure 30
    • 31. Data Strategy DefinedThe role of the basic “Data Strategy” for HA / BC purposes • Define major data types “good enough” – i.e. by major application, by business line…. Business Strategies – An ongoing journey You have to • For each data type: know your data IT Strategy – Usage – Performance and measurement Data Strategy – Security – Availability Enterprise IT Architecture – Criticality – Organizational role And have a – Who manages basic strategy – What standards for this data for it • What type storage deployed on • What database • What virtualization IT Infrastructure • Be pragmatic People Data – Create a basic, “good enough” data strategy for HA/BC purposes Process Technology Structure • Acquire tools that help you know your data 31
    • 32. Here’s the major difference for 2012:There are two major types of workloads: Traditional IT Internet Scale WorkloadsHA, Business Continuity, HA/DR/BC can be done “Agnostic / HA/DR/BC must be “designedDisaster Recovery after the fact” using replication into software stack from theCharacteristics beginning”Data Strategy Use traditional tools/concepts to Proven Open Source toolset understand / know data to implement failure Storage/server virtualization and tolerance and redundancy in pooling the application stackAutomation End to end automation of server / End to end automation of the storage virtualization application software stack providing failure toleranceCommonality Apply master vision and lessons Apply master vision and learned from internet scale data lessons learned from internet centers scale data centers 32
    • 33. Choices for high availability and replication architecturesProduction Site Geographic Load Balancer Site Load Web Application / DB Server Balancer Server Clusters Server Clusters Clusters Disk Workload Application Server Storage balancer or database Replication Replic. Replication Local backup Geographic Load Balancer Site Web Application / DB Server Load Balancer Server Clusters Server Clusters Clusters PIT Image, Other Site(s) Tape B/U 33
    • 34. Comparing IT BC architectural methods Production Site Geographic Load Balancer Site Load Web Application / DB Server Balancer Server Clusters Server Clusters Clusters Storage Workload Application / Server Stor Balancer DB Replication Replication Replic. Local Geographic Backup Load Balancer Site Web Application / DB Server Load Server Clusters Server Clusters Clusters Balancer Replication, Multiple Site(s) PiT Image, Tape• Application / database / file system replication / workload balancer – File system, Typically requires the least bandwidth – May be required if the scale of storage is very large (i.e. internet scale) DB, Applic. – Span of consistency is that application, database or file system only – Aware Well understood by database, application, file system administrators – Can be more complex implementation, must implement for each application• Replication – Server (traditional IT) – Well understood by operating systems administrators – Storage and application independent, uses server cycles – Span of recovery limited to that server platform• Replication – Storage (traditional IT) – Can provide common recovery across multiple application stacks and multiple File system, server platforms – Usually requires more bandwidth DB, Applic. – Requires storage replication skill set Agnostic 34
    • 35. Principles for Internet Scale Workloads 35
    • 36. Internet Scale Workload Characteristics - 1• Embarrassingly parallel Internet workload – Immense data sets, but relatively independent records being processed • Example: billions of web pages, billions of log / cookie / click entries – Web requests from different users essentially independent of each over • Creating natural units of data partitioning and concurrency • Lends itself well to cluster-level scheduling / load-balancing – Independence = peak server performance not important i.e. Very low inter-process – What’s important is aggregate throughput of 100,000s of servers communication• Workload Churn – Well-defined, stable high level API’s (i.e. simple URLs) – Software release cycles on the order of every couple of weeks • Means Google’s entire core of search services rewritten in 2 years – Great for rapid innovation • Expect significant software re-writes to fix problems ongoing basis – New products hyper-frequently emerge • Often with workload-altering characteristics, example = YouTube 36
    • 37. Internet Scale Workload Characteristics - 2 • Platform Homogeneity – Single company owns, has technical capability, runs entire platform end-to- end including an ecosystem – Most Web applications more homogeneous than traditional IT – With immense number of independent worldwide users 1% - 2% of all Internet requests fail* • Fault-free operation via application middleware – Some type of failure every few hours, including software bugs – All hidden from users by fault-tolerant middleware Users can’t tell difference between Internet down and – Means hardware, software doesn’t have to be perfect your system down Hence 99% good enough • Immense scale: – Workload can’t be held within 1 server, or within max size tightly-clustered memory-shared SMP – Requires clusters of 1000s, 10000s of servers with corresponding PBs storage, network, power, cooling, software – Scale of compute power also makes possible apps such as Google Maps, Google Translate, Amazon Web Services EC2, Facebook, etc. *The Data Center as a Computer: Introduction to Warehouse Scale Computing, p.81 Barroso, Holzle http://www.morganclaypool.com/doi/pdf/10.2200/S00193ED1V01Y200905CAC006 37
    • 38. IT architecture at internet scale • Internet scale architectures fundamental assumptions: Criteria: – Distributed aggregation of data Cost – High Availability, failure tolerance functionality is in software on the server – Time to Market is everything • Breakage = “OK” if I can insulate that from user – Affordability is everything – Use open source software where-ever possibleExtreme: – Expect that something somewhere in infrastructure will- Scale always be broken- Parallelism- Performance – Infrastructure is designed top-to-bottom to address this- Real time-Time to Market • All other criteria are driven off of these 38
    • 39. For Internet Scale workloads, Open Source based internet-scale software stackExample shown is the 2003-2008 Google version: 1. Google File System Architecture – GFS II 2. Google Database - Bigtable 3. Google Computation - MapReduce 4. Google Scheduling - GWQ Reliability, redundancy all in The OS or HW doesn’t do the “application stack” any of the redundancy 39
    • 40. Internet-scale Each red block is anHA/DR/BC inexpensive server = IT infrastructure plenty of power for its portion of workflow ForInternet Your customers Scale Input from the InternetWorkloads 40
    • 41. Warehouse Scale Computer programmer productivity framework example • Hadoop • Flume – Overall name of software stack – Populate Hadoop with data • HDFS • Oozie – Hadoop Distributed File System – Workflow processing system • MapReduce • Whirr – Software compute framework – Libraries to spin up Hadoop on • Map = queries Amazon EC2, Rackspace, etc. • Reduce=aggregates answers • Avro • Hive – Data serialization – Hadoop-based data warehouse • Mahout • Pig – Data mining – Hadoop-based language • Sqoop • Hbase – Connectivity to non-Hadoop – data stores Non-relationship database fast lookups • BigTop – Packaging / interop of all Hadoop components http://wikibon.org/wiki/v/Big_Data:_Hadoop%2C_Business_Analytics_and_Beyond 41
    • 42. Summary - two major types of approaches, dependingon workload type: Traditional IT Internet Scale WorkloadsHA, Business Continuity, HA/DR/BC can be done “Agnostic / HA/DR/BC must be “designedDisaster Recovery after the fact” using replication into software stack from theCharacteristics beginning”Data Strategy Use traditional tools/conceptsw to Proven Open Source toolset understand / know data to implement failure Storage/server virtualization and tolerance and redundancy in pooling the application stackAutomation End to end automation of server / End to end automation of the storage virtualization application software stack providing failure toleranceCommonality Apply master vision and lessons Apply master vision and learned from internet scale data lessons learned from internet centers scale data centers 42
    • 43. Principles for Architecting IT HA / DR / Business Continuity 43
    • 44. Key strategy: segment data into logical storage pools by appropriate Data Protection characteristics (animated chart) Mission Critical • Continuous Availability (CA) – E2E automation enhances RDR – RTO = near continuous, RPO = small as possible (Tier 7) – Priority = uptime, with high value justification • Rapid Data Recovery (RDR) – enhance backup/restore – For data that requires it – RTO = minutes, to (approx. range): 2 to 6 hours – BC Tiers 6, 4 – Balanced priorities = Uptime and cost/value • Backup/Restore (B/R) – assure efficient foundation – Standardize base backup/restore foundation – Provide universal 24 hour - 12 hour (approx) recovery capability – Address requirements for archival, compliance, green energy – Priority = costLower Enabled by Know and categorize your data -cost virtualization Provides foundation for affordable data protection 44
    • 45. Virtualization is fundamental to addressing today’s IT diversity Virtuali zation 45
    • 46. Consolidated virtualized systems become the Recoverable VirtualiUnits for IT Business Continuity zation Virtualized IT infrastructure Business Processes Virtualized systems become the resource pools that enable the recoverability 46
    • 47. High Availability, Business Continuity Step by Step virtualization journey Balancing recovery time objective with cost / value Recovery from a disk image Recovery from tape copy BC Tier 7 – Add Server or Storage replication with end-to-end automated server recovery BC Tier 6 – Add real-time continuous data replication, server or storage BC Tier 5 – Add Application/database integration to Backup/Restore e u a V/ t s o C BC Tier 4 – Add Point in Time replication to Backup/Restore BC Tier 3 – VTL, Data De-Dup, Remote vault BC Tier 2 – Tape libraries + Automation BC Tier 1 – Restore l 15 Min. 1-4 Hr.. 4 -8 Hr.. 8-12 Hr.. 12-16 Hr.. 24 Hr.. Days from TapeStorage pools Recovery Time Objective Foundation 47
    • 48. Storage Pools Add automated failover toApply appropriate server, replicated storagestorage technology Real Time replication Real-time (storage or server or replication software) Periodic PiT replication: Point in time -File System - Point in Time Disk - VTL to VTL with Dedup Removable media - Foundation backup/restore - Physical or electronic transport PetaByte Petabyte unstructured, due to usage and Petabyte Unstructured large scale, typically uses Unstructured application level intelligent redundancy File, application, or failure toleration design disk-to-disk periodic replication 48
    • 49. Methodology: Traditional IT HA / BC / DR in stages, from bottom up •IBM ProtecTier •IBM Virtual Tape Library •IBM Tivoli Storage •VTL, de-dup, Manager Backup/restore remote replication at tape level SAN SAN Disk VTL/De-Dup VTL/De-Dup •IBM FlashCopy, SnapShot •IBM XIV, SVC, DS, SONAS •IBM Tivoli Storage Productivity Center 5.1Cost Add: Point-in-time Copy, disk to disk, Tiered Storage (Tier 4) Foundation: electronic vaulting, automation, tape lib (Tier 3) Foundation: standardized, automated tape backup (Tier 2, 1) Recovery Time Objective 49
    • 50. Methodology: traditional IT HA / BC / DR in stages, from bottom up •Server virtualization •Tivoli FlashCopy Manager End to end Automated Application Failover: Application Dynamic Server integration integration Storage Applications •VMWare •PowerHA on p SAN SAN If storage: •Metro Mirror, Global VTL/De-Dup VTL/De-Dup Disk Data Data VTL/De-Dup Mirror, Hitachi UR replication replication •XIV, SVC, DS, other storage •TPC 5.1 End to end automated site failover servers, storage, applications (Tier 7) Consolidate and implement real time data availability (Tier 6)Cost Automate applications, database for replication and automation (Tier 5) Add: Point-in-time Copy, disk to disk for backup/restore (Tier 4) Foundation: electronic vaulting, automation, tape lib (Tier 3) Foundation: standardized, automated tape backup (Tier 2, 1) Recovery Time Objective 50
    • 51. Pay-per-Usage • Supporting compute- Persistent Storage multi-tenancy model Compute Cloud • Finer granularity in centric workloads User • Provider-owned C User Public Cloud E 4 Services Enterprise A User B Enterprise B User Enterprise C D assets User A 5 Shared Cloud Services 51 • Standardized, multi- tenant service • Pay-per-usage Operated or model withCo-located provider-owned assets 2 3 Enterprise EnterpriseTechnology Deployments in Cloud Data Center Managed Private Cloud Hosted Private Cloud Co-lo operated Co-lo owned and Co-lo owned and operated operated • Consumption models including client- owned and provider-owned assets • Delivery options including client premise & hosted • Strategic Outsourcing clients with standardized services Private Cloud • Client-managed implementation Data Center Enterprise Private Cloud • Internal or services partner cloud 1
    • 52. Cloud as remote site deployment options Real Time replication (storage or server or software) RecoveryProduction in Periodic PiT replication: Cloud -File System - Point in Time Disk - VTL to VTL with Dedup - Point in Time Copies - Physical or electronic transport PetaByte Petabyte level storage typically Petabyte Unstructured uses intelligent file or application replication Unstructured due to large scale, usage patterns 52
    • 53. Virtualized Automated Storage failoverData strategyremote cloud Real Time replication Real-time (storage or server or replication software) Periodic PiT replication: Point in time -File System - Point in Time Disk - VTL to VTL with Dedup Removable media - Point in Time Copies - Physical or electronic transport Disk-to-disk replication PetaByte Petabyte level storage typically Petabyte Unstructured uses intelligent file or application replication Unstructured due to large scale, usage patterns 53
    • 54. Local Cloud deployment fromdata standpoint PetaByte Unstructured 54
    • 55. Cloud providerresponsibilityfor HAand BC Real Time replication (storage or server or software) Recovery Your By Production Periodic PiT replication: Cloud In -File System - Point in Time Disk Provider Cloud - VTL to VTL with Dedup - Point in Time Copies - Physical or electronic transport PetaByte Petabyte level storage typically Petabyte Unstructured uses intelligent file or application replication Unstructured due to large scale, usage patterns 55
    • 56. Today’s world: High Availability, Business Continuity Cloudis a Step by Step data strategy / workload journey deployment Balancing recovery time objective with cost / value if needed Recovery from a disk image Recovery from tape copy BC Tier 7 – Add Server or Storage replication with end-to-end automated server recovery BC Tier 6 – Add real-time continuous data replication, server or storage BC Tier 5 – Add Application/database integration to Backup/Restoree u a V/ t s o C BC Tier 4 – Add Point in Time replication to Backup/Restore BC Tier 3 – VTL, Data De-Dup, Remote vault BC Tier 2 – Tape libraries + Automation BC Tier 1 – Restore l 15 Min. 1-4 Hr.. 4 -8 Hr.. 8-12 Hr.. 12-16 Hr.. 24 Hr.. Days from TapeData Strategy Recovery Time Objective Workload Types 56
    • 57. Step by Step Virtualization, High Availability, CloudBusiness Continuity data strategy deployment Balancing recovery time objective with cost / value if needed Recovery from a disk image Recovery from tape copy BC Tier 7 – Add Server or Storage Availability end-to-end automated server Continuous replication with recovery BC Tier 6 – Add real-time continuous data replication, server or storage Rapid Data Recovery BC Tier 5 – Add Application/database integration to Backup/Restoree u a V/ t s o C BC Tier 4 – Add Point in Time replication to Backup/Restore BC Tier 3 – VTL, Data De-Dup, Remote vault Backup/Restore BC Tier 2 – Tape libraries + Automation BC Tier 1 – Restore l 15 Min. 1-4 Hr.. 4 -8 Hr.. 8-12 Hr.. 12-16 Hr.. 24 Hr.. Days from TapeData Strategy Workload types Recovery Time Objective 57
    • 58. Summary – IT High Availability / Business Continuity Best Practices 2012Continuous Implement BC Tier 7 – Standardize use of ContinuousAvailability Availability automated Failover Implement Tier 6 – Standardize high volume data replication method Rapid Data IRecovery Implement Tier 4 – Standardize use of disk to disk and Point in Time disk copy Backup / Implement Tier 3 – Consolidate and standardize Restore Backup/Restore methods. Implement tape VTL, data de-dup, Server / Storage Virtualization / Mgmt tools, basic automation Production Backup/Restore Tier 1, 2 Backup/Restore Tier 1, 2 Foundation: replicated foundation: Storage, server virtualization SAN and server and consolidation virtualization and Understand my data consolidation Define scope of recovery Implement remoteData strategy S sites (Tier 1, 2) Recovery Workload types 58
    • 59. Summary• Understand today’s best practices – Data Workload for IT High Availability and IT Business Continuity Strategy types• What has changed? What is the same? – Principles for requirements = no change • Data Strategy – Deployment for true internet scale wkloads: • Application level redundancy• Strategies for: – Requirements, design, implementation – In-house vs. out-sourcing Cloud deployment• Step by step approach options – Automation, virtualization essential – Segment workloads traditional vs. petabyte scale – Exploiting Cloud59 59
    • 60. 60