Welcome to this session on Private CloudThe topic is “Private Cloud: Principles, Concepts and Patterns”I’m Tom Shinder, a Principle writer in the Server and Cloud Division Information Experience GroupYou might know me from my ISAserver.org daysIf you’ve been thinking “I know that guy from somewhere” – then now you know from where
Here’s today’s agenda.We will start with a short introduction to architecture and why’s it’s important. This talk covers architectural principles, concepts and patterns and therefore I want to provide you a rationale and motivation for understanding this material.After the short architecture “sales job” we’ll move onto the following subjects:Cloud Service and Deployment ModelsPrivate Cloud Principles, Concepts and PatternsThe Top Ten Private Cloud Architecture issues and lessonsPatterns in Infrastructure as a Service
Is knowing architecture useful?I’ve heard each of these said in the pastArchitects are closer to rocket scientistsHow do you think the Star Ship Enterprise knew where to go?Architects need to have an understanding of the capabilities that software can provide and understand what is currently possible and not possible and inform architectural alternativesArchitects do a lot of things – they typically have datacenter infrastructure and operations experience before getting into the architecture businessAre you sure? Maybe you did need an architect!Ah ha! If you don’t know what an architect is, then this is a great time to learn what the purpose and value of architecture providesArchitects can also work as clowns on the side if they like kids
It’s important to understand that the industry is placing increasing importance to architects and architecture.A Corporate Executive Board study shows that among the CIOs surveyed – 63 % believe that architecture is growing in importance and that 47% of them are having difficulty finding architects.A recent Gartner study focused on the future of IT notes that with the growing importance of cloud computing, that there will be an increased emphasis on architecture roles and that the term “cloud architect” is used increasingly often when thinking about the new roles. They see a number of new cloud architect positions becoming important in the next five years.Many believe that while there is likely to be a significant contraction in the total number of infrastructure and operations people due to widespread adoption of cloud computing, the number of cloud architect positions will increase so that the total number of IT positions will remain stable or potentially grow as the architect roles are defined and refined.
The Winchester “Mystery” house is an example of what happens when there is no architectural blueprint for a building design(read the four bullet points)Does this remind you of your datacenter network today?
A surgeon doesn’t just grab a scalpel and starting cutting based on where he thinks the organs are, or even after having participated in a few surgical operations. Surgeons need to understand the entire system on which they’re working on, as a surgical procedure requires prerequisite knowledge of many areas, such as anatomy, physiology, pharmacology, biochemistry, neuroscience, pathology and microbiology, even before considering the surgical procedures. These areas provide the architectural framework the provide the definitions, constraints, requirements and decision points for every step of the surgical process.
This slide shows the US National Institute of Standards and Technology (or NISTs) definition of “the cloud”, which is generally accepted by vendors and service providers across the industry. (http://csrc.nist.gov/groups/SNS/cloud-computing/cloud-computing-v26.ppt)I’m assuming that most people watching this session are broadly familiar with these definitions so I wont go through it in detail, but I will run through it quickly to ensure that we all have a consistent definition and to set the context for the rest of this session.A solution must have 5 essential characteristics for it to be considered a “cloud”:On-demand self-service. A consumer can unilaterally provision computing capabilities, such as server time and network storage, as needed automatically without requiring human interaction with each service’s provider.Broad network access. Capabilities are available over the network and accessed through standard mechanisms that promote use by heterogeneous thin or thick client platforms (e.g. mobile phones, laptops, and PDAs).Resource pooling. The provider’s computing resources are pooled to serve multiple consumers using a multi-tenant model, with different physical and virtual resources dynamically assigned and reassigned according to consumer demand. There is a sense of location independence in that the customer generally has no control or knowledge over the exact location of the provided resources but may be able to specify location at a higher level of abstraction (e.g. country, state, or datacenter). Examples of resources include storage, processing, memory, network bandwidth, and virtual machines.Rapid elasticity. Capabilities can be rapidly and elastically provisioned, in some cases automatically, to quickly scale out and rapidly released to quickly scale in. To the consumer, the capabilities available for provisioning often appear to be unlimited and can be purchased in any quantity at any time.Measured Service. Cloud systems automatically control and optimize resource use by leveraging a metering capability at some level of abstraction appropriate to the type of service (e.g. storage, processing, bandwidth, and active user accounts). Resource usage can be monitored, controlled, and reported providing transparency for both the provider and consumer of the utilized service.There are 3 generally-accepted cloud service models:Cloud Software as a Service (SaaS). The capability provided to the consumer is to use the provider’s applications running on a cloud infrastructure and accessible from various client devices through a thin client interface such as a Web browser (e.g. web-based email). The consumer does not manage or control the underlying cloud infrastructure, network, servers, operating systems, storage, or even individual application capabilities, with the possible exception of limited user-specific application configuration settings.Cloud Platform as a Service (PaaS).The capability provided to the consumer is to deploy onto the cloud infrastructure consumer-created applications using programming languages and tools supported by the provider (e.g. java, python, .NET). The consumer does not manage or control the underlying cloud infrastructure, network, servers, operating systems, or storage, but the consumer has control over the deployed applications and possibly application hosting environment configurations.Cloud Infrastructure as a Service (IaaS).The capability provided to the consumer is to provision processing, storage, networks, and other fundamental computing resources where the consumer is able to deploy and run arbitrary software, which can include operating systems and applications. The consumer does not manage or control the underlying cloud infrastructure but has control over operating systems, storage, deployed applications, and possibly select networking components (e.g., firewalls, load balancers).And there are 4 different deployment models:Private cloud.The cloud infrastructure is operated solely for an organization. It may be managed by the organization or a third party and may exist on premise or off premise.Community cloud. The cloud infrastructure is shared by several organizations and supports a specific community that has shared concerns (e.g. mission, security requirements, policy, and compliance considerations). It may be managed by the organizations or a third party and may exist on premise or off premise.Public cloud.The cloud infrastructure is made available to the general public or a large industry group and is owned by an organization selling cloud services (e.g. Office 360, Azure).Hybrid cloud. The cloud infrastructure is a composition of two or more clouds (private, community, or public) that remain unique entities but are bound together by standardized or proprietary technology that enables data and application portability (e.g. cloud bursting).NOTE:Stress that there is no vertical alignment between the Deployment models and the Service models as might be implied by the diagram. A cloud implementation can adopt any Deployment model and use any Service model. This presentation will focus on a private cloud implemented as IaaS, but it could equally be PaaS or SaaS.Finally, there are a set of common characteristics that are not required to be considered “cloud”, but are typically associated with or desirable in a cloud offering. Customers will typically value some characteristics more than others – such as advanced security with less regard to cost or scalability.Private cloud is the focus of this presentation.IaaS is the service model we are going to focus on because:It is the common starting point for most organizations – being an evolution (and service orientation) of data center concepts they have already adoptedIt is the most generic – providing fully-managed and self-service compute, network and storage resources. Adding platform and application services makes the security discussion a lot more specific to what is being deployed.
In this presentation we will touch upon each of these private cloud principles. The following provides you with more depth for each of the principles.PrinciplesThe principles outlined in this section provide general rules and guidelines to support the evolution of a cloud infrastructure. They are enduring, seldom amended, and inform and support the way a cloud fulfills its mission and goals. They also strive to be compelling and aspirational in some respects since there needs to be a connection with business drivers for change. These principles are often interdependent and together form the basis on which a cloud infrastructure is planned, designed and created.Achieve Business Value through Measured Continual Improvement Statement: The productive use of technology to deliver business value should be measured via a process of continual improvement. Rationale: All investments into IT services need to be clearly and measurably related to delivering business value. Often the returns on major investments into strategic initiatives are managed in the early stages but then tail off, resulting in diminishing returns. By continuously measuring the value which a service is delivering to a business, improvements can be made which achieve the maximum potential value. This ensures the use of evolving technology to the productive benefit of the consumer and the efficiency of the provider. Adhered to successfully, this principle results in a constant evolution of IT services which provide the agile capabilities that a business requires to attain and maintain a competitive advantage. Implications: The main implication of this principle is the requirement to constantly calculate the current and future return from investments. This governance process needs to determine if there is still value being returned to the business from the current service architecture and, if not, determine which element of the strategy needs to be adjusted. Perception of Infinite Capacity Statement: From the consumer’s perspective, a cloud service should provide capacity on demand, only limited by the amount of capacity the consumer is willing to pay for. Rationale: IT has historically designed services to meet peak demand, which results in underutilization that the consumer must pay for. Likewise, once capacity has been reached, IT must often make a monumental investment in time, resources and money in order to expand existing capacity, which may negatively impact business objectives. The consumer wants “utility” services where they pay for what they use and can scale capacity up or down on demand. Implications: A highly mature capacity management strategy must be employed by the provider in order to deliver capacity on demand. Predictable units of network, storage and compute should be pre-defined as scale units. The procurement and deployment times for each scale unit must be well understood and planned for. Therefore, Management tools must be programmed with the intelligence to understand scale units, procurement and deployment times, and current and historical capacity trends that may trigger the need for additional scale units. Finally, the provider (IT) must work closely with the consumer (the business) to understand new and changing business initiatives that may change historical capacity trends. The process of identifying changing business needs and incorporating these changes into the capacity plan will be critical to the providers Capacity Management processes. Perception of Continuous Service Availability Statement: From the consumer’s perspective, a cloud service should be available on demand from anywhere and on any device. Rationale: Traditionally, IT has been challenged by the availability demands of the business. Technology limitations, architectural decisions and lack of process maturity all lead to increased likelihood and duration of availability outages. High availability services can be offered, but only after a tremendous investment in redundant infrastructure. Access to most services has often been limited to on-premises access due to security implications. Cloud services must provide a cost-effective way of maintaining high availability and address security concerns so that services can be made available over the internet. Implications: In order to achieve cost-effective highly available services, IT must create a resilient infrastructure and reduce hardware redundancy wherever possible. Resiliency can only be achieved through highly automated fabric management and a high degree of IT service management maturity. In a highly resilient environment, it is expected that hardware components will fail. A robust and intelligent fabric management tool is needed to detect early signs of eminent failure so that workloads can be quickly moved off of failing components, ensuring the consumer continues to experience service availability. Legacy applications may not be designed to leverage a resilient infrastructure and some applications may need to be redesigned or replaced in order to achieve cost-effective high availability.Likewise, in order to allow service access from anywhere, it must be proven that security requirements can be met when access occurs over the internet. Finally, for a true cloud-like experience, considerations should be made to ensure the service can be accessed from the wide array of mobile devices that exist today. Take a Service Provider’s Approach Statement: The provider of a cloud should think and behave like they are running a Service Provider business rather than an IT department within an Enterprise. Rationale:Enterprise IT is often driven and funded by business initiatives which encourages a silo approach and leads to inefficiencies. Solution Architects may feel it is simply too risky to share significant infrastructure between solutions. The impact of one solution on another cannot be eliminated and therefore each solution builds its own infrastructure, only sharing capabilities where there is high confidence. The result is the creation of projects that increase efficiencies (e.g. virtualization & data center consolidation).A cloud service is a shared service, and therefore needs to be defined in a way that gives the consumer confidence to adopt it; its capabilities, performance and availability characteristics are clearly defined. At the same time, the cloud needs to show value to the organization. Because Service Providers sell to customers, there is a clear separation between the provider and the customer/consumer. This relationship drives the provider to define services from capability, capacity, performance, availability and financial perspectives. Enterprise IT needs to take this same approach in offering services to the business. Implications: Taking a Service Provider’s approach requires a high degree of IT Service Management maturity. IT must have a clear understanding of the service levels they can achieve and must consistently meet these targets. IT must also have a clear understanding of the true cost of providing a service and must be able to communicate to the business the cost of consuming the service. There must be a robust capacity management strategy to ensure demand for the service can be met without disruption and with minimal delay. IT must also have a high fidelity view of the health of the service and have automated management tools to monitor and respond to failing components quickly and proactively so that there is no disruption to the service.Optimization of Resource Usage Statement: The cloud should automatically make efficient and effective use of infrastructure resources. Rationale: Resource optimization drives efficiency and cost reduction and is primarily achieved through resource sharing. Abstracting the platform from the physical infrastructure enables realization of this principle through shared use of pooled resources. Allowing multiple consumers to share resources results in higher resource utilization and a more efficient and effective use of the infrastructure. Optimization through abstraction enables many of the other principles and ultimately helps drive down costs and improve agility. Implications: The IT organization providing a service needs to clearly understand the business drivers to ensure appropriate emphasis during design and operations. The level of efficiency and effectiveness will vary depending on time/cost/quality drivers for a cloud. In one extreme, the cloud may be built to minimize the cost, in which case the design and operation will maximize efficiency via a high degree of sharing. At the other extreme, the business driver may be agility in which case the design focuses on the time it takes to respond to changes and will therefore likely trade efficiency for effectiveness. Take a Holistic Approach to Availability Design Statement: The availability design for a solution should involve all layers of the stack and employ resilience wherever possible and remove redundancy that is unnecessary. Rationale: Traditionally, IT has provided highly available services through a strategy of redundancy. In the event of component failure, a redundant component would be standing by to pick up the workload. Redundancy is often applied at multiple layers of the stack, as each layer does not trust that the layer below will be highly available. This redundancy, particularly at the Infrastructure Layer, comes at a premium price in capital as well as operational costs.A key principle of a cloud is to provide highly available services through resiliency. Instead of designing for failure prevention, a cloud design accepts and expects that components will fail and focuses instead on mitigating the impact of failure and rapidly restoring service when the failure occurs. Through virtualization, real-time detection and automated response to health states, workloads can be moved off the failing infrastructure components often with no perceived impact on the service.Implications: Because the cloud focuses on resilience, unexpected failures of infrastructure components (e.g. hosting servers) will occur and will affect machines. Therefore, the consumer needs to expect and plan for machine failures at the application level. In other words, the solution availability design needs to build on top of the cloud resilience and use application-level redundancy and/or resilience to achieve the availability goals. Existing applications may not be good tenants for such an infrastructure, especially those which are stateful and assume a redundant infrastructure. Stateless workloads should cope more favorably provided that resilience is handled by the application or a load balancer, for example.Minimize Human Involvement Statement: The day-to-day operations of a cloud should have minimal human involvement. Rationale: The resiliency required to run a cloud cannot be achieved without a high degree of automation. When relying on human involvement for the detection and response to failure conditions, continuous service availability cannot be achieved without a fully redundant infrastructure. Therefore, a fully automated fabric management system must be used to perform operational tasks dynamically, detect and respond automatically to failure conditions in the environment, and elastically add or reduce capacity as workloads require. It is important to note that there is a continuum between manual and automated intervention that must be understood.A manual process is where all steps require human intervention. A mechanized process is where some steps are automated, but some human intervention is still required (such as detecting that a process should be initiated or starting a script). To be truly automated, no aspect of a process, from its detection to the response, should require any human intervention. Implications: Automated fabric management requires specific architectural patterns to be in place, which are described later in this document. The fabric management system must have an awareness of these architectural patterns, and must also reflect a deep understanding of health. This requires a high degree of customization of any automated workflows in the environment. Drive Predictability Statement: A cloud must provide a predictable environment, as the consumer expects consistency in the quality and functionality of the services they consume. Rationale: Traditionally, IT has often provided unpredictable levels of service quality. This lack of predictability hinders the business from fully realizing the strategic benefit that IT could provide. As public cloud offerings emerge, businesses may choose to utilize public offerings over internal IT in order to achieve greater predictability. Therefore, enterprise IT must provide a predictable service on par with public offerings in order to remain a viable option for businesses to choose. Implications: For IT to provide predictable services, they must deliver an underlying infrastructure that assures a consistent experience to the hosted workloads in order to achieve this predictability. This consistency is achieved through the homogenization of underlying physical servers, network devices, and storage systems. In addition to homogenization of infrastructure, a very high level of IT Service Management maturity is also required to achieve predictability. Well managed change, configuration and release management processes must be adhered to, and highly effective, highly automated incident and problem management processes must be in place. Incentivize Desired Behavior Statement: IT will be more successful in meeting business objectives if the services it offers are defined in a way that incentivizes desired behavior from the service consumer. Rationale:Most business users, when asked what level of availability they would like for a particular application, will usually ask for 99.999% or even 100% uptime when making a request of IT to deliver a service. This typically stems from the lack of insight into the true cost of delivering service on the part of the consumer as well as the IT provider. If the IT provider, for example, we’re to provide a menu style set of service classifications where the cost of delivering to requirements such as 99.999% availability were very obvious, there would be an immediate injection of reality in the definition of business needs and hence expectations of IT.For a different, more technical example, many organizations who have adopted virtualization have found it leads to a new phenomenon of virtual server sprawl, where Virtual Machines (VM) were created on demand, but there were no incentives for stopping or removing VMs when they were no longer needed. The perception of infinite capacity may result in consumers using capacity as a replacement for effective workload management. While unlimited capacity may be perceived as an improvement in the quality and agility of a service, used irresponsibly it negatively impacts the cost of the cloud capability. In the case above, the cloud provider wants to incentivize the consumers to use only the resources they need. This could be achieved via billing or reporting on consumption. Encouraging desired consumer behavior is a key principle and is related to the principle of taking a service provider approach. In the electrical utility example, consumers are encouraged to use less, and are charged a lower multiplier when utilization is below an agreed threshold. If they reach the upper bounds of the threshold, a higher multiplier kicks in as additional resources are consumed. Implications: The IT organization needs to identify the behavior they want to incent. The example above was related to inefficient resource usage, other examples include reducing helpdesk calls (charging per call), using the right level of redundancy (charge more for higher redundancy). Each requires a mature service management capability; e.g. metering and reporting on usage per business unit, tiered services in the product/service catalog and a move to a service-provider relationship with the business. The incentives should be defined during the product/service design phase. Create a Seamless User Experience Statement: Consumers of an IT service should not encounter anything which disrupts their use of the service as a result of crossing a service provider boundary. Rationale: IT strategies increasingly look to incorporate service from multiple providers to achieve the most cost effective solution for a business. As more of the services delivered to consumers are provided by a hybrid of providers, the potential for disruption to consumption increases as business transactions cross provider boundaries. The fact that a composite service being delivered to a consumer is sourced from multiple providers should be completely opaque and the consumer should experience no break in continuity of usage as a result. An example of this may be a consumer who is using a business portal to access information across their organization such as the status of a purchase order. They may look at the order through the on premise order management system and click on a link to more detailed information about the purchaser which is held in a CRM system in a public cloud. In crossing the boundary between the on premise system and the public cloud based system the user should see no hindrance to their progress which would result in a reduction in productivity. There should be no requests for additional verification and they should encounter a consistent look and feel, and performance should be consistent across the whole experience. These are just a few examples of how this principle should be applied. Implications: The IT provider needs to identify potential causes of disruption to the activities of consumers across a composite service. Security systems may need to be federated to allow for seamless traversal of systems, data transformation may be required to ensure consistent representation of business records, styling may need to be applied to give the consumer more confidence that they are working within a consistent environment.The area where this may have most implications is in the resolution of incidents raised by consumers. As issues occur, the source of them may not be immediately obvious and require complex management across providers until the root cause has been established. The consumer should be oblivious to this combined effort which goes on behind a single point of contact within the service delivery function
The following concepts are abstractions or strategies that support the principles and facilitate the composition of a cloud. They are guided by and directly support one or more of the principles above. We will touch upon many of these concepts during the presentation. The following goes into each of them in more detail.Predictability Traditionally, IT has often provided unpredictable levels of service quality. This lack of predictability hinders the business from fully realizing the strategic benefit that IT could provide. As public cloud offerings emerge, businesses may choose to utilize public offerings over internal IT in order to achieve greater predictability. Enterprise IT must provide a predictable service on par with public offerings in order to remain a viable option for businesses to choose. For IT to provide predictable services, they must deliver an underlying infrastructure that assures a consistent experience to the hosted workloads in order to achieve this predictability. This consistency is achieved through the homogenization of underlying physical servers, network devices, and storage systems. In addition to homogenizing of the infrastructure, a very high level of IT Service Management maturity is also required to achieve predictability. Well managed change, configuration and release management processes must be adhered to and highly effective, highly automated incident and problem management processes must be in place. Favor Resiliency Over Redundancy In order to achieve the perception of continuous availability, a holistic approach must be taken in the way availability is achieved. Traditionally, availability has been the primary measure of the success of IT service delivery and is defined through service level targets that measure the percentage of uptime. However, defining the service delivery success solely through availability targets creates the false perception of “the more nines the better” and does not account for how much availability the consumers actually need. There are two fundamental assumptions behind using availability as the measure of success. First, that any service outage will be significant enough in length that the consumer will be aware of it and second, that there will be a significant negative impact to the business every time there is an outage. It is also a reasonable assumption that the longer it takes to restore the service, greater the impact on the business. There are two main factors that affect availability. First is reliability which is measured by Mean-Time-Between-Failures (MTBF). This measures the time between service outages. Second is resiliency which is measured by Mean-Time-to-Restore-Service (MTRS). MTRS measures the total elapsed time from the start of a service outage to the time the service is restored. The fact that human intervention is normally required to detect and respond to incidents limits how much MTRS can be reduced. Therefore organizations have traditionally focused on MTBF to achieve availability targets. Achieving higher availability through greater reliability requires increased investment in redundant hardware and an exponential increase in the cost of implementing and maintaining this hardware. Using the holistic approach, a cloud achieves higher levels of availability and resiliency by replacing the traditional model of physical redundancy with software tools. The first tool that helps achieve this is virtualization. It provides a means of abstracting the service from a specific server thereby increasing its portability. The second software tool is Hypervisor. Technologies provided by the hypervisor can allow either the transparent movement or restart of the workload to other virtualization hosts, thereby increasing resiliency and availability without any other specialized software running within the workload. The final tool is a health model that allows IT to fully understand hardware health status and automatically respond to failure conditions by migrating services away from the failing hardware. While the compute components no longer require hardware redundancy, the storage components continue to require it. In addition, network components require hardware redundancy to support needs of the storage systems. While current network and storage requirements prevent the complete elimination of hardware redundancy, significant cost savings can still be gained by removing the compute hardware redundancy. In a traditional data center, the MTRS may average well over an hour while a cloud can recover from failures in a matter of seconds. Combined with the automation of detection and response to failure and warn states within the infrastructure, this can reduce the MTRS (from the perspective of IaaS) dramatically. Thus a significant increase in resiliency makes the reliability factor much less important. In a cloud, availability (minutes of uptime/year) is no longer the primary measure of the success of IT service delivery. The perception of availability and the business impact of unavailability become the measures of success. This chart illustrates these points. Homogenization of Physical HardwareHomogenization of the physical hardware is a key concept for driving predictability. The underlying infrastructure must provide a consistent experience to the hosted workloads in order to achieve predictability. This consistency is attained through the homogenization of the underlying servers, network, and storage. Abstraction of services from the hardware layer through virtualization makes “server stock-keeping units (SKU) differentiation” a logical rather than a physical construct. This eliminates the need for differentiation at the physical server level. Greater homogenization of compute components results in a greater reduction in variability. This reduction in variability increases the predictability of the infrastructure which, in turn, improves service quality.The goal is to ultimately homogenize the compute, storage, and network layers to the point where there is no differentiation between servers. In other words, every server has the same processor and random access memory (RAM); every server connects to the same storage resource; and every server connects to the same networks. This means that any virtualized service runs and functions identically on any physical server and so it can be relocated from a failing or failed physical server to another physical server seamlessly without any change in service behavior.It is understood that full homogenization of the physical infrastructure may not be feasible. While it is recommended that homogenization be the strategy, where this is not possible, the compute components should at least be standardized to the fullest extent possible. Whether or not the customer homogenizes their compute components, the model requires them to be homogeneous in their storage and network connections so that a Resource Pool may be created to host virtualized services.It should be noted that homogenization has the potential to allow for a focused vendor strategy for economies of scale. Without this scale however, there could be a negative impact on cost, where homogenizing hardware detracts from the buying power that a multi-vendor strategy can facilitate. Pool Compute Resources Leveraging a shared pool of compute resources is key to cloud computing. This Resource Pool is a collection of shared resources composed of compute, storage, and network that create the fabric that hosts virtualized workloads. Subsets of these resources are allocated to the customers as needed and conversely, returned to the pool when they are not needed. Ideally, the Resource Pool should be homogeneous. However, as previously mentioned, the realities of a customer’s current infrastructure may not facilitate a fully homogenized pool. Virtualized Infrastructure Virtualization is the abstraction of hardware components into logical entities. Although virtualization occurs differently in each infrastructure component (server, network, and storage), the benefits are generally the same including lesser or no downtime during resource management tasks, enhanced portability, simplified management of resources, and the ability to share resources. Virtualization is the catalyst to the other concepts, such as Elastic Infrastructure, Partitioning of Shared Resources, and Pooling Compute Resources. The virtualization of infrastructure components needs to be seamlessly integrated to provide a fluid infrastructure that is capable of growing and shrinking, on demand, and provides global or partitioned resource pools of each component. Fabric Management Fabric is the term applied to the collection of compute, network, and storage resources. Fabric Management is a level of abstraction above virtualization; in the same way that virtualization abstracts physical hardware, Fabric Management abstracts service from specific hypervisors and network switches. Fabric Management can be thought of as an orchestration engine, which is responsible for managing the life cycle of a consumer’s workload. In a cloud, Fabric Management responds to service requests, Systems Management events and Service Management policies.Traditionally, servers, network and storage have been managed separately, often on a project-by-project basis. To ensure resiliency, a cloud must be able to automatically detect if a hardware component is operating at a diminished capacity or has failed. This requires an understanding of all of the hardware components that work together to deliver a service, and the interrelationships between these components. Fabric Management provides this understanding of interrelationships to determine which services are impacted by a component failure. This enables the Fabric Management system to determine if an automated response action is needed to prevent an outage, or to quickly restore a failed service onto another host within the fabric.From a provider's point of view, the Fabric Management system is key in determining the amount of Reserve Capacity available and the health of existing fabric resources. This also ensures that services are meeting the defined service levels required by the consumer. Elastic Infrastructure The concept of an elastic infrastructure enables the perception of infinite capacity. An elastic infrastructure allows resources to be allocated on demand and more importantly, returned to the Resource Pool when no longer needed. The ability to scale down when capacity is no longer needed is often overlooked or undervalued, resulting in server sprawl and lack of optimization of resource usage. It is important to use consumption-based pricing to incent consumers to be responsible in their resource usage. Automated or customer request based triggers determine when compute resources are allocated or reclaimed.Achieving an elastic infrastructure requires close alignment between IT and the business, as peak usage and growth rate patterns need to be well understood and planned for as part of Capacity Management. Partitioning of Shared Resources Sharing resources to optimize usage is a key principle; however, it is also important to understand when these shared resources need to be partitioned. While a fully shared infrastructure may provide the greatest optimization of cost and agility, there may be regulatory requirements, business drivers, or issues of multi-tenancy that require various levels of resource partitioning. Partitioning strategies can occur at many layers, such as physical isolation or network partitioning. Much like redundancy, the lower in the stack this isolation occurs, the more expensive it is. Additional hardware and Reserve Capacity may be needed for partitioning strategies such as separation of resource pools. Ultimately, the business will need to balance the risks and costs associated with partitioning strategies and the cloud infrastructure will need the capability of providing a secure method of isolating the infrastructure and network traffic while still benefiting from the optimization of shared resources. Resource Decay Treating infrastructure resources as a single Resource Pool allows the infrastructure to experience small hardware failures without significant impact on the overall capacity. Traditionally, hardware is serviced using an incident model, where the hardware is fixed or replaced as soon as there is a failure. By leveraging the concept of a Resource Pool, hardware can be serviced using a maintenance model. A percentage of the Resource Pool can fail because of “decay” before services are impacted and an incident occurs. Failed resources are replaced on a regular maintenance schedule or when the Resource Pool reaches a certain threshold of decay instead of a server-by-server replacement.The Decay Model requires the provider to determine the amount of “decay” they are willing to accept before infrastructure components are replaced. This allows for a more predictable maintenance cycle and reduces the costs associated with urgent component replacement. For example, a customer with a Resource Pool containing 100 servers may determine that up to 3 percent of the Resource Pool may decay before an action is taken. This will mean that 3 servers can be completely inoperable before an action is required.Service ClassificationService classification is an important concept for driving predictability and incenting consumer behavior. Each service class will be defined in the provider’s service catalog, describing service levels for availability, resiliency, reliability, performance, and cost. Each service must meet pre-defined requirements for its class. These eligibility requirements reflect the differences in cost when resiliency is handled by the application versus when resiliency is provided by the infrastructure.The classification allows consumers to select the service they consume at a price and the quality point that is appropriate for their requirements. The classification also allows for the provider to adopt a standardized approach to delivering a service which reduces complexity and improves predictability, thereby resulting in a higher level of service delivery. Cost Transparency Cost transparency is a fundamental concept for taking a service provider’s approach to delivering infrastructure. In a traditional data center, it may not be possible to determine what percentage of a shared resource, such as infrastructure, is consumed by a particular service. This makes benchmarking services against the market an impossible task. By defining the cost of infrastructure through service classification and consumption modeling, a more accurate picture of the true cost of utilizing shared resources can be gained. This allows the business to make fair comparisons of internal services to market offerings and enables informed investment decisions.Cost transparency through service classification will also allow the business to make informed decisions when buying or building new applications. Applications designed to handle redundancy will be eligible for the most cost-effective service class and can be delivered at roughly a sixth of the cost of applications that depend on the infrastructure to provide redundancy.Finally, cost transparency incents service owners to think about service retirement. In a traditional data center, services may fall out of use but often there is no consideration on how to retire an unused service. The cost of ongoing support and maintenance for an under-utilized service may be hidden in the cost model of the data center. In a private cloud, monthly consumption costs for each service can be provided to the business, incenting service owners to retire unused services and reduce their cost. Consumption Based Pricing This is the concept of paying for what you use as opposed to a fixed cost irrespective of the amount consumed. In a traditional pricing model, the consumer’s cost is based on flat costs derived from the capital cost of hardware and software and expenses to operate the service. In this model, services may be over or underpriced based on actual usage. In a consumption-based pricing model, the consumer’s cost reflects their usage more accurately.The unit of consumption is defined in the service class and should reflect, as accurately as possible, the true cost of consuming infrastructure services, the amount of Reserve Capacity needed to ensure continuous availability, and the user behaviors that are being incented. Security and Identity Security for the cloud is founded on two paradigms: protected infrastructure and network access.Protected infrastructure takes advantage of security and identity technologies to ensure that hosts, information, and applications are secured across all scenarios in the data center, including the physical (on-premises) and virtual (on-premises and cloud) environments.Application access helps ensure that IT managers can extend vital applications to internal users as well as to important business partners and cloud users.Network access uses an identity-centric approach to ensure that users—whether they’re based in the central office or in remote locations—have more secure access no matter what device they’re using. This helps ensure that productivity is maintained and that business gets done the way it should.Most important from a security standpoint, the secure data center makes use of a common integrated technology to assist users in gaining simple access using a common identity. Management is integrated across physical, virtual, and cloud environments so that businesses can take advantage of all capabilities without the need for significant additional financial investments. MultitenancyMultitenancy refers to the ability of the infrastructure to be logically subdivided and provisioned to different organizations or organizational units. The traditional example is a hosting company that provides servers to multiple customer organizations. Increasingly, this is also a model being utilized by a centralized IT organization that provides services to multiple business units within a single organization, treating each as a customer or tenant.
Patterns are specific, reusable ideas that have been proven solutions to commonly occurring problems. The following section describes a set of patterns useful for enabling the cloud computing concepts and principles. This section introduces these specific patterns. Further guidance on how to use these patterns as part of a design is described in subsequent documents.Resource Pooling The Resource Pool pattern divides resources into partitions for management purposes. Its boundaries are driven by Service Management, Capacity Management, or Systems Management tools. Resource pools exist for either storage (Storage Resource Pool) or compute and network (Compute Resource Pool). This de-coupling of resources reflects that storage is consumed at one rate while compute and network are collectively consumed at another rate. Service Management Partitions The Service Architect may choose to differentiate service classifications based on security policies, performance characteristics, or consumer (that is a Dedicated Resource Pool). Each of these classifications could be a separate Resource Pool.Systems Management Partitions Systems Management tools depend on defined boundaries to function. For example, deployment, provisioning, and automated failure recovery (VM movement) depend on the tools knowing which servers are available to host VMs. Resource Pools define these boundaries and allow management tool activities to be automated. Capacity Management Partitions To perform Capacity Management it is necessary to know the total amount of resource available to a datacenter. A Resource Pool can represent the total data center compute, storage, and network resources that form an enterprise. Resource Pools allow this capacity to be partitioned; for example, to represent different budgetary requirements or to represent the power capacity of a particular UPS. The Resource Pool below represents a pool of servers allocated to a datacenter.Physical Fault Domain It is important to understand how a fault impacts the Resource Pool, and therefore the resiliency of the VMs. A datacenter is resilient to small outages such as single server failure or local direct-attached storage (DAS) failure. Larger faults have a direct impact on the datacenter’s capacity so it becomes important to understand the impact of a non-server hardware component’s failure on the size of the available Resource Pool. To understand the failure rate of the key hardware components, select the component that is most likely to fail and determine how many servers will be impacted by that failure. This defines the pattern of the Physical Fault Domain. The number of “most-likely-to-fail” components sets the number of Physical Fault Domains. For example, the figure below represents 10 racks with 10 servers in each rack. Assume that the racks have two network switches and an uninterruptible power supply (UPS). Also assume that the component most likely to fail is the UPS. When that UPS fails, it will cause all 10 servers in the rack to fail. In this case, those 10 servers become the Physical Fault Domain. If we assume that there are 9 other racks configured identically, then there are a total of 10 Physical Fault Domains. From a practical perspective, it may not be possible to determine the component with the highest fault rate. Therefore, the architect should suggest that the customer begin monitoring failure rates of key hardware components and use the bottom-of-rack UPS as the initial boundary for the Physical Fault Domain. Upgrade DomainThe upgrade domain pattern applies to all three categories of datacenter resources; network, compute, and storage. Although the VM creates an abstraction from the physical server, it doesn’t obviate the requirement of an occasional update or upgrade of the physical server. The Upgrade Domain pattern can be used to accommodate this without disrupting service delivery by dividing the Resource Pool into small groups called Upgrade Domains. All servers in an Upgrade Domain are maintained simultaneously, and each group is targeted in turn. This allows workloads to be migrated away from the Upgrade Domain during maintenance and migrated back after completion. Ideally, an upgrade would follow the pseudo code algorithm below: For each ResourceDomain in n; Free from workloads; Update hardware; Reinstall OS; Return to Resource Pool; Next;The same concept applies to network. Because the datacenter design is based on a redundant network infrastructure, an upgrade domain could be created for all primary switches (or a subset) and another upgrade domain for the secondary switches (or subset). The same applies for the storage network.Reserve CapacityThe advantage of a homogenized Resource Pool-based approach is that all VMs will run the same way on any server in the pool. This means that during a fault, any VM can be relocated to any physical host as long as there is capacity available for that VM. Determining how much capacity needs to be reserved is an important part of designing a private cloud. The Reserve Capacity pattern combines the concept of resource decay with the Fault Domain and Upgrade Domain patterns to determine the amount of Reserve Capacity a Resource Pool should maintain. To compute Reserve Capacity, assume the following: TOTALSERVERS = the total number of servers in a Resource Pool ServersInFD = the number of servers in a Fault Domain ServersInUD = the number of servers in an Upgrade Domain ServersInDecay = the maximum number of servers that can decay before maintenanceSo, the formula is: Reserve Capacity = ServersInFD + ServersInUD + ServersInDecay / TOTALSERVERSThis formula makes a few assumptions:It assumes that only one Fault Domain will fail at a time. A customer may elect to base their Reserve Capacity on the assumption that more than one Fault Domain may fail simultaneously. However, this leaves more capacity unused. Second, if we agree to use only one Fault Domain, it assumes that failure of multiple Fault Domains will trigger the Disaster Recovery plan and not the Fault Management plan. It assumes a situation where a Fault Domain fails when some servers are at maximum decay and some other servers are down for upgrade. Finally, it is based on no oversubscription of capacity.In the formula, the number of servers in the Fault Domain is a constant. The number of servers allowed to decay and the number of servers in an Upgrade Domain are variable and determined by the architect. The architect must balance the Reserve Capacity because too much Reserve Capacity will lead to poor utilization. If an Upgrade Domain is too large, the Reserve Capacity will be high; if it is too small, upgrades will take a longer time to cycle through the Resource Pool. Too small a decay percentage is unrealistic and may require frequent maintenance of the Resource Pool, while too large a decay percentage means that the Reserve Capacity will be high.There is no “correct” answer to the question of Reserve Capacity. It is the architect’s job to determine what is most important to the customer and tailor the Reserve Capacity in accordance with the customer’s needs.Calculating Reserve Capacity based on the example so far, our numbers would be:TOTALSERVERS = 100 ServersInFD = 10 ServersInUD = 2 ServersInDecay = 3 Reserve Capacity = 15%The figure below illustrates the allocation of 15 percent of the Resource Pool for Reserve Capacity. Scale UnitAt some point, the amount of capacity used will begin to get close to the total available capacity (where available capacity is equal to the total capacity minus the Reserve Capacity) and new capacity will need to be added to the datacenter. Ideally, the architect will want to increase the size of the Resource Pool to accommodate the capacity in standardized increments, with known environmental requirements (such as space, power, and cooling), known procurement lead time, and standardized engineering (like racking, cabling, and configuration). Further, this additional capacity needs to be a balance between accommodating the growth, while not leaving too much of the capacity unutilized. To do this, the architect will want to leverage the Scale Unit pattern. The Scale Unit represents a standardized unit of capacity that is added to a datacenter. There are two types of Scale Unit; a Compute Scale Unit which includes servers and network, and a Storage Scale Unit which includes storage components. Scale Units increase capacity in a predictable, consistent way, allow standardized designs, and enable capacity modeling. Much like Reserve Capacity, Scale Unit sizing will be left to the architect. Capacity Plan The Capacity Plan pattern utilizes the infrastructure patterns described above along with the business demand to ensure the perception of infinite capacity can be met. The capacity plan pattern cannot be built by IT alone but must be built and regularly reviewed and revised in conjunction with the business.The capacity plan must account for peak capacity requirements of the business, such as holiday shopping season for an online retailer. It must account for typical as well as accelerated growth patterns of the business, such as business expansion, mergers and acquisitions, and development of new markets.It must account for current available capacity and define triggers for when the procurement of additional Scale Units should be initiated. These triggers should be defined by the amount of capacity each Scale Unit provides and the lead time required for purchasing, obtaining, and installing a Scale Unit.The requirements for a well-designed capacity plan cannot be achieved without a high degree of IT Service Management maturity and a close alignment between the business and IT.Health Model To ensure resiliency, a datacenter must be able to automatically detect if a hardware component is operating at a diminished capacity or has failed. This requires an understanding of all of the hardware components that work together to deliver a service, and the interrelationships between these components. The Health Model pattern is the understanding of these interrelationships that enables a MANAGEMENT LAYER to determine which VMs are impacted by a hardware component failure, facilitating the datacenter management system to determine if an automated response action is needed to prevent an outage, or to quickly restore a failed VM onto another system.From a broader perspective, the management system needs to classify a failure as Resource Decay, a Physical Fault Domain failure, or a Broad Failure that requires the system to trigger the disaster recovery response.When creating the Health Model, it is important to consider the connections between the systems including connections to power, network, and storage components. The architect also needs to consider data access while considering interconnections between the systems. For example, if a server cannot connect to the correct Logical Unit Number (LUN), the service may fail or work at a diminished capacity. Finally, the architect needs to understand how diminished performance might impact the system. For example, if the network is saturated (let’s say usage is greater than 80 Percent) there may be an impact on performance that will require the management system to move workloads to new hosts. It is important to understand how to proactively determine both the health and failed states in a predictable ladder.The diagrams below show typical systems interconnections and demonstrate how the health model pattern is used to provide resiliency. In this case, power is a single point of failure. Network connections and Fiber Channel connections to the Storage Area Network (SAN) are redundant. When “UPS A” fails, it causes a loss of power to Servers 1-4. It also causes a loss of power to “Network A” and “Fiber Channel A”, but because network and Fiber Channel are redundant, only one Fault Domain fails. The other is diminished, as it loses its redundancy.The management system detects the Fault Domain failure and migrates or restarts workloads on functioning Physical Fault Domains.While the concept of a health model is not unique, its importance becomes even more critical in a datacenter. To achieve the necessary resiliency, failure states (an indication that a failure has occurred) and warn states (an indication that a failure may soon occur) need to be thoroughly understood for the cloud infrastructure. The Detect and Respond scenario for each state also needs to be understood, documented, and automated. Only then can the benefits of resiliency be fully realized.This dynamic infrastructure, which can automatically move workloads around the fabric in response to health warning states, is only the first step towards dynamic IT. As applications are designed for greater resiliency, they too should have robust and high fidelity Health Models and they should provide the service monitoring toolset with the information needed to detect and respond to health warning states at the application layer as well.Service ClassService Class patterns are useful in describing how different applications interact with the cloud platform infrastructure. While each environment may present unique criteria for their service class definitions, in general there are three Service Class patterns that describe most application behaviors and dependencies.The first Service Class pattern is designed for stateless applications. It is assumed that the application is responsible for providing redundancy and resiliency. For this pattern, redundancy at the infrastructure is reduced to an absolute minimum and thus, this is the least costly Service Class pattern.The next Service Class pattern is designed for stateful applications. Some redundancy is still required at the Infrastructure Layer and resiliency is handled through Live Migration. The cost of providing this service class is higher because of the additional hardware required for redundancy.The last and most expensive Service Class pattern is for those applications that are incompatible with a fabric approach to infrastructure. These are applications that cannot be hosted in a dynamic datacenter and must be provided using traditional data center designs.Cost Model Cost Model patterns are a reflection of the cost of providing services in the cloud and the desired consumer behavior the provider wishes to encourage. These patterns should account for the deployment, operations, and maintenance costs for delivering each service class, as well as the capacity plan requirements for peak usage and future growth. Cost model patterns must also define the units of consumption. The units of consumption will likely incorporate some measurement of the compute, storage, and network provided to each workload by Service Class. This can then be used as part of a consumption-based charge model. Organizations that do not use a charge back model to pay for IT services should still use units of consumption as part of notional charging. (Notional charging is where consumers are made aware of the cost of providing the services they consumed without actually billing them.)The cost model will encourage desired behavior in two ways. First, by charging (or notionally charging) consumers based on the unit of consumption, they will likely only request the amount of resources they need. If they need to temporarily scale up their consumption, they will likely give back the extra resources when they are no longer needed. Secondly, by leveraging different cost models based on service class, the business is encouraged to build or buy applications that qualify for the most cost-effective service class wherever possible.
The journey to the cloud enables you to create the data center you want. It’s likely that your data center today is not what you would have built out if you had your choice. If you are like most data center operators and administrators, your data center is something that sort of “grew that way” and ended up being a mix of best practices, OK practices, and not very good practices. There are a number of reasons why this happened, but why it happened doesn’t matter at this point. The good news is that private cloud enables you to “start over” and create the data center network you would create if you could design it the way you wanted.Because private cloud provides you an opportunity to architect a new solution using what might be considered a new paradigm of datacenter computing, there are several things to keep in mind.First, what analysts and others considered to be best practices in a traditional datacenter might not apply to a private cloud environment.Second, remember that analysts are human and can be as wrong about something, including best practices, as anyone else.Third, just because you always did something a certain way doesn’t mean that it will apply to your private cloud. You might have to adjust your way of thinking to get the most out of private cloud core capabilitiesLast, just because auditors said you had to do something doesn’t mean they are right – they are not likely up to speed on private cloud principles, concepts and patterns, and you’ll need to challenge them more often in the future than you do now.
Do you spend a lot of money on air conditioning?Most data center operators do!Then someone thought “hey, they use passive air cooling to cool buildings – maybe this will work with a datacenter.So they tried it – and it worked! No one said this was a best practice – but with a little thinking outside of the conventional, these datacenter architects and designers ended up saving sounds of dollars a year on air conditioning costs. Sure, this solution will only work in limited number of climates, but its up to the architect to understand the existing options and constraints and propose a possible design that will work for the existing conditions.
That right – you’ve got competition and that competition is the pubic cloud.Public cloud providers will be able to provide many of the same services (or at least claim they can) at a price that compares to or is lower to what it costs you to provide the same services to your organization.That means you’re going to need to understand what it costs to run your data center, where those costs come from, and how you can reduce those costs by increasing efficiency and increasing your value.This is going to require you to increase you level of service management maturity – and actually start realizing the principles and concepts proposed by MOF and ITIL.Your entire approach to service management is going to need to change from a primarily reactive, break/fix one, to a proactive, resilient and automated one – which includes robust monitoring and reporting, along with transparent usage and pricing models.
It’s important to realize that virtualization is not private cloud. Virtualization is a key enabler for private cloud (although not a requirement) but virtualization is not private cloud.In fact, without a well architected and planned approach, virtualization can actually reduce the overall level of service. What are the reasons for this?Everyone who’s been part of a significant server virtualization project has had to deal with the problem of VM sprawl. Virtualization was thought to be a solution for the old problem of “server sprawl” but the server sprawl was replaced with virtualization sprawl. This led to greater data center complexity because IT started standing up servers and services without feeling the constraints of physical hardware. The result was even more virtual servers than they had previous physical servers! The more systems on your network, the more complex that network becomes and the harder it is to manage and maintain.Because of this complexity, the management of the network become more reactive. One of the reasons for the increase in reactiveness is that admins were not fully trained in virtualization and even if fully trained in virtualization, they hadn’t the long experience with it compared to the experience they had with physical networks. This made it harder for them to determine what they needed to do so that they could be more proactive and less reactive. Because they were reactive, overall service level can go down because of the increased downtime.You also need the right monitoring tools. The monitoring tools that worked in the physical data center might not be as effective in a virtualized datacenter. The end result is that the mean time to restore service went up, which increased downtime and created an overall reduction of quality of service. As you can see, virtualization by itself can create a situation where you end up being less effective and efficient. However, when virtualization is used as an enabler for the private cloud, it can end up as one of the critical components required to increase service quality and reduce the time to service restoration.
A private cloud benefits from homogeneity for hardware and softwareThis enables you to benefit from repeatable patterns that represent best practices – these repeatable patterns enable you to simplify the management of what appears to be a very complex system.Homogeneity also enables:Predicable performance – by using the same hardware and software mechanisms throughout the cloud infrastructure, you can baseline current capacity and understand the incremental improvements gained by adding predefined amounts of capacity in the future and have a clear understanding of what the performance gains will be.Due to cloud scale, you can reduce the cost of acquisition by taking advantage of homogeneityIn addition, you can buy in large amounts and come to agreements with vendors regarding size and frequency of future purchases, because you can predict your growth patterns with greater accuracy when using a homogeneous infrastructure for the private cloud.
Traditionally, we’ve worked toward increasing uptime and service quality by using hardware redundancy – redundant power supplies, redundant UPSs, redundant everything we can think of – all with a mind to the magic “five nines”. However, that redundancy comes with a price:It costs a lot of money – this represents capital expensesIt costs a lot of money not only to purchase, but also to maintain – it costs to power and to service the hardware. Part of the reason why this redundancy is required is that stateful applications are intimately tied to the hardware on which they are runningStateful applications where designed for the traditional datacenter, where there was a tight association between the applications and the hardware that they depend onVirtualization at all levels of the private cloud infrastructure decouples the tight connections between applications and hardware and moves us to the private cloud deployments of the 21st century.
When rethinking the value of redundancy, you should think about what you’re getting and what the costs of getting it are. In this way, you can build in the right amount of redundancy for your data center and keep the level of service you and your customers/consumers require. Let’s look at three models for redundancy:The “Economy Class” level. Here we have a rack of 10 servers that host 100 virtual machines. There is no UPS, no generator, power comes right from the utility. The result is that you get the same level of service as that provided by the utility, which is an availability level of three nines. This availability value indicates that there is a 100% probably of a complete failure in that rack over the course of five years.The “Business Class level. Here we have a rack of 10 server that host 100 virtual machines. There is one UPS, and the option to have a generator or no generator. If you have no generator, your availability will be three nines, and the probability of failure over the course of five years will be 87%. If you have a generator, availability will be three nines, and the probability of failure in five years is 50%. A lot better than economy class, but more expensive.Then there is the “Gulfsteam Class” (a Gulfstream is a model of private cloud jet). Here you have N+1 UPSs and N+1 generators. Availability goes up significantly to six nines and the probability of failure is less than one percent. But costs three times or more the cost of economy.The point here isn’t that redundancy won’t increase your up time – we know that is true. But at what cost? What is the cost of downtime for the services versus the cost of the infrastructure? And an even better question to ask – is there a way to increase service uptime without incurring the costs of redundancy? This is where the principle of resiliency comes in.
Redundancy is all about preventing failure. Since you can’t ever prevent failure, the best you can do it increase the time until the inevitable failure will occur. Redundancy therefore works hard to increase the Mean Time Between Failures.In contrast, we can use software based systems to create resiliency. The idea behind software based resiliency is that it expects and plans for failure and takes advantage of software routines and non-redundant hardware to speed time to service restoration. While the number of failures are higher in non-redundant system, the time to restore service is much shorter. Therefore, even though there are more failures in a resilient system, the total time the service is unavailable is much less.
So how do we reach a state of resiliency without redundancy?Through fabric managementWith fabric management:The goal is focused on avoiding service disruptionAutomate fault detection and responseFail often and recover fast!
Private cloud benefits when users are good stewards of the cloud infrastructure made available to them. What we want to do it incent the users of the cloud to think about the amount of resources they need and be aware of the costs associated with them. In effect, we want them to be “data center environmentalists”Private cloud will reach increased efficiency and effectiveness when:Users are aware that they pay only for what they useFor those who believe they “need” five-nines, let them know what it costs and then they can decide if they are willing to bear those costs. They will need to determine if the loss of service is commensurate with the cost of paying for the five-nines. Maybe three-nines ends up to the point where they find that the cost is less then the cost of service downtimeEnable your users to release what they don’t need anymore. Make release policies easy for the self-service users so that when they are done with services, the are released back into the resource pool. This is a key feature you should make available in your service catalogs and make it clear that there are advantages of taking advantage of these service catalog options.Educate the developers in your business units that they should code applications that take advantage of cloud attributes – one key consideration is create stateless applications, so that these applications are not tied to any specific hardware component of the system. Cloud enabled applications have only loose associations with the enabling hardware. The enables these applications to be “portable” and can easily be moved anywhere in the cloud infrastructure without any effect on service delivery
Most of us like to think about solutions in terms of software – it’s often the first place we go when we think about problems and solutions. However, hardware fails too, and hardware failure is not rare. The traditional data center is not as good as it should be having visibility into the health of the hardware infrastructure that powers the software solutions. In order to obtain the advantages of private cloud:We need to understand that there is more to providing services than just softwareWe need a deep understanding of hardware interactions and dependencies with the softwareWe need know what a healthy state of the entire system looks likeConversely, we need to know what an unhealthy state of the system looks likeWe need to understand the difference between a “failure” state and a “disaster” state – failure triggers automated procedures that provide resiliency, disaster triggers a Disaster Recovery response.The fabric management system needs to know how to detect and respond to failure and disaster states based on these understandings
Understanding the difference between failure and disaster is crucial, because there are significant difference in how your respond to each of these states.We need to:Understand how failures affect the service as a whole – do they create a situation of lower performance, to they create a temporary service disruption or a prolonged outage? At what point does a failure turn into a disaster? Is it matter of how long the failure is anticipated to also? Is it related to what hardware or software components have failed? Is it a combination of the two? You need to assess your hardware and software systems to make these decisionsCreate a dependency tree – once you see this tree you will have a good idea of the interrelationships between the software and hardware components and it will be easier to access where failure and disaster states will occurOnce you have this understanding of the difference between failure and disaster, you can define the parameters that represent failure and disaster and then create automated failure and disaster recovery responses. The failure responses enable service resiliency while the disaster responses trigger the disaster recovery responses
Throughout this presentation I’ve talked about the automation. In the private cloud, automation is ubiquitous.Automation is different from manual and mechanized processes.Manual processes are when someone has to perform each and every step required to complete the solution. These are slow and error prone because people make mistakes.Mechanized processes are slow, but faster than manual processes – but they are more predictable and less error prone, because typically an operator fires off a script. The script has been tested and validated, so that all processes related to the script are not exposed to potential error.Automation seeks to eliminate human intervention all together. Monitoring, reporting, alerting, detection and response are all automated. The goal of automation is to enable all activities that can be automated to be automated. Only those that cannot be handled through programmatic controls (such as racking and stacking) require human intervention.
There are several key components of a private cloud infrastructure as a service deployment:The resource poolThe scale unitThe fault domainThe upgrade domainAnd resource decayLet’s look more closely at each of these.
A resource pool is he collection of servers, networking and storage available to the private cloud.
At some point, the amount of capacity used will begin to get close to the total available capacity (where available capacity is equal to the total capacity minus the Reserve Capacity, we’ll cover reserve capacity later) and new capacity will need to be added to the datacenter. Ideally, the architect will want to increase the size of the Resource Pool to accommodate the capacity in standardized increments, with known environmental requirements (such as space, power, and cooling), known procurement lead time, and standardized engineering (like racking, cabling, and configuration). Further, this additional capacity needs to be a balance between accommodating the growth, while not leaving too much of the capacity unutilized. To do this, the architect will want to leverage the Scale Unit pattern. The Scale Unit represents a standardized unit of capacity that is added to a datacenter. There are two types of Scale Unit; a Compute Scale Unit which includes servers and network, and a Storage Scale Unit which includes storage components. Scale Units increase capacity in a predictable, consistent way, allow standardized designs, and enable capacity modeling. Much like Reserve Capacity, Scale Unit sizing will be left to the architect.
Treating infrastructure resources as a single Resource Pool allows the infrastructure to experience small hardware failures without significant impact on the overall capacity. Traditionally, hardware is serviced using an incident model, where the hardware is fixed or replaced as soon as there is a failure. By leveraging the concept of a Resource Pool, hardware can be serviced using a maintenance model. A percentage of the Resource Pool can fail because of “decay” before services are impacted and an incident occurs. Failed resources are replaced on a regular maintenance schedule or when the Resource Pool reaches a certain threshold of decay instead of a server-by-server replacement.The Decay Model requires the provider to determine the amount of “decay” they are willing to accept before infrastructure components are replaced. This allows for a more predictable maintenance cycle and reduces the costs associated with urgent component replacement. For example, a customer with a Resource Pool containing 100 servers may determine that up to 3 percent of the Resource Pool may decay before an action is taken. This will mean that 3 servers can be completely inoperable before an action is required.
It is important to understand how a fault impacts the Resource Pool, and therefore the resiliency of the VMs. A datacenter is resilient to small outages such as single server failure or local direct-attached storage (DAS) failure. Larger faults have a direct impact on the datacenter’s capacity so it becomes important to understand the impact of a non-server hardware component’s failure on the size of the available Resource Pool. To understand the failure rate of the key hardware components, select the component that is most likely to fail and determine how many servers will be impacted by that failure. This defines the pattern of the Physical Fault Domain. The number of “most-likely-to-fail” components sets the number of Physical Fault Domains. For example, the figure represents 10 racks with 10 servers in each rack. Assume that the racks have two network switches and an uninterruptible power supply (UPS). Also assume that the component most likely to fail is the UPS. When that UPS fails, it will cause all 10 servers in the rack to fail. In this case, those 10 servers become the Physical Fault Domain. If we assume that there are 9 other racks configured identically, then there are a total of 10 Physical Fault Domains. From a practical perspective, it may not be possible to determine the component with the highest fault rate. Therefore, the architect should suggest that the customer begin monitoring failure rates of key hardware components and use the bottom-of-rack UPS as the initial boundary for the Physical Fault Domain.
The upgrade domain pattern applies to all three categories of datacenter resources; network, compute, and storage. Although the VM creates an abstraction from the physical server, it doesn’t obviate the requirement of an occasional update or upgrade of the physical server. The Upgrade Domain pattern can be used to accommodate this without disrupting service delivery by dividing the Resource Pool into small groups called Upgrade Domains. All servers in an Upgrade Domain are maintained simultaneously, and each group is targeted in turn. This allows workloads to be migrated away from the Upgrade Domain during maintenance. Ideally, an upgrade would follow the pseudo code algorithm below: For each ResourceDomain in n; Free from workloads; Update hardware; Reinstall OS; Return to Resource Pool; Next;
Now let’s think about a worse case fault scenario – the worst things might get before there is a noticeable impact on service delivery and performance.In this scenario we have the following three events or conditions take place simultaneously:The maximum servers are in decay before a maintenance cycleAn upgrade is in place, so the number of servers in an upgrade domain are out of serviceAn entire fault domain is lostIn order to prevent a significant decrement in service – we need to have resources that can make up for the lost resources.
There is where the pattern of Reserve Capacity comes into play.The advantage of a homogenized Resource Pool-based approach is that all VMs will run the same way on any server in the pool. This means that during a fault, any VM can be relocated to any physical host as long as there is capacity available for that VM. Determining how much capacity needs to be reserved is an important part of designing a private cloud. The Reserve Capacity pattern combines the concept of resource decay with the Fault Domain and Upgrade Domain patterns to determine the amount of Reserve Capacity a Resource Pool should maintain. To compute Reserve Capacity, assume the following: TOTALSERVERS = the total number of servers in a Resource Pool ServersInFD = the number of servers in a Fault Domain ServersInUD = the number of servers in an Upgrade Domain ServersInDecay = the maximum number of servers that can decay before maintenanceSo, the formula is: Reserve Capacity = ServersInFD + ServersInUD + ServersInDecay / TOTALSERVERSThis formula makes a few assumptions:It assumes that only one Fault Domain will fail at a time. A customer may elect to base their Reserve Capacity on the assumption that more than one Fault Domain may fail simultaneously. However, this leaves more capacity unused. Second, if we agree to use only one Fault Domain, it assumes that failure of multiple Fault Domains will trigger the Disaster Recovery plan and not the Fault Management plan. It assumes a situation where a Fault Domain fails when some servers are at maximum decay and some other servers are down for upgrade. Finally, it is based on no oversubscription of capacity.In the formula, the number of servers in the Fault Domain is a constant. The number of servers allowed to decay and the number of servers in an Upgrade Domain are variable and determined by the architect. The architect must balance the Reserve Capacity because too much Reserve Capacity will lead to poor utilization. If an Upgrade Domain is too large, the Reserve Capacity will be high; if it is too small, upgrades will take a longer time to cycle through the Resource Pool. Too small a decay percentage is unrealistic and may require frequent maintenance of the Resource Pool, while too large a decay percentage means that the Reserve Capacity will be high.There is no “correct” answer to the question of Reserve Capacity. It is the architect’s job to determine what is most important to the customer and tailor the Reserve Capacity in accordance with the customer’s needs.Calculating Reserve Capacity based on the example so far, our numbers would be:TOTALSERVERS = 100 ServersInFD = 10 ServersInUD = 2 ServersInDecay = 3 Reserve Capacity = 15%
So, now what do you think of architectural principles, concepts, and patterns?Should I hide my face?Do you think they are scary?Should you protect yourself from them?Or do you LOVE THEM!
I welcome all of you to take this presentation and re-present it.Lots of speaker’s note to help you.Improve it!Get the word out that before you begin to build your private cloud, you need to understand the core Principles, Concepts and Patterns of successful private clouds.
The Private Cloud, Principles, Patterns and Concepts
Hybrid CloudsDeploymentModels Community Private Cloud Public Cloud CloudService Infrastructure as a Platform as a Service Software as a ServiceModels Service (IaaS) (PaaS) (SaaS) On Demand Self-ServiceEssential Broad Network Access Rapid ElasticityCharacteristics Resource Pooling Measured Service Massive Scale Resilient ComputingCommon Homogeneity Geographic DistributionCharacteristics Virtualization Service Orientation Low Cost Software Advanced Security
Business Value Principles provide general rules and guidelines to support the evolution of a Continuous cloud infrastructure. They are enduring, seldom amended, and inform and Improvement support the way a cloud fulfills its mission. They strive to be compelling andaspirational. These principles form the basis on which a cloud infrastructure is Perception ofdesigned and created planned, Perception of Continuous Infinite Capacity Availability Service Providers Optimize Resource Holistic Approach Approach Utilization to Availability Ubiquitous Drive Incentivize Create A Seamless Automation Predictability Desired Behavior User Experience
Predictability Concepts are abstractions or strategies that support the principles and facilitate the composition of a cloud. They are guided by and directly support one or more of the principles. Resiliency over Homogenized Pool Compute Redundancy Hardware Resources Virtualized Fabric Elastic Partition Shared ResourceInfrastructure Management Infrastructure Resources Decay Service Cost Consumption Security and MultitenancyClassification Transparency Based Pricing Identity
ResourcePatterns are specific, reusable Pooling have been proven solutions to ideas thatcommonly occurring problems. Patterns are useful for enabling the cloudcomputing concepts and principles. Physical Fault Upgrade Domain Domain Reserve Scale Unit Capacity Plan Capacity Health Model Service Class Cost Model
Requires service Approach to service You’ve got management management needscompetition maturity to change
Can Reduce Quality of Service Requires Right Greater More Reactive Monitoring MTRS Goes UpComplexity Tools
• Drives predictable performanceSimplicity is • Reduces cost of acquisition Elegance • Help with predicting time for new acquisition
Redundancy Capital Expensecomes with Operational Expense (power, maintenance)a price Stateful applications increase cost Driven by stateful applications
• No UPS • UPSEconomy Class • N+1 UPS Business Class • No • No Generator: • N+1 Gulfstream Generator • Availability: 0.999 • Straight Generator • Failure Utility Power Probability in 5 • Availability: • Availability: years: 87% 0.999999 0.999 • With Generator: • Availability: • Failure • Failure 0.999 Probability Probability • Failure in 5 years: in 5 years: Probability in 5 <1% ~100% years: ~50%
• Avoid hardware failure • Goal – minimize service• Redundant at all levels disruption• Longer MTBF – More • Automated fault Disruption detection and response • Fail often – recover fast!Redundancy ResiliencyDriven HA Driven HA
Pay only for what you usePortable apps Show the cost of enable cost five nines comparison Elasticity – both up and down
UnderstandHealth Model More to services than software hardware interactions and What does “healthy” look like?must provide dependencies visibility into hardware What does What does “failure” Detect and “unhealthy” look respond dependinfrastructure like? look like? on understanding
• How do failures affect the service as a whole?Define both • Create a dependency tree Failure and • Determine when failure becomes disaster Disaster • Detect and automate both failure and disaster responses
Manual • Slow/Error ProneAutomation Drives the Mechanized Cloud • Faster/Predictable Automation • Fast and Predictable
Virtual Virtual Virtual Virtual Virtual Host Host Host Host Host Virtual Virtual Virtual Virtual Virtual Host Host Host Host HostHost Location Management DDC Hypervisor Fabric Physical Physical Physical Physical Physical Server Server Server Server Server Physical Physical Physical Physical Physical + Server Server Server Server Server Physical Server X Physical Server Physical Server Physical Server Physical Server Physical Physical Physical Physical Physical Server Server Server Server Server X Management System Physical Physical Physical Physical Physical Server Server Server Server Server Health State
Standardized incrementsKnown environmental requirementsKnown procurement lead timeStandardized engineeringCompute scale unitStorage scale unit +
Move away from break/fix incident modelUse a pool-based maintenance modelDefine % of decay before maintenance
Consider non server component failuresSelect the component most likely to failHow many servers are impacted?That’s the physical fault domain
Host servers still need to be upgradedAll in UD are maintained simultaneouslyWorkloads migrated away during upgrade
3% in decay 2% in upgrade 10% lost in fault domain15% of total capacity lost
Takes advantage of homogeneityVMs can be relocated predictablyCombines decay/fault/upgrade conceptsTotal is Reserve CapacityRC=FD+D(max)+UP/RPThere should be no decrement in service