Data Center<br />Best Practices and Architecture<br />for the California State University<br />Author(s):DCBPA Task Force<br />Date:OctAug 127, 2009<br />Status:DRAFT<br />Version:0.34.11<br />The content of this document is the result of the collaborative work of the Data Center Best Practice and Architecture (DCBPA) Task Force established under the Systems Technology Alliance committee within the California State University.<br />Team members who directly contributed to the content of this document are listed below.<br />Samuel G. Scalise, Sonoma, Chair of the STA and the DCBPA Task Force<br />Don Lopez, Sonoma<br />Jim Michael, Fresno<br />Wayne Veres, San Marcos<br />Mike Marcinkevicz, Fullerton<br />Richard Walls, San Luis Obispo<br />David Drivdahl, Pomona<br />Ramiro Diaz-Granados, San Bernardino<br />Don Baker, San Jose<br />Victor Vanleer, San Jose<br />Dustin Mollo, Sonoma<br />David Stein, PlanNet Consulting<br />Mark Berg, PlanNet Consulting<br />Michel Davidoff, Chancellor’s Office<br />Table of Contents<br /> TOC o "
h z u 1.Introduction PAGEREF _Toc230564580 h 4<br />1.1.Purpose PAGEREF _Toc230564581 h 4<br />1.2.Context PAGEREF _Toc230564582 h 4<br />1.3.Audience PAGEREF _Toc230564583 h 5<br />1.4.Development Process PAGEREF _Toc230564584 h 5<br />1.5.Principles and Properties PAGEREF _Toc230564585 h 5<br />2.Framework/Reference Model PAGEREF _Toc230564586 h 7<br />3.Best Practice Components PAGEREF _Toc230564587 h 15<br />3.1.Standards PAGEREF _Toc230564588 h 15<br />3.2.Hardware Platforms PAGEREF _Toc230564589 h 15<br />3.3.Software PAGEREF _Toc230564590 h 17<br />3.4.Delivery Systems PAGEREF _Toc230564591 h 18<br />3.5.Disaster Recovery PAGEREF _Toc230564592 h 23<br />3.6.Total Enterprise Virtualization PAGEREF _Toc230564593 h 29<br />3.7.Management Disciplines PAGEREF _Toc230564594 h 32<br />Introduction<br />Purpose<br />As society and institutions of higher education increasingly benefit from technology and collaboration, the importance of identifying mutually best practices and architecture makes this document vital to the behind-the-scenes infrastructure of the university. Key drivers behind the gathering and assimilation of this collection are:<br />Many campuses want to know what the others are doing so they can draw from a knowledge base of successful initiatives and lessons learned. Having a head start in thinking through operational practices and effective architectures--as well as narrowing vendor selection for hardware, software and services--creates efficiencies in time and cost.<br />Campuses are impacted financially and data center capital and operating expenses need to be curbed. For many, current growth trends are unsustainable with limited square footage to address the demand for more servers and storage without implementing new technologies to virtualize and consolidate.<br />Efficiencies in power and cooling need to be achieved in order to address green initiatives and reduction in carbon footprint. They are also expected to translate into real cost savings in an energy-conscious economy. Environmentally sound practices are increasingly the mandate and could result in measurable controls on higher energy consumers.<br />Creating uniformity across the federation of campuses allows for consolidation of certain systems, reciprocal agreements between campuses to serve as tertiary backup locations, and opt-in subscription to services hosted at campuses with capacity to support other campuses, such as the C-cubed initiative.<br />Context<br />This document is a collection of Best Practices and Architecture for California State University Data Centers. It identifies practices and architecture associated with the provision and operation of mission-critical production-quality servers in a multi-campus university environment. The scope focuses on the physical hardware of servers, their operating systems, essential related applications (such as virtualization, backup systems and log monitoring tools), the physical environment required to maintain these systems, and the operational practices required to meet the needs of the faculty, students, and staff. Data centers that adopt these practices and architecture should be able to house any end-user service – from Learning Management Systems, to calendaring tools, to file-sharing. <br />This work represents the collective experience and knowledge of data center experts from the 23 campuses and the chancellor’s office of the California State University system. It is coordinated by the Systems Technology Alliance, whose charge is to advise the Information Technology Advisory Committee (made up of campus Chief Information Officers and key Chancellor’s Office personnel) on matters relating to servers (i.e., computers which provide a service for other computers connected via a network) and server applications.<br />This is a dynamic, living document that can be used to guide planning to enable collaborative systems, funding, procurement, and interoperability among the campuses and with vendors. <br />This document does not prescribe services used by end-users, such as Learning Management Systems nor Document Management Systems. As those services and applications are identified by end-users such as faculty and administrators, this document will describe the data center best practices and architecture needed to support such applications. <br />Campuses are not required to adopt the practices and architecture elucidated in this document. There may be extenuating circumstances that require alternative architectures and practices. However, it is hoped that these alternatives are documented in this process. <br />It is not the goal to describe a single solution, but rather the range of best solutions that meet the diverse needs of diverse campuses.<br />Audience<br />This information is intended to be reviewed by key stakeholders who have material knowledge of data center facilities and service offerings from business, technical, operational, and financial perspectives.<br />Development Process<br />The process for creating and updating these best Practices and Architecture (P&A) is to identify the most relevant P&A, inventory existing CSU P&A for key aspects of data center operations, identify current industry trends, and document those P&A which best meet the needs of the CSU. This will include information about related training and costs, so that campuses can adopt these P&A with a full understanding of the costs and required expertise.<br />The work of creating this document will be conducted by members of the Systems Technology Alliance appointed by the campus Chief Information Officers, by members of the Chancellors Office Technology Infrastructure Services group, and by contracted vendors. <br />Principles and Properties<br />In deciding which Practices and Architecture should be adopted, it is important to have a set of criteria that reflect the unique needs, values, and goals of the organization. These Principles and Properties include:<br />Cost-effectiveness<br />Long-term viability<br />Flexibility to support a range of services<br />Security of the systems and data<br />Reliable and dependable uptime <br />Environmental compatibility<br />Redundancy<br />High availability<br />Performance<br />Training<br />Communication<br />Additionally, the architecture should emphasize criteria that are standards-based. The CSU will implement standards-based solutions in preference to proprietary solutions where this does not compromise the functional implementation.<br />The CSU seeks to adhere to standard ITIL practices and workflows where practical. Systems and solutions described herein should relate to corresponding ITIL and service management principles.<br />Framework/Reference Model<br />The framework is used to describe the components and management processes that lead to a holistic data center design. Data centers are as much about the services offered as they are the equipment and space contained in them. Taken together, these elements should constitute a reference model for a specific CSU campus implementation.<br />Standards<br />ITIL<br />The Information Technology Infrastructure Library is a set of concepts around managing services and operations. The model was developed by the UK Office of Government Commerce and has been refined and adopted internationally. The ITIL version 2 framework for Service Support breaks out several management disciplines that are incorporated in this CSU reference architecture (see Section 2.7).<br />ITIL version 3 has reworked the framework into a collection of five volumes that describe<br />Service Strategy<br />Service Design<br />Service Transition<br />Service Operation<br />Continual Service Improvement<br />ASHRAE<br />The American Society of Heating, Refrigerating and Air-Conditioning Engineers (ASHRAE) releases updated standards and guidelines for industry consideration in building design. They include recommended and allowable environment envelopes, such as temperature, relative humidity, and altitude for spaces housing datacomm equipment. The purpose of the recommended envelope is to give guidance to data center operators on maintaining high reliability and also operating their data centers in the most energy efficient manner. <br />Uptime Institute<br />The Uptime Institute addresses architectural, security, electrical, mechanical, and telecommunications design considerations. See Section 18.104.22.168 for specific information on tiering standards as applied to data centers.<br />ISO/IEC 20000<br />An effective resource to draw upon as part of one of the ISO IT management standards are the ISO 20000-1 and ISO 20000-2 processes. ISO 20000-1 promotes the adoption of an integrated process approach to effectively deliver managed services to meet the business and customer requirements. It comprises ten sections: Scope; Terms & Definitions; Planning and Implementing Service Management; Requirements for a Management System; Planning & Implementing New or Changed Services; Service Delivery Process; Relationship Processes; Control Processes; Resolution Processes; and Release Process. ISO 20000-2 is a 'code of practice', and describes the best practices for service management within the scope of ISO20000-1. It comprises nine sections: Scope; Terms & Definitions; The Management System; Planning & Implementing Service Management; Service Delivery Processes; Relationship Processes; Resolution Processes; Control Processes; Release Management Processes. <br />Together, this set of ISO standards is the first global standard for IT service management, and is fully compatible and supportive of the ITIL framework.<br />Hardware Platforms<br />Servers<br />Types<br />Rack-mounted Servers – provide the foundation for any data center’s compute infrastructure. The most common are 1U and 2U: these form factors compose what is known as the volume market. The high-end market, geared towards high-performance computing (HPC) or applications that need more input/output (I/O) and /or storage is composed of 4U to 6U rack-mounted servers. The primary distinction between volume market and high-end servers is the I/O and storage capabilities.<br />Blade Servers –are defined by the removal of many components – PSUs, network interface cards (NICS) and storage adapters from the server itself. These components are grouped together as part of the blade chassis and shared by all the blades. The chassis is the piece of equipment that all of the blade servers “plug” into. The blade servers themselves contain processors, memory and a hard drive or two. One of the primary caveats to selecting the blade server option is the potential for future blade/chassis compatibility. Most IHVs do not guarantee blade/chassis beyond two generations or five years. Another potential caveat is the high initial investment in blade technology because of additional costs associated with the chassis.<br />Towers – There are two primary reasons for using tower servers…price and remote locations. Towers offer the least expensive entrance into the server platform market. Towers have the ability to be placed outside the confines of a data center. This feature can be useful for locating an additional Domain Name Server (DSN) or backup server in a remote office for redundancy purposes.<br />Principles<br />Application requirements – Applications such as databases, backup servers and other high I/O requirements are better suited HPC rack-mounted servers. Applications such as web servers and MTAs work well in a volume-market rack-mounted environment or even in a virtual server environment. These applications allow servers to be easily added and removed to meet spikes in capacity demand. The need to have servers that are physically located at different sites for redundancy or ease of administration can be met by tower servers, especially if they are low demand applications. Applications with high I/O requirements perform better with 1U or 2U rack-mounted servers rather than blade servers because stand alone servers have a dedicated I/O interface rather than a common one found on the chassis of a blade server.<br />Software support – can determine the platform an application lives on. Some vendors refuse to support virtual servers making VMs unsuitable if support is a key requirement. Multiple instances of an application is not supported by some software, requiring the application to run on a large single server rather than multiple smaller servers.<br />Storage – requirements can vary from a few gigabytes to accommodate the operating system, application and state data for application servers to terabytes to support large database servers. Applications requiring large amounts of storage should be SAN attached using fiber channel or iSCSI. Fiber offers greater reliability and performance but a higher skill lever from SAN Admins. Support for faster speeds in iSCSI is and improved reliability is making it more attractive. Direct Attached Storage (DAS) is still prevalent because it is less costly and easier to manage than SAN storage. Rack-mounted 4U to 6U servers have the space to house a large number of disk drives and make suitable DAS servers.<br />Consolidation – projects can result in several applications being combined onto a single server or virtualization. Care must be taken when combining applications to ensure they are compatible with each other and vendor support can be maintained. Virtualization accomplishes consolidation by allowing each application think it’s running on its own server. The benefits of consolidation include reduced power and space requirements and fewer servers to manage.<br />Energy efficiency – starts with proper cooling design, server utilization management and power management. Replacing old servers with newer energy efficient ones reduces energy use and cooling requirements and may be eligible for rebates which allow them to pay for themselves. <br />Improved management – Many data centers contain “best of breed” technology. They contain server platforms and other devices from many different vendors. Servers may be from vendor A, storage from vendor B and network from vendor C. This complicates troubleshooting and leads to finger pointing. Reducing the number of vendors produces standardization and is more likely to allow a single management interface for all platforms.<br />Business growth/New services – As student enrollment grows and the number of services to support them increases, the data center’s capacity to run its applications and store its data increases. This is the most common reason for buying new server platforms. IT administrators must use a variety of gauges to anticipate this need and respond in time. <br />Server Virtualization<br />Principles<br />Reliability and availability—An implementation of server virtualization should provide increased reliability of servers and services by providing for server failover in the event of a hardware loss of service as well as high-availability by ensuring that access to shared services like network and disk are fault-tolerant and balanced by load.<br />Reuse—Server virtualization should allow better utilization of hardware and resources by provisioning multiple services and operating environments on the same hardware. Care must be taken to ensure that hardware is operating within limits of its capacity. Effective capacity planning becomes especially important.<br />Consumability—Server virtualization should allow us to provide quickly available server instances, using technologies such as cloning and templating when appropriate.<br />Agility—Server virtualization should allow us to improve organizational efficiency by provisioning servers and services faster by allowing for rapid deployment of instances using cloning and templates.<br />Administration—Server virtualization will improve administration by having a single, secure, easy-to-access interface to all virtual servers.<br />Storage<br />SAN – Storage Area Network<br />Fiber Channel<br />iSCSI<br /><ul><li>Benefits
Reduced costs: By leveraging existing network components (network interface cards [NICs], switches, etc.) as a storage fabric, iSCSI increases the return on investment (ROI) made for data center network communications and potentially saves capital investments required to create a separate storage network. For example, iSCSI host bus adapters (HBAs) are 30-40% less expensive than Fibre Channel HBAs. Also, in some cases, 1 Gigabit Ethernet (GbE) switches are 50% less than comparable Fibre Channel switches.
Organizations employ qualified network administrator(s) or trained personnel to manage network operations. Being a network protocol, iSCSI leverages existing network administration knowledge bases, obviating the need for additional staff and educational training to manage a different storage network.
Improved options for DR: One of iSCSI's greatest strengths is its ability to travel long distances using IP wide area networks (WANs). Offsite data replication plays a key part in disaster recovery plans by preserving company data at a co-location that is protected by distance from a disaster affecting the original data center. Using a SAN router (iSCSI to Fibre Channel gateway device) and a target array that supports standard storage protocols (like Fibre Channel), iSCSI can replicate data from a local target array to a remote iSCSI target array, eliminating the need for costly Fibre Channel SAN infrastructure at the remote site.
iSCSI-based tiered storage solutions such as backup-to-disk (B2D) and near-line storage have become popular disaster recovery options. Using iSCSI in conjunction with Serial Advanced Technology Attachment (SATA) disk farms, B2D applications inexpensively back up, restore, and search data at rapid speeds.
Boot from SAN: As operating system (OS) images migrate to network storage, boot from SAN (BfS) becomes a reality, allowing chameleon-like servers to change application personalities based on business needs, while removing ties to Fibre Channel HBAs previously required for SAN connectivity (would still require hardware initiator).
Software Initiators: While software initiators offer cost-effective SAN connectivity, there are some issues to consider. The first is host resource consumption versus performance. An iSCSI initiator runs within the input/output (I/O) stack of the operating system, utilizing the host memory space and CPU for iSCSI protocol processing. By leveraging the host, an iSCSI initiator can outperform almost any hardware-based initiator. However, as more iSCSI packets are sent or received by the initiator, more memory and CPU bandwidth is consumed, leaving less for applications. Obviously, the amount of resource consumption is highly dependent on the host CPU, NIC, and initiator implementation, but resource consumption could be problematic in certain scenarios. Software iSCSI initiators can consume additional resource bandwidth that could be partitioned for supplemental virtual machines.
Hardware Initiators: iSCSI HBAs simplify boot-from-SAN (BfS). Because an iSCSI HBA is a combination NIC and initiator, it does not require assistance to boot from the SAN, unlike software initiator counterparts. By discovering a bootable target LUN during system power-on self test (POST), an iSCSI HBA can enable an OS to boot an iSCSI target like any DAS or Fibre Channel SAN-connected system. In terms of resource utilization, an iSCSI HBA offloads both TCP and iSCSI protocol processing, saving host CPU cycles and memory. In certain scenarios, like server virtualization, an iSCSI HBA may be the only choice where CPU processing power is consequential.
Software Targets: Any standard server can be used as a software target storage array but should be deployed as a stand-alone application. A software target can capitalize platform resources, leaving little room for additional applications.
Hardware Targets: Many of the iSCSI disk array platforms are built using the same storage platform as their Fibre Channel cousin. Thus, many iSCSI storage arrays are similar, if not identical, to Fibre Channel arrays in terms of reliability, scalability, performance, and management. Other than the controller interface, the remaining product features are almost identical.
Tape libraries should be capable of being iSCSI target devices, however broad adoption and support in this category hasn’t been seen and remains a territory served by native Fiber Channel connectivity.
iSCSI to Fibre Channel gateways and routers play a vital role in two ways. First, these devices increase return on invested capital made in Fibre Channel SANs by extending connectivity to “Ethernet islands” where devices that were previously unable to reach the Fibre Channel SAN can tunnel through using a router or gateway. Secondly, iSCSI routers and gateways enable Fibre Channel to iSCSI migration. SAN migration is a gradual process. Replacing a large investment in Fibre Channel SANs at one time is not a cost reality. As IT administrators carefully migrate from one interconnect to another, iSCSI gateways and routers afford IT administrators the luxury of time and money. One note of caution: It's important to know the port speeds and amount of traffic passing through a gateway or router. These devices can become potential bottlenecks if too much traffic from one network is aggregated into another. For example, some router products offer eight 1 GbE ports and only two 4 Gb Fibre Channel ports. While total throughput is the same, careful attention must be made to ensure traffic is evenly distributed across ports.
Any x86 server can act as an iSCSI to Fibre Channel gateway. Using a Fibre Channel HBA and iSCSI target software, any x86 server can present LUNs from a Fibre Channel SAN as an iSCSI target. Once again, this is not a turnkey solution—especially for large SANs—and caution should be exercised to prevent performance bottlenecks. However, this configuration can be cost-effective for small environments and connectivity to a single Fibre Channel target or small SAN.
Voracious storage consumption, combined with lower-cost SAN devices, has stimulated SAN growth beyond what administrators can manage without help. iSCSI exacerbates this problem by proliferating iSCSI initiators and low-cost target devices throughout a boundless IP network. Thus, a discovery and configuration service, like iSNS is a must for large SAN configurations. Although other discovery services exist for iSCSI SANs, such as Service Location Protocol (SLP), iSNS is emerging as the most widely accepted solution.
Multi-path support</li></ul>NAS – Network Attached Storage<br />DAS – Direct Attached Storage<br />Storage Virtualization<br />Software<br />Operating Systems<br />An Operating System (commonly abbreviated to either OS or O/S) is an interface between hardware and user; an OS is responsible for the management and coordination of activities and the sharing of the resources of the computer. The operating system acts as a host for computing applications that are run on the machine. As a host, one of the purposes of an operating system is to handle the details of the operation of the hardware. This relieves application programs from having to manage these details and makes it easier to write applications.<br />Middleware<br />Middleware is computer software that connects software components or applications. The software consists of a set of services that allows multiple processes running on one or more machines to interact across a network. This technology evolved to provide for interoperability in support of the move to coherent distributed architectures, which are used most often to support and simplify complex, distributed applications. It includes web servers, application servers, and similar tools that support application development and delivery. Middleware is especially integral to modern information technology based on XML, SOAP, Web services, and service-oriented architecture.<br />Identity Management<br />Identity management or ID management is a broad administrative area that deals with identifying individuals in a system (such as a country, a network or an organization) and controlling the access to the resources in that system by placing restrictions on the established identities.<br />Databases<br />A database is an integrated collection of logically related records or files which consolidates records previously stored in separate files into a common pool of data records that provides data for many applications. A database is a collection of information that is organized so that it can easily be accessed, managed, and updated. In one view, databases can be classified according to types of content: bibliographic, full-text, numeric, and images. The structure is achieved by organizing the data according to a database model. The model that is most commonly used today is the relational model. Other models such as the hierarchical model and the network model use a more explicit representation of relationships.<br />Core/Enabling Applications<br />Email<br />Electronic mail, often abbreviated as email or e-mail, is a method of exchanging digital messages, designed primarily for human use. E-mail systems are based on a store-and-forward model in which e-mail computer server systems accept, forward, deliver and store messages on behalf of users, who only need to connect to the e-mail infrastructure, typically an e-mail server, with a network-enabled device (e.g., a personal computer) for the duration of message submission or retrieval.<br />Spam Filtering<br />E-mail spam, also known as junk e-mail, is a subset of spam that involves nearly identical messages sent to numerous recipients by e-mail. Spam filtering comes with a large set of rules which are applied to determine whether an email is spam or not. Most rules are based on regular expressions that are matched against the body or header fields of the message, but Spam vendors also employ a number of other spam-fighting techniques including header and text analysis, Bayesian filtering, DNS blocklists, and collaborative filtering databases.<br />Web Services<br />A Web Service (also Webservice) is defined by the W3C as "
a software system designed to support interoperable machine-to-machine interaction over a network. It has an interface described in a machine-processable format (specifically WSDL). Other systems interact with the Web service in a manner prescribed by its description using SOAP-messages, typically conveyed using HTTP with an XML serialization in conjunction with other Web-related standards."
<br />Calendaring<br />iCalendar is a computer file format which allows internet users to send meeting requests and tasks to other internet users, via email, or sharing files with an .ics extension. Recipients of the iCalendar data file (with supporting software, such as an email client or calendar application) can respond to the sender easily or counter propose another meeting date/time. iCalendar is used and supported by a large number of products. iCalendar data is usually sent with traditional email.<br />DNS<br />Domain Name Services enable the use of canonical names, (rather then IP addresses) in addressing network resources. To provide a highly available network, DNS servers should be placed in an Enabling Services Network Infrastructure Model (see section 12.7.5 ). DNS services must also be highly available.<br />DHCP<br />Dynamic Host Configuration Protocol is used to manage the allocation of IP addresses. To provide a highly available network, DHCP servers should be placed in an Enabling Services Network Infrastructure Model (see section 12.7.5). DHCP services must also be highly available.<br />Syslog<br />syslog is a standard for forwarding log messages in an IP network. The term "
is often used for both the actual syslog protocol, as well as the application or library sending syslog messages. Syslog is essential to capturing system messages generated from network devices. Devices provide a wide range of messages, including changes to device configurations, device errors, and hardware component failures.<br />Desktop Virtualization<br />Desktop virtualization (or Virtual Desktop Infrastructure) is a server-centric computing model that borrows from the traditional thin-client model but is designed to give system administrators and end-users the best of both worlds: the ability to host and centrally manage desktop virtual machines in the data center while giving end users a full PC desktop experience. The user experience is intended to be identical to that of a standard PC, but from a thin client device or similar, from the same office or remotely.<br />Application Virtualization<br />Application virtualization is an umbrella term that describes software technologies that improve portability, manageability and compatibility of applications by encapsulating them from the underlying operating system on which they are executed. A fully virtualized application is not installed in the traditional sense, although it is still executed as if it is. The application is fooled at runtime into believing that it is directly interfacing with the original operating system and all the resources managed by it, when in reality it is not. Application virtualization differs from operating system virtualization in that in the latter case, the whole operating system is virtualized rather than only specific applications.<br />Third Party Applications<br />LMS<br />A learning management system (LMS) is software for delivering, tracking and managing training/education. LMSs range from systems for managing training and educational records to software for distributing courses over the Internet and offering features for online collaboration.<br />CMS<br />The mission of the Common Management Systems (CMS) is to provide efficient, effective and high quality service to the students, faculty and staff of the 23-campus California State University System (CSU) and the Office of the Chancellor.<br />Utilizing a best practices approach, CMS supports human resources, financials and student services administration functions with a common suite of Oracle Enterprise applications in a shared data center, with a supported data warehouse infrastructure.<br />Help Desk/Ticketing<br />Help desks are now fundamental and key aspects of good business service and operation. Through the help desk, problems are reported, managed and then appropriately resolved in a timely manner. Help desks can provide users the ability to ask questions and receive effective answers. Moreover, help desks can help the organization run smoothly and improve the quality of the support it offers to the users.<br />Traditional - Help desks have been traditionally used as call centers. Telephone support was the main medium used until the advent of Internet.<br />Internet - The advent of the Internet has provided the opportunity for potential and existing customers to communicate with suppliers directly and to review and buy their services online. Customers can email their problems without being put on hold over the phone. One of the largest advantages Internet help desks have over call centers are that it is available 24/7.<br />Delivery Systems<br />Facilities<br />Tiering Standards<br />The industry standard for measuring data center availability is the tiering metric developed by The Uptime Institute and addresses architectural, security, electrical, mechanical, and telecommunications design considerations. The higher the tier, the higher the availability. Tier descriptions include information like raised floor heights, watts per square foot, and points of failure. “Need,” or “N,” indicates the level of redundant components for each tier with N representing only the necessary system need. Construction cost per square foot is also provided and varies greatly from tier to tier with Tier 3 costs double that of Tier 1.<br />Tier 1 – Basic: 99.671% Availability<br /><ul><li>Susceptible to disruptions from both planned and unplanned activity
Single path for power and cooling distribution, no redundant components (N)
May or may not have a raised floor, UPS, or generator
Must be shut down completely to perform preventative maintenance</li></ul>Tier 2 – Redundant Components: 99.741% Availability<br /><ul><li>Less susceptible to disruption from both planned and unplanned activity
Single path for power and cooling distribution, includes redundant components (N+1)
Maintenance of power path and other parts of the infrastructure require a processing shutdown</li></ul>Tier 3 – Concurrently Maintainable: 99.982% Availability<br /><ul><li>Enables planned activity without disrupting computer hardware operation, but unplanned events will still cause disruption
Multiple power and cooling distribution paths but with only one path active, includes redundant components (N+1)
Includes raised floor and sufficient capacity and distribution to carry load on one path while performing maintenance on the other
Annual downtime of 1.6 hours</li></ul>Tier 4 – Fault Tolerant: 99.995% Availability<br /><ul><li>Planned activity does not disrupt critical load and data center can sustain at least one worst-case unplanned event with no critical load impact
Multiple active power and cooling distribution paths, includes redundant components (2 (N+1), i.e. 2 UPS each with N+1 redundancy)
Annual downtime of 0.4 hours</li></ul>Trying to achieve availability above Tier 4 presents a level of complexity that some believe presents diminishing returns. EYP, which manages HP’s data center design practice, says their empirical data shows no additional uptime from the considerable cost of trying to further reduce downtime from 0.4 hours due to the human element that gets introduced in managing the complexities of the many redundant systems.<br />Spatial Guidelines and Capacities<br />Locale: A primary consideration in data center design is understanding the importance of location. In addition to the obvious criteria of adjacency to business operations and technical support resources, consideration for cost factors such as utilities, networking and real estate are prime. Exposure to natural disaster is also a key component. Power is generally the largest cost factor over time, which has prompted organizations to increasingly consider remote data centers in low utility cost areas. Addressing remote control operations and network latency become essential considerations.<br />Zoned space: Data centers should be block designed with specific tiering levels in mind so that sections of the space can be operated at high density with supporting infrastructure while other sections can be supported with minimal infrastructure. Each zone should have capacity for future growth within that tier.<br />Raised floor: A typical design approach for data centers is to use raised floor for air flow management and cable conveyance. Consideration must be given for air flow volume, which dictates the height of the floor, as well as weight loading. Raised floor structures must also be grounded.<br />Rack rows and density: Equipment racks and cabinets should be arranged in rows that provide for logical grouping of equipment types, ease of distribution for power and network, and provide for air flow management, either through perforated floor tiles or direct ducting.<br />Electrical Systems<br />Generators: Can be natural gas or petroleum/diesel fuel type. For higher tier designs, are deployed in an N+1 configuration to account for load.<br />UPS: Can be rack-based or large room-based systems. Must be configured for load and runtime considerations. Asset management systems should track the lifecycle of batteries for proactive service and replacement.<br />PDUs: Power distribution units provide receptacles from circuits on the data center power system, usually from the UPS. Intelligent PDUs are able to provide management systems information about power consumption at the rack or even device level. Some PDUs are able to be remotely managed to allow for power cycling of equipment at the receptacle level, which aids in remote operation of servers where a power cycle is required to reboot a hung system. <br />Dual A-B cording: In-rack PDUs should make multiple circuits available so that redundant power supplies (designated A and B) for devices can be corded to separate circuits. Some A-B cording strategies call for both circuits to be on UPS while others call for one power supply to be on house power while the other is on UPS. Each is a function of resilience and availability.<br />HVAC Systems<br /><ul><li>CRAC units: Computer Room Air Conditioners are specifically designed to provide cooling with humidification for data centers. They are typically tied to power systems that can maintain cooling independent of the power distribution to the rest of the building.
Hot/Cold Aisle Containment: Arranging equipment racks in rows that allow for the supply of cold air to the front of racks and exhaust of hot air at the rear. Adjacent rows would have opposite airflow to provide only one set of supply or exhaust ducts. Some very dense rack configurations may require the use of chimney exhaust above the racks to channel hot air away from the cold air supply. The key design component is to not allow hot air exhaust to mix with cold air supply and diminish its overall effectiveness. Containment is achieved through enclosed cabinet panels, end of row wall or panel structures, or plastic sheet curtains.
Economizers: Directs ambient outside air in cooler climates to supplement cooling to the data center.</li></ul>Fire Protection & Life Safety<br />Fire suppression systems are essential for providing life safety protection for occupants of a data center and to protecting the equipment. Design of systems should give priority to human life over equipment, which factors in the decision of certain gas suppression systems.<br />Pre-action: Describes a water sprinkler design that allows for the water pipes serving sprinkler heads within a data center to be free from water until such point that a triggering mechanism allows water to enter the pipes. This is meant to mitigate damage from incidental leakage or spraying water from ruptured water lines normally under pressure.<br />VESDA: Very Early Smoke Detection Apparatus allows for pre-action or gas suppression systems to have a human interrupt and intervention at initial thresholds before ultimately triggering on higher thresholds. The system operates by using lasers to evaluate continuous air samples for very low levels of smoke.<br />Halon: Oxygen displacing gas suppression system that is generally no longer used in current data center design due to risk to personnel in the occupied space.<br />FM-200: Gas suppression system that quickly rushes the FM-200 gas to the confined data center space that must be kept air tight for effectiveness. It is a popular replacement for halon gas since it can be implemented without having to replace deployment infrastructure. A purge system is usually required to exhaust and contain the gas after deployment so it does not enter the atmosphere.<br />Novec1230: Gas suppression system that is stored as a liquid at room temperature and allows for more efficient use of space over inert gas systems. Also a popular halon gas alternative.<br />Inergen: A gas suppression system that does not require a purge system or air tight facility since it is non-toxic and can enter the atmosphere without environmental concerns. Requires larger footprint for more tanks and is a more expensive gas to use and replace.<br />Access Control<br /><ul><li>Part of a good physical security plan includes access controls which allow you to determine who has access to your Data Center and when. Metal keys can provide a high level of security, but they do not provide an audit trail, and don't allow you to limit access based on times and/or days. Intrusion systems (aka, alarm systems) can sometimes allow for this kind of control in a facility where it is not possible to migrate to an electronic lock system.
Most new Data Centers constructed today include some sort electronic locking system. These can take the form of simple, offline keypad locks, to highly complex systems that include access portals (aka, man traps) and anti-tailgating systems. Electronic lock systems allow the flexability to issue and revoke access instantaneously, or nearly so, depending on the product. Online systems (sometimes refered to as hardwired systems) consist of an access control panel that connects to a set of doors and readers of various types using wiring run through the building. Offline systems consist of locks that have a reader integrated into the lock, a battery and all of the electronics to make access determinations. Updates to these sorts of locks are usually done through some sort of hand-held device that is plugged into the lock.
There are two fairly common reader technologies in use today. One is magnetic stripe based. These systems usually read data encoded on tracks two or three. While the technology is mature and stable, it has a few weaknesses. The data on the cards can be easily duplicated with equipment easily purchased on the Internet. The magnetic stripe can wear out or become erased if it gets close to a magnetic field. One option to improving the security of magnetic swipe installations is the use of a dual-validation reader, where after swiping your card, the user must enter a PIN code before the lock will open.
The other common access token in use today is the proximity card, also called a RFID card. These cards have an integrated circuit (IC), capacitor and wire coil inside of them. When the coil is placed near a reader, the energy field emitted by the reader produces a charge in the capacitor, which powers the IC. Once powered, the IC transmits it's information to the reader and the reader or control panel that it communicates with, determines if you should gain access.
Beyond access control, the other big advantage to electronic locking systems is their ability to provide an audit trail. The system will keep track of all credentials presented to the reader, and the resulting outcome of that presentation - access was either granted or denied. Complex access control systems will even allow you to do things such as implement a two-man rule, where two people must present authorized credentials before a lock will open, or anti-passback.
Anti-passback systems require a user to present credentials to both enter or exit a given space. Obviously, locking someone into a room would be a life safety issue, so usually, some sort of alarm is sounded on egress if proper credentials were not presented. Anti-passback also allows you to track where individuals are at any given time, because the system knows that they presented credentials to exit a space.</li></ul>Commissioning<br />Commissioning is essential to have validation of the design, verification of load capacities, and testing of failover mechanisms. A commissioning agent can identify design flaws, single-points of failure, and inconsistencies in the build-out from the original design. Normally a commissioning agent would be independent from the design or build team.<br />A commissioning agent will inspect for such things as proper wiring, pipe sizes, weight loads, chiller and pump capacities, electrical distribution panels and switch gear. They will test battery run times, UPS and generator step loads, and air conditioning. They will simulate load with resistive coils to generate heat and UPS draw and go through a play-book of what-if scenarios to test all aspects of redundant systems.<br />Load Balancing/High Availability<br />Connectivity<br />Network<br />Network components in the data center—such as Layer 3 backbone switches, WAN edge routers, perimeter firewalls, and wireless access points—are described in the ITRP2 Network Baseline Standard Architecture and Design document, developed by the Network Technology Alliance, sister committee to the Systems Technology Alliance. Latest versions of the standard can be located at http://nta.calstate.edu/ITRP2.shtml.<br />Increasingly, boundaries are blurring between systems and networks. Virtualization is causing an abstraction of traditional networking components and moving them into software and the hypervisor layer. Virtual switches<br />Considerations beyond “common services”<br />The following components have elements of network enabling services but are also systems-oriented and may be managed by the systems or applications groups. <br />DNS<br />For privacy and security reasons, many large enterprises choose to make only a limited subset of their systems “visible” to external parties on the public Internet. This can be accomplished by creating a separate Domain Name System (DNS) server with entries for these systems, and locating it where it can be readily accessible by any external user on the Internet (e.g., locating it in a DMZ LAN behind external firewalls to the public Internet). Other DNS servers containing records for internally accessible enterprise resources may be provided as “infrastructure servers” hidden behind additional firewalls in “trusted” zones in the data center. This division of responsibility permits the DNS server with records for externally visible enterprise systems to be exposed to the public Internet, while reducing the security exposure of DNS servers containing the records of internal enterprise systems.<br />E-Mail (MTA only) <br />For security reasons, large enterprises may choose to distribute e-mail functionality across different types of e-mail servers. A message transfer agent (MTA) server that only forwards Simple Mail Transfer Protocol (SMTP) traffic (i.e., no mailboxes are contained within it) can be located where it is readily accessible to other enterprise e-mail servers on the Internet. For example, it can be located in a DMZ LAN behind external firewalls to the public Internet). Other e-mail servers containing user agent (UA) mailboxes for enterprise users may be provided as “infrastructure servers” located behind additional firewalls in “trusted” zones in the data center. This division of responsibility permits the “external” MTA server to communicate with any other e-mail server on the public Internet, but reduces the security exposure of “internal” UA e-mail servers.<br />Voice Media Gateway <br />The data center site media gateway will include analog or digital voice ports for access to the local PSTN, possibly including integrated services digital network (ISDN) ports.<br />With Ethernet IP phones, the VoIP gateway is used for data center site phone users to gain local dial access to the PSTN. The VoIP media gateway converts voice calls between packetized IP voice traffic on a data center site network and local circuit-switched telephone service. With this configuration, the VoIP media gateway operates under the control of a call control server located at the data center site, or out in the ISP public network as part of an “IP Centrex” or “virtual PBX” service. However, network operators/carriers increasingly are providing a SIP trunking interface between their IP networks and the PSTN; this will permit enterprises to send VoIP calls across IP WANs to communicate with PSTN devices without the need for a voice media gateway or direct PSTN interface. Instead, data center site voice calls can be routed through the site’s WAN edge IP routers and data network access links.<br />Ethernet L2 Virtual Switch<br />In a virtual server environment, the hypervisor manages L2 connections from virtual hosts to the NIC(s) of the physical server.<br />A hypervisor plug-in module may be available to allow the switching characteristics to emulate a specific type of L2 switch so that it can be managed apart from the hypervisor and incorporated into the enterprise NMS.<br />Top-of-Rack Fabric Switches<br />As a method of consolidating and aggregating connections from dense rack configurations in the data center, top-of-rack switching has emerged as a way to provide both Ethernet and Fiber Channel connectivity in one platform. Generally, these devices connect to end-of-row switches that, optimally, can manage all downstream devices as one switching fabric. The benefits are a modularized approach to server and storage networks, reduced cross connects and better cable management.<br />Network Virtualization<br />Structured Cabling<br />The CSU has developed a set of standards for infrastructure planning that should serve as a starting place for designing cabling systems and other utilities serving the data center. These Telecommunications Infrastructure Planning (TIP) standards can be referenced at the following link: http://www.calstate.edu/cpdc/ae/gsf/TIP_Guidelines/<br />There is also a NTA working group specific to networking that regards cabling infrastructure, known as the Infrastructure Physical Plant Working Group (IPPWG). Information about the working group can be found at the following link: http://nta.calstate.edu/NTA_working_groups/IPP/<br />The approach to structured cabling in a data center differs from other aspects of building wiring due to the following issues:<br />Managing higher densities, particularly fiber optics<br />Cable management, especially with regard to moves, adds and changes<br />Heat control, for which cable management plays a role<br />The following are components of structured cabling design in the data center:<br />Cable types: Cabling may be copper (shielded or unshielded) or fiber optic (single mode or multi mode).<br />Cabling pathways: usually a combination of raised floor access and overhead cable tray. Cables under raised floor should be in channels that protect them from adjacent systems, such as power and fire suppression.<br />Fiber ducts: fiber optic cabling has specific stress and bend radius requirements to protect the transmission of light and duct systems designed for fiber takes into account the proper routing and storage of strands, pigtails and patchcords among the distribution frames and splice cabinets.<br />Fiber connector types: usually MT-RJ, LC, SC or ST. The use of modular fiber “cassettes” and trunk cables allows for higher densities and the benefit of factory terminations rather than terminations in the field, which can be time-consuming and subject to higher dB loss.<br />Cable management:<br />Operations<br />Information Technology (IT) operations refers to the day-to-day management of an IT infrastructure. An IT operation incorporates all the work required to keep a system running smoothly. This process typically includes the introduction and control of small changes to the system, such as mailbox moves and hardware upgrades, but it does not affect the overall system design. Operational support includes systems monitoring, network monitoring, problem determination, problem reporting, problem escalation, operating system upgrades, change control, version management, backup and recovery, capacity planning, performance tuning and system programming. <br />The mission of data center operations is to provide the highest possible quality of central computing support for the campus community and to maximize availability central computing systems. <br />Data center operations services include: <br />Help Desk Support <br />Network Management <br />Data Center Management <br />Server Management <br />Application Management <br />Database Administration <br />Web Infrastructure Management <br />Systems Integration <br />Business Continuity Planning <br />Disaster Recovery Planning<br />Email Administration<br />Staffing<br />Staffing is the process of acquiring, deploying, and retaining a workforce of sufficient quantity and quality maximize the organizational effectiveness of the data center.<br />Training<br />Training is not simply a support function, but a strategic element in achieving an organization’s objectives. <br />IT Training Management Processes and Sample Practices<br />Management ProcessesSample PracticesAlign IT training with business goals.Enlist executive-level champions.Involve critical stakeholders.Identify and assess IT training needs.Document competencies/skills required for each job description.Perform a gap analysis to determine needed training.Allocate IT training resources.Use an investment process to select and manage training projects.Provide resources for management training, e.g., leadership and project management.Design and deliver IT training.Give trainees choice among different training delivery methods.Build courses using reusable components. Evaluate/demonstrate the value of IT training.Collect information on how job performance is affected by training.Assess evaluation results in terms of business impact<br />Monitoring<br />Monitoring is a critical element of data center asset management and covers a wide spectrum of issues such as system availability, system performance levels, component serviceability and timely detection of system operational or security problems such as disk capacity exceeding defined thresholds or system binary files being modified, etc.<br /> <br />Automation<br />Automation of routine data center tasks reduces staffing headcount by using tools such as automated tape backup systems that auto load magnetic media from tape libraries sending backup status and exception reports to data center staff. The potential for automating routine tasks is limitless. Automation increases reliability and frees staff from routine tasks so that continuous improvement of operations can occur.<br />Console Management<br />To the extent possible console management should integrate the management of heterogeneous systems using orchestration or a common management console.<br />Remote Operations<br />Lights out operations are facilitated by effective remote operations tools. This leverages the economy of scales enjoyed by managing multiple remote production data centers from a single location that may be dynamically assigned in manner such as “follow the sun.”<br />Accounting<br />Auditing<br />The CSU publishes findings and campus responses to information security audits. Reports can be found at the following site: http://www.calstate.edu/audit/audit_reports/information_security/index.shtml<br />Disaster Recovery<br />Relationship to overall campus strategy for Business Continuity<br />Campuses should already have a business continuity plan, which typically includes a business impact analysis (BIA) to monetize the effects of interrupted processes and system outages. Deducing a maximum allowable downtime through this exercise will inform service and operational level agreements, as well as establish recovery time and point objectives, discussed in section 22.214.171.124 Backup and Recovery.<br />Relationship to CSU Remote Backup – DR initiative<br />ITAC has sponsored an initiative to explore business continuity and disaster recovery partnerships between CSU campuses. [Charter document?] Several campuses have teamed to develop documents and procedures and their workproduct is posted at http://drp.sharepointsite.net/itacdrp/default.aspx. <br />Examples of operational considerations, memorandums of understanding, and network diagrams are given in Section 126.96.36.199<br />Infrastructure considerations<br />Site availability<br />Disaster recovery planning should account for short-, medium-, and long-term disaster and disruption scenarios, including impact and accessibility to the data center. Consideration should be given to location, size, capacity, and utilities necessary to recover the level of service required by the critical business functions. Attention should be given to structural, mechanical, electrical, plumbing and control systems and should also include planning for workspace, telephones, workstations, network connectivity, etc.<br />Alternate sites could be geographically diverse locations on the same campus, locations on other campuses (perhaps as part of a reciprocal agreement between campuses to recover each other’s basic operations), or commercially available co-location facilities described in Section 188.8.131.52.<br />When determining an alternate site, management should consider scalability, in the event a long-term disaster becomes a reality. The plan should include logistical procedures for accessing backup data as well as moving personnel to the recovery location. <br />Co-location<br />One method of accomplishing business continuity objectives through redundancy with geographic diversity is to use a co-location scenario, either through a reciprocal agreement with another campus or a commercial provider. The following are typical types of collocation arrangements:<br />Real estate investment trusts (REITs): REITs offer leased shared data center facilities in a business model that leverages tax laws to offer savings to customers. <br />Network-neutral co-location: Network-neutral co-location providers offer leased rack space, power, and cooling with the added service of peer-to-peer network cross-connection.<br />Co-location within hosting center: Hosting centers may offer co-location as a basic service with the ability to upgrade to various levels of managed hosting.<br />Unmanaged hosted services: Hosting centers may offer a form of semi-co-location wherein the hosting provider owns and maintains the server hardware for the customer, but doesn't manage the operating system or applications/services that run on that hardware.<br />Principles for co-location selection criteria<br />Business process includes or provides an e-commerce solution<br />Business process does not contain applications and services that were developed and are maintained in-house<br />Business process does not predominantly include internal infrastructure or support services that are not web-based<br />Business process contain predominantly commodity and horizontal applications and services (such as email and database systems)<br />Business process requires geographically distant locations for disaster recovery or business continuity<br />Co-location facility meets level of reliability objective (Tier I, II, III, or IV) at less cost than retrofitting or building new campus data centers<br />Access to particular IT staff skills and bandwidth of the current IT staffers<br />Level of SLA matches the campus requirements, including those for disaster recovery<br />Co-location provider can accommodate regulatory auditing and reporting for the business process<br />Current data center facilities have run out of space, power, or cooling<br />[concepts from Burton Group article, “Host, Co-Lo, or Do-It-Yourself?”]<br />Operational considerations<br />Recovery Time Objectives and Recovery Point Objectives discussed in 184.108.40.206 (Backup and Recovery<br />Total Enterprise Virtualization<br />Management Disciplines<br />Service Management<br />IT service management is the integrated set of activities required to ensure the cost and quality of IT services valued by the customer. It is the management of customer-valued IT capabilities through effective processes, organization, information and technology, including:<br />Aligning IT with business objectives<br />Managing IT services and solutions throughout their lifecycles<br />Service management processes like those described in ITIL, ISO/IEC 20000, or IBM’s Process Reference Model for IT.<br />Service Catalog<br />An IT Service Catalog defines the services that an IT organization is delivering to the business users and serves to align the business requirements with IT capabilities, communicate IT services to the business community, plan demand for these services, and orchestrate the delivery of these services across the functionally distributed (and, oftentimes, multi-sourced) IT organization. An effective Service Catalog also segments the customers who may access the catalog - whether end users or business unit executives - and provides different content based on function, roles, needs, locations, and entitlements. <br />The most important requirement for any Service Catalog is that it should be business-oriented, with services articulated in business terms. In following this principle, the Service Catalog can provide a vehicle for communicating and marketing IT services to both business decision-makers and end users. <br />The ITIL framework distinguishes between these groups as "
(the business executives who fund the IT budget) and "
(the consumers of day-to-day IT service deliverables). The satisfaction of both customers and users is equally important, yet it's important to recognize that these are two very distinct and different audiences.<br />To be successful, the IT Service Catalog must be focused on addressing the unique requirements for each of these business segments. Depending on the audience, they will require a very different view into the Service Catalog. IT organizations should consider a two-pronged approach to creating an actionable Service Catalog:<br />The executive-level, service portfolio view of the Service Catalog used by business unit executives to understand how IT's portfolio of service offerings map to business unit needs. This is referred to in this article as the "
<br />The employee-centric, request-oriented view of the Service Catalog that is used by end users (and even other IT staff members) to browse for the services required and submit requests for IT services. For the purposes of this article, this view is referred to as a "
service request catalog."
<br />As described above, a Service Request Catalog should look like consumer catalogs, with easy-to-understand descriptions and an intuitive store-front interface for browsing available service offerings. This customer-focused approach helps ensure that the Service Request Catalog is adopted by end users. The Service Portfolio provides the basis for a balanced, business-level discussion on service quality and cost trade-offs with business decision-makers.<br />To that end, service catalogs should extend beyond a mere list of services offered and can be used to facilitate:<br />IT best practices, captured as Service Catalog templates <br />Operational Level Agreements, Service Level Agreements (aligning internal & external customer expectations) <br />Hierarchical and modular service models <br />Catalogs of supporting and underlying infrastructures and dependencies (including direct links into the CMDB) <br />Demand management and capacity planning <br />Service request, configuration, validation, and approval processes <br />Workflow-driven provisioning of services <br />Key performance indicator (KPI)-based reporting and compliance auditing<br />Service Level Agreements<br />The existence of a quality service level agreement is of fundamental importance for any service or product delivery of any importance. It essentially defines the formal relationship between the supplier and the recipient, and is NOT an area for short-cutting. This is an area which too often is not given sufficient attention. This can lead to serious problems with the relationship, and indeed, serious issues with respect to the service itself and potentially the business itself.<br />It will embrace all key issues, and typically will define and/or cover: <br />The services to be delivered <br />Performance, Tracking and Reporting Mechanisms<br />Problem Management Procedures <br />Dispute Resolution Procedures <br />The Recipient's Duties and Responsibilities <br />Security <br />Legislative Compliance <br />Intellectual Property and Confidential Information Issues<br />Agreement Termination <br />Project Management<br />An organization’s ability to effectively manage projects allows it to adapt to changes and succeed in activities such as system conversions, infrastructure upgrades and system maintenance. A project management system should employ well-defined and proven techniques for managing projects at all stages, including:<br />Initiation<br />Planning<br />Execution<br />Control<br />Close-out<br />Project monitoring will include:<br />Target completion dates – realistically set for each task or phase to improve project control.<br />Project status updates – measured against original targets to assess time and cost overruns.<br />Stakeholders and IT staff should collaborate on defining project requirements, budget, resources, critical success factors, and risk assessment, as well as a transition plan from the implementation team to the operational team.<br />Change Management<br />Change Management addresses routine maintenance and periodic modification of hardware, software and related documentation. It is a core component of a functional ITIL process as well. Functions associated with change management are:<br />Major modifications: significant functional changes to an existing system, or converting to or implementing a new system; usually involves detailed file mapping, rigorous testing, and training.<br />Routine modifications: changes to applications or operating systems to improve performance, correct problems or enhance security; usually not of the magnitude of major modifications and can be performed in the normal course of business.<br />Emergency modifications: periodically needed to correct software problems or restore operations quickly. Change procedures should be similar to routine modifications but include abbreviated change request, evaluation and approval procedures to allow for expedited action. Controls should be designed so that management completes detailed evaluation and documentation as soon as possible after implementation.<br />Patch management: similar to routine modifications, but relating to externally developed software.<br />Library controls: provide ways to manage the movement of programs and files between collections of information, typically segregated by the type of stored information, such as for development, test and production.<br />Utility controls: restricts the use of programs used for file maintenance, debugging, and management of storage and operating systems.<br />Documentation maintenance: identifies document authoring, approving and formatting requirements and establishes primary document custodians. Effective documentation allows administrators to maintain and update systems efficiently and to identify and correct programming defects, and also provides end users access to operations manuals.<br />Communication plan: change standards should include communication procedures that ensure management notifies affected parties of changes. An oversight or change control committee can help clarify requirements and make departments or divisions aware of pending changes.<br />[concepts from FFIEC Development and Acquisition handbook]<br />Configuration Management<br />Configuration Management is the process of creating and maintaining an up to date record of all components of the infrastructure.<br />Functions associated with Configuration Management are:<br />Planning <br />Identification <br />Control <br />Status Accounting <br />Verification and Audit <br />Configuration Management Database (CMDB) - A database that contains details about the attributes and history of each Configuration Item and details of the important relationships between CI’s. The information held may be in a variety of formats, textual, diagrammatic, photographic, etc.; effectively a data map of the physical reality of IT Infrastructure.<br />Configuration Item - Any component of an IT Infrastructure which is (or is to be) under the control of Configuration Management.<br />The lowest level CI is normally the smallest unit that will be changed independently of other components. CI’s may vary widely in complexity, size and type, from an entire service (including all its hardware, software, documentation, etc.) to a single program module or a minor hardware component. <br />Data Management<br />Backup and Recovery<br />Concepts<br />Recovery Time Objective, or RTO, is the duration of time in which a set of data, a server, business process etc. must be restored by. For example, a highly visible server such as a campus' main web server may need to be up and running again in a matter of seconds, as the business impact if that service is down is high. Conversely, a server with low visibility, such as a server used in software QA, may have a RTO of a few hours.<br />Recovery Point Objective, or RPO, is the acceptable amount of data loss a business can tolerate, measured in time. In other words, this is the point in time before a data loss event occurred, at which data may be successfully recovered. For less critical systems, it may be acceptable to recover to the most recent backup taken at the end of the business day, whereas highly critical systems may have a RPO of an hour or only a few minutes. RPO and RTO go hand-in-hand in developing your data protection plan.<br />Deduplication:<br />Source deduplication - Source deduplication means that the deduplication work is done up-front by the client being backed up.<br />Target deduplication - Target deduplication is where the deduplication processing is done by the backup appliance and/or server. There tend to be two forms of target deduplication: in-line and post-process.<br />In-line deduplication devices decide whether or not they have seen the data before writing it out to disk.<br />Post-process deduplication devices write all of the data to disk, and then at some later point, analyze that data to find duplicate blocks.<br />Backup types<br />Full backups - Full backups are a backup of a device that includes all data required to restore that device to the point in time at which the backup was performed.<br />Incremental backups - Incremental backups backup the changed data set since the last full backup of the system was performed. There does not seem to be any industry standards when you compare one vendor's style of incremental to another. In fact, some vendors include multiple styles of incrementals that a backup administrator may choose from.<br />A cumulative incremental backup is a style of incremental backup where the data set contains all data changed since the last full backup.<br />A differential incremental backup is a style of incremental backup where the data set contains all data changed since the previous backup, whether it be a full or another differential incremental.<br />Tape Media - There are many tape formats to choose from when looking at tape backup purchases. They range from open-standards (many vendors sell compatible drives) to single-vendor or legacy technologies.<br />DLT - Digital Linear Tape, or DLT, was originally developed by Digital Equipment Corporation in 1984. The technology was later purchased by Quantum in 1994. Quantum licenses the technology to other manufacturers, as well as manufacturing their own drives.<br />LTO - Linear Tape Open, or LTO, is a tape technology developed by a consortium of companies in order to compete with proprietary tape formats in use at the time.<br />DAT/DDS - Digital Data Store, or DDS, is a tape technology that evolved from Digital Audio Tape, or DAT technology.<br />AIT - Advanced Intelligent Tape, or AIT, is a tape technology developed by Sony in the late 1990's.<br />STK/IBM - StorageTek and IBM have created several proprietary tape formats that are usually found in large, mainframe environments.<br />Methods<br />Disk-to-Tape (D2T)- Disk-to-tape is what most system administrators think of when they think of backups, as it has been the most common backup method in the past. The data typically moves from the client machine through some backup server to an attached tape drive. Writing data to tape is typically faster than reading the data from the tape.<br />Disk-to-Disk (D2D) - With the dramatic drop in hard drive prices over the recent years, disk-to-disk methods and technologies have become more popular. The big advantage they have over the traditional tape method, is speed in both the writing and reading of data. Some options available in the disk-to-disk technology space:<br />VTL - Virtual Tape Libraries, or VTLs, are a class of disk-to-disk backup devices where a disk array and software appear as a tape library to your backup software.<br />Standard disk array - Many enterprise backup software packages available today support writing data to attached disk devices instead of a tape drive. One advantage to this method is that you don't have to purchase a special device in order to gain the speed benefits of disk-to-disk technology.<br />Disk-to-Disk-to-Tape (D2D2T) - Disk-to-disk-to-tape is a combination of the previous two methods. This practice combines the best of both worlds - speed benefits from using disk as your backup target, and tape's value in long-term and off-site storage practices. Many specialized D2D appliances have some support for pushing their images off to tape. Backup applications that support disk targets, also tend to support migrating their images to tape at a later date.<br />Snapshots - A snapshot is a copy of a set of files and directories as they were at a particular moment in time. On a server operating system, the snapshot is usually taken by either the logical volume manager (LVM) or the file system driver. File system snapshots tend to be more space-efficient than their LVM counterpart. Most storage arrays come with some sort of snapshot capabilities either as base features, or as licenseable add-ons.<br />VM images – In a virtualized environment, backup agents may be installed on the virtual host and file level backups invoked in a conventional method. Backing up each virtual instance as a file at the hypervisor level is another consideration. A prime consideration in architecting backup strategies in a virtual environment is the use of a proxy server or intermediate staging server to handle snapshots of active systems. Such proxies allow for the virtual host instance to be staged for backup without having to quiesce or reboot the VM. Depending on the platform and the OS, it may also be possible to achieve file-level restores within the VM while backing up the entire VM as a file.<br />Replication<br />On-site - On-site replication is useful if you are trying to protect against device failure. You would typically purchase identical storage arrays and then configure them to mirror the data between them. This does not, however, protect against some sort of disaster that takes out your entire data center.<br /> Off-site - Off-site implies that you are replicating your data to a similar device located away from your campus. Technically, off-site could mean something as simple as a different building on your campus, but generally this term implies some geo-diversity to the configuration.<br /> Synchronous vs. Asynchronous - Synchronous replication guarantees zero data-loss by performing atomic writes. In other words, the data is written to the arrays that are part of the replication configuration, or none of them. A write request is not considered complete until acknowledged by all storage arrays. Depending on your application and the distance between your local and remote arrays, synchronous replication can cause performance impacts, since the application may wait until it has been informed by the OS that the write is complete. Asynchronous replication gets around this by acknowledging the write as soon as the local storage array has written the data. Asynchronous replication may increase performance, but it can contribute to data loss if the local array fails before the remote array has received all data updates.<br />In-band vs. Out-of-band - In-band replication refers to replication capabilities built into the storage device. Out-of-band can be accomplished with an appliance, software installed on a server or "
in the network"
, usually in the form of a module or licensed feature installed into a storage router or switch.<br />Tape Rotation and Aging Strategies<br />Grandfather, father, son - From Wikipedia: "
Grandfather-father-son backup refers to the most common rotation scheme for rotating backup media. Originally designed for tape backup, it works well for any hierarchical backup strategy. The basic method is to define three sets of backups, such as daily, weekly and monthly. The daily, or son, backups are rotated on a daily basis with one graduating to father status each week. The weekly or father backups are rotated on a weekly basis with one graduating to grandfather status each month."
<br />Offsite vaults - Vaulting, or moving media from on-site to an off-site storage facility, is usually done with some sort of full backup. The media sent off-site can either be the original copy or a duplicate, but it is common to have at least one copy of the media being sent rather than sending your only copy. The amount of time it takes to retrieve a given piece of media should be taken into consideration when calculating and planning for your RTO.<br />Retention policies: The CSU maintains a website with links and resources to help campuses comply with requirements contained in Executive Order 1031, the CSU Records/Information Retention and Disposition Schedules. The objective of the executive order is to ensure compliance with legal and regulatory requirements while implementing appropriate operational best practices. The site is located at http://www.calstate.edu/recordsretention.<br />Tape library: A tape library is a device which usually holds multiple tapes, multiple tape drives and has a robot to move tapes between the various slots and drives. A library can help automate the process of switching tapes so that an administrator doesn't have to spend several hours every week changing out tapes in the backup system. A large tape library can also allow you to consolidate various media formats in use in an environment into a single device (ie, mixing DLT and LTO tapes and drives).<br />Disk Backup appliances/arrays: some vendor backup solutions may implement the use of a dedicated storage appliance or array that is optimized for their particular backup scheme. In the case of incorporating deduplication into the backup platform, a dedicated appliance may be involved for handling the indexing of the bit-level data.<br />Archiving<br />Media Lifecycle<br />Destruction of expired data<br />Hierarchical Storage Management<br />Document Management<br />Asset Management<br />Effective data center asset management is necessary for both regulatory and contractual compliance. It can improve life cycle management, and facilitate inventory reductions by identifying under-utilized hardware and software, potentially resulting in significant cost savings. An effective management process requires combining current Information Technology Infrastructure Library (ITIL) and Information Technology Asset Management (ITAM) best practices with accurate asset information, ongoing governance and asset management tools. The best systems/tools should be capable of asset discovery, manage all aspects of the assets, including physical, financial and contractual, life cycle management with Web interfaces for real time access to the data. Recognizing that sophisticated systems may be prohibitively expensive, asset management for smaller environments may be able to be managed by spreadsheets or simple database. Optimally, a system that could be shared among campuses while maintaining restricted permission levels, would allow for more comprehensive and uniform participation, such as the Network Infrastrucure Asset Management System (NIAMS), http://www.calstate.edu/tis/cass/niams.shtml<br />The following are asset categories to be considered in a management system:<br />Physical Assets – to include the grid, floor space, tile space, racks and cables. The layout of space and the utilization of the attributes above are literally an asset that needs to be tracked both logically and physically.<br />Network Assets – to include routers, switches, firewalls, load balancers, and other network related appliances. <br />Storage Assets – to include Storage Area Networks (SAN), Network Attached Storage (NAS), tape libraries and virtual tape libraries.<br />Server Assets – to include individual servers, blade servers and enclosures.<br />Electrical Assets – to include Universal Power Supplies (UPS), Power Distribution Units (PDU), breakers, outlets (NEMA noted), circuit number and grid location of same. Power consumption is another example of logical asset that needs to be monitored by the data center manager in order to maximize server utilization and understand, if not reduce, associated costs.<br />Air Conditioning Assets – to include air conditioning units, air handlers, chiller plants and other airflow related equipment. Airflow in this instance may be considered a logical asset as well but the usage plays an important role in a data center environment. Rising energy costs and concerns about global warming require data center managers to track usage carefully. Computational fluid dynamics (CFD) modeling can serve as a tool for maximizing airflow within the data center.<br />Data Center Security and Safety Assets – Media access controllers, cameras, fire alarms, environmental surveillance, access control systems and access cards/devices, fire and life safety components, such as fire suppression systems.<br />Logical Assets – T1’s, PRI’s and other communication lines, air conditioning, electrical power usage. Most importantly in this logical realm is the management of the virtual environment. Following is a list of logical assets or associated attributes that would need to be tracked:<br />A list of Virtual Machines <br />Software licenses in use in data center<br />Virtual access to assets<br />VPN access accounts to data center<br />Server/asset accounts local to the asset<br />Information Assets – to include text, images, audio, video and other media. Information is probably the most important asset a data center manager is responsible for. The definition is: An information asset is a definable piece of information, stored in any manner, recognized as valuable to the organization. In order to achieve access users must have accurate, timely, secure and personalized access to this information.<br />The following are asset groupings to be considered in a management system:<br />By Security Level<br />Confidentiality<br />FERPA<br />HIPPA<br />PCI<br />By Support Organization<br />Departmental<br />Computer Center Supported<br />Project Team<br />Criticality<br />Critical (ex. 24x7 availability)<br />Business Hours only (ex. 8AM - 7 PM)<br />Noncritical<br />By Funding Source (useful for recurring costs)<br />Departmental funded<br />Project funded<br />Division funded<br />Tagging/Tracking<br />Licensing<br />Software Distribution<br />Problem Management<br />Problem Management investigates the underlying cause of incidents, and aims to prevent incidents of a similar nature from recurring. By removing errors, which often requires a structural change to the IT infrastructure in an organization, the number of incidents can be reduced over time. Problem Management should not be confused with Incident Management. Problem Management seeks to remove the causes of incidents permanently from the IT infrastructure whereas Incident Management deals with fighting symptoms to incidents. Problem Management is proactive while Incident Management is reactive.<br />Fault Detection - A condition often identified as a result of multiple incidents that exhibit common symptoms. Problems can also be identified from a single significant incident, indicative of a single error, for which the cause is unknown, but for which the impact is significant.<br />Correction - An iterative process to diagnose known errors until they are eliminated by the successful implementation of a change under the control of the Change Management process.<br />Reporting - Summarizes Problem Management activities. Includes number of repeat incidents, problems, open problems, repeat problems, etc.<br />Security<br />Data Security<br />Data security is the protection of data from accidental or malicious modification, destruction, or disclosure. Although the subject of data security is broad and multi-faceted, it should be an overriding concern in the design and operation of a data center. There are multiple laws, regulations and standards that are likely to be applicable such as the Payment Card Industry Data Security Standard, ISO 17799 Information Security Standard, California SB 1386, California AB 211, the California State University Information Security Policy and Standards to name a few. It is required to periodically prove compliance to these standards and laws.<br />Encryption<br />Encryption is the use of an algorithm to encode data in order to render a message or other file readable only for the intended recipient. Its primary functions are to ensure non-repudiation, integrity, and confidentiality in both data transmission and data storage. The use of encryption is especially important for Protected Data (data classified as Level 1 or 2). Common transmission encryption protocols and utilities include SSL/TLS, SecureShell, and IPSec. Encrypted Data Storage programs include PGP's encryption products (other security vendors such as McAfee have products in this space as well), encrypted USB keys, and TrueCrypt's free encryption software. Key management (exchange of keys, protection of keys, and key recovery) should be carefully considered.<br />Authentication<br />Authentication is the verification of the identity of a user. From a security perspective it is important that user identification be unique so that each person can be positively identified. Also the process of issuing identifiers must be secure and documented. There are three types of authentication available:<br />What a person knows (e.g., password or passphrase)<br />What a person has (e.g., smart card or token)<br />What a person is or does (e.g., biometrics or keystroke dynamics)<br />Single-factor authentication is the use one of the above authentication types, two-factor authentication uses two of them, and three-factor authentication uses all of them. <br />Single-factor password authentication remains the most common means of authentication ("
What a person knows"
). However due to the computing power of modern computers in the hands of attackers and technologies such as "
, passwords used for single factor authentication may soon outlive their usefulness. Strong passwords should be used and a password should never be transmitted or stored without being encrypted. A reasonably strong password would be a minimum of eight characters and should contain three of the following four character types: lower case alpha, upper case alpha, number, and special character.<br />Vulnerability Management<br />Anti-malware Protection<br />Malware (malicious code, such as viruses, worms, and spyware, written to circumvent the security policy of a computer) represents a threat to data center operations. Anti malware solutions must be deployed on all operating system platforms to detect and reduce the risk to an acceptable level. Solutions for malware infection attacks include firewalls (host and network), antivirus/anti-spyware, host/network intrusion protection systems, OS/Application hardening and patching. Relying on only anti virus solutions will not fully protect a computer from malware. Determining the correct mix and configuration of the anti-malware solutions depends on the value and type of services provided by a server. Anti virus, firewalls, and intrusion protection systems need to be regularly updated in order to respond to current threats.<br />Patching<br />The ongoing patching of operating systems and applications are important activities in vulnerability management. Patching includes file updates and configuration alterations. Data Center Operations groups should implement a patching program designed to monitor available patches, categorize, test, implement, and monitor the deployment of OS and application patches. In order to detect and address emerging vulnerabilities in a timely manner, campus staff members should frequently monitor announcements from sources such as BugTraq, REN-ISAC, US-Cert, and Microsoft and then take appropriate action. Both timely patch deployment and patch testing are important and should be thoughtfully balanced. Patches should be applied via a change control process. The ability to undo patches is highly desirable in case unexpected consequences are encountered. Also the capability to verify that patches were successfully applied is important.<br />Vulnerability Scanning<br />The datacenter should implement a vulnerability scanning program such as regular use of McAfee’s Foundstone.<br />Compliance Reporting<br />Compliance Reporting informs all parties with responsibility for the data and applications how well risks are reduced to an acceptable level as defined by policy, standards, and procedures. Compliance reporting is also valuable in proving compliance to applicable laws and contracts (HIPAA, PCI DSS, etc.). Compliance reporting should include measures on:<br />How many systems are out of compliance.<br />Percentage of compliant/non-compliant systems.<br />Once detected out of compliance, how quickly a system comes into compliance.<br />Compliance trends over time.<br />Physical Security<br />When planning for security around your Data Center and the equipment contained therein, physical security must be part of the equation. This would be part of a "
security model. If physical security of critical IT equipment isn't addressed, it doesn't matter how long your passwords are, or what method of encryption you are using on your network - once an attacker has gained physical access to your systems, not much else matters.<br />See section 220.127.116.11 for description of access control<br /><insert diagram of reference model with key components as building blocks><br />Best Practice Components<br />Standards<br />ITIL<br />The Information Technology Infrastructure Library (ITIL) Version 3 is a collection of good practices for the management of Information Technology organizations. It consists of five components whose central theme is the management of IT services. The five components are Service Strategy (SS), Service Design (SD), Service Transition (ST), Service Operations (SO), and Service Continuous Improvement (SCI). Together these five components define the ITIL life cycle with the first four components (SS, SD, ST and SO) at the core with SCI overarching the first four components. SCI wraps the first four components and depicts the necessary concern of each of the core components to continuously look for ways to improve the respective ITIL process. <br />ITIL defines the five components in terms of functions/activities, concepts, and processes, as illustrated below:<br />Service Strategy <br />Main ActivitiesKey ConceptsProcessesDefine the MarketUtility & WarrantyService Portfolio ManagementDevelop OfferingsValue CreationDemand ManagementDevelop Strategic AssetsService ProviderFinancial ManagementPrepare ExecutionService ModelService Portfolio<br />Service Design<br />Five Aspects of SDKey ConceptsProcessesService SolutionFour “P’s”: People, Processes, Products, & PartnersService Catalog ManagementService Management Systems and SolutionsService Design PackageService Level ManagementTechnology and Management Architectures & ToolsDelivery Model OptionsAvailability ManagementProcessesService Level AgreementCapacity ManagementMeasurement Systems, Methods & MetricsOperational Level AgreementIT Service Continuity ManagementUnderpinning ContractInformation Security ManagementSupplier Management<br />Service Transition<br />ProcessesKey ConceptsChange ManagementService ChangesService Asset & Configuration ManagementRequest for ChangeRelease & Deployment ManagementSeven “R’s” of Change ManagementKnowledge ManagementChange TypesTransition Planning & SupportRelease UnitService Validation & TestingConfiguration Management Database (CMDB)EvaluationConfiguration Management SystemDefinitive Media Library (DML)<br />Service Operation<br />Achieving the Right BalanceProcessesFunctionInternal IT View versus External Business ViewEvent ManagementService DeskStability versus ResponsivenessIncident ManagementTechnical ManagementReactive versus ProactiveProblem ManagementIT Operations ManagementQuality versus CostAccess ManagementApplication ManagementRequest Fulfillment<br />Service Continuous Improvement<br />The 7 Step Improvement Process to identify vision and strategy, tactical and operational goalsDefine what you should measureDefine what you can measureGather the data. Who? How? When? Integrity of the data?Process the data. Frequency, format, system, accuracy.Analyze the data. Relationships, trends, according to plan, targets met, corrective actions?Present and use the information assessment summary action plans, etc.Implement corrective action.<br />ASHRAE<br />ASHRAE modified their operational envelope for data centers with the goal of reducing energy consumption. For extended periods of time, the IT manufacturers recommend that data center operators maintain their environment within the recommended envelope. Exceeding the recommended limits for short periods of time should not be a problem, but running near the allowable limits for months could result in increased reliability issues. In reviewing the available data from a number of IT manufacturers the 2008 expanded recommended operating envelope is the agreed-upon envelope that is acceptable to all the IT manufacturers, and operation within this envelope will not compromise overall reliability of the IT equipment.<br />Following are the previous and 2008 recommended envelope data:<br />2004 Version2008 VersionLow End Temperature20°C (68 °F)18°C (64.4 °F)High End Temperature25°C (77 °F)27°C (80.6 °F)Low End Moisture40% RH5.5°C DP (41.9 °F)High End Moisture55% RH60% RH & 15°C DP (59 °F DP)<br /><Additional comments on the relationship of electro static discharge (ESD) and relative humidity and the impact to printed circuit board (PCB) electronics and component lubricants in drive motors for disk and tape.><br />Uptime Institute<br />Hardware Platforms<br />Servers<br />Server Virtualization<br />Practices<br />Production hardware should run the latest stable release of the selected hypervisor, with patching and upgrade paths defined and pursued on a scheduled basis with each hardware element (e.g. blade) dual-attached to the data network and storage environment to provide for load balancing and fault tolerance.<br />Virtual machine templates should be developed, tested and maintained to allow for consistent OS, maintenance and middleware levels across production instances. These templates should be used to support cloning of new instances as required and systematic maintenance of production instances as needed. <br />Virtual machines should be provisioned using a defined work order process that allows for an effective understanding of server requirements and billing/accounting expectations.<br />This process should allow for interaction between requestor and provider to ensure appropriate configuration and acceptance of any fee-for-service arrangements.<br />Virtual machines should be monitored for CPU, memory, network and disk usage. Configurations should be modified, with service owning unit participation, to ensure an optimum balance between required and committed capacity.<br />Post-provisioning capacity analysis should be performed via a formal, documented process. For example, a 4 VCPU virtual machine with 8 gigabytes of RAM that is using less than 10% of 1 VCPU and 500 megabytes of RAM should be adjusted to ensure that resources are not wasted. This process should be formal, documented and performed on a frequent basis.<br />Virtual machine boot/system disks should be provisioned into a LUN maintained in the storage environment to ensure portability of server instances across hardware elements. <br />To reduce I/O contention, virtual machines with high performance or high capacity requirements should have their non-boot/system disks provisioned using dedicated LUNs mapped to logical disks in the sto