ITIC 2009 Global Server Hardware and Server OS Reliability Survey

674 views

Published on

Published in: Technology, News & Politics
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
674
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
14
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

ITIC 2009 Global Server Hardware and Server OS Reliability Survey

  1. 1. INFORMATION TECHNOLOGY INTELLIGENCE CORP.ITIC 2009 Global Server Hardware and Server OS Reliability Survey July 2009© Copyright 2009, Information Technology Intelligence Corp. (ITIC) All rights reserved.Other products and companies referred to herein are trademarks or registered trademarks of their respective companies or mark holders.
  2. 2. Executive Summary“Time is money”For the second year in a row, IBM AIX UNIX running on the Power or ―P‖ series servers scoredthe highest reliability ratings among 15 different server operating system platforms – includingLinux, Mac OS X, UNIX and Windows.Those are the results of the ITIC 2009 Global Server Hardware and Server OS Reliability Surveywhich polled C-level executives and IT managers at 400 corporations from 20 countriesworldwide. The results indicate that the IBM AIX operating system running on Big Blue’sPower servers (System p5s), is the clear winner; it offers rock solid reliability, besting allcompeting operating systems, including those running on Intel-based x86 machines. The IBMservers running AIX consistently score at least 99.99% or just 15 minutes of unplanned perserver, per annum downtime (See Exhibit 1).Overall, the results showed improvements in reliability, patch management procedures and anacross-the-board reduction in per server, per annum Tier 1, Tier 2 and the most severe Tier 3outages.  IBM AIX on the Power series System p5 and System p6 servers leads all vendors for both server hardware and server OS reliability. The IBM UNIX distribution recorded the fewest number of Tier 1, Tier 2 and Tier 3 unplanned server outages per year. IBM AIX running on the System p5s and newer p6s had less than one unplanned outage incident per server in a 12 month period. More impressively, the IBM servers experience no severe Tier 3 outages.  Hewlett-Packard’s HP UX 11i running on the HP 9000 and Integrity servers also performed very well though HP servers notch approximately 21 to 25 minutes more downtime than IBM servers, depending on model and configuration. The HP UX 11i v. 3 Update 4 on the HP 9000s average 36 minutes of per server, per annum downtime; while the HP UX 11i v.3 Update 4 on HP Integrity servers recorded 39 minutes of per server, per annum downtime.  Faster Patch Management. IT managers spend approximately 11 minutes to apply patches to IBM servers running the AIX operating system, which is again, the least amount of time spent patching any server or operating system. The open source Ubuntu distribution is a close second with IT managers spending 12 minutes to apply patches, while IT managers in the Novell SUSE Enterprise, customized Linux distribution and Apple Mac OS X 10.x. environments each spend a very economical 15 to 19 minutes applying patches.  Unplanned severe Tier 2 and Tier 3 Outages Decline. IBM also took top honors in another important category: IBM Power Series System p5 and p6 servers running AIX experience the lowest amount of the more severe Tier 2 and Tier 3 outages combined of any server hardware or server operating system. The combined total of Tier 2 and Tier 3© Copyright 2009, Information Technology Intelligence Corp. (ITIC) All rights reserved.Other products and companies referred to herein are trademarks or registered trademarks of their respective companies or mark holders. Page 2
  3. 3. outages accounted for just 19% of all per server, per annum failures in IBM network environments. HP UX on the 9000 and Integrity servers, Novell SUSE Linux Enterprise 11 and ―other‖ Linux distributions were close behind with combined Tier 2 + Tier 3 outages accounting for 24% to 25% of unplanned yearly downtime.  Novell SUSE Superiority. Among the Linux and Open Source server operating system distributions, both Novell SUSE Linux Enterprise 10 and 11 versions consistently achieved superior reliability ratings. In fact, Novell SUSE in a customized implementation had the lowest instance -- approximately 16 minutes of per server/server OS, per annum downtime – of any distribution with the exception of IBM’s AIX on the Power Series. Many IT managers specifically mentioned and extolled the high level of integration and interoperability between their Novell SUSE Linux Enterprise and Microsoft Windows Server 2003 and Windows Server 2008 in heterogeneous networks, in their anecdotal responses and first person customer interviews.  Most Improved. Microsoft Windows Server 2003 and Windows Server 2008 showed the biggest improvements of any of the vendors. The Windows Server 2003 and 2008 operating systems running on Intel-based platforms saw a 35% reduction in the amount of unplanned per server, per annum downtime from 3.77 hours in 2008 to 2.42 hours in 2009. The number of annual Windows Server Tier 3 outages also decreased by 31% year over year and the time spent applying patches similarly decline by 35% from last year to 32 minutes in 2009.  Apple Mac and OS X 10.x Competitive Enterprise Reliability. This year’s survey for the first time also incorporated reliability results for the Apple Mac and OS X 10.x OS platform. Over the past two to three years, the Apple Mac platform has made a comeback in corporate enterprises. The numbers of Mac G4 servers are modest in comparison to the more entrenched Windows, Linux and UNIX distributions. Nonetheless, they are making their presence known. IT managers report the reliability has been generally very good. The survey respondents indicated that the Apple Mac G4 servers are extremely competitive in an enterprise setting. IT managers spend approximately 15 minutes per server to apply patches and an average recorded downtime of about 40 minutes per server, per annum.. It is important to note that at this point, the workloads of the G4 Macs are not comparable to those of the high end IBM, HP and Sun (now Oracle) UNIX systems or the customized Linux and open source distributions.© Copyright 2009, Information Technology Intelligence Corp. (ITIC) All rights reserved.Other products and companies referred to herein are trademarks or registered trademarks of their respective companies or mark holders. Page 3
  4. 4. The intent of this Report is to quantify and qualify the reliability of 15 different server operatingsystem platforms running on a variety of proprietary UNIX and Intel-based hardware platforms.This will allow organizations to more easily identify baseline reliability metrics associated withindividual platforms in order to better determine and optimize their total cost of ownership(TCO), accelerate return on investment (ROI) and more efficiently manage risk.© Copyright 2009, Information Technology Intelligence Corp. (ITIC) All rights reserved.Other products and companies referred to herein are trademarks or registered trademarks of their respective companies or mark holders. Page 4
  5. 5. Table of ContentsExecutive Summary...................................................................................................2Introduction..............................................................................................................6 Survey Methodology ..............................................................................................8 Survey Demographics ............................................................................................9Data & Analysis.........................................................................................................9Conclusions ............................................................................................................ 19Recommendations................................................................................................... 19 Recommendations for Corporate Customers .......................................................... 20 Recommendations for Vendors ............................................................................. 22© Copyright 2009, Information Technology Intelligence Corp. (ITIC) All rights reserved.Other products and companies referred to herein are trademarks or registered trademarks of their respective companies or mark holders. Page 5
  6. 6. IntroductionServer hardware and server operating system reliability is the foundation and bedrock uponwhich crucial applications, storage, security and third party utilities and management, rest. Thestability and health of the entire network infrastructure depend heavily on the server hardwareand the operating systems that run on them. Server hardware and server operating systemreliability are inextricably linked to the corporation’s ability to lower its TCO, accelerate ROIand reduce the risk factors that negatively impact performance.Information on specific reliability metrics, allows businesses to calculate the real-time resourcesand monies needed to manage and maintain their various server hardware platforms andoperating systems. It also enables them to determine whether or not their mission critical serverhardware and operating system software are assisting or impeding the business from meeting keyservice level agreements (SLAs) to their customers, business partners and suppliers as well asinternally to the company’s own end users.The ITIC self-selecting reliability survey polled IT managers at 400 corporations worldwide onthe annual amount and percent of unplanned per server, per annum downtime experiencedfollowing 15 hardware and server OS environments.  IBM AIX on Power series System p5 and p6 servers  HP UX on the 9000  HP UX on Integrity servers  Sun Solaris UNIX on the SPARC Servers  Apple Mac OS X 10.5, 10.6 on G4 Macs  Novell SUSE Linux Enterprise on Intel x86 servers  Novell SUSE Linux Enterprise on Intel x86 servers  Red Hat Enterprise Linux on Intel x86 servers  Red Hat Enterprise Linux with customization  Windows Server 2003 on Intel x86 servers  Windows Server 2008 on Intel x86 servers  Ubuntu open source  Debian open source  Other Linux distributions (e.g. Mandriva, Turbo Linux)  Other Linux distributions with customizationThe survey data gives a detailed comparison breakdown of the percentage of Tier 1, Tier 2 andhighest severity Tier 3 outages.© Copyright 2009, Information Technology Intelligence Corp. (ITIC) All rights reserved.Other products and companies referred to herein are trademarks or registered trademarks of their respective companies or mark holders. Page 6
  7. 7. ITIC’s definition of server outages is as follows: Tier 1: These are the typically minor common, albeit annoying occurrences. A network administrator can usually resolve such incidents with less than 30 minutes for dependent users. Tier 1 incidents can usually be resolved by rebooting the server and rarely involve any data loss. Tier 2: These are moderate issues in which the server may be offline from one hour to four hours or about a half-day. Tier 2 problems may require the intervention of more than one network administrator to troubleshoot and it frequently affects the corporation’s end users and possibly business partners, customers and suppliers in the event they are attempting to access data on an affected corporate extranet. Tier 3: This is the most severe type of incident. Tier 3 outages are of longer than four hours duration for network administrators and the company’s associated dependent users. Tier 3 outages almost always require a team of multiple network administrators to resolve. Data loss or damage to systems and applications may or may not occur. Another real threat associated with a protracted Tier 3 outage is potential lost business and the potential damage to the company’s reputation..The length and severity of each of these actions correspond to specific line item capitalexpenditure and operational expenditure costs for the business. Reliability, measured bydowntime, can positively or negatively impact TCO and accelerate or delay the time it takes torealize ROI.Improvements or declines in reliability also mitigate or increase technical and business risks tothe organization’s end users and external customers. The ability to meet service-levelagreements (SLAs) hinges on server reliability, uptime and manageability. These are keyindicators that enable organizations to determine which server operating system platform orcombination thereof is most suitable.The survey data detailed the disparity in the number and severity of unplanned server outagesand the amount of time in minutes and hours that businesses experience on the various Linux,Windows and UNIX platforms.The survey closely examined both the actual quantitative reliability statistics as well as thequalitative issues that positively or negatively impacted outage time. The ITIC survey queriedcorporate IT managers and C-level executives on myriad reliability-related functions including: The amount of downtime (minutes/hours experienced per server, per annum The amount of time spent patching each server Whether the IT administrators apply updates via an automated group policy procedure or manually apply the patches to individual servers© Copyright 2009, Information Technology Intelligence Corp. (ITIC) All rights reserved.Other products and companies referred to herein are trademarks or registered trademarks of their respective companies or mark holders. Page 7
  8. 8. On average, individual corporate Linux, Windows and UNIX servers experience from zero toapproximately two failures per server per year. In a best case scenario, this results in 20 minutes(IBM AIX running on p5 and p6 Power servers) to 4.3 hours (Debian open source) hours ofannual downtime for each server. Windows Server 2008 servers experienced a total of just underthree unplanned yearly Tier 1, Tier 2 and Tier 3 outages. However, the necessity of having totake many of the Windows Servers offline to apply monthly patches and then do a system reboot,resulted in Windows Server 2008 machines being offline for just under two and a half hours eachyear. Still, this is a 35% reduction for the 3.77 hours of downtime experienced by WindowsServer 2008 machines in last year’s ITIC reliability survey.Among the Linux distributions Novell SUSE Enterprise exhibited consistent reliabilityreminiscent of the late 1980s and 1990s when Novell NetWare was famous for running severalyears – in some cases as long as nine years – without experiencing a failure or the need to reboot.This can be attributed to the stability of the Novell distribution, the experience of the SUSEengineers and the length of experience of many IT managers who came from the NetWareenvironment. Novell also inked an interoperability and technical service and support agreementwith Microsoft two and a half years ago, which also served to improve reliability.The open source Ubuntu distribution also scored some impressive reliability gains as it continuesto gain in popularity and deployments.Overall, these survey responses provide crucial, comparative reliability metrics to enablecustomers to make informed choices on which server hardware and server operating system orcombination thereof, best suits their specific business and budgets needs.Survey MethodologyITIC conducted the 2009 Global Server Hardware and Server OS Survey, an independent Web-based survey; that included multiple-choice questions and essay responses from March throughJuly 2009. ITIC polled C-level executives and IT managers at 400 corporations worldwide.ITIC analysts supplemented the Web survey by conducting two dozen first-person customerinterviews. ITIC conducted additional interviews with customers in October 2009 and updatedthe Report with specific information on server downtime statistics. The anecdotal data obtainedfrom these interviews validates the survey responses and provides deeper insight into thechallenges confronting businesses in both the immediate and long term.To deliver the most unbiased, accurate information, ITIC did not accept any vendorsponsorship money for the online poll or the subsequent first-person interviews conducted inconnection with this project. ITIC employed authentication and tracking mechanisms to preventtampering and to prohibit multiple responses by the same parties.© Copyright 2009, Information Technology Intelligence Corp. (ITIC) All rights reserved.Other products and companies referred to herein are trademarks or registered trademarks of their respective companies or mark holders. Page 8
  9. 9. Survey DemographicsCompanies of all sizes and all vertical markets were represented in the survey. Respondentscame from companies ranging from small and medium businesses (SMBs) with fewer than 50workers, to large enterprises with more than 100,000 employees.Roughly 33% of the survey respondents came from the SMB segment with 1 to 100 employees;12% of those polled were from midsize companies with 100 to 500 employees; 14% were drawnfrom corporations employing 500 to 1,000 employees; and 41% of respondents worked in largeenterprises with 1,000 to more than 100,000 workers.The survey was truly global. Approximately 85% of respondents came from North America.The remaining 15% hailed from more than 20 countries including Europe, Asia, Australia, NewZealand, South America and Africa.Data & AnalysisServer hardware and server operating system reliability has improved immeasurably in the lastfive years.When ITIC began conducting reliability research and surveys, our original definition ofunplanned downtime was an unexpected external or internal incident that caused the serverhardware and/or the server operating system software to spontaneously fail or freeze, therebydisrupting network operations and requiring remediation efforts and a reboot. Depending on theseriousness of the incident, the downtime may also have resulted in lost or damaged data.However, it quickly became apparent from the anecdotal survey comments and during our firstperson customer interviews, that IT managers and network administrators had a broaderdefinition of what constituted downtime.As far as IT departments are concerned, anything that causes them to take the server offline,regardless of the cause, is unplanned downtime. Included in this category are instances ofvendors releasing an unanticipated patch to fix a technical bug or security vulnerability. Such anoccurrence does not qualify as unplanned downtime in the narrowest definition of the term;network administrators oftentimes do not make that distinction. To them downtime is downtimebecause it disrupts their routine and may also impact daily operations because it means the ITdepartment must devote time to remedial issues that would have been spent performing other ITchores. And in some network environments like Windows, it’s still necessary to take the serversdown, apply the patch and perform a hard reboot.Time very literally equates to money. The economic downturn has forced companies to cut staff,put network and software upgrades on hold, decimated IT departments and has severely reducedthe training and recertification for network administrators.© Copyright 2009, Information Technology Intelligence Corp. (ITIC) All rights reserved.Other products and companies referred to herein are trademarks or registered trademarks of their respective companies or mark holders. Page 9
  10. 10. A recent ITIC survey that polled 250 corporations worldwide in October 2009 found that 47% ofbusinesses had budget cuts within the past 12 months. That number was even greater forcompanies with over 500 end users; 64% of large enterprises experienced budget cuts.Consequently, 84% of the respondents reported that their IT departments simply pick up theslack and work longer and harder.Downtime by the NumbersIn the early days of networks, corporate enterprises considered 99% uptime to be an adequatereliability standard. Not so in 2009. An ITIC survey of 250 enterprises conducted in Octoberfound that only 14% of survey respondents consider 99% uptime adequate for their most mission critical, line of business (LOB) applications. Another 14% said that 99.9% or three nines mettheir reliability needs. A two-thirds majority – 66% -- of those polled however, said theirnetwork environments require 99.95%; 99.999% or greater reliability for their most mission critical LOBs.It’s easy to see why when you correlate the downtime percentages to actual downtime:99% = average unplanned downtime of one hour and 40 minutes per week99.9% = average unplanned downtime of 45 minutes per month99.95% = average unplanned downtime of 22 minutes per month99.999% = average unplanned downtime of 5 1/2 minutes per yearTaken in this context, it’s easy to understand how the ongoing economic crisis has cast renewedemphasis on server and server operating system reliability. Businesses of all sizes and across allvertical markets are extremely risk averse. IT departments grapple daily with the reality ofkeeping networks up and running in the face of cost cuts, layoffs and fewer resources. Serverhardware, server operating systems and the a Businesses and their IT departments are underpressure to maximize server hardware and server operating system uptime in order to realize thegreatest economies of scale and ensure that their server hardware, server operating systems andthe crucial business applications and services that run on them are available to end users,corporate clients, business partners and suppliers. A server outage of even a few minutesduration can disrupt network operations and result in lost data, steep monetary losses anddamage a company’s reputation.Reliability Then and NowThe first generations of server hardware and server operating system software platformsintroduced in the mid-to-late 1980s, were proprietary. Network administrators typically becameexperts in a particular vendor’s platform. The 1.0 version of new hardware and software products© Copyright 2009, Information Technology Intelligence Corp. (ITIC) All rights reserved.Other products and companies referred to herein are trademarks or registered trademarks of their respective companies or mark holders. Page 10
  11. 11. from 10 to 20 years ago were also rife with bugs. It typically took from six months to a year forthe vendors to work the kinks out and achieve an acceptable level of stability and IT managers togain sufficient expertise and knowledge resulting in higher levels of uptime. It is also worthnoting that two decades ago, businesses were not as wholly dependent on their networks as theyare today. In the 1990s, 99% reliability was considered an acceptable industry standard. That is no longerthe case; 99% uptime is the equivalent of over 80 hours of annual per server downtime. ITIC’sseparate 2009 Global Application Availability Survey conducted in April found that eight outof 10 of the 300 businesses polled said that their major business applications require higheravailability rates than they did two or three years ago. However, nearly three-quarters ofcompanies – 72% -- are unable to quantify the cost of downtime or the impact that unplannedreliability outages have on the business. Among the other 2009 Global Application Availabilitysurvey findings: Nearly two-thirds -- 61% -- of organizations are unsure of how estimate the impact of downtime on the business or do not even attempt to track the losses associated with application downtime and reliability Two out of five firms -- 41% -- said they require conventional 99% to 99.9% application availability; 29% said they needed 99.95% or 99.99% uptime; while 7% of respondents indicated they need continuous availability of 99.999% or 99.9999% availability. Just under half – 49% of companies – lack the budget to purchase additional third party software or hardware availability technology. This places more of an onus on the underlying server hardware and server OS to deliver high reliability.The responses from the ITIC 2009 Global Application underscore the crucial importance ofhaving highly reliable server hardware and server operating system reliability. If the servers,server OS and related applications are unavailable for any reason, business and daily operationsgrind to a halt – with sometimes catastrophic results.The demand for server hardware, server OS and application availability has grown, particularlywith the emergence of new technologies like cloud computing and virtualization. Corporationsneed to ensure that reliability keeps pace. To quantify the reliability statistics: 99.99% uptimeequates to approximately four hours or 240 minutes of per server, per annum downtime.Today’s networks demand near perfect reliability; corporations deem any downtime as ananathema to their business operations. This is particularly true for those companies in verticalmarkets such as banking and finance, stock exchanges, insurance, healthcare and legal, whosebusinesses are based on intensive data transactions. A server crash of even 15 to 30 minutesduration can cost a company from tens of thousands or tens of millions in lost business andremediation efforts. Zero downtime – or as close to it as is humanly and technologically possible,is the obvious goal and Holy Grail of reliability.© Copyright 2009, Information Technology Intelligence Corp. (ITIC) All rights reserved.Other products and companies referred to herein are trademarks or registered trademarks of their respective companies or mark holders. Page 11
  12. 12. While system flaws will always be present in some fashion, the survey found that at present,server hardware and server OS reliability was also inextricably linked with several other crucialfactors and components. They are:  Integration and interoperability is crucial. Over 85% of businesses with 300+ end users have myriad types of server hardware and three different operating systems present in their environment. Heterogeneity and openness are essential to the reliability of today’s networks. The 2007 wide ranging, non-exclusive interoperability pact between Microsoft and Novell was extremely well received and a huge boon for the respective customer bases of both firms. As part of the deal, Microsoft and Novell team up to provide joint sales, technical service and support to deliver plug and play interoperability between the Windows and SUSE Linux Enterprise environments.  Workloads. The applications themselves are growing in size and complexity. It is therefore imperative that the server hardware be robust enough to handle the increased demands of new classes of applications such as streaming audio and digital and highly complex processes. It is a fact that a robust server configuration that includes new multi- core and multi-threading technologies, maximum memory, hard drive and the fastest processors will perform better than old, outmoded and inadequate equipment. The survey showed for example that the high reliability ratings for IBM and HP were no fluke: the powerful IBM System p5 and System p6 Power Series servers and the HP 9000 and Integrity Servers achieved very high reliability – 99.99% and 99.999% uptime – while carrying workloads that were 30% to 40% greater than comparable x86-based machines.  Experience of the IT managers. Errors by neophyte, inexperienced network administrators and IT managers who have not been able to get training and re-certified on the latest technologies is another major factor that contributes to extended downtime and adversely impacts system reliability.  Patch management. The amount of time spent applying patches is one of the biggest contributors to system downtime; this is especially true of security patches, as we see in Exhibit 2 below.© Copyright 2009, Information Technology Intelligence Corp. (ITIC) All rights reserved.Other products and companies referred to herein are trademarks or registered trademarks of their respective companies or mark holders. Page 12
  13. 13. IBM AIX administrators spent the least amount of time – 11 minutes – applying patches. Theywere followed closely by the Ubuntu open source distribution, Apple Mac, niche market ―other‖customized Linux distributions and Novell SUSE; administrators in each of these environmentsspent on average from 12 to 15 minutes applying patches in these environments.This speaks to the underlying stability of these environments as well as the experience of theadministrative staff. Typically, UNIX installations – notably IBM’s AIX, as well as NovellSUSE Enterprise and Apple Mac, tend to be stable, static environments with experienced, handson network administrators who are familiar with the most minute details of the bits and bytes oftheir systems. Fast patch management positively impacts reliability.The feedback from the survey respondents reinforced the importance of being able to receive anddownload patches quickly once a bug has been identified. Corporate IT managers noted thesignificant strides that had been made by all of the vendors across the board in recent years,though they still voiced some concerns. Among the anecdotal comments:© Copyright 2009, Information Technology Intelligence Corp. (ITIC) All rights reserved.Other products and companies referred to herein are trademarks or registered trademarks of their respective companies or mark holders. Page 13
  14. 14. • ―IBM has done a wonderful job of keeping our AIX systems up and ready. We rarely if ever have reliability issues,‖ said an IT manager at a Midwest financial institution. • ―Patch management automation has significantly reduced both the manpower required to apply patches and the downtime associated with patch management over the last three years,‖ noted an IT administrator at a large health care facility in the northeast. • ―Novell SUSE Linux Enterprise is always very up-to-date on patches; Zenworks is nice and we never have a problem,‖ said a longtime Novell user at a large healthcare provider in the Southwest. • ―The amount of time it takes to identify vulnerability and when the vendors release the patch, has decreased significantly, but if the bug is a dangerous one, we still worry,‖ according to a chief technology officer at midsized retailer. • ―Our patches are tested at our corporate headquarters location and then distributed as needed to the various remote locations, downloaded to a local Microsoft Systems Management Server (SMS) and automatically downloaded via group policy to each workstation and server. The process is accelerated and it’s relatively painless for the IT department,‖ said an administrator at a large West Coast enterprise. • ―Our patch management dramatically improved with SUSE 10.2 and SUSE 11,‖noted another veteran Novell administrator. ―We have no problems now to speak of.‖ • ―We currently use Group Policy to download patches on each server, but we manually apply them. So it takes us about 15 minutes to patch each Windows server. This means that each server takes less than 15 min to patch. On a whole, other than hardware issues, weve averaged less than two failures per server, per year on our Windows Server 2003 systems,‖ said an IT manager at a large East coast insurance firm.© Copyright 2009, Information Technology Intelligence Corp. (ITIC) All rights reserved.Other products and companies referred to herein are trademarks or registered trademarks of their respective companies or mark holders. Page 14
  15. 15. Serious Tier 2 + Tier 3 Incidents DeclineThe survey results also showed a discernible decline in the number and percentage of the moreserious Tier 2, Tier 3 and combined Tier 2 + Tier 3 incidents, according to Exhibit 3 below.© Copyright 2009, Information Technology Intelligence Corp. (ITIC) All rights reserved.Other products and companies referred to herein are trademarks or registered trademarks of their respective companies or mark holders. Page 15
  16. 16. Once again, IBM AIX on the Power Series System p5 and p6s recorded the smallest percentageof combined Tier 2 + Tier 3 incidents at 19%. The other UNIX and Linux distributions includingthe HP UX 11i v3 on the HP 9000 and HP Integrity, Novell SUSE Linux Enterprise and SunSolaris also scored well with the more serious aggregate Tier 2+ Tier 3 outages accounting for24% to 25% of total outages. And all of the aforementioned distributions managed to lower theirscores from the similar survey in 2008.Microsoft’s Windows Server 2003 on x86-based servers came in with a very respectable 30% ofreliability outages being in the Tier 2 + Tier 3 categories; this was a reduction of 11% from the41% reported by respondents to the 2008 ITIC Global Reliability Survey.One of the most impressive statistics was that IBM AIX Power Series System p5 and System p6servers notched no severe Tier 3 incidents whatsoever. Again, this achievement is even moreimpressive when one considers that these systems typically run higher workloads than their x86-based counterparts as shown in Exhibit 4.© Copyright 2009, Information Technology Intelligence Corp. (ITIC) All rights reserved.Other products and companies referred to herein are trademarks or registered trademarks of their respective companies or mark holders. Page 16
  17. 17. HP’s UX 11i v.3 Update 4 on the HP 9000 and Integrity servers and Sun Solaris on SPARCServers (now owned by Oracle), Novell SUSE, Red Hat Enterprise Linux and Apple Mac OS10x 5.6 on the G4 Macs also recorded very few Tier 3 outages – less than one each, per serverper annum.The most common Tier 1 incidents that are usually between 10 and 30 minutes duration, alsoshowed across the board reductions among all server hardware and server operating systemplatforms as we see from Exhibit 5.© Copyright 2009, Information Technology Intelligence Corp. (ITIC) All rights reserved.Other products and companies referred to herein are trademarks or registered trademarks of their respective companies or mark holders. Page 17
  18. 18. In the Tier 1 category, IBM also came out on top with less than one-half of one Tier 1 incidentper AIX Power Series System p5 and System p6 per annum. This equates to about four to sevenminutes downtime per server, per year.In fact, all of the server hardware and server OS environments each racked up less than one Tier1 per server, per annum outage.The results were similarly encouraging for the average number of Tier 2 outages as we see inExhibit 6 below.© Copyright 2009, Information Technology Intelligence Corp. (ITIC) All rights reserved.Other products and companies referred to herein are trademarks or registered trademarks of their respective companies or mark holders. Page 18
  19. 19. ConclusionsIn summary the ITIC 2009 Global Server Hardware and Server OS Reliability Survey findingsindicate that all of the server operating system platforms have achieved a high degree ofreliability. However, the UNIX distributions led by IBM AIX running on the p5 and p6 PowerServers is the clear winner followed closely by HP, Novell SUSE Enterprise Linux and theUbuntu open source distribution.These results are especially considering in light of the ongoing economic crunch which hascaused companies to cut their budgets and reduce IT staff. As they strive to accomplish morewith fewer resources, IT departments must rely even more heavily on their vendors to delivermore reliable servers and server operating system software.To reiterate, time is literally money. Even a few minutes of downtime can cost companiesthousands or millions of dollars and cause business operations to grind to a halt. Downtime canalso impact adversely a company’s relationship with its customers, business suppliers, partnersand internal end users. Reliability or lack thereof can potentially damage a company’s reputationand result in lost business.Hence, corporations must have confidence in the reliability and stability of the underlying serverhardware and server OS platforms.The advances in technology are encouraging. Now companies must tackle other equallyimportant and challenging issues to ensure the highest level of uptime and reliability. Closeattention must be paid to integration and interoperability, patch management, documentation andgetting the necessary training and certification for the appropriate IT managers. The mostbulletproof hardware and software platforms can be undone by human error. It’s equallyimportant that companies find the funds to stay as current as possible on their server hardwareand server OS software. Performance will suffer if the server is configuration is old andinadequate.RecommendationsServer hardware and server operating system reliability has improved vastly since the 1980s,1990s and even in just the last two to three years. While technical bugs still exist, the number,frequency and severity have declined significantly.With few exceptions, common human error poses a bigger threat to server hardware and serveroperating system reliability then technical glitches.© Copyright 2009, Information Technology Intelligence Corp. (ITIC) All rights reserved.Other products and companies referred to herein are trademarks or registered trademarks of their respective companies or mark holders. Page 19
  20. 20. Crucial TCO metrics such as reliability, performance, security and management ultimatelydepends as much on each firm’s specific implementation, as it does on the properties of theserver and server OS technology itself. There are inherent dependencies between the underlyingcapabilities of a particular server operating system and an individual corporation’s ability toadhere to best deployment practices with respect to training, testing and configuration. Thereliability, security and manageability of even the most hardened server and server operatingsystem are easily compromised by human error.A company that does not restrict physical access to the server is asking for trouble. Similarly,any firm which does not enact and enforce strong usage and security policies, riskscompromising the reliability and integrity of its server hardware and server OS environment. Thereliability of the server environment can also be undone easily or seriously compromised by suchactions as: a bad configuration; the use of incompatible or unapproved memory and logic chips,hardware, peripherals and software drivers; over clocking machines; failing to apply necessarypatches; failing to upgrade or retrofit inadequate or obsolete servers and operating systems andtaxing server and software resources beyond their capabilities.Recommendations for Corporate CustomersTo optimize uptime and reliability, ITIC advises corporations to:  Regularly analyze and review configurations, usage and performance levels. This will enable companies to determine whether or not their current server and server OS environment allows them to achieve optimal reliability.  Adopt formal SLAs. Service level agreements enable organizations to define acceptable performance metrics. Companies should meet with their vendors and customers on at least an annual basis to ensure the terms are met.  Define measure and monitor reliability and performance metrics. It is imperative that companies measure component, system, server hardware, server OS and desktop and server OS, security, network infrastructure, storage and application performance. Keep a log of the planned and unplanned downtime in a continuous fashion throughout the enterprise.  Regularly track server and server OS reliability and downtime. Keep accurate records of outages and their causes. Segment the outages according to their severity and length – e.g. Tier 1, Tier 2 and Tier 3. The appropriate IT managers should also keep detailed logs of remediation efforts in the event of the outage. These logs should include a full account of remediation activities, specifying how the problem was solved, how long it took and what staff members participated in the event. It should also list the monetary costs as well as any material impact on the business, its operations and its end users. This will prove invaluable resource should the problem recur. It may also make the© Copyright 2009, Information Technology Intelligence Corp. (ITIC) All rights reserved.Other products and companies referred to herein are trademarks or registered trademarks of their respective companies or mark holders. Page 20
  21. 21. difference in containing or curtailing the reliability-related incident, saving precious time for the IT department, the end users and corporate customers.  Calculate the cost of unplanned downtime. Companies should determine the average cost of minor Tier 1 outages. They should also keep more detailed cost assessments of the more serious unplanned Tier 2 and Tier 3 incidents. It’s essential for businesses to know the monetary amount of each outage – including IT and end user salaries due to troubleshooting and any lost productivity – as well as the impact on the business. C-level executives and IT managers should also pay close attention to whether or not the company’s reputation suffered as a result of a reliability incident; did any litigation ensue; were customers, business partners and suppliers impacted (and at what cost) and at least try and gauge whether or not the company lost business or potential business.  Ensure that your organization has robust server hardware that can adequately handle the OS and application workloads. The server hardware (standalone, blade, cluster, etc.) and the server operating system are inextricably linked. To achieve optimal performance from both components, corporations must ensure that the server hardware is robust enough to carry both the current and anticipated workloads for the lifecycle of both.  Compile a list of best practices and adhere to them. This is absolutely essential. Chief technology officers (CTOs), software developers, engineers, network administrators and managers should have extensive familiarity with the products they currently use and are considering. Check and adhere to your vendors’ list of approved, compatible hardware, software and applications. Software developers and network administrators must obey the rules. That means avoiding such ill-advised and iffy practices like overclocking server and desktop hardware, allowing unskilled or neophyte administrators to make changes to the registry. All of these actions can lead to serious reliability problems.  Don’t skimp on training and recertification for IT administrators, software developers and engineers. In these days of budget cuts, it’s common practice to eliminate monies that were formerly earmarked for training. ITIC understands that money is tight. If you can’t afford the time or expense to re-certify your entire IT department, designate the most experienced or appropriate IT staffer to take the course – even if it’s only an online course – and allow that person to train additional appropriate managers.  Perform regular asset management testing. Schedule asset management reviews on a yearly, bi-annual or quarterly basis, as needed. This will assist your company in remaining current on hardware and software and help you to adhere to the terms and conditions of licensing contracts. All of these issues influence network reliability. It also allows organizations to be better equipped to meet their SLA requirements and maintain peak performance and reliability.  Manual vs. Automated Group Policy Patch Management. IT managers, particularly in high end UNIX environments and in corporations whose environments feature a high degree of customization, will continue to perform manual patch management.© Copyright 2009, Information Technology Intelligence Corp. (ITIC) All rights reserved.Other products and companies referred to herein are trademarks or registered trademarks of their respective companies or mark holders. Page 21
  22. 22.  Keep your software updated with the latest necessary patches and upgrades. You don’t have to apply every patch, but it’s wise to keep track of which patches are crucial to the network’s health. Construct and adhere to a regular schedule to apply patches, preferably on a monthly basis. This will help the company avoid potentially nasty surprises.  Standardize legacy and future hardware, server OS and application environments as much as possible. ITIC survey data indicates that standardization—that is, following a prescribed configuration and version for the company’s hardware, software and network infrastructure components—can lower TCO costs by 15%. Standardization benefits all users—including organizations that have custom configurations.  Note that custom software implementations require the highest level of expertise. Any firm that elects to customize its Linux or open source server operating system distribution should either employ guru-level administrators or contract with a systems integrator or outsourcer with the appropriate expertise.  Automated patch management applied via Group Policy vs. manual patching. Companies should also regularly review whether it is feasible for the firm to migrate away from manual patch management. Collecting this information may seem to be a chore at first, but it will be an invaluable source of information that can guide the company to lower its TCO and improve the rate of its ROI.Recommendations for VendorsIt is a buyer’s market and is likely to remain so for the foreseeable future. Competition amongvendors is intense because businesses have a wide array of server hardware and server operatingsystem platforms from which to choose. In order to retain the current customer base and attractnew corporate customers, all of the vendors must strive to improve the features, performance,reliability and security of their respective server hardware and server OS software. Additionally,ITIC advises vendors to: Embrace Interoperability and Integration. The survey data indicates that backwards compatibility and integration with other hardware, server OS, applications and third party tools and utilities pose significant potential threat to the underlying stability of the network environment. Provide Explicit Guidance around Patches and Patch Management. Patches vary according to the importance, severity of the fix or update and by the number of patches in a formal release as well. Data ITIC obtained from anecdotal essay comments and first person customer interviews underscore the need for vendors to issue patches in an efficient, expeditious manner and to provide full transparency on the nature and severity of all bugs. Many IT managers expressed frustration and confusion with the patch management process, which was sometimes cumbersome. IT managers also noted that oftentimes they were unsure of which patches were crucial versus optional. ITIC advises vendors to deliver specific recommendations and instructions on the download process, since patch management is a crucial element of IT management that can positively or negatively impact reliability.© Copyright 2009, Information Technology Intelligence Corp. (ITIC) All rights reserved.Other products and companies referred to herein are trademarks or registered trademarks of their respective companies or mark holders. Page 22
  23. 23.  Provide the latest technical documentation. Ready access to clear, concise technical guidelines and detailed documentation has never been more important. The economic downturn forced many companies to cut staff. Time and money are scarce or non-existent for training and re-certification of IT administrators. It is therefore crucial that vendors pick up the slack and publicize and disseminate technical ―how to‖ guidelines via their respective Websites, Emails and Webinars. Vendors should also actively work with third party ISVs to assist in resolving driver and application compatibility issues. As we noted above, integration and interoperability issues are a top priority for IT departments who wish to maintain a high level of reliability. While many of the largest third party ISVs do an exemplary job of ensuring that their applications and drivers are certified to work with new server hardware and server OS releases, many smaller and niche ISVs – particularly in specific verticals like finance, legal and healthcare, in many instances lack the necessary resources and funds to support new releases. Vendors should poll their customers on which third party applications, drivers and utilities are crucial and when necessary assist ISVs in providing the necessary compatibility. Work with partners to provide expanded access to discounted certification and online training courses. One of the biggest challenges confronting IT departments today is finding the money and sparing the time to get the appropriate administrators re-trained and certified on the latest server hardware and server OS software.© Copyright 2009, Information Technology Intelligence Corp. (ITIC) All rights reserved.Other products and companies referred to herein are trademarks or registered trademarks of their respective companies or mark holders. Page 23

×