)DXOW 7ROHUDQW 3ODWIRUPV                                         IRU 0DQXIDFWXULQJ $SSOLFDWLRQV                       %< *...
6S8ÃD†vtu‡†ÃQhtrÃ!à                     applications and logs on the users). Implementation requires the development, tes...
6S8ÃD†vtu‡†ÃQhtrÃà                      (MIC) sends and receives data from both systems simultaneously. The MIC also pro-...
6S8ÃD†vtu‡†ÃQhtrÃ#ÃStratus recommends that all drivers be hardened. Hardened drivers for all installedadapters are requir...
6S8ÃD†vtu‡†ÃQhtrÃ$ÃEach ftServer comes with two ftServer Management PCI adapters. These adapters are,themselves, board le...
Upcoming SlideShare
Loading in …5
×

Fault tolerant platforms for manufacturing applications

343 views

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
343
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
2
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Fault tolerant platforms for manufacturing applications

  1. 1. )DXOW 7ROHUDQW 3ODWIRUPV IRU 0DQXIDFWXULQJ $SSOLFDWLRQV %< *5(* *25%$&+ 6(37(0%(5 $5 ,16,*+76 0 ( .(:256 Fault Tolerance, High Availability, Cluster, Collaborative Manufacturing 6800$5 New, low-cost technology for fault-tolerant platforms is now available for Microsoft Windows 2000 environments. Manufacturers should revisit some old assumptions about where they might benefit from deploying these platforms. 7KH FRVW RI WKH QHZ IDXOW WROHUDQW Collaboration puts a premium on real-time manufacturing informa- VVWHPV KDV IDOOHQ VR IDU WKDW tion, and these systems can help ensure that the information is PDQXIDFWXUHUV PXVW UHWKLQN always available. Next generation automation systems, production HQVXULQJ WKH DYDLODELOLW RI WKHLU FULWLFDO LQIRUPDWLRQ management systems, business systems, and collaborative systems can all benefit from this technology. $1$/6,6 The first is the fully replicated, fault-tolerant hardware solution from Stratus Computer Systems, with duplicate components operating in lockstep. In the event of a component failure, there is no interruption in processing, no lost data, and no slowdown in perform- ance. The second approach, offered byHVFULSWLRQ 6WUDWXV 0DUDWKRQ OXVWHU Marathon Technologies, isolates all I/O$YDLODELOLW from both the user operating system and the application by placing these tasks on differ-5HFRYHU 7LPH =HUR 0LOOLVHFRQGV 0LQXWHV ent computers connected throughRSLHV RI 26 0XOWLSOH 0XOWLSOH proprietary interface cards, software, and6PPHWULF 0XOWL $YDLODEOH 1R $YDLODEOH high speed interconnect.3URFHVVLQJ6VWHP 2SHUDWLRQ 6LQJOH 6VWHP 6SOLW 0XOWL6VWHP ,PDJH $UFKLWHFWXUH OXVWHU %HRQG OXVWHUV While the traditional clustering approach to,PSOHPHQWDWLRQ 1R ZRUN ,QWHJUDWH )7 6FULSW H UHTXLUHG RPSR YHORSPHQW fault tolerance does provide for enhanced QHQWV DQG 7HVWLQJ availability, there are significant limitations. UG UGLVDVWHU 7ROHUDQFH 3DUW $YDLODEOH 3DUW Cluster solutions do not provide fault toler-6LQJOH 6XSSRUW HV UG 3DUW UG 3DUW ance (failure and repair/recovery isRQWDFW transparent to the user), only failover (a RPSDULVRQ RI )DXOW 7ROHUDQW 6ROXWLRQV backup system automatically restarts the @IU@SQSDT@Ã6I9ÃH6IVA68UVSDIBÃTUS6U@BD@TÃAPSÃDI9VTUS`Ã@Y@8VUDW@TÃ
  2. 2. 6S8ÃD†vtu‡†ÃQhtrÃ!à applications and logs on the users). Implementation requires the development, testing, and support of custom failover scripts, licensing and installation of multiple copies of software, and possibly application modifications for a cluster environment. In the event of a hardware failure, a cluster failover always loses all memory contents, and several minutes will be required to recover. Cluster solutions offer 99.9 percent availability (about 8 hours down per year), but fault tolerant solutions offer 99.999 percent availabil- ity (about 5 minutes down per year). +DUGZDUH )DXOW 7ROHUDQFH The first requirement for high availability systems is hardware fault tolerance. Stratus and Marathon each take a different approach. Stratus ftServer ftServer uses standard Intel server components and designs, but Stratus designs its own motherboard (using standard Intel server design guidelines), removes the PCI I/O, and adds fault detection logic that is key to fault isolation in a DMR configuration. The sys- tem contains two motherboards for Dual Modular Redundancy (DMR) or three motherboards for Triple Modular Redundancy (TMR). All motherboards run in lock- step, using a single system clock, and Disk PCI Fault redundant clock cards. Fault-detection Fault Memory CPU Detection Detection 1-N way SMP and isolation logic (a custom ASIC) com- Lockstep CPU’s Lockstep CPU’s Lockstep CPU’s Lockstep CPU’s Isolation Isolation Chipset pares I/O output from all motherboards.DMR Disk PCI DMR systems rely on fault-detection Fault Fault Memory CPU Detection Detection 1-N way SMP logic on each motherboard to see which Isolation Chipset Isolation is in error. If no motherboard error is signaled, a software algorithm decides Fault Memory CPU Detection 1-N way SMP which board to remove. In a TMR sys-TMR Chipset Isolation tem, 3-way voting is used to isolate the failed board. ftServer runs a single copy 6WUDWXV· IW6HUYHU $UFKLWHFWXUH (QVXUHV =HUR 6ZLWFKRYHU 7LPH of all software, resulting in lower licens- 1R 6LQJOH 3RLQW RI )DLOXUH DQG D 6LQJOH 6RIWZDUH ,PDJH ing costs and simple administration. Marathon Endurance System Marathon physically and logically separates the two basic operations of computers, the manipulating and transforming data (computing) and the moving data to and from mass storage, networks, and other I/O devices (I/O processing). The computing function is put on one server (the compute element), and the I/O processing function is put on an- other server, (the I/O processor). These CE/IOP pairs (tuples) connect through proprietary high-speed PCI interfaces and fiber optics. The Marathon Interface Card ‹Ã! ÇÃ6S8Ã6q‰v†‚…’ÃB…‚ˆƒÃ‡ÃÃ6yyvrqÃ9…v‰rÇÃ9rquh€ÃH6Ã!!%ÃVT6Çà # ÇÃ6S8rip‚€Ã VT6ÇÃVFÇÃBr…€h’ÇÃEhƒhÃ‡ÃDqvhÃ
  3. 3. 6S8ÃD†vtu‡†ÃQhtrÃà (MIC) sends and receives data from both systems simultaneously. The MIC also pro- vides the comparison and test logic to ensures that both systems are identical. Each tuple is a complete system, wherein the operating system running on both the CE and IOP is a Windows server OS. All CE I/O task requests go to the IOP for handling. Marathon software runs as an application on the IOP and controls all of the fault man- agement, disk mirroring, system management, and resynchronization. Because the fault management is done in software, it can impact the performance. Depending on the ap- plications running, system performance may degrade by 10-20 percent or more. It takes two tuples to configure an assured availability system. TheCompute Element IOPs run in parallel, but not in lockstep. If an IOP fails, the other CPU IOP continues to run the system. The failed IOP can then be physi- Applications and cally removed. After the Marathon software starts running, the MEMORY MIC Operating System repaired IOP automatically rejoins the configuration. The mirroredI/O Processor disks are re-mirrored in background mode over the private Ethernet linking the IOPs. The same process handles the failure of MEMORY MIC All I/O a mirrored disk. I/O CPU ADAPTERS 6RIWZDUH $YDLODELOLW Network The second requirement is for maximizing software availability. Clusters rely on standard hardware, software, and service models that do not help prevent failures, isolate failures, or resolve failures. 0DUDWKRQ 7XSOH ³ %XLOGLQJ%ORFN IRU DQ $VVXUHG $YDLODELOLW They simply recover from failures. Once again, Marathon and Stra- tus have different approaches. Stratus Software availability features seek to prevent outages, minimize those that cannot be prevented, and resolve problems so that they do not happen again. Stratus does not change any of the core Windows code. This guarantees 100 percent binary compatibility of all Windows applications. Stratus does change the Windows 2000 environment, but only in areas designed to be customized by hardware and software partners and sepa- rated from the main body of Windows code by documented, well-defined interfaces. Drivers cause a significant percentage of NT failures. Stratus driver hardening goes be- yond Windows 2000 improvements to further reduce driver-induced OS failures. The driver defines its memory boundaries and works with Stratus hardware to automatically prevent memory transfers beyond the defined memory boundaries. This prevents a bad PCI card from crashing the system. The new Microsoft driver model for Windows 2000 uses WMI (Windows Management Instrumentation) for management, control, and re- porting functions. Stratus hardened drivers are completely compatible with WMI. ‹Ã! ÇÃ6S8Ã6q‰v†‚…’ÃB…‚ˆƒÃ‡ÃÃ6yyvrqÃ9…v‰rÇÃ9rquh€ÃH6Ã!!%ÃVT6Çà # ÇÃ6S8rip‚€Ã VT6ÇÃVFÇÃBr…€h’ÇÃEhƒhÃ‡ÃDqvhÃ
  4. 4. 6S8ÃD†vtu‡†ÃQhtrÃ#ÃStratus recommends that all drivers be hardened. Hardened drivers for all installedadapters are required in order to receive Stratus’ 100 percent availability guarantee.Incompatible versions of hardware and software from different suppliers are well-known. The Resource Inventory Manager (RIM) identifies all system hardware andsoftware configuration elements, along with their revision levels, at initial install and allconfiguration changes. This information is stored and is also sent to the Stratus CAC,which can check known conflicts and help diagnose any problems.MarathonMarathon’s architecture provides hardware fault tolerance, protection against transientOS bugs, detects OS failures, and automatically restarts the system. Because the IOPsrun Marathon’s I/O management and fault-handling software, they are isolated from theloads placed on the CEs by the user’s applications and operating system. The IOPs runin parallel, but not in lockstep. Since the IOPs handle all interruptions, the CEs are freeto run the OS and user applications without the usual stream of asynchrony. Interrup-tions are managed through a structured process that eliminates a major source ofasynchrony-induced software failures. The IOPs are subjected to these asynchronies, butsince there are two autonomous IOPs in a full fault-tolerant system, an interrupt-inducedsoftware asynchrony will only affect one of the IOPs. If an IOP goes down, the survivingIOP carries on until an automatic reboot of the failed IOP is completed.6HUYLFHThe third requirement for high availability systems is designed-in serviceability. Again,Stratus and Marathon have different approaches.StratusServiceability is built into the ftServer hardware design in the form of customer replace-able modules, automatic fault isolation and remote management, and reporting throughthe Stratus remote management card. The Stratus Service Network (SSN) enables re-mote access to every customer system. The Stratus Customer Assistance Center providesthe 24/7 critical support.ftServer automatically isolates failures to the component level while continuing opera-tion on a second component. Failures are automatically reported to the CAC via a dialconnection. A replacement component is shipped from Stratus for next-day arrival. Thecustomer replaces the component while the system continues to operate. The new com-ponent is automatically integrated into the running system. The system and applicationcontinue to run normally through this entire process. ‹Ã! ÇÃ6S8Ã6q‰v†‚…’ÃB…‚ˆƒÃ‡ÃÃ6yyvrqÃ9…v‰rÇÃ9rquh€ÃH6Ã!!%ÃVT6Çà # ÇÃ6S8rip‚€Ã VT6ÇÃVFÇÃBr…€h’ÇÃEhƒhÃ‡ÃDqvhÃ
  5. 5. 6S8ÃD†vtu‡†ÃQhtrÃ$ÃEach ftServer comes with two ftServer Management PCI adapters. These adapters are,themselves, board level computers. They run independently of the host system and arepowered even if the rest of the system is powered off. Either redundant ftServer Man-agement adapter provides full control over the ftServer. Access is controlled through anTCP/IP interface via dial modem or local Ethernet.If a customer calls, Stratus will troubleshoot the problem. If the problem is in MicrosoftWindows 2000 code, Stratus calls in Microsoft, based on its service contract with Micro-soft. Stratus also has licensed Windows 2000 source code and a staff of kernel-trainedengineers. Microsoft has also given Stratus access to their OS debugging tools.MarathonThe Marathon Assured Availability system has three states: operational, vulnerable, anddown. The vulnerable state, invisible to users, notifies the system manager that a re-pair/resynchronization cycle can be initiated. Marathon provides two notificationmethods: the system console and the event log. The console presents a graphical modelon the system monitor, on remote systems over the network, or through a serial line tothe system manager. Color-coded components indicate their state, and a point-and-clickinterface is used to examine and manage system components. The second method usesthe Windows server event log to log all events, including Marathon system events. Sev-eral third-party tools are available that use the event log to communicate specified eventsvia beepers, fax, e-mail, etc., to the system manager.5(200(1$7,216• All systems supporting real-time collaboration throughout the enterprise and value chain should be deployed on fault-tolerant platforms.• When it comes to control-level, real-time, batch and process control applications, Stratus ftServer has the advantage because their architecture has no single point of failure and zero switchover time.• When selecting fault tolerant solutions, consider the whole solution, including hardware fault tolerance, software availability, performance, implementation costs, and serviceability.For further information, contact your account manager or the author at ggorbach@arcweb.com.Recommended circulation: All EAS and MAS clients. ‹Ã! ÇÃ6S8Ã6q‰v†‚…’ÃB…‚ˆƒÃ‡ÃÃ6yyvrqÃ9…v‰rÇÃ9rquh€ÃH6Ã!!%ÃVT6Çà # ÇÃ6S8rip‚€Ã VT6ÇÃVFÇÃBr…€h’ÇÃEhƒhÃ‡ÃDqvhÃ

×