SlideShare a Scribd company logo
Faster Recovery from Operating System
Failure and File Cache Missing
Yudai Kato, Shoichi Saito, Koichi Mouri, and Hiroshi Matsuo
Nagoya Institute of Technology, Nagoya 466-8555, Japan.

A. Jaleel
MS 1390 4692
Outline
Abstract
Introduction
Proposal
Design
Application
Implementation

Evaluation
•

Conclusion
References
Abstract
 Rebooting a computer is one method to fix OS failures. However this method
causes two problems.
a. The computer will not be available until the rebooting is finished.
b. The user data on the volatile main memory will be erased.

 To address these problems, we propose a system that can fix OS failures
within one second without losing file caches by replacing the OS encountering
a fatal error by another OS that is running on the same machine operating
simultaneously.
 One OS is dedicated to the backup of the other OS. Using the backup OS, we
can instantly conduct a failover.
Introduction
OS failure causes various problems, including loss of temporary data on the main
memory. In this research, we resolve these problems by making the restoration
process faster and protecting file caches from OS failure.
The main goals of Research
•
•
•
•
•

Making OSes robust against bugs
Constructing a high availability system in a single machine
Using a machine with minimum hardware
Quickly making failover
Minimizing overhead

For these goals,
• Two OSes are executed simultaneously on a single machine “Active OS &
Backup OS”.
• When the active OS encounters a failure, the backup OS can take over the
active OS by migrating the devices and file cashes from the active OS to itself.
• the proposed system, which does not need special hardware to perform this.
PROPOSAL
A. Outline
The amount of time for the failover of crashed OSes is the total suspension time
until the administrator detects the failure and restarts the OSes. To reduce the
computer downtime, we automatically detect OS failures and start a quick
failover instead of restarting the computer.
This proposed system provides a method to check the health of an OS kernel. If
the kernel is defunct, we simply conduct a failover
The backup OS can quickly take over functions of the active OS when it fails to
operate properly. This shortens the failover time more than rebooting. Therefore
we can reduce the downtime caused by OS failure.
B. Behavior
During failover, we migrate devices and file caches and launch applications.
DESIGN
a.
b.
c.
d.

Multiple-OS execution platform
Alive monitoring
Device migration
File cache migration
A. Multiple-OS execution platform
 For such simultaneously running of the active and backup OSes, we use Logical
PARtition (LPAR), which is a virtualization technique that emulates multiple
machines on a physical machine.
 Each OS is booted on a partition specified by kernel parameters that are given
by a bootloader.
B. Alive monitoring
How to confirm whether the active OS is dead or alive???
1. The crashed OS itself realizes the fault. So, active OS can send a dying message
to the backup OS.
2. the OS cannot realize the fault. this is caused by serious bugs, immediately
stops the OSes after they encounter the bugs .here, we use heartbeat messages at
regular intervals.

C. Device migration
The device migration mechanism is intended for migrating configuration files,
the libraries, the IP addresses and the function of the devices from the active
to the backup OS.
D. File cache migration
we can migrate the file cache that remains on the volatile main memory from
the active to the backup OS. As a result, the file cache can be written properly
to the storage device.
The steps performed during migration.
• The backup OS obtains the data structures related to the file caches from the
memory region of the active OS.
• It reconstructs file caches from the data structures.
• It writes the file caches to the storage.
APPLICATION
The proposed system enhances the availability of a operating system on a
machine using an example of an application to a Network File System (NFS)
server.
NFS is a kind of server-client application for sharing files and this exports its
local directory to the network. Since it’s a stateless application, the NFS server
can handle client requests even if the server is rebooted after a sudden
interruption.
we create an NFS server process on both the active and the backup OS. A NFS
client is running on a remote machine and communicating with the NFS server on
the active OS through the network. The NFS server on the active OS exports a
directory /export, which is a partition on the storage device. When the active OS
crashed because of bugs, alive monitoring detects it, and the backup OS starts the
failover to restore the NFS server.

First, the backup OS migrates the storage device and the NIC from the active OS to
it. The storage device contains the partition for the NFS server. Then the backup
OS mounts the partition on the same directory /export. After that, the backup OS
migrates the dirty file cache that remains on the volatile main memory to it.
Next, attaching the same IP address to the NIC used by the active OS, the backup
OS can communicate with remote NFS clients. In this stage, the backup OS
transparently finishes the failover for the client.
After that, the server begins processing the suspended operations, which are
successfully processed for the client because the previous operations are never
lost. Hence, our proposed system can protect files even if the OS crashed when the
file was being written by the NFS clients. In this way, we migrate environments
from the active to the backup OS, which transparently takes over for the crashed
OS to the NFS clients.
IMPLEMENTATION
Linux (2.6.38, processor type x86 64). Both the active and backup OSes use this
kernel.
A. Implementation of multiple-OS execution platform
For creating a LPAR partition that represents a virtual machine, every OS on the
machine must exclusively access the hardware of its partition. First, the OS is
booted from the normal bootloader, and subsequent OSes are booted by a special
bootloader based on kexec.
B. Failover
The failover of the proposed system is composed of device and file cache
migrations. First, we conduct device migration. In the normal state, both OSes
exclusively access their devices by detaching other’s devices without initializing
them.
C. Implementation of alive monitoring
Inter Processor Interrupt (IPI) Application is used which a core can communicate
with other cores through the processor’s interrupt controller. for detecting the
failure of the active OS. Both messages are sent by IPI.
After receiving this message, the backup OS notices the death of the active OS and
immediately starts failover.
The active OS sends heartbeat messages at regular intervals using interval timers. If
any message cannot be received for several seconds, the backup OS assumes that
the active OS is dead. With these two methods, alive monitoring detects the failure
of the active OS.
EVALUATION
we measure the time required to perform failover.
For the measurement, for simplicity, we deliberately crashed the active OS and
measured the length of the failover & the backup OS immediately detected the
failure of the active OS by receiving a dying message from the active OS. While the
backup OS was taking over the active OS, the backup OS performed the following:
1. Device migration
2. Mounting storage
3. File cache migration
4. Assignment of IP addresses to the NIC.
1. Device migration
2. Mounting storage
3. File cache migration
4. Assignment of IP addresses
to the NIC.

• Measured the amount of time for (3) with several different amounts of caches. The
times for (1)(2)(4) did not differ by conditions, thus the total time for (1)-(4) was
measured instead of measuring them individually. The measurement results of the
amount of time for cache migration are shown in Fig. 2.
• About 90 MBytes of the file cache can be migrated within 70 milliseconds.
After that, we measured the time for (1)-(4). The results were from 560 to 650
milliseconds. The margin of 90 milliseconds reflects the amount of time for file-cache
migration. Our proposed system can finish a failover process within one second.
•

Measured the amount of downtime for the network and the NFS on the proposed system using
a remote machine connected to a switching hub to which our proposed system was also
connected. We measured the time (a)(b)(c) shown in Fig. 3. First, we measured the network
downtime. The echo requests were sent continuously using Internet control message protocol
(ICMP) before the active OS crashed.

•

Measured the downtime using the amount of time between (a) the beginning of the first echo
request that failed to come back and (b) the time of the echo request that first came back
after the failover. The result was about 3.1 seconds, which is slightly longer than the one
second of the failover.
The start time of the suspended operation was the same as (b), and its end is
denoted by (c). The results ranged from about 3 to 13 seconds. They were much
longer than the network downtime.
We conducted the same measure using a UDP protocol, and the time was about 3
milliseconds. Therefore, if NFS uses a UDP protocol, it can communicate
immediately after the network comes back.
CONCLUSION
In this paper, we proposed a system with a failover mechanism for operating
systems. Our proposed system’s failover takes less than one second. The user data
in the file cache are never lost because they are migrated by a file cache
migration mechanism.
our proposed system can resolve problems caused by rebooting after OS failure. It
runs on commodity x86 machines and needs no modifications for applications.
During normal times, since there is no special processing except for heartbeat
messages, our proposed system is lightweight. Future work will randomly insert
bugs into kernels to measure the durability of our proposed system.
REFERENCES
[1] A. S. Tanenbaum, J. N. Herder, and H. Bos, “Can we make operating systems reliable and secure,” Computer, vol. 39, pp.
44–51, 2006.
[2] M. M. Swift, M. Annamalai, B. N. Bershad, and H. M. Levy, “Recovering device drivers,” ACM Transactions on Computer
Systems, vol. 24, pp. 333–360, 2006.
[3] A. Depoutovitch and M. Stumm, “”otherworld”: Giving applications a chance to survive os kernel crashes,” in EuroSys,
2008, pp. 181–194.
[4] A. Pfiffer, “Reducing system reboot time with kexec.” http://www.osdl.org/.
[5] M. K. Mckusick and T. J. Kowalski, “Fsck - the unix file system check program,” 1994.
[6] A. Sweeney, A. Sweeney, D. Doucette, W. Hu, C. Anderson, M. Nishimoto, and G. Peck, “Scalability in the xfs file system,”
in In Proceedings of the 1996 USENIX Annual Technical Conference, 1996, pp.1–14.
[7] M. K. Mckusick, M. K. Mckusick, G. R. Ganger, and G. R. Ganger, “Soft updates: A technique for eliminating most
synchronous writes in the fast filesystem,” in In Proceedings of the Freenix Track: 1999 USENIX Annual Technical Conference,
1999, pp. 1–17.
[8] M. Rosenblum and J. K. Ousterhout, “The design and implementation of a log-structured file system,” ACM Transactions
on Computer Systems, vol. 10, pp. 1–15, 1991.
[9] “Keepalived for linux.” [Online]. Available: http://www.keepalived.org/
[10] T. L. Borden, J. P. Hennessy, and J. W. Rymarczyk, “Multiple operating systems on one processor complex,” IBM Systems
Journal, vol. 28, no. 1, pp. 104 –123, 1989.
[11] T. Shimosawa, H. Matsuba, and Y. Ishikawa, “Logical partitioning without architectural supports,” in International
Computer Software and Applications Conference, 2008, pp. 355–364.
[12] A. Chou, J. Yang, B. Chelf, S. Hallem, and D. Engler, “An empirical study of operating systems errors,” in Symposium on
Operating Systems Principles, 2001, pp. 73–88.
[13] G. Candea, S. Kawamoto, Y. Fujiki, G. Friedman, and A. Fox,“Microreboot - a technique for cheap recovery,” in
Operating Systems Design and Implementation, 2004, pp. 31–44.
[14] P. M. Chen, W. T. Ng, S. Chandra, C. M. Aycock, G. Rajamani, and D. E. Lowell, “The rio file cache: Surviving operating
system crashes,” in Architectural Support for Programming Languages and Operating Systems, 1996, pp. 74–83.
[15] C. Giuffrida, L. Cavallaro, and A. S. Tanenbaum, “We crashed, now what?” in Proceedings of the Sixth international
conference on Hot topics in system dependability, ser. HotDep’10. Berkeley, CA, USA: USENIX Association, 2010, pp. 1–8.
[16] M. M. Swift, B. N. Bershad, and H. M. Levy, “Improving the reliability of commodity operating systems,” in Symposium on
Operating Systems Principles, 2003, pp. 207–222.
[17] N. Palix, G. Thomas, S. Saha, C. Calv`es, J. Lawall, and G. Muller,“Faults in linux: ten years later,” in Architectural
Support for Programming Languages and Operating Systems, 2011, pp. 305–318.
Thank You
Athambawa Jaleel
MS 1390 4692

More Related Content

What's hot

Refining Linux
Refining LinuxRefining Linux
Refining Linux
Jason Murray
 
RTOS - Real Time Operating Systems
RTOS - Real Time Operating SystemsRTOS - Real Time Operating Systems
RTOS - Real Time Operating Systems
Emertxe Information Technologies Pvt Ltd
 
HKG15-305: Real Time processing comparing the RT patch vs Core isolation
HKG15-305: Real Time processing comparing the RT patch vs Core isolationHKG15-305: Real Time processing comparing the RT patch vs Core isolation
HKG15-305: Real Time processing comparing the RT patch vs Core isolation
Linaro
 
Operating Systems Part I-Basics
Operating Systems Part I-BasicsOperating Systems Part I-Basics
Operating Systems Part I-Basics
Ajit Nayak
 
INTERRUPT ROUTINES IN RTOS EN VIRONMENT HANDELING OF INTERRUPT SOURCE CALLS
INTERRUPT ROUTINES IN RTOS EN VIRONMENT HANDELING OF INTERRUPT SOURCE CALLSINTERRUPT ROUTINES IN RTOS EN VIRONMENT HANDELING OF INTERRUPT SOURCE CALLS
INTERRUPT ROUTINES IN RTOS EN VIRONMENT HANDELING OF INTERRUPT SOURCE CALLS
JOLLUSUDARSHANREDDY
 
Vxworks
VxworksVxworks
Vxworks
ismaeil nethead
 
MicroC/OS-II
MicroC/OS-IIMicroC/OS-II
AOS Lab 5: System calls
AOS Lab 5: System callsAOS Lab 5: System calls
AOS Lab 5: System callsZubair Nabi
 
Rtos part2
Rtos part2Rtos part2
Rtos part2navakishore
 
Introduction to Real-Time Operating Systems
Introduction to Real-Time Operating SystemsIntroduction to Real-Time Operating Systems
Introduction to Real-Time Operating Systems
coolmirza143
 
Preempt_rt realtime patch
Preempt_rt realtime patchPreempt_rt realtime patch
Preempt_rt realtime patch
Emre Can Kucukoglu
 
Virtualization overheads
Virtualization overheadsVirtualization overheads
Virtualization overheads
Sandeep Joshi
 
Linux Internals - Interview essentials 3.0
Linux Internals - Interview essentials 3.0Linux Internals - Interview essentials 3.0
Linux Internals - Interview essentials 3.0
Emertxe Information Technologies Pvt Ltd
 
Priority Inversion on Mars
Priority Inversion on MarsPriority Inversion on Mars
Priority Inversion on Mars
National Cheng Kung University
 
2 srs
2 srs2 srs
2 srshanmya
 
Real Time Operating Systems
Real Time Operating SystemsReal Time Operating Systems
Real Time Operating Systems
Murtadha Alsabbagh
 
Rtos Concepts
Rtos ConceptsRtos Concepts
Rtos Concepts
Sundaresan Sundar
 
Real time Linux
Real time LinuxReal time Linux
Real time Linux
navid ashrafi
 
High Performance Storage Devices in the Linux Kernel
High Performance Storage Devices in the Linux KernelHigh Performance Storage Devices in the Linux Kernel
High Performance Storage Devices in the Linux Kernel
Kernel TLV
 

What's hot (20)

Refining Linux
Refining LinuxRefining Linux
Refining Linux
 
RTOS - Real Time Operating Systems
RTOS - Real Time Operating SystemsRTOS - Real Time Operating Systems
RTOS - Real Time Operating Systems
 
HKG15-305: Real Time processing comparing the RT patch vs Core isolation
HKG15-305: Real Time processing comparing the RT patch vs Core isolationHKG15-305: Real Time processing comparing the RT patch vs Core isolation
HKG15-305: Real Time processing comparing the RT patch vs Core isolation
 
Operating Systems Part I-Basics
Operating Systems Part I-BasicsOperating Systems Part I-Basics
Operating Systems Part I-Basics
 
INTERRUPT ROUTINES IN RTOS EN VIRONMENT HANDELING OF INTERRUPT SOURCE CALLS
INTERRUPT ROUTINES IN RTOS EN VIRONMENT HANDELING OF INTERRUPT SOURCE CALLSINTERRUPT ROUTINES IN RTOS EN VIRONMENT HANDELING OF INTERRUPT SOURCE CALLS
INTERRUPT ROUTINES IN RTOS EN VIRONMENT HANDELING OF INTERRUPT SOURCE CALLS
 
Mastering Real-time Linux
Mastering Real-time LinuxMastering Real-time Linux
Mastering Real-time Linux
 
Vxworks
VxworksVxworks
Vxworks
 
MicroC/OS-II
MicroC/OS-IIMicroC/OS-II
MicroC/OS-II
 
AOS Lab 5: System calls
AOS Lab 5: System callsAOS Lab 5: System calls
AOS Lab 5: System calls
 
Rtos part2
Rtos part2Rtos part2
Rtos part2
 
Introduction to Real-Time Operating Systems
Introduction to Real-Time Operating SystemsIntroduction to Real-Time Operating Systems
Introduction to Real-Time Operating Systems
 
Preempt_rt realtime patch
Preempt_rt realtime patchPreempt_rt realtime patch
Preempt_rt realtime patch
 
Virtualization overheads
Virtualization overheadsVirtualization overheads
Virtualization overheads
 
Linux Internals - Interview essentials 3.0
Linux Internals - Interview essentials 3.0Linux Internals - Interview essentials 3.0
Linux Internals - Interview essentials 3.0
 
Priority Inversion on Mars
Priority Inversion on MarsPriority Inversion on Mars
Priority Inversion on Mars
 
2 srs
2 srs2 srs
2 srs
 
Real Time Operating Systems
Real Time Operating SystemsReal Time Operating Systems
Real Time Operating Systems
 
Rtos Concepts
Rtos ConceptsRtos Concepts
Rtos Concepts
 
Real time Linux
Real time LinuxReal time Linux
Real time Linux
 
High Performance Storage Devices in the Linux Kernel
High Performance Storage Devices in the Linux KernelHigh Performance Storage Devices in the Linux Kernel
High Performance Storage Devices in the Linux Kernel
 

Viewers also liked

ราคากลางสอบราคาจ้างปรับปรุงหลังคา รางน้ำ
ราคากลางสอบราคาจ้างปรับปรุงหลังคา รางน้ำราคากลางสอบราคาจ้างปรับปรุงหลังคา รางน้ำ
ราคากลางสอบราคาจ้างปรับปรุงหลังคา รางน้ำ
tanapongslide
 
ราคากลางงานจ้างปรับปรุงห้องน้ำห้องส้วม
ราคากลางงานจ้างปรับปรุงห้องน้ำห้องส้วมราคากลางงานจ้างปรับปรุงห้องน้ำห้องส้วม
ราคากลางงานจ้างปรับปรุงห้องน้ำห้องส้วม
tanapongslide
 
民族主義還是國際主義@2013台社年會
民族主義還是國際主義@2013台社年會民族主義還是國際主義@2013台社年會
民族主義還是國際主義@2013台社年會Po-Chien Chen
 
Fiat riciclata un nuevo concepto de rse
Fiat riciclata  un nuevo concepto de rseFiat riciclata  un nuevo concepto de rse
Fiat riciclata un nuevo concepto de rseEzequiel Bertorini
 
如何站在工人立場評估服貿協議 (台南產業總工會勞教)
如何站在工人立場評估服貿協議 (台南產業總工會勞教)如何站在工人立場評估服貿協議 (台南產業總工會勞教)
如何站在工人立場評估服貿協議 (台南產業總工會勞教)Po-Chien Chen
 
Introduction to the learning in 1G
Introduction to the learning in 1GIntroduction to the learning in 1G
Introduction to the learning in 1G
nickgoligher
 
Secret finalppt
Secret finalpptSecret finalppt
Secret finalpptdipikadeb15
 
แบบปร.6 ปร.5(ก) ปร.4(ก) ปร.4(ข)
แบบปร.6 ปร.5(ก) ปร.4(ก) ปร.4(ข)แบบปร.6 ปร.5(ก) ปร.4(ก) ปร.4(ข)
แบบปร.6 ปร.5(ก) ปร.4(ก) ปร.4(ข)
tanapongslide
 
Genetic Engineering
Genetic Engineering Genetic Engineering
Genetic Engineering
Joshua Porca
 

Viewers also liked (10)

ราคากลางสอบราคาจ้างปรับปรุงหลังคา รางน้ำ
ราคากลางสอบราคาจ้างปรับปรุงหลังคา รางน้ำราคากลางสอบราคาจ้างปรับปรุงหลังคา รางน้ำ
ราคากลางสอบราคาจ้างปรับปรุงหลังคา รางน้ำ
 
ราคากลางงานจ้างปรับปรุงห้องน้ำห้องส้วม
ราคากลางงานจ้างปรับปรุงห้องน้ำห้องส้วมราคากลางงานจ้างปรับปรุงห้องน้ำห้องส้วม
ราคากลางงานจ้างปรับปรุงห้องน้ำห้องส้วม
 
民族主義還是國際主義@2013台社年會
民族主義還是國際主義@2013台社年會民族主義還是國際主義@2013台社年會
民族主義還是國際主義@2013台社年會
 
Gita v7
Gita v7Gita v7
Gita v7
 
Fiat riciclata un nuevo concepto de rse
Fiat riciclata  un nuevo concepto de rseFiat riciclata  un nuevo concepto de rse
Fiat riciclata un nuevo concepto de rse
 
如何站在工人立場評估服貿協議 (台南產業總工會勞教)
如何站在工人立場評估服貿協議 (台南產業總工會勞教)如何站在工人立場評估服貿協議 (台南產業總工會勞教)
如何站在工人立場評估服貿協議 (台南產業總工會勞教)
 
Introduction to the learning in 1G
Introduction to the learning in 1GIntroduction to the learning in 1G
Introduction to the learning in 1G
 
Secret finalppt
Secret finalpptSecret finalppt
Secret finalppt
 
แบบปร.6 ปร.5(ก) ปร.4(ก) ปร.4(ข)
แบบปร.6 ปร.5(ก) ปร.4(ก) ปร.4(ข)แบบปร.6 ปร.5(ก) ปร.4(ก) ปร.4(ข)
แบบปร.6 ปร.5(ก) ปร.4(ก) ปร.4(ข)
 
Genetic Engineering
Genetic Engineering Genetic Engineering
Genetic Engineering
 

Similar to Faster recovery from operating system failure & file cache missing

Lecture1,2,3 (1).pdf
Lecture1,2,3 (1).pdfLecture1,2,3 (1).pdf
Lecture1,2,3 (1).pdf
Taufeeq8
 
Cs6413 operating-systems-laboratory
Cs6413 operating-systems-laboratoryCs6413 operating-systems-laboratory
Cs6413 operating-systems-laboratory
Preeja Ravishankar
 
Os files 2
Os files 2Os files 2
Os files 2Amit Pal
 
Linux Internals - Interview essentials - 1.0
Linux Internals - Interview essentials - 1.0Linux Internals - Interview essentials - 1.0
Linux Internals - Interview essentials - 1.0
Emertxe Information Technologies Pvt Ltd
 
Processor management
Processor managementProcessor management
Processor management
dev3993
 
Introduction to Operating System (Important Notes)
Introduction to Operating System (Important Notes)Introduction to Operating System (Important Notes)
Introduction to Operating System (Important Notes)
Gaurav Kakade
 
Io sy.stemppt
Io sy.stempptIo sy.stemppt
Io sy.stemppt
muthumani mahesh
 
Os functions
Os functionsOs functions
Os functions
Pushpraj Patel
 
OS Structure
OS StructureOS Structure
OS Structure
Ajay Singh Rana
 
OS - Ch1
OS - Ch1OS - Ch1
OS - Ch1sphs
 
Operating systems. replace ch1 with numbers for next chapters
Operating systems. replace ch1 with numbers for next chaptersOperating systems. replace ch1 with numbers for next chapters
Operating systems. replace ch1 with numbers for next chapterssphs
 
Chapter 1 - Introduction
Chapter 1 - IntroductionChapter 1 - Introduction
Chapter 1 - IntroductionWayne Jones Jnr
 
Operating Systems
Operating Systems Operating Systems
Operating Systems
Ziyauddin Shaik
 
Autosar Basics hand book_v1
Autosar Basics  hand book_v1Autosar Basics  hand book_v1
Autosar Basics hand book_v1
Keroles karam khalil
 
IT241 - Full Summary.pdf
IT241 - Full Summary.pdfIT241 - Full Summary.pdf
IT241 - Full Summary.pdf
SHEHABALYAMANI
 
Operating system.pptx
Operating system.pptxOperating system.pptx
Operating system.pptx
KrishnenduMisra2
 
Operating Systems PPT 1 (1).pdf
Operating Systems PPT 1 (1).pdfOperating Systems PPT 1 (1).pdf
Operating Systems PPT 1 (1).pdf
FahanaAbdulVahab
 
Engg-0505-IT-Operating-Systems-2nd-year.pdf
Engg-0505-IT-Operating-Systems-2nd-year.pdfEngg-0505-IT-Operating-Systems-2nd-year.pdf
Engg-0505-IT-Operating-Systems-2nd-year.pdf
nikhil287188
 
OS Content.pdf
OS Content.pdfOS Content.pdf
OS Content.pdf
VAIBHAVSAHU55
 

Similar to Faster recovery from operating system failure & file cache missing (20)

Lecture1,2,3 (1).pdf
Lecture1,2,3 (1).pdfLecture1,2,3 (1).pdf
Lecture1,2,3 (1).pdf
 
Cs6413 operating-systems-laboratory
Cs6413 operating-systems-laboratoryCs6413 operating-systems-laboratory
Cs6413 operating-systems-laboratory
 
Os files 2
Os files 2Os files 2
Os files 2
 
Linux Internals - Interview essentials - 1.0
Linux Internals - Interview essentials - 1.0Linux Internals - Interview essentials - 1.0
Linux Internals - Interview essentials - 1.0
 
Processor management
Processor managementProcessor management
Processor management
 
Introduction to Operating System (Important Notes)
Introduction to Operating System (Important Notes)Introduction to Operating System (Important Notes)
Introduction to Operating System (Important Notes)
 
Io sy.stemppt
Io sy.stempptIo sy.stemppt
Io sy.stemppt
 
Os functions
Os functionsOs functions
Os functions
 
OS Structure
OS StructureOS Structure
OS Structure
 
OS - Ch1
OS - Ch1OS - Ch1
OS - Ch1
 
Operating systems. replace ch1 with numbers for next chapters
Operating systems. replace ch1 with numbers for next chaptersOperating systems. replace ch1 with numbers for next chapters
Operating systems. replace ch1 with numbers for next chapters
 
Chapter 1 - Introduction
Chapter 1 - IntroductionChapter 1 - Introduction
Chapter 1 - Introduction
 
Operating Systems
Operating Systems Operating Systems
Operating Systems
 
Autosar Basics hand book_v1
Autosar Basics  hand book_v1Autosar Basics  hand book_v1
Autosar Basics hand book_v1
 
Ch1
Ch1Ch1
Ch1
 
IT241 - Full Summary.pdf
IT241 - Full Summary.pdfIT241 - Full Summary.pdf
IT241 - Full Summary.pdf
 
Operating system.pptx
Operating system.pptxOperating system.pptx
Operating system.pptx
 
Operating Systems PPT 1 (1).pdf
Operating Systems PPT 1 (1).pdfOperating Systems PPT 1 (1).pdf
Operating Systems PPT 1 (1).pdf
 
Engg-0505-IT-Operating-Systems-2nd-year.pdf
Engg-0505-IT-Operating-Systems-2nd-year.pdfEngg-0505-IT-Operating-Systems-2nd-year.pdf
Engg-0505-IT-Operating-Systems-2nd-year.pdf
 
OS Content.pdf
OS Content.pdfOS Content.pdf
OS Content.pdf
 

Recently uploaded

National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
Quotidiano Piemontese
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Paige Cruz
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
91mobiles
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
mikeeftimakis1
 
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptxSecstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
nkrafacyberclub
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
Adtran
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
SOFTTECHHUB
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
Uni Systems S.M.S.A.
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
Kari Kakkonen
 
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfSAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
Peter Spielvogel
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
Neo4j
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
Neo4j
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
 

Recently uploaded (20)

National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
 
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptxSecstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
 
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfSAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
 

Faster recovery from operating system failure & file cache missing

  • 1. Faster Recovery from Operating System Failure and File Cache Missing Yudai Kato, Shoichi Saito, Koichi Mouri, and Hiroshi Matsuo Nagoya Institute of Technology, Nagoya 466-8555, Japan. A. Jaleel MS 1390 4692
  • 3. Abstract  Rebooting a computer is one method to fix OS failures. However this method causes two problems. a. The computer will not be available until the rebooting is finished. b. The user data on the volatile main memory will be erased.  To address these problems, we propose a system that can fix OS failures within one second without losing file caches by replacing the OS encountering a fatal error by another OS that is running on the same machine operating simultaneously.  One OS is dedicated to the backup of the other OS. Using the backup OS, we can instantly conduct a failover.
  • 4. Introduction OS failure causes various problems, including loss of temporary data on the main memory. In this research, we resolve these problems by making the restoration process faster and protecting file caches from OS failure. The main goals of Research • • • • • Making OSes robust against bugs Constructing a high availability system in a single machine Using a machine with minimum hardware Quickly making failover Minimizing overhead For these goals, • Two OSes are executed simultaneously on a single machine “Active OS & Backup OS”. • When the active OS encounters a failure, the backup OS can take over the active OS by migrating the devices and file cashes from the active OS to itself. • the proposed system, which does not need special hardware to perform this.
  • 5. PROPOSAL A. Outline The amount of time for the failover of crashed OSes is the total suspension time until the administrator detects the failure and restarts the OSes. To reduce the computer downtime, we automatically detect OS failures and start a quick failover instead of restarting the computer. This proposed system provides a method to check the health of an OS kernel. If the kernel is defunct, we simply conduct a failover The backup OS can quickly take over functions of the active OS when it fails to operate properly. This shortens the failover time more than rebooting. Therefore we can reduce the downtime caused by OS failure. B. Behavior During failover, we migrate devices and file caches and launch applications.
  • 6. DESIGN a. b. c. d. Multiple-OS execution platform Alive monitoring Device migration File cache migration
  • 7. A. Multiple-OS execution platform  For such simultaneously running of the active and backup OSes, we use Logical PARtition (LPAR), which is a virtualization technique that emulates multiple machines on a physical machine.  Each OS is booted on a partition specified by kernel parameters that are given by a bootloader. B. Alive monitoring How to confirm whether the active OS is dead or alive??? 1. The crashed OS itself realizes the fault. So, active OS can send a dying message to the backup OS. 2. the OS cannot realize the fault. this is caused by serious bugs, immediately stops the OSes after they encounter the bugs .here, we use heartbeat messages at regular intervals. C. Device migration The device migration mechanism is intended for migrating configuration files, the libraries, the IP addresses and the function of the devices from the active to the backup OS.
  • 8. D. File cache migration we can migrate the file cache that remains on the volatile main memory from the active to the backup OS. As a result, the file cache can be written properly to the storage device. The steps performed during migration. • The backup OS obtains the data structures related to the file caches from the memory region of the active OS. • It reconstructs file caches from the data structures. • It writes the file caches to the storage. APPLICATION The proposed system enhances the availability of a operating system on a machine using an example of an application to a Network File System (NFS) server. NFS is a kind of server-client application for sharing files and this exports its local directory to the network. Since it’s a stateless application, the NFS server can handle client requests even if the server is rebooted after a sudden interruption.
  • 9. we create an NFS server process on both the active and the backup OS. A NFS client is running on a remote machine and communicating with the NFS server on the active OS through the network. The NFS server on the active OS exports a directory /export, which is a partition on the storage device. When the active OS crashed because of bugs, alive monitoring detects it, and the backup OS starts the failover to restore the NFS server. First, the backup OS migrates the storage device and the NIC from the active OS to it. The storage device contains the partition for the NFS server. Then the backup OS mounts the partition on the same directory /export. After that, the backup OS migrates the dirty file cache that remains on the volatile main memory to it. Next, attaching the same IP address to the NIC used by the active OS, the backup OS can communicate with remote NFS clients. In this stage, the backup OS transparently finishes the failover for the client. After that, the server begins processing the suspended operations, which are successfully processed for the client because the previous operations are never lost. Hence, our proposed system can protect files even if the OS crashed when the file was being written by the NFS clients. In this way, we migrate environments from the active to the backup OS, which transparently takes over for the crashed OS to the NFS clients.
  • 10. IMPLEMENTATION Linux (2.6.38, processor type x86 64). Both the active and backup OSes use this kernel. A. Implementation of multiple-OS execution platform For creating a LPAR partition that represents a virtual machine, every OS on the machine must exclusively access the hardware of its partition. First, the OS is booted from the normal bootloader, and subsequent OSes are booted by a special bootloader based on kexec. B. Failover The failover of the proposed system is composed of device and file cache migrations. First, we conduct device migration. In the normal state, both OSes exclusively access their devices by detaching other’s devices without initializing them.
  • 11. C. Implementation of alive monitoring Inter Processor Interrupt (IPI) Application is used which a core can communicate with other cores through the processor’s interrupt controller. for detecting the failure of the active OS. Both messages are sent by IPI. After receiving this message, the backup OS notices the death of the active OS and immediately starts failover. The active OS sends heartbeat messages at regular intervals using interval timers. If any message cannot be received for several seconds, the backup OS assumes that the active OS is dead. With these two methods, alive monitoring detects the failure of the active OS.
  • 12. EVALUATION we measure the time required to perform failover. For the measurement, for simplicity, we deliberately crashed the active OS and measured the length of the failover & the backup OS immediately detected the failure of the active OS by receiving a dying message from the active OS. While the backup OS was taking over the active OS, the backup OS performed the following: 1. Device migration 2. Mounting storage 3. File cache migration 4. Assignment of IP addresses to the NIC.
  • 13. 1. Device migration 2. Mounting storage 3. File cache migration 4. Assignment of IP addresses to the NIC. • Measured the amount of time for (3) with several different amounts of caches. The times for (1)(2)(4) did not differ by conditions, thus the total time for (1)-(4) was measured instead of measuring them individually. The measurement results of the amount of time for cache migration are shown in Fig. 2. • About 90 MBytes of the file cache can be migrated within 70 milliseconds. After that, we measured the time for (1)-(4). The results were from 560 to 650 milliseconds. The margin of 90 milliseconds reflects the amount of time for file-cache migration. Our proposed system can finish a failover process within one second.
  • 14. • Measured the amount of downtime for the network and the NFS on the proposed system using a remote machine connected to a switching hub to which our proposed system was also connected. We measured the time (a)(b)(c) shown in Fig. 3. First, we measured the network downtime. The echo requests were sent continuously using Internet control message protocol (ICMP) before the active OS crashed. • Measured the downtime using the amount of time between (a) the beginning of the first echo request that failed to come back and (b) the time of the echo request that first came back after the failover. The result was about 3.1 seconds, which is slightly longer than the one second of the failover.
  • 15. The start time of the suspended operation was the same as (b), and its end is denoted by (c). The results ranged from about 3 to 13 seconds. They were much longer than the network downtime. We conducted the same measure using a UDP protocol, and the time was about 3 milliseconds. Therefore, if NFS uses a UDP protocol, it can communicate immediately after the network comes back. CONCLUSION In this paper, we proposed a system with a failover mechanism for operating systems. Our proposed system’s failover takes less than one second. The user data in the file cache are never lost because they are migrated by a file cache migration mechanism. our proposed system can resolve problems caused by rebooting after OS failure. It runs on commodity x86 machines and needs no modifications for applications. During normal times, since there is no special processing except for heartbeat messages, our proposed system is lightweight. Future work will randomly insert bugs into kernels to measure the durability of our proposed system.
  • 16. REFERENCES [1] A. S. Tanenbaum, J. N. Herder, and H. Bos, “Can we make operating systems reliable and secure,” Computer, vol. 39, pp. 44–51, 2006. [2] M. M. Swift, M. Annamalai, B. N. Bershad, and H. M. Levy, “Recovering device drivers,” ACM Transactions on Computer Systems, vol. 24, pp. 333–360, 2006. [3] A. Depoutovitch and M. Stumm, “”otherworld”: Giving applications a chance to survive os kernel crashes,” in EuroSys, 2008, pp. 181–194. [4] A. Pfiffer, “Reducing system reboot time with kexec.” http://www.osdl.org/. [5] M. K. Mckusick and T. J. Kowalski, “Fsck - the unix file system check program,” 1994. [6] A. Sweeney, A. Sweeney, D. Doucette, W. Hu, C. Anderson, M. Nishimoto, and G. Peck, “Scalability in the xfs file system,” in In Proceedings of the 1996 USENIX Annual Technical Conference, 1996, pp.1–14. [7] M. K. Mckusick, M. K. Mckusick, G. R. Ganger, and G. R. Ganger, “Soft updates: A technique for eliminating most synchronous writes in the fast filesystem,” in In Proceedings of the Freenix Track: 1999 USENIX Annual Technical Conference, 1999, pp. 1–17. [8] M. Rosenblum and J. K. Ousterhout, “The design and implementation of a log-structured file system,” ACM Transactions on Computer Systems, vol. 10, pp. 1–15, 1991. [9] “Keepalived for linux.” [Online]. Available: http://www.keepalived.org/ [10] T. L. Borden, J. P. Hennessy, and J. W. Rymarczyk, “Multiple operating systems on one processor complex,” IBM Systems Journal, vol. 28, no. 1, pp. 104 –123, 1989. [11] T. Shimosawa, H. Matsuba, and Y. Ishikawa, “Logical partitioning without architectural supports,” in International Computer Software and Applications Conference, 2008, pp. 355–364. [12] A. Chou, J. Yang, B. Chelf, S. Hallem, and D. Engler, “An empirical study of operating systems errors,” in Symposium on Operating Systems Principles, 2001, pp. 73–88. [13] G. Candea, S. Kawamoto, Y. Fujiki, G. Friedman, and A. Fox,“Microreboot - a technique for cheap recovery,” in Operating Systems Design and Implementation, 2004, pp. 31–44. [14] P. M. Chen, W. T. Ng, S. Chandra, C. M. Aycock, G. Rajamani, and D. E. Lowell, “The rio file cache: Surviving operating system crashes,” in Architectural Support for Programming Languages and Operating Systems, 1996, pp. 74–83. [15] C. Giuffrida, L. Cavallaro, and A. S. Tanenbaum, “We crashed, now what?” in Proceedings of the Sixth international conference on Hot topics in system dependability, ser. HotDep’10. Berkeley, CA, USA: USENIX Association, 2010, pp. 1–8. [16] M. M. Swift, B. N. Bershad, and H. M. Levy, “Improving the reliability of commodity operating systems,” in Symposium on Operating Systems Principles, 2003, pp. 207–222. [17] N. Palix, G. Thomas, S. Saha, C. Calv`es, J. Lawall, and G. Muller,“Faults in linux: ten years later,” in Architectural Support for Programming Languages and Operating Systems, 2011, pp. 305–318.

Editor's Notes

  1. Rebooting. Within 1 second.by replace one os.on same machine.Backup.failover
  2. Restore faster. Protect file cache.
  3. Backup takeover function of active
  4. 0718433029 upali
  5. See in implementation
  6. Since NFS is a stateless application, the NFS server can handle client requests even if the server is rebooted after a sudden interruption. we create an NFS server process on both the active and the backup OS. A NFS client is running on a remote machine and communicating with the NFS server on the active OS through the network. The NFS server on the active OS exports a directory /export, which is a partition on the storage device. When the active OS crashed because of bugs, alive monitoring detects it, and the backup OS starts the failover to restore the NFS server.  First, the backup OS migrates the storage device and the NIC from the active OS to it. The storage device contains the partition for the NFS server. Then the backup OS mounts the partition on the same directory /export. After that, the backup OS migrates the dirty file cache that remains on the volatile main memory to it.  Next, attaching the same IP address to the NIC used by the active OS, the backup OS can communicate with remote NFS clients. In this stage, the backup OS transparently finishes the failover for the client. After that, the server begins processing the suspended operations, which are successfully processed for the client because the previous operations are never lost. Hence, our proposed system can protect files even if the OS crashed when the file was being written by the NFS clients. In this way, we migrate environments from the active to the backup OS, which transparently takes over for the crashed OS to the NFS clients.
  7. Ipi. Both message.Backup notice death