08448380779 Call Girls In Greater Kailash - I Women Seeking Men
Design and evaluation of an io controller for data protection
1. DESIGN AND EVALUATION OF
AN I/O CONTROLLER FOR DATA
PROTECTION : DARC
PRESENTED BY:
ZARA TARIQ REG # 1537119
SAPNA KUMARI REG # 1537131
2. MOTIVATION
Data Integrity for data at rest through error detection and error correction
Human errors through transparent online versioning
Storage device failures through evolving RAID techniques
3. OUR APPROACH:
DATA PROTECTION IN THE CONTROLLER
Use persistent checksums for error detection
If error is recovered use second copy of mirror for recovery
Use versioning for dealing with human errors
After failure, revert to previous version
Perform both techniques transparently to
Devices: can use any type of (low-cost) devices
Potential for high-rate I/O
Make use of specialized data-path & hardware resources
Perform (some) computations on data while they are on transit
5. BUFFER MANAGEMENT
Buffer pools
Pre-allocated, fixed-size
2 classes: 64KB for application data, 4KB for control information
Trade-off between space-efficiency and latency
IO allocation/de-allocation overhead
Lazy de-allocation
De-allocate when:
Idle, or under extreme memory pressure
Command & completion FIFO queues
6. CONTEXT SCHEDULING
Multiple in-flight I/O commands at any one time
I/O command processing actually proceeds in discrete stages, with several
events/notifications being triggered at each
1. Option-I: Event-driven
Design (and tune) dedicated FSM
Many events during I/O processing
Eg: DMA transfer start/completion, disk I/O start/completion, …
2. Option-II: Thread-based
Encapsulate I/O processing stages in threads, schedule threads
We have used Thread-based, using full Linux OS
Programmable, infrastructure in-place to build advanced functionality more easily
but more s/w layers, with less control over timing of events/interactions
7. ERROR DETECTION AND CORRECTION
DARC approach for correcting errors is based on a combining two
mechanisms which are:
1. Error detection through the calculation of data checksums, which are
insistently stored and checked on every read command
2. Error correction through data reconstruction using available data
redundancy schemes.
8. SYSTOR 2010 - DARC
ERROR CORRECTION PROCEDURE IN THE CONTROLLER
IO PATH
9. HOST-CONTROLLER I/O PATH
I/O commands [ transferred via Host-initiated PIO ]
SCSI command descriptor block + DMA segments
DMA segments reference host-side memory addresses
I/O completions [transferred via Controller-initiated DMA ]
Status code + reference to originally issued I/O command
Options for transfer of commands
PIO vs DMA
PIO: simple, but with high CPU overhead
DMA: high throughput, but completion detection is complicated
Options: Polling, Interrupts
11. STORAGE VIRTUALIZATION
DARC uses the Violin block-driver framework for volume virtualization &
versioning
Violin is located above the SCSI (Small Computer System Interface) drivers in
the controller (Violin already provides versioning) and RAID modules.
M. Flouris and A. Bilas – Proc. MSST, 2005
VIOLIN
Provides new virtualization functions for extension modules
Combine these functions in storage hierarchies with rich semantics
Meta-data persistence
VIOLIN supports:
Asynchronous IO (Improves performance but challenging)
12. CONTROLLER ON-BOARD CACHE
Typically, I/O controllers have an on-board cache:
Exploit temporal locality (recently-accessed data blocks)
Read-ahead for spatial locality (prefetch adjacent data blocks)
Coalescing small writes (e.g. partial-stripe updates with RAID-5/6)
Many design decisions needed
RAID affects cache implementation
Performance
Failures (degraded RAID operation)
14. I/O STACK IN DARC - “DATA PROTECTION CONTROLLER”
User-Level Applications
Storage Controller
Buffer
Cache
File
System
SCSI Layer
Virtual File System (VFS)
System Calls
Block-level Device Drivers
Raw I/O
15. CONCLUSION
I/O controllers are not so much limited from host connectivity
competences, but from internal resources and their allocation and
management policies.
How to integrate data protection features in a commodity I/O controller,
particularly protecting using checksums and versioning of storage volumes.
Incorporation of data protection features in a commodity I/O controller
integrity protection using persistent checksums
versioning of storage volumes
Several challenges in implementing an efficient I/O path between the host
machine & the controller
16. REFERENCES
[1] T10 DIF (Data Integrity Field) standard. http://www.t10.org.
[2] Intel. Intel Xscale IOP Linux Kernel Patches.
http://sourceforge.net/projects/xscaleiop/les/.
[3] M. D. Flouris and A. Bilas. Violin: A framework for extensible blocklevel storage.
In Proceedings of 13th IEEE/NASA Goddard (MSST2005) Conference on Mass
Storage Systems and Technologies, pages 128142, Monterey, CA, Apr. 2005.
[4] M. D. Flouris and A. Bilas. Clotho: transparent data versioning at the block i/o
level. In Proceedings of 12th IEEE/NASA Goddard (MSST2004) Conference on Mass
Storage Systems and Technologies, pages 315328, 2004.
[5] G. A. Gibson, D. F. Nagle, K. Amiri, J. Butler, F. W. Chang, H. Gobio, C. Hardin, E.
Riedel, D. Rochberg, and J. Zelenka. A cost-eective, highbandwidth storage
architecture. In Proc. of the 8th ASPLOS Conference. ACM Press, Oct. 1998.
[6] E. K. Lee and C. A. Thekkath. Petal: distributed virtual disks. In Proceedings of the
Seventh International Conference on Architectural Support for Programming
Languages and Operating Systems (ASPLOS VII), pages 8493. ACM
SIGARCH/SIGOPS/SIGPLAN, Oct. 1996.
17. REFERENCES
[7] A. Krioukov, L. N. Bairavasundaram, G. R. Goodson, K. Srinivasan, R. Thelen, A. C.
Arpaci-Dusseau, and R. H. Arpaci-Dusseau. Parity lost and parity regained. In Proc.
of the 6th USENIX Conf. on File and Storage Technologies (FAST08), pages 127141,
2008.
[8] Microsoft. Optimizing Storage for Microsoft Exchange Server 2003.
http://technet.microsoft. com/enus/exchange/default.aspx
[9] C.-H. Moh and B. Liskov. Timeline: a high performance archive for a distributed
object store. In NSDI, pages 351364, 2004.
[10] V. Prabhakaran, L. N. Bairavasundaram, N. Agrawal, H. S. Gunawi, A. C. Arpaci-
Dusseau, and R. H. Arpaci-Dusseau. IRON le systems. In Proc. of the 20th ACM
Symposium on Operating Systems Principles (SOSP 05), pages 206220, Brighton,
United Kingdom, October 2005.
[11] Markos Fountoulakis*, Manolis Marazakis, Michail D. Flouris, and Angelos
Bilas*. DARC: Design and Evaluation of an I/O Controller for Data Protection.
Foundation of Research and Technology Hellas (FORTH), Greece, May 2010
Editor's Notes
With increasing capacity, the rate of failures in storage devices increases as well. RAID technology masks device failures, assuming that the data is available in untouched condition. With the possibility of data corruption, other forms of data protection become necessary as well:
Versioning, to recover from user errors
Checksums, to protect against silent data corruption events
These additional data protection techniques need persistent state (”metadata”), to be maintained & accessed at high I/O rates along the common I/O path.
It is very important to guarantee the constant overheads for allocation or de-allocation of buffer. The allocation of buffers for the I/O commands and their completions are different from other traditional buffers used by the controller. Command buffers allocation is done on the controller side, but are filled in by the host. Likewise, completion buffers are allocated on the host, but they are filled in by the controller.
I/O commands are services in stages, with several events in each stage (e.g: DMA start/completion, disk i/o start/completion)
Option #1 is to design an event-driven FSM. Option #2 is to design a thread-based system, where each threads incorporates handling several of the many events.
In our prototype, we took Option #2, using full Linux OS. This gives us a programmable infrastructure to build advanced storage features more easily (in line with our goals in the prototype). However, with more software layers we have less control over timing of events and interactions.
Error detection and correction (EDC) in DARC is done by using redundancy schemes (e.g. RAID-1), where the corrupted data are rebuilt using the parity or redundant data block copies. Redundancy schemes, such as RAID-1, 5, or 6, are commonly used for availability and reliability purposes in all installed storage systems that are supposed to handle disk failures. DARC also uses the same redundant data blocks to correct silent data errors, based on the supposition that the probability of data corruption on both copies of one disk block are very small. A similar hypothesis is made for other redundancy schemes, such as RAID-5/6, but the possibility of a valid data restoration gets high with the amount of redundancy preserved. The likelihood of a second data error occurring within the same group typically, 4-5 data blocks and 1-2 parity blocks in RAID-5/6 is higher for the two data block copies in RAID-10. The checksum capabilities of DMA engines in this platform suffers insignificant overhead when used during the DMA data transfers to/from the host. Therefore, checksums are used to protect host accessible data blocks. An advantage of only protecting host-accessible data blocks is that, RAID restoration, which is unaware of the existence of persistent checksums. Therefore, RAID restoration don’t need to be modified. Furthermore, the RAID restoration procedure can have benefit of the data structure storing the checksums to determine which blocks can be safely avoided, as they have not been written so far. This would reduce the RAID restoration time.
If the stored checksum does not match the computed, we proceed to check the mirror copies. This requires the capability to map the “virtual” block# to physical block locations, on the devices that make up the RAID-10 volume. After this map is made available, the mirror copies of the referenced block are retrieved and their corresponding checksums are computed. These checksums are then compared to one another. If they match, we update the stored checksum and proceed to complete the read request. Otherwise, if one of them matches the stored checksum, we proceed to re-issue the original read request, after synchronizing the mirror copies.
PIO simple (load/store instructions in specific memory regions), but incur CPU overhead. Not suitable for large data volumes.
DMA high throughput (at PCIe rates), but complex to initiate and to detect their completion.
I/O command consists of SCSI CDB & DMA segments, with host-side addresses.
I/O completion consist of status-code & reference to corresponding commands.
Figure 5 demonstrates the how of I/O issue and completion from the host to the controller, and back. Commands are transferred from the host to the controller, using Programmed IO, in a circular first in first out (FIFO) queue
The FIFO queues are statically allotted, and carry of fixed size elements. Commands, completions and other control information may consume multiple sequential queue elementsof the FIFO queues.
Typically, I/O controllers have an on-board cache. We have done some work in our prototype to incorporate a cache, and we will discuss the design issues involved.
Cache on the controller serves 3 purposes:
-temporal locality exploitation (if any)
-spatial locality epxloitation, via read-ahead
-coalescing of small writes
Cache design intertwined with RAID implementation (distinction bet. Normal and Degraded-Mode operation)
A summary of the prototype design, as presented so far …
Overview of our controller prototype, which is built using real hardware:
- SCSI-layer at Host & its co-ordination with the controller’s firmware.
- The controller’s firmware contains several layers, as described earlier.
In summary, we have designed and implemented two data protection features within a controller:
-persistent checksums for integrity protection
-versioning of storage volumes, to protect against human errors.
From our experiments, on real hardware, we have found the overhead of EDC checking to be in the range of 12 to 20%, depending on the number of concurrent I/Os. Versioning adds a further overhead, in the range of 2.5 to 5%, depending on the number & size of writes.