Stonehenge: Multi-Dimensional Storage Virtualization <ul><li>Lan Huang </li></ul><ul><li>IBM Almaden Research Center </li>...
Introduction <ul><li>Storage growth is phenomenal:  new hardware </li></ul><ul><li>Isolated storage: resource waste </li><...
Storage Virtualization <ul><li>Examples: LVM, xFS, StorageTank </li></ul><ul><li>Hide Physical details from high-level app...
Storage Virtualization <ul><li>Storage consolidation </li></ul><ul><li>VD as tangible as PD:  </li></ul><ul><ul><li>Capaci...
Stonehenge Overview <ul><li>Input: VD (B, C, D, E) </li></ul><ul><li>Output: VDs with performance guarantee </li></ul><ul>...
Hardware Organization Storage manager Control mesg Data/cmds Gigabit network Object interface Object interface client clie...
Key Issues in Stonehenge <ul><li>How to ease the task of storage management:  </li></ul><ul><ul><li>Centralization </li></...
Key components <ul><li>Mapper </li></ul><ul><li>CVC scheduler </li></ul><ul><li>Feedback path between them </li></ul>
Virtual to Physical Disk Mapping <ul><li>Multi-dimension disk mapping: NP Complete </li></ul><ul><li>Goal: maximize resour...
Islands Effect PDs VDs 1 2 3 4
Key Components <ul><li>Mapper </li></ul><ul><li>CVC scheduler </li></ul><ul><li>Feedback path between them </li></ul>
Requirements of Real-time Disk Scheduling <ul><li>Disk Specific </li></ul><ul><ul><li>Improve disk bandwidth utilization <...
CVC Algorithm <ul><li>Two Queues: </li></ul><ul><ul><li>FT(i) = max(FT(i-1),  realtime)+1/IOPSm </li></ul></ul><ul><ul><li...
Real-life Deployment <ul><li>Dispatch the next N requests from LBA queue </li></ul><ul><li>The next batch will not be issu...
A Two-level CVC Scheduler Client Manager Server Server Server 1 2 6 5 4 3
CVC Performance <ul><li>3 VDs with real-life traces: video stream, web, financial, TPC-C </li></ul><ul><li>Touch 40% of th...
Impact of Disk I/O Time Estimate <ul><li>Model Disk I/O time ?  </li></ul><ul><ul><li>ATA disk impossible [ECSL TR-81] </l...
CVC Latency Bound <ul><ul><li>If the traffic generated within the period of [0,t]  </li></ul></ul><ul><ul><li>V(t) <= T + ...
Key Components <ul><li>Mapper </li></ul><ul><li>CVC scheduler </li></ul><ul><li>Feedback path between them </li></ul><ul><...
Empirical Latency vs Worst Case <ul><li>Approximate  </li></ul><ul><li>P(service time, N)  with  P(service time, N-1) </li...
Bursty I/O Traffic and P spare <ul><li>Self-similar </li></ul><ul><li>Multiplexing effect : P spare(x) </li></ul>x y
Latency or Throughput Bound <ul><li>(B throughput , C, D, E) </li></ul><ul><li>D--> B latency </li></ul><ul><li>(B through...
MBAC for Latency Bound VDs <ul><ul><li>When the jth VD with requirements  </li></ul></ul><ul><ul><li>(Dj, IOPS’’j, Cj, E) ...
MBAC Performance  Pservice Table 2. Resource Reservation Table 1. Maximum number of VDs accepted. 17 17 7 85% Mixed  Run 3...
MBAC for Throughput Bound VDs <ul><li>When jth VD (Dj, IOPS’’j, Cj, E)   comes, </li></ul><ul><li>Convert Dj to IOPS’j:  <...
MBAC Performance  Pspare VD 0 – TPC-C VD 1 - financial VD 2 – web    search
Measurement-based Admission Control (MBAC) <ul><ul><li>When the jth VD with requirements  </li></ul></ul><ul><ul><li>(Dj, ...
Issues with Measurement <ul><li>Stability  </li></ul><ul><ul><li>I/O rate pattern is stable </li></ul></ul><ul><ul><li>Bou...
Put them all together: Stonehenge <ul><li>Functionality: A general purpose IP storage cluster </li></ul><ul><li>Performanc...
Software Architecture kernel kernel kernel F. E. T. D. iSCSI initiator Stonehenge V. Table IDE Mid Layer Driver Disk Mappe...
Effectiveness of QoS Guarantees in Stonehenge (a) CVC (b) CSCAN (c) Deadline violation percentage
Impact of Leeway Factor Overload probability Violation Percentage
Overall System Performance and Latency Breakdown <ul><li>1 GHZ CPU </li></ul><ul><li>IBM 7200 ATA disk array </li></ul><ul...
Related Work <ul><li>Storage management </li></ul><ul><ul><li>Minerva etc at HPL </li></ul></ul><ul><li>Efficiency-aware d...
Conclusion <ul><li>IP Storage Cluster consolidates storage and reduces fragmentation by 20-30%. </li></ul><ul><li>Efficien...
Upcoming SlideShare
Loading in …5
×

PPT

496 views
435 views

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
496
On SlideShare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
8
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • according to industrial studies, the growth of storage capacity in an enterprise or organization is phenomenal. Along with the increasing capacity growth, the storage hardware also experiences significant changes. Areal Density is increasing. New device come up: storage router, san, nas etc. So now we are left with such a scenario: huge amount of data, a larger variety of devices spread out everywhere. This scenario makes the storage management task extremely difficult. The whole purpose of deploying a SAN is to centralize storage and ease management. However, the management problem is more complicated than connecting a set of disks together through fiber channel. For example, an important aspect is software abstraction as know as storage virtualization. %one problem is that different applications access a central SAN will require appropriate performance isolation.
  • Storage virtualization enables flexible storage management. Just like os hides the details of cpu, memory, disks and other Hardware resources from high level applications, storage virtualization is trying to achieve the same. The unix os provides a standard interface that user-level application can use easily. Storage virtualization tries to provide a virtual disk interface. User open a virtual disk, create file system without knowing the hardware details. The Virtualization engine can do disk replacement, data migration, load balancing underneath the virtual disk without interfering the client’s tasks. This flexibility is very important for system availability and business continuance.
  • Several previous work such as xfs, storagetank, lvm, connect a group of distributed disk arrays together and provide a logical volume. The logical volume essentially is a virtual disk. However, it is different from a physical disk. The logical volume does not have any throuhgput, latency guarantee. Only the volume capacity is guaranteed. For those application servers who need performance assurance, a logical volume is not enough. Our vision is that a virtual disk should be as tangible as a physical disk. Hence the virtualization should be a multidimensional one, in volume, and performance as well. QoS guarantee unavoidably translates into over-provision, we introduce a forth parameter Ei to give us room for improving resource efficiency. %try to address this with a probabilistic Ei. Which we will discuss more in later slides.
  • To address the issues of storage consolidation and performance isolation, we built stonehenge. Here is An overview for stonehenge. Stonehenge does not introduce any new hardware such as fiber channel or storage router. the stonehenge software glues the isolated storage together into a shareable storage pool. the clients can be application servers including file server, web server, and database server. stonehenge enforces run time guarantees for performance isolation.
  • This picture lists the components in a stonehenge system. A group of storage servers provides storage space. They are coordinated by One central node: storage manager. The client installs a stonehenge client. The client Talks to the storage server through iscsi protocol. Iscsi is IP based protocol. Data are communicated between server/clients, manager only send small size control messages to clients/servers.
  • To summarize the key issues in stonehenge, we have four questions to be answered in stonehenge.
  • Two key components are the vd to pd mapper and the cvc scheduler for run time performance guarantee. To improve the resource efficiency, a feedback path is added between the run time monitoring and the static configuration.
  • Given a vd, stonehenge needs to assign a set of pds to place the vd. The vd carries the capacity, throughput and latency parameters. This mapping problem can be formalized as a multi dimensional bin packing problem. This problem is proved to be NP complete. So we use heuristics to map a vd to pds. A standard procedure for mapping is to come up with a goal function, the mapping which maximize the goal function is chosen. Different algos features different goal functions. We tried a set of goal functions with different flavors, the resulting resource utilization does not differ much for the traces we studied.
  • After a vd is assigned to a set of pds, the cvc scheduler will enforce run time performance guarantees.
  • A real time disk scheduler has two set of requirements, the disk specific requirements needs it to achieve a reasonable disk bandwidth utilization. If we look at that breakdown of an average request latency, you can see less than 5% is spent on actual data transfer and the rest of time are spend on overhead. This overhead can be reduced by a good disk scheduler. Other standard requirements for a real time scheduler is fair bandwidth allocation and latency bound. The key of these two requirements is bandwidth guarantee since, latency can be converted to a related bandwidth reservation.
  • Stonehenge cvc scheduler tries to achieve efficiency and fair bw allocation simulatenously. It maintains two queues. One finishing time queue and one utilization queue ordered in lba. The finishing time is computed as in a standard virtual clock scheduler. It mimics the finishing time experienced in a TDM. The optimization we add is, rather than following FT all the time, we try to follow lba order as much as possible as long as the FT deadlines can be met. So the scheduler compare (1) and decides which one to dispatch next.
  • In the real deployment, it is slightly more complicated. Storage controller and on disk scheduler do their own scheduling to achieve better disk head utilization. To take advantage of their scheduling, cvc issue a batch of requests down. And the slack time needs to be larger than the service time of these N requests.
  • Stonehenge cvc scheduler has a two tier structure. Data and control are decoupled. The advantage is ease of management and simplify server node as much as possible.
  • Performance gain:
  • One key issue is how to achieve a balance point between aggressively following the lba order and deadline violation. the accuracy of slack time estimate is critical for that. We have two approaches to estimate the disk io time. One is modeling another is run time measurement. The first approach turns out to be extremely difficult if not impossible due to the complexity of disk internal mechanism. We did some study along this direction and report the negative results in a technical report. In stonehenge, we use a measurement based approach.
  • In the network qos literature, people analyze vc scheduler and deduce the latency bound as shown in formula (1) To adjust it to storage system. We need to add in the overhead items. K is average overhead. The rationale is that during the overhead time, the equivalent number of bytes transferred by. Stonehenge uses a simplified version with i/o rate. First item is draining time of the request in front of it, the second item is preemptive delay. Empirically, worst case is not common. Since over provision underutilize resource so much, can we do sth with the first item?
  • We hope to add a feedback path between the mapper and the run time engine.
  • We start with measuring the disk response time distribution curve. a data point (x y) y in a curve repsent y percentage of requests that experience x percent of worst delay. Assume we want to find a repsentative ratio for 90% of the requests, we can draw a horizontal line crossing all the curves and the x value of intersection points represents the empirical to worst case ratio. This ratio grows close to 1 when the system is getting more and more loaded. We add a disturbing factor to the latency bound.
  • Burstiness: figure left Multiplexing effect: figure right. 20% more spare bandwidth.
  • More here.
  • More here.
  • Here is the admission control procedure in stonehenge. We will see how p service an dp spare contribute to the admission improvements.
  • I will skip the software architecture.
  • Bandwidth isolation. Why still have violation in cvc? Leeway factor, inaccurate disk io time estimate.
  • PPT

    1. 1. Stonehenge: Multi-Dimensional Storage Virtualization <ul><li>Lan Huang </li></ul><ul><li>IBM Almaden Research Center </li></ul><ul><li>Joint work with </li></ul><ul><li>Gang Peng and Tzi-cker Chiueh </li></ul><ul><li>SUNY Stony Brook </li></ul><ul><li>June, 2004 </li></ul>
    2. 2. Introduction <ul><li>Storage growth is phenomenal: new hardware </li></ul><ul><li>Isolated storage: resource waste </li></ul><ul><li>Management </li></ul>clients IP LAN/MAN/WAN Database server File server [Patterson’98] Huge amount of data, heterogeneous Devices, spread out everywhere.
    3. 3. Storage Virtualization <ul><li>Examples: LVM, xFS, StorageTank </li></ul><ul><li>Hide Physical details from high-level applications </li></ul>application Storage management O. S. Storage Virtualization Disks, Controllers Hardware resources Abstract Interface Physical Disks Virtual Disks Clients
    4. 4. Storage Virtualization <ul><li>Storage consolidation </li></ul><ul><li>VD as tangible as PD: </li></ul><ul><ul><li>Capacity </li></ul></ul><ul><ul><li>Throughput </li></ul></ul><ul><ul><li>Latency </li></ul></ul><ul><li>Resource efficiency </li></ul><ul><ul><li>Ei </li></ul></ul>
    5. 5. Stonehenge Overview <ul><li>Input: VD (B, C, D, E) </li></ul><ul><li>Output: VDs with performance guarantee </li></ul><ul><li>High Level Goals: </li></ul><ul><ul><li>Storage Consolidation </li></ul></ul><ul><ul><li>Performance Isolation </li></ul></ul><ul><ul><li>Efficiency </li></ul></ul><ul><ul><li>Performance </li></ul></ul>clients IP LAN/MAN/WAN Database server File server Stonehenge (LAN)
    6. 6. Hardware Organization Storage manager Control mesg Data/cmds Gigabit network Object interface Object interface client client File interface Storage server Disk array Kernel Client Application Storage Clerk Kernel Client Application Storage Clerk Storage server Disk array Storage server Disk array
    7. 7. Key Issues in Stonehenge <ul><li>How to ease the task of storage management: </li></ul><ul><ul><li>Centralization </li></ul></ul><ul><ul><li>Virtualization </li></ul></ul><ul><ul><li>Consolidation </li></ul></ul><ul><li>How to achieve performance isolation among virtual disks? </li></ul><ul><ul><li>Run time QoS guarantee </li></ul></ul><ul><li>How to do it efficiently? </li></ul><ul><ul><li>Efficiency-aware algorithms </li></ul></ul><ul><ul><li>Dynamic adaptive feedback </li></ul></ul>
    8. 8. Key components <ul><li>Mapper </li></ul><ul><li>CVC scheduler </li></ul><ul><li>Feedback path between them </li></ul>
    9. 9. Virtual to Physical Disk Mapping <ul><li>Multi-dimension disk mapping: NP Complete </li></ul><ul><li>Goal: maximize resource utilization </li></ul><ul><li>Heuristics: maximize goal function [toyota75] </li></ul><ul><ul><li>Input: VDs, PDs </li></ul></ul><ul><ul><li>Goal Function G: max(G) </li></ul></ul><ul><ul><li>Output: VD, PD mapping </li></ul></ul>
    10. 10. Islands Effect PDs VDs 1 2 3 4
    11. 11. Key Components <ul><li>Mapper </li></ul><ul><li>CVC scheduler </li></ul><ul><li>Feedback path between them </li></ul>
    12. 12. Requirements of Real-time Disk Scheduling <ul><li>Disk Specific </li></ul><ul><ul><li>Improve disk bandwidth utilization </li></ul></ul><ul><ul><ul><li>SATF, CSCAN etc… </li></ul></ul></ul><ul><li>Non Disk Specific </li></ul><ul><ul><li>Meet real-time request’s deadline </li></ul></ul><ul><ul><li>Fair disk bandwidth allocation among virtual disks (Virtual Clock scheduling) </li></ul></ul><ul><li>Key: Bandwidth Guarantee </li></ul>seek rotation txfer other
    13. 13. CVC Algorithm <ul><li>Two Queues: </li></ul><ul><ul><li>FT(i) = max(FT(i-1), realtime)+1/IOPSm </li></ul></ul><ul><ul><li>LBA </li></ul></ul><ul><li>LBA Queue is used only if FT’s slack time allows it. </li></ul><ul><ul><li>Real time + service time(R) < </li></ul></ul><ul><ul><li>starting deadline of next request </li></ul></ul>FT LBA CVC Scheduler VD(m)
    14. 14. Real-life Deployment <ul><li>Dispatch the next N requests from LBA queue </li></ul><ul><li>The next batch will not be issued until the previous batch is done. </li></ul>FT LBA CVC Scheduler VD(m) Storage controller On disk scheduler
    15. 15. A Two-level CVC Scheduler Client Manager Server Server Server 1 2 6 5 4 3
    16. 16. CVC Performance <ul><li>3 VDs with real-life traces: video stream, web, financial, TPC-C </li></ul><ul><li>Touch 40% of the storage space </li></ul>Video Streams Mixed Traces
    17. 17. Impact of Disk I/O Time Estimate <ul><li>Model Disk I/O time ? </li></ul><ul><ul><li>ATA disk impossible [ECSL TR-81] </li></ul></ul><ul><ul><li>SCSI disk possible? </li></ul></ul><ul><li>Run Time measurement: P(I/O Time) </li></ul>
    18. 18. CVC Latency Bound <ul><ul><li>If the traffic generated within the period of [0,t] </li></ul></ul><ul><ul><li>V(t) <= T + r * t </li></ul></ul><ul><ul><li>then </li></ul></ul><ul><ul><li>D <= (T + Lmax )/ Bi +Lmax/C (1) </li></ul></ul><ul><ul><li>Storage System: </li></ul></ul><ul><ul><li>D <= ( ( N+1)*k*C + T + Lmax)/ Bi + ( k*C +Lmax)/C (2) </li></ul></ul><ul><ul><li>Stonehenge: </li></ul></ul><ul><ul><li>D <=(N+1)/IOPSi+1/IOPS max (3) </li></ul></ul>FT VD(m) IOPS(m) IOPS(max) N req T Bytes ? seek rotation txfer other
    19. 19. Key Components <ul><li>Mapper </li></ul><ul><li>CVC scheduler </li></ul><ul><li>Feedback path between them </li></ul><ul><ul><li>Relaxing worst case service time estimate </li></ul></ul><ul><ul><li>VD multiplex effect </li></ul></ul>
    20. 20. Empirical Latency vs Worst Case <ul><li>Approximate </li></ul><ul><li>P(service time, N) with P(service time, N-1) </li></ul><ul><li>Q is P’s inverse function </li></ul><ul><li>D <=(Q(0.95) + s) * [(N+1)/IOPSi+1/IOPS max ] </li></ul>x y
    21. 21. Bursty I/O Traffic and P spare <ul><li>Self-similar </li></ul><ul><li>Multiplexing effect : P spare(x) </li></ul>x y
    22. 22. Latency or Throughput Bound <ul><li>(B throughput , C, D, E) </li></ul><ul><li>D--> B latency </li></ul><ul><li>(B throughput , C, B latency , E) </li></ul><ul><li>B throughput >= B latency : throughput bound </li></ul><ul><li>B throughput < B latency: latency bound </li></ul>B throughput B latency Or even less?
    23. 23. MBAC for Latency Bound VDs <ul><ul><li>When the jth VD with requirements </li></ul></ul><ul><ul><li>(Dj, IOPS’’j, Cj, E) comes, </li></ul></ul><ul><ul><li>1. For 0 < i <= j, </li></ul></ul><ul><li>Convert Di to IOPS’i: </li></ul><ul><li>Di <=(Q service (0.95) +s)*[(N+1)/IOPS’i+1/IOPSmax] </li></ul><ul><li>Let IOPSi = max(IOPS’i, IOPS’’i) </li></ul><ul><li>2. If sum(IOPSi) <IOPSmax , accept the new VD, otherwise, reject. </li></ul>
    24. 24. MBAC Performance Pservice Table 2. Resource Reservation Table 1. Maximum number of VDs accepted. 17 17 7 85% Mixed Run 3 14 14 7 95% Mixed Run 2 22 20 7 95% Financial Run 1 Oracle MBAC Deterministic Probability VD Type - - - - - - 90% Deterministic 95% 67% 55% 47% 43% 38% N/A MBAC - 49% 37% 24% 19% 15% 11% Q_{service}(0.95) 15 14 13 11 10 9 7 Number of VDs
    25. 25. MBAC for Throughput Bound VDs <ul><li>When jth VD (Dj, IOPS’’j, Cj, E) comes, </li></ul><ul><li>Convert Dj to IOPS’j: </li></ul><ul><li>Dj <=(Qservice(0.95)+s)*[(N+1)/IOPS’j+1/IOPSmax] </li></ul><ul><li>Let IOPSj = max(IOPS’j, IOPS’’j) </li></ul><ul><li>if IOPSj < Qspare(E) admit the new VD, otherwise, reject it. </li></ul>
    26. 26. MBAC Performance Pspare VD 0 – TPC-C VD 1 - financial VD 2 – web search
    27. 27. Measurement-based Admission Control (MBAC) <ul><ul><li>When the jth VD with requirements </li></ul></ul><ul><ul><li>(Dj, IOPS’’j, Cj, E) comes, </li></ul></ul><ul><ul><li>1. For 0 < i <= j, </li></ul></ul><ul><li>Convert Di to IOPS’i: </li></ul><ul><li>Di <=(Q service (0.95) +s)*[(N+1)/IOPSi+1/IOPSmax] </li></ul><ul><li>Let IOPSi = max(IOPS’i, IOPS’’i) </li></ul><ul><li>2. Group VDs into two sets: throughput bounded set T and latency bounded L </li></ul><ul><li>3. For the throughput bound VDs, calculate combined </li></ul><ul><li> Q I/O_rate , Let Q spare (x) = IOPS max – Q I/O_rate (x) </li></ul><ul><li>4. If sum(IOPS(L)) <Qspare(E) , accept the new VD, otherwise, reject. </li></ul>
    28. 28. Issues with Measurement <ul><li>Stability </li></ul><ul><ul><li>I/O rate pattern is stable </li></ul></ul><ul><ul><li>Boundary case for Pservice </li></ul></ul><ul><li>Overhead of monitoring </li></ul><ul><ul><li>trivial </li></ul></ul><ul><li>Window Size </li></ul>
    29. 29. Put them all together: Stonehenge <ul><li>Functionality: A general purpose IP storage cluster </li></ul><ul><li>Performance </li></ul><ul><ul><li>scheduling </li></ul></ul><ul><li>Efficiency </li></ul><ul><ul><li>measurement </li></ul></ul>
    30. 30. Software Architecture kernel kernel kernel F. E. T. D. iSCSI initiator Stonehenge V. Table IDE Mid Layer Driver Disk Mapper Admission controller Traffic shaping Scheduler F. E. T. D. F. E. T. D. V. Table Scheduler Queues FETD front end target driver User Stonehenge P. Table P. Table
    31. 31. Effectiveness of QoS Guarantees in Stonehenge (a) CVC (b) CSCAN (c) Deadline violation percentage
    32. 32. Impact of Leeway Factor Overload probability Violation Percentage
    33. 33. Overall System Performance and Latency Breakdown <ul><li>1 GHZ CPU </li></ul><ul><li>IBM 7200 ATA disk array </li></ul><ul><li>Promise IDE controllers </li></ul><ul><li>64 bit 66MHZ PCI bus </li></ul><ul><li>Intel GB NICs </li></ul>A max of 55 MB/sec per server. 2 Network delay 2 574 Network delay 1 50 Central 1360 Disk access 507 iSCSI server 57 iSCSI client Average Latency (usec) Software Modules
    34. 34. Related Work <ul><li>Storage management </li></ul><ul><ul><li>Minerva etc at HPL </li></ul></ul><ul><li>Efficiency-aware disk scheduler: </li></ul><ul><ul><li>Cello, Prism, YFQ </li></ul></ul><ul><li>Run time QoS guarantee </li></ul><ul><ul><li>Web server, Video server, network QoS </li></ul></ul><ul><li>IP storage </li></ul>
    35. 35. Conclusion <ul><li>IP Storage Cluster consolidates storage and reduces fragmentation by 20-30%. </li></ul><ul><li>Efficiency-aware CVC real time disk scheduler with dynamic I/O time estimate provides guarantee of performance and good disk head utilization. </li></ul><ul><li>Measurement feed-back effectively remedies the over-provision. </li></ul><ul><ul><li>Latency: P service 2-3 folds </li></ul></ul><ul><ul><li>Throughput: P spare 20% </li></ul></ul><ul><ul><li>I/O time estimate: P I/O time </li></ul></ul><ul><ul><li>Load imbalance: P leeway </li></ul></ul>

    ×