Paper on RDMA enabled Cluster FileSystem at Intel Developer Forum


Published on

This is a presentation of a new architecture of Cluster File System at Intel Developer Forum:

1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Talk about each component. Indicate that SAN is a must. Only Symmetrix is available direct attached.
  • CVM allow for failover without migration, simultaneous access you expect and common naming Simultaneous access to volumes from multiple hosts Common logical device name Consistent logical view of volumes from any host Managed from any host in the cluster - updates seen by all nodes Only raw device access supported from CVM Volumes remain accessible from other hosts after a single host failure Failover does not require volume migration
  • Good Morning, It’s a little hard to follow that whimsical view of business - but I hope it gave you a view of the issues most of you face in your business today. What I like to do now is take you a little deeper into how VERITAS can offer you a strategy for increasing availability of your business information.
  • Paper on RDMA enabled Cluster FileSystem at Intel Developer Forum

    1. 1. Cluster File System with RDMA Ramesh Balan Manager Cluster File System Somenath Bandyopadhyay Staff Software Engineer Cluster File System Veritas* Software September 9-12, 2002 Copyright © 2002 Veritas Software Corporation.
    2. 2. Agenda <ul><li>What is Veritas* Cluster File System (CFS)? </li></ul><ul><li>Remote Direct Memory Access(RDMA) </li></ul><ul><li>Common RDMA Transport Access Layer(CRTL) </li></ul><ul><li>Advantages of InfiniBand* Architecture </li></ul>
    3. 3. Cluster File System (CFS)- Overview <ul><li>Provides scalable data bandwidth </li></ul><ul><li>Single File System image across the cluster </li></ul><ul><li>POSIX* File System Semantics </li></ul><ul><ul><li>Atomicity of Reads/Writes </li></ul></ul><ul><ul><li>Cache Coherency across nodes </li></ul></ul><ul><li>Feature compatibility with VxFS </li></ul><ul><li>Single binary with VxFS, packaged as a licensable feature </li></ul>
    4. 4. CFS Hardware Topology Private Network Fibre Storage Switch Node 1 Node 2 Node n Node n-1
    5. 5. Kernel Components Of CFS CFS CVM GLM GAB MSG LLT Shared Disks Network InfiniBand* Architecture enabled stack described later GAB MONITOR
    6. 6. CFS - Architecture <ul><li>Master / Slave Design for metadata I/O </li></ul><ul><li>Symmetrical design for data I/O </li></ul><ul><li>Masters are assigned on a per file system basis </li></ul><ul><li>Application transparency in presence of node crashes </li></ul><ul><li>Load balancing </li></ul><ul><li>Group membership and messaging </li></ul><ul><li>Group Lock Manager (GLM) </li></ul>
    7. 7. CFS - Coherency <ul><li>GLM locks are used for cluster wide coherency </li></ul><ul><li>Inode Cache Coherency </li></ul><ul><li>Buffer Cache Coherency </li></ul><ul><li>Page Cache Coherency </li></ul><ul><li>Directory Name Lookup Cache(DNLC) Coherency </li></ul>InfiniBand* Architecture: RDMA operations
    8. 8. CFS - Recovery <ul><li>Triggered by notification of a node crash </li></ul><ul><li>New master is elected for every file system mastered on the dead node </li></ul><ul><li>Data integrity preserved after a crash </li></ul><ul><li>Locks held by the dead node are recovered </li></ul><ul><li>Meta data operations continue after recovery </li></ul>InfiniBand* Architecture improves message latency
    9. 9. Group Lock Manager (GLM) <ul><li>Clustered reader-writer locks </li></ul><ul><li>Locking modes </li></ul><ul><ul><li>Shared/update/exclusive </li></ul></ul><ul><li>Locks are identified by a 32 byte string </li></ul><ul><li>Distributed lock mastership </li></ul><ul><li>Fair scheduling of lock requests </li></ul>Enhancements with InfiniBand* Architecture
    10. 10. Group Membership and Atomic Broadcast (GAB) <ul><li>Group Membership Service (GMS) </li></ul><ul><ul><li>Maintains membership of the cluster in face of joining and leaving nodes </li></ul></ul><ul><ul><li>Port registration interface </li></ul></ul><ul><ul><li>Cluster components register on different ports </li></ul></ul><ul><li>Atomic Broadcast </li></ul><ul><ul><li>Globally ordered atomic broadcasts </li></ul></ul><ul><li>Unicast Messages (Directed) </li></ul>Issues: Message latency – InfiniBand* Architecture addresses
    11. 11. Cluster Volume Manager (CVM) <ul><li>Simultaneous access to volumes from multiple hosts </li></ul><ul><li>Common logical device name </li></ul><ul><li>Consistent logical view of volumes from any host </li></ul><ul><li>Managed from one host in the cluster - updates seen by all nodes </li></ul><ul><li>Volumes remain accessible from other hosts after a single host failure </li></ul><ul><li>Fail over does not require volume migration </li></ul><ul><li>Mirroring and fast resync </li></ul>Cluster-shareable disk group Master Node Slave Node InfiniBand* Architecture Storage
    12. 12. CFS Applications <ul><li>Oracle Real Application Cluster* (RAC*) </li></ul><ul><li>Workflow applications (CAD/CAM) </li></ul><ul><li>File serving </li></ul><ul><li>Content serving </li></ul><ul><li>High Performance Computing (HPC) </li></ul>
    13. 13. CFS with RDMA VERITAS* Software Corporation <ul><li>Other names and brands may be claimed as the property of others </li></ul>
    14. 14. Limitations with current technology <ul><li>Excessive CPU utilization when interconnect load increases </li></ul><ul><li>Scope to improve Scalability </li></ul><ul><li>Separate network for IPC, Storage and networking </li></ul><ul><li>Split Brain Problem </li></ul><ul><li>Takes time to detect node failures </li></ul>
    15. 15. InfiniBand solves those limitations <ul><li>RDMA </li></ul><ul><li>Split Brain </li></ul><ul><li>Virtual Lanes – QOS </li></ul><ul><li>Bandwidth </li></ul><ul><li>Latency </li></ul><ul><li>Shared Memory </li></ul><ul><li>Shared data between nodes </li></ul><ul><li>Single point of failure </li></ul><ul><li>Handling fabric health </li></ul>
    16. 16. Remote Direct Memory Access <ul><li>Data transfer operations access application buffers directly </li></ul><ul><ul><li>Zero copy </li></ul></ul><ul><ul><li>Reduced interrupts </li></ul></ul><ul><ul><li>Applications can directly schedule data transfer operations </li></ul></ul><ul><li>Network protocol stack resides on the NIC </li></ul><ul><ul><li>Lower latency </li></ul></ul><ul><ul><li>CPU offload </li></ul></ul><ul><ul><li>Enhanced scalability </li></ul></ul><ul><li>Remote Direct Memory Access Flavors </li></ul><ul><ul><li>InfiniBand* Architecture </li></ul></ul><ul><ul><li>RDMA over IP network </li></ul></ul><ul><ul><li>Direct Access Transport* (DAT) </li></ul></ul><ul><ul><li>Virtual Interface Architecture* (VIA) </li></ul></ul>
    17. 17. CRTL <ul><li>Common RDMA Transport Access Layer </li></ul><ul><li>Building-block for use with various VERITAS* components </li></ul><ul><ul><li>CFS/GLM/VCS/Oracle* Real Application Cluster (RAC) </li></ul></ul><ul><li>Version 1 </li></ul><ul><ul><li>H1’03 </li></ul></ul><ul><ul><li>CFS (VCS,GLM)/InfiniBand* Architecture / Linux </li></ul></ul><ul><ul><li>Designed for porting to different transports and vendors </li></ul></ul>
    18. 18. InfiniBand SourceForge Stack User-mode HCA Driver Interface User-mode InfiniBand Access Interface Subnet Manager CRTL HCA Driver Interface User Kernel kDAPL Sockets VIPL MPI uDAPL Application(s) InfiniBand Access Proxy InfiniBand Access interface Legend OSV Components HCA Vendor Components InfiniBand Access IB PnP Mgmt. Svcs SM Query Resource Mgmt Connection Mgmt Work Request Processing User-mode HCA Driver Interface User-mode InfiniBand Access Interface Subnet Manager SDP HCA Driver Interface kDAPL Sockets VIPL MPI uDAPL Application(s) InfiniBand Access Proxy InfiniBand Access interface HCA Vendor Components InfiniBand Access User-level Proxy Agent IPoIB Other Target Drivers IB PnP Mgmt. Svcs SM Query Resource Mgmt Connection Mgmt Work Request Processing SRP
    19. 19. CRTL Benefits <ul><li>CFS </li></ul><ul><ul><li>Reduce CPU consumption </li></ul></ul><ul><ul><li>Lower latencies from secondary nodes </li></ul></ul><ul><ul><li>RDMA write of file system meta data </li></ul></ul><ul><ul><li>Use for data pushes </li></ul></ul><ul><ul><li>Reduce load on shared storage by RDMA write of Shared data (future) </li></ul></ul><ul><ul><li>Quality of service per file system </li></ul></ul><ul><li>Low Latency Transport - LLT </li></ul><ul><ul><li>Improve heartbeat mechanism </li></ul></ul><ul><ul><li>Faster error detection </li></ul></ul><ul><ul><li>Automatic Path Migration (APM) </li></ul></ul><ul><ul><li>Can address Split Brain problem (future) </li></ul></ul>
    20. 20. Split Brain Problem <ul><li>Mechanism to detect liveliness of a node could break down </li></ul><ul><ul><li>Communication channel failure </li></ul></ul><ul><ul><li>Excessive load </li></ul></ul><ul><li>Nodes not responding to heartbeats would be considered dead </li></ul><ul><li>Segmented cluster, each segment considering others dead causing split brain </li></ul><ul><li>InfiniBand* Architecture can address this issue (future) </li></ul>Provides Failover, APM, etc
    22. 22. CRTL Future Plans <ul><li>RDMA enabled storage access </li></ul><ul><ul><li>SCSI RDMA Protocol (SRP) </li></ul></ul><ul><ul><li>Storage and IPC on same fabric </li></ul></ul><ul><li>CFS enhancements </li></ul><ul><li>Investigate Distributed Shared Memory with RDMA </li></ul><ul><li>Enable data transfer applications with RDMA </li></ul><ul><li>Enhance other Veritas* applications for RDMA (e.g. Replication, Backup,…) </li></ul><ul><li>Other IB Stacks </li></ul><ul><ul><li>We can port CRTL to your stack </li></ul></ul>
    23. 23. Advantages of CRTL <ul><li>Ideal for cluster applications </li></ul><ul><ul><li>Applications define their own end points </li></ul></ul><ul><ul><li>Ideal for applications owning both side of the end points </li></ul></ul><ul><li>CRTL can be a pass through interface </li></ul><ul><ul><li>Due to simplicity of APIs one can implement CRTL on DAT or on IB verbs or on any other RDMA transport. </li></ul></ul><ul><ul><li>CRTL does not define new RDMA semantics; it keeps very minimum state information. </li></ul></ul><ul><ul><li>Applications can make full use of underlying transport. </li></ul></ul><ul><li>Interoperability </li></ul><ul><ul><li>CRTL can talk to other side even if it is coded to verbs layer (not using CRTL). </li></ul></ul><ul><ul><li>It will interoperate with SourceForge IB stack and other IB applications. </li></ul></ul>
    24. 24. InfiniBand* Performance <ul><li>Performance results based on proto-type and 3 rd party data </li></ul><ul><li>Latency </li></ul><ul><ul><li>IB Message latency: InfiniBand Architecture vendors claim <10 microseconds for reliable connection </li></ul></ul><ul><ul><li>Low CPU utilization </li></ul></ul><ul><ul><li>Reduced code path( at least 60% less) </li></ul></ul><ul><ul><li>Several investigations show very low CPU utilization </li></ul></ul><ul><li>Bandwidth </li></ul><ul><ul><li>Maximum possible bandwidth available today (10Gbps ) </li></ul></ul><ul><ul><li>IB supports 30Gbps using 12x connectors </li></ul></ul>
    25. 25. InfiniBand* Benefits in CFS <ul><li>QOS using Virtual Lanes </li></ul><ul><ul><li>GLM, GAB, CFS messages in different virtual lanes </li></ul></ul><ul><li>Nodes can share data using RDMA </li></ul><ul><ul><li>Schedule RDMA WRITEs </li></ul></ul><ul><li>Distributed Lock Manager </li></ul><ul><ul><li>Use Atomic Operations </li></ul></ul><ul><ul><li>High performance Lock Manager </li></ul></ul><ul><li>Eliminates Single Point of Failure (SPF) </li></ul><ul><ul><li>Switched fabric architecture eliminates SPF in fabric </li></ul></ul><ul><ul><li>Failover Subnet Manager </li></ul></ul><ul><li>Credit based flow control </li></ul><ul><ul><li>Eliminates the need of additional flow control logic </li></ul></ul><ul><li>Low latency switches </li></ul>
    26. 26. What else InfiniBand* Architecture can do for CFS? <ul><li>Looking for scopes outside current specification </li></ul><ul><li>Handle Split Brain Problems </li></ul><ul><ul><li>DTOs and Heart beat exchange using same fabric reduces the problem (doesn’t fix it completely) </li></ul></ul><ul><ul><li>Membership awareness </li></ul></ul><ul><li>Eliminate traditional “polling the fabric” issues </li></ul><ul><ul><li>Higher CPU utilization, extra network traffic </li></ul></ul><ul><ul><li>Detection depends on poll interval </li></ul></ul><ul><ul><li>Subject to Operating Systems scheduling issues </li></ul></ul><ul><ul><li>Subnet Management Agent (SMA) monitors the fabric health using QP0. SMA notification can be used. </li></ul></ul><ul><ul><li>RDMA WRITE of ‘0’ byte can fix the problem a little! </li></ul></ul><ul><ul><ul><li>No network traffic/No WQE on receiver side </li></ul></ul></ul>
    27. 27. What else InfiniBand* Architecture can do for CFS? (Cont.) <ul><li>Separate heartbeat mechanisms </li></ul><ul><ul><li>Application and Transport level heartbeat </li></ul></ul><ul><ul><li>RDMA WRITE with 0 byte can be a transport level heartbeat mechanism, a better solution is… </li></ul></ul><ul><li>Faster Error Recovery possibilities </li></ul><ul><ul><li>Notify as soon as a node goes down </li></ul></ul><ul><ul><li>Link State Error Notification mechanism </li></ul></ul><ul><li>More dynamic multicast operations </li></ul><ul><ul><li>RDMA operations to ‘n’ multicast nodes </li></ul></ul><ul><li>Load Sharing with Automatic Path Migration </li></ul><ul><ul><li>Nodes involved in APM can share load when not migrated </li></ul></ul>
    28. 28. What’s Missing in InfiniBand* Architecture? <ul><li>Problem </li></ul><ul><ul><li>Only way to use InfiniBand effectively is to code to verbs API </li></ul></ul><ul><ul><li>Different verbs API from different vendors </li></ul></ul><ul><li>Kernel level APIs </li></ul><ul><ul><li>Common Verbs, Subnet Manager, General Services interface. </li></ul></ul><ul><li>SourceForge stack for InfiniBand* Architecture kernel access </li></ul><ul><ul><li>CFS available on Solaris*, HP/UX* </li></ul></ul><ul><ul><li>CFS port going on AIX*, Linux </li></ul></ul><ul><ul><li>Fixes the problem for Linux </li></ul></ul><ul><ul><li>InfiniBand* Architecture HCA vendors support is building </li></ul></ul>
    29. 29. Summary <ul><li>Veritas* Cluster File System </li></ul><ul><li>Different Cluster components </li></ul><ul><li>CFS With RDMA </li></ul>
    30. 30. Collateral <ul><li>Where to get additional and updated information? </li></ul><ul><ul><li> </li></ul></ul><ul><ul><li> </li></ul></ul><ul><ul><li>[email_address] </li></ul></ul><ul><ul><li>[email_address] </li></ul></ul>
    31. 31. Presentation Title Ramesh Balan Somenath Bandyopadhyay Veritas* Software Please remember to turn in your session survey form.
    32. 32. Acronyms <ul><li>CRTL – Common RDMA Transport Access Layer </li></ul><ul><li>CFS – Cluster File System </li></ul><ul><li>VIA – Virtual interface Architecture </li></ul><ul><li>GAB – Group Membership and Atomic Broadcast </li></ul><ul><li>CVM – Cluster Volume Manager </li></ul><ul><li>GLM – Group Lock Manager </li></ul><ul><li>LLT – Low Latency Transport </li></ul>
    33. 33. <ul><li>This presentation will be </li></ul><ul><li>posted September 26 th </li></ul><ul><li> </li></ul><ul><li>Attendee password will be sent two weeks after the conference via email. </li></ul>
    34. 34. <ul><li>Copyright© 2002 VERITAS Software Corporation. All rights reserved. VERITAS, the VERITAS logo and all other VERITAS product names and slogans are trademarks or registered trademarks of VERITAS Software Corporation. VERITAS and the VERITAS Logo Reg. U.S. Pat. & Tm. Off. Other product names and/or slogans mentioned herein may be trademarks or registered trademarks of their respective companies. </li></ul>