Your SlideShare is downloading. ×
0
Windows Server 2012 R2 - Boosted by Mellanox
Windows Server 2012 R2 - Boosted by Mellanox
Windows Server 2012 R2 - Boosted by Mellanox
Windows Server 2012 R2 - Boosted by Mellanox
Windows Server 2012 R2 - Boosted by Mellanox
Windows Server 2012 R2 - Boosted by Mellanox
Windows Server 2012 R2 - Boosted by Mellanox
Windows Server 2012 R2 - Boosted by Mellanox
Windows Server 2012 R2 - Boosted by Mellanox
Windows Server 2012 R2 - Boosted by Mellanox
Windows Server 2012 R2 - Boosted by Mellanox
Windows Server 2012 R2 - Boosted by Mellanox
Windows Server 2012 R2 - Boosted by Mellanox
Windows Server 2012 R2 - Boosted by Mellanox
Windows Server 2012 R2 - Boosted by Mellanox
Windows Server 2012 R2 - Boosted by Mellanox
Windows Server 2012 R2 - Boosted by Mellanox
Windows Server 2012 R2 - Boosted by Mellanox
Windows Server 2012 R2 - Boosted by Mellanox
Windows Server 2012 R2 - Boosted by Mellanox
Windows Server 2012 R2 - Boosted by Mellanox
Windows Server 2012 R2 - Boosted by Mellanox
Windows Server 2012 R2 - Boosted by Mellanox
Windows Server 2012 R2 - Boosted by Mellanox
Windows Server 2012 R2 - Boosted by Mellanox
Windows Server 2012 R2 - Boosted by Mellanox
Windows Server 2012 R2 - Boosted by Mellanox
Windows Server 2012 R2 - Boosted by Mellanox
Windows Server 2012 R2 - Boosted by Mellanox
Windows Server 2012 R2 - Boosted by Mellanox
Windows Server 2012 R2 - Boosted by Mellanox
Windows Server 2012 R2 - Boosted by Mellanox
Windows Server 2012 R2 - Boosted by Mellanox
Windows Server 2012 R2 - Boosted by Mellanox
Windows Server 2012 R2 - Boosted by Mellanox
Windows Server 2012 R2 - Boosted by Mellanox
Windows Server 2012 R2 - Boosted by Mellanox
Windows Server 2012 R2 - Boosted by Mellanox
Windows Server 2012 R2 - Boosted by Mellanox
Windows Server 2012 R2 - Boosted by Mellanox
Windows Server 2012 R2 - Boosted by Mellanox
Windows Server 2012 R2 - Boosted by Mellanox
Windows Server 2012 R2 - Boosted by Mellanox
Windows Server 2012 R2 - Boosted by Mellanox
Windows Server 2012 R2 - Boosted by Mellanox
Windows Server 2012 R2 - Boosted by Mellanox
Windows Server 2012 R2 - Boosted by Mellanox
Windows Server 2012 R2 - Boosted by Mellanox
Windows Server 2012 R2 - Boosted by Mellanox
Windows Server 2012 R2 - Boosted by Mellanox
Windows Server 2012 R2 - Boosted by Mellanox
Windows Server 2012 R2 - Boosted by Mellanox
Windows Server 2012 R2 - Boosted by Mellanox
Windows Server 2012 R2 - Boosted by Mellanox
Windows Server 2012 R2 - Boosted by Mellanox
Windows Server 2012 R2 - Boosted by Mellanox
Windows Server 2012 R2 - Boosted by Mellanox
Windows Server 2012 R2 - Boosted by Mellanox
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Windows Server 2012 R2 - Boosted by Mellanox

973

Published on

Presentation at the joint session with Microsoft Japan - 2013/12/4

Presentation at the joint session with Microsoft Japan - 2013/12/4

Published in: Technology
0 Comments
4 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
973
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
17
Comments
0
Likes
4
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Windows Server 2012 R2 – Boosted by Mellanox メラノックステクノロジーズジャパン株式会社 シニアシステムエンジニア 友永 和総 2013年12月4日
  • 2. セッション内容  Mellanox概要  Microsoft + Mellanox • SMB Direct • NVGREオフロード  補足 • RoCE設定 • I/O統合に関する考慮事項  参考資料 • SMB Direct – Protocol Deep Dive © 2013 Mellanox Technologies 2
  • 3. 会社概要 Ticker: MLNX  サーバ・ストレージエリア向け広帯域・低レイテンシーなインターコネクト市場のリーディングプロバイダー • FDR InfiniBand (56Gbps)と10/40/56ギガビットEthernetを共通のハードウェアでサポート • データアクセス性能の高速化により、アプリケーション性能を飛躍的に向上 • ノード数の大幅な削減や管理効率の向上によって、データセンター内IT基盤のROIを劇的に改善  本社・従業員数 • ヨークナム(イスラエル)、サニーベール(米国) • 全世界で 約1,400人の従業員  安定した財務基盤 • 2011年度の売上: $259.3M (67.6% up) • 2012年度の売上: $500.8M (93.2% up) • Cash + investments @ 9/30/13 = $306.4M * As of September 2013 © 2013 Mellanox Technologies 3
  • 4. メラノックス社のコアテクノロジー : 高性能・高集積ASIC  VPI (Virtual Protocol Interconnect) テクノロジー • IB and EN with single chip (ConnectX-3、SwitchX-2) • IB and EN port by port (ConnectX-3、SwitchX-2) • IB/EN Bridging(SwitchX-2)  高スループット、低レイテンシー、超低消費電力 (Ultra Low Power)  RDMA (Remote Direct Memory Access) 対応、高速データ転送  VXLAN/NVGREオフロード(ConnectX-3Pro) 2 x 56Gbps Ethernet mode: 1/10/40/56GbE 17mm 144組のネットワークSerDesを搭載 36x 40/56GbE 64x 10GbE 48x 10GbE+12x 40/56GbE Ethernet mode: 1/10/40/56GbE InfiniBand/Ethernet InfiniBand/Ethernet 2pt 40GbE Typ power: 7.9W 2 x IB FDR (56Gbps) 3.0 x16 45mm • InfiniBand or Ethernet 36x 40GbE: 83W • InfiniBand + Ethernet 64x 10GbE: 63W • InfiniBand / Ethernet Bridging (100% load power) 3.0 x8 © 2013 Mellanox Technologies 4
  • 5. Leading Supplier of End-to-End Interconnect Solutions Server / Compute Storage Switch / Gateway Front / Back-End Virtual Protocol Interconnect Virtual Protocol Interconnect 56G IB & FCoIB 56G InfiniBand 10/40/56GbE & FCoE 10/40/56GbE Comprehensive End-to-End InfiniBand and Ethernet Portfolio ICs © 2013 Mellanox Technologies Adapter Cards Switches/Gateways Host/Fabric Software Metro / WAN Cables/Modules 5
  • 6. The Future Depends on Fastest Interconnects 1Gb/s © 2013 Mellanox Technologies 10Gb/s 40/56Gb/s 6
  • 7. Top Tier OEMs, ISVs and Distribution Channels Hardware OEMs Software Partners Selected Channel Partners Server Storage Embedded Medical © 2013 Mellanox Technologies 7
  • 8. InfiniBand Enables Lowest Application Cost in the Cloud (Examples) Microsoft Windows Azure Cloud 90.2% Cloud Efficiency Application Performance 33% Lower Cost per Application Improved up to 10X 3X Increase in VMs per Physical Server Consolidation of Network and Storage I/O © 2013 Mellanox Technologies 32% Lower Cost per Application 694% Higher Network Performance 8
  • 9. Microsoft + Mellanox © 2013 Mellanox Technologies 9
  • 10. マイクロソフト社ソリューションにおけるメラノックスの技術  SMB Direct (RDMA) – Windows Server 2012から投入されたI/O性能を飛躍的に高めるテクノロジー Mellanox ConnectX-3 (InfiniBand 及び 10G/40G Ethernet NIC)搭載で実現  Hyper-V over SMB Direct     • 統合率向上、アプリケーションパフォーマンスの向上 Hyper-V RDMA Live Migration (Windows Server 2012 R2) • ライブマイグレーション時間を短縮、運用の簡素化、効率化 Microsoft SQL Server 2012 – Always-ON • 二重化したDBサーバー間を低レイテンシーなメラノックスネットワークで接続、DBライト性能を向上 Hyper-VベースのVDIソリューション • 同一ハードウェア構成でのVDIクライアント数を倍増させ、コストパフォーマンス向上 Microsoft SQL 2012 Parallel Data Warehouse V2 • 高速なSMB Directを活用した高速データベースアプライアンス NVGREオフロード (Windows Server 2012 R2) Mellanox ConnectX-3Pro (10G/40G Ethernet NIC)搭載で実現 • オーバーレイネットワークにおけるパケット処理をネットワークアダプタでハードウェアオフロード処理 • CPUボトルネックを解消し、広帯域ネットワークの帯域を最大限活用 © 2013 Mellanox Technologies 10
  • 11. RDMA技術の特長  Remote Direct Memory Access • ゼロコピー、CPUバイパスを実現するデータ転送技術 • 標準的なインターコネクトプロトコルとしてサポート • リモートアプリケーションのバッファ間でダイレクトにデータ転送 • 非常に低いレイテンシでのデータ転送が可能  RDMAプロトコル • InfiniBand – 最大 56Gb/s • RDMA-over-Converged-Ethernet (RoCE) – 最大 40Gb/s  SMB Directでは、Windowsのファイル共有プロトコル(SMB3.0)にRDMA処理を統合 SMB3.0 RDMA Ethernet InfiniBand © 2013 Mellanox Technologies MellanoxのEthernet NICを使えば、 EthernetでもRDMA動作可能* *RDMAを用いる場合はDCB設定が必要 11
  • 12. RoCE(RDMA over Converged Ethernet)  RoCE Frame 出典: IBTA Supplement to InfiniBand Architecture Specification Volume 1 Release 1.2.1 - Annex A16:RDMA over Converged Ethernet (RoCE)/ 出典:http://blog.infinibandta.org/2012/02/13/roce-and-infiniband-which-should-i-choose/ © 2013 Mellanox Technologies 12
  • 13. Application KERNEL USER RDMAの概要 1 Application Buffer 1 Buffer 1 2 データを水に 例えると・・・ Buffer 1 Buffer 1 OS OS Buffer 1 「太い直結ホース」的な転送 Buffer 1 HARDWARE RDMA over InfiniBand or Ethernet HCA NIC Buffer 1 Buffer 1 NIC TCP/IP RACK 1 © 2013 Mellanox Technologies HCA 「バケツリレー」的な転送 RACK 2 13
  • 14. I/O Offload Frees Up CPU for Application Processing User Space ~53% CPU Available for App User Space ~88% CPU Available for App ~47% CPU Overhead System Space With RDMA and Offload System Space Without RDMA ~12% CPU Overhead © 2013 Mellanox Technologies 14
  • 15. SMB Direct - File Read RoCE frame capture SMB Client RDMA Write by SMB Server SMB Server © 2013 Mellanox Technologies 15
  • 16. SMB Direct - File Write RoCE frame capture SMB Client RDMA Read by SMB Server SMB Server © 2013 Mellanox Technologies 16
  • 17. How to watch the RDMA Traffic • ibdump.exe © 2013 Mellanox Technologies 17
  • 18. Measuring SMB Direct Performance SMB Client IO Micro Benchmark SMB Client 10GbE SMB Server IO Micro Benchmark SMB Client QDR IB SMB Server IO Micro Benchmark FDR IB SMB Server IO Micro Benchmark Single Server 10GbE Fusion-IO FusionIO IO FusionIO Fusion © 2013 Mellanox Technologies QDR IB Fusion-IO FusionIO IO FusionIO Fusion FDR IB Fusion-IO FusionIO IO FusionIO Fusion Fusion-IO FusionIO IO FusionIO Fusion 18
  • 19. Microsoft Delivers Low-Cost Replacement to High-End Storage FDR 56Gb/s InfiniBand delivers 5X higher throughput with 50% less CPU overhead vs. 10GbE Native Throughput Performance over FDR InfiniBand © 2013 Mellanox Technologies 19
  • 20. Hyper-V over SMB Direct - Performance Native Remote VM Hyper-V (SMB 3.0) VM SQLIO RDMA NIC vs. Single Server SQLIO RAID RAID RAID RAID Controller Controller Controller Controller RDMA NIC File Server (SMB 3.0) RDMA NIC RDMA NIC RAID RAID RAID Controller Controller Controller RAID Controller SAS SAS SAS SAS SAS SAS SAS SAS JBOD JBOD JBOD JBOD JBOD JBOD JBOD JBOD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD © 2013 Mellanox Technologies 20
  • 21. SMB 3.0 Performance in Virtualized Environment Configuration BW IOPS %CPU Latency MB/sec 512KB IOs/sec Privileged milliseconds Native 10,090 38,492 ~2.5% ~3ms Remote VM 10,367 39,548 ~4.6% ~3 ms SMB 3.0 over InfiniBand Delivers Native Performance © 2013 Mellanox Technologies 21
  • 22. EchoStreams: InfiniBand Enables Near Linear Scalability File Client (SMB 3.0) SQLIO RDMA NIC RDMA NIC RDMA NIC File Server (SMB 3.0) RDMA NIC RDMA NIC RDMA NIC 8KB random reads from a mirrored space (disk) ~600,000 IOPS Storage Spaces SAS HBA SAS HBA SAS HBA SAS HBA SAS HBA 8KB random reads from cache (RAM) ~1,000,000 IOPS SAS HBA SAS SAS SAS SAS SAS SAS JBOD JBOD JBOD JBOD JBOD JBOD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD © 2013 Mellanox Technologies 32KB random reads from a mirrored space (disk) ~500,000 IOPS ~16.5 GBytes/sec 22
  • 23. Hyper-V Live Migration over SMB  SMB as a transport for Live Migration of VMs  Delivers the power of SMB to provide: Live Migration Times 70 60 • RDMA (SMB Direct) • Streaming over multiple NICs (SMB Multichannel)  Provides highest bandwidth and lowest latency Seconds 50 New in Windows Server 2012 R2 40 30 20 10 0 RDMAで広帯域ネット ワークを最大限に活用 複数リンクを 活用可能 © 2013 Mellanox Technologies RDMAでCPU負荷 最小限に抑制 23
  • 24. Microsoft SQL Server 2012 AlwaysON RDMAデータ転送で低レイテンシなDB二重化書き込みを実現 © 2013 Mellanox Technologies 24
  • 25. Microsoft PDW* V2 – 10X Faster & 50% Lower Capital Cost PDW V1 PDW V2 Control Node • Pure hardware costs are ~50% lower • Price per raw TB is close to 70% lower due to higher capacity Mgmt. Node • 70% more disk I/O bandwidth LZ Backup Node InfiniBand & Ethernet Ethernet, InfiniBand & Fiber Channel • • • • $ 160 cores on 10 compute nodes 1.28 TB of RAM on compute Up to 30 TB of temp DB Up to 150 TB of user data Estimated total HW component list price: $1M © 2013 Mellanox Technologies • • • • 128 cores on 8 compute nodes 2TB of RAM on compute Up to 168 TB of temp DB Up to 1PB of user data $ Estimated total HW component list price: $500K *Parallel Data Warehouse 25
  • 26. Accelerating Microsoft SQL 2012 Parallel Data Warehouse V2 Analyze 1 Petabyte of Data in 1 Second  Up to 100X faster performance than legacy data warehouse queries  Up to 50X faster data query, up to 2X the data loading rate  Unlimited storage scalability for future proofing  Accelerated by Mellanox FDR 56Gb/s InfiniBand end-to-end solutions © 2013 Mellanox Technologies 26
  • 27. NVGRE H/W Offload © 2013 Mellanox Technologies 27
  • 28. ConnectX-3 Pro | The Next Generation Cloud Competitive Asset  World’s first Cloud offload interconnect solution  Provides hardware offloads for Overlay Networks – enables mobility, scalability, serviceability  Dramatically lowers CPU overhead, reduces cloud application cost  Highest throughput (10, 40GbE & 56GbE), SR-IOV, PCIe Gen3, low power More users Mobility Scalability Cloud 2.0 Simpler Management Lower Application Cost The Foundation of Cloud 2.0 © 2013 Mellanox Technologies 28
  • 29. World’s First HW Offload Engines for Overlay Network Protocols  Introducing L2 Virtual Tunneling solutions for virtualized data centers • NVGRE and VXLAN  Virtual L2 Tunnels provides a method for “creating” virtual domains on top of a scalable L3 virtualized infrastructure • Enabling virtual domains with complete isolation Three virtual domains connected by Layer 2 Tunneling  Targeting public/private cloud networks with multi-tenants  Mellanox uniqueness: HW offload = higher performance • Checksums, LSO, FlowID calculation, VLAN Stripping / insertion • Combined with steering mechanisms: RSS, VMQ VM VM VM Domain1 Domain2 VM Domain3 Server VM VM VM VM Physical Switch Server © 2013 Mellanox Technologies VM VM VM VM Server 29
  • 30. NVGRE  NVGRE • MAC over GRE • 24 bit tenant id MAC © 2013 Mellanox Technologies IP (v4/v6) GRE MAC …. 30
  • 31. NVGRE Initial Performance Results (ConnectX-3 Pro, 10GbE) CPU Overhead (CPU Cycles per Byte) 10 Throughput (Gb/s) 12 80% 9 65% 8 10 7 8 6 5 6 4 4 3 2 2 Lower Is Better 1 Higher Is Better 0 0 NVGRE with ConnectX-3 Pro Offloads NVGRE Without Offloads NVGRE with ConnectX-3 Pro Offloads NVGRE Without Offloads Higher Throughput for Less CPU Overhead © 2013 Mellanox Technologies 31
  • 32. ConnectX-3 Pro NVGRE Throughput Throughput ConnectX3 Pro 10 GbE 12 10 9.2 9.15 8.7 8.65 Bandwidth Gb/s 8 6 4.8 4.55 5.5 5 4 2 0 2 4 8 16 VM Pairs NVGRE Offload Disabled © 2013 Mellanox Technologies NVGRE Offload Enabled 32
  • 33. Links  Microsoft: • http://smb3.info • Blog posts from Microsoft about SMB Direct  Mellanox.com: • http://www.mellanox.com/page/file_storage - Recipe and how-to guides • http://www.mellanox.com/page/edc_system - Demo/Test RDMA on Windows Server 2012 © 2013 Mellanox Technologies 33
  • 34. まとめ  Mellanox RDMAテクノロジーは、Windows Server 2012 及び Windows Server 2012 R2 の標準技術として「SMB Direct」として採用され、I/O性能を高めると同時にCPU負荷を下 げるという非常に効果的なテクノロジーです。  Windows Server 2012 及び Windows Server 2012 R2は、ファイルプロトコルの運用性・ 管理性と、ブロックストレージをも上回る性能を両立した画期的なテクノロジーをOS標準技 術として業界に先駆けて搭載しています。 • Mellanox ConnectX-3を実装するだけで圧倒的なネットワーク性能を活用可能 - 従来に比べ、約10倍といったオーダーの性能を実現(ブロックストレージをも上回る性能) • InfiniBandだけでなく、Ethernetでも動作 • Hyper-V環境では、ファイルストレージアクセス(Hyper-V over SMB)だけでなく、 ライブマイグレーションもRDMAで動作 • Windows Server 2012 R2を基盤とした幅広いソリューションに活用可能 - Hyper-V、VDI、SQLServer © 2013 Mellanox Technologies 34
  • 35. 補足 © 2013 Mellanox Technologies 35
  • 36. RoCE設定  http://www.mellanox.com/pdf/whitepapers/WP_Deploying_Windows_Server_Eth.pdf  http://www.mellanox.com/related-docs/prod_software/RoCE_with_Priority_Flow_Control_Application_Guide.pdf 通常のEthernetフレームのため、原理的には何も設定しなくても動くが、 性能、安定性のためにDCB設定(フローコントロール設定)を行う • Windowsホストの設定 • Ethernetスイッチの設定 © 2013 Mellanox Technologies 36
  • 37. SMB ダイレクトを使用する場合の考慮事項  http://technet.microsoft.com/ja-jp/library/jj134210.aspx (Windows仕様上の制限事項)  Hyper-V 管理オペレーティング システムで SMB ダイレクトを使用して、Hyper-V over SMB を使用でき るようにしたり、Hyper-V 記憶域スタックを使用する仮想マシンに記憶域を提供したりできます。ただし、 RDMA 対応ネットワーク アダプターは Hyper-V クライアントに直接公開されません。RDMA 対応ネット ワーク アダプターを仮想スイッチに接続すると、そのスイッチからの仮想ネットワーク アダプターは RDMA 対応ではなくなります。  SMB マルチチャネルを無効にすると、SMB ダイレクトも無効になります。SMB マルチチャネルによっ て、ネットワーク アダプターの機能が検出され、ネットワーク アダプターが RDMA 対応かどうかが確認 されるため、SMB マルチチャネルが無効になっていると、クライアントが SMB ダイレクトを使用できませ ん。  SMB ダイレクトは、ダウンレベル バージョンの Windows Server ではサポートされていません。 Windows Server 2012 でのみサポートされています。 © 2013 Mellanox Technologies 37
  • 38. I/O統合に関する考慮事項  前頁考慮事項より:「RDMA 対応ネットワーク アダプターを仮想スイッチに接続すると、そのスイッチから の仮想ネットワーク アダプターは RDMA 対応ではなくなります」  SMB Direct (ストレージアクセス及びライブマイグレーションパス)とVM間通信(TCP/IP)(仮想スイッチに 接続する必要あり)を同一インターフェースでI/O統合することができない?  Mellanoxとしてのソリューション • part_manコマンドによる仮想インターフェース追加で、1物理ポートを2論理ポートとしてOSへ見せる © 2013 Mellanox Technologies 38
  • 39. part_manコマンド  例) MLNX WinOF 4.55 User Manual # part_man add “イーサネット 4” <任意の名前>  現在のステータス • InfiniBand : サポート済み • Ethernet : (会場説明)リリース予定 © 2013 Mellanox Technologies 39
  • 40. 参考資料 SMB Direct - Protocol Deep Dive © 2013 Mellanox Technologies 40
  • 41. SMB Direct Specification  [MS-SMBD]  Available below • http://msdn.microsoft.com/en-us/library/hh536346(v=prot.20).aspx © 2013 Mellanox Technologies 41
  • 42. Relationship to Other Protocols  RDMA Transports • The SMBDirect Protocol is transport-independent. • It requires only an RDMA lower layer for sending and receiving the messages. • The RDMA transports most commonly used by SMBDirect include: - iWARP - InfiniBand Reliable Connected mode - RDMA over Converged Ethernet (RoCE)  Protocols Transported by SMBDirect • SMB2 Protocl [MS-SMB2] - When SMB2 version 3.0 is negotiated - both client and server - RDMA-capable transport © 2013 Mellanox Technologies 42
  • 43. SMBDirect Protocol Overview  Must in-order delivery  Must support direct data placement via RDMA Write and RDMA Read reqeusts • Example : iWARP, InfiniBand, RoCE  Only 3 message types • Negotiate request • Negotiate response • Data transfer  Little-endian order • least-significant byte first  Use multiple connection • 1st connection – negotiation • 2nd or more connection - RDMA © 2013 Mellanox Technologies 43
  • 44. Initialization  New additions for RDMA • Negotiate - Server capability advertisement  Server must advertise Multi-channel support (multiple connections per session) because SMB Direct always starts with a TCP connection, then opens a second connection (or more) for RDMA - Session setup  Initial session –None (normal processing)  New Connection is created when RDMA is detected › As part of RDMA connection setup, an SMB Direct negotiation occurs › New RDMA connection joins previously setup session © 2013 Mellanox Technologies 44
  • 45. Creating the RDMA Connection  Normal TCP connection is used to negotiate SMB2.2 and setup session  After session setup • Client uses FSCTL FSCTL_QUERY_TRANSPORT_INFO to query server interface capabilities • Interface Capability = RDMA_CAPABLE  If Server RDMA NIC is found and a local RDMA NIC is found that can connect to the server RDMA NIC • Then additional connections using RDMA will be created and bound to session • Original TCP connection is idled –all traffic goes over RDMA channel  RDMA NICs are always selected first over other types of NICs © 2013 Mellanox Technologies 45
  • 46. SMBDirect Message Types – 3 Messages  The 3 messages are: • Negotiate request –negotiate RDMA parameters • Negotiate response –negotiate RDMA parameters • Data transfer -encapsulates SMB2 messages  SMBDirect Data Transfer Mode • Send/Receive mode - Transmit SMB3 metadata requests and small SMB3 reads/writes • RDMA mode - Transmit data for large SMB3 reads/writes © 2013 Mellanox Technologies 46
  • 47. SMBDirect Credits  Credits are bidirectional and asymmetric  Peers MUST avoid credit deadlock • All sends must request at least one credit • When consuming final credit, at least one must also be granted by the message • These rules avoid deadlock  Peers SHOULD grant many credits  Peers can perform dynamic credit management  KEEPALIVE mechanism supports liveness probe • Side effect to refresh credits © 2013 Mellanox Technologies 47
  • 48. SMB Direct Negotiate Request 0 1 2 3 4 5 6 7 8 9 1 0 1 2 3 4 5 6 7 8 9 2 0 MinVersion 2 3 4 5 6 7 8 9 1 MaxVersion Reserved 1 3 0 CreditsRequested PreferredSendSize MaxReceiveSize MaxFragmentedSize CreditsRequested (2 bytes): The number of Send Credits requested of the receiver. PreferredSendSize (4 bytes): The maximum number of bytes that the sender requests to transmit in a single message. MaxReceiveSize (4 bytes): The maximum number of bytes that the sender can receive in a single message. MaxFragmentedSize (4 bytes): The maximum number of upper-layer bytes that the sender can receive as the result of a sequence of fragmented Send operations. © 2013 Mellanox Technologies 48
  • 49. SMBDirect Negotiate Response 0 1 2 3 4 5 6 7 8 9 1 0 1 2 3 4 5 6 7 8 9 MinVersion 2 3 4 5 6 7 8 9 3 0 1 Reserved CreditsRequested 1 MaxVersion NegotiatedVersion 2 0 CreditsGranted Status MaxReadWriteSize PreferredSendSize MaxReceiveSize MaxFragmentedSize NegotiatedVersion (2 bytes): The SMBDirect Protocol version that has been selected for this connection. This value MUST be one of the values from the range specified by the SMBDirect Negotiate Request message. CreditsRequested (2 bytes): The number of Send Credits requested of the receiver. CreditsGranted (2 bytes): The number of Send Credits granted by the sender. Status (4 bytes): Indicates whether the SMBDirect Negotiate Request message succeeded. The value MUST be set to STATUS_SUCCESS (0x0000) if SMBDirect Negotiate Request message succeeds. MaxReadWriteSize (4 bytes): The maximum number of bytes that the sender will transfer via RDMA Write or RDMA Read request to satisfy a single upper-layer read or write request. PreferredSendSize (4 bytes): The maximum number of bytes that the sender will transmit in a single message. This value MUST be less than or equal to theMaxReceiveSize value of the SMBDirect Negotiate Request message. MaxReceiveSize (4 bytes): The maximum number of bytes that the sender can receive in a single message. MaxFragmentedSize (4 bytes): The maximum number of upper-layer bytes that the sender can receive as the result of a sequence of fragmented Send operations. © 2013 Mellanox Technologies 49
  • 50. SMBDirect Data Transfer Message 0 1 2 3 4 5 6 7 8 9 1 0 1 2 3 4 5 6 2 3 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 CreditsRequested CreditsGranted Flags Reserved RemainingDataLength DataOffset DataLength Padding (variable) ... Buffer (variable) ... RemainingDataLength (4 bytes): The amount of data, in bytes, remaining in a sequence of fragmented messages. If this value is 0x00000000, this message is the final message in the sequence. DataOffset (4 bytes): The offset, in bytes, from the beginning of the SMBDirect header to the first byte of the message’s data payload. If no data payload is associated with this message, this value MUST be 0. This offset MUST be 8-byte aligned from the beginning of the message. DataLength (4 bytes): The length, in bytes, of the message’s data payload. If no data payload is associated with this message, this value MUST be 0. © 2013 Mellanox Technologies 50
  • 51. SMBDirect Buffer Descriptor V1 Structure 1 2 3 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 Offset ... Token Length Offset (8 bytes): The RDMA provider-specific offset, in bytes, identifying the first byte of data to be transferred to or from the registered buffer. Token (4 bytes): An RDMA provider-assigned Steering Tag for accessing the registered buffer. Length (4 bytes): The size, in bytes, of the data to be transferred to or from the registered buffer. © 2013 Mellanox Technologies 51
  • 52. SMB2 READ Request 0 1 2 3 4 5 6 7 8 9 1 0 StructureSize 1 2 3 4 5 6 7 8 9 2 0 Padding 1 2 3 4 5 6 7 8 9 3 0 1 Reserved Length Offset ... FileId ... MinimumCount Channel RemainingBytes ReadChannelInfoOffset ReadChannelInfoLength Buffer (variable) ... Channel (4 bytes): For SMB 2.002 and 2.1 dialects, this field MUST NOT be used and MUST be reserved. The client MUST set this field to 0, and the server MUST ignore it on receipt. For the SMB 3.0 dialect, this field MUST contain exactly one of the following values: Value Meaning SMB2_CHANNEL_NONE 0x00000000 No channel information is present in the request. The ReadChannelInfoOffset andReadChannelInfoLength fields MUST be set to 0 by the client and MUST be ignored by the server. SMB2_CHANNEL_RDMA_V1 0x00000001 One or more SMB_DIRECT_BUFFER_DESCRIPTOR_V1 structures as specified in [MS-SMBD] section 2.2.3.1 are present in the channel information specified byReadChannelInfoOffset andReadChannelInfoLength fields. © 2013 Mellanox Technologies 52
  • 53. Example - Establishing a Connection  The initiator (for example, an SMB2 client) sends an SMBDirect Negotiate message, indicating that it is capable of the 1.0 version of the protocol, can send and receive up to 1 KiB of data per Send operation, and can reassemble fragmented Sends up to 128 KiB. • The SMBDirect Negotiate request message fields are set to the following: - MinVersion: 0x0100 MaxVersion: 0x0100 Reserved: 0x0000 CreditsRequested: 0x000A (10) - PreferredSendSize: 0x00000400 (1 KiB) MaxReceiveSize: 0x00000400 (1 KiB) MaxFragmentedSize: 0x00020000 (128 KiB)  The peer receives the SMBDirect Negotiate request and selects version 1.0 as the version for the connection. The negotiate response indicates that the peer can receive up to 1 KiB of data per Send operation, and requests that the requestor permit the same. The negotiate response also grants an initial batch of 10 Send Credits and requests 10 Send Credits to be used for future messages. • The SMBDirect Negotiate response message fields are set to the following: - MinVersion: 0x0100 MaxVersion: 0x0100 NegotiatedVersion: 0x0100 Reserved: 0x0000 CreditsRequested: 0x000A (10) CreditsGranted: 0x000A (10) Status: 0x0000 MaxReadWriteSize: 0x00100000 (1MiB) PreferredSendSize: 0x00000400 (1KiB) MaxReceiveSize: 0x00000400 (1KiB) MaxFragmentedSize: 0x00020000 (128KiB)  The peer sends the first data transfer, typically an upper-layer SMB2 Negotiate Request. The message grants an initial credit limit of 10, and requests 10 credits to begin sending normal traffic. • The SMBDirect Data Transfer message fields are set to the following: - CreditsRequested: 0x000A (10) CreditsGranted: 0x000A (10) Flags: 0x0000 - Reserved: 0x0000 RemainingDataLength: 0x000000 (nonfragmented message) DataOffset: 0x00000018 (24) - DataLength: 0x00000xxx (length of Buffer) Padding: 0x00000000 (4 bytes of 0x00) Buffer: (Upper layer message)  An SMBDirect Version 1.0 Protocol connection has now been established, and the initial message is processed. © 2013 Mellanox Technologies 53
  • 54. Example - Peer Transmits 500 Bytes of Data  The peer uses the Send operation to transmit the data because the upper layer request did not provide an RDMA Buffer Descriptor. An SMBDirect Data Transfer message is sent that contains the 500 bytes of data as the message’s payload. The message requests 10 Send Credits to maintain the current credit limit and grants 1 Send Credit to replace the credit request used by the final message. • The SMBDirect Data Transfer message fields are set to the following: - CreditsRequested: 0x000A (10) CreditsGranted: 0x0001 Flags: 0x0000 Reserved: 0x0000 RemainingDataLength: 0x000000 (nonfragmented message) DataOffset: 0x00000018 (24) DataLength: 0x000001F4 (500 = size of the data payload) Padding: 0x00000000 (4 bytes of 0x00) Buffer: (Upper layer message) © 2013 Mellanox Technologies 54
  • 55. Example - Peer Transmits 64 KiB of Data  The peer uses fragmented Send operations to transmit the data because the message exceeds the remote peer’s negotiated MaxReceiveSize, but is within the MaxFragmentedSize. A sequence of fragmented Sends of SMBDirect Data Transfer messages is prepared. The messages each request 10 Send Credits and grant a Send Credit to maintain the credits offered to the peer for expected responses. Because the fragmented sequence requires more credits (65) than are currently available (10), several pauses can occur while waiting for credit replenishment.  The SMBDirect Data Transfer message fields are set to the following: • • • • CreditsRequested: 0x000A (10) CreditsGranted: 0x0001 Flags: 0x0000 Reserved: 0x0000 RemainingDataLength: 0x000000xxx (63KiB remaining) DataOffset: 0x00000018 (24) DataLength: 0x000003F8 (1000 = MaxReceiveSize – 24) Padding: 0x00000000 (4 bytes of 0x00) Buffer: (1000 bytes of the upper-layer message)  The SMBDirect Data Transfer message fields are set to the following: • • • • CreditsRequested: 0x000A (10) CreditsGranted: 0x0001 Flags: 0x0000 Reserved: 0x0000 RemainingDataLength: 0x000000xxx (62KiB remaining) DataOffset: 0x00000018 (24) DataLength: 0x000003F8 (1000 = MaxReceiveSize – 24) Padding: 0x00000000 (4 bytes of 0x00) Buffer: (1000 bytes of the upper-layer message)  (Additional intermediate fragments, and pauses, elided…)  The SMBDirect Data Transfer message fields are set to the following: • • • • • CreditsRequested: 0x000A (10) CreditsGranted: 0x0001 Flags: 0x0000 Reserved: 0x0000 RemainingDataLength: 0x000000000 (final message of fragmented sequence) DataOffset: 0x00000018 (24) DataLength: 0x00000218 (536 = last fragment) Padding: 0x00000000 (4 bytes of 0x00) Buffer: (536 final bytes of the upper-layer message) © 2013 Mellanox Technologies 55
  • 56. Example - Peer Transmits 1 MiB of Data Via Upper Layer  The upper layer performs the transfer via RDMA. The buffer containing the data to be written is registered, obtaining the following single-element SMBDirect Buffer Descriptor V1. The buffer descriptor will be embedded in the upper-layer Write request. • The SMBDirect Buffer Descriptor V1 fields are set to the following: - Base: 0x00000000ABCDE012 - Length: 0x0000000000100000 (1 MiB) - Token: 0x1A00BC56  The peer sends an SMBDirect Data Transfer message that contains an upper layer Write request which includes the SMBDirect Buffer Descriptor V1 describing the 1 MiB buffer. The upper layer message totals 500 bytes. • The SMBDirect Data Transfer message fields are set to the following: - CreditsRequested: 0x000A (10) CreditsGranted: 0x0001 (1) Flags: 0x0000 Reserved: 0x0000 RemainingDataLength: 0x000000 (nonfragmented message) DataOffset: 0x00000018 (24) DataLength: 0x000001F4 (500 = size of the data payload) Padding: 0x00000000 (4 bytes of 0x00) Buffer: (Upper-layer message)  The message is recognized by the upper layer as a Write request via RDMA, and the supplied SMBDirect buffer descriptor is used to RDMA Read the data from the peer into a local buffer.  (the RDMA device performs an RDMA Read operation)  The write processing is completed, and the upper layer later replies to the peer.  The peer deregisters the buffer and completes the operation. © 2013 Mellanox Technologies 56
  • 57. Example - Peer Receives 1 MiB of Data Via Upper Layer  The upper layer performs the transfer via RDMA. The buffer containing the data to be read is registered, and the following single-element SMB Buffer Descriptor V1 is obtained. The buffer descriptor will be embedded in the upper-layer read request. • The SMBDirect Buffer Descriptor V1 fields are set to the following: - Base: 0x00000000DCBA024 Length: 0x0000000000100000 (1 MiB) Token: 0x1A00BC57  The peer sends an SMBDirect Data Transfer message that contains an upper-layer Read request which includes the SMBDirect Buffer Descriptor describing the 1 MiB buffer. The upper-layer message totals 500 bytes. • The SMBDirect Data Transfer message fields are set to the following: - CreditsRequested: 0x000A (10) CreditsGranted: 0x0001 Flags: 0x0000 Reserved: 0x0000 RemainingDataLength: 0x000000 (nonfragmented message) DataOffset: 0x00000018 (24) DataLength: 0x000001F4 (500 = size of the data payload) Padding: 0x00000000 (4 bytes of 0x00) Buffer: (Upper-layer message)  The message is recognized by the upper layer as a Read request via RDMA, and the 1MiB of data is prepared.  The supplied SMBDirect Buffer Descriptor V1 is used by an RDMA Write request to write the data to the peer from a local buffer. • (the RDMA device performs an RDMA Write operation)  The read processing is completed, and the reply is sent.  The peer deregisters the buffer and completes the operation. © 2013 Mellanox Technologies 57
  • 58. Thank You

×