• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
InfiniBand: Today and Tomorrow
 

InfiniBand: Today and Tomorrow

on

  • 1,946 views

 

Statistics

Views

Total Views
1,946
Views on SlideShare
1,946
Embed Views
0

Actions

Likes
0
Downloads
52
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • This is a classic Mission Critical HPC cluster with MFIO, deployed for a large bank where downtime is unacceptable.
  • This is a classic HPC design with TS120’s at the edge and TS270’s in the core. Most economical way to build one.
  • This is a classic HPC design with TS120’s at the edge and TS270’s in the core. Most economical way to build one.
  • Three cables side by side. (CX4, Superflex, optical dongle) Bottom picture show TS120 on side with elegant cabling at live customer.
  • Full IB protocol stack available for all major operating systems. All flavors of Linux available; open source avail via openib.org
  • Cisco supports openib.org with a major investment. Primary authors of major portions, including IP, SRP, and various APIs. Openib.org is rapidly nearing production-quality code. MPI will be available integrated with openib in the near future for MVAPICH 0.96 and then OpenMPI. Charter of OpenIB.org has expanded to include Windows and IWARP. (talk about convergence of iwarp and ib apis here).

InfiniBand: Today and Tomorrow InfiniBand: Today and Tomorrow Presentation Transcript

  • InfiniBand: Today and Tomorrow Jamie Riotto Sr. Director of Engineering Cisco Systems (formerly Topspin Communications) [email_address]
  • Agenda
    • InfiniBand Today
      • State of the market
      • Cisco and InfiniBand
      • InfiniBand products available now
      • Open source initiatives
    • InfiniBand Tomorrow
      • Scaling InfiniBand
      • Future Issues
    • Q&A
  • InfiniBand Maturity Milestones
    • High adoption rates
      • Currently shipping > 10,000 IB ports / Qtr
    • Cisco acquisition will drive broader market adoption
    • End-to-end price points of <$1000.
    • New Cluster scalability proof-points
      • 1000 to 4000 nodes
  • Cisco Adopts InfiniBand
    • Cisco acquired Topspin on May 16, 2005
    • Adds InfiniBand to Switching Portfolio
      • Network Switches, Storage Switches, now Server Switches
      • Creates independent Business Unit to promote InfiniBand & Server Virtualization
    • New Product line of Server Fabric Switches (SFS)
      • SFS 7000 Series InfiniBand Server Switches
      • SFS 3000 Series Multifabric Server Switches
  • Cisco and InfiniBand The Server Fabric Switch Network Switch Clients Network Resources (Internet, Printer, Server) Storage Switch Server Storage (SAN) Server Switch Servers Storage Network
  • Cisco HPC Case Studies
  • Real Deployments Today: Wall Street Bank with 512 Node Grid SAN LAN 2 96-port TS-270 23 24-port TS-120 512 Server Nodes 2 TS-360 w/ Ethernet and Fibre Channel Gateways Core Fabric Edge Fabric GRID I/O Existing Networks Fibre Channel and GigE connectivity built seamlessly into the cluster
  • NCSA National Center for Supercomputing Applications Core Fabric Edge Fabric
    • Parallel MPI codes for commercial clients
    • Point to point 5.2us MPI latency
    520 Dual CPU Nodes 1,040 CPUs Tungsten 2: 520 Node Supercomputer 6 72-port TS270 29 24-port TS120 174 uplink cables 512 1m cables 18 Compute Nodes 18 Compute Nodes Deployed: November 2004
  • D.E. Shaw Bio-Informatics: 1,066 Node Super Computer Fault Tolerant Core Fabric Edge Fabric 12 96-port TS-270 89 24-port TS-120 1,068 5m/7m/10m/15m uplink cables 1,066 1m cables 12 Compute Nodes 12 Compute Nodes 1,066 Fully Non-Blocking Fault Tolerant IB Cluster
  • Large Government Lab Worlds Largest Commodity Server Cluster – 4096 nodes
    • Application:
      • High Performance Super Computing Cluster
    • Environment:
      • 4096 Dell Servers
      • 50% Blocking Ratio
      • 8 TS-740s
      • 256 TS-120s
    • Benefits:
      • Compelling Price/Performance
      • Largest Cluster Ever Built (by approx. 2X)
      • Expected to be 2nd Largest Supercomputer in the world by node count
    Core Fabric 8x SFS TS740 288 ports each Edge 256x TS120 24-ports each 18 Compute Nodes) 18 Compute Nodes) 8192 Processor 60TFlop SuperCluster 2048 uplinks (7m/10m/15m/20m)
  • InfiniBand Products Available Today
  • InfiniBand Switches and HCAs
    • Fully non-blocking switch building blocks available in sizes from 24 up to 288 ports.
    • Blade servers offer integrated switches and pass-through modules
    • HCAs available in PCI-X and PCI-Express
    • IP & Fibre-Channel Gateway Modules
  • Integrated InfiniBand for Blade Servers Create “wire-once” fabric
    • Integrated 10Gbps InfiniBand switches provide unified “wire-once” fabric
    • Optimize density, cooling, space, and cable management.
    • Option of integrated InfiniBand switch (ex: IBM BC) or pass-thru module (ex: Dell 1855)
    • Virtual I/O provides shared Ethernet and Fibre Channel ports across blades and racks
    10Gbps 30Gbps Blade Chassis with InfiniBand Switches IB Switch IB Switch HCA
  • Ethernet and Fibre Channel Gateways Unified “wire-once” fabric SAN Server Fabric LAN/WAN Server Cluster Fibre Channel to InfiniBand gateway for storage access Ethernet to InfiniBand gateway for LAN access Single InfiniBand link for: - Storage - Network
  • InfiniBand Price / Performance
    • Myrinet pricing data from Myricom Web Site (Dec 2004) ** InfiniBand pricing data based on Topspin avg. sales price (Dec 2004) *** Myrinet, GigE, and IB performance data from June 2004 OSU study
    • Note: MPI Processor to Processor latency – switch latency is less
    $100 $2K-$6K $2K-$5K 50us 900MB/s 10GigE $25 $100-$300 Free 50us 100MB/s GigE $400 $400 $250 Switch Port $175 $880 5.7us 495MB/s Myrinet E $175 $100 Cable Cost (3m Street Price) $535 $550 HCA Cost (Street Price) 6.5us 5us MPI Latency (Small Messages) 245MB/s 950MB/s Data Bandwidth (Large Messages) Myrinet D InfiniBand PCI-Express
  • InfiniBand Cabling
    • CX4 Copper (15m)
    • Flexible 30-Gauge Copper (3m)
    • Fiber Optics up to 150m
  • Host Drivers for Standard Protocols
    • Open source strategy = reliability at low cost
    • IPoIB: legacy TCP/IP applications
    • SDP: reliable socket connections (optional RDMA)
    • MPI: leading edge HPCC applications (RDMA)
    • SRP: block storage access (RDMA)
    • uDAPL: User level RDMA
  • OS Support
    • Operating Systems Available:
      • Linux (Red Hat, SuSE, Fedora, Debian, etc.)
      • Windows 2000 and 2003
      • HP-UX (Via HP)
      • Solaris (Via Sun)
  • The InfiniBand Driver Architecture BSD Sockets FS API TCP SDP IP Drivers VERBS ETHER INFINIBAND HCA DAT FILE SYSTEM SCSI SRP FC FCP SDP INFINIBAND SAN API BSD Sockets NFS-RDMA LAN/WAN SERVER FABRIC SAN INFINIBAND SWITCH ETHER SWITCH FC SWITCH FC GW E ETH GW NETWORK APPLICATION UDAPL TS TS IPoIB User Kernel
  • Open Software Initiatives
    • OpenIB.org
      • Topspin primary authors of major portions including IPoIB, SDP, SRP and TS-API. Cisco will continue to invest.
      • Current protocol development nearing production quality code. Expect release by end of year.
      • Charter has been expanded to include Windows and iWarp
      • MPI will be available in the near future (MVAPICH 0.96)
    • OpenSM
    • OpenMPI
  • InfiniBand Tomorrow
  • Looking into the future
    • Cost
    • Speed
    • Distance Limitations
    • Cable Management
    • Scalability
    • IB and Ethernet
  • Speed: InfiniBand DDR / QDR, 4X / 12X
    • DDR Available end of 2005
      • Doubles wire speeds to ? (ok, still working on this one)
      • PCI-Express DDR
      • Distances of 5-10m using copper
      • Distances of 100m using fiber
    • QDR Available WHEN?
    • 12X (30 Gb/s) available for over one year!!
      • Not interesting until 12X HCA
        • Not interesting until > 16X PCIe
  • Future InfiniBand Cables
    • InfiniBand over CAT5 / CAT6 / CAT7
      • Shielded cable distances up to ???
      • Leverage existing 10-GigE cabling
      • 10-GigE too expensive?
  • IB Distance Scaling
    • IB Short Haul
      • New Copper drivers
      • 25 – 50 Meters (KeyEye)
      • 75 - 100 Meters (IEEE 10Ge)
    • IB Wan
      • Same Subnet over distance (300 KM target)
      • Buffer / Credit / Timeout issues
      • Applications: Disaster Recover, Data Mirroring
    • IB Long Haul
      • IB over IP (over SONET?)
      • utilizes existing public plant (WDM, Debugging, etc)
  • Scaling InfiniBand
    • Subnet Management
    • Host-side Drivers
      • MPI
      • IPoIB
      • SRP
    • Memory Utilization
  • IB Subnet Manager
    • Subnets are getting bigger
      • 4,000 -> 10,000 nodes
      • Topology convergence times
        • Topology disturbance times
        • Topology disturbance minimization
  • Subnet Management Challenges
    • Cluster Cold Start times
      • Template Routing
      • Persistent Routing
    • Cluster Topology Change Management
      • Intentional Change - Maintenance
      • Unintentional Change – Dealing with Faults
        • How to impact minimum number of connections
        • Predetermine fault reaction strategy?
    • Topology Diagnostic Tools
      • Link/Route Verification
      • Built-in BERT testing
    • Partition Management
  • Multiple Routing Models
    • Minimum Latency Routing:
      • Load-Balanced Shortest-Path Routing
    • Minimum Contention Routing:
      • Lowest-Interference Divergent-Path Routing
    • Template Driven Routing:
      • Supports Pre-Determined Routing Topology
      • For example: Clos Routing, Matrix Row/Column, etc
      • Automatic Cabling Verification for Large Installations
  • IB Routing Challenges
    • Static / Dynamic Routing
      • IB impliments Static Routing through Linear Forwarding Tables at each chip
      • Multi-LID Routing enables Dynamic Routing
    • Credit Loops
    • Cost Base Routing
      • Speed mismatches cause Store & Forward (vs. cut through)
      • SDR <> DDR <>QDR
      • 4X <> 12X
      • Short Haul <> Long Haul
  • Multi-LID Source-Based Routing Support
    • Applications can implement “Dynamic” Routing for Contention Avoidance, Failover, Parallel Data Transfer
    1 , 2 , 3 ,4 Spine Switches Leaf Switches Leaf Switches
  • New IB Peripherals
    • CPUs?
    • Storage
      • SAN
      • NFS-RDMA
    • Memory (coherent / non-coherent)
    • Purpose built Processors?
      • Floating Point Processors
      • Graphics Processors
      • Pattern Matching Hardware
      • XML Processor
  • THANK YOU!
    • Questions & Answers