Successfully reported this slideshow.
© 2010 Voltaire Inc.
November 19, 2010
Unified Fabric Manager Overview
Ghislain de Jacquelot
© 2010 Voltaire Inc. 2
Voltaire Software Portfolio
Robust RDMA Drivers
Fabric provisioning and
performance monitoring
Robu...
© 2010 Voltaire Inc. 3
Unified Fabric Manager
“So far, I haven't seen any other solutions claiming to be a
"fabric manager...
© 2010 Voltaire Inc. 4
Infiniband Traditional Management
© 2010 Voltaire Inc. 5
An Infiniband Fabric is not a black box (1/2)
► Requires Hardware management
• Detect failures, com...
© 2010 Voltaire Inc. 6
An Infiniband Fabric is not a black box (2/2)
►What about performance ?
►Some embarrassing question...
© 2010 Voltaire Inc. 7
UFM Central Management Platform
► In-depth visibility into
fabric health and traffic
• Central Dash...
© 2010 Voltaire Inc. 8
Introducing UFM
UFM Server
CLI
GUI
(Java)
Web
Services
IB-SM
(OpenSM)
Perf Mng
Providers
Device Mng...
© 2010 Voltaire Inc. 10
Advanced Monitoring and Analysis
► Monitor & analyze fabric performance
• Bandwidth utilization
• ...
© 2010 Voltaire Inc. 11
Central Dashboard
Resource Utilization
& Status
Congestion Map
Top 10 alerted nodesEvent Pane
Top ...
© 2010 Voltaire Inc. 12
Advanced Monitoring Engine
Multiple sessions
On demand
Sessions per Logical
Groups – no need to
kn...
© 2010 Voltaire Inc. 13
Performance Optimization Cycle with UFM
Characterize
traffic pattern and priorities
Unique logical...
© 2010 Voltaire Inc. 14
Advanced Performance Optimization
Mechanisms
► Fabric virtualization and Quality of Service (QoS)
...
© 2010 Voltaire Inc. 15
Congestion Example
► Degradation due to node oversubscription
• Destination port in saturation (mu...
© 2010 Voltaire Inc. 16
Quality of Service Optimization
UFM Enables QoS Optimization
© 2010 Voltaire Inc. 17
Test Environment
► 2 nodes running
a latency critical
job
► 12 nodes
running a
bandwidth
consuming...
© 2010 Voltaire Inc. 18
W/O Partitioning Latency degradation of ~215%
Latency job running alone
(Latency = ~2.1 us)
Bandwi...
© 2010 Voltaire Inc. 19
UFM Logical Model Creates Partition and Sets
QoS
► 2 Logical Groups
• Latency job
• B/W oriented j...
© 2010 Voltaire Inc. 20
With UFM QoS
Cross Application Interference fixed
Single job in cluster
(Latency = 2.1us)
2 jobs, ...
© 2010 Voltaire Inc. 21
Optimize performance #2: routing
► Existing routing algorithms
• Are not aware of application comm...
© 2010 Voltaire Inc. 22
TARA Optimization
► TARA provides the following benefits:
• Reduces competition between fabric res...
© 2010 Voltaire Inc. 23
Routing ?
► InfiniBand packets are ‘destination routed’ based on the
Destination Logical ID (DLID)...
© 2010 Voltaire Inc. 24
The real wording should be
« rearrangeably non-blocking »
36p switch
Nodes 1-18
36p switch
Nodes 1...
© 2010 Voltaire Inc. 25
TARA Optimization
► TARA provides the following benefits:
• Reduces competition between fabric res...
© 2010 Voltaire Inc. 26
With TARA
36p switch
Nodes 1-18
36p switch
Nodes 19-36
36p switch
Nodes 37-54
36p switch
Nodes 55-...
© 2010 Voltaire Inc. 30
0
200
400
600
800
1000
1200
1400
1600
1800
2000
1.18
1.28
2.20
2.30
3.22
3.32
4.24
4.34
5.26
6.18
...
© 2010 Voltaire Inc. 31
Scale-out and Maintain Control on Fabric
► Dozens of switches and 1000s of
nodes become a massive
...
© 2010 Voltaire Inc. 32
Efficient Troubleshooting
► Dozens of traffic and health
events
• Easy central drill-down to count...
© 2010 Voltaire Inc. 33
Open system
► Extensible architecture based
on Web-services
► Open API for users or 3rd party exte...
© 2010 Voltaire Inc. 34
UFM Adaptive Suite
- Separate UFM offering integrated with Platform LSF
 Intelligent &
automatic ...
© 2010 Voltaire Inc. 35
Integration with Platform LSF
- how does it work ?
Automation and Optimization
© 2010 Voltaire Inc. 36
UFM Benefits
Simple and Automated
Lowers administration tasks
time from days to minutes
Increased ...
Upcoming SlideShare
Loading in …5
×

Voltaire ufm en_nov10

957 views

Published on

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Voltaire ufm en_nov10

  1. 1. © 2010 Voltaire Inc. November 19, 2010 Unified Fabric Manager Overview Ghislain de Jacquelot
  2. 2. © 2010 Voltaire Inc. 2 Voltaire Software Portfolio Robust RDMA Drivers Fabric provisioning and performance monitoring Robust Drivers MPI Acceleration Multicast Acceleration Storage Access Acceleration Collective communication offload Multicast and TCP transport utilizing Kernel bypass technology RDMA based storage iSCSI target Multicast and TCP transport utilizing Kernel bypass technology
  3. 3. © 2010 Voltaire Inc. 3 Unified Fabric Manager “So far, I haven't seen any other solutions claiming to be a "fabric manager" offer the sophisticated insight, resource management, performance trending, and core fabric function extension that UFM can … it fully illustrates what a well architected fabric should be capable of.” Jeff Boles, Taneja Group, June 2009
  4. 4. © 2010 Voltaire Inc. 4 Infiniband Traditional Management
  5. 5. © 2010 Voltaire Inc. 5 An Infiniband Fabric is not a black box (1/2) ► Requires Hardware management • Detect failures, communication problems  Inside the Infiniband Fabric - Port counters - Port status (QDR,DDR,SDR – 4X,2X,1X) - Firmware upgrades (Switch and HCA ASICs)  Outside the Infiniband Fabric - Chassis - Power supplies - Fans - Temperature - Chassis software updates (Switch management)
  6. 6. © 2010 Voltaire Inc. 6 An Infiniband Fabric is not a black box (2/2) ►What about performance ? ►Some embarrassing questions… • Blocking vs non-blocking fabrics ? • Influence of routing algorithms ? • Congestion ? • Mixing different protocols on the same fabric ? • Running multiple jobs on the same fabric ? • Performance monitoring Tools ?
  7. 7. © 2010 Voltaire Inc. 7 UFM Central Management Platform ► In-depth visibility into fabric health and traffic • Central Dashboard, Unique Congestion Map • Advanced monitoring engine, threshold based alerts ► Optimize application performance • Quality of Service • Traffic Aware Routing Algorithm ► Efficient operations of thousands of fabric components • Automated configuration of hosts and switches, group tasks • Seamless change management Unified Fabric Manager
  8. 8. © 2010 Voltaire Inc. 8 Introducing UFM UFM Server CLI GUI (Java) Web Services IB-SM (OpenSM) Perf Mng Providers Device Mng Providers SQL DB HA Daemon Access Control Central administration of multiple switches (or hosts) Hierarchal performance monitoring, variety of sources Leverage open source SM engine Transparent fail-over Fast retrieval, historical data Manage complex relations and workflows Voltaire Plug-ins User and application interfaces
  9. 9. © 2010 Voltaire Inc. 10 Advanced Monitoring and Analysis ► Monitor & analyze fabric performance • Bandwidth utilization • Unique congestion monitoring • Dashboard for aggregated fabric view ► Real-time fabric-wide health monitoring • Monitor events and errors through-out the fabric • Threshold based alarms • Granular monitoring of host and switch parameters ► Innovative congestion mapping • One view for fabric-wide congestion and traffic patterns • Enables root cause analysis for routing, job placement or resource allocation inefficiencies ► All is managed at the application/aggregation level • Event effects are clearly visible • Pro-active measures can be taken
  10. 10. © 2010 Voltaire Inc. 11 Central Dashboard Resource Utilization & Status Congestion Map Top 10 alerted nodesEvent Pane Top 10’s B/W, Congestion B/W Consumers
  11. 11. © 2010 Voltaire Inc. 12 Advanced Monitoring Engine Multiple sessions On demand Sessions per Logical Groups – no need to know physical nodes Aggregation per Multiple devices Various graphs (linear, bar, historgram, pie…) Correlate switch and host information Formulas (AVG, Max, Min, Sum)
  12. 12. © 2010 Voltaire Inc. 13 Performance Optimization Cycle with UFM Characterize traffic pattern and priorities Unique logical fabric model QoS to prioritize critical apps. Optimize routing with Voltaire’s Traffic Optimized Routing (TOR) Show traffic and congestion information Unique Congestion Map Feedback and Analysis OptionalOrchestrators & Schedulers Application Requirements UFM Optimization UFM Monitoring
  13. 13. © 2010 Voltaire Inc. 14 Advanced Performance Optimization Mechanisms ► Fabric virtualization and Quality of Service (QoS) • Run multiple clusters or multiple jobs on the same infrastructure • Assure critical applications get priority through QoS policy • Provide the required isolation for different departments or jobs ► Traffic Aware Routing Algorithm (TARA) • Voltaire’s major shift from static to traffic aware routing • Routing enhancements are built on top of OpenSM in a modular plug-in architecture • Takes into consideration traffic patterns and loads • Traffic model can be derived automatically from fabric model or via API with 3rd party schedulers Applicable to both DDR and QDR Environments
  14. 14. © 2010 Voltaire Inc. 15 Congestion Example ► Degradation due to node oversubscription • Destination port in saturation (multiple sources) • Congestion spread across the fabric • ALL other flows drop to 20% of capacity • Take time to recover • Common with storage traffic drop recovery
  15. 15. © 2010 Voltaire Inc. 16 Quality of Service Optimization UFM Enables QoS Optimization
  16. 16. © 2010 Voltaire Inc. 17 Test Environment ► 2 nodes running a latency critical job ► 12 nodes running a bandwidth consuming job ► Goal: achieve best performance with Latency critical tasks
  17. 17. © 2010 Voltaire Inc. 18 W/O Partitioning Latency degradation of ~215% Latency job running alone (Latency = ~2.1 us) Bandwidth job added on same partition (Latency = ~4.5 us)
  18. 18. © 2010 Voltaire Inc. 19 UFM Logical Model Creates Partition and Sets QoS ► 2 Logical Groups • Latency job • B/W oriented job ► QoS settings ► UFM creates virtual NICs, partitions and assigns Service Levels on the fabric
  19. 19. © 2010 Voltaire Inc. 20 With UFM QoS Cross Application Interference fixed Single job in cluster (Latency = 2.1us) 2 jobs, UFM optimization (Latency = 2.2us) 2nd job added (Latency = 4.5us) 100% Better Performance Through QoS Implementation
  20. 20. © 2010 Voltaire Inc. 21 Optimize performance #2: routing ► Existing routing algorithms • Are not aware of application communication flow • They distribute paths evenly across the fabric links ► In real life, fabrics have non uniform usage • Some endpoints “talk” a lot, some don’t “talk” at all • Many-to-many (cluster) and any-to-many (storage) topologies ► Result • Unbalanced fabric • Congestion is created leading to slower performance and high latency Congestion = Latency
  21. 21. © 2010 Voltaire Inc. 22 TARA Optimization ► TARA provides the following benefits: • Reduces competition between fabric resources, thus decreasing congestion • Increases available bandwidth, resulting in improved fabric utilization • Delivers lower latency and shorter application runtime ► How ? • Uses knowledge of cluster usage: logical servers, networks. • Balances routes depending on usage • Not based on real-time analysis of bandwidth / congestion
  22. 22. © 2010 Voltaire Inc. 23 Routing ? ► InfiniBand packets are ‘destination routed’ based on the Destination Logical ID (DLID) field in the header ► In IB: DLID=route (not only remote address) ► DLIDs are 16 bits • 48K values are used for unicast • 16K values are used for multicast ► At each switch ASIC, the incoming unicast DLID is used as an index into a Linear Forwarding Table (LFT) that returns the outgoing switch port number • E.g. the InfiniScale III ASIC supports all 48K possible LFT entries Out Port # DLID 0 1 2 3 4 5 6 7 8 9 10 11
  23. 23. © 2010 Voltaire Inc. 24 The real wording should be « rearrangeably non-blocking » 36p switch Nodes 1-18 36p switch Nodes 19-36 36p switch Nodes 37-54 36p switch Nodes 55-72 36p switch 36p switch Each link represents 9 cables 18 uplinks 54 nodes At boot time, 3 routes are assigned to each uplink, lets assume: 19-37-55 on port #1 20-38-56 on port #2, etc… What happens if you have a job running on nodes 1-2-3-19-37-55 ? Unbalanced communication, congestion…
  24. 24. © 2010 Voltaire Inc. 25 TARA Optimization ► TARA provides the following benefits: • Reduces competition between fabric resources, thus decreasing congestion • Increases available bandwidth, resulting in improved fabric utilization • Delivers lower latency and shorter application runtime ► How ? • Uses knowledge of cluster usage: logical servers, networks. • Balances routes depending on usage • Not based on real-time analysis of bandwidth / congestion
  25. 25. © 2010 Voltaire Inc. 26 With TARA 36p switch Nodes 1-18 36p switch Nodes 19-36 36p switch Nodes 37-54 36p switch Nodes 55-72 36p switch 36p switch Each link represents 9 cables 18 uplinks 3 nodes job running on nodes 1-2-3-19-37-55 At job launch time, routes to nodes used by the job are balanced over all uplinks: 19 on port #1 37 on port #2 55 on port #3 Others are unchanged
  26. 26. © 2010 Voltaire Inc. 30 0 200 400 600 800 1000 1200 1400 1600 1800 2000 1.18 1.28 2.20 2.30 3.22 3.32 4.24 4.34 5.26 6.18 6.28 7.20 7.30 8.22 8.32 9.24 9.34 10.26 11.18 11.28 12.20 12.30 13.22 13.32 14.24 14.34 15.26 16.18 16.28 17.20 17.30 switch.port portweight 0 200 400 600 800 1000 1200 1400 1600 1800 2000 1.18 1.28 2.20 2.30 3.22 3.32 4.24 4.34 5.26 6.18 6.28 7.20 7.30 8.22 8.32 9.24 9.34 10.26 11.18 11.28 12.20 12.30 13.22 13.32 14.24 14.34 15.26 16.18 16.28 17.20 17.30 switch.port portweight Internal ports on the line cards trafficbandwidth Traffic Optimized RoutingOpenSM Job 1 47 nodes Job 6 46 nodes Job 2 41 nodes Job 5 63 nodes Job 3 71 nodes Job 4 25 nodes storage Nodes (24) traffic to/from storage Average of 200MB/s per node Internal traffic inside each job, 1000 MB/s from each node Example: TARA with 324 nodes cluster 300 servers 24 storage nodes Logical Topology Physical Topology
  27. 27. © 2010 Voltaire Inc. 31 Scale-out and Maintain Control on Fabric ► Dozens of switches and 1000s of nodes become a massive operational burden ► UFM automates I/O and switch configuration enabling isolation and QoS ► Central Device Management for switches and hosts ► High-availability and seamless failover of SM and UFM ► Advanced API for seamless integration in existing environments Automatic, seamless operations save hours of configuration and set-up work
  28. 28. © 2010 Voltaire Inc. 32 Efficient Troubleshooting ► Dozens of traffic and health events • Easy central drill-down to counters, alerts and events to the port level ► Configurable thresholds and criticality levels ► GUI and log level alarms ► Alerts correlated to the application level ► Alerts correlated to the DC rack level
  29. 29. © 2010 Voltaire Inc. 33 Open system ► Extensible architecture based on Web-services ► Open API for users or 3rd party extensions ► Expose entire fabric and datacenter object model ► Allow simple reporting, provisioning, monitoring, and task automation ► Tools already benefiting from UFM API  Scheduler integration (e.g. Moab)  UFM Support tool kit  Various command line tools/extensions to UFM  Web fabric portal  * Provided in UFM Advanced packages
  30. 30. © 2010 Voltaire Inc. 34 UFM Adaptive Suite - Separate UFM offering integrated with Platform LSF  Intelligent & automatic resource allocation  Optimize fabric performance  Maintain connectivity upon changes  Central monitoring This is the first integrated solution that correlates network fabric management and workload management for dynamic data centers Platform LSF Service Policy UFM Fabric Provisioning Control & Optimization
  31. 31. © 2010 Voltaire Inc. 35 Integration with Platform LSF - how does it work ? Automation and Optimization
  32. 32. © 2010 Voltaire Inc. 36 UFM Benefits Simple and Automated Lowers administration tasks time from days to minutes Increased Performance Reduce congestion, lower latency Quicker application runtime Little Fabric Visibility Unnoticed performance degradation Difficult to assess impact Low Performing Unutilized Fabrics Arbitrary routing algorithms, QoS seldom implemented Congested fabrics, latency affected Complex and Manual Processes Needs admin skills Many options left unused at all Ineffective Troubleshooting Long troubleshooting time Performance issues take days to analyze Quick Issue Resolution Dashboard, Alarms, Congestion Map Reduces downtime, high fabric utilization In-Depth Visibility and Control Clear health and performance visualization Business oriented impact and root analysis Fabrics w/o UFM UFM Customers

×