ISC 12 BoF: InfiniBand? Problems? Do you care?

1,139 views

Published on

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,139
On SlideShare
0
From Embeds
0
Number of Embeds
414
Actions
Shares
0
Downloads
9
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

ISC 12 BoF: InfiniBand? Problems? Do you care?

  1. 1. InfiniBand? Problems? Do you care?Christian Kniep / Jan Wenderscience + computing agIT services for sophisticated computer environmentsTübingen | München | Berlin | Düsseldorf
  2. 2. Agenda This is an interactive session! ▪ Who is on the podium? ▪ Living Histogram? ▪ Getting some statistics ▪ Living Histogram ▪ Existing Monitoring Solutions ▪ Discussion ▪ Quick and Dirty Analysis ▪ ConclusionsPage 2BoF InfiniBand | 2012-06-19 © 2012 science + computing ag
  3. 3. On the podiumPage 3BoF InfiniBand | 2012-06-19 © 2012 science + computing ag
  4. 4. science + computing at a glance Founding Year 1989 Locations Tübingen München Berlin Düsseldorf Employees 270 Shareholder Bull S.A. (100%) Revenue 10/11 27 Mio. Euro Partners Daikin Industries, Japan NICE srl, Italien Exa Corporation, USA Platform Computing, KanadaPage 4BoF InfiniBand | 2012-06-19 © 2012 science + computing ag
  5. 5. Living Histogram? Brian L. Joiner, International Statistical Review / Revue Internationale de Statistique, Vol. 43, No. 3. (Dec.,1975), pp. 339-340.Page 5BoF InfiniBand | 2012-06-19 © 2012 science + computing ag
  6. 6. Living Histogram Size of Fabric ▪ <10 ▪ <50 ▪ <500 ▪ >500Page 6BoF InfiniBand | 2012-06-19 © 2012 science + computing ag
  7. 7. Living Histogram Switch Structure ▪ Switch size ▪ singular switch (mlx4036, qlogic12300) ▪ Modular switch (mlx5600, qlogic12800) ▪ Amount ▪ few ▪ manyPage 7BoF InfiniBand | 2012-06-19 © 2012 science + computing ag
  8. 8. Living Histogram Focus ▪ Stability ➡ maintenance cost ▪ High-Perfomance ➡ extremly optimizedPage 8BoF InfiniBand | 2012-06-19 © 2012 science + computing ag
  9. 9. Living Histogram Type of Use ▪ Cluster Purpose ▪ Single Purpose Cluster ▪ Multi Purpose Cluster ▪ Usage ▪ One Job at a time ▪ Multiple JobsPage 9BoF InfiniBand | 2012-06-19 © 2012 science + computing ag
  10. 10. Living Histogram Kind/Amount of Problems ▪ Impact ▪ minor ▪ major ▪ Amount ▪ few ▪ manyPage 10BoF InfiniBand | 2012-06-19 © 2012 science + computing ag
  11. 11. Living Histogram Problem solving ▪ Iterative ➡ reseat / reboot ▪ Analytic ➡ dig into the problem ➡ try to wipe it outPage 11BoF InfiniBand | 2012-06-19 © 2012 science + computing ag
  12. 12. Monitoring Solutionsstable (but not useful to admins?) unstable (individually carved)▪ infiniband-diags ▪ wrapper of infiniband-diags ▪ ibcheckerrors ▪ INAM (Ohio-State-University) ▪ ibdiagpath ▪ QNIB▪ plugin to non-IB systems ▪ ..... ▪ nagios ▪ collectl▪ hardware vendor suites not listed stuff ▪ Unified Fabric Manager (Mellanox) ▪ ... ▪ InfiniBand Fabric Suites (QLogic) Page 12 BoF InfiniBand | 2012-06-19 © 2012 science + computing ag
  13. 13. Monitoring Solutionsstable (but not useful to admins?) unstable (individually carved)▪ infiniband-diags ▪ wrapper of infiniband-diags ▪ ibcheckerrors ▪ INAM (Ohio-State-University) ▪ ibdiagpath ▪ QNIB▪ plugin to non-IB systems ▪ ..... ▪ nagios ▪ collectl▪ hardware vendor suites not listed stuff ▪ Unified Fabric Manager (Mellanox) ▪ ... ▪ InfiniBand Fabric Suites (QLogic) Page 13 BoF InfiniBand | 2012-06-19 © 2012 science + computing ag
  14. 14. Modular Switchesswitchguid=0xac1(ac1)! # Spine 1Switch! 36 "S-ac1"! # "A1" enhanced port 0 lid 11 lmc 0[1]! "S-bc1"[1]! # "B1" lid 21 4xQDR[2]! "S-bc2"[1]! # "B2" lid 22 4xQDR[3]! "S-bc3"[1]! # "B3" lid 23 4xQDRswitchguid=0xac2(ac2)! # Spine 2Switch! 36 "S-ac2"! # "A2" enhanced port 0 lid 12 lmc 0[1]! "S-bc1"[2]! # "B1" lid 21 4xQDR[2]! "S-bc2"[2]! # "B2" lid 22 4xQDR[3]! "S-bc3"[2]! # "B3" lid 23 4xQDRswitchguid=0xbc1(bc1)! # Line 1Switch 36 "S-bc1"! # "B1" enhanced port 0 lid 21 lmc 0[1]! "S-ac1"[1]! # "A1" lid 11 4xQDR[2] "S-ac2"[1] # "A2" lid 12 4xQDR[3] "H-1"[1](f1) # "Host1" lid 101 4xQDRswitchguid=0xbc2(bc2)! # Line 2Switch! 36 "S-bc2"! # "B2" enhanced port 0 lid 22 lmc 0[1]! "S-ac1"[2]! # "A1" lid 11 4xQDR[2] "S-ac2"[2] # "A2" lid 12 4xQDR[3] "H-2"[1](f2) # "Host2" lid 102 4xQDRswitchguid=0xbc3(bc3)! # Line 3Switch! 36 "S-bc3"! # "B3" enhanced port 0 lid 23 lmc 0[1]! "S-ac1"[3]! # "A1" lid 11 4xQDR[2] "S-ac2"[3] # "A2" lid 12 4xQDR[3] "H-3"[1](f3) # "Host3" lid 103 4xQDRPage 14BoF InfiniBand | 2012-06-19 © 2012 science + computing ag
  15. 15. Modular Switchesswitchguid=0xac1(ac1)! # Spine 1Switch! 36 "S-ac1"! # "A1" enhanced port 0 lid 11 lmc 0 Chassis1[1]! "S-bc1"[1]! # "B1" lid 21 4xQDR Spine1 Spine2[2]! "S-bc2"[1]! # "B2" lid 22 4xQDR[3]! "S-bc3"[1]! # "B3" lid 23 4xQDRswitchguid=0xac2(ac2)! # Spine 2Switch! 36 "S-ac2"! # "A2" enhanced port 0 lid 12 lmc 0[1]! "S-bc1"[2]! # "B1" lid 21 4xQDR[2]! "S-bc2"[2]! # "B2" lid 22 4xQDR Line1 Line2 Line3[3]! "S-bc3"[2]! # "B3" lid 23 4xQDRswitchguid=0xbc1(bc1)! # Line 1Switch 36 "S-bc1"! # "B1" enhanced port 0 lid 21 lmc 0[1]! "S-ac1"[1]! # "A1" lid 11 4xQDR Host1 Host2 Host3[2] "S-ac2"[1] # "A2" lid 12 4xQDR[3] "H-1"[1](f1) # "Host1" lid 101 4xQDRswitchguid=0xbc2(bc2)! # Line 2Switch! 36 "S-bc2"! # "B2" enhanced port 0 lid 22 lmc 0[1]! "S-ac1"[2]! # "A1" lid 11 4xQDR[2] "S-ac2"[2] # "A2" lid 12 4xQDR[3] "H-2"[1](f2) # "Host2" lid 102 4xQDRswitchguid=0xbc3(bc3)! # Line 3Switch! 36 "S-bc3"! # "B3" enhanced port 0 lid 23 lmc 0[1]! "S-ac1"[3]! # "A1" lid 11 4xQDR[2] "S-ac2"[3] # "A2" lid 12 4xQDR[3] "H-3"[1](f3) # "Host3" lid 103 4xQDRPage 15BoF InfiniBand | 2012-06-19 © 2012 science + computing ag
  16. 16. Modular Switchesswitchguid=0xac1(ac1)! # Spine 1Switch! 36 "S-ac1"! # "A1" enhanced port 0 lid 11 lmc 0 Chassis1[1]! "S-bc1"[1]! # "B1" lid 21 4xQDR Spine1 Spine2[2]! "S-bc2"[1]! # "B2" lid 22 4xQDR[3]! "S-bc3"[1]! # "B3" lid 23 4xQDRswitchguid=0xac2(ac2)! # Spine 2Switch! 36 "S-ac2"! # "A2" enhanced port 0 lid 12 lmc 0[1]! "S-bc1"[2]! # "B1" lid 21 4xQDR[2]! "S-bc2"[2]! # "B2" lid 22 4xQDR Line1 Line2 Line3[3]! "S-bc3"[2]! # "B3" lid 23 4xQDRswitchguid=0xbc1(bc1)! # Line 1Switch 36 "S-bc1"! # "B1" enhanced port 0 lid 21 lmc 0[1]! "S-ac1"[1]! # "A1" lid 11 4xQDR Host1 Host2 Host3[2] "S-ac2"[1] # "A2" lid 12 4xQDR[3] "H-1"[1](f1) # "Host1" lid 101 4xQDRswitchguid=0xbc2(bc2)! # Line 2Switch! 36 "S-bc2"! # "B2" enhanced port 0 lid 22 lmc 0[1]! "S-ac1"[2]! # "A1" lid 11 4xQDR[2] "S-ac2"[2] # "A2" lid 12 4xQDR Chassis1[3] "H-2"[1](f2) # "Host2" lid 102 4xQDRswitchguid=0xbc3(bc3)! # Line 3Switch! 36 "S-bc3"! # "B3" enhanced port 0 lid 23 lmc 0[1]! "S-ac1"[3]! # "A1" lid 11 4xQDR[2] "S-ac2"[3] # "A2" lid 12 4xQDR[3] "H-3"[1](f3) # "Host3" lid 103 4xQDR Host1 Host2 Host3Page 16BoF InfiniBand | 2012-06-19 © 2012 science + computing ag
  17. 17. Monitoring Solutionsstable (but not useful to admins?) unstable (individually carved)▪ infiniband-diags ▪ wrapper of infiniband-diags ▪ ibcheckerrors ▪ INAM (Ohio-State-University) ▪ ibdiagpath ▪ QNIB▪ plugin to non-IB systems ▪ ..... ▪ nagios ▪ collectl▪ hardware vendor suites not listed stuff ▪ Unified Fabric Manager (Mellanox) ▪ ... ▪ InfiniBand Fabric Suites (QLogic) Page 17 BoF InfiniBand | 2012-06-19 © 2012 science + computing ag
  18. 18. Monitoring Solutionsstable (but not useful to admins?) unstable (individually carved)▪ infiniband-diags ▪ wrapper of infiniband-diags ▪ ibcheckerrors ▪ INAM (Ohio-State-University) ▪ ibdiagpath ▪ QNIB▪ plugin to non-IB systems ▪ ..... ▪ nagios ▪ collectl▪ hardware vendor suites not listed stuff ▪ Unified Fabric Manager (Mellanox) ▪ ... ▪ InfiniBand Fabric Suites (QLogic) Page 18 BoF InfiniBand | 2012-06-19 © 2012 science + computing ag
  19. 19. Monitoring Solutionsstable (but not useful to admins?) unstable (individually carved)▪ infiniband-diags ▪ wrapper of infiniband-diags ▪ ibcheckerrors ▪ INAM (Ohio-State-University) ▪ ibdiagpath ▪ QNIB▪ plugin to non-IB systems ▪ ..... ▪ nagios ▪ collectl▪ hardware vendor suites not listed stuff ▪ Unified Fabric Manager (Mellanox) ▪ ... ▪ InfiniBand Fabric Suites (QLogic) Page 19 BoF InfiniBand | 2012-06-19 © 2012 science + computing ag
  20. 20. Monitoring Solutionsstable (but not useful to admins?) unstable (individually carved)▪ infiniband-diags ▪ wrapper of infiniband-diags ▪ ibcheckerrors ▪ INAM (Ohio-State-University) ▪ ibdiagpath ▪ QNIB▪ plugin to non-IB systems ▪ ..... ▪ nagios ▪ collectl▪ hardware vendor suites not listed stuff ▪ Unified Fabric Manager (Mellanox) ▪ ... ▪ InfiniBand Fabric Suites (QLogic) Page 20 BoF InfiniBand | 2012-06-19 © 2012 science + computing ag
  21. 21. Monitoring Solutionsstable (but not useful to admins?) unstable (individually carved)▪ infiniband-diags ▪ wrapper of infiniband-diags ▪ ibcheckerrors ▪ INAM (Ohio-State-University) ▪ ibdiagpath ▪ QNIB▪ plugin to non-IB systems ▪ ..... ▪ nagios ▪ collectl▪ hardware vendor suites not listed stuff ▪ Unified Fabric Manager (Mellanox) ▪ ... ▪ InfiniBand Fabric Suites (QLogic) Page 21 BoF InfiniBand | 2012-06-19 © 2012 science + computing ag
  22. 22. Discussion - Quick AnalysisFabricsize Type of use▪ small -> easy as pie? ▪ willing/forced to share▪ big -> crit. mass for Problemkind / -amount real analysis? ▪ runs smoothly enoughSwitch structure Problemsolving▪ what is your ▪ learncurve starts step routing algorithm?Focus▪ 80:20 rule? performance maintenancePage 22BoF InfiniBand | 2012-06-19 © 2012 science + computing ag
  23. 23. Discussion - Quick AnalysisFabric size Type of use▪ small -> easy as pie? ▪ willing/forced to share▪ big -> crit. mass for Problem type / amount real analysis? ▪ runs smoothly enoughSwitch structure Problem solving▪ what is your ▪ learning curve starts steep routing algorithm?Focus▪ 80:20 rule? performance maintenancePage 23BoF InfiniBand | 2012-06-19 © 2012 science + computing ag
  24. 24. Discussion - Quick AnalysisFabric size Type of use▪ small -> easy as pie? ▪ willing/forced to share▪ big -> crit. mass for Problem type / amount real analysis? ▪ runs smoothly enoughSwitch structure Problem solving▪ what is your ▪ learning curve starts steep routing algorithm?Focus▪ 80:20 rule? performance maintenancePage 24BoF InfiniBand | 2012-06-19 © 2012 science + computing ag
  25. 25. Discussion - Quick AnalysisFabric size Type of use▪ small -> easy as pie? ▪ willing/forced to share▪ big -> crit. mass for Problem type / amount real analysis? ▪ runs smoothly enoughSwitch structure Problem solving▪ what is your ▪ learning curve starts steep routing algorithm?Focus 100▪ 80:20 rule? 75 performance 50 maintenance 25Page 25 0BoF InfiniBand | 2012-06-19 © 2012 science + computing ag
  26. 26. Discussion - Quick AnalysisFabric size Type of use▪ small -> easy as pie? ▪ willing/forced to share▪ big -> crit. mass for Problem type / amount real analysis? ▪ runs smoothly enoughSwitch structure Problem solving▪ what is your ▪ learning curve starts steep routing algorithm?Focus▪ 80:20 rule? performance maintenancePage 26BoF InfiniBand | 2012-06-19 © 2012 science + computing ag
  27. 27. Discussion - Quick AnalysisFabric size Type of use▪ small -> easy as pie? ▪ willing/forced to share▪ big -> crit. mass for Problem type / amount real analysis? ▪ runs smoothly enoughSwitch structure Problem solving▪ what is your ▪ learning curve starts steep routing algorithm?Focus▪ 80:20 rule? performance maintenancePage 27BoF InfiniBand | 2012-06-19 © 2012 science + computing ag
  28. 28. Discussion - Quick AnalysisFabric size Type of use▪ small -> easy as pie? ▪ willing/forced to share▪ big -> crit. mass for Problem type / amount real analysis? ▪ runs smoothly enoughSwitch structure Problem solving▪ what is your ▪ learning curve starts steep routing algorithm?Focus▪ 80:20 rule? performance maintenancePage 28BoF InfiniBand | 2012-06-19 © 2012 science + computing ag
  29. 29. Discussion - ConclusionsMonitoring▪ what approach?Do we scare you?▪ not intending to spread Fear, Uncertainty and DoubtOur conclusionsYour conclusionsPage 29BoF InfiniBand | 2012-06-19 © 2012 science + computing ag
  30. 30. Discussion - ConclusionsMonitoring▪ what approach?Do we scare you?▪ not intending to spread Fear, Uncertainty and DoubtOur conclusionsYour conclusionsPage 30BoF InfiniBand | 2012-06-19 © 2012 science + computing ag
  31. 31. Discussion - ConclusionsMonitoring▪ what approach?Do we scare you?▪ not intending to spread Fear, Uncertainty and DoubtOur conclusionsYour conclusionsPage 31BoF InfiniBand | 2012-06-19 © 2012 science + computing ag
  32. 32. Discussion - ConclusionsMonitoring▪ what approach?Do we scare you?▪ not intending to spread Fear, Uncertainty and DoubtOur conclusionsYour conclusionsPage 32BoF InfiniBand | 2012-06-19 © 2012 science + computing ag
  33. 33. Discussion - ConclusionsMonitoring▪ what approach?Do we scare you?▪ not intending to spread Fear, Uncertainty and DoubtOur conclusionsYour conclusionsPage 33BoF InfiniBand | 2012-06-19 © 2012 science + computing ag
  34. 34. Thank you for your attention and participation!science + computing agwww.science-computing.deTelefon: +49 (0)7071 9457 - 0E-Mail: info@science-computing.de

×