Overlay HPC Information

488
-1

Published on

In this presentation from ISC'14, Christian Kniep from Bull presents: Understand Your Cluster by Overlaying Multiple Information Layers. Kniep is using Docker technology in a novel way to ease the administration of InfiniBand networks.

"Today's data center managers are burdened by a lack of aligned information of multiple layers. Work-flow events like 'job XYZ starts' aligned with performance metrics and events extracted from log facilities are low-hanging fruit that is on the edge to become use-able due to open-source software like Graphite, StatsD, logstash and alike. This talk aims to show off the benefits of merging multiple layers of information within an InfiniBand cluster by using use-cases for system operation and management level personal. Mr. Kniep held two BoF sessions which described the lack of InfiniBand (ISC'12) and generic HPC monitoring (ISC'13). This years' session aims to propose a way to fix it. To drill into the issue, Mr. Kniep uses his recently started project QNIBTerminal to spin up a complete clusterstack using LXC containers."

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
488
On Slideshare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
18
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Overlay HPC Information

  1. 1. ©Bull 2012 Overlay HPC Information 1 Christian Kniep R&D HPC Engineer 2014-06-25
  2. 2. ©Bull 2014 About Me 2 ‣ 10y+ SysAdmin ‣ 8y+ SysOps ‣ B.Sc. (2008-2011) ‣ 6y+ DevOps ‣ 1y+ R&D - @CQnib - http://blog.qnib.org - https://github.com/ChristianKniep
  3. 3. ©Bull 2014 My ISC History - Motivation 3
  4. 4. ©Bull 2014 My ISC History - Description 4
  5. 5. ©Bull 2014 HPC Software Stack (rough estimate) 5 Hardware:! ! HW-sensors/-errors OS:! ! ! Kernel, Userland tools MiddleWare:! MPI, ISV-libs Software:! ! End user application Excel:!! ! KPI, SLA MgmtSysOps SysOpsMgmt User PowerUser/ISV ISVMgmt Services:! ! Storage, Job Scheduler HW
  6. 6. ©Bull 2014 HPC Software Stack (goal) 6 Hardware:! ! HW-sensors/-errors OS:! ! ! Kernel, Userland tools MiddleWare:! MPI, ISV-libs Software:! ! End user application Excel:!! ! KPI, SLA Services:! ! Storage, Job Scheduler Log/Events Perf
  7. 7. ©Bull 2012 QNIBTerminal - History 7
  8. 8. ! ! ! • Created my own ! ! • No useful tools in sight ©Bull 2014 QNIB 8 ‣ Cluster of n*1000+ IB nodes • Hard to debug
  9. 9. ! ! ! • Created my own - Graphite-Update in late 2013 ! ! • No useful tools in sight ©Bull 2014 QNIB 9 ‣ Cluster of n*1000+ IB nodes • Hard to debug
  10. 10. ©Bull 2014 Achieved HPC Software Stack 10 Hardware:! ! IB-sensors/-errors OS:! ! ! Kernel, Userland tools MiddleWare:! MPI, ISV-libs Software:! ! End user application Excel:!! ! KPI, SLA Services:! ! Storage, Job Scheduler Log/Events Perf
  11. 11. ©Bull 2012 QNIBTerminal - Implementation 11
  12. 12. ©Bull 2014 QNIBTerminal -blog.qnib.org 12 haproxy haproxy dns helixdns elk kibana logstash etcd carboncarbon graphite-webgraphite-web graphite-apigraphite-api grafanagrafana slurmctldslurmctld compute0slurmd compute<N>slurmd Log/Events Services Performance Compute elasticsearch
  13. 13. ©Bull 2012 DEMONSTRATION 13
  14. 14. ©Bull 2012 Future Work 14
  15. 15. ©Bull 2014 More Services 15 ‣ Improve work-flow for log-events ‣ Nagios(-like) node is missing ‣ Cluster-FileSystem ‣ LDAP ‣ Additional dashboards ‣ Inventory ‣ using InfiniBand for communication traffic
  16. 16. ©Bull 2014 Graph Representation 16 ‣ Graph inventory needed • Hierarchical view is not enough
  17. 17. ©Bull 2014 Graph Representation 17 ! ! • GraphDB seems to be a good idea comp0 comp1 comp2 ibsw0 eth1 eth10 ldap12 lustre0 ibsw2 ‣ Graph inventory needed • Hierarchical view is not enough RETURN node=comp* WHERE ROUTE TO lustre_service INCLUDES ibsw2
  18. 18. ©Bull 2012 Conclusion 18
  19. 19. ! ! ! ! ! ! ! ! ! ! ‣ Training • New SysOps could start on virtual cluster • ‚Strangulate’ node to replay an error. ! ! ! ! ! ! ! ‣ Showcase • Showing a customer his (to-be) software stack • Convince the SysOps-Team ‚they have nothing to fear‘ ! ! ! ‣ complete toolchain could be automated • Testing • Verification • Q&A ! ! ‣ n*1000 containers through clustering ©Bull 2014 Conclusion 19 ‣ n*100 of containers are easy (50 on my laptop) • Running a 300 node cluster stack
  20. 20. ©Bull 2014 Log AND Performance Management 20 ‣ Metric w/o Logs are useless!
  21. 21. ©Bull 2014 Log AND Performance Management 21 ‣ Metric w/o Logs are useless!! • and the other way around…
  22. 22. ©Bull 2014 Log AND Performance Management 22 ! ! • overlapping is king ‣ Metric w/o Logs are useless!! • and the other way around…
  23. 23. ©Bull 2012 23
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×