Overlay HPC Information
Upcoming SlideShare
Loading in...5
×
 

Overlay HPC Information

on

  • 191 views

In this presentation from ISC'14, Christian Kniep from Bull presents: Understand Your Cluster by Overlaying Multiple Information Layers. Kniep is using Docker technology in a novel way to ease the ...

In this presentation from ISC'14, Christian Kniep from Bull presents: Understand Your Cluster by Overlaying Multiple Information Layers. Kniep is using Docker technology in a novel way to ease the administration of InfiniBand networks.

"Today's data center managers are burdened by a lack of aligned information of multiple layers. Work-flow events like 'job XYZ starts' aligned with performance metrics and events extracted from log facilities are low-hanging fruit that is on the edge to become use-able due to open-source software like Graphite, StatsD, logstash and alike. This talk aims to show off the benefits of merging multiple layers of information within an InfiniBand cluster by using use-cases for system operation and management level personal. Mr. Kniep held two BoF sessions which described the lack of InfiniBand (ISC'12) and generic HPC monitoring (ISC'13). This years' session aims to propose a way to fix it. To drill into the issue, Mr. Kniep uses his recently started project QNIBTerminal to spin up a complete clusterstack using LXC containers."

Statistics

Views

Total Views
191
Views on SlideShare
190
Embed Views
1

Actions

Likes
0
Downloads
10
Comments
0

1 Embed 1

http://www.slideee.com 1

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

CC Attribution License

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Overlay HPC Information Overlay HPC Information Presentation Transcript

  • ©Bull 2012 Overlay HPC Information 1 Christian Kniep R&D HPC Engineer 2014-06-25
  • ©Bull 2014 About Me 2 ‣ 10y+ SysAdmin ‣ 8y+ SysOps ‣ B.Sc. (2008-2011) ‣ 6y+ DevOps ‣ 1y+ R&D - @CQnib - http://blog.qnib.org - https://github.com/ChristianKniep
  • ©Bull 2014 My ISC History - Motivation 3
  • ©Bull 2014 My ISC History - Description 4
  • ©Bull 2014 HPC Software Stack (rough estimate) 5 Hardware:! ! HW-sensors/-errors OS:! ! ! Kernel, Userland tools MiddleWare:! MPI, ISV-libs Software:! ! End user application Excel:!! ! KPI, SLA MgmtSysOps SysOpsMgmt User PowerUser/ISV ISVMgmt Services:! ! Storage, Job Scheduler HW
  • ©Bull 2014 HPC Software Stack (goal) 6 Hardware:! ! HW-sensors/-errors OS:! ! ! Kernel, Userland tools MiddleWare:! MPI, ISV-libs Software:! ! End user application Excel:!! ! KPI, SLA Services:! ! Storage, Job Scheduler Log/Events Perf
  • ©Bull 2012 QNIBTerminal - History 7
  • ! ! ! • Created my own ! ! • No useful tools in sight ©Bull 2014 QNIB 8 ‣ Cluster of n*1000+ IB nodes • Hard to debug
  • ! ! ! • Created my own - Graphite-Update in late 2013 ! ! • No useful tools in sight ©Bull 2014 QNIB 9 ‣ Cluster of n*1000+ IB nodes • Hard to debug
  • ©Bull 2014 Achieved HPC Software Stack 10 Hardware:! ! IB-sensors/-errors OS:! ! ! Kernel, Userland tools MiddleWare:! MPI, ISV-libs Software:! ! End user application Excel:!! ! KPI, SLA Services:! ! Storage, Job Scheduler Log/Events Perf
  • ©Bull 2012 QNIBTerminal - Implementation 11
  • ©Bull 2014 QNIBTerminal -blog.qnib.org 12 haproxy haproxy dns helixdns elk kibana logstash etcd carboncarbon graphite-webgraphite-web graphite-apigraphite-api grafanagrafana slurmctldslurmctld compute0slurmd compute<N>slurmd Log/Events Services Performance Compute elasticsearch
  • ©Bull 2012 DEMONSTRATION 13
  • ©Bull 2012 Future Work 14
  • ©Bull 2014 More Services 15 ‣ Improve work-flow for log-events ‣ Nagios(-like) node is missing ‣ Cluster-FileSystem ‣ LDAP ‣ Additional dashboards ‣ Inventory ‣ using InfiniBand for communication traffic
  • ©Bull 2014 Graph Representation 16 ‣ Graph inventory needed • Hierarchical view is not enough
  • ©Bull 2014 Graph Representation 17 ! ! • GraphDB seems to be a good idea comp0 comp1 comp2 ibsw0 eth1 eth10 ldap12 lustre0 ibsw2 ‣ Graph inventory needed • Hierarchical view is not enough RETURN node=comp* WHERE ROUTE TO lustre_service INCLUDES ibsw2
  • ©Bull 2012 Conclusion 18
  • ! ! ! ! ! ! ! ! ! ! ‣ Training • New SysOps could start on virtual cluster • ‚Strangulate’ node to replay an error. ! ! ! ! ! ! ! ‣ Showcase • Showing a customer his (to-be) software stack • Convince the SysOps-Team ‚they have nothing to fear‘ ! ! ! ‣ complete toolchain could be automated • Testing • Verification • Q&A ! ! ‣ n*1000 containers through clustering ©Bull 2014 Conclusion 19 ‣ n*100 of containers are easy (50 on my laptop) • Running a 300 node cluster stack
  • ©Bull 2014 Log AND Performance Management 20 ‣ Metric w/o Logs are useless!
  • ©Bull 2014 Log AND Performance Management 21 ‣ Metric w/o Logs are useless!! • and the other way around…
  • ©Bull 2014 Log AND Performance Management 22 ! ! • overlapping is king ‣ Metric w/o Logs are useless!! • and the other way around…
  • ©Bull 2012 23