Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
OSDC 2014 
Overlay Datacenter Information Christian Kniep 
Bull SAS 
2014-04-10
About Me 
❖ Me (>30y) 
❖ SysOps (>10y) 
❖ SysOps v1.1 (>8y) 
❖ BSc (2008-2011) 
❖ DevOps (>4y) 
❖ R&D [OpsDev?](>1y) 
2
❖ Cluster Stack 
❖ Motivation (InfiniBand use-case) 
❖ QNIB/ng 
Agenda 
❖ QNIBTerminal (virtual cluster using docker) 
3 
...
Cluster Stack Work Environment 
4
Cluster? 
5 
„A computer cluster consists of a set of loosely connected or tightly connected computers 
that work together...
HPC-Cluster 
6 
High Performance Computing 
❖ HPC: Surfing the bottleneck 
❖ Weakest link breaks performance
Cluster Layers 
7 
(rough estimate) 
Software: End user application 
Services: Storage, Job Scheduler, sshd 
MiddleWare: M...
Layern 
❖ Every Layer is composed of layers 
❖ How deep to go? 
8
Little Data w/o Connection 
❖ No way of connecting them 
❖ Connecting is manual labour 
❖ Experience driven 
❖ Niche solut...
IB + QNIBng Motivation 
10
Modular Switch 
11 
❖ Looks like one „switch“
Modular Switch 
12 
❖ Looks like one „switch“ 
❖ Composed of a network itself
Modular Switch 
13 
❖ Looks like one „switch“ 
❖ Composed of a network itself 
❖ Which route is taken is transparent to 
a...
Modular Switch 
14 
❖ Looks like one „switch“ 
❖ Composed of a network itself 
❖ Which route is taken is transparent to 
a...
Modular Switch 
15 
❖ Looks like one „switch“ 
❖ Composed of a network itself 
❖ Which route is taken is transparent to 
a...
❖ 96 port switch 
Debug-Nightmare 
❖ multiple autonomous job-cells 
❖ Relevant information 
❖ Job status (Resource Schedul...
Communication Networks 
IBPM: An Open-Source-Based Framework for 
InfiniBand Performance Monitoring 
Michael Hoefling1, Mi...
❖ OpenSM Performance Manager 
❖ Sends token to all ports 
❖ All ports reply with metrics 
OpenSM 
❖ Callback triggered for...
OpenSM 
OpenSM 
PerfMgmt 
qnqinbinbg 
19 
❖ qnib 
❖ sends metrics to RRDtool 
❖ events to PostgreSQL 
❖ qnibng 
❖ sends me...
Graphite Events port is up/down 
20
21
22
QNIBTerminal Proof of Concept 
23
Cluster Stack Mock-Up 
❖ IB events and metrics are not enough 
❖ How to get real-world behavior? 
❖ Wanted: 
❖ Slurm (Reso...
Classical Virtualization 
❖ Big overhead for simple node 
❖ Resources provisioned in advance 
❖ Host resources allocated 
...
LXC (docker) 
❖ minimal overhead ( couple of MB) 
❖ no resource pinning 
❖ cgroups option 
❖ highly automatable 
26 
NOW: ...
Virtual Cluster Nodes 
❖ Master Node (etcd, DNS, slurmctld) 
❖ monitoring (graphite + statsd) 
❖ log mgmt (ELK) 
❖ compute...
Master Node 
❖ takes care of inventory (etcd) 
❖ provides DNS (+PTR) 
❖ Integrate Rudder, ansible, chef,…? 
28
Non-Master Nodes (in general) 
❖ are started with master as DNS 
❖ mounting /scratch, /chome (sits on SSDs) 
❖ supervisord...
docker-compute 
❖ slurmd 
❖ sshd 
❖ logstash-forwarder 
❖ openmpi 
❖ qperf 
30
docker-graphite (monitoring) 
❖ full graphite stack + statsd 
❖ stresses IO (<3 SSDs) 
❖ needs more care (optimize IO) 
31
docker-elk (Log Mgmt) 
❖ elasticsearch, logstash, kibana 
❖ inputs: syslog, lumberjack 
❖ filters: none 
❖ outputs: elasti...
It’s alive! 
33
Start Compute Node 
34
Start Compute Node 
35
Check Slurm Config 
36
Run MPI-Job 
37
TCP benchmark 
38
QNIBTerminal Future Work 
39
docker-icinga 
40 
❖ Icinga to provide 
❖ state-of-the-cluster overview 
❖ bundle with graphite/elk 
❖ no big deal… 
❖ Is ...
docker-(GlusterFS,Lustre) 
❖ Cluster scratch to integrate with 
❖ Use of kernel-modules freezes attempt 
❖ Might be pushed...
❖ How is SysOps/DevOps/Mgmt 
❖ react to the changes 
❖ adopt them 
❖ feared by them 
Humans! 
42
❖ Truckload of 
❖ Events 
❖ Metrics 
❖ Interaction 
Big Data! 
43 
node01.system.memory.usage 9 
node13.system.memory.usag...
pipework / mininet 
❖ Currently all containers are bound to docker0 bridge 
❖ Creating topology with virtual/real switches...
Dockerfiles 
❖ Only 3 images are fd20 based 
45
Questions? 
❖ Pictures 
❖ p2: http://de.wikipedia.org/wiki/Datei:Audi_logo.svg 
http://commons.wikimedia.org/wiki/File:Dai...
Upcoming SlideShare
Loading in …5
×

QNIBTerminal: Understand your datacenter by overlaying multiple information layers.

530 views

Published on

Today's data center managers are burdened by a lack of aligned information of multiple layers. Work-flow events like 'job starts' aligned with performance metrics and events extracted from log facilities are low-hanging fruit that is on the edge to become use-able due to open-source software like Graphite, StatsD, logstash and alike.
This talk aims to show off the benefits of merging multiple layers of information within an InfiniBand cluster by using use-cases for level 1/2/3 personnel.

Published in: Technology
  • Be the first to comment

QNIBTerminal: Understand your datacenter by overlaying multiple information layers.

  1. 1. OSDC 2014 Overlay Datacenter Information Christian Kniep Bull SAS 2014-04-10
  2. 2. About Me ❖ Me (>30y) ❖ SysOps (>10y) ❖ SysOps v1.1 (>8y) ❖ BSc (2008-2011) ❖ DevOps (>4y) ❖ R&D [OpsDev?](>1y) 2
  3. 3. ❖ Cluster Stack ❖ Motivation (InfiniBand use-case) ❖ QNIB/ng Agenda ❖ QNIBTerminal (virtual cluster using docker) 3 Cluster Stack IB QNIBng I. QNIB Terminal II. III.
  4. 4. Cluster Stack Work Environment 4
  5. 5. Cluster? 5 „A computer cluster consists of a set of loosely connected or tightly connected computers that work together so that in many respects they can be viewed as a single system.“ - wikipedia.org User
  6. 6. HPC-Cluster 6 High Performance Computing ❖ HPC: Surfing the bottleneck ❖ Weakest link breaks performance
  7. 7. Cluster Layers 7 (rough estimate) Software: End user application Services: Storage, Job Scheduler, sshd MiddleWare: MPI, ISV-libs Operating System: Kernel, Userland tools Hardware: IMPI, lm_sensors, IB counter End User Excel: KPI, SLA Mgmt SysOps Power User/ISV SysOps Mgmt ISV Mgmt SysOps L2 SysOps L1 Events Metrics SysOps L3
  8. 8. Layern ❖ Every Layer is composed of layers ❖ How deep to go? 8
  9. 9. Little Data w/o Connection ❖ No way of connecting them ❖ Connecting is manual labour ❖ Experience driven ❖ Niche solutions misleading 9 ❖ Multiple data sources
  10. 10. IB + QNIBng Motivation 10
  11. 11. Modular Switch 11 ❖ Looks like one „switch“
  12. 12. Modular Switch 12 ❖ Looks like one „switch“ ❖ Composed of a network itself
  13. 13. Modular Switch 13 ❖ Looks like one „switch“ ❖ Composed of a network itself ❖ Which route is taken is transparent to application ❖ LB1<>FB1<>LB4
  14. 14. Modular Switch 14 ❖ Looks like one „switch“ ❖ Composed of a network itself ❖ Which route is taken is transparent to application ❖ LB1<>FB1<>LB4 ❖ LB1<>FB2<>LB4
  15. 15. Modular Switch 15 ❖ Looks like one „switch“ ❖ Composed of a network itself ❖ Which route is taken is transparent to application ❖ LB1<>FB1<>LB4 ❖ LB1<>FB2<>LB4 ❖ LB1 ->FB1 ->LB4 / LB1 <-FB2 <-LB4
  16. 16. ❖ 96 port switch Debug-Nightmare ❖ multiple autonomous job-cells ❖ Relevant information ❖ Job status (Resource Scheduler) ❖ Routes (IB Subnet Manager) ❖ IB Counter (Command Line) ❖ changing one plug, recomputes routes :) 16 ❖ Job seems to fail due to bad internal link
  17. 17. Communication Networks IBPM: An Open-Source-Based Framework for InfiniBand Performance Monitoring Michael Hoefling1, Michael Menth1, Christian Kniep2, Marcus Camen2 Background: InfiniBand (IB) IBPM: Demo Overview Rate Measurement in IB Networks f State-of-the art communication technology for interconnection in high-performance computing data centers f Point-to-point bidirectional links f High throughput (40 Gbit/s with QDR) f Low latency f Dynamic on-line network reconfiguration in cooperation with Idea f Extract raw network information from IB network f Analyze output f Derive statistics about performance of the network Topology Extraction f Subnet discovery using ibnetdiscover f Produces human readable file of network topology f Process output to produce graphical representation of the network Remote Counter Readout f Each port has its own set of performance counters f Counters measure, e.g., transferred data, congestion, errors, link states changes ibsim-Based Network Simulation f ibsim simulates an IB network f Simple topology changes possible (GUI) f ibsim limitations ƒ No performance simulation possible ƒ No data rate changes possible Real IB Network f Physical network f Allows performance measurements f GUI controlled traffic scenarios 17
  18. 18. ❖ OpenSM Performance Manager ❖ Sends token to all ports ❖ All ports reply with metrics OpenSM ❖ Callback triggered for every reply ❖ Dumps info to file Sw 18 OpenSM PerfMgmt osmeventplugin Sw node node node node node node node ❖ osmeventplugin
  19. 19. OpenSM OpenSM PerfMgmt qnqinbinbg 19 ❖ qnib ❖ sends metrics to RRDtool ❖ events to PostgreSQL ❖ qnibng ❖ sends metrics to graphite ❖ events to logstash
  20. 20. Graphite Events port is up/down 20
  21. 21. 21
  22. 22. 22
  23. 23. QNIBTerminal Proof of Concept 23
  24. 24. Cluster Stack Mock-Up ❖ IB events and metrics are not enough ❖ How to get real-world behavior? ❖ Wanted: ❖ Slurm (Resource Scheduler) ❖ MPI enabled compute nodes ❖ As much additional cluster stack as possible (Graphite,elasticsearch/logstash/kibana, Icinga, Cluster-FS, …) 24
  25. 25. Classical Virtualization ❖ Big overhead for simple node ❖ Resources provisioned in advance ❖ Host resources allocated 25
  26. 26. LXC (docker) ❖ minimal overhead ( couple of MB) ❖ no resource pinning ❖ cgroups option ❖ highly automatable 26 NOW: Watch OSDC2014 talk ‚Docker‘ by ‚Tobias Schwab‘
  27. 27. Virtual Cluster Nodes ❖ Master Node (etcd, DNS, slurmctld) ❖ monitoring (graphite + statsd) ❖ log mgmt (ELK) ❖ compute nodes (slurmd) ❖ alarming (Icinga) [not integrated] 27 host master monitoring log mgmt compute0 compute1 computeN
  28. 28. Master Node ❖ takes care of inventory (etcd) ❖ provides DNS (+PTR) ❖ Integrate Rudder, ansible, chef,…? 28
  29. 29. Non-Master Nodes (in general) ❖ are started with master as DNS ❖ mounting /scratch, /chome (sits on SSDs) ❖ supervisord kicks in and starts services and setup-scripts ❖ sending metrics to graphite ❖ logs to logstash 29
  30. 30. docker-compute ❖ slurmd ❖ sshd ❖ logstash-forwarder ❖ openmpi ❖ qperf 30
  31. 31. docker-graphite (monitoring) ❖ full graphite stack + statsd ❖ stresses IO (<3 SSDs) ❖ needs more care (optimize IO) 31
  32. 32. docker-elk (Log Mgmt) ❖ elasticsearch, logstash, kibana ❖ inputs: syslog, lumberjack ❖ filters: none ❖ outputs: elasticsearch 32
  33. 33. It’s alive! 33
  34. 34. Start Compute Node 34
  35. 35. Start Compute Node 35
  36. 36. Check Slurm Config 36
  37. 37. Run MPI-Job 37
  38. 38. TCP benchmark 38
  39. 39. QNIBTerminal Future Work 39
  40. 40. docker-icinga 40 ❖ Icinga to provide ❖ state-of-the-cluster overview ❖ bundle with graphite/elk ❖ no big deal… ❖ Is this going to scale?
  41. 41. docker-(GlusterFS,Lustre) ❖ Cluster scratch to integrate with ❖ Use of kernel-modules freezes attempt ❖ Might be pushed in VirtualBox (vagrant) 41
  42. 42. ❖ How is SysOps/DevOps/Mgmt ❖ react to the changes ❖ adopt them ❖ feared by them Humans! 42
  43. 43. ❖ Truckload of ❖ Events ❖ Metrics ❖ Interaction Big Data! 43 node01.system.memory.usage 9 node13.system.memory.usage 14 node35.system.memory.usage 12 node95.system.memory.usage 11 target=sumSeries(node{01,13,35,95}.system.memory.usage) job1.node01.system.memory.usage 9 job1.node13.system.memory.usage 14 job1.node35.system.memory.usage 12 job1.node95.system.memory.usage 11 target=sumSeries(job01.*.system.memory.usage)
  44. 44. pipework / mininet ❖ Currently all containers are bound to docker0 bridge ❖ Creating topology with virtual/real switches would be nice ❖ First iteration might use pipework ❖ More complete one should use vSwitches (mininet?) 44
  45. 45. Dockerfiles ❖ Only 3 images are fd20 based 45
  46. 46. Questions? ❖ Pictures ❖ p2: http://de.wikipedia.org/wiki/Datei:Audi_logo.svg http://commons.wikimedia.org/wiki/File:Daimler_AG.svg http://ffb.uni-lueneburg.de/20JahreFFB/ ❖ p4: https://www.flickr.com/photos/adeneko/4229090961 ❖ p6: cae t100 https://www.flickr.com/photos/losalamosnatlab/7422429706 ❖ p8: http://www.brendangregg.com/Slides/SCaLE_Linux_Performance2013.pdf ❖ p9: https://www.flickr.com/photos/riafoge/6796129047 ❖ p10: https://www.flickr.com/photos/119364768@N03/12928685224/ ❖ p11: http://www.mellanox.com/page/products_dyn?product_family=74 ❖ p23: https://www.flickr.com/photos/jaxport/3077543062 ❖ p25/26: https://blog.trifork.com/2013/08/08/next-step-in-virtualization-docker-lightweight-containers/ ❖ p33: https://www.flickr.com/photos/fkehren/5139094564 ❖ p39: https://www.flickr.com/photos/brizzlebornandbred/12852909293 46

×