OSDC 2014
Overlay Datacenter Information
Christian Kniep

Bull SAS!
2014-04-10
About Me
❖ Me (>30y)
2
!
❖ SysOps (>10y)
About Me
❖ Me (>30y)
2
!
!
❖ SysOps v1.1 (>8y)
!
❖ SysOps (>10y)
About Me
❖ Me (>30y)
2
!
!
❖ SysOps v1.1 (>8y)
!
!
!
❖ BSc (2008-2011)
!
❖ SysOps (>10y)
About Me
❖ Me (>30y)
2
!
!
!
!
❖ DevOps (>4y)
!
!
❖ SysOps v1.1 (>8y)
!
!
!
❖ BSc (2008-2011)
!
❖ SysOps (>10y)
About Me
❖ Me (>30y)
2
!
!
!
!
!
❖ R&D [OpsDev?](>1y)
!
!
!
!
❖ DevOps (>4y)
!
!
❖ SysOps v1.1 (>8y)
!
!
!
❖ BSc (2008-2011)
!
❖ SysOps (>10y)
About Me
❖ Me (>30y)
2
Agenda
3
❖ Cluster Stack
Agenda
3
Cluster
Stack
!
❖ Motivation (InfiniBand use-case)
❖ Cluster Stack
Agenda
3
Cluster
Stack
IB
!
!
❖ QNIB/ng
!
❖ Motivation (InfiniBand use-case)
❖ Cluster Stack
Agenda
3
Cluster
Stack
QNIBngIB
!
!
!
❖ QNIBTerminal (virtual cluster using docker)
!
!
❖ QNIB/ng
!
❖ Motivation (InfiniBand use-case)
❖ Cluster Stack
Agenda
3
Cluster
Stack
QNIBngIB
QNIB

Terminal
!
!
!
❖ QNIBTerminal (virtual cluster using docker)
!
!
❖ QNIB/ng
!
❖ Motivation (InfiniBand use-case)
❖ Cluster Stack
Agenda
3
Cluster
Stack
QNIBngIB
I.
QNIB

Terminal
II.
III.
Cluster Stack Work Environment
4
Cluster?
5
„A computer cluster consists of a set of loosely connected or tightly connected computers !
that work together so that in many respects they can be viewed as a single system.“ - wikipedia.org
User
Cluster?
5
„A computer cluster consists of a set of loosely connected or tightly connected computers !
that work together so that in many respects they can be viewed as a single system.“ - wikipedia.org
User
Cluster?
5
„A computer cluster consists of a set of loosely connected or tightly connected computers !
that work together so that in many respects they can be viewed as a single system.“ - wikipedia.org
User
HPC-Cluster
6
High Performance Computing
HPC-Cluster
6
High Performance Computing
❖ HPC: Surfing the bottleneck!
❖ Weakest link breaks performance
HPC-Cluster
6
High Performance Computing
❖ HPC: Surfing the bottleneck!
❖ Weakest link breaks performance
Cluster Layers
7
(rough estimate)
Events Metrics
Cluster Layers
7
Hardware:! ! ! IMPI, lm_sensors, IB counter
(rough estimate)
Events Metrics
Cluster Layers
7
Hardware:! ! ! IMPI, lm_sensors, IB counter
Operating System:! Kernel, Userland tools
(rough estimate)
Events Metrics
Cluster Layers
7
Hardware:! ! ! IMPI, lm_sensors, IB counter
Operating System:! Kernel, Userland tools
MiddleWare:! ! ! MPI, ISV-libs
(rough estimate)
Events Metrics
Cluster Layers
7
Hardware:! ! ! IMPI, lm_sensors, IB counter
Operating System:! Kernel, Userland tools
MiddleWare:! ! ! MPI, ISV-libs
Services:! ! ! ! Storage, Job Scheduler, sshd
(rough estimate)
Events Metrics
Cluster Layers
7
Hardware:! ! ! IMPI, lm_sensors, IB counter
Operating System:! Kernel, Userland tools
MiddleWare:! ! ! MPI, ISV-libs
Services:! ! ! ! Storage, Job Scheduler, sshd
Software:! ! ! ! End user application
(rough estimate)
Events Metrics
Cluster Layers
7
Hardware:! ! ! IMPI, lm_sensors, IB counter
Operating System:! Kernel, Userland tools
MiddleWare:! ! ! MPI, ISV-libs
Services:! ! ! ! Storage, Job Scheduler, sshd
Software:! ! ! ! End user application
(rough estimate)
End

User
Events Metrics
Cluster Layers
7
Hardware:! ! ! IMPI, lm_sensors, IB counter
Operating System:! Kernel, Userland tools
MiddleWare:! ! ! MPI, ISV-libs
Services:! ! ! ! Storage, Job Scheduler, sshd
Software:! ! ! ! End user application
(rough estimate)
End

User
PowerUser/ISV
Events Metrics
Cluster Layers
7
Hardware:! ! ! IMPI, lm_sensors, IB counter
Operating System:! Kernel, Userland tools
MiddleWare:! ! ! MPI, ISV-libs
Services:! ! ! ! Storage, Job Scheduler, sshd
Software:! ! ! ! End user application
(rough estimate)
End

User
Excel:! ! ! ! ! KPI, SLA
Mgmt
PowerUser/ISV
Events Metrics
Cluster Layers
7
Hardware:! ! ! IMPI, lm_sensors, IB counter
Operating System:! Kernel, Userland tools
MiddleWare:! ! ! MPI, ISV-libs
Services:! ! ! ! Storage, Job Scheduler, sshd
Software:! ! ! ! End user application
(rough estimate)
End

User
Excel:! ! ! ! ! KPI, SLA
Mgmt
SysOps
PowerUser/ISV
Events Metrics
Cluster Layers
7
Hardware:! ! ! IMPI, lm_sensors, IB counter
Operating System:! Kernel, Userland tools
MiddleWare:! ! ! MPI, ISV-libs
Services:! ! ! ! Storage, Job Scheduler, sshd
Software:! ! ! ! End user application
(rough estimate)
End

User
Excel:! ! ! ! ! KPI, SLA
Mgmt
SysOps
PowerUser/ISV
SysOpsL1
Events Metrics
Cluster Layers
7
Hardware:! ! ! IMPI, lm_sensors, IB counter
Operating System:! Kernel, Userland tools
MiddleWare:! ! ! MPI, ISV-libs
Services:! ! ! ! Storage, Job Scheduler, sshd
Software:! ! ! ! End user application
(rough estimate)
End

User
Excel:! ! ! ! ! KPI, SLA
Mgmt
SysOps
PowerUser/ISV
SysOpsL2
SysOpsL1
Events Metrics
Cluster Layers
7
Hardware:! ! ! IMPI, lm_sensors, IB counter
Operating System:! Kernel, Userland tools
MiddleWare:! ! ! MPI, ISV-libs
Services:! ! ! ! Storage, Job Scheduler, sshd
Software:! ! ! ! End user application
(rough estimate)
End

User
Excel:! ! ! ! ! KPI, SLA
Mgmt
SysOps
PowerUser/ISV
SysOpsL2
SysOpsL1
Events Metrics
SysOpsL3
Cluster Layers
7
Hardware:! ! ! IMPI, lm_sensors, IB counter
Operating System:! Kernel, Userland tools
MiddleWare:! ! ! MPI, ISV-libs
Services:! ! ! ! Storage, Job Scheduler, sshd
Software:! ! ! ! End user application
(rough estimate)
End

User
Excel:! ! ! ! ! KPI, SLA
Mgmt
SysOps
PowerUser/ISV
SysOpsMgmt
SysOpsL2
SysOpsL1
Events Metrics
SysOpsL3
Cluster Layers
7
Hardware:! ! ! IMPI, lm_sensors, IB counter
Operating System:! Kernel, Userland tools
MiddleWare:! ! ! MPI, ISV-libs
Services:! ! ! ! Storage, Job Scheduler, sshd
Software:! ! ! ! End user application
(rough estimate)
End

User
Excel:! ! ! ! ! KPI, SLA
Mgmt
SysOps
PowerUser/ISV
SysOpsMgmt
ISVMgmt
SysOpsL2
SysOpsL1
Events Metrics
SysOpsL3
Layer
n
❖ Every Layer is composed of layers!
❖ How deep to go?
8
Little Data w/o Connection
9
❖ Multiple data sources
!
❖ No way of connecting them
Little Data w/o Connection
9
❖ Multiple data sources
!
!
❖ Connecting is manual labour
!
❖ No way of connecting them
Little Data w/o Connection
9
❖ Multiple data sources
!
!
!
❖ Experience driven
!
!
❖ Connecting is manual labour
!
❖ No way of connecting them
Little Data w/o Connection
9
❖ Multiple data sources
!
!
!
!
❖ Niche solutions misleading
!
!
!
❖ Experience driven
!
!
❖ Connecting is manual labour
!
❖ No way of connecting them
Little Data w/o Connection
9
❖ Multiple data sources
IB + QNIBng Motivation
10
Modular Switch
11
❖ Looks like one „switch“!
Modular Switch
12
❖ Looks like one „switch“!
❖ Composed of a network itself
Modular Switch
13
❖ Looks like one „switch“!
❖ Composed of a network itself!
❖ Which route is taken is transparent to
application!
❖ LB1<>FB1<>LB4
Modular Switch
14
❖ Looks like one „switch“!
❖ Composed of a network itself!
❖ Which route is taken is transparent to
application!
❖ LB1<>FB1<>LB4!
❖ LB1<>FB2<>LB4
Modular Switch
15
❖ Looks like one „switch“!
❖ Composed of a network itself!
❖ Which route is taken is transparent to
application!
❖ LB1<>FB1<>LB4!
❖ LB1<>FB2<>LB4!
❖ LB1 ->FB1 ->LB4 / LB1 <-FB2 <-LB4
Debug-Nightmare
16
❖ Job seems to fail due to bad internal link
!
❖ 96 port switch
Debug-Nightmare
16
❖ Job seems to fail due to bad internal link
!
!
❖ multiple autonomous job-cells
!
❖ 96 port switch
Debug-Nightmare
16
❖ Job seems to fail due to bad internal link
!
!
!
❖ Relevant information!
❖ Job status (Resource Scheduler)!
❖ Routes (IB Subnet Manager)!
❖ IB Counter (Command Line)
!
!
❖ multiple autonomous job-cells
!
❖ 96 port switch
Debug-Nightmare
16
❖ Job seems to fail due to bad internal link
!
!
!
!
!
!
!
❖ changing one plug, recomputes routes :)
!
!
!
❖ Relevant information!
❖ Job status (Resource Scheduler)!
❖ Routes (IB Subnet Manager)!
❖ IB Counter (Command Line)
!
!
❖ multiple autonomous job-cells
!
❖ 96 port switch
Debug-Nightmare
16
❖ Job seems to fail due to bad internal link
Communication Networks
IBPM: Demo OverviewBackground: InfiniBand (IB)
Rate Measurement in IB Networks
IBPM: An Open-Source-Based Framework for
InfiniBand Performance Monitoring
Michael Hoefling1, Michael Menth1, Christian Kniep2, Marcus Camen2
State-of-the art communication technology for interconnection in
high-performance computing data centers
Point-to-point bidirectional links
High throughput (40 Gbit/s with QDR)
Low latency
Dynamic on-line network reconfiguration
in cooperation with
Idea
Extract raw network information from IB network
Analyze output
Derive statistics about performance of the network
Topology Extraction
Subnet discovery using ibnetdiscover
Produces human readable file of network topology
Process output to produce graphical representation of the
network
Remote Counter Readout
Each port has its own set of performance counters
Counters measure, e.g., transferred data, congestion, errors,
link states changes
ibsim-Based Network Simulation
ibsim simulates an IB network
Simple topology changes possible (GUI)
ibsim limitations
No performance simulation possible
No data rate changes possible
Real IB Network
Physical network
Allows performance measurements
GUI controlled traffic scenarios
17
OpenSM
18
Sw
OpenSM
18
OpenSM
nodenode
Sw
node
nodenode
node
node
❖ OpenSM Performance Manager
Sw
OpenSM
18
OpenSM
PerfMgmt
nodenode
Sw
node
nodenode
node
node
!
❖ Sends token to all ports
❖ OpenSM Performance Manager
Sw
OpenSM
18
OpenSM
PerfMgmt
nodenode
Sw
node
nodenode
node
node
!
!
❖ All ports reply with metrics
!
❖ Sends token to all ports
❖ OpenSM Performance Manager
Sw
OpenSM
18
OpenSM
PerfMgmt
nodenode
Sw
node
nodenode
node
node
!
!
!
❖ Callback triggered for every reply
!
!
❖ All ports reply with metrics
!
❖ Sends token to all ports
❖ OpenSM Performance Manager
Sw
OpenSM
18
OpenSM
PerfMgmt
nodenode
Sw
node
nodenode
node
node
!
!
!
❖ Callback triggered for every reply
!
!
❖ All ports reply with metrics
!
❖ Sends token to all ports
❖ OpenSM Performance Manager
Sw
OpenSM
18
OpenSM
PerfMgmt
osmeventplugin
nodenode
Sw
node
nodenode
node
node
❖ osmeventplugin
!
!
!
❖ Callback triggered for every reply
!
❖ Dumps info to file
!
!
❖ All ports reply with metrics
!
❖ Sends token to all ports
❖ OpenSM Performance Manager
Sw
OpenSM
18
OpenSM
PerfMgmt
osmeventplugin
nodenode
Sw
node
nodenode
node
node
❖ osmeventplugin
!
!
!
❖ Callback triggered for every reply
!
❖ Dumps info to file
!
!
❖ All ports reply with metrics
!
❖ Sends token to all ports
❖ OpenSM Performance Manager
Sw
OpenSM
18
OpenSM
PerfMgmt
nodenode
Sw
node
nodenode
node
node
❖ osmeventplugin
OpenSM
PerfMgmt
OpenSM
19
OpenSM
PerfMgmt
qnib
OpenSM
19
❖ qnib
OpenSM
PerfMgmt
qnib
OpenSM
19
!
❖ sends metrics to RRDtool !
❖ events to PostgreSQL
❖ qnib
OpenSM
PerfMgmt
qnibng
OpenSM
19
!
❖ sends metrics to RRDtool !
❖ events to PostgreSQL
❖ qnib
❖ qnibng
OpenSM
PerfMgmt
qnibng
OpenSM
19
!
❖ sends metrics to RRDtool !
❖ events to PostgreSQL
❖ qnib
!
❖ sends metrics to graphite !
❖ events to logstash
❖ qnibng
OpenSM
PerfMgmt
qnibng
OpenSM
19
!
❖ sends metrics to RRDtool !
❖ events to PostgreSQL
❖ qnib
!
❖ sends metrics to graphite !
❖ events to logstash
❖ qnibng
Graphite Events port is up/down
20
21
22
QNIBTerminal Proof of Concept
23
Cluster Stack Mock-Up
❖ IB events and metrics are not enough!
❖ How to get real-world behavior?!
❖ Wanted:!
❖ Slurm (Resource Scheduler)!
❖ MPI enabled compute nodes!
❖ As much additional cluster stack as possible 

(Graphite,elasticsearch/logstash/kibana, Icinga, Cluster-FS, …)
24
Classical Virtualization
❖ Big overhead for simple node!
❖ Resources provisioned in advance!
❖ Host resources allocated
25
LXC (docker)
❖ minimal overhead ( couple of MB)!
❖ no resource pinning!
❖ cgroups option!
❖ highly automatable
26
LXC (docker)
❖ minimal overhead ( couple of MB)!
❖ no resource pinning!
❖ cgroups option!
❖ highly automatable
26
NOW: Watch OSDC2014 talk ‚Docker‘ by ‚Tobias Schwab‘
Virtual Cluster Nodes
27
host
Virtual Cluster Nodes
❖ Master Node (etcd, DNS, slurmctld)
27
host
master
!
❖ monitoring (graphite + statsd)
Virtual Cluster Nodes
❖ Master Node (etcd, DNS, slurmctld)
27
host
master
monitoring
!
!
❖ log mgmt (ELK)
!
❖ monitoring (graphite + statsd)
Virtual Cluster Nodes
❖ Master Node (etcd, DNS, slurmctld)
27
host
master
monitoring
logmgmt
!
!
!
❖ compute nodes (slurmd)
!
!
❖ log mgmt (ELK)
!
❖ monitoring (graphite + statsd)
Virtual Cluster Nodes
❖ Master Node (etcd, DNS, slurmctld)
27
host
master
monitoring
logmgmt
compute0
compute1
computeN
!
!
!
!
❖ alarming (Icinga) [not integrated]
!
!
!
❖ compute nodes (slurmd)
!
!
❖ log mgmt (ELK)
!
❖ monitoring (graphite + statsd)
Virtual Cluster Nodes
❖ Master Node (etcd, DNS, slurmctld)
27
host
master
monitoring
logmgmt
compute0
compute1
computeN
Master Node
❖ takes care of inventory (etcd)!
❖ provides DNS (+PTR)!
❖ Integrate Rudder, ansible, chef,…?
28
Non-Master Nodes (in general)
❖ are started with master as DNS!
❖ mounting /scratch, /chome (sits on SSDs)!
❖ supervisord kicks in and starts services and setup-scripts!
❖ sending metrics to graphite!
❖ logs to logstash
29
docker-compute
❖ slurmd!
❖ sshd!
❖ logstash-forwarder!
❖ openmpi!
❖ qperf
30
docker-compute
❖ slurmd!
❖ sshd!
❖ logstash-forwarder!
❖ openmpi!
❖ qperf
30
docker-compute
❖ slurmd!
❖ sshd!
❖ logstash-forwarder!
❖ openmpi!
❖ qperf
30
docker-compute
❖ slurmd!
❖ sshd!
❖ logstash-forwarder!
❖ openmpi!
❖ qperf
30
docker-compute
❖ slurmd!
❖ sshd!
❖ logstash-forwarder!
❖ openmpi!
❖ qperf
30
docker-compute
❖ slurmd!
❖ sshd!
❖ logstash-forwarder!
❖ openmpi!
❖ qperf
30
docker-compute
❖ slurmd!
❖ sshd!
❖ logstash-forwarder!
❖ openmpi!
❖ qperf
30
docker-graphite (monitoring)
❖ full graphite stack + statsd!
❖ stresses IO (<3 SSDs)!
❖ needs more care (optimize IO)
31
docker-elk (Log Mgmt)
❖ elasticsearch, logstash, kibana!
❖ inputs: syslog, lumberjack!
❖ filters: none!
❖ outputs: elasticsearch
32
It’s alive!
33
Start Compute Node
34
Start Compute Node
35
Check Slurm Config
36
Check Slurm Config
36
Check Slurm Config
36
Check Slurm Config
36
Check Slurm Config
36
Run MPI-Job
37
Run MPI-Job
37
Run MPI-Job
37
TCP benchmark
38
QNIBTerminal Future Work
39
docker-icinga
40
❖ Icinga to provide !
❖ state-of-the-cluster overview!
❖ bundle with graphite/elk!
❖ no big deal…
docker-icinga
40
❖ Icinga to provide !
❖ state-of-the-cluster overview!
❖ bundle with graphite/elk!
❖ no big deal…
!
!
!
!
❖ Is this going to scale?
docker-(GlusterFS,Lustre)
❖ Cluster scratch to integrate with!
❖ Use of kernel-modules freezes attempt!
❖ Might be pushed in VirtualBox (vagrant)
41
❖ How is SysOps/DevOps/Mgmt
Humans!
42
!
❖ react to the changes
❖ How is SysOps/DevOps/Mgmt
Humans!
42
!
!
❖ adopt them
!
❖ react to the changes
❖ How is SysOps/DevOps/Mgmt
Humans!
42
!
!
!
❖ feared by them
!
!
❖ adopt them
!
❖ react to the changes
❖ How is SysOps/DevOps/Mgmt
Humans!
42
❖ Truckload of
Big Data!
43
!
❖ Events
❖ Truckload of
Big Data!
43
!
!
❖ Metrics
!
❖ Events
❖ Truckload of
Big Data!
43
!
!
!
❖ Interaction
!
!
❖ Metrics
!
❖ Events
❖ Truckload of
Big Data!
43
!
!
!
❖ Interaction
!
!
❖ Metrics
!
❖ Events
❖ Truckload of
Big Data!
43
node01.system.memory.usage 9!
node13.system.memory.usage 14!
node35.system.memory.usage 12!
node95.system.memory.usage 11
!
!
!
❖ Interaction
!
!
❖ Metrics
!
❖ Events
❖ Truckload of
Big Data!
43
node01.system.memory.usage 9!
node13.system.memory.usage 14!
node35.system.memory.usage 12!
node95.system.memory.usage 11
target=sumSeries(node{01,13,35,95}.system.memory.usage)
!
!
!
❖ Interaction
!
!
❖ Metrics
!
❖ Events
❖ Truckload of
Big Data!
43
job1.node01.system.memory.usage 9!
job1.node13.system.memory.usage 14!
job1.node35.system.memory.usage 12!
job1.node95.system.memory.usage 11
node01.system.memory.usage 9!
node13.system.memory.usage 14!
node35.system.memory.usage 12!
node95.system.memory.usage 11
target=sumSeries(node{01,13,35,95}.system.memory.usage)
!
!
!
❖ Interaction
!
!
❖ Metrics
!
❖ Events
❖ Truckload of
Big Data!
43
job1.node01.system.memory.usage 9!
job1.node13.system.memory.usage 14!
job1.node35.system.memory.usage 12!
job1.node95.system.memory.usage 11
target=sumSeries(job01.*.system.memory.usage)
node01.system.memory.usage 9!
node13.system.memory.usage 14!
node35.system.memory.usage 12!
node95.system.memory.usage 11
target=sumSeries(node{01,13,35,95}.system.memory.usage)
pipework / mininet
❖ Currently all containers are bound to docker0 bridge!
❖ Creating topology with virtual/real switches would be nice!
❖ First iteration might use pipework!
❖ More complete one should use vSwitches (mininet?)
44
Dockerfiles
❖ Only 3 images are fd20 based
45
Questions?
❖ Pictures!
❖ p2: http://de.wikipedia.org/wiki/Datei:Audi_logo.svg

http://commons.wikimedia.org/wiki/File:Daimler_AG.svg

http://ffb.uni-lueneburg.de/20JahreFFB/!
❖ p4: https://www.flickr.com/photos/adeneko/4229090961!
❖ p6: cae t100

https://www.flickr.com/photos/losalamosnatlab/7422429706!
❖ p8: http://www.brendangregg.com/Slides/SCaLE_Linux_Performance2013.pdf!
❖ p9: https://www.flickr.com/photos/riafoge/6796129047!
❖ p10: https://www.flickr.com/photos/119364768@N03/12928685224/!
❖ p11: http://www.mellanox.com/page/products_dyn?product_family=74 !
❖ p23: https://www.flickr.com/photos/jaxport/3077543062!
❖ p25/26: https://blog.trifork.com/2013/08/08/next-step-in-virtualization-docker-lightweight-containers/!
❖ p33: https://www.flickr.com/photos/fkehren/5139094564!
❖ p39: https://www.flickr.com/photos/brizzlebornandbred/12852909293
46

OSDC 2014: Christian Kniep - Understand your data center by overlaying multiple information layers

  • 1.
    OSDC 2014 Overlay DatacenterInformation Christian Kniep
 Bull SAS! 2014-04-10
  • 2.
  • 3.
    ! ❖ SysOps (>10y) AboutMe ❖ Me (>30y) 2
  • 4.
    ! ! ❖ SysOps v1.1(>8y) ! ❖ SysOps (>10y) About Me ❖ Me (>30y) 2
  • 5.
    ! ! ❖ SysOps v1.1(>8y) ! ! ! ❖ BSc (2008-2011) ! ❖ SysOps (>10y) About Me ❖ Me (>30y) 2
  • 6.
    ! ! ! ! ❖ DevOps (>4y) ! ! ❖SysOps v1.1 (>8y) ! ! ! ❖ BSc (2008-2011) ! ❖ SysOps (>10y) About Me ❖ Me (>30y) 2
  • 7.
    ! ! ! ! ! ❖ R&D [OpsDev?](>1y) ! ! ! ! ❖DevOps (>4y) ! ! ❖ SysOps v1.1 (>8y) ! ! ! ❖ BSc (2008-2011) ! ❖ SysOps (>10y) About Me ❖ Me (>30y) 2
  • 8.
  • 9.
  • 10.
    ! ❖ Motivation (InfiniBanduse-case) ❖ Cluster Stack Agenda 3 Cluster Stack IB
  • 11.
    ! ! ❖ QNIB/ng ! ❖ Motivation(InfiniBand use-case) ❖ Cluster Stack Agenda 3 Cluster Stack QNIBngIB
  • 12.
    ! ! ! ❖ QNIBTerminal (virtualcluster using docker) ! ! ❖ QNIB/ng ! ❖ Motivation (InfiniBand use-case) ❖ Cluster Stack Agenda 3 Cluster Stack QNIBngIB QNIB
 Terminal
  • 13.
    ! ! ! ❖ QNIBTerminal (virtualcluster using docker) ! ! ❖ QNIB/ng ! ❖ Motivation (InfiniBand use-case) ❖ Cluster Stack Agenda 3 Cluster Stack QNIBngIB I. QNIB
 Terminal II. III.
  • 14.
    Cluster Stack WorkEnvironment 4
  • 15.
    Cluster? 5 „A computer clusterconsists of a set of loosely connected or tightly connected computers ! that work together so that in many respects they can be viewed as a single system.“ - wikipedia.org User
  • 16.
    Cluster? 5 „A computer clusterconsists of a set of loosely connected or tightly connected computers ! that work together so that in many respects they can be viewed as a single system.“ - wikipedia.org User
  • 17.
    Cluster? 5 „A computer clusterconsists of a set of loosely connected or tightly connected computers ! that work together so that in many respects they can be viewed as a single system.“ - wikipedia.org User
  • 18.
  • 19.
    HPC-Cluster 6 High Performance Computing ❖HPC: Surfing the bottleneck! ❖ Weakest link breaks performance
  • 20.
    HPC-Cluster 6 High Performance Computing ❖HPC: Surfing the bottleneck! ❖ Weakest link breaks performance
  • 21.
  • 22.
    Cluster Layers 7 Hardware:! !! IMPI, lm_sensors, IB counter (rough estimate) Events Metrics
  • 23.
    Cluster Layers 7 Hardware:! !! IMPI, lm_sensors, IB counter Operating System:! Kernel, Userland tools (rough estimate) Events Metrics
  • 24.
    Cluster Layers 7 Hardware:! !! IMPI, lm_sensors, IB counter Operating System:! Kernel, Userland tools MiddleWare:! ! ! MPI, ISV-libs (rough estimate) Events Metrics
  • 25.
    Cluster Layers 7 Hardware:! !! IMPI, lm_sensors, IB counter Operating System:! Kernel, Userland tools MiddleWare:! ! ! MPI, ISV-libs Services:! ! ! ! Storage, Job Scheduler, sshd (rough estimate) Events Metrics
  • 26.
    Cluster Layers 7 Hardware:! !! IMPI, lm_sensors, IB counter Operating System:! Kernel, Userland tools MiddleWare:! ! ! MPI, ISV-libs Services:! ! ! ! Storage, Job Scheduler, sshd Software:! ! ! ! End user application (rough estimate) Events Metrics
  • 27.
    Cluster Layers 7 Hardware:! !! IMPI, lm_sensors, IB counter Operating System:! Kernel, Userland tools MiddleWare:! ! ! MPI, ISV-libs Services:! ! ! ! Storage, Job Scheduler, sshd Software:! ! ! ! End user application (rough estimate) End
 User Events Metrics
  • 28.
    Cluster Layers 7 Hardware:! !! IMPI, lm_sensors, IB counter Operating System:! Kernel, Userland tools MiddleWare:! ! ! MPI, ISV-libs Services:! ! ! ! Storage, Job Scheduler, sshd Software:! ! ! ! End user application (rough estimate) End
 User PowerUser/ISV Events Metrics
  • 29.
    Cluster Layers 7 Hardware:! !! IMPI, lm_sensors, IB counter Operating System:! Kernel, Userland tools MiddleWare:! ! ! MPI, ISV-libs Services:! ! ! ! Storage, Job Scheduler, sshd Software:! ! ! ! End user application (rough estimate) End
 User Excel:! ! ! ! ! KPI, SLA Mgmt PowerUser/ISV Events Metrics
  • 30.
    Cluster Layers 7 Hardware:! !! IMPI, lm_sensors, IB counter Operating System:! Kernel, Userland tools MiddleWare:! ! ! MPI, ISV-libs Services:! ! ! ! Storage, Job Scheduler, sshd Software:! ! ! ! End user application (rough estimate) End
 User Excel:! ! ! ! ! KPI, SLA Mgmt SysOps PowerUser/ISV Events Metrics
  • 31.
    Cluster Layers 7 Hardware:! !! IMPI, lm_sensors, IB counter Operating System:! Kernel, Userland tools MiddleWare:! ! ! MPI, ISV-libs Services:! ! ! ! Storage, Job Scheduler, sshd Software:! ! ! ! End user application (rough estimate) End
 User Excel:! ! ! ! ! KPI, SLA Mgmt SysOps PowerUser/ISV SysOpsL1 Events Metrics
  • 32.
    Cluster Layers 7 Hardware:! !! IMPI, lm_sensors, IB counter Operating System:! Kernel, Userland tools MiddleWare:! ! ! MPI, ISV-libs Services:! ! ! ! Storage, Job Scheduler, sshd Software:! ! ! ! End user application (rough estimate) End
 User Excel:! ! ! ! ! KPI, SLA Mgmt SysOps PowerUser/ISV SysOpsL2 SysOpsL1 Events Metrics
  • 33.
    Cluster Layers 7 Hardware:! !! IMPI, lm_sensors, IB counter Operating System:! Kernel, Userland tools MiddleWare:! ! ! MPI, ISV-libs Services:! ! ! ! Storage, Job Scheduler, sshd Software:! ! ! ! End user application (rough estimate) End
 User Excel:! ! ! ! ! KPI, SLA Mgmt SysOps PowerUser/ISV SysOpsL2 SysOpsL1 Events Metrics SysOpsL3
  • 34.
    Cluster Layers 7 Hardware:! !! IMPI, lm_sensors, IB counter Operating System:! Kernel, Userland tools MiddleWare:! ! ! MPI, ISV-libs Services:! ! ! ! Storage, Job Scheduler, sshd Software:! ! ! ! End user application (rough estimate) End
 User Excel:! ! ! ! ! KPI, SLA Mgmt SysOps PowerUser/ISV SysOpsMgmt SysOpsL2 SysOpsL1 Events Metrics SysOpsL3
  • 35.
    Cluster Layers 7 Hardware:! !! IMPI, lm_sensors, IB counter Operating System:! Kernel, Userland tools MiddleWare:! ! ! MPI, ISV-libs Services:! ! ! ! Storage, Job Scheduler, sshd Software:! ! ! ! End user application (rough estimate) End
 User Excel:! ! ! ! ! KPI, SLA Mgmt SysOps PowerUser/ISV SysOpsMgmt ISVMgmt SysOpsL2 SysOpsL1 Events Metrics SysOpsL3
  • 36.
    Layer n ❖ Every Layeris composed of layers! ❖ How deep to go? 8
  • 37.
    Little Data w/oConnection 9 ❖ Multiple data sources
  • 38.
    ! ❖ No wayof connecting them Little Data w/o Connection 9 ❖ Multiple data sources
  • 39.
    ! ! ❖ Connecting ismanual labour ! ❖ No way of connecting them Little Data w/o Connection 9 ❖ Multiple data sources
  • 40.
    ! ! ! ❖ Experience driven ! ! ❖Connecting is manual labour ! ❖ No way of connecting them Little Data w/o Connection 9 ❖ Multiple data sources
  • 41.
    ! ! ! ! ❖ Niche solutionsmisleading ! ! ! ❖ Experience driven ! ! ❖ Connecting is manual labour ! ❖ No way of connecting them Little Data w/o Connection 9 ❖ Multiple data sources
  • 42.
    IB + QNIBngMotivation 10
  • 43.
    Modular Switch 11 ❖ Lookslike one „switch“!
  • 44.
    Modular Switch 12 ❖ Lookslike one „switch“! ❖ Composed of a network itself
  • 45.
    Modular Switch 13 ❖ Lookslike one „switch“! ❖ Composed of a network itself! ❖ Which route is taken is transparent to application! ❖ LB1<>FB1<>LB4
  • 46.
    Modular Switch 14 ❖ Lookslike one „switch“! ❖ Composed of a network itself! ❖ Which route is taken is transparent to application! ❖ LB1<>FB1<>LB4! ❖ LB1<>FB2<>LB4
  • 47.
    Modular Switch 15 ❖ Lookslike one „switch“! ❖ Composed of a network itself! ❖ Which route is taken is transparent to application! ❖ LB1<>FB1<>LB4! ❖ LB1<>FB2<>LB4! ❖ LB1 ->FB1 ->LB4 / LB1 <-FB2 <-LB4
  • 48.
    Debug-Nightmare 16 ❖ Job seemsto fail due to bad internal link
  • 49.
    ! ❖ 96 portswitch Debug-Nightmare 16 ❖ Job seems to fail due to bad internal link
  • 50.
    ! ! ❖ multiple autonomousjob-cells ! ❖ 96 port switch Debug-Nightmare 16 ❖ Job seems to fail due to bad internal link
  • 51.
    ! ! ! ❖ Relevant information! ❖Job status (Resource Scheduler)! ❖ Routes (IB Subnet Manager)! ❖ IB Counter (Command Line) ! ! ❖ multiple autonomous job-cells ! ❖ 96 port switch Debug-Nightmare 16 ❖ Job seems to fail due to bad internal link
  • 52.
    ! ! ! ! ! ! ! ❖ changing oneplug, recomputes routes :) ! ! ! ❖ Relevant information! ❖ Job status (Resource Scheduler)! ❖ Routes (IB Subnet Manager)! ❖ IB Counter (Command Line) ! ! ❖ multiple autonomous job-cells ! ❖ 96 port switch Debug-Nightmare 16 ❖ Job seems to fail due to bad internal link
  • 53.
    Communication Networks IBPM: DemoOverviewBackground: InfiniBand (IB) Rate Measurement in IB Networks IBPM: An Open-Source-Based Framework for InfiniBand Performance Monitoring Michael Hoefling1, Michael Menth1, Christian Kniep2, Marcus Camen2 State-of-the art communication technology for interconnection in high-performance computing data centers Point-to-point bidirectional links High throughput (40 Gbit/s with QDR) Low latency Dynamic on-line network reconfiguration in cooperation with Idea Extract raw network information from IB network Analyze output Derive statistics about performance of the network Topology Extraction Subnet discovery using ibnetdiscover Produces human readable file of network topology Process output to produce graphical representation of the network Remote Counter Readout Each port has its own set of performance counters Counters measure, e.g., transferred data, congestion, errors, link states changes ibsim-Based Network Simulation ibsim simulates an IB network Simple topology changes possible (GUI) ibsim limitations No performance simulation possible No data rate changes possible Real IB Network Physical network Allows performance measurements GUI controlled traffic scenarios 17
  • 54.
  • 55.
  • 56.
    ❖ OpenSM PerformanceManager Sw OpenSM 18 OpenSM PerfMgmt nodenode Sw node nodenode node node
  • 57.
    ! ❖ Sends tokento all ports ❖ OpenSM Performance Manager Sw OpenSM 18 OpenSM PerfMgmt nodenode Sw node nodenode node node
  • 58.
    ! ! ❖ All portsreply with metrics ! ❖ Sends token to all ports ❖ OpenSM Performance Manager Sw OpenSM 18 OpenSM PerfMgmt nodenode Sw node nodenode node node
  • 59.
    ! ! ! ❖ Callback triggeredfor every reply ! ! ❖ All ports reply with metrics ! ❖ Sends token to all ports ❖ OpenSM Performance Manager Sw OpenSM 18 OpenSM PerfMgmt nodenode Sw node nodenode node node
  • 60.
    ! ! ! ❖ Callback triggeredfor every reply ! ! ❖ All ports reply with metrics ! ❖ Sends token to all ports ❖ OpenSM Performance Manager Sw OpenSM 18 OpenSM PerfMgmt osmeventplugin nodenode Sw node nodenode node node ❖ osmeventplugin
  • 61.
    ! ! ! ❖ Callback triggeredfor every reply ! ❖ Dumps info to file ! ! ❖ All ports reply with metrics ! ❖ Sends token to all ports ❖ OpenSM Performance Manager Sw OpenSM 18 OpenSM PerfMgmt osmeventplugin nodenode Sw node nodenode node node ❖ osmeventplugin
  • 62.
    ! ! ! ❖ Callback triggeredfor every reply ! ❖ Dumps info to file ! ! ❖ All ports reply with metrics ! ❖ Sends token to all ports ❖ OpenSM Performance Manager Sw OpenSM 18 OpenSM PerfMgmt nodenode Sw node nodenode node node ❖ osmeventplugin
  • 63.
  • 64.
  • 65.
    OpenSM PerfMgmt qnib OpenSM 19 ! ❖ sends metricsto RRDtool ! ❖ events to PostgreSQL ❖ qnib
  • 66.
    OpenSM PerfMgmt qnibng OpenSM 19 ! ❖ sends metricsto RRDtool ! ❖ events to PostgreSQL ❖ qnib ❖ qnibng
  • 67.
    OpenSM PerfMgmt qnibng OpenSM 19 ! ❖ sends metricsto RRDtool ! ❖ events to PostgreSQL ❖ qnib ! ❖ sends metrics to graphite ! ❖ events to logstash ❖ qnibng
  • 68.
    OpenSM PerfMgmt qnibng OpenSM 19 ! ❖ sends metricsto RRDtool ! ❖ events to PostgreSQL ❖ qnib ! ❖ sends metrics to graphite ! ❖ events to logstash ❖ qnibng
  • 69.
    Graphite Events portis up/down 20
  • 70.
  • 71.
  • 72.
  • 73.
    Cluster Stack Mock-Up ❖IB events and metrics are not enough! ❖ How to get real-world behavior?! ❖ Wanted:! ❖ Slurm (Resource Scheduler)! ❖ MPI enabled compute nodes! ❖ As much additional cluster stack as possible 
 (Graphite,elasticsearch/logstash/kibana, Icinga, Cluster-FS, …) 24
  • 74.
    Classical Virtualization ❖ Bigoverhead for simple node! ❖ Resources provisioned in advance! ❖ Host resources allocated 25
  • 75.
    LXC (docker) ❖ minimaloverhead ( couple of MB)! ❖ no resource pinning! ❖ cgroups option! ❖ highly automatable 26
  • 76.
    LXC (docker) ❖ minimaloverhead ( couple of MB)! ❖ no resource pinning! ❖ cgroups option! ❖ highly automatable 26 NOW: Watch OSDC2014 talk ‚Docker‘ by ‚Tobias Schwab‘
  • 77.
  • 78.
    Virtual Cluster Nodes ❖Master Node (etcd, DNS, slurmctld) 27 host master
  • 79.
    ! ❖ monitoring (graphite+ statsd) Virtual Cluster Nodes ❖ Master Node (etcd, DNS, slurmctld) 27 host master monitoring
  • 80.
    ! ! ❖ log mgmt(ELK) ! ❖ monitoring (graphite + statsd) Virtual Cluster Nodes ❖ Master Node (etcd, DNS, slurmctld) 27 host master monitoring logmgmt
  • 81.
    ! ! ! ❖ compute nodes(slurmd) ! ! ❖ log mgmt (ELK) ! ❖ monitoring (graphite + statsd) Virtual Cluster Nodes ❖ Master Node (etcd, DNS, slurmctld) 27 host master monitoring logmgmt compute0 compute1 computeN
  • 82.
    ! ! ! ! ❖ alarming (Icinga)[not integrated] ! ! ! ❖ compute nodes (slurmd) ! ! ❖ log mgmt (ELK) ! ❖ monitoring (graphite + statsd) Virtual Cluster Nodes ❖ Master Node (etcd, DNS, slurmctld) 27 host master monitoring logmgmt compute0 compute1 computeN
  • 83.
    Master Node ❖ takescare of inventory (etcd)! ❖ provides DNS (+PTR)! ❖ Integrate Rudder, ansible, chef,…? 28
  • 84.
    Non-Master Nodes (ingeneral) ❖ are started with master as DNS! ❖ mounting /scratch, /chome (sits on SSDs)! ❖ supervisord kicks in and starts services and setup-scripts! ❖ sending metrics to graphite! ❖ logs to logstash 29
  • 85.
    docker-compute ❖ slurmd! ❖ sshd! ❖logstash-forwarder! ❖ openmpi! ❖ qperf 30
  • 86.
    docker-compute ❖ slurmd! ❖ sshd! ❖logstash-forwarder! ❖ openmpi! ❖ qperf 30
  • 87.
    docker-compute ❖ slurmd! ❖ sshd! ❖logstash-forwarder! ❖ openmpi! ❖ qperf 30
  • 88.
    docker-compute ❖ slurmd! ❖ sshd! ❖logstash-forwarder! ❖ openmpi! ❖ qperf 30
  • 89.
    docker-compute ❖ slurmd! ❖ sshd! ❖logstash-forwarder! ❖ openmpi! ❖ qperf 30
  • 90.
    docker-compute ❖ slurmd! ❖ sshd! ❖logstash-forwarder! ❖ openmpi! ❖ qperf 30
  • 91.
    docker-compute ❖ slurmd! ❖ sshd! ❖logstash-forwarder! ❖ openmpi! ❖ qperf 30
  • 92.
    docker-graphite (monitoring) ❖ fullgraphite stack + statsd! ❖ stresses IO (<3 SSDs)! ❖ needs more care (optimize IO) 31
  • 93.
    docker-elk (Log Mgmt) ❖elasticsearch, logstash, kibana! ❖ inputs: syslog, lumberjack! ❖ filters: none! ❖ outputs: elasticsearch 32
  • 94.
  • 95.
  • 96.
  • 97.
  • 98.
  • 99.
  • 100.
  • 101.
  • 102.
  • 103.
  • 104.
  • 105.
  • 106.
  • 107.
    docker-icinga 40 ❖ Icinga toprovide ! ❖ state-of-the-cluster overview! ❖ bundle with graphite/elk! ❖ no big deal…
  • 108.
    docker-icinga 40 ❖ Icinga toprovide ! ❖ state-of-the-cluster overview! ❖ bundle with graphite/elk! ❖ no big deal… ! ! ! ! ❖ Is this going to scale?
  • 109.
    docker-(GlusterFS,Lustre) ❖ Cluster scratchto integrate with! ❖ Use of kernel-modules freezes attempt! ❖ Might be pushed in VirtualBox (vagrant) 41
  • 110.
    ❖ How isSysOps/DevOps/Mgmt Humans! 42
  • 111.
    ! ❖ react tothe changes ❖ How is SysOps/DevOps/Mgmt Humans! 42
  • 112.
    ! ! ❖ adopt them ! ❖react to the changes ❖ How is SysOps/DevOps/Mgmt Humans! 42
  • 113.
    ! ! ! ❖ feared bythem ! ! ❖ adopt them ! ❖ react to the changes ❖ How is SysOps/DevOps/Mgmt Humans! 42
  • 114.
  • 115.
  • 116.
    ! ! ❖ Metrics ! ❖ Events ❖Truckload of Big Data! 43
  • 117.
    ! ! ! ❖ Interaction ! ! ❖ Metrics ! ❖Events ❖ Truckload of Big Data! 43
  • 118.
    ! ! ! ❖ Interaction ! ! ❖ Metrics ! ❖Events ❖ Truckload of Big Data! 43 node01.system.memory.usage 9! node13.system.memory.usage 14! node35.system.memory.usage 12! node95.system.memory.usage 11
  • 119.
    ! ! ! ❖ Interaction ! ! ❖ Metrics ! ❖Events ❖ Truckload of Big Data! 43 node01.system.memory.usage 9! node13.system.memory.usage 14! node35.system.memory.usage 12! node95.system.memory.usage 11 target=sumSeries(node{01,13,35,95}.system.memory.usage)
  • 120.
    ! ! ! ❖ Interaction ! ! ❖ Metrics ! ❖Events ❖ Truckload of Big Data! 43 job1.node01.system.memory.usage 9! job1.node13.system.memory.usage 14! job1.node35.system.memory.usage 12! job1.node95.system.memory.usage 11 node01.system.memory.usage 9! node13.system.memory.usage 14! node35.system.memory.usage 12! node95.system.memory.usage 11 target=sumSeries(node{01,13,35,95}.system.memory.usage)
  • 121.
    ! ! ! ❖ Interaction ! ! ❖ Metrics ! ❖Events ❖ Truckload of Big Data! 43 job1.node01.system.memory.usage 9! job1.node13.system.memory.usage 14! job1.node35.system.memory.usage 12! job1.node95.system.memory.usage 11 target=sumSeries(job01.*.system.memory.usage) node01.system.memory.usage 9! node13.system.memory.usage 14! node35.system.memory.usage 12! node95.system.memory.usage 11 target=sumSeries(node{01,13,35,95}.system.memory.usage)
  • 122.
    pipework / mininet ❖Currently all containers are bound to docker0 bridge! ❖ Creating topology with virtual/real switches would be nice! ❖ First iteration might use pipework! ❖ More complete one should use vSwitches (mininet?) 44
  • 123.
    Dockerfiles ❖ Only 3images are fd20 based 45
  • 124.
    Questions? ❖ Pictures! ❖ p2:http://de.wikipedia.org/wiki/Datei:Audi_logo.svg
 http://commons.wikimedia.org/wiki/File:Daimler_AG.svg
 http://ffb.uni-lueneburg.de/20JahreFFB/! ❖ p4: https://www.flickr.com/photos/adeneko/4229090961! ❖ p6: cae t100
 https://www.flickr.com/photos/losalamosnatlab/7422429706! ❖ p8: http://www.brendangregg.com/Slides/SCaLE_Linux_Performance2013.pdf! ❖ p9: https://www.flickr.com/photos/riafoge/6796129047! ❖ p10: https://www.flickr.com/photos/119364768@N03/12928685224/! ❖ p11: http://www.mellanox.com/page/products_dyn?product_family=74 ! ❖ p23: https://www.flickr.com/photos/jaxport/3077543062! ❖ p25/26: https://blog.trifork.com/2013/08/08/next-step-in-virtualization-docker-lightweight-containers/! ❖ p33: https://www.flickr.com/photos/fkehren/5139094564! ❖ p39: https://www.flickr.com/photos/brizzlebornandbred/12852909293 46