• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
OSDC 2014: Christian Kniep -  Understand your data center by overlaying multiple information layers
 

OSDC 2014: Christian Kniep - Understand your data center by overlaying multiple information layers

on

  • 583 views

Today's data center managers are burdened by a lack of aligned information of multiple layers. Work-flow events like 'job starts' aligned with performance metrics and events extracted from log ...

Today's data center managers are burdened by a lack of aligned information of multiple layers. Work-flow events like 'job starts' aligned with performance metrics and events extracted from log facilities are low-hanging fruit that is on the edge to become use-able due to open-source software like Graphite, StatsD, logstash and alike.
This talk aims to show off the benefits of merging multiple layers of information within an InfiniBand cluster by using use-cases for level 1/2/3 personnel.

Statistics

Views

Total Views
583
Views on SlideShare
516
Embed Views
67

Actions

Likes
0
Downloads
19
Comments
0

3 Embeds 67

http://insidehpc.com 35
http://www.netways.de 31
https://www.google.com 1

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    OSDC 2014: Christian Kniep -  Understand your data center by overlaying multiple information layers OSDC 2014: Christian Kniep - Understand your data center by overlaying multiple information layers Presentation Transcript

    • OSDC 2014 Overlay Datacenter Information Christian Kniep
 Bull SAS! 2014-04-10
    • About Me ❖ Me (>30y) 2
    • ! ❖ SysOps (>10y) About Me ❖ Me (>30y) 2
    • ! ! ❖ SysOps v1.1 (>8y) ! ❖ SysOps (>10y) About Me ❖ Me (>30y) 2
    • ! ! ❖ SysOps v1.1 (>8y) ! ! ! ❖ BSc (2008-2011) ! ❖ SysOps (>10y) About Me ❖ Me (>30y) 2
    • ! ! ! ! ❖ DevOps (>4y) ! ! ❖ SysOps v1.1 (>8y) ! ! ! ❖ BSc (2008-2011) ! ❖ SysOps (>10y) About Me ❖ Me (>30y) 2
    • ! ! ! ! ! ❖ R&D [OpsDev?](>1y) ! ! ! ! ❖ DevOps (>4y) ! ! ❖ SysOps v1.1 (>8y) ! ! ! ❖ BSc (2008-2011) ! ❖ SysOps (>10y) About Me ❖ Me (>30y) 2
    • Agenda 3
    • ❖ Cluster Stack Agenda 3 Cluster Stack
    • ! ❖ Motivation (InfiniBand use-case) ❖ Cluster Stack Agenda 3 Cluster Stack IB
    • ! ! ❖ QNIB/ng ! ❖ Motivation (InfiniBand use-case) ❖ Cluster Stack Agenda 3 Cluster Stack QNIBngIB
    • ! ! ! ❖ QNIBTerminal (virtual cluster using docker) ! ! ❖ QNIB/ng ! ❖ Motivation (InfiniBand use-case) ❖ Cluster Stack Agenda 3 Cluster Stack QNIBngIB QNIB
 Terminal
    • ! ! ! ❖ QNIBTerminal (virtual cluster using docker) ! ! ❖ QNIB/ng ! ❖ Motivation (InfiniBand use-case) ❖ Cluster Stack Agenda 3 Cluster Stack QNIBngIB I. QNIB
 Terminal II. III.
    • Cluster Stack Work Environment 4
    • Cluster? 5 „A computer cluster consists of a set of loosely connected or tightly connected computers ! that work together so that in many respects they can be viewed as a single system.“ - wikipedia.org User
    • Cluster? 5 „A computer cluster consists of a set of loosely connected or tightly connected computers ! that work together so that in many respects they can be viewed as a single system.“ - wikipedia.org User
    • Cluster? 5 „A computer cluster consists of a set of loosely connected or tightly connected computers ! that work together so that in many respects they can be viewed as a single system.“ - wikipedia.org User
    • HPC-Cluster 6 High Performance Computing
    • HPC-Cluster 6 High Performance Computing ❖ HPC: Surfing the bottleneck! ❖ Weakest link breaks performance
    • HPC-Cluster 6 High Performance Computing ❖ HPC: Surfing the bottleneck! ❖ Weakest link breaks performance
    • Cluster Layers 7 (rough estimate) Events Metrics
    • Cluster Layers 7 Hardware:! ! ! IMPI, lm_sensors, IB counter (rough estimate) Events Metrics
    • Cluster Layers 7 Hardware:! ! ! IMPI, lm_sensors, IB counter Operating System:! Kernel, Userland tools (rough estimate) Events Metrics
    • Cluster Layers 7 Hardware:! ! ! IMPI, lm_sensors, IB counter Operating System:! Kernel, Userland tools MiddleWare:! ! ! MPI, ISV-libs (rough estimate) Events Metrics
    • Cluster Layers 7 Hardware:! ! ! IMPI, lm_sensors, IB counter Operating System:! Kernel, Userland tools MiddleWare:! ! ! MPI, ISV-libs Services:! ! ! ! Storage, Job Scheduler, sshd (rough estimate) Events Metrics
    • Cluster Layers 7 Hardware:! ! ! IMPI, lm_sensors, IB counter Operating System:! Kernel, Userland tools MiddleWare:! ! ! MPI, ISV-libs Services:! ! ! ! Storage, Job Scheduler, sshd Software:! ! ! ! End user application (rough estimate) Events Metrics
    • Cluster Layers 7 Hardware:! ! ! IMPI, lm_sensors, IB counter Operating System:! Kernel, Userland tools MiddleWare:! ! ! MPI, ISV-libs Services:! ! ! ! Storage, Job Scheduler, sshd Software:! ! ! ! End user application (rough estimate) End
 User Events Metrics
    • Cluster Layers 7 Hardware:! ! ! IMPI, lm_sensors, IB counter Operating System:! Kernel, Userland tools MiddleWare:! ! ! MPI, ISV-libs Services:! ! ! ! Storage, Job Scheduler, sshd Software:! ! ! ! End user application (rough estimate) End
 User PowerUser/ISV Events Metrics
    • Cluster Layers 7 Hardware:! ! ! IMPI, lm_sensors, IB counter Operating System:! Kernel, Userland tools MiddleWare:! ! ! MPI, ISV-libs Services:! ! ! ! Storage, Job Scheduler, sshd Software:! ! ! ! End user application (rough estimate) End
 User Excel:! ! ! ! ! KPI, SLA Mgmt PowerUser/ISV Events Metrics
    • Cluster Layers 7 Hardware:! ! ! IMPI, lm_sensors, IB counter Operating System:! Kernel, Userland tools MiddleWare:! ! ! MPI, ISV-libs Services:! ! ! ! Storage, Job Scheduler, sshd Software:! ! ! ! End user application (rough estimate) End
 User Excel:! ! ! ! ! KPI, SLA Mgmt SysOps PowerUser/ISV Events Metrics
    • Cluster Layers 7 Hardware:! ! ! IMPI, lm_sensors, IB counter Operating System:! Kernel, Userland tools MiddleWare:! ! ! MPI, ISV-libs Services:! ! ! ! Storage, Job Scheduler, sshd Software:! ! ! ! End user application (rough estimate) End
 User Excel:! ! ! ! ! KPI, SLA Mgmt SysOps PowerUser/ISV SysOpsL1 Events Metrics
    • Cluster Layers 7 Hardware:! ! ! IMPI, lm_sensors, IB counter Operating System:! Kernel, Userland tools MiddleWare:! ! ! MPI, ISV-libs Services:! ! ! ! Storage, Job Scheduler, sshd Software:! ! ! ! End user application (rough estimate) End
 User Excel:! ! ! ! ! KPI, SLA Mgmt SysOps PowerUser/ISV SysOpsL2 SysOpsL1 Events Metrics
    • Cluster Layers 7 Hardware:! ! ! IMPI, lm_sensors, IB counter Operating System:! Kernel, Userland tools MiddleWare:! ! ! MPI, ISV-libs Services:! ! ! ! Storage, Job Scheduler, sshd Software:! ! ! ! End user application (rough estimate) End
 User Excel:! ! ! ! ! KPI, SLA Mgmt SysOps PowerUser/ISV SysOpsL2 SysOpsL1 Events Metrics SysOpsL3
    • Cluster Layers 7 Hardware:! ! ! IMPI, lm_sensors, IB counter Operating System:! Kernel, Userland tools MiddleWare:! ! ! MPI, ISV-libs Services:! ! ! ! Storage, Job Scheduler, sshd Software:! ! ! ! End user application (rough estimate) End
 User Excel:! ! ! ! ! KPI, SLA Mgmt SysOps PowerUser/ISV SysOpsMgmt SysOpsL2 SysOpsL1 Events Metrics SysOpsL3
    • Cluster Layers 7 Hardware:! ! ! IMPI, lm_sensors, IB counter Operating System:! Kernel, Userland tools MiddleWare:! ! ! MPI, ISV-libs Services:! ! ! ! Storage, Job Scheduler, sshd Software:! ! ! ! End user application (rough estimate) End
 User Excel:! ! ! ! ! KPI, SLA Mgmt SysOps PowerUser/ISV SysOpsMgmt ISVMgmt SysOpsL2 SysOpsL1 Events Metrics SysOpsL3
    • Layer n ❖ Every Layer is composed of layers! ❖ How deep to go? 8
    • Little Data w/o Connection 9 ❖ Multiple data sources
    • ! ❖ No way of connecting them Little Data w/o Connection 9 ❖ Multiple data sources
    • ! ! ❖ Connecting is manual labour ! ❖ No way of connecting them Little Data w/o Connection 9 ❖ Multiple data sources
    • ! ! ! ❖ Experience driven ! ! ❖ Connecting is manual labour ! ❖ No way of connecting them Little Data w/o Connection 9 ❖ Multiple data sources
    • ! ! ! ! ❖ Niche solutions misleading ! ! ! ❖ Experience driven ! ! ❖ Connecting is manual labour ! ❖ No way of connecting them Little Data w/o Connection 9 ❖ Multiple data sources
    • IB + QNIBng Motivation 10
    • Modular Switch 11 ❖ Looks like one „switch“!
    • Modular Switch 12 ❖ Looks like one „switch“! ❖ Composed of a network itself
    • Modular Switch 13 ❖ Looks like one „switch“! ❖ Composed of a network itself! ❖ Which route is taken is transparent to application! ❖ LB1<>FB1<>LB4
    • Modular Switch 14 ❖ Looks like one „switch“! ❖ Composed of a network itself! ❖ Which route is taken is transparent to application! ❖ LB1<>FB1<>LB4! ❖ LB1<>FB2<>LB4
    • Modular Switch 15 ❖ Looks like one „switch“! ❖ Composed of a network itself! ❖ Which route is taken is transparent to application! ❖ LB1<>FB1<>LB4! ❖ LB1<>FB2<>LB4! ❖ LB1 ->FB1 ->LB4 / LB1 <-FB2 <-LB4
    • Debug-Nightmare 16 ❖ Job seems to fail due to bad internal link
    • ! ❖ 96 port switch Debug-Nightmare 16 ❖ Job seems to fail due to bad internal link
    • ! ! ❖ multiple autonomous job-cells ! ❖ 96 port switch Debug-Nightmare 16 ❖ Job seems to fail due to bad internal link
    • ! ! ! ❖ Relevant information! ❖ Job status (Resource Scheduler)! ❖ Routes (IB Subnet Manager)! ❖ IB Counter (Command Line) ! ! ❖ multiple autonomous job-cells ! ❖ 96 port switch Debug-Nightmare 16 ❖ Job seems to fail due to bad internal link
    • ! ! ! ! ! ! ! ❖ changing one plug, recomputes routes :) ! ! ! ❖ Relevant information! ❖ Job status (Resource Scheduler)! ❖ Routes (IB Subnet Manager)! ❖ IB Counter (Command Line) ! ! ❖ multiple autonomous job-cells ! ❖ 96 port switch Debug-Nightmare 16 ❖ Job seems to fail due to bad internal link
    • Communication Networks IBPM: Demo OverviewBackground: InfiniBand (IB) Rate Measurement in IB Networks IBPM: An Open-Source-Based Framework for InfiniBand Performance Monitoring Michael Hoefling1, Michael Menth1, Christian Kniep2, Marcus Camen2 State-of-the art communication technology for interconnection in high-performance computing data centers Point-to-point bidirectional links High throughput (40 Gbit/s with QDR) Low latency Dynamic on-line network reconfiguration in cooperation with Idea Extract raw network information from IB network Analyze output Derive statistics about performance of the network Topology Extraction Subnet discovery using ibnetdiscover Produces human readable file of network topology Process output to produce graphical representation of the network Remote Counter Readout Each port has its own set of performance counters Counters measure, e.g., transferred data, congestion, errors, link states changes ibsim-Based Network Simulation ibsim simulates an IB network Simple topology changes possible (GUI) ibsim limitations No performance simulation possible No data rate changes possible Real IB Network Physical network Allows performance measurements GUI controlled traffic scenarios 17
    • OpenSM 18
    • Sw OpenSM 18 OpenSM nodenode Sw node nodenode node node
    • ❖ OpenSM Performance Manager Sw OpenSM 18 OpenSM PerfMgmt nodenode Sw node nodenode node node
    • ! ❖ Sends token to all ports ❖ OpenSM Performance Manager Sw OpenSM 18 OpenSM PerfMgmt nodenode Sw node nodenode node node
    • ! ! ❖ All ports reply with metrics ! ❖ Sends token to all ports ❖ OpenSM Performance Manager Sw OpenSM 18 OpenSM PerfMgmt nodenode Sw node nodenode node node
    • ! ! ! ❖ Callback triggered for every reply ! ! ❖ All ports reply with metrics ! ❖ Sends token to all ports ❖ OpenSM Performance Manager Sw OpenSM 18 OpenSM PerfMgmt nodenode Sw node nodenode node node
    • ! ! ! ❖ Callback triggered for every reply ! ! ❖ All ports reply with metrics ! ❖ Sends token to all ports ❖ OpenSM Performance Manager Sw OpenSM 18 OpenSM PerfMgmt osmeventplugin nodenode Sw node nodenode node node ❖ osmeventplugin
    • ! ! ! ❖ Callback triggered for every reply ! ❖ Dumps info to file ! ! ❖ All ports reply with metrics ! ❖ Sends token to all ports ❖ OpenSM Performance Manager Sw OpenSM 18 OpenSM PerfMgmt osmeventplugin nodenode Sw node nodenode node node ❖ osmeventplugin
    • ! ! ! ❖ Callback triggered for every reply ! ❖ Dumps info to file ! ! ❖ All ports reply with metrics ! ❖ Sends token to all ports ❖ OpenSM Performance Manager Sw OpenSM 18 OpenSM PerfMgmt nodenode Sw node nodenode node node ❖ osmeventplugin
    • OpenSM PerfMgmt OpenSM 19
    • OpenSM PerfMgmt qnib OpenSM 19 ❖ qnib
    • OpenSM PerfMgmt qnib OpenSM 19 ! ❖ sends metrics to RRDtool ! ❖ events to PostgreSQL ❖ qnib
    • OpenSM PerfMgmt qnibng OpenSM 19 ! ❖ sends metrics to RRDtool ! ❖ events to PostgreSQL ❖ qnib ❖ qnibng
    • OpenSM PerfMgmt qnibng OpenSM 19 ! ❖ sends metrics to RRDtool ! ❖ events to PostgreSQL ❖ qnib ! ❖ sends metrics to graphite ! ❖ events to logstash ❖ qnibng
    • OpenSM PerfMgmt qnibng OpenSM 19 ! ❖ sends metrics to RRDtool ! ❖ events to PostgreSQL ❖ qnib ! ❖ sends metrics to graphite ! ❖ events to logstash ❖ qnibng
    • Graphite Events port is up/down 20
    • 21
    • 22
    • QNIBTerminal Proof of Concept 23
    • Cluster Stack Mock-Up ❖ IB events and metrics are not enough! ❖ How to get real-world behavior?! ❖ Wanted:! ❖ Slurm (Resource Scheduler)! ❖ MPI enabled compute nodes! ❖ As much additional cluster stack as possible 
 (Graphite,elasticsearch/logstash/kibana, Icinga, Cluster-FS, …) 24
    • Classical Virtualization ❖ Big overhead for simple node! ❖ Resources provisioned in advance! ❖ Host resources allocated 25
    • LXC (docker) ❖ minimal overhead ( couple of MB)! ❖ no resource pinning! ❖ cgroups option! ❖ highly automatable 26
    • LXC (docker) ❖ minimal overhead ( couple of MB)! ❖ no resource pinning! ❖ cgroups option! ❖ highly automatable 26 NOW: Watch OSDC2014 talk ‚Docker‘ by ‚Tobias Schwab‘
    • Virtual Cluster Nodes 27 host
    • Virtual Cluster Nodes ❖ Master Node (etcd, DNS, slurmctld) 27 host master
    • ! ❖ monitoring (graphite + statsd) Virtual Cluster Nodes ❖ Master Node (etcd, DNS, slurmctld) 27 host master monitoring
    • ! ! ❖ log mgmt (ELK) ! ❖ monitoring (graphite + statsd) Virtual Cluster Nodes ❖ Master Node (etcd, DNS, slurmctld) 27 host master monitoring logmgmt
    • ! ! ! ❖ compute nodes (slurmd) ! ! ❖ log mgmt (ELK) ! ❖ monitoring (graphite + statsd) Virtual Cluster Nodes ❖ Master Node (etcd, DNS, slurmctld) 27 host master monitoring logmgmt compute0 compute1 computeN
    • ! ! ! ! ❖ alarming (Icinga) [not integrated] ! ! ! ❖ compute nodes (slurmd) ! ! ❖ log mgmt (ELK) ! ❖ monitoring (graphite + statsd) Virtual Cluster Nodes ❖ Master Node (etcd, DNS, slurmctld) 27 host master monitoring logmgmt compute0 compute1 computeN
    • Master Node ❖ takes care of inventory (etcd)! ❖ provides DNS (+PTR)! ❖ Integrate Rudder, ansible, chef,…? 28
    • Non-Master Nodes (in general) ❖ are started with master as DNS! ❖ mounting /scratch, /chome (sits on SSDs)! ❖ supervisord kicks in and starts services and setup-scripts! ❖ sending metrics to graphite! ❖ logs to logstash 29
    • docker-compute ❖ slurmd! ❖ sshd! ❖ logstash-forwarder! ❖ openmpi! ❖ qperf 30
    • docker-compute ❖ slurmd! ❖ sshd! ❖ logstash-forwarder! ❖ openmpi! ❖ qperf 30
    • docker-compute ❖ slurmd! ❖ sshd! ❖ logstash-forwarder! ❖ openmpi! ❖ qperf 30
    • docker-compute ❖ slurmd! ❖ sshd! ❖ logstash-forwarder! ❖ openmpi! ❖ qperf 30
    • docker-compute ❖ slurmd! ❖ sshd! ❖ logstash-forwarder! ❖ openmpi! ❖ qperf 30
    • docker-compute ❖ slurmd! ❖ sshd! ❖ logstash-forwarder! ❖ openmpi! ❖ qperf 30
    • docker-compute ❖ slurmd! ❖ sshd! ❖ logstash-forwarder! ❖ openmpi! ❖ qperf 30
    • docker-graphite (monitoring) ❖ full graphite stack + statsd! ❖ stresses IO (<3 SSDs)! ❖ needs more care (optimize IO) 31
    • docker-elk (Log Mgmt) ❖ elasticsearch, logstash, kibana! ❖ inputs: syslog, lumberjack! ❖ filters: none! ❖ outputs: elasticsearch 32
    • It’s alive! 33
    • Start Compute Node 34
    • Start Compute Node 35
    • Check Slurm Config 36
    • Check Slurm Config 36
    • Check Slurm Config 36
    • Check Slurm Config 36
    • Check Slurm Config 36
    • Run MPI-Job 37
    • Run MPI-Job 37
    • Run MPI-Job 37
    • TCP benchmark 38
    • QNIBTerminal Future Work 39
    • docker-icinga 40 ❖ Icinga to provide ! ❖ state-of-the-cluster overview! ❖ bundle with graphite/elk! ❖ no big deal…
    • docker-icinga 40 ❖ Icinga to provide ! ❖ state-of-the-cluster overview! ❖ bundle with graphite/elk! ❖ no big deal… ! ! ! ! ❖ Is this going to scale?
    • docker-(GlusterFS,Lustre) ❖ Cluster scratch to integrate with! ❖ Use of kernel-modules freezes attempt! ❖ Might be pushed in VirtualBox (vagrant) 41
    • ❖ How is SysOps/DevOps/Mgmt Humans! 42
    • ! ❖ react to the changes ❖ How is SysOps/DevOps/Mgmt Humans! 42
    • ! ! ❖ adopt them ! ❖ react to the changes ❖ How is SysOps/DevOps/Mgmt Humans! 42
    • ! ! ! ❖ feared by them ! ! ❖ adopt them ! ❖ react to the changes ❖ How is SysOps/DevOps/Mgmt Humans! 42
    • ❖ Truckload of Big Data! 43
    • ! ❖ Events ❖ Truckload of Big Data! 43
    • ! ! ❖ Metrics ! ❖ Events ❖ Truckload of Big Data! 43
    • ! ! ! ❖ Interaction ! ! ❖ Metrics ! ❖ Events ❖ Truckload of Big Data! 43
    • ! ! ! ❖ Interaction ! ! ❖ Metrics ! ❖ Events ❖ Truckload of Big Data! 43 node01.system.memory.usage 9! node13.system.memory.usage 14! node35.system.memory.usage 12! node95.system.memory.usage 11
    • ! ! ! ❖ Interaction ! ! ❖ Metrics ! ❖ Events ❖ Truckload of Big Data! 43 node01.system.memory.usage 9! node13.system.memory.usage 14! node35.system.memory.usage 12! node95.system.memory.usage 11 target=sumSeries(node{01,13,35,95}.system.memory.usage)
    • ! ! ! ❖ Interaction ! ! ❖ Metrics ! ❖ Events ❖ Truckload of Big Data! 43 job1.node01.system.memory.usage 9! job1.node13.system.memory.usage 14! job1.node35.system.memory.usage 12! job1.node95.system.memory.usage 11 node01.system.memory.usage 9! node13.system.memory.usage 14! node35.system.memory.usage 12! node95.system.memory.usage 11 target=sumSeries(node{01,13,35,95}.system.memory.usage)
    • ! ! ! ❖ Interaction ! ! ❖ Metrics ! ❖ Events ❖ Truckload of Big Data! 43 job1.node01.system.memory.usage 9! job1.node13.system.memory.usage 14! job1.node35.system.memory.usage 12! job1.node95.system.memory.usage 11 target=sumSeries(job01.*.system.memory.usage) node01.system.memory.usage 9! node13.system.memory.usage 14! node35.system.memory.usage 12! node95.system.memory.usage 11 target=sumSeries(node{01,13,35,95}.system.memory.usage)
    • pipework / mininet ❖ Currently all containers are bound to docker0 bridge! ❖ Creating topology with virtual/real switches would be nice! ❖ First iteration might use pipework! ❖ More complete one should use vSwitches (mininet?) 44
    • Dockerfiles ❖ Only 3 images are fd20 based 45
    • Questions? ❖ Pictures! ❖ p2: http://de.wikipedia.org/wiki/Datei:Audi_logo.svg
 http://commons.wikimedia.org/wiki/File:Daimler_AG.svg
 http://ffb.uni-lueneburg.de/20JahreFFB/! ❖ p4: https://www.flickr.com/photos/adeneko/4229090961! ❖ p6: cae t100
 https://www.flickr.com/photos/losalamosnatlab/7422429706! ❖ p8: http://www.brendangregg.com/Slides/SCaLE_Linux_Performance2013.pdf! ❖ p9: https://www.flickr.com/photos/riafoge/6796129047! ❖ p10: https://www.flickr.com/photos/119364768@N03/12928685224/! ❖ p11: http://www.mellanox.com/page/products_dyn?product_family=74 ! ❖ p23: https://www.flickr.com/photos/jaxport/3077543062! ❖ p25/26: https://blog.trifork.com/2013/08/08/next-step-in-virtualization-docker-lightweight-containers/! ❖ p33: https://www.flickr.com/photos/fkehren/5139094564! ❖ p39: https://www.flickr.com/photos/brizzlebornandbred/12852909293 46