SlideShare a Scribd company logo
©Bull 2012
Overlay HPC Information
1
Christian Kniep
R&D HPC Engineer
2014-06-25
©Bull 2014
About Me
2
‣ 10y+ SysAdmin
‣ 8y+ SysOps
‣ B.Sc. (2008-2011)
‣ 6y+ DevOps
‣ 1y+ R&D
- @CQnib
- http://blog.qnib.org
- https://github.com/ChristianKniep
©Bull 2014
My ISC History - Motivation
3
©Bull 2014
My ISC History - Description
4
©Bull 2014
HPC Software Stack (rough estimate)
5
Hardware:! ! HW-sensors/-errors
OS:! ! ! Kernel, Userland tools
MiddleWare:! MPI, ISV-libs
Software:! ! End user application
Excel:!! ! KPI, SLA
MgmtSysOps
SysOpsMgmt
User
PowerUser/ISV
ISVMgmt
Services:! ! Storage, Job Scheduler
HW
©Bull 2014
HPC Software Stack (goal)
6
Hardware:! ! HW-sensors/-errors
OS:! ! ! Kernel, Userland tools
MiddleWare:! MPI, ISV-libs
Software:! ! End user application
Excel:!! ! KPI, SLA
Services:! ! Storage, Job Scheduler
Log/Events
Perf
©Bull 2012
QNIBTerminal - History
7
!
!
!
• Created my own
!
!
• No useful tools in sight
©Bull 2014
QNIB
8
‣ Cluster of n*1000+ IB nodes
• Hard to debug
!
!
!
• Created my own - Graphite-Update in late 2013
!
!
• No useful tools in sight
©Bull 2014
QNIB
9
‣ Cluster of n*1000+ IB nodes
• Hard to debug
©Bull 2014
Achieved HPC Software Stack
10
Hardware:! ! IB-sensors/-errors
OS:! ! ! Kernel, Userland tools
MiddleWare:! MPI, ISV-libs
Software:! ! End user application
Excel:!! ! KPI, SLA
Services:! ! Storage, Job Scheduler
Log/Events
Perf
©Bull 2012
QNIBTerminal - Implementation
11
©Bull 2014
QNIBTerminal -blog.qnib.org
12
haproxy haproxy
dns
helixdns
elk
kibana
logstash
etcd
carboncarbon
graphite-webgraphite-web
graphite-apigraphite-api
grafanagrafana
slurmctldslurmctld
compute0slurmd
compute<N>slurmd
Log/Events
Services Performance
Compute
elasticsearch
©Bull 2012
DEMONSTRATION
13
©Bull 2012
Future Work
14
©Bull 2014
More Services
15
‣ Improve work-flow for log-events
‣ Nagios(-like) node is missing
‣ Cluster-FileSystem
‣ LDAP
‣ Additional dashboards
‣ Inventory
‣ using InfiniBand for communication traffic
©Bull 2014
Graph Representation
16
‣ Graph inventory needed
• Hierarchical view is not enough
©Bull 2014
Graph Representation
17
!
!
• GraphDB seems to be a good idea
comp0 comp1 comp2
ibsw0
eth1
eth10
ldap12
lustre0
ibsw2
‣ Graph inventory needed
• Hierarchical view is not enough
RETURN node=comp* WHERE 
ROUTE TO lustre_service INCLUDES ibsw2
©Bull 2012
Conclusion
18
!
!
!
!
!
!
!
!
!
!
‣ Training
• New SysOps could start on virtual cluster
• ‚Strangulate’ node to replay an error.
!
!
!
!
!
!
!
‣ Showcase
• Showing a customer his (to-be) software stack
• Convince the SysOps-Team ‚they have nothing to fear‘
!
!
!
‣ complete toolchain could be automated
• Testing
• Verification
• Q&A
!
!
‣ n*1000 containers through clustering
©Bull 2014
Conclusion
19
‣ n*100 of containers are easy (50 on my laptop)
• Running a 300 node cluster stack
©Bull 2014
Log AND Performance Management
20
‣ Metric w/o Logs are useless!
©Bull 2014
Log AND Performance Management
21
‣ Metric w/o Logs are useless!!
• and the other way around…
©Bull 2014
Log AND Performance Management
22
!
!
• overlapping is king
‣ Metric w/o Logs are useless!!
• and the other way around…
©Bull 2012 23

More Related Content

What's hot

mabl's Machine Learning Implementation on Google Cloud Platform
mabl's Machine Learning Implementation on Google Cloud Platformmabl's Machine Learning Implementation on Google Cloud Platform
mabl's Machine Learning Implementation on Google Cloud Platform
Joseph Lust
 
Replacing Rails asset pipeline with Gulp
Replacing Rails asset pipeline with GulpReplacing Rails asset pipeline with Gulp
Replacing Rails asset pipeline with Gulp
Tomasz Bak
 
Taipei City Bike Prediction
Taipei City Bike PredictionTaipei City Bike Prediction
Taipei City Bike Predictionyaoch29
 
End to-end test automation at scale
End to-end test automation at scaleEnd to-end test automation at scale
End to-end test automation at scale
mabl
 
Дмитрий Лемешко. Comaqa Spring 2018. Continuous mobile automation in build pi...
Дмитрий Лемешко. Comaqa Spring 2018. Continuous mobile automation in build pi...Дмитрий Лемешко. Comaqa Spring 2018. Continuous mobile automation in build pi...
Дмитрий Лемешко. Comaqa Spring 2018. Continuous mobile automation in build pi...
COMAQA.BY
 
Advanced Spark and Tensorflow Meetup - London - Nov 15, 2016 - Deploy Spark M...
Advanced Spark and Tensorflow Meetup - London - Nov 15, 2016 - Deploy Spark M...Advanced Spark and Tensorflow Meetup - London - Nov 15, 2016 - Deploy Spark M...
Advanced Spark and Tensorflow Meetup - London - Nov 15, 2016 - Deploy Spark M...
Chris Fregly
 
The State of the Developer Ecosystem - .NET Conf Barcelona 2018
The State of the Developer Ecosystem - .NET Conf Barcelona 2018The State of the Developer Ecosystem - .NET Conf Barcelona 2018
The State of the Developer Ecosystem - .NET Conf Barcelona 2018
Carlos Mendible
 
Balkan - data eng meetup - data fusion
Balkan - data eng meetup - data fusionBalkan - data eng meetup - data fusion
Balkan - data eng meetup - data fusion
Balkan Misirli
 
Transition to Infrastructure as Code
Transition to Infrastructure as CodeTransition to Infrastructure as Code
Transition to Infrastructure as Code
Wise Engineering
 
Making Angular2 lean and Fast
Making Angular2 lean and FastMaking Angular2 lean and Fast
Making Angular2 lean and Fast
Vinci Rufus
 
Front end architecture patterns
Front end architecture patternsFront end architecture patterns
Front end architecture patterns
Oleksandr Tryshchenko
 
Continuous Delivery with Cloud Foundry
Continuous Delivery with Cloud FoundryContinuous Delivery with Cloud Foundry
Continuous Delivery with Cloud FoundryPlatform CF
 
GitLab Product update July 25
GitLab Product update July 25GitLab Product update July 25
GitLab Product update July 25
🌍 Job van der Voort
 
From AWS Data Pipeline to Airflow - managing data pipelines in Nielsen Market...
From AWS Data Pipeline to Airflow - managing data pipelines in Nielsen Market...From AWS Data Pipeline to Airflow - managing data pipelines in Nielsen Market...
From AWS Data Pipeline to Airflow - managing data pipelines in Nielsen Market...
Itai Yaffe
 
Alieaters#8 アリババクラウドで
サーバレス頑張って
モバイルアプリを作った
Alieaters#8 アリババクラウドで
サーバレス頑張って
モバイルアプリを作ったAlieaters#8 アリババクラウドで
サーバレス頑張って
モバイルアプリを作った
Alieaters#8 アリババクラウドで
サーバレス頑張って
モバイルアプリを作った
Ryota Suenaga
 
ArcGIS 10.1 Upgrade Cowlitz PUD & SSP Innovations
ArcGIS 10.1 Upgrade Cowlitz PUD & SSP InnovationsArcGIS 10.1 Upgrade Cowlitz PUD & SSP Innovations
ArcGIS 10.1 Upgrade Cowlitz PUD & SSP Innovations
SSP Innovations
 
Released WEBridge 4 SAP R 3 on 9/9 of 2014
Released WEBridge 4 SAP R 3 on 9/9 of 2014Released WEBridge 4 SAP R 3 on 9/9 of 2014
Released WEBridge 4 SAP R 3 on 9/9 of 2014
Gandhavalla Informatics Pvt Ltd.
 
Serverless
ServerlessServerless
Serverless
Natan Deitch
 

What's hot (20)

mabl's Machine Learning Implementation on Google Cloud Platform
mabl's Machine Learning Implementation on Google Cloud Platformmabl's Machine Learning Implementation on Google Cloud Platform
mabl's Machine Learning Implementation on Google Cloud Platform
 
Replacing Rails asset pipeline with Gulp
Replacing Rails asset pipeline with GulpReplacing Rails asset pipeline with Gulp
Replacing Rails asset pipeline with Gulp
 
Taipei City Bike Prediction
Taipei City Bike PredictionTaipei City Bike Prediction
Taipei City Bike Prediction
 
End to-end test automation at scale
End to-end test automation at scaleEnd to-end test automation at scale
End to-end test automation at scale
 
Дмитрий Лемешко. Comaqa Spring 2018. Continuous mobile automation in build pi...
Дмитрий Лемешко. Comaqa Spring 2018. Continuous mobile automation in build pi...Дмитрий Лемешко. Comaqa Spring 2018. Continuous mobile automation in build pi...
Дмитрий Лемешко. Comaqa Spring 2018. Continuous mobile automation in build pi...
 
Advanced Spark and Tensorflow Meetup - London - Nov 15, 2016 - Deploy Spark M...
Advanced Spark and Tensorflow Meetup - London - Nov 15, 2016 - Deploy Spark M...Advanced Spark and Tensorflow Meetup - London - Nov 15, 2016 - Deploy Spark M...
Advanced Spark and Tensorflow Meetup - London - Nov 15, 2016 - Deploy Spark M...
 
Go
GoGo
Go
 
The State of the Developer Ecosystem - .NET Conf Barcelona 2018
The State of the Developer Ecosystem - .NET Conf Barcelona 2018The State of the Developer Ecosystem - .NET Conf Barcelona 2018
The State of the Developer Ecosystem - .NET Conf Barcelona 2018
 
Balkan - data eng meetup - data fusion
Balkan - data eng meetup - data fusionBalkan - data eng meetup - data fusion
Balkan - data eng meetup - data fusion
 
Transition to Infrastructure as Code
Transition to Infrastructure as CodeTransition to Infrastructure as Code
Transition to Infrastructure as Code
 
Making Angular2 lean and Fast
Making Angular2 lean and FastMaking Angular2 lean and Fast
Making Angular2 lean and Fast
 
Front end architecture patterns
Front end architecture patternsFront end architecture patterns
Front end architecture patterns
 
Continuous Delivery with Cloud Foundry
Continuous Delivery with Cloud FoundryContinuous Delivery with Cloud Foundry
Continuous Delivery with Cloud Foundry
 
GitLab Product update July 25
GitLab Product update July 25GitLab Product update July 25
GitLab Product update July 25
 
From AWS Data Pipeline to Airflow - managing data pipelines in Nielsen Market...
From AWS Data Pipeline to Airflow - managing data pipelines in Nielsen Market...From AWS Data Pipeline to Airflow - managing data pipelines in Nielsen Market...
From AWS Data Pipeline to Airflow - managing data pipelines in Nielsen Market...
 
Alieaters#8 アリババクラウドで
サーバレス頑張って
モバイルアプリを作った
Alieaters#8 アリババクラウドで
サーバレス頑張って
モバイルアプリを作ったAlieaters#8 アリババクラウドで
サーバレス頑張って
モバイルアプリを作った
Alieaters#8 アリババクラウドで
サーバレス頑張って
モバイルアプリを作った
 
ArcGIS 10.1 Upgrade Cowlitz PUD & SSP Innovations
ArcGIS 10.1 Upgrade Cowlitz PUD & SSP InnovationsArcGIS 10.1 Upgrade Cowlitz PUD & SSP Innovations
ArcGIS 10.1 Upgrade Cowlitz PUD & SSP Innovations
 
Big datalab
Big datalabBig datalab
Big datalab
 
Released WEBridge 4 SAP R 3 on 9/9 of 2014
Released WEBridge 4 SAP R 3 on 9/9 of 2014Released WEBridge 4 SAP R 3 on 9/9 of 2014
Released WEBridge 4 SAP R 3 on 9/9 of 2014
 
Serverless
ServerlessServerless
Serverless
 

Similar to Overlay HPC Information

Deploying Machine Learning in production without servers - #serverlessCPH
Deploying Machine Learning in production without servers - #serverlessCPHDeploying Machine Learning in production without servers - #serverlessCPH
Deploying Machine Learning in production without servers - #serverlessCPH
Damien Cavaillès
 
OSDC 2014: Christian Kniep - Understand your data center by overlaying multi...
OSDC 2014: Christian Kniep -  Understand your data center by overlaying multi...OSDC 2014: Christian Kniep -  Understand your data center by overlaying multi...
OSDC 2014: Christian Kniep - Understand your data center by overlaying multi...
NETWAYS
 
Building Enterprise OLAP on Hadoop for FSI
Building Enterprise OLAP on Hadoop for FSIBuilding Enterprise OLAP on Hadoop for FSI
Building Enterprise OLAP on Hadoop for FSI
Luke Han
 
SkyBase - a Devops Platform for Hybrid Cloud
SkyBase - a Devops Platform for Hybrid CloudSkyBase - a Devops Platform for Hybrid Cloud
SkyBase - a Devops Platform for Hybrid Cloud
Vlad Kuusk
 
Hadoop and subsystems in livedoor #Hcj11f
Hadoop and subsystems in livedoor #Hcj11fHadoop and subsystems in livedoor #Hcj11f
Hadoop and subsystems in livedoor #Hcj11fSATOSHI TAGOMORI
 
SAP #BOBJ #BI 4.1 Upgrade Webcast Series 3: BI 4.1 Sizing and Virtualization
SAP #BOBJ #BI 4.1 Upgrade Webcast Series 3: BI 4.1 Sizing and VirtualizationSAP #BOBJ #BI 4.1 Upgrade Webcast Series 3: BI 4.1 Sizing and Virtualization
SAP #BOBJ #BI 4.1 Upgrade Webcast Series 3: BI 4.1 Sizing and Virtualization
SAP Analytics
 
Why use Gitlab
Why use GitlabWhy use Gitlab
Why use Gitlab
abenyeung1
 
Why and How to Run Your Own Gitlab Runners as Your Company Grows
Why and How to Run Your Own Gitlab Runners as Your Company GrowsWhy and How to Run Your Own Gitlab Runners as Your Company Grows
Why and How to Run Your Own Gitlab Runners as Your Company Grows
NGINX, Inc.
 
Denver Cloud Foundry Meetup - February 2016
Denver Cloud Foundry Meetup - February 2016Denver Cloud Foundry Meetup - February 2016
Denver Cloud Foundry Meetup - February 2016
Josh Ghiloni
 
To be or not to be serverless
To be or not to be serverlessTo be or not to be serverless
To be or not to be serverless
Steve Houël
 
Aws, play! couch db scaling soa in the cloud
Aws, play! couch db  scaling soa in the cloudAws, play! couch db  scaling soa in the cloud
Aws, play! couch db scaling soa in the cloud
Christophe Marchal
 
KoprowskiT_SQLRelay2014#2_Southampton_MaintenancePlansForBeginners
KoprowskiT_SQLRelay2014#2_Southampton_MaintenancePlansForBeginnersKoprowskiT_SQLRelay2014#2_Southampton_MaintenancePlansForBeginners
KoprowskiT_SQLRelay2014#2_Southampton_MaintenancePlansForBeginners
Tobias Koprowski
 
re:Invent re:cap 2020
re:Invent re:cap 2020re:Invent re:cap 2020
re:Invent re:cap 2020
Kaushik Mohanraj
 
Multitenancy At Bloomberg - HBase and Oozie
Multitenancy At Bloomberg - HBase and OozieMultitenancy At Bloomberg - HBase and Oozie
Multitenancy At Bloomberg - HBase and Oozie
DataWorks Summit
 
Self-Service BI for big data applications using Apache Drill (Big Data Amster...
Self-Service BI for big data applications using Apache Drill (Big Data Amster...Self-Service BI for big data applications using Apache Drill (Big Data Amster...
Self-Service BI for big data applications using Apache Drill (Big Data Amster...
Dataconomy Media
 
Self-Service BI for big data applications using Apache Drill (Big Data Amster...
Self-Service BI for big data applications using Apache Drill (Big Data Amster...Self-Service BI for big data applications using Apache Drill (Big Data Amster...
Self-Service BI for big data applications using Apache Drill (Big Data Amster...
Mats Uddenfeldt
 
Berlin AWS meetup: here.com on AWS
Berlin AWS meetup: here.com on AWSBerlin AWS meetup: here.com on AWS
Berlin AWS meetup: here.com on AWS
Cristian Măgherușan-Stanciu
 
Scaling systems for research computing
Scaling systems for research computingScaling systems for research computing
Scaling systems for research computingThe BioTeam Inc.
 
Lambda Architectures in Practice
Lambda Architectures in PracticeLambda Architectures in Practice
Lambda Architectures in Practice
C4Media
 

Similar to Overlay HPC Information (20)

Deploying Machine Learning in production without servers - #serverlessCPH
Deploying Machine Learning in production without servers - #serverlessCPHDeploying Machine Learning in production without servers - #serverlessCPH
Deploying Machine Learning in production without servers - #serverlessCPH
 
Hive et Hadoop Usage chez Square
Hive et Hadoop Usage chez SquareHive et Hadoop Usage chez Square
Hive et Hadoop Usage chez Square
 
OSDC 2014: Christian Kniep - Understand your data center by overlaying multi...
OSDC 2014: Christian Kniep -  Understand your data center by overlaying multi...OSDC 2014: Christian Kniep -  Understand your data center by overlaying multi...
OSDC 2014: Christian Kniep - Understand your data center by overlaying multi...
 
Building Enterprise OLAP on Hadoop for FSI
Building Enterprise OLAP on Hadoop for FSIBuilding Enterprise OLAP on Hadoop for FSI
Building Enterprise OLAP on Hadoop for FSI
 
SkyBase - a Devops Platform for Hybrid Cloud
SkyBase - a Devops Platform for Hybrid CloudSkyBase - a Devops Platform for Hybrid Cloud
SkyBase - a Devops Platform for Hybrid Cloud
 
Hadoop and subsystems in livedoor #Hcj11f
Hadoop and subsystems in livedoor #Hcj11fHadoop and subsystems in livedoor #Hcj11f
Hadoop and subsystems in livedoor #Hcj11f
 
SAP #BOBJ #BI 4.1 Upgrade Webcast Series 3: BI 4.1 Sizing and Virtualization
SAP #BOBJ #BI 4.1 Upgrade Webcast Series 3: BI 4.1 Sizing and VirtualizationSAP #BOBJ #BI 4.1 Upgrade Webcast Series 3: BI 4.1 Sizing and Virtualization
SAP #BOBJ #BI 4.1 Upgrade Webcast Series 3: BI 4.1 Sizing and Virtualization
 
Why use Gitlab
Why use GitlabWhy use Gitlab
Why use Gitlab
 
Why and How to Run Your Own Gitlab Runners as Your Company Grows
Why and How to Run Your Own Gitlab Runners as Your Company GrowsWhy and How to Run Your Own Gitlab Runners as Your Company Grows
Why and How to Run Your Own Gitlab Runners as Your Company Grows
 
Denver Cloud Foundry Meetup - February 2016
Denver Cloud Foundry Meetup - February 2016Denver Cloud Foundry Meetup - February 2016
Denver Cloud Foundry Meetup - February 2016
 
To be or not to be serverless
To be or not to be serverlessTo be or not to be serverless
To be or not to be serverless
 
Aws, play! couch db scaling soa in the cloud
Aws, play! couch db  scaling soa in the cloudAws, play! couch db  scaling soa in the cloud
Aws, play! couch db scaling soa in the cloud
 
KoprowskiT_SQLRelay2014#2_Southampton_MaintenancePlansForBeginners
KoprowskiT_SQLRelay2014#2_Southampton_MaintenancePlansForBeginnersKoprowskiT_SQLRelay2014#2_Southampton_MaintenancePlansForBeginners
KoprowskiT_SQLRelay2014#2_Southampton_MaintenancePlansForBeginners
 
re:Invent re:cap 2020
re:Invent re:cap 2020re:Invent re:cap 2020
re:Invent re:cap 2020
 
Multitenancy At Bloomberg - HBase and Oozie
Multitenancy At Bloomberg - HBase and OozieMultitenancy At Bloomberg - HBase and Oozie
Multitenancy At Bloomberg - HBase and Oozie
 
Self-Service BI for big data applications using Apache Drill (Big Data Amster...
Self-Service BI for big data applications using Apache Drill (Big Data Amster...Self-Service BI for big data applications using Apache Drill (Big Data Amster...
Self-Service BI for big data applications using Apache Drill (Big Data Amster...
 
Self-Service BI for big data applications using Apache Drill (Big Data Amster...
Self-Service BI for big data applications using Apache Drill (Big Data Amster...Self-Service BI for big data applications using Apache Drill (Big Data Amster...
Self-Service BI for big data applications using Apache Drill (Big Data Amster...
 
Berlin AWS meetup: here.com on AWS
Berlin AWS meetup: here.com on AWSBerlin AWS meetup: here.com on AWS
Berlin AWS meetup: here.com on AWS
 
Scaling systems for research computing
Scaling systems for research computingScaling systems for research computing
Scaling systems for research computing
 
Lambda Architectures in Practice
Lambda Architectures in PracticeLambda Architectures in Practice
Lambda Architectures in Practice
 

More from inside-BigData.com

Major Market Shifts in IT
Major Market Shifts in ITMajor Market Shifts in IT
Major Market Shifts in IT
inside-BigData.com
 
Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...
inside-BigData.com
 
Transforming Private 5G Networks
Transforming Private 5G NetworksTransforming Private 5G Networks
Transforming Private 5G Networks
inside-BigData.com
 
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
inside-BigData.com
 
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
inside-BigData.com
 
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
inside-BigData.com
 
HPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural NetworksHPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural Networks
inside-BigData.com
 
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean MonitoringBiohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
inside-BigData.com
 
Machine Learning for Weather Forecasts
Machine Learning for Weather ForecastsMachine Learning for Weather Forecasts
Machine Learning for Weather Forecasts
inside-BigData.com
 
HPC AI Advisory Council Update
HPC AI Advisory Council UpdateHPC AI Advisory Council Update
HPC AI Advisory Council Update
inside-BigData.com
 
Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19
inside-BigData.com
 
Energy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic TuningEnergy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic Tuning
inside-BigData.com
 
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPODHPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
inside-BigData.com
 
State of ARM-based HPC
State of ARM-based HPCState of ARM-based HPC
State of ARM-based HPC
inside-BigData.com
 
Versal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud AccelerationVersal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud Acceleration
inside-BigData.com
 
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance EfficientlyZettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
inside-BigData.com
 
Scaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's EraScaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's Era
inside-BigData.com
 
CUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computingCUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computing
inside-BigData.com
 
Introducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi ClusterIntroducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi Cluster
inside-BigData.com
 
Overview of HPC Interconnects
Overview of HPC InterconnectsOverview of HPC Interconnects
Overview of HPC Interconnects
inside-BigData.com
 

More from inside-BigData.com (20)

Major Market Shifts in IT
Major Market Shifts in ITMajor Market Shifts in IT
Major Market Shifts in IT
 
Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...
 
Transforming Private 5G Networks
Transforming Private 5G NetworksTransforming Private 5G Networks
Transforming Private 5G Networks
 
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
 
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
 
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
 
HPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural NetworksHPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural Networks
 
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean MonitoringBiohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
 
Machine Learning for Weather Forecasts
Machine Learning for Weather ForecastsMachine Learning for Weather Forecasts
Machine Learning for Weather Forecasts
 
HPC AI Advisory Council Update
HPC AI Advisory Council UpdateHPC AI Advisory Council Update
HPC AI Advisory Council Update
 
Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19
 
Energy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic TuningEnergy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic Tuning
 
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPODHPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
 
State of ARM-based HPC
State of ARM-based HPCState of ARM-based HPC
State of ARM-based HPC
 
Versal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud AccelerationVersal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud Acceleration
 
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance EfficientlyZettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
 
Scaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's EraScaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's Era
 
CUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computingCUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computing
 
Introducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi ClusterIntroducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi Cluster
 
Overview of HPC Interconnects
Overview of HPC InterconnectsOverview of HPC Interconnects
Overview of HPC Interconnects
 

Recently uploaded

Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
OnBoard
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
DianaGray10
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Product School
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Product School
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
UiPathCommunity
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
DianaGray10
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Tobias Schneck
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Thierry Lestable
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
Elena Simperl
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
Sri Ambati
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
Elena Simperl
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Thijs Feryn
 
"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi
Fwdays
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
Product School
 

Recently uploaded (20)

Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 
"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
 

Overlay HPC Information

  • 1. ©Bull 2012 Overlay HPC Information 1 Christian Kniep R&D HPC Engineer 2014-06-25
  • 2. ©Bull 2014 About Me 2 ‣ 10y+ SysAdmin ‣ 8y+ SysOps ‣ B.Sc. (2008-2011) ‣ 6y+ DevOps ‣ 1y+ R&D - @CQnib - http://blog.qnib.org - https://github.com/ChristianKniep
  • 3. ©Bull 2014 My ISC History - Motivation 3
  • 4. ©Bull 2014 My ISC History - Description 4
  • 5. ©Bull 2014 HPC Software Stack (rough estimate) 5 Hardware:! ! HW-sensors/-errors OS:! ! ! Kernel, Userland tools MiddleWare:! MPI, ISV-libs Software:! ! End user application Excel:!! ! KPI, SLA MgmtSysOps SysOpsMgmt User PowerUser/ISV ISVMgmt Services:! ! Storage, Job Scheduler HW
  • 6. ©Bull 2014 HPC Software Stack (goal) 6 Hardware:! ! HW-sensors/-errors OS:! ! ! Kernel, Userland tools MiddleWare:! MPI, ISV-libs Software:! ! End user application Excel:!! ! KPI, SLA Services:! ! Storage, Job Scheduler Log/Events Perf
  • 8. ! ! ! • Created my own ! ! • No useful tools in sight ©Bull 2014 QNIB 8 ‣ Cluster of n*1000+ IB nodes • Hard to debug
  • 9. ! ! ! • Created my own - Graphite-Update in late 2013 ! ! • No useful tools in sight ©Bull 2014 QNIB 9 ‣ Cluster of n*1000+ IB nodes • Hard to debug
  • 10. ©Bull 2014 Achieved HPC Software Stack 10 Hardware:! ! IB-sensors/-errors OS:! ! ! Kernel, Userland tools MiddleWare:! MPI, ISV-libs Software:! ! End user application Excel:!! ! KPI, SLA Services:! ! Storage, Job Scheduler Log/Events Perf
  • 11. ©Bull 2012 QNIBTerminal - Implementation 11
  • 12. ©Bull 2014 QNIBTerminal -blog.qnib.org 12 haproxy haproxy dns helixdns elk kibana logstash etcd carboncarbon graphite-webgraphite-web graphite-apigraphite-api grafanagrafana slurmctldslurmctld compute0slurmd compute<N>slurmd Log/Events Services Performance Compute elasticsearch
  • 15. ©Bull 2014 More Services 15 ‣ Improve work-flow for log-events ‣ Nagios(-like) node is missing ‣ Cluster-FileSystem ‣ LDAP ‣ Additional dashboards ‣ Inventory ‣ using InfiniBand for communication traffic
  • 16. ©Bull 2014 Graph Representation 16 ‣ Graph inventory needed • Hierarchical view is not enough
  • 17. ©Bull 2014 Graph Representation 17 ! ! • GraphDB seems to be a good idea comp0 comp1 comp2 ibsw0 eth1 eth10 ldap12 lustre0 ibsw2 ‣ Graph inventory needed • Hierarchical view is not enough RETURN node=comp* WHERE ROUTE TO lustre_service INCLUDES ibsw2
  • 19. ! ! ! ! ! ! ! ! ! ! ‣ Training • New SysOps could start on virtual cluster • ‚Strangulate’ node to replay an error. ! ! ! ! ! ! ! ‣ Showcase • Showing a customer his (to-be) software stack • Convince the SysOps-Team ‚they have nothing to fear‘ ! ! ! ‣ complete toolchain could be automated • Testing • Verification • Q&A ! ! ‣ n*1000 containers through clustering ©Bull 2014 Conclusion 19 ‣ n*100 of containers are easy (50 on my laptop) • Running a 300 node cluster stack
  • 20. ©Bull 2014 Log AND Performance Management 20 ‣ Metric w/o Logs are useless!
  • 21. ©Bull 2014 Log AND Performance Management 21 ‣ Metric w/o Logs are useless!! • and the other way around…
  • 22. ©Bull 2014 Log AND Performance Management 22 ! ! • overlapping is king ‣ Metric w/o Logs are useless!! • and the other way around…