SlideShare a Scribd company logo
ZABBIX FOR 

HPC MONITORING AND
SUPPORT
Mikhail Serkov
Delivery Manager/HPC Engineer
2016
CONFIDENTIAL 2
What’s next?
HPC monitoring – differences from classic support
model
How do we use Zabbix
AGENDA
Overview of the customer infrastructure and software
stack
High Performance Computing – what is it about#1
#4
#5
#2
#3
CONFIDENTIAL 3
• Scientific research in Pharma area:
Bioinformatics, Computational
Chemistry, Drug Discovery, etc.
• About 10k CPU cores used for a
scientific computation.
• Shared clusters - different workflows
could run simultaneously within the
same cluster.
• About 500 different scientific tools.
• Custom software ( Python, Java, R)
Novartis Institute For Biomedical Research (NIBR)
CONFIDENTIAL 4
• Hundreds or even thousands of computation nodes
• Grid Computing technologies and software ( SGE, UGE, SoGE, PBS, etc)
• Massive parallel computation across the nodes
• Strong requirements for all subsystems on hardware and software level ( storage, network,
power, OS )
• No magic. Linux boxes, shell scripts on a low level ☺
Example of a job submission:
HIGH PERFORMANCE COMPUTING
CONFIDENTIAL 5
OVERVIEW OF THE CUSTOMER INFRASTRUCTURE
250 GPU’s
70TB RAM
35-40KW/Rack
CONFIDENTIAL 6
• 28 CPU cores ( 2 sockets x 14 cores each )
− Intel(R) Xeon(R) CPU E5-2697 v3 @ 2.60GHz
• 200 GB RAM
• 10 GB Ethernet + InfiniBand interfaces
• 8 GPU cores ( 4 cards x 2 cores each )
• NFS over 10 GB Ehternet
• Lustre over InfiniBand
TYPICAL COMPUTATION NODE CONFIGURATION
CONFIDENTIAL 7
OVERVIEW OF SOFTWARE STACK
• More than 500 of scientific tools
• Bioinformatics, Computation
Chemistry, Xtallography, Molecular
Dynamics, etc
• RHEL6.5
• Univa Grid Engine
• Zabbix 2.4
CONFIDENTIAL 8
• We need information like ‘who, what, when’, not only system metrics.
• Users are allowed to run whatever they want using grid scheduler on the computation
nodes.
• 100% CPU utilization and 100% RAM utilization for node is perfectly fine.
• Node crash – not such a big deal.
• Preventing global issues by using aggregated metrics.
• Metrics not only for monitoring but for a performance analysis.
• Users are having access to the monitoring system ( but restricted ).
HPC MONITORING DIFFERENCES
CONFIDENTIAL 9
• Able to monitor of a huge systems with a lot of metrics
• Flexible
• Out of the box
• Ability to aggregate metrics
• API for a data extraction
• GUI convenient for both support team and scientists
• Autodiscovery
• New nodes automatic configuration
WHY ZABBIX?
CONFIDENTIAL 10
ZABBIX CONFIGURATION
Server configuration:
• 20 CPU cores ( 2 sockets x 10 cores each )
− Intel(R) Xeon(R) CPU E5-2697 v3 @ 2.60GHz
• 120 GB RAM
Number of hosts: 601
Number of items: ~200k
Number of triggers: ~37k
DB Size: 187GB
CONFIDENTIAL 11
WHAT DO WE MONITOR
Local metrics ( node level ) Global metrics ( cluster level )
All default Linux checks (LA, CPU utilization, RAM,
swap, etc) - agent
Meta CPU utilization – aggregation of CPU utilization
of HPC nodes.
Every single GPU core ( Temperature, Utilization if
possible) - agent
NFS global transmit/retransmit - aggregation of
nodes values
Every single CPU core ( Utilization, Temperature) -
agent
Grid specific – used/active slots, running jobs,
pending jobs, top users - external scripts
NFS shares availability / utilization / mount details -
agent
CPU/Memory oversubscription - aggregation of
nodes values
Slots / RAM reserved - external scripts Overloaded nodes - aggregation of HPC values
HPC jobs - external scripts Pending time - external scripts
... ....
CONFIDENTIAL 12
HPC specific examples
1) Expected utilization VS Real one
Every job has a resource request for number of CPUs, RAM, etc. In every moment we can compare real
utilization with an expected one. If they are not close, we need to investigate if someone oversubscribing
resources or overload nodes.
Solution: Zabbix not only checks current system metrics, but also keeps an expected values. If they
are too different we receive warning.
2) Users on a computation node
Users are not restricted to SSH to any node ( debugging, tracing job in real time, interactive jobs,
etc). However we should check if user has job on the node he is logged into.
Solution: We have a trigger that notify us if we have anyone logged on the node with no job running.
Additionally we store a list of logged in users for any single moment.
CONFIDENTIAL 13
HPC specific examples
Pending time probes
It is really hard to predict the pending time for any particular job in the pending list, as they all have different resource
requests, and runtimes. It is not a FIFO and the pending time is always related to resources user wants to have.
Solution: Zabbix runs ‘pending probes’ ( empty jobs) and checks how long does it take. This is a good indicator for
queue state at the moment.
CONFIDENTIAL 14
WHAT DO WE MONITOR: GLOBAL METRICS
Global cluster utilization
CONFIDENTIAL 15
WHAT DO WE MONITOR: GLOBAL METRICS
RAM oversubscription
CONFIDENTIAL 16
WHAT DO WE MONITOR: GLOBAL METRICS
CPU time oversubscription
CONFIDENTIAL 17
WHAT DO WE MONITOR: GLOBAL METRICS
Meta CPU utilization
CONFIDENTIAL 18
WHAT DO WE MONITOR: GLOBAL METRICS
Aggregated cluster status
CONFIDENTIAL 19
WHAT DO WE MONITOR: GLOBAL METRICS
Storage operational metrics
CONFIDENTIAL 20
WHAT DO WE MONITOR: LOCAL METRICS
CONFIDENTIAL 21
WHAT DO WE MONITOR: LOCAL METRICS
CONFIDENTIAL 22
USER ACCESS
We want to provide a limited amount of information to users. They don’t need any info about triggers and issues, but only metrics. We
have patched Zabbix to remove all unnecessary data for guest access.
After
Before
CONFIDENTIAL 23
Benefits
• Better understanding of a global issues on the cluster an reasons of why have they happened.
• Great performance indicators for other infrastructure teams ( especially Storage team )
• Performance tuning of a scientific workflows. Jobs profiling. In some cases information we cat get from
Zabbix is helping us to significantly improve performance of jobs.
• Proactive monitoring. With Zabbix it’s easier to understand if something is not right on the cluster or
with some job. In most cases we are able to prevent global cluster issues, or at least minimize an
impact.
• One monitoring system for clusters and HPC infrastructure.
• “All in one”. Lower efforts on support/maintain monitoring system(s).
CONFIDENTIAL 24
• Tight integration with Grid HPC software.
• Data analysis using external tools, but with Zabbix data source.
• Create a set of CLI utilities for getting Zabbix statistics in ‘human-readable’ format.
• Automation of jobs profiling using Zabbix API.
WHAT’S NEXT?
CONFIDENTIAL 25
Questions?

More Related Content

What's hot

実運用して分かったRabbit MQの良いところ・気をつけること #jjug
実運用して分かったRabbit MQの良いところ・気をつけること #jjug実運用して分かったRabbit MQの良いところ・気をつけること #jjug
実運用して分かったRabbit MQの良いところ・気をつけること #jjug
Yahoo!デベロッパーネットワーク
 
Introduction to Zabbix - Company, Product, Services and Use Cases
Introduction to Zabbix - Company, Product, Services and Use CasesIntroduction to Zabbix - Company, Product, Services and Use Cases
Introduction to Zabbix - Company, Product, Services and Use Cases
Zabbix
 
単なるキャッシュじゃないよ!?infinispanの紹介
単なるキャッシュじゃないよ!?infinispanの紹介単なるキャッシュじゃないよ!?infinispanの紹介
単なるキャッシュじゃないよ!?infinispanの紹介
AdvancedTechNight
 
「GraphDB徹底入門」〜構造や仕組み理解から使いどころ・種々のGraphDBの比較まで幅広く〜
「GraphDB徹底入門」〜構造や仕組み理解から使いどころ・種々のGraphDBの比較まで幅広く〜「GraphDB徹底入門」〜構造や仕組み理解から使いどころ・種々のGraphDBの比較まで幅広く〜
「GraphDB徹底入門」〜構造や仕組み理解から使いどころ・種々のGraphDBの比較まで幅広く〜Takahiro Inoue
 
3種類のTEE比較(Intel SGX, ARM TrustZone, RISC-V Keystone)
3種類のTEE比較(Intel SGX, ARM TrustZone, RISC-V Keystone)3種類のTEE比較(Intel SGX, ARM TrustZone, RISC-V Keystone)
3種類のTEE比較(Intel SGX, ARM TrustZone, RISC-V Keystone)
Kuniyasu Suzaki
 
MongoDBを用いたソーシャルアプリのログ解析 〜解析基盤構築からフロントUIまで、MongoDBを最大限に活用する〜
MongoDBを用いたソーシャルアプリのログ解析 〜解析基盤構築からフロントUIまで、MongoDBを最大限に活用する〜MongoDBを用いたソーシャルアプリのログ解析 〜解析基盤構築からフロントUIまで、MongoDBを最大限に活用する〜
MongoDBを用いたソーシャルアプリのログ解析 〜解析基盤構築からフロントUIまで、MongoDBを最大限に活用する〜Takahiro Inoue
 
Kubernetesによる機械学習基盤への挑戦
Kubernetesによる機械学習基盤への挑戦Kubernetesによる機械学習基盤への挑戦
Kubernetesによる機械学習基盤への挑戦
Preferred Networks
 
データプロダクトを支えるビッグデータ基盤
データプロダクトを支えるビッグデータ基盤データプロダクトを支えるビッグデータ基盤
データプロダクトを支えるビッグデータ基盤
Google Cloud Platform - Japan
 
Yahoo! JAPANのプライベートRDBクラウドとマルチライター型 MySQL #dbts2017 #dbtsOSS
Yahoo! JAPANのプライベートRDBクラウドとマルチライター型 MySQL #dbts2017 #dbtsOSSYahoo! JAPANのプライベートRDBクラウドとマルチライター型 MySQL #dbts2017 #dbtsOSS
Yahoo! JAPANのプライベートRDBクラウドとマルチライター型 MySQL #dbts2017 #dbtsOSS
Yahoo!デベロッパーネットワーク
 
Zabbix最新情報 ~Zabbix 6.0に向けて~ @OSC2021 Online/Fall
Zabbix最新情報 ~Zabbix 6.0に向けて~ @OSC2021 Online/FallZabbix最新情報 ~Zabbix 6.0に向けて~ @OSC2021 Online/Fall
Zabbix最新情報 ~Zabbix 6.0に向けて~ @OSC2021 Online/Fall
Atsushi Tanaka
 
Apache Pulsarの概要と近況
Apache Pulsarの概要と近況Apache Pulsarの概要と近況
Apache Pulsarの概要と近況
Yahoo!デベロッパーネットワーク
 
AWSのログ管理ベストプラクティス
AWSのログ管理ベストプラクティスAWSのログ管理ベストプラクティス
AWSのログ管理ベストプラクティス
Akihiro Kuwano
 
今さら聞けない人のためのGitLabの始め方 Ubuntu編
今さら聞けない人のためのGitLabの始め方 Ubuntu編今さら聞けない人のためのGitLabの始め方 Ubuntu編
今さら聞けない人のためのGitLabの始め方 Ubuntu編
VirtualTech Japan Inc./Begi.net Inc.
 
ストリーム処理を支えるキューイングシステムの選び方
ストリーム処理を支えるキューイングシステムの選び方ストリーム処理を支えるキューイングシステムの選び方
ストリーム処理を支えるキューイングシステムの選び方
Yoshiyasu SAEKI
 
PostgreSQL10を導入!大規模データ分析事例からみるDWHとしてのPostgreSQL活用のポイント
PostgreSQL10を導入!大規模データ分析事例からみるDWHとしてのPostgreSQL活用のポイントPostgreSQL10を導入!大規模データ分析事例からみるDWHとしてのPostgreSQL活用のポイント
PostgreSQL10を導入!大規模データ分析事例からみるDWHとしてのPostgreSQL活用のポイント
NTT DATA OSS Professional Services
 
AWS X-Rayによるアプリケーションの分析とデバッグ
AWS X-Rayによるアプリケーションの分析とデバッグAWS X-Rayによるアプリケーションの分析とデバッグ
AWS X-Rayによるアプリケーションの分析とデバッグ
Amazon Web Services Japan
 
ヤフーのプライベートクラウドとクラウドエンジニアの業務について
ヤフーのプライベートクラウドとクラウドエンジニアの業務についてヤフーのプライベートクラウドとクラウドエンジニアの業務について
ヤフーのプライベートクラウドとクラウドエンジニアの業務について
Yahoo!デベロッパーネットワーク
 
PostgreSQL 12は ここがスゴイ! ~性能改善やpluggable storage engineなどの新機能を徹底解説~ (NTTデータ テクノ...
PostgreSQL 12は ここがスゴイ! ~性能改善やpluggable storage engineなどの新機能を徹底解説~ (NTTデータ テクノ...PostgreSQL 12は ここがスゴイ! ~性能改善やpluggable storage engineなどの新機能を徹底解説~ (NTTデータ テクノ...
PostgreSQL 12は ここがスゴイ! ~性能改善やpluggable storage engineなどの新機能を徹底解説~ (NTTデータ テクノ...
NTT DATA Technology & Innovation
 
明日から始める! ソフトウェアのグリーン化(GSF MeetUp Tokyo 発表資料)
明日から始める! ソフトウェアのグリーン化(GSF MeetUp Tokyo 発表資料)明日から始める! ソフトウェアのグリーン化(GSF MeetUp Tokyo 発表資料)
明日から始める! ソフトウェアのグリーン化(GSF MeetUp Tokyo 発表資料)
NTT DATA Technology & Innovation
 
NTTデータ流Infrastructure as Code~ 大規模プロジェクトを通して考え抜いた基盤自動化の新たな姿~(NTTデータ テクノロジーカンフ...
NTTデータ流Infrastructure as Code~ 大規模プロジェクトを通して考え抜いた基盤自動化の新たな姿~(NTTデータ テクノロジーカンフ...NTTデータ流Infrastructure as Code~ 大規模プロジェクトを通して考え抜いた基盤自動化の新たな姿~(NTTデータ テクノロジーカンフ...
NTTデータ流Infrastructure as Code~ 大規模プロジェクトを通して考え抜いた基盤自動化の新たな姿~(NTTデータ テクノロジーカンフ...
NTT DATA Technology & Innovation
 

What's hot (20)

実運用して分かったRabbit MQの良いところ・気をつけること #jjug
実運用して分かったRabbit MQの良いところ・気をつけること #jjug実運用して分かったRabbit MQの良いところ・気をつけること #jjug
実運用して分かったRabbit MQの良いところ・気をつけること #jjug
 
Introduction to Zabbix - Company, Product, Services and Use Cases
Introduction to Zabbix - Company, Product, Services and Use CasesIntroduction to Zabbix - Company, Product, Services and Use Cases
Introduction to Zabbix - Company, Product, Services and Use Cases
 
単なるキャッシュじゃないよ!?infinispanの紹介
単なるキャッシュじゃないよ!?infinispanの紹介単なるキャッシュじゃないよ!?infinispanの紹介
単なるキャッシュじゃないよ!?infinispanの紹介
 
「GraphDB徹底入門」〜構造や仕組み理解から使いどころ・種々のGraphDBの比較まで幅広く〜
「GraphDB徹底入門」〜構造や仕組み理解から使いどころ・種々のGraphDBの比較まで幅広く〜「GraphDB徹底入門」〜構造や仕組み理解から使いどころ・種々のGraphDBの比較まで幅広く〜
「GraphDB徹底入門」〜構造や仕組み理解から使いどころ・種々のGraphDBの比較まで幅広く〜
 
3種類のTEE比較(Intel SGX, ARM TrustZone, RISC-V Keystone)
3種類のTEE比較(Intel SGX, ARM TrustZone, RISC-V Keystone)3種類のTEE比較(Intel SGX, ARM TrustZone, RISC-V Keystone)
3種類のTEE比較(Intel SGX, ARM TrustZone, RISC-V Keystone)
 
MongoDBを用いたソーシャルアプリのログ解析 〜解析基盤構築からフロントUIまで、MongoDBを最大限に活用する〜
MongoDBを用いたソーシャルアプリのログ解析 〜解析基盤構築からフロントUIまで、MongoDBを最大限に活用する〜MongoDBを用いたソーシャルアプリのログ解析 〜解析基盤構築からフロントUIまで、MongoDBを最大限に活用する〜
MongoDBを用いたソーシャルアプリのログ解析 〜解析基盤構築からフロントUIまで、MongoDBを最大限に活用する〜
 
Kubernetesによる機械学習基盤への挑戦
Kubernetesによる機械学習基盤への挑戦Kubernetesによる機械学習基盤への挑戦
Kubernetesによる機械学習基盤への挑戦
 
データプロダクトを支えるビッグデータ基盤
データプロダクトを支えるビッグデータ基盤データプロダクトを支えるビッグデータ基盤
データプロダクトを支えるビッグデータ基盤
 
Yahoo! JAPANのプライベートRDBクラウドとマルチライター型 MySQL #dbts2017 #dbtsOSS
Yahoo! JAPANのプライベートRDBクラウドとマルチライター型 MySQL #dbts2017 #dbtsOSSYahoo! JAPANのプライベートRDBクラウドとマルチライター型 MySQL #dbts2017 #dbtsOSS
Yahoo! JAPANのプライベートRDBクラウドとマルチライター型 MySQL #dbts2017 #dbtsOSS
 
Zabbix最新情報 ~Zabbix 6.0に向けて~ @OSC2021 Online/Fall
Zabbix最新情報 ~Zabbix 6.0に向けて~ @OSC2021 Online/FallZabbix最新情報 ~Zabbix 6.0に向けて~ @OSC2021 Online/Fall
Zabbix最新情報 ~Zabbix 6.0に向けて~ @OSC2021 Online/Fall
 
Apache Pulsarの概要と近況
Apache Pulsarの概要と近況Apache Pulsarの概要と近況
Apache Pulsarの概要と近況
 
AWSのログ管理ベストプラクティス
AWSのログ管理ベストプラクティスAWSのログ管理ベストプラクティス
AWSのログ管理ベストプラクティス
 
今さら聞けない人のためのGitLabの始め方 Ubuntu編
今さら聞けない人のためのGitLabの始め方 Ubuntu編今さら聞けない人のためのGitLabの始め方 Ubuntu編
今さら聞けない人のためのGitLabの始め方 Ubuntu編
 
ストリーム処理を支えるキューイングシステムの選び方
ストリーム処理を支えるキューイングシステムの選び方ストリーム処理を支えるキューイングシステムの選び方
ストリーム処理を支えるキューイングシステムの選び方
 
PostgreSQL10を導入!大規模データ分析事例からみるDWHとしてのPostgreSQL活用のポイント
PostgreSQL10を導入!大規模データ分析事例からみるDWHとしてのPostgreSQL活用のポイントPostgreSQL10を導入!大規模データ分析事例からみるDWHとしてのPostgreSQL活用のポイント
PostgreSQL10を導入!大規模データ分析事例からみるDWHとしてのPostgreSQL活用のポイント
 
AWS X-Rayによるアプリケーションの分析とデバッグ
AWS X-Rayによるアプリケーションの分析とデバッグAWS X-Rayによるアプリケーションの分析とデバッグ
AWS X-Rayによるアプリケーションの分析とデバッグ
 
ヤフーのプライベートクラウドとクラウドエンジニアの業務について
ヤフーのプライベートクラウドとクラウドエンジニアの業務についてヤフーのプライベートクラウドとクラウドエンジニアの業務について
ヤフーのプライベートクラウドとクラウドエンジニアの業務について
 
PostgreSQL 12は ここがスゴイ! ~性能改善やpluggable storage engineなどの新機能を徹底解説~ (NTTデータ テクノ...
PostgreSQL 12は ここがスゴイ! ~性能改善やpluggable storage engineなどの新機能を徹底解説~ (NTTデータ テクノ...PostgreSQL 12は ここがスゴイ! ~性能改善やpluggable storage engineなどの新機能を徹底解説~ (NTTデータ テクノ...
PostgreSQL 12は ここがスゴイ! ~性能改善やpluggable storage engineなどの新機能を徹底解説~ (NTTデータ テクノ...
 
明日から始める! ソフトウェアのグリーン化(GSF MeetUp Tokyo 発表資料)
明日から始める! ソフトウェアのグリーン化(GSF MeetUp Tokyo 発表資料)明日から始める! ソフトウェアのグリーン化(GSF MeetUp Tokyo 発表資料)
明日から始める! ソフトウェアのグリーン化(GSF MeetUp Tokyo 発表資料)
 
NTTデータ流Infrastructure as Code~ 大規模プロジェクトを通して考え抜いた基盤自動化の新たな姿~(NTTデータ テクノロジーカンフ...
NTTデータ流Infrastructure as Code~ 大規模プロジェクトを通して考え抜いた基盤自動化の新たな姿~(NTTデータ テクノロジーカンフ...NTTデータ流Infrastructure as Code~ 大規模プロジェクトを通して考え抜いた基盤自動化の新たな姿~(NTTデータ テクノロジーカンフ...
NTTデータ流Infrastructure as Code~ 大規模プロジェクトを通して考え抜いた基盤自動化の新たな姿~(NTTデータ テクノロジーカンフ...
 

Viewers also liked

Dimitri Bellini and Pietro Antonacci - Manage Zabbix Proxies in Remote Networ...
Dimitri Bellini and Pietro Antonacci - Manage Zabbix Proxies in Remote Networ...Dimitri Bellini and Pietro Antonacci - Manage Zabbix Proxies in Remote Networ...
Dimitri Bellini and Pietro Antonacci - Manage Zabbix Proxies in Remote Networ...
Zabbix
 
Lukáš Malý - Log management ELISA controlled by Zabbix | ZabConf2016
Lukáš Malý - Log management ELISA controlled by Zabbix | ZabConf2016Lukáš Malý - Log management ELISA controlled by Zabbix | ZabConf2016
Lukáš Malý - Log management ELISA controlled by Zabbix | ZabConf2016
Zabbix
 
Rihards Olups - Zabbix at Nokia - Case Study
Rihards Olups - Zabbix at Nokia - Case StudyRihards Olups - Zabbix at Nokia - Case Study
Rihards Olups - Zabbix at Nokia - Case Study
Zabbix
 
Zabbix Conference LatAm 2016 - Jorge Pretel - Low Level Discovery for ODBC an...
Zabbix Conference LatAm 2016 - Jorge Pretel - Low Level Discovery for ODBC an...Zabbix Conference LatAm 2016 - Jorge Pretel - Low Level Discovery for ODBC an...
Zabbix Conference LatAm 2016 - Jorge Pretel - Low Level Discovery for ODBC an...
Zabbix
 
Raymond Kuiper - Zen and The Art of Zabbix Template Design | ZabConf2016
Raymond Kuiper - Zen and The Art of Zabbix Template Design | ZabConf2016Raymond Kuiper - Zen and The Art of Zabbix Template Design | ZabConf2016
Raymond Kuiper - Zen and The Art of Zabbix Template Design | ZabConf2016
Zabbix
 
Zabbix Conference LatAm 2016 - Daniel Nasiloski - Extending Zabbix - Interact...
Zabbix Conference LatAm 2016 - Daniel Nasiloski - Extending Zabbix - Interact...Zabbix Conference LatAm 2016 - Daniel Nasiloski - Extending Zabbix - Interact...
Zabbix Conference LatAm 2016 - Daniel Nasiloski - Extending Zabbix - Interact...
Zabbix
 
Zabbix Conference LatAm 2016 - Jessian Ferreira - Wireless with Zabbix
Zabbix Conference LatAm 2016 - Jessian Ferreira - Wireless with ZabbixZabbix Conference LatAm 2016 - Jessian Ferreira - Wireless with Zabbix
Zabbix Conference LatAm 2016 - Jessian Ferreira - Wireless with Zabbix
Zabbix
 
Alexei Vladishev - Zabbix - Monitoring Solution for Everyone
Alexei Vladishev - Zabbix - Monitoring Solution for EveryoneAlexei Vladishev - Zabbix - Monitoring Solution for Everyone
Alexei Vladishev - Zabbix - Monitoring Solution for Everyone
Zabbix
 
OpenStack Marketing Meeting Oct 2
OpenStack Marketing Meeting Oct 2OpenStack Marketing Meeting Oct 2
OpenStack Marketing Meeting Oct 2
OpenStack Foundation
 
Openstack高度自动化持续交付
Openstack高度自动化持续交付Openstack高度自动化持续交付
Openstack高度自动化持续交付
Bill Zhong Qibin
 
General Bare-metal Provisioning Framework.pdf
General Bare-metal Provisioning Framework.pdfGeneral Bare-metal Provisioning Framework.pdf
General Bare-metal Provisioning Framework.pdf
OpenStack Foundation
 
OpenStack Day CEE 2015: Real-World Use Cases
OpenStack Day CEE 2015: Real-World Use CasesOpenStack Day CEE 2015: Real-World Use Cases
OpenStack Day CEE 2015: Real-World Use Cases
OpenStack Foundation
 
Feedbackstr - Verbessern Sie Ihr Geschäft durch das Feedback Ihrer Kunden!
Feedbackstr - Verbessern Sie Ihr Geschäft  durch das Feedback Ihrer Kunden!Feedbackstr - Verbessern Sie Ihr Geschäft  durch das Feedback Ihrer Kunden!
Feedbackstr - Verbessern Sie Ihr Geschäft durch das Feedback Ihrer Kunden!
Spectos GmbH
 
Social Media: 4 Tipps für ein gutes Kundenfeedback
Social Media: 4 Tipps für ein gutes KundenfeedbackSocial Media: 4 Tipps für ein gutes Kundenfeedback
Social Media: 4 Tipps für ein gutes Kundenfeedback
TWT
 
Oleg Ivanivskyi - Lessons Learned While Being On-Site | ZabConf2016
Oleg Ivanivskyi - Lessons Learned While Being On-Site | ZabConf2016Oleg Ivanivskyi - Lessons Learned While Being On-Site | ZabConf2016
Oleg Ivanivskyi - Lessons Learned While Being On-Site | ZabConf2016
Zabbix
 
Vladimir Ulogov - Large Scale Simulation | ZabConf2016 Lightning Talk
Vladimir Ulogov - Large Scale Simulation | ZabConf2016 Lightning TalkVladimir Ulogov - Large Scale Simulation | ZabConf2016 Lightning Talk
Vladimir Ulogov - Large Scale Simulation | ZabConf2016 Lightning Talk
Zabbix
 
Inaba Kazuhiko - Ahiruyaki Zabbix in Japan Part 2 | ZabConf2016 Lightning Talk
Inaba Kazuhiko - Ahiruyaki Zabbix in Japan Part 2 | ZabConf2016 Lightning TalkInaba Kazuhiko - Ahiruyaki Zabbix in Japan Part 2 | ZabConf2016 Lightning Talk
Inaba Kazuhiko - Ahiruyaki Zabbix in Japan Part 2 | ZabConf2016 Lightning Talk
Zabbix
 
Rafael Martinez Guerrero Zabbix CLI | ZabConf2016 Lightning Talk
Rafael Martinez Guerrero Zabbix CLI | ZabConf2016 Lightning TalkRafael Martinez Guerrero Zabbix CLI | ZabConf2016 Lightning Talk
Rafael Martinez Guerrero Zabbix CLI | ZabConf2016 Lightning Talk
Zabbix
 
Wolfgang Alper - Zabbix Meets OPS Control / Rundeck | ZabConf2016
Wolfgang Alper - Zabbix Meets OPS Control / Rundeck | ZabConf2016Wolfgang Alper - Zabbix Meets OPS Control / Rundeck | ZabConf2016
Wolfgang Alper - Zabbix Meets OPS Control / Rundeck | ZabConf2016
Zabbix
 

Viewers also liked (20)

Dimitri Bellini and Pietro Antonacci - Manage Zabbix Proxies in Remote Networ...
Dimitri Bellini and Pietro Antonacci - Manage Zabbix Proxies in Remote Networ...Dimitri Bellini and Pietro Antonacci - Manage Zabbix Proxies in Remote Networ...
Dimitri Bellini and Pietro Antonacci - Manage Zabbix Proxies in Remote Networ...
 
Lukáš Malý - Log management ELISA controlled by Zabbix | ZabConf2016
Lukáš Malý - Log management ELISA controlled by Zabbix | ZabConf2016Lukáš Malý - Log management ELISA controlled by Zabbix | ZabConf2016
Lukáš Malý - Log management ELISA controlled by Zabbix | ZabConf2016
 
Rihards Olups - Zabbix at Nokia - Case Study
Rihards Olups - Zabbix at Nokia - Case StudyRihards Olups - Zabbix at Nokia - Case Study
Rihards Olups - Zabbix at Nokia - Case Study
 
Zabbix Conference LatAm 2016 - Jorge Pretel - Low Level Discovery for ODBC an...
Zabbix Conference LatAm 2016 - Jorge Pretel - Low Level Discovery for ODBC an...Zabbix Conference LatAm 2016 - Jorge Pretel - Low Level Discovery for ODBC an...
Zabbix Conference LatAm 2016 - Jorge Pretel - Low Level Discovery for ODBC an...
 
Raymond Kuiper - Zen and The Art of Zabbix Template Design | ZabConf2016
Raymond Kuiper - Zen and The Art of Zabbix Template Design | ZabConf2016Raymond Kuiper - Zen and The Art of Zabbix Template Design | ZabConf2016
Raymond Kuiper - Zen and The Art of Zabbix Template Design | ZabConf2016
 
Zabbix Conference LatAm 2016 - Daniel Nasiloski - Extending Zabbix - Interact...
Zabbix Conference LatAm 2016 - Daniel Nasiloski - Extending Zabbix - Interact...Zabbix Conference LatAm 2016 - Daniel Nasiloski - Extending Zabbix - Interact...
Zabbix Conference LatAm 2016 - Daniel Nasiloski - Extending Zabbix - Interact...
 
Zabbix Conference LatAm 2016 - Jessian Ferreira - Wireless with Zabbix
Zabbix Conference LatAm 2016 - Jessian Ferreira - Wireless with ZabbixZabbix Conference LatAm 2016 - Jessian Ferreira - Wireless with Zabbix
Zabbix Conference LatAm 2016 - Jessian Ferreira - Wireless with Zabbix
 
Alexei Vladishev - Zabbix - Monitoring Solution for Everyone
Alexei Vladishev - Zabbix - Monitoring Solution for EveryoneAlexei Vladishev - Zabbix - Monitoring Solution for Everyone
Alexei Vladishev - Zabbix - Monitoring Solution for Everyone
 
OpenStack Marketing Meeting Oct 2
OpenStack Marketing Meeting Oct 2OpenStack Marketing Meeting Oct 2
OpenStack Marketing Meeting Oct 2
 
Openstack高度自动化持续交付
Openstack高度自动化持续交付Openstack高度自动化持续交付
Openstack高度自动化持续交付
 
General Bare-metal Provisioning Framework.pdf
General Bare-metal Provisioning Framework.pdfGeneral Bare-metal Provisioning Framework.pdf
General Bare-metal Provisioning Framework.pdf
 
OpenStack Day CEE 2015: Real-World Use Cases
OpenStack Day CEE 2015: Real-World Use CasesOpenStack Day CEE 2015: Real-World Use Cases
OpenStack Day CEE 2015: Real-World Use Cases
 
Feedbackstr - Verbessern Sie Ihr Geschäft durch das Feedback Ihrer Kunden!
Feedbackstr - Verbessern Sie Ihr Geschäft  durch das Feedback Ihrer Kunden!Feedbackstr - Verbessern Sie Ihr Geschäft  durch das Feedback Ihrer Kunden!
Feedbackstr - Verbessern Sie Ihr Geschäft durch das Feedback Ihrer Kunden!
 
NICE Fizzback: Echtzeit-Kundenfeedback
NICE Fizzback: Echtzeit-KundenfeedbackNICE Fizzback: Echtzeit-Kundenfeedback
NICE Fizzback: Echtzeit-Kundenfeedback
 
Social Media: 4 Tipps für ein gutes Kundenfeedback
Social Media: 4 Tipps für ein gutes KundenfeedbackSocial Media: 4 Tipps für ein gutes Kundenfeedback
Social Media: 4 Tipps für ein gutes Kundenfeedback
 
Oleg Ivanivskyi - Lessons Learned While Being On-Site | ZabConf2016
Oleg Ivanivskyi - Lessons Learned While Being On-Site | ZabConf2016Oleg Ivanivskyi - Lessons Learned While Being On-Site | ZabConf2016
Oleg Ivanivskyi - Lessons Learned While Being On-Site | ZabConf2016
 
Vladimir Ulogov - Large Scale Simulation | ZabConf2016 Lightning Talk
Vladimir Ulogov - Large Scale Simulation | ZabConf2016 Lightning TalkVladimir Ulogov - Large Scale Simulation | ZabConf2016 Lightning Talk
Vladimir Ulogov - Large Scale Simulation | ZabConf2016 Lightning Talk
 
Inaba Kazuhiko - Ahiruyaki Zabbix in Japan Part 2 | ZabConf2016 Lightning Talk
Inaba Kazuhiko - Ahiruyaki Zabbix in Japan Part 2 | ZabConf2016 Lightning TalkInaba Kazuhiko - Ahiruyaki Zabbix in Japan Part 2 | ZabConf2016 Lightning Talk
Inaba Kazuhiko - Ahiruyaki Zabbix in Japan Part 2 | ZabConf2016 Lightning Talk
 
Rafael Martinez Guerrero Zabbix CLI | ZabConf2016 Lightning Talk
Rafael Martinez Guerrero Zabbix CLI | ZabConf2016 Lightning TalkRafael Martinez Guerrero Zabbix CLI | ZabConf2016 Lightning Talk
Rafael Martinez Guerrero Zabbix CLI | ZabConf2016 Lightning Talk
 
Wolfgang Alper - Zabbix Meets OPS Control / Rundeck | ZabConf2016
Wolfgang Alper - Zabbix Meets OPS Control / Rundeck | ZabConf2016Wolfgang Alper - Zabbix Meets OPS Control / Rundeck | ZabConf2016
Wolfgang Alper - Zabbix Meets OPS Control / Rundeck | ZabConf2016
 

Similar to Mikhail Serkov - Zabbix for HPC Cluster Support | ZabConf2016

Lessons learned from embedding Cassandra in xPatterns
Lessons learned from embedding Cassandra in xPatternsLessons learned from embedding Cassandra in xPatterns
Lessons learned from embedding Cassandra in xPatternsClaudiu Barbura
 
Leveraging Cassandra for real-time multi-datacenter public cloud analytics
Leveraging Cassandra for real-time multi-datacenter public cloud analyticsLeveraging Cassandra for real-time multi-datacenter public cloud analytics
Leveraging Cassandra for real-time multi-datacenter public cloud analytics
Julien Anguenot
 
iland Internet Solutions: Leveraging Cassandra for real-time multi-datacenter...
iland Internet Solutions: Leveraging Cassandra for real-time multi-datacenter...iland Internet Solutions: Leveraging Cassandra for real-time multi-datacenter...
iland Internet Solutions: Leveraging Cassandra for real-time multi-datacenter...
DataStax Academy
 
Monitoring federation open stack infrastructure
Monitoring federation open stack infrastructureMonitoring federation open stack infrastructure
Monitoring federation open stack infrastructure
Fernando Lopez Aguilar
 
Atmosphere 2016 - Pawel Mastalerz, Wojciech Inglot - New way of building inf...
Atmosphere 2016 -  Pawel Mastalerz, Wojciech Inglot - New way of building inf...Atmosphere 2016 -  Pawel Mastalerz, Wojciech Inglot - New way of building inf...
Atmosphere 2016 - Pawel Mastalerz, Wojciech Inglot - New way of building inf...
PROIDEA
 
Kubernetes @ Squarespace (SRE Portland Meetup October 2017)
Kubernetes @ Squarespace (SRE Portland Meetup October 2017)Kubernetes @ Squarespace (SRE Portland Meetup October 2017)
Kubernetes @ Squarespace (SRE Portland Meetup October 2017)
Kevin Lynch
 
Intro to open source telemetry linux con 2016
Intro to open source telemetry   linux con 2016Intro to open source telemetry   linux con 2016
Intro to open source telemetry linux con 2016
Matthew Broberg
 
HPC and cloud distributed computing, as a journey
HPC and cloud distributed computing, as a journeyHPC and cloud distributed computing, as a journey
HPC and cloud distributed computing, as a journey
Peter Clapham
 
The power of linux advanced tracer [POUG18]
The power of linux advanced tracer [POUG18]The power of linux advanced tracer [POUG18]
The power of linux advanced tracer [POUG18]
Mahmoud Hatem
 
OS for AI: Elastic Microservices & the Next Gen of ML
OS for AI: Elastic Microservices & the Next Gen of MLOS for AI: Elastic Microservices & the Next Gen of ML
OS for AI: Elastic Microservices & the Next Gen of ML
Nordic APIs
 
The state of Hive and Spark in the Cloud (July 2017)
The state of Hive and Spark in the Cloud (July 2017)The state of Hive and Spark in the Cloud (July 2017)
The state of Hive and Spark in the Cloud (July 2017)
Nicolas Poggi
 
Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...
Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...
Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...
DataStax
 
FPGA Hardware Accelerator for Machine Learning
FPGA Hardware Accelerator for Machine Learning FPGA Hardware Accelerator for Machine Learning
FPGA Hardware Accelerator for Machine Learning
Dr. Swaminathan Kathirvel
 
Tsinghua University: Two Exemplary Applications in China
Tsinghua University: Two Exemplary Applications in ChinaTsinghua University: Two Exemplary Applications in China
Tsinghua University: Two Exemplary Applications in China
DataStax Academy
 
Scaling Hadoop at LinkedIn
Scaling Hadoop at LinkedInScaling Hadoop at LinkedIn
Scaling Hadoop at LinkedIn
DataWorks Summit
 
OCP Telco Engineering Workshop at BCE2017
OCP Telco Engineering Workshop at BCE2017OCP Telco Engineering Workshop at BCE2017
OCP Telco Engineering Workshop at BCE2017
Radisys Corporation
 
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese..."Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...
Edge AI and Vision Alliance
 
2018 03 25 system ml ai and openpower meetup
2018 03 25 system ml ai and openpower meetup2018 03 25 system ml ai and openpower meetup
2018 03 25 system ml ai and openpower meetup
Ganesan Narayanasamy
 
Webinar: OpenEBS - Still Free and now FASTEST Kubernetes storage
Webinar: OpenEBS - Still Free and now FASTEST Kubernetes storageWebinar: OpenEBS - Still Free and now FASTEST Kubernetes storage
Webinar: OpenEBS - Still Free and now FASTEST Kubernetes storage
MayaData Inc
 

Similar to Mikhail Serkov - Zabbix for HPC Cluster Support | ZabConf2016 (20)

Lessons learned from embedding Cassandra in xPatterns
Lessons learned from embedding Cassandra in xPatternsLessons learned from embedding Cassandra in xPatterns
Lessons learned from embedding Cassandra in xPatterns
 
Leveraging Cassandra for real-time multi-datacenter public cloud analytics
Leveraging Cassandra for real-time multi-datacenter public cloud analyticsLeveraging Cassandra for real-time multi-datacenter public cloud analytics
Leveraging Cassandra for real-time multi-datacenter public cloud analytics
 
iland Internet Solutions: Leveraging Cassandra for real-time multi-datacenter...
iland Internet Solutions: Leveraging Cassandra for real-time multi-datacenter...iland Internet Solutions: Leveraging Cassandra for real-time multi-datacenter...
iland Internet Solutions: Leveraging Cassandra for real-time multi-datacenter...
 
Monitoring federation open stack infrastructure
Monitoring federation open stack infrastructureMonitoring federation open stack infrastructure
Monitoring federation open stack infrastructure
 
Atmosphere 2016 - Pawel Mastalerz, Wojciech Inglot - New way of building inf...
Atmosphere 2016 -  Pawel Mastalerz, Wojciech Inglot - New way of building inf...Atmosphere 2016 -  Pawel Mastalerz, Wojciech Inglot - New way of building inf...
Atmosphere 2016 - Pawel Mastalerz, Wojciech Inglot - New way of building inf...
 
Kubernetes @ Squarespace (SRE Portland Meetup October 2017)
Kubernetes @ Squarespace (SRE Portland Meetup October 2017)Kubernetes @ Squarespace (SRE Portland Meetup October 2017)
Kubernetes @ Squarespace (SRE Portland Meetup October 2017)
 
Intro to open source telemetry linux con 2016
Intro to open source telemetry   linux con 2016Intro to open source telemetry   linux con 2016
Intro to open source telemetry linux con 2016
 
HPC and cloud distributed computing, as a journey
HPC and cloud distributed computing, as a journeyHPC and cloud distributed computing, as a journey
HPC and cloud distributed computing, as a journey
 
Cassandra in xPatterns
Cassandra in xPatternsCassandra in xPatterns
Cassandra in xPatterns
 
The power of linux advanced tracer [POUG18]
The power of linux advanced tracer [POUG18]The power of linux advanced tracer [POUG18]
The power of linux advanced tracer [POUG18]
 
OS for AI: Elastic Microservices & the Next Gen of ML
OS for AI: Elastic Microservices & the Next Gen of MLOS for AI: Elastic Microservices & the Next Gen of ML
OS for AI: Elastic Microservices & the Next Gen of ML
 
The state of Hive and Spark in the Cloud (July 2017)
The state of Hive and Spark in the Cloud (July 2017)The state of Hive and Spark in the Cloud (July 2017)
The state of Hive and Spark in the Cloud (July 2017)
 
Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...
Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...
Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...
 
FPGA Hardware Accelerator for Machine Learning
FPGA Hardware Accelerator for Machine Learning FPGA Hardware Accelerator for Machine Learning
FPGA Hardware Accelerator for Machine Learning
 
Tsinghua University: Two Exemplary Applications in China
Tsinghua University: Two Exemplary Applications in ChinaTsinghua University: Two Exemplary Applications in China
Tsinghua University: Two Exemplary Applications in China
 
Scaling Hadoop at LinkedIn
Scaling Hadoop at LinkedInScaling Hadoop at LinkedIn
Scaling Hadoop at LinkedIn
 
OCP Telco Engineering Workshop at BCE2017
OCP Telco Engineering Workshop at BCE2017OCP Telco Engineering Workshop at BCE2017
OCP Telco Engineering Workshop at BCE2017
 
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese..."Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...
 
2018 03 25 system ml ai and openpower meetup
2018 03 25 system ml ai and openpower meetup2018 03 25 system ml ai and openpower meetup
2018 03 25 system ml ai and openpower meetup
 
Webinar: OpenEBS - Still Free and now FASTEST Kubernetes storage
Webinar: OpenEBS - Still Free and now FASTEST Kubernetes storageWebinar: OpenEBS - Still Free and now FASTEST Kubernetes storage
Webinar: OpenEBS - Still Free and now FASTEST Kubernetes storage
 

More from Zabbix

Zabbix Conference LatAm 2016 - Andre Deo - Zabbix Brazil Community
Zabbix Conference LatAm 2016 - Andre Deo - Zabbix Brazil CommunityZabbix Conference LatAm 2016 - Andre Deo - Zabbix Brazil Community
Zabbix Conference LatAm 2016 - Andre Deo - Zabbix Brazil Community
Zabbix
 
Zabbix Conference LatAm 2016 - Andre Deo - SNMP and Zabbix
Zabbix Conference LatAm 2016 - Andre Deo - SNMP and ZabbixZabbix Conference LatAm 2016 - Andre Deo - SNMP and Zabbix
Zabbix Conference LatAm 2016 - Andre Deo - SNMP and Zabbix
Zabbix
 
Zabbix Conference LatAm 2016 - Rodrigo Mohr - Challenges on Large Env with Or...
Zabbix Conference LatAm 2016 - Rodrigo Mohr - Challenges on Large Env with Or...Zabbix Conference LatAm 2016 - Rodrigo Mohr - Challenges on Large Env with Or...
Zabbix Conference LatAm 2016 - Rodrigo Mohr - Challenges on Large Env with Or...
Zabbix
 
Zabbix Conference LatAm 2016 - Marcio Prop - Monitoring Complex Environments ...
Zabbix Conference LatAm 2016 - Marcio Prop - Monitoring Complex Environments ...Zabbix Conference LatAm 2016 - Marcio Prop - Monitoring Complex Environments ...
Zabbix Conference LatAm 2016 - Marcio Prop - Monitoring Complex Environments ...
Zabbix
 
Zabbix Conference LatAm 2016 - Filipe Paternot - Zbx@Globo Automation+Integra...
Zabbix Conference LatAm 2016 - Filipe Paternot - Zbx@Globo Automation+Integra...Zabbix Conference LatAm 2016 - Filipe Paternot - Zbx@Globo Automation+Integra...
Zabbix Conference LatAm 2016 - Filipe Paternot - Zbx@Globo Automation+Integra...
Zabbix
 
Zabbix Conference LatAm 2016 - Douglas Esteves - Zabbix at UNICAMP
Zabbix Conference LatAm 2016 - Douglas Esteves - Zabbix at UNICAMPZabbix Conference LatAm 2016 - Douglas Esteves - Zabbix at UNICAMP
Zabbix Conference LatAm 2016 - Douglas Esteves - Zabbix at UNICAMP
Zabbix
 
Ryan Armstrong - Monitoring More Than 6000 Devices in Zabbix | ZabConf2016
Ryan Armstrong - Monitoring More Than 6000 Devices in Zabbix | ZabConf2016Ryan Armstrong - Monitoring More Than 6000 Devices in Zabbix | ZabConf2016
Ryan Armstrong - Monitoring More Than 6000 Devices in Zabbix | ZabConf2016
Zabbix
 
Rafael Martinez Guerrero - Zabbix at the University of Oslo | ZabConf2016
Rafael Martinez Guerrero - Zabbix at the University of Oslo | ZabConf2016Rafael Martinez Guerrero - Zabbix at the University of Oslo | ZabConf2016
Rafael Martinez Guerrero - Zabbix at the University of Oslo | ZabConf2016
Zabbix
 
Wolfgang Alper - Zabbix Meets OPS Control / Rundeck | ZabConf2016
Wolfgang Alper - Zabbix Meets OPS Control / Rundeck | ZabConf2016Wolfgang Alper - Zabbix Meets OPS Control / Rundeck | ZabConf2016
Wolfgang Alper - Zabbix Meets OPS Control / Rundeck | ZabConf2016
Zabbix
 
Sumit Goel - Monitoring Cloud Applications Using Zabbix | ZabConf2016
Sumit Goel - Monitoring Cloud Applications Using Zabbix | ZabConf2016Sumit Goel - Monitoring Cloud Applications Using Zabbix | ZabConf2016
Sumit Goel - Monitoring Cloud Applications Using Zabbix | ZabConf2016
Zabbix
 
Erik Skytthe - Monitoring Mesos, Docker, Containers with Zabbix | ZabConf2016
Erik Skytthe - Monitoring Mesos, Docker, Containers with Zabbix | ZabConf2016Erik Skytthe - Monitoring Mesos, Docker, Containers with Zabbix | ZabConf2016
Erik Skytthe - Monitoring Mesos, Docker, Containers with Zabbix | ZabConf2016
Zabbix
 
Konstantin Yakovlev - Event Analysis Toolset | ZabConf2016
Konstantin Yakovlev - Event Analysis Toolset | ZabConf2016Konstantin Yakovlev - Event Analysis Toolset | ZabConf2016
Konstantin Yakovlev - Event Analysis Toolset | ZabConf2016
Zabbix
 
Ingus Vilnis - Benefits of Zabbix Training | ZabConf2016
Ingus Vilnis -  Benefits of Zabbix Training | ZabConf2016Ingus Vilnis -  Benefits of Zabbix Training | ZabConf2016
Ingus Vilnis - Benefits of Zabbix Training | ZabConf2016
Zabbix
 
Alexei Vladishev - Opening Speech | ZabConf2016
Alexei Vladishev - Opening Speech | ZabConf2016Alexei Vladishev - Opening Speech | ZabConf2016
Alexei Vladishev - Opening Speech | ZabConf2016
Zabbix
 
Alexander Naydenko - Nagios to Zabbix Migration | ZabConf2016
Alexander Naydenko - Nagios to Zabbix Migration | ZabConf2016Alexander Naydenko - Nagios to Zabbix Migration | ZabConf2016
Alexander Naydenko - Nagios to Zabbix Migration | ZabConf2016
Zabbix
 
Alain Ganuchaud - Trouble Ticket Integration with Zabbix in Large Environment...
Alain Ganuchaud - Trouble Ticket Integration with Zabbix in Large Environment...Alain Ganuchaud - Trouble Ticket Integration with Zabbix in Large Environment...
Alain Ganuchaud - Trouble Ticket Integration with Zabbix in Large Environment...
Zabbix
 
Rihards Olups - Zabbix log management
Rihards Olups - Zabbix log managementRihards Olups - Zabbix log management
Rihards Olups - Zabbix log management
Zabbix
 
Zabbix Conference LatAm 2016 - Paulo Deolindo - Case Study_BBTS and Zabbix
Zabbix Conference LatAm 2016 - Paulo Deolindo - Case Study_BBTS and ZabbixZabbix Conference LatAm 2016 - Paulo Deolindo - Case Study_BBTS and Zabbix
Zabbix Conference LatAm 2016 - Paulo Deolindo - Case Study_BBTS and Zabbix
Zabbix
 

More from Zabbix (18)

Zabbix Conference LatAm 2016 - Andre Deo - Zabbix Brazil Community
Zabbix Conference LatAm 2016 - Andre Deo - Zabbix Brazil CommunityZabbix Conference LatAm 2016 - Andre Deo - Zabbix Brazil Community
Zabbix Conference LatAm 2016 - Andre Deo - Zabbix Brazil Community
 
Zabbix Conference LatAm 2016 - Andre Deo - SNMP and Zabbix
Zabbix Conference LatAm 2016 - Andre Deo - SNMP and ZabbixZabbix Conference LatAm 2016 - Andre Deo - SNMP and Zabbix
Zabbix Conference LatAm 2016 - Andre Deo - SNMP and Zabbix
 
Zabbix Conference LatAm 2016 - Rodrigo Mohr - Challenges on Large Env with Or...
Zabbix Conference LatAm 2016 - Rodrigo Mohr - Challenges on Large Env with Or...Zabbix Conference LatAm 2016 - Rodrigo Mohr - Challenges on Large Env with Or...
Zabbix Conference LatAm 2016 - Rodrigo Mohr - Challenges on Large Env with Or...
 
Zabbix Conference LatAm 2016 - Marcio Prop - Monitoring Complex Environments ...
Zabbix Conference LatAm 2016 - Marcio Prop - Monitoring Complex Environments ...Zabbix Conference LatAm 2016 - Marcio Prop - Monitoring Complex Environments ...
Zabbix Conference LatAm 2016 - Marcio Prop - Monitoring Complex Environments ...
 
Zabbix Conference LatAm 2016 - Filipe Paternot - Zbx@Globo Automation+Integra...
Zabbix Conference LatAm 2016 - Filipe Paternot - Zbx@Globo Automation+Integra...Zabbix Conference LatAm 2016 - Filipe Paternot - Zbx@Globo Automation+Integra...
Zabbix Conference LatAm 2016 - Filipe Paternot - Zbx@Globo Automation+Integra...
 
Zabbix Conference LatAm 2016 - Douglas Esteves - Zabbix at UNICAMP
Zabbix Conference LatAm 2016 - Douglas Esteves - Zabbix at UNICAMPZabbix Conference LatAm 2016 - Douglas Esteves - Zabbix at UNICAMP
Zabbix Conference LatAm 2016 - Douglas Esteves - Zabbix at UNICAMP
 
Ryan Armstrong - Monitoring More Than 6000 Devices in Zabbix | ZabConf2016
Ryan Armstrong - Monitoring More Than 6000 Devices in Zabbix | ZabConf2016Ryan Armstrong - Monitoring More Than 6000 Devices in Zabbix | ZabConf2016
Ryan Armstrong - Monitoring More Than 6000 Devices in Zabbix | ZabConf2016
 
Rafael Martinez Guerrero - Zabbix at the University of Oslo | ZabConf2016
Rafael Martinez Guerrero - Zabbix at the University of Oslo | ZabConf2016Rafael Martinez Guerrero - Zabbix at the University of Oslo | ZabConf2016
Rafael Martinez Guerrero - Zabbix at the University of Oslo | ZabConf2016
 
Wolfgang Alper - Zabbix Meets OPS Control / Rundeck | ZabConf2016
Wolfgang Alper - Zabbix Meets OPS Control / Rundeck | ZabConf2016Wolfgang Alper - Zabbix Meets OPS Control / Rundeck | ZabConf2016
Wolfgang Alper - Zabbix Meets OPS Control / Rundeck | ZabConf2016
 
Sumit Goel - Monitoring Cloud Applications Using Zabbix | ZabConf2016
Sumit Goel - Monitoring Cloud Applications Using Zabbix | ZabConf2016Sumit Goel - Monitoring Cloud Applications Using Zabbix | ZabConf2016
Sumit Goel - Monitoring Cloud Applications Using Zabbix | ZabConf2016
 
Erik Skytthe - Monitoring Mesos, Docker, Containers with Zabbix | ZabConf2016
Erik Skytthe - Monitoring Mesos, Docker, Containers with Zabbix | ZabConf2016Erik Skytthe - Monitoring Mesos, Docker, Containers with Zabbix | ZabConf2016
Erik Skytthe - Monitoring Mesos, Docker, Containers with Zabbix | ZabConf2016
 
Konstantin Yakovlev - Event Analysis Toolset | ZabConf2016
Konstantin Yakovlev - Event Analysis Toolset | ZabConf2016Konstantin Yakovlev - Event Analysis Toolset | ZabConf2016
Konstantin Yakovlev - Event Analysis Toolset | ZabConf2016
 
Ingus Vilnis - Benefits of Zabbix Training | ZabConf2016
Ingus Vilnis -  Benefits of Zabbix Training | ZabConf2016Ingus Vilnis -  Benefits of Zabbix Training | ZabConf2016
Ingus Vilnis - Benefits of Zabbix Training | ZabConf2016
 
Alexei Vladishev - Opening Speech | ZabConf2016
Alexei Vladishev - Opening Speech | ZabConf2016Alexei Vladishev - Opening Speech | ZabConf2016
Alexei Vladishev - Opening Speech | ZabConf2016
 
Alexander Naydenko - Nagios to Zabbix Migration | ZabConf2016
Alexander Naydenko - Nagios to Zabbix Migration | ZabConf2016Alexander Naydenko - Nagios to Zabbix Migration | ZabConf2016
Alexander Naydenko - Nagios to Zabbix Migration | ZabConf2016
 
Alain Ganuchaud - Trouble Ticket Integration with Zabbix in Large Environment...
Alain Ganuchaud - Trouble Ticket Integration with Zabbix in Large Environment...Alain Ganuchaud - Trouble Ticket Integration with Zabbix in Large Environment...
Alain Ganuchaud - Trouble Ticket Integration with Zabbix in Large Environment...
 
Rihards Olups - Zabbix log management
Rihards Olups - Zabbix log managementRihards Olups - Zabbix log management
Rihards Olups - Zabbix log management
 
Zabbix Conference LatAm 2016 - Paulo Deolindo - Case Study_BBTS and Zabbix
Zabbix Conference LatAm 2016 - Paulo Deolindo - Case Study_BBTS and ZabbixZabbix Conference LatAm 2016 - Paulo Deolindo - Case Study_BBTS and Zabbix
Zabbix Conference LatAm 2016 - Paulo Deolindo - Case Study_BBTS and Zabbix
 

Recently uploaded

FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
DianaGray10
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
OnBoard
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Tobias Schneck
 
ODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User Group
CatarinaPereira64715
 
Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
Bhaskar Mitra
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Inflectra
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Product School
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
Product School
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
RTTS
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
Elena Simperl
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Product School
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Thierry Lestable
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Product School
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
Product School
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
DianaGray10
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Thijs Feryn
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
Elena Simperl
 

Recently uploaded (20)

FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
 
ODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User Group
 
Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
 

Mikhail Serkov - Zabbix for HPC Cluster Support | ZabConf2016

  • 1. ZABBIX FOR 
 HPC MONITORING AND SUPPORT Mikhail Serkov Delivery Manager/HPC Engineer 2016
  • 2. CONFIDENTIAL 2 What’s next? HPC monitoring – differences from classic support model How do we use Zabbix AGENDA Overview of the customer infrastructure and software stack High Performance Computing – what is it about#1 #4 #5 #2 #3
  • 3. CONFIDENTIAL 3 • Scientific research in Pharma area: Bioinformatics, Computational Chemistry, Drug Discovery, etc. • About 10k CPU cores used for a scientific computation. • Shared clusters - different workflows could run simultaneously within the same cluster. • About 500 different scientific tools. • Custom software ( Python, Java, R) Novartis Institute For Biomedical Research (NIBR)
  • 4. CONFIDENTIAL 4 • Hundreds or even thousands of computation nodes • Grid Computing technologies and software ( SGE, UGE, SoGE, PBS, etc) • Massive parallel computation across the nodes • Strong requirements for all subsystems on hardware and software level ( storage, network, power, OS ) • No magic. Linux boxes, shell scripts on a low level ☺ Example of a job submission: HIGH PERFORMANCE COMPUTING
  • 5. CONFIDENTIAL 5 OVERVIEW OF THE CUSTOMER INFRASTRUCTURE 250 GPU’s 70TB RAM 35-40KW/Rack
  • 6. CONFIDENTIAL 6 • 28 CPU cores ( 2 sockets x 14 cores each ) − Intel(R) Xeon(R) CPU E5-2697 v3 @ 2.60GHz • 200 GB RAM • 10 GB Ethernet + InfiniBand interfaces • 8 GPU cores ( 4 cards x 2 cores each ) • NFS over 10 GB Ehternet • Lustre over InfiniBand TYPICAL COMPUTATION NODE CONFIGURATION
  • 7. CONFIDENTIAL 7 OVERVIEW OF SOFTWARE STACK • More than 500 of scientific tools • Bioinformatics, Computation Chemistry, Xtallography, Molecular Dynamics, etc • RHEL6.5 • Univa Grid Engine • Zabbix 2.4
  • 8. CONFIDENTIAL 8 • We need information like ‘who, what, when’, not only system metrics. • Users are allowed to run whatever they want using grid scheduler on the computation nodes. • 100% CPU utilization and 100% RAM utilization for node is perfectly fine. • Node crash – not such a big deal. • Preventing global issues by using aggregated metrics. • Metrics not only for monitoring but for a performance analysis. • Users are having access to the monitoring system ( but restricted ). HPC MONITORING DIFFERENCES
  • 9. CONFIDENTIAL 9 • Able to monitor of a huge systems with a lot of metrics • Flexible • Out of the box • Ability to aggregate metrics • API for a data extraction • GUI convenient for both support team and scientists • Autodiscovery • New nodes automatic configuration WHY ZABBIX?
  • 10. CONFIDENTIAL 10 ZABBIX CONFIGURATION Server configuration: • 20 CPU cores ( 2 sockets x 10 cores each ) − Intel(R) Xeon(R) CPU E5-2697 v3 @ 2.60GHz • 120 GB RAM Number of hosts: 601 Number of items: ~200k Number of triggers: ~37k DB Size: 187GB
  • 11. CONFIDENTIAL 11 WHAT DO WE MONITOR Local metrics ( node level ) Global metrics ( cluster level ) All default Linux checks (LA, CPU utilization, RAM, swap, etc) - agent Meta CPU utilization – aggregation of CPU utilization of HPC nodes. Every single GPU core ( Temperature, Utilization if possible) - agent NFS global transmit/retransmit - aggregation of nodes values Every single CPU core ( Utilization, Temperature) - agent Grid specific – used/active slots, running jobs, pending jobs, top users - external scripts NFS shares availability / utilization / mount details - agent CPU/Memory oversubscription - aggregation of nodes values Slots / RAM reserved - external scripts Overloaded nodes - aggregation of HPC values HPC jobs - external scripts Pending time - external scripts ... ....
  • 12. CONFIDENTIAL 12 HPC specific examples 1) Expected utilization VS Real one Every job has a resource request for number of CPUs, RAM, etc. In every moment we can compare real utilization with an expected one. If they are not close, we need to investigate if someone oversubscribing resources or overload nodes. Solution: Zabbix not only checks current system metrics, but also keeps an expected values. If they are too different we receive warning. 2) Users on a computation node Users are not restricted to SSH to any node ( debugging, tracing job in real time, interactive jobs, etc). However we should check if user has job on the node he is logged into. Solution: We have a trigger that notify us if we have anyone logged on the node with no job running. Additionally we store a list of logged in users for any single moment.
  • 13. CONFIDENTIAL 13 HPC specific examples Pending time probes It is really hard to predict the pending time for any particular job in the pending list, as they all have different resource requests, and runtimes. It is not a FIFO and the pending time is always related to resources user wants to have. Solution: Zabbix runs ‘pending probes’ ( empty jobs) and checks how long does it take. This is a good indicator for queue state at the moment.
  • 14. CONFIDENTIAL 14 WHAT DO WE MONITOR: GLOBAL METRICS Global cluster utilization
  • 15. CONFIDENTIAL 15 WHAT DO WE MONITOR: GLOBAL METRICS RAM oversubscription
  • 16. CONFIDENTIAL 16 WHAT DO WE MONITOR: GLOBAL METRICS CPU time oversubscription
  • 17. CONFIDENTIAL 17 WHAT DO WE MONITOR: GLOBAL METRICS Meta CPU utilization
  • 18. CONFIDENTIAL 18 WHAT DO WE MONITOR: GLOBAL METRICS Aggregated cluster status
  • 19. CONFIDENTIAL 19 WHAT DO WE MONITOR: GLOBAL METRICS Storage operational metrics
  • 20. CONFIDENTIAL 20 WHAT DO WE MONITOR: LOCAL METRICS
  • 21. CONFIDENTIAL 21 WHAT DO WE MONITOR: LOCAL METRICS
  • 22. CONFIDENTIAL 22 USER ACCESS We want to provide a limited amount of information to users. They don’t need any info about triggers and issues, but only metrics. We have patched Zabbix to remove all unnecessary data for guest access. After Before
  • 23. CONFIDENTIAL 23 Benefits • Better understanding of a global issues on the cluster an reasons of why have they happened. • Great performance indicators for other infrastructure teams ( especially Storage team ) • Performance tuning of a scientific workflows. Jobs profiling. In some cases information we cat get from Zabbix is helping us to significantly improve performance of jobs. • Proactive monitoring. With Zabbix it’s easier to understand if something is not right on the cluster or with some job. In most cases we are able to prevent global cluster issues, or at least minimize an impact. • One monitoring system for clusters and HPC infrastructure. • “All in one”. Lower efforts on support/maintain monitoring system(s).
  • 24. CONFIDENTIAL 24 • Tight integration with Grid HPC software. • Data analysis using external tools, but with Zabbix data source. • Create a set of CLI utilities for getting Zabbix statistics in ‘human-readable’ format. • Automation of jobs profiling using Zabbix API. WHAT’S NEXT?