For the last two years I've been working in Cambridge (US) in Novartis Institute for Biomedical Research (NIBR) on a project related to a support of HPC cluster infrastructure and users. We're using Zabbix for HPC cluster monitoring (more than 1000 nodes, 10000+ cores, GPU cores, etc). In this presentation we will cover interesting use cases of Zabbix for HPC cluster, as it's not a regular infrastructure monitoring. We will talk about some challenges we have in HPC monitoring, how Zabbix helps us to work with scientists as well as present some solutions, which might be interesting for Zabbix community.
Introduction to Zabbix - Company, Product, Services and Use CasesZabbix
About Zabbix Software:
Zabbix is an enterprise-class open source distributed monitoring solution designed to monitor and track performance and availability of network servers, devices, services and other IT resources.
Zabbix is an all-in-one monitoring solution that allows users to collect, store, manage and analyze information received from IT infrastructure, as well as display on-screen, and alert by e-mail, SMS or Jabber when thresholds are reached.
Zabbix allows administrators to recognize server and device problems within a short period of time and therefore reduces the system downtime and risk of system failure. The monitoring solution is being actively used by SMBs and large enterprises across all industries and almost in every country of the world.
Dimitri Bellini and Pietro Antonacci - Manage Zabbix Proxies in Remote Networ...Zabbix
Monitoring multiple server farms spread all around the world is not an easy task, many small problems have to be addressed, but using Zabbix it is all a breeze.
We will talk about our experience on setup of Zabbix proxies in very remote networks, problems we encountered and how we worked on fixing them.
Lukáš Malý - Log management ELISA controlled by Zabbix | ZabConf2016Zabbix
Datasys ELISA log management is robust, powerful, yet inexpensive solution for collection, correlation and analysis of logs. Core system consists of the Elasticsearch “noSQL“ database and the web user interface Kibana, which provides high comfort for analysis of detected security incidents and relevant logs. It is common that the database ElasticSearch is distributed to multiple servers to achieve load balancing and high availability of indexed data. ELISA heavily utilizes ZABBIX for user authentication and role based access control, notifications and self-monitoring. Elasticsearch Indices can be managed right in ZABBIX Frontend. ZABBIX "trapper" items and monitoring templates are used to centrally manage configuration of distributed environment of NXlog agents. Agents are capable to securely auto-register as ZABBIX "hosts".
Introduction to Zabbix - Company, Product, Services and Use CasesZabbix
About Zabbix Software:
Zabbix is an enterprise-class open source distributed monitoring solution designed to monitor and track performance and availability of network servers, devices, services and other IT resources.
Zabbix is an all-in-one monitoring solution that allows users to collect, store, manage and analyze information received from IT infrastructure, as well as display on-screen, and alert by e-mail, SMS or Jabber when thresholds are reached.
Zabbix allows administrators to recognize server and device problems within a short period of time and therefore reduces the system downtime and risk of system failure. The monitoring solution is being actively used by SMBs and large enterprises across all industries and almost in every country of the world.
Dimitri Bellini and Pietro Antonacci - Manage Zabbix Proxies in Remote Networ...Zabbix
Monitoring multiple server farms spread all around the world is not an easy task, many small problems have to be addressed, but using Zabbix it is all a breeze.
We will talk about our experience on setup of Zabbix proxies in very remote networks, problems we encountered and how we worked on fixing them.
Lukáš Malý - Log management ELISA controlled by Zabbix | ZabConf2016Zabbix
Datasys ELISA log management is robust, powerful, yet inexpensive solution for collection, correlation and analysis of logs. Core system consists of the Elasticsearch “noSQL“ database and the web user interface Kibana, which provides high comfort for analysis of detected security incidents and relevant logs. It is common that the database ElasticSearch is distributed to multiple servers to achieve load balancing and high availability of indexed data. ELISA heavily utilizes ZABBIX for user authentication and role based access control, notifications and self-monitoring. Elasticsearch Indices can be managed right in ZABBIX Frontend. ZABBIX "trapper" items and monitoring templates are used to centrally manage configuration of distributed environment of NXlog agents. Agents are capable to securely auto-register as ZABBIX "hosts".
Rihards Olups - Zabbix at Nokia - Case StudyZabbix
We will explore a fairly complicated Zabbix environment at one division in Nokia. Having several different Zabbix versions in use and a lot of custom products monitored, it is a place one can get lost in easily. We'll discuss JMX monitoring, approaches to keep notification configuration simple and notifications useful, different usecases for the Zabbix API and a lot of other topics. The importance of the SSL compliance will be covered along with some of the many ways custom solutions are monitored.
The aim of the lecture is the demonstration of the new Low Level Discovery Resources that emerged in Zabbix 3.0, as well as presentation, operation and demonstration LLD settings of Windows and ODBC Services.
Raymond Kuiper - Zen and The Art of Zabbix Template Design | ZabConf2016Zabbix
Zabbix monitoring solution can help bring balance to your organisation's IT landscape. However, the success greatly depends on the templates you use to setup your monitoring system. As any Zabbix veteran will tell you, the default templates don't really suffice for any setup other than a proof-of-concept. How then do you set about creating your own templates? Following practical examples, we'll discuss some of the design decisions that need to be made to achieve template perfection.
A Unirede atua desde 2008 com projetos de todos os portes envolvendo o Zabbix. Desde então surgem necessidades onde devemos garantir a interação do Zabbix com as mais variadas formas, métodos e ferramentas de mensageria para para notificar os eventos (E-mail, SMS, criação de tickets/chamados,arquivos de log, WhatsApp, VOIP, Telegram, etc). Nessa palestra irei tentar exemplificar como podemos interegir com o Telegram, recebendo e enviando mensagens para o Zabbix e dessa forma tornar mais dinâmica a comunicação de usuários remotos com seus servidores e equipamentos no datacenter.
The wireless network monitoring data are abundant, as it seems relevant store information from devices and users connected, especially in an multicampus environment like Unesp. In this sense, the database tends to increase rapidly the number of records, being necessary to optimize the periodic cleaning routine of Zabbix data. Here are our way of improving the functioning of the "housekeeping" native application. Also will demonstrate the massive use of the data type "Zabbix Trapper" for flexible the list of informations of Wi-Fi infrastructure and techniques varied use of "low level discovery" for monitoring of wireless access points.
Alexei Vladishev - Zabbix - Monitoring Solution for EveryoneZabbix
Paris Zabbix User Group Meetup 2016
June 23, 2016
1. Open Source
2. Zabbix Architecture
3. Data Collection
4. Problem Detection
5. Problem Forecasting / Trend Prediction
6. Lifecycle and Support Policy
At OpenStack Day CEE 2015, we discuss the latest user survey results, some real-world OpenStack case studies and how new users and cloud operators can get involved with the community.
Feedbackstr - Verbessern Sie Ihr Geschäft durch das Feedback Ihrer Kunden!Spectos GmbH
Erfahren Sie, wie Sie durch mobiles Kundenfeedback via Smartphone, Tablet und Computer Ihr Geschäft verbessern. Lernen Sie, wie Sie echtes Feedback von echten Kunden direkt Vor-Ort und im Moment der Wahrheit erhalten, auswerten und sofort reagieren können. So erhöhen Sie die Kundenzufriedenheit und verbessern Ihr Geschäft Tag für Tag.
Social Media: 4 Tipps für ein gutes KundenfeedbackTWT
Am berühmten Shopping-Tag “Black Friday” bieten zahlreiche Händler Rabatte und besondere Angebote. Händler können an diesem Tag mit einem starken Kundenservice punkten.
Wolfgang Alper - Zabbix Meets OPS Control / Rundeck | ZabConf2016Zabbix
Zabbix is an excellent tool to do network monitoring and to alert if something bad happens. But Zabbix can do more. An underestimated feature of Zabbix is its ability to perform actions in addition to simple notifications. However, this requires to precisly setup those actions within zabbix, which is not always an easy task and might duplicate existing work. So what if Zabbix actually worked in concert with an external taskrunner / jobscheduler that is build to do exactly this: run a task or action against a host and report its outcome? Zabbix would perform the same well defined steps that an ops member would perform in case of certain failures using this kind of tool. A well know example of this kind of software is "Rundeck" which is licensed under the Apache License Version 2.0.
Leveraging Cassandra for real-time multi-datacenter public cloud analyticsJulien Anguenot
iland has built a global data warehouse across multiple data centers, collecting and aggregating data from core cloud services including compute, storage and network as well as chargeback and compliance. iland's warehouse brings actionable intelligence that customers can use to manipulate resources, analyze trends, define alerts and share information.
In this session, we would like to present the lessons learned around Cassandra, both at the development and operations level, but also the technology and architecture we put in action on top of Cassandra such as Redis, syslog-ng, RabbitMQ, Java EE, etc.
Finally, we would like to share insights on how we are currently extending our platform with Spark and Kafka and what our motivations are.
iland Internet Solutions: Leveraging Cassandra for real-time multi-datacenter...DataStax Academy
iland has built a global data warehouse across multiple data centers, collecting and aggregating data from core cloud services including compute, storage and network as well as chargeback and compliance. iland's warehouse brings actionable intelligence that customers can use to manipulate resources, analyze trends, define alerts and share information.
In this session, we would like to present the lessons learned around Cassandra, both at the development and operations level, but also the technology and architecture we put in action on top of Cassandra such as Redis, syslog-ng, RabbitMQ, Java EE, etc.
Finally, we would like to share insights on how we are currently extending our platform with Spark and Kafka and what our motivations are.
Atmosphere 2016 - Pawel Mastalerz, Wojciech Inglot - New way of building inf...PROIDEA
Creating infrastructure for global web and mobile applications can be hard. Creating infrastructure for fast growing global applications can be very hard :) In brainly we had to move from traditional LAMP setup with bare metal servers to something new and cloud was not enough. With software like ansible, mesos, docker, consul, we have designed fully automated immutable setup, even with tests! On this presentation we will show you how, and share with you our exeperince with running this kind of platform.
Kubernetes @ Squarespace (SRE Portland Meetup October 2017)Kevin Lynch
In this presentation I talk about our motivation to converting our microservices to run on Kubernetes. I discuss many of the technical challenges we encountered along the way, including networking issues, Java issues, monitoring and alerting, and managing all of our resources!
Intro to open source telemetry linux con 2016Matthew Broberg
Abstract
As part of the team delivering Snap, an open telemetry framework, I've run through dozens of use cases where gathering disparate metrics from services can roll up into meaningful diagrams for operations engineers and developers alike. We will use Snap's plugin model to collect, process and publish these measurements into meaningful graphs using open source tools. By joining this session, you can follow along and install industry-standard open source projects, deploy them and then use Snap to collect, process and visualize these metrics.
Audience
Anyone with an operations-background (or future ahead of them) that wants to see the breadth of available open source tooling around telemetry. This proposal is designed for the hands-on user, who is comfortable running containers or virtual machines locally.
Experience Level
Intermediate
Benefits to the Ecosystem
By joining this session, you can follow along and install industry-standard open source projects, deploy them and then use Snap to collect, process and visualize these metrics. This empowers users within the Linux ecosystem to see their knowledge as powerful when visualized next to other layers of the datacenter.
HPC and cloud distributed computing, as a journeyPeter Clapham
Introducing an internal cloud brings new paradigms, tools and infrastructure management. When placed alongside traditional HPC the new opportunities are significant But getting to the new world with micro-services, autoscaling and autodialing is a journey that cannot be achieved in a single step.
OS for AI: Elastic Microservices & the Next Gen of MLNordic APIs
AI has been a hot topic lately, with advances being made constantly in what is possible, there has not been as much discussion of the infrastructure and scaling challenges that come with it. How do you support dozens of different languages and frameworks, and make them interoperate invisibly? How do you scale to run abstract code from thousands of different developers, simultaneously and elastically, while maintaining less than 15ms of overhead?
At Algorithmia, we’ve built, deployed, and scaled thousands of algorithms and machine learning models, using every kind of framework (from scikit-learn to tensorflow). We’ve seen many of the challenges faced in this area, and in this talk I’ll share some insights into the problems you’re likely to face, and how to approach solving them.
In brief, we’ll examine the need for, and implementations of, a complete “Operating System for AI” – a common interface for different algorithms to be used and combined, and a general architecture for serverless machine learning which is discoverable, versioned, scalable and sharable.
The state of Hive and Spark in the Cloud (July 2017)Nicolas Poggi
Originally presented at the BDOOP and Spark Barcelona meetup groups: http://meetu.ps/3bwCTM
Cloud providers currently offer convenient on-demand managed big data clusters (PaaS) with a pay-as-you-go model. In PaaS, analytical engines such as Spark and Hive come ready to use, with a general-purpose configuration and upgrade management. Over the last year, the Spark framework and APIs have been evolving very rapidly, with major improvements on performance and the release of v2, making it challenging to keep up-to-date production services both on-premises and in the cloud for compatibility and stability. The talk compares:
• The performance of both v1 and v2 for Spark and Hive
• PaaS cloud services: Azure HDinsight, Amazon Web Services EMR, Google Cloud Dataproc
• Out-of-the-box support for Spark and Hive versions from providers
• PaaS reliability, scalability, and price-performance of the solutions
Using BigBench, the new Big Data benchmark standard. BigBench combines SQL queries, MapReduce, user code (UDF), and machine learning, which makes it ideal to stress Spark libraries (SparkSQL, DataFrames, MLlib, etc.).
Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...DataStax
At Knewton we operate across five different VPCs a total of 29 clusters, each ranging from 3 nodes to 24 nodes. For a team of three to maintain this is not herculean, however good tools to diagnose issues and gather information in a distributed manner are vital to moving quickly and minimizing engineering time spent.
The database team at Knewton has been successfully using a combination of Ansible and custom open sourced tools to maintain and improve the Cassandra deployment at Knewton. I will be talking about several of these tools and giving examples of how we are using them. Specifically I will discuss the cassandra-tracing tool, which analyzes the contents of the system_traces keyspace, and the cassandra-stat tool, which gives real-time output of the operations of a cassandra cluster. Distributed administration with ad-hoc Ansible will also be covered and I will walk through examples of using these commands to identify and remediate clusterwide issues.
About the Speaker
Jeffrey Berger Lead Database Engineer, Knewton
Dr. Jeffrey Berger is currently the lead database engineer at Knewton, an education tech startup in NYC. He joined the tech scene in NYC in 2013 and spent two years working with MongoDB, becoming a certified MongoDB administrator and a MongoDB Master. He received his Cassandra Administrator certification at Cassandra Summit 2015. He holds a Ph.D. in Theoretical Physics from Penn State and spent several years working on high energy nuclear interactions.
Tsinghua University: Two Exemplary Applications in ChinaDataStax Academy
In this talk, we will share the experiences of applying Cassandra with two real customers in China. In the first use case, we deployed Cassandra at Sany Group, a leading company of Machinery manufacturing, to manage the sensor data generated by construction machinery. By designing a specific schema and optimizing the write process, we successfully managed over 1.5 billion historical data records and achieved the online write throughput of 10k write operations per second with 5 servers. MapReduce is also used on Cassandra for valued-added services, e.g. operations management, machine failure prediction, and abnormal behavior mining. In the second use case, Cassandra is deployed in the China Meteorological Administration to manage the Meteorological data. We design a hybrid schema to support both slice query and time window based query efficiently. Also, we explored the optimized compaction and deletion strategy for meteorological data in this case.
LinkedIn leverages the Apache Hadoop ecosystem for its big data analytics. Steady growth of the member base at LinkedIn along with their social activities results in exponential growth of the analytics infrastructure. Innovations in analytics tooling lead to heavier workloads on the clusters, which generate more data, which in turn encourage innovations in tooling and more workloads. Thus, the infrastructure remains under constant growth pressure. Heterogeneous environments embodied via a variety of hardware and diverse workloads make the task even more challenging.
This talk will tell the story of how we doubled our Hadoop infrastructure twice in the past two years.
• We will outline our main use cases and historical rates of cluster growth in multiple dimensions.
• We will focus on optimizations, configuration improvements, performance monitoring and architectural decisions we undertook to allow the infrastructure to keep pace with business needs.
• The topics include improvements in HDFS NameNode performance, and fine tuning of block report processing, the block balancer, and the namespace checkpointer.
• We will reveal a study on the optimal storage device for HDFS persistent journals (SATA vs. SAS vs. SSD vs. RAID).
• We will also describe Satellite Cluster project which allowed us to double the objects stored on one logical cluster by splitting an HDFS cluster into two partitions without the use of federation and practically no code changes.
• Finally, we will take a peek at our future goals, requirements, and growth perspectives.
SPEAKERS
Konstantin Shvachko, Sr Staff Software Engineer, LinkedIn
Erik Krogen, Senior Software Engineer, LinkedIn
Radisys' CTO, Andrew Alleman, was one of the featured speakers at the OCP Telco Engineering Workshop during the 2017 Big Communications Event. Andrew discussed carrier-grade open rack architecture (CG-OpenRack-19), the future of open hardware standards and commercial products in the OCP pipeline during his presentation.
For the full video of this presentation, please visit:
http://www.embedded-vision.com/platinum-members/luxoft/embedded-vision-training/videos/pages/may-2016-embedded-vision-summit
For more information about embedded vision, please visit:
http://www.embedded-vision.com
Alexey Rybakov, Senior Director at LUXOFT, presents the "Making Computer Vision Software Run Fast on Your Embedded Platform" tutorial at the May 2016 Embedded Vision Summit.
Many computer vision algorithms perform well on desktop class systems, but struggle on resource constrained embedded platforms. This how-to talk provides a comprehensive overview of various optimization methods that make vision software run fast on low power, small footprint hardware that is widely used in automotive, surveillance, and mobile devices. The presentation explores practical aspects of deep algorithm and software optimization such as thinning of input data, using dynamic regions of interest, mastering data pipelines and memory access, overcoming compiler inefficiencies, and more.
Webinar: OpenEBS - Still Free and now FASTEST Kubernetes storageMayaData Inc
Webinar Session - https://youtu.be/_5MfGMf8PG4
In this webinar, we share how the Container Attached Storage pattern makes performance tuning more tractable, by giving each workload its own storage system, thereby decreasing the variables needed to understand and tune performance.
We then introduce MayaStor, a breakthrough in the use of containers and Kubernetes as a data plane. MayaStor is the first containerized data engine available that delivers near the theoretical maximum performance of underlying systems. MayaStor performance scales with the underlying hardware and has been shown, for example, to deliver in excess of 10 million IOPS in a particular environment.
Similar to Mikhail Serkov - Zabbix for HPC Cluster Support | ZabConf2016 (20)
Zabbix Conference LatAm 2016 - Andre Deo - Zabbix Brazil CommunityZabbix
In 2008 Brazil hadn't any community about Zabbix, and the software was not known to most people. What changed in 8 years? How a community initiated by one single man (like in Japan) made the difference? Currently this community has more than 3.000 members, many lectures at local events, articles in magazines, books, many blogs and member involved in building additional functions for Zabbix and translation of official documentation!
Zabbix Conference LatAm 2016 - Andre Deo - SNMP and ZabbixZabbix
The aim of the lecture is to discuss the main questions people have when using SNMP with Zabbix. Will present an overview of SNMP, MIBs, Net-SNMP and items used in Zabbix templates.
Zabbix Conference LatAm 2016 - Rodrigo Mohr - Challenges on Large Env with Or...Zabbix
Scalability on a large environment can be a challenge on many different aspects involving customization of monitors, performance and reporting. The goal of this presentation is to share the experience we had at Dell, monitoring a big number of servers in an environment with constant changes, lots of custom monitors and new servers configured every week. We will present, from our 3 years of experience with Zabbix and Oracle, which positive/negative aspects we have taken from the configuration parameters we used, involving strong use of User Macros, optimization of Database Queries, Table Partitioning and Automation.
The Lojas Renner has always had a close proximity to the Open Source movement in Brazil. Still in the 90s, all the company's POS solutions have been migrated to Linux and in early 2000, migration started in all the company's systems, including the main components of critical infrastructure. Since then, much has changed. The world scene Open Source has become a worldwide standard for all products and companies, making its adoption not only an innovation but a necessity. Understand how since 2008 Zabbix helps us in monitoring the entire IT infrastructure, remote units and our business processes.
Zabbix Conference LatAm 2016 - Filipe Paternot - Zbx@Globo Automation+Integra...Zabbix
Zabbix API offers us a lot of power and possibilities. We will talk about automation and integrations at scale, at Globo.com. Automating gives us power to clone instances of Zabbix, perform batch operations, manage MANY networks for discovery and more. We will present our layer of abstraction to API, democratizing API access, offering a nice UI and standards for every new service monitored and few cached responses. Also, we will show how we have integrated with CloudStack, to deliver automated private cloud monitoring into Zabbix.
Zabbix Conference LatAm 2016 - Douglas Esteves - Zabbix at UNICAMPZabbix
Present the Zabbix use case in the Computer Center of UNICAMP, excellent option for monitoring Datacenter Environments and the University Environment. Presentation of the use of the tool at UNICAMP with simple monitoring and case of IT Service Monitoring to measure Server Availability and Database.
Ryan Armstrong - Monitoring More Than 6000 Devices in Zabbix | ZabConf2016Zabbix
Ryan will describe a Skunkworks project executed by Kinetic IT at the Department of Education to deliver an autonomous infrastructure monitoring solution for over 6000 devices distributed across WA. The team were given opportunity to experiment with DevOps practices such as Scrum product development, Infrastructure As Code and Continuous Integration to determine where the value lay and which practices should be adopted at greater scale.
Rafael Martinez Guerrero - Zabbix at the University of Oslo | ZabConf2016Zabbix
A case study showing the problems we have resolved with Zabbix and the challenges we had when we implemented Zabbix as the main monitoring tool at the University of Oslo. The number of challenges is not low in an organization as heterogenous as ours, with many thousands of servers and clients, all kinds of devices connected to our infrastructure, different operating systems, multiple locations and hundreds of IT staff. Full automation and delegation of privileges are the key words in the work we have done during the past year and a half.
Wolfgang Alper - Zabbix Meets OPS Control / Rundeck | ZabConf2016Zabbix
Zabbix is an excellent tool to do network monitoring and to alert if something bad happens. But Zabbix can do more. An underestimated feature of Zabbix is its ability to perform actions in addition to simple notifications. However, this requires to precisly setup those actions within zabbix, which is not always an easy task and might duplicate existing work. So what if Zabbix actually worked in concert with an external taskrunner / jobscheduler that is build to do exactly this: run a task or action against a host and report its outcome? Zabbix would perform the same well defined steps that an ops member would perform in case of certain failures using this kind of tool. A well know example of this kind of software is "Rundeck" which is licensed under the Apache License Version 2.0.
Sumit Goel - Monitoring Cloud Applications Using Zabbix | ZabConf2016Zabbix
With global shift towards flexibility of cloud there are different demands on monitoring availability and performance of applications provided in the cloud. There are obvious limitations in accessing components of app hosted by third party run outside of internal environment. Same time there are opportunities of using vendor API and status page. In Salesforce, one of the most innovative company in the world by Forbes and one of the biggest cloud service provider, we understand the need of customer to be able to see in real time availability and performance of cloud application. In the following presentation we're going to list and describe multiple ways of monitoring cloud apps. Some of the methods are: building in web monitoring using Curl, web browser automation tools like Selenium, external scripts (reading vendor status dashboard) and API calls to the app.
Erik Skytthe - Monitoring Mesos, Docker, Containers with Zabbix | ZabConf2016Zabbix
At DBC we are running docker and other container types in a mesos/marathon cluster environment. I will demonstrate how we collect statistics, logs etc. and monitor this environment, showing configuration examples, data flows and templates.
Some of the covered topics:
- Mesos master and agents
- Marathon Framework
- Docker engine
- Containers
- Zookeeper
- Elasticserach/ELK
Konstantin Yakovlev - Event Analysis Toolset | ZabConf2016Zabbix
During outages on 10k+ hosts environment, NOC and Operations teams may face hundreds of alerts in order to perform root cause analysis, remediation or escalation, meanwhile logging resolution progress to Incident Management system for audit purposes.
This presentation will describe RingCentral approach to Incident and Problem Management in large Zabbix monitored cloud.
Co-authors of the presentation: Dmitry Shchemelinin, Ph.D., Sr. Director of Operations, RingCentral, USA.
Alexei will talk about new exciting features of Zabbix 3.2 event correlation module aimed to simplify large scale monitoring and root cause analysis. Also sharing his thoughts about future plans in regards of Zabbix 3.4 and further releases.
Alain Ganuchaud - Trouble Ticket Integration with Zabbix in Large Environment...Zabbix
Large Environments rely on TroubleTicket tool and HelpDesk for managing IT issues. Bridging Zabbix with over 5000 servers and HelpDesk manually is a painful and impossible project. In this presentation we will cover how we may integrate Zabbix with HelpDesk, the architecture and what are the issues specially in Large Environments.
As an example, we will cover the case study of Zabbix - ServiceNow integration, as it was developped for SwissLife and released as OpenSource.
Zabbix Conference LatAm 2016 - Paulo Deolindo - Case Study_BBTS and ZabbixZabbix
The presentation will demonstrate as an introductory form and summarized as BB Technology, an Services Bank conglomerate company in Brazil, uses Zabbix to perform the monitoring of assets and services in your current infrastructure and future use for monitoring of its new Data Center in the Digital City, in Brasilia. As a provider of services to BB Group companies, the use of Zabbix and his extension, yet to be scheduled. It is a planning and high expectations for the entire group.
Connector Corner: Automate dynamic content and events by pushing a buttonDianaGray10
Here is something new! In our next Connector Corner webinar, we will demonstrate how you can use a single workflow to:
Create a campaign using Mailchimp with merge tags/fields
Send an interactive Slack channel message (using buttons)
Have the message received by managers and peers along with a test email for review
But there’s more:
In a second workflow supporting the same use case, you’ll see:
Your campaign sent to target colleagues for approval
If the “Approve” button is clicked, a Jira/Zendesk ticket is created for the marketing design team
But—if the “Reject” button is pushed, colleagues will be alerted via Slack message
Join us to learn more about this new, human-in-the-loop capability, brought to you by Integration Service connectors.
And...
Speakers:
Akshay Agnihotri, Product Manager
Charlie Greenberg, Host
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Tobias Schneck
As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other?
Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.
Let's dive deeper into the world of ODC! Ricardo Alves (OutSystems) will join us to tell all about the new Data Fabric. After that, Sezen de Bruijn (OutSystems) will get into the details on how to best design a sturdy architecture within ODC.
Search and Society: Reimagining Information Access for Radical FuturesBhaskar Mitra
The field of Information retrieval (IR) is currently undergoing a transformative shift, at least partly due to the emerging applications of generative AI to information access. In this talk, we will deliberate on the sociotechnical implications of generative AI for information access. We will argue that there is both a critical necessity and an exciting opportunity for the IR community to re-center our research agendas on societal needs while dismantling the artificial separation between the work on fairness, accountability, transparency, and ethics in IR and the rest of IR research. Instead of adopting a reactionary strategy of trying to mitigate potential social harms from emerging technologies, the community should aim to proactively set the research agenda for the kinds of systems we should build inspired by diverse explicitly stated sociotechnical imaginaries. The sociotechnical imaginaries that underpin the design and development of information access technologies needs to be explicitly articulated, and we need to develop theories of change in context of these diverse perspectives. Our guiding future imaginaries must be informed by other academic fields, such as democratic theory and critical theory, and should be co-developed with social science scholars, legal scholars, civil rights and social justice activists, and artists, among others.
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualityInflectra
In this insightful webinar, Inflectra explores how artificial intelligence (AI) is transforming software development and testing. Discover how AI-powered tools are revolutionizing every stage of the software development lifecycle (SDLC), from design and prototyping to testing, deployment, and monitoring.
Learn about:
• The Future of Testing: How AI is shifting testing towards verification, analysis, and higher-level skills, while reducing repetitive tasks.
• Test Automation: How AI-powered test case generation, optimization, and self-healing tests are making testing more efficient and effective.
• Visual Testing: Explore the emerging capabilities of AI in visual testing and how it's set to revolutionize UI verification.
• Inflectra's AI Solutions: See demonstrations of Inflectra's cutting-edge AI tools like the ChatGPT plugin and Azure Open AI platform, designed to streamline your testing process.
Whether you're a developer, tester, or QA professional, this webinar will give you valuable insights into how AI is shaping the future of software delivery.
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
JMeter webinar - integration with InfluxDB and GrafanaRTTS
Watch this recorded webinar about real-time monitoring of application performance. See how to integrate Apache JMeter, the open-source leader in performance testing, with InfluxDB, the open-source time-series database, and Grafana, the open-source analytics and visualization application.
In this webinar, we will review the benefits of leveraging InfluxDB and Grafana when executing load tests and demonstrate how these tools are used to visualize performance metrics.
Length: 30 minutes
Session Overview
-------------------------------------------
During this webinar, we will cover the following topics while demonstrating the integrations of JMeter, InfluxDB and Grafana:
- What out-of-the-box solutions are available for real-time monitoring JMeter tests?
- What are the benefits of integrating InfluxDB and Grafana into the load testing stack?
- Which features are provided by Grafana?
- Demonstration of InfluxDB and Grafana using a practice web application
To view the webinar recording, go to:
https://www.rttsweb.com/jmeter-integration-webinar
UiPath Test Automation using UiPath Test Suite series, part 3DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 3. In this session, we will cover desktop automation along with UI automation.
Topics covered:
UI automation Introduction,
UI automation Sample
Desktop automation flow
Pradeep Chinnala, Senior Consultant Automation Developer @WonderBotz and UiPath MVP
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
Knowledge engineering: from people to machines and back
Mikhail Serkov - Zabbix for HPC Cluster Support | ZabConf2016
1. ZABBIX FOR
HPC MONITORING AND
SUPPORT
Mikhail Serkov
Delivery Manager/HPC Engineer
2016
2. CONFIDENTIAL 2
What’s next?
HPC monitoring – differences from classic support
model
How do we use Zabbix
AGENDA
Overview of the customer infrastructure and software
stack
High Performance Computing – what is it about#1
#4
#5
#2
#3
3. CONFIDENTIAL 3
• Scientific research in Pharma area:
Bioinformatics, Computational
Chemistry, Drug Discovery, etc.
• About 10k CPU cores used for a
scientific computation.
• Shared clusters - different workflows
could run simultaneously within the
same cluster.
• About 500 different scientific tools.
• Custom software ( Python, Java, R)
Novartis Institute For Biomedical Research (NIBR)
4. CONFIDENTIAL 4
• Hundreds or even thousands of computation nodes
• Grid Computing technologies and software ( SGE, UGE, SoGE, PBS, etc)
• Massive parallel computation across the nodes
• Strong requirements for all subsystems on hardware and software level ( storage, network,
power, OS )
• No magic. Linux boxes, shell scripts on a low level ☺
Example of a job submission:
HIGH PERFORMANCE COMPUTING
6. CONFIDENTIAL 6
• 28 CPU cores ( 2 sockets x 14 cores each )
− Intel(R) Xeon(R) CPU E5-2697 v3 @ 2.60GHz
• 200 GB RAM
• 10 GB Ethernet + InfiniBand interfaces
• 8 GPU cores ( 4 cards x 2 cores each )
• NFS over 10 GB Ehternet
• Lustre over InfiniBand
TYPICAL COMPUTATION NODE CONFIGURATION
7. CONFIDENTIAL 7
OVERVIEW OF SOFTWARE STACK
• More than 500 of scientific tools
• Bioinformatics, Computation
Chemistry, Xtallography, Molecular
Dynamics, etc
• RHEL6.5
• Univa Grid Engine
• Zabbix 2.4
8. CONFIDENTIAL 8
• We need information like ‘who, what, when’, not only system metrics.
• Users are allowed to run whatever they want using grid scheduler on the computation
nodes.
• 100% CPU utilization and 100% RAM utilization for node is perfectly fine.
• Node crash – not such a big deal.
• Preventing global issues by using aggregated metrics.
• Metrics not only for monitoring but for a performance analysis.
• Users are having access to the monitoring system ( but restricted ).
HPC MONITORING DIFFERENCES
9. CONFIDENTIAL 9
• Able to monitor of a huge systems with a lot of metrics
• Flexible
• Out of the box
• Ability to aggregate metrics
• API for a data extraction
• GUI convenient for both support team and scientists
• Autodiscovery
• New nodes automatic configuration
WHY ZABBIX?
10. CONFIDENTIAL 10
ZABBIX CONFIGURATION
Server configuration:
• 20 CPU cores ( 2 sockets x 10 cores each )
− Intel(R) Xeon(R) CPU E5-2697 v3 @ 2.60GHz
• 120 GB RAM
Number of hosts: 601
Number of items: ~200k
Number of triggers: ~37k
DB Size: 187GB
11. CONFIDENTIAL 11
WHAT DO WE MONITOR
Local metrics ( node level ) Global metrics ( cluster level )
All default Linux checks (LA, CPU utilization, RAM,
swap, etc) - agent
Meta CPU utilization – aggregation of CPU utilization
of HPC nodes.
Every single GPU core ( Temperature, Utilization if
possible) - agent
NFS global transmit/retransmit - aggregation of
nodes values
Every single CPU core ( Utilization, Temperature) -
agent
Grid specific – used/active slots, running jobs,
pending jobs, top users - external scripts
NFS shares availability / utilization / mount details -
agent
CPU/Memory oversubscription - aggregation of
nodes values
Slots / RAM reserved - external scripts Overloaded nodes - aggregation of HPC values
HPC jobs - external scripts Pending time - external scripts
... ....
12. CONFIDENTIAL 12
HPC specific examples
1) Expected utilization VS Real one
Every job has a resource request for number of CPUs, RAM, etc. In every moment we can compare real
utilization with an expected one. If they are not close, we need to investigate if someone oversubscribing
resources or overload nodes.
Solution: Zabbix not only checks current system metrics, but also keeps an expected values. If they
are too different we receive warning.
2) Users on a computation node
Users are not restricted to SSH to any node ( debugging, tracing job in real time, interactive jobs,
etc). However we should check if user has job on the node he is logged into.
Solution: We have a trigger that notify us if we have anyone logged on the node with no job running.
Additionally we store a list of logged in users for any single moment.
13. CONFIDENTIAL 13
HPC specific examples
Pending time probes
It is really hard to predict the pending time for any particular job in the pending list, as they all have different resource
requests, and runtimes. It is not a FIFO and the pending time is always related to resources user wants to have.
Solution: Zabbix runs ‘pending probes’ ( empty jobs) and checks how long does it take. This is a good indicator for
queue state at the moment.
22. CONFIDENTIAL 22
USER ACCESS
We want to provide a limited amount of information to users. They don’t need any info about triggers and issues, but only metrics. We
have patched Zabbix to remove all unnecessary data for guest access.
After
Before
23. CONFIDENTIAL 23
Benefits
• Better understanding of a global issues on the cluster an reasons of why have they happened.
• Great performance indicators for other infrastructure teams ( especially Storage team )
• Performance tuning of a scientific workflows. Jobs profiling. In some cases information we cat get from
Zabbix is helping us to significantly improve performance of jobs.
• Proactive monitoring. With Zabbix it’s easier to understand if something is not right on the cluster or
with some job. In most cases we are able to prevent global cluster issues, or at least minimize an
impact.
• One monitoring system for clusters and HPC infrastructure.
• “All in one”. Lower efforts on support/maintain monitoring system(s).
24. CONFIDENTIAL 24
• Tight integration with Grid HPC software.
• Data analysis using external tools, but with Zabbix data source.
• Create a set of CLI utilities for getting Zabbix statistics in ‘human-readable’ format.
• Automation of jobs profiling using Zabbix API.
WHAT’S NEXT?