Deploying prometheus is easy and running single instance can be sufficient for most deployments. We will talk about scalability limits of prometheus instance, when and how use shardIng, what is trickster and why you should use it, too and how thanos can help you when all hope is lost.
Prometheus: A Next Generation Monitoring System (FOSDEM 2016)Brian Brazil
A look at how Prometheus's instrumentation, data model, query language, manageability and reliability make it a next generation solution.
Video: https://www.youtube.com/watch?v=cwRmXqXKGtk
Contact us: prometheus@robustperception.io
Better Monitoring for Python: Inclusive Monitoring with Prometheus (Pycon Ire...Brian Brazil
Monitoring should be part of your solution, not a problem. This lightening talk takes a brief look at the ideas behind Inclusive Monitoring and how to use them with Python.
Prometheus is a next-generation monitoring system. It lets you see you not just what your systems look like from the outside, but also gives visibility into the internals and business aspects of your systems. This allows everyone to benefit, including both operations and developers. This talk will look at the concepts behind monitoring with Prometheus, how it's designed, why it's suitable for Cloud Native environments and how you can get involved.
Cloud Native Night August 2016, Munich: Talk by Julius Volz (@juliusvolz, Co-founder at Prometheus).
Join our Meetup: www.meetup.com/cloud-native-muc
Abstract: This talk is on monitoring dynamic cloud environments with Prometheus.
Slides used in following Udemy training: https://www.udemy.com/course/monitoring-and-alerting-with-prometheus/?referralCode=6E2F738124DB09FA4C21
Prometheus is the leading open-source monitoring system that can collect metrics from all your systems, including Linux servers, Windows Servers, Database Servers and any application you have written. It's inspired on Google's Borgmon, which uses time-series data as a datasource, to then send alerts based on this data.
This course will show you how to install and configure Prometheus on a Linux server. This course will use a VM on DigitalOcean, but you can install Prometheus on any modern Linux OS. We'll show you how to make visualizations (graphs) using Grafana. When building these graphs, you'll get to know PromQL, the language to query Prometheus and get meaningful data displayed. You'll also learn how to setup alerts to receive notifications when something goes wrong. Lastly, we have a section on use-cases to showcase you some real world examples.
Prometheus: A Next Generation Monitoring System (FOSDEM 2016)Brian Brazil
A look at how Prometheus's instrumentation, data model, query language, manageability and reliability make it a next generation solution.
Video: https://www.youtube.com/watch?v=cwRmXqXKGtk
Contact us: prometheus@robustperception.io
Better Monitoring for Python: Inclusive Monitoring with Prometheus (Pycon Ire...Brian Brazil
Monitoring should be part of your solution, not a problem. This lightening talk takes a brief look at the ideas behind Inclusive Monitoring and how to use them with Python.
Prometheus is a next-generation monitoring system. It lets you see you not just what your systems look like from the outside, but also gives visibility into the internals and business aspects of your systems. This allows everyone to benefit, including both operations and developers. This talk will look at the concepts behind monitoring with Prometheus, how it's designed, why it's suitable for Cloud Native environments and how you can get involved.
Cloud Native Night August 2016, Munich: Talk by Julius Volz (@juliusvolz, Co-founder at Prometheus).
Join our Meetup: www.meetup.com/cloud-native-muc
Abstract: This talk is on monitoring dynamic cloud environments with Prometheus.
Slides used in following Udemy training: https://www.udemy.com/course/monitoring-and-alerting-with-prometheus/?referralCode=6E2F738124DB09FA4C21
Prometheus is the leading open-source monitoring system that can collect metrics from all your systems, including Linux servers, Windows Servers, Database Servers and any application you have written. It's inspired on Google's Borgmon, which uses time-series data as a datasource, to then send alerts based on this data.
This course will show you how to install and configure Prometheus on a Linux server. This course will use a VM on DigitalOcean, but you can install Prometheus on any modern Linux OS. We'll show you how to make visualizations (graphs) using Grafana. When building these graphs, you'll get to know PromQL, the language to query Prometheus and get meaningful data displayed. You'll also learn how to setup alerts to receive notifications when something goes wrong. Lastly, we have a section on use-cases to showcase you some real world examples.
Prometheus for Monitoring Metrics (Percona Live Europe 2017)Brian Brazil
From its humble beginnings in 2012, the Prometheus monitoring system has grown a substantial community with a comprehensive set of integrations. This talk will provide an overview of the core ideas behind Prometheus and its feature set.
Prometheus Design and Philosophy by Julius Volz at Docker Distributed System Summit
Prometheus - https://github.com/Prometheus
Liveblogging: http://canopy.mirage.io/Liveblog/MonitoringDDS2016
Monitoring Large-scale Cloud Infrastructures with OpenNebulaNETWAYS
Efficient monitoring is crucial when managing your Cloud infrastructure. The metrics collected by OpenNebula can be used to trigger automatic scaling, or quickly detect failures to automatically restart virtual machines. During this talk, I will show how OpenNebula can be used to efficiently monitor thousands of virtual machines at sub-1 minute interval. I will show how OpenNebula can be enhanced and optimized, and how different metrics collection tools such as Ganglia and Host-sFlow can be used with OpenNebula to monitor large-scale Cloud infrastructures.
Distributed Tests on Pulsar with Fallout - Pulsar Summit NA 2021StreamNative
Fallout is an open source testing framework based on Jepsen. In this talk we will see how distributed testing works and how to use these tools to verify Pulsar quality. We will see how we can easily deploy a reproducible Pulsar cluster on K8S and how to use ChaosMesh to inject failures. We will also cover integrated metrics reporting tools, very useful to verify the behaviour of the system for any Pulsar version, system environment and especially during maintenance operations (rollout restarts/upgrades) and unexpected failures.
Basic concept of nginx , Apache Vs Nginx , Nginx as Loadbalancer , Nginx as Reverse proxy , Configuration of nginx as load balancer and reverse proxy .
In the glorious future, cancer will be cured, world hunger will solved and all because everything was directly instrumented for Prometheus. Until then however, we need to write exporters. This talk will look at how to go about this and all the tradeoffs involved in writing a good exporter.
Time Warner Cable Brad Klein OpenStack Monasca operational overview monitoring as a service MONaaS
https://wiki.openstack.org/wiki/Monasca#Presentations
Hangout: https://youtu.be/YyOEU8aICiU
Evolving Prometheus for the Cloud Native World (FOSDEM 2018)Brian Brazil
As the industry moves towards more cloud based and containerised solutions such as Kubernetes, monitoring tools have to keep up. These new environments are far more dynamic than the hand-maintained machines of old, requiring more sophisticated and scalable approaches. This talk will look at how Prometheus has evolved over the past 5 years to be better able to cope with these challenges, including the 2.0 release and practices that we encourage in a cloud native world.
Provisioning and Capacity Planning Workshop (Dogpatch Labs, September 2015)Brian Brazil
If you’ve ever worried that you may have an outage someday due to your production servers not being able to handle increased user traffic, then this workshop will help put you at ease. Learn the foundations and how to apply it to your services.
Contact me at brian.brazil@robustperception.io if you'd like to learn more.
A backup and recovery strategy is necessary to protect your mission critical data against the risk of catastrophic failure or human error. In this session, we'll discuss the different strategies to backing up and restoring your MongoDB clusters in case of a disaster scenario. We'll review the benefits and drawbacks of various approaches, including taking filesystem snapshots, using mongodump, or using MongoDB Management Service.
Systems Monitoring with Prometheus (Devops Ireland April 2015)Brian Brazil
Monitoring means many things to many people. This talk looks at Systems Monitoring, that is how to keep an eye on a given system and use this as part of overall management of a system. This talk will cover Why one monitors, What to monitor, How to monitor, the general design of a monitoring system and how Prometheus is a good fit for this in terms of instrumentation, consoles, alerts, general system health and sanity.
Prometheus is a next-generation monitoring system publicly announced earlier this year, developed by companies including SoundCloud, locals Boxever and Docker. Since launch there has been wide-spread interest, and many community contributions.
For more information see http://prometheus.io or http://www.boxever.com/tag/monitoring
Next Generation DevOps in Drupal: DrupalCamp London 2014Barney Hanlon
In this talk, Barney will be discussing and demonstrating how to:
- Use nginx, Varnish and Apache together in a "SPDY sandwich" to support HTTP 2.0
- Setting up SSL properly to mitigate against attack vectors
- Performance improvements with mod_pagespeed and nginx
- Deploying Drupal sites with Docker containers
Barney is a Technical Team Leader at Inviqa, a Drupal Association member and writes for Techportal on using technologies to improve website performance. He first started using PHP professionally in 2003, and has over seventeen years experience in software development. He is an advocate of Scrum methodology and has an interest in performance optimization, researching and speaking on various techniques to improve user experience through faster load times.
Massively Scaled High Performance Web Services with PHPDemin Yin
Over the years, people have questioned if PHP is a good choice for building web services. In this talk, I will share how we use PHP on the backend for Glu Mobile’s flagship mobile game Design Home, enabling it to regularly rank amongst the top free mobile games in the Apple App Store and the Google Play Store. We will deep dive into the thought processes, development, testing, and deployment strategy, showcasing what we have achieved with PHP.
As one of our primary data stores, we utilize MongoDB heavily. Early last year our DevOps lead, Chris Merz, submitted some of our use cases to 10gen (http://www.10gen.com/events) as fodder for a presentation at the MongoDB conference in Boulder. The presentation went well enough at the Boulder conference that 10gen asked him to give it again at San Francisco, Seattle and again in Boulder.
Hopefully there are some nuggets in this deck that can help you in your quest to dominate MongoDB.
Prometheus for Monitoring Metrics (Percona Live Europe 2017)Brian Brazil
From its humble beginnings in 2012, the Prometheus monitoring system has grown a substantial community with a comprehensive set of integrations. This talk will provide an overview of the core ideas behind Prometheus and its feature set.
Prometheus Design and Philosophy by Julius Volz at Docker Distributed System Summit
Prometheus - https://github.com/Prometheus
Liveblogging: http://canopy.mirage.io/Liveblog/MonitoringDDS2016
Monitoring Large-scale Cloud Infrastructures with OpenNebulaNETWAYS
Efficient monitoring is crucial when managing your Cloud infrastructure. The metrics collected by OpenNebula can be used to trigger automatic scaling, or quickly detect failures to automatically restart virtual machines. During this talk, I will show how OpenNebula can be used to efficiently monitor thousands of virtual machines at sub-1 minute interval. I will show how OpenNebula can be enhanced and optimized, and how different metrics collection tools such as Ganglia and Host-sFlow can be used with OpenNebula to monitor large-scale Cloud infrastructures.
Distributed Tests on Pulsar with Fallout - Pulsar Summit NA 2021StreamNative
Fallout is an open source testing framework based on Jepsen. In this talk we will see how distributed testing works and how to use these tools to verify Pulsar quality. We will see how we can easily deploy a reproducible Pulsar cluster on K8S and how to use ChaosMesh to inject failures. We will also cover integrated metrics reporting tools, very useful to verify the behaviour of the system for any Pulsar version, system environment and especially during maintenance operations (rollout restarts/upgrades) and unexpected failures.
Basic concept of nginx , Apache Vs Nginx , Nginx as Loadbalancer , Nginx as Reverse proxy , Configuration of nginx as load balancer and reverse proxy .
In the glorious future, cancer will be cured, world hunger will solved and all because everything was directly instrumented for Prometheus. Until then however, we need to write exporters. This talk will look at how to go about this and all the tradeoffs involved in writing a good exporter.
Time Warner Cable Brad Klein OpenStack Monasca operational overview monitoring as a service MONaaS
https://wiki.openstack.org/wiki/Monasca#Presentations
Hangout: https://youtu.be/YyOEU8aICiU
Evolving Prometheus for the Cloud Native World (FOSDEM 2018)Brian Brazil
As the industry moves towards more cloud based and containerised solutions such as Kubernetes, monitoring tools have to keep up. These new environments are far more dynamic than the hand-maintained machines of old, requiring more sophisticated and scalable approaches. This talk will look at how Prometheus has evolved over the past 5 years to be better able to cope with these challenges, including the 2.0 release and practices that we encourage in a cloud native world.
Provisioning and Capacity Planning Workshop (Dogpatch Labs, September 2015)Brian Brazil
If you’ve ever worried that you may have an outage someday due to your production servers not being able to handle increased user traffic, then this workshop will help put you at ease. Learn the foundations and how to apply it to your services.
Contact me at brian.brazil@robustperception.io if you'd like to learn more.
A backup and recovery strategy is necessary to protect your mission critical data against the risk of catastrophic failure or human error. In this session, we'll discuss the different strategies to backing up and restoring your MongoDB clusters in case of a disaster scenario. We'll review the benefits and drawbacks of various approaches, including taking filesystem snapshots, using mongodump, or using MongoDB Management Service.
Systems Monitoring with Prometheus (Devops Ireland April 2015)Brian Brazil
Monitoring means many things to many people. This talk looks at Systems Monitoring, that is how to keep an eye on a given system and use this as part of overall management of a system. This talk will cover Why one monitors, What to monitor, How to monitor, the general design of a monitoring system and how Prometheus is a good fit for this in terms of instrumentation, consoles, alerts, general system health and sanity.
Prometheus is a next-generation monitoring system publicly announced earlier this year, developed by companies including SoundCloud, locals Boxever and Docker. Since launch there has been wide-spread interest, and many community contributions.
For more information see http://prometheus.io or http://www.boxever.com/tag/monitoring
Next Generation DevOps in Drupal: DrupalCamp London 2014Barney Hanlon
In this talk, Barney will be discussing and demonstrating how to:
- Use nginx, Varnish and Apache together in a "SPDY sandwich" to support HTTP 2.0
- Setting up SSL properly to mitigate against attack vectors
- Performance improvements with mod_pagespeed and nginx
- Deploying Drupal sites with Docker containers
Barney is a Technical Team Leader at Inviqa, a Drupal Association member and writes for Techportal on using technologies to improve website performance. He first started using PHP professionally in 2003, and has over seventeen years experience in software development. He is an advocate of Scrum methodology and has an interest in performance optimization, researching and speaking on various techniques to improve user experience through faster load times.
Massively Scaled High Performance Web Services with PHPDemin Yin
Over the years, people have questioned if PHP is a good choice for building web services. In this talk, I will share how we use PHP on the backend for Glu Mobile’s flagship mobile game Design Home, enabling it to regularly rank amongst the top free mobile games in the Apple App Store and the Google Play Store. We will deep dive into the thought processes, development, testing, and deployment strategy, showcasing what we have achieved with PHP.
As one of our primary data stores, we utilize MongoDB heavily. Early last year our DevOps lead, Chris Merz, submitted some of our use cases to 10gen (http://www.10gen.com/events) as fodder for a presentation at the MongoDB conference in Boulder. The presentation went well enough at the Boulder conference that 10gen asked him to give it again at San Francisco, Seattle and again in Boulder.
Hopefully there are some nuggets in this deck that can help you in your quest to dominate MongoDB.
Kubernetes Observability with Prometheus by ExampleThomas Riley
This talk was given at Cloud Native + Kubernetes Manchester, July 2019.
Prometheus is quickly becoming the de factor open-source monitoring and alerting tool for Kubernetes. Through a series of live demos I will explain how to deploy Prometheus into Kubernetes and make use of it for monitoring Kubernetes. I will also demonstrate how to successfully run Prometheus in HA with the Thanos project and how to store years worth of metrics without requiring heaps of CPU, memory and storage for Prometheus.
Build cloud native solution using open source Nitesh Jadhav
Build cloud native solution using open source. I have tried to give a high level overview on How to build Cloud Native using CNCF graduated software's which are tested, proven and having many reference case studies and partner support for deployment
Monitoring Kubernetes with Prometheus (Kubernetes Ireland, 2016)Brian Brazil
Prometheus is a next-generation monitoring system. Since being publicly announced last year it has seen wide-spread interest and adoption. This talk will look at the concepts behind monitoring with Prometheus, and how to use it with Kubernetes which has direct support for Prometheus.
Prometheus for Monitoring Metrics (Fermilab 2018)Brian Brazil
From its humble beginnings in 2012, the Prometheus monitoring system has grown a substantial community with a comprehensive set of integrations. This talk will give an overview of the core ideas behind Prometheus, its feature set and how it has grown to met the challenges of modern cloud-based systems.
Monitoring in Big Data Platform - Albert Lewandowski, GetInDataGetInData
Did you like it? Check out our blog to stay up to date: https://getindata.com/blog
The webinar was organized by GetinData on 2020. During the webinar we explaned the concept of monitoring and observability with focus on data analytics platforms.
Watch more here: https://www.youtube.com/watch?v=qSOlEN5XBQc
Whitepaper - Monitoring ang Observability for Data Platform: https://getindata.com/blog/white-paper-big-data-monitoring-observability-data-platform/
Speaker: Albert Lewandowski
Linkedin: https://www.linkedin.com/in/albert-lewandowski/
___
Getindata is a company founded in 2014 by ex-Spotify data engineers. From day one our focus has been on Big Data projects. We bring together a group of best and most experienced experts in Poland, working with cloud and open-source Big Data technologies to help companies build scalable data architectures and implement advanced analytics over large data sets.
Our experts have vast production experience in implementing Big Data projects for Polish as well as foreign companies including i.a. Spotify, Play, Truecaller, Kcell, Acast, Allegro, ING, Agora, Synerise, StepStone, iZettle and many others from the pharmaceutical, media, finance and FMCG industries.
https://getindata.com
Prometheus - Intro, CNCF, TSDB,PromQL,GrafanaSridhar Kumar N
https://www.youtube.com/playlist?list=PLAiEy9H6ItrKC5PbH7KiELiSEIKv3tuov
-What is Prometheus?
-Difference Between Nagios vs Prometheus
-Architecture
-Alertmanager
-Time series DB
-PromQL (Prometheus Query Language)
-Live Demo
-Grafana
Alluxio Community Office Hour
July 14, 2020
For more Alluxio events: https://www.alluxio.io/events/
Speakers:
Calvin Jia, Alluxio
Bin Fan, Alluxio
Alluxio 2.3 was just released at the end of June 2020. Calvin and Bin will go over the new features and integrations available and share learnings from the community. Any questions about the release and on-going community feature development are welcome.
In this Office Hour, we will go over:
- Glue Under Database integration
- Under Filesystem mount wizard
- Tiered Storage Enhancements
- Concurrent Metadata Sync
- Delegated Journal Backups
Installation of Grafana on linux ; connectivity with Prometheus database , installation of Prometheus ; Installation of node_exporter ,Tomcat-exporter ; installation and configuration of alert manager .. Detailed step by step installation and working
Speaker: Remco Overdijk
Genre & level: Backend, Way of working, Medior
Familiar tools like Statsd, Graphite, Nagios, etc. are no longer used in the Cloud, meaning we’ve hitched a new ride: Prometheus, and it’s all about Metrics! “A Metric, The Hitchhiker’s Guide to Prometheus says, is about the most massively useful thing someone doing Monitoring can have. It has great practical value. You can wave your Metric in emergencies as a distress signal, and produce pretty Graphs at the same time.” Don’t Panic, this talk is not about deploying Prometheus, Kubernetes or Vogon Poetry, but all about YOU!
How exactly would that work, using metrics for monitoring purposes? Is it really that different from having separate stacks? Can I export 42 as a Metric? How do I migrate from Statsd/Nagios to this new world? What do I do when metrics seem to be insufficient to monitor something? Like a Babel Fish, this talk translates your questions into hands-on tips and tricks on working with Prometheus. Not only for the cloud, but all applications/services in general.
Event Driven Autoscaling with KEDA can save you a lot of trouble with different issues/troubles which you can see with traditional HPA/VPA based workloads.
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
Essentials of Automations: Optimizing FME Workflows with ParametersSafe Software
Are you looking to streamline your workflows and boost your projects’ efficiency? Do you find yourself searching for ways to add flexibility and control over your FME workflows? If so, you’re in the right place.
Join us for an insightful dive into the world of FME parameters, a critical element in optimizing workflow efficiency. This webinar marks the beginning of our three-part “Essentials of Automation” series. This first webinar is designed to equip you with the knowledge and skills to utilize parameters effectively: enhancing the flexibility, maintainability, and user control of your FME projects.
Here’s what you’ll gain:
- Essentials of FME Parameters: Understand the pivotal role of parameters, including Reader/Writer, Transformer, User, and FME Flow categories. Discover how they are the key to unlocking automation and optimization within your workflows.
- Practical Applications in FME Form: Delve into key user parameter types including choice, connections, and file URLs. Allow users to control how a workflow runs, making your workflows more reusable. Learn to import values and deliver the best user experience for your workflows while enhancing accuracy.
- Optimization Strategies in FME Flow: Explore the creation and strategic deployment of parameters in FME Flow, including the use of deployment and geometry parameters, to maximize workflow efficiency.
- Pro Tips for Success: Gain insights on parameterizing connections and leveraging new features like Conditional Visibility for clarity and simplicity.
We’ll wrap up with a glimpse into future webinars, followed by a Q&A session to address your specific questions surrounding this topic.
Don’t miss this opportunity to elevate your FME expertise and drive your projects to new heights of efficiency.
Neuro-symbolic is not enough, we need neuro-*semantic*Frank van Harmelen
Neuro-symbolic (NeSy) AI is on the rise. However, simply machine learning on just any symbolic structure is not sufficient to really harvest the gains of NeSy. These will only be gained when the symbolic structures have an actual semantics. I give an operational definition of semantics as “predictable inference”.
All of this illustrated with link prediction over knowledge graphs, but the argument is general.
JMeter webinar - integration with InfluxDB and GrafanaRTTS
Watch this recorded webinar about real-time monitoring of application performance. See how to integrate Apache JMeter, the open-source leader in performance testing, with InfluxDB, the open-source time-series database, and Grafana, the open-source analytics and visualization application.
In this webinar, we will review the benefits of leveraging InfluxDB and Grafana when executing load tests and demonstrate how these tools are used to visualize performance metrics.
Length: 30 minutes
Session Overview
-------------------------------------------
During this webinar, we will cover the following topics while demonstrating the integrations of JMeter, InfluxDB and Grafana:
- What out-of-the-box solutions are available for real-time monitoring JMeter tests?
- What are the benefits of integrating InfluxDB and Grafana into the load testing stack?
- Which features are provided by Grafana?
- Demonstration of InfluxDB and Grafana using a practice web application
To view the webinar recording, go to:
https://www.rttsweb.com/jmeter-integration-webinar
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
Keynote at DIGIT West Expo, Glasgow on 29 May 2024.
Cheryl Hung, ochery.com
Sr Director, Infrastructure Ecosystem, Arm.
The key trends across hardware, cloud and open-source; exploring how these areas are likely to mature and develop over the short and long-term, and then considering how organisations can position themselves to adapt and thrive.
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Tobias Schneck
As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other?
Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.
2. Labyrinth Labs
Rock-solid infrastructure and DevOps
● Building rock-solid and secure foundations for all your digital operations. Our
mission is to let you focus on your business without ever needing to worry
about technical issues again.
● Making you ready for growing traffic, safe against new security vulnerabilities
and data-loss.
2
3. TL;DR
● We will start with common monitoring issues and problems.
● Deploying Prometheus is easy and running a single instance can be sufficient
for most deployments.
● We will have a quick look at AlertManager
● We will talk about scalability limits of prometheus instance, when and how to
use sharding.
● What is Trickster and why you should use it too
● How Thanos/Cortex can help you when all hope is lost.
3
4. Common Monitoring Problems
● Monitoring tools are limited both technically and conceptually
● Most of existing tools don’t really scale with current infrastructure needs.
● Limited visibility
○ Generally we want to monitor and gather as much information as we can.
○ Even if we don’t need it right away usually it will be useful in a future(I promise)
● No common application monitoring interface. There are different
protocols/standards
○ Openmetrics
○ SNMP
4
6. Prometheus Monitoring System
The Prometheus monitoring system and time series database is CNCF graduated
project.
● Originally developed by exGooglers for SoundCloud as their internal monitoring
system
● Inspired by Google’s Borgmon monitoring system
● Open Source under the Apache License
● Written as monolithic application in Go
6
7. Prometheus Server Overview
● Multi-dimensional data model with time series data identified by metric
name and key/value(labels) pairs
● PromQL, a flexible query language to leverage this dimensionality
● No reliance on distributed storage; single server nodes are autonomous
● Targets are discovered via service discovery or static configuration
● Pushing time series is supported via an intermediary gateway
● Monitor Services not Machines/Servers
7
9. Company Prometheus
Usage
● We deployed first prometheus servers
● Add some services
● Setup trickster as a Grafana Cache
● Add more services/servers
● Continuous adding of CPU/Memory to Prometheus instance
● Setup simple federation/sharding if single instance is too big
● Use Thanos
9
10. First Prometheus Deployment
● Deploying your first Prometheus server is very easy. Fetch prometheus
binary + config.
● There is a no concept of a Prometheus Cluster
● Generally Prometheus can scale very well with CPU/Memory
○ Providing more cpu/memory allows prometheus to monitor more
metrics
○ It’s hard to run large pod in a kubernetes cluster if it’s as big as a
worker node.
● If job is too big for a single server you can use federation/sharding
(remote reads) for simple scaling
10
11. Trickster Setup
● Loading complicated/big dashboard on Grafana can overload your
prometheus server
○ Use trickster to cache PromQL results for future reuse
○ Queries on metrics with high cardinality can use a lot of memory on
you prometheus instance[1].
○ Use limits to make sure user will not overload your server
query.max-concurrency/query.max-samples
● Delta Proxy caching - inspects the time range of a client query to
determine what data points are already cached
111. https://www.robustperception.io/limiting-promql-resource-usage
14. Metrics Cardinality
● Prometheus performance almost always comes to one thing metrics
cardinality.
● Cardinality describes how many unique values of some metric you have
○ container_tasks_state metric will have a unique (pod/container) pair for each running
container in your cluster
○ custom_api_http_request will have a unique metric for each combination of
url/http_method/env. (/api/v2/users, get, dev; /api/v2/users, post, prod...)
141. https://www.robustperception.io/cardinality-is-key
15. Bad Metrics Cardinality
151. https://www.robustperception.io/cardinality-is-key
● See example below where we throw away bad fluentd metrics and dropped number of
scrapped metrics by ½
● If you are using fluentd look for fluentd_tail_file_inode, fluentd_tail_file_position
○ In our use case we saw cardinality 1220 from 2 metrics above per node !
16. Thanos/Cortex as ultimate solution
● If you have multiple kubernetes clusters, datacenters with millions of
metrics and adding more CPU/memory to prometheus is not an option.
○ Consider adding Thanos/Cortex to your infrastructure
● Thanos querier Prometheus Server HA, can load metrics from multiple
prometheus servers and make sure it will present full data to user.
○ Implements Prometheus 1.1 HTTP api.
● Thanos compactor can downsample, change retention or resolution of
your metrics.
● Thanos store is a component which can save your metrics in a AWS S3
compatible object store.
16
18. Thanos SideCar
18
● It implements Thanos’ Store API on top of Prometheus’ remote-read API. This allows
Queriers to treat Prometheus servers as yet another source of time series data without
directly talking to its APIs.
● Optionally, the sidecar uploads TSDB blocks to an object storage bucket as Prometheus
produces them every 2 hours. This allows Prometheus servers to be run with relatively
low retention while their historic data is made durable and queryable via object storage.
● Optionally Thanos sidecar is able to watch Prometheus rules and configuration,
decompress and substitute environment variables if needed and ping Prometheus to
reload them.
19. Thanos Query
19
● The PromQL query is posted to the Querier
● It interprets the query and goes to a pre-filter
● The query fans out its request for stores, prometheuses or other queries on the basis of labels and
time-range requirements
● The Query only sends and receives StoreAPI messages
● After it has collected all the responses, it merges and deduplicates them (if enabled)
● It then sends back the series for the user
1. https://banzaicloud.com/img/blog/multi-cluster-monitoring/life_of_a_query.png