1) The document discusses setting up a system to monitor metrics from multiple devices using Graphite and RabbitMQ. It notes that Graphite stores and displays metrics and RabbitMQ is a message broker.
2) It provides context on using CFQ IO scheduler to provide fair disk bandwidth allocation and addresses for cascading failures.
3) Troubleshooting tips are given for issues like OOM killed processes and low delivery rates that could occur.
Measuring Latency for Monitoring and Benchmarking purposes is notoriously difficult. There are a lot of pitfalls with collecting, aggregating and analyzing latency data.
In the talk, we will make an effort to visit this topic from a top-down perspective and compile known complications and best-practice approaches on how to avoid them. This will include:
Measurement Overhead
Queuing effects – Coordinated omission
Histograms for Aggregation and Visualization
Percentile aggregation
Latency bands and burn-down charts
Latency comparison methods (QQ Plots, KS-Distance)
Crimson: Ceph for the Age of NVMe and Persistent MemoryScyllaDB
Ceph is a mature open source software-defined storage solution that was created over a decade ago.
During that time new faster storage technologies have emerged including NVMe and Persistent memory.
The crimson project aim is to create a better Ceph OSD that is more well suited to those faster devices. The crimson OSD is built on the Seastar C++ framework and can leverage these devices by minimizing latency, cpu overhead, and cross-core communication. This talk will discuss the project design, our current status, and our future plans.
Data Structures for High Resolution, Real-time Telemetry at ScaleScyllaDB
The challenge within telemetry in real-time systems is that you need as many sources of telemetry as possible (Throughput, latency, Errors, CPU, and many more... ) but you can't pay for extra overhead when our users are expecting sub-ms ops that scale to millions of transactions per second.
In this talk, we'll describe how we're using and improving several OSS data structures to incorporate telemetry features at scale, and showcase why they do matter on scenarios in which we have Performance/Security/Ops issues.
English - Lauching a Public PaaS on Open Source Getup & OpenShift Origin - FI...Getup Cloud
Get a glimpse of our OpenShift Origin implementation on Amazon Web Services. This slides was presented with Diane Mueller - Red Hat Cloud Ecosystem Evangelist
The myths of deprecating docker in kubernetesJo Hoon
Don’t be surprise. It is very natural movement from monolithic style to MSA. And it is not sooner issue. Just happen to late 2021 as a plan. And most of customer doesn’t impact your system. Due to many of service provider (GCP, AWS, AZURE, OpenShift, RKE and so on) already change their Container Runtime from (a little noisy?) old version of docker to light Container Runtime. I.e. new version of docker or others. And also it is no no no impact to your current image because you already use containerD monstly and what if you use old version of docker and also k8s said support old dockershim with there supportive method.
Measuring Latency for Monitoring and Benchmarking purposes is notoriously difficult. There are a lot of pitfalls with collecting, aggregating and analyzing latency data.
In the talk, we will make an effort to visit this topic from a top-down perspective and compile known complications and best-practice approaches on how to avoid them. This will include:
Measurement Overhead
Queuing effects – Coordinated omission
Histograms for Aggregation and Visualization
Percentile aggregation
Latency bands and burn-down charts
Latency comparison methods (QQ Plots, KS-Distance)
Crimson: Ceph for the Age of NVMe and Persistent MemoryScyllaDB
Ceph is a mature open source software-defined storage solution that was created over a decade ago.
During that time new faster storage technologies have emerged including NVMe and Persistent memory.
The crimson project aim is to create a better Ceph OSD that is more well suited to those faster devices. The crimson OSD is built on the Seastar C++ framework and can leverage these devices by minimizing latency, cpu overhead, and cross-core communication. This talk will discuss the project design, our current status, and our future plans.
Data Structures for High Resolution, Real-time Telemetry at ScaleScyllaDB
The challenge within telemetry in real-time systems is that you need as many sources of telemetry as possible (Throughput, latency, Errors, CPU, and many more... ) but you can't pay for extra overhead when our users are expecting sub-ms ops that scale to millions of transactions per second.
In this talk, we'll describe how we're using and improving several OSS data structures to incorporate telemetry features at scale, and showcase why they do matter on scenarios in which we have Performance/Security/Ops issues.
English - Lauching a Public PaaS on Open Source Getup & OpenShift Origin - FI...Getup Cloud
Get a glimpse of our OpenShift Origin implementation on Amazon Web Services. This slides was presented with Diane Mueller - Red Hat Cloud Ecosystem Evangelist
The myths of deprecating docker in kubernetesJo Hoon
Don’t be surprise. It is very natural movement from monolithic style to MSA. And it is not sooner issue. Just happen to late 2021 as a plan. And most of customer doesn’t impact your system. Due to many of service provider (GCP, AWS, AZURE, OpenShift, RKE and so on) already change their Container Runtime from (a little noisy?) old version of docker to light Container Runtime. I.e. new version of docker or others. And also it is no no no impact to your current image because you already use containerD monstly and what if you use old version of docker and also k8s said support old dockershim with there supportive method.
Containers explained as for cook and a mecanics Rachid Zarouali
Containers are everywhere, google/office365 mailboxes, web applications, healthcare booking, aeroplanes, and many more.
Docker containers are everywhere today, our google/office365 mailboxes, our web applications, our access for medical appointments, airplanes, ...
They are everywhere but not always easy to apprehend, and yet, they have much more similarities with our daily jobs than it seems.
During this webinar, I will present you these famous Docker containers, seen by a chef and a car mechanic and you will see that they have a lot in common.
Modern Web Security, Lazy but Mindful Like a FoxC4Media
Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/2hYU0cd.
Albert Yu presents a few viable, usable and effective defensive techniques that developers have often overlooked. Filmed at qconsf.com.
Albert Yu is currently working as a principal engineer for the Trust Engineering team in Atlassian. He has spent 15 years exposing himself to many different aspects of a security program, including security engineering, R&D, product reviews, code review, penetration test, governance and compliance, risk management, incident response, in large scale environment.
This session brings to your attention how several millions of dollars are wasted and what you can do to save money. Optimizing garbage collection performance not only saves money, but also improves the overall customer experience as well.
Powering Interactive Analytics with Alluxio and PrestoAlluxio, Inc.
Data Orchestration Summit 2020 organized by Alluxio
https://www.alluxio.io/data-orchestration-summit-2020/
Powering Interactive Analytics with Alluxio and Presto
Dmytro Dermanskyi, Data Engineering Lead, WalkMe
About Alluxio: alluxio.io
Engage with the open source community on slack: alluxio.io/slack
Azure + DataStax Enterprise (DSE) Powers Office365 Per User StoreDataStax Academy
We will present our Office 365 use case scenarios, why we chose Cassandra + Spark, and walk through the architecture we chose for running DSE on Azure.
The presentation will feature demos on how you too can build similar applications.
Implementing data and databases on K8s within the Dutch governmentDoKC
A small walkthrough of projects within the dutch government running Data(bases) on OpenShift. This talk shares success stories, provides a proven recipe to `get it done` and debunks some of the FUD.
About Sebastiaan:
I have always been a weird DBA, trying to combine Databases with out-of-the-box thinking and a DevOps mindset. Around 2016 I fell in love with both Postgres and Kubernetes, and I then committed my life to enabling Dutch organisations with running their Database workloads CloudNative.
Over the last few years I worked as a private contractor for 2 large government agencies doing exactly that, and I want to share my and others (success stories) hoping to enable and inspire Data on Kubernetes adoption.
Functioning incessantly of Data Science Platform with Kubeflow - Albert Lewan...GetInData
Did you like it? Check out our blog to stay up to date: https://getindata.com/blog
The talk is focused on administration, development and monitoring platform with Apache Spark, Apache Flink and Kubeflow in which the monitoring stack is based on Prometheus stack.
Author: Albert Lewandowski
Linkedin: https://www.linkedin.com/in/albert-lewandowski/
___
Getindata is a company founded in 2014 by ex-Spotify data engineers. From day one our focus has been on Big Data projects. We bring together a group of best and most experienced experts in Poland, working with cloud and open-source Big Data technologies to help companies build scalable data architectures and implement advanced analytics over large data sets.
Our experts have vast production experience in implementing Big Data projects for Polish as well as foreign companies including i.a. Spotify, Play, Truecaller, Kcell, Acast, Allegro, ING, Agora, Synerise, StepStone, iZettle and many others from the pharmaceutical, media, finance and FMCG industries.
https://getindata.com
Golang é uma linguagem fantástica para se desenvolver aplicações e um fator a ser explorado é o seu uso em dispositivos IoT. A linguagem já conta com diversas ferramentas de cross-compile, alguns pacotes experimentais de comunicação baixo nível e diversos projetos relacionados a hardware.
Deep Learning and Gene Computing Acceleration with Alluxio in KubernetesAlluxio, Inc.
Data Orchestration Summit
www.alluxio.io/data-orchestration-summit-2019
November 7, 2019
Deep Learning and Gene Computing Acceleration with Alluxio in Kubernetes
Speaker:
Eric Li, Alibaba Cloud
For more Alluxio events: https://www.alluxio.io/events/
Prometheus for Monitoring Metrics (Fermilab 2018)Brian Brazil
From its humble beginnings in 2012, the Prometheus monitoring system has grown a substantial community with a comprehensive set of integrations. This talk will give an overview of the core ideas behind Prometheus, its feature set and how it has grown to met the challenges of modern cloud-based systems.
Trying and evaluating the new features of GlusterFS 3.5Keisuke Takahashi
My presentation in LinuxCon/CloudOpen Japan 2014.
It has passed few days since GlusterFS 3.5 released so feel free to correct me if you find my mistakes or misunderstandings. Thanks.
Start Counting: How We Unlocked Platform Efficiency and Reliability While Sav...VMware Tanzu
SpringOne 2021
Session Title: Start Counting: How We Unlocked Platform Efficiency and Reliability While Saving Over $730,000
Speakers: David Filippelli, Lead Site Reliability Engineer at Manulife; David Wu, Senior Staff Solutions Architect at VMware Tanzu Labs; Alvin Coch, Senior Platform Reliability Engineer at Manulife Financial
Containers explained as for cook and a mecanics Rachid Zarouali
Containers are everywhere, google/office365 mailboxes, web applications, healthcare booking, aeroplanes, and many more.
Docker containers are everywhere today, our google/office365 mailboxes, our web applications, our access for medical appointments, airplanes, ...
They are everywhere but not always easy to apprehend, and yet, they have much more similarities with our daily jobs than it seems.
During this webinar, I will present you these famous Docker containers, seen by a chef and a car mechanic and you will see that they have a lot in common.
Modern Web Security, Lazy but Mindful Like a FoxC4Media
Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/2hYU0cd.
Albert Yu presents a few viable, usable and effective defensive techniques that developers have often overlooked. Filmed at qconsf.com.
Albert Yu is currently working as a principal engineer for the Trust Engineering team in Atlassian. He has spent 15 years exposing himself to many different aspects of a security program, including security engineering, R&D, product reviews, code review, penetration test, governance and compliance, risk management, incident response, in large scale environment.
This session brings to your attention how several millions of dollars are wasted and what you can do to save money. Optimizing garbage collection performance not only saves money, but also improves the overall customer experience as well.
Powering Interactive Analytics with Alluxio and PrestoAlluxio, Inc.
Data Orchestration Summit 2020 organized by Alluxio
https://www.alluxio.io/data-orchestration-summit-2020/
Powering Interactive Analytics with Alluxio and Presto
Dmytro Dermanskyi, Data Engineering Lead, WalkMe
About Alluxio: alluxio.io
Engage with the open source community on slack: alluxio.io/slack
Azure + DataStax Enterprise (DSE) Powers Office365 Per User StoreDataStax Academy
We will present our Office 365 use case scenarios, why we chose Cassandra + Spark, and walk through the architecture we chose for running DSE on Azure.
The presentation will feature demos on how you too can build similar applications.
Implementing data and databases on K8s within the Dutch governmentDoKC
A small walkthrough of projects within the dutch government running Data(bases) on OpenShift. This talk shares success stories, provides a proven recipe to `get it done` and debunks some of the FUD.
About Sebastiaan:
I have always been a weird DBA, trying to combine Databases with out-of-the-box thinking and a DevOps mindset. Around 2016 I fell in love with both Postgres and Kubernetes, and I then committed my life to enabling Dutch organisations with running their Database workloads CloudNative.
Over the last few years I worked as a private contractor for 2 large government agencies doing exactly that, and I want to share my and others (success stories) hoping to enable and inspire Data on Kubernetes adoption.
Functioning incessantly of Data Science Platform with Kubeflow - Albert Lewan...GetInData
Did you like it? Check out our blog to stay up to date: https://getindata.com/blog
The talk is focused on administration, development and monitoring platform with Apache Spark, Apache Flink and Kubeflow in which the monitoring stack is based on Prometheus stack.
Author: Albert Lewandowski
Linkedin: https://www.linkedin.com/in/albert-lewandowski/
___
Getindata is a company founded in 2014 by ex-Spotify data engineers. From day one our focus has been on Big Data projects. We bring together a group of best and most experienced experts in Poland, working with cloud and open-source Big Data technologies to help companies build scalable data architectures and implement advanced analytics over large data sets.
Our experts have vast production experience in implementing Big Data projects for Polish as well as foreign companies including i.a. Spotify, Play, Truecaller, Kcell, Acast, Allegro, ING, Agora, Synerise, StepStone, iZettle and many others from the pharmaceutical, media, finance and FMCG industries.
https://getindata.com
Golang é uma linguagem fantástica para se desenvolver aplicações e um fator a ser explorado é o seu uso em dispositivos IoT. A linguagem já conta com diversas ferramentas de cross-compile, alguns pacotes experimentais de comunicação baixo nível e diversos projetos relacionados a hardware.
Deep Learning and Gene Computing Acceleration with Alluxio in KubernetesAlluxio, Inc.
Data Orchestration Summit
www.alluxio.io/data-orchestration-summit-2019
November 7, 2019
Deep Learning and Gene Computing Acceleration with Alluxio in Kubernetes
Speaker:
Eric Li, Alibaba Cloud
For more Alluxio events: https://www.alluxio.io/events/
Prometheus for Monitoring Metrics (Fermilab 2018)Brian Brazil
From its humble beginnings in 2012, the Prometheus monitoring system has grown a substantial community with a comprehensive set of integrations. This talk will give an overview of the core ideas behind Prometheus, its feature set and how it has grown to met the challenges of modern cloud-based systems.
Trying and evaluating the new features of GlusterFS 3.5Keisuke Takahashi
My presentation in LinuxCon/CloudOpen Japan 2014.
It has passed few days since GlusterFS 3.5 released so feel free to correct me if you find my mistakes or misunderstandings. Thanks.
Start Counting: How We Unlocked Platform Efficiency and Reliability While Sav...VMware Tanzu
SpringOne 2021
Session Title: Start Counting: How We Unlocked Platform Efficiency and Reliability While Saving Over $730,000
Speakers: David Filippelli, Lead Site Reliability Engineer at Manulife; David Wu, Senior Staff Solutions Architect at VMware Tanzu Labs; Alvin Coch, Senior Platform Reliability Engineer at Manulife Financial
Charla dada en el Codemotion España que se celebró en Madrid el 21 y 22 /11/2014
Como tiene gifs animado, es recomendable descargar la presentación
Trata sobre el uso de herramientas de sistemas para hacer debugging
Charla que di en la PyConES 2014 en Zaragoza. Hablo sobre como usar Python Fabric si pasas un poco del getting started. Como tiene gifs animados recomiendo que sea descargada
Charla que dí en la PgConfEU en el año 2014, la cual se celebró en Madrid, España.
(Recomiendo descargar el original de la presentación)
Hablo sobre como desplegué Postgres en AWS en 2008 y comento también sobre cosas que se podrían hacer mejor, siguiendo un enfoque de mejora iterativo
A lighting talk I gave at python Madrid user group on 2014/03/27 about using python fabric beyond the tutorial http://docs.fabfile.org/en/1.8/#tutorial and relates a journey of tips that I have use to improve my fabfiles. All is from the documentation.
Download the source file for best viewing (animated gifs ;-) )
About the references and images are from their respective owners
Charla hecha en el Codemotion celebrado en España los dias 18 y 19 de octubre para explicar de manera introductoria como administrar un entorno de mongodb en producción. Haciendo enfasis en hacer backups y sharding. Se recomienda descargar para su mejor visualización (Gifs animados ^_^)
Essentials of Automations: The Art of Triggers and Actions in FMESafe Software
In this second installment of our Essentials of Automations webinar series, we’ll explore the landscape of triggers and actions, guiding you through the nuances of authoring and adapting workspaces for seamless automations. Gain an understanding of the full spectrum of triggers and actions available in FME, empowering you to enhance your workspaces for efficient automation.
We’ll kick things off by showcasing the most commonly used event-based triggers, introducing you to various automation workflows like manual triggers, schedules, directory watchers, and more. Plus, see how these elements play out in real scenarios.
Whether you’re tweaking your current setup or building from the ground up, this session will arm you with the tools and insights needed to transform your FME usage into a powerhouse of productivity. Join us to discover effective strategies that simplify complex processes, enhancing your productivity and transforming your data management practices with FME. Let’s turn complexity into clarity and make your workspaces work wonders!
UiPath Test Automation using UiPath Test Suite series, part 5DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 5. In this session, we will cover CI/CD with devops.
Topics covered:
CI/CD with in UiPath
End-to-end overview of CI/CD pipeline with Azure devops
Speaker:
Lyndsey Byblow, Test Suite Sales Engineer @ UiPath, Inc.
In the rapidly evolving landscape of technologies, XML continues to play a vital role in structuring, storing, and transporting data across diverse systems. The recent advancements in artificial intelligence (AI) present new methodologies for enhancing XML development workflows, introducing efficiency, automation, and intelligent capabilities. This presentation will outline the scope and perspective of utilizing AI in XML development. The potential benefits and the possible pitfalls will be highlighted, providing a balanced view of the subject.
We will explore the capabilities of AI in understanding XML markup languages and autonomously creating structured XML content. Additionally, we will examine the capacity of AI to enrich plain text with appropriate XML markup. Practical examples and methodological guidelines will be provided to elucidate how AI can be effectively prompted to interpret and generate accurate XML markup.
Further emphasis will be placed on the role of AI in developing XSLT, or schemas such as XSD and Schematron. We will address the techniques and strategies adopted to create prompts for generating code, explaining code, or refactoring the code, and the results achieved.
The discussion will extend to how AI can be used to transform XML content. In particular, the focus will be on the use of AI XPath extension functions in XSLT, Schematron, Schematron Quick Fixes, or for XML content refactoring.
The presentation aims to deliver a comprehensive overview of AI usage in XML development, providing attendees with the necessary knowledge to make informed decisions. Whether you’re at the early stages of adopting AI or considering integrating it in advanced XML development, this presentation will cover all levels of expertise.
By highlighting the potential advantages and challenges of integrating AI with XML development tools and languages, the presentation seeks to inspire thoughtful conversation around the future of XML development. We’ll not only delve into the technical aspects of AI-powered XML development but also discuss practical implications and possible future directions.
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
Threats to mobile devices are more prevalent and increasing in scope and complexity. Users of mobile devices desire to take full advantage of the features
available on those devices, but many of the features provide convenience and capability but sacrifice security. This best practices guide outlines steps the users can take to better protect personal devices and information.
Pushing the limits of ePRTC: 100ns holdover for 100 daysAdtran
At WSTS 2024, Alon Stern explored the topic of parametric holdover and explained how recent research findings can be implemented in real-world PNT networks to achieve 100 nanoseconds of accuracy for up to 100 days.
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...SOFTTECHHUB
The choice of an operating system plays a pivotal role in shaping our computing experience. For decades, Microsoft's Windows has dominated the market, offering a familiar and widely adopted platform for personal and professional use. However, as technological advancements continue to push the boundaries of innovation, alternative operating systems have emerged, challenging the status quo and offering users a fresh perspective on computing.
One such alternative that has garnered significant attention and acclaim is Nitrux Linux 3.5.0, a sleek, powerful, and user-friendly Linux distribution that promises to redefine the way we interact with our devices. With its focus on performance, security, and customization, Nitrux Linux presents a compelling case for those seeking to break free from the constraints of proprietary software and embrace the freedom and flexibility of open-source computing.
Communications Mining Series - Zero to Hero - Session 1DianaGray10
This session provides introduction to UiPath Communication Mining, importance and platform overview. You will acquire a good understand of the phases in Communication Mining as we go over the platform with you. Topics covered:
• Communication Mining Overview
• Why is it important?
• How can it help today’s business and the benefits
• Phases in Communication Mining
• Demo on Platform overview
• Q/A
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024Neo4j
Neha Bajwa, Vice President of Product Marketing, Neo4j
Join us as we explore breakthrough innovations enabled by interconnected data and AI. Discover firsthand how organizations use relationships in data to uncover contextual insights and solve our most pressing challenges – from optimizing supply chains, detecting fraud, and improving customer experiences to accelerating drug discoveries.
How to Get CNIC Information System with Paksim Ga.pptxdanishmna97
Pakdata Cf is a groundbreaking system designed to streamline and facilitate access to CNIC information. This innovative platform leverages advanced technology to provide users with efficient and secure access to their CNIC details.
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
Unlocking Productivity: Leveraging the Potential of Copilot in Microsoft 365, a presentation by Christoforos Vlachos, Senior Solutions Manager – Modern Workplace, Uni Systems
2. Información importante
● Hay dulces
● Siempre se puede hacer mejor en un green
field proyect
● Ahorremos la pregunta de por que no fui a la
Prometheus School of Running Away From
Things *
● Suelo usar metáforas no actas para todo
público ^_^
* https://www.youtube.com/watch?v=-BWnTW4rL0U (spoilers alert)
3. Contexto
● Un sistema que permita consultar las gráficas de
valores para multitud de dispositivos, suena a
IoT pero no te dejes llevar por el hype
● RabbitMQ es un broker de mensajes. Uno de los
usos que tiene es desacoplar productores y
consumidores *
● Graphite es un sistema para almacenar y
mostrar gráficas separado en multiples
componentes
* http://www.eferro.net/2017/09/pub-sub-swiss-army-knife-tech-pill.html
6. Cambiar el io scheduler
a CFQ
The main aim of CFQ scheduler is to provide a fair allocation of the disk
I/O bandwidth for all the processes which requests an I/O operation.
CFQ maintains the per process queue for the processes which request I/O
operation(synchronous requests). In case of asynchronous requests, all the
requests from all the processes are batched together according to their
process's I/O priority.
https://www.kernel.org/doc/Documentation/block/cfq-iosched.txt
https://www.kernel.org/doc/Documentation/block/ioprio.txt - bola extra
vs I
13. vs IV
# Limits the number of whisper update_many() calls per second, which effectively
# means the number of write requests sent to the disk. This is intended to
# prevent over-utilizing the disk and thus starving the rest of the system.
# When the rate of required updates exceeds this, then carbon's caching will
# take effect and increase the overall throughput accordingly.
# MAX_UPDATES_PER_SECOND = 500
Comentar el sistema de procesos que recopilan metricas de dispositivos y las publican a rabbitmq para ser consumidas por graphite.
Si hacen menciones a kafka, comentar que kafka requiere que los consumers lleven el tracking de donde estan. https://content.pivotal.io/blog/understanding-when-to-use-rabbitmq-or-apache-kafka
Diagrama de como estaba originalmente
Hacer ssh en el servidor de métricas para disfrutar de tiempos de respuesta que hacen llorar. Problemas de tener todo en los mismos discos físicos, cuando hay mucho IO lo mejor es tener discos dedicados, sino te comes el atasco.
Razono en que el problema es el ciclo salvaje de lectura escritura que esta haciendo graphite, el cual es tan frecuente que deja al resto en inanición. Por lo que decido probar cambiando el planificador para tener canales disponibles para otros procesos. Ademas si la cosa se ponia fea se podia usar ionice.
Parece que hay otros nuevos planificadores en el kernel que valdra la pena probar.
https://lwn.net/Articles/720675/
Se migro graphite de maquina y de una vez se aprovecho y se puso en docker para ser lo suficientemente hipster (en realidad era para aprovechar y que próximas instalaciones del graphite fueran reproducibles). Lo curioso es que en el riemann se empezó a ver que habían tiempos de respuesta altos y errores en el NGINX que lo dejaban KO
Viendo los logs y recordando lo que había leído en un post de netflix, el libro de SRE sobre los fallos en cascada y después de hacer pruebas de carga con ab decidí usando la fuerza que si una gráfica tardaba más de 5 segundos nginx cancelara la operación. Como bien pone Dan Luu hay que poner deadlines para evitar zombie requests
https://danluu.com/google-sre-book/
Antes de hacer la migración (otra vez) de los servidores, decido hacer pruebas de carga en AWS. Es el momento mágico donde descubres que graphite leyendo metricas desde AMQP es una basura, llegando a tope de CPU antes de saturar los discos.
Al final para seguir haciendo pruebas tuve que empezar a usar los carbon-relay y editarles el código para que usaran la misma named queue.
Diagrama de como quedo
Revisando de nuevo los logs de nginx vi que muchas peticiones de gráficas no se cargaban porque tardan más de 5 segundos.
Al final después de un poco de syscall tracing veo que al estar el disco tan petado escribiendo a disco (si, en un disco dedicado) las lecturas no se podían atender, además de la motorización tenemos a gente con ventanitas abiertas para ver las gráficas como si fuera un NOC. Así que me puse a ver los updates por second, saque estadísticas de cual era el promedio / media para posteriormente bajar en la configuración del graphite este valor al 70% y así dar espacio para las lecturas. Quizás había que tocar queue depth y esas cosillas =)
Diagrama de como quedo
Tenia reciente la charla de @adrianco sobre chaos engineering y aprovechando que tenemos el graphite y broker duplicados, me sentía con ganas de ver si la configuración que teníamos puesta aguantaría de verdad un evento tan tonto como una actualización (si, había que hacer una actualización).
Uno de los datos que tenia que validar, es que el carbon-cache iba a tardar aproximadamente 5 horas y 30 minutos en terminar de manera limpia. Por lo que detenemos los relays y el cache. Esto causa que el broker acumule mensajes hasta que boom, se pierde la conexión a ambas máquinas y conectandome al hypervisor veo que el OOM Killer se cargo los procesos KVM.
Tan simple como reconfigurar el uso de memoria de las 2 máquinas virtuales para que no exploten si usan toda la memoria asignada.
Había ahora que actualizar la otra instancia, así que era una buena oportunidad para probar de nuevo el proceso de shutdown y restore. En este caso el shutdown no dio problemas, más alla del RabbitMQ al usar toda la memoria, dejó de recibir mensajes de los shovels.
La gracia fue restaurando el servicio. La cola estaba llena de mensajes y se leían muy lentamente. Pensaba que eran los relays, asi que ejecute aun más relays y la cosa no se arreglaba. Al final me di cuenta que el rabbitMQ estaba a tope de disco. Parece que le problema era que los mensajes a enviar no estaban en cache y tenían que leerse de disco mientras se seguía recibiendo una avalancha de mensajes.
Al final para mejorar la velocidad utilice una estrategia doble, por un lado vmtouch https://hoytech.com/vmtouch/
Para tener en cache los ficheros del mnesia y por otro lado bloquear el trafico de entrada al broker e aceptarlo poco a poco con iptables y el modulo de statistics, aprovechando que TCP baja la velocidad si se pierden muchos paquetes =)