OSMC 2015: Testing in Production by Devdas Bhagat

•

0 likes•283 views

For most ecommerce companies, software is not the final deliverable product. It is a research tool, to determine what customers will pay for. To be able to get good data from software, monitoring and analytics must be built into the system. Alerting must come from business requirements and be based on application generated data. In the traditional operations world, we monitor what is easy, and avoid monitoring that which is difficult. This talk is an attempt to show people that monitoring must be driven by metrics from the CxO office, and then potentially involve technical metrics if needed. This talk explains why functional and business level monitoring is crucial. We also cover the tradeoffs from a DTAP model to continuous deployment. There will be a brief introduction to a couple of useful monitoring tools for functional monitoring. No special technical skills are expected of the audience, but having a general overview of the monitoring world is a good thing. This talk is not limited to ecommerce companies, but is most applicable to that environment.

Technology

I have a test network. You may know it as
production.
– Andreas Thienemann
Testing

Asking the right questions
● Information is not a scarce resource, attention
is
– Herb Simon

Asking the right questions
● Information is not a scarce resource, attention
is
– Herb Simon
● What do you know?

Asking the right questions
● Information is not a scarce resource, attention
is
– Herb Simon
● What do you know?
● What do you not know?

Asking the right questions
● Information is not a scarce resource, attention
is
– Herb Simon
● What do you know?
● What do you not know?
● What do you not know that you do not know?

Fast Feedback Loops
● Test driven business

Fast Feedback Loops
● Test driven business
– It's like Test Driven
Design

Fast Feedback Loops
● Test driven business
– It's like Test Driven
Design
● Enable IT to speak
the language of
business

Fast Feedback Loops
● Test driven business
– It's like Test Driven
Design
● Enable IT to speak
the language of
business
– Show me the data!

Fast Feedback Loops
● Test driven business
– It's like Test Driven
Design
● Enable IT to speak
the language of
business
– Show me the data!
– HIPPO

What makes software “good”?
● Developer metrics

What makes software “good”?
● Developer metrics
– Object Oriented/Functional/
– Is testable
– Follows DRY
– ...

What makes software “good”?
● Developer metrics
– Object Oriented/Functional/
– Is testable
– Follows DRY
– …
● Operational metrics
– Bug “free”
– Scalable
– Secure

What makes software “good”?
● Business metrics

What makes software “good”?
● Business metrics
– Does it help me make money?

What makes software “good”?
● Business metrics
– Does it help me make money?
– Does it save me money?

What makes software “good”?
● Business metrics
– Does it help me make money?
– Does it save me money?
– Does it generate value?

Testing vs Reality
● Stable environment ● Highly unstable

Testing vs Reality
● Stable environment
● No humans
● Highly unstable
● Humans

Testing vs Reality
● Stable environment
● No humans
● Low latency
● Highly unstable
● Humans
● Potentially high
latency

Testing vs Reality
● Stable environment
● No humans
● Low latency
● Not always the same
size of dataset
● Highly unstable
● Humans
● Potentially high
latency
● Large, ever changing
datasets

Users
● Humans do strange things
● Or sometimes make mistakes

Users
● Humans do strange things
● Or sometimes make mistakes
● They come up with different requirements

Users
● Humans do strange things
● Or sometimes make mistakes
● They come up with different requirements
● They change the world your software works in

Risk management
● Approach 1:
– Scope your problems well
– Test a lot
– Release stable code
– Avoid changing a working system

Risk management
● Approach 2:
– Accept that you have an ill-defined problem
– Iterate rapidly
– Make a large number of small changes
– Build software to be able to isolate these changes
– Test them in the real world
– Keep only what works

What if When it breaks?
● Fix fast (maybe)
– “Do it right the first time” does not apply
● Business process for handling failure
– Hardware will eventually fail, software will
eventually work

The lifetime of code
● How long does your code live?

The lifetime of code
● How long does your code live?
– Hours or days?
● This should be most code out there

The lifetime of code
● How long does your code live?
– Hours or days?
● This should be most code out there
– Months?
● A little code, often “libraries” with a single application

Event processing
● Generate information about software use and
changes in realtime
● For more information and tooling:
– https://www.quora.com/Are-there-any-open-source-
CEP-tools?share=1
– https://en.wikipedia.org/wiki/Complex_event_proces
sing

Monitoring/Alerting
● Process events to generate graphs

Monitoring/Alerting
● Process events to generate graphs
● Riemann is an excellent tool for generating
alerts from event streams

Monitoring/Alerting
● Process events to generate graphs
● Riemann is an excellent tool for generating
alerts from event streams
● Generate graphs as close to realtime as
possible
– Developers doing rollouts know that something else
is changing

OSMC 2015: Testing in Production by Devdas Bhagat

We are building ever more complex systems, and demanding of them ever higher standards of reliability, functionality, and safety. The development environment for the successful project you just delivered almost certainly needs enhancing for your next project. Maybe your team needs to use new tools, new methodologies, new architectural patterns, new process, or just a new language. You can analyse past projects, and research other people's work, but how do you choose what enhancements to make? And how do you deploy new process or tooling in an industrial context where time-to-market, margin, and success are everything? This talk will look at the key drivers behind the successful adoption of any new process or tool - from a small incremental update to a major shift in development philosophy. Along the way we will look at some real-world successes, and face up to a few challenges.

Backlog or Black Hole? How to Manage Massive Backlogs

Rachel Maxwell

Discover how to identify and fix a massive backlog. When a backlog gets too big, it threatens productivity, quality, and innovation. But there’s so much you can do to prevent — or fix — this common problem. With this webinar, you'll discover: -How to identify if you have a massive backlog issue. -Common root causes of oversize backlogs. -Concrete actions for regaining control of your backlog.

Analysis paralysis

Business Analyst Learnings

HIS 2015: Roderick Chapman - Murphy Vs Satan Why programming secure systems i...

AdaCore

Test Driven Relevancy, Presented by Doug Turnbull at SolrExchage DCLucidworks (Archived)

What we learned from three years sciencing the crap out of devops

Nicole Forsgren

Three years, 20,000 DevOps professionals, and some science... What did we find? Well, the headline is that IT *does* matter if you do it right. With a mix of technology, processes, and a great culture, IT contributes to organizations' profitability, productivity, and market share. We also found that using continuous delivery and lean management practices not only makes IT better -- giving you throughput and stability without tradeoffs -- but it also makes your work feel better -- making your organizational culture better and decreasing burnout. Jez and Nicole will share these findings as well as tips and tricks to help make your own DevOps transformation awesome.

Anomaly detection made easy

Piotr Guzik

Embracing collaborative chaos

Equal Experts

Talk given by Lyndsay Prewer Technical Delivery Manager at Equal Experts at ExpertTalks Leeds on June 11 2019. Embracing Collaborative Chaos Today’s systems are inherently complex, with some component parts often operating in or close to suboptimal or failure modes. Left unchecked, as complexity increases, the compounding of failure modes will inevitably lead to catastrophic system failure. Chaos Days help us address this risk by spending time deliberately inducing failures, then analysing the response. This session summarises our experience of running Chaos Days on a large scale platform. We’ll explore the what, why, how and when of running a Chaos Day.

In 2015, the OpenNMS application was split into two main branches: OpenNMS Horizon and OpenNMS Meridian. The main reason was to allow for OpenNMS to improve at a more rapid pace. Where it used to take 18-24 months for a new major OpenNMS release, Horizon gets a new major release every 3-4 months. This model is very similar to the one Red Hat uses, with Horizon being similar to Fedora and Meridian being like Red Hat Enterprise Linux. Also like RHEL, while Meridian is still 100% open source it is only available through a paid subscription. This talk will discuss the differences between the two version and highlight the new features available in Horizon, such as the Grafana integration, the new Newts.io back end storage model built on Cassandra and the "minion" remote poller that positions OpenNMS to monitor the Internet of Things.

OSMC 2015: The Assimilation Project by Alan Robertson

NETWAYS

Painlessly Discovering and Monitoring Systems, Services and Compliance The open source Assimilation Project provides continuous integrated IT discovery and monitoring aimed at risk management and mitigation. It discovers systems, switches, services and dependencies and detailed configuration information. Our discovery uses agents which run local commands, listens to packets without network privileges, and create and update a graph-based configuration management database (CMDB) of your infrastructure and services without setting off security alarms. This CMDB includes services you aren’t monitoring and systems you’ve forgotten about. This is important since about 30% of outsider security breaches come through forgotten systems, and services you’re not monitoring can’t be properly managed. Monitoring is extremely scalable due to its radically distributed architecture. Because discovery informs monitoring, most monitoring doesn’t require any configuration. Easily extensible discovery enables administrators to let the Assimilation software keep information they are interested in a central database and continually up to date instead of in ad hoc flat files. This enables straightforward best practice audits (including security audits) without touching every machine. Our graph-based CMDB is natural for visualization and supports interesting queries about root causes and impact analysis. Our future work concentrates on continuous security monitoring - enabling you to easy stay in security compliance. This talk gives an overview of the Assimilation project - its capabilities, scalability and architecture, future plans and includes a demo of zero-configuration discovery and monitoring.

OSMC 2014: Using elasticsearch, logstash & kibana in system administration | ...

NETWAYS

This talk will give an introduction into the ELK stack, which consists of Elasticsearch, Logstash and Kibana. Before giving a quick theoretical introduction about the stack we will talk about the challenges and problems when trying to extract information from logfiles, which are distributed and very different in nature. After covering the theoritical groundwork we will dive into the practical parts of the talk. There will be several demonstrations of how to use the ELK stack to obtain useful information for system administrators from your production environment. The demonstrations will include parsing realtime streams, old fashioned logfiles as well as making sense of performance metrics.

OSMC 2015:The road to lazy monitoring with Icinga 2 and Puppet by Tom de Vylder

NETWAYS

OSMC 2015: End to End Monitoring mit Alyvix-Jürgen Vigna

NETWAYS

Application Performance Monitoring auf Open Source Basis: Wie stark leiden unsere User wirklich? Im Cloud-Zeitalter spielt die Verbesserung der End-User-Experience eine zunehmende Rolle wenn es darum geht den Geschäftserfolg zu optimieren. Die Open Source Lösung Alyvix, eine Python basierte End-to-End Monitoring Engine, wurde letzthin deutlich erweitert, um die Identifizierung von Performance- und Zuverlässigkeitsmängeln an geschäftskritischen Applikationen wie Citrix, SAP, Terminal Server usw. zu vereinfachen. Durch die Integration von Anaconda und Robot Framework bietet die kürzlich veröffentlichte Version Alyvix 2 (welche unter GNU GPL lizensiert ist), verschiedene Verbesserungen wie z.B. die Möglichkeit zur Erstellung von Test Cases ohne jegliche Python-Kenntnisse, stabilere Computer Vision Algorithmen und die Visualisierung detaillierter HTML-Reports. Auf der diesjährigen OSMC wird Jürgen Vigna die neuesten Funktionen der End-to-End Monitoring Engine vorstellen.

OSMC 2015: Grafana and Future of Metrics Visualization by Torkel Ödegaard

NETWAYS

OSMC 2015: Monitor Open stack environments from the bottom up and front to ba...

NETWAYS

Elastic virtualization using the popular OpenStack platform is for real. While Sysadmins and DevOps professionals fully embrace these new developments, managing them is still a challenge. Adding layers of abstraction for compute, network and storage resources further increases complexity. Resource sharing, the fully dynamic creation of networks, virtual machines and recently Linux containers inside the framework also increases the challenge of managing these already complex systems. This presentation will provide insights on how to optimize the monitoring and management of OpenStack "from the bottom up", and from front to back to efficiently manage and troubleshoot OpenStack environments using API monitoring techniques and best of breed OpenSource tools such as Icinga 2.4, OpenStack API, Fuel, BoxSpy, OpenTSDB and others.

OSMC 2015: MQTT it´s also for monitoring by Jan-Piet Mens

NETWAYS

OSMC 2015: Prometheus: A Next-Generation Monitoring System by Fabian Reinartz

NETWAYS

Prometheus is a rising open-source monitoring system written in Go. Based on a multi-dimensional data model and on a flexible query language it provides instrumentation, collection and storage of metric data. This presentation will examine the fundamental design decisions which had been taken behind Prometheus and its components. Finally, we will demonstrate with an example the process from instrumentation up to alerting.

OSMC 2015: Collectd Thresholds Plugin and Icinga by Florian Forster

NETWAYS

Capacity planning and monitoring both use system and application performance data. Using the data sampled by collectd at a high frequency allows system engineers to define alerts with short windows while reducing overall system load. This talk will give a brief introduction to collectd and its "threshold" plugin, including the concepts and configuration involved. It will then explore the different possibilities to combine collectd with Icinga / Nagios and discuss pro and contra of each approach.

OSMC 2015: Monitoring at Spotify-When things go ping in the night by Martin Parm

NETWAYS

When Spotify started in 2006, with just 20 people, they were more worried about selling the idea of music streaming than of setting up monitoring systems. Fast forward to 2015 and more than 400 engineers are collecting more than 30 million time series from more than 10000 hosts; so how did we get here? The journey has been a long one, with plenty of false starts and growing pains, from scaling systems to scaling teams to scaling the business itself; challenging what we thought we knew about operational monitoring at every step. This talk is about some of the more interesting challenges we've faced along the way, and about what we've learned so far; covering some of the technical details but primarily focusing on the human aspects, and how our monitoring solutions have both shaped and been shaped by organizational structures and changing engineering practices.

OSMC 2014: Business Prozessmonitoring mit BPView | Rene Koch

NETWAYS

BPView ist ein Open-Source-Projekt zum Überwachen und Darstellen von Geschäftsprozessen. Das Webinterface ist für die Verwendung auf Präsentationsleinwänden sowie TV-Geräten optimiert und gibt Service-Desk- und Operations-Mitarbeitern einen schnellen Überblick über ihre Umgebung. Durch den modularen Aufbau können verschiedene Monitoring-Backends wie z.B. Zabbix, Icinga, Nagios oder Microsoft SCOM angebunden werden. Aktuell werden Icinga und Nagios unterstützt.

OSMC 2014: Interesting use cases of Zabbix improvements in latest versions | ...

NETWAYS

OSMC 2015: NSClient++: A brief Introduction by Michael Medin

NETWAYS

NSClient++ has been growing steadily over the years and with 0.5.0 we are getting ever closer to an official 1.0 version. Yet still many people only monitor the very basics metrics such as cpu/memory/disk. In this session I will show you how to get the most of NSClient++ and why it is time to say goodbye to check_nt for good. We will explore NSClient++ left and right but do so from an end user perspective showing you what you can monitor and how easy it is to do so...

OSMC 2015: Zabbix 3.0. The Simple, the Powerful and the Shiny by Wolfgang Alper

NETWAYS

With its first release in 2001, over the last 14+ years Zabbix became a solid and mature enterprise grade open source GPL network monitoring solution which is maintained and packaged for most linux distributions. Having a release cicles for regular product releases and LTS (Long-Term-Support) versions, this presentation gives a glance on the new features to be expected in zabbix 3.0 which will be the next official LTS release.

OSMC 2015: Monitoring Linux and Windows Logs with the Graylog Collector byBer...

NETWAYS

Until recently, sending logs to Graylog without using Syslog or any third party program was a bit cumbersome. This has changed since version 1.1. Graylog now has its own log collector which is tightly integrated with the Graylog server and web interface to simplify the management of log shippers. The Graylog collector runs on several operating systems including Linux, Windows, Mac OS and AIX. It makes it easy to send data like Apache access logs or Windows event logs to Graylog without the need of any third party tools. In this talk I will introduce the Graylog collector and show how to install and configure it on Linux and Windows. I will also show how to extract structured data from those logs and an example integration with the Icinga monitoring system to alert on critical events.

Open Source Backup Conference 2014: Migration from bacula to bareos, by Danie...

NETWAYS

At the past two or three conferences i have been asked to give a presentation of our configuration. I have implemented some ideas that i have never seen anywhere else but that works quite nicely for us. Also we just renewed our backup server hardware and took that opportunity to switch from Bacula to Bareos (work in progress).The talk will cover several lessons we learned in the last 10 years with Bacula and now Bareos. Going into the detail with multiple datacenters, tons of files, retiring clients and multi-tier-backups it will cover general issues as well special solutions for complex backup scenarios.

Puppet Camp Duesseldorf 2014: Martin Alfke - Can you upgrade to puppet 4.x?

NETWAYS

PuppetLabs takes care on the Puppet software stack and they provide regular updates of their software. But how about your Puppet DSL code? How can you ensure that your code will also work fine on newer Puppet versions? This talks shows basic steps and actions which should be done to ensure fully functional Puppet DSL code on newer Puppet versions. I will show common old practices, which have been replaced by more modern ways in using Puppet and how to migrate to the new solution. Additionally I want you to learn how you can test your Puppet DSL code prior putting it onto a new Puppet master.

Open Source Backup Cpnference 2014: Bareos in scientific environments, by Dr....

NETWAYS

To backup 110 (partly virtualized) Linux servers the Max Planck Institute for Radio Astronomy has been using Bareos for 5 years now. The full backup volume is constantly growing and has just passed the 35 TiB mark with up to 6 million files per TiB. Naturally there were problems with scalability and flexibility which needed to be addressed. We are using 2 Spectra Logic T950 (LTO5/LTO6) tape libraries, 40 TiB of disk backup space, and a dedicated 1GbE/10GbE backup LAN. As it may be an inspiration to other users, we would like to share our experience utilizing virtual full backups, concurrent jobs, backup of Heartbeat/DRBD Failover Clusters and integrating Bareos with REAR for disaster recovery. Coming from TSM, passing Bacula on the way, we finally found our destination with Bareos! The Max Planck Institute for Neurological Research operates several brain scanners for human and animal studies. Imaging techniques used here comprise magnetic resonance imaging (MRI), positron emission tomography (PET), optical imaging and microscopy. Research is often interdisciplinary, including contributions from the fields of biology, physics, medicine, psychology, genetics, biochemistry, radiochemistry – with very heterogeneous characteristics of data and analysis methods. Backup requirements range between file systems with literally millions of very small files (DICOM raw data or FSL intermediate results) to files of 200 GB+ size (PET listmode). “Good Scientific Practice” mandates backup/archiving primary data and “everything else needed to reproduce published results” (tools, documentation of tool chains, intermediate results) – which is a veritable challenge in a high-end, dynamic lab environment. Until recently, we have used a HSM system from Sun/Oracle Inc (SAM-FS) to meet our requirements of backup and archiving, in particular, using HSM-type filesystems for scientific computing in order to have a fine-grained backup. However, a significantly larger and more powerful system was needed and we are now migrating to a Quantum i6000 (LTO-6) tape library with Grau OpenArchiver as HSM frontend. With help from our colleagues in Bonn (MPI for Radio Astronomy), we were able to use Bareos for archiving some vital filesystems (backup-to-disk using a HSM file system with WORM tapes; one job per file; file archives < 5 GB; mostly unixoid backup clients). We are very pleased with the performance, ease of handling and flexibility this approach offers, e.g. when using incremental backups of virtual machines, listing the 5 largest files can tell a lot about a system’s “health”; pre- and posthooks allow some interesting security features in an ESX-cluster environment (taking network interfaces automatically up before saving sensitive data and shutting the interfaces down afterwards); analysing backup reports reveal longterm trends for hot spots, etc.

Puppet Camp Duesseldorf 2014: Kris Buytaert - Monitoring (with) Puppet

NETWAYS

In the age of automated infrastructure our monitoring tools need to be capable of being automated , we need to be able to deploy new services and hosts and know that they are monitored. Puppet can obviously help us here. But in the age of the chaos monkey our puppet infra needs to be monitored too. So how do you monitor Puppet and its friends itselve ? This talk will give you some ideas on monitoring a puppetmaster with it's friends , PuppetDB, etc ..

Scalable, good, cheap

Marc Cluet

Continuous Infrastructure First

Kris Buytaert

Viewers also liked

OSMC 2015: What's Happening with OpenNMS? by Tarus Balog

NETWAYS

OSMC 2015: The Assimilation Project by Alan Robertson

NETWAYS

OSMC 2014: Using elasticsearch, logstash & kibana in system administration | ...

NETWAYS

OSMC 2015:The road to lazy monitoring with Icinga 2 and Puppet by Tom de Vylder

NETWAYS

OSMC 2015: End to End Monitoring mit Alyvix-Jürgen Vigna

NETWAYS

OSMC 2015: Grafana and Future of Metrics Visualization by Torkel Ödegaard

NETWAYS

OSMC 2015: Monitor Open stack environments from the bottom up and front to ba...

NETWAYS

OSMC 2015: MQTT it´s also for monitoring by Jan-Piet Mens

NETWAYS

OSMC 2015: Prometheus: A Next-Generation Monitoring System by Fabian Reinartz

NETWAYS

OSMC 2015: Collectd Thresholds Plugin and Icinga by Florian Forster

NETWAYS

OSMC 2015: Monitoring at Spotify-When things go ping in the night by Martin Parm

NETWAYS

OSMC 2014: Business Prozessmonitoring mit BPView | Rene Koch

NETWAYS

OSMC 2014: Interesting use cases of Zabbix improvements in latest versions | ...

NETWAYS

OSMC 2015: NSClient++: A brief Introduction by Michael Medin

NETWAYS

OSMC 2015: Zabbix 3.0. The Simple, the Powerful and the Shiny by Wolfgang Alper

NETWAYS

OSMC 2015: Monitoring Linux and Windows Logs with the Graylog Collector byBer...

NETWAYS

Open Source Backup Conference 2014: Migration from bacula to bareos, by Danie...

NETWAYS

Puppet Camp Duesseldorf 2014: Martin Alfke - Can you upgrade to puppet 4.x?

NETWAYS

Open Source Backup Cpnference 2014: Bareos in scientific environments, by Dr....

NETWAYS

Puppet Camp Duesseldorf 2014: Kris Buytaert - Monitoring (with) Puppet

NETWAYS

Viewers also liked (20)