Presented at Monitorama 2016 in Portland, OR. This presentation is about using statistical information to simulate high volume data traffic during product development.
Recommendation System that recommends Similar Bugs and estimates efforts required. Each Defect is broken into set of Keywords and applying ML algorithms to calculate a similarity coefficient.
- Lifeguard is a set of optimizations to the SWIM protocol and memberlist failure detector to make it more robust. It addresses issues seen in the field of healthy nodes being falsely detected as failed.
- Lifeguard introduces three components: self-awareness using a node status counter, dogpiling requiring multiple independent suspicions before failure detection, and a buddy system to more quickly allow nodes to refute suspicions.
- Experiments show Lifeguard significantly reduces false positives, with modest increases to latency and network load in pathological situations. It provides a tunable way for users to tradeoff failures detection speed and false positives.
Lifting the Blinds: Monitoring Windows Server 2012Datadog
Operating systems monitor resources continuously in order to effectively schedule processes.
In this webinar, Evan Mouzakitis (Datadog) discusses how to get operational data from Windows Server 2012 using a variety of native tools.
This document summarizes the key aspects of a public cloud archive storage solution. It offers affordable and unlimited storage using standard transfer protocols. Data is stored using erasure coding for redundancy and fault tolerance. Accessing archived data takes 10 minutes to 12 hours depending on previous access patterns, with faster access for inactive archives. The solution uses middleware to handle sealing and unsealing archives along with tracking access patterns to regulate retrieval times.
Monitoring Docker at Scale - Docker San Francisco Meetup - August 11, 2015Datadog
In this session I showed building a multi-container app from beginning to end, using Docker, Docker-Machine, Docker-Compose and everything in between. You can even try it out yourself using the link in the deck to a repo on GitHub.
Leveraging open source tools to gain insight into OpenStack SwiftDmitry Sotnikov
Performance monitoring and troubleshooting of cloud based object storage is as much an art as science. Although there are a plethora of open source monitoring tools which gather system metrics, the real challenge is how to utilize them to find the root cause of a problem.
In this presentation we present a general, open source based, step-by-step methodology to understand performance bottlenecks in a OpenStack Swift system. Our approach uses standard tools including logstash, collectd, statsd, elasticsearch, kibana and graphite. We also describe an additional simple Swift middleware we developed to help gain further insights. Finally, we demonstrate results obtained from our approach used in an internal deployment of OpenStack Swift.
Resource Scheduling using Apache Mesos in Cloud Native EnvironmentsSharma Podila
This document discusses using Apache Mesos for scheduling heterogeneous resources in a cloud environment. It describes Mantis, a Mesos framework for reactive stream processing. Mantis provides lightweight jobs, dynamic scaling, and custom SLAs. Fenzo is introduced as Mantis' task scheduler, which uses plugins for constraints, fitness functions, and autoscaling. Mantis allows for stream locality, backpressure handling, and job autoscaling. The document argues that Mesos provides benefits over instance-level scheduling through finer-grained resource allocation and faster task startup times.
Storm makes it easy to write and scale complex realtime computations on a cluster of computers, doing for realtime processing what Hadoop did for batch processing. Storm guarantees that every message will be processed. And it’s fast — you can process millions of messages per second with a small cluster. Best of all, you can write Storm topologies using any programming language. Storm was open-sourced by Twitter in September of 2011 and has since been adopted by many companies around the world.
Storm has a wide range of use cases, from stream processing to continuous computation to distributed RPC. In this talk I'll introduce Storm and show how easy it is to use for realtime computation.
Recommendation System that recommends Similar Bugs and estimates efforts required. Each Defect is broken into set of Keywords and applying ML algorithms to calculate a similarity coefficient.
- Lifeguard is a set of optimizations to the SWIM protocol and memberlist failure detector to make it more robust. It addresses issues seen in the field of healthy nodes being falsely detected as failed.
- Lifeguard introduces three components: self-awareness using a node status counter, dogpiling requiring multiple independent suspicions before failure detection, and a buddy system to more quickly allow nodes to refute suspicions.
- Experiments show Lifeguard significantly reduces false positives, with modest increases to latency and network load in pathological situations. It provides a tunable way for users to tradeoff failures detection speed and false positives.
Lifting the Blinds: Monitoring Windows Server 2012Datadog
Operating systems monitor resources continuously in order to effectively schedule processes.
In this webinar, Evan Mouzakitis (Datadog) discusses how to get operational data from Windows Server 2012 using a variety of native tools.
This document summarizes the key aspects of a public cloud archive storage solution. It offers affordable and unlimited storage using standard transfer protocols. Data is stored using erasure coding for redundancy and fault tolerance. Accessing archived data takes 10 minutes to 12 hours depending on previous access patterns, with faster access for inactive archives. The solution uses middleware to handle sealing and unsealing archives along with tracking access patterns to regulate retrieval times.
Monitoring Docker at Scale - Docker San Francisco Meetup - August 11, 2015Datadog
In this session I showed building a multi-container app from beginning to end, using Docker, Docker-Machine, Docker-Compose and everything in between. You can even try it out yourself using the link in the deck to a repo on GitHub.
Leveraging open source tools to gain insight into OpenStack SwiftDmitry Sotnikov
Performance monitoring and troubleshooting of cloud based object storage is as much an art as science. Although there are a plethora of open source monitoring tools which gather system metrics, the real challenge is how to utilize them to find the root cause of a problem.
In this presentation we present a general, open source based, step-by-step methodology to understand performance bottlenecks in a OpenStack Swift system. Our approach uses standard tools including logstash, collectd, statsd, elasticsearch, kibana and graphite. We also describe an additional simple Swift middleware we developed to help gain further insights. Finally, we demonstrate results obtained from our approach used in an internal deployment of OpenStack Swift.
Resource Scheduling using Apache Mesos in Cloud Native EnvironmentsSharma Podila
This document discusses using Apache Mesos for scheduling heterogeneous resources in a cloud environment. It describes Mantis, a Mesos framework for reactive stream processing. Mantis provides lightweight jobs, dynamic scaling, and custom SLAs. Fenzo is introduced as Mantis' task scheduler, which uses plugins for constraints, fitness functions, and autoscaling. Mantis allows for stream locality, backpressure handling, and job autoscaling. The document argues that Mesos provides benefits over instance-level scheduling through finer-grained resource allocation and faster task startup times.
Storm makes it easy to write and scale complex realtime computations on a cluster of computers, doing for realtime processing what Hadoop did for batch processing. Storm guarantees that every message will be processed. And it’s fast — you can process millions of messages per second with a small cluster. Best of all, you can write Storm topologies using any programming language. Storm was open-sourced by Twitter in September of 2011 and has since been adopted by many companies around the world.
Storm has a wide range of use cases, from stream processing to continuous computation to distributed RPC. In this talk I'll introduce Storm and show how easy it is to use for realtime computation.
- Fred de Villamil is the director of infrastructure at Synthesio and has been working with Linux/BSD and open source since 1996.
- Synthesio uses Elasticsearch to power over 13,000 dashboards, indexing over 75 billion documents and 200TB of data across 5 clusters with 163 servers and 400TB of storage.
- They initially had performance issues with cross-cluster queries in MySQL but migrated to Elasticsearch in 2015 and saw significant performance improvements with their "Clipping Revolution" implementation.
- Over time they encountered issues at scale including too many shards, slow restarts, and garbage collection problems. They optimized their implementation with changes like rack awareness, G1GC tuning, and field data cache configuration.
How to achieve advanced scheduling of resources to heterogeneous tasks using the now open source NetflixOSS Fenzo scheduling library for Apache Mesos frameworks, including autoscaling of the execution cluster.
Realtime Statistics based on Apache Storm and RocketMQXin Wang
This document discusses using Apache Storm and RocketMQ for real-time statistics. It begins with an overview of the streaming ecosystem and components. It then describes challenges with stateful statistics and introduces Alien, an open-source middleware for handling stateful event counting. The document concludes with best practices for Storm performance and data hot points.
How are systems in finance design for deterministic outcomes, and performance. What are the benefits and what is the performance you can achieve.
Included a demo you can download.
The document discusses the evolution of Ceilometer, an OpenStack project that collects measurements from deployed clouds and persists the data for later retrieval and analysis. It describes how Ceilometer has scaled out its data collection capabilities over time by adding agents, partitioning workloads, and integrating with Gnocchi to provide more efficient time-series storage. The document also provides best practices for Ceilometer deployment and configuration to optimize data collection, storage and querying.
This document proposes a hybrid approach to securely sharding decentralized databases with low redundancy. It involves using a real-time validator layer for fast transactions, a shared fisherman pool for independent verification of transaction histories uploaded to a decentralized storage network like Swarm, and smart contracts on Ethereum to resolve disputes. This approach reduces the risk of shard takeover to 0.00037% while keeping redundancy costs low compared to naive consensus-based or blockchain-only approaches.
Greg Parmer, Information Technology Specialist
Jonas Bowersock, Information Technology Specialist
Alabama Cooperative Extension System
Auburn University
Storm is an open-source distributed real-time computation system. It provides a framework for processing unbounded streams of data reliably and fault-tolerantly. Storm allows data to be analyzed in real-time using spouts, bolts, and topologies. It is scalable, fault-tolerant, guarantees processing, and is easy to code. Storm powers many real-time systems at Twitter and is useful for applications like analytics, personalization, and ETL.
Using Simplicity to Make Hard Big Data Problems Easynathanmarz
The document proposes a simple approach to solving a complex problem of computing unique visitors over time ranges that involves maintaining normalized and denormalized views of the data. The approach involves:
1) Storing all data in a master dataset and continuously recomputing indexes and views as a function of all the data to maintain normalized and denormalized views.
2) Querying both recent real-time views and historical batch views to retrieve the necessary data for a time range query, combining for high performance and accuracy.
3) Approximating unique counts for recent data by ignoring real-time equivalences to keep the real-time layer simple while still providing good query performance and eventual accuracy.
PHP Backends for Real-Time User Interaction using Apache Storm.DECK36
Engaging users in real-time is the topic of our times. Whether it’s a game, a shop, or a content-network, the aim remains the same: providing a personalized experience. In this workshop we will look under the hood of Apache Storm and lay a firm foundation on how to use it with PHP. By that, you can leverage your existing codebase and PHP expertise for an entirely new world: real-time analytics and business logic working on message streams. During the course of the workshop, we will introduce Apache Storm and take a look at all of its components. We will then skyrocket the applicability of Storm by showing you how to implement their components with PHP. All exercises will be conducted using an example project, the infamous and most exhilarating lolcat kitten game ever conceived: Plan 9 From Outer Kitten. In order to follow the hands-on excercises, you will need a development VM prepared by us with all relevant system components and our project repositories. To make the workshop experience as smooth as possible for all participants, please bring a prepared computer to the workshop, as there will be no time to deal with installation and setup issues. Please download all prerequisites and install them as described: VM, Plan 9 webapp, Plan 9 storm backend, (Tutorial: https://github.com/DECK36/plan9_workshop_tutorial ).
Real-Time Big Data at In-Memory Speed, Using StormNati Shalom
Storm, a popular framework from Twitter, is used for real-time event processing. The challenge presented is how to manage the state of your real-time data processing at all times. In addition, you need Storm to integrate with your batch processing system (such as Hadoop) in a consistent manner.
This session will demonstrate how to integrate Storm with an in-memory database/grid, and explore various strategies for integrating the data grid with Hadoop and Cassandra, seamlessly. By achieving smooth integration with consistent management, you will be able to easily manage all the tiers of you Big Data stack in a consistent and effective way.
- See more at: http://nosql2013.dataversity.net/sessionPop.cfm?confid=74&proposalid=5526#sthash.FWIdqRHh.dpuf
The document discusses various configuration parameters for process engines: Max Jobs sets the maximum number of concurrent process instances in memory; Activation Limit loads process instances sequentially into memory one at a time; and Flow Limit sets the maximum number of concurrently running process instances before suspending new starts. The effects of different configuration combinations are explained.
Webinar: Diagnosing Apache Cassandra Problems in ProductionDataStax Academy
This document provides guidance on diagnosing problems in Cassandra production systems. It recommends first using OpsCenter to identify issues, then monitoring servers, applications, and logs. Common problems discussed include incorrect timestamps, tombstones slowing queries, not using a snitch, version mismatches, and disk space not being reclaimed. Diagnostic tools like htop, iostat, and nodetool are presented. The document also covers JVM garbage collection profiling to identify issues like early object promotion and long minor GCs slowing the system.
This document discusses advanced inter-process communication (IPC) techniques using off-heap memory in Java. It introduces OpenHFT, a company that develops low-latency software, and their open-source projects Chronicle and OpenHFT Collections that provide high-performance IPC and embedded data stores. It then discusses problems with on-heap memory and solutions using off-heap memory mapped files for sharing data across processes at microsecond latency levels and high throughput.
This document discusses common problems a platform engineer may see with Elasticcache and provides solutions. It covers issues related to unexpected behavior, performance, cluster stability, and HA/DR. Specific problems addressed include data becoming stale when not following the cache-aside pattern, latency increases from Redis calls in transactions, large key sizes causing spikes, empty cache values when the database has no value, and lack of reconciliation logic. Solutions involve updating empty cache values, using bloom filters, and ensuring availability during cache penetration or stampedes. Distributed locking challenges and sharding without online resharding are also covered, along with metrics to monitor like cache hit rate and the Datadog ElasticCache dashboard.
This document provides an introduction to Storm, an open source distributed real-time processing system. It discusses the types of data processing in Storm as either batch or real-time. The key components of a Storm cluster are the Nimbus master node, supervisor worker nodes, and ZooKeeper coordination service. A Storm topology defines the computation as a directed acyclic graph of spouts emitting streams and bolts processing the streams.
Learning Stream Processing with Apache StormEugene Dvorkin
Over the last couple years, Apache Storm became a de-facto standard for developing real-time analytics and complex event processing applications. Storm enables to tackle real-time data processing challenges the same way Hadoop enables batch processing of Big Data. Storm enables companies to have "Fast Data" alongside with "Big Data". Some use cases where Storm can be used are Fraud Detection, Operation Intelligence, Machine Learning, ETL, Analytics, etc.
In this meetup, Eugene Dvorkin, Architect @WebMD and NYC Storm User Group organizer will teach Apache Storm and Stream Processing fundamentals. While this meeting is geared toward new Storm users, experienced users may find something interesting as well.
Following topics will be covered:
• Why use Apache Storm?
• Common use cases
• Storm Architecture - components, concepts, topology
• Building simple Storm topology with Java and Groovy
• Trident and micro-batch processing
• Fault tolerance and guaranteed message delivery
• Running and monitoring Storm in production
• Kafka
• Storm at WebMD
• Resources
Webinar - How to Build Data Pipelines for Real-Time Applications with SMACK &...DataStax
Data is being collected more and more every year. Cloud applications, including IoT, web, and mobile send torrents of bits at our data centers that have to be processed and stored. In addition, users expect an always-on experience, with little room for error. Numerous companies are successfully doing this every day. In this webinar, you will learn about the convergence of complementary technologies: Spark, Mesos, Akka, Cassandra and Kafka (SMACK), how Apache Kafka can help you get your data under control and the critical role Kafka plays in your data pipeline.
Webinar recording: https://youtu.be/uwYlwLyv-1s
Webinar Q&A will be posted shortly.
Elasticsearch is a distributed, open source search and analytics engine. It allows for horizontal scaling with no single point of failure. Data is automatically rebalanced across nodes in a cluster. Elasticsearch is used by many large companies and sees heavy use for log analytics and search capabilities. It uses RESTful APIs and JSON documents.
The premise of Dave's talk: "Good monitoring changes people." Through his evolving experience with monitoring, Dave Josephsen realized he had been “carrying a misapprehension about what monitoring was and who it was for.” His prior experiences with monitoring were much like those of folks he nowadays meets at conferences: monitoring is terrible, alerts are flooding from everything, and the world is probably burning right now. Through observing all teams - ops, data engineering, design - interact at Librato, he realized that the purpose of monitoring isn’t creating alerts but asking questions. There is no “owner” of monitoring, as everyone has the ability to measure things and ask their own questions.
- Fred de Villamil is the director of infrastructure at Synthesio and has been working with Linux/BSD and open source since 1996.
- Synthesio uses Elasticsearch to power over 13,000 dashboards, indexing over 75 billion documents and 200TB of data across 5 clusters with 163 servers and 400TB of storage.
- They initially had performance issues with cross-cluster queries in MySQL but migrated to Elasticsearch in 2015 and saw significant performance improvements with their "Clipping Revolution" implementation.
- Over time they encountered issues at scale including too many shards, slow restarts, and garbage collection problems. They optimized their implementation with changes like rack awareness, G1GC tuning, and field data cache configuration.
How to achieve advanced scheduling of resources to heterogeneous tasks using the now open source NetflixOSS Fenzo scheduling library for Apache Mesos frameworks, including autoscaling of the execution cluster.
Realtime Statistics based on Apache Storm and RocketMQXin Wang
This document discusses using Apache Storm and RocketMQ for real-time statistics. It begins with an overview of the streaming ecosystem and components. It then describes challenges with stateful statistics and introduces Alien, an open-source middleware for handling stateful event counting. The document concludes with best practices for Storm performance and data hot points.
How are systems in finance design for deterministic outcomes, and performance. What are the benefits and what is the performance you can achieve.
Included a demo you can download.
The document discusses the evolution of Ceilometer, an OpenStack project that collects measurements from deployed clouds and persists the data for later retrieval and analysis. It describes how Ceilometer has scaled out its data collection capabilities over time by adding agents, partitioning workloads, and integrating with Gnocchi to provide more efficient time-series storage. The document also provides best practices for Ceilometer deployment and configuration to optimize data collection, storage and querying.
This document proposes a hybrid approach to securely sharding decentralized databases with low redundancy. It involves using a real-time validator layer for fast transactions, a shared fisherman pool for independent verification of transaction histories uploaded to a decentralized storage network like Swarm, and smart contracts on Ethereum to resolve disputes. This approach reduces the risk of shard takeover to 0.00037% while keeping redundancy costs low compared to naive consensus-based or blockchain-only approaches.
Greg Parmer, Information Technology Specialist
Jonas Bowersock, Information Technology Specialist
Alabama Cooperative Extension System
Auburn University
Storm is an open-source distributed real-time computation system. It provides a framework for processing unbounded streams of data reliably and fault-tolerantly. Storm allows data to be analyzed in real-time using spouts, bolts, and topologies. It is scalable, fault-tolerant, guarantees processing, and is easy to code. Storm powers many real-time systems at Twitter and is useful for applications like analytics, personalization, and ETL.
Using Simplicity to Make Hard Big Data Problems Easynathanmarz
The document proposes a simple approach to solving a complex problem of computing unique visitors over time ranges that involves maintaining normalized and denormalized views of the data. The approach involves:
1) Storing all data in a master dataset and continuously recomputing indexes and views as a function of all the data to maintain normalized and denormalized views.
2) Querying both recent real-time views and historical batch views to retrieve the necessary data for a time range query, combining for high performance and accuracy.
3) Approximating unique counts for recent data by ignoring real-time equivalences to keep the real-time layer simple while still providing good query performance and eventual accuracy.
PHP Backends for Real-Time User Interaction using Apache Storm.DECK36
Engaging users in real-time is the topic of our times. Whether it’s a game, a shop, or a content-network, the aim remains the same: providing a personalized experience. In this workshop we will look under the hood of Apache Storm and lay a firm foundation on how to use it with PHP. By that, you can leverage your existing codebase and PHP expertise for an entirely new world: real-time analytics and business logic working on message streams. During the course of the workshop, we will introduce Apache Storm and take a look at all of its components. We will then skyrocket the applicability of Storm by showing you how to implement their components with PHP. All exercises will be conducted using an example project, the infamous and most exhilarating lolcat kitten game ever conceived: Plan 9 From Outer Kitten. In order to follow the hands-on excercises, you will need a development VM prepared by us with all relevant system components and our project repositories. To make the workshop experience as smooth as possible for all participants, please bring a prepared computer to the workshop, as there will be no time to deal with installation and setup issues. Please download all prerequisites and install them as described: VM, Plan 9 webapp, Plan 9 storm backend, (Tutorial: https://github.com/DECK36/plan9_workshop_tutorial ).
Real-Time Big Data at In-Memory Speed, Using StormNati Shalom
Storm, a popular framework from Twitter, is used for real-time event processing. The challenge presented is how to manage the state of your real-time data processing at all times. In addition, you need Storm to integrate with your batch processing system (such as Hadoop) in a consistent manner.
This session will demonstrate how to integrate Storm with an in-memory database/grid, and explore various strategies for integrating the data grid with Hadoop and Cassandra, seamlessly. By achieving smooth integration with consistent management, you will be able to easily manage all the tiers of you Big Data stack in a consistent and effective way.
- See more at: http://nosql2013.dataversity.net/sessionPop.cfm?confid=74&proposalid=5526#sthash.FWIdqRHh.dpuf
The document discusses various configuration parameters for process engines: Max Jobs sets the maximum number of concurrent process instances in memory; Activation Limit loads process instances sequentially into memory one at a time; and Flow Limit sets the maximum number of concurrently running process instances before suspending new starts. The effects of different configuration combinations are explained.
Webinar: Diagnosing Apache Cassandra Problems in ProductionDataStax Academy
This document provides guidance on diagnosing problems in Cassandra production systems. It recommends first using OpsCenter to identify issues, then monitoring servers, applications, and logs. Common problems discussed include incorrect timestamps, tombstones slowing queries, not using a snitch, version mismatches, and disk space not being reclaimed. Diagnostic tools like htop, iostat, and nodetool are presented. The document also covers JVM garbage collection profiling to identify issues like early object promotion and long minor GCs slowing the system.
This document discusses advanced inter-process communication (IPC) techniques using off-heap memory in Java. It introduces OpenHFT, a company that develops low-latency software, and their open-source projects Chronicle and OpenHFT Collections that provide high-performance IPC and embedded data stores. It then discusses problems with on-heap memory and solutions using off-heap memory mapped files for sharing data across processes at microsecond latency levels and high throughput.
This document discusses common problems a platform engineer may see with Elasticcache and provides solutions. It covers issues related to unexpected behavior, performance, cluster stability, and HA/DR. Specific problems addressed include data becoming stale when not following the cache-aside pattern, latency increases from Redis calls in transactions, large key sizes causing spikes, empty cache values when the database has no value, and lack of reconciliation logic. Solutions involve updating empty cache values, using bloom filters, and ensuring availability during cache penetration or stampedes. Distributed locking challenges and sharding without online resharding are also covered, along with metrics to monitor like cache hit rate and the Datadog ElasticCache dashboard.
This document provides an introduction to Storm, an open source distributed real-time processing system. It discusses the types of data processing in Storm as either batch or real-time. The key components of a Storm cluster are the Nimbus master node, supervisor worker nodes, and ZooKeeper coordination service. A Storm topology defines the computation as a directed acyclic graph of spouts emitting streams and bolts processing the streams.
Learning Stream Processing with Apache StormEugene Dvorkin
Over the last couple years, Apache Storm became a de-facto standard for developing real-time analytics and complex event processing applications. Storm enables to tackle real-time data processing challenges the same way Hadoop enables batch processing of Big Data. Storm enables companies to have "Fast Data" alongside with "Big Data". Some use cases where Storm can be used are Fraud Detection, Operation Intelligence, Machine Learning, ETL, Analytics, etc.
In this meetup, Eugene Dvorkin, Architect @WebMD and NYC Storm User Group organizer will teach Apache Storm and Stream Processing fundamentals. While this meeting is geared toward new Storm users, experienced users may find something interesting as well.
Following topics will be covered:
• Why use Apache Storm?
• Common use cases
• Storm Architecture - components, concepts, topology
• Building simple Storm topology with Java and Groovy
• Trident and micro-batch processing
• Fault tolerance and guaranteed message delivery
• Running and monitoring Storm in production
• Kafka
• Storm at WebMD
• Resources
Webinar - How to Build Data Pipelines for Real-Time Applications with SMACK &...DataStax
Data is being collected more and more every year. Cloud applications, including IoT, web, and mobile send torrents of bits at our data centers that have to be processed and stored. In addition, users expect an always-on experience, with little room for error. Numerous companies are successfully doing this every day. In this webinar, you will learn about the convergence of complementary technologies: Spark, Mesos, Akka, Cassandra and Kafka (SMACK), how Apache Kafka can help you get your data under control and the critical role Kafka plays in your data pipeline.
Webinar recording: https://youtu.be/uwYlwLyv-1s
Webinar Q&A will be posted shortly.
Elasticsearch is a distributed, open source search and analytics engine. It allows for horizontal scaling with no single point of failure. Data is automatically rebalanced across nodes in a cluster. Elasticsearch is used by many large companies and sees heavy use for log analytics and search capabilities. It uses RESTful APIs and JSON documents.
The premise of Dave's talk: "Good monitoring changes people." Through his evolving experience with monitoring, Dave Josephsen realized he had been “carrying a misapprehension about what monitoring was and who it was for.” His prior experiences with monitoring were much like those of folks he nowadays meets at conferences: monitoring is terrible, alerts are flooding from everything, and the world is probably burning right now. Through observing all teams - ops, data engineering, design - interact at Librato, he realized that the purpose of monitoring isn’t creating alerts but asking questions. There is no “owner” of monitoring, as everyone has the ability to measure things and ask their own questions.
This document summarizes Brian Overstreet's talk on scaling Pinterest's monitoring system over time as the company and traffic grew. It describes how Pinterest started with just Ganglia for system metrics and no application metrics. They introduced Graphite but faced challenges with packet loss and metrics being dropped. They then introduced OpenTSDB which users were happier with due to its querying speed. Pinterest developed an agent-based pipeline using Kafka and Storm to address packet loss issues and allow over 1.5 million points per second to be ingested by OpenTSDB. Key lessons included the need to educate users, control incoming metrics, and ensure the monitoring system scales with engineers rather than just site users.
Design principles for building useful graph displays and visualisations for monitoring data. What goes into designing graphs, creating a good user experience and what other types of visualisations are appropriate for which situations?
1) Heinrich Hartmann presented on statistics and monitoring for engineers. He discussed various methods for API monitoring including external monitoring, log analysis, and measuring latency averages and percentiles.
2) Histograms were presented as another method that involves dividing the latency and time scales into bands and reporting periods to count samples, allowing flexible analysis while enabling aggregation.
3) Takeaways included being wary of line graphs, not aggregating percentiles but instead using histograms, keeping all raw data, and striving for meaningful metrics.
Monitorama: How monitoring can improve the rest of the companyJeff Weinstein
Monitoring can improve the entire company by sharing data and techniques across teams. By implementing structured logging, automatic metrics collection, and common data visualization tools, monitoring can become the central data platform. This allows all teams like developers, analysts, and executives to access insights that help improve products, prioritize issues, and make data-driven decisions.
I gave a talk about monitoring your people at Monitorama PDX 2016. I don't know much about monitoring, but I do know that measuring your people matters to scale a business and grow your revenue.
Plus, happy people work better!
The document discusses sessionization with Spark streaming to analyze user sessions from a constant stream of page visit data. Key points include:
- Streaming page visit data presents challenges like joining new visits to ongoing sessions and handling variable data volumes and long user sessions.
- The proposed solution uses Spark streaming to join a checkpoint of incomplete sessions with new visit data to calculate session metrics in real-time.
- Important aspects are controlling data ingress size and partitioning to optimize performance of operations like joins and using custom formats to handle output to multiple sinks.
Everything obfuscurity taught me about monitoringPete Cheslock
The document appears to be a series of tweets from Pete Cheslock about monitoring best practices and lessons learned over his career. Some key points discussed include using Graphite for time-series data collection and storage, leveraging existing tools like StatsD that developers are already using, building services that are consumable by developers to encourage cross-team collaboration, and focusing on solving your own company's problems rather than trying to replicate what large companies do.
Infrastructure as code might be literally impossible part 2ice799
The document discusses various issues with infrastructure as code including complexities that arise from software licenses, bugs, and inconsistencies across tools and platforms. Specific examples covered include problems with SSL and APT package management on Debian/Ubuntu, Linux networking configuration difficulties, and inconsistencies in Python packaging related to naming conventions for packages containing hyphens, underscores, or periods. Potential causes discussed include legacy code, lack of time for thorough testing and bug fixing, and economic pressures against developing fully working software systems.
We don't always think of it this way, but your metrics *are* your culture... Your metrics shape behavior and incentives, which really is the heart of culture.
Monitorama - Please, no more Minutes, Milliseconds, Monoliths or Monitoring T...Adrian Cockcroft
Monitorama opening keynote talk on the challenges of Monitoring in a world where we need to deal with continuous delivery, cloud, and automated control feedback loops.
Google Cloud Platform: Prototype ->Production-> Planet scaleIdan Tohami
As one of Big Data’s Founding Fathers, Google explored the technological changes we faced over the past 10 years and present their solutions to the new data challenges within the Google Cloud ecosystem
All of Your Network Monitoring is (probably) Wrongice799
The document discusses the challenges of network monitoring due to complexity in systems and lack of standardization. It notes that drivers and tools like ethtool may report statistics differently or incompletely between hardware. This makes it difficult to understand monitoring data and diagnose issues from graphs alone without deep knowledge of underlying driver and hardware implementations.
Opening talk at Monitorama, talks about the problems of monitoring, challenges of creating monitoring tools and why monitoring vendors keep getting disrupted. Ended with a discussion of simulation testing and serverless architectures - Monitorless.
Introduction to Apache ZooKeeper | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
Big Data with Hadoop & Spark Training: http://bit.ly/2kvXlPd
This CloudxLab Introduction to Apache ZooKeeper tutorial helps you to understand ZooKeeper in detail. Below are the topics covered in this tutorial:
1) Data Model
2) Znode Types
3) Persistent Znode
4) Sequential Znode
5) Architecture
6) Election & Majority Demo
7) Why Do We Need Majority?
8) Guarantees - Sequential consistency, Atomicity, Single system image, Durability, Timeliness
9) ZooKeeper APIs
10) Watches & Triggers
11) ACLs - Access Control Lists
12) Usecases
13) When Not to Use ZooKeeper
How does the Cloud Foundry Diego Project Run at Scale?VMware Tanzu
From Pivotal's Amit Gupta on July 9, 2015, a look at how the Cloud Foundry Diego project runs at scale, and what it took to get there. Offering a look into the Diego project scheduler and the performance testing efforts, all the tools necessary to ensure that Cloud Foundry can scale quickly and effortlessly.
To learn more, visit pivotal.io/platform-as-a-service/pivotal-cloud-foundry
How does the Cloud Foundry Diego Project Run at Scale, and Updates on .NET Su...Amit Gupta
The Cloud Foundry Diego team at Pivotal has been hard at work for the past few months exploring and improving Diego's performance at scale and under stress. This talk covers the goals, tools, and results of the experiments to date, as well as a glimpse of what's next.
And finally, a brief teaser about the current state of .NET support in Diego
Big data from the LHC commissioning: practical lessons from big science - Sim...jaxLondonConference
Presented at JAX London 2013
The Large Hadron Collider experiments manage tens of petabytes of data spread across hundreds of data centres. Managing and processing this volume required significant infrastructure and novel software systems, involving years of R&D and significant commissioning to prepare for the LHC First Data. The evolution of this global computing infrastructure, and the specialisations made by the experiments, have lessons relevant for many commercial "big data" users.
This document discusses using Hadoop and Elasticsearch for real-time analytics. It provides an overview of Elasticsearch, including how it is document-oriented, schema-free, distributed and fast. It also demonstrates indexing, retrieving, updating and deleting documents from Elasticsearch. The demo portion involves extracting data from a SQL database using Hive, transforming it with Hadoop/Hive, and loading it into Elasticsearch to run queries. Lessons learned focus on concurrency, filtering, field data caching and JVM memory usage.
Tomas Doran presented on their implementation of Logstash at TIM Group to process over 55 million messages per day. Their applications are all Java/Scala/Clojure and they developed their own library to send structured log events as JSON to Logstash using ZeroMQ for reliability. They index data in Elasticsearch and use it for metrics, alerts and dashboards but face challenges with data growth.
Lightning talk showing various aspectos of software system performance. It goes through: latency, data structures, garbage collection, troubleshooting method like workload saturation method, quick diagnostic tools, famegraph and perfview
Алексей Петров "PHP at Scale: Knowing enough to be dangerous!"Fwdays
PHP at Scale: Knowing enough to be dangerous! by Oleksii Petrov discusses how to scale PHP applications. It covers strategies like caching, queueing, read/write splitting, and sharding. It also discusses using load balancers and choosing the right database. The key is to improve system metrics without dramatically changing the system. Scaling is predefined by your stack and architecture. Performance comes from optimizations everywhere, not just PHP. Being distributed is very challenging.
The document proposes using MapReduce as a general framework to support research in mining software repositories (MSR). It describes how MapReduce can provide efficiency, scalability, adaptability and flexibility for common MSR tasks like analyzing large code repositories. A case study of applying MapReduce to the J-REX MSR tool shows significant reductions in running time for large datasets. Minimal programming effort was required and MapReduce could run on various computing environments.
2013 py con awesome big data algorithmsc.titus.brown
This document provides an overview of algorithms for analyzing large datasets, referred to as "big data". It discusses skip lists, HyperLogLog counting, and Bloom filters as examples of probabilistic data structures that can be used for problems involving big data. These algorithms provide approximate answers but are more scalable and memory efficient than exact algorithms. The document also describes applications of these algorithms to analyzing shotgun DNA sequencing data from metagenomics studies.
Flink Forward Berlin 2018: Lasse Nedergaard - "Our successful journey with Fl...Flink Forward
At Trackunit we have based our telematic IoT processing pipeline on Flink. We started out on version 1.2 and are now on 1.5. In this session I will share the lessons learned going from one giant Flink job to many smalls and some of the problems we have seen operating Flink on AWS EMR cluster, including topics such as:
• Why external enrichment can be challenging with Flink Async operator.
• Pattern to change external enrichment into streaming join.
• Building your own source
• Why Flink restart is great but should be avoided as it will terminate your cluster.
• Why iteration can cause deadlocking when backpressure occurs.
• Kinesis rate exceeded exception
• Why throttling Flink source read during catchup is needed.
• Why we moved from EMR/Kinesis and into DC/OS and kafka.
• And much more.
Melbourne Big Data Meetup Talk: Scaling a Real-Time Anomaly Detection Applica...Paul Brebner
Apache Kafka, Apache Cassandra and Kubernetes are open source big data technologies enabling applications and business operations to scale massively and rapidly. While Kafka and Cassandra underpins the data layer of the stack providing capability to stream, disseminate, store and retrieve data at very low latency, Kubernetes is a container orchestration technology that helps in automated application deployment and scaling of application clusters.
In this presentation, Paul will reveal how he architected a massive scale deployment of a streaming data pipeline with Kafka and Cassandra to cater to an example Anomaly detection application running on a Kubernetes cluster and generating and processing massive amount of events. Anomaly detection is a method used to detect unusual events in an event stream.
It is widely used in a range of applications such as financial fraud detection, security, threat detection, website user analytics, sensors, IoT, system health monitoring, etc. When such applications operate at massive scale generating millions or billions of events, they impose significant computational, performance and scalability challenges to anomaly detection algorithms and data layer technologies. Paul will demonstrate the scalability, performance and cost effectiveness of Apache Kafka, Cassandra and Kubernetes, with results from his experiments allowing the Anomaly detection application to scale to 19 Billion anomaly checks per day.
Melbourne Big Data Meetup, March 5 2020
https://www.eventbrite.com/e/melbourne-big-data-meetup-realtime-anomaly-detection-with-cassandra-kafka-tickets-93028445585
A presentation about the deployment of an ELK stack at bol.com
At bol.com we use Elasticsearch, Logstash and Kibana in a logsearch system that allows our developers and operations people to easilly access and search thru logevents coming from all layers of its infrastructure.
The presentations explains the initial design and its failures. It continues with explaining the latest design (mid 2014). Its improvements. And finally a set of tips are giving regarding Logstash and Elasticsearch scaling.
These slides were first presented at the Elasticsearch NL meetup on September 22nd 2014 at the Utrecht bol.com HQ.
This document discusses HDInsight interactive query architecture and performance. It summarizes that:
1. HDInsight uses LLAP (Low Latency Analytical Processing) clusters to serve queries directly from Azure blob storage and data lake store for fast performance on text data.
2. Testing showed LLAP had high query concurrency and interactive query speed compared to Spark SQL and Presto.
3. The document also outlines HDInsight's logging architecture where the OMS agent collects logs and metrics from HDInsight clusters and sends them to Log Analytics for analysis.
This document summarizes lessons learned from scaling HDFS storage at Twitter to over 1 exabyte across tens of thousands of nodes. Some key challenges discussed include identifying scale limits through benchmarking, abstracting access across multiple clusters and datacenters, implementing extensive metrics and auditing, preventing single points of failure, handling failures and slowdowns silently, understanding network bottlenecks, implementing throttling, preventing data loss, carefully planning upgrades, and monitoring all aspects of the system. The lessons have helped Twitter scale HDFS and are also useful for scaling other systems.
This document discusses moving from host-centric monitoring to fact-based monitoring using Puppet facts. It argues that hosts should not be the center of the monitoring universe, but rather facts should be. Effective monitoring uses queries against existing facts and metrics to express conditions like ensuring web servers respond quickly or PostgreSQL processes are running. This mirrors how Puppet, SQL, and MCollective improved systems management by moving from imperative programming to declarative queries based on available facts and metadata.
Your configuration management is fact-based.
Your orchestration is fact-based.
Is your monitoring fact-based?
What does that even mean? Monitoring is very similar to configuration, at least in its expression. Configuration cares about files, services, and hosts being present and in a certain state (""nginx should be running with the following configuration""). Monitoring cares about services being present, running, and in a certain state. Both describe your infrastructure as it should be (""nginx should be running and respond in less than 200ms"").
Fact-based monitoring is about being able to control monitoring with the same facts that Puppet uses (""monitor nginx latency wherever Puppet says it should run""). This is in contrast with imperative monitoring (""monitor nginx on host a, b and c"") that gets out of sync and leads to mailbox meltdowns from spurious alerts.
Using open source and commercial examples, this talk will help you express your monitoring in a way that will feel very natural to your Puppet configuration.
This document provides an introduction to single-cell RNA-seq (scRNA-seq) analysis. It discusses different scRNA-seq assays such as Smart-Seq2, Drop-seq, and 10X, and how their protocols and sequencing outputs differ. It also covers scRNA-seq data characteristics like zero inflation and overdispersion. The document outlines common analysis steps like filtering, dimensionality reduction, clustering, and differential expression. It emphasizes that scRNA-seq data requires specialized analysis due to its noisy and sparse nature compared to bulk RNA-seq data.
Diagnosing Problems in Production - CassandraJon Haddad
1) The document discusses various tools for diagnosing problems in Cassandra production environments, including OpsCenter for monitoring, application metrics collection with Statsd/Graphite, and log aggregation with Splunk or Logstash.
2) Some common issues covered are incorrect server times causing data inconsistencies, tombstone overhead slowing queries, not using the proper snitch, and disk space not being reclaimed on new nodes.
3) Diagnostic tools described are htop, iostat, vmstat, dstat, strace, tcpdump, and nodetool for investigating process activity, disk usage, memory, networking, and Cassandra-specific statistics. GC profiling and query tracing are also recommended.
Diagnosing Problems in Production (Nov 2015)Jon Haddad
Diagnosing Problems in Production involves first preparing monitoring tools like OpsCenter, server monitoring, application metrics, and log aggregation. Common issues include incorrect server times causing data inconsistencies, tombstone overhead slowing queries, not using the proper snitch, and version mismatches breaking functionality. Diagnostic tools like htop, iostat, vmstat, dstat, strace, jstack, nodetool, histograms, and query tracing help narrow down performance problems which could be due to compaction, garbage collection, or other bottlenecks.
Pushing the limits of ePRTC: 100ns holdover for 100 daysAdtran
At WSTS 2024, Alon Stern explored the topic of parametric holdover and explained how recent research findings can be implemented in real-world PNT networks to achieve 100 nanoseconds of accuracy for up to 100 days.
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc
How does your privacy program stack up against your peers? What challenges are privacy teams tackling and prioritizing in 2024?
In the fifth annual Global Privacy Benchmarks Survey, we asked over 1,800 global privacy professionals and business executives to share their perspectives on the current state of privacy inside and outside of their organizations. This year’s report focused on emerging areas of importance for privacy and compliance professionals, including considerations and implications of Artificial Intelligence (AI) technologies, building brand trust, and different approaches for achieving higher privacy competence scores.
See how organizational priorities and strategic approaches to data security and privacy are evolving around the globe.
This webinar will review:
- The top 10 privacy insights from the fifth annual Global Privacy Benchmarks Survey
- The top challenges for privacy leaders, practitioners, and organizations in 2024
- Key themes to consider in developing and maintaining your privacy program
“An Outlook of the Ongoing and Future Relationship between Blockchain Technologies and Process-aware Information Systems.” Invited talk at the joint workshop on Blockchain for Information Systems (BC4IS) and Blockchain for Trusted Data Sharing (B4TDS), co-located with with the 36th International Conference on Advanced Information Systems Engineering (CAiSE), 3 June 2024, Limassol, Cyprus.
20 Comprehensive Checklist of Designing and Developing a WebsitePixlogix Infotech
Dive into the world of Website Designing and Developing with Pixlogix! Looking to create a stunning online presence? Look no further! Our comprehensive checklist covers everything you need to know to craft a website that stands out. From user-friendly design to seamless functionality, we've got you covered. Don't miss out on this invaluable resource! Check out our checklist now at Pixlogix and start your journey towards a captivating online presence today.
Essentials of Automations: The Art of Triggers and Actions in FMESafe Software
In this second installment of our Essentials of Automations webinar series, we’ll explore the landscape of triggers and actions, guiding you through the nuances of authoring and adapting workspaces for seamless automations. Gain an understanding of the full spectrum of triggers and actions available in FME, empowering you to enhance your workspaces for efficient automation.
We’ll kick things off by showcasing the most commonly used event-based triggers, introducing you to various automation workflows like manual triggers, schedules, directory watchers, and more. Plus, see how these elements play out in real scenarios.
Whether you’re tweaking your current setup or building from the ground up, this session will arm you with the tools and insights needed to transform your FME usage into a powerhouse of productivity. Join us to discover effective strategies that simplify complex processes, enhancing your productivity and transforming your data management practices with FME. Let’s turn complexity into clarity and make your workspaces work wonders!
UiPath Test Automation using UiPath Test Suite series, part 5DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 5. In this session, we will cover CI/CD with devops.
Topics covered:
CI/CD with in UiPath
End-to-end overview of CI/CD pipeline with Azure devops
Speaker:
Lyndsey Byblow, Test Suite Sales Engineer @ UiPath, Inc.
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionAggregage
Join Maher Hanafi, VP of Engineering at Betterworks, in this new session where he'll share a practical framework to transform Gen AI prototypes into impactful products! He'll delve into the complexities of data collection and management, model selection and optimization, and ensuring security, scalability, and responsible use.
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfMalak Abu Hammad
Discover how MongoDB Atlas and vector search technology can revolutionize your application's search capabilities. This comprehensive presentation covers:
* What is Vector Search?
* Importance and benefits of vector search
* Practical use cases across various industries
* Step-by-step implementation guide
* Live demos with code snippets
* Enhancing LLM capabilities with vector search
* Best practices and optimization strategies
Perfect for developers, AI enthusiasts, and tech leaders. Learn how to leverage MongoDB Atlas to deliver highly relevant, context-aware search results, transforming your data retrieval process. Stay ahead in tech innovation and maximize the potential of your applications.
#MongoDB #VectorSearch #AI #SemanticSearch #TechInnovation #DataScience #LLM #MachineLearning #SearchTechnology
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Speck&Tech
ABSTRACT: A prima vista, un mattoncino Lego e la backdoor XZ potrebbero avere in comune il fatto di essere entrambi blocchi di costruzione, o dipendenze di progetti creativi e software. La realtà è che un mattoncino Lego e il caso della backdoor XZ hanno molto di più di tutto ciò in comune.
Partecipate alla presentazione per immergervi in una storia di interoperabilità, standard e formati aperti, per poi discutere del ruolo importante che i contributori hanno in una comunità open source sostenibile.
BIO: Sostenitrice del software libero e dei formati standard e aperti. È stata un membro attivo dei progetti Fedora e openSUSE e ha co-fondato l'Associazione LibreItalia dove è stata coinvolta in diversi eventi, migrazioni e formazione relativi a LibreOffice. In precedenza ha lavorato a migrazioni e corsi di formazione su LibreOffice per diverse amministrazioni pubbliche e privati. Da gennaio 2020 lavora in SUSE come Software Release Engineer per Uyuni e SUSE Manager e quando non segue la sua passione per i computer e per Geeko coltiva la sua curiosità per l'astronomia (da cui deriva il suo nickname deneb_alpha).
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!SOFTTECHHUB
As the digital landscape continually evolves, operating systems play a critical role in shaping user experiences and productivity. The launch of Nitrux Linux 3.5.0 marks a significant milestone, offering a robust alternative to traditional systems such as Windows 11. This article delves into the essence of Nitrux Linux 3.5.0, exploring its unique features, advantages, and how it stands as a compelling choice for both casual users and tech enthusiasts.
How to Get CNIC Information System with Paksim Ga.pptxdanishmna97
Pakdata Cf is a groundbreaking system designed to streamline and facilitate access to CNIC information. This innovative platform leverages advanced technology to provide users with efficient and secure access to their CNIC details.
Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...Zilliz
Join us to introduce Milvus Lite, a vector database that can run on notebooks and laptops, share the same API with Milvus, and integrate with every popular GenAI framework. This webinar is perfect for developers seeking easy-to-use, well-integrated vector databases for their GenAI apps.
14. • Roughly 25K messages/hour
• Controllers are 100x noisier than compute nodes
• Swift generates 60% of traffic
• 99.9% of the time, there were less than 65 messages/sec
• There are some traffic spikes of ~1600 messages/sec
• 95% of messages were less than 240 bytes
• Largest message was 782 bytes
22. [flood_1ms]
ip_address = “127.0.0.1:9997”
sender = “tcp”
pprof_file = “”
encoder = “protobuf”
num_messages = 0
message_interval = “1ms”
max_message_size = 2800
variable_message_size = true
late_bind_timestamp = true
• Add flood process
• Monitor everything
• Repeat until it breaks
The plan
23. heka-flood
Seems like a good idea
• The good:
• Control over message rate
• The not so good:
• Not enough control over message content
• Timestamps assigned at initialization (pull request forthcoming)
• Couldn’t get variable message sizing to work
• Here comes the meatloaf…
25. Tactical
• System sustained 4000 x ~1k msgs/sec
• Collector and/or ES started to pause above that
• No messages were dropped
Strategic
• Load tool was underqualified
• Monitoring tool resolution is important for interpretation
• I should prepare for presentations further in advance
What did we learn?
26. Intuitions
• 4MB/sec is:
• 17K messages of 240 bytes
• 240 bytes is 95th percentile
• 64 msgs/sec is:
• 99.9th percentile
• 265 controllers
• 1060 compute
• 1:100 ratio
Our First Model
27. • Find the bottlenecks
• System resources?
• Elasticsearch?
• Collector?
• Load generator?
• Improve the model
• Probability distributions
• Noise
• Real message samples
• Off host load
• Regime changes
• Real world feedback
Next Steps
Lately, most of my time has been spent with these tools, and generally shifting to the right.
We recently started using Heka instead of Logstash. It’s an experiment, but we’re embracing Go in some other development contexts, and we have not Ruby skills in house. Heka is interesting, but still a young project. I have high hopes for it from a performance perspective, and have had good interactions with the community for both Q & A and code acceptance.
When sales or consultants come to me and ask how many nodes our product can support, I’d like to provide a better answer than the shrug emoji, so we’re going to walk through a modeling exercise.
Let’s start with the big picture.
Since the missing servers are compute nodes, they are most likely have a very similar profile as the other compute nodes.
Query all the logs for 7 days, and do a date histogram on the timestamp with an interval of 1 second. With our amount of data, the aggregation took about 15 seconds. Your situation might be different, so you can adjust the lookback or the interval to suit your needs.
We see that 99.9% of the time, there were no more than 64 messages per second, but there are a few outliers. I didn’t include the chart, but I think there were about 10 instances of a rate over 1000 messages/sec. We’ll want to consider those in a robust model.
But for now, let’s remove them so we can see the shape of the message rate histogram. I’ve hand drawn in the red line to highlight what I think is a reasonable shape.
Let’s do a similar analysis of the message size. Here’ well look at the last 7 days, but do a percentiles aggregation. We can see that our max message size was 782 bytes, but that 95% of messages were less than 240 byes. This is only the text component of the message, not the other fields like tags, timestamp, etc., but those things are pretty constant in this environment, so it should be ok.
So, what did we learn out our environment?
Now, we’ve got a bunch of ingredients, and we want to get down to the business of making a model.
First, we might try to select a distribution that looks most like our model parameters. That’s pretty straight forward, and you can use tools like numpy and scipy to generate these curves, then take samples from the line to get a value for a parameter.
You might want to add some noise to your parameters to give them some “texture”. For example, you might want to vary the cadence of the message stream by making a decision at each sample time about how many messages to generate, or increase or decrease the message size.
And then all you need to do is execute flawlessly.
But what happens when you left your not a professional data modeler, or your knives aren’t sharp, or people are pounding the forks and knives on the table… You probably end up with something like...
Meatloaf. Arguably not horrible, but no Thomas Keller dish either.
Let’s revisit what what we have to work with. Shippers, collectors, and a datastore. The shippers and collectors are distributed. In our case, the datastore is on the same node as the collector. It’s unlikely that an individual shipper will generate more data than the network, or an individual collector can handle, so we’ll focus on the flooding the collector from the local host, and see how the collector / elasticsearch combo handle things. There’s a chance that we can overwork the OS, but this still seems like a good starting point to isolate bottlenecks.
Heka has a load testing utility called flood. I tend to build a mental model of what these types of utilities should do, then by the time I get a chance to prototype a solution using them, I can only hope they they are close to my mental model. In this case, I missed the mark a little, and hit a few other snags, but overall, I was able to demonstrate what I had hoped to.
A simplified overview of what I found was that everything ramped up pretty well until I got over 5500 messages/sec, then things started to destabilize. The shippers started pausing for a few seconds at a time, then would restart, run for a while and pause again. I included two views of this behavior to show how the sample rate can mask or highlight system behavior. Marvel (on top) is aggregating over 10 seconds, Kibana (bottom) is aggregating over 1 second. You can see the destabilization in Marvel, but it’s much more obvious in Kibana.
…. But on to our first model...
… So, when someone asks if we can handle traffic from 100 controllers and 15000 compute nodes, I can say “The model says yes!”, but when they ask if we can handle 250 controllers and 10000 compute nodes, I can say “The model says “Oh, no. That would be a bad idea...”
Well, we didn’t really find the bottleneck, which means we didn’t get the chance to tune it or scale it. We also have a very simple model using fairly simple tools. We could definitely improve that. Finally, we would want to take some real world samples to support or refute our model.
In other words, we should continuously improvement via understanding our environment, tuning parameters, and validating our model with feedback from the real world.
Since I paid for the right to use Meatloaf, he would like to say Thanks, and ask if you have any questions…